ccc — a RISC-V computer, from scratch

ccc

a RISC-V computer, from scratch

in Zig · emulator → kernel → OS → network → browser → python → desktop

▶ try it in your browser 📖 learn the codebase

contents · the project

Seven phases, one computer

The original target: boot our own kernel, talk to a server we control, and render plain HTML in a terminal-mode browser — everything written from scratch. No Linux. No TLS. The roadmap then keeps going past that finish line: Phase 6 ports real CPython 3.12 on top of Phase 3's kernel, and Phase 7 reverses the original "no graphics" stance with a framebuffer + compositor + windowed apps. Both are now in scope, not optional.

RISC-V CPU emulator ✓ complete

Zig-written RV32IMA emulator · passes riscv-tests · runs a bare-metal hello world.

Bare-metal kernel ✓ complete

S-mode + Sv32 paging · trap delegation · async timers · process + scheduler + yield + ticks.

Multi-process OS + shell ✓ complete

Process scheduler · fork/exec · block device · filesystem · tiny shell, ls/cat, an editor. Plans 3.A + 3.B + 3.C + 3.D + 3.E + 3.F landed (Phase 3 done): emulator PLIC + block + UART RX, the kernel-side ptable / scheduler / swtch / ELF loader, the full Unix-shaped fork / exec / wait / exit / kill-flag lifecycle, the read-side filesystem (bufcache + inode cache + namei · /bin/init from disk), the write side + console + shell — writei / ialloc / itrunc / dirlink, cooked-mode line discipline on fd 0/1/2, and a real sh driving ls / cat / echo / mkdir / rm, and finally the cursor-moving edit binary + e2e-persist that proves writes survive emulator restart.

Network stack upcoming

Ethernet → ARP → IP → ICMP → UDP → TCP → DNS, layer by layer.

HTTP/1.0 + text browser upcoming

HTTP client · URL parsing · tiny HTML parser · terminal renderer · link navigation by number.

CPython 3.12 port upcoming

Real CPython 3.12 as /bin/python — picolibc + dlmalloc + frozen stdlib on top of Phase 3's kernel. REPL, scripts, exceptions, ^C → KeyboardInterrupt. python /usr/lib/demo/pi.py 50 prints 50 digits of π through _pydecimal.

Framebuffer + compositor + windowed apps upcoming

Linear framebuffer · mouse + keyboard input · kernel mmap + pipes · userland wm compositor · windowed clock, calc, and term demos. Reverses the original "no graphics" stance.

This deck walks Phase 1 and Phase 2 in full, plus Plans 3.A, 3.B, 3.C, 3.D, 3.E, and 3.F — the emulator-side substrate, the kernel-side multi-process foundation, the Unix-shaped process lifecycle, the read-side filesystem that finally loads /bin/init from disk, the write-side + console + shell that gives the OS its first interactive prompt, and the cursor-moving edit binary + persistence test that closes Phase 3. Phases 4–7 are specced and queued.

chapter 1.a · the hart

A register file and a fetch-decode loop

Phase 1 starts with nothing. A Cpu struct holds 32 general-purpose registers (x0 hardwired to zero) plus a program counter. The main loop fetches four bytes at PC, decodes them into a tagged Instruction, dispatches on the variant, and steps PC.

pub const Cpu = struct {
    regs: [32]u32,        // x0..x31
    pc:   u32,             // program counter
    mem:  *Memory,

    pub fn writeReg(self: *Cpu,
                       i: u5, v: u32) void {
        if (i == 0) return;  // x0 hardwired
        self.regs[i] = v;
    }

    pub fn step(self: *Cpu) !void {
        const raw = try self.mem.fetch(self.pc);
        const insn = try decode(raw);
        try self.execute(insn);
    }
};

40 RV32I instructions

2 · LUI · AUIPC
2 · JAL · JALR
6 · BEQ · BNE · BLT · BGE · BLTU · BGEU
5 · LB · LH · LW · LBU · LHU
3 · SB · SH · SW
9 · ADDI · SLTI · SLTIU · XORI · ORI · ANDI · SLLI · SRLI · SRAI
10 · ADD · SUB · SLL · SLT · SLTU · XOR · SRL · SRA · OR · AND
3 · FENCE · ECALL · EBREAK

All 40 decode into a single tagged Instruction union — execute switches on the tag, no opcode-bit shenanigans in the hot path. 78 inline unit tests pin encode/decode round-trips and arithmetic edge cases.

chapter 1.a · memory + mmio

128 MB of RAM, two devices, one halt

Memory is one flat allocation at 0x8000_0000. Loads and stores route through a single Memory object: addresses inside the RAM range hit the buffer; a small set of MMIO ranges shortcut to device handlers. Two devices is the minimum needed to make the guest do something visible: print bytes, then stop.

guest physical address space

 0x0000_0000  ┐
              │  invalid · access fault
 0x0010_0000  ┤  halt MMIO   (1 byte)
              │
 0x1000_0000  ┤  UART        (NS16550A subset)
              │
              │  invalid
 0x8000_0000  ┤  RAM         128 MB, flat
              │
 0x8800_0000  ┘

UART · 0x1000_0000

NS16550A subset. THR (offset 0) byte-stores forward to host stdout; LSR (offset 5) reports THR_EMPTY | TX_IDLE so the guest's spin-wait completes immediately.

Halt · 0x0010_0000

A single byte. Any store snapshots the low byte as the host process's exit code, then breaks the fetch-decode loop.

zig build e2e loads a hand-encoded RV32I program at 0x8000_0000 that loops over "hello world\n", byte-stores each character to the UART THR, then byte-stores 0 to the halt device. The first end-to-end signal that the emulator is real.

chapter 1.b · m extension

Multiply, divide, every edge case

RV32M adds 8 instructions. Multiply is straightforward — except the high-word variants (mulh / mulhu / mulhsu) need 64-bit intermediates with the right signedness on each operand. Divide and remainder are required to behave a specific way on the two cases that would otherwise trap on real silicon: divide by zero and signed overflow.

// rv32m: 8 ops, all OP-major (0x33),
// distinguished by funct7 = 0000001

// multiply
mul    rd, rs1, rs2     // (rs1 * rs2)[31:0]
mulh   rd, rs1, rs2     // (s64 × s64)[63:32]
mulhu  rd, rs1, rs2     // (u64 × u64)[63:32]
mulhsu rd, rs1, rs2     // (s64 × u64)[63:32]

// divide / remainder
div    rd, rs1, rs2     // signed
divu   rd, rs1, rs2     // unsigned
rem    rd, rs1, rs2     // signed
remu   rd, rs1, rs2     // unsigned

Edge cases · spec-mandated

Divide by zero
div / divu → 0xFFFF_FFFF
rem / remu → dividend

Signed overflow (INT_MIN ÷ -1)
div → INT_MIN
rem → 0

No traps. Ever. The result is what the hardware must return.

execute casts to i64/u64 with the correct signedness per opcode, then truncates to the requested half. ~30 new unit tests pin every edge case from the table above.

chapter 1.b · a extension

LR/SC plus nine atomic ops

RV32A adds 11 instructions. Load-reserved / store-conditional gives us a compare-and-swap-shaped primitive; the nine AMOs (atomic-memory-operation) are read-modify-write opcodes that return the original value. Single-hart semantics — the aq/rl ordering bits decode but don't reorder anything. A small reservation: ?u32 field on the hart tracks live LR/SC state.

    # 6 × 7 = 42, atomically stored, split for display
    li    t1, 6
    li    t2, 7
    mul   t4, t1, t2          # t4 ← 42                       (RV32M)
    amoswap.w t5, t4, (t3)    # atomically store 42          (RV32A)
    divu  t6, t4, 10           # tens  = 4                    (RV32M)
    remu  t4, t4, 10           # ones  = 2                    (RV32M)
    fence.i                     # instruction fence          (Zifencei)

11 A-ops · lr.w · sc.w + 9 AMOs (swap · add · xor · and · or · min · max · minu · maxu) · 1 Zifencei — fence.i is a no-op (we have no I-cache). lr.w sets the reservation; sc.w reads it, returns 0 on success / 1 on miss, then clears it. ~50 new unit tests green.

chapter 1.c · privilege + csrs

Two privileges, a control file

Up to now the hart had one mode and zero state besides registers and PC. Plan 1.C adds the machinery that separates kernel code from user code: a second privilege level (U), a small file of control and status registers, and the Zicsr instructions that read and write them. U-mode access to M-CSRs traps as illegal — the boundary becomes enforceable.

Zicsr · 6 instructions

csrrw  rd, csr, rs1   # swap
csrrs  rd, csr, rs1   # set bits
csrrc  rd, csr, rs1   # clear bits
csrrwi rd, csr, uimm5
csrrsi rd, csr, uimm5
csrrci rd, csr, uimm5

# side-effect rules:
#   rd  = x0 → suppress read
#   rs1 = x0 → suppress write

M-mode CSR file

writable
mstatus · mtvec · mepc
mcause · mtval
mie · mip · mscratch
mtimecmp (via CLINT mirror)

read-only
misa = 0x4014_1101 (RV32 I·M·A·U)
mhartid · mvendor/arch/impid · zero

U-mode CSR access → illegal instruction trap.

mstatus's key bits: MIE (interrupts enabled), MPIE (saved MIE), MPP (previous privilege — restored by mret). Setting MPP=00 then mret downgrades to U-mode. The whole privilege transition is one instruction.

chapter 1.c · traps + the real test suite

Synchronous traps, a real test suite

A trap is a tightly choreographed dance of CSR writes. Six fields snapshot the cause, the faulting PC, and the previous privilege; PC jumps to mtvec. mret reverses it. Eight causes cover everything an RV32IMA hart can throw without supervisor mode. Then we point the official RISC-V conformance tests at it.

Trap entry · mret

  U-mode code
     │  ecall / fault
     ▼
  mcause ← cause code
  mepc   ← pc of faulting inst
  mtval  ← fault info
  mstatus.MPP  ← prev priv
  mstatus.MPIE ← MIE; MIE ← 0
  pc     ← mtvec
     │
     ▼
  M-mode handler
     │  mret
     ▼
  pc     ← mepc
  priv   ← mstatus.MPP
  MIE    ← mstatus.MPIE
     │
     ▼
  resume

8 causes · ecall_u/m · ebreak ·
illegal · ld/st misaligned · ld/st fault

Conformance · riscv-tests

$ zig build riscv-tests
# 66/66 pass
#   rv32ui  39   integer
#   rv32um   8   multiply/divide
#   rv32ua  10   atomics
#   rv32mi   9   M-mode + traps

Each test is an ELF that runs under our hand-rolled ELF32 loader; success is signaled by writing 1 to the linker symbol tohost. CLINT lives at 0x0200_0000 for the rv32mi timer tests.

chapter 1.d · hello.elf · two halves

Two halves, one binary

hello.elf bundles two cooperating programs. A tiny supervisor written in assembly boots the hardware and services requests; the application, written in Zig, simply asks for what it wants. The split mirrors how real operating systems separate kernel code from user code.

monitor.S · M-mode

Boots the hart, installs the trap vector, drops to U-mode, and services ecall.

Hand-written RISC-V assembly. Linked at 0x80000000.

hello.zig · U-mode

Naked function. No prologue, no stack. Calls write(1, msg, 12) then exit(0) via ecall.

Cross-compiled Zig. No libc.

chapter 1.d · hello.elf · boot flow

Control flow, end to end

From the moment ccc opens the ELF file until hello world appears on the terminal, control passes through five handoffs — some inside the guest program, others across the boundary into ccc's emulated hardware.

  ccc                           ← host: ELF32 loader reads hello.elf; PC = e_entry
   ▼
  monitor._start       (M-mode)
   │  set sp, mtvec, mepc; clear MPP;  mret
   ▼
  hello.u_entry        (U-mode)
   │  li a7, 64 ; la a1, msg ; li a2, 12;  ecall
   ▼
  monitor.trap_vector  (M-mode, cause = ecall_from_u)
   │  dispatch on a7:  64 → sys_write   93 → sys_exit
   ▼
  UART THR  0x10000000          ← stdout byte
  halt MMIO 0x00100000          ← exit code

chapter 1.d · hello.elf · monitor.S

_start: configure the hart, drop to U-mode

The first code that runs inside the emulated CPU. It sets up the stack pointer, registers a handler for anything that might go wrong, voluntarily downgrades its own privilege, and jumps to the application. This is the standard pattern for bare-metal code handing off to user code.

_start:
    la    sp, _stack_top           # 8 KB stack above all sections
    la    t0, trap_vector
    csrw  mtvec, t0                # install the trap handler
    li    t0, 0x1888
    csrc  mstatus, t0              # clear MPP → post-mret priv = U
    la    t0, u_entry
    csrw  mepc, t0                 # return address
    mret                           # PC ← mepc, priv ← MPP (= U)

PMP setup omitted for clarity — the real monitor.S installs a PMP entry granting U-mode RWX on all of RAM so QEMU and ccc run the same binary.

chapter 1.d · hello.elf · hello.zig

The entire U-mode program

The application, in its entirety — a naked Zig function (no prologue, no stack setup) that issues two system calls: one to print the message, one to exit. Every piece of machinery elsewhere in this deck exists to make these ten lines of assembly work.

export fn u_entry() linksection(".text.umode")
                   callconv(.naked) noreturn {
    asm volatile (
        \\ # write(1, msg, 12)
        \\ li   a7, 64
        \\ li   a0, 1
        \\ la   a1, msg
        \\ li   a2, 12
        \\ ecall
        \\ # exit(0)
        \\ li   a7, 93
        \\ li   a0, 0
        \\ ecall
    );
}

chapter 1.d · hello.elf · the trap

`ecall` → `trap_vector`

When the user program hits ecall, the CPU snapshots state into CSRs and jumps to mtvec. The monitor's trap vector first confirms the cause really was an ecall, then branches on the syscall number the user placed in a7.

CSR state on entry

mcause = 8 (ecall from U)
mepc = pc of ecall
mstatus.MPP = U
mstatus.MPIE = MIE
MIE = 0

pc = mtvec

trap_vector:
    csrr  t0, mcause
    li    t1, 8                # ecall from U
    bne   t0, t1, unexpected_trap

    li    t0, 64
    beq   a7, t0, sys_write
    li    t0, 93
    beq   a7, t0, sys_exit

    li    a0, -38             # -ENOSYS
    j     mret_resume

chapter 1.d · hello.elf · the devices

Bytes out · exit code in

Two syscalls, two magic addresses. ccc intercepts stores to 0x10000000 and forwards them to the host terminal; stores to 0x00100000 halt the simulation and set the host process's exit code.

sys_write:               # a7 = 64
    mv    t2, a2            # save len
    li    t0, 0x10000000    # UART THR
1:  lbu   t1, 0(a1)
    sb    t1, 0(t0)        # byte → UART
    addi  a1, a1, 1
    addi  a2, a2, -1
    bnez  a2, 1b
    mv    a0, t2            # return len
    j     mret_resume

sys_exit:                # a7 = 93
    li    t0, 0x00100000    # halt MMIO
    sb    a0, 0(t0)        # low byte → exit
1:  j     1b                # unreachable

chapter 1.d · hello.elf · verification

Verified against QEMU, instruction by instruction

Two independent verification paths. The e2e-hello-elf test pins the final output to the exact string hello world\n. The QEMU-diff harness runs the same ELF through qemu-system-riscv32 and through ccc, then diffs the per-instruction traces.

End-to-end & conformance

$ zig build e2e-hello-elf
# asserts stdout == "hello world\n"

$ zig build riscv-tests
# 66/66 tests pass:
#   rv32ui  39
#   rv32um   8
#   rv32ua  10
#   rv32mi   9

QEMU-diff

$ scripts/qemu-diff.sh \
    zig-out/bin/hello.elf

# runs same ELF under:
#   qemu-system-riscv32 --trace
#   ccc --trace
# canonicalises to "PC insn" lines
# diffs — empty output = match

chapter 2.a · the missing privilege

A third privilege: supervisor

Phase 1's hart had two modes — M and U. A real OS sits in the middle, where it can run system code without owning the firmware. Plan 2.A adds S-mode, plus its own slice of CSRs that mirror the M-mode set. sstatus, sie, and sip are masked views over the same underlying mstatus/mie/mip storage — the kernel never sees machine-only bits.

S-mode CSRs

sstatus (SIE/SPIE/SPP/SUM/MXR)
sie · sip (SSI/STI/SEI bits only)
stvec · sepc · scause
stval · sscratch
satp (MODE · ASID · root PPN)

Two new instructions

sret — the S-mode counterpart of mret. Restores PC ← sepc, priv ← sstatus.SPP, SIE ← SPIE.

sfence.vma — TLB / page-table fence. We don't model a TLB, so this is effectively a no-op — but the encoding decodes and privilege-checks.

Privilege ladder

  M  (firmware)
   ↓  mret
  S  (kernel)
   ↓  sret
  U  (user)

misa now
0x4014_1101
RV32 I·M·A·U·S

U-mode access to any S-CSR traps as illegal. M-mode can read or write S-CSRs freely. Every Phase 1 test stays green — adding a privilege level shouldn't break anything below it.

chapter 2.a · sv32 paging

Virtual addresses, a two-level walk

When satp.MODE = 1 and the effective privilege is S or U, every address the hart touches goes through the page table. Two levels of 10 bits index 4 KB tables of 32-bit PTEs; the leaf carries an R/W/X/U/V/A/D flag set and the physical page number. Permission checks happen on every walk; failures raise one of three new page-fault causes.

Sv32 virtual address

 31      22 21      12 11        0
┌──────────┬─────────┬────────────┐
│  VPN[1]  │ VPN[0]  │   offset   │
│  10 bits │ 10 bits │  12 bits   │
└──────────┴─────────┴────────────┘

two-level walk
 L1 = (satp.PPN << 12) + VPN[1]*4
 L2 = ( L1.PPN  << 12) + VPN[0]*4
 PA = ( L2.PPN  << 12) |  offset

page-fault causes
 12 inst    13 load    15 store

PTE layout · 32 bits

 31     20 19    10 9 8 7 6 5 4 3 2 1 0
┌─────────┬───────┬───┬─┬─┬─┬─┬─┬─┬─┬─┐
│  PPN[1] │PPN[0] │RSW│D│A│G│U│X│W│R│V│
└─────────┴───────┴───┴─┴─┴─┴─┴─┴─┴─┴─┘

V · valid R/W/X · perms U · user-accessible
G · global A · accessed D · dirty

Leaf vs intermediate: a leaf has any of R·W·X set. An intermediate has none — the walk continues.

rv32si-p-*: csr · dirty · illegal · ma_fetch · sbreak · scall · wfi — all green. Permission check uses both the U-bit on the PTE and the SUM/MXR bits in sstatus. A/D bits update in place on access if the PTE allows.

chapter 2.b · trap delegation

Bypass M, land in S

Without delegation, every fault from S-mode and U-mode lands in M-mode, and the firmware has to ping-pong everything down to the kernel. Real hardware doesn't do that. medeleg and mideleg are bitmask CSRs — set bit N means cause N routes straight to S-mode (filling sepc/scause/stvec instead of their M-counterparts). M-mode itself never delegates.

Delegation decision

  fault / interrupt
  in U or S
       │
       ▼
  cause bit set in
  medeleg / mideleg?
       │
   ┌───┴───┐
   yes     no
    │       │
    ▼       ▼
  S-mode   M-mode
  sepc     mepc
  scause   mcause
  stvec    mtvec
  sstatus  mstatus
   .SPP    .MPP

What can be delegated

Synchronous (medeleg):
ecall_u · ebreak · illegal
ld/st misaligned · ld/st fault
page faults (12 · 13 · 15)

Interrupt (mideleg):
SSI (1) · STI (5) · SEI (9)

M-only causes (MSI/MTI/MEI/ecall_m) are never delegable — those bits in medeleg/mideleg are WARL-zero.

A trap that fires in M-mode always lands in M-mode regardless of delegation — delegation is "exit early to S on the way down", not "redirect M". The supervisor can finally service its own faults without the firmware in the loop.

chapter 2.b · async interrupts

Traps from outside the instruction stream

Up until now, every trap was synchronous — it came from an instruction. Plan 2.B introduces asynchronous interrupts: between every fetch, the hart checks mip & mie against the current privilege and the WFI/MIE state, and if any pending interrupt is enabled, it traps before the next instruction runs. The CLINT is the first source — mtime ≥ mtimecmp drives mip.MTIP live.

Interrupt priority

MEI > MSI > MTI > SEI > SSI > STI
(11)  (3)   (7)   (9)   (1)   (5)

at each instruction fetch:
 if mip & mie & (priv-mask)
   take highest-priority bit
   trap (cause & 0x80000000 set)

mie / mip are WARL-masked.
MTIP is hardware-only on write;
reads see CLINT's truth.

CLINT · live MTIP

CLINT @ 0x0200_0000
  msip      [4 B · MSI source]
  mtimecmp  [8 B · timer threshold]
  mtime     [8 B · 10 MHz clock]

read of mip recomputes MTIP:
 mip.MTIP = (mtime ≥ mtimecmp)

writes to mip.MTIP are ignored —
it's hardware-driven, not software.

Round-trip integration test: CLINT fires MTIP → M-mode ISR acks the timer, sets mip.SSIP → S-mode ISR receives a delegated software interrupt, clears it, writes a sentinel. One test, every layer of routing exercised.

chapter 2.c · the address space

Three privileges, one address space

Plans 2.A and 2.B built the CPU substrate. 2.C composes it into a real kernel. The first decision: one Sv32 page table, shared by kernel and user. Kernel pages are mapped identity (VA == PA) with S-only perms; user pages live at 0x0001_0000 with U=R/W/X. G=1 on every leaf so the (modeled-as-no-op) TLB doesn't need flushes when we'd switch contexts.

process page table
 0x0001_0000  user .text / .rodata    U R W X
 0x0003_0000  user stack (2 pages)    U R W
 0x0010_0000  halt MMIO               S R W
 0x0200_0000  CLINT                   S R W
 0x1000_0000  UART                    S R W
 0x8000_0000  kernel .text            S R X
              kernel .rodata          S R
              kernel .data / .bss     S R W
              kernel stack (16 KB)    S R W

kernel VA == kernel PA · direct-mapped · G=1 throughout

Why one address space? Phase 2.C runs exactly one user program, and kernel-side virtual addresses (which include MMIO and the kernel image) need to be reachable from inside trap handlers without a context switch. Splitting into per-process page tables waits for Plan 3.B.

chapter 2.c · m → s → u

Boot to hello from u-mode

A tiny boot.S in M-mode sets up delegation, the trap vector, and the timer, then drops into the S-mode kernel's kmain. kmain initializes the page allocator, builds the Sv32 root, maps everything in the table on the previous slide, installs the S-mode trap entry, and srets into the user payload. The payload prints hello from u-mode via write and exit, and the emulator returns 0.

M   boot.S zero BSS
M   medeleg / mideleg
M   mtvec = m_trap_vector
M   mtimecmp = mtime + 1e6
M   mstatus.MPP = S
M   mret ──────────────► S
S   page_alloc.init
S   vm.allocRoot
S   vm.mapKernelAndMmio
S   vm.mapUser(blob)
S   stvec = s_trap_entry
S   sscratch = &the_tf
S   csrw satp; sfence.vma
S   sret ──────────────► U
U   ecall (write 64)   ◄─┐
S   sys_write → UART     │
U   ecall (exit 93)    ◄─┤
S   sys_exit → halt      └──► emulator exit 0

Two syscalls

sys_write (a7=64) · loops over a1[0..a2], byte-stores each character to the UART THR.

sys_exit (a7=93) · stores a0's low byte to the halt MMIO, halting the emulator with that exit code.

Both arrive via ecall, are delegated to S, dispatched in trap.zig, and return via sret.

e2e-kernel: ccc kernel.elf prints exactly hello from u-mode\n and exits 0. No Process struct, no scheduler, no yield — those arrive in Plan 2.D.

chapter 2.d · process · scheduler stub

A `Process`, a scheduler stub, a `yield`

Plan 2.D finishes Phase 2 by giving the kernel state to describe a running program. The Process struct holds the trapframe (already saved on every trap), the page-table root, the kernel stack, a state, an exit code, and a per-process tick counter. schedule() is a one-line stub that returns the singleton — but everything downstream is wired as if it were a real picker.

pub const Process = extern struct {
    tf:   trap.TrapFrame,  // offset 0
    satp: u32,
    kstack_top: u32,
    state: State,          // Runnable
                           // /Running/Exited
    ticks_observed: u32,
    exit_code: u32,
};
pub export var the_process: Process
    = undefined;

pub fn schedule() *Process {
    return &proc.the_process;
}

pub fn context_switch_to(p: *Process) void {
    // csrw satp, p.satp
    // sfence.vma zero, zero
}

The yield syscall

sys_yield (a7=1) — a no-op handler that returns immediately via sret. Because every trap return goes through schedule(), calling yield gives the scheduler a clean reason to pretend it picked the same process.

The structure is real even though the picker is trivial. When Plan 3.B replaces schedule() with a round-robin scan, none of this code changes — only the body of one function.

Trap-frame at offset 0 of Process matters: the trampoline always treats sscratch as a *TrapFrame, never a *Process. The two pointers are interchangeable as long as tf stays first — an invariant that survives into Plan 3.B and beyond.

chapter 2.d · the tick

One process, counting ticks

Phase 2's Definition of Done: hello from u-mode\nticks observed: N\n with N > 0. The path the timer takes to bump that counter touches every layer below: CLINT raises MTIP, M-mode's ISR re-arms mtimecmp and forwards via mip.SSIP, S-mode's dispatcher catches the delegated software interrupt, increments the counter, calls schedule(), and srets back to the same user PC.

M   CLINT: mtime ≥ mtimecmp
M   mip.MTIP = 1   (async)
M   timer ISR: clear MTIP
M                bump mtimecmp
M                set sip.SSIP
M   mret ─────────────► S
S   s_trap_dispatch
S     cause = 1 (SSI, interrupt)
S     clear sip.SSIP
S     the_process.ticks_observed += 1
S     sched.schedule()
S   sret ─────────────► U
U   write ... yield ... busy-loop
U   li a7, 93 ; ecall    (exit)
S   sys_exit:
      kprintf "ticks observed: 19\n"
      halt MMIO ← 0

Tuning the slice

TIMESLICE tuned from 1 M → 10 K CLINT cycles (≈ 1 ms) so ticks accumulate during the user program's ~300 K-cycle busy-loop.

verify_e2e.zig — host tool asserts stdout matches hello from u-mode\nticks observed: (\d+)\n with N > 0 and exit 0.

Phase 2 §Definition of done · met.

chapter 3.a · plic

PLIC · 32 sources, one S-context

Phase 3 begins under the kernel. The emulator gains the platform-level interrupt controller — a tiny piece of memory-mapped state that aggregates many external interrupt sources behind a single S-mode external interrupt (SEIP). One S-context, 32 sources, per-source priority, per-source enable, per-context threshold, and a claim/complete handshake every external IRQ flows through. No kernel-side code yet — Plan 3.A is emulator-only.

PLIC @ 0x0c00_0000

 0x0000_0000  source priority   [32 × 4]
 0x0000_1000  pending           [1 × 4]
 0x0000_2080  enable[ctx 0]     [1 × 4]
 0x0020_1000  threshold[ctx 0]  [1 × 4]
 0x0020_1004  claim / complete  [1 × 4]

SEIP wiring (mirror of MTIP pattern)
 read of sip recomputes SEIP:
   sip.SEIP = (any enabled source
                pending & pri > thr)

delegation
 mideleg.SEIP = 1
 → external IRQs land in S directly

Claim · complete

Claim reads the highest-priority enabled-and-pending source, returns its ID, and atomically clears its pending bit.

Complete writes the source ID back when the handler is done — re-arms the source for future asserts.

Same shape as real silicon: the kernel never sees the priority arbiter, only the winner.

wfi becomes a real idle: the emulator sleeps in poll(2) until any device's pending edge fires. No more burning host CPU on the spin path.

chapter 3.a · the two sources

A disk and a keyboard

The PLIC by itself isn't useful — it needs sources. Plan 3.A wires the two devices Phase 3.D's filesystem and 3.E's shell will need: a synchronous block device backed by a host file (PLIC source #1), and a UART RX path with a 256-byte ring buffer (PLIC source #10). Both follow the same shape: an MMIO write or host-side edge raises a pending bit, the kernel claims it, services the device, completes.

Block · 0x1000_1000 · src #1

Four registers: SECTOR · BUFFER · CMD · STATUS. 4 KB sectors, host-file-backed via --disk PATH.

Every CMD write transfers synchronously into the guest's BUFFER address, then raises PLIC source #1 on the next instruction boundary. Sync I/O behind an async-shaped interface — easy to retrofit a real driver later.

UART RX · src #10

256-byte ring buffer. LSR.DR tracks live FIFO state. Empty → non-empty asserts source #10; draining via RBR de-asserts.

--input PATH streams a host file into the FIFO for scripted tests; otherwise the idle path drains host stdin. Same UART register file as Phase 1's TX-only model — RX is purely additive.

e2e-plic-block: M-mode boot.S sets mideleg.SEIP, programs PLIC enables/threshold, drops to S. The S-mode test issues a block read, executes wfi, takes a delegated S-external interrupt, claims source 1, verifies sector 0's magic bytes, completes, halts 0.

chapter 3.b · ptable + swtch + elf loader

From one process to many

Phase 2's kernel ran a single hard-coded the_process. Plan 3.B rebuilds the foundation: a free-list page allocator, a kernel context-switch primitive, a 16-slot process table, and a kernel-side ELF loader — every moving part a real round-robin scheduler needs. The emulator doesn't change: 3.B is kernel-side only.

page_alloc · free-list

Replaces Phase 2's bump allocator. Each free 4 KB page stores a *next at offset 0; alloc() pops the head and zero-fills, free(pa) pushes. O(1), plus freeCount() for debug.

Context + swtch.S

52-byte Context: ra · sp · s0..s11. 30 lines of asm save callee-saved regs to *old, restore from *new, ret into the new caller's ra. The single primitive every yield, preempt, and (3.D) wakeup funnels through.

ptable[16] + Cpu

Process grows pid · pgdir · sz · kstack · context (tf stays at offset 0 — trampoline invariants survive). Cpu singleton holds cur + its own scheduler stack so 3.D's sleep machinery stays surgical.

elfload.zig — kernel-side ELF32 loader walks every PT_LOAD header, allocates ceil(p_memsz / 4 KB) zeroed user pages, copies p_filesz bytes, maps at p_vaddr. Replaces Phase 2's objcopy flatten — userprogs now embed as raw ELFs. New syscalls: getpid (#172) · sbrk (#214).

chapter 3.b · the context switch

30 lines of asm, one ret

The mechanical heart of the scheduler. swtch(*old, *new) spills the 14 callee-saved registers — ra, sp, and s0..s11 — into *old, refills them from *new, and rets. The trick: ret jumps to whatever ra the new context was holding, which is the line after that thread's last swtch call. The compiler's caller-saved discipline takes care of everything else.

.equ CTX_RA,   0
.equ CTX_SP,   4
.equ CTX_S0,   8
# ... s1..s10 at offsets 12..48
.equ CTX_S11, 52

.globl swtch
swtch:
    # save *old (a0)
    sw  ra, CTX_RA(a0)
    sw  sp, CTX_SP(a0)
    sw  s0, CTX_S0(a0)
    # ... s1..s11

    # restore *new (a1)
    lw  ra, CTX_RA(a1)
    lw  sp, CTX_SP(a1)
    lw  s0, CTX_S0(a1)
    # ... s1..s11
    ret     # PC ← new.ra

Why it works

Callee-saved only. The RISC-V ABI says the caller is responsible for spilling a0..a7, t0..t6, and floats around any function call. swtch is just another function — the compiler already spilled the rest.

The first switch into a new process needs ra pre-loaded. The fork path stamps the child's context with ra = forkret, a stub that drops it into U-mode for the very first time.

Same primitive every yield, every preempt, every (3.D) wakeup will funnel through.

No satp in the context — that's swapped separately by switchTo(p) just before swtch, so the new code starts running with the new page table active. Trap-frame state (caller-saved registers + sepc) lives on the kernel stack and gets restored by sret later.

chapter 3.b · the scheduler dance

PID 1 and PID 2, interleaved

The scheduler runs forever on its own kernel stack, scanning ptable for the first Runnable proc and swtch-ing in. Every timer tick the SSI handler bumps cur.ticks_observed and yields back; swtch saves the outgoing context, restores the next one, returns into its ra. A freshly-allocated process's first ra is forkret — a stub that drops it into U-mode.

pub fn scheduler() noreturn {
    while (true) {
        const p = pickRunnable()
            orelse continue;
        cpu.cur = p;
        p.state = .Running;
        // re-arm sscratch + load pgdir
        // csrw sscratch,p · csrw satp,p.satp
        // sfence.vma
        switchTo(p);
        swtch(&cpu.sched_context,
              &p.context);
        // p yielded back; pick another
        cpu.cur = null;
    }
}

kmain   alloc PID 1 + PID 2
kmain   scheduler() ───────► loop
sched   pick PID 1
sched   swtch ─► forkret ─► U-mode
U       PID 1 prints "hello…"
U       (timer) ecall yield
S       cur.ticks_observed += 1
S       swtch back to sched_context
sched   pick PID 2
U       PID 2 prints "[2] hello…"
U       PID 2  exit(0) → Zombie
sched   pick PID 1 → exit
sysExit pid == 1?
          kprintf "ticks observed: N\n"
          halt MMIO ← 0

e2e-multiproc-stub: spawns ccc kernel-multi.elf (two embedded ELFs), asserts stdout contains both hello from u-mode\n and [2] hello from u-mode\n plus a ticks observed: N\n trailer (any N), exit 0. Interleaving is non-deterministic — verifier checks substrings, not byte-for-byte. Phase 2's e2e-kernel stays green: sysExit halts only when cur.pid == 1.

chapter 3.c · fork + exec + wait + exit + kill

A real process lifecycle

Plan 3.B gave the kernel a process table and a scheduler; Plan 3.C gives processes the Unix-shaped lifecycle that hangs off them. Five new primitives, plus the VM helpers and sleep/wakeup machinery they ride on. The emulator doesn't change: 3.C is kernel-side only.

vm helpers

copyUvm(src, dst, sz) walks every user PTE and deep-copies each leaf into a fresh frame; unmapUser(pgdir, sz, policy) frees every user leaf, the L0 tables that backed them, and (optionally) the L1 root. Both rollback to a consistent state on OOM. The pair turns proc.free's 3.B panic stub into a real reaper.

fork · exec · wait · exit

fork = full-AS clone (copy_uvm + dup tf, child.tf.a0 = 0). exec = build the new pgdir in scratch, write the System-V argv tail to the new user stack, then csrw satp + tear down the old AS. wait sleeps on self until a Zombie child appears. exit reparents children to PID 1, marks Zombie, wakes the parent.

sleep · wakeup · kill

xv6-style sleep(chan) sets .Sleeping + swtch; wakeup(chan) flips matching sleepers to .Runnable. kill(pid) sets the killed flag (and wakes if sleeping); the trap dispatcher checks it on every syscall return. Invariant: S-mode runs interrupts-off; only U-mode runs them on.

New syscalls wired: 220 (clone → flagless fork) · 221 (execve) · 260 (wait4) · 5000 (set_fg_pid, accept-and-discard until 3.E) · 5001 (console_set_mode, accept-and-discard until 3.E). The killed-check sits in trap.zig's ECALL branch, right before sret — inert in 3.C, ready for 3.E's ^C path.

chapter 3.c · address space duplication

copyUvm — page by page

fork's heaviest lift is duplicating the parent's user pages into fresh frames for the child. copyUvm walks every leaf PTE from VA 0 to parent.sz, asks the page allocator for a new frame per source page, copies 4 KB through kernel-direct pointers, and installs the new mapping in the child's page table. copyUserStack handles the 2-page user stack region separately. Both rollback to a consistent state if the allocator runs dry mid-copy.

pub fn copyUvm(src: u32, dst: u32, sz: u32)
    CopyError!void
{
    const end_va = (sz + (PAGE_SIZE - 1))
                  & ~@as(u32, PAGE_SIZE - 1);
    var va: u32 = 0;
    while (va < end_va) : (va += PAGE_SIZE) {
        const src_pa = lookupPA(src, va)
            orelse continue;
        const dst_pa = page_alloc.alloc()
            orelse {
                unmapUser(dst, end_va, .leave_root);
                return CopyError.OutOfMemory;
            };
        // kernel-direct 4 KB memcpy
        const s: [*]const volatile u8
            = @ptrFromInt(src_pa);
        const d: [*]volatile u8
            = @ptrFromInt(dst_pa);
        var i: u32 = 0;
        while (i < PAGE_SIZE) : (i += 1)
            d[i] = s[i];
        mapPage(dst, va, dst_pa, USER_RWX);
    }
}

What makes it tricky

Kernel-direct pointers. The kernel's identity mapping means physical addresses are valid Zig pointers — no double-mapping or temporary windows. The volatile qualifier keeps the compiler from optimizing the byte-loop away.

Rollback on OOM. If page_alloc.alloc() returns null mid-copy, unmapUser walks every leaf already installed in dst and frees it back. The parent is untouched throughout — fork can still return -1 cleanly.

Stack copied separately. The 2-page user stack lives at a fixed VA above the program's heap; copyUserStack uses USER_RW instead of USER_RWX — no execute on the stack.

unmapUser(pgdir, sz, policy) is the inverse: walks every user leaf, frees it, and (for .leave_root) keeps the L1 root for further use or (for .free_root) tears the whole pgdir down. Together they turn proc.free's 3.B panic stub into a real reaper.

chapter 3.c · init forks /bin/hello

`init` forks `hello`, parent reaps, exits 0

kernel-fork.elf embeds two userland ELFs (init.elf + hello.elf) and a comptime blob registry mapping "/bin/hello" → HELLO_ELF. init forks; the child execves /bin/hello (which the kernel resolves via that registry); the parent wait4s on its own pointer; exit wakes it; init prints and halts. End-to-end fork/exec/wait/exit on real ELFs.

pub fn fork() i32 {
    const parent = cur();
    const child = alloc()
        orelse return -1;
    const root = vm.allocRoot()
        orelse { freeKstackOnly(child);
                  return -1; };
    vm.mapKernelAndMmio(root);
    try vm.copyUvm(parent.pgdir,
                    root, parent.sz);
    try vm.copyUserStack(parent.pgdir, root);
    child.pgdir = root;
    child.satp  = SATP_SV32 | (root >> 12);
    child.sz    = parent.sz;
    child.tf    = parent.tf;
    // child returns 0 from the same ecall
    child.tf.a0 = 0;
    child.parent = parent;
    child.state  = .Runnable;
    return @intCast(child.pid);
}

kmain   alloc init
sched   pick init
U       init: fork() ──┐
S       copy_uvm + dup tf
S       child.tf.a0 = 0
        parent.a0 = child.pid ◄─┘
U       parent: wait4 ─► sleep(self)
sched   pick child
U       child: execve("/bin/hello")
S       lookupBlob ─► HELLO_ELF
S       new pgdir + argv tail
S       csrw satp · free old AS
U       child: write "hello from /bin/hello\n"
U       child: exit(0) ─► Zombie
S       wakeup(parent)
sched   pick parent
S       wait reaps · free(child)
U       parent: write "init: reaped\n"
U       parent: exit(0) ─► PID 1
sysExit pid == 1?
          kprintf "ticks observed: N\n"
          halt MMIO ← 0

e2e-fork: spawns ccc kernel-fork.elf, asserts stdout contains hello from /bin/hello\n + init: reaped\n + ticks observed: N\n, exit 0. Order is deterministic (init forks → wait sleeps → child runs → child exits → wakeup → init reaps → init exits) but the verifier checks substrings to stay robust against future scheduler tweaks. Both Plan 3.B regressions stay green — sysExit's PID-1 trailer + halt is preserved byte-for-byte.

chapter 3.d · the disk path

First real device wakeup — block IRQ → bufcache

Plan 3.A taught the emulator to fire a PLIC IRQ when a block transfer finishes; Plan 3.D finally drives it from the kernel side. The S-mode trap dispatcher gains an external-interrupt branch (cause == 9) that runs the canonical claim → ISR → complete handshake. block.zig is the spec's single-outstanding-request model: the kernel writes the four MMIO registers, sets req.state = .Pending, and proc.sleep(@intFromPtr(&req)). The ISR sets .Done and proc.wakeups the waiter. The first time the kernel actually blocks on hardware — every prior sleep was process-local (fork/wait, child-of).

// kernel/block.zig — single outstanding
fn submit(blk: u32, buf_pa: u32, cmd: u32) {
    if (req.state != .Idle) panic(...);
    req.state = .Pending;
    mmio(REG_SECTOR).* = blk;
    mmio(REG_BUFFER).* = buf_pa;
    mmio(REG_CMD).*    = cmd; // ← fires IRQ

    while (req.state != .Done)
        proc.sleep(@intFromPtr(&req));

    if (req.err) panic("block I/O error");
    req.state = .Idle;
}

pub fn isr() {
    req.err   = (mmio(REG_STATUS).* != 0);
    req.state = .Done;
    proc.wakeup(@intFromPtr(&req));
}

Bufcache · NBUF=16 · LRU + sleep-on-busy

On top of the driver, fs/bufcache.zig caches 16 4-KB blocks in a doubly-linked LRU list. bget(blk) finds an existing buffer or evicts the LRU tail; if the buffer is busy, the caller sleeps on @intFromPtr(&buf) until the holder calls brelse. bread = bget + (if !valid) block.read; brelse bumps LRU position, decrements refs, wakes sleepers.

Two channels per buffer: one for the I/O wait (&req), one for the bget contention wait (&buf). Both ride the same sleep/wakeup primitive 3.C built.

The scheduler had to grow too: when nothing is Runnable but something is sleeping, it opens a tiny csrs SIE; csrc SIE window so the block IRQ can fire. A separate s_kernel_trap_entry vector handles S-from-S traps — the user vector's sscratch/sp dance would have clobbered the sleeping proc's tf.

chapter 3.d · layout · inode · namei

A read-side filesystem — six files in `fs/`

The on-disk layout is xv6-shaped: 4 KB blocks, a superblock at block 1, a bitmap at block 2, four inode blocks (NINODES=64 × 64 B = 4 KB each), then data blocks. Inodes have 12 direct + 1 indirect addresses (max ~4 MB per file — plenty for 3.D's /etc/motd and /bin/init). The kernel side stacks six modules in fs/: layout (constants + structs) → balloc (bitmap; write-side waits for 3.E) → inode (NINODE=32 in-memory cache + iget/ilock/iput + bmap + readi) → dir (dirlookup) → path (namei walks left-to-right from root or cwd).

on-disk layout · 4 MB · 1024 blocks

 block 0     boot sector (zeros)
 block 1     SuperBlock (magic + counts)
 block 2     bitmap (1 bit / data block)
 blocks 3..6 inode table (64 inodes × 64 B)
 blocks 7..  data blocks

DiskInode  (64 bytes)
   type · nlink · size · addrs[12+1]

read stack
   read(fd, buf, n)
     ─► file.read(idx)
         ─► inode.readi(ip, dst, off, n)
             ─► loop bmap → bread → memcpy → brelse
                 bread = bget + block.read
                 block.read = submit + sleep

// fs/path.zig — iterative left-to-right
fn nameix(path: []const u8) ?*InMemInode {
    var cur = startInode(path); // "/"=root
    var off: u32 = 0;
    while (true) {
        const step = nextComponent(path, off)
            orelse return cur;
        ilock(cur);
        if (cur.dinode.type != .Dir)
            return putAndFail(cur);
        const child = dirlookup(cur, step.comp)
            orelse return putAndFail(cur);
        iunlock(cur);
        const next = iget(child);
        iput(cur);
        cur = next;
        off = step.next;
    }
}

Seven new syscalls land: 17 (getcwd) · 49 (chdir) · 56 (openat) · 57 (close) · 62 (lseek) · 63 (read) · 80 (fstat). Per-process ofile[16] + cwd on Process; fork file.dup's every fd, exit file.close's them. Write side, mkdir, unlink — all 3.E.

chapter 3.d · init no longer embedded

`/bin/init` opens `/etc/motd`, writes it to fd 1, exits 0

proc.exec stops calling boot_config.lookupBlob. In the FS-mode kernel it now does namei(path) → readi → elfload.load against a 64 KB .bss scratch buffer. A new host tool, mkfs.zig, walks --root + --bin directory trees and lays out a 4 MB image with the canonical superblock + bitmap + inode table + data layout. The build runs it to produce zig-out/fs.img; kernel-fs.elf boots against that image and the on-disk /bin/init (a tiny naked-asm Zig program) is the first thing it execs.

build pipeline

 fs_init.elf     kernel/user/fs_init.zig
   ─► stage as /bin/init
 motd            kernel/userland/fs/etc/motd
   ─► stage as /etc/motd
 mkfs --root --bin --out
   ─► zig-out/fs.img  (4 MB)

 kernel-fs.elf   FS_DEMO=true
   ─► boot ─► kmain.FS_DEMO arm
       bufcache.init · inode.init
       PLIC enable IRQ #1
       proc.alloc PID 1
       proc.exec("/bin/init", NULL)
         namei ─► readi ─► elfload.load
       schedule ─► sret to U
 PID 1 · /bin/init runs:
   openat(0, "/etc/motd", 0)
   read(fd, buf, 256)
   write(1, buf, n)
   close(fd) · exit(0)

// kernel/user/fs_init.zig — naked _start
export fn _start() callconv(.naked) noreturn {
    asm volatile (
        \\ // openat(0, &path, 0)
        \\ li   a7, 56
        \\ li   a0, 0
        \\ la   a1, path
        \\ li   a2, 0
        \\ ecall
        \\ bltz a0, fail
        \\ mv   s1, a0
        \\ // read(fd, buf, 256)
        \\ li   a7, 63
        \\ mv   a0, s1
        \\ la   a1, buf
        \\ li   a2, 256
        \\ ecall
        \\ // write(1, buf, n)
        \\ li   a7, 64
        \\ li   a0, 1
        \\ ...
        \\ // close · exit(0)
    );
}

e2e-fs: ccc --disk fs.img kernel-fs.elf → stdout contains hello from phase 3\n (motd content) + the canonical PID-1 ticks observed: N\n trailer; exit 0. Getting here surfaced nine subtle bugs along the way — most memorably, file.read's 4 KB stack-allocated kbuf overflowing the per-process kernel stack into the adjacent scheduler stack page (var x = undefined's 0xaa fill made the corruption pattern unmissable). Hoisted to a .bss static; problem solved.

chapter 3.e · the write side

From read-only to mutable: `writei` + `ialloc` + `itrunc` + a real `dirlink`

Plan 3.D shipped the read side; 3.E completes the FS by adding the write side. inode.writei mirrors readi but flips a for_write flag through bmap so missing direct/indirect blocks get lazily allocated via balloc. iupdate flushes the in-memory dinode back to its inode-table block. ialloc claims the first free inode slot. itrunc frees direct + indirect blocks, called from iput when ref+nlink hit zero. dirlink stops being a stub — it scans for an empty slot or appends — and a new dirunlink zeros the directory entry. A new fs/fsops.zig wraps the create + unlink pipelines so the syscall layer stays thin.

// fs/inode.zig — bmap with for_write hint
fn bmap(ip: *InMemInode, bn: u32,
        for_write: bool) ?u32 {
    if (bn < NDIRECT) {
        var a = ip.dinode.addrs[bn];
        if (a == 0 and for_write) {
            a = balloc.alloc() orelse return null;
            ip.dinode.addrs[bn] = a;
            iupdate(ip); // flush
        }
        return if (a == 0) null else a;
    }
    // indirect: same shape, one extra bread
    ...
}

// fsops.zig — the create pipeline
pub fn create(path: []const u8,
                 t: InodeType) ?*InMemInode {
    const parent = nameiparent(path) orelse ...;
    const name = lastComponent(path);
    if (dirlookup(parent, name)) |existing|
        return existing; // idempotent
    const ip = ialloc(t) orelse return null;
    dirlink(parent, name, ip.inum);
    return ip;
}

New + extended syscalls

New: mkdirat (#34) · unlinkat (#35).
Extended: openat with O_CREAT / O_TRUNC / O_APPEND.
Refactored: sysWrite stops short-circuiting straight to UART — it routes any fd through file.write, which dispatches to console.write (for the new Console-typed entries on fd 0/1/2) or writei (for inode-backed fds).

The file.write reroute means every previous build that hadn't wired Console fds suddenly needs them — a regression caught by e2e-kernel + e2e-multiproc-stub going silent until kmain's Phase 2 branch grew the same 3-line console install the FS branch already had.

iput-on-zero is the last piece: when a file's ref + nlink hit zero, iput calls itrunc, frees the inode, and writes a zero type back through iupdate. Combined with unlinkat's dirunlink + nlink--, an rm /tmp/x actually returns the data blocks + inode slot to the pool.

chapter 3.e · the input side

Cooked-mode line discipline on fd 0/1/2 — and the WFI loop it surfaced

console.zig is the new fd 0/1/2 backing. UART RX (PLIC source #10) drives uart.isr, which drains the FIFO via console.feedByte: cooked mode echoes the byte, handles backspace / ^U line-kill / ^C-via-proc.kill / ^D EOF, and on \n commits the line + wakes any sleepers. Raw mode is wired (the console_set_mode syscall lands real this time) but only exercised by 3.F's editor. The trap dispatcher gains an IRQ_UART_RX = 10 arm that mirrors the block IRQ shape — claim → uart.isr → complete.

the input loop · why WFI matters

 host stdin
   ─► --input bytes (one keystroke at a time)
       ─► cpu.idleSpin (only fires on WFI)
           rx_pump.drainOne ─► uart.pushRx
               PLIC src 10
                 ─► s_trap_dispatch (cause 9)
                     ─► uart.isr ─► console.feedByte
                         echo ─► input.buf[]
                         on \n: wakeup(&input.r)
       sh resumes in console.read

scheduler idle path  (3.D shape)
   csrs SIE          // open SIE briefly
   csrc SIE          // close it

scheduler idle path  (3.E)
   csrs SIE
   wfi            // idleSpin runs
   csrc SIE

The fixed-point loop

Adding wfi broke e2e-fs hard. cpu.step's prologue raises any pending S-trap before the next instruction is fetched. With SIE = 1, the pending PLIC IRQ traps before wfi is fetched; sret returns to wfi (sepc = wfi_addr); the same trap fires again. Forever.

Two fixes broke the loop: s_kernel_trap_dispatch advances sepc by 4 (the trap site is always the 4-byte WFI), and idleSpin drains one byte per iteration AFTER checking interrupts (so disk-I/O sleeps, where a block IRQ is already pending, don't leak --input bytes into the FIFO mid-boot — drain only happens when the guest is truly idle).

The byte-at-a-time pacing matters: bulk-draining all 51 input bytes would emit every cooked-mode echo before sh can print its second-and-subsequent prompts. One byte per WFI iteration mirrors interactive-tty cadence, and the harness sees $ ls /bin\n$ echo hi > /tmp/x\n… in the right order.

chapter 3.e · the milestone

An interactive prompt: `init_shell` forks `sh`, `sh` forks `cat`

A small userland stdlib lands at src/kernel/user/lib/: start.S parses argc/argv from the System-V tail and calls main; usys.S emits 19 syscall stubs (li a7; ecall; ret); ulib.zig bundles mem*/str* + Stat + O_* constants; uprintf.zig is a 90-line printf(fd, fmt, args). An addUserBinary build helper bolts these onto every userland binary identically. Seven binaries ship: init_shell (loops fork-exec-sh-wait, exits cleanly when sh status is 0), sh (line read + token split + redirect </>/>> + builtins cd/pwd/exit + fork+exec, ~250 LoC), and the four utilities ls / cat / echo / mkdir / rm.

// kernel/user/sh.zig — main loop
export fn main(argc: u32, argv: [*]const [*:0]const u8) i32 {
    while (true) {
        const prompt: [2]u8 = .{ '$', ' ' };
        _ = ulib.write(1, &prompt, 2);

        const got = ulib.getline(0, &line_buf, LINE_MAX);
        if (got <= 0) return 0;

        const parsed = parseLine(&line_buf, got);
        runCommand(&parsed); // fork+exec or builtin
    }
}

milestone session

 $ ccc --input shell_input.txt \
       --disk shell-fs.img kernel-fs.elf

 $ ls /bin
 .
 ..
 cat
 init
 echo
 sh
 mkdir
 ls
 rm
 $ echo hi > /tmp/x
 $ cat /tmp/x
 hi
 $ rm /tmp/x
 $ exit
 ticks observed: 6

shell-fs.img is the parallel image baking init_shell as /bin/init + every utility under /bin/. mkfs.zig learned --init (override) and walks every top-level subdir of --root (so the empty /tmp/ staging carries through). The e2e-shell harness pipes the canonical session through --input and asserts each prompt + command echo + the hi round-trip + the clean halt. This is our first interactive shell.

chapter 3.f · editor + persistence (final demo)

Chapter 3.F — editor + persistence (final demo)

Plan 3.E wired raw-mode in the kernel but never exercised it. Plan 3.F finally does, with a cursor-moving edit binary: console_set_mode(1) on entry, console_set_mode(0) on exit, a 16 KB content buffer with a single byte-offset cursor, and a redraw-on-every-keystroke ANSI loop dispatching ESC [ A/B/C/D arrows + printables + backspace + ^S save + ^X exit. e2e-editor drives the canonical session via a 43-byte binary fixture. e2e-persist proves the kernel's bwrite path actually persists by running ccc twice on the same disk image and asserting pass 2 sees pass 1's writes. e2e-cancel closes the last DoD bullet — pipes cat\n\x03exit\n and asserts cat\n^C\n$ exit appears in stdout, exercising the full kill-flag chain (console.feedByte(0x03) → proc.kill(fg_pid) → killed flag → console.read returns -1 → syscall dispatch calls proc.exit).

Phase 3 §Definition of Done holds end-to-end. kernel.elf + shell-fs.img boot to a shell, run our own programs, edit files interactively, observe changes survive emulator restarts, ^C cancels foreground programs. Next: Phase 4 (network stack) → Phase 5 (HTTP/1.0 + text browser) → Phase 6 (CPython 3.12 port) → Phase 7 (framebuffer + compositor).

epilogue · where we are

Phase 1 · complete. Phase 2 · complete. Phase 3 · complete.

Phase 1 asked: can we run hello world end-to-end on an emulator we wrote? Yes. Phase 2 asked: can we host a real OS kernel? Also yes — three privilege levels, Sv32 paging, trap delegation, async timers, a one-process scheduler stub, and a user program that prints how many times the timer fired while it ran. Phase 3 is a multi-process OS with a filesystem and a shell — Plan 3.A landed the emulator-side hardware (PLIC, block device, UART RX); Plan 3.B landed the kernel-side multi-process foundation (free-list page_alloc, ptable[16], round-robin scheduler with swtch, kernel-side ELF loader); Plan 3.C landed the full Unix-shaped process lifecycle — fork, execve, wait4, exit, and a kill-flag — with init forking /bin/hello end-to-end on real ELFs; Plan 3.D landed the read-side filesystem — bufcache, block driver with sleep-on-IRQ, inode cache + bmap + readi, namei, file table, seven new syscalls, and a host-side mkfs that builds a 4 MB image, with /bin/init loaded from disk; Plan 3.E just landed the write side + console + shell — writei/ialloc/itrunc/dirlink, cooked-mode line discipline on fd 0/1/2 driven by UART RX through PLIC IRQ #10, a small userland stdlib, and seven user binaries including a real sh. The OS finally has an interactive prompt. Plan 3.F wrapped Phase 3 — a cursor-moving edit binary that exercises the raw-mode console arm 3.E left wired-but-unexercised, plus an e2e-persist test that proves on-disk writes survive emulator restart. Phase 3 §Definition of Done holds: boot to a shell, run our own programs, edit files interactively, observe changes persist across reboots.

✓ 1.A · baseline emulator

RV32I · 128 MB RAM · UART · halt device · 78 tests

✓ 1.B · IMA extensions

RV32M · RV32A (LR/SC + 9 AMOs) · Zifencei

✓ 1.C · privilege + traps

Zicsr · M/U · CLINT · ELF loader · rv32ui/um/ua/mi

✓ 1.D · hello.elf

Monitor + U-mode payload · QEMU-diff · e2e-hello-elf

✓ 2.A · S-mode + Sv32

Supervisor mode · two-level paging · rv32si-p-*

✓ 2.B · delegation + async

medeleg / mideleg · CLINT MTIP · interrupt priority

✓ 2.C · kernel skeleton

M-mode boot shim · Sv32 paging · trap dispatcher · write + exit

✓ 2.D · process + scheduler + ticks

Process struct · sched stub · yield · timer tick counter · qemu-diff-kernel

✓ 3.A · PLIC + block + UART RX

Emulator-only · 32-source PLIC · 4 KB block device · 256-byte RX FIFO · real wfi idle · --disk/--input · e2e-plic-block

✓ 3.B · ptable + scheduler + ELF loader

Free-list page_alloc · Context + swtch.S · ptable[16] + Cpu · round-robin scheduler · kernel-side ELF loader · getpid/sbrk/yield · PID 1 + PID 2 · e2e-multiproc-stub

✓ 3.C · fork + exec + wait + exit + kill

vm.copyUvm/unmapUser · real proc.free · sleep/wakeup · fork/execve/wait4/exit · killed-check on syscall return · init forks /bin/hello · e2e-fork

✓ 3.D · bufcache + block + FS read

PLIC + block kernel drivers · NBUF=16 LRU bufcache (sleep-on-busy) · NINODE=32 inode cache + bmap/readi · namei · NFILE=64 file table + per-proc ofile[16]/cwd · 7 new syscalls (openat/close/read/lseek/fstat/chdir/getcwd) · mkfs + 4 MB fs.img · separate kernel trap vector for the scheduler SIE window · /bin/init from disk · e2e-fs

✓ 3.E · FS write + console + shell

writei/bmap lazy alloc · iupdate/ialloc/itrunc · dirlink/dirunlink · fsops.create/unlink · mkdirat/unlinkat + openat O_CREAT/O_TRUNC/O_APPEND · console fd 0/1/2 cooked-mode line discipline · UART RX via PLIC #10 · scheduler WFI + sepc+4 + paced rx_pump · userland stdlib (start.S/usys.S/ulib/uprintf) · init_shell/sh/ls/cat/echo/mkdir/rm · shell-fs.img · e2e-shell

✓ 3.F · editor + persistence

cursor-moving edit.zig · raw-mode console exercised · 16 KB content buffer · ESC [ A/B/C/D arrow parser · ANSI redraw (clear + home + buffer + position) · ^S save (close + O_TRUNC re-open) · ^X exit + cooked restore · e2e-editor (43-byte fixture: edit → 2× right → Y → ^S^X → cat asserts heYllo from phase 3) · e2e-persist (two ccc passes on copied image; pass 2 sees pass 1's writes) · e2e-cancel (10-byte fixture cat\n\x03exit\n proves ^C kill-flag end-to-end)