The original target: boot our own kernel, talk to a server we control, and render plain HTML in a terminal-mode browser — everything written from scratch. No Linux. No TLS. The roadmap then keeps going past that finish line: Phase 6 ports real CPython 3.12 on top of Phase 3's kernel, and Phase 7 reverses the original "no graphics" stance with a framebuffer + compositor + windowed apps. Both are now in scope, not optional.
riscv-tests · runs a bare-metal hello world.yield + ticks.ls/cat, an editor. Plans 3.A + 3.B + 3.C + 3.D + 3.E + 3.F landed (Phase 3 done): emulator PLIC + block + UART RX, the kernel-side ptable / scheduler / swtch / ELF loader, the full Unix-shaped fork / exec / wait / exit / kill-flag lifecycle, the read-side filesystem (bufcache + inode cache + namei · /bin/init from disk), the write side + console + shell — writei / ialloc / itrunc / dirlink, cooked-mode line discipline on fd 0/1/2, and a real sh driving ls / cat / echo / mkdir / rm, and finally the cursor-moving edit binary + e2e-persist that proves writes survive emulator restart./bin/python — picolibc + dlmalloc + frozen stdlib on top of Phase 3's kernel. REPL, scripts, exceptions, ^C → KeyboardInterrupt. python /usr/lib/demo/pi.py 50 prints 50 digits of π through _pydecimal.mmap + pipes · userland wm compositor · windowed clock, calc, and term demos. Reverses the original "no graphics" stance./bin/init from disk, the write-side + console + shell that gives the OS its first interactive prompt, and the cursor-moving edit binary + persistence test that closes Phase 3. Phases 4–7 are specced and queued.Phase 1 starts with nothing. A Cpu struct holds 32 general-purpose registers (x0 hardwired to zero) plus a program counter. The main loop fetches four bytes at PC, decodes them into a tagged Instruction, dispatches on the variant, and steps PC.
pub const Cpu = struct { regs: [32]u32, // x0..x31 pc: u32, // program counter mem: *Memory, pub fn writeReg(self: *Cpu, i: u5, v: u32) void { if (i == 0) return; // x0 hardwired self.regs[i] = v; } pub fn step(self: *Cpu) !void { const raw = try self.mem.fetch(self.pc); const insn = try decode(raw); try self.execute(insn); } };
2 · LUI · AUIPC
2 · JAL · JALR
6 · BEQ · BNE · BLT · BGE · BLTU · BGEU
5 · LB · LH · LW · LBU · LHU
3 · SB · SH · SW
9 · ADDI · SLTI · SLTIU · XORI · ORI · ANDI · SLLI · SRLI · SRAI
10 · ADD · SUB · SLL · SLT · SLTU · XOR · SRL · SRA · OR · AND
3 · FENCE · ECALL · EBREAK
Instruction union — execute switches on the tag, no opcode-bit shenanigans in the hot path. 78 inline unit tests pin encode/decode round-trips and arithmetic edge cases.Memory is one flat allocation at 0x8000_0000. Loads and stores route through a single Memory object: addresses inside the RAM range hit the buffer; a small set of MMIO ranges shortcut to device handlers. Two devices is the minimum needed to make the guest do something visible: print bytes, then stop.
guest physical address space 0x0000_0000 ┐ │ invalid · access fault 0x0010_0000 ┤ halt MMIO (1 byte) │ 0x1000_0000 ┤ UART (NS16550A subset) │ │ invalid 0x8000_0000 ┤ RAM 128 MB, flat │ 0x8800_0000 ┘
NS16550A subset. THR (offset 0) byte-stores forward to host stdout; LSR (offset 5) reports THR_EMPTY | TX_IDLE so the guest's spin-wait completes immediately.
A single byte. Any store snapshots the low byte as the host process's exit code, then breaks the fetch-decode loop.
zig build e2e loads a hand-encoded RV32I program at 0x8000_0000 that loops over "hello world\n", byte-stores each character to the UART THR, then byte-stores 0 to the halt device. The first end-to-end signal that the emulator is real.RV32M adds 8 instructions. Multiply is straightforward — except the high-word variants (mulh / mulhu / mulhsu) need 64-bit intermediates with the right signedness on each operand. Divide and remainder are required to behave a specific way on the two cases that would otherwise trap on real silicon: divide by zero and signed overflow.
// rv32m: 8 ops, all OP-major (0x33), // distinguished by funct7 = 0000001 // multiply mul rd, rs1, rs2 // (rs1 * rs2)[31:0] mulh rd, rs1, rs2 // (s64 × s64)[63:32] mulhu rd, rs1, rs2 // (u64 × u64)[63:32] mulhsu rd, rs1, rs2 // (s64 × u64)[63:32] // divide / remainder div rd, rs1, rs2 // signed divu rd, rs1, rs2 // unsigned rem rd, rs1, rs2 // signed remu rd, rs1, rs2 // unsigned
Divide by zero
div / divu → 0xFFFF_FFFF
rem / remu → dividend
Signed overflow (INT_MIN ÷ -1)
div → INT_MIN
rem → 0
No traps. Ever. The result is what the hardware must return.
execute casts to i64/u64 with the correct signedness per opcode, then truncates to the requested half. ~30 new unit tests pin every edge case from the table above.RV32A adds 11 instructions. Load-reserved / store-conditional gives us a compare-and-swap-shaped primitive; the nine AMOs (atomic-memory-operation) are read-modify-write opcodes that return the original value. Single-hart semantics — the aq/rl ordering bits decode but don't reorder anything. A small reservation: ?u32 field on the hart tracks live LR/SC state.
# 6 × 7 = 42, atomically stored, split for display li t1, 6 li t2, 7 mul t4, t1, t2 # t4 ← 42 (RV32M) amoswap.w t5, t4, (t3) # atomically store 42 (RV32A) divu t6, t4, 10 # tens = 4 (RV32M) remu t4, t4, 10 # ones = 2 (RV32M) fence.i # instruction fence (Zifencei)
lr.w · sc.w + 9 AMOs (swap · add · xor · and · or · min · max · minu · maxu) · 1 Zifencei — fence.i is a no-op (we have no I-cache). lr.w sets the reservation; sc.w reads it, returns 0 on success / 1 on miss, then clears it. ~50 new unit tests green.Up to now the hart had one mode and zero state besides registers and PC. Plan 1.C adds the machinery that separates kernel code from user code: a second privilege level (U), a small file of control and status registers, and the Zicsr instructions that read and write them. U-mode access to M-CSRs traps as illegal — the boundary becomes enforceable.
csrrw rd, csr, rs1 # swap csrrs rd, csr, rs1 # set bits csrrc rd, csr, rs1 # clear bits csrrwi rd, csr, uimm5 csrrsi rd, csr, uimm5 csrrci rd, csr, uimm5 # side-effect rules: # rd = x0 → suppress read # rs1 = x0 → suppress write
writable
mstatus · mtvec · mepc
mcause · mtval
mie · mip · mscratch
mtimecmp (via CLINT mirror)
read-only
misa = 0x4014_1101 (RV32 I·M·A·U)
mhartid · mvendor/arch/impid · zero
U-mode CSR access → illegal instruction trap.
mstatus's key bits: MIE (interrupts enabled), MPIE (saved MIE), MPP (previous privilege — restored by mret). Setting MPP=00 then mret downgrades to U-mode. The whole privilege transition is one instruction.A trap is a tightly choreographed dance of CSR writes. Six fields snapshot the cause, the faulting PC, and the previous privilege; PC jumps to mtvec. mret reverses it. Eight causes cover everything an RV32IMA hart can throw without supervisor mode. Then we point the official RISC-V conformance tests at it.
U-mode code
│ ecall / fault
▼
mcause ← cause code
mepc ← pc of faulting inst
mtval ← fault info
mstatus.MPP ← prev priv
mstatus.MPIE ← MIE; MIE ← 0
pc ← mtvec
│
▼
M-mode handler
│ mret
▼
pc ← mepc
priv ← mstatus.MPP
MIE ← mstatus.MPIE
│
▼
resume
8 causes · ecall_u/m · ebreak ·
illegal · ld/st misaligned · ld/st fault
$ zig build riscv-tests # 66/66 pass # rv32ui 39 integer # rv32um 8 multiply/divide # rv32ua 10 atomics # rv32mi 9 M-mode + traps
Each test is an ELF that runs under our hand-rolled ELF32 loader; success is signaled by writing 1 to the linker symbol tohost. CLINT lives at 0x0200_0000 for the rv32mi timer tests.
hello.elf bundles two cooperating programs. A tiny supervisor written in assembly boots the hardware and services requests; the application, written in Zig, simply asks for what it wants. The split mirrors how real operating systems separate kernel code from user code.
Boots the hart, installs the trap vector, drops to U-mode, and services ecall.
Hand-written RISC-V assembly. Linked at 0x80000000.
Naked function. No prologue, no stack. Calls write(1, msg, 12) then exit(0) via ecall.
Cross-compiled Zig. No libc.
From the moment ccc opens the ELF file until hello world appears on the terminal, control passes through five handoffs — some inside the guest program, others across the boundary into ccc's emulated hardware.
ccc ← host: ELF32 loader reads hello.elf; PC = e_entry ▼ monitor._start (M-mode) │ set sp, mtvec, mepc; clear MPP; mret ▼ hello.u_entry (U-mode) │ li a7, 64 ; la a1, msg ; li a2, 12; ecall ▼ monitor.trap_vector (M-mode, cause = ecall_from_u) │ dispatch on a7: 64 → sys_write 93 → sys_exit ▼ UART THR 0x10000000 ← stdout byte halt MMIO 0x00100000 ← exit code
The first code that runs inside the emulated CPU. It sets up the stack pointer, registers a handler for anything that might go wrong, voluntarily downgrades its own privilege, and jumps to the application. This is the standard pattern for bare-metal code handing off to user code.
_start:
la sp, _stack_top # 8 KB stack above all sections
la t0, trap_vector
csrw mtvec, t0 # install the trap handler
li t0, 0x1888
csrc mstatus, t0 # clear MPP → post-mret priv = U
la t0, u_entry
csrw mepc, t0 # return address
mret # PC ← mepc, priv ← MPP (= U)
monitor.S installs a PMP entry granting U-mode RWX on all of RAM so QEMU and ccc run the same binary.The application, in its entirety — a naked Zig function (no prologue, no stack setup) that issues two system calls: one to print the message, one to exit. Every piece of machinery elsewhere in this deck exists to make these ten lines of assembly work.
export fn u_entry() linksection(".text.umode") callconv(.naked) noreturn { asm volatile ( \\ # write(1, msg, 12) \\ li a7, 64 \\ li a0, 1 \\ la a1, msg \\ li a2, 12 \\ ecall \\ # exit(0) \\ li a7, 93 \\ li a0, 0 \\ ecall ); }
ecall → trap_vectorWhen the user program hits ecall, the CPU snapshots state into CSRs and jumps to mtvec. The monitor's trap vector first confirms the cause really was an ecall, then branches on the syscall number the user placed in a7.
mcause = 8 (ecall from U)
mepc = pc of ecall
mstatus.MPP = U
mstatus.MPIE = MIE
MIE = 0
pc = mtvec
trap_vector:
csrr t0, mcause
li t1, 8 # ecall from U
bne t0, t1, unexpected_trap
li t0, 64
beq a7, t0, sys_write
li t0, 93
beq a7, t0, sys_exit
li a0, -38 # -ENOSYS
j mret_resume
Two syscalls, two magic addresses. ccc intercepts stores to 0x10000000 and forwards them to the host terminal; stores to 0x00100000 halt the simulation and set the host process's exit code.
sys_write: # a7 = 64 mv t2, a2 # save len li t0, 0x10000000 # UART THR 1: lbu t1, 0(a1) sb t1, 0(t0) # byte → UART addi a1, a1, 1 addi a2, a2, -1 bnez a2, 1b mv a0, t2 # return len j mret_resume
sys_exit: # a7 = 93 li t0, 0x00100000 # halt MMIO sb a0, 0(t0) # low byte → exit 1: j 1b # unreachable
Two independent verification paths. The e2e-hello-elf test pins the final output to the exact string hello world\n. The QEMU-diff harness runs the same ELF through qemu-system-riscv32 and through ccc, then diffs the per-instruction traces.
$ zig build e2e-hello-elf # asserts stdout == "hello world\n" $ zig build riscv-tests # 66/66 tests pass: # rv32ui 39 # rv32um 8 # rv32ua 10 # rv32mi 9
$ scripts/qemu-diff.sh \
zig-out/bin/hello.elf
# runs same ELF under:
# qemu-system-riscv32 --trace
# ccc --trace
# canonicalises to "PC insn" lines
# diffs — empty output = match
Phase 1's hart had two modes — M and U. A real OS sits in the middle, where it can run system code without owning the firmware. Plan 2.A adds S-mode, plus its own slice of CSRs that mirror the M-mode set. sstatus, sie, and sip are masked views over the same underlying mstatus/mie/mip storage — the kernel never sees machine-only bits.
sstatus (SIE/SPIE/SPP/SUM/MXR)
sie · sip (SSI/STI/SEI bits only)
stvec · sepc · scause
stval · sscratch
satp (MODE · ASID · root PPN)
sret — the S-mode counterpart of mret. Restores PC ← sepc, priv ← sstatus.SPP, SIE ← SPIE.
sfence.vma — TLB / page-table fence. We don't model a TLB, so this is effectively a no-op — but the encoding decodes and privilege-checks.
M (firmware) ↓ mret S (kernel) ↓ sret U (user) misa now 0x4014_1101 RV32 I·M·A·U·S
When satp.MODE = 1 and the effective privilege is S or U, every address the hart touches goes through the page table. Two levels of 10 bits index 4 KB tables of 32-bit PTEs; the leaf carries an R/W/X/U/V/A/D flag set and the physical page number. Permission checks happen on every walk; failures raise one of three new page-fault causes.
Sv32 virtual address 31 22 21 12 11 0 ┌──────────┬─────────┬────────────┐ │ VPN[1] │ VPN[0] │ offset │ │ 10 bits │ 10 bits │ 12 bits │ └──────────┴─────────┴────────────┘ two-level walk L1 = (satp.PPN << 12) + VPN[1]*4 L2 = ( L1.PPN << 12) + VPN[0]*4 PA = ( L2.PPN << 12) | offset page-fault causes 12 inst 13 load 15 store
31 20 19 10 9 8 7 6 5 4 3 2 1 0 ┌─────────┬───────┬───┬─┬─┬─┬─┬─┬─┬─┬─┐ │ PPN[1] │PPN[0] │RSW│D│A│G│U│X│W│R│V│ └─────────┴───────┴───┴─┴─┴─┴─┴─┴─┴─┴─┘
V · valid R/W/X · perms U · user-accessible
G · global A · accessed D · dirty
Leaf vs intermediate: a leaf has any of R·W·X set. An intermediate has none — the walk continues.
SUM/MXR bits in sstatus. A/D bits update in place on access if the PTE allows.Without delegation, every fault from S-mode and U-mode lands in M-mode, and the firmware has to ping-pong everything down to the kernel. Real hardware doesn't do that. medeleg and mideleg are bitmask CSRs — set bit N means cause N routes straight to S-mode (filling sepc/scause/stvec instead of their M-counterparts). M-mode itself never delegates.
fault / interrupt
in U or S
│
▼
cause bit set in
medeleg / mideleg?
│
┌───┴───┐
yes no
│ │
▼ ▼
S-mode M-mode
sepc mepc
scause mcause
stvec mtvec
sstatus mstatus
.SPP .MPP
Synchronous (medeleg):
ecall_u · ebreak · illegal
ld/st misaligned · ld/st fault
page faults (12 · 13 · 15)
Interrupt (mideleg):
SSI (1) · STI (5) · SEI (9)
M-only causes (MSI/MTI/MEI/ecall_m) are never delegable — those bits in medeleg/mideleg are WARL-zero.
Up until now, every trap was synchronous — it came from an instruction. Plan 2.B introduces asynchronous interrupts: between every fetch, the hart checks mip & mie against the current privilege and the WFI/MIE state, and if any pending interrupt is enabled, it traps before the next instruction runs. The CLINT is the first source — mtime ≥ mtimecmp drives mip.MTIP live.
MEI > MSI > MTI > SEI > SSI > STI (11) (3) (7) (9) (1) (5) at each instruction fetch: if mip & mie & (priv-mask) take highest-priority bit trap (cause & 0x80000000 set) mie / mip are WARL-masked. MTIP is hardware-only on write; reads see CLINT's truth.
CLINT @ 0x0200_0000 msip [4 B · MSI source] mtimecmp [8 B · timer threshold] mtime [8 B · 10 MHz clock] read of mip recomputes MTIP: mip.MTIP = (mtime ≥ mtimecmp) writes to mip.MTIP are ignored — it's hardware-driven, not software.
mip.SSIP → S-mode ISR receives a delegated software interrupt, clears it, writes a sentinel. One test, every layer of routing exercised.Plans 2.A and 2.B built the CPU substrate. 2.C composes it into a real kernel. The first decision: one Sv32 page table, shared by kernel and user. Kernel pages are mapped identity (VA == PA) with S-only perms; user pages live at 0x0001_0000 with U=R/W/X. G=1 on every leaf so the (modeled-as-no-op) TLB doesn't need flushes when we'd switch contexts.
process page table 0x0001_0000 user .text / .rodata U R W X 0x0003_0000 user stack (2 pages) U R W 0x0010_0000 halt MMIO S R W 0x0200_0000 CLINT S R W 0x1000_0000 UART S R W 0x8000_0000 kernel .text S R X kernel .rodata S R kernel .data / .bss S R W kernel stack (16 KB) S R W kernel VA == kernel PA · direct-mapped · G=1 throughout
A tiny boot.S in M-mode sets up delegation, the trap vector, and the timer, then drops into the S-mode kernel's kmain. kmain initializes the page allocator, builds the Sv32 root, maps everything in the table on the previous slide, installs the S-mode trap entry, and srets into the user payload. The payload prints hello from u-mode via write and exit, and the emulator returns 0.
M boot.S zero BSS M medeleg / mideleg M mtvec = m_trap_vector M mtimecmp = mtime + 1e6 M mstatus.MPP = S M mret ──────────────► S S page_alloc.init S vm.allocRoot S vm.mapKernelAndMmio S vm.mapUser(blob) S stvec = s_trap_entry S sscratch = &the_tf S csrw satp; sfence.vma S sret ──────────────► U U ecall (write 64) ◄─┐ S sys_write → UART │ U ecall (exit 93) ◄─┤ S sys_exit → halt └──► emulator exit 0
sys_write (a7=64) · loops over a1[0..a2], byte-stores each character to the UART THR.
sys_exit (a7=93) · stores a0's low byte to the halt MMIO, halting the emulator with that exit code.
Both arrive via ecall, are delegated to S, dispatched in trap.zig, and return via sret.
ccc kernel.elf prints exactly hello from u-mode\n and exits 0. No Process struct, no scheduler, no yield — those arrive in Plan 2.D.Process, a scheduler stub, a yieldPlan 2.D finishes Phase 2 by giving the kernel state to describe a running program. The Process struct holds the trapframe (already saved on every trap), the page-table root, the kernel stack, a state, an exit code, and a per-process tick counter. schedule() is a one-line stub that returns the singleton — but everything downstream is wired as if it were a real picker.
pub const Process = extern struct { tf: trap.TrapFrame, // offset 0 satp: u32, kstack_top: u32, state: State, // Runnable // /Running/Exited ticks_observed: u32, exit_code: u32, }; pub export var the_process: Process = undefined; pub fn schedule() *Process { return &proc.the_process; } pub fn context_switch_to(p: *Process) void { // csrw satp, p.satp // sfence.vma zero, zero }
sys_yield (a7=1) — a no-op handler that returns immediately via sret. Because every trap return goes through schedule(), calling yield gives the scheduler a clean reason to pretend it picked the same process.
The structure is real even though the picker is trivial. When Plan 3.B replaces schedule() with a round-robin scan, none of this code changes — only the body of one function.
Process matters: the trampoline always treats sscratch as a *TrapFrame, never a *Process. The two pointers are interchangeable as long as tf stays first — an invariant that survives into Plan 3.B and beyond.Phase 2's Definition of Done: hello from u-mode\nticks observed: N\n with N > 0. The path the timer takes to bump that counter touches every layer below: CLINT raises MTIP, M-mode's ISR re-arms mtimecmp and forwards via mip.SSIP, S-mode's dispatcher catches the delegated software interrupt, increments the counter, calls schedule(), and srets back to the same user PC.
M CLINT: mtime ≥ mtimecmp M mip.MTIP = 1 (async) M timer ISR: clear MTIP M bump mtimecmp M set sip.SSIP M mret ─────────────► S S s_trap_dispatch S cause = 1 (SSI, interrupt) S clear sip.SSIP S the_process.ticks_observed += 1 S sched.schedule() S sret ─────────────► U U write ... yield ... busy-loop U li a7, 93 ; ecall (exit) S sys_exit: kprintf "ticks observed: 19\n" halt MMIO ← 0
TIMESLICE tuned from 1 M → 10 K CLINT cycles (≈ 1 ms) so ticks accumulate during the user program's ~300 K-cycle busy-loop.
verify_e2e.zig — host tool asserts stdout matches hello from u-mode\nticks observed: (\d+)\n with N > 0 and exit 0.
Phase 2 §Definition of done · met.
Phase 3 begins under the kernel. The emulator gains the platform-level interrupt controller — a tiny piece of memory-mapped state that aggregates many external interrupt sources behind a single S-mode external interrupt (SEIP). One S-context, 32 sources, per-source priority, per-source enable, per-context threshold, and a claim/complete handshake every external IRQ flows through. No kernel-side code yet — Plan 3.A is emulator-only.
PLIC @ 0x0c00_0000 0x0000_0000 source priority [32 × 4] 0x0000_1000 pending [1 × 4] 0x0000_2080 enable[ctx 0] [1 × 4] 0x0020_1000 threshold[ctx 0] [1 × 4] 0x0020_1004 claim / complete [1 × 4] SEIP wiring (mirror of MTIP pattern) read of sip recomputes SEIP: sip.SEIP = (any enabled source pending & pri > thr) delegation mideleg.SEIP = 1 → external IRQs land in S directly
Claim reads the highest-priority enabled-and-pending source, returns its ID, and atomically clears its pending bit.
Complete writes the source ID back when the handler is done — re-arms the source for future asserts.
Same shape as real silicon: the kernel never sees the priority arbiter, only the winner.
wfi becomes a real idle: the emulator sleeps in poll(2) until any device's pending edge fires. No more burning host CPU on the spin path.The PLIC by itself isn't useful — it needs sources. Plan 3.A wires the two devices Phase 3.D's filesystem and 3.E's shell will need: a synchronous block device backed by a host file (PLIC source #1), and a UART RX path with a 256-byte ring buffer (PLIC source #10). Both follow the same shape: an MMIO write or host-side edge raises a pending bit, the kernel claims it, services the device, completes.
Four registers: SECTOR · BUFFER · CMD · STATUS. 4 KB sectors, host-file-backed via --disk PATH.
Every CMD write transfers synchronously into the guest's BUFFER address, then raises PLIC source #1 on the next instruction boundary. Sync I/O behind an async-shaped interface — easy to retrofit a real driver later.
256-byte ring buffer. LSR.DR tracks live FIFO state. Empty → non-empty asserts source #10; draining via RBR de-asserts.
--input PATH streams a host file into the FIFO for scripted tests; otherwise the idle path drains host stdin. Same UART register file as Phase 1's TX-only model — RX is purely additive.
boot.S sets mideleg.SEIP, programs PLIC enables/threshold, drops to S. The S-mode test issues a block read, executes wfi, takes a delegated S-external interrupt, claims source 1, verifies sector 0's magic bytes, completes, halts 0.Phase 2's kernel ran a single hard-coded the_process. Plan 3.B rebuilds the foundation: a free-list page allocator, a kernel context-switch primitive, a 16-slot process table, and a kernel-side ELF loader — every moving part a real round-robin scheduler needs. The emulator doesn't change: 3.B is kernel-side only.
Replaces Phase 2's bump allocator. Each free 4 KB page stores a *next at offset 0; alloc() pops the head and zero-fills, free(pa) pushes. O(1), plus freeCount() for debug.
52-byte Context: ra · sp · s0..s11. 30 lines of asm save callee-saved regs to *old, restore from *new, ret into the new caller's ra. The single primitive every yield, preempt, and (3.D) wakeup funnels through.
Process grows pid · pgdir · sz · kstack · context (tf stays at offset 0 — trampoline invariants survive). Cpu singleton holds cur + its own scheduler stack so 3.D's sleep machinery stays surgical.
PT_LOAD header, allocates ceil(p_memsz / 4 KB) zeroed user pages, copies p_filesz bytes, maps at p_vaddr. Replaces Phase 2's objcopy flatten — userprogs now embed as raw ELFs. New syscalls: getpid (#172) · sbrk (#214).The mechanical heart of the scheduler. swtch(*old, *new) spills the 14 callee-saved registers — ra, sp, and s0..s11 — into *old, refills them from *new, and rets. The trick: ret jumps to whatever ra the new context was holding, which is the line after that thread's last swtch call. The compiler's caller-saved discipline takes care of everything else.
.equ CTX_RA, 0 .equ CTX_SP, 4 .equ CTX_S0, 8 # ... s1..s10 at offsets 12..48 .equ CTX_S11, 52 .globl swtch swtch: # save *old (a0) sw ra, CTX_RA(a0) sw sp, CTX_SP(a0) sw s0, CTX_S0(a0) # ... s1..s11 # restore *new (a1) lw ra, CTX_RA(a1) lw sp, CTX_SP(a1) lw s0, CTX_S0(a1) # ... s1..s11 ret # PC ← new.ra
Callee-saved only. The RISC-V ABI says the caller is responsible for spilling a0..a7, t0..t6, and floats around any function call. swtch is just another function — the compiler already spilled the rest.
The first switch into a new process needs ra pre-loaded. The fork path stamps the child's context with ra = forkret, a stub that drops it into U-mode for the very first time.
Same primitive every yield, every preempt, every (3.D) wakeup will funnel through.
satp in the context — that's swapped separately by switchTo(p) just before swtch, so the new code starts running with the new page table active. Trap-frame state (caller-saved registers + sepc) lives on the kernel stack and gets restored by sret later.The scheduler runs forever on its own kernel stack, scanning ptable for the first Runnable proc and swtch-ing in. Every timer tick the SSI handler bumps cur.ticks_observed and yields back; swtch saves the outgoing context, restores the next one, returns into its ra. A freshly-allocated process's first ra is forkret — a stub that drops it into U-mode.
pub fn scheduler() noreturn { while (true) { const p = pickRunnable() orelse continue; cpu.cur = p; p.state = .Running; // re-arm sscratch + load pgdir // csrw sscratch,p · csrw satp,p.satp // sfence.vma switchTo(p); swtch(&cpu.sched_context, &p.context); // p yielded back; pick another cpu.cur = null; } }
kmain alloc PID 1 + PID 2 kmain scheduler() ───────► loop sched pick PID 1 sched swtch ─► forkret ─► U-mode U PID 1 prints "hello…" U (timer) ecall yield S cur.ticks_observed += 1 S swtch back to sched_context sched pick PID 2 U PID 2 prints "[2] hello…" U PID 2 exit(0) → Zombie sched pick PID 1 → exit sysExit pid == 1? kprintf "ticks observed: N\n" halt MMIO ← 0
ccc kernel-multi.elf (two embedded ELFs), asserts stdout contains both hello from u-mode\n and [2] hello from u-mode\n plus a ticks observed: N\n trailer (any N), exit 0. Interleaving is non-deterministic — verifier checks substrings, not byte-for-byte. Phase 2's e2e-kernel stays green: sysExit halts only when cur.pid == 1.Plan 3.B gave the kernel a process table and a scheduler; Plan 3.C gives processes the Unix-shaped lifecycle that hangs off them. Five new primitives, plus the VM helpers and sleep/wakeup machinery they ride on. The emulator doesn't change: 3.C is kernel-side only.
copyUvm(src, dst, sz) walks every user PTE and deep-copies each leaf into a fresh frame; unmapUser(pgdir, sz, policy) frees every user leaf, the L0 tables that backed them, and (optionally) the L1 root. Both rollback to a consistent state on OOM. The pair turns proc.free's 3.B panic stub into a real reaper.
fork = full-AS clone (copy_uvm + dup tf, child.tf.a0 = 0). exec = build the new pgdir in scratch, write the System-V argv tail to the new user stack, then csrw satp + tear down the old AS. wait sleeps on self until a Zombie child appears. exit reparents children to PID 1, marks Zombie, wakes the parent.
xv6-style sleep(chan) sets .Sleeping + swtch; wakeup(chan) flips matching sleepers to .Runnable. kill(pid) sets the killed flag (and wakes if sleeping); the trap dispatcher checks it on every syscall return. Invariant: S-mode runs interrupts-off; only U-mode runs them on.
220 (clone → flagless fork) · 221 (execve) · 260 (wait4) · 5000 (set_fg_pid, accept-and-discard until 3.E) · 5001 (console_set_mode, accept-and-discard until 3.E). The killed-check sits in trap.zig's ECALL branch, right before sret — inert in 3.C, ready for 3.E's ^C path.fork's heaviest lift is duplicating the parent's user pages into fresh frames for the child. copyUvm walks every leaf PTE from VA 0 to parent.sz, asks the page allocator for a new frame per source page, copies 4 KB through kernel-direct pointers, and installs the new mapping in the child's page table. copyUserStack handles the 2-page user stack region separately. Both rollback to a consistent state if the allocator runs dry mid-copy.
pub fn copyUvm(src: u32, dst: u32, sz: u32) CopyError!void { const end_va = (sz + (PAGE_SIZE - 1)) & ~@as(u32, PAGE_SIZE - 1); var va: u32 = 0; while (va < end_va) : (va += PAGE_SIZE) { const src_pa = lookupPA(src, va) orelse continue; const dst_pa = page_alloc.alloc() orelse { unmapUser(dst, end_va, .leave_root); return CopyError.OutOfMemory; }; // kernel-direct 4 KB memcpy const s: [*]const volatile u8 = @ptrFromInt(src_pa); const d: [*]volatile u8 = @ptrFromInt(dst_pa); var i: u32 = 0; while (i < PAGE_SIZE) : (i += 1) d[i] = s[i]; mapPage(dst, va, dst_pa, USER_RWX); } }
Kernel-direct pointers. The kernel's identity mapping means physical addresses are valid Zig pointers — no double-mapping or temporary windows. The volatile qualifier keeps the compiler from optimizing the byte-loop away.
Rollback on OOM. If page_alloc.alloc() returns null mid-copy, unmapUser walks every leaf already installed in dst and frees it back. The parent is untouched throughout — fork can still return -1 cleanly.
Stack copied separately. The 2-page user stack lives at a fixed VA above the program's heap; copyUserStack uses USER_RW instead of USER_RWX — no execute on the stack.
unmapUser(pgdir, sz, policy) is the inverse: walks every user leaf, frees it, and (for .leave_root) keeps the L1 root for further use or (for .free_root) tears the whole pgdir down. Together they turn proc.free's 3.B panic stub into a real reaper.init forks hello, parent reaps, exits 0kernel-fork.elf embeds two userland ELFs (init.elf + hello.elf) and a comptime blob registry mapping "/bin/hello" → HELLO_ELF. init forks; the child execves /bin/hello (which the kernel resolves via that registry); the parent wait4s on its own pointer; exit wakes it; init prints and halts. End-to-end fork/exec/wait/exit on real ELFs.
pub fn fork() i32 { const parent = cur(); const child = alloc() orelse return -1; const root = vm.allocRoot() orelse { freeKstackOnly(child); return -1; }; vm.mapKernelAndMmio(root); try vm.copyUvm(parent.pgdir, root, parent.sz); try vm.copyUserStack(parent.pgdir, root); child.pgdir = root; child.satp = SATP_SV32 | (root >> 12); child.sz = parent.sz; child.tf = parent.tf; // child returns 0 from the same ecall child.tf.a0 = 0; child.parent = parent; child.state = .Runnable; return @intCast(child.pid); }
kmain alloc init sched pick init U init: fork() ──┐ S copy_uvm + dup tf S child.tf.a0 = 0 parent.a0 = child.pid ◄─┘ U parent: wait4 ─► sleep(self) sched pick child U child: execve("/bin/hello") S lookupBlob ─► HELLO_ELF S new pgdir + argv tail S csrw satp · free old AS U child: write "hello from /bin/hello\n" U child: exit(0) ─► Zombie S wakeup(parent) sched pick parent S wait reaps · free(child) U parent: write "init: reaped\n" U parent: exit(0) ─► PID 1 sysExit pid == 1? kprintf "ticks observed: N\n" halt MMIO ← 0
ccc kernel-fork.elf, asserts stdout contains hello from /bin/hello\n + init: reaped\n + ticks observed: N\n, exit 0. Order is deterministic (init forks → wait sleeps → child runs → child exits → wakeup → init reaps → init exits) but the verifier checks substrings to stay robust against future scheduler tweaks. Both Plan 3.B regressions stay green — sysExit's PID-1 trailer + halt is preserved byte-for-byte.Plan 3.A taught the emulator to fire a PLIC IRQ when a block transfer finishes; Plan 3.D finally drives it from the kernel side. The S-mode trap dispatcher gains an external-interrupt branch (cause == 9) that runs the canonical claim → ISR → complete handshake. block.zig is the spec's single-outstanding-request model: the kernel writes the four MMIO registers, sets req.state = .Pending, and proc.sleep(@intFromPtr(&req)). The ISR sets .Done and proc.wakeups the waiter. The first time the kernel actually blocks on hardware — every prior sleep was process-local (fork/wait, child-of).
// kernel/block.zig — single outstanding fn submit(blk: u32, buf_pa: u32, cmd: u32) { if (req.state != .Idle) panic(...); req.state = .Pending; mmio(REG_SECTOR).* = blk; mmio(REG_BUFFER).* = buf_pa; mmio(REG_CMD).* = cmd; // ← fires IRQ while (req.state != .Done) proc.sleep(@intFromPtr(&req)); if (req.err) panic("block I/O error"); req.state = .Idle; } pub fn isr() { req.err = (mmio(REG_STATUS).* != 0); req.state = .Done; proc.wakeup(@intFromPtr(&req)); }
On top of the driver, fs/bufcache.zig caches 16 4-KB blocks in a doubly-linked LRU list. bget(blk) finds an existing buffer or evicts the LRU tail; if the buffer is busy, the caller sleeps on @intFromPtr(&buf) until the holder calls brelse. bread = bget + (if !valid) block.read; brelse bumps LRU position, decrements refs, wakes sleepers.
Two channels per buffer: one for the I/O wait (&req), one for the bget contention wait (&buf). Both ride the same sleep/wakeup primitive 3.C built.
csrs SIE; csrc SIE window so the block IRQ can fire. A separate s_kernel_trap_entry vector handles S-from-S traps — the user vector's sscratch/sp dance would have clobbered the sleeping proc's tf.fs/The on-disk layout is xv6-shaped: 4 KB blocks, a superblock at block 1, a bitmap at block 2, four inode blocks (NINODES=64 × 64 B = 4 KB each), then data blocks. Inodes have 12 direct + 1 indirect addresses (max ~4 MB per file — plenty for 3.D's /etc/motd and /bin/init). The kernel side stacks six modules in fs/: layout (constants + structs) → balloc (bitmap; write-side waits for 3.E) → inode (NINODE=32 in-memory cache + iget/ilock/iput + bmap + readi) → dir (dirlookup) → path (namei walks left-to-right from root or cwd).
on-disk layout · 4 MB · 1024 blocks block 0 boot sector (zeros) block 1 SuperBlock (magic + counts) block 2 bitmap (1 bit / data block) blocks 3..6 inode table (64 inodes × 64 B) blocks 7.. data blocks DiskInode (64 bytes) type · nlink · size · addrs[12+1] read stack read(fd, buf, n) ─► file.read(idx) ─► inode.readi(ip, dst, off, n) ─► loop bmap → bread → memcpy → brelse bread = bget + block.read block.read = submit + sleep
// fs/path.zig — iterative left-to-right fn nameix(path: []const u8) ?*InMemInode { var cur = startInode(path); // "/"=root var off: u32 = 0; while (true) { const step = nextComponent(path, off) orelse return cur; ilock(cur); if (cur.dinode.type != .Dir) return putAndFail(cur); const child = dirlookup(cur, step.comp) orelse return putAndFail(cur); iunlock(cur); const next = iget(child); iput(cur); cur = next; off = step.next; } }
17 (getcwd) · 49 (chdir) · 56 (openat) · 57 (close) · 62 (lseek) · 63 (read) · 80 (fstat). Per-process ofile[16] + cwd on Process; fork file.dup's every fd, exit file.close's them. Write side, mkdir, unlink — all 3.E./bin/init opens /etc/motd, writes it to fd 1, exits 0proc.exec stops calling boot_config.lookupBlob. In the FS-mode kernel it now does namei(path) → readi → elfload.load against a 64 KB .bss scratch buffer. A new host tool, mkfs.zig, walks --root + --bin directory trees and lays out a 4 MB image with the canonical superblock + bitmap + inode table + data layout. The build runs it to produce zig-out/fs.img; kernel-fs.elf boots against that image and the on-disk /bin/init (a tiny naked-asm Zig program) is the first thing it execs.
build pipeline fs_init.elf kernel/user/fs_init.zig ─► stage as /bin/init motd kernel/userland/fs/etc/motd ─► stage as /etc/motd mkfs --root --bin --out ─► zig-out/fs.img (4 MB) kernel-fs.elf FS_DEMO=true ─► boot ─► kmain.FS_DEMO arm bufcache.init · inode.init PLIC enable IRQ #1 proc.alloc PID 1 proc.exec("/bin/init", NULL) namei ─► readi ─► elfload.load schedule ─► sret to U PID 1 · /bin/init runs: openat(0, "/etc/motd", 0) read(fd, buf, 256) write(1, buf, n) close(fd) · exit(0)
// kernel/user/fs_init.zig — naked _start export fn _start() callconv(.naked) noreturn { asm volatile ( \\ // openat(0, &path, 0) \\ li a7, 56 \\ li a0, 0 \\ la a1, path \\ li a2, 0 \\ ecall \\ bltz a0, fail \\ mv s1, a0 \\ // read(fd, buf, 256) \\ li a7, 63 \\ mv a0, s1 \\ la a1, buf \\ li a2, 256 \\ ecall \\ // write(1, buf, n) \\ li a7, 64 \\ li a0, 1 \\ ... \\ // close · exit(0) ); }
ccc --disk fs.img kernel-fs.elf → stdout contains hello from phase 3\n (motd content) + the canonical PID-1 ticks observed: N\n trailer; exit 0. Getting here surfaced nine subtle bugs along the way — most memorably, file.read's 4 KB stack-allocated kbuf overflowing the per-process kernel stack into the adjacent scheduler stack page (var x = undefined's 0xaa fill made the corruption pattern unmissable). Hoisted to a .bss static; problem solved.writei + ialloc + itrunc + a real dirlinkPlan 3.D shipped the read side; 3.E completes the FS by adding the write side. inode.writei mirrors readi but flips a for_write flag through bmap so missing direct/indirect blocks get lazily allocated via balloc. iupdate flushes the in-memory dinode back to its inode-table block. ialloc claims the first free inode slot. itrunc frees direct + indirect blocks, called from iput when ref+nlink hit zero. dirlink stops being a stub — it scans for an empty slot or appends — and a new dirunlink zeros the directory entry. A new fs/fsops.zig wraps the create + unlink pipelines so the syscall layer stays thin.
// fs/inode.zig — bmap with for_write hint fn bmap(ip: *InMemInode, bn: u32, for_write: bool) ?u32 { if (bn < NDIRECT) { var a = ip.dinode.addrs[bn]; if (a == 0 and for_write) { a = balloc.alloc() orelse return null; ip.dinode.addrs[bn] = a; iupdate(ip); // flush } return if (a == 0) null else a; } // indirect: same shape, one extra bread ... } // fsops.zig — the create pipeline pub fn create(path: []const u8, t: InodeType) ?*InMemInode { const parent = nameiparent(path) orelse ...; const name = lastComponent(path); if (dirlookup(parent, name)) |existing| return existing; // idempotent const ip = ialloc(t) orelse return null; dirlink(parent, name, ip.inum); return ip; }
New: mkdirat (#34) · unlinkat (#35).
Extended: openat with O_CREAT / O_TRUNC / O_APPEND.
Refactored: sysWrite stops short-circuiting straight to UART — it routes any fd through file.write, which dispatches to console.write (for the new Console-typed entries on fd 0/1/2) or writei (for inode-backed fds).
The file.write reroute means every previous build that hadn't wired Console fds suddenly needs them — a regression caught by e2e-kernel + e2e-multiproc-stub going silent until kmain's Phase 2 branch grew the same 3-line console install the FS branch already had.
iput-on-zero is the last piece: when a file's ref + nlink hit zero, iput calls itrunc, frees the inode, and writes a zero type back through iupdate. Combined with unlinkat's dirunlink + nlink--, an rm /tmp/x actually returns the data blocks + inode slot to the pool.console.zig is the new fd 0/1/2 backing. UART RX (PLIC source #10) drives uart.isr, which drains the FIFO via console.feedByte: cooked mode echoes the byte, handles backspace / ^U line-kill / ^C-via-proc.kill / ^D EOF, and on \n commits the line + wakes any sleepers. Raw mode is wired (the console_set_mode syscall lands real this time) but only exercised by 3.F's editor. The trap dispatcher gains an IRQ_UART_RX = 10 arm that mirrors the block IRQ shape — claim → uart.isr → complete.
the input loop · why WFI matters host stdin ─► --input bytes (one keystroke at a time) ─► cpu.idleSpin (only fires on WFI) rx_pump.drainOne ─► uart.pushRx PLIC src 10 ─► s_trap_dispatch (cause 9) ─► uart.isr ─► console.feedByte echo ─► input.buf[] on \n: wakeup(&input.r) sh resumes in console.read scheduler idle path (3.D shape) csrs SIE // open SIE briefly csrc SIE // close it scheduler idle path (3.E) csrs SIE wfi // idleSpin runs csrc SIE
Adding wfi broke e2e-fs hard. cpu.step's prologue raises any pending S-trap before the next instruction is fetched. With SIE = 1, the pending PLIC IRQ traps before wfi is fetched; sret returns to wfi (sepc = wfi_addr); the same trap fires again. Forever.
Two fixes broke the loop: s_kernel_trap_dispatch advances sepc by 4 (the trap site is always the 4-byte WFI), and idleSpin drains one byte per iteration AFTER checking interrupts (so disk-I/O sleeps, where a block IRQ is already pending, don't leak --input bytes into the FIFO mid-boot — drain only happens when the guest is truly idle).
sh can print its second-and-subsequent prompts. One byte per WFI iteration mirrors interactive-tty cadence, and the harness sees $ ls /bin\n$ echo hi > /tmp/x\n… in the right order.init_shell forks sh, sh forks catA small userland stdlib lands at src/kernel/user/lib/: start.S parses argc/argv from the System-V tail and calls main; usys.S emits 19 syscall stubs (li a7; ecall; ret); ulib.zig bundles mem*/str* + Stat + O_* constants; uprintf.zig is a 90-line printf(fd, fmt, args). An addUserBinary build helper bolts these onto every userland binary identically. Seven binaries ship: init_shell (loops fork-exec-sh-wait, exits cleanly when sh status is 0), sh (line read + token split + redirect </>/>> + builtins cd/pwd/exit + fork+exec, ~250 LoC), and the four utilities ls / cat / echo / mkdir / rm.
// kernel/user/sh.zig — main loop export fn main(argc: u32, argv: [*]const [*:0]const u8) i32 { while (true) { const prompt: [2]u8 = .{ '$', ' ' }; _ = ulib.write(1, &prompt, 2); const got = ulib.getline(0, &line_buf, LINE_MAX); if (got <= 0) return 0; const parsed = parseLine(&line_buf, got); runCommand(&parsed); // fork+exec or builtin } }
milestone session $ ccc --input shell_input.txt \ --disk shell-fs.img kernel-fs.elf $ ls /bin . .. cat init echo sh mkdir ls rm $ echo hi > /tmp/x $ cat /tmp/x hi $ rm /tmp/x $ exit ticks observed: 6
shell-fs.img is the parallel image baking init_shell as /bin/init + every utility under /bin/. mkfs.zig learned --init (override) and walks every top-level subdir of --root (so the empty /tmp/ staging carries through). The e2e-shell harness pipes the canonical session through --input and asserts each prompt + command echo + the hi round-trip + the clean halt. This is our first interactive shell.Plan 3.E wired raw-mode in the kernel but never exercised it. Plan 3.F finally does, with a cursor-moving edit binary: console_set_mode(1) on entry, console_set_mode(0) on exit, a 16 KB content buffer with a single byte-offset cursor, and a redraw-on-every-keystroke ANSI loop dispatching ESC [ A/B/C/D arrows + printables + backspace + ^S save + ^X exit. e2e-editor drives the canonical session via a 43-byte binary fixture. e2e-persist proves the kernel's bwrite path actually persists by running ccc twice on the same disk image and asserting pass 2 sees pass 1's writes. e2e-cancel closes the last DoD bullet — pipes cat\n\x03exit\n and asserts cat\n^C\n$ exit appears in stdout, exercising the full kill-flag chain (console.feedByte(0x03) → proc.kill(fg_pid) → killed flag → console.read returns -1 → syscall dispatch calls proc.exit).
kernel.elf + shell-fs.img boot to a shell, run our own programs, edit files interactively, observe changes survive emulator restarts, ^C cancels foreground programs. Next: Phase 4 (network stack) → Phase 5 (HTTP/1.0 + text browser) → Phase 6 (CPython 3.12 port) → Phase 7 (framebuffer + compositor).Phase 1 asked: can we run hello world end-to-end on an emulator we wrote? Yes. Phase 2 asked: can we host a real OS kernel? Also yes — three privilege levels, Sv32 paging, trap delegation, async timers, a one-process scheduler stub, and a user program that prints how many times the timer fired while it ran. Phase 3 is a multi-process OS with a filesystem and a shell — Plan 3.A landed the emulator-side hardware (PLIC, block device, UART RX); Plan 3.B landed the kernel-side multi-process foundation (free-list page_alloc, ptable[16], round-robin scheduler with swtch, kernel-side ELF loader); Plan 3.C landed the full Unix-shaped process lifecycle — fork, execve, wait4, exit, and a kill-flag — with init forking /bin/hello end-to-end on real ELFs; Plan 3.D landed the read-side filesystem — bufcache, block driver with sleep-on-IRQ, inode cache + bmap + readi, namei, file table, seven new syscalls, and a host-side mkfs that builds a 4 MB image, with /bin/init loaded from disk; Plan 3.E just landed the write side + console + shell — writei/ialloc/itrunc/dirlink, cooked-mode line discipline on fd 0/1/2 driven by UART RX through PLIC IRQ #10, a small userland stdlib, and seven user binaries including a real sh. The OS finally has an interactive prompt. Plan 3.F wrapped Phase 3 — a cursor-moving edit binary that exercises the raw-mode console arm 3.E left wired-but-unexercised, plus an e2e-persist test that proves on-disk writes survive emulator restart. Phase 3 §Definition of Done holds: boot to a shell, run our own programs, edit files interactively, observe changes persist across reboots.
Process struct · sched stub · yield · timer tick counter · qemu-diff-kernelwfi idle · --disk/--input · e2e-plic-blockpage_alloc · Context + swtch.S · ptable[16] + Cpu · round-robin scheduler · kernel-side ELF loader · getpid/sbrk/yield · PID 1 + PID 2 · e2e-multiproc-stubvm.copyUvm/unmapUser · real proc.free · sleep/wakeup · fork/execve/wait4/exit · killed-check on syscall return · init forks /bin/hello · e2e-forkbmap/readi · namei · NFILE=64 file table + per-proc ofile[16]/cwd · 7 new syscalls (openat/close/read/lseek/fstat/chdir/getcwd) · mkfs + 4 MB fs.img · separate kernel trap vector for the scheduler SIE window · /bin/init from disk · e2e-fswritei/bmap lazy alloc · iupdate/ialloc/itrunc · dirlink/dirunlink · fsops.create/unlink · mkdirat/unlinkat + openat O_CREAT/O_TRUNC/O_APPEND · console fd 0/1/2 cooked-mode line discipline · UART RX via PLIC #10 · scheduler WFI + sepc+4 + paced rx_pump · userland stdlib (start.S/usys.S/ulib/uprintf) · init_shell/sh/ls/cat/echo/mkdir/rm · shell-fs.img · e2e-shelledit.zig · raw-mode console exercised · 16 KB content buffer · ESC [ A/B/C/D arrow parser · ANSI redraw (clear + home + buffer + position) · ^S save (close + O_TRUNC re-open) · ^X exit + cooked restore · e2e-editor (43-byte fixture: edit → 2× right → Y → ^S^X → cat asserts heYllo from phase 3) · e2e-persist (two ccc passes on copied image; pass 2 sees pass 1's writes) · e2e-cancel (10-byte fixture cat\n\x03exit\n proves ^C kill-flag end-to-end)