WAFER Architecture Reference (updated 2026-04-16) =================================================== WAFER = WebAssembly Forth Engine in Rust. Optimizing Forth-2012 compiler that emits WASM at run time. Each colon definition becomes its own WASM module that shares memory, globals, and a function table with every other word. 1. COMPILATION PIPELINE ----------------------- Forth Source | v Outer Interpreter (outer.rs) +--------------------------------------------+ | Tokenizer: whitespace-delimited words | | For each token: | | 1. Dictionary lookup (HashMap + wordlist | | search order) | | 2. Found + interpret mode: EXECUTE | | 3. Found + compile mode: | | - IMMEDIATE? Execute now | | - Normal? Append Call(WordId) to IR | | 4. Not found: try parse as number | | - Interpret: push to data stack | | - Compile: append PushI32/64/F64 | | 5. Neither: error "unknown word" | | Special cases handled here, not via IR: | | defining words (CREATE, VARIABLE, :), | | DOES> dispatch, S" / ." string parsing, | | {: ... :} locals, [: ... ;] quotations. | +--------------------------------------------+ | On `;` (end of colon definition): v Optimizer (optimizer.rs) — IR -> IR +--------------------------------------------+ | Phase 1 simplify: | | peephole -> fold -> strength -> peephole | | Phase 2 inline (max 8 ops) then re-simpl.: | | inline -> peephole -> fold -> strength | | -> peephole | | Phase 3 dead code: dce -> peephole | | Phase 4 tail calls (must be last) | | Total peephole passes: 5 | +--------------------------------------------+ | v Codegen (codegen.rs) — IR -> WASM bytes +--------------------------------------------+ | wasm-encoder builds one module per word. | | Function locals (laid out in order): | | 0 cached DSP (i32) | | 1..s scratch i32 (or promoted | | stack-to-local slots) | | s..f Forth locals from {: ... :} | | (i32 then f64) | | f..l loop locals: 2 per nested | | DO/?DO (index, limit) | | DSP write-back before every Call, | | reload after — keeps host functions and | | call_indirect targets coherent. | | Stack-to-local promotion (codegen flag): | | straight-line + simple control flow | | words skip the linear-memory data stack | | entirely; values stay in WASM locals. | +--------------------------------------------+ | v Runtime trait (runtime.rs) — execution backend +--------------------------------------------+ | ForthVM generic over backend. | | Runtime owns: | | - shared linear memory (16 pages init) | | - shared funcref table (grows on demand) | | - 3 mutable i32 globals (dsp/rsp/fsp) | | - emit() import bound to output buffer | | Runtime methods: | | mem_read/write_{i32,u8,slice} | | get/set_{dsp,rsp,fsp} | | ensure_table_size(n) | | instantiate_and_install(wasm, fn_index) | | call_func(fn_index) | | register_host_func(fn_index, HostFn) | | | | HostAccess trait — same memory/global ops | | exposed to host-fn callbacks; lets one | | HostFn closure run on either runtime. | | HostFn = Box Result<()> + Send + Sync> | +--------------------------------------------+ | | v v NativeRuntime WebRuntime (runtime_native.rs, (crates/web/src/ feature = "native") runtime_web.rs) +------------------+ +------------------+ | wasmtime Engine, | | js_sys WebAsm | | Store, Memory, | | Memory, Table, | | Table, Globals, | | Global, JS | | Func closures | | Closures | +------------------+ +------------------+ 2. MEMORY LAYOUT (linear memory, single shared instance) -------------------------------------------------------- Address Region Size Notes -------- ------------------ ------- -------------------------- 0x0000 System Variables 64 B STATE, BASE, >IN, HERE, LATEST, SOURCE-ID, #TIB, HLD, LEAVE-FLAG 0x0040 Input Buffer (TIB) 1024 B Source line being parsed 0x0440 PAD 256 B Scratch for string ops 0x0540 Pictured Output 128 B <# ... #> (HLD grows down) 0x05C0 WORD Buffer 64 B Transient counted string 0x0600 Data Stack 4096 B 1024 cells, grows DOWN ^ DSP starts at top = 0x1600 0x1600 Return Stack 4096 B Grows DOWN ^ RSP starts at top = 0x2600 0x2600 Float Stack 2048 B 256 doubles, grows DOWN ^ FSP starts at top = 0x2E00 0x2E00 Hash Scratch 128 B SHA1/256/512 output 0x2E80 Dictionary grows UP Linked list of entries Constants from crates/core/src/memory.rs (authoritative): SYSVAR_BASE 0x0000 size 64 INPUT_BUFFER_BASE 0x0040 size 1024 PAD_BASE 0x0440 size 256 PICT_BUF_BASE 0x0540 size 128 WORD_BUF_BASE 0x05C0 size 64 DATA_STACK_BASE 0x0600 size 4096 (DATA_STACK_TOP = 0x1600) RETURN_STACK_BASE 0x1600 size 4096 (RETURN_STACK_TOP = 0x2600) FLOAT_STACK_BASE 0x2600 size 2048 (FLOAT_STACK_TOP = 0x2E00) HASH_SCRATCH_BASE 0x2E00 size 128 DICTIONARY_BASE 0x2E80 grows up to memory.len() (Some inline `// 0x...` comments in memory.rs are stale — the computed values above are correct; the consts are derived.) Total initial memory: 16 pages = 1 MiB (max 256 pages = 16 MiB). Cell size: 4 bytes (i32). Float size: 8 bytes (f64). Stack layout note: linear-memory data and float stacks are the fallback used whenever the optimizer can't keep values in WASM locals. After stack-to-local promotion, many words touch DSP only on entry/exit. 3. SYSTEM VARIABLES (offsets from 0x0000) ----------------------------------------- Offset Name Purpose ------ ---------- ----------------------------------- 0 STATE 0=interpreting, -1=compiling 4 BASE Number base (default 10) 8 >IN Parse offset into input buffer 12 HERE Next free dictionary address 16 LATEST Most recent dictionary entry addr 20 SOURCE-ID 0=user input, -1=string, fileid>0 24 #TIB Length of current input 28 HLD Pictured numeric output pointer 32 LEAVE-FLAG Nonzero when LEAVE called in loop 4. DICTIONARY (dictionary.rs) ----------------------------- Entry layout in linear memory: +--------+-------+----------+---------+-----------+----------+ | Link | Flags | Name | Padding | Code | Param | | 4 B | 1 B | N B | 0-3 B | 4 B | optional | +--------+-------+----------+---------+-----------+----------+ ^ ^ entry_addr code field (fn-table idx) Flags byte: Bit 7 (0x80): IMMEDIATE Bit 6 (0x40): HIDDEN (during compilation) Bits 0-4 : name length (max 31) Link points to previous entry (0 = end of list). Name stored uppercase, padded to 4-byte alignment. Code field: index into shared WASM function table. Parameter field follows the code field for CREATE'd / DOES> / VARIABLE / CONSTANT bodies. Lookup is NOT linear: dictionary.rs maintains a HashMap index from name -> Vec<(wid, addr, fn_index, immediate)>. Each entry is tagged with its wordlist id; resolution walks the current search order. Wordlists / Search-Order: wordlist ids are u32; the FORTH wordlist is id 1. `current_wid` selects where new definitions land; `search_order` is the lookup chain (top first). Implements the Forth-2012 Search-Order word set. 5. WORD CATEGORIES ------------------ a) IR Primitives — register_primitive("DUP", false, vec![IrOp::Dup]) - Body stored as Vec - Optimized, then compiled to WASM - Inlineable by optimizer - Batched at boot: ~110 primitive registrations compiled into a single WASM module to amortize instantiation cost b) Host Functions — register_host_primitive(".", false, func) - HostFn = Box Result<()> + Send + Sync> - Access memory/globals via HostAccess trait - NOT inlineable - Used for I/O, dictionary manipulation, complex stack ops - Same closure runs on NativeRuntime and WebRuntime c) Forth-defined words — `: SQUARE DUP * ;` - Compiled by the outer interpreter - Goes through the full optimize -> codegen pipeline - Stored in `ir_bodies` for future inlining d) Special interpreter tokens (immediate, with custom parsing) - Defining words: CREATE, VARIABLE, CONSTANT, :, ;, DOES> - String literals: S", ." - Control structures: IF/ELSE/THEN, BEGIN/UNTIL/WHILE/REPEAT, DO/?DO/LOOP/+LOOP, [: ... ;] quotations, {: ... :} locals - CONSOLIDATE Their body-collection / dictionary-side-effect logic lives directly in compile_token / interpret_token_immediate. They still emit IR ops (e.g. IrOp::If, IrOp::DoLoop, IrOp::ForthLocalGet) — the difference is that they are NOT registered via register_primitive; the outer interpreter handles them as special syntax. 6. WASM MODULE STRUCTURE (per JIT-compiled word) ------------------------------------------------ Imports (6) — provided by Runtime impl: 0. emit (func: i32 -> void) Character output callback 1. memory (memory: 16 pages) Shared linear memory 2. dsp (global: mut i32) Data stack pointer 3. rsp (global: mut i32) Return stack pointer 4. fsp (global: mut i32) Float stack pointer 5. table (table: funcref) Shared function table Types: () -> () for word bodies; (i32) -> () for emit. Functions (1): The compiled word body, typed () -> (). Element section: table[base_fn_index] = function 1 Runtime::instantiate_and_install(wasm_bytes, fn_index): - NativeRuntime: wasmtime Module::new + Instance::new with the 6 imports above - WebRuntime: WebAssembly.instantiate with JS import objects pulled from the shared WaferRepl state 7. IR OPS (ir.rs — IrOp enum) ----------------------------- Stack: Drop, Dup, Swap, Over, Rot, Nip, Tuck, TwoDup, TwoDrop Literals: PushI32, PushI64, PushF64 Arithmetic: Add, Sub, Mul, DivMod, Negate, Abs Compare: Eq, NotEq, Lt, Gt, LtUnsigned, ZeroEq, ZeroLt Logic: And, Or, Xor, Invert, Lshift, Rshift, ArithRshift Memory: Fetch, Store, CFetch, CStore, PlusStore Control: Call, TailCall, Exit, If{then, else?}, DoLoop{body, is_plus_loop}, BeginUntil, BeginAgain, BeginWhileRepeat, BeginDoubleWhileRepeat, LoopRestartIfFalse, Block(label), BranchIfFalse(label), EndBlock(label) -- for CS-ROLL'd patterns Return stack: ToR, FromR, RFetch, LoopJ Forth locals: ForthLocalGet/Set, ForthFLocalGet/Set I/O: Emit, Dot, Cr, Type System: Execute, SpFetch Float stack: FDup, FDrop, FSwap, FOver Float math: FAdd, FSub, FMul, FDiv, FNegate, FAbs, FSqrt, FMin, FMax, FFloor, FRound Float compare:FZeroEq, FZeroLt, FEq, FLt Float memory: FetchFloat, StoreFloat Conversion: StoF, FtoS 8. OPTIMIZATION PASSES (detail) ------------------------------- PEEPHOLE (5x across pipeline): PushI32(n), Drop -> (removed) Unused literal Dup, Drop -> (removed) Redundant copy Swap, Swap -> (removed) Self-inverse Swap, Drop -> Nip Combine PushI32(0), Add -> (removed) Identity PushI32(0), Or -> (removed) Identity PushI32(-1), And -> (removed) Identity PushI32(1), Mul -> (removed) Identity Over, Over -> TwoDup Combine Drop, Drop -> TwoDrop Combine Float variants: PushF64(_), FDrop / FDup, FDrop / FSwap, FSwap / FNegate, FNegate CONSTANT FOLD: Binary i32: PushI32(a), PushI32(b), -> PushI32(r) Add, Sub, Mul, And, Or, Xor, Lshift, Rshift, ArithRshift, Eq, NotEq, Lt, Gt, LtUnsigned Unary i32: Negate, Abs, Invert, ZeroEq, ZeroLt Float binary/unary equivalents on PushF64. STRENGTH REDUCE: PushI32(2^n), Mul -> PushI32(n), Lshift PushI32(0), Eq -> ZeroEq PushI32(0), Lt -> ZeroLt DCE: PushI32(nonzero), If{then,else} -> then_body only PushI32(0), If{then,else} -> else_body only Everything after Exit -> removed INLINE (max 8 ops, single pass): Call(id) -> body if all of: - body length <= 8 ops - no self-recursion - no Exit (would return from caller) - no ForthLocalGet/Set (would collide with caller locals) TailCall -> Call when inlined (no longer tail position) TAIL CALL (last pass, must be last): trailing Call(id) -> TailCall(id) if return stack balanced (equal ToR / FromR pairs). Recurses into If branches for conditional tail calls. STACK-TO-LOCAL PROMOTION (codegen pass, not optimizer): Words whose effects on the data stack can be statically tracked are compiled to use WASM locals 1..s instead of DSP loads/stores. Triggered by `is_promotable(body)`. DSP is still written back before any Call so callees and host functions see a consistent stack. 9. CONSOLIDATION (consolidate.rs + codegen.rs) ---------------------------------------------- CONSOLIDATE recompiles every JIT-compiled word into ONE WASM module: - All call_indirect to consolidated words become direct `call` (single-module direct calls) - External calls (host functions) stay call_indirect - Removes per-word instantiation overhead and lets the WASM engine inline / specialize across word boundaries Two parts: codegen::compile_consolidated_module() Builds the multi-function module. outer::ForthVM::consolidate() Collects ir_bodies, computes table layout, compiles, instantiates, and patches the shared function table. 10. EXPORT PIPELINE (`wafer build`) ---------------------------------- export.rs::export_module() steps: 1. Evaluate the source file with recording_toplevel = true 2. Collect every IR word + recorded top-level IR 3. Resolve entry point (priority): --entry > MAIN > synthetic _start from the recorded top-level 4. Snapshot WASM linear memory (system vars + dictionary + any user data) 5. Walk the IR, find every Call/TailCall to a host word not in the consolidated set: those become required imports of the exported module 6. Build metadata (JSON, custom "wafer" section): version, entry_table_index, host_functions, memory_size, dsp/rsp/fsp_init 7. compile_exportable_module() emits the final WASM with a passive data section seeded from the memory snapshot 8. Optional --js: also emit a JS loader + minimal HTML 9. Optional --native: AOT-compile and append to the wafer binary itself, in this layout: [wafer ELF/Mach-O][precompiled WASM][metadata] [trailer: payload_len(8) | metadata_len(8) | "WAFEREXE"] The CLI detects the trailer at startup and runs the embedded payload directly (single-file distribution). 11. CRATE STRUCTURE ------------------- crates/ core/ wafer-core: compiler, optimizer, codegen, dictionary, runtime trait, outer interpreter. Largest file: codegen.rs (~4.3k LOC). Feature flags: default = ["native"] "native" pulls in wasmtime + NativeRuntime + runner.rs (CLI executor) + export.rs "crypto" enables SHA1/256/512 host words No features: pure-Rust core for wafer-web (dictionary, IR, optimizer, codegen, outer interpreter only) cli/ wafer: rustyline REPL + `wafer build` / `wafer run` web/ wafer-web: browser REPL. Key web files: crates/web/src/lib.rs WaferRepl wasm-bindgen entry crates/web/src/runtime_web.rs WebRuntime: js_sys WebAssembly crates/web/www/app.js Frontend (terminal emulation) crates/web/www/index.html HTML shell crates/web/www/style.css Styling crates/web/www/pkg/ wasm-pack output (gitignored) 12. BOOT SEQUENCE ----------------- ForthVM::::new() -> 1. R::new() — create runtime (wasmtime or browser WASM) 2. register_primitives() in batch_mode = true: - ~110 IR primitive registrations (DUP, +, @, ...) - ~87 host primitive registrations (., .S, M*, ACCEPT, ...) - special interpreter tokens (IF, DO, :, VARIABLE, S", {: :}, [: ;], CONSOLIDATE, ...) handled directly in interpret_token_immediate / compile_token, no IR op 3. Word-set registrations: core, double, exception, facility, file (subset), floating-point, locals, memory, search-order, programming-tools, string, optional crypto 4. batch_compile_deferred() — single WASM module for all deferred IR primitives 5. Load boot.fth (include_str!), evaluated line by line so `\` comments terminate at end-of-line: Phase 1: stack/memory (DEPTH, PICK, 2OVER, FILL, MOVE, CMOVE, /STRING, -TRAILING) Phase 2: double-cell arithmetic (D+, DNEGATE, D<, D=) Phase 3: mixed arithmetic (SM/REM, FM/MOD, */, */MOD) Phase 4: HERE, ALLOT, comma, ALIGN, ALIGNED Phase 5: I/O + pictured output (., U., TYPE, <# # #>, SIGN, HOLD) Phase 6: DEFER support (DEFER, IS, ACTION-OF) Phase 7: more replacements (COMPARE, SOURCE, FALIGNED, DFALIGN, structures, S" hint, ...) 13. RUNTIME-VS-EXPORT NOTE -------------------------- Two separate codegen entry points produce multi-function WASM modules from the same IR: compile_consolidated_module() used by CONSOLIDATE - Targets the live runtime - Re-uses the shared globals/table/memory imports - External calls remain call_indirect compile_exportable_module() used by `wafer build` - Targets a standalone module - Carries its own memory (passive data section seeded from the snapshot) and embeds metadata - Required host functions become imports the runner (or AOT loader) must satisfy Both share the same per-IrOp lowering helpers; the difference is in module-level wiring.