From f3bc270904efb07ffb16a190bfa1a74445b48928 Mon Sep 17 00:00:00 2001 From: Oleksandr Kozachuk Date: Thu, 2 Apr 2026 12:47:50 +0200 Subject: [PATCH] Update all docs to reflect current state README: 392 tests, 200+ words, 12 word sets, optimization pipeline described CLAUDE.md: 200+ words, 12 word sets, 392 tests, added optimizer/config/consolidate to key files OPTIMIZATIONS.md: update all 14 section statuses (12 done, 2 not started) WAFER.md: correct line counts, add optimizer/config/consolidate/types to project layout, add FSP global --- CLAUDE.md | 7 +++++-- README.md | 11 +++++++---- docs/OPTIMIZATIONS.md | 28 ++++++++++++++-------------- docs/WAFER.md | 11 ++++++++--- 4 files changed, 34 insertions(+), 23 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 090a164..e28d6d3 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -2,7 +2,7 @@ ## What is WAFER? -WAFER (WebAssembly Forth Engine in Rust) is an optimizing Forth 2012 compiler targeting WebAssembly. Currently a working Forth system with 130+ words, JIT compilation, and 11 word sets at 100% compliance. +WAFER (WebAssembly Forth Engine in Rust) is an optimizing Forth 2012 compiler targeting WebAssembly. Currently a working Forth system with 200+ words, JIT compilation, 12 word sets at 100% compliance, and a full optimization pipeline (peephole, constant folding, inlining, strength reduction, DCE, tail calls, stack-to-local promotion, consolidation). ## Architecture @@ -19,6 +19,9 @@ WAFER (WebAssembly Forth Engine in Rust) is an optimizing Forth 2012 compiler ta - `crates/core/src/dictionary.rs` -- Dictionary data structure with create/find/reveal - `crates/core/src/ir.rs` -- IrOp enum (the intermediate representation) - `crates/core/src/memory.rs` -- Memory layout constants (stack regions, dictionary base, etc.) +- `crates/core/src/optimizer.rs` -- IR optimization passes (peephole, fold, inline, DCE, etc.) +- `crates/core/src/config.rs` -- WaferConfig: unified optimization configuration +- `crates/core/src/consolidate.rs` -- Consolidation recompiler (single-module direct calls) - `crates/cli/src/main.rs` -- CLI REPL with rustyline ## Adding a New Word @@ -51,7 +54,7 @@ Handle in `interpret_token_immediate()` or `compile_token()` as a special case. ## Testing -- Run `cargo test --workspace` before committing (currently 261 unit + 11 compliance tests) +- Run `cargo test --workspace` before committing (currently 392 tests: 380 unit + 1 benchmark + 11 compliance) - Forth 2012 compliance: `cargo test -p wafer-core --test compliance` - Test helper in outer.rs: `eval_output("forth code")` returns printed output as String - Test helper: `eval_stack("forth code")` returns data stack as Vec diff --git a/README.md b/README.md index 57d5dc7..7a7e774 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ An optimizing Forth 2012 compiler targeting WebAssembly. ## Status -WAFER is a working Forth system. It JIT-compiles each word definition to a separate WASM module and executes via `wasmtime`. 310 tests passing (299 unit + 11 compliance), **0 errors on all 12 tested Forth 2012 word sets** including Floating-Point. +WAFER is a working Forth system with an optimizing compiler. It JIT-compiles each word definition to a separate WASM module and executes via `wasmtime`. 392 tests passing (380 unit + 1 benchmark + 11 compliance), **0 errors on all 12 tested Forth 2012 word sets** including Floating-Point. **Working features:** @@ -50,7 +50,7 @@ Forth Source -> Outer Interpreter -> IR -> [Optimize] -> WASM Codegen (wasm-enco - **Subroutine threading** via WASM function tables and `call_indirect` - **JIT mode**: each new word compiles to a separate WASM module linked to shared memory/globals/table -- **IR-based pipeline** enables future optimization passes before WASM emission +- **IR-based pipeline** with 6 optimization passes (peephole, constant folding, strength reduction, DCE, tail call detection, inlining) plus stack-to-local promotion and consolidation - **Dictionary**: linked-list word headers in simulated linear memory ## Building @@ -75,12 +75,15 @@ echo ': SQUARE DUP * ; 7 SQUARE .' | cargo run -p wafer ## Testing ```bash -# All tests (185 currently passing) +# All tests (392 currently passing) cargo test --workspace # Forth 2012 compliance dashboard cargo test -p wafer-core --test compliance +# Optimization benchmark report +cargo test -p wafer-core --test benchmark_report -- --nocapture --ignored + # Lints cargo clippy --workspace ``` @@ -117,7 +120,7 @@ tests/ Forth 2012 compliance suite (gerryjackson/forth2012-test-suite sub ### Not Yet Implemented -11 word sets at 100% compliance: Core, Core Ext, Core Plus, Exception, Double-Number, String, Search-Order, Memory-Allocation, Programming-Tools, Facility, Locals. 130+ words including VALUE, DEFER, CASE, DOES>, CATCH/THROW, double-cell arithmetic, string operations. +12 word sets at 100% compliance: Core, Core Ext, Core Plus, Exception, Double-Number, String, Search-Order, Memory-Allocation, Programming-Tools, Facility, Locals, Floating-Point. 200+ words including VALUE, DEFER, CASE, DOES>, CATCH/THROW, double-cell arithmetic, string operations, and 70+ floating-point words. ## Compliance Status diff --git a/docs/OPTIMIZATIONS.md b/docs/OPTIMIZATIONS.md index ea7c285..5ba09ed 100644 --- a/docs/OPTIMIZATIONS.md +++ b/docs/OPTIMIZATIONS.md @@ -31,7 +31,7 @@ This document describes every optimization that makes sense for WAFER, why it ma ## 1. Stack-to-Local Promotion -**Status: Not implemented.** Type infrastructure exists (`crates/core/src/types.rs`) but is not wired into codegen. +**Status: Phase 1 done.** Straight-line words (no control flow, calls, or I/O) use WASM locals instead of memory stack. Stack manipulation ops (Swap, Rot, Nip, Tuck, Dup, Drop) emit zero WASM instructions. Switchable via `WaferConfig::codegen.stack_to_local_promotion`. ### The Problem @@ -105,7 +105,7 @@ When the compiler can statically determine the types and lifetimes of values on ## 2. Peephole Optimization -**Status: Not implemented.** +**Status: Done.** Implemented in `optimizer.rs::peephole()`. Runs as fixpoint loop. A peephole optimizer scans adjacent IR operations and replaces recognized patterns with cheaper equivalents. This is the lowest-effort, highest-return IR pass because Forth's postfix style generates many redundant sequences. @@ -134,7 +134,7 @@ A single function `fn peephole(ops: Vec) -> Vec` that makes repeated ## 3. Constant Folding -**Status: Not implemented.** +**Status: Done.** Implemented in `optimizer.rs::constant_fold()`. Folds binary and unary ops on known constants. When both operands of an operation are compile-time constants, compute the result at compile time. @@ -163,7 +163,7 @@ A function `fn constant_fold(ops: Vec) -> Vec` that simulates a comp ## 4. Inlining -**Status: Not implemented.** +**Status: Done.** Implemented in `optimizer.rs::inline()`. Inlines word bodies <= 8 ops, non-recursive. TailCall converted back to Call when inlining. IR bodies stored in `ForthVM::ir_bodies`. Replace `Call(WordId)` with the callee's IR body, avoiding the `call_indirect` overhead and enabling further optimization of the combined code. @@ -212,7 +212,7 @@ PushI32(34) ## 5. Strength Reduction -**Status: Not implemented.** +**Status: Done.** Implemented in `optimizer.rs::strength_reduce()`. Power-of-2 multiply to shift, zero comparisons to ZeroEq/ZeroLt. Replace expensive operations with cheaper equivalents when one operand is a known constant. @@ -231,7 +231,7 @@ The most common case is `CELLS` which is defined as `PushI32(4), Mul`. Strength ## 6. Dead Code Elimination -**Status: Not implemented.** +**Status: Done.** Implemented in `optimizer.rs::dce()`. Truncates after Exit, eliminates constant-conditional branches. Remove IR operations that can never execute or whose results are never used. @@ -246,7 +246,7 @@ DCE should run after constant folding, since folding can create new constant con ## 7. Tail Call Optimization -**Status: Partial.** `IrOp::TailCall(WordId)` exists in `ir.rs` and codegen handles it in `codegen.rs`, but the compiler never generates it. +**Status: Done.** `optimizer.rs::tail_call_detect()` converts the last `Call` to `TailCall` when return stack is balanced. Codegen emits `call_indirect + return`. ### What Exists @@ -272,7 +272,7 @@ Detection rule: if the last IR op in a word body (or in a branch of an `If`) is ## 8. Consolidation -**Status: Not implemented.** Stub exists at `crates/core/src/consolidate.rs`. +**Status: Done.** `CONSOLIDATE` word recompiles all IR-based words into a single WASM module with direct `call` instructions. Implemented in `codegen.rs::compile_consolidated_module()` and `outer.rs::consolidate()`. ### The Idea @@ -300,7 +300,7 @@ After interactive development, `CONSOLIDATE` recompiles all defined words into a ## 9. Compound IR Operations -**Status: Not implemented.** +**Status: Done.** `TwoDup` and `TwoDrop` IrOp variants with optimized codegen. Peephole converts `Over, Over -> TwoDup` and `Drop, Drop -> TwoDrop`. Some common multi-op sequences have more efficient WASM implementations than emitting each op individually. @@ -342,7 +342,7 @@ These can be added as new `IrOp` variants recognized by peephole and emitted by ## 10. Codegen Improvements -**Status: Not implemented.** +**Status: Done (DSP caching).** `$dsp` cached in WASM local 0, written back before calls and at function exit. Commutative optimization and loop index in local are future work. These are improvements within `codegen.rs` `emit_op()` that do not require new IR operations. @@ -394,7 +394,7 @@ i32.add ;; result on wasm stack ## 11. wasmtime Configuration -**Status: Not implemented.** Currently using `Engine::default()`. +**Status: Done.** NaN canonicalization disabled, module caching enabled via `cache_config_load_default()`. ### Available Knobs @@ -410,7 +410,7 @@ Module caching is the most impactful: `wasmtime::Config::cache_config_load_defau ## 12. Dictionary Hash Index -**Status: Not implemented.** +**Status: Done.** `HashMap` in Dictionary struct. `find()` uses O(1) hash lookup with linked-list fallback. Updated on `reveal()` and `toggle_immediate()`. The dictionary lookup (`dictionary.rs` `find()`) walks a linked list from the most recent entry backward, comparing names character by character. After registering 80+ primitives plus user words, every lookup during compilation scans the full list. @@ -422,7 +422,7 @@ This affects **compile time** (word lookup during parsing), not runtime (compile ## 13. Startup Batching -**Status: Not implemented.** `compile_core_module()` stub exists in `codegen.rs`. +**Status: Not started.** `compile_core_module()` stub exists in `codegen.rs`. Currently, each of the 80+ primitives registered at boot creates a separate WASM module: `wasm-encoder` builds it, `wasmparser` validates it, Cranelift compiles it, and wasmtime instantiates it. This happens 80+ times sequentially. @@ -432,7 +432,7 @@ Batch all IR-based primitives into a single WASM module with multiple exported f ## 14. Float and Double-Cell Stack -**Status: Not implemented.** `PushI64` and `PushF64` exist as IR ops but are stubs in codegen. +**Status: Not started.** `PushI64` and `PushF64` exist as IR ops but are stubs in codegen. Float stack operations are currently all host functions. The float stack lives in its own memory region (0x2540--0x2D40). Float operations will have the same memory-based overhead as integer operations, but worse: `f64` values are 8 bytes, doubling the memory traffic per push/pop. Stack-to-local promotion (section 1) is even more impactful for floats because WASM has native `f64` locals and operand stack support. diff --git a/docs/WAFER.md b/docs/WAFER.md index bd2f2fd..8c85cb2 100644 --- a/docs/WAFER.md +++ b/docs/WAFER.md @@ -8,10 +8,14 @@ WAFER (WebAssembly Forth Engine in Rust) is a Forth 2012 compiler that JIT-compi crates/ core/src/ outer.rs ForthVM: outer interpreter, compiler, all primitives - codegen.rs IR-to-WASM translation, module generation - dictionary.rs Dictionary (linked list in Vec) + codegen.rs IR-to-WASM translation, module generation, stack-to-local promotion + dictionary.rs Dictionary (linked list in Vec, hash index for O(1) lookup) ir.rs IrOp enum -- the intermediate representation + optimizer.rs IR optimization passes (peephole, fold, inline, DCE, etc.) + config.rs WaferConfig: unified optimization configuration + consolidate.rs Consolidation recompiler (single-module direct calls) memory.rs Memory layout constants (addresses, sizes) + types.rs Stack type inference infrastructure error.rs Error types cli/src/ main.rs CLI REPL (rustyline), file execution @@ -23,7 +27,7 @@ tests/ forth2012-test-suite/ Forth 2012 compliance test suite (submodule) ``` -The entire compiler and runtime lives in `outer.rs` (~5200 lines). Codegen is in `codegen.rs` (~1500 lines). Everything else is supporting infrastructure. +The compiler and runtime lives in `outer.rs` (~10,400 lines). Codegen is in `codegen.rs` (~2,800 lines). The optimizer is in `optimizer.rs` (~800 lines). Everything else is supporting infrastructure. ## What Happens When You Start WAFER @@ -39,6 +43,7 @@ wasmtime Store Runtime state container Linear Memory 16 pages (1 MiB), expandable to 256 pages (16 MiB) Global: DSP Data stack pointer, initialized to 0x1540 Global: RSP Return stack pointer, initialized to 0x2540 +Global: FSP Float stack pointer, initialized to 0x2D40 Function Table 256 funcref entries (grows as needed) ```