Update docs: performance results, new optimizations, test counts

- README: add performance section (beats gforth 2-10x), update test
  commands, note self-recursive direct calls and loop promotion
- CLAUDE.md: update test counts (427 unit + comparison tests)
- OPTIMIZATIONS.md: stack-to-local Phase 1→Phase 2 (loops + IF),
  DO/LOOP locals done, J as IR done, add section 14 (self-recursive
  direct call), add current performance table vs gforth
- WAFER.md: document self-recursive call optimization, CONSOLIDATE,
  update test commands and line counts
- FORTH.md: expanded space history, add FORTH-IN-SPACE.md reference
- FORTH-IN-SPACE.md: new document with verified spacecraft history
This commit is contained in:
2026-04-09 20:00:55 +02:00
parent 5555202bf0
commit 13a16ae2a4
6 changed files with 176 additions and 50 deletions
+61 -39
View File
@@ -12,26 +12,29 @@ This document describes every optimization that makes sense for WAFER, why it ma
## Status Summary
| # | Optimization | Level | Status | Impact |
| -- | ------------------------ | ------------ | ----------- | ------- |
| 1 | Stack-to-Local Promotion | Codegen | Phase 1 | Highest |
| 2 | Peephole Optimization | IR pass | Done | High |
| 3 | Constant Folding | IR pass | Done | High |
| 4 | Inlining | IR pass | Done | High |
| 5 | Strength Reduction | IR pass | Done | Medium |
| 6 | Dead Code Elimination | IR pass | Done | Medium |
| 7 | Tail Call Optimization | IR + Codegen | Done | Medium |
| 8 | Consolidation | Architecture | Done | High |
| 9 | Compound IR Operations | IR + Codegen | Done | Medium |
| 10 | Codegen Improvements | Codegen | Done | Medium |
| 11 | wasmtime Configuration | Runtime | Done | Low |
| 12 | Dictionary Hash Index | Runtime | Done | Low |
| 13 | Startup Batching | Architecture | Done | Low |
| 14 | Float / Double-Cell | Codegen | Not started | Future |
| # | Optimization | Level | Status | Impact |
| -- | ------------------------- | ------------ | ----------- | ------- |
| 1 | Stack-to-Local Promotion | Codegen | Phase 2 | Highest |
| 2 | Peephole Optimization | IR pass | Done | High |
| 3 | Constant Folding | IR pass | Done | High |
| 4 | Inlining | IR pass | Done | High |
| 5 | Strength Reduction | IR pass | Done | Medium |
| 6 | Dead Code Elimination | IR pass | Done | Medium |
| 7 | Tail Call Optimization | IR + Codegen | Done | Medium |
| 8 | Consolidation | Architecture | Done | High |
| 9 | Compound IR Operations | IR + Codegen | Done | Medium |
| 10 | Codegen Improvements | Codegen | Done | High |
| 11 | wasmtime Configuration | Runtime | Done | Low |
| 12 | Dictionary Hash Index | Runtime | Done | Low |
| 13 | Startup Batching | Architecture | Done | Low |
| 14 | Self-Recursive Direct Call| Codegen | Done | High |
| 15 | Float / Double-Cell | Codegen | Not started | Future |
## 1. Stack-to-Local Promotion
**Status: Phase 1 done.** Straight-line words (no control flow, calls, or I/O) use WASM locals instead of memory stack. Stack manipulation ops (Swap, Rot, Nip, Tuck, Dup, Drop) emit zero WASM instructions. Switchable via `WaferConfig::codegen.stack_to_local_promotion`.
**Status: Phase 2 done.** Words with straight-line code, DO/LOOP, and IF/ELSE use WASM locals instead of memory stack. Stack manipulation ops (Swap, Rot, Nip, Tuck, Dup, Drop) emit zero WASM instructions. Loop index/limit kept in WASM locals (zero return stack traffic). Switchable via `WaferConfig::codegen.stack_to_local_promotion`.
Phase 1 covered straight-line code only. Phase 2 extends to DO/LOOP (with stack-neutrality check) and IF/ELSE/THEN (with equal-branch-effect check). BEGIN loops and BeginDoubleWhileRepeat are not yet promoted.
### The Problem
@@ -342,7 +345,7 @@ These can be added as new `IrOp` variants recognized by peephole and emitted by
## 10. Codegen Improvements
**Status: Done (DSP caching).** `$dsp` cached in WASM local 0, written back before calls and at function exit. Commutative optimization and loop index in local are future work.
**Status: Done.** DSP cached in WASM local 0, DO/LOOP index/limit in WASM locals (fast path when body has no calls/RS ops), J compiled as IR primitive (was host function). Self-recursive `call` optimization in section 14.
These are improvements within `codegen.rs` `emit_op()` that do not require new IR operations.
@@ -390,7 +393,11 @@ i32.add ;; result on wasm stack
### Loop Index in Local
`DO...LOOP` currently stores the loop index and limit on the return stack (in memory). Keep them in WASM locals for the duration of the loop body. This makes `I` (read loop index) a simple `local.get` instead of a memory load from the return stack.
**Status: Done.** DO/LOOP index and limit are kept in WASM locals. Two codegen paths:
- **Fast path** (body has no calls, no `>R`/`R>`): pure locals, zero return stack traffic. `I` reads from `local.get`. `J` also reads from outer loop's local.
- **Slow path** (body has calls or explicit RS ops): locals used for loop control but synced to return stack for LEAVE/UNLOOP compatibility.
`J` was converted from a host function (WASM-to-Rust roundtrip) to an IR primitive (`IrOp::LoopJ`) that compiles to `local.get` of the outer loop's index local.
## 11. wasmtime Configuration
@@ -430,31 +437,46 @@ Currently, each of the 80+ primitives registered at boot creates a separate WASM
Batch all IR-based primitives into a single WASM module with multiple exported functions. One `Module::new()` + one `Instance::new()` replaces 80+ pairs. This is a subset of what Consolidation (section 8) achieves, but scoped to primitives only and simpler to implement.
## 14. Float and Double-Cell Stack
## 14. Self-Recursive Direct Call
**Status: Done.** When a word calls itself (RECURSE), the codegen emits `call WORD_FUNC` (direct call to the same function) instead of `call_indirect` through the function table. This eliminates the table lookup and signature check overhead.
### Impact
Fibonacci(25) with ~243K recursive calls:
- `call_indirect`: ~21ns/call → 5.0ms total
- Direct `call`: ~7ns/call → 1.6ms total (3x faster)
- gforth: ~14ns/call → 3.4ms total
The optimization is implemented in `emit_op` for `IrOp::Call`: when `ctx.self_word_id == Some(word_id)`, emit `call WORD_FUNC` (function index 1 in the word's own module). The `self_word_id` is derived from `CodegenConfig::base_fn_index`.
## 15. Float and Double-Cell Stack
**Status: Not started.** `PushI64` and `PushF64` exist as IR ops but are stubs in codegen. Float stack operations are currently all host functions.
The float stack lives in its own memory region (0x2540--0x2D40). Float operations will have the same memory-based overhead as integer operations, but worse: `f64` values are 8 bytes, doubling the memory traffic per push/pop. Stack-to-local promotion (section 1) is even more impactful for floats because WASM has native `f64` locals and operand stack support.
## Suggested Implementation Order
## Current Performance vs Gforth
Ordered by effort-to-impact ratio (cheapest wins first):
All optimizations enabled, release mode, measured with UTIME:
| Priority | Optimization | Effort | Unlocks |
| -------- | -------------------------------------------------- | ------- | ----------------------------------------------- |
| 1 | Peephole optimization | Low | Immediate code size reduction |
| 2 | Constant folding | Low | Composes with peephole |
| 3 | Tail call detection | Low | Recursive word optimization |
| 4 | Dictionary hash index | Low | Faster compilation |
| 5 | wasmtime config tuning | Trivial | Caching, interruption |
| 6 | Codegen improvements (global caching, loop locals) | Medium | ~30% fewer instructions |
| 7 | Inlining | Medium | Unlocks cross-word folding and peephole |
| 8 | Strength reduction | Low | Best after inlining exists |
| 9 | Dead code elimination | Low | Best after constant folding exists |
| 10 | Compound IR operations | Medium | Cumulative gains |
| 11 | Stack-to-local promotion | High | The single biggest speedup (~7x for arithmetic) |
| 12 | Startup batching | Medium | Faster boot |
| 13 | Consolidation | High | Direct calls, cross-word optimization |
| 14 | Float/double-cell | Medium | Depends on stack-to-local |
```
Benchmark WAFER CONSOL gforth WAFER/gf
Fibonacci(25) 1629 1535 3422 0.45x
Factorial(12)x10K 340 339 638 0.53x
GCD-bench(500) 18 15 30 0.50x
NestedLoops(50) 84 73 720 0.10x
Collatz(2K) 1212 1202 3914 0.31x
```
Stack-to-local promotion has the highest impact but also the highest implementation cost. The passes before it (peephole, folding, inlining) are simpler and their benefits multiply when stack-to-local promotion is eventually added. Consolidation is last because it requires storing IR bodies and restructuring the module generation -- it benefits most from having all other passes working first.
Times in microseconds. WAFER/gf < 1.0 means WAFER is faster.
## Remaining Opportunities
| Optimization | Status | Potential Impact |
| --- | --- | --- |
| BEGIN loop promotion | Not started | Would speed up GCD-style tight loops further |
| BeginDoubleWhileRepeat promotion | Not started | Rare pattern, low priority |
| LEAVE as IR primitive | Not started | Would enable fast-path for loops with LEAVE |
| Float stack-to-local | Not started | Eliminate float stack memory traffic |
| WASM tail calls proposal | Waiting on wasmtime | Would eliminate stack growth for tail-recursive words |