Update docs: performance results, new optimizations, test counts

- README: add performance section (beats gforth 2-10x), update test commands, note self-recursive direct calls and loop promotion - CLAUDE.md: update test counts (427 unit + comparison tests) - OPTIMIZATIONS.md: stack-to-local Phase 1→Phase 2 (loops + IF), DO/LOOP locals done, J as IR done, add section 14 (self-recursive direct call), add current performance table vs gforth - WAFER.md: document self-recursive call optimization, CONSOLIDATE, update test commands and line counts - FORTH.md: expanded space history, add FORTH-IN-SPACE.md reference - FORTH-IN-SPACE.md: new document with verified spacecraft history
2026-04-09 20:00:55 +02:00
parent 5555202bf0
commit 13a16ae2a4
6 changed files with 176 additions and 50 deletions
@@ -0,0 +1,77 @@
+# Forth in Space
+
+Forth's properties -- tiny footprint, deterministic execution, interactive testing on live hardware, no operating system required -- read like a specification sheet for spacecraft software. When your computer has kilobytes of memory, your budget for cosmic ray errors is zero, and the nearest debugger is hundreds of millions of kilometers away, you want a language where you can see every instruction and test every word before committing it to flight.
+
+This is a verified list of Forth's presence in space, drawn from the NASA Goddard Space Flight Center compilation, Johns Hopkins University Applied Physics Laboratory mission documentation, and cross-referenced public sources. Claims that cannot be traced to authoritative sources (such as the persistent but debunked claim that Voyager runs Forth) are excluded.
+
+## Before Forth: Apollo and the Verb-Noun Paradigm
+
+The Apollo Guidance Computer (AGC) was not programmed in Forth -- it used its own assembly language, and Forth did not exist yet. But its human interface deserves mention as a philosophical ancestor.
+
+The AGC's DSKY (Display and Keyboard) unit presented astronauts with a numeric vocabulary: two-digit _verb_ codes for actions, two-digit _noun_ codes for data. `VERB 37 NOUN 01` meant "change to program 01." `VERB 06 NOUN 62` displayed spacecraft attitude. There were roughly 100 verb-noun combinations. Astronauts memorized them, composed them interactively, and got immediate feedback on a seven-segment display -- all while traveling to the Moon on a computer with 74 kilobytes of memory.
+
+The parallel to Forth is structural, not genealogical. Both systems give the operator a small vocabulary of composable commands, entered interactively, with immediate execution and visible results. Both trust the human to know the vocabulary. Both reject the idea that the user needs to be protected from the machine by layers of abstraction. Charles Moore started building Forth in 1968, the same year astronauts were training on DSKY procedures for Apollo 8. The two systems emerged from the same engineering instinct: when the stakes are high, give skilled people direct control.
+
+## The RTX2010: A Processor That Speaks Forth
+
+Most of the missions below share a single piece of hardware: the Harris RTX2010, a radiation-hardened 16-bit stack processor. Understanding this chip explains why Forth dominated space instrument control for two decades.
+
+The RTX2010 is built on silicon-on-sapphire (SOS) -- a thin layer of silicon grown on an insulating sapphire substrate. Because sapphire is an insulator, ionizing radiation cannot create the parasitic currents that corrupt conventional silicon chips. This makes the processor inherently radiation-resistant without the performance penalties of other hardening techniques.
+
+More importantly, the RTX2010 is a _native Forth processor_. Its instruction set maps directly to Forth primitives. It has two hardware stacks -- a 256-word data stack and a 256-word return stack -- built into the chip. Subroutine calls and returns execute in a single clock cycle. Interrupt latency is four cycles, consistent and predictable. At 8 MHz, it draws milliwatts.
+
+The original architecture was designed by Chuck Moore, the inventor of Forth, as the Novix NC4000 in the mid-1980s. Harris Semiconductor licensed and radiation-hardened it as the RTX2000, then the RTX2010. The Johns Hopkins University Applied Physics Laboratory (APL) standardized on it for spacecraft instrument controllers, and it became the de facto Forth-in-space platform.
+
+On this chip, there is no compiler between the programmer and the hardware. Forth words _are_ the machine instructions. Every word can be tested interactively on the flight hardware before launch. There is no operating system to fail, no runtime to consume resources, no hidden behavior. What you write is what the silicon executes.
+
+## Verified Missions
+
+### The Early Pioneers
+
+**MAGSAT (1979)** -- NASA, built by APL. The Magnetic Field Satellite carried an attitude control system programmed in Forth on an RCA CDP1802 COSMAC processor. This is among the earliest documented uses of Forth in space. The 1802 was itself a radiation-hardened chip (silicon-on-sapphire, like the later RTX2010), and its simple architecture made it a natural target for Forth implementations.
+
+**AMSAT OSCAR 10, 13, and 21 (1983--1996)** -- Built by volunteer radio amateurs in the AMSAT organization, these Phase III amateur radio satellites carried flight software written in Forth. OSCAR 10 launched in 1983 on an Ariane rocket, OSCAR 13 in 1988, and OSCAR 21 in 1991. These were high-orbit satellites (Molniya-type orbits reaching 35,000+ km altitude) providing intercontinental amateur radio communication, and Forth was chosen for the same reasons as professional missions: tiny footprint, interactive debugging, and full control of the hardware.
+
+**Freja (1992)** -- Swedish Space Corporation. This Swedish magnetospheric research satellite carried a magnetometer experiment controlled by Forth running on an SC32 processor. Freja studied the aurora borealis and plasma processes in the upper ionosphere from a polar orbit.
+
+### The APL/RTX2010 Era
+
+The mid-1990s through early 2000s saw an extraordinary concentration of Forth-powered instruments, nearly all built by the Johns Hopkins University Applied Physics Laboratory on RTX2010 processors.
+
+**NEAR Shoemaker (1996)** -- NASA, built by APL. The Near Earth Asteroid Rendezvous mission was the first spacecraft to orbit and land on an asteroid. Four of its science instruments ran Forth on RTX2010 processors: the X-Ray/Gamma-Ray Spectrometer (9,545 lines of Forth), the Multispectral Imager (5,926 lines), the Near-Infrared Spectrometer/Magnetometer (3,019 lines), and the NEAR Laser Rangefinder (2,946 lines). These instruments mapped the composition and topography of asteroid 433 Eros in detail before the spacecraft soft-landed on its surface on February 12, 2001 -- a maneuver the spacecraft was never designed to perform, executed by Forth code running on 8 MHz processors 316 million kilometers from Earth.
+
+**Cassini MIMI (1997)** -- NASA/ESA joint mission, instrument built by APL. The Magnetospheric Imaging Instrument (MIMI) on the Cassini orbiter ran its CPU and Event Processing Unit on RTX2010 processors programmed in Forth. MIMI studied Saturn's magnetosphere for 13 years, from Cassini's arrival in 2004 until its planned destruction in Saturn's atmosphere in September 2017. During that time, the Forth code on those 8 MHz processors operated flawlessly through radiation environments far more intense than Earth orbit.
+
+**ACE (1997)** -- NASA. The Advanced Composition Explorer, stationed at the L1 Lagrange point 1.5 million kilometers sunward of Earth, carries a star scanner built by Ball Aerospace on an RTX2000 with Forth, and the Ultra Low Energy Isotope Spectrometer (ULEIS) built by APL on an RTX2010. ACE monitors the solar wind and is the primary early-warning system for solar storms heading toward Earth. As of 2026, it is still operating -- nearly three decades of continuous Forth execution in deep space.
+
+**Chandra X-ray Observatory (1999)** -- NASA. The Chandra X-ray telescope, one of NASA's four Great Observatories, uses science instrument control software and mechanism controllers built by Ball Aerospace on RTX2000/2010 processors. These systems manage the precise positioning of Chandra's X-ray detectors and gratings.
+
+**IMAGE (2000)** -- NASA, instruments by Baja Technology and APL. The Imager for Magnetopause-to-Aurora Global Exploration was the first spacecraft to image Earth's magnetosphere globally. Its Extreme Ultraviolet (EUV) imager and High-Energy Neutral Atom (HENA) imager both ran on RTX2010 processors programmed in Forth.
+
+**TIMED GUVI (2001)** -- NASA, built by APL. The Global Ultraviolet Imager (GUVI) on the TIMED satellite runs on an RTX2010, scanning the Earth's thermosphere and ionosphere in five ultraviolet wavelength bands. TIMED launched from Vandenberg Air Force Base in December 2001.
+
+### Comets, Shuttles, and Weather Satellites
+
+**Rosetta and Philae (2004/2014)** -- ESA. The most famous Forth-in-space story. Philae's central command and data management system ran Forth-83 on two 8 MHz RTX2010 processors. On November 12, 2014, Philae separated from the Rosetta orbiter and descended to the surface of comet 67P/Churyumov-Gerasimenko, 511 million kilometers from Earth -- the first controlled landing on a comet. But Forth was present on Rosetta as well: the orbiter's Ion and Electron Sensor (IES) instrument ran on an RTX2010 with Forth, built by Baja Technology.
+
+**Deep Impact (2005)** -- NASA. The spacecraft that deliberately collided an impactor with Comet Tempel 1 to study its interior composition used an RTX2010 running Forth for its Attitude and Orbit Control (AOC) system controller, built by Baja Technology.
+
+**SSUSI on DMSP (5 missions, 15+ years)** -- U.S. Department of Defense, built by APL. The Special Sensor Ultraviolet Spectrographic Imager (SSUSI) flew on five Defense Meteorological Satellite Program Block 5D-3 weather satellites, running on Harris RTX2000 processors with Forth. Each satellite orbited at 850 km altitude in sun-synchronous polar orbits, measuring ultraviolet emissions from the upper atmosphere to monitor space weather and auroral activity.
+
+**SSBUV on Space Shuttle (8 flights, 1989--1996)** -- NASA Goddard Space Flight Center. The Shuttle Solar Backscatter Ultraviolet instrument flew eight successful missions in the Shuttle payload bay, calibrating ozone-monitoring satellites. Its controller ran chipFORTH on a custom hardware platform, interfacing the instrument to the Shuttle's avionics systems through the Small Payload Accommodations Interface Module (SPAIM).
+
+## Additional Missions
+
+The NASA Goddard Space Flight Center compilation lists further missions with less detailed documentation, including: Shuttle Imaging Radar SIR-B (1984), Hopkins Ultraviolet Telescope on Astro-1 (1990), TOPEX/Poseidon (1992), Midcourse Space Experiment MSX (1996), X-ray Timing Explorer XTE (1995), Submillimeter Wave Astronomy Satellite SWAS (1998), SAGE III (2001 and 2017), and several ground support systems for Globalstar, Iridium, and ORBCOMM satellite constellations.
+
+## A Note on Myths
+
+The claim that NASA's Voyager spacecraft run Forth appears frequently online but is not supported by primary sources. NASA's official Voyager FAQ states that the onboard computers (CCS, AACS, and FDS) are programmed in their respective assembly languages, with ground software written in Fortran. Similar unverified claims circulate about the Mars rovers and other missions. This list includes only missions traceable to the NASA GSFC compilation, APL documentation, or equivalent authoritative sources.
+
+## Sources
+
+- NASA Goddard Space Flight Center. _Forth in Space Applications._ Compiled table of Forth-based spacecraft systems. Archived at [forth.com/resources/space-applications](https://www.forth.com/resources/space-applications/).
+- Ratfactor. _Forth in Space!_ Cross-referenced compilation with APL instrument details. [ratfactor.com/forth/forth-in-space](https://ratfactor.com/forth/forth-in-space).
+- Mecrisp Stellaris documentation. _Forth Use in Spacecraft._ [mecrisp-stellaris-folkdoc.sourceforge.io/spacecraft.html](https://mecrisp-stellaris-folkdoc.sourceforge.io/spacecraft.html).
+- Wikipedia. _RTX2010._ [en.wikipedia.org/wiki/RTX2010](https://en.wikipedia.org/wiki/RTX2010).
+- The CPU Shack Museum. _Here comes Philae! Powered by an RTX2010._ November 2014. [cpushack.com](https://www.cpushack.com/2014/11/12/here-comes-philae-powered-by-an-rtx2010/).
@@ -48,7 +48,7 @@ As FORTH, Inc. puts it: Forth was designed for "a programmer who was intelligent

 Forth is not mainstream. It never tried to be. But it persists in places where its particular nature -- tiny, deterministic, interactive, self-contained -- is exactly what the situation demands.

-**Space.** The Philae lander, which touched down on comet 67P/Churyumov-Gerasimenko in 2014 as part of ESA's Rosetta mission, ran its central command and data management system in Forth-83 on radiation-hardened RTX2010 stack processors. When you are 500 million kilometers from the nearest debugger and your computer has kilobytes of memory, you want a language where you can see every instruction and test every word interactively before committing it to flight.
+**Space.** Forth has flown on over 30 spacecraft and instruments, from MAGSAT's attitude controller in 1979 to ESA's Philae comet lander in 2014. The radiation-hardened RTX2010 -- a processor that executes Forth natively -- became the standard for instrument controllers at NASA and JHU/APL, powering missions including NEAR Shoemaker (first asteroid landing), Cassini (13 years at Saturn), Chandra X-ray Observatory, and Deep Impact (comet collision). When you are 500 million kilometers from the nearest debugger and your computer has kilobytes of memory, you want a language where you can see every instruction and test every word interactively before committing it to flight. See [Forth in Space](FORTH-IN-SPACE.md) for the full verified history.

 **Firmware.** Open Firmware, the boot ROM standard used by Apple, Sun, IBM, and the OLPC XO-1, is a Forth environment. Before the operating system loads -- before there are drivers, before there is a filesystem -- there is a Forth interpreter probing hardware, initializing buses, and running platform-independent device drivers encoded as FCode (a compiled Forth bytecode format). Forth is the language you reach for when there is nothing else yet.

@@ -12,26 +12,29 @@ This document describes every optimization that makes sense for WAFER, why it ma

 ## Status Summary

-| #  | Optimization             | Level        | Status      | Impact  |
-| -- | ------------------------ | ------------ | ----------- | ------- |
-| 1  | Stack-to-Local Promotion | Codegen      | Phase 1     | Highest |
-| 2  | Peephole Optimization    | IR pass      | Done        | High    |
-| 3  | Constant Folding         | IR pass      | Done        | High    |
-| 4  | Inlining                 | IR pass      | Done        | High    |
-| 5  | Strength Reduction       | IR pass      | Done        | Medium  |
-| 6  | Dead Code Elimination    | IR pass      | Done        | Medium  |
-| 7  | Tail Call Optimization   | IR + Codegen | Done        | Medium  |
-| 8  | Consolidation            | Architecture | Done        | High    |
-| 9  | Compound IR Operations   | IR + Codegen | Done        | Medium  |
-| 10 | Codegen Improvements     | Codegen      | Done        | Medium  |
-| 11 | wasmtime Configuration   | Runtime      | Done        | Low     |
-| 12 | Dictionary Hash Index    | Runtime      | Done        | Low     |
-| 13 | Startup Batching         | Architecture | Done        | Low     |
-| 14 | Float / Double-Cell      | Codegen      | Not started | Future  |
+| #  | Optimization              | Level        | Status      | Impact  |
+| -- | ------------------------- | ------------ | ----------- | ------- |
+| 1  | Stack-to-Local Promotion  | Codegen      | Phase 2     | Highest |
+| 2  | Peephole Optimization     | IR pass      | Done        | High    |
+| 3  | Constant Folding          | IR pass      | Done        | High    |
+| 4  | Inlining                  | IR pass      | Done        | High    |
+| 5  | Strength Reduction        | IR pass      | Done        | Medium  |
+| 6  | Dead Code Elimination     | IR pass      | Done        | Medium  |
+| 7  | Tail Call Optimization    | IR + Codegen | Done        | Medium  |
+| 8  | Consolidation             | Architecture | Done        | High    |
+| 9  | Compound IR Operations    | IR + Codegen | Done        | Medium  |
+| 10 | Codegen Improvements      | Codegen      | Done        | High    |
+| 11 | wasmtime Configuration    | Runtime      | Done        | Low     |
+| 12 | Dictionary Hash Index     | Runtime      | Done        | Low     |
+| 13 | Startup Batching          | Architecture | Done        | Low     |
+| 14 | Self-Recursive Direct Call| Codegen      | Done        | High    |
+| 15 | Float / Double-Cell       | Codegen      | Not started | Future  |

 ## 1. Stack-to-Local Promotion

-**Status: Phase 1 done.** Straight-line words (no control flow, calls, or I/O) use WASM locals instead of memory stack. Stack manipulation ops (Swap, Rot, Nip, Tuck, Dup, Drop) emit zero WASM instructions. Switchable via `WaferConfig::codegen.stack_to_local_promotion`.
+**Status: Phase 2 done.** Words with straight-line code, DO/LOOP, and IF/ELSE use WASM locals instead of memory stack. Stack manipulation ops (Swap, Rot, Nip, Tuck, Dup, Drop) emit zero WASM instructions. Loop index/limit kept in WASM locals (zero return stack traffic). Switchable via `WaferConfig::codegen.stack_to_local_promotion`.
+
+Phase 1 covered straight-line code only. Phase 2 extends to DO/LOOP (with stack-neutrality check) and IF/ELSE/THEN (with equal-branch-effect check). BEGIN loops and BeginDoubleWhileRepeat are not yet promoted.

 ### The Problem

@@ -342,7 +345,7 @@ These can be added as new `IrOp` variants recognized by peephole and emitted by

 ## 10. Codegen Improvements

-**Status: Done (DSP caching).** `$dsp` cached in WASM local 0, written back before calls and at function exit. Commutative optimization and loop index in local are future work.
+**Status: Done.** DSP cached in WASM local 0, DO/LOOP index/limit in WASM locals (fast path when body has no calls/RS ops), J compiled as IR primitive (was host function). Self-recursive `call` optimization in section 14.

 These are improvements within `codegen.rs` `emit_op()` that do not require new IR operations.

@@ -390,7 +393,11 @@ i32.add               ;; result on wasm stack

 ### Loop Index in Local

-`DO...LOOP` currently stores the loop index and limit on the return stack (in memory). Keep them in WASM locals for the duration of the loop body. This makes `I` (read loop index) a simple `local.get` instead of a memory load from the return stack.
+**Status: Done.** DO/LOOP index and limit are kept in WASM locals. Two codegen paths:
+- **Fast path** (body has no calls, no `>R`/`R>`): pure locals, zero return stack traffic. `I` reads from `local.get`. `J` also reads from outer loop's local.
+- **Slow path** (body has calls or explicit RS ops): locals used for loop control but synced to return stack for LEAVE/UNLOOP compatibility.
+
+`J` was converted from a host function (WASM-to-Rust roundtrip) to an IR primitive (`IrOp::LoopJ`) that compiles to `local.get` of the outer loop's index local.

 ## 11. wasmtime Configuration

@@ -430,31 +437,46 @@ Currently, each of the 80+ primitives registered at boot creates a separate WASM

 Batch all IR-based primitives into a single WASM module with multiple exported functions. One `Module::new()` + one `Instance::new()` replaces 80+ pairs. This is a subset of what Consolidation (section 8) achieves, but scoped to primitives only and simpler to implement.

-## 14. Float and Double-Cell Stack
+## 14. Self-Recursive Direct Call
+
+**Status: Done.** When a word calls itself (RECURSE), the codegen emits `call WORD_FUNC` (direct call to the same function) instead of `call_indirect` through the function table. This eliminates the table lookup and signature check overhead.
+
+### Impact
+
+Fibonacci(25) with ~243K recursive calls:
+- `call_indirect`: ~21ns/call → 5.0ms total
+- Direct `call`: ~7ns/call → 1.6ms total (3x faster)
+- gforth: ~14ns/call → 3.4ms total
+
+The optimization is implemented in `emit_op` for `IrOp::Call`: when `ctx.self_word_id == Some(word_id)`, emit `call WORD_FUNC` (function index 1 in the word's own module). The `self_word_id` is derived from `CodegenConfig::base_fn_index`.
+
+## 15. Float and Double-Cell Stack

 **Status: Not started.** `PushI64` and `PushF64` exist as IR ops but are stubs in codegen. Float stack operations are currently all host functions.

 The float stack lives in its own memory region (0x2540--0x2D40). Float operations will have the same memory-based overhead as integer operations, but worse: `f64` values are 8 bytes, doubling the memory traffic per push/pop. Stack-to-local promotion (section 1) is even more impactful for floats because WASM has native `f64` locals and operand stack support.

-## Suggested Implementation Order
+## Current Performance vs Gforth

-Ordered by effort-to-impact ratio (cheapest wins first):
+All optimizations enabled, release mode, measured with UTIME:

-| Priority | Optimization                                       | Effort  | Unlocks                                         |
-| -------- | -------------------------------------------------- | ------- | ----------------------------------------------- |
-| 1        | Peephole optimization                              | Low     | Immediate code size reduction                   |
-| 2        | Constant folding                                   | Low     | Composes with peephole                          |
-| 3        | Tail call detection                                | Low     | Recursive word optimization                     |
-| 4        | Dictionary hash index                              | Low     | Faster compilation                              |
-| 5        | wasmtime config tuning                             | Trivial | Caching, interruption                           |
-| 6        | Codegen improvements (global caching, loop locals) | Medium  | ~30% fewer instructions                         |
-| 7        | Inlining                                           | Medium  | Unlocks cross-word folding and peephole         |
-| 8        | Strength reduction                                 | Low     | Best after inlining exists                      |
-| 9        | Dead code elimination                              | Low     | Best after constant folding exists              |
-| 10       | Compound IR operations                             | Medium  | Cumulative gains                                |
-| 11       | Stack-to-local promotion                           | High    | The single biggest speedup (~7x for arithmetic) |
-| 12       | Startup batching                                   | Medium  | Faster boot                                     |
-| 13       | Consolidation                                      | High    | Direct calls, cross-word optimization           |
-| 14       | Float/double-cell                                  | Medium  | Depends on stack-to-local                       |
+```
+Benchmark                   WAFER     CONSOL     gforth      WAFER/gf
+Fibonacci(25)                1629       1535       3422        0.45x
+Factorial(12)x10K             340        339        638        0.53x
+GCD-bench(500)                 18         15         30        0.50x
+NestedLoops(50)                84         73        720        0.10x
+Collatz(2K)                  1212       1202       3914        0.31x
+```

-Stack-to-local promotion has the highest impact but also the highest implementation cost. The passes before it (peephole, folding, inlining) are simpler and their benefits multiply when stack-to-local promotion is eventually added. Consolidation is last because it requires storing IR bodies and restructuring the module generation -- it benefits most from having all other passes working first.
+Times in microseconds. WAFER/gf < 1.0 means WAFER is faster.
+
+## Remaining Opportunities
+
+| Optimization | Status | Potential Impact |
+| --- | --- | --- |
+| BEGIN loop promotion | Not started | Would speed up GCD-style tight loops further |
+| BeginDoubleWhileRepeat promotion | Not started | Rare pattern, low priority |
+| LEAVE as IR primitive | Not started | Would enable fast-path for loops with LEAVE |
+| Float stack-to-local | Not started | Eliminate float stack memory traffic |
+| WASM tail calls proposal | Waiting on wasmtime | Would eliminate stack growth for tail-recursive words |
@@ -27,7 +27,7 @@ tests/
    forth2012-test-suite/   Forth 2012 compliance test suite (submodule)
 ```

-The compiler and runtime lives in `outer.rs` (~10,400 lines). Codegen is in `codegen.rs` (~2,800 lines). The optimizer is in `optimizer.rs` (~800 lines). Everything else is supporting infrastructure.
+The compiler and runtime lives in `outer.rs` (~10,500 lines). Codegen is in `codegen.rs` (~3,900 lines). The optimizer is in `optimizer.rs` (~800 lines). Everything else is supporting infrastructure.

 ## What Happens When You Start WAFER

@@ -282,6 +282,10 @@ When the compiler encounters a word reference during compilation, it emits:
 (call_indirect (type $void) (table 0))    ;; indirect call through the table
 ```

+**Self-recursive optimization**: When a word calls itself (RECURSE), the codegen detects this and emits a direct `call` instead of `call_indirect`, eliminating the table lookup and signature check (~3x faster for recursive words like Fibonacci).
+
+**After CONSOLIDATE**: All `call_indirect` between words in the consolidated module are replaced with direct `call` instructions, giving similar benefits for cross-word calls.
+
 At runtime, wasmtime resolves the table entry and calls the target function. Because all functions share the same memory, globals, and table, state passes between words through the data stack in linear memory. There are no function parameters or return values at the WASM level -- everything goes through the stack.

 This is subroutine threading: each word is a subroutine, and calling a word is an indirect function call.
@@ -362,13 +366,14 @@ Note: `EMIT` is an IR primitive -- it compiles to WASM code that calls the impor

 WAFER generates all WASM modules in memory. No `.wasm` files are written to disk. No caches, no configuration files, no persistent state. Every time you start WAFER, it rebuilds everything from scratch.

-The `--consolidate` CLI flag is reserved for a planned feature: compiling all words into a single optimized WASM module for ahead-of-time deployment. This is not yet implemented.
+The `CONSOLIDATE` word (available at the REPL and in source files) recompiles all defined words into a single optimized WASM module with direct `call` instructions. The `wafer build` subcommand compiles Forth source to standalone `.wasm` files or native executables.

 ## Running the Tests

 ```bash
-cargo test --workspace              # All unit tests (~220)
+cargo test --workspace              # All tests (~450)
 cargo test -p wafer-core --test compliance   # Forth 2012 compliance suite
+cargo test -p wafer-core --test comparison -- --nocapture --ignored  # vs gforth benchmarks
 cargo run -p wafer -- file.fth      # Execute a Forth source file
 echo '5 3 + .' | cargo run -p wafer # Pipe input (non-interactive)
 ```