James Routley

tl;dr: on Apple Silicon, a WebAssembly module's linear memory can be shared directly with the GPU: no copies, no serialization, no intermediate buffers. The CPU and GPU read and write the same physical bytes. End-to-end, it works: a Wasm guest fills a matrix in its linear memory, the GPU reads it, computes, writes back, and the guest sees the result through the same pointer, same memory, zero copies.

Normally Wasm and GPUs are separated by an expensive serialization boundary: on most hardware, getting data from a VM sandbox to an accelerator means copying across a bus. Apple Silicon's Unified Memory Architecture erases that boundary (no bus, same physical memory), and what falls out is a runtime where Wasm is the control plane and the GPU is the compute plane, with near-zero overhead between them.

I'm building something called Driftwood that exploits this for stateful AI inference ... and this post is about the foundation (how the zero-copy chain works, what I measured, what it opens up). Still early, still poking at it.

Why this is normally hard

Quick background, for anyone who doesn't live in this stack: WebAssembly gives you a sandbox. Your module gets a flat byte array (linear memory) and that's the universe ... everything outside is mediated by "host" function calls. The whole point is isolation, portability, determinism.

GPUs also want a flat byte array, but a specific kind: page-aligned, pinned, accessible to the DMA engine. On a discrete GPU (think NVIDIA, or AMD), that memory sits across a PCIe bus from the CPU, so getting data from a Wasm module's linear memory to the GPU means: copy out of the sandbox into host memory, then copy across the bus into GPU memory. Two copies, two latency hits, and an awkward impedance mismatch between "isolated VM" and "hardware accelerator."

Apple Silicon changes the physics. The CPU and GPU share the same physical memory (Apple's Unified Memory Architecture) ... no bus! A pointer the CPU can read, the GPU can also read, from the same DRAM. The real question: can you thread that pointer through the layers of abstraction (the Wasm runtime, the GPU API) without anyone making a defensive copy along the way?

Turns out ... you can!

The three-link chain

Three links. I validated each one on its own before trying to compose them: it's the kind of thing where if you skip the isolation step and the whole pipeline breaks, you have no idea "which joint is leaking".

Link 1: mmap gives you page-aligned memory. On ARM64 macOS, mmap with MAP_ANON | MAP_PRIVATE returns 16 KB-aligned addresses. This isn't a lucky accident, it happens to be the ARM64 page size, and mmap aligns by contract. The alignment matters because Metal requires it.

Link 2: Metal accepts that pointer without copying. MTLDevice.makeBuffer(bytesNoCopy:length:) wraps an existing pointer as a Metal buffer. On Apple Silicon, this is the zero-copy path, i.e. the GPU accesses the same physical memory the CPU does. I verified pointer identity: the MTLBuffer.contents() pointer equals the original mmap pointer. I verified no hidden copies: RSS delta was 0.03 MB (measurement noise), compared to 16.78 MB for the explicit-copy path. And, same compute latency either way.

Link 3: Wasmtime lets you bring your own allocator. Wasmtime's MemoryCreator trait lets you control how linear memory is allocated. Instead of letting Wasmtime call mmap internally, you provide the backing memory yourself. I implement MemoryCreator to return our own mmap region, and Wasmtime's memory.data_ptr() returns exactly the pointer I handed it. The Wasm module reads and writes through Wasmtime's memory API; the GPU reads and writes through the Metal buffer; both are operating on the same bytes.

The composition: allocate an mmap region, hand it to both Wasmtime (as the actor's linear memory) and Metal (as a GPU buffer). The Wasm module writes data at known offsets, the GPU computes on it in place, and the results appear in the module's linear memory with no copies and no explicit data transfer.

I tested the full chain with a 128×128 matrix multiply: the Wasm module fills matrices A and B, the GPU runs a GEMM shader, the module reads result C back. Zero errors across 16,384 elements. Small test, but it's the kind of thing where either it all lines up or you get garbage, so zero errors is the signal I wanted.

What I measured

Three things I cared about: pointer identity (is it actually zero-copy?), memory overhead (any hidden copies sneaking in?), and correctness (does the GPU see what Wasm wrote?).

  Measurement                     Zero-copy path     Copy path
  ─────────────────────────────────────────────────────────────
  Pointer identity                mmap == MTLBuffer   different addrs
  RSS delta (16 MB region)        0.03 MB             16.78 MB
  GEMM latency (128×128)          ~6.75 ms            ~6.75 ms
  Correctness (16K elements)      0 errors            0 errors

The latency equivalence makes sense: on UMA, the compute itself is identical either way. The memory picture is where it shows up: the zero-copy path has essentially no overhead for making data GPU-accessible, and the copy path doubles your memory footprint.

At small tensor sizes, nobody cares. At the scale of KV caches in transformer inference (hundreds of megabytes per conversation) it's the difference between fitting four actors in memory or two. That's the regime I actually want to operate in, so the memory part matters.

From zero-copy to inference

So now I've got a primitive: Wasm and the GPU share memory with no overhead. What do you do with it?

I plugged the chain into Apple's MLX framework and ran Llama 3.2 1B Instruct from a Wasm actor: a full transformer decoder written in Rust, compiled to a native host runtime, driving inference on the Apple Silicon GPU through host function calls. (I was too lazy to wire up a custom kernel path from scratch, and ... MLX was there)

Measured latencies, running Llama 3.2 1B (4-bit quantized, 695 MB) on a 2021 M1 Macbook Pro (old personal laptop, I'll re-evaluate on a proper Mac Studio someday when I can get my hands on it 😄):

  Operation                    Latency
  ──────────────────────────────────────
  Model load (safetensors)     229 ms      (one-time)
  Prefill (5 tokens)           106 ms
  Per-token generation          ~9 ms
  Host function boundary        negligible

The host function boundary (the Wasm-to-GPU dispatch) isn't measurable against the inference cost. Anyone who's worked with sandboxed runtimes has probably winced at the thought of crossing that boundary per dispatch. On this hardware, it's not a thing.

KV cache portability

Transformers maintain a key-value cache that accumulates context across conversation turns, which is normally ephemeral (kill the process, lose the cache, start over). If you've tried running local inference, you know the feeling.

Because the cache lives in GPU-accessible memory that I control, I can serialize it. So, I dump the KV cache to safetensors format (standard ML tensor serialization, nothing exotic) and restore it later, on the same machine, or a different machine, or potentially against a different model on a different machine! That last one I haven't tested across meaningfully different architectures yet ... we'll see.

  Operation                    Latency      Size
  ───────────────────────────────────────────────────
  Serialize (24 tokens)        1.1 ms       1.58 MB (~66 KB/token)
  Restore from disk            1.4 ms
  Re-prefill from scratch      67.7 ms      (the alternative)
  ───────────────────────────────────────────────────
  Speedup from restore:        5.45×
  Round-trip fidelity:         bit-identical (10/10 tokens match)

5.45× at 24 tokens, and the ratio improves with context length: restore time is nearly constant, re-prefill scales linearly. At 4,096 tokens, restore would be around 100× faster than recomputation (I haven't actually pushed it to 4,096 yet; that's napkin math extrapolating from the constant-vs-linear shape).

This is the basis for stateful actor mobility: freezing a conversation mid-exchange, moving it somewhere else, thawing it with full context intact. The Wasm module's linear memory captures the actor's logical state; the KV cache captures the inference engine's accumulated context. Together: a portable snapshot of a running AI conversation (or, at least, that's the plan 😅).

What's being built

Driftwood is a runtime for stateful Wasm actors with GPU inference. The zero-copy chain is the foundation: on top of it I'm going to add on actor snapshots (freeze and resume any conversation), checkpoint portability (move inference state across machines), and multi-model support (the snapshot format is model-agnostic, so in theory the actor's identity survives model swaps ... which might work, will revisit once I test it).

This is all early, still stitching things together. But the "physics" works: Wasm and the GPU can share memory on Apple Silicon with zero overhead, the KV cache is portable, and a full transformer runs from a sandboxed actor at native speed. The next things I want to poke at: whether the snapshot really survives a model swap, whether the chain holds up on larger models, and whether I'm missing some obvious reason this will fall over at scale. Slow and steady ...