James Routley

I turned a forest trail near my apartment into a playable neural world.

By "neural world", I mean that the entire thing is a neural network generating new images based on previous images + controls. There is no level geometry, no code for lighting or shadows, no scripted animation. Just a neural net in a loop.

A diagram illustrating a neural network that consumes noise, controls, and memory, and produces video frames and memory.

By "in your web browser" I mean this world runs locally, in your web browser. Once the world has loaded, you can continue exploring even in Airplane Mode.

So, why bother creating a world this way? There are some interesting conceptual reasons (I'll get to them later), but my main goal was just to outdo a prior post.

See, three years ago, I got a simple two-dimensional video game world to run in-browser by training a neural network to mimic gameplay videos from YouTube.

Mimicking a 2D video game world was cute, but ultimately kind of pointless;

The wonderful, unique, exciting property of neural worlds is that they can be constructed from any video file, not just screen recordings of old video games.

So for this post, to demonstrate what makes neural networks truly special,

Recording data

To begin this project, I walked through a forest trail, recording videos with my phone, using a customized camera app which also recorded my phone's motion.

I collected ~15 minutes of video and motion recordings. I've visualized motion as a "walking" control stick on the left and a "looking" control stick on the right.

Back at home, I transferred the recordings to my laptop, and shuffled them into a list of (previous frame, control → next frame) pairs just like my previous game-emulation dataset.

A screenshot of two examples from the dataset, showing the memory and control inputs as well as the corresponding video frame outputs.

Now, all I needed to do was train a neural network to mimic the behavior of these input→output pairs. I already had working code from my previous game-emulation project,

Training baselines

Applying my previous game-emulation-via-neural-network recipe to this new dataset produced, regrettably, a sort of interactive forest-flavored soup.

My neural network couldn't predict the actual next frame accurately, and it couldn't make up new details fast enough to compensate, so the resulting world collapsed even if I gave it a running start by initializing from real video frames:

Undaunted, I started work on a new version of the neural world training code.

Upgrading the training recipe

To help my network understand real-world video, I made the following upgrades:

Adding more control information. I upgraded the "control" network input from simple 2D controls to more-informative 3D (6DoF) controls.
Adding more memory. I upgraded the "memory" network input from a single frame to 32 frames (using lower resolution for the older frames).
Adding multiple scales. I restructured the network to process all inputs across multiple resolutions, instead of a fixed 1/8 resolution.

A before/after diagram of the neural network architecture.

These upgrades let me stave off soupification enough to get a half-baked demo:

This was significant progress. Unfortunately, the world was still pretty melty,

Upgrading the training recipe more

This time, I left the inputs/outputs as-is and focused on finding incremental improvements to the training procedure. Here's a mercifully-abbreviated montage:

The biggest jumps in quality came from:

Making the network bigger: I added even more layers of neural network processing, while striving to maintain a somewhat-playable FPS.
Picking a better training objective: I adjusted training to put less emphasis on detail prediction and more emphasis on detail generation.
Training longer: I trained the network longer on a selected subset of video frames to try and eke out the highest-quality results.

Here's a summary of the final forest world recipe:

Dataset: 22,814 frames (30FPS SDR video, timestamped ARKit poses) captured at Marymoor Park Audobon Bird Loop with iPhone 13 Pro.
Inputs: 3x4-element relative camera pose, 2-element gravity-relative roll/pitch, relative time delta, valid/augmented bit,
Model: Asymmetric (decoder-heavy) 4-scale UNet with reduced-size full-resolution decoder block.
Training: AdamW constant LR + SWA, L1 + adversarial loss, stability fixes from the game-emulation recipe, around ~100 GPU-hours (~$100 USD).
Inference: Control-conditioned sequential autoregression with 60FPS cap, preprocessing in JS, network in ONNX Runtime Web's WebGL backend.

Whew. So, let's return to the original question:

Two ways to create worlds

Traditional game worlds are made like paintings. You sit in front of an empty canvas and layer keystroke upon keystroke until you get something beautiful. Every lifelike detail in a traditional game is only there because some artist painted it in.

Neural worlds are made rather differently.

So, if traditional game worlds are paintings, neural worlds are photographs.

A doodle showing how information flows in painting-style worlds vs. photo-style worlds.

Admittedly, as of this post, neural worlds resemble very early photographs.

The exciting part was that cameras reduced realistic-image-creation from an artistic problem to a technological one.

A modern iPhone photograph.

I think that neural worlds will improve in fidelity just like photographs did. In time, neural worlds will have trees that bend in the wind, lilypads that bob in the rain, birds that sing to each other.

I think the tools for creating neural worlds can also, eventually, be just as convenient as today's cameras. In the same way that a modern digital camera creates images or videos at the press of a button, we could have a tool to create worlds.

If neural worlds become as lifelike, cheap, and composable as photos are today,

I think that would be very exciting indeed!

Neural networks which model the world are often called "world models" and many smart people have worked on them; a classic example is Comma's "Learning a Driving Simulator", and some more recent examples are OpenDriveLabs' Vista or Wayve's GAIA-2. If you're a programmer interested in training your own world models, I recommend looking at DIAMOND or Diffusion Forcing.

Compared to serious "Foundation World Models" with billions of parameters,

World Emulation via Neural Network

Recording data

Training baselines

Upgrading the training recipe

Upgrading the training recipe more

Two ways to create worlds