I turned a forest trail near my apartment into a playable neural world.
By "neural world", I mean that the entire thing is a neural network generating new images based on previous images + controls. There is no level geometry, no code for lighting or shadows, no scripted animation. Just a neural net in a loop.
By "in your web browser" I mean this world runs locally, in your web browser. Once the world has loaded, you can continue exploring even in Airplane Mode.
So, why bother creating a world this way? There are some interesting conceptual reasons (I'll get to them later), but my main goal was just to outdo a prior post.
See, three years ago, I got a simple two-dimensional video game world to run in-browser by training a neural network to mimic gameplay videos from YouTube.
Mimicking a 2D video game world was cute, but ultimately kind of pointless;
The wonderful, unique, exciting property of neural worlds is that they can be constructed from any video file, not just screen recordings of old video games.
So for this post, to demonstrate what makes neural networks truly special,
To begin this project, I walked through a forest trail, recording videos with my phone, using a customized camera app which also recorded my phone's motion.
I collected ~15 minutes of video and motion recordings. I've visualized motion as a "walking" control stick on the left and a "looking" control stick on the right.
Back at home, I transferred the recordings to my laptop, and shuffled them into a list of (previous frame, control
→ next frame
) pairs just like my previous game-emulation dataset.
Now, all I needed to do was train a neural network to mimic the behavior of these input→output pairs. I already had working code from my previous game-emulation project,
Applying my previous game-emulation-via-neural-network recipe to this new dataset produced, regrettably, a sort of interactive forest-flavored soup.
My neural network couldn't predict the actual next frame accurately, and it couldn't make up new details fast enough to compensate, so the resulting world collapsed even if I gave it a running start by initializing from real video frames:
Undaunted, I started work on a new version of the neural world training code.
To help my network understand real-world video, I made the following upgrades:
These upgrades let me stave off soupification enough to get a half-baked demo:
This was significant progress. Unfortunately, the world was still pretty melty,
This time, I left the inputs/outputs as-is and focused on finding incremental improvements to the training procedure. Here's a mercifully-abbreviated montage:
The biggest jumps in quality came from:
Here's a summary of the final forest world recipe:
Whew. So, let's return to the original question:
Traditional game worlds are made like paintings. You sit in front of an empty canvas and layer keystroke upon keystroke until you get something beautiful. Every lifelike detail in a traditional game is only there because some artist painted it in.
Neural worlds are made rather differently.
So, if traditional game worlds are paintings, neural worlds are photographs.
Admittedly, as of this post, neural worlds resemble very early photographs.
The exciting part was that cameras reduced realistic-image-creation from an artistic problem to a technological one.
I think that neural worlds will improve in fidelity just like photographs did. In time, neural worlds will have trees that bend in the wind, lilypads that bob in the rain, birds that sing to each other.
I think the tools for creating neural worlds can also, eventually, be just as convenient as today's cameras. In the same way that a modern digital camera creates images or videos at the press of a button, we could have a tool to create worlds.
If neural worlds become as lifelike, cheap, and composable as photos are today,
I think that would be very exciting indeed!
Neural networks which model the world are often called "world models" and many smart people have worked on them; a classic example is Comma's "Learning a Driving Simulator", and some more recent examples are OpenDriveLabs' Vista or Wayve's GAIA-2. If you're a programmer interested in training your own world models, I recommend looking at DIAMOND or Diffusion Forcing.
Compared to serious "Foundation World Models" with billions of parameters,