Scroll mode
Our universe has about 1080 atoms. Now imagine if each atom contained its own universe, with 1080 atoms inside. Even that barely scratches the surface. You'd need about 5,000 layers of atom-universes, nested one inside the next, before you reached the number of possible images the size of the one above - about 10400,000. That's a 1 with 400,000 zeroes after it.
As you can imagine, the vast majority of these are nothing but random noise:
Depending on your computer, you're seeing up to 60 random images per second. If you see anything that looks like a real image before the heat death of the universe, let me (or my descendants) know.
Amazingly, diffusion models can navigate this vast space of possibilities to produce coherent results. Unlike humans who start with a blank canvas and add paint, diffusion models start with random noise, and gradually remove the noise until an image emerges.
From noise to image
A monarch butterfly on a purple coneflower, macro close-up, delicate orange and black wing detail, morning dew, soft bokeh
If we think of all possible images as occupying a vast, multi-dimensional space, then a diffusion model starts at a random point in that space, and gradually forges a path towards a point that's consistent with your prompt.
Actually it's not as bad as I made out, because the model operates in a compressed, lower-dimensional space called latent space. During training, the model is trained together with an encoder/decoder that can translate from this latent space to real images. In these examples, the latent space has 12x fewer dimensions than the full image space - still vast, but more manageable.
Latent space
32×32 latent PCA (shown 4×)
We can't show the full latent space because it has too many dimensions, but I've compressed it down to give you a feel for it.
From here on, we ignore the latent space and show the fully decoded image at each step - because it's easier for humans to understand. But in reality, the decoding only happens right at the end.
Just like the multi-dimensional space of possible images, text prompts can be mapped to a high-dimensional "embedding space". Each prompt lives at a particular spot in this space, and similar prompts cluster together.
Loading...
Again, we can't show the full embedding because it's too high-dimensional but we can show this 2D compression. Notice how similar concepts group together.
The embedding acts like a "compass" for the diffusion process. At each step on its journey, the model looks at the embedding to figure out the best direction to move.
If we start at different points in the possible-image space, we end up at slightly different destinations. This is determined by the random seed - a number that starts off the initial randomisation.
Different random seeds, same prompt
We can choose how many steps we take before stopping. A small number means we have to take big steps, and can end up off track - if we get there at all. But after a certain point, taking more steps doesn't help much, and just wastes time.
Different numbers of inference steps
A vague prompt leads to a more wobbly compass. More detailed prompts constrain the direction more tightly, leading to better results.
Variation of prompt detail
A monarch butterfly on a purple coneflower, macro
A monarch butterfly on a purple coneflower, macro close-up, delicate orange and black wing
A monarch butterfly on a purple coneflower, macro close-up, delicate orange and black wing detail, morning dew, soft bokeh
Explore the effect of your prompt on the generated image:
Prompt length explorer
Amonarchbutterflyonapurpleconeflower,macroclose-up,delicateorangeandblackwingdetail,morningdew,softbokeh
Because prompts also exist in their own "embedding space", we can follow a path between any two prompt embeddings, and generate images along the way. These "in-between" points don't correspond to human words, but they exist in the embedding space.
The space between two prompts
A monarch butterfly on a purple coneflower, macro close-up, delicate orange and black wing detail, morning dew, soft bokeh
A snail resting on a moss-covered rock, extreme close-up, shell spiral detail, rain droplets, rich green, soft overcast lighting
You can explore this "in-between" space yourself:
Noisy
A monarch butterfly on a purple coneflower, macro close-up, delicate orange and black wing detail, morning dew, soft bokeh
A snail resting on a moss-covered rock, extreme close-up, shell spiral detail, rain droplets, rich green, soft overcast lighting
Clear
The model decides how strongly to follow your prompt using a number called the guidance scale. A higher value gives the prompt a stronger "steering" effect, but if we set it too high we can end up with unnatural, oversaturated images:
Changing the guidance scale
You can explore how the guidance scale affects the image generation process:
Noisy
Guidance Scale = 1.0
Guidance Scale = 15.0
Clear
Imagine you've been plonked in the middle of an unfamiliar terrain with an uncertain destination and nothing but a compass to guide you. You come up with the following plan:
Diffusion models follow a very similar pattern, guided by these 4 things:
By combining these, the model navigates from pure chaos to a coherent image that matches your prompt. Here's that journey again, from noise to structure:
From noise to image
A monarch butterfly on a purple coneflower, macro close-up, delicate orange and black wing detail, morning dew, soft bokeh
An AI model generating an image may look like magic, but now you know it's just navigation through an unimaginably vast space of possibilities.
Which, to be fair, is still super cool.
Thanks to Photoroom, who open sourced their text-to-image model PRX, allowing me to generate all of these examples. All examples were generated using the Photoroom/prx-256-t2i-sft model. Extra thanks to Jon Almazán for the support and ideas.