James Routley

Scroll mode

How many possible images are there?

Our universe has about 10⁸⁰ atoms. Now imagine if each atom contained its own universe, with 10⁸⁰ atoms inside. Even that barely scratches the surface. You'd need about 5,000 layers of atom-universes, nested one inside the next, before you reached the number of possible images the size of the one above - about 10^400,000. That's a 1 with 400,000 zeroes after it.

As you can imagine, the vast majority of these are nothing but random noise:

Depending on your computer, you're seeing up to 60 random images per second. If you see anything that looks like a real image before the heat death of the universe, let me (or my descendants) know.

Amazingly, diffusion models can navigate this vast space of possibilities to produce coherent results. Unlike humans who start with a blank canvas and add paint, diffusion models start with random noise, and gradually remove the noise until an image emerges.

From noise to image

A monarch butterfly on a purple coneflower, macro close-up, delicate orange and black wing detail, morning dew, soft bokeh

If we think of all possible images as occupying a vast, multi-dimensional space, then a diffusion model starts at a random point in that space, and gradually forges a path towards a point that's consistent with your prompt.

A smaller world

Actually it's not as bad as I made out, because the model operates in a compressed, lower-dimensional space called latent space. During training, the model is trained together with an encoder/decoder that can translate from this latent space to real images. In these examples, the latent space has 12x fewer dimensions than the full image space - still vast, but more manageable.

Latent space

32×32 latent PCA (shown 4×)

We can't show the full latent space because it has too many dimensions, but I've compressed it down to give you a feel for it.

From here on, we ignore the latent space and show the fully decoded image at each step - because it's easier for humans to understand. But in reality, the decoding only happens right at the end.

Where words live

Just like the multi-dimensional space of possible images, text prompts can be mapped to a high-dimensional "embedding space". Each prompt lives at a particular spot in this space, and similar prompts cluster together.

Again, we can't show the full embedding because it's too high-dimensional but we can show this 2D compression. Notice how similar concepts group together.

The embedding acts like a "compass" for the diffusion process. At each step on its journey, the model looks at the embedding to figure out the best direction to move.

The starting point

If we start at different points in the possible-image space, we end up at slightly different destinations. This is determined by the random seed - a number that starts off the initial randomisation.

Different random seeds, same prompt

Dividing the path

We can choose how many steps we take before stopping. A small number means we have to take big steps, and can end up off track - if we get there at all. But after a certain point, taking more steps doesn't help much, and just wastes time.

Different numbers of inference steps

The weight of words

A vague prompt leads to a more wobbly compass. More detailed prompts constrain the direction more tightly, leading to better results.

Variation of prompt detail

A monarch butterfly on a purple coneflower, macro

A monarch butterfly on a purple coneflower, macro close-up, delicate orange and black wing

A monarch butterfly on a purple coneflower, macro close-up, delicate orange and black wing detail, morning dew, soft bokeh

Explore the effect of your prompt on the generated image:

Prompt length explorer

Amonarchbutterflyonapurpleconeflower,macroclose-up,delicateorangeandblackwingdetail,morningdew,softbokeh

The space between words

Because prompts also exist in their own "embedding space", we can follow a path between any two prompt embeddings, and generate images along the way. These "in-between" points don't correspond to human words, but they exist in the embedding space.

The space between two prompts

A monarch butterfly on a purple coneflower, macro close-up, delicate orange and black wing detail, morning dew, soft bokeh

A snail resting on a moss-covered rock, extreme close-up, shell spiral detail, rain droplets, rich green, soft overcast lighting

You can explore this "in-between" space yourself:

Noisy

A monarch butterfly on a purple coneflower, macro close-up, delicate orange and black wing detail, morning dew, soft bokeh

A snail resting on a moss-covered rock, extreme close-up, shell spiral detail, rain droplets, rich green, soft overcast lighting

Clear

The pull of the prompt

The model decides how strongly to follow your prompt using a number called the guidance scale. A higher value gives the prompt a stronger "steering" effect, but if we set it too high we can end up with unnatural, oversaturated images:

Changing the guidance scale

You can explore how the guidance scale affects the image generation process:

Noisy

Guidance Scale = 1.0

Guidance Scale = 15.0

Clear

The full journey

Imagine you've been plonked in the middle of an unfamiliar terrain with an uncertain destination and nothing but a compass to guide you. You come up with the following plan:

Check the compass
Walk a bit in that direction
Check the compass again
Walk a bit in the new direction
Repeat a fixed number of times

Diffusion models follow a very similar pattern, guided by these 4 things:

Random seed - where you start: Different starting points lead to slightly different destinations.
Prompt - the compass: A rickety old compass that only points "northish" will get you somewhere, but not necessarily where you want to be.
Step count - how often you check the compass: Check the compass too rarely and you'll drift off course. But checking all the time will slow you down.
Guidance scale - how strongly you follow the compass: Blindly follow the compass, and you may end up stuck in a river. But ignore it completely, and you'll wander aimlessly.

By combining these, the model navigates from pure chaos to a coherent image that matches your prompt. Here's that journey again, from noise to structure:

From noise to image

A monarch butterfly on a purple coneflower, macro close-up, delicate orange and black wing detail, morning dew, soft bokeh

An AI model generating an image may look like magic, but now you know it's just navigation through an unimaginably vast space of possibilities.

Which, to be fair, is still super cool.

Thanks

Thanks to Photoroom, who open sourced their text-to-image model PRX, allowing me to generate all of these examples. All examples were generated using the Photoroom/prx-256-t2i-sft model. Extra thanks to Jon Almazán for the support and ideas.

From Noise to Image – interactive guide to diffusion

How many possible images are there?

A smaller world

Where words live

The starting point

Dividing the path

The weight of words

The space between words

The pull of the prompt

The full journey

Thanks