Demonstration
I made a Game Boy Color game that renders images in real time. The player controls an orbiting light and spins an object.
Play it here
Check out the code, download the ROMs
https://github.com/nukep/gbshader
3D Workflow
Early lookdev
Before really diving into this project, I experimented with the look in Blender to see if it would even look good. IMO it did, so I went ahead with it!
I experimented with a "pseudo-dither" on the Blender monkey by adding a small random vector to each normal.


Blender to normal map workflow
tl;dr: Cryptomattes and custom shaders to adjust normal maps
It doesn't really matter what software I used to produce the normal maps. Blender was the path of least resistance for me, so I chose that.
For the teapot, I simply put in a teapot, rotated a camera around it, and exported the normal AOV as a PNG sequence. Pretty straight-forward.
For the spinning Game Boy Color, I wanted to ensure that certain colors were solid, so I used cryptomattes in the compositor to identify specific geometry and output hard-coded values in the output.
The geometry in the screen was done by rendering a separate scene, then compositing it in the final render using a cryptomatte for the screen.

The Math
Normal Maps

The above animations are normal map frames that are used to solve the value of each pixel
Normal maps are a core concept of this project. They're already used everywhere in 3D graphics.
And indeed, normal map images are secretly a vector field. The reason normal maps tend to have a blue-ish baseline color, is because everyone likes to associate XYZ with RGB, and +Z is the forward vector by convention.
In a typical 3D workflow, a normal map is used to encode the normal vector at any given point on a textured mesh.
Source: Own work (Danny Spencer). Suzanne model (c) Blender Foundation.
Calculating a Lambert shader using dot products
The simplest way to shade a 3D object is using the dot product:
where is the normal vector, and is the light position when it points towards the origin (or equivalently: the negative light direction).
Expanded out component-wise, this is:
When the light vector is constant for all pixels, it models what most 3D graphics software calls a "distant light", or a "sun light".
Spherical Coordinates
To speed up computation on the Game Boy, I use an alternate version of the dot product, using spherical coordinates.
A spherical coordinate is a point represented by a radius , a primary angle "theta", and a secondary angle "phi". This is represented as a tuple:
The dot product of two spherical coordinates:
Because all normal vectors are unit length, and the light vector is unit length, we can just assume the radius is equal to 1. This simplifies to:
And using the previous variable names, we get the formula:
Making it work on the Game Boy
Encoding normal maps in the Game Boy ROM
In the ROM, I decided to fix "L-theta" to a constant value for performance reasons. The player gets to control "L-phi", creating an orbiting light effect.
This means that we can extract constant coefficients and and rewrite the formula:
The ROM encodes each pixel as a 3-byte tuple of .
Why ? Well...
The Game Boy has no multiply instruction
Not only does the SM83 CPU not support multiplication, but it also doesn't support floats. That's a real bummer.
We have to get really creative when the entire mathematical foundation of this project involves multiplying non-integer numbers.
What do we do instead? We use logarithms and lookup tables!
Logarithms have this nice property of being able to factor products to outside the . This way, we can add values instead!
This requires two lookups: a log lookup, and a pow lookup.
In pseudocode, multiplying 0.3 and 0.5 looks like this:
pow = [ ... ]
x = float_to_logspace(0.3)
y = float_to_logspace(0.5)
result = pow[x + y]
One limitation of this is that it's not possible to take the log of a negative number. e.g. has no real solution.
We can overcome this by encoding a "sign" bit in the MSB of the log-space value. When adding two log-space values together, the sign bit is effectively XOR'd (toggled). We just need to ensure the remaining bits don't overflow into it. We ensure this by keeping the remaining bits small enough.
The pow lookup accounts for this bit and returns a positive or negative result based on it.
All scalars and lookups are 8-bit fractions
It's advantageous to restrict numbers to a single byte, for both run-time performance and ROM size. 8-bit fractions are pretty extreme by today's standards, but believe it or not, it works. It's lossy as hell, but it works!
All scalars we're working with are between -1.0 and +1.0.
| Byte | Resolved linear-space value | Resolved log-space value |
|---|---|---|
| 0 | ||
| 1 | ||
| 2 | ||
| ... | ||
| 126 | ||
| 127 | ||
| 128 | undefined | |
| 129 | ||
| 130 | ||
| ... | ||
| 254 | ||
| 255 |
Addition and multiplication both use... addition!
Consider adding the two bytes: 5 + 10 = 15
- Addition uses linear-space values:
- Multiplication uses log-space values:
Why is the denominator 127 instead of 128? It's because I needed to represent both positive and negative 1. In a two's-complement encoding, signed positive 128 doesn't exist.
You might notice that the log-space values cycle and become negative at byte 128. The log-space values use bit 7 of the byte to encode the "sign" bit. As mentioned in the previous section, this is important for toggling the sign during multiplication.
The log-space values also use as a base, because I chose this as a sufficiently small base to meet the requirement that adding 3 of these log-space values won't overflow (42+42+42 = 126). Bytes 43 thru 127 are near 0, so in practice the ROM doesn't encode these values.
The lookup tables look like this:

Where:
- takes a real number and returns an unsigned byte.
- takes an unsigned byte and returns a return number. And:
Reconstructed functions look like this. The precision error is shown in the jagged "staircase" patterns:

It may look like there's a lot of error, but it's fast and it's passable enough to look alright! ;)
What's with cos_log?
It's basically a combined . This exists because in practice, cosine is always used with a multiplication.
The core calculation for the shader is:
And we can rewrite it as:
This amounts to, per-pixel:
- 1 subtraction
- 1 lookup to
cos_log - 1 addition
- 1 lookup to
pow - 1 addition
For a total of, per-pixel:
- 3 additions/subtractions
- 2 lookups
How fast is it?
The procedure processes 15 tiles per frame. It can process more if some of the tile's rows are empty (all 0), but it's guaranteed to process at least 15.

Figure: Mesen's "Event Viewer" window, showing a dot for each iteration (tile row) of the shader's critical loop.
There's some intentional visual tearing as well. The image itself is more than 15 tiles, so the ROM actually switches to rendering different portions of the image for each frame. The tearing is less noticeable because of ghosting on the LCD display, so I thought it was acceptable.
A pixel takes about 130 cycles, and an empty row's pixel takes about 3 cycles.
At one point I had calculated 15 tiles rendering at exactly 123,972 cycles, including the call and branch overhead. This is an overestimate now, because I since added an optimization for empty rows.
The Game Boy Color's CPU runs up to 8.388608 MHz, or roughly 139,810 T-cycles per frame (1/60 of a second).
About 89% of a frame's available CPU time goes to rendering the 15 tiles per frame. The remaining time goes to other functionality like responding to user input and performing hardware IO.
Self-modifying code

Figure: A hex representation of the shader subroutine instructions in RAM. The blue digits show a patch to change sub a, 0 into sub a, 8.
The core shader subroutine contains a hot path that processes about 960 pixels per frame. It's really important to make this as fast as possible!
Self-modifying code is a super-effective way to make code fast. But most modern developers don't do this anymore, and there are good reasons: It's difficult, rarely portable, and it's hard to do it right without introducing serious security vulnerabilities. Modern developers are spoiled by an abundance of processing power, super-scalar processors that take optimal paths, and modern JIT (Just-In-Time) runtimes that generate code on the fly. But we're on the Game Boy, baybeee, so we don't have those options.
If you're a developer who uses higher-level languages like Python and JavaScript, the closest equivalent to self-modifying code is eval(). Think about how nervous eval() makes you feel. That's almost exactly how native developers feel about modifying instructions.
On the Game Boy's SM83 processor, it's faster to add and subtract by a hard-coded number than it is to load that number from memory.
i.e. x += 5 is faster than x += variable.
unsigned char Ltheta = 8;
v = (*in++) - Ltheta;
v = (*in++) - 8;
In SM83 assembly, this looks like:
; Slower: 28 cycles
ld a, [Ltheta] ; 12 cycles: Read variable "Ltheta" from HRAM
ld b, a ; 4 cycles: Move value to B register
ld a, [hl+] ; 8 cycles: Read from the HL pointer
sub a, b ; 4 cycles: A = A - B
; Faster: 16 cycles
ld a, [hl+] ; 8 cycles: Read from the HL pointer
sub a, 8 ; 8 cycles: A = A - 8
The faster way shaves off 12 cycles. If we're rendering 960 pixels, this saves a total of 11,520 cycles. This doesn't sound like a lot, but it's roughly 10% of the shader's runtime!
So how can we get the faster subtraction if the value we're subtracting with changes?
By modifying the instruction operand!
2A ld a, [hl+]
D6 08 sub a, 8
An overall failed attempt at using AI
"AI Will Be Writing 90% of Code in 3 to 6 Months"
95% of this project was made by hand. Large language models struggle to write Game Boy assembly. I don't blame them.
Update: 2026-02-03: I attempted to use AI to try out the process, mostly because 1) the industry won't shut up about AI, and 2) I wanted a grounded opinion of it for novel projects, so I have a concrete and personal reference point when talking about it in the wild. At the end of the day, this is still a hobbyist project, so AI really isn't the point! But still...
I believe in disclosing all attempts or actual uses of generative AI output, because I think it's unethical to deceive people about the process of your work. Not doing so undermines trust, and amounts to disinformation or plagiarism. Disclosure also invites people who have disagreements to engage with the work, which they should be able to. I'm open to feedback, btw.
I'll probably write something about my experiences with AI in the future.
As far as disclosures go, I used AI for:
- Python: Reading OpenEXR layers, as part of a conversion script to read normal map data
- Python/Blender: Some Python scripts for populating Blender scenes, to demo the process in Blender
- SM83 assembly: Snippets for Game Boy Color features like double-speed and VRAM DMA. Unsurprising, because these are likely available somewhere else.
I attempted - and failed - to use AI for:
- SM83 assembly: (Unused) Generating an initial revision of the shader code
I'll also choose to disclose what I did NOT use AI for:
- Writing this article
- The algorithms, lookups, all other SM83 assembly
- 3D assets
- The soul 🌟 (AI techbros are groaning right now)
I tried to make AI write Game Boy assembly
Just to see what it would do, I fed pseudocode into Claude Sonnet 4 (the industry claims that it's the best AI model for coding in 2025), and got it to generate SM83 assembly:
https://claude.ai/share/846cb7d4-e4a6-40ab-8aaa-6e4c308e3da3
It was an interesting process. To start, I chewed Claude's food and gave it pseudocode, because I had a data format in mind, and I assumed it'd struggle with a higher-level description.
I was skeptical that it wouldn't do well, but it did better than I thought it would. It even produced code that worked when I persisted it and guided it enough. However, it wasn't very fast, and it made some initial mistakes by assuming the SM83 processor was the Z80 processor. I attempted to get Claude to optimize it by offering suggestions. It did well initially, but it introduced errors until I reached the conversation limit.
After that point, I manually rewrote everything. My final implementation is aggressively optimized and barely has any resemblance to Claude's take.
And it loved telling me how "absolutely right" I always was. 🥺

It was better for small tasks and snippets of code. The tile demo in my video was partially AI scripted. A Game Boy subroutine for copying to VRAM was authored by AI. Few issues there.
An early iteration of the normal map conversion script accepted OpenEXR files. I didn't feel like drudging through a new library, so I asked ChatGPT to convert an OpenEXR file to a numpy array. It did pretty well! It however also introduced a very subtle bug that I didn't catch for weeks. Once I finally read the code, I realized it was sorting channel names alphabetically (so XYZ sorts as XYZ, but RGB sorts as BGR). It's the sort of error I'd never make myself.
Update: 2026-02-03 - Yeah, so the OpenEXR code could've been done in two lines this whole time. One of the first examples in the official PyPi readme shows how to get a numpy array from an OpenEXR file - exactly what I needed. I could update this snippet for different channels too in theory, but basically it's this. ChatGPT gave me 30 lines to handle edge cases that simply won't happen.
with OpenEXR.File("readme.exr") as infile:
RGB = infile.channels()["RGB"].pixels
At this point, I can't emphasize verifiable enough.
This, and other experiences, made me realize how easy it is to let your guard down when using AI like this, even if you're an experienced coder. AI can be helpful, but discretion is very much a required skill. I'm just thankful I never relied on it for installing hallucinated packages.
If you like this, share this post or like and comment on the YouTube video!
https://www.youtube.com/watch?v=SAQXEW3ePwo
(post will be updated once I post on Bluesky)