Can you encode UTF-8 without branches?
Yes.
In a Recurse chat, Nathan Goldbaum asked:
I know how to decode UTF-8 using bitmath and some LUTs (see https://github.com/skeeto/branchless-utf8), but if I want to to go from a codepoint to UTF-8, is there a way to do it without branches?
To start with, is there a way to write this C function, which returns the number of bytes needed to store the UTF-8 bytes for the codepoint without any branches? Or would I need a huge look-up-table?
int num_utf8_bytes_for_codepoint(uint32_t code) { if (code <= 0x7F) { return 1; } else if (code <= 0x07FF) { return 2; } else if (code <= 0xFFFF) { if ((code >= 0xD800) && (code <= 0xDFFF)) { // surrogates are invalid UCS4 code points return -1; } return 3; } else if (code <= 0x10FFFF) { return 4; } else { // codepoint is outside the valid unicode range return -1; } }
I pondered this but didn’t immediately see a way to do it without a huge (2^32) lookup table.
Until Lorenz pointed out:
Very handwavy idea: encode a 32 bit code point into utf8 but store the result in a 32bit word again. Count the number of leading / trailing zeroes to figure out how many bytes are necessary. Write four bytes into the output buffer but only advance your position in the output by the number of bytes you really need.
Aha!
The number of leading zeros will range from 12 to 321 – a very reasonable size for a lookup table. From there, we could look up other parameters by length (no more than 4).
I fired off a draft into the chat, then came back to test (and fix) it in the evening. When I got the tests passing, it looked like this:
/// Return the number of bytes required to UTF-8 encode a codepoint.
/// Returns 0 for surrogates and out-of-bounds values.
const fn utf8_bytes_for_codepoint(codepoint: u32) -> usize {
let len = LEN[codepoint.leading_zeros() as usize] as usize;
// Handle surrogates via bit-twiddling.
// Rust guarantees true == 1 and false == 0:
let surrogate_bit = ((codepoint >= 0xD800) && (codepoint <= 0xDFFF)) as usize;
// Extend that one bit into three, and use its inverse as a mask for length
let surrogate_mask = surrogate_bit << 2 | surrogate_bit << 1 | surrogate_bit;
// Handle exceeded values via bit-twiddling.
// Unfortunately, these don't align precisely with a leading-zero boundary;
// the largest codepoint is U+10FFFF.
let exceeded_bit = (codepoint > 0x10_FFFF) as usize;
let exceeded_mask = exceeded_bit << 2 | exceeded_bit << 1 | exceeded_bit;
len & !surrogate_mask & !exceeded_mask
}
/// Length, based on the number of leading zeros.
const LEN: [u8; 33] = [
// 0-10 leading zeros: not valid
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
// 11-15 leading zeros: 4 bytes
4, 4, 4, 4, 4,
//16-20 leading zeros: 3 bytes
3, 3, 3, 3, 3,
// 21-24 leading zeros: 2 bytes
2, 2, 2, 2,
// 25-32 leading zeros: 1 byte
1, 1, 1, 1, 1, 1, 1, 1,
];
/// Encode a UTF-8 codepoint.
/// Returns a buffer and the number of valid bytes in the buffer.
///
/// To add this codepoint to a string, append all four bytes in order,
/// and record that (usize) bytes were added to the string.
///
/// Returns a length of zero for invalid codepoints (surrogates and out-of-bounds values).
pub fn branchless_utf8(codepoint: u32) -> ([u8; 4], usize) {
let len = utf8_bytes_for_codepoint(codepoint);
let buf = [
PREFIX[len][0] | ((codepoint >> SHIFT[len][0]) & MASK[len][0] as u32) as u8,
PREFIX[len][1] | ((codepoint >> SHIFT[len][1]) & MASK[len][1] as u32) as u8,
PREFIX[len][2] | ((codepoint >> SHIFT[len][2]) & MASK[len][2] as u32) as u8,
PREFIX[len][3] | ((codepoint >> SHIFT[len][3]) & MASK[len][3] as u32) as u8,
];
(buf, len)
}
type Table = [[u8; 4]; 5];
// Byte prefix for a continuation byte.
const CONTINUE: u8 = 0b1000_0000;
const PREFIX: Table = [
[0u8; 4],
[0, 0, 0, 0],
[0b1100_0000, CONTINUE, 0, 0],
[0b1110_0000, CONTINUE, CONTINUE, 0],
[0b1111_0000, CONTINUE, CONTINUE, CONTINUE],
];
// We must arrange that the most-significant bytes are always in byte 0.
const SHIFT: Table = [
[0u8; 4],
[0, 0, 0, 0],
[6, 0, 0, 0],
[12, 6, 0, 0],
[18, 12, 6, 0],
];
const MASK: Table = [
[0u8; 4],
[0x7f, 0, 0, 0],
[0x1f, 0x3f, 0, 0],
[0x0f, 0x3f, 0x3f, 0],
[0x07, 0x3f, 0x3f, 0x3f],
];
No if statements, loops, or other conditionals. So, branchless, right?
…well, no. If we peek at the (optimized) code in Compiler Explorer, we can see the x86_64 assembly has two different kinds of branches.
There’s a branch right at the start of the function:
test esi, esi
je .LBB0_1
bsr eax, esi
xor eax, 31
jmp .LBB0_3
.LBB0_1:
mov eax, 32
.LBB0_3:
mov eax, eax
I wasn’t sure what this was about until I stepped through it.
The “special” case seems to be when the input (esi
) is zero;
then it returns 32.
Why the special case? Compiler Explorer’s tooltip for the bsr
instruction says:
If the content source operand is 0, the content of the destination operand is undefined.
So on x86_64 processors, we have to branch to say “a 32-bit zero value has 32 leading zeros”. Put differently, the “count leading zeros” intrinsic isn’t necessarily a branchless instruction. This might look nicer on another architecture!
The other jump seems to be a conflation of the several array-bounds checks.
cmp eax, 4
ja .LBB0_5
...
LBB0_5:
lea rdx, [rip + .L__unnamed_5]
mov esi, 5
mov rdi, rax
call qword ptr [rip + core::panicking::panic_bounds_check::h8307ccead484a122@GOTPCREL]
All of the jump arrays have the same bound (4), so the compiler can decide to only check once – and still get Rust’s famous safety guarantees.
In principle, if the compiler optimized through the LEN
table,
it could eliminate this check as well; the LEN
value is never
greater than 4, which is a valid index for all tables.
But apparently the constants don’t propagate that far.
Changing the code and dropping to unsafe array accesses eliminates the array bounds check. But still, there’s still the count-leading-zeros branch at the start. Can we get rid of that?
Let’s take another look at a bit of the code – specifically, how we handle out-of-bounds values:
let exceeded_bit = (codepoint > 0x10_FFFF) as usize;
The trick I pulled here was to cast a boolean (true or false) to an integer (1 or 0). Rust’s semantics guarantee this conversion is safe, and it happens to be a representation the hardware can work with; it doesn’t appear to incur a conditional after compilation.2
I used these booleans-as-integers to perform masking to zero. But you know what else we can do with integers?
Addition.
We can get rid of all the branches by tweaking the length-computing function:
const fn utf8_bytes_for_codepoint(codepoint: u32) -> usize {
let mut len = 1;
// In Rust, true casts to 1 and false to 0, so we can "just" sum lengths.
len += (codepoint > 0x7f) as usize;
len += (codepoint > 0x7ff) as usize;
len += (codepoint > 0xffff) as usize;
// As before:
let surrogate_bit = ((codepoint >= 0xD800) && (codepoint <= 0xDFFF)) as usize;
let surrogate_mask = surrogate_bit << 2 | surrogate_bit << 1 | surrogate_bit;
let exceeded_bit = (codepoint > 0x10_FFFF) as usize;
let exceeded_mask = exceeded_bit << 2 | exceeded_bit << 1 | exceeded_bit;
len & !surrogate_mask & !exceeded_mask
}
This is the answer to Nathan’s original question, about working out the number of bytes. Compiler explorer confirms that, with optimizations enabled, this function is branchless.
Happily, this transformation also allowed the compiler to realize len <= 4
on all paths,
and to statically eliminate the array bounds check.
That means the full code is branchless as well. Victory!
While this is branchless, I make absolutely no claim that it is optimized – my only goal here was a proof-of-concept of branchlessness. I haven’t even benchmarked it!
Chris Wellons notes in his post about branchless decoding that a DFA-based decoder can have similar performance; SIMD and other “use what the hardware gives you” techniques are probably even better. I wouldn’t bet on my encoder over the one in your favorite standard library.
I also make no claims of usefulness. But you’re welcome to do just about anything with the code: I hereby release it under the MIT license. The full code is here, along with the tests I used to match it against Rust’s implementation.
Thanks Nathan for the question and Lorenz for the insights! Any mistakes remaining are my own – give me a shout if you spot them!