James Routley

Can you encode UTF-8 without branches?

The question

In a Recurse chat, Nathan Goldbaum asked:

I know how to decode UTF-8 using bitmath and some LUTs (see https://github.com/skeeto/branchless-utf8), but if I want to to go from a codepoint to UTF-8, is there a way to do it without branches?

To start with, is there a way to write this C function, which returns the number of bytes needed to store the UTF-8 bytes for the codepoint without any branches? Or would I need a huge look-up-table?

The C function

int
num_utf8_bytes_for_codepoint(uint32_t code)
{
    if (code <= 0x7F) {
        return 1;
    }
    else if (code <= 0x07FF) {
        return 2;
    }
    else if (code <= 0xFFFF) {
        if ((code >= 0xD800) && (code <= 0xDFFF)) {
            // surrogates are invalid UCS4 code points
            return -1;
        }
        return 3;
        }
    else if (code <= 0x10FFFF) {
        return 4;
    }
    else {
        // codepoint is outside the valid unicode range
        return -1;
    }
}

I pondered this but didn’t immediately see a way to do it without a huge (2^32) lookup table.

The almost answer

Until Lorenz pointed out:

Very handwavy idea: encode a 32 bit code point into utf8 but store the result in a 32bit word again. Count the number of leading / trailing zeroes to figure out how many bytes are necessary. Write four bytes into the output buffer but only advance your position in the output by the number of bytes you really need.

Aha!

The number of leading zeros will range from 12 to 32¹ – a very reasonable size for a lookup table. From there, we could look up other parameters by length (no more than 4).

I fired off a draft into the chat, then came back to test (and fix) it in the evening. When I got the tests passing, it looked like this:

/// Return the number of bytes required to UTF-8 encode a codepoint.
/// Returns 0 for surrogates and out-of-bounds values.
const fn utf8_bytes_for_codepoint(codepoint: u32) -> usize {
    let len = LEN[codepoint.leading_zeros() as usize] as usize;

    // Handle surrogates via bit-twiddling.
    // Rust guarantees true == 1 and false == 0:
    let surrogate_bit = ((codepoint >= 0xD800) && (codepoint <= 0xDFFF)) as usize;
    // Extend that one bit into three, and use its inverse as a mask for length
    let surrogate_mask = surrogate_bit << 2 | surrogate_bit << 1 | surrogate_bit;

    // Handle exceeded values via bit-twiddling.
    // Unfortunately, these don't align precisely with a leading-zero boundary;
    // the largest codepoint is U+10FFFF.
    let exceeded_bit = (codepoint > 0x10_FFFF) as usize;
    let exceeded_mask = exceeded_bit << 2 | exceeded_bit << 1 | exceeded_bit;

    len & !surrogate_mask & !exceeded_mask
}

/// Length, based on the number of leading zeros.
const LEN: [u8; 33] = [
    // 0-10 leading zeros: not valid
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    // 11-15 leading zeros: 4 bytes
    4, 4, 4, 4, 4,
    //16-20 leading zeros: 3 bytes
    3, 3, 3, 3, 3,
    // 21-24 leading zeros: 2 bytes
    2, 2, 2, 2,
    // 25-32 leading zeros: 1 byte
    1, 1, 1, 1, 1, 1, 1, 1,
];



/// Encode a UTF-8 codepoint.
/// Returns a buffer and the number of valid bytes in the buffer.
///
/// To add this codepoint to a string, append all four bytes in order,
/// and record that (usize) bytes were added to the string.
///
/// Returns a length of zero for invalid codepoints (surrogates and out-of-bounds values).
pub fn branchless_utf8(codepoint: u32) -> ([u8; 4], usize) {
    let len = utf8_bytes_for_codepoint(codepoint);
    let buf = [
        PREFIX[len][0] | ((codepoint >> SHIFT[len][0]) & MASK[len][0] as u32) as u8,
        PREFIX[len][1] | ((codepoint >> SHIFT[len][1]) & MASK[len][1] as u32) as u8,
        PREFIX[len][2] | ((codepoint >> SHIFT[len][2]) & MASK[len][2] as u32) as u8,
        PREFIX[len][3] | ((codepoint >> SHIFT[len][3]) & MASK[len][3] as u32) as u8,
    ];

    (buf, len)
}

type Table = [[u8; 4]; 5];

// Byte prefix for a continuation byte.
const CONTINUE: u8 = 0b1000_0000;
const PREFIX: Table = [
    [0u8; 4],
    [0, 0, 0, 0],
    [0b1100_0000, CONTINUE, 0, 0],
    [0b1110_0000, CONTINUE, CONTINUE, 0],
    [0b1111_0000, CONTINUE, CONTINUE, CONTINUE],
];

// We must arrange that the most-significant bytes are always in byte 0.
const SHIFT: Table = [
    [0u8; 4],
    [0, 0, 0, 0],
    [6, 0, 0, 0],
    [12, 6, 0, 0],
    [18, 12, 6, 0],
];

const MASK: Table = [
    [0u8; 4],
    [0x7f, 0, 0, 0],
    [0x1f, 0x3f, 0, 0],
    [0x0f, 0x3f, 0x3f, 0],
    [0x07, 0x3f, 0x3f, 0x3f],
];

The branches

No if statements, loops, or other conditionals. So, branchless, right?

…well, no. If we peek at the (optimized) code in Compiler Explorer, we can see the x86_64 assembly has two different kinds of branches.

Count leading zeros

There’s a branch right at the start of the function:

            test    esi, esi
            je      .LBB0_1
            bsr     eax, esi
            xor     eax, 31
            jmp     .LBB0_3
    .LBB0_1:
            mov     eax, 32
    .LBB0_3:
            mov     eax, eax

I wasn’t sure what this was about until I stepped through it. The “special” case seems to be when the input (esi) is zero; then it returns 32.

Why the special case? Compiler Explorer’s tooltip for the bsr instruction says:

If the content source operand is 0, the content of the destination operand is undefined.

So on x86_64 processors, we have to branch to say “a 32-bit zero value has 32 leading zeros”. Put differently, the “count leading zeros” intrinsic isn’t necessarily a branchless instruction. This might look nicer on another architecture!

Bounds checks

The other jump seems to be a conflation of the several array-bounds checks.

        cmp     eax, 4
        ja      .LBB0_5
        ...
LBB0_5:
        lea     rdx, [rip + .L__unnamed_5]
        mov     esi, 5
        mov     rdi, rax
        call    qword ptr [rip + core::panicking::panic_bounds_check::h8307ccead484a122@GOTPCREL]

All of the jump arrays have the same bound (4), so the compiler can decide to only check once – and still get Rust’s famous safety guarantees.

In principle, if the compiler optimized through the LEN table, it could eliminate this check as well; the LEN value is never greater than 4, which is a valid index for all tables. But apparently the constants don’t propagate that far.

Eliminating branching

Changing the code and dropping to unsafe array accesses eliminates the array bounds check. But still, there’s still the count-leading-zeros branch at the start. Can we get rid of that?

Let’s take another look at a bit of the code – specifically, how we handle out-of-bounds values:

let exceeded_bit = (codepoint > 0x10_FFFF) as usize;

The trick I pulled here was to cast a boolean (true or false) to an integer (1 or 0). Rust’s semantics guarantee this conversion is safe, and it happens to be a representation the hardware can work with; it doesn’t appear to incur a conditional after compilation.²

I used these booleans-as-integers to perform masking to zero. But you know what else we can do with integers?

Addition.

The answer

We can get rid of all the branches by tweaking the length-computing function:

const fn utf8_bytes_for_codepoint(codepoint: u32) -> usize {
    let mut len = 1;
    // In Rust, true casts to 1 and false to 0, so we can "just" sum lengths.
    len += (codepoint > 0x7f) as usize;
    len += (codepoint > 0x7ff) as usize;
    len += (codepoint > 0xffff) as usize;

    // As before:
    let surrogate_bit = ((codepoint >= 0xD800) && (codepoint <= 0xDFFF)) as usize;
    let surrogate_mask = surrogate_bit << 2 | surrogate_bit << 1 | surrogate_bit;
    let exceeded_bit = (codepoint > 0x10_FFFF) as usize;
    let exceeded_mask = exceeded_bit << 2 | exceeded_bit << 1 | exceeded_bit;

    len & !surrogate_mask & !exceeded_mask
}

This is the answer to Nathan’s original question, about working out the number of bytes. Compiler explorer confirms that, with optimizations enabled, this function is branchless.

Happily, this transformation also allowed the compiler to realize len <= 4 on all paths, and to statically eliminate the array bounds check. That means the full code is branchless as well. Victory!

The caveats

While this is branchless, I make absolutely no claim that it is optimized – my only goal here was a proof-of-concept of branchlessness. I haven’t even benchmarked it!

Chris Wellons notes in his post about branchless decoding that a DFA-based decoder can have similar performance; SIMD and other “use what the hardware gives you” techniques are probably even better. I wouldn’t bet on my encoder over the one in your favorite standard library.

I also make no claims of usefulness. But you’re welcome to do just about anything with the code: I hereby release it under the MIT license. The full code is here, along with the tests I used to match it against Rust’s implementation.

Thanks!

Thanks Nathan for the question and Lorenz for the insights! Any mistakes remaining are my own – give me a shout if you spot them!

Branchless UTF-8 Encoding