James Routley

TL;DR:

We build a minimal u128 as two u64 limbs and implement arithmetic using carry, borrow, and multiply intrinsics that map directly to x64 instructions. The generated code is on par with builtin __uint128_t for addition, subtraction, multiplication, and comparison. This is unsigned-only, x64-focused, and intentionally narrow in scope. The result is a solid foundation for exact, fixed-width arithmetic with a focus on good codegen and predictability, not abstraction.

Full code and compiler output: https://godbolt.org/z/K6dn3s91Y

Scope

We take the smallest reasonable definition of a 128-bit integer, two 64-bit words, and turn it into a usable arithmetic type whose generated code is indistinguishable from a builtin __uint128_t.

This post is explicitly not about dynamically-sized big integer arithmetic. It is about being explicit with range bounds and letting the compiler emit the exact instructions we want. The scope is deliberately limited: unsigned arithmetic, fixed width, modern x64, with Clang and GCC as the primary targets and notes for MSVC where it differs.

Why fixed-width big integers

In many domains, especially geometry and numerics, we do not need arbitrary precision. We need enough precision to be exact for known bounds, and we need the cost to be predictable.

Dynamic big integer libraries solve a different problem. They are flexible and general, but they pay for that generality in memory traffic, branches, and indirection. If your values fit into a fixed number of bits and you know that ahead of time, fixed-width arithmetic is usually the better trade. (In fact, our high-performance exact mesh booleans are completely built on this: Exact Arithmetic in Solidean)

A 128-bit integer is the gateway drug to fixed-width arithmetic. It is the smallest width that is no longer builtin, while still mapping cleanly to the underlying hardware. Once the carry and multiply patterns are explicit at 128 bits, extending them to 192 or 256 bits is straightforward. In production, we use 256-bit integers in our hot paths and go up to 564 bits for certain edge cases.

Representation

We represent a 128-bit unsigned integer as two 64-bit limbs. You can literally think of this as writing the u128 as a 2-digit number in base $2^{64}$. Because it's unsigned, we don't need to think about two's complement.

#include <cstdint>
#include <immintrin.h> // _addcarry_u64, _subborrow_u64, _mulx_u64

// this is the type used in the intrinsics signature
// (and uint64_t is unsigned long, not unsigned long long...)
using u64 = unsigned long long;

struct u128
{
    u64 low = 0;
    u64 high = 0;
};

Addition, with carry

We start off easy. Addition is simply done using long addition on our base $2^{64}$ digits. The intrinsic _addcarry_u64 corresponds to the x64 instruction adc and is exactly what we need: Given two u64 summands and an input carry (0 or 1), we get the u64 result (via a slightly cumbersome output parameter) and a new carry.

u128 operator+(u128 a, u128 b) 
{
    u128 r;
    unsigned char c = _addcarry_u64(0, a.low,  b.low,  &r.low);
    (void)_addcarry_u64(c, a.high, b.high, &r.high);
    return r;
}

The generated assembly is exactly what you would write by hand.

operator+(u128, u128):
        mov     rax, rdi
        add     rax, rdx
        adc     rsi, rcx
        mov     rdx, rsi
        ret

The moves are just calling convention noise. The core is an add (because the first addition has no input carry) followed by an adc. This is identical to what the compiler emits for __uint128_t addition.

A small but important point is that intrinsics are preferable to inline assembly here. They act like specialized IR operations. The compiler understands their semantics, can schedule around them, and can still optimize aggressively. Inline assembly is more of a black box and much easier to get wrong or inhibit optimizations.

Subtraction: same story, inverted

Subtraction mirrors addition almost perfectly. Instead of a carry, we track a borrow. On x64, this is sbb, subtract with borrow. The corresponding intrinsic is _subborrow_u64.

u128 operator-(u128 a, u128 b) 
{
    u128 r;
    unsigned char c = _subborrow_u64(0, a.low,  b.low,  &r.low);
    (void)_subborrow_u64(c, a.high, b.high, &r.high);
    return r;
}

And again, the assembly is exactly what we want.

operator-(u128, u128):
        mov     rax, rdi
        sub     rax, rdx
        sbb     rsi, rcx
        mov     rdx, rsi
        ret

At this point, addition and subtraction are basically solved. There is no hidden cost and no abstraction penalty. It's also easy to see how this scales to larger integer types with one extra instruction per u64 "digit".

Multiplication: regrouping our `u64` digits

Multiplication is where things get more interesting. A 128-bit by 128-bit multiply produces a 256-bit result but we want only the lower 128 bit for our result. Same story with u64 * u64 really, which produces a 128 bit result in theory, but you usually only use the lower u64. Speaking of which, all modern 64-bit architectures give you access to fast u64 * u64 -> u128 instructions. With BMI2, this is exposed as _mulx_u64. On MSVC, the equivalent is _umul128. This is our building block for large multiplication.

You can derive the code from writing the u128 * u128 as 2-digit long multiplication (u64, u64) * (u64, u64) and then look sharply at what sums up to which digit.

That's how I do it on paper, but here we can also choose an algebraic route. Write the numbers as: $$ (a.\text{low} + 2^{64} \cdot a.\text{high}) \cdot (b.\text{low} + 2^{64} \cdot b.\text{high}) $$ Expanding this gives four terms. Of those, only three contribute to the low 128 bits:

$a_\text{low} \cdot b_\text{low}$ contributes both low and high parts
$a_\text{low} \cdot b_\text{high}$ contributes to bits 64..127
$a_\text{high} \cdot b_\text{low}$ contributes to bits 64..127
$a_\text{high} \cdot b_\text{high}$ contributes only above bit 128 and can be discarded

This works for larger integers as well, though summing up intermediate terms can produce carries for "higher digits".

u128 operator*(u128 a, u128 b) 
{
    // we want the low 128 bits of:
    // (a.low + 2^64 a.high) * (b.low + 2^64 b.high)
    //
    // r.low  = lo64(a.low * b.low)
    // r.high = hi64(a.low * b.low)
    //        + lo64(a.low * b.high)
    //        + lo64(a.high * b.low)        (mod 2^64)

    // NOTE (MSVC): for multiply, you can use _umul128(a, b, &hi) instead of _mulx_u64.
    //              Clang/GCC: _mulx_u64 is BMI2 and needs -mbmi2.

    u128 r;

    u64 p0_hi;
    r.low = _mulx_u64(a.low, b.low, &p0_hi);

    // cross terms: only the low 64 bits contribute to r.high
    u64 t1_hi;
    u64 t1_lo = _mulx_u64(a.low,  b.high, &t1_hi);

    u64 t2_hi;
    u64 t2_lo = _mulx_u64(a.high, b.low,  &t2_hi);

    // simply add is sufficient: carries would land in bit 128 and are discarded
    r.high = p0_hi + t1_lo + t2_lo;

    return r;
}

Note that we do not need carry handling there. Any carry out of bit 127 would land in bit 128, which we are discarding anyway.

The compiler output reflects this reasoning.

operator*(u128, u128):
        mulx    r8, rax, rdi
        imul    rcx, rdi
        imul    rdx, rsi
        add     rdx, rcx
        add     rdx, r8
        ret

The compiler chooses slightly different instructions and registers for the builtin __uint128_t multiplication but is otherwise identical as well.

Equality

Now an easy one. u128 equality is simple structural equality.

bool operator==(u128 a, u128 b) 
{
    return a.low == b.low && a.high == b.high;
}

The generated assembly is worth a quick look.

operator==(u128, u128):
        xor     rdi, rdx
        xor     rsi, rcx
        or      rsi, rdi
        sete    al
        ret

Instead of branching (as you might expect from the short-circuiting of &&), the compiler XORs the corresponding limbs. XOR produces zero if and only if the inputs are equal. ORing the results combines the checks. If the final value is zero, both limbs were equal.

This pattern continues for larger integers, though we might see branching due to short-circuiting at some point. (We could of course just use the XOR approach in operator== but it is distinctly less readable.)

Comparison: borrow beats branching

The straightforward way to compare two 128-bit integers is to compare the high parts first and then the low parts.

bool operator<(u128 a, u128 b)
{
    if (a.high != b.high) return a.high < b.high;
    return a.low < b.low;
}

This is correct, but the codegen is not great:

operator<(u128, u128):
        xor     r8d, r8d
        cmp     rdi, rdx
        setb    r8b
        xor     eax, eax
        cmp     rsi, rcx
        setb    al
        cmove   eax, r8d
        ret

Works, but heavier than necessary. We can do better by leaning on the hardware borrow flag.

Unsigned comparison a < b is equivalent to checking whether a - b produces a borrow:

bool operator<(u128 a, u128 b) 
{
    u64 dont_care;

    // compute borrow from (a.low - b.low). If a.low < b.low => borrow = 1.
    unsigned char borrow = _subborrow_u64(0, a.low, b.low, &dont_care);

    // now subtract highs with that borrow
    // final borrow tells us if a < b in 128-bit unsigned.
    borrow = _subborrow_u64(borrow, a.high, b.high, &dont_care);

    return borrow != 0;
}

The resulting assembly is minimal:

operator<(u128, u128):
        cmp     rdi, rdx
        sbb     rsi, rcx
        setb    al
        ret

One compare, one subtract with borrow, and a flag check. This is exactly the kind of codegen we want in hot code.

A small use site

To make sure everything composes properly, let's build a slightly larger function.

u128 demo_u128(u128 a, u128 b) 
{
    u128 x = a + b;
    u128 y = a * b;
    return x < y
            ? y - x
            : x - y;
}

All operators inline cleanly:

demo_u128(u128, u128):
        mov     r8, rdi
        add     r8, rdx
        mov     r9, rsi
        adc     r9, rcx
        mulx    rax, r10, rdi
        imul    rcx, rdi
        imul    rdx, rsi
        add     rdx, rcx
        add     rdx, rax
        mov     rax, r8
        sub     rax, r10
        mov     rcx, r9
        sbb     rcx, rdx
        jae     .LBB13_2
        sub     r10, r8
        sbb     rdx, r9
        mov     rax, r10
        mov     rcx, rdx
.LBB13_2:
        mov     rdx, rcx
        ret

There is a branch for the ternary operator while the builtin version uses conditional moves instead. Which is better depends on data patterns but could go either way. I consider this basically as good as it gets.

Platform notes

The examples shown are for x64 with Clang or GCC.

On MSVC, _addcarry_u64 and _subborrow_u64 work the same way. For multiplication, _umul128 replaces _mulx_u64.

On AArch64, the same approach applies using adds and adcs instructions for addition, subs and sbcs for subtraction, and mul + umulh for the low + high half of a 64-bit multiply. The patterns carry over directly, even though the intrinsics differ slightly (and multiplication is split into two parts).

Outlook

This u128 is the easiest large integer you can write. Our goal is best performance, so we made sure that codegen is reasonably identical to the builtin __uint128_t.

From here, the same patterns extend naturally to signed variants, widening multiplies such as u128 times u128 to u256, and chains of fixed-width integers like i192 or i256. More importantly, the same reasoning applies when you design predicate-specific arithmetic that only computes what is actually needed.

A lot of our performance really boils down to these types: No BigInteger tax, no floating predicates (that need a few kB of stack space for the worst case). Just a lot of straightline integer code and some light static branching.

Addendum 2026-01-24

Some notes based on reader feedback.

On PowerPC: I don't know much about PowerPC or the availability of intrinsics, but the instructions you need are definitely there. For add/sub with carry, there's addc / adde for addition and subfc / subfe for subtraction with borrow. For wide multiplication, mulhdu gives you the high half of a 64-bit multiply (and mulhd for signed).

On GCC codegen with intrinsics: Someone pointed out that GCC doesn't always handle the intrinsic-based addition optimally. It sometimes moves the result through the stack before setting it in the final registers. This is related to writing directly to the result struct. If you write to "real" scalars first, the codegen is optimal again. A weird one.

On division: There is no neat codegen for division. Even the builtin delegates to a library call. The naive but practical approach is binary long division, which finishes in up to 128 steps. Either branchless with fixed runtime or with a loop that searches for the next set bit. Either way it's a bit of work. Our exact predicates are always formulated in a division-free way simply because division would be expensive.

On _BitInt(N): The upcoming _BitInt(N) type does work, but with caveats. For B = 128, you get the normal codegen. For B > 128, GCC calls into a function where the bit size is a runtime parameter. So yes, they would work, but performance will be subpar for larger widths. Clang generates properly inlined assembly.

Building Your Own Efficient uint128 in C++