While many performance-oriented programmers are intimately acquainted with the almost preternatural ability of modern compilers to optimise their code, and many of us have spent countless hours on Compiler Explorer examining the differences between the Assembly generated by different versions of gcc and clang, most have likely not looked under the hood to see how the magic happens. It is a testament to their quality that most of us simply treat compilers as black boxes: more or less readable code goes in, fast binaries come out. Sometimes, however, seemingly innocuous changes—perhaps even meant to help the compiler—can cause surprising performance issues which we are hard-pressed to explain without a deeper understanding of the underlying machinery.
In this post we’ll dive into the implementation of some of the LLVM1 optimisation passes using two simple examples which, nonetheless, will help us pull back the veil on the complexity involved in producing highly-optimised code. We will see how small source changes can trigger different paths in the compiler’s internal processing with unexpected consequences, demonstrating how achieving high performance can be as much an art as it is a science for both compiler developers and users. I have also included a few exercises for those interested in getting their hands dirty, but they are not required to follow along with the main text.
I use LLVM 22.1.0 as the reference implementation throughout this post. The examples are written in (very basic) C++23 and target x86-64, and the Assembly code uses Intel syntax. Prior knowledge of LLVM IR is not required but it can be helpful (I recommend A Gentle Introduction to LLVM IR).
Consider the following C++ function to get the next index into an array or vector of elements
accessed in a round-robin fashion, with cur being the current index and count the number of
elements:
unsigned next_naive(unsigned cur, unsigned count)
{
return (cur + 1) % count;
}
As written, this code requires an expensive 32-bit division instruction (6 cycles with 12-cycle
latency on an Intel IceLake core, worse in previous generations). There are, of course, numerous
tricks to replace divisions by constants with cheaper arithmetic operations—powers of two being the
best-known case—but since here count is a dynamic runtime value the compiler cannot help us:
next_naive(unsigned int, unsigned int):
lea eax, [rdi + 1]
xor edx, edx
div esi
mov eax, edx
ret
However, we know something the compiler doesn’t: because cur is an index, it must always be
strictly smaller than count. We can take advantage of this to replace the division with a much
cheaper conditional move by writing:
unsigned next_cmov(unsigned cur, unsigned count)
{
auto n = cur + 1;
return n == count ? 0 : n;
}
next_cmov(unsigned int, unsigned int):
lea ecx, [rdi + 1]
xor eax, eax
cmp ecx, esi
cmovne eax, ecx
ret
Could the compiler have done this transformation on its own? Let’s try out the new C++23 assume
attribute2:
unsigned next_assume(unsigned cur, unsigned count)
{
[[assume(cur < count)]];
return (cur + 1) % count;
}
next_assume(unsigned int, unsigned int):
lea ecx, [rdi + 1]
xor eax, eax
cmp ecx, esi
cmovne eax, ecx
ret
Exactly the same code as for next_cmov()! So how did clang figure it out?
When trying to understand where a certain optimisation comes from (or where it might be missing), the first step is to identify the pass where it occurs. Currently, the most convenient way to investigate the effects of the different LLVM optimisation passes is to use Compiler Explorer’s Opt Pipeline view:
(You can get the same information locally by passing the -mllvm -print-changed options to clang,
but without the nice highlighted diffs.)
When looking through the evolution of the LLVM IR for
next_assume(), we quickly find that the transformation from a urem instruction (representing the
modulo operation) to an icmp+select pair (which is eventually translated into an x86
cmp+cmov) happens in
InstCombine:
%cmp = icmp ult i32 %cur, %count
call void @llvm.assume(i1 %cmp)
%add = add i32 %cur, 1
- %rem = urem i32 %add, %count
+ %0 = icmp eq i32 %add, %count
+ %rem = select i1 %0, i32 0, i32 %add
ret i32 %rem
InstCombine is responsible for many of the peephole optimisations that do not modify the control flow graph, such as converting multiplications by powers of two into left shifts. It proceeds by visiting all the instructions in a function and matching them to one of many hundreds of patterns using a generic pattern matching mechanism. The pass actually runs multiple times during compilation3, since other passes typically expose new optimisation opportunities by performing different simplifications, unrolling loops, etc.
In this case the main transformation logic is quite simple and self-contained. If we look through
the lib/Transforms/InstCombine directory in the LLVM repository, we’ll see that the code is neatly
organised into different files for the different categories of instructions matched by the pass,
each file containing visitor methods for the opcodes of interest. One of these files is
InstCombineMulDivRem.cpp, where we can find the relatively short visitURem() method and, at the
end of it:4
// For "(X + 1) % Op1" and if (X u< Op1) => (X + 1) == Op1 ? 0 : X + 1 .
if (match(Op0, m_Add(m_Value(X), m_One()))) {
Value *Val =
simplifyICmpInst(ICmpInst::ICMP_ULT, X, Op1, SQ.getWithInstruction(&I));
if (Val && match(Val, m_One())) {
Value *FrozenOp0 = Op0;
if (!isGuaranteedNotToBeUndef(Op0))
FrozenOp0 = Builder.CreateFreeze(Op0, Op0->getName() + ".frozen");
Value *Cmp = Builder.CreateICmpEQ(FrozenOp0, Op1);
return createSelectInstWithUnknownProfile(
Cmp, ConstantInt::getNullValue(Ty), FrozenOp0);
}
}
As the comment preceding the block describes, this snippet matches exactly the pattern we used in
next_assume() and generates the same optimisation we wrote by hand in next_cmov() for cases
where it can prove that the pre-increment value is less than the modulus.5 To make this
guarantee, the compiler cleverly reuses the same analysis used to fold the result of icmp
instructions, namely the simplifyICmpInst()
function.
Scrolling down its implementation you can see the numerous patterns it matches; of particular
interest to us here is the simplifyICmpWithDominatingAssume helper, which checks all the
assumptions applicable to either side of the comparison (in the form of calls to the llvm.assume()
intrinsic) to see whether they imply our condition of interest.6 This, finally, explains
how the information we encoded in our assume attribute enabled the compiler to remove that pesky
division.
Looking through the implementation of simplifyICmpInst() also shows us other options to
communicate this precondition. For example, this version of our function is optimised in a similar
way when assertions are enabled:
unsigned next_assert(unsigned cur, unsigned count)
{
assert(cur < count);
return (cur + 1) % count;
}
Here, the simplification analysis uses the conditional hidden in the assert() macro to figure out
that we only execute the urem instruction if cur < count (see the isImpliedByDomCondition()
method). This can have the unexpected consequence of making builds with assertions enabled faster
in some cases; we can restore performance by replacing disabled assertions with assume attributes,
which have no runtime impact.
Exercise 1: If we try to be clever and rewrite our function using bitwise operations to avoid the potential branch, InstCombine sees right through our trickery and produces the exact same output:
unsigned next_bitwise(unsigned cur, unsigned count)
{
auto n = cur + 1;
return n & -(n != count);
}
%add = add i32 %cur, 1
- %cmp = icmp ne i32 %add, %count
- %conv = zext i1 %cmp to i32
- %sub = sub nsw i32 0, %conv
- %and = and i32 %add, %sub
+ %cmp.not = icmp eq i32 %add, %count
+ %and = select i1 %cmp.not, i32 0, i32 %add
ret i32 %and
Trace the execution of the pass to figure out what patterns it matches to do this. Start with the
visitor for the sub instruction. (rr can be very useful for this.)
So far all the examples we’ve seen required us to explicitly indicate, one way or another, what the
relationship was between cur and count so the code could be optimised accordingly. In other
cases this relationship might be implicit, such as when iterating through the modulo values in
a loop:
void iterate_1(unsigned upto, unsigned count)
{
for (unsigned i = 0, r = 0; i < upto; ++i, r = (r + 1) % count)
// force the compiler to calculate r instead of discarding the whole loop
asm("" :: "r" (r));
}
If we look at the generated Assembly, it doesn’t seem like the compiler is smart enough to deduce the optimisation on its own:
iterate_1(unsigned int, unsigned int):
test edi, edi
je .LBB4_3
xor edx, edx
.LBB4_2:
inc edx
mov eax, edx
xor edx, edx
div esi
dec edi
jne .LBB4_2
.LBB4_3:
ret
As is often the case, though, the compiler has simply noticed something we overlooked: our
optimisation relies on having r < count, but if count is zero this condition cannot be met.7
It’s instructive to see exactly how the optimiser reaches this conclusion. If we look at the IR at
the input to the first InstCombine pass, we will find that the (X + 1) % Op1 pattern from
visitURem() now matches X to the following phi instruction8:
%r.0 = phi i32 [ 0, %entry ], [ %rem, %for.body ]
When finding a phi instruction, simplifyICmpInst()
checks
that the comparison can be simplified for all possible input values to the phi (in this case, zero
and the last value calculated for r) and, if so, that it produces the same result for all of them.
With our code above there is no way to determine whether 0 < count and the simplification fails.
If we cover our bases by adding a if (count == 0) return; statement before the loop, we can close
the gap and get the expected result:
.LBB4_2:
inc ecx
cmp ecx, esi
cmove ecx, eax
dec edi
jne .LBB4_2
But that doesn’t explain how the compiler determined that %rem < %count in the second input to the
phi, when we no longer have an explicit assumption or condition to that effect. That case is
handled by yet another
pattern
deep in the bowels of the instruction folding routines, which specifically handles comparisons
between remainders of a given modulus and the modulus itself:
// icmp pred (urem X, Y), Y
if (match(LBO, m_URem(m_Value(), m_Specific(RHS)))) {
switch (Pred) {
// ...
case ICmpInst::ICMP_NE:
case ICmpInst::ICMP_ULT:
case ICmpInst::ICMP_ULE:
return getTrue(ITy);
}
}
Yes, it’s patterns all the way down.
Exercise 2: If we instead change the loop condition to i < std::min(upto, count) the compiler
is no longer able to replace the urem instruction. Starting from the isImpliedByDomCondition()
method mentioned earlier, try to figure out where you could add support to see through min/max
idioms when evaluating implied conditions, and whether you could reuse any existing analysis code.
What if we wanted to remove the now-redundant remainder entirely?
It may seem unnecessary to define an extra auxiliary variable to hold the remainder when we could simply calculate it from the loop counter itself, even if the generated code is efficient. Let’s get rid of it:
void iterate_2(unsigned upto, unsigned count)
{
for (unsigned i = 0; i < upto; ++i)
asm("" :: "r" (i % count));
}
iterate_2(unsigned int, unsigned int):
test edi, edi
je .LBB5_3
xor eax, eax
xor ecx, ecx
xor edx, edx
.LBB5_2:
inc ecx
cmp ecx, esi
cmove ecx, eax
inc edx
cmp edi, edx
jne .LBB5_2
.LBB5_3:
ret
It looks like the compiler recognises the same opportunity here: the division is still replaced with a nice conditional move, using a separate register to hold the remainder. But this doesn’t look like the work of the InstCombine code we saw earlier; and why are we incrementing the loop counter, which requires an extra comparison, rather than decrementing it and using the zero flag, as before?
This time, if we inspect the results of the different optimisation passes, we’ll see that the
transformation comes not from InstCombine but from another pass, CodeGenPrepare, which runs shortly
before instruction selection but after the loop strength reduction (LSR) responsible for converting
the counter increment into a decrement. InstCombine can no longer match the pattern it expects,
since it does not hold that i < count; and, because i is used inside the body of the loop to
calculate the remainder, LSR can no longer replace it with a down counter. CodeGenPrepare then acts
as a sort of last resort, identifying a specific
pattern
which it can transform into almost-optimal code.
This illustrates two points which we’ll explore more in depth in the next section:
Exercise 3: LLVM ships its optimiser, aptly named opt, as a modular binary separate from the
compiler front-end. Save our implementation of iterate_2() above in an iterate.cpp file and try
running the following command using your local clang install:
clang++ -O2 -S -emit-llvm -fno-discard-value-names -o- iterate.cpp \
| opt -S -passes="require<profile-summary>,loop-reduce,codegenprepare" -
You should obtain output similar to what you can observe after the CodeGenPrepare pass in the
Compiler Explorer Opt Pipeline view. What happens if you add another loop-reduce pass at the end?
Exercise 4: If we make the same change to the loop condition in iterate_2() as in Exercise 2,
the remainder instruction is eliminated from the generated code. What pass does this and why didn’t
it work before?
Most binary formats, whether for storage, IPC, or networking, specify a particular byte order for
numeric fields; for example, SQLite database files use big-endian fields, while Protobuf i32 and
i64 fields are little-endian. The customary way to write host endianness-agnostic decoding
functions is to explicitly join the individual input bytes. For 32-bit values that looks something
like (assuming a 32-bit int so we don’t need casts):
uint32_t load_le32(const uint8_t* data)
{
return data[0] | (data[1] << 8) | (data[2] << 16) | (data[3] << 24);
}
uint32_t load_be32(const uint8_t* data)
{
return data[3] | (data[2] << 8) | (data[1] << 16) | (data[0] << 24);
}
On x86, which supports unaligned loads from memory, this generates optimal Assembly at -O1 and
above:
load_le32(unsigned char const*):
mov eax, dword ptr [rdi]
ret
load_be32(unsigned char const*):
mov eax, dword ptr [rdi]
bswap eax
ret
Some developers, suffering from chronic DRYness and faced with the prospect of writing these functions and their 64-bit equivalents for the umpteenth time, or perhaps working with integer sizes other than 4 and 8 bytes, would prefer to write a single generic version instead:
static uint64_t load(const uint8_t* data, size_t sz, bool be)
{
uint64_t val = 0;
for (size_t i = 0; i < sz; ++i)
val |= static_cast<uint64_t>(data[be ? sz - i - 1 : i]) << (8 * i);
return val;
}
uint32_t load_le32(const uint8_t* data) { return load(data, 4, false); }
uint32_t load_be32(const uint8_t* data) { return load(data, 4, true); }
uint64_t load_le64(const uint8_t* data) { return load(data, 8, false); }
uint64_t load_be64(const uint8_t* data) { return load(data, 8, true); }
Code reviewers might then reasonably worry about whether there are any performance regressions compared to the more traditional functions. Let’s see what happens when we compile:
load_le32(unsigned char const*):
mov eax, dword ptr [rdi]
ret
load_be32(unsigned char const*):
mov eax, dword ptr [rdi]
bswap eax
ret
load_le64(unsigned char const*):
mov rax, qword ptr [rdi]
ret
load_be64(unsigned char const*):
mov rax, qword ptr [rdi]
bswap rax
ret
It looks like our lazy (or perhaps far-sighted) developer is in the clear—at least if they don’t need to use gcc which, at least as of version 15.2, cannot optimise this code. Let’s examine the progression of the optimisation pipeline again to gain some insight into how the compiler pulls this off.
Given what we’ve seen in the first half of the post, it may be somewhat surprising to find through our trusty Opt Pipeline Viewer that it takes all the way until the instruction selection phase (also known as ISel) for the consecutive loads to be folded into single 32- or 64-bit loads. We’ll see in the next section how we can enable earlier simplification in some cases, but let’s take some time to understand what ISel is and how it implements folding.
Instruction selection is the first phase of the compiler back-end, which runs after all the generic
optimisation passes (in LLVM this distinction is also reflected in the separation between the opt
and llc binaries). Its purpose is to convert the LLVM IR produced by the optimiser into Machine
IR (MIR), a target-specific representation of the instruction stream suitable for downstream passes
such as scheduling or register allocation. Most LLVM targets, including x86, use SelectionDAG as
their instruction selection framework. We won’t go into the weeds of SelectionDAG here9;
for our purposes, it suffices to know that it includes its own peephole optimiser called
DAGCombiner, analogous to InstCombine, which can capture simplification opportunities exposed by
code generation or specific to that target.
Much like InstCombine visits each instruction in a basic block, DAGCombiner visits each node in the
DAG built from those same instructions. It’s here, among the many transformations performed by the
or visitor, that we find the MatchLoadCombine()
method:
/// Match a pattern where a wide type scalar value is loaded by several narrow
/// loads and combined by shifts and ors. Fold it into a single load or a load
/// and a BSWAP if the targets supports it.
This is, of course, the exact pattern we see in the LLVM IR at the input to SelectionDAG, after
load() is inlined, constant arguments propagated, and the loops fully unrolled. What’s
interesting about this method, though, is that it doesn’t perform a simple pattern match as we’ve
seen before in InstCombine, but instead tracks the origin of every single byte at the output of the
or instruction recursively through the DAG (using the auxiliary calculateByteProvider()
function),
not just from loads but also through bitwise operations, byte swaps, and zero/sign extensions. This
makes it powerful enough to also handle expressions like:
static uint64_t load_alt(const uint8_t* data, size_t sz, bool be)
{
uint64_t val = 0;
for (size_t i = 0; i < sz; ++i)
val = (val << 8) | data[be ? i : sz - i - 1];
return val;
}
In this case, however, we find that only the 32-bit functions are fully optimised. Although we have exactly the same number of LLVM IR instructions as before, this new version requires recursing deeper into the DAG to uncover the source of the bytes loaded earliest (since we do two operations on the output variable per iteration, instead of one), hitting the hard-coded limit of ten levels of recursion. This is a good example of how having a mental map of the compiler’s internal processing, as well as sticking to more common patterns, can help us avoid such pathological cases.10
Some seasoned C++ developers may have instinctively objected to this example. After all, if we only have a few possible argument values known at compile-time, why wouldn’t we just template our function rather than relying on the compiler to propagate the constants?
template <size_t sz, bool be>
static uint64_t load(const uint8_t* data)
{
uint64_t val = 0;
for (size_t i = 0; i < sz; ++i)
val |= static_cast<uint64_t>(data[be ? sz - i - 1 : i]) << (8 * i);
return val;
}
uint32_t load_le32(const uint8_t* data) { return load<4, false>(data); }
uint32_t load_be32(const uint8_t* data) { return load<4, true>(data); }
uint64_t load_le64(const uint8_t* data) { return load<8, false>(data); }
uint64_t load_be64(const uint8_t* data) { return load<8, true>(data); }
The output is unchanged, but this time the IR for the
little-endian functions is optimised into a single load long before we reach the instruction
selection stage, in a pass called
AggressiveInstCombine.
Like InstCombine, this pass performs a variety of peephole optimisations, but it’s focused on those
which tend to be less commonly needed or take longer to run, which is why it runs only once at -O2
(and not at all at -O1). One of these optimisations is implemented in the
foldConsecutiveLoads()
function,
which matches patterns of the form “bitwise-or of (optionally) left-shifted, zero-extended,
consecutive loads”, such as the ones we find in our functions.
Since it relies on simple pattern matching, AggressiveInstCombine is not as powerful as DAGCombiner;
in particular, it cannot optimise the fold used in load_alt() above even for 32 bits. It also has
no impact on our big-endian loads because, as a middle-end optimisation pass (unlike DAGCombiner,
which is part of the compiler back-end), it tries to produce target-independent IR as much as
possible. Folding a big-endian load on a little-endian target, and vice-versa, would require adding
a target-specific byte-swap intrinsic which would complicate downstream analysis and optimisation
passes.
Exercise 5: Template the load_alt() function as in this section and examine the resulting
output. Does it match our earlier results? What does this tell us about the interaction between
templating and the optimisation pipeline?
Exercise 6: Instead of loading bytes in a different order depending on endianness, we can also
fix the order and optionally use the __builtin_bswap64() intrinsic on the return from load().
How does this affect the ability of AggressiveInstCombine to optimise the code?
Why didn’t AggressiveInstCombine come into play in our initial non-templated version? If we inspect
the optimisation pipeline, we’ll find that earlier the pass ran after load() was inlined but,
crucially, before the loop was unrolled, so there was no pattern to match yet. Function templates,
on the other hand, are instantiated separately for each unique combination of parameters, so by the
time they are inlined the loops have already been fully unrolled (since the iteration count is
known) and can be matched by foldConsecutiveLoads(). We can obtain the same result with the
initial code by manually running:
clang++ -S -emit-llvm -O2 -o- endianness.cpp \
| opt -S -passes='aggressive-instcombine' - \
| llc --x86-asm-syntax=intel
This also exemplifies an important principle of function inlining in LLVM (I’m not sure about other compilers): whenever possible, callees are simplified before callers in order to improve the accuracy of the inlining cost heuristics. The more information is available to the optimiser before inlining, the better it can make those decisions. In general, in fact, the earlier in the pipeline we can enable specific simplifications, the more optimised the final result will be (see the exercises for more examples of this).
Exercise 7: If we comment out all of the specialised functions but load_le32() we get similar
results as when we template load(), i.e. AggressiveInstCombine successfully folds the
consecutive loads. What optimisation pass enables this? How can we restructure
performance-critical code to take advantage of it?
Exercise 8: If we now keep only load_le64() but replace load() with load_alt() above, the
DAGCombiner is able to fully optimise the resulting LLVM IR. Compare with our earlier results to
understand why this is the case.
If you’ve read this far you may be under the impression that, while they may not be magic, optimisers are at the very least superhuman. Let’s quickly dispel that notion by looking at two examples where the obvious thing to do turns out not to be so obvious for the compiler.
We’ll start by adding a small (and pretty useless) function to our initial example:
uint32_t load_le32(const uint8_t* data) { return load(data, 4, false); }
uint32_t load_add_le32(const uint8_t* data)
{
return load_le32(data) + data[0];
}
Take a moment before continuing to consider how you would expect this function to be compiled. Two loads and an add? Perhaps a load, zero extension of the low byte, and an add?
Now compare with the actual result:
load_add_le32(unsigned char const*):
movzx ecx, byte ptr [rdi]
movzx edx, word ptr [rdi + 1]
shl edx, 8
movzx eax, byte ptr [rdi + 3]
shl eax, 24
or eax, edx
or eax, ecx
add eax, ecx
ret
Huh. What happened here?
Remember that the loads aren’t folded together until we reach the ISel phase, so by the time
load_le32() is inlined it’s not yet fully optimised. It turns out DAGCombiner, when tracking the
origin of each byte in the load, specifically rules
out
bytes which are used elsewhere in the function to prevent introducing duplicate loads. This can be
sensible in general (even cache hits have a cost, after all), but it’s clearly suboptimal in this
case.
Can we make use of AggressiveInstCombine to “hide” the duplicate load from DAGCombiner? The answer turns out to be yes, but we must be careful. This, for example, won’t work:
uint32_t load_add_le32(const uint8_t* data)
{
return load<4, false>(data) + data[0];
}
Although AggressiveInstCombine does run after load<4, false>() is inlined here, it also
refuses
to create duplicate loads11. As the saying goes, however, there’s no problem that cannot
be solved with another layer of indirection: if we instead call the function template from
load_le32(), we can get AggressiveInstCombine to optimise the unrolled loop before it’s inlined
in load_add_le32(), producing the desired result (the
additional cleverness to avoid a second load comes courtesy of the GVN pass).
Exercise 9: What happens if we use val & 0xFF instead of data[0] in the examples above, and
why? Can you make a small change to make this work regardless of which function we call?12
For our second and final act, let’s reconsider what we’re optimising for in the first place. We
have been focusing on performance in the sense of execution time by using the -O<n> flags. But
what about optimising for size with -Os? In this case, they happen to coincide: surely you cannot
get any faster or smaller than a single 32- or 64-bit load. But does the compiler agree?
Unfortunately, it does not. With the curious exception of
load_be32() (or, more precisely, load<4, true>()), it has decided that unrolling the loops would
go against its mandate to minimise code size, making it impossible for downstream passes to fold the
loads. The solution here is to nudge it politely using #pragma unroll before the loop, allowing
AggressiveInstCombine or DAGCombiner to do their job properly. (You can also avoid pragmas by
writing a recursive function template if that’s more to your taste.)
What can we take away from this somewhat meandering exploration? Perhaps the overarching theme is: get familiar with your compiler and what it can and cannot do for you. More actionable advice:
Use language facilities to make your intent clear. If you know something at compile-time, all else being equal, you can help the optimiser by stating it explicitly through e.g. C++ templates or Rust macros. In some cases it may even be preferable to choose at runtime from among a small set of specialised functions, instead of using a single generic function which cannot be optimised.
Maximise the potential for early optimisation through things like templating (again), auxiliary variables, explicit constraints, or specialised functions. Having at least a rough idea of the different optimisation passes and their order can help with this. The earlier the compiler can optimise something, the greater the opportunity for beneficial knock-on effects.
Stick to well-known patterns. The more common a construction is, the more likely that someone else already did the hard work of making the compiler recognise and optimise it. Simple code beats clever nine times out of ten.
Haven’t had enough yet? Here are some helpful resources if you’re interested in diving deeper into LLVM in general and the optimiser in particular: