One unassuming week of September 2022, Google DeepMind dropped a fully-fledged CPython JIT called S6 squashed to one commit. I had heard nothing of its development even though I was working on Cinder at the time and generally heard about new JIT efforts. I started poking at it.
The README has some excellent structural explanation of how they optimize Python, including a nice introduction to hidden classes (also called shapes, layouts, and maps elsewhere). Hidden classes are core to making dynamic language runtimes fast: they allow for what is normally a hashtable lookup to become an integer comparison and a memory load. They rely on the assumption that even in a dynamic language, programmers are not very creative, and therefore for a given location in the code (PC), the number of types seen will be 1 or small. See a great tutorial by CF Bolz-Tereick on how to build a hidden class based object model.
Hidden classes give you the ability to more quickly read from objects, but you, the runtime implementor, have to decide what kind of cache you want to use. Should you have a monomorphic cache? Or a polymorphic cache?
In an interpreter, a common approach is to do some kind of state-machine-based bytecode rewriting. Your generic opcodes (load an attribute, load a method, add) start off unspecialized, specialize to monomorphic when they first observe a hidden class HC, rewrite themselves to polymorphic when they observe the next hidden class HC’, and may again rewrite themselves to megamorphic (the sad case) when they see the K+1th hidden class. Pure interpreters take this approach because they want to optimize as they go and the unit of optimization is normally (PDF) one opcode at a time.
One interesting observation here is that while the bytecoder rewriting is used to help interpreter performance, you can reuse this specialized bytecode and its cache contents as a source of profiling information when the JIT kicks in. It’s a double use, which is a win for storage and run-time overhead.
In an optimizing JIT world that cares a little less about interpreter/baseline compiler performance, the monomorphic/polymorphic split may look a little different:
If you go for monomorphic and that code never sees any other hidden class, you’ve won big: the generated code is small and generally you can use these very strong type assumptions from having burned it into the code from the beginning. If you’re wrong, though, and the that ends up being a polymorphic site in the code, you lose on performance: it will be constantly jumping into the interpreter.
If you go for polymorphic but the code is mostly monomorphic, then you mostly just lose on peak performance. Your code may need to walk the cmp+jcc chain in the JIT and the operation’s inferred type in your IR will not be as fine-grained as the monomorphic case. But you might side-exit less into the interpreter, which is nice.
But “polymorphic” and “megamorphic” are very coarse summaries of the access patterns at that site. Yes, side exits are slow, but if a call site S is specialized only for hidden class HC and mostly sees HC but sometimes sees HC’, that’s probably fine! We can take a few occasional side exits if the primary case is fast.
Let’s think about the information our caches give us right now:
But we want more information than that: we want to know if the access patterns are skewed in some way.
What if at some PC the interpreter sees 100x hidden class A and only 2x hidden
class B? This would unfortunately look like a boring polymorphic [A, B]
cache.
Or, maybe more interesting, what if we have a megamorphic site but one class more or less dominates? This would unfortunately look like a total bummer case even though it might be salvageable.
If only we had a nice data structure for this…
S6 has this small C++ class called ClassDistribution
that the interpreter uses to register what hidden classes it sees during
execution profiling. It dispenses with the implicit seen order that a polymorphic
cache keeps in its cmp-jcc chain and instead uses two fixed-size (they chose
K=4) parallel arrays: bucket_class_ids_
and bucket_counts_
.
Every time the interpreter captures a profile, it calls
ClassDistribution::Add
, which increments the
corresponding count associated with that ID. There are a couple of interesting
things that this function does:
Add
and makes
summarization easier (more on that later)other_count_
to track more information about how megamorphic the call-site
isother_count_
using a field called count_disparity_
. If
this gets too high, it indicates that the execution patterns have shifted
over time and that it might be time to reset the statspre_reset_event_count_
. This
can be used to determine if the current epoch has seen a statistically
sigificant number of events to the pre-reset epochThat is not much more additional space and it gets you a totally different slice of the picture than a “normal” IC and bytecode rewriting. I find the bubbling up, the other count, and the running difference especially fun.
After a while, some bit of policy code decides that it’s time to switch
execution modes for a given function and compile. The compiler would like to
make use of this profile information. Sure, it can fiddle around with it in its
raw state, but the S6 devs found a better API that random compiler passes can
consume: the ClassDistributionSummary
.
The ClassDistributionSummary
is another very
small C++ class. It has only three fields: the class IDs from the
ClassDistribution
(but not their counts), a kind_
field, and a stable_
field.
We don’t need their counts because that’s not really the question the optimizer
should be asking. The thing the optimizer actually wants to know is “how
should I speculate at this PC?” and it can outsource the mechanism for that to
the ClassDistributionSummary
’s kind (and the information implicit in the
ordering of the class IDs, where the hottest class ID is in index 0).
The kind can be one of five options: Empty, Monomorphic, Polymorphic, SkewedMegamorphic, and Megamorphic, each of which imply different things about how to speculate. Empty, monomorphic and polymorphic are reasonably straightforward (did we see 0, 1, or <= K class IDs?) but SkewedMegamorphic is where it gets interesting.
Their heuristic for if a megamorphic PC is skewed is if the class ID in bucket 0—the most popular class ID—is over 75% of the total recorded events. This means that the optimizer still has a shot at doing something interesting at the given PC.
I wonder why they didn’t also have SkewedPolymorphic. I think that’s because
for polymorphic PCs, they inline the entire compare-jump chain eagerly, which
puts the check for the most popular ID in the first position. Still, I think
there is potentially room to decide to monomorphize a polymorphic call site.
There’s some ad-hoc checking for this kind of thing in optimize_calls.cc
, for
example to specialize a[b]
where a
is historically either a list
or a
tuple
.
Also, sadly, they did not get to implemented SkewedMegamorphic before the project shut down, so they only handle monomorphic and polymorphic cases all across the optimizer. Ah well.
Some of the time-shift profiling that you can do with a ClassDistribution seems really cool and I had not seen it before. It feels like it could help with the issues brought up in Why Aren’t More Users More Happy With Our VMs?. Maybe. Understanding the behavior of a program through a tiny lens over a small snapshot of time is challenging.
(Kind of a “bits and bobbles” section a la Phil Zucker. I’m trying it out.)
FeedbackVector in V8. See blog post by Benedikt Meurer, which explains how they profile generic instruction operands using a feedback lattice.
Speculation in
JavaScriptCore,
which continues to be a fantastic resource for fast runtime development. In it,
Fil argues that the cost of speculating wrong is so high that you better be
darn sure that cond
is true in if (!cond) { side_exit(); }
See a blog post by Jan de Mooij and a blog post by Matthew Gaudet on CacheIR in SpiderMonkey (and paper! (PDF))
→ helpful for trial inlining? See warp improvement blog post
Tracing is just different
Basic block versioning
What if we had more context? Info from caller