A couple of weeks ago I came across one of Antonio's PRs, titled "Speed up point translation in the Rope" — now who doesn't stop to take a closer look at a PR with that title?
The description already lives up to the title. It contains benchmark results
telling me that a method on our Rope named point_to_offset
is now up to 70%
faster. 70%! Throughput increased by 250%. (I can't remember but I'm sure I made
one of those impressed whistling sounds when I saw those numbers.)
Then there's the code. I scrolled through the diff and with the fourth move of the mouse wheel I landed on this snippet and stopped:
Okay, okay, okay — "Parallel bit count intermediates", "Branchless select" in a PR that results in a 75% speed up — I'm in. I need to know the whole story. No way this won't be interesting.
So I asked Antonio what's up with the bit twiddling and he not only offered to walk me through the optimizations he made, but also said we should pair and add yet another optimization to make the Rope even faster when handling tabs.
And — lucky me — that's exactly what we did. We recorded the whole pairing session, so you can watch it too, and now I'll share with you everything I learned about bit twiddling optimizations on our Rope so far.
Companion Video: Rope Optimizations, Part 1
The 1.5hr companion video is the full pairing session in which Antonio and Thorsten first walk through these new optimizations on the Rope and then add another one to index tabs.
Watch the video here: https://youtu.be/9Ys9lpOyv08
Speeding up Point translation
First, we let's take a look at how Antonio sped up "point translation" in the Rope and what that even means.
If you've read our post on Rope & SumTree you already know this: our Rope is not a real Rope but a SumTree in a trench coat and looked roughly like this before Antonio made his changes:
It's a B-tree of Chunk
s and a Chunk
is nothing more than a stack-allocated
string with a maximum length of 128 bytes.
And that's already all the background you need — our Rope is a B-tree of 128-byte strings.
The Problem
So what's point translation? Expressed in code, it's roughly equivalent to this:
Point translation means: take an arbitrary offset into a string (represented
as a Rope) and translate it into a Point
— the row and column your cursor
would land on if moved to the given offset.
As you can imagine, this method is a popular one in a text editor. Zed gets offsets from all kinds of sources and has to turn them into lines and columns — hundreds if not thousands of times per second when you're moving around a file and editing.
Now imagine how you would implement that offset_to_point
method.
Conceptually, what you'd have to do is to go through each character in the file, count the newlines and characters you come across, and stop once you are at your offset. Then you know which line you are on.
And that's exactly what our Rope did, too.
If you called rope.offset_to_point(7234)
, the Rope would traverse its
SumTree
to find the Chunk
that contains offset 7234
and then, on that
Chunk
, it would call offset_to_point
again. And that method, on Chunk
,
looked pretty similar to this piece of code:
It's straightforward: set up a counter in the form of a Point
, loop through all the
characters in the 128-byte string, and keep track of newlines as you come across
them.
But that's also the problem, right there: while the Rope
can get us to the
right Chunk in O(log(n))
, we still have to loop through 128 characters and
count the newlines manually — like cavemen.
Now you might say that 128 characters isn't a lot and what harm comes from a little loop like that, but remember: we're talking about a text editor that supports multiple cursors and talks to multiple language servers at once — this loop gets executed so often, it's hot to the touch. We want it to be as fast as possible.
And that's exactly what Antonio achieved.
The Optimization
What Antonio figured out is that instead of looping through 128 characters every
time we need find a newline, we can do it once and remember where they are,
effectively building an index of the Chunk
.
And all we need for such an index is a u128
.
A u128
is Rust's 128-bit unsigned integer type and, hey, 128 — that's exactly
how many bytes are in a Chunk
.
A single u128
is enough to remember whether a given byte in a
Chunk
has a certain property or not. For example, we could set a bit in a
u128
at a given position to 1 if the corresponding byte in the Chunk
is a
newline character. Or we could flip bits to remember how many bytes a character
takes up — which can be more than one with UTF-8 and emojis.
That's what that Antonio's PR did. Our Chunk
now looks like this:
But how does that work? And what does it buy us? Is that really better than looping through the characters?
To illustrate the idea, let's focus on newlines only and use a u8
instead of a
u128
— less bits to show and count, same principle.
Indexing newlines in a u8
Say we have the following text:
We can use the following code to index the positions of the newline characters
in a u8
:
When we run this, we end up with a newlines
bitmask that looks like this:
newlines
now tells us that the 3rd and 6th characters in text
are newlines.
Sweet. Now with newlines
in hand, back to our original problem: translating an
offset into a Point
.
How does newlines
help with that?
Say our offset
is 4
and say our text is "ab\ncd\nef"
. We want to know what
line and column the character d
is on.
Here's how we can use newlines
to find out:
We first create a mask
that has all the bits set to 1
up until the offset
we're interested in:
Then we take that mask
and bitwise-and it with the newlines
we passed in:
That leaves only the newline-bits up to the offset we're interested in being
set. Now, to find out which line our offset is on, all we have to do is to count
the number of remaining bits set to 1
. That's this line:
That's the row the offset is on — 1
in our case. It's zero-indexed, meaning
that when talking to a human and not a computer, we'd say that the character d
is on line 2.
The next step is to figure out the column. To do that, we need to calculate the
distance between the offset
we're interested in and the last newline
character, because that is what a column is: the number of characters from the
last newline.
This line first gives us the position of the newline closest to our offset:
In our case, newline_ix
is 3
:
Then we plug that into the next line:
Which gives us the column 1
.
The result: offset 4
in ab\ncd\nef
translates to Point { row: 1, column: 1 }
— second line, second column.
The Beauty of It
Take a look at these two lines from above again:
count_ones()
and leading_zeros()
— sounds an awful lot like there might be
some looping going on to count those ones and zeros, right?
But no, that's the beautiful part! count_ones
and leading_zeros
are both
implemented with a single CPU instructions. No loop necessary. Turns out CPUs
are pretty good with zeros and ones.
And it's not just that we reduced the number of instructions, we also have less branch instructions now and branchless programming can often lead to tremendous speed-ups in hot loops.
If we put both versions of offset_to_point
— one with a loop and one with the
bitmask — into micro-benchmark and use an actual u128
instead of
an u8
to make the results more pronounced, we can see how much faster the
loop-less, branch-less version is:
Running benches/benchmark.rs (target/release/deps/benchmark-21888b29446a33c0)
offset_to_point_u128/loop_version
time: [56.914 ns 57.001 ns 57.096 ns]
offset_to_point_u128/mask_version
time: [1.0478 ns 1.0501 ns 1.0529 ns]
57ns
with the loop and 1ns
with the bitmask — 57x faster. Impressed
whistling sound.
Of course, all the disclaimers about micro-benchmarks apply and in our production code the results aren't that drastic, but very, very good nonetheless: the 70% speed-up I mentioned at the start is real.
Indexing Tabs
Fascinated and motivated by all of this, Antonio and I then set out to add the same index for tabs.
"Tabs?", you might say, "I don't use tabs." Yes, you don't, but Zed doesn't know that and still has to check whether you have tabs at the start of your lines in order to display them correctly.
And tabs are tricky. You can't display tabs like other characters. Tabs are... dynamic, for the lack of a better word.
How the string \t\tmy function
is displayed depends on what tab size you have
configured: if the tab size is
four, then \t\t
should be displayed as eight spaces. If it's two, it's four spaces.
"Poor text editor developers", you might be thinking, "they have to multiply numbers." Appreciate the compassion, but, listen, that's not all.
Consider this piece of text:
With a tab size of 4 and hard tabs enabled, it should be displayed like this:
That's right — the first tab in the first line only takes up two spaces, the others all take up four spaces.
You see: tabs are tricky and, as it turns out, also costly.
In a performance profile, Antonio saw that we spend a lot of time figuring out where and how many tabs there are in a given file. A lot of time.
So what we did in our pairing session was to add an index for tabs that works just like the index for newlines:
Does it make Zed faster?
We'll see. This time we only added the index, but we haven't actually changed the code the higher layers to use it yet. We'll do that in the next Zed Decoded episode.
Until then, watch the full pairing session in the companion video to see how all the nuts and bolts are put into place.