James Routley

Note for numberphiles: all numbers have been rounded to their first significant digit, because I’m a fan of Rob Eastaway’s “zequals” method of getting to the point when it comes to estimation. It’s much more valuable to walk away with the heuristic “some dude got a 300x memory reduction by swapping out a database he hacked together for a tiny, static, specialized data structure that does exactly what he needs it to and no more.”

I found myself with an increasingly rare opportunity to work this weekend on Taskusanakirja, also often called tsk, a Finnish-English dictionary with incremental search-as-you-type.¹ Fundamentally this problem reduces down to prefix search, and the canonical solution for prefix search with autocomplete is to implement a trie.

And this worked wonderfully for the first implementation of tsk, which was in Go (and which I have written about elsewhere and elsewhere and elsewhere). With a few basic optimizations. To prevent matching on some single-digit percentage of the mid-six-figures number of words we were baking into the binary (it’s been a design goal from the start to ship the entire program as one .exe, one .app, or one statically linked binary), we set some limit of e.g. only the first 50 or 100 matches or so and then just memoized all of the 1-, 2-, and 3-character combinations, after which I’ve never noticed a perceptible delay in the program again after a year of heavy personal use. We could just about squeeze a trie with some basic optimizations like that into ~60 MB of space.

But Finnish is a heavily agglutinative language. It’s not impossible for a single base word in the language to have over one hundred possible endings, when all combinations are considered. And the combinations are not regular!² The extremely regularized orthography of the Finnish language also means no fibbing when it comes to what speakers actually say on the page, and that means that base words stretch and shift and transmute in acoustically-pleasing ways as you layer on endings, which makes perfect sense after you’ve spent a couple years already immersed in the language. When you’re a beginner, and you see a sentence like e.g. “Opiskelijassammekin on leijonan sydän”, there is one word you are disproportionately likely to get stuck on. Part of what this tool attempts to do is help the student figure out how to cleave the word at the right edges by embedding all that information as well.

The trie fell down at that point. I could keep ~400,000 items in the trie sipping ~50 MB of RAM. The same trick does not scale to 40-60 million. Not if you want it all to run on the old laptop of a college kid from Jakarta. Frustrated and running out of time, I threw up my hands and said “We’ll ship the inflections in a separate SQLite database with FTS (Full Text Search) and let them search on that if they’re so desperate,” which worked — still without perceptible delay — but it required a one time 3 gigabyte download. Not ideal!

That was where the story stopped about 9 months ago. This weekend, with 9 more months of intense full time software engineering under my belt, I boldly asked: Had I considered rewriting it in Rust?³

It turns out there is a very, very smart guy named BurntSushi aka Andrew Gallant, infamous for ripgrep, a really really fast grep — a tool so ubiquitously useful I put it years ago in my “Holy Trinity” of modern shell commands — who also faced a similar problem at some point in the past, and wrote a post called Index 1,600,000,000 Keys with Automata and Rust. (Warning: long, extremely interesting.) The opening spoils it:

It turns out that finite state machines are useful for things other than expressing computation. Finite state machines can also be used to compactly represent ordered sets or maps of strings that can be [prefix, fuzzy, suffix] searched very quickly.

Well, I thought, this seems promising. Let’s write a minimal Rust program to strip the data out of that 3 GB database and compact it down into one of these FST thingies.⁴ I mean, it was always obvious that was a hack, but it was the best hack I could manage with the time and energy at the time. How small could we get it?

Ten _mega_bytes. A 300x reduction in space. Even in the world of fst crate users, this particular application — mapping conjugations and declensions of a highly agglutinative language back to their source definitions — was extremely well suited to the domain. Unlike tries, FSTs compress both prefixes and suffixes, and in a language like Finnish, there are a very small handful of popular suffixes which get repeated extremely often in the dictionary corpus. The data load is static at runtime, which gets around fst’s greatest weakness.

I do wish to point out, of course, that the whole reason it was possible to experiment cheaply and come across this serendipity was because 9 months ago, faced with the choice to either do the bad easy thing or the good nothing, I chose to do the bad easy thing.⁵ The SQLite database worked! I understood how it worked, behind the scenes with its B-trees and its Full Text Search extension. I think I even used that same FTS extension to power certain lesser used features that are not in the alphas of tsk v2.0.0 at the time being and are likely to be dropped entirely if it means compromising this now salivatory memory footprint.

Because the Pro version of v2 is shaping up to be about 20 megabytes, all batteries included, which is 3 times less than the free version of v1 ever was. We’ll see what makes it past the cutting room in time.

Replacing a 3 GB SQLite db with a 10 MB FST (finite state transducer) binary