A bloom filter is a probabilistic data structure that potentially can make SQL queries execute orders of magnitudes faster. Today I want to tell you how we use them in Floe, and how we make them produce 2x fewer false results.
Feel free to skip this section if you know the answer.
A bloom filter is a probabilistic data structure that answers one question: "Is this element definitely not in the set?" It can give false positives (says yes when the answer is no), but never false negatives (it won't miss elements that are actually there). The main benefit is that a well-designed bloom filter can be really fast - a few CPU cycles per lookup. That's faster than a single function call.
The structure: An array of m bits, all initially set to 0.
Insertion: To add an element, we:
Lookup: To check if an element exists:

Here we're discussing bloom filters in the context of database engineering. If you're not familiar with how databases join tables - here's a quick primer: a hash join matches rows from two tables. First, it loads the smaller table into a hash table (that's the build side). Then it scans the larger table row by row, looking up each value in the hash table to find matches (that's the probe side). Most of the work happens on the probe side, because the larger table can have billions of rows.
When processing millions of rows we want to avoid all the extra work that we can. Don't decompress the data you won't use. Don't probe hash tables for keys that don't exist. Discard rows as soon as you can - it's called being efficient (not lazy!)
We use bloom filters at 2 critical places:
Let's imagine the situation where we want to join two tables where only 1% of 10 billion probe-side rows will be matched. Without filtering we would need to decompress and probe 99% of those rows before discarding them.
What we do instead:
Build phase: At build phase we populate the bloom filter with hashes of build side.
Pushdown: after build phase is complete we push down the bloom filter, which at this point is read-only, to the storage engine.
First-pass filtering: The storage engine decompresses only the columns needed for bloom filtering. It checks each value against the bloom filter, and marks values that definitely do not match the build side.
Adaptive behaviour: Here it gets interesting. We keep the statistics of how many rows we skipped. If we end up discarding almost no rows we don't bother with first-pass filtering and disable it. But we keep checking decompressed rows to re-enable filtering if stats improve.
Rough estimate:
Without bloom filtering:
With bloom filtering:
That's a huge 9x reduction in scan and I/O!
Why do we need to keep the filtering adaptive? Because sometimes bloom doesn't help:
For hash joins we use a simpler, almost textbook-style bloom filter: insert values into the bloom filter at build phase, read it at probe phase before probing the hash buckets.
We landed on using a fixed 256KB bloom filter per join as a sweet spot between size and efficiency. Go bigger - waste the memory and overflow L2/L3 cache (cache misses hurt). Go smaller - might as well flip a coin.
Why fixed size? Predictable performance. No dynamic allocation. Compiler can optimize the hell out of it. Lock-free access. The last one is especially critical when we're talking about a concurrent performance-first database engine.
All of the above works well only if the bloom filter is actually useful and doesn't lie too often. If it does - it is useless. In our engine we measure bloom filter performance with a simple threshold for number of bits set. What does that matter? To understand we need to dive deeper into the theory, and understand false positive rate of bloom filter
Why false positives? As we insert more elements (n), more bits get set to 1. Eventually, random elements will have all their k bits set to 1 by pure chance - even though they were never inserted. That's a false positive.
The occupancy problem: As we insert more elements, more bits get set to 1 and the filter gets saturated. For our single-hash (k=1) approach, that means the false positive rate climbs quickly - up to 10% and above - that's way too high!
There's a well-known equation for false positive rate:
(1 - e(-kn/m))k
Where:
k = hash functions