James Routley

A bloom filter is a probabilistic data structure that potentially can make SQL queries execute orders of magnitudes faster. Today I want to tell you how we use them in Floe, and how we make them produce 2x fewer false results.

What is a bloom filter?

Feel free to skip this section if you know the answer.

A bloom filter is a probabilistic data structure that answers one question: "Is this element definitely not in the set?" It can give false positives (says yes when the answer is no), but never false negatives (it won't miss elements that are actually there). The main benefit is that a well-designed bloom filter can be really fast - a few CPU cycles per lookup. That's faster than a single function call.

The structure: An array of m bits, all initially set to 0.

Insertion: To add an element, we:

Hash the element using k different hash functions
Each hash gives us a position in the bit array
Set all k bits at those positions to 1

Lookup: To check if an element exists:

Hash it with the same k hash functions
Check if ALL k bits are set to 1
If any bit is 0 → definitely not in the set
If all bits are 1 → probably in the set

bloom_filter

Why do bloom filters matter?

Here we're discussing bloom filters in the context of database engineering. If you're not familiar with how databases join tables - here's a quick primer: a hash join matches rows from two tables. First, it loads the smaller table into a hash table (that's the build side). Then it scans the larger table row by row, looking up each value in the hash table to find matches (that's the probe side). Most of the work happens on the probe side, because the larger table can have billions of rows.

When processing millions of rows we want to avoid all the extra work that we can. Don't decompress the data you won't use. Don't probe hash tables for keys that don't exist. Discard rows as soon as you can - it's called being efficient (not lazy!)

We use bloom filters at 2 critical places:

Storage engine (column store): Before decompressing the columns
Hash joins: before probing the hash table during probe phase

1. Adaptive Storage Engine filtering

Let's imagine the situation where we want to join two tables where only 1% of 10 billion probe-side rows will be matched. Without filtering we would need to decompress and probe 99% of those rows before discarding them.

What we do instead:

Build phase: At build phase we populate the bloom filter with hashes of build side.

Pushdown: after build phase is complete we push down the bloom filter, which at this point is read-only, to the storage engine.

First-pass filtering: The storage engine decompresses only the columns needed for bloom filtering. It checks each value against the bloom filter, and marks values that definitely do not match the build side.

Adaptive behaviour: Here it gets interesting. We keep the statistics of how many rows we skipped. If we end up discarding almost no rows we don't bother with first-pass filtering and disable it. But we keep checking decompressed rows to re-enable filtering if stats improve.

Rough estimate:

Without bloom filtering:

Decompress 10 columns × 10B rows
= 100B column decompression operations
= time to get another coffee. Or three.

With bloom filtering:

First pass: Decompress 1 column × 10B rows = 10B operations
- bloom rejects 99% → 100M survivors
Second pass: Decompress 9 columns × 100M rows = 900M operations
Total: 10.9B operations

That's a huge 9x reduction in scan and I/O!

Why do we need to keep the filtering adaptive? Because sometimes bloom doesn't help:

Join selectivity is high (most rows match anyway)
Build side is large (bloom fills up, high false positive rate)
Data is skewed (many duplicates saturate the filter)

2. Hash join bloom filter and the micro optimization

For hash joins we use a simpler, almost textbook-style bloom filter: insert values into the bloom filter at build phase, read it at probe phase before probing the hash buckets.

We landed on using a fixed 256KB bloom filter per join as a sweet spot between size and efficiency. Go bigger - waste the memory and overflow L2/L3 cache (cache misses hurt). Go smaller - might as well flip a coin.

Why fixed size? Predictable performance. No dynamic allocation. Compiler can optimize the hell out of it. Lock-free access. The last one is especially critical when we're talking about a concurrent performance-first database engine.

The Problem: When too many bits tell less

All of the above works well only if the bloom filter is actually useful and doesn't lie too often. If it does - it is useless. In our engine we measure bloom filter performance with a simple threshold for number of bits set. What does that matter? To understand we need to dive deeper into the theory, and understand false positive rate of bloom filter

Why false positives? As we insert more elements (n), more bits get set to 1. Eventually, random elements will have all their k bits set to 1 by pure chance - even though they were never inserted. That's a false positive.

The occupancy problem: As we insert more elements, more bits get set to 1 and the filter gets saturated. For our single-hash (k=1) approach, that means the false positive rate climbs quickly - up to 10% and above - that's way too high!

Let's Do Some Math (No Really, Stay With Me)

There's a well-known equation for false positive rate:

(1 - e^(-kn/m))^k

Where: