Uber has rewritten its ledger systems five times in the last ten years. And at least one of those rewrites, if not all, could have been avoided.
That’s because the root of each generation of money software at Uber was driven from bad incentives. Each started with a brand new proposal, approved as the definitive solution; in time, a fatal flaw was surfaced; and finally, a new proposal came along to replace it.
Every rewrite was someone’s promotion project.
At least one of them could’ve been avoided: the one where Uber moved to DynamoDB. In 2017, Uber launched their new payment platform on it, and the critical factor that everyone involved seemed to miss was that DynamoDB is a consumption-priced database.
You pay for every read, and every write.
With each trip generating multiple ledger entries, and Uber as a whole processing 15 million trips per day, it didn’t matter that DynamoDB was great because of high throughput at global scale. The proverbial bean counter should’ve stopped this madness from happening.
Within 2 years, the cost became prohibitely expensive:
At Uber’s scale, DynamoDB became expensive. Hence, we started keeping only 12 weeks of data (i.e., hot data) in DynamoDB and started using Uber’s blobstore, TerraBlob, for older data (i.e., cold data). TerraBlob is similar to AWS S3. For a long-term solution, we wanted to use LSG.
— Migrating a Trillion Entries of Uber’s Ledger Data from DynamoDB to LedgerStore
A redesign that gets replaced 2 years later is a catastrophe.
And yet, history remembers Uber’s ledger on top of DynamoDB as a masterpiece. As late as 2024, ByteByteGo has an article praising it.
And that’s what concerns me. Uber’s design was a failure, but nobody seems to remember it that way.
That ends today.
I’m Alvaro Duran, and this is The Payments Engineer Playbook, the only newsletter on Earth tailor-made for engineers of money software. Every week, more than 2,000 subscribers from companies like Stripe, Coinbase and Modern Treasury get a dive deep on how to build software that moves money around. Not to pass interviews, but to do their job exceptionally well.
When money is on the line, stakes are sky high and the margin for error is razor thin.
In The Payments Engineer Playbook, we investigate the technology that transfers money. All to help you become a smarter, more skillful and more successful payments engineer. And we do that by cutting off one sliver of it and extracting insights from it.
Here’s what you can expect in today’s article:
Why DynamoDB works for payments but breaks when you use it as a ledger
The napkin math that would have saved Uber millions of dollars
And one shocking conclusion from all of this
Enough intro, let’s dive in.
But first: is DynamoDB a bad choice for financial software?
Not necessarily. I’ve already covered DynamoDB as a potential data store for payment systems, and it has many features that are worth it: zero-downtime migrations, low latency for a global audience, and built-in replication and failover.
If you’re accepting payments at scale around the globe, DynamoDB is a great choice.
It is because DynamoDB, while not enforcing full linearizability across partitions, can guarantee consistency within a Region. DynamoDB gives you strong consistency on a per-partition basis, but not across partitions — and for a global-scale payments system, that’s a trade-off worth making.
PostgreSQL can; DynamoDB cannot.
This is quite a property when it comes to payments, because they are independent from each other. You can interleave the authentication of one with the capture of another. There’s no need to maintain full linearizability across your data; causal consistency is enough. DynamoDB trades off the linearizability that you don’t actually need for all those nice features I mentioned earlier, which means that for large enterprises that serve customers all over the world in high volume and frequency, DynamoDB is better than PostgreSQL.
But a ledger isn’t a payments system.
A ledger cannot simply say “hey, this account and that account can be dealt with independently”. The scope of a ledger system is The World; a data store that can’t enforce full linearizability isn’t going to cut it, no matter how good at throughput and latency it is.
In other words: DynamoDB works well in payments because payments can give up global consistency for better availability. But ledgers can’t give up global consistency, even if that means they get worse availability.
DynamoDB capacity pricing is based on two main models: Provisioned and On-demand. You can buy reads and writes in bulk, buy reads and writes when you need them, or both.
And then, there’s the storage and add-on features. For most, throughput is the real deal, and storage is just the cream on top of the invoice. But for data-heavy applications, such as ledgers, storage can become the dominant cost. That’s why reserving capacity is important, but it demands from you some predictive abilities, or at least making an educated guess about how much you’re going to need in advance, because you’re going to get a discount of more than 50 percent if you do it right.
If you use DynamoDB at scale, you need to do some napkin math.
In 2017, Uber was doing around 11 million trips a day. Assuming 10 entries per trip and 5 WCUs per entry1, that’s 550 million writes per day, and at $1.25 per million writes, that’s $687 per day.
$687 per day doesn’t sound like a lot. But that’s $250K a year, just in writes.
With 3x annual growth, the math is unsustainable: by year 3, we’re talking $2.25 million a year. I don’t have visibility into reads, indexes and global tables, but at Uber’s scale, the read side likely costs as much as the write side.
Which means that Uber was burning 5 million dollars in a freaking ledger.
Based on Uber’s data, by 2020 they had accumulated 1.2 petabytes of data. That, at $0.25 per gigabyte is 300K per month. And assuming the same 3x annual growth from 2017 to 2020 with a final size of 1.2 petabytes, that’s a cumulative cost of $3.5 million.
No wonder they switched to storing only the last 12 weeks of data in Dynamo, and stored the rest on premises.
Add the writes to the storage, and you’re looking at an 8 million dollar bill for a ledger that didn’t need to exist on DynamoDB in the first place.
What do you do with an 8 million dollar bill? You turn it into a case study.
Since 2020, Uber has migrated away from DynamoDB into their own internal ledger, called LSG (Ledger Store...Gateway?2), built on top of their own internal distributed database called DocStore:
Docstore is a general-purpose multi-model database that provides a strict serializability consistency model on a partition level and can scale horizontally to serve high volume workloads. Features such as Transaction, Materialized View, Associations, and Change Data Capture combined with modeling flexibility and rich query support, significantly improve developer productivity, and reduce the time to market for new applications at Uber.
Why not use an open-source alternative? Because Uber builds in-house. That’s what Uber does.
You could argue that DocStore provided the features they needed in a way no other alternative could. But you would be wrong!
Our homegrown Docstore was a perfect match for our database requirements, except for Change Data Capture (CDC) a.k.a., streaming functionality. [...] We decided to build a streaming framework for Docstore (project name “Flux”) and used that for LedgerStore’s Manifest generation.
— How Uber Migrated Financial Data from DynamoDB to Docstore
So let me get this straight: DynamoDB was a bad choice because it was expensive, which is something you could have figured out in advance. You then decided to move everything to an internal data store that had been built for something else3, that was available when you decided to build on top of DynamoDB. And that internal data store wasn’t good on its own, so you had to build a streaming framework to complete the migration.
And nobody got fired for this?
But nobody was optimizing for cost. They were optimizing for their next promotion. Each rewrite was a new proposal, a new design doc, a new system to put on a resume. The incentive was never to pick the boring, correct choice — it was to pick the complex, impressive one.
This isn’t Metaverse-levels of disaster, but relative to Uber’s scale, gets pretty close.
What bothers me the most is this: by 2019, it was painfully obvious that Uber had made a terrible decision when they built LedgerStore on top of DynamoDB.
And yet, when AWS invited Uber to present at re:Invent 2019, they said yes.
I’ve written about Uber’s testing practices before, and praised them for it — the article hit the front of Hacker News.
But let’s call a spade a spade: when you actively disguise an atrocious decision as a case study for a database technology, you’re no less fraudulent than one of those hedge fund managers talking their book on TV.
It is the technological equivalent of an arsonist writing a fire safety manual.
On a second level, there’s the publications that regurgitated this case study without looking at the full picture: ByteByteGo has an article on LedgerStore praising “The cost savings from this migration” with yearly savings “exceeding $6 million due to reduced spend on DynamoDB”.
I can’t possibly comment on this.
When you’re tasked with building a system of any kind, not just ledgers, the technology is never enough. If you’re building a system that makes the economics of your company impossible, you’re better off not building it.
Focusing solely on the technical requirements, and not seeing the costs, is a disservice to the business that employs you.
Uber didn’t make a ledger mistake. It set the wrong incentives.
And it paid millions of dollars for it.
This was The Payments Engineer Playbook. I’ll see you next week.
Feel free to share this article with a system designer about to make a costly mistake.