Much of the developer world is familiar with the AWS outage in us-east-1 that occurred on October 20th due to a race condition bug inside a DNS management service. The backlog of events we needed to process from that outage on the 20th stretched our system to the limits, and so we decided to increase our headroom for event handling throughput. When we attempted that infrastructure upgrade on October 23rd, we ran into yet another race condition bug in Aurora RDS. This is the story of how we figured out it was an AWS bug (later confirmed by AWS) and what we learned.
The Hightouch Events product enables organizations to gather and centralize user behavioral data such as page views, clicks, and purchases. Customers can setup syncs to load events into a cloud data warehouse for analytics or stream them directly to marketing, operational, and analytics tools to support real-time personalization use cases.
Here is the portion of Hightouch’s architecture dedicated to our events system:

Hightouch events system architecture
Our system scales on three levers: Kubernetes clusters that contain event collectors and batch workers, Kafka for event processing, and Postgres as our virtual queue metadata store.
When our pagers went off during the AWS outage on the 20th, we observed:
Kafka’s durability meant that no events were dropped once they were accepted by the collectors, but there was a massive backlog to process. Syncs with consistently high traffic or with enrichments that needed to call slower 3rd party services took longer to catch up and were testing the limits of our (small) Postgres instance’s ability to act as a queue for the batch metadata.
As an aside, at Hightouch, we start with Postgres where we can. Postgres queues serve our non-events architecture well at ~1M syncs/day and for events scaled to 500K events per second at ~1s end-to-end latency on a small Aurora instance.
After observing the events on the 20th, We wanted to upsize the DB to give us more headroom. Given that Aurora supports fast failovers for scaling up instances, we decided to proceed with an upgrade on Oct 23rd without a scheduled maintenance window.
The central datastore for real-time streaming and warehouse delivery of customer events uses Amazon Aurora PostgreSQL.
Aurora's architecture differs from traditional PostgreSQL in a crucial way: it separates compute from storage. An Aurora cluster consists of:
This architecture enables fast failovers and efficient read scaling, but as we'd discover, it also introduces unique failure modes.
A failover is the process of promoting a read replica to become the new primary writer - typically done automatically when the primary fails, or manually triggered for maintenance operations like ours. When you trigger a failover in the AWS console:
The diagram below explains how Hightouch Events uses Aurora.

How Hightouch Events uses Aurora
This was our upgrade plan:
The AWS docs supported this approach and we had already tested the process successfully in a staging environment while performing a load test, so we were confident in the correctness of the procedure.
At 16:39 EDT on October 23, 2025, we triggered the failover to the newly-upgraded instance #2. The AWS Console showed the typical progression: parameter adjustments, instance restarts, the usual status updates.
Then the page refreshed. Instance #1 - the original writer was still the primary. The failover had reversed itself.
According to AWS everything was healthy. The cluster appeared healthy across the board. But our backend services couldn't execute write queries. Restarting the services cleared the errors and restored normal operation, but the upgrade had failed.
We tried again at 16:43. Same result: brief promotion followed by immediate reversal.
Two failed failovers in five minutes. Nothing else had changed - no code updates, no unusual queries, no traffic spikes. We had successfully tested this exact procedure in a staging environment under load earlier in the day. We checked our process to see if we had made any mistakes. We searched online to see if anyone else had encountered this issue but found nothing. Nothing obvious could explain why Aurora was refusing to complete the failover in this cluster. We were perplexed.
We first checked database metrics for anything unusual. There was a spike in connection count, network traffic, and commit throughput to the read replica (instance #2) during the failover.
The higher commit throughput could have been due to replication or the execution of write queries. The other two metrics simply indicated a higher query volume.
We checked the read query traffic from the app (graph below), and found that there was no change during this period. This told us the extra traffic to instance #2 came from our backend services which are supposed to connect to the writer instance.

Query traffic from the Hightouch app
When we looked at the backend application logs, we found this error -DatabaseError: cannot execute UPDATE in a read-only transaction in some pods.

Backend application logs
Our services do not connect directly to the writer instance, but rather to a cluster endpoint which points to the writer. This could mean one of 3 things:
We did not find any evidence supporting or disproving #1 in the application logs. We had a strong suspicion it was either #2 or #3. We downloaded the database logs to take a closer look and found something interesting. In both the promoted reader and the original writer, we found the same sequence of logs:
2025-10-23 20:38:58 UTC::@:[569]:LOG: starting PostgreSQL...
...
...
...
LOG: database system is ready to accept connections
LOG: server process (PID 799) was terminated by signal 9: Killed
DETAIL: Failed process was running: <write query from backend application>
LOG: terminating any other active server processes
FATAL: Can't handle storage runtime process crash
LOG: database system is shut down
This led us to a hypothesis:
During the failover window, Aurora briefly allowed both instances to process writes. The distributed storage layer rejected the concurrent write operations, causing both instances to crash.
We expect Aurora’s failover orchestration to do something like this:
There was clearly a race condition between steps 3 & 4.
To validate the theory, we performed a controlled failover attempt. This time:
By eliminating concurrent writes, the failover completed successfully. This strongly reinforced the race-condition hypothesis.
We escalated the findings and log patterns to AWS. After an internal review, AWS confirmed that:
The root cause was due to an internal signaling issue in the demotion process of the old writer, resulting in the writer being unchanged after the failover.
They also confirmed that there was nothing unique about our configuration or usage that would trigger the bug. The conditions that caused it were not under our control.
AWS has indicated a fix is on their roadmap, but as of now, the recommended mitigation aligns with our solution: use Aurora’s Failover feature on an as-needed basis and ensure that no writes are executed against the DB during the failover.
With the race condition understood and mitigated, we:
us-east-1The following principles were reinforced during this experience:
If challenges like this investigation sound interesting, we encourage you to check out our careers page