Software projects often implement "soft delete", maybe with a deleted boolean or an archived_at timestamp column.
If customers accidentally delete their data, they can recover it, which makes work easier for customer support teams.
Perhaps archived records are even required for compliance or audit reasons.
I've run into some trouble with soft delete designs. I'll cover those, and ponder ideas for how I'd build this in the future.
Adding an archived_at column seems to ooze complexity out into queries, operations, and applications.
Recovering deleted records does happen, but 99% of archived records are never going to be read.
So, the database tables will have a lot of dead data. Depending on access patterns, that might even be a significant amount of data. I've seen APIs that didn't work well with Terraform, so Terraform would delete + recreate records on every run, and over time that led to millions of dead rows. Your database can probably handle the extra bytes, and storage is fairly cheap, so it's not necessarily a problem, at first.
Hopefully, the project decided on a retention period in the beginning, and set up a periodic job to clean up those rows. Unfortunately, I'd bet that a significant percentage of projects did neither – it's really easy to ignore the archived data for a long time.
At some point, someone might want to restore a database backup. Hopefully that's for fun and profit and not because you lost the production database at 11 am. If your project is popular, you might have a giant database full of dead data that takes a long time to recreate from a dump file.
archived_at columns also complicate queries, operations, and application code. Applications need to make sure they always avoid the archived data that's sitting
right next to the live data. Indexes need to be careful to avoid archived rows. Manual queries run for debugging or analytics are longer and more complicated.
There's always a risk that archived data accidentally leaks in when it's not wanted. The complexity grows when there are mapping tables involved.
Migrations have to deal with archived data too. Migrations may involve more than just schema changes – perhaps you need to fix a mistake with default values, or add a new column and backfill values. Is that going to work on records from 2 years ago? I've done migrations where these questions were not trivial to answer.
Restoring an archived record is not always as simple as just running SET archived_at = null – creating a record may involve making calls to external systems as well.
I've seen complex restoration code that was always a buggy, partial implementation of the "create" API endpoint. In the end, we removed the specialized restoration code
and required all restoration to go through the standard APIs – that simplified the server implementation, and ensured that old data that had since become invalid, could not
be restored incorrectly – it needs to pass the new validation rules.
I'm not a fan of the archived_at column approach. It's simple at first, but in my experience, it's full of pitfalls down the line.
Let's look at some alternatives (in PostgreSQL): application events, triggers, and logical replication.
All these approaches store archived data separately from live data – that may be a separate database table, a separate database, object storage, etc.
One team I worked with took the approach of emitting an event at the application layer when a record was deleted. The event was sent to SQS, and another service would archive that object to S3 (among other things).
This had a few big benefits:
The tradeoffs:
A trigger can copy a row to an archive table before it's deleted. The archive table can be a single, generic table that stores JSON blobs:
CREATE TABLE archive (
id UUID PRIMARY KEY,
table_name TEXT NOT NULL,
record_id TEXT NOT NULL,
data JSONB NOT NULL,
archived_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
caused_by_table TEXT,
caused_by_id TEXT
);
CREATE INDEX idx_archive_table_record ON archive(table_name, record_id);
CREATE INDEX idx_archive_archived_at ON archive(archived_at);
The trigger function converts the deleted row to JSON:
CREATE OR REPLACE FUNCTION archive_on_delete()
RETURNS TRIGGER AS $$
BEGIN
INSERT INTO archive (id, table_name, record_id, data)
VALUES (
gen_random_uuid(),
TG_TABLE_NAME,
OLD.id::TEXT,
to_jsonb(OLD)
);
RETURN OLD;
END;
$$ LANGUAGE plpgsql;
Attach this trigger to any table you want to archive:
CREATE TRIGGER archive_users
BEFORE DELETE ON users
FOR EACH ROW EXECUTE FUNCTION archive_on_delete();
CREATE TRIGGER archive_documents
BEFORE DELETE ON documents
FOR EACH ROW EXECUTE FUNCTION archive_on_delete();
When a parent record is deleted, PostgreSQL cascades the delete to child records. These child deletes also fire triggers, but in the context of a cascade, you often want to know why a record was deleted.
One approach is to use a session variable to track the root cause:
CREATE OR REPLACE FUNCTION archive_on_delete()
RETURNS TRIGGER AS $$
DECLARE
cause_table TEXT;
cause_id TEXT;
BEGIN
-- Check if we're in a cascade context
cause_table := current_setting('archive.cause_table', true);
cause_id := current_setting('archive.cause_id', true);
-- If this is a top-level delete, set ourselves as the cause
IF cause_table IS NULL THEN
PERFORM set_config('archive.cause_table', TG_TABLE_NAME, true);
PERFORM set_config('archive.cause_id', OLD.id::TEXT, true);
cause_table := TG_TABLE_NAME;
cause_id := OLD.id::TEXT;
END IF;
INSERT INTO archive (id, table_name, record_id, data, caused_by_table, caused_by_id)
VALUES (
gen_random_uuid(),
TG_TABLE_NAME,
OLD.id::TEXT,
to_jsonb(OLD),
cause_table,
cause_id
);
RETURN OLD;
END;
$$ LANGUAGE plpgsql;
Now when you delete a user, you can see which archived documents were deleted because of that user:
SELECT * FROM archive
WHERE caused_by_table = 'users'
AND caused_by_id = '123e4567-e89b-12d3-a456-426614174000';
Triggers add some overhead to deletes, and the archive table will grow. But:
archived_at columns, no dead rowsWHERE archived_at < NOW() - INTERVAL '90 days'.The archive table can even live in a separate tablespace or be partitioned by time if it grows large.
PostgreSQL's write-ahead log (WAL) records every change to the database. Change data capture (CDC) tools can read the WAL and stream those changes to external systems. For archiving, you'd filter for DELETE events and write the deleted records to another datastore.
Debezium is the most well-known tool for this. It connects to PostgreSQL's logical replication slot, reads changes, and publishes them to Kafka. From there, a consumer writes the data wherever you want – S3, Elasticsearch, another database, etc.
PostgreSQL → Debezium → Kafka → Consumer → Archive Storage
For simpler setups, there are lighter-weight alternatives:
The main downside is operational overhead. You're running additional services that need to be monitored, maintained, and made fault-tolerant. Debezium with Kafka is a significant infrastructure investment – Kafka alone requires careful tuning and monitoring.
The lighter-weight alternatives reduce this burden but shift reliability concerns to your custom code. If your consumer crashes or falls behind, you need to handle that gracefully.
A critical configuration is max_wal_size in PostgreSQL. The database retains WAL segments until all replication slots have consumed them. If your CDC consumer stops processing – due to a bug, network issue, or downstream failure – WAL segments accumulate on the primary.
If this continues unchecked, the primary database can run out of disk space and crash.
PostgreSQL 13+ has max_slot_wal_keep_size to limit how much WAL a slot can retain:
ALTER SYSTEM SET max_slot_wal_keep_size = '10GB';
If a slot falls too far behind, PostgreSQL invalidates it rather than filling the disk. This protects the primary but means your CDC pipeline loses data and needs to be re-synced from a snapshot.
You need monitoring and alerting on replication slot lag. If a slot starts falling behind, you want to know before it becomes a crisis.
WAL-based CDC provides:
But:
This approach makes the most sense when you already have Kafka or similar infrastructure, or when you need to stream changes to multiple destinations beyond just archiving.
This is an idea I had never considered until I wrote this post – I haven't tested this, it's just an idea.
What if you kept a PostgreSQL replica (e.g. using logical replication) that just didn't process DELETE queries? Would it effectively accumulate records and updates without conflict over time?
One potential benefit of this is that the archive can be easily queried, so finding old data is simple.
Would the replica have any information about deletes? Could it separate live from deleted data? Would you be able to find a record that was "deleted 2 hours ago in account 123" for a customer? Perhaps instead of ignoring DELETE queries entirely, you could have a specialized replica that transforms DELETE events into an archived_at column.
One potential pitfall here could be schema migrations – would the archive run into difficulty applying migrations over time?
Another downside might be cost – running a replica and keeping all that storage could have a non-trivial cost: it costs money and has operational overhead.
If I were starting a new project today and needed soft delete, I'd reach for the trigger-based approach first. It's simple to set up, keeps live tables clean, and doesn't require extra infrastructure. The archive table is easy to query when you need it, and easy to ignore when you don't.
If you have thoughts, comments, feedback, shoot me an email at atlas9@eabuc.com.