Back Original

PostgreSQL and the OOM killer: Why we use strict memory overcommit

April 27, 2026 · 10 min read

Burak Yucesoy

Burak Yucesoy

Principal Software Engineer

Our team members built and operated five managed PostgreSQL services over the past 15 years. Across all of them, one configuration has remained constant: strict memory overcommit. 

Why PostgreSQL Can't Tolerate the OOM Killer

Linux allows processes to allocate more virtual memory than what is physically available. When a process allocates memory, for example with malloc(), the kernel reserves virtual address space for it. However, the kernel does not immediately back that space with physical memory. Physical pages are only consumed when the process actually touches the memory.

For most processes, handling an OOM kill is simple: the process restarts, reconnects, and picks up where it left off. PostgreSQL is different.

Strict Overcommit: Fail Early, Not Catastrophically

It is possible to configure how the kernel behaves when processes ask for memory. Linux provides three overcommit policies via vm.overcommit_memory:

  • Mode 0 (Heuristic): The default. The kernel refuses any single allocation larger than what the system could realistically provide (roughly free memory + swap + reclaimable page cache and slab), but otherwise allows overcommitting freely. In practice, this only blocks absurd requests like a single process asking for more memory than the entire system memory.
  • Mode 1 (Always): The kernel never refuses an allocation request, regardless of how large it is or how much memory has already been committed. Every malloc() and mmap() succeeds. If processes later fault in more physical memory than the system can actually provide, the OOM killer steps in to free memory by terminating a process.
  • Mode 2 (Strict): The kernel tracks the total committed virtual memory across all processes in Committed_AS and enforces an upper bound called CommitLimit. Any allocation that would push Committed_AS past CommitLimit is refused immediately with ENOMEM.

Under strict overcommit, the kernel has two knobs to set CommitLimit: overcommit_kbytes and overcommit_ratio. The CommitLimit is calculated as:

CommitLimit = overcommit_kbytes + swap

Or, if overcommit_kbytes is not set:

CommitLimit = overcommit_ratio / 100 * available_memory + swap

When  allocation fails with ENOMEM error code. PostgreSQL handles this gracefully. A backend that cannot allocate memory reports an error to the client, cancels the transaction, and continues. The postmaster stays up. Other connections remain unaffected. This is a routine error, not a catastrophe. The trade-off is that strict overcommit converts late, destructive failures into early, graceful ones.

A Kernel Bug and 648 GB of Phantom Memory

We always favored strict overcommit for PostgreSQL. We used it in previous managed PostgreSQL services we built and also in Ubicloud PostgreSQL. However, after enabling it this time, we quickly ran into trouble. A few weeks after we turned on strict memory overcommit, we started to get failures on some of the databases. They showed out of memory errors, even though there was plenty of free physical memory on the machines. We disabled strict memory overcommit and started investigating.

Discovery

The first clue came from a routine check of /proc/meminfo on one of our servers with 8 GB memory:

$> cat /proc/meminfo | grep "Committed_AS"

Committed_AS: 683547672 kB

651 GB of committed memory on an 8 GB machine! For comparison, a healthy server of the same size showed:

$> cat /proc/meminfo | grep "Committed_AS"

Committed_AS: 2703940 kB

The counter was off by orders of magnitude.

Narrowing It Down

We first looked at ps output.

$> ps -C postgres -o pid,vsz,rss,cmd --sort=-vsz

PID   VSZ     RSS   CMD
96622 2242244 95416 postgres: 18/main: postgres postgres...
95721 2241668 94708 postgres: 18/main: postgres postgres...
96414 2241436 94892 postgres: 18/main: postgres postgres...
96619 2241076 93308 postgres: 18/main: postgres postgres...
96417 2240900 94300 postgres: 18/main: postgres postgres...
95728 2240736 93864 postgres: 18/main: postgres postgres...
96620 2240736 92852 postgres: 18/main: postgres postgres...
95727 2240428 93640 postgres: 18/main: postgres postgres...
96623 2239840 93164 postgres: 18/main: postgres postgres...

VSZ is the total virtual address space a process has mapped and RSS is the physical memory it's actually using. In the output above, each backend shows ~2 GB of VSZ covering its entire mapped address space, but a much smaller RSS (~95 MB) reflecting the memory it is actively using. On this 8 GB VM we configure 2 GB of shared_buffers, and if you think ~2 GB VSZ is suspiciously close to the shared_buffers size, you are right. Most of each backend's VSZ is actually the shared memory segment that holds shared_buffers. Every backend maps the same 2 GB region into its own address space, so it shows up in each backend's VSZ. With many backends, the VSZ numbers add up quickly.

$> sudo cat /proc/321784/smaps | grep -A 25 "hugepage"

7fce75000000-7fcef0c00000 rw-s 00000000 00:10 10723551  /anon_hugepage (deleted)
Size:            2027520 kB
Shared_Hugetlb:   393216 kB
Private_Hugetlb:       0 kB
...
...
VmFlags: rd wr sh mr mw me ms de ht sd

No ac flag. Huge tables were correctly excluded from committed memory accounting. The hypothesis is ruled out.

$> sudo awk '/^Size/{size=$2} /VmFlags:/ && / ac/{sum+=size} END{printf "%.2f GB\n", sum/1048576}' /proc/[0-9]*/smaps

2.43 GB

2.43 GB accountable vs 651 GB reported; 648 GB of phantom committed memory. The vm_committed_as counter was leaking. We suspected that the memory was being charged on allocation but was never recredited. This made us consider a potential kernel bug in committed memory calculation.

Fleet-Wide Analysis

At that time, we had two different kernels being used on our fleet. We checked our entire fleet of PostgreSQL servers and compared the ratio of Committed_AS to MemTotal against kernel version and uptime:

MetricKernel 6.5.0Kernel 6.8.0

Median Ratio

0.550.27

Mean Ratio

24.970.32

Max Ratio

3,4051.86

Servers with a ratio > 1.0

23%< 1%

Drag table left or right to see remaining content

We also ran a statistical analysis and found that a server running the 6.5 kernel was 52x more likely to have inflated committed memory.

The One-Character Bug

To have definitive proof, we tasked an LLM to look into every commit between 6.5.0 and 6.8.0 to find possible bug fixes in committed memory calculations. It quickly found the following.

  • Before: 0 = success, 1 = success with lock downgraded, negative = error
  • After: always 0 for success, negative = error

The commit updated callers throughout the mm subsystem. However, in mm/mremap.c, inside move_vma(), the error check was converted incorrectly:

if (do_vmi_munmap(&vmi, mm, old_addr, old_len, uf_unmap, false) < 0) {
   /* OOM: unable to split vma, just get accounts right */
   if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP))
       vm_acct_memory(old_len >> PAGE_SHIFT);
}

AFTER (broken): error handler runs when return is 0 (on success)

if (!do_vmi_munmap(&vmi, mm, old_addr, old_len, uf_unmap, false)) {
   /* OOM: unable to split vma, just get accounts right */
   if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP))
       vm_acct_memory(old_len >> PAGE_SHIFT);
}

The change from < 0 to ! inverted the condition. To understand why this matters, consider what move_vma() does. It first decrements Committed_AS for the old region as part of the move, then calls do_vmi_munmap() to actually unmap it. If the unmap fails, the kernel needs to increment the counter back to keep accounting correct. After all, unmap has failed and the old region still exists. Its charge must be restored. With the inverted condition, this re-increment runs on every successful mremap instead of only on failure. The counter grew monotonically with every memory remap operation.

-   if (!do_vmi_munmap(&vmi, mm, old_addr, old_len, uf_unmap, false)) {

+   if (do_vmi_munmap(&vmi, mm, old_addr, old_len, uf_unmap, false) < 0) {

As Linus Torvalds wrote in the fix:

Setting the Commit Limit

With the kernel bug behind us, we can gradually go back to enabling strict memory overcommit. This is a good point to explain our heuristic in deciding the commit limit in case you want to enable it for your workloads as well.

overcommit_kbytes = total_memory_kb × 0.8 + 2 × 1048576

In plain terms: 80% of total physical memory plus 2 GB.

Why 80%

The 20% holdback covers memory used by kernel data structures not seen in userspace. This includes items like page tables, slab caches, network buffers, and the kernel's own allocations.

Why +2 GB

Every PostgreSQL server in our fleet runs several sidecar processes. Some examples are prometheus, node_exporter, postgres_exporter and wal-g. These are Go programs, and Go's runtime reserves large virtual memory regions upfront via mmap but only faults in pages as needed. Their committed memory contribution is far larger than their actual resident memory.

Sidecar Committed MemoryPercentage of Servers
0.0 – 0.5 GB~64%
0.5 – 1.0 GB~32%
1.0 – 1.5 GB~1%
1.5 – 2.0 GB~1%
2.0 – 2.5 GB~1%
2.5 – 3.0 GB~1%
3.0 – 3.5 GB~1%

Drag table left or right to see remaining content

96% of servers fall under 1 GB. We found a weak positive correlation between vCPU count and sidecar committed memory (r = 0.22). This is likely driven by Go's runtime scaling with available CPUs but it is not strong enough to justify proportional scaling.

Implementation

If you are curious about how we implemented this, it is actually pretty straightforward. You can read the code in our GitHub repo here. I’m also adding the core part of it below for convenience.

def configure_memory_overcommit(strict: false)
  if strict
    total_mem_kb = File.read("/proc/meminfo").match(/MemTotal:\s+(\d+)/)[1].to_i
    # 25% of memory is reserved for hugepages, which do not count towards the
    # commit limit, so only the remaining 75% is available for overcommit.
    non_hugepage_mem_kb = total_mem_kb * 0.75
    overcommit_kbytes = (non_hugepage_mem_kb * 0.8 + 2 * 1048576).round
    safe_write_to_file("/etc/sysctl.d/99-overcommit.conf", "vm.overcommit_memory=2\nvm.overcommit_kbytes=#{overcommit_kbytes}\n")
  else
    r "sudo rm -f /etc/sysctl.d/99-overcommit.conf"
  end

r "sudo sysctl --system"
end

Note that we use vm.overcommit_kbytes instead of vm.overcommit_ratio. We need overcommit_kbytes because our formula includes a fixed 2 GB component that can't be expressed as a percentage. On a 4 GB server, the 2 GB buffer is 50% of the physical memory; on a 64 GB server, it's 3%. A single ratio can't capture both.

Conclusion

Strict memory overcommit is a small configuration change that provides a meaningful safety improvement for PostgreSQL. It converts catastrophic OOM kills into graceful allocation failures. This way, each backend can manage the issue without disrupting the whole system. Even though we had to disable it for a while due to a kernel bug, it remains a key configuration for healthy PostgreSQL deployments.