WAL overgrowth protection

PostgreSQL WAL Overgrowth Protection

The Hidden Risk That Can Bring Down Your Production Database

PostgreSQL replicas are designed to enhance system reliability and performance, but they can paradoxically become a threat to your master database's stability. When replicas fail to keep up with Write-Ahead Log (WAL) consumption, they can trigger a cascade of issues that ultimately lead to production outages.

This article explores how WAL accumulation can crash your master database and demonstrates how PostgreSQL's max_slot_wal_keep_size parameter provides essential protection against this scenario.

Understanding the Problem: Storage Depletion

The most critical issue occurs when replicas cannot keep pace with WAL log consumption (or WAL log consumers become unavailable). In this scenario, WAL logs accumulate on the master node, progressively consuming available storage until the database becomes unavailable.

The Core Problem: WAL Accumulation

PostgreSQL maintains Write-Ahead Logs for replicas to ensure data consistency and durability. However, when replicas become unavailable or fall behind, the master continues to retain WAL files that haven't been acknowledged. This protective mechanism becomes problematic when:

PostgreSQL's Built-in Safety Net: max_slot_wal_keep_size

PostgreSQL 13 introduced a crucial protection mechanism through the max_slot_wal_keep_size parameter:

-- Configure in your RDS Parameter Group
max_slot_wal_keep_size = 10000  -- 10GB limit

-- Default value is "-1" (no limit)

This parameter automatically invalidates replication slots when WAL accumulation exceeds the specified limit, protecting the master database from disk exhaustion while stopping replication to problematic consumers.

Proof of Concept: Testing WAL Protection Mechanisms

To validate the effectiveness of max_slot_wal_keep_size in protecting our master database, we conducted a controlled experiment to observe its behavior under realistic failure conditions.

Test Environment Setup

Our test environment consisted of:

Methodology

The experiment followed these steps:

Test Results

The logical replication slot was automatically invalidated when WAL accumulation exceeded the configured limit of 10,240 MB (10 GB), demonstrating the parameter's effectiveness in protecting the master and replica databases.

Without this safety mechanism, free storage space on the replica would continue to decrease as WAL files accumulate, making it fail, and free storage on the master would eventually be exhausted, leading to complete database failure. We experienced this exact scenario in our testing environment, where an unresponsive replica caused WAL accumulation that eventually crashed both the replica and its master database.

The following chart illustrates WAL size and free storage space throughout the 14-hour test period, showing the complete cycle of WAL accumulation, critical limit breach, and automatic recovery:

WAL Log growth during the experiment
WAL log growth during the experiment

Detailed Timeline Analysis

Phase 1: Initial Stability (14:00 - 16:00)

Phase 2: WAL Accumulation Period (17:00 - 02:45)

Phase 3: Critical Event (02:48)

Replication slot invalidation event
Automatic replication slot invalidation when WAL limit exceeded

Phase 4: Recovery and Stabilization (02:49 - 04:00)

Key Takeaways and Best Practices

Our testing demonstrates that max_slot_wal_keep_size provides reliable protection against WAL-induced storage exhaustion. The parameter successfully prevented master database failure by automatically invalidating problematic replication slots when WAL accumulation reached critical levels.

Implementation Recommendations

  1. Configure WAL limits: Set max_slot_wal_keep_size to an appropriate value based on your storage capacity and business requirements
  2. Monitor storage metrics: Implement alerting on FreeStorageSpace to detect accumulation trends before they become critical
  3. Test your configuration: Validate protection mechanisms in non-production environments to ensure they work as expected
  4. Plan for consumer failures: Design your replication consumers with automatic restart capabilities and proper error handling

Conclusion

Replication safety is not optional - it's critical infrastructure protection. While replicas are designed to enhance system reliability, they can become single points of failure if not properly configured and monitored.

Remember: Your replica's problems inevitably become your master's problems.

By implementing proper WAL protection mechanisms and monitoring, you can ensure that replication enhances rather than threatens your database infrastructure's stability.

Michał Łasisz's Profile

Michał Łasisz

DevOPS Engineer