Presented by:

Ibrar Ahmed, Principal Engineer at pgEdge, brings 25 years of experience in software design and open-source development, particularly PostgreSQL. With a strong background in system-level embedded development, Ibrar has made impactful contributions during his tenure at companies like EnterpriseDB, Percona, and Bitnine. Since 2006, he's been instrumental in enhancing PostgreSQL's core engine, driving performance improvements, and refining essential modules.

His expertise spans MySQL, Oracle, and NoSQL solutions like MongoDB and Hadoop, alongside tools like Hive, HBase, and Spark. A prolific author and blogger, Ibrar shares deep insights into PostgreSQL with several authoritative books. Over the past year, he’s delivered over fifteen talks worldwide at PostgreSQL conferences, further cementing his reputation. His dedication to advancing data management technology continues to shape the PostgreSQL landscape.

No video of the event yet, sorry!

Disks fail, RAM runs short, software breaks, and human error introduces faults that spread through a PostgreSQL cluster without warning. When these events occur, data integrity depends on a disciplined recovery process rather than ad hoc fixes. This talk provides a structured approach to handling corruption and service failures in production environments. The session begins with early-detection methods based on log analysis, checksum validation, page header inspection, and common indicators of broken storage or inconsistent WAL records. Once failure signals appear, the next step is to stop service immediately to prevent further changes to damaged files. Recovery then moves to restoration from verified backups, with emphasis on base backups checked for integrity and WAL archives stored with consistent retention rules. After restoration, point-in-time recovery establishes a clean state by selecting precise timestamps or LSN markers that precede the corrupt event. When backups are incomplete or missing, salvage techniques extract healthy tables via targeted dumps or low-level page inspection, enabling partial recovery when full rollback is not possible. In situations where the cluster cannot proceed with standard recovery, pg_resetwal remains an option of last resort, used only to regain startup access while accepting loss of recent transactions. The session concludes with a practical set of measures to reduce future risk, including routine checksum use, scheduled integrity checks, durable backup policies, WAL archiving discipline, and the addition of high-availability replicas to support failover during critical events. The focus stays on established commands, stable operational habits, and recovery actions proven to limit data loss in real deployments.

Date:
Duration:
50 min
Room:
Conference:
Postgres Conference: 2026
Language:
Track:
Ops
Difficulty:
Hard