Be Sure To Stop Your Backups!

Written by ,

Database Platform | Dec 13, 2017

3 min read

Be Sure to Stop Your Backups!

In a recent support case, I came across a customer who used a clever way to create streaming replication base backups–by taking a Google Cloud instance and cloning it. With the proliferation of cloud computing, it’s very convenient to be able to create a block-level clone of a VM within minutes or even seconds, and it would be much faster than any program like scp or rsync. They had found it to be faster than pg_basebackup for sure, on the order of several minutes for a ~50GB database. Basically, they would start a base backup, clone the VM, and then stand up the clone as a streaming replication standby. Unfortunately, for some reason, they could not use psql to log in to the standby–they would simply see the following error:

FATAL: the database is starting up

It was really strange. If you go to the primary and do a SELECT * FROM pg_stat_replication, you’ll see that while WAL is advancing on the primary, it’s getting replayed on the standby–the data stream is flowing, and replication is working, but yet we’re not able to log in to the standby to run read-only queries.

What’s going on?
A clue into this is that in a typical streaming replication instance, you’ll see the following in your log on startup:
LOG: entering standby mode
LOG: redo starts at 13/B0000028
LOG: invalid record length at 13/B0000108
LOG: started streaming WAL from primary at 13/B0000000 on timeline 1
LOG: consistent recovery state reached at 13/B00235B8
LOG: database system is ready to accept read only connections
On this customer’s instance, we weren’t seeing the last two lines (consistent recovery state reached… and database system is ready to accept read only connections). Apparently, the standby wasn’t in a consistent state with the primary.
But, it LOOKS consistent…
One may argue that if you look in pg_stat_replication, all the evidence points to the idea that the standby IS in a consistent state with the primary. It’s replaying all the primary’s WAL. The LSN is advancing on both the standby and the primary–how could it NOT be consistent? To the human eye and the human intuition, things are consistent, as evidenced by the advancing LSN, but to a machine, it may not know that. Recall that if not using pg_basebackup, the proper steps to setting up a Streaming Replication standby involves the following steps:

Execute pg_start_backup(‘any_label’) on the primary
Copy all the files in the primary’s $PGDATA directory, including WAL files
Execute pg_stop_backup() on the primary
Set up recovery.conf on the standby (and delete postmaster.pid, set hot_standby=on, etc.)
Start up Postgres on the standby

Apparently, the customer had neglected to execute the pg_stop_backup() step, which left the standby in a state of technically perpetual inconsistency. This is because the pg_stop_backup()step writes a BACKUP_END entry into the WAL stream, which lets the standby know that it is done replaying all the copied WAL from step 2, and has now technically reached a consistent state, and can allow read-only connections. Without this BACKUP_END entry, it will never know whether it has replayed all the WAL during the copy (what if the copy took a whole year to process?). This BACKUP_END entry is the foolproof way for Postgres to ensure a consistent state between the primary and standby.

Conclusion

The moral of the story: Be sure to stop your backups! When setting up a streaming replication standby, it is imperative to execute SELECT pg_stop_backup() after copying all of $PGDATA; without it, you’ll never be able to log in to your standby and run your read-only queries.
Richard Yen I Senior Support Engineer, EDB Postgres

Quick and Reliable Failure Detection with EDB Postgres Failover Manager

Jul 20, 2020 | 6 MIN READ

Chaos Engineering with Docker EE

Jul 10, 2018 | 14 MIN READ

Introducing the Adaptive Execution Layer and Spark Architecture

Jul 18, 2017 | 7 MIN READ

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Migrating to NGINX Plus Ingress Controller: A Production-Grade Migration Plan

Revolutionize Your CX with
Unified Observability

CloudOps Automation tool for Infrastructure monitoring and deployment.

From Chaos to Control – Transforming Log Management for a Leading Payment Solution Company

Revolutionize Your CX with Unified Observability

Automate and monitor your PostgreSQL with ease.

The CloudOps Automation Tool for easy Infrastructure deployment and monitoring

Maximize Potential of Your Data with Streaming Data Pipeline Architecture

AI Is Not Failing Because of Models. It’s Failing Because of Architecture.

Watch: Building an MCP Server for PostgreSQL: Making Databases Talk to AI

Be Sure to Stop Your Backups!

Conclusion

Read More

Quick and Reliable Failure Detection with EDB Postgres Failover Manager

Chaos Engineering with Docker EE

Introducing the Adaptive Execution Layer and Spark Architecture