Chapter 16. Handling network splits with repmgrd

A common pattern for replication cluster setups is to spread servers over more than one datacentre. This can provide benefits such as geographically- distributed read replicas and DR (disaster recovery capability). However this also means there is a risk of disconnection at network level between datacentre locations, which would result in a split-brain scenario if servers in a secondary data centre were no longer able to see the primary in the main data centre and promoted a standby among themselves.

repmgr enables provision of "witness server" to artificially create a quorum of servers in a particular location, ensuring that nodes in another location will not elect a new primary if they are unable to see the majority of nodes. However this approach does not scale well, particularly with more complex replication setups, e.g. where the majority of nodes are located outside of the primary datacentre. It also means the witness node needs to be managed as an extra PostgreSQL instance outside of the main replication cluster, which adds administrative and programming complexity.

repmgr4 introduces the concept of location: each node is associated with an arbitrary location string (default is default); this is set in repmgr.conf, e.g.:

    conninfo='host=node1 user=repmgr dbname=repmgr connect_timeout=2'

In a failover situation, repmgrd will check if any servers in the same location as the current primary node are visible. If not, repmgrd will assume a network interruption and not promote any node in any other location (it will however enter Chapter 18 mode until a primary becomes visible).