At the intervals specified by monitor_interval_secs
in repmgr.conf
, repmgrd
will ping each node to check if it's available. If a node isn't available,
repmgrd will enter failover mode and check reconnect_attempts
times at intervals of reconnect_interval
to confirm the node is definitely unreachable.
This buffer period is necessary to avoid false positives caused by transient
network outages.
If the node is still unavailable, repmgrd will enter failover mode and execute
the script defined in event_notification_command
; an entry will be logged
in the repmgr.events
table and repmgrd will
(unless otherwise configured) resume monitoring of the node in "degraded" mode until it reappears.
repmgrd logfile output during a failover event will look something like this
on one node (usually the node which has failed, here node2
):
... [2017-07-27 21:08:39] [INFO] starting continuous BDR node monitoring [2017-07-27 21:08:39] [INFO] monitoring BDR replication status on node "node2" (ID: 2) [2017-07-27 21:08:55] [INFO] monitoring BDR replication status on node "node2" (ID: 2) [2017-07-27 21:09:11] [INFO] monitoring BDR replication status on node "node2" (ID: 2) [2017-07-27 21:09:23] [WARNING] unable to connect to node node2 (ID 2) [2017-07-27 21:09:23] [INFO] checking state of node 2, 0 of 5 attempts [2017-07-27 21:09:23] [INFO] sleeping 1 seconds until next reconnection attempt [2017-07-27 21:09:24] [INFO] checking state of node 2, 1 of 5 attempts [2017-07-27 21:09:24] [INFO] sleeping 1 seconds until next reconnection attempt [2017-07-27 21:09:25] [INFO] checking state of node 2, 2 of 5 attempts [2017-07-27 21:09:25] [INFO] sleeping 1 seconds until next reconnection attempt [2017-07-27 21:09:26] [INFO] checking state of node 2, 3 of 5 attempts [2017-07-27 21:09:26] [INFO] sleeping 1 seconds until next reconnection attempt [2017-07-27 21:09:27] [INFO] checking state of node 2, 4 of 5 attempts [2017-07-27 21:09:27] [INFO] sleeping 1 seconds until next reconnection attempt [2017-07-27 21:09:28] [WARNING] unable to reconnect to node 2 after 5 attempts [2017-07-27 21:09:28] [NOTICE] setting node record for node 2 to inactive [2017-07-27 21:09:28] [INFO] executing notification command for event "bdr_failover" [2017-07-27 21:09:28] [DETAIL] command is: /path/to/bdr-pgbouncer.sh 2 bdr_failover 1 "host=host=node1 dbname=bdrtest user=repmgr connect_timeout=2" "node1" [2017-07-27 21:09:28] [INFO] node 'node2' (ID: 2) detected as failed; next available node is 'node1' (ID: 1) [2017-07-27 21:09:28] [INFO] monitoring BDR replication status on node "node2" (ID: 2) [2017-07-27 21:09:28] [DETAIL] monitoring node "node2" (ID: 2) in degraded mode ...
Output on the other node (node1
) during the same event will look like this:
... [2017-07-27 21:08:35] [INFO] starting continuous BDR node monitoring [2017-07-27 21:08:35] [INFO] monitoring BDR replication status on node "node1" (ID: 1) [2017-07-27 21:08:51] [INFO] monitoring BDR replication status on node "node1" (ID: 1) [2017-07-27 21:09:07] [INFO] monitoring BDR replication status on node "node1" (ID: 1) [2017-07-27 21:09:23] [WARNING] unable to connect to node node2 (ID 2) [2017-07-27 21:09:23] [INFO] checking state of node 2, 0 of 5 attempts [2017-07-27 21:09:23] [INFO] sleeping 1 seconds until next reconnection attempt [2017-07-27 21:09:24] [INFO] checking state of node 2, 1 of 5 attempts [2017-07-27 21:09:24] [INFO] sleeping 1 seconds until next reconnection attempt [2017-07-27 21:09:25] [INFO] checking state of node 2, 2 of 5 attempts [2017-07-27 21:09:25] [INFO] sleeping 1 seconds until next reconnection attempt [2017-07-27 21:09:26] [INFO] checking state of node 2, 3 of 5 attempts [2017-07-27 21:09:26] [INFO] sleeping 1 seconds until next reconnection attempt [2017-07-27 21:09:27] [INFO] checking state of node 2, 4 of 5 attempts [2017-07-27 21:09:27] [INFO] sleeping 1 seconds until next reconnection attempt [2017-07-27 21:09:28] [WARNING] unable to reconnect to node 2 after 5 attempts [2017-07-27 21:09:28] [NOTICE] other node's repmgrd is handling failover [2017-07-27 21:09:28] [INFO] monitoring BDR replication status on node "node1" (ID: 1) [2017-07-27 21:09:28] [DETAIL] monitoring node "node2" (ID: 2) in degraded mode ...
This assumes only the PostgreSQL instance on node2
has failed. In this case the
repmgrd instance running on node2
has performed the failover. However if
the entire server becomes unavailable, repmgrd on node1
will perform
the failover.