15.5. Node monitoring and failover

At the intervals specified by monitor_interval_secs in repmgr.conf, repmgrd will ping each node to check if it's available. If a node isn't available, repmgrd will enter failover mode and check reconnect_attempts times at intervals of reconnect_interval to confirm the node is definitely unreachable. This buffer period is necessary to avoid false positives caused by transient network outages.

If the node is still unavailable, repmgrd will enter failover mode and execute the script defined in event_notification_command; an entry will be logged in the repmgr.events table and repmgrd will (unless otherwise configured) resume monitoring of the node in "degraded" mode until it reappears.

repmgrd logfile output during a failover event will look something like this on one node (usually the node which has failed, here node2):

            ...
    [2017-07-27 21:08:39] [INFO] starting continuous BDR node monitoring
    [2017-07-27 21:08:39] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
    [2017-07-27 21:08:55] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
    [2017-07-27 21:09:11] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
    [2017-07-27 21:09:23] [WARNING] unable to connect to node node2 (ID 2)
    [2017-07-27 21:09:23] [INFO] checking state of node 2, 0 of 5 attempts
    [2017-07-27 21:09:23] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:24] [INFO] checking state of node 2, 1 of 5 attempts
    [2017-07-27 21:09:24] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:25] [INFO] checking state of node 2, 2 of 5 attempts
    [2017-07-27 21:09:25] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:26] [INFO] checking state of node 2, 3 of 5 attempts
    [2017-07-27 21:09:26] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:27] [INFO] checking state of node 2, 4 of 5 attempts
    [2017-07-27 21:09:27] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:28] [WARNING] unable to reconnect to node 2 after 5 attempts
    [2017-07-27 21:09:28] [NOTICE] setting node record for node 2 to inactive
    [2017-07-27 21:09:28] [INFO] executing notification command for event "bdr_failover"
    [2017-07-27 21:09:28] [DETAIL] command is:
      /path/to/bdr-pgbouncer.sh 2 bdr_failover 1 "host=host=node1 dbname=bdrtest user=repmgr connect_timeout=2" "node1"
    [2017-07-27 21:09:28] [INFO] node 'node2' (ID: 2) detected as failed; next available node is 'node1' (ID: 1)
    [2017-07-27 21:09:28] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
    [2017-07-27 21:09:28] [DETAIL] monitoring node "node2" (ID: 2) in degraded mode
    ...

Output on the other node (node1) during the same event will look like this:

    ...
    [2017-07-27 21:08:35] [INFO] starting continuous BDR node monitoring
    [2017-07-27 21:08:35] [INFO] monitoring BDR replication status on node "node1" (ID: 1)
    [2017-07-27 21:08:51] [INFO] monitoring BDR replication status on node "node1" (ID: 1)
    [2017-07-27 21:09:07] [INFO] monitoring BDR replication status on node "node1" (ID: 1)
    [2017-07-27 21:09:23] [WARNING] unable to connect to node node2 (ID 2)
    [2017-07-27 21:09:23] [INFO] checking state of node 2, 0 of 5 attempts
    [2017-07-27 21:09:23] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:24] [INFO] checking state of node 2, 1 of 5 attempts
    [2017-07-27 21:09:24] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:25] [INFO] checking state of node 2, 2 of 5 attempts
    [2017-07-27 21:09:25] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:26] [INFO] checking state of node 2, 3 of 5 attempts
    [2017-07-27 21:09:26] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:27] [INFO] checking state of node 2, 4 of 5 attempts
    [2017-07-27 21:09:27] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:28] [WARNING] unable to reconnect to node 2 after 5 attempts
    [2017-07-27 21:09:28] [NOTICE] other node's repmgrd is handling failover
    [2017-07-27 21:09:28] [INFO] monitoring BDR replication status on node "node1" (ID: 1)
    [2017-07-27 21:09:28] [DETAIL] monitoring node "node2" (ID: 2) in degraded mode
    ...

This assumes only the PostgreSQL instance on node2 has failed. In this case the repmgrd instance running on node2 has performed the failover. However if the entire server becomes unavailable, repmgrd on node1 will perform the failover.