Chapter 19. Monitoring with repmgrd

When repmgrd is running with the option monitoring_history=true, it will constantly write standby node status information to the monitoring_history table, providing a near-real time overview of replication status on all nodes in the cluster.

The view replication_status shows the most recent state for each node, e.g.:

    repmgr=# select * from repmgr.replication_status;
    -[ RECORD 1 ]-------------+------------------------------
    primary_node_id           | 1
    standby_node_id           | 2
    standby_name              | node2
    node_type                 | standby
    active                    | t
    last_monitor_time         | 2017-08-24 16:28:41.260478+09
    last_wal_primary_location | 0/6D57A00
    last_wal_standby_location | 0/5000000
    replication_lag           | 29 MB
    replication_time_lag      | 00:00:11.736163
    apply_lag                 | 15 MB
    communication_time_lag    | 00:00:01.365643

The interval in which monitoring history is written is controlled by the configuration parameter monitor_interval_secs; default is 2.

As this can generate a large amount of monitoring data in the table repmgr.monitoring_history. it's advisable to regularly purge historical data using the repmgr cluster cleanup command; use the -k/--keep-history option to specify how many day's worth of data should be retained.

It's possible to use repmgrd to run in monitoring mode only (without automatic failover capability) for some or all nodes by setting failover=manual in the node's repmgr.conf file. In the event of the node's upstream failing, no failover action will be taken and the node will require manual intervention to be reattached to replication. If this occurs, an event notification standby_disconnect_manual will be created.

Note that when a standby node is not streaming directly from its upstream node, e.g. recovering WAL from an archive, apply_lag will always appear as 0 bytes.

Tip: If monitoring history is enabled, the contents of the repmgr.monitoring_history table will be replicated to attached standbys. This means there will be a small but constant stream of replication activity which may not be desirable. To prevent this, convert the table to an UNLOGGED one with:

     ALTER TABLE repmgr.monitoring_history SET UNLOGGED;

This will however mean that monitoring history will not be available on another node following a failover, and the view repmgr.replication_status will not work on standbys.