Task #7655
Story #7652: Enable simple metrics reporting for core services
report on replication activity
100%
Description
At a minimum, report on the size of the replication backlog.
Results should be written to a file that is easily parsed with various languages. Two common options include JSON and CSV. A CSV file would be convenient for appending entries if more than a single record is to be kept.
Contents of the file are to be determined but should include the size of the replication backlog, and may include additional per member node information such as the timestamp of the last replication activity and perhaps the number of current replication tasks for that MN.
Associated revisions
refs: #7655: added ReplicationTaskMonitor (runnable) to report replicationTask queue statistics (by status and authMN). Commented out unused 'public InputStream ReplicationService#getObjectFromCN() method.
refs: #7655: added ReplicationTaskMonitor (runnable) to report replicationTask queue statistics (by status and authMN). Commented out unused 'public InputStream ReplicationService#getObjectFromCN() method.
refs: #7655: added ReplicaStatusMonitor (runnable) to report summary counts of replicas by target node and status. Added Replication MetricEvents.
refs: #7655: added ReplicaStatusMonitor (runnable) to report summary counts of replicas by target node and status. Added Replication MetricEvents.
refs: #7655. parameterized replication monitoring frequency.
refs: #7655. parameterized replication monitoring frequency.
refs: #7655
parameterized replication monitoring frequency. fix config properties
refs: #7655
parameterized replication monitoring frequency. fix config properties
History
#1 Updated by Rob Nahf over 8 years ago
replicationDAO has methods to determine how many outstanding requests there are per MN, and hooks into ReplicationManager for reporting. It should relatively straightforward to schedule a task that runs at regular intervals to generate monitoring statistics.
We may need to add more DAO queries to not filter by task state, or queries that return counts instead of records. (located in ReplicationDaoMetacatImpl in d1_cn_common.
#2 Updated by Rob Nahf over 8 years ago
one pid can be associated with [0..n] replicas (identified by [pid, targetMN])
so, replication has two things can can be potentially backlogged:
1. the pid that has been picked up by the sysmeta map listener (the sysmeta has changed, so need to re-evaluate if enough replicas were created)
2. replica themselves that have been ordered to be created on a target node. (a request issued to a replica node, but not completed).
The first type could be reported per authoritativeMN.
The second by target node. The StaleReplicationRequestAuditor looks for these and tries to address them.
pid statuses:
NEW - the listener picked up a pid to evaluate
IN_PROCESS - a processor picked the pid up to evaluate
(there is no complete status. I believe the item is removed from the repo.)
replica statuses:
QUEUED, REQUESTED, COMPLETED, FAILED, INVALIDATED
#3 Updated by Robert Waltz over 8 years ago
- % Done changed from 0 to 30
- Status changed from New to In Progress
#4 Updated by Robert Waltz over 8 years ago
- % Done changed from 30 to 50
- Status changed from In Progress to Testing
#5 Updated by Robert Waltz over 8 years ago
- Status changed from Testing to Closed
- translation missing: en.field_remaining_hours set to 0.0
- % Done changed from 50 to 100