Story #8582

Story #8756: Ensure replica auditor is effective

Replica Auditing service is throwing errors

Added by Chris Jones over 3 years ago. Updated almost 3 years ago.

Target version:
Start date:
Due date:
% Done:


Story Points:


Replica auditing should be auditing objects every 90 days for fixity, and setting the replicaStatus appropriately. The /var/log/dataone/cn-replication-audit.log* files are showing many errors:

cjones@cn-ucsb-1:replicate$ grep ERROR cn-replication-audit.log* | grep "Cannot update replica status" | wc -l

Determine if this is a configuration issue or a code issue and fix it as needed. Also, fix the code to call Identifier.getValue() when logging these errors to avoid printing the memory location of the object like org.dataone.service.types.v1.Identifier@7e90f2e8. There are multiple places where getValue() needs to be added.


#1 Updated by Rob Nahf over 3 years ago

additional logging information is in cn-replication.log (from ReplicationService, which performs setting the INVALID status).

cn-replication.log.1-[DEBUG] 2018-04-27 21:41:55,946 [pool-11-thread-3]  (ReplicationService:updateReplicationMetadata:3
58) org.dataone.service.exceptions.NotFound: No record found for: : null
cn-replication.log.1-[ERROR] 2018-04-27 21:41:55,946 [pool-11-thread-3]  (ReplicationService:updateReplicationMetadata:3
60) Error in calling updateReplicationMetadata(): No record found for: : null
cn-replication.log.1:[ERROR] 2018-04-27 21:41:55,946 [pool-11-thread-3]  (ReplicaAuditingDelegate:updateInvalidReplica:1
10) Cannot update replica status to INVALID for pid: org.dataone.service.types.v1.Identifier@37cca400 on MN: urn:node:DR

The NotFound comes from either cn.getSystemMetadata or cn.updateReplicationMetadata.

Maybe it has something to do with the MemberNode of the Replica being updated?

#2 Updated by Rob Nahf over 3 years ago

cn.updateReplicationMetadata in Metacat returns a string formatted like what's in the logs, and it looks like it does it for any issue with the Hazelcast.getInstance().getSystemMetadataMap().get() call. The method returns a NotFound, but both HZ client errors (resulting in .getSystemMetadadtaMap() returning null) and nulls returned from the call (indicating a true NotFound) are handled by catching NPEs, and returned as NotFounds.

It would be better to split out these two root causes for throwing an exception from the method.

 } catch (RuntimeException e) { // Catch is generic since HZ throws RuntimeException
              throw new NotFound("4854", "No record found for: " + pid.getValue() +
                  " : " + e.getMessage());


#3 Updated by Rob Nahf over 3 years ago

In this case, the NotFound is a true not found, because calling /meta from a browser returns a not found. Are we trying to audit pids that don't exist anymore? '', for example?

#4 Updated by Chris Jones almost 3 years ago

Well, somewhere in the CN processing pipeline, we are truncating the pid values that have special characters like ?. So, the NotFound errors are due to us not looking for the correct pid string.

#5 Updated by Chris Jones almost 3 years ago

After looking at the auditing code, the results of each audit task are being written to the /var/log/dataone/replicate/cn-replication-audit-splunk.log files. These files are nice in that they uniformly document the AuditLogEntry results. One thing to note is that every log entry is an ERROR entry (despite most being successful), so pretty much ignore that status.

The general logging debug information is stored in the /var/log/dataone/replicate/cn-replication-audit.log files. Note that the splunk logs are rotated out frequently, and Splunk is no longer persisting them, so we should consider creating file beat forwarders and persisting these entries in Elastic Search on to help us understand the actual frequency of replica auditing over time.

#6 Updated by Chris Jones almost 3 years ago

  • Parent task set to #8756

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 14.8 MB)