Story #8759
CN incorrectly modified the replica information of an object
100%
Description
Matt experienced a NotFound exception by calling the cn.resolve method:
https://cn.dataone.org/cn/v2/resolve/urn%3Auuid%3Ab4b3cc45-4953-43d3-910a-847528577531
The issue is caused by that all of replicas are marked the "failed" status and the replica of the original(authorization) member node GOA is gone:
https://cn.dataone.org/cn/v2/meta/urn%3Auuid%3Ab4b3cc45-4953-43d3-910a-847528577531
The system metadata on the original member nodes show it has two successful replicas - GOA itself and the one on mnUNM1:
https://goa.nceas.ucsb.edu/goa/d1/mn/v2/meta/urn%3Auuid%3Ab4b3cc45-4953-43d3-910a-847528577531
This information is correct since we can download the object from the two nodes:
https://mn-unm-1.dataone.org/knb/d1/mn/v2/object/urn%3Auuid%3Ab4b3cc45-4953-43d3-910a-847528577531
https://goa.nceas.ucsb.edu/goa/d1/mn/v2/meta/urn%3Auuid%3Ab4b3cc45-4953-43d3-910a-847528577531
History
#1 Updated by Dave Vieglais almost 6 years ago
Identifier is: urn:uuid:b4b3cc45-4953-43d3-910a-847528577531
System metadata from the CN:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ns3:systemMetadata xmlns:ns2="http://ns.dataone.org/service/types/v1" xmlns:ns3="http://ns.dataone.org/service/types/v2.0">
<serialVersion>43</serialVersion>
<identifier>urn:uuid:b4b3cc45-4953-43d3-910a-847528577531</identifier>
<formatId>image/png</formatId>
<size>101868</size>
<checksum algorithm="SHA-1">d4570d228cc7259ab1bf5cc0106d6ed9666e8117</checksum>
<submitter>http://orcid.org/0000-0002-1006-9496</submitter>
<rightsHolder>http://orcid.org/0000-0002-1006-9496</rightsHolder>
<accessPolicy>
<allow>
<subject>public</subject>
<permission>read</permission>
</allow>
</accessPolicy>
<replicationPolicy replicationAllowed="true" numberReplicas="1">
<preferredMemberNode>urn:node:KNB</preferredMemberNode>
</replicationPolicy>
<archived>false</archived>
<dateUploaded>2017-05-05T15:14:12.529+00:00</dateUploaded>
<dateSysMetadataModified>2017-05-05T15:14:12.529+00:00</dateSysMetadataModified>
<originMemberNode>urn:node:GOA</originMemberNode>
<authoritativeMemberNode>urn:node:GOA</authoritativeMemberNode>
<replica>
<replicaMemberNode>urn:node:mnORC1</replicaMemberNode>
<replicationStatus>failed</replicationStatus>
<replicaVerified>2018-09-26T01:24:28.584+00:00</replicaVerified>
</replica>
<replica>
<replicaMemberNode>urn:node:mnUNM1</replicaMemberNode>
<replicationStatus>failed</replicationStatus>
<replicaVerified>2018-10-20T23:22:23.466+00:00</replicaVerified>
</replica>
<replica>
<replicaMemberNode>urn:node:KNB</replicaMemberNode>
<replicationStatus>failed</replicationStatus>
<replicaVerified>2018-11-16T22:37:54.935+00:00</replicaVerified>
</replica>
<fileName>hcdbSampleLocs.png</fileName>
</ns3:systemMetadata>
and from the GOA MN (content origin):
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ns3:systemMetadata xmlns:ns2="http://ns.dataone.org/service/types/v1" xmlns:ns3="http://ns.dataone.org/service/types/v2.0">
<serialVersion>28</serialVersion>
<identifier>urn:uuid:b4b3cc45-4953-43d3-910a-847528577531</identifier>
<formatId>image/png</formatId>
<size>101868</size>
<checksum algorithm="SHA-1">d4570d228cc7259ab1bf5cc0106d6ed9666e8117</checksum>
<submitter>http://orcid.org/0000-0002-1006-9496</submitter>
<rightsHolder>http://orcid.org/0000-0002-1006-9496</rightsHolder>
<accessPolicy>
<allow>
<subject>public</subject>
<permission>read</permission>
</allow>
</accessPolicy>
<replicationPolicy replicationAllowed="true" numberReplicas="1">
<preferredMemberNode>urn:node:KNB</preferredMemberNode>
</replicationPolicy>
<archived>false</archived>
<dateUploaded>2017-05-05T15:14:12.529+00:00</dateUploaded>
<dateSysMetadataModified>2017-05-05T15:14:12.529+00:00</dateSysMetadataModified>
<originMemberNode>urn:node:GOA</originMemberNode>
<authoritativeMemberNode>urn:node:GOA</authoritativeMemberNode>
<replica>
<replicaMemberNode>urn:node:GOA</replicaMemberNode>
<replicationStatus>completed</replicationStatus>
<replicaVerified>2017-05-05T15:15:22.319+00:00</replicaVerified>
</replica>
<replica>
<replicaMemberNode>urn:node:UIC</replicaMemberNode>
<replicationStatus>failed</replicationStatus>
<replicaVerified>2017-05-09T23:59:05.154+00:00</replicaVerified>
</replica>
<replica>
<replicaMemberNode>urn:node:mnORC1</replicaMemberNode>
<replicationStatus>failed</replicationStatus>
<replicaVerified>2017-05-11T15:40:29.894+00:00</replicaVerified>
</replica>
<replica>
<replicaMemberNode>urn:node:KNB</replicaMemberNode>
<replicationStatus>failed</replicationStatus>
<replicaVerified>2017-05-22T00:55:17.973+00:00</replicaVerified>
</replica>
<replica>
<replicaMemberNode>urn:node:mnUNM1</replicaMemberNode>
<replicationStatus>completed</replicationStatus>
<replicaVerified>2017-05-30T18:53:49.397+00:00</replicaVerified>
</replica>
<fileName>hcdbSampleLocs.png</fileName>
</ns3:systemMetadata>
#2 Updated by Rob Nahf almost 6 years ago
I ran several queries in postgres to help determine the size of the problem.
It is assumed that lack of 'COMPLETED' replicas leads to failures in cn/resolve
.
This query counts the number of objects where there are no 'COMPLETED' replicas
select count(*) from systemmetadata s where not exists (select r.guid from smreplicationstatus r where r.guid =s.guid and status = 'COMPLETED');
count = 1754
This query counts the number of objects where the authMN is not listed as a replica:
select count(*) from systemmetadata s where not exists (select r.guid from smreplicationstatus r where r.guid =s.guid and s.authoritive_member_node = r.member_node);
count = 3054
This query counts the number of object having no replicas at all:
select count(*) from systemmetadata s where not exists (select r.guid from smreplicationstatus r where r.guid =s.guid);
count = 657
#3 Updated by Dave Vieglais almost 6 years ago
- File picture19-1.png added
Some quick analysis on the missing replicas.
System metadata with no replicas:
System metadata with no COMPLETED entries:
System metadata with no authoritative mn in the replicas:
#6 Updated by Dave Vieglais almost 6 years ago
Origin member nodes for system metadata with no replication information. The CN* entries can be ignored as those items are revisions of the formatId lists:
urn:node:ARCTIC 161
urn:node:CDL 5
urn:node:CNORC1 2
urn:node:CNUCSB1 14
urn:node:CNUNM1 2
urn:node:DRYAD 106
urn:node:EDI 1
urn:node:ESS_DIVE 21
urn:node:FEMC 1
urn:node:GOA 3
urn:node:GRIIDC 1
urn:node:KNB 131
urn:node:LTER 13
urn:node:NMEPSCOR 54
urn:node:ONEShare_test 1
urn:node:PANGAEA 115
urn:node:PISCO 1
urn:node:TDAR 20
urn:node:USGS_SDC 5
#7 Updated by Rob Nahf almost 6 years ago
- Category set to d1_replication
- % Done changed from 0 to 100
- Status changed from New to Closed
I was unable to find a cause of the disappearance of the replicas after a thorough review of the replication code, so I had to do a manual fix of the replication metadata using cn.updateReplicationMetadata. I searched every registered an "up" node in production for replicas, and found two, on GOA and mnUNM1. I verified the checksums against /checksum
and created / updated replicas for them.
Regarding mechanisms for removing a completed replica, I could find none. Within the code there are state transitions from completed to invalidated (for failed audits), and removal of failed replicas. But, there is nothing to transition from invalidated to failed. (Failed, I believe, is only set by a callback from MNs attempting to replicate an object, and can only be called against the replica that represents itself.
It may be possible for a target replica node to call failed on an already completed replica, but I can't think of a triggering event for that.