Project

General

Profile

Story #8759

CN incorrectly modified the replica information of an object

Added by Jing Tao about 5 years ago. Updated about 5 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
d1_replication
Target version:
-
Start date:
2019-02-02
Due date:
% Done:

100%

Story Points:

Description

Matt experienced a NotFound exception by calling the cn.resolve method:
https://cn.dataone.org/cn/v2/resolve/urn%3Auuid%3Ab4b3cc45-4953-43d3-910a-847528577531

The issue is caused by that all of replicas are marked the "failed" status and the replica of the original(authorization) member node GOA is gone:
https://cn.dataone.org/cn/v2/meta/urn%3Auuid%3Ab4b3cc45-4953-43d3-910a-847528577531

The system metadata on the original member nodes show it has two successful replicas - GOA itself and the one on mnUNM1:
https://goa.nceas.ucsb.edu/goa/d1/mn/v2/meta/urn%3Auuid%3Ab4b3cc45-4953-43d3-910a-847528577531

This information is correct since we can download the object from the two nodes:
https://mn-unm-1.dataone.org/knb/d1/mn/v2/object/urn%3Auuid%3Ab4b3cc45-4953-43d3-910a-847528577531
https://goa.nceas.ucsb.edu/goa/d1/mn/v2/meta/urn%3Auuid%3Ab4b3cc45-4953-43d3-910a-847528577531

picture19-1.png - Counts over time of sysmeta with no replicas (23.8 KB) Dave Vieglais, 2019-02-06 14:49

picture318-1.png (23.7 KB) Dave Vieglais, 2019-02-06 14:54

picture420-1.png (20.9 KB) Dave Vieglais, 2019-02-06 14:54

715
716
717

History

#1 Updated by Dave Vieglais about 5 years ago

Identifier is: urn:uuid:b4b3cc45-4953-43d3-910a-847528577531

System metadata from the CN:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ns3:systemMetadata xmlns:ns2="http://ns.dataone.org/service/types/v1" xmlns:ns3="http://ns.dataone.org/service/types/v2.0">
    <serialVersion>43</serialVersion>
    <identifier>urn:uuid:b4b3cc45-4953-43d3-910a-847528577531</identifier>
    <formatId>image/png</formatId>
    <size>101868</size>
    <checksum algorithm="SHA-1">d4570d228cc7259ab1bf5cc0106d6ed9666e8117</checksum>
    <submitter>http://orcid.org/0000-0002-1006-9496</submitter>
    <rightsHolder>http://orcid.org/0000-0002-1006-9496</rightsHolder>
    <accessPolicy>
        <allow>
            <subject>public</subject>
            <permission>read</permission>
        </allow>
    </accessPolicy>
    <replicationPolicy replicationAllowed="true" numberReplicas="1">
        <preferredMemberNode>urn:node:KNB</preferredMemberNode>
    </replicationPolicy>
    <archived>false</archived>
    <dateUploaded>2017-05-05T15:14:12.529+00:00</dateUploaded>
    <dateSysMetadataModified>2017-05-05T15:14:12.529+00:00</dateSysMetadataModified>
    <originMemberNode>urn:node:GOA</originMemberNode>
    <authoritativeMemberNode>urn:node:GOA</authoritativeMemberNode>
    <replica>
        <replicaMemberNode>urn:node:mnORC1</replicaMemberNode>
        <replicationStatus>failed</replicationStatus>
        <replicaVerified>2018-09-26T01:24:28.584+00:00</replicaVerified>
    </replica>
    <replica>
        <replicaMemberNode>urn:node:mnUNM1</replicaMemberNode>
        <replicationStatus>failed</replicationStatus>
        <replicaVerified>2018-10-20T23:22:23.466+00:00</replicaVerified>
    </replica>
    <replica>
        <replicaMemberNode>urn:node:KNB</replicaMemberNode>
        <replicationStatus>failed</replicationStatus>
        <replicaVerified>2018-11-16T22:37:54.935+00:00</replicaVerified>
    </replica>
    <fileName>hcdbSampleLocs.png</fileName>
</ns3:systemMetadata>

and from the GOA MN (content origin):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ns3:systemMetadata xmlns:ns2="http://ns.dataone.org/service/types/v1" xmlns:ns3="http://ns.dataone.org/service/types/v2.0">
  <serialVersion>28</serialVersion>
  <identifier>urn:uuid:b4b3cc45-4953-43d3-910a-847528577531</identifier>
  <formatId>image/png</formatId>
  <size>101868</size>
  <checksum algorithm="SHA-1">d4570d228cc7259ab1bf5cc0106d6ed9666e8117</checksum>
  <submitter>http://orcid.org/0000-0002-1006-9496</submitter>
  <rightsHolder>http://orcid.org/0000-0002-1006-9496</rightsHolder>
  <accessPolicy>
    <allow>
      <subject>public</subject>
      <permission>read</permission>
    </allow>
  </accessPolicy>
  <replicationPolicy replicationAllowed="true" numberReplicas="1">
    <preferredMemberNode>urn:node:KNB</preferredMemberNode>
  </replicationPolicy>
  <archived>false</archived>
  <dateUploaded>2017-05-05T15:14:12.529+00:00</dateUploaded>
  <dateSysMetadataModified>2017-05-05T15:14:12.529+00:00</dateSysMetadataModified>
  <originMemberNode>urn:node:GOA</originMemberNode>
  <authoritativeMemberNode>urn:node:GOA</authoritativeMemberNode>
  <replica>
    <replicaMemberNode>urn:node:GOA</replicaMemberNode>
    <replicationStatus>completed</replicationStatus>
    <replicaVerified>2017-05-05T15:15:22.319+00:00</replicaVerified>
  </replica>
  <replica>
    <replicaMemberNode>urn:node:UIC</replicaMemberNode>
    <replicationStatus>failed</replicationStatus>
    <replicaVerified>2017-05-09T23:59:05.154+00:00</replicaVerified>
  </replica>
  <replica>
    <replicaMemberNode>urn:node:mnORC1</replicaMemberNode>
    <replicationStatus>failed</replicationStatus>
    <replicaVerified>2017-05-11T15:40:29.894+00:00</replicaVerified>
  </replica>
  <replica>
    <replicaMemberNode>urn:node:KNB</replicaMemberNode>
    <replicationStatus>failed</replicationStatus>
    <replicaVerified>2017-05-22T00:55:17.973+00:00</replicaVerified>
  </replica>
  <replica>
    <replicaMemberNode>urn:node:mnUNM1</replicaMemberNode>
    <replicationStatus>completed</replicationStatus>
    <replicaVerified>2017-05-30T18:53:49.397+00:00</replicaVerified>
  </replica>
  <fileName>hcdbSampleLocs.png</fileName>
</ns3:systemMetadata>

#2 Updated by Rob Nahf about 5 years ago

I ran several queries in postgres to help determine the size of the problem.

It is assumed that lack of 'COMPLETED' replicas leads to failures in cn/resolve.

This query counts the number of objects where there are no 'COMPLETED' replicas

select count(*) from systemmetadata s where not exists (select r.guid from smreplicationstatus r where r.guid =s.guid and status = 'COMPLETED');

count = 1754

This query counts the number of objects where the authMN is not listed as a replica:

select count(*) from systemmetadata s where not exists (select r.guid from smreplicationstatus r where r.guid =s.guid and s.authoritive_member_node = r.member_node);

count = 3054

This query counts the number of object having no replicas at all:

select count(*) from systemmetadata s where not exists (select r.guid from smreplicationstatus r where r.guid =s.guid);

count = 657

#3 Updated by Dave Vieglais about 5 years ago

715

Some quick analysis on the missing replicas.

System metadata with no replicas:
Counts over time of sysmeta with no replicas

System metadata with no COMPLETED entries:

System metadata with no authoritative mn in the replicas:

#4 Updated by Dave Vieglais about 5 years ago

716

#5 Updated by Dave Vieglais about 5 years ago

717

#6 Updated by Dave Vieglais about 5 years ago

Origin member nodes for system metadata with no replication information. The CN* entries can be ignored as those items are revisions of the formatId lists:

urn:node:ARCTIC 161
urn:node:CDL 5
urn:node:CNORC1 2
urn:node:CNUCSB1 14
urn:node:CNUNM1 2
urn:node:DRYAD 106
urn:node:EDI 1
urn:node:ESS_DIVE 21
urn:node:FEMC 1
urn:node:GOA 3
urn:node:GRIIDC 1
urn:node:KNB 131
urn:node:LTER 13
urn:node:NMEPSCOR 54
urn:node:ONEShare_test 1
urn:node:PANGAEA 115
urn:node:PISCO 1
urn:node:TDAR 20
urn:node:USGS_SDC 5

#7 Updated by Rob Nahf about 5 years ago

  • Category set to d1_replication
  • % Done changed from 0 to 100
  • Status changed from New to Closed

I was unable to find a cause of the disappearance of the replicas after a thorough review of the replication code, so I had to do a manual fix of the replication metadata using cn.updateReplicationMetadata. I searched every registered an "up" node in production for replicas, and found two, on GOA and mnUNM1. I verified the checksums against /checksum and created / updated replicas for them.

Regarding mechanisms for removing a completed replica, I could find none. Within the code there are state transitions from completed to invalidated (for failed audits), and removal of failed replicas. But, there is nothing to transition from invalidated to failed. (Failed, I believe, is only set by a callback from MNs attempting to replicate an object, and can only be called against the replica that represents itself.

It may be possible for a target replica node to call failed on an already completed replica, but I can't think of a triggering event for that.

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 14.8 MB)