Story #3352: Production CNs are out of sync in object count - Infrastructure - DataONE Tasks

Story #3352

Production CNs are out of sync in object count

Added by Chris Jones over 12 years ago. Updated about 12 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Ben Leinfelder

Category:

Environment.Production

Target version:

Sprint-2012.50-Block.6.4

Start date:

2012-10-23

Due date:

2013-01-05

% Done:

100%

Story Points:

Sprint:

Description

In the production environment, the CNs are slightly out of sync with regard to object count. cn-ucsb-1 is reporting 328691 objects, whereas both cn-orc-1 and cn-unm-1 are reporting 328448 objects. We need to figure out which pids are not in sync and why ucsb has fallen short.

correct_cn_20121022.sql (6.03 KB) Ben Leinfelder, 2012-10-22 22:37

Subtasks

History

#1 Updated by Ben Leinfelder over 12 years ago

An example (which probably applies broadly to the 200+ "missing" pids):
EML file guid 'nceas.988.9' was inserted into UCSB. For some reason there is no systemMetadata entry for it in the backing table, but it is mapped to docid 'autogen.2012101220031313139.1' in the identifier table.
ORC and UNM have systemMetadata for this guid (so HZ replication has worked) but they also have the docid as a separate object in the xml_documents table -- there from the Metacat replication of the EML file.

#2 Updated by Ben Leinfelder over 12 years ago

Another pid with the same issue:
'kgordon.35.40' -> 'autogen.2012101713571339641.1'

#3 Updated by Chris Jones over 12 years ago

At some point UCSB dropped from the Hazelcast cluster, but was still communicating via Metacat replication. Some objects were created on UCSB, but the call to saveLocally() looks to have failed, even though it succeeded on the other two CNs (via seeing the hz event). Ben created a short list of PIDs that were on UNM and ORC, but not UCSB, and so we'll restart UCSB to get the object counts to resync. The process will be:

1) Shut down d1-processing on cn-ucsb-1
2) Remove cn-ucsb-1 from the DNS round robin
3) Shut down d1-index-processing and d1-index-generator on cn-ucsb-1
4) Restart Tomcat on cn-ucsb-1
5) cn-ucsb-1 should resync it's missing identifiers in the system metadata tables by pulling them from the other two CNs
6) Restart d1-index-* on cn-ucsb-1
7) Restart d1-processing on cn-ucsb-1
8) Manually fix the incrrect guids on cn-orc-1 and cn-unm-1 for the short list of pids via SQL
9) Add cn-ucsb-1 back into the DNS round robin

Then,
10) Look through the cn-ucsb-1 logs to try to see why it dropped from the cluster in the first place.

#4 Updated by Ben Leinfelder over 12 years ago

Numbers (1-4) and (8) are done. Now waiting for the resynch to complete on UCSB. I am skeptical it is in the cluster given the large number of statements like this flying by in the knb.log:
knb 20121022-22:34:11: [DEBUG]: Adding missing hzIdentifiers key: resourceMap_SRKX00_XXXIBTNXMBR11_20080505.50.2 [edu.ucsb.nceas.metacat.dataone.hazelcast.HazelcastService]