Story #3352

Production CNs are out of sync in object count

Added by Chris Jones about 11 years ago. Updated almost 11 years ago.

Ben Leinfelder
Start date:
Due date:
% Done:


Story Points:


In the production environment, the CNs are slightly out of sync with regard to object count. cn-ucsb-1 is reporting 328691 objects, whereas both cn-orc-1 and cn-unm-1 are reporting 328448 objects. We need to figure out which pids are not in sync and why ucsb has fallen short.

correct_cn_20121022.sql (6.03 KB) Ben Leinfelder, 2012-10-22 22:37


Task #3357: Address ISet iterator bug that only iterates over a subset of the ISetNewBen Leinfelder


#1 Updated by Ben Leinfelder about 11 years ago

An example (which probably applies broadly to the 200+ "missing" pids):
EML file guid 'nceas.988.9' was inserted into UCSB. For some reason there is no systemMetadata entry for it in the backing table, but it is mapped to docid 'autogen.2012101220031313139.1' in the identifier table.
ORC and UNM have systemMetadata for this guid (so HZ replication has worked) but they also have the docid as a separate object in the xml_documents table -- there from the Metacat replication of the EML file.

#2 Updated by Ben Leinfelder about 11 years ago

Another pid with the same issue:
'kgordon.35.40' -> 'autogen.2012101713571339641.1'

#3 Updated by Chris Jones about 11 years ago

At some point UCSB dropped from the Hazelcast cluster, but was still communicating via Metacat replication. Some objects were created on UCSB, but the call to saveLocally() looks to have failed, even though it succeeded on the other two CNs (via seeing the hz event). Ben created a short list of PIDs that were on UNM and ORC, but not UCSB, and so we'll restart UCSB to get the object counts to resync. The process will be:

1) Shut down d1-processing on cn-ucsb-1
2) Remove cn-ucsb-1 from the DNS round robin
3) Shut down d1-index-processing and d1-index-generator on cn-ucsb-1
4) Restart Tomcat on cn-ucsb-1
5) cn-ucsb-1 should resync it's missing identifiers in the system metadata tables by pulling them from the other two CNs
6) Restart d1-index-* on cn-ucsb-1
7) Restart d1-processing on cn-ucsb-1
8) Manually fix the incrrect guids on cn-orc-1 and cn-unm-1 for the short list of pids via SQL
9) Add cn-ucsb-1 back into the DNS round robin

10) Look through the cn-ucsb-1 logs to try to see why it dropped from the cluster in the first place.

#4 Updated by Ben Leinfelder about 11 years ago

Numbers (1-4) and (8) are done. Now waiting for the resynch to complete on UCSB. I am skeptical it is in the cluster given the large number of statements like this flying by in the knb.log:
knb 20121022-22:34:11: [DEBUG]: Adding missing hzIdentifiers key: resourceMap_SRKX00_XXXIBTNXMBR11_20080505.50.2 [edu.ucsb.nceas.metacat.dataone.hazelcast.HazelcastService]

#5 Updated by Ben Leinfelder about 11 years ago

#6 Updated by Chris Jones about 11 years ago

  • Tracker changed from Task to Story
  • Due date set to 2012-10-27
  • Status changed from New to In Progress
  • translation missing: en.field_remaining_hours set to 0.0

#7 Updated by Chris Jones about 11 years ago

We confirmed that cn-ucsb-1 was in the cluster, but there was a problem with it seeing all pids in the hzIdentifiers ISet. See subtask #3357.

#8 Updated by Chris Jones about 11 years ago

We're now seeing that Hazelcast connections are frequently being dropped:

WARNING: / [DataONE] hz.1.InThread Closing socket to endpoint Address[],

This looks like a known 1.9.X issue, and the recommended fix is to upgrade to Hazelcast 2.x

#9 Updated by Chris Jones almost 11 years ago

  • Due date changed from 2012-10-27 to 2013-01-05
  • Target version changed from Sprint-2012.41-Block.6.1 to Sprint-2012.50-Block.6.4

#10 Updated by Ben Leinfelder almost 11 years ago

  • Status changed from In Progress to Closed

Counts are now synched.

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 14.8 MB)