Production CNs are out of sync in object count
In the production environment, the CNs are slightly out of sync with regard to object count. cn-ucsb-1 is reporting 328691 objects, whereas both cn-orc-1 and cn-unm-1 are reporting 328448 objects. We need to figure out which pids are not in sync and why ucsb has fallen short.
#1 Updated by Ben Leinfelder about 11 years ago
An example (which probably applies broadly to the 200+ "missing" pids):
EML file guid 'nceas.988.9' was inserted into UCSB. For some reason there is no systemMetadata entry for it in the backing table, but it is mapped to docid 'autogen.2012101220031313139.1' in the identifier table.
ORC and UNM have systemMetadata for this guid (so HZ replication has worked) but they also have the docid as a separate object in the xml_documents table -- there from the Metacat replication of the EML file.
#3 Updated by Chris Jones about 11 years ago
At some point UCSB dropped from the Hazelcast cluster, but was still communicating via Metacat replication. Some objects were created on UCSB, but the call to saveLocally() looks to have failed, even though it succeeded on the other two CNs (via seeing the hz event). Ben created a short list of PIDs that were on UNM and ORC, but not UCSB, and so we'll restart UCSB to get the object counts to resync. The process will be:
1) Shut down d1-processing on cn-ucsb-1
2) Remove cn-ucsb-1 from the DNS round robin
3) Shut down d1-index-processing and d1-index-generator on cn-ucsb-1
4) Restart Tomcat on cn-ucsb-1
5) cn-ucsb-1 should resync it's missing identifiers in the system metadata tables by pulling them from the other two CNs
6) Restart d1-index-* on cn-ucsb-1
7) Restart d1-processing on cn-ucsb-1
8) Manually fix the incrrect guids on cn-orc-1 and cn-unm-1 for the short list of pids via SQL
9) Add cn-ucsb-1 back into the DNS round robin
10) Look through the cn-ucsb-1 logs to try to see why it dropped from the cluster in the first place.
#4 Updated by Ben Leinfelder about 11 years ago
Numbers (1-4) and (8) are done. Now waiting for the resynch to complete on UCSB. I am skeptical it is in the cluster given the large number of statements like this flying by in the knb.log:
knb 20121022-22:34:11: [DEBUG]: Adding missing hzIdentifiers key: resourceMap_SRKX00_XXXIBTNXMBR11_20080505.50.2 [edu.ucsb.nceas.metacat.dataone.hazelcast.HazelcastService]
#8 Updated by Chris Jones about 11 years ago
We're now seeing that Hazelcast connections are frequently being dropped:
WARNING: /184.108.40.206:5701 [DataONE] hz.1.InThread Closing socket to endpoint Address[220.127.116.11:5701], Cause:java.io.EOFException
This looks like a known 1.9.X issue, and the recommended fix is to upgrade to Hazelcast 2.x