Project

General

Profile

h1. Restoring Consistency Across CNs

Network outages or other causes of disconnect between Coordinating Nodes can lead to a discrepancy in the number of objects recorded by the CNs participating in an environment. The following process can be followed to restore consistency. In the example below, the UNM production CN lost connectivity and needed to be brought back to a consistent state with the other CNs (UCSB was the primary CN at the time).

I. Check & Restore consistency of LDAP on cn-unm-1:

p(. A. On CN-UNM-1

p((. 1. Check consistency LDAP

p(((. a. Diff Node Lists - If node lists are the same, then compare fill in the blank variable on all the Node entries.

p(((. b. If results indicate consistency, goto II.

p((. 2. Restore consistency LDAP

p(((. a. Stop LDAP/restart LDAP - the ldap process can be disconnected and reconnected on the CN components with out needing to stop/restart dependent processes (cn.war and process daemons)

p(((. b. Repeat 1.a, if 1.a does not pass, then scratch your head. It might be an issue on other machines, it maybe that the network is still split, it maybe that ldap is taking longer to synchronize than normal (so just sit still and repeat check in 10 minutes)

II. Check & Restore consistency of Metacat on cn-unm-1:

p(. A. On CN-UNM-1

p((. 1. Check consistency Metacat

p(((. a. Compare number counts on monitor.dataone.org?

p(((. b. Perform the following command for each of the CNs:

https://cn-unm-1.dataone.org/cn/v1/object?start=0&count=0

https://cn-ucsb-1.dataone.org/cn/v1/object?start=0&count=0
https://cn-orc-1.dataone.org/cn/v1/object?start=0&count=0

p(((. c. If the counts are equal, then goto end. If this is your first time running these commands, What made you check metacat consistency? Please make a note. maybe only a problem in ldap...

p((. 2. Restore consistency Metacat

p(((. a. Login to the administrative interface of metacat - https://cn-unm-1.dataone.org/metacat/admin

p(((. b. Click the button named 'Reconfigure Now' beside the label 'Replication Configuration'

p(((. c. Perform the 'Hazelcast Synchronization' operation.

p((((. i. Begin by tailing the log in /var/metacat/logs/metacat.log (tail -100f /var/metacat/logs/metacat.log)

p((((. ii. Click the button named 'Resynch' under the label 'Hazelcast Synchronization'

p((((. iii. Note in the log files, the beginning of the synchronization process! (remember that logs rotate, so re-issue the tail command periodically):

<pre>

metacat 20150126-20:55:51: [WARN]: Local SystemMetadata pid count: 577993
metacat 20150126-20:57:49: [WARN]: processedCount (identifiers from iterator): 578077

        There are a lot of these messages! metacat 20150126-20:58:04: [ERROR]: Error looking up missing system metadata for pid: 

http://dx.doi.org/10.5061/dryad.ft48k/3/bitstream [edu.ucsb.nceas.metacat.dataone.hazelcast.HazelcastService]
is this right? need to check the metacat code to ensure it is notthing fatal.

          OK, there is no official end to the process in the log files. It just kind of stops logging. So, lets gamble that everything is complete! ...NOOO! 

... Goto II.A.1.a and start over

p(((. d. Perform the 'Replicate Now' operation

p((((. i. Begin by tailing the log in /var/metacat/logs/replicate.log (tail -100f /var/metacat/logs/replicate.log)

p((((. ii. Click the button named 'Get All' under the label 'Replicate Now'

p((((. iii. Messages to expect in replicate.log

Start of the Metacat replication process:
metacat 2015-01-26T14:48:47: [INFO]: ReplicationService.handleForceReplicateDataFileRequest - Force replication request from: cn-ucsb-1.dataone.org/metacat/servlet/replication

End of the Metacat replication process:
metacat 2015-01-27T07:34:22: [WARN]: ForceReplicationHandler.run - exiting ForceReplicationHandler Thread

Add picture from clipboard (Maximum size: 14.8 MB)