Project

General

Profile

Bug #7391

Synchronization and replica updates fail after MN.updateSystemMetadata()

Added by Chris Jones over 8 years ago. Updated over 8 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
d1_synchronization
Start date:
2015-09-28
Due date:
% Done:

100%

Story Points:
Sprint:

Description

In the mixed V1/V2 DEV2 environment, a call to MN.updateSystemMetadata() successfully calls CN.synchronize(), but the synchronization ultimately fails, and so the CN and the replica member nodes fail to receive the updated system metadata. An example object is pid http://dx.doi.org/10.5061/dryad.37mm8/2/bitstream, which was updated to include a seriesId field. We expect that the CN will then get the new system metadata, and that the V2 MNs with replicas will also get the update.

The error message on cn-dev-ucsb-2 in /var/log/dataone/synchronization/cn-synchronization.log was (unfortunately) a bit terse:

[ INFO] 2015-09-28 12:21:10,541 (V2TransferObjectTask:validateSeriesId:544) Task-urn:node:mnDevUCSB2-http://dx.doi.org/10.5061/dryad.37mm8/2/bitstream SeriesId doesn't exist for any object on the CN...
[DEBUG] 2015-09-28 12:21:10,541 (V2TransferObjectTask:alreadyExists:578) Task-urn:node:mnDevUCSB2-http://dx.doi.org/10.5061/dryad.37mm8/2/bitstream entering alreadyExists...
[ INFO] 2015-09-28 12:21:10,543 (V2TransferObjectTask:alreadyExists:595) Task-urn:node:mnDevUCSB2-http://dx.doi.org/10.5061/dryad.37mm8/2/bitstream Pid Exists. Must be a systemMetadata update.
[DEBUG] 2015-09-28 12:21:10,543 (V2TransferObjectTask:processUpdates:870) Task-urn:node:mnDevUCSB2-http://dx.doi.org/10.5061/dryad.37mm8/2/bitstream entering processUpdates...
[ INFO] 2015-09-28 12:21:10,543 (V2TransferObjectTask:processUpdates:877) Task-urn:node:mnDevUCSB2-http://dx.doi.org/10.5061/dryad.37mm8/2/bitstream Processing as an Update
[ INFO] 2015-09-28 12:21:10,543 (V2TransferObjectTask:processUpdates:878) Task-urn:node:mnDevUCSB2-http://dx.doi.org/10.5061/dryad.37mm8/2/bitstream Getting sysMeta from HazelCast map
[ INFO] 2015-09-28 12:21:10,544 (SystemMetadataValidator:schemaValidateSystemMetadata:71) Entering schemaValidateSysMeta method...
[ INFO] 2015-09-28 12:21:10,544 (SystemMetadataValidator:validateEssentialProperties:186) The submitted checksum matches existing one
[ERROR] 2015-09-28 12:21:10,546 (V2TransferObjectTask:processTask:410) Task-urn:node:mnDevUCSB2-http://dx.doi.org/10.5061/dryad.37mm8/2/bitstream
null

The general Exception is thrown on line 410 of V2TransferObjectTask, with a "null" message. Given the messages above, it looks like we can narrow it down to the call to processUpdates() where things go south, although there are no more messages coming from that method. Needs a bit of logging to track this down.

Associated revisions

Revision 16525
Added by Rob Nahf over 8 years ago

refs #7391. Addressed possible NPEs in V2TransferObjectTask.processUpdates having do to with problems with either NodeRegistry access or node.services properties. refactored NPE handling to try to get stacktraces into the logs.

Revision 16525
Added by Rob Nahf over 8 years ago

refs #7391. Addressed possible NPEs in V2TransferObjectTask.processUpdates having do to with problems with either NodeRegistry access or node.services properties. refactored NPE handling to try to get stacktraces into the logs.

Revision 16572
Added by Rob Nahf over 8 years ago

refs: #7391: addressed two NPE situations, notably switching the NodeComm method from getNodeRegisteryService to getCnCore because that property is left null in a SyncObjectNodeComm.

Revision 16572
Added by Rob Nahf over 8 years ago

refs: #7391: addressed two NPE situations, notably switching the NodeComm method from getNodeRegisteryService to getCnCore because that property is left null in a SyncObjectNodeComm.

Revision 16588
Added by Rob Nahf over 8 years ago

refs #7391: refactored NodeComm to include the NodeRegistryService in the constructor for NodecommSyncObjectFactory. V2TransferObjectTash reverted back to using nodeRegistryService.

Revision 16588
Added by Rob Nahf over 8 years ago

refs #7391: refactored NodeComm to include the NodeRegistryService in the constructor for NodecommSyncObjectFactory. V2TransferObjectTash reverted back to using nodeRegistryService.

Revision 16735
Added by Rob Nahf over 8 years ago

refs: #7391. removing the check for different serialVersions before notifying replicaMN. clearer logging statements.

Revision 16735
Added by Rob Nahf over 8 years ago

refs: #7391. removing the check for different serialVersions before notifying replicaMN. clearer logging statements.

History

#1 Updated by Rob Nahf over 8 years ago

this is most likely a NullPointerException, and the stacktrace will be in the logs. Unfortunately, not caught in a grep unless you look for it (the -A or -B argument). I don't have ssh access to DEV2 machines, so Chris, can you run this grep and copy in the result?

grep -B20 "Task-urn:node:mnDevUCSB2-http://dx.doi.org/10.5061/dryad.37mm8/2/bitstream" /var/log/dataone/synchronization/cn-synchronization.log

(or -B30 or whatever catches the relevant portions of the stacktrace )

#2 Updated by Chris Jones over 8 years ago

Hey Rob, I've run the grep above but didn't see any stack trace. I vi'd the sync log and looked for the stacktrace around that timestamp (2015-09-28 12:19:11,121) but also didn't see anything. Looks like we need to figure out why we're not getting the stacktrace. I've added your ssh key to cn-dev-{ucsb|unm}-2 and mn-dev-{ucsb|unm}-{1|2} and put you in the sudo group.

#3 Updated by Rob Nahf over 8 years ago

looks like the ex.printStackTrace doesn't end up in any logs I looked through. the bare-line 'null' is the message from the exception, which is what NPEs use. By the other log messages, I can isolate the NPE to 2 lines of code in sync:

        NodeList nl = nodeCommunications.getNodeRegistryService().listNodes();
        boolean isV1Object = AuthUtils.isCNAuthorityForSystemMetadataUpdate(nl, newSystemMetadata);

calling https://mn-dev-ucsb-2.test.dataone.org/metacat/d1/mn/v1/meta/http://dx.doi.org/10.5061/dryad.37mm8/2/bitstream, I can rule out problems with the newSystemMetadata parameter, so there's an issue with either procuring the NodeRegisteryService, or the returned NodeList. NPEs will be emitted from AuthUtils.isCNAuthorityFor... if either the passed in NodeList is null, or the chosen Node record has no Services.

Will add better parameter checking to the AuthUtils methods, because null services is schema valid, although maybe not by business rules...
Also, will change the catch Exception to include the exception in the log statement so the stack trace has a better chance to make it into the logs.

#4 Updated by Rob Nahf over 8 years ago

  • Status changed from New to In Progress
  • % Done changed from 0 to 30

added more null checks and the exception to the logging statement.
.
.
.

#5 Updated by Chris Jones over 8 years ago

Rob,

Looking at the d1_sync log on cn-dev-ucsb-2, here's the NPE that you were suspecting:

[ERROR] 2015-10-05 07:58:04,899 (V2TransferObjectTask:processTask:409) Task-urn:node:mnDevUNM2-testV2CreateV1Archive_201527873421430
null
java.lang.NullPointerException
at org.dataone.cn.batch.synchronization.tasks.V2TransferObjectTask.processUpdates(V2TransferObjectTask.java:896)
at org.dataone.cn.batch.synchronization.tasks.V2TransferObjectTask.processTask(V2TransferObjectTask.java:392)
at org.dataone.cn.batch.synchronization.tasks.V2TransferObjectTask.call(V2TransferObjectTask.java:204)
at org.dataone.cn.batch.synchronization.tasks.V2TransferObjectTask.call(V2TransferObjectTask.java:110)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

which is coming from the call to get the node list:

NodeList nl = nodeCommunications.getNodeRegistryService().listNodes();

#6 Updated by Rob Nahf over 8 years ago

the same thing happening on DEV:

[ERROR] 2015-10-04 05:11:11,811 (V2TransferObjectTask:processTask:409) Task-urn:node:mnDemo8-P1_20152762259634
null
java.lang.NullPointerException
at org.dataone.cn.batch.synchronization.tasks.V2TransferObjectTask.processUpdates(V2TransferObjectTask.java:896)
at org.dataone.cn.batch.synchronization.tasks.V2TransferObjectTask.processTask(V2TransferObjectTask.java:392)
at org.dataone.cn.batch.synchronization.tasks.V2TransferObjectTask.call(V2TransferObjectTask.java:204)
at org.dataone.cn.batch.synchronization.tasks.V2TransferObjectTask.call(V2TransferObjectTask.java:110)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

and I also see an NPE in validateChecksum when trying to create meaningful error message

[ERROR] 2015-10-04 05:31:14,873 (V2TransferObjectTask:processTask:409) Task-urn:node:mnDemo8-testUpdateSystemMetadata_MutableAccessPolicy_2015276222828801
null
java.lang.NullPointerException
at org.dataone.cn.batch.synchronization.tasks.V2TransferObjectTask.validateChecksum(V2TransferObjectTask.java:765)
at org.dataone.cn.batch.synchronization.tasks.V2TransferObjectTask.createObject(V2TransferObjectTask.java:654)
at org.dataone.cn.batch.synchronization.tasks.V2TransferObjectTask.processNewObject(V2TransferObjectTask.java:430)
at org.dataone.cn.batch.synchronization.tasks.V2TransferObjectTask.processTask(V2TransferObjectTask.java:394)
at org.dataone.cn.batch.synchronization.tasks.V2TransferObjectTask.call(V2TransferObjectTask.java:204)
at org.dataone.cn.batch.synchronization.tasks.V2TransferObjectTask.call(V2TransferObjectTask.java:110)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

the latter can be addressed through null checking.

The former, I'm not sure the reason for. If it's happening in two environments, it's either a client resource consumption issue, or a network one. the DEV-2 mn where the failure is happening is on the same LAN as the DEV2 CN, so less likely a network issue.

#7 Updated by Rob Nahf over 8 years ago

  • Tracker changed from Task to Bug

#8 Updated by Robert Waltz over 8 years ago

Jing reports that after the hazelcast changes, synchronize method still does not work.

The error in the synchronize log file reads, “The requested identifier tao.13365.1 is already used by another object and therefore can not be used for this object. Clients should choose a new identifier that is unique and retry the operation or use CN.reserveIdentifier() to reserve one.”

Looks like synchronization believes the call to be a create instead of an update.
From log file:

[DEBUG] 2015-10-13 15:54:10,675 (V2TransferObjectTask:alreadyExists:566) Task-urn:node:mnDemo6-tao.13365.1 entering alreadyExists...
[ INFO] 2015-10-13 15:54:10,809 (V2TransferObjectTask:alreadyExists:583) Task-urn:node:mnDemo6-tao.13365.1 Pid doesn't exist as reservation or object.

it does not appear that the object was found in the reserveIdentifier service though it was already created in the object store. maybe a hangover from a previous bug that has been since squashed?

#9 Updated by Rob Nahf over 8 years ago

This is happening because sync is still relying on hasReservation to determine if something is a new object or an update. If the Identifier Manager Service no longer throws IdentifierNotUnique, it will only see no reservation and assume it's a new object. This will fix when the changes are committed and deployed.

#10 Updated by Rob Nahf over 8 years ago

  • % Done changed from 30 to 50
  • Status changed from In Progress to Testing

#11 Updated by Rob Nahf over 8 years ago

  • % Done changed from 50 to 100
  • Status changed from Testing to Closed

reviewed in standup. Jing confirms that it works.

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 14.8 MB)