Project

General

Profile

Bug #7830

Metacat reports The CnList has not been initialized after restarting tomcat

Added by Robert Waltz almost 8 years ago. Updated over 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
d1_libclient_java
Start date:
2016-06-16
Due date:
% Done:

100%

Story Points:
Sprint:

Description

This error only happened on cn-ucsb-1 immediately after an upgrade, and the DNS RR having been switched back to cn-ucsb-1.

org.dataone.service.exceptions.ServiceFailure: class org.dataone.client.exception.ClientSideException: Error: The CnList has not been initialized!!!
at org.dataone.client.v2.impl.NodeListNodeLocator.getCNode(NodeListNodeLocator.java:99)
at org.dataone.client.v2.impl.NodeListNodeLocator.getCNode(NodeListNodeLocator.java:54)
at org.dataone.client.v2.itk.D1Client.getCN(D1Client.java:134)
at org.dataone.client.v2.itk.D1Client.getCN(D1Client.java:100)
at edu.ucsb.nceas.metacat.dataone.CNodeService.setReplicationStatus(CNodeService.java:882)
at edu.ucsb.nceas.metacat.dataone.v1.CNodeService.setReplicationStatus(CNodeService.java:189)
at edu.ucsb.nceas.metacat.restservice.v1.CNResourceHandler.setReplicationStatus(CNResourceHandler.java:1656)
at edu.ucsb.nceas.metacat.restservice.v1.CNResourceHandler.handle(CNResourceHandler.java:344)
at edu.ucsb.nceas.metacat.restservice.D1RestServlet.doPut(D1RestServlet.java:102)
...

Associated revisions

Revision 18501
Added by Rob Nahf over 7 years ago

refs: #6570, #7830: Refactored D1Client behavior under situations where it can't get a NodeList from CN_URL. Improved error messages, cleaned up some unnecessary complexity. Harmonized v1 and v2 implementaiton classes (NodeListNodeLocator, SettingsContextNL, D1Client). Added NodeLocator expiration to be able to pick up new MN BaseUrls periodically (5 minutes).

Revision 18501
Added by Rob Nahf over 7 years ago

refs: #6570, #7830: Refactored D1Client behavior under situations where it can't get a NodeList from CN_URL. Improved error messages, cleaned up some unnecessary complexity. Harmonized v1 and v2 implementaiton classes (NodeListNodeLocator, SettingsContextNL, D1Client). Added NodeLocator expiration to be able to pick up new MN BaseUrls periodically (5 minutes).

Revision 19054
Added by Rob Nahf over 6 years ago

refs #6570, #7830: Manual merge of refactoring found in trunk: Refactored D1Client behavior under situations where it can't get a NodeList from CN_URL. Improved error messages, cleaned up some unnecessary complexity. Harmonized v1 and v2 implementaiton classes (NodeListNodeLocator, SettingsContextNL, D1Client). Added NodeLocator expiration to be able to pick up new MN BaseUrls periodically (5 minutes).

Revision 19054
Added by Rob Nahf over 6 years ago

refs #6570, #7830: Manual merge of refactoring found in trunk: Refactored D1Client behavior under situations where it can't get a NodeList from CN_URL. Improved error messages, cleaned up some unnecessary complexity. Harmonized v1 and v2 implementaiton classes (NodeListNodeLocator, SettingsContextNL, D1Client). Added NodeLocator expiration to be able to pick up new MN BaseUrls periodically (5 minutes).

History

#1 Updated by Robert Waltz almost 8 years ago

  • Tracker changed from Task to Bug

#2 Updated by Rob Nahf over 7 years ago

  • Status changed from New to In Progress
  • % Done changed from 0 to 30

looked at this as part of minor refactor / code cleanup of D1Clients. Believe that the reason for the exception was communication problem with the CN to build the NodeLocator (from cn.listNodes).

The newer version should have more informative exception messages and logging.

#3 Updated by Chris Jones over 6 years ago

This issue is still a problem, and since it is intermittent, multiple Tomcat reboots may be necessary for the CN list to initialize correctly. This affects both Metacat MNs and the CNs.

However, I'm not understanding why there is a CnList at all. As I understand this, the call to @D1Client.getCN()@ should always return the Round Robin CN entry in the node list (e.g. for production: @https://cn.dataone.org/cn/v2/node/urn:node:CN@). The client should never be communicating with a CN directly (like @https://cn.dataone.org/cn/v2/node/urn:node:CNUCSB1@). Perhaps I'm missing something here though.

#4 Updated by Rob Nahf over 6 years ago

the fix for this is only in trunk, so hasn't been deployed yet. I am not sure why it would take multiple tomcat reboots to fix a state problem of a java class instance, so maybe there is something more than a bug in the code going on.

#5 Updated by Rob Nahf over 6 years ago

Regarding the cnList, this construct is used only in the more generalized NodeListNodeLocator - the implementation that just relies on a NodeList to map between nodeReference and baseUrl. (This implementation is used a lot in the integration tests, where we need to get behind the round-robin to test all CN instances). D1Client uses a subclass of NodeListNodeLocator (SettingsContextNodeLocator) that overrides that general behavior and does what we want, which is return the CNode using the baseurl found in the libclient.properties file.

Regarding how the Round Robin CN entry in the NodeList is used:
* in NodeListNodeLocator: if it can determine that one of the CNs in the nodelist is a round robin CN, it is used. If there isn't one, the NodeListNodeLocator makes its own round-robin from the listed CNs.
* in SettingsContextNodeLocator: it is not used at all - the baseURL from libclient.properties is used. libclient_java ships with the production RR url, but also the CNs change that property to point to themselves.

#6 Updated by Rob Nahf over 6 years ago

copied the changes into the 2.3 branch, so Metacat can use it for its next deployment.

#7 Updated by Rob Nahf over 6 years ago

  • Status changed from In Progress to Testing
  • % Done changed from 30 to 50

...off to sandbox testing

#8 Updated by Rob Nahf over 6 years ago

  • Target version changed from CCI-2.4.0 to CCI-2.3.7

#9 Updated by Rob Nahf over 6 years ago

  • % Done changed from 50 to 80
  • Status changed from Testing to In Review

#10 Updated by Rob Nahf over 6 years ago

  • % Done changed from 80 to 100
  • Status changed from In Review to Closed

Jing reports success in Metacat - it didn't break things, but he doesn't have specific tests.

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 14.8 MB)