DataONE Tasks: Issueshttps://redmine.dataone.org/https://redmine.dataone.org/favicon.ico2020-06-17T21:49:55ZDataONE Tasks
Redmine CN REST - Story #8864 (New): Sychronization does not register authoritative replica entry correctlyhttps://redmine.dataone.org/issues/88642020-06-17T21:49:55ZChris Jonescjones@nceas.ucsb.edu
<p>When objects are synchronized to the CN, the <code>d1_synchronization</code> component will fetch the system metadata <br>
for each object and will add a <code><replica></code> entry for the origin node (like <code>urn:node:ESS_DIVE</code>, <br>
as well as entries for other copies (for instance for science metadata copied to the CN, <br>
a <code><replica>urn:node:CN</replica></code> will be added.</p>
<p>In some instances, the origin replica instance is not added to the replica list.<br><br>
This causes downstream problems for the <code>d1_replication</code> component because it relies on the origin node <br>
replica entry to be present in order to set up a replication request to a target node. I'm seeing errors like:</p>
<pre>/var/log/dataone/replicate/cn-replication.log.90:[ERROR] 2020-06-04 05:18:30,179 [pool-15-thread-1] (MNCommunication:requestReplication:34) Could not determine replication source node for replication request for pid: ess-dive-eb6cbb22c605506-20200122T170607966. Replication request failed.
</pre>
<p>Looking back in the logs, this is the case for the following objects:</p>
<pre>ess-dive-3947e68e9825233-20180621T213650539
ess-dive-3b8d9f4513e02f9-20180621T214221437
ess-dive-467a6c3dda4dc88-20180621T211148554
ess-dive-51f345daca126f7-20180328T160350610716
ess-dive-53b37ae5d8c0f51-20200219T211634419654
ess-dive-6b688fab5524c46-20200121T210154766
ess-dive-7a31346c154f02b-20200127T155012488
ess-dive-a1fb05cbd903309-20200130T190835651
ess-dive-b420b097851c716-20180523T161714606
ess-dive-ba81a8a8e0bef31-20180727T200828345
ess-dive-bfaf3d6d6fd038c-20180716T154005175903
ess-dive-c2ef5f3af108c9c-20180621T220020545
ess-dive-eb6cbb22c605506-20200122T170607966
ess-dive-f3238db16593de5-20180621T215956950
</pre>
<p>We need to fix this issue in <code>d1_synchronization</code> so replication runs correctly and monthly <br>
replica auditing (done by ESS_DIVE) doesn't flag these issues.</p>
Infrastructure - Story #8796 (New): Various issues with service access after upgrade to 18.04https://redmine.dataone.org/issues/87962019-05-14T23:57:48ZDave Vieglaisdave.vieglais@gmail.com
<p>Users have reported some issues with CNs after upgrades to 18.04. See individual issues for details.</p>
Infrastructure - Story #8782 (New): Upgrade OS to Ubuntu 18.04 on CNshttps://redmine.dataone.org/issues/87822019-03-22T18:12:12ZJing Taotao@nceas.ucsb.eduMember Nodes - Story #8683 (New): USGS SDC: redeploy as a v2 Slender Node with GMNhttps://redmine.dataone.org/issues/86832018-08-22T16:25:08ZAmy Forresteraforres4@utk.eduInfrastructure - Story #8525 (In Progress): timeout exceptions thrown from Hazelcast disable sync...https://redmine.dataone.org/issues/85252018-03-27T22:36:54ZRob Nahfrnahf@epscor.unm.edu
<p>Very occasionally, synchronization disables itself when RuntimeExceptions bubble up. The most common of these is when the Hazelcast client seemingly disconnects, or can't complete an operation, and a java.util.concurrent.TimeoutException is thrown.</p>
<p>These are usually due to network problems, as evidenced by timeout exceptions appearing in both the Metacat hazelcast-storage.log files as well as d1-processing logs.</p>
<p>Temporary problems like this should be recoverable, and so a retry or bypass for those timeouts should be implemented. It's not clear whether or not a new HazelcastClient should be instantiated, or whether the same client is still usable. (Is the client tightly bound to a session, or does it recover?) If a new client is needed, preliminary searching through the code indicates that refactoring the HazelcastClientFactory.getProcessingClient() method is only used in a few places, and the singleton behavior it uses can be sidestepped by removing the method and replacing it with a getLock() wrapper method (that seems to be the dominant use case for it). See the newer SyncQueueFacade in d1_synchronization for guidance on that. If the client is never exposed, it can be refreshed as needed.</p>
<pre>root@cn-unm-1:/var/metacat/logs# grep FATAL hazelcast-storage.log.1
[FATAL] 2018-03-27 03:15:19,380 (BaseManager$2:run:1402) [64.106.40.6]:5701 [DataONE] Caught error while calling event listener; cause: [CONCURRENT_MAP_CONTAINS_KEY] Operation Timeout (with no response!): 0
</pre><pre>[ERROR] 2018-03-27 03:15:19,781 [ProcessDaemonTask1] (SyncObjectTaskManager:run:84) java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.util.concurrent
.TimeoutException: [CONCURRENT_MAP_REMOVE] Operation Timeout (with no response!): 0
java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.util.concurrent.TimeoutException: [CONCURRENT_MAP_REMOVE] Operation Timeout (with no response!): 0
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.dataone.cn.batch.synchronization.SyncObjectTaskManager.run(SyncObjectTaskManager.java:76)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: java.util.concurrent.TimeoutException: [CONCURRENT_MAP_REMOVE] Operation Timeout (with no response!): 0
at com.hazelcast.impl.ClientServiceException.readData(ClientServiceException.java:63)
at com.hazelcast.nio.Serializer$DataSerializer.read(Serializer.java:104)
at com.hazelcast.nio.Serializer$DataSerializer.read(Serializer.java:79)
at com.hazelcast.nio.AbstractSerializer.toObject(AbstractSerializer.java:121)
at com.hazelcast.nio.AbstractSerializer.toObject(AbstractSerializer.java:156)
at com.hazelcast.client.ClientThreadContext.toObject(ClientThreadContext.java:72)
at com.hazelcast.client.IOUtil.toObject(IOUtil.java:34)
at com.hazelcast.client.ProxyHelper.getValue(ProxyHelper.java:186)
at com.hazelcast.client.ProxyHelper.doOp(ProxyHelper.java:146)
at com.hazelcast.client.ProxyHelper.doOp(ProxyHelper.java:140)
at com.hazelcast.client.QueueClientProxy.innerPoll(QueueClientProxy.java:115)
at com.hazelcast.client.QueueClientProxy.poll(QueueClientProxy.java:111)
at org.dataone.cn.batch.synchronization.type.SyncQueueFacade.poll(SyncQueueFacade.java:231)
at org.dataone.cn.batch.synchronization.tasks.SyncObjectTask.call(SyncObjectTask.java:131)
at org.dataone.cn.batch.synchronization.tasks.SyncObjectTask.call(SyncObjectTask.java:73)
</pre> Testing MN Management - Story #8463 (New): test: Testing & Developmenthttps://redmine.dataone.org/issues/84632018-03-01T21:04:41ZAmy Forresteraforres4@utk.edu
<p>Install or develop a functional member node to be registered to a non-production environment. </p>
Infrastructure - Story #8368 (New): Update backup strategy for jenkins job configurations to subv...https://redmine.dataone.org/issues/83682018-02-15T17:37:09ZDave Vieglaisdave.vieglais@gmail.comTesting MN Management - Story #8347 (New): Testing & Developmenthttps://redmine.dataone.org/issues/83472018-02-08T15:28:11ZMonica Ihliemail@monicaihli.com
<p>Install or develop a functional member node to be registered to a non-production environment. </p>
Member Nodes - Story #8244 (New): Upgrade Member Node to current version of Metacat (IOE)https://redmine.dataone.org/issues/82442018-01-22T16:38:20ZDave Vieglaisdave.vieglais@gmail.com
<p>The MN operations have been placed in the care of Todd Kipfer. He has requested some assistance with use of Metacat, and especially bringing it to the latest version.</p>
Infrastructure - Story #8234 (New): Use University of Kansas ORCID membership to support authenti...https://redmine.dataone.org/issues/82342018-01-09T02:00:28ZDave Vieglaisdave.vieglais@gmail.com
<p><a href="https://orcid.org/members/001G000001CAkZgIAL-university-of-kansas" class="external">KU is a premium ORCID member</a> as a member of the Greater Western Library Alliance (GWLA). As a result, KU has access to five ORCID API keys. One is currently in use for the KU DSpace instance.</p>
<p>Goal of this story is to leverage on of the remaining API keys to support ORCID authentication in the DataONE production environment.</p>
Infrastructure - Story #8036 (New): synchronization should respond to various MN down conditions ...https://redmine.dataone.org/issues/80362017-03-03T17:30:35ZRob Nahfrnahf@epscor.unm.edu
<p>Currently, synchronization <em>does</em> heed the Node.status='DOWN' in harvesting, but this is limited, especially when there are long delays between harvest and processing. tDAR uses HTTP 502 / 503 responses (not sure which) to signal that the node is temporarily down, for example. Also, for network segregation events, where a node cannot even be reached, synchronization should halt processing - the member node cannot even be notified of sync failures in these situations. </p>
<p>Thoughts on implementation would be adding Observer pattern to libclient (and/or NodeComms in synchronization) so the task can be tried again at a later time. Also try to read the Retry-After header with 503 responses.</p>
Infrastructure - Story #7889 (New): Synchronization not happening when authoritativeMN is not set...https://redmine.dataone.org/issues/78892016-09-13T17:19:40ZRob Nahfrnahf@epscor.unm.edu
<p>This leads to an uncorrectable situation because system metadata updates can only happen from the authoritativeMN. We need to add validation upon first synchronization such that the authoritativeMN is a registeredMN, even if it is not really present (state=down).</p>
Infrastructure - Story #7882 (In Progress): Tune CN logfile managementhttps://redmine.dataone.org/issues/78822016-09-09T20:13:07ZDave Vieglaisdave.vieglais@gmail.com
<p>There are many log files generated on the Coordinating Nodes which can make diagnostics challenging. Some logs also appear to be misconfigured or set to log at DEBUG level even on production systems, resulting in extremely verbose logs.</p>
<p>The goal of this activity is to streamline logging to make it easier to find useful information in the logs by reducing verbosity, consolidating where possible, and perhaps refining some log messages.</p>
Python GMN - Story #7219 (In Progress): Upgrade Django version used by GMN, add better support fo...https://redmine.dataone.org/issues/72192015-06-16T14:56:59ZDave Vieglaisdave.vieglais@gmail.comInfrastructure - Story #7183 (New): Update wild card server certificate on all test.dataone.org s...https://redmine.dataone.org/issues/71832015-06-15T14:54:13ZDave Vieglaisdave.vieglais@gmail.com
<p>The *.test.dataone.org server certificate expires in July.</p>
<p>A replacement has been ordered and will be stored in subversion:</p>
<p>AdminAccounts.txt</p>
<p>The servers impacted are listed on the Google Sheet:</p>
<p><a href="https://docs.google.com/spreadsheets/d/1BrZgm0yPV9dzd6SIfjQ9P5W5WH666xaGAI6KeXTAiFs/edit#gid=0">https://docs.google.com/spreadsheets/d/1BrZgm0yPV9dzd6SIfjQ9P5W5WH666xaGAI6KeXTAiFs/edit#gid=0</a></p>