DataONE Tasks: Issueshttps://redmine.dataone.org/https://redmine.dataone.org/favicon.ico2019-06-19T02:03:44ZDataONE Tasks
Redmine Infrastructure - Story #8823 (New): Recent Apache and OpenSSL combinations break connectivity on ...https://redmine.dataone.org/issues/88232019-06-19T02:03:44ZDave Vieglaisdave.vieglais@gmail.com
<p>The latest Ubuntu 18.04 release of Apache is 2.4.29 and OpenSSL is 1.1.1.</p>
<p>This combination creates a significant delay in TLS renegotiation that results from the Apache config option on the CNs:</p>
<pre>SSLVerifyClient none
<Location "/cn">
<If " ! ( %{HTTP_USER_AGENT} =~ /(windows|chrome|mozilla|safari|webkit)/i )">
SSLVerifyClient optional
</If>
</Location>
</pre>
<p>Which is intended to disable client certificate authentication for web browsers, but allow it for others. This approach worked fine on older Apache / OpenSSL but the new combination creates a several second wait when the server discovers the client is not a web browser and tells it to reconnect with the option of including a client certificate.</p>
<p>The latest released version of Apache is 2.4.39 and this is available through a PPA intended for Debian developers. This has been installed so far on dev-2, sandbox, stage, and stage-2 with the process:</p>
<pre>sudo add-apt-repository ppa:ondrej/apache2
sudo apt update
sudo apt dist-upgrade
sudo systemctl restart apache2
</pre>
<p>This installs Apache 2.4.39 and OpenSSL 1.1.1c which appears to resolve the apparent bug in the 2.4.29 / 1.1.1 combination.</p>
<p>One issue with the update is that by default, Apache now offers TLSv1.3, which is great except that it appears to cause problems with at least Python clients failing to connect and getting a 403 error. For example:</p>
<pre>$ python3
>>> import requests
>>> r = requests.get("https://cn-sandbox-ucsb-1.test.dataone.org/cn/v2/monitor/ping")
>>> r.status_code
403
</pre>
<p>That TLSv1.3 is the problem was verified with cn-stage-unm-2 by configuring Apache with:</p>
<pre> SSLProtocol all -TLSv1.3 -SSLv2 -SSLv3
</pre>
<p>to disable TLSv1.3. After this change the Python client was able to connect as expected.</p>
<p>A workaround has not yet been researched.</p>
<p>It is not clear if this issue applies to other clients such as R and Java, so until we learn one way or the other, TLSv1.3 will be disabled on the CNs.</p>
<p>--This issue will likely apply to Member Nodes as well once TLSv1.3 is generally available or if MNs choose to install Apache 2.4.39.-- CORRECTION: this issue only applies when attempting to renegotiate TLS after headers have been transferred, so will not typically apply to a MN.</p>
CN REST - Story #8757 (New): Fix getChecksum() in MNAuditTask to use dynamic checksum algorithmshttps://redmine.dataone.org/issues/87572019-01-14T16:46:33ZChris Jonescjones@nceas.ucsb.edu
<p>The <code>MNAuditTask.call()</code> method is hardcoded to use <code>MD5</code> checksums on line 277. It requests the Member Node to generate an <code>MD5</code> checksum, and then compares that checksum to the checksum stated in the Coordinating Node<code>s cached</code>SystemMetadata.checksum<code>field for the object. This obviously will fail for objects that submitted objects using</code>SHA-1` or other algorithms.</p>
CN REST - Story #8756 (New): Ensure replica auditor is effectivehttps://redmine.dataone.org/issues/87562019-01-12T20:25:18ZChris Jonescjones@nceas.ucsb.edu
<p>The replication auditor service is currently configured to audit all objects every 90 days. As documented in <a class="issue tracker-4 status-1 priority-4 priority-default child" title="Story: Replica Auditing service is throwing errors (New)" href="https://redmine.dataone.org/issues/8582">#8582</a>, the auditor is not working correctly. While the errors being thrown that are described in that ticket seem to be limited to <code>pid</code>s with certain characters in them, I think the whole auditor process is not keeping up with our content.</p>
<p>Looking at the number of objects on each member node that haven't been audited in the last 90 days, auditing is well behind (if we consider it working at all):</p>
<pre>SELECT sm.authoritive_member_node, count(smr.guid) AS count
FROM systemmetadata sm INNER JOIN smreplicationstatus smr
ON sm.guid = smr.guid
WHERE
smr.member_node != 'urn:node:CN' AND
sm.date_uploaded < (SELECT CURRENT_DATE - interval '90 days') AND
smr.date_verified < (SELECT CURRENT_DATE - interval '90 days')
GROUP BY sm.authoritive_member_node
ORDER BY count DESC;
authoritive_member_node | count
-------------------------+--------
urn:node:ARCTIC | 771872
urn:node:PANGAEA | 507456
urn:node:LTER | 416339
urn:node:DRYAD | 374439
urn:node:CDL | 242115
urn:node:PISCO | 235791
urn:node:KNB | 86075
urn:node:TDAR | 75639
urn:node:NCEI | 50974
urn:node:USGS_SDC | 40290
urn:node:TERN | 31671
urn:node:ESS_DIVE | 28830
urn:node:NMEPSCOR | 16042
urn:node:GOA | 9266
urn:node:IARC | 7677
urn:node:NRDC | 6673
urn:node:TFRI | 6478
urn:node:PPBIO | 3464
urn:node:ORNLDAAC | 3328
urn:node:FEMC | 2430
urn:node:EDI | 2098
urn:node:GRIIDC | 2065
urn:node:mnTestKNB | 2010
urn:node:SANPARKS | 2008
urn:node:ONEShare | 1874
urn:node:R2R | 1787
urn:node:USGSCSAS | 1151
urn:node:EDACGSTORE | 1075
urn:node:US_MPC | 1032
urn:node:RW | 970
urn:node:KUBI | 516
urn:node:NEON | 487
urn:node:LTER_EUROPE | 343
urn:node:IOE | 279
urn:node:RGD | 273
urn:node:ESA | 272
urn:node:NKN | 218
urn:node:OTS_NDC | 126
urn:node:BCODMO | 115
urn:node:SEAD | 90
urn:node:mnTestNKN | 50
urn:node:EDORA | 28
urn:node:ONEShare.pem | 22
urn:node:CLOEBIRD | 17
urn:node:mnTestBCODMO | 11
urn:node:USANPN | 10
urn:node:mnTestTDAR | 10
urn:node:MyMemberNode | 1
</pre>
<p>The table above represents the number of un-audited objects (in the last 90 days), but I get the feeling that the auditor isn't able to audit any of the content it is charged to audit given 1) the frequency, 2) the number of threads allotted, and 3) the configured batch count (seems way too low). <del>Note that this query excludes replicated content - this is just the original objects</del> (After looking at my query again, I think the join is including all replicas - the total is 2,935,787, which is greater than the total objects in the system (2,751,136), so this query needs to be refined).</p>
<p>We need to evaluate the true effectiveness of the auditor. Some strategies may include: 1) looking to see if we may be in an infinite loop on processing a few <code>pid</code>s due to the issues in <a class="issue tracker-4 status-1 priority-4 priority-default child" title="Story: Replica Auditing service is throwing errors (New)" href="https://redmine.dataone.org/issues/8582">#8582</a>, 2) seeing if we can increase the batch size by increasing the total threads allocated in the executor, and 3) decide if we need to offload the process from the CNs and distribute the workload across a cluster of workers that can do the auditing faster. Needs some thought and discussion.</p>
Infrastructure - Story #8639 (New): Replication performance is too slow to service demandhttps://redmine.dataone.org/issues/86392018-07-04T11:17:55ZDave Vieglaisdave.vieglais@gmail.com
<p>The replication process is operating too slowly to service demand resulting in lengthy delays to completion of replication tasks for new and changed content.</p>
<p>This is particularly apparent in the stage environment where perhaps the number of orphaned objects and deprecated / defunct nodes is interfering with expected behaviors.</p>
<p>Goal of this story is to identify and address the immediate issues. Any significant refactoring of the replication process should be captured under another story / epic.</p>
Search UI - Story #8574 (New): PANGAEA Temporary Fix: SID only in Data Citationhttps://redmine.dataone.org/issues/85742018-04-30T13:26:15ZMonica Ihliemail@monicaihli.com
<p>Adjust appearance of data citation in the case of formatid <a href="http://www.isotc211.org/2005/gmd-pangaea">http://www.isotc211.org/2005/gmd-pangaea</a> such that the SID alone appears in the data citation and not the PID. This is intended as a temporary fix to be in place only while a longer term strategy for customizing appearances of data citations is designed and implemented. Once that longer term strategy is in place, whatever temporary code level changes were implemented to accommodate PANGAEA should be removed.</p>
Member Nodes - Story #8535 (In Progress): Neotoma: Story: Discovery & Planninghttps://redmine.dataone.org/issues/85352018-04-09T21:14:39ZAmy Forresteraforres4@utk.edu
<p>Discovery: establish contact and build relationship with a potential new member node. Determine if DataONE and the repository are a good fit for one another and if the repository generally meets the requirements of DataONE member nodes. </p>
<p>Planning: If the repository and DataONE have agreed to proceed with deployment as a member node. Decisions will be made as to how to proceed with development. Node operators will receive training. </p>
<p>This story is complete when a determination is made to either proceed with planning a new deployment, or that joining DataONE is not an option for the repository at this time.</p>
<p><strong>Record initial communication here</strong></p>
Infrastructure - Story #8525 (In Progress): timeout exceptions thrown from Hazelcast disable sync...https://redmine.dataone.org/issues/85252018-03-27T22:36:54ZRob Nahfrnahf@epscor.unm.edu
<p>Very occasionally, synchronization disables itself when RuntimeExceptions bubble up. The most common of these is when the Hazelcast client seemingly disconnects, or can't complete an operation, and a java.util.concurrent.TimeoutException is thrown.</p>
<p>These are usually due to network problems, as evidenced by timeout exceptions appearing in both the Metacat hazelcast-storage.log files as well as d1-processing logs.</p>
<p>Temporary problems like this should be recoverable, and so a retry or bypass for those timeouts should be implemented. It's not clear whether or not a new HazelcastClient should be instantiated, or whether the same client is still usable. (Is the client tightly bound to a session, or does it recover?) If a new client is needed, preliminary searching through the code indicates that refactoring the HazelcastClientFactory.getProcessingClient() method is only used in a few places, and the singleton behavior it uses can be sidestepped by removing the method and replacing it with a getLock() wrapper method (that seems to be the dominant use case for it). See the newer SyncQueueFacade in d1_synchronization for guidance on that. If the client is never exposed, it can be refreshed as needed.</p>
<pre>root@cn-unm-1:/var/metacat/logs# grep FATAL hazelcast-storage.log.1
[FATAL] 2018-03-27 03:15:19,380 (BaseManager$2:run:1402) [64.106.40.6]:5701 [DataONE] Caught error while calling event listener; cause: [CONCURRENT_MAP_CONTAINS_KEY] Operation Timeout (with no response!): 0
</pre><pre>[ERROR] 2018-03-27 03:15:19,781 [ProcessDaemonTask1] (SyncObjectTaskManager:run:84) java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.util.concurrent
.TimeoutException: [CONCURRENT_MAP_REMOVE] Operation Timeout (with no response!): 0
java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.util.concurrent.TimeoutException: [CONCURRENT_MAP_REMOVE] Operation Timeout (with no response!): 0
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.dataone.cn.batch.synchronization.SyncObjectTaskManager.run(SyncObjectTaskManager.java:76)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: java.util.concurrent.TimeoutException: [CONCURRENT_MAP_REMOVE] Operation Timeout (with no response!): 0
at com.hazelcast.impl.ClientServiceException.readData(ClientServiceException.java:63)
at com.hazelcast.nio.Serializer$DataSerializer.read(Serializer.java:104)
at com.hazelcast.nio.Serializer$DataSerializer.read(Serializer.java:79)
at com.hazelcast.nio.AbstractSerializer.toObject(AbstractSerializer.java:121)
at com.hazelcast.nio.AbstractSerializer.toObject(AbstractSerializer.java:156)
at com.hazelcast.client.ClientThreadContext.toObject(ClientThreadContext.java:72)
at com.hazelcast.client.IOUtil.toObject(IOUtil.java:34)
at com.hazelcast.client.ProxyHelper.getValue(ProxyHelper.java:186)
at com.hazelcast.client.ProxyHelper.doOp(ProxyHelper.java:146)
at com.hazelcast.client.ProxyHelper.doOp(ProxyHelper.java:140)
at com.hazelcast.client.QueueClientProxy.innerPoll(QueueClientProxy.java:115)
at com.hazelcast.client.QueueClientProxy.poll(QueueClientProxy.java:111)
at org.dataone.cn.batch.synchronization.type.SyncQueueFacade.poll(SyncQueueFacade.java:231)
at org.dataone.cn.batch.synchronization.tasks.SyncObjectTask.call(SyncObjectTask.java:131)
at org.dataone.cn.batch.synchronization.tasks.SyncObjectTask.call(SyncObjectTask.java:73)
</pre> Infrastructure - Story #8368 (New): Update backup strategy for jenkins job configurations to subv...https://redmine.dataone.org/issues/83682018-02-15T17:37:09ZDave Vieglaisdave.vieglais@gmail.comMember Nodes - Story #8358 (In Progress): Discovery & Planning (CAFF)https://redmine.dataone.org/issues/83582018-02-12T16:25:06ZAmy Forresteraforres4@utk.edu
<p>Discovery is about establishing contact and building a relationship with a potential new member node. In this phase, it is determined if DataONE and the repository are a good fit for one another and if the repository generally meets the requirements of DataONE member nodes. Broad discussions of deployment options may be reviewed as well.<br>
This story is complete when a determination is made to either proceed with planning a new deployment, or that joining DataONE is not an option for the repository at this time.</p>
Member Nodes - Story #8244 (New): Upgrade Member Node to current version of Metacat (IOE)https://redmine.dataone.org/issues/82442018-01-22T16:38:20ZDave Vieglaisdave.vieglais@gmail.com
<p>The MN operations have been placed in the care of Todd Kipfer. He has requested some assistance with use of Metacat, and especially bringing it to the latest version.</p>
Infrastructure - Story #8173 (New): add checks for retrograde systemMetadata changeshttps://redmine.dataone.org/issues/81732017-09-01T19:42:33ZRob Nahfrnahf@epscor.unm.edu
<p>with the ability to prioritize and the introduction of parallelized index task processing, the effective queue is not guaranteed to be time-ordered. If there are two valid system metadata changes resulting in two tasks and the second change hits the index first, the earlier task should be rejected, as its changes are out of date.</p>
Search UI - Story #7754 (In Progress): Support for XSL transform of various metadata formatshttps://redmine.dataone.org/issues/77542016-04-27T14:43:23ZDave Vieglaisdave.vieglais@gmail.com
<p>Currently the DCX and ISO metadata formats are being rendered in the view service using solr output rather than a transform of the XML metadata. This results in a less than satisfactory rendering.</p>
<p>Goal of this story is to implement XSLT for metadata formats that currently are relying on the solr only rendering.</p>
OGC-Slender Node - Story #7151 (In Progress): Generate system metadata for contenthttps://redmine.dataone.org/issues/71512015-06-04T20:24:16ZDave Vieglaisdave.vieglais@gmail.com
<p>For the three types of content (resource, science meta, data), develop a mechanism to provide system metadata.</p>
OGC-Slender Node - Story #7149 (Testing): Implement mechanism to retrieve a list of objects avail...https://redmine.dataone.org/issues/71492015-06-04T20:20:47ZDave Vieglaisdave.vieglais@gmail.com
<p>Using Python, implement a tool that is able to retrieve a list of packages, and the objects that make up each package.</p>
OGC-Slender Node - Story #7146 (In Progress): Determine formatId for content retrievable from NOD...https://redmine.dataone.org/issues/71462015-06-04T20:15:08ZDave Vieglaisdave.vieglais@gmail.com
<p>In order to create system metadata for objects held by NODC, it is necessary to infer the appropriate formatId for each object.</p>