DataONE Tasks: Issueshttps://redmine.dataone.org/https://redmine.dataone.org/favicon.ico2020-08-06T00:06:07ZDataONE Tasks
Redmine CN REST - Bug #8867 (New): CNCore.listChecksumAlgorithms() returns incorrect listhttps://redmine.dataone.org/issues/88672020-08-06T00:06:07ZMatthew Jonesjones@nceas.ucsb.edu
<p>The definition of the ChecksumAlgorithm type in SystemMetadata allows any checksum algorithm listed in the Library of Congress vocab. But the current CNCore.listChecksumAlgorithms() implementation only returns two, MD5 and SHA-1. Need to correct this to include the full list of supported algorithms (see <a href="http://id.loc.gov/vocabulary/preservation/cryptographicHashFunctions.html">http://id.loc.gov/vocabulary/preservation/cryptographicHashFunctions.html</a>).</p>
<p>The implementation of this is in a property file, which needs to be updated with the correct list. The file (d1_cn_rest/src/test/resources/org/dataone/configuration/node.properties) currently contains:</p>
<p><code>cn.checksumAlgorithmList=SHA-1;MD5</code></p>
<p>But it should contain all of the other valid algorithms as well from the LoC.</p>
CN REST - Story #8864 (New): Sychronization does not register authoritative replica entry correctlyhttps://redmine.dataone.org/issues/88642020-06-17T21:49:55ZChris Jonescjones@nceas.ucsb.edu
<p>When objects are synchronized to the CN, the <code>d1_synchronization</code> component will fetch the system metadata <br>
for each object and will add a <code><replica></code> entry for the origin node (like <code>urn:node:ESS_DIVE</code>, <br>
as well as entries for other copies (for instance for science metadata copied to the CN, <br>
a <code><replica>urn:node:CN</replica></code> will be added.</p>
<p>In some instances, the origin replica instance is not added to the replica list.<br><br>
This causes downstream problems for the <code>d1_replication</code> component because it relies on the origin node <br>
replica entry to be present in order to set up a replication request to a target node. I'm seeing errors like:</p>
<pre>/var/log/dataone/replicate/cn-replication.log.90:[ERROR] 2020-06-04 05:18:30,179 [pool-15-thread-1] (MNCommunication:requestReplication:34) Could not determine replication source node for replication request for pid: ess-dive-eb6cbb22c605506-20200122T170607966. Replication request failed.
</pre>
<p>Looking back in the logs, this is the case for the following objects:</p>
<pre>ess-dive-3947e68e9825233-20180621T213650539
ess-dive-3b8d9f4513e02f9-20180621T214221437
ess-dive-467a6c3dda4dc88-20180621T211148554
ess-dive-51f345daca126f7-20180328T160350610716
ess-dive-53b37ae5d8c0f51-20200219T211634419654
ess-dive-6b688fab5524c46-20200121T210154766
ess-dive-7a31346c154f02b-20200127T155012488
ess-dive-a1fb05cbd903309-20200130T190835651
ess-dive-b420b097851c716-20180523T161714606
ess-dive-ba81a8a8e0bef31-20180727T200828345
ess-dive-bfaf3d6d6fd038c-20180716T154005175903
ess-dive-c2ef5f3af108c9c-20180621T220020545
ess-dive-eb6cbb22c605506-20200122T170607966
ess-dive-f3238db16593de5-20180621T215956950
</pre>
<p>We need to fix this issue in <code>d1_synchronization</code> so replication runs correctly and monthly <br>
replica auditing (done by ESS_DIVE) doesn't flag these issues.</p>
CN REST - Task #8778 (New): Ensure SystemMetadata replica auditing updates are saved and broadcasthttps://redmine.dataone.org/issues/87782019-03-12T16:54:08ZChris Jonescjones@nceas.ucsb.edu
<p>In <code>MNAuditTask.call()</code>, we process a batch of pids in need of auditing per Member Node. For each <code>pid</code> in the <code>auditPIDs</code> list, we call <code>MN.getChecksum()</code>. Regardless of success or failure, we set the <code>Replica.replicaVerified</code> date in the <code>SystemMetadata</code>. However, the task has a copy of the system metadata from <code>hzSystemMetadata.get()</code>, and the task doesn't subsequently call <code>hzSystemMetadata.put(pid, sysmeta)</code>. This means that while we are auditing content, we may just not be recording the results! I need to look at the code more to see if we make an API call to <code>CN.updateSystemMetadata()</code> elsewhere, but I would expect the<code>MNAuditTask</code> to do this. Also, if this happens in the task, we also need to broadcast the system metadata change to the authoritative MN and all replica MNS. Lastly, we need to update the <code>serialVersion</code> field to show the other CNs what the most recent replica list is.</p>
CN REST - Task #8777 (New): Configure CN to audit objects greater than 1GBhttps://redmine.dataone.org/issues/87772019-03-12T16:47:42ZChris Jonescjones@nceas.ucsb.edu
<p>The replication auditor currently limits auditing of objects at 1GB. There are currently 4 objects greater than 1TB in size, and 3,588 objects greater than 1GB in size, both being very small counts compared to the 2,769,111 objects less than 1GB in size in the network. Nonetheless, they should still be audited if feasible. The limiting factor is likely HTTP timeout limits during the call to <code>MN.getChecksum()</code>. For reference, I'm seeing the following general times for calculating MD5 and SHA-1 checksums:</p>
<pre>Size MD5 SHA-1
---- ------- -------
1GB 00m02.5s 00m02.6s
10GB 00m25.9s 00m30.0s
100GB 03m28.0s 04m01.8s
1TB 50m14.2s 67m38.6s
</pre>
<p>10GB and 100GB objects seem pretty feasible if we set the HTTP client timeout to > 5 minutes, whereas the few > 1TB files may be challenging just due to the timeouts. The other factor is that the <code>AbstractReplicationAuditor</code> sets a default timeout to 60 seconds, and if the task future doesn't return in that time frame, the future gets cancelled. So the HTTP timeout and this timeout need to be increased and coordinated in order to handle larger object auditing.</p>
CN REST - Task #8776 (New): Set valid replica status to completedhttps://redmine.dataone.org/issues/87762019-03-12T15:57:36ZChris Jonescjones@nceas.ucsb.edu
<p>In <code>MNAuditTask.call()</code> we audit replica checksums, but on success, we only set the <code>Replica.replicaVerified</code> field to the current date. We don't set the <code>Replica.replicationStatus</code> field to <code>COMPLETED</code>. This is an issue because the <code>Replica</code> entry in the <code>SystemMetadata</code> may have been set to <code>FAILED</code> or <code>INVALIDATED</code>, but may now be valid, and so would need to be updated.</p>
CN REST - Story #8771 (New): Issue with LDAP when updating `nodeReplicationPolicy`https://redmine.dataone.org/issues/87712019-03-05T19:42:17ZRoger Dahldahl@unm.edu
<p>When a submitting a Node doc update which includes a nodeReplicationPolicy, this section is good:</p>
<pre><nodeReplicationPolicy>
<maxObjectSize>21474836480</maxObjectSize>
<spaceAllocated>1099511627776</spaceAllocated>
</nodeReplicationPolicy>
</pre>
<p>while the same section without <code>maxObjectSize</code> returns error:</p>
<pre> <error detailCode="4822" errorCode="500" name="ServiceFailure">
<description>updateNodeCapabilities failed due to LDAP communication failure:: InvalidAttributeValueException:[LDAP: error code 21 - d1ReplicationPolicyMaxObjectSize: value #0 invalid per syntax]:[LDAP: error code 21 - d1ReplicationPolicyMaxObjectSize: value #0 invalid per syntax]</description>
</error>
</pre>
<p>The schema allows leaving <code>maxObjectSize</code> out, which means that the MN accepts replicas of unlimited size.</p>
<p>Both GMN and Metacat leave <code>maxObjectSize</code> out if the setting is configured to unlimited with <code>-1</code>.</p>
<p>I think it used to work.</p>
CN REST - Story #8770 (New): Issue with CN handling of encoded identifiers in object/ meta/ node/...https://redmine.dataone.org/issues/87702019-03-05T19:37:13ZRoger Dahldahl@unm.edu
<p>Works:<br>
<a href="http://cn.dataone.org/cn/v2/object/doi:10.6073/AA/knb-lter-bes.298.37">http://cn.dataone.org/cn/v2/object/doi:10.6073/AA/knb-lter-bes.298.37</a><br>
<a href="https://cn.dataone.org/cn/v2/node/urn:node:LTER">https://cn.dataone.org/cn/v2/node/urn:node:LTER</a></p>
<p>Does not work:<br>
<a href="http://cn.dataone.org/cn/v2/object/doi%3A10.6073%2FAA%2Fknb-lter-bes.298.37">http://cn.dataone.org/cn/v2/object/doi%3A10.6073%2FAA%2Fknb-lter-bes.298.37</a><br>
<a href="https://cn.dataone.org/cn/v2/node/urn%3Anode%3ALTER">https://cn.dataone.org/cn/v2/node/urn%3Anode%3ALTER</a></p>
<p>Note: Behavior differs between HTTP / HTTPS.</p>
CN REST - Story #8757 (New): Fix getChecksum() in MNAuditTask to use dynamic checksum algorithmshttps://redmine.dataone.org/issues/87572019-01-14T16:46:33ZChris Jonescjones@nceas.ucsb.edu
<p>The <code>MNAuditTask.call()</code> method is hardcoded to use <code>MD5</code> checksums on line 277. It requests the Member Node to generate an <code>MD5</code> checksum, and then compares that checksum to the checksum stated in the Coordinating Node<code>s cached</code>SystemMetadata.checksum<code>field for the object. This obviously will fail for objects that submitted objects using</code>SHA-1` or other algorithms.</p>
CN REST - Story #8756 (New): Ensure replica auditor is effectivehttps://redmine.dataone.org/issues/87562019-01-12T20:25:18ZChris Jonescjones@nceas.ucsb.edu
<p>The replication auditor service is currently configured to audit all objects every 90 days. As documented in <a class="issue tracker-4 status-1 priority-4 priority-default child" title="Story: Replica Auditing service is throwing errors (New)" href="https://redmine.dataone.org/issues/8582">#8582</a>, the auditor is not working correctly. While the errors being thrown that are described in that ticket seem to be limited to <code>pid</code>s with certain characters in them, I think the whole auditor process is not keeping up with our content.</p>
<p>Looking at the number of objects on each member node that haven't been audited in the last 90 days, auditing is well behind (if we consider it working at all):</p>
<pre>SELECT sm.authoritive_member_node, count(smr.guid) AS count
FROM systemmetadata sm INNER JOIN smreplicationstatus smr
ON sm.guid = smr.guid
WHERE
smr.member_node != 'urn:node:CN' AND
sm.date_uploaded < (SELECT CURRENT_DATE - interval '90 days') AND
smr.date_verified < (SELECT CURRENT_DATE - interval '90 days')
GROUP BY sm.authoritive_member_node
ORDER BY count DESC;
authoritive_member_node | count
-------------------------+--------
urn:node:ARCTIC | 771872
urn:node:PANGAEA | 507456
urn:node:LTER | 416339
urn:node:DRYAD | 374439
urn:node:CDL | 242115
urn:node:PISCO | 235791
urn:node:KNB | 86075
urn:node:TDAR | 75639
urn:node:NCEI | 50974
urn:node:USGS_SDC | 40290
urn:node:TERN | 31671
urn:node:ESS_DIVE | 28830
urn:node:NMEPSCOR | 16042
urn:node:GOA | 9266
urn:node:IARC | 7677
urn:node:NRDC | 6673
urn:node:TFRI | 6478
urn:node:PPBIO | 3464
urn:node:ORNLDAAC | 3328
urn:node:FEMC | 2430
urn:node:EDI | 2098
urn:node:GRIIDC | 2065
urn:node:mnTestKNB | 2010
urn:node:SANPARKS | 2008
urn:node:ONEShare | 1874
urn:node:R2R | 1787
urn:node:USGSCSAS | 1151
urn:node:EDACGSTORE | 1075
urn:node:US_MPC | 1032
urn:node:RW | 970
urn:node:KUBI | 516
urn:node:NEON | 487
urn:node:LTER_EUROPE | 343
urn:node:IOE | 279
urn:node:RGD | 273
urn:node:ESA | 272
urn:node:NKN | 218
urn:node:OTS_NDC | 126
urn:node:BCODMO | 115
urn:node:SEAD | 90
urn:node:mnTestNKN | 50
urn:node:EDORA | 28
urn:node:ONEShare.pem | 22
urn:node:CLOEBIRD | 17
urn:node:mnTestBCODMO | 11
urn:node:USANPN | 10
urn:node:mnTestTDAR | 10
urn:node:MyMemberNode | 1
</pre>
<p>The table above represents the number of un-audited objects (in the last 90 days), but I get the feeling that the auditor isn't able to audit any of the content it is charged to audit given 1) the frequency, 2) the number of threads allotted, and 3) the configured batch count (seems way too low). <del>Note that this query excludes replicated content - this is just the original objects</del> (After looking at my query again, I think the join is including all replicas - the total is 2,935,787, which is greater than the total objects in the system (2,751,136), so this query needs to be refined).</p>
<p>We need to evaluate the true effectiveness of the auditor. Some strategies may include: 1) looking to see if we may be in an infinite loop on processing a few <code>pid</code>s due to the issues in <a class="issue tracker-4 status-1 priority-4 priority-default child" title="Story: Replica Auditing service is throwing errors (New)" href="https://redmine.dataone.org/issues/8582">#8582</a>, 2) seeing if we can increase the batch size by increasing the total threads allocated in the executor, and 3) decide if we need to offload the process from the CNs and distribute the workload across a cluster of workers that can do the auditing faster. Needs some thought and discussion.</p>
CN REST - Story #8749 (New): Fix log aggregation events from the CN without associated CN IPshttps://redmine.dataone.org/issues/87492018-11-16T20:39:55ZChris Jonescjones@nceas.ucsb.edu
<p>The robots list used to filter out usage events includes the IP addresses of the CNs, so events logged during synchronization don't show up as true hits. Because of the SSL infrastructure at lbl.gov, the ESS-DIVE group doesn't see the public IP of an incoming request, but rather an internal private IP assigned by lbl.gov infrastructure. You can see the impact of this on the <a href="https://data.ess-dive.lbl.gov/#profile" class="external">ESS-DIVE profile page</a>. The spike of 11,000+ downloads in August 2018 was the CN synchronizing content.</p>
<p>Rushiraj summarized these events in a <a href="https://gist.github.com/rushirajnenuji/847d8239acf68a108bda30e04af0406b" class="external">gist</a></p>
<p>There are multiple <code>10.42.x.x</code> IP associated with the CN requests. These events all need to be updated in the <code>logsolr</code> core and changed to an actual CN IP. For future synchronizations, perhaps we need to add <code>10.42.0.0/16</code> to the robots list? </p>
CN REST - Story #8582 (New): Replica Auditing service is throwing errorshttps://redmine.dataone.org/issues/85822018-05-01T19:15:35ZChris Jonescjones@nceas.ucsb.edu
<p>Replica auditing should be auditing objects every 90 days for fixity, and setting the <code>replicaStatus</code> appropriately. The <code>/var/log/dataone/cn-replication-audit.log*</code> files are showing many errors:</p>
<pre>cjones@cn-ucsb-1:replicate$ grep ERROR cn-replication-audit.log* | grep "Cannot update replica status" | wc -l
437601
</pre>
<p>Determine if this is a configuration issue or a code issue and fix it as needed. Also, fix the code to call <code>Identifier.getValue()</code> when logging these errors to avoid printing the memory location of the object like <code>org.dataone.service.types.v1.Identifier@7e90f2e8</code>. There are multiple places where <code>getValue()</code> needs to be added.</p>
CN REST - Bug #7918 (New): SEAD object only partially synchronized - missing autogen.201609291601...https://redmine.dataone.org/issues/79182016-10-21T17:20:29ZDave Vieglaisdave.vieglais@gmail.com
<p>While updating <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: SEAD object only partially synchronized - missing autogen.2015061616000265251 document from /var/... (Closed)" href="https://redmine.dataone.org/issues/7222">#7222</a> and <a class="issue tracker-5 status-5 priority-4 priority-default closed" title="Task: SEAD object only partially synchronized in CN production - missing autogen.2016092916013111425 do... (Closed)" href="https://redmine.dataone.org/issues/7914">#7914</a>. Checked for content for those, and it is accessible through CN. May still be an issue however as the SEAD MN reports 69 objects, though<br>
there are 68 reported in the search index.</p>
<p>Counting on the MN:</p>
<p>curl "<a href="http://seadva.d2i.indiana.edu:8081/sead/rest/mn/v1/object">http://seadva.d2i.indiana.edu:8081/sead/rest/mn/v1/object</a>" | xml fo</p>
<p>returns 69 objects.</p>
<p>Counting on the CN::</p>
<p>curl "<a href="https://cn.dataone.org/cn/v2/object?nodeId=urn:node:SEAD">https://cn.dataone.org/cn/v2/object?nodeId=urn:node:SEAD</a>" | xml fo</p>
<p>returns 69 objects.</p>
<p>Counting on the CN using the search index::</p>
<p><a href="https://cn.dataone.org/cn/v1/query/solr/?start=0&rows=10&fl=id%2Ctitle%2CformatId&q=datasource%3A%22urn%5C%3Anode%5C%3ASEAD%22">https://cn.dataone.org/cn/v1/query/solr/?start=0&rows=10&fl=id%2Ctitle%2CformatId&q=datasource%3A%22urn%5C%3Anode%5C%3ASEAD%22</a></p>
<p>returns 68 objects.</p>
<p>List of identifiers in search index::</p>
<p>curl "<a href="https://cn.dataone.org/cn/v1/query/solr/?start=0&rows=100&fl=id&q=datasource%3A%22urn%5C%3Anode%5C%3ASEAD%22">https://cn.dataone.org/cn/v1/query/solr/?start=0&rows=100&fl=id&q=datasource%3A%22urn%5C%3Anode%5C%3ASEAD%22</a>" | xml sel -t -m "//doc/str[@name='id']" -v . -n | sort > SEAD_index_pids.txt</p>
<p>List of identifiers on CN::</p>
<p>curl "<a href="https://cn.dataone.org/cn/v2/object?nodeId=urn:node:SEAD">https://cn.dataone.org/cn/v2/object?nodeId=urn:node:SEAD</a>" | xml sel -t -m "//objectInfo/identifier" -v . -n | sort > SEAD_cn_pids.txt</p>
<p>Missing PID::</p>
<p>diff SEAD_cn_pids.txt SEAD_index_pids.txt<br>
60d59<br>
< seadva-c918e4ff-2861-496a-a907-d2cb382ddb30</p>
<p>System Metadata for PID::</p>
<p><?xml version="1.0"?><br>
<br>
1<br>
seadva-c918e4ff-2861-496a-a907-d2cb382ddb30<br>
FGDC-STD-001-1998<br>
6553<br>
ff3d4641669b4e08c9f8b978b85d6113a4c1bab8<br>
SEAD<br>
CN=urn:node:SEAD,DC=dataone,DC=org<br>
<br>
<br>
public<br>
read<br>
<br>
<br>
<br>
sead-Martin-John-f1dbc3df-c27c-4647-b05a-4b1f05c99a24<br>
2016-08-17T12:41:34.03Z<br>
2016-08-17T14:43:58.677Z<br>
urn:node:SEAD<br>
urn:node:SEAD<br>
<br>
urn:node:SEAD<br>
completed<br>
2016-09-29T23:01:21.235Z<br>
<br>
<br>
urn:node:CN<br>
completed<br>
2016-09-29T23:01:21.241Z<br>
<br>
<a href="/ns1:systemMetadata">/ns1:systemMetadata</a></p>
<p>Confirm object can be retieved::</p>
<p>curl "<a href="http://seadva.d2i.indiana.edu:8081/sead/rest/mn/v1/object/seadva-c918e4ff-2861-496a-a907-d2cb382ddb30">http://seadva.d2i.indiana.edu:8081/sead/rest/mn/v1/object/seadva-c918e4ff-2861-496a-a907-d2cb382ddb30</a>" | xml fo</p>
<p>System Metadata for obsoleted object::</p>
<p><?xml version="1.0"?><br>
<br>
1<br>
sead-Martin-John-f1dbc3df-c27c-4647-b05a-4b1f05c99a24<br>
FGDC-STD-001-1998<br>
7443<br>
ebd481ad8268f9714f23d772c15fb62e61384486<br>
CN=urn:node:SEAD,DC=dataone,DC=org<br>
CN=urn:node:SEAD,DC=dataone,DC=org<br>
<br>
<br>
public<br>
read<br>
<br>
<br>
<br>
seadva-c918e4ff-2861-496a-a907-d2cb382ddb30<br>
2013-10-24T18:41:31.213Z<br>
2016-08-17T14:43:58.677Z<br>
urn:node:SEAD<br>
urn:node:SEAD<br>
<br>
urn:node:CN<br>
completed<br>
2013-10-24T23:00:04.684Z<br>
<br>
<br>
urn:node:SEAD<br>
completed<br>
2013-10-24T23:00:04.579Z<br>
<br>
<a href="/ns1:systemMetadata">/ns1:systemMetadata</a><br></p>
<p>autogenid for <code>seadva-c918e4ff-2861-496a-a907-d2cb382ddb30</code>::</p>
<p>psql metacat<br>
select * from identifier where guid='seadva-c918e4ff-2861-496a-a907-d2cb382ddb30';</p>
<table><thead>
<tr>
<th>guid</th>
<th>docid</th>
<th>rev</th>
</tr>
</thead><tbody>
<tr>
<td>seadva-c918e4ff-2861-496a-a907-d2cb382ddb30</td>
<td>autogen.2016092916012224122</td>
<td>1</td>
</tr>
</tbody></table>
<p>$ ls /var/metacat/data/autogen.2016092916012224122*<br>
ls: cannot access /var/metacat/data/autogen.2016092916012224122: No such file or directory</p>
CN REST - Task #7096 (New): Unexpectedly closed streams / disconnects on UNM networkhttps://redmine.dataone.org/issues/70962015-05-12T17:49:44ZAndrei Buiumandreib@epscor.unm.edu
<p>When testing, making any CN or MN API call would occasionally yield an exception (randomly but very often) :</p>
<p>Could not resolve multipart files: Processing of multipart/form-data request failed. Stream ended unexpectedly</p>
<p>I was making CN.create() calls from UNM to the UCSB Dev CN. <br>
(It also happened for MN.create() calls from UNM to mnDemo6 when it was up. Not sure where mnDemo6 is physically located though.)<br>
This seems to happen when the connection is terminated while metacat is reading the object from the multipart files.</p>
CN REST - Task #2168 (New): Design UI for identity validationhttps://redmine.dataone.org/issues/21682012-01-06T03:18:46ZDave Vieglaisdave.vieglais@gmail.comCN REST - Task #1479 (In Progress): Design web UI for validating newly created identitieshttps://redmine.dataone.org/issues/14792011-04-06T18:01:34ZMatthew Jonesjones@nceas.ucsb.edu