DataONE Tasks: Issueshttps://redmine.dataone.org/https://redmine.dataone.org/favicon.ico2020-06-17T21:49:55ZDataONE Tasks
Redmine CN REST - Story #8864 (New): Sychronization does not register authoritative replica entry correctlyhttps://redmine.dataone.org/issues/88642020-06-17T21:49:55ZChris Jonescjones@nceas.ucsb.edu
<p>When objects are synchronized to the CN, the <code>d1_synchronization</code> component will fetch the system metadata <br>
for each object and will add a <code><replica></code> entry for the origin node (like <code>urn:node:ESS_DIVE</code>, <br>
as well as entries for other copies (for instance for science metadata copied to the CN, <br>
a <code><replica>urn:node:CN</replica></code> will be added.</p>
<p>In some instances, the origin replica instance is not added to the replica list.<br><br>
This causes downstream problems for the <code>d1_replication</code> component because it relies on the origin node <br>
replica entry to be present in order to set up a replication request to a target node. I'm seeing errors like:</p>
<pre>/var/log/dataone/replicate/cn-replication.log.90:[ERROR] 2020-06-04 05:18:30,179 [pool-15-thread-1] (MNCommunication:requestReplication:34) Could not determine replication source node for replication request for pid: ess-dive-eb6cbb22c605506-20200122T170607966. Replication request failed.
</pre>
<p>Looking back in the logs, this is the case for the following objects:</p>
<pre>ess-dive-3947e68e9825233-20180621T213650539
ess-dive-3b8d9f4513e02f9-20180621T214221437
ess-dive-467a6c3dda4dc88-20180621T211148554
ess-dive-51f345daca126f7-20180328T160350610716
ess-dive-53b37ae5d8c0f51-20200219T211634419654
ess-dive-6b688fab5524c46-20200121T210154766
ess-dive-7a31346c154f02b-20200127T155012488
ess-dive-a1fb05cbd903309-20200130T190835651
ess-dive-b420b097851c716-20180523T161714606
ess-dive-ba81a8a8e0bef31-20180727T200828345
ess-dive-bfaf3d6d6fd038c-20180716T154005175903
ess-dive-c2ef5f3af108c9c-20180621T220020545
ess-dive-eb6cbb22c605506-20200122T170607966
ess-dive-f3238db16593de5-20180621T215956950
</pre>
<p>We need to fix this issue in <code>d1_synchronization</code> so replication runs correctly and monthly <br>
replica auditing (done by ESS_DIVE) doesn't flag these issues.</p>
CN REST - Task #8859 (Closed): Modify portal component to reload cached certificates when renewed...https://redmine.dataone.org/issues/88592020-02-29T00:37:01ZChris Jonescjones@nceas.ucsb.edu
<p>The Coordinating Node's SSL certificates get renewed approximately every three months through Let's Encrypt. This presents a problem for Metacat-based Member Nodes and the Metacat component of the CN because they both rely on the <code>d1_portal</code> component's <code>TokenGenerator</code>. This class is a singleton pattern, and the <code>publicKey</code> and <code>privateKey</code> class variables are essentially cached in memory due to the singleton instance. When the Let's Encrypt certificates renew on disk, the in-memory versions remain, and authentication fails when the CN Tomcat restarts (loading the new certs), but the various MN Tomcat instances don't restart, and continue using the old certificate. This is true for the reverse scenario too, when the MNs restart, but the CN hasn't yet.</p>
<p>To avoid this fragile scenario where we must coordinate restarts (certainly not scalable), modify the <code>TokenGenerator</code> to reload the SSL certificates when they change on disk so the in-memory version is up-to-date. Also, when a <code>TokenGenerator.getSession(token)</code> fails, reload the certificates from disk to ensure we are using up-to-date certificates.</p>
<p>Test this with Metacat installations that have also been updated with the new <code>d1_portal</code> jar file.</p>
CN REST - Task #8778 (New): Ensure SystemMetadata replica auditing updates are saved and broadcasthttps://redmine.dataone.org/issues/87782019-03-12T16:54:08ZChris Jonescjones@nceas.ucsb.edu
<p>In <code>MNAuditTask.call()</code>, we process a batch of pids in need of auditing per Member Node. For each <code>pid</code> in the <code>auditPIDs</code> list, we call <code>MN.getChecksum()</code>. Regardless of success or failure, we set the <code>Replica.replicaVerified</code> date in the <code>SystemMetadata</code>. However, the task has a copy of the system metadata from <code>hzSystemMetadata.get()</code>, and the task doesn't subsequently call <code>hzSystemMetadata.put(pid, sysmeta)</code>. This means that while we are auditing content, we may just not be recording the results! I need to look at the code more to see if we make an API call to <code>CN.updateSystemMetadata()</code> elsewhere, but I would expect the<code>MNAuditTask</code> to do this. Also, if this happens in the task, we also need to broadcast the system metadata change to the authoritative MN and all replica MNS. Lastly, we need to update the <code>serialVersion</code> field to show the other CNs what the most recent replica list is.</p>
CN REST - Task #8777 (New): Configure CN to audit objects greater than 1GBhttps://redmine.dataone.org/issues/87772019-03-12T16:47:42ZChris Jonescjones@nceas.ucsb.edu
<p>The replication auditor currently limits auditing of objects at 1GB. There are currently 4 objects greater than 1TB in size, and 3,588 objects greater than 1GB in size, both being very small counts compared to the 2,769,111 objects less than 1GB in size in the network. Nonetheless, they should still be audited if feasible. The limiting factor is likely HTTP timeout limits during the call to <code>MN.getChecksum()</code>. For reference, I'm seeing the following general times for calculating MD5 and SHA-1 checksums:</p>
<pre>Size MD5 SHA-1
---- ------- -------
1GB 00m02.5s 00m02.6s
10GB 00m25.9s 00m30.0s
100GB 03m28.0s 04m01.8s
1TB 50m14.2s 67m38.6s
</pre>
<p>10GB and 100GB objects seem pretty feasible if we set the HTTP client timeout to > 5 minutes, whereas the few > 1TB files may be challenging just due to the timeouts. The other factor is that the <code>AbstractReplicationAuditor</code> sets a default timeout to 60 seconds, and if the task future doesn't return in that time frame, the future gets cancelled. So the HTTP timeout and this timeout need to be increased and coordinated in order to handle larger object auditing.</p>
CN REST - Task #8776 (New): Set valid replica status to completedhttps://redmine.dataone.org/issues/87762019-03-12T15:57:36ZChris Jonescjones@nceas.ucsb.edu
<p>In <code>MNAuditTask.call()</code> we audit replica checksums, but on success, we only set the <code>Replica.replicaVerified</code> field to the current date. We don't set the <code>Replica.replicationStatus</code> field to <code>COMPLETED</code>. This is an issue because the <code>Replica</code> entry in the <code>SystemMetadata</code> may have been set to <code>FAILED</code> or <code>INVALIDATED</code>, but may now be valid, and so would need to be updated.</p>
CN REST - Story #8757 (New): Fix getChecksum() in MNAuditTask to use dynamic checksum algorithmshttps://redmine.dataone.org/issues/87572019-01-14T16:46:33ZChris Jonescjones@nceas.ucsb.edu
<p>The <code>MNAuditTask.call()</code> method is hardcoded to use <code>MD5</code> checksums on line 277. It requests the Member Node to generate an <code>MD5</code> checksum, and then compares that checksum to the checksum stated in the Coordinating Node<code>s cached</code>SystemMetadata.checksum<code>field for the object. This obviously will fail for objects that submitted objects using</code>SHA-1` or other algorithms.</p>
CN REST - Story #8756 (New): Ensure replica auditor is effectivehttps://redmine.dataone.org/issues/87562019-01-12T20:25:18ZChris Jonescjones@nceas.ucsb.edu
<p>The replication auditor service is currently configured to audit all objects every 90 days. As documented in <a class="issue tracker-4 status-1 priority-4 priority-default child" title="Story: Replica Auditing service is throwing errors (New)" href="https://redmine.dataone.org/issues/8582">#8582</a>, the auditor is not working correctly. While the errors being thrown that are described in that ticket seem to be limited to <code>pid</code>s with certain characters in them, I think the whole auditor process is not keeping up with our content.</p>
<p>Looking at the number of objects on each member node that haven't been audited in the last 90 days, auditing is well behind (if we consider it working at all):</p>
<pre>SELECT sm.authoritive_member_node, count(smr.guid) AS count
FROM systemmetadata sm INNER JOIN smreplicationstatus smr
ON sm.guid = smr.guid
WHERE
smr.member_node != 'urn:node:CN' AND
sm.date_uploaded < (SELECT CURRENT_DATE - interval '90 days') AND
smr.date_verified < (SELECT CURRENT_DATE - interval '90 days')
GROUP BY sm.authoritive_member_node
ORDER BY count DESC;
authoritive_member_node | count
-------------------------+--------
urn:node:ARCTIC | 771872
urn:node:PANGAEA | 507456
urn:node:LTER | 416339
urn:node:DRYAD | 374439
urn:node:CDL | 242115
urn:node:PISCO | 235791
urn:node:KNB | 86075
urn:node:TDAR | 75639
urn:node:NCEI | 50974
urn:node:USGS_SDC | 40290
urn:node:TERN | 31671
urn:node:ESS_DIVE | 28830
urn:node:NMEPSCOR | 16042
urn:node:GOA | 9266
urn:node:IARC | 7677
urn:node:NRDC | 6673
urn:node:TFRI | 6478
urn:node:PPBIO | 3464
urn:node:ORNLDAAC | 3328
urn:node:FEMC | 2430
urn:node:EDI | 2098
urn:node:GRIIDC | 2065
urn:node:mnTestKNB | 2010
urn:node:SANPARKS | 2008
urn:node:ONEShare | 1874
urn:node:R2R | 1787
urn:node:USGSCSAS | 1151
urn:node:EDACGSTORE | 1075
urn:node:US_MPC | 1032
urn:node:RW | 970
urn:node:KUBI | 516
urn:node:NEON | 487
urn:node:LTER_EUROPE | 343
urn:node:IOE | 279
urn:node:RGD | 273
urn:node:ESA | 272
urn:node:NKN | 218
urn:node:OTS_NDC | 126
urn:node:BCODMO | 115
urn:node:SEAD | 90
urn:node:mnTestNKN | 50
urn:node:EDORA | 28
urn:node:ONEShare.pem | 22
urn:node:CLOEBIRD | 17
urn:node:mnTestBCODMO | 11
urn:node:USANPN | 10
urn:node:mnTestTDAR | 10
urn:node:MyMemberNode | 1
</pre>
<p>The table above represents the number of un-audited objects (in the last 90 days), but I get the feeling that the auditor isn't able to audit any of the content it is charged to audit given 1) the frequency, 2) the number of threads allotted, and 3) the configured batch count (seems way too low). <del>Note that this query excludes replicated content - this is just the original objects</del> (After looking at my query again, I think the join is including all replicas - the total is 2,935,787, which is greater than the total objects in the system (2,751,136), so this query needs to be refined).</p>
<p>We need to evaluate the true effectiveness of the auditor. Some strategies may include: 1) looking to see if we may be in an infinite loop on processing a few <code>pid</code>s due to the issues in <a class="issue tracker-4 status-1 priority-4 priority-default child" title="Story: Replica Auditing service is throwing errors (New)" href="https://redmine.dataone.org/issues/8582">#8582</a>, 2) seeing if we can increase the batch size by increasing the total threads allocated in the executor, and 3) decide if we need to offload the process from the CNs and distribute the workload across a cluster of workers that can do the auditing faster. Needs some thought and discussion.</p>
CN REST - Story #8749 (New): Fix log aggregation events from the CN without associated CN IPshttps://redmine.dataone.org/issues/87492018-11-16T20:39:55ZChris Jonescjones@nceas.ucsb.edu
<p>The robots list used to filter out usage events includes the IP addresses of the CNs, so events logged during synchronization don't show up as true hits. Because of the SSL infrastructure at lbl.gov, the ESS-DIVE group doesn't see the public IP of an incoming request, but rather an internal private IP assigned by lbl.gov infrastructure. You can see the impact of this on the <a href="https://data.ess-dive.lbl.gov/#profile" class="external">ESS-DIVE profile page</a>. The spike of 11,000+ downloads in August 2018 was the CN synchronizing content.</p>
<p>Rushiraj summarized these events in a <a href="https://gist.github.com/rushirajnenuji/847d8239acf68a108bda30e04af0406b" class="external">gist</a></p>
<p>There are multiple <code>10.42.x.x</code> IP associated with the CN requests. These events all need to be updated in the <code>logsolr</code> core and changed to an actual CN IP. For future synchronizations, perhaps we need to add <code>10.42.0.0/16</code> to the robots list? </p>
CN REST - Bug #8733 (New): Exception handling in ForesiteResourceMap needs improvementhttps://redmine.dataone.org/issues/87332018-10-12T20:09:55ZChris Jonescjones@nceas.ucsb.edu
<p>We're seeing some errors from the <code>org.dataone.cn.indexer.resourcemap.ForesiteResourceMap</code> constructor when indexing resource maps in Metacat 2.9.0, d1_cn_index_processor 2.3.4. A general <code>Exception</code> is being caught and re-thrown as an <code>OREParserException</code> on line 104:</p>
<pre>metacat 20181012-10:53:02: [INFO]: done queueing doc index for pid urn:uuid:5a860679-159e-45a6-bcb6-beb67f4166c2 [edu.ucsb.nceas.metacat.MetacatHandler:handleReindexAction:2777]
metacat 20181012-10:53:02: [DEBUG]: ================= in the finally statement [edu.ucsb.nceas.metacat.MetacatHandler:handleReindexAction:2808]
metacat 20181012-10:53:02: [DEBUG]: ================= in the finally statement which out is not null [edu.ucsb.nceas.metacat.MetacatHandler:handleReindexAction:2810]
org.dspace.foresite.OREParserException: java.lang.NullPointerException
at org.dataone.cn.indexer.resourcemap.ForesiteResourceMap.<init>(ForesiteResourceMap.java:104)
at org.dataone.cn.indexer.resourcemap.ResourceMapFactory.buildResourceMap(ResourceMapFactory.java:45)
at edu.ucsb.nceas.metacat.index.resourcemap.ResourceMapSubprocessor.processResourceMap(ResourceMapSubprocessor.java:97)
at edu.ucsb.nceas.metacat.index.resourcemap.ResourceMapSubprocessor.processDocument(ResourceMapSubprocessor.java:86)
at edu.ucsb.nceas.metacat.index.SolrIndex.process(SolrIndex.java:236)
at edu.ucsb.nceas.metacat.index.SolrIndex.insert(SolrIndex.java:426)
at edu.ucsb.nceas.metacat.index.SolrIndex.update(SolrIndex.java:637)
at edu.ucsb.nceas.metacat.index.SolrIndex.update(SolrIndex.java:592)
at edu.ucsb.nceas.metacat.index.SystemMetadataEventListener.entryUpdated(SystemMetadataEventListener.java:146)
at edu.ucsb.nceas.metacat.index.SystemMetadataEventListener.entryAdded(SystemMetadataEventListener.java:119)
at com.hazelcast.client.impl.EntryListenerManager.notifyListeners(EntryListenerManager.java:148)
at com.hazelcast.client.impl.EntryListenerManager.notifyListeners(EntryListenerManager.java:130)
at com.hazelcast.client.impl.ListenerManager.customRun(ListenerManager.java:88)
at com.hazelcast.client.ClientRunnable.run(ClientRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
</pre>
<p>The resource map that is failing to index is <a href="https://knb.ecoinformatics.org/knb/d1/mn/v2/object/urn%3Auuid%3A5a860679-159e-45a6-bcb6-beb67f4166c2">https://knb.ecoinformatics.org/knb/d1/mn/v2/object/urn%3Auuid%3A5a860679-159e-45a6-bcb6-beb67f4166c2</a></p>
<p>This resource map parses fine using <code>com.hp.hpl.jena.rdf.model.Model.read()</code> outside of the index processor code.</p>
<p>This <code>ForesiteResourceMap</code> constructor is taking an <code>org.w3c.dom.Document</code> as input, converting it to an in-memory <code>ByteArrayOutputStream</code>, and converting that to an in-memory <code>String</code>, to finally get a <code>ReaderInputStream</code> to be passed to <code>_init()</code>. We may need to address the handling inefficiency as a secondary improvement to the code, if feasible.</p>
<p>We also need to not catch a general <code>Exception</code> on line 117, but rather catch the more specific exception that is thrown, which will help with troubleshooting where the exception is coming from. Debug logging statements would help too.</p>
<p>The exception is a <code>NullPointerException</code>, so we ultimately need to track down where this is coming from.</p>
<p>It doesn't look like this code has changed in the 2.3.6 version of the library, so an upgrade wouldn't make a difference. I did look on <code>cn-orc-1</code> to see if we are having similar issues outside of a Metacat MN context, and see errors like these:</p>
<pre>cn-index-processor-daemon.log.9-[ERROR] 2018-10-12 06:42:44,977 (ErrorHandlerFactory$ErrorLogger:logError:84) {E201} rdf:resource not allowed as attribute here.
cn-index-processor-daemon.log.9:[ERROR] 2018-10-12 06:42:44,977 (ForesiteResourceMap:_init:163) Unable to parse ORE document:
cn-index-processor-daemon.log.9-org.apache.jena.riot.RiotException: {E201} rdf:resource not allowed as attribute here.
cn-index-processor-daemon.log.9- at org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.error(ErrorHandlerFactory.java:128)
cn-index-processor-daemon.log.9- at org.apache.jena.riot.lang.LangRDFXML$ErrorHandlerBridge.error(LangRDFXML.java:241)
cn-index-processor-daemon.log.9- at com.hp.hpl.jena.rdf.arp.impl.ARPSaxErrorHandler.error(ARPSaxErrorHandler.java:38)
cn-index-processor-daemon.log.9- at com.hp.hpl.jena.rdf.arp.impl.XMLHandler.warning(XMLHandler.java:206)
cn-index-processor-daemon.log.9- at com.hp.hpl.jena.rdf.arp.impl.XMLHandler.warning(XMLHandler.java:183)
cn-index-processor-daemon.log.9- at com.hp.hpl.jena.rdf.arp.impl.XMLHandler.warning(XMLHandler.java:178)
cn-index-processor-daemon.log.9- at com.hp.hpl.jena.rdf.arp.impl.ParserSupport.warning(ParserSupport.java:199)
cn-index-processor-daemon.log.9- at com.hp.hpl.jena.rdf.arp.impl.AttributeLexer.error(AttributeLexer.java:187)
cn-index-processor-daemon.log.9- at com.hp.hpl.jena.rdf.arp.impl.QNameLexer.lookup(QNameLexer.java:80)
cn-index-processor-daemon.log.9- at com.hp.hpl.jena.rdf.arp.impl.AttributeLexer.processSpecials(AttributeLexer.java:71)
cn-index-processor-daemon.log.9- at com.hp.hpl.jena.rdf.arp.states.WantDescription.startElement(WantDescription.java:60)
cn-index-processor-daemon.log.9- at com.hp.hpl.jena.rdf.arp.states.AbsWantLiteralValueOrDescription.startElement(AbsWantLiteralValueOrDescription.java:84)
cn-index-processor-daemon.log.9- at com.hp.hpl.jena.rdf.arp.states.WantLiteralValueOrDescription.startElement(WantLiteralValueOrDescription.java:56)
cn-index-processor-daemon.log.9- at com.hp.hpl.jena.rdf.arp.impl.XMLHandler.startElement(XMLHandler.java:121)
cn-index-processor-daemon.log.9- at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source)
cn-index-processor-daemon.log.9- at org.apache.xerces.parsers.AbstractXMLDocumentParser.emptyElement(Unknown Source)
cn-index-processor-daemon.log.9- at org.apache.xerces.impl.XMLNamespaceBinder.handleStartElement(Unknown Source)
cn-index-processor-daemon.log.9- at org.apache.xerces.impl.XMLNamespaceBinder.emptyElement(Unknown Source)
cn-index-processor-daemon.log.9- at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanStartElement(Unknown Source)
cn-index-processor-daemon.log.9- at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
cn-index-processor-daemon.log.9- at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
cn-index-processor-daemon.log.9- at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
cn-index-processor-daemon.log.9- at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
cn-index-processor-daemon.log.9- at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
cn-index-processor-daemon.log.9- at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
cn-index-processor-daemon.log.9- at com.hp.hpl.jena.rdf.arp.impl.RDFXMLParser.parse(RDFXMLParser.java:151)
cn-index-processor-daemon.log.9- at com.hp.hpl.jena.rdf.arp.ARP.load(ARP.java:119)
cn-index-processor-daemon.log.9- at org.apache.jena.riot.lang.LangRDFXML.parse(LangRDFXML.java:142)
cn-index-processor-daemon.log.9- at org.apache.jena.riot.RDFParserRegistry$ReaderRIOTFactoryImpl$1.read(RDFParserRegistry.java:142)
cn-index-processor-daemon.log.9- at org.apache.jena.riot.RDFDataMgr.process(RDFDataMgr.java:818)
cn-index-processor-daemon.log.9- at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:258)
cn-index-processor-daemon.log.9- at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:244)
cn-index-processor-daemon.log.9- at org.apache.jena.riot.adapters.RDFReaderRIOT.read(RDFReaderRIOT.java:69)
cn-index-processor-daemon.log.9- at com.hp.hpl.jena.rdf.model.impl.ModelCom.read(ModelCom.java:274)
cn-index-processor-daemon.log.9- at org.dspace.foresite.jena.JenaOREParser.parseToModel(JenaOREParser.java:119)
cn-index-processor-daemon.log.9- at org.dspace.foresite.jena.JenaOREParser.parse(JenaOREParser.java:69)
cn-index-processor-daemon.log.9- at org.dataone.ore.ResourceMapFactory.parseResourceMap(ResourceMapFactory.java:444)
cn-index-processor-daemon.log.9: at org.dataone.cn.indexer.resourcemap.ForesiteResourceMap._init(ForesiteResourceMap.java:160)
cn-index-processor-daemon.log.9: at org.dataone.cn.indexer.resourcemap.ForesiteResourceMap.<init>(ForesiteResourceMap.java:73)
cn-index-processor-daemon.log.9- at org.dataone.cn.indexer.resourcemap.ResourceMapFactory.buildResourceMap(ResourceMapFactory.java:41)
cn-index-processor-daemon.log.9- at org.dataone.cn.index.processor.IndexTaskProcessor.getNextIndexTask(IndexTaskProcessor.java:641)
cn-index-processor-daemon.log.9- at org.dataone.cn.index.processor.IndexTaskProcessor.processFailedIndexTaskQueue(IndexTaskProcessor.java:229)
cn-index-processor-daemon.log.9- at org.dataone.cn.index.processor.IndexTaskProcessor.processIndexTaskQueue(IndexTaskProcessor.java:191)
cn-index-processor-daemon.log.9- at org.dataone.cn.index.processor.IndexTaskProcessorJob.execute(IndexTaskProcessorJob.java:57)
</pre>
<p>This error is also related to <code>_init</code>, but isn't an NPE, so <code>_init()</code> might be a good place to look first.</p>
CN REST - Task #8617 (Closed): Fix CN to CN replication in Stagehttps://redmine.dataone.org/issues/86172018-06-17T01:30:07ZChris Jonescjones@nceas.ucsb.edu
<p>While working on on-boarding ESS-DIVE in the Stage environment, we found that <code>DATA</code> objects were both sync'ing fine (their <code>sysmeta</code>) to the CNs, and they were indexing fine on the CNs. However, METADATA and RESOURCE object types were only sync'ing fine. They were failing to index. Looking at the indexer log files on <code>cn-stage-orc-1</code>, I saw errors where the indexer could not find the object (science metadata and resource map documents) on disk because there was no path to the object in the <code>hzObjectPaths</code> map.</p>
<p>I confirmed that:</p>
<p>1) There was no pid-to-docid entry in the PostgreSQL <code>identifier</code> table<br>
2) There was no file on disk for these objects in <code>/var/metacat/{data,documents}</code><br>
3) That the system metadata for each document <em>did</em> exist in the PostgreSQL <code>systemmetadata</code> table</p>
<p>Since <code>sysmeta</code> replicates via Hazelcast and is persisted on all CNs, this didn't surprise me. But science metadata nad resource maps get replicated via the CN-to-CN Metacat replication service.</p>
<p>Looking at the replication.log files, the CNs were unable to create SSL connections among themselves, and replication wasn't working since May 18th, 2018. After trying this manually, I confirmed this didn't work.</p>
<p>After looking at the certificate key and cert files on each CN in <code>/etc/dsataone/client/{certs,private}</code>, I saw that the <code>cn-stage-{orc,ucsb,unm}-1.test.dataone.org.{pem,key}</code> files had the wrong subject in them. For instance:</p>
<pre>subject= /DC=org/DC=dataone/CN=urn:node:cnStageORC1
notBefore=May 14 16:52:20 2018 GMT
notAfter=May 13 16:52:20 2021 GMT
</pre>
<p>This is the subject for the client certificate, not for the server certificate that is signed by the DataONE Test Intermediate CA, whic for instance is:</p>
<pre>subject= /DC=org/DC=dataone/CN=cn-stage-orc-1.test.dataone.org
notBefore=Oct 13 22:12:57 2015 GMT
notAfter=Oct 12 22:12:57 2018 GMT
</pre>
<p>This file was updated in May on all three CNs, and the old certificate was moved to a <code>.bak</code> version. The <code>.bak</code> version still contained the un-expired certificate (expires in Oct 2018), which has the correct subject string in the certificate (the FQDN of the hostname).</p>
<p>I moved the <code>.bak</code> certificates back to the current active name. After forcing CN-to-CN replication via the Metacat Admin interface, all three CNs are now replicating to each other again, and are making their way through the backlog.</p>
<p>I think someone mistakenly copied the client certificate file to the server cert files used for Metacat, LDAP, and PostgreSQL replication when the client certificates were updated in may. This is likely just a procedure documentation issue. We'll have to look out for this as the certificates expire in the next few months.</p>
CN REST - Story #8582 (New): Replica Auditing service is throwing errorshttps://redmine.dataone.org/issues/85822018-05-01T19:15:35ZChris Jonescjones@nceas.ucsb.edu
<p>Replica auditing should be auditing objects every 90 days for fixity, and setting the <code>replicaStatus</code> appropriately. The <code>/var/log/dataone/cn-replication-audit.log*</code> files are showing many errors:</p>
<pre>cjones@cn-ucsb-1:replicate$ grep ERROR cn-replication-audit.log* | grep "Cannot update replica status" | wc -l
437601
</pre>
<p>Determine if this is a configuration issue or a code issue and fix it as needed. Also, fix the code to call <code>Identifier.getValue()</code> when logging these errors to avoid printing the memory location of the object like <code>org.dataone.service.types.v1.Identifier@7e90f2e8</code>. There are multiple places where <code>getValue()</code> needs to be added.</p>
CN REST - Story #8364 (In Progress): Ensure portal uses correct X509 certificateshttps://redmine.dataone.org/issues/83642018-02-13T20:17:25ZChris Jonescjones@nceas.ucsb.edu
<p>We've run into issues where after an upgrade of the <code>dataone-cn-portal</code> package on the CNs, the properties pointing to the public certificate and private key are incorrectly pointing to the old GeoTrust wildcard files rather than the new Lets Encrypt files:<br>
<br>
cn.server.publiccert.filename=/etc/ssl/certs/<em>.test.dataone.org.crt<br>
cn.server.privatekey.filename=/etc/ssl/private/</em>.test.dataone.org.key</p>
<p>These should be (in STAGE):</p>
<p>/etc/letsencrypt/live/cn-stage.test.dataone.org/cert.pem<br>
/etc/letsencrypt/live/cn-stage.test.dataone.org/privkey.pem</p>
<p>The issue might be that these are not being set correctly during the <code>postinst</code> script run. Jing pointed out that these values are taken from the debconf database settings that get set when <code>dataon-cn-os-core</code> is installed. So although the <code>postinst</code> script might be setting the correct values, the old cached values might still be in memory in the debconf database. If so, we'll need to clear those values during installations and upgrades.</p>
<p>Also, knowing where to look for these configuration settings can be challenging. These are referenced from <code>/var/lib/tomcat7/webapps/portal/WEB-INF/portal.properties</code>. These settings should be consolidated into <code>/etc/dataone/portal/portal.properties</code> so they also don't get blown away on war file upgrades in Tomcat.</p>
Infrastructure - Bug #8183 (Closed): CNDiagnostic.echoIndexedObject() fails on resource mapshttps://redmine.dataone.org/issues/81832017-09-12T19:46:14ZChris Jonescjones@nceas.ucsb.edu
<p>Calling @CNDiagnostic.echoIndexedObject()@ with both known-good and known-bad resource maps fails with a @500 ServiceFailure@ exception. We would expect it to return the list of Solr documents for the known-good resource map, and the original exception content for known-bad resource maps.</p>
<p>Here are two examples:</p>
<p>Good (i.e. has been successfully indexed by DataONE):<br>
<a href="https://knb.ecoinformatics.org/knb/d1/mn/v2/meta/resourceMap_knb.1389.1">https://knb.ecoinformatics.org/knb/d1/mn/v2/meta/resourceMap_knb.1389.1</a></p>
<p>Bad (failed indexing by DataONE):<br>
<a href="https://knb.ecoinformatics.org/knb/d1/mn/v2/meta/urn:uuid:c3cb05ae-0fd7-4ebc-960a-4781521d249d">https://knb.ecoinformatics.org/knb/d1/mn/v2/meta/urn:uuid:c3cb05ae-0fd7-4ebc-960a-4781521d249d</a></p>
<p>The latter ORE has a syntax error with regard to the @<a href="dc:creator">dc:creator</a>@ content, and fails to be parsed by the Jena library with:<br>
<br>
[ERROR] 2017-09-12 12:58:51,900 (RDFDefaultErrorHandler:error:40) (line 9 column 61): {E201} rdf:resource not allowed as attribute here.<br>
[ERROR] 2017-09-12 12:58:51,910 (RDFDefaultErrorHandler:error:40) (line 10 column 70): {E201} Multiple children of property element<br>
[ERROR] 2017-09-12 12:58:51,910 (RDFDefaultErrorHandler:error:40) (line 10 column 70): {E201} rdf:datatype not allowed as attribute here.<br>
[ERROR] 2017-09-12 12:58:51,911 (RDFDefaultErrorHandler:error:40) (line 10 column 82): {E202} Expecting XML start or end element(s). String data "Ocean Spaces" not allowed. Maybe there should be an rdf:parseType='Literal' for embedding mixed XML content in RDF. Maybe a striping error.</p>
<p>So we would expect these errors to trickle up the call stack and be reported to the submitter.</p>
Infrastructure - Task #8098 (Closed): Token-based authentication fails with LE CN certshttps://redmine.dataone.org/issues/80982017-05-21T17:06:46ZChris Jonescjones@nceas.ucsb.edu
<p>When trying to call @MN.create()@ on my local Metacat setup (which points to the production CN environment for ORCID authentication), I'm getting an @InvalidToken@ error:<br>
<br>
<?xml version="1.0" encoding="UTF-8"?><br>
Session is required to WRITE to the Node.<br>
</p>
<p>This is odd because I recently logged in via ORCID, so it looked like a token verification issue. In the Metacat log, I see:</p>
<p>metacat 20170521-10:35:07: [WARN]: Could not use public key to verify provided token: eyJhbGciOiJSUzI1NiJ9.eyJzdWIiOiJodHRwOlwvXC9vcmNpZC5vcmdcLzAwMDAtMDAwMi04MTIxLTIzNDEiLCJmdWxsTmFtZSI6IkNocmlzdG9waGVyIEpvbmVzIiwiaXNzdWVkQXQiOiIyMDE3LTA1LTIxVDE1OjU3OjA2LjEyOCswMDowMCIsImNvbnN1bWVyS2V5IjoidGhlY29uc3VtZXJrZXkiLCJleHAiOjE0OTU0NDcwMjYsInVzZXJJZCI6Imh0dHA6XC9cL29yY2lkLm9yZ1wvMDAwMC0wMDAyLTgxMjEtMjM0MSIsInR0bCI6NjQ4MDAsImlhdCI6MTQ5NTM4MjIyNn0.dynDbRKqIuI1bXzPYlHfW7aFcrl2J7O8ZWqxS_2DHBotx4AqX_hbxuRrlQ_9s-V1mRJupyxkYxW3EWkLcoMUQNTuyMLGpV53GPoGdBjkTEd407GU-yxv_G3cmmSovXSLj6AAjeKJ8KHBt4y6JtgqR2isf5YGoM18CwM-IZV3nJVPBMZpNMPhYSWJeaeD2u02duKCpcy7L-XD_OCLJdzHjtjyFqqbHvqGyZIPqc9Kp_JTuTmlYaAZe9JiLcjHnyaOeHMGCEkmOekiRA_wh6DtnBLKyCczBjNg0kirxMk27abjAxt-ckhKfrCT6dnXbd1lCLNnxVYiJj5wztNOGH492T3nyaSQGROnSQd6cxB3pPAiwW7AOR34MPNJlNv_r-3WbwThDeOOtrMSvfZtYGv6Mn_i0-d1yjccRDzZeXdRS0P91GYfdK2lfog1lhiPuec3gD4V4plNJR3wKSSMhgjikH6igCB5I7C5n9Ye5vSeyWW9ApwLogfbEUc3xKgiCgj1jtED4L7E3WgUvtWxsyqMMtaEAJGvRHlGPPShD3xHPsm6ltCVrU1arLXneuGa0R7M-GgzMk0z5HdRE2bD2agu5WuN-w5-w9W6jwrzgI4wM7v8KiJYxeM332nx4f2BF6ArFJ2K-DxlpgmdK6bkPTtL7H-uj5digXvBoHFYZAJF49c</p>
<p>After grabbing the public certificate from the production CN:</p>
<p>-----BEGIN CERTIFICATE-----<br>
MIIFQzCCBCugAwIBAgISAxPSoq7BM7aFc1VzgyTJkz3wMA0GCSqGSIb3DQEBCwUA<br>
MEoxCzAJBgNVBAYTAlVTMRYwFAYDVQQKEw1MZXQncyBFbmNyeXB0MSMwIQYDVQQD<br>
ExpMZXQncyBFbmNyeXB0IEF1dGhvcml0eSBYMzAeFw0xNzA1MTcxMjI5MDBaFw0x<br>
NzA4MTUxMjI5MDBaMBkxFzAVBgNVBAMTDmNuLmRhdGFvbmUub3JnMIIBIjANBgkq<br>
hkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAtp++UWPu0Zm4gIs01F+LE94i4eExI+UX<br>
82DIB3Xn93FW4IgDTsjEfXCB3AHggdx6GnExbDzu/iXn+K3LiW6QaeasG47XOeup<br>
JjpmJqDROAJvLy1GpgrFeNxEe5F6xljPcAxUH/W/NkoHAem7wMatRNA53f6JkMVd<br>
sKXAYPOdKUOqhQ9QRMqEFIPImt+SHfvxUkQyL4g+1taQ5XYDu5zwF5+k77ZRre+o<br>
RVR9gHdbdlvLLQYP9eGJdi+nmFFTrEuXIklB8SQi6yvck0p6nR2sjmxFlnaLTe7Z<br>
iaVWaA1vvwvwgG27Q2iMcnAG+JXQDe7Jd1YIuXUW7vVYyGl4ONbp3QIDAQABo4IC<br>
UjCCAk4wDgYDVR0PAQH/BAQDAgWgMB0GA1UdJQQWMBQGCCsGAQUFBwMBBggrBgEF<br>
BQcDAjAMBgNVHRMBAf8EAjAAMB0GA1UdDgQWBBSyQkmQUHHO3EkItWseuA3L6vg8<br>
1DAfBgNVHSMEGDAWgBSoSmpjBH3duubRObemRWXv86jsoTBwBggrBgEFBQcBAQRk<br>
MGIwLwYIKwYBBQUHMAGGI2h0dHA6Ly9vY3NwLmludC14My5sZXRzZW5jcnlwdC5v<br>
cmcvMC8GCCsGAQUFBzAChiNodHRwOi8vY2VydC5pbnQteDMubGV0c2VuY3J5cHQu<br>
b3JnLzBcBgNVHREEVTBTghRjbi1vcmMtMS5kYXRhb25lLm9yZ4IVY24tdWNzYi0x<br>
LmRhdGFvbmUub3JnghRjbi11bm0tMS5kYXRhb25lLm9yZ4IOY24uZGF0YW9uZS5v<br>
cmcwgf4GA1UdIASB9jCB8zAIBgZngQwBAgEwgeYGCysGAQQBgt8TAQEBMIHWMCYG<br>
CCsGAQUFBwIBFhpodHRwOi8vY3BzLmxldHNlbmNyeXB0Lm9yZzCBqwYIKwYBBQUH<br>
AgIwgZ4MgZtUaGlzIENlcnRpZmljYXRlIG1heSBvbmx5IGJlIHJlbGllZCB1cG9u<br>
IGJ5IFJlbHlpbmcgUGFydGllcyBhbmQgb25seSBpbiBhY2NvcmRhbmNlIHdpdGgg<br>
dGhlIENlcnRpZmljYXRlIFBvbGljeSBmb3VuZCBhdCBodHRwczovL2xldHNlbmNy<br>
eXB0Lm9yZy9yZXBvc2l0b3J5LzANBgkqhkiG9w0BAQsFAAOCAQEAJo/aaCo0NweP<br>
prHz+9Ko39xZ/Y6kum0ZOSw6BFM8zgkOOd1R0rbc53j09yKDi3V+MKd5rXfISNsp<br>
LKBVe/R8HH/rglYUhMTBBizGsEdyPE4n5I3ml4RyOVmC1SpDPUzH0CAeSLkzBpBV<br>
WVIfEwl641GtT0hBcwVjMlDYywrvSHv4mifVLd/2ZTSYillrhQzQySKb9g7jbEld<br>
LHY1WoIU0E5XgQJq3b6Vhb5dXVkHsDfwPHNpJA5fVCVYoKazo+xSNBP757ta/ix4<br>
e9CbRsQQ0TgEsuUAOa9lh9+O8uAL5zkZ4kwZCLypxbkZ8/YYOCMGMtGz4632J7VF<br>
Ozukfk41bw==<br>
-----END CERTIFICATE-----</p>
<p>and trying to verify the token with this certificate, it fails. </p>
<p>However, it verifies correctly with the old CN certificate:<br>
<br>
-----BEGIN CERTIFICATE-----<br>
MIIFrTCCBJWgAwIBAgICbkowDQYJKoZIhvcNAQELBQAwRzELMAkGA1UEBhMCVVMx<br>
FjAUBgNVBAoTDUdlb1RydXN0IEluYy4xIDAeBgNVBAMTF1JhcGlkU1NMIFNIQTI1<br>
NiBDQSAtIEczMB4XDTE0MTEwMzEyNTMyNFoXDTE3MDUyMDIxNDU0OVowgZExEzAR<br>
BgNVBAsTCkdUMzkwMjU2MTcxMTAvBgNVBAsTKFNlZSB3d3cucmFwaWRzc2wuY29t<br>
L3Jlc291cmNlcy9jcHMgKGMpMTIxLzAtBgNVBAsTJkRvbWFpbiBDb250cm9sIFZh<br>
bGlkYXRlZCAtIFJhcGlkU1NMKFIpMRYwFAYDVQQDDA0qLmRhdGFvbmUub3JnMIIC<br>
IjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEAzZaa/tslwA/CJ6Wqfzl72TrF<br>
/8IurHHrfzmme/B2dSUt0+zDfdfXWe7p6pZ4yJp95Kk34cf0EFWgFJ5Nc1gyXJUh<br>
Ht6IVweDDFrExeNPsNbI5DLFdUJ5ZfNhWrqu2C4kdeRfHqxOvI0w6XEfdZ4yI3QC<br>
zfx5EtsoFEXpqK5Xe3r5KEnXVsPq6azerVqvq2UqhPa0EYJA8/CVJiQ0CRQl+w9x<br>
Mh6GBvHUXqCHBPlRPIY7QomI+3Cx8gYgcLCCEcHVgzU05zQQRwdtIqjENq6CubH9<br>
UTMiKS81CFJbAVrKetDRI3bNGIcEEpjV1XC28OOWXNc9fXXAK3fvVFVl2tuzYFn0<br>
ROmRrtiz4+jXC7mp7/fTb5ekTeenKyoVA5UicbIHM1PPQeTwcHUH7CxybJVheGAo<br>
7wwzqrxin3LMMyn56QBXqB81qL+iMJ+ZBHXxiS5V6g4W1ag3VOtDvyRtN1QGB6J2<br>
enOTBOHNwr9bHuJcVPx1dYd6YjZD3LQbyJZyVtYHalnlCXGjLCxs9B2uL4MBllb5<br>
N++ouBiujO5ww6Ht+MgOq/gbahx9WlJCs5xXLy8Hf+FfjUBZXDdkvLwa36FWktZa<br>
ibbqqeBBq9IaW0gUNNmhYs3SB8J7JICVflUIp7e7wy7cXBJHpkATZKAuHVnqJ8ZT<br>
83YekoQFyxpcqB2fmRkCAwEAAaOCAVYwggFSMB8GA1UdIwQYMBaAFMOc8/zTRgg0<br>
u85Gf6B8W/PiCMtZMFcGCCsGAQUFBwEBBEswSTAfBggrBgEFBQcwAYYTaHR0cDov<br>
L2d2LnN5bWNkLmNvbTAmBggrBgEFBQcwAoYaaHR0cDovL2d2LnN5bWNiLmNvbS9n<br>
di5jcnQwDgYDVR0PAQH/BAQDAgWgMB0GA1UdJQQWMBQGCCsGAQUFBwMBBggrBgEF<br>
BQcDAjAlBgNVHREEHjAcgg0qLmRhdGFvbmUub3JnggtkYXRhb25lLm9yZzArBgNV<br>
HR8EJDAiMCCgHqAchhpodHRwOi8vZ3Yuc3ltY2IuY29tL2d2LmNybDAMBgNVHRMB<br>
Af8EAjAAMEUGA1UdIAQ+MDwwOgYKYIZIAYb4RQEHNjAsMCoGCCsGAQUFBwIBFh5o<br>
dHRwczovL3d3dy5yYXBpZHNzbC5jb20vbGVnYWwwDQYJKoZIhvcNAQELBQADggEB<br>
ABcvSyNwX1jHZ7HRX5Lzcua0Q4//wc5KCBvPgPrbr3bGSi3+t+Rc4ZagIUxFWSd1<br>
uZ+guQ4lywhQXGOXh7dH1SPljPOwZ9VPdhJMPW/woaQ0ndakLvW0OBIgyyqIcJ57<br>
8e6DKzZ0jd97xmXYAa7iMhCxL2lpXzDQMH5k8XhENHcjMXfVitkqmIS2Wfi1rEMK<br>
phszml9yRABtx+X0z/4/xmNZ2PrNApqmqVD2DnY1MgJNHga/KmPX/6VZ+NEszudP<br>
rvrD5hQvAjkJA+5kgqX31w98ggfXg4oxQo8AhKrHWnhI52SoWT1BOwSGDRpgRW/n<br>
1AdVxT9TIoHXbhf6+c8fWOU=<br>
-----END CERTIFICATE-----</p>
<p>So, effectively, the @d1_cn_portal@ component is still using the old RapidSSL certificate to sign tokens, but (I think) on MNs that have recently been restarted and grab the most recent CN certificate for verification purposes, the get the new LE certificate, and so can't verify incoming tokens signed by the CN. My guess is that this is going to be problematic for other MNs that go through a reboot and or restart and rely on the CN signing tokens. Looking at the @portal.properties@ file on the cn, I see that it is indeeed still pointing to the old certificate and key:<br>
<br>
cn.server.publiccert.filename=/etc/ssl/certs/_.dataone.org.crt<br>
cn.server.privatekey.filename=/etc/ssl/private/dataone_org.key</p>
<p>So, in the short term, we need to plan to re-configure @portal.properties@ on the production CNs to use the new Let's Encrypt certificates for token signing:<br>
<br>
cn.server.publiccert.filename=/etc/letsencrypt/live/cn.dataone.org/fullchain.pem<br>
cn.server.privatekey.filename=/etc/letsencrypt/live/cn.dataone.org/privkey.pem</p>
<p>However, the @fullchain.pem@ includes the intermediate CA certs as well, and I don't know if @CertificateManager.loadCertificateFromFile()@ handles multiple certificates in a file (i.e. does it use the first found, last found, etc?). We need to determine this before making the properties change, but also before other production MNs get rebooted and begin to fail authentication for clients.</p>
<p>Once tested, for the long term, we need to update the portal properties in the buildout to make the changes permanent. We may also need to add some logic for ensuring the @/etc/letsencrypt@ files have the correct permissions as Dave pointed out.</p>
Infrastructure - Bug #8051 (In Progress): CORS-based CN calls fail using Internet Explorer on Win...https://redmine.dataone.org/issues/80512017-03-22T20:02:54ZChris Jonescjones@nceas.ucsb.edu
<p>As noted in <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: Error -1205 "Client Certificate Rejected" by Safari (Closed)" href="https://redmine.dataone.org/issues/2693">#2693</a>, <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: completely unable to access cn.dataone.org from Safari 7.1 if user has any certificates installed (Closed)" href="https://redmine.dataone.org/issues/6539">#6539</a>, and <a class="issue tracker-1 status-6 priority-4 priority-default closed" title="Bug: Safari 6.0 fails to connect to Metacat MN with SSLVerifyClient (Rejected)" href="https://redmine.dataone.org/issues/3255">#3255</a>, Safari does not handle TLS handshakes correctly when asked for a client X509 certificate. Similarly, IE 11 (and 10) on Windows 7 (and maybe others) does not handle TLS handshakes correctly.</p>
<p>The MetacatUI application used on certain Member Nodes (KNB, ARCTIC, ...) makes calls to the CN Identity API to get account information for users and their associated groups. This is done via an XHR, using the CORS pre-flight mechanism of calling @OPTIONS@ on the REST endpoint. During this call, the CN is returning an @HTTP 403@ Not Authorized response to IE11 on Windows, but succeeds on Firefox and Chrome on Windows. </p>
<p>It seems (but we're not sure) that IE is responding to the request for a client certificate and whatever is sent is being rejected by the CN web server. However, it's not super straight forward. When Apache is configured with:<br>
<br>
SSLVerifyClient optional<br>
SSLVerifyDepth 10</p>
<p>IE11 succeeds on the @OPTIONS@ request. However, when the CN is configured to conditionally set @SSLVerifyClient@ within a @@ block:<br>
<br>
SSLVerifyClient none<br>
<br>
<br>
SSLVerifyClient optional<br>
<br>
<br>
SSLVerifyDepth 10<br>
<br>
the request fails (which is currently the production configuration).</p>
<p>However, after testing in STAGE, IE11 works fine when not asked for a client certificate (@SSLVerifyClient none@). It seems that the interaction with Apache changes based on the conditional logic in a specific @@ block, and IE11 responds incorrectly in some way.</p>
<p>To alleviate issues with browser-based client certificate requests, I suggest that we adopt the following configuration:<br>
<br>
SSLVerifyClient none<br>
<br>
<br>
SSLVerifyClient optional<br>
<br>
<br>
SSLVerifyDepth 10<br>
<br>
This configuration excludes most desktop/handheld browser clients from being asked for an X509 certificate. However, it still allows for Java, Python, curl, R, etc. clients to connect using client-side certificates. Since we've migrated to JWT token-based browser authentication, this seems reasonable to me.</p>
<p>This is currently a blocker in production, so we should consider a manual change to the production CNs before this gets rolled into a CCI release, if agreed upon. Thoughts welcome.</p>