DataONE Tasks: Issueshttps://redmine.dataone.org/https://redmine.dataone.org/favicon.ico2020-07-15T18:03:40ZDataONE Tasks
Redmine Infrastructure - Bug #8866 (New): Java client tools should set a custom user agent stringhttps://redmine.dataone.org/issues/88662020-07-15T18:03:40ZBryce Mecummecum@nceas.ucsb.edu
<p>Related to <a href="https://redmine.dataone.org/issues/7047">https://redmine.dataone.org/issues/7047</a></p>
<p>It looks like nowhere in <code>d1_libclient_java</code> do we set a user agent string. Aside from being best practice, it limits our ability to customize our infrastructure around it. For example, OPC is running into HTTP 413s due to overrunning their TLS renegotiation buffer and we can't effectively whitelist their requests, which come from our Java client tools, to allow them to upload large files.</p>
Infrastructure - Task #8865 (Closed): Configure dataone.org web server to redirect DataONE datase...https://redmine.dataone.org/issues/88652020-07-02T19:06:39ZBryce Mecummecum@nceas.ucsb.edu
<p>As scoped out in <a href="https://hpad.dataone.org/8h3o_7VPTIibo5xL9bz24w">https://hpad.dataone.org/8h3o_7VPTIibo5xL9bz24w</a>, we'd like to be able to referenced DataONE resources in a linked-open-data manner. This is useful because it forms the basis of building more interested things on top of them. However, those resources (e.g., Data Packages) don't currently have, subjectively, suitable IRIs, though they have a variety of URLs.</p>
<p>The ideas in the above proposal are multi-tiered but a good first start can be achieved immediately: Support a PIRI space for "Datasets" (DataONE Data Packages) by redirecting requests from their IRI form:</p>
<p><code>https://dataone.org/datasets/$ID</code></p>
<p>to their URL form:</p>
<p><code>https://search.dataone.org/view/$ID</code>.</p>
<p>Such a redirection can be achieved within our Apache configuration using <code>mod_rewrite</code> and a rule similar to:</p>
<pre>RewriteRule "^/datasets/(.+)$" "https://search.dataone.org/view/$1" [L,R]
</pre> Infrastructure - Task #8858 (New): Update CN Apache configs in version control with directives to...https://redmine.dataone.org/issues/88582020-02-05T20:02:12ZBryce Mecummecum@nceas.ucsb.edu
<p>Sitemaps are located on disk in ${tomcat_webapps_dir}/${context}/sitemaps as <code>sitemap_index.xml</code> and <code>sitemap%d.xml</code> (for each sub-sitemap).</p>
<p>The rule we've come up with is:</p>
<p><code>RewriteRule ^/(sitemap.+) /metacat/sitemaps/$1 [R=303]</code></p>
Infrastructure - Bug #8836 (Closed): CSS and JS not loading on ProvONE documentationhttps://redmine.dataone.org/issues/88362019-08-12T21:54:25ZBryce Mecummecum@nceas.ucsb.edu
<p>I was looking at <a href="http://jenkins-1.dataone.org/jenkins/view/Documentation%20Projects/job/ProvONE-Documentation-trunk/ws/provenance/ProvONE/v1/provone.html">http://jenkins-1.dataone.org/jenkins/view/Documentation%20Projects/job/ProvONE-Documentation-trunk/ws/provenance/ProvONE/v1/provone.html</a> today and Chrome and Safari are both upset with how its CSS and JS are set up and are refusing to load either. Steps to reproduce:</p>
<ol>
<li>Go to <a href="http://jenkins-1.dataone.org/jenkins/view/Documentation%20Projects/job/ProvONE-Documentation-trunk/ws/provenance/ProvONE/v1/provone.html">http://jenkins-1.dataone.org/jenkins/view/Documentation%20Projects/job/ProvONE-Documentation-trunk/ws/provenance/ProvONE/v1/provone.html</a></li>
<li>See no styles are loaded, hide/show examples buttons don't work. See errors in devtools about CSP errors.</li>
</ol>
Infrastructure - Bug #8821 (New): Object doi:10.18739/A2610VR7Z fails to index on CNhttps://redmine.dataone.org/issues/88212019-06-18T22:43:38ZBryce Mecummecum@nceas.ucsb.edu
<p>I noticed <a href="https://search.dataone.org/cn/v2/meta/doi:10.18739/A2610VR7Z">https://search.dataone.org/cn/v2/meta/doi:10.18739/A2610VR7Z</a> isn't present in the search index and asked about it on DataONEOrg #ci Slack channel. Chris noted that no no task was in the index queue for the PID. He tried to force a reindex which failed. He couldn't find a log entry indicating why it failed so someone needs to look into what's going on. It occurs to me that whatever bug is affecting this Object may have affected more Objects so a second part of fixing this bug should probably be doing some housekeeping or an audit of some kind to make sure the search index isn't missing more content.</p>
Infrastructure - Task #8820 (Closed): Add new DataONE Object format for HDF4/5 file formatshttps://redmine.dataone.org/issues/88202019-06-13T19:54:01ZBryce Mecummecum@nceas.ucsb.edu
<p>HDF 4 and 5 are efficient binary formats for data commonly used in science: <a href="https://en.wikipedia.org/wiki/Hierarchical_Data_Format">https://en.wikipedia.org/wiki/Hierarchical_Data_Format</a>, <a href="https://www.hdfgroup.org/solutions/hdf5/">https://www.hdfgroup.org/solutions/hdf5/</a>. I don't think we have a lot of content in this format, if any, but it's a pretty common format and a good one at that.</p>
<p>I did some research on MIME types and file extensions:</p>
<p>Re: MIME type:</p>
<blockquote>
<p>The recommended content is application/x-hdf5 for data in HDF5 or application/x-hdf for data in earlier versions.</p>
</blockquote>
<p><a href="https://www.hdfgroup.org/2018/06/citations-for-hdf-data-and-software/">https://www.hdfgroup.org/2018/06/citations-for-hdf-data-and-software/</a></p>
<p>Re: Extension,</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Hierarchical_Data_Format">https://en.wikipedia.org/wiki/Hierarchical_Data_Format</a> lists a few of the variants I've seen</li>
<li>The R hdf5 package uses the h5 extension (<a href="https://github.com/grimbough/rhdf5/tree/master/inst/testfiles">https://github.com/grimbough/rhdf5/tree/master/inst/testfiles</a>) so I went with this</li>
<li>.hdf4 and .hdf5 seem common too but h4/h5 seems a tad more common</li>
</ul>
<p><strong>Here are the details for each of the new formats:</strong></p>
<p><code>HDF4</code></p>
<ul>
<li>formatId: <code>application/x-hdf</code></li>
<li>formatName: Hierarchical Data Format version 4 (HDF4)</li>
<li>mediaType: <code>application/octet-stream</code></li>
<li>extension: h4</li>
</ul>
<p><code>HDF5</code></p>
<ul>
<li>formatId: <code>application/x-hdf5</code></li>
<li>formatName: Hierarchical Data Format version 5 (HDF5)</li>
<li>mediaType: <code>application/octet-stream</code></li>
<li>extension: h5</li>
</ul>
Infrastructure - Task #8817 (New): Configure sitemaps on the CNhttps://redmine.dataone.org/issues/88172019-06-06T23:52:23ZBryce Mecummecum@nceas.ucsb.edu
<p>Support for sitemaps landed last fall in Metacat: <a href="https://github.com/NCEAS/metacat/pull/1283">https://github.com/NCEAS/metacat/pull/1283</a>. Sitemaps are good for users but especially for search engines and DataONE's Search Catalog could benefit from having sitemaps enabled. A sitemap could help crawlers discover all of the datasets in DataONE. The CNs already run Metacat and should use Metacat's sitemaps ability to generate sitemaps for all content.</p>
<p>To enable sitemaps on the CNs, a few things seem to be needed. I'm not very familiar with how the CNs get built so I may be wrong or be missing things:</p>
<ul>
<li>Sitemaps rely on two properties in Metacat's <code>metacat.properties</code> file, which should have values: <code>sitemap.location.base=https://search.dataone.org/</code> and <code>sitemap.entry.base=https://search.dataone.org/view</code></li>
<li>The Apache config the CNs are built with need to serve the <code>sitemap_index.xml</code> and individual sitemaps from the Tomcat webapps dir. Metacat generates sitemaps in the <code>sitemaps</code> subfolder (e.g., <code>/usr/lib/tomcat8/webapps/metacat/sitemaps</code>). A <code>Directory</code> directive should work so long as filesystem permissions are set up for Apache to see the files.</li>
<li>We need a robots.txt that points at the sitemap index file at search.dataone.org which provides the entrypoint to <code>sitemap_index.xml</code></li>
<li>Metacat generates sitemaps with a recurring job mechanism that's internal to Metacat. AFAIK this job isn't turned on when Tomcat loads Metacat and a request has to get sent to the admin API which turns this job on as a side-effect. We might want to change this to reduce maintenance burden or chance of having stale sitemaps</li>
</ul>
<p>Dave nominated Jing for this work and has targeted this for the next CCI release. I'm not sure which that is so please select whichever one is appropriate.</p>
<p>Note this relates to <a href="https://redmine.dataone.org/issues/8693">https://redmine.dataone.org/issues/8693</a> which we've delayed because Google's crawler infrastructure has changed and DataONE is now visible by Google. Google staff have indicated we only need to send them a robots.txt that points to our sitemaps for them to begin crawling.</p>
Infrastructure - Bug #8815 (New): Investigate and fix failed sync/harvest of doi:10.18739/A2CH48https://redmine.dataone.org/issues/88152019-06-04T23:26:43ZBryce Mecummecum@nceas.ucsb.edu
<p>I noticed this while doing something unrelated. There isn't a copy of the sysmeta for <a href="https://arcticdata.io/metacat/d1/mn/v2/meta/doi:10.18739/A2CH48">https://arcticdata.io/metacat/d1/mn/v2/meta/doi:10.18739/A2CH48</a> on the CN:</p>
<p>At: <a href="http://cn.dataone.org/cn/v2/meta/doi:10.18739/A2CH48">http://cn.dataone.org/cn/v2/meta/doi:10.18739/A2CH48</a></p>
<pre><code class="xml syntaxhl"><span class="CodeRay"><span class="preprocessor"><?xml version="1.0" encoding="UTF-8"?></span><span class="tag"><error</span> <span class="attribute-name">detailCode</span>=<span class="string"><span class="delimiter">"</span><span class="content">1420</span><span class="delimiter">"</span></span> <span class="attribute-name">errorCode</span>=<span class="string"><span class="delimiter">"</span><span class="content">404</span><span class="delimiter">"</span></span> <span class="attribute-name">name</span>=<span class="string"><span class="delimiter">"</span><span class="content">NotFound</span><span class="delimiter">"</span></span><span class="tag">></span>
<span class="tag"><description></span>No system metadata could be found for given PID: doi:10.18739/A2CH48<span class="tag"></description></span>
<span class="tag"></error></span>
</span></code></pre>
<p>CNs are in readonly mode at the moment so I'm filing this here so someone can a look later on.</p>
Infrastructure - Task #8775 (In Progress): Make taxonomic rank fields in Solr index non-case-sens...https://redmine.dataone.org/issues/87752019-03-11T22:39:44ZBryce Mecummecum@nceas.ucsb.edu
<p>In the current system, EML documents with taxonomic coverage get indexed into fields such as <code>species</code> if they contain XML such as:</p>
<pre>...snip
<taxonomicCoverage>
<taxonomicClassification>
<taxonRankName>Species</taxonRankName>
<taxonRankValue>Some species</taxonRankValue>
...snip
</pre>
<p>The field values are extracted using the XPath in In <a href="https://repository.dataone.org/software/cicore/trunk/cn/d1_cn_index_processor/src/main/resources/application-context-eml-base.xml:">https://repository.dataone.org/software/cicore/trunk/cn/d1_cn_index_processor/src/main/resources/application-context-eml-base.xml:</a></p>
<pre>//taxonomicClassification/taxonRankValue[../taxonRankName="Species"]/text()
</pre>
<p>We ran into a case where the <code>taxonRankName</code> had been entered as 'species' instead of 'Species' and we decided that the XPath is too restrictive and that the strictness is needless and surprising. This change should result in a slight but negligible decrease in performance.</p>
<ul>
<li>Change all EML taxonomy fields to also match the lowercase form of each taxonomic rank</li>
<li>Check over other indexing field definitions related to taxonomy to make sure the above change is consistent</li>
</ul>
Infrastructure - Task #8755 (New): Expand EML indexing support for EML 2.2https://redmine.dataone.org/issues/87552018-12-19T00:55:12ZBryce Mecummecum@nceas.ucsb.eduInfrastructure - Task #8754 (New): Add EML 2.2 to CN formats listhttps://redmine.dataone.org/issues/87542018-12-19T00:54:05ZBryce Mecummecum@nceas.ucsb.eduInfrastructure - Task #8753 (New): Add support for EML 2.2 (indexing, view)https://redmine.dataone.org/issues/87532018-12-19T00:43:52ZBryce Mecummecum@nceas.ucsb.edu
<p>EML 2.2 is nearly released and, with it, comes changes to indexing and the view service. The key points are that EML now supports Markdown and semantic annotations but some more stuff was added too. See the in-progress What's New in EML 2.2.0 page <a href="https://github.com/NCEAS/eml/blob/BRANCH_EML_2_2/docs/eml-220info.md">https://github.com/NCEAS/eml/blob/BRANCH_EML_2_2/docs/eml-220info.md</a> for details.</p>
<p>I'll track these with sub-tasks but here's an overview:</p>
<p><strong>Formats</strong></p>
<p>We need to expand the list of formats to include EML 2.2.</p>
<p><strong>Indexing</strong></p>
<p>EML 2.2 requires adding new fields and changes to some existing fields. For example, abstracts in EML can now also include a <code>markdown</code> child element which we'll also want to store in Solr. We also want to add in support for semantic annotations and probably structured funding information since that's been requested so much.</p>
<p>Support for semantic annotations will likely use the same approach as our previous work on semantic search where incoming annotations were subject to materialization during the indexing process. To explain what that means, the idea is that, when an annotation for term X comes in, the index processor loads the relevant ontology, materializes the superclass hierarchy for the term, and stores the entire hierarchy in Solr. For EML records with multiple annotations, only the unique set needs to be stored.</p>
<p><strong>View service</strong></p>
<p>As mentioned above, EML 2.2 changes how some existing elements (e.g., abstract) work and adds some new ones (e.g., annotations. The existing EML 2 stylesheets will work for EML 2.2 because 2.2 is backwards compatible but we'll want to extend them to support what's new and changed in EML 2.2</p>
<p>Of note: EML now supports storing Markdown where previously only EML TextType (basically DocBook) was allowed. Our plan on Metacat/MetacatUI to support this is to render the Markdown on the client side which does involve some security concerns (see <a href="https://github.com/NCEAS/metacatui/issues/860">https://github.com/NCEAS/metacatui/issues/860</a> for tracking).</p>
Infrastructure - Bug #8615 (Closed): isotc211 indexing component has the wrong XPath for the pubD...https://redmine.dataone.org/issues/86152018-06-14T00:14:32ZBryce Mecummecum@nceas.ucsb.edu
<p>The latest trunk isotc211 indexing component has the following XPATH for <code>pubDate</code>:</p>
<pre><bean id="isotc.pubDate" class="org.dataone.cn.indexer.parser.SolrField">
<constructor-arg name="name" value="pubDate"/>
<constructor-arg name="xpath" value="(//gmd:dateStamp/gco:Date/text() | //gmd:dateStamp/gco:DateTime/text())[1]"/>
<property name="converter" ref="dateConverter"/>
</bean>
</pre>
<p>This doesn't make any sense and probably wasn't vetted when it went into version control. If you look at an example document, like one from PANGAEA, you see this is how they describe the publication date:</p>
<pre><ns0:identificationInfo>
<ns0:MD_DataIdentification>
<ns0:citation>
<ns0:CI_Citation>
<ns0:title>
<ns2:CharacterString>Simulated ocean velocity at 420 m water depth for pre-industrial, glacial, Pliocene, and Miocene climate states</ns2:CharacterString>
</ns0:title>
<ns0:date>
<ns0:CI_Date>
<ns0:date>
<ns2:DateTime>2018-04-25T15:20:13</ns2:DateTime>
</ns0:date>
<ns0:dateType>
<ns0:CI_DateTypeCode codeList="http://www.isotc211.org/2005/resources/Codelist/gmxCodelists.xml#CI_DateTypeCode" codeListValue="http://www.isotc211.org/2005/resources/Codelist/gmxCodelists.xml#CI_DateTypeCode_publication">publication</ns0:CI_DateTypeCode>
</ns0:dateType>
</ns0:CI_Date>
</ns0:date>
</pre>
<p>I think it'd be good if the XPath pulled out the first <code>date</code> from the <code>identificationInfo/citation</code>, preferring one of <code>dateType</code> <code>publicationDate</code> and falling back to any <code>date</code> that's in the <code>identificationInfo/citation</code>.</p>
Infrastructure - Task #8499 (New): Improve rendering of http://www.isotc211.org/2005/gmd-pangaea ...https://redmine.dataone.org/issues/84992018-03-14T00:49:31ZBryce Mecummecum@nceas.ucsb.edu
<p>I offered to file this a while back but it just sat in my inbox.</p>
<p>Initial support for rendering went in with <a href="https://redmine.dataone.org/issues/8219">https://redmine.dataone.org/issues/8219</a> but we noted in Slack in #ci that the rendering could be a lot better. As an example of how much better, Pangaea's own landing pages are quite nice: <a href="https://doi.pangaea.de/10.1594/PANGAEA.511392">https://doi.pangaea.de/10.1594/PANGAEA.511392</a></p>
<p>Upon checking the first Pangaea dataset in the index, <a href="https://search.dataone.org/#view/af113afed60c98df50052feaf6cb7894">https://search.dataone.org/#view/af113afed60c98df50052feaf6cb7894</a>, I see that the view service isn't actually working at all for this document. So I guess this could be two tasks:</p>
<ol>
<li>Enable the Metacat View Service for Pangaea docs</li>
<li>Improve the XSLT</li>
</ol>
<p>2 just involve extending the existing isotc211 XSLT if we think that other isotc211 creators want those fixes but if we want to make a nice view that's Pangaea-specific, we may want to consider a separate stylesheet suite.</p>
Infrastructure - Story #7859 (New): Add formatID for the STL 3d model file formathttps://redmine.dataone.org/issues/78592016-08-04T19:02:58ZBryce Mecummecum@nceas.ucsb.edu
<p>The STL file format is a domain standard file format for storing 3d models and is the most common way I've managed 3d models used while 3d printing. Given that 3d printing is seeing increased usage in the sciences, I would say this is a good candidate for inclusion in the controlled list of format ids.</p>
<p>Type: DATA<br>
Id: STL<br>
Name: StereoLithography File Format<br>
Media type: application/sla (unofficial)<br>
Extension: .stl</p>
<p>There is an ASCII form and a Binary form of this format. They don't see to be distinguished according to any standard. What do we do in this case?</p>
<p>References: <br>
- <a href="https://en.wikipedia.org/wiki/STL_(file_format)">https://en.wikipedia.org/wiki/STL_(file_format)</a><br>
- <a href="https://reference.wolfram.com/language/ref/format/STL.html">https://reference.wolfram.com/language/ref/format/STL.html</a></p>