DataONE Tasks: Issueshttps://redmine.dataone.org/https://redmine.dataone.org/favicon.ico2020-07-15T18:03:40ZDataONE Tasks
Redmine Infrastructure - Bug #8866 (New): Java client tools should set a custom user agent stringhttps://redmine.dataone.org/issues/88662020-07-15T18:03:40ZBryce Mecummecum@nceas.ucsb.edu
<p>Related to <a href="https://redmine.dataone.org/issues/7047">https://redmine.dataone.org/issues/7047</a></p>
<p>It looks like nowhere in <code>d1_libclient_java</code> do we set a user agent string. Aside from being best practice, it limits our ability to customize our infrastructure around it. For example, OPC is running into HTTP 413s due to overrunning their TLS renegotiation buffer and we can't effectively whitelist their requests, which come from our Java client tools, to allow them to upload large files.</p>
Infrastructure - Task #8865 (Closed): Configure dataone.org web server to redirect DataONE datase...https://redmine.dataone.org/issues/88652020-07-02T19:06:39ZBryce Mecummecum@nceas.ucsb.edu
<p>As scoped out in <a href="https://hpad.dataone.org/8h3o_7VPTIibo5xL9bz24w">https://hpad.dataone.org/8h3o_7VPTIibo5xL9bz24w</a>, we'd like to be able to referenced DataONE resources in a linked-open-data manner. This is useful because it forms the basis of building more interested things on top of them. However, those resources (e.g., Data Packages) don't currently have, subjectively, suitable IRIs, though they have a variety of URLs.</p>
<p>The ideas in the above proposal are multi-tiered but a good first start can be achieved immediately: Support a PIRI space for "Datasets" (DataONE Data Packages) by redirecting requests from their IRI form:</p>
<p><code>https://dataone.org/datasets/$ID</code></p>
<p>to their URL form:</p>
<p><code>https://search.dataone.org/view/$ID</code>.</p>
<p>Such a redirection can be achieved within our Apache configuration using <code>mod_rewrite</code> and a rule similar to:</p>
<pre>RewriteRule "^/datasets/(.+)$" "https://search.dataone.org/view/$1" [L,R]
</pre> CN REST - Bug #8860 (New): /token endpoint doesn't set a content-type and character encodinghttps://redmine.dataone.org/issues/88602020-02-29T01:00:11ZBryce Mecummecum@nceas.ucsb.edu
<p>On Firefox only, requests to the /portal/token endpoint (i.e., the one MetacatUI and other clients use to fetch their auth tokens, like <a href="https://cn.dataone.org/portal/token">https://cn.dataone.org/portal/token</a>) result in errors in the browser console.</p>
<p>When you access the URL via an XHR request, you see:</p>
<blockquote>
<p>XML Parsing Error: syntax error<br>
Location: <a href="https://cn-stage.test.dataone.org/portal/token">https://cn-stage.test.dataone.org/portal/token</a><br>
Line Number 1, Column 1:</p>
</blockquote>
<p>When you access the URL directly in Firefox:</p>
<blockquote>
<p>The character encoding of the plain text document was not declared. The document will render with garbled text in some browser configurations if the document contains characters from outside the US-ASCII range. The character encoding of the file needs to be declared in the transfer protocol or file needs to use a byte order mark as an encoding signature.</p>
</blockquote>
<p>I had a hunch that this error would go away if the response simply had the <code>Content-Type</code> header set to <code>text/plain; charset=utf-8</code> so I spun up <code>mitmproxy</code>, made that edit to the intercepted response, and saw that the error does go away.</p>
<p>I think we should modify the portal code to set the <code>Content-Type</code> header like above so the error goes away.</p>
Infrastructure - Task #8858 (New): Update CN Apache configs in version control with directives to...https://redmine.dataone.org/issues/88582020-02-05T20:02:12ZBryce Mecummecum@nceas.ucsb.edu
<p>Sitemaps are located on disk in ${tomcat_webapps_dir}/${context}/sitemaps as <code>sitemap_index.xml</code> and <code>sitemap%d.xml</code> (for each sub-sitemap).</p>
<p>The rule we've come up with is:</p>
<p><code>RewriteRule ^/(sitemap.+) /metacat/sitemaps/$1 [R=303]</code></p>
Infrastructure - Bug #8836 (Closed): CSS and JS not loading on ProvONE documentationhttps://redmine.dataone.org/issues/88362019-08-12T21:54:25ZBryce Mecummecum@nceas.ucsb.edu
<p>I was looking at <a href="http://jenkins-1.dataone.org/jenkins/view/Documentation%20Projects/job/ProvONE-Documentation-trunk/ws/provenance/ProvONE/v1/provone.html">http://jenkins-1.dataone.org/jenkins/view/Documentation%20Projects/job/ProvONE-Documentation-trunk/ws/provenance/ProvONE/v1/provone.html</a> today and Chrome and Safari are both upset with how its CSS and JS are set up and are refusing to load either. Steps to reproduce:</p>
<ol>
<li>Go to <a href="http://jenkins-1.dataone.org/jenkins/view/Documentation%20Projects/job/ProvONE-Documentation-trunk/ws/provenance/ProvONE/v1/provone.html">http://jenkins-1.dataone.org/jenkins/view/Documentation%20Projects/job/ProvONE-Documentation-trunk/ws/provenance/ProvONE/v1/provone.html</a></li>
<li>See no styles are loaded, hide/show examples buttons don't work. See errors in devtools about CSP errors.</li>
</ol>
Infrastructure - Bug #8822 (New): Account queries in STAGE-2 failing from web browserhttps://redmine.dataone.org/issues/88222019-06-19T01:44:19ZBryce Mecummecum@nceas.ucsb.edu
<p>Steps to reproduce:</p>
<ol>
<li>Visit <a href="https://search-stage-2.test.dataone.org/">https://search-stage-2.test.dataone.org/</a></li>
<li>Log in</li>
<li>Navigate to <a href="https://search-stage-2.test.dataone.org/data">https://search-stage-2.test.dataone.org/data</a></li>
<li>Observe an HTTP 500 in the Network pane of whichever browser you're using to <a href="https://search-stage-2.test.dataone.org/cn/v2/accounts/?query=%7BYOUR_DN%7D">https://search-stage-2.test.dataone.org/cn/v2/accounts/?query={YOUR_DN}</a></li>
</ol>
<p>Response body:</p>
<pre><?xml version="1.0" encoding="UTF-8"?>
<error detailCode="500" errorCode="500" name="ServiceFailure">
<description>Internal Server Error: The server encountered an unexpected condition which prevented it from fulfilling the request.</description>
</error>
</pre>
<p>Some notes:</p>
<ul>
<li>I notice this in multiple browsers</li>
<li>I notice this only when the request is issued from MetacatUI, not when I visit the URL in my browser or hit it with curl</li>
<li>I don't notice this error on search.dataone.org</li>
<li>MetacatUI on search.dataone.org doesn't even issue this request</li>
<li>MetacatUI on stage-2 is at v2.4.2 and on search is at 2.6.1</li>
</ul>
<p>It seems like a bug to me that we see a service failure but I'm not sure if this is a MetacatUI bug or an issue in the CN stack (i.e., Apache config or something) but I wanted to file it for someone to take a look.</p>
Infrastructure - Bug #8821 (New): Object doi:10.18739/A2610VR7Z fails to index on CNhttps://redmine.dataone.org/issues/88212019-06-18T22:43:38ZBryce Mecummecum@nceas.ucsb.edu
<p>I noticed <a href="https://search.dataone.org/cn/v2/meta/doi:10.18739/A2610VR7Z">https://search.dataone.org/cn/v2/meta/doi:10.18739/A2610VR7Z</a> isn't present in the search index and asked about it on DataONEOrg #ci Slack channel. Chris noted that no no task was in the index queue for the PID. He tried to force a reindex which failed. He couldn't find a log entry indicating why it failed so someone needs to look into what's going on. It occurs to me that whatever bug is affecting this Object may have affected more Objects so a second part of fixing this bug should probably be doing some housekeeping or an audit of some kind to make sure the search index isn't missing more content.</p>
Infrastructure - Task #8820 (Closed): Add new DataONE Object format for HDF4/5 file formatshttps://redmine.dataone.org/issues/88202019-06-13T19:54:01ZBryce Mecummecum@nceas.ucsb.edu
<p>HDF 4 and 5 are efficient binary formats for data commonly used in science: <a href="https://en.wikipedia.org/wiki/Hierarchical_Data_Format">https://en.wikipedia.org/wiki/Hierarchical_Data_Format</a>, <a href="https://www.hdfgroup.org/solutions/hdf5/">https://www.hdfgroup.org/solutions/hdf5/</a>. I don't think we have a lot of content in this format, if any, but it's a pretty common format and a good one at that.</p>
<p>I did some research on MIME types and file extensions:</p>
<p>Re: MIME type:</p>
<blockquote>
<p>The recommended content is application/x-hdf5 for data in HDF5 or application/x-hdf for data in earlier versions.</p>
</blockquote>
<p><a href="https://www.hdfgroup.org/2018/06/citations-for-hdf-data-and-software/">https://www.hdfgroup.org/2018/06/citations-for-hdf-data-and-software/</a></p>
<p>Re: Extension,</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Hierarchical_Data_Format">https://en.wikipedia.org/wiki/Hierarchical_Data_Format</a> lists a few of the variants I've seen</li>
<li>The R hdf5 package uses the h5 extension (<a href="https://github.com/grimbough/rhdf5/tree/master/inst/testfiles">https://github.com/grimbough/rhdf5/tree/master/inst/testfiles</a>) so I went with this</li>
<li>.hdf4 and .hdf5 seem common too but h4/h5 seems a tad more common</li>
</ul>
<p><strong>Here are the details for each of the new formats:</strong></p>
<p><code>HDF4</code></p>
<ul>
<li>formatId: <code>application/x-hdf</code></li>
<li>formatName: Hierarchical Data Format version 4 (HDF4)</li>
<li>mediaType: <code>application/octet-stream</code></li>
<li>extension: h4</li>
</ul>
<p><code>HDF5</code></p>
<ul>
<li>formatId: <code>application/x-hdf5</code></li>
<li>formatName: Hierarchical Data Format version 5 (HDF5)</li>
<li>mediaType: <code>application/octet-stream</code></li>
<li>extension: h5</li>
</ul>
Infrastructure - Task #8817 (New): Configure sitemaps on the CNhttps://redmine.dataone.org/issues/88172019-06-06T23:52:23ZBryce Mecummecum@nceas.ucsb.edu
<p>Support for sitemaps landed last fall in Metacat: <a href="https://github.com/NCEAS/metacat/pull/1283">https://github.com/NCEAS/metacat/pull/1283</a>. Sitemaps are good for users but especially for search engines and DataONE's Search Catalog could benefit from having sitemaps enabled. A sitemap could help crawlers discover all of the datasets in DataONE. The CNs already run Metacat and should use Metacat's sitemaps ability to generate sitemaps for all content.</p>
<p>To enable sitemaps on the CNs, a few things seem to be needed. I'm not very familiar with how the CNs get built so I may be wrong or be missing things:</p>
<ul>
<li>Sitemaps rely on two properties in Metacat's <code>metacat.properties</code> file, which should have values: <code>sitemap.location.base=https://search.dataone.org/</code> and <code>sitemap.entry.base=https://search.dataone.org/view</code></li>
<li>The Apache config the CNs are built with need to serve the <code>sitemap_index.xml</code> and individual sitemaps from the Tomcat webapps dir. Metacat generates sitemaps in the <code>sitemaps</code> subfolder (e.g., <code>/usr/lib/tomcat8/webapps/metacat/sitemaps</code>). A <code>Directory</code> directive should work so long as filesystem permissions are set up for Apache to see the files.</li>
<li>We need a robots.txt that points at the sitemap index file at search.dataone.org which provides the entrypoint to <code>sitemap_index.xml</code></li>
<li>Metacat generates sitemaps with a recurring job mechanism that's internal to Metacat. AFAIK this job isn't turned on when Tomcat loads Metacat and a request has to get sent to the admin API which turns this job on as a side-effect. We might want to change this to reduce maintenance burden or chance of having stale sitemaps</li>
</ul>
<p>Dave nominated Jing for this work and has targeted this for the next CCI release. I'm not sure which that is so please select whichever one is appropriate.</p>
<p>Note this relates to <a href="https://redmine.dataone.org/issues/8693">https://redmine.dataone.org/issues/8693</a> which we've delayed because Google's crawler infrastructure has changed and DataONE is now visible by Google. Google staff have indicated we only need to send them a robots.txt that points to our sitemaps for them to begin crawling.</p>
Infrastructure - Bug #8802 (Closed): Titles in EML records that use <value> in <title> do not get...https://redmine.dataone.org/issues/88022019-05-18T00:55:24ZBryce Mecummecum@nceas.ucsb.edu
<p>Margaret O'Brien over at EDI noticed this and sent it my way. She found an EML record that serializes the title in a schema-valid but unusual way:</p>
<pre>...snip...
<title>
<value>Forest-wide bird survey at 183 sample sites the Andrews Experimental Forest from 2009-present (Reformatted to ecocomDP Design Pattern)</value>
</title>
...snip...
</pre>
<p>See <a href="https://search.dataone.org/view/https://pasta.lternet.edu/package/metadata/eml/edi/359/1">https://search.dataone.org/view/https://pasta.lternet.edu/package/metadata/eml/edi/359/1</a> and notice the citation is missing its title (which is powered by Solr) and also see the accompanying Solr doc at <a href="https://search.dataone.org/cn/v2/query/solr/?q=id:%22https://pasta.lternet.edu/package/metadata/eml/edi/359/1%22">https://search.dataone.org/cn/v2/query/solr/?q=id:%22https://pasta.lternet.edu/package/metadata/eml/edi/359/1%22</a> and notice the title field is not set.</p>
<p>This is a bit tricky. It's pretty clearly <em>not</em> endorsed in the EML spec, as per:</p>
<p><em>i18nNonEmptyStringType</em>: (emphasis mine)</p>
<blockquote>
<p>This type specifies a content pattern for all elements that require language translations. The xml:lang attribute can be used to define the default language for element content. <em>Additional translations should be included as child 'value' elements</em> that also have an optional xml:lang</p>
</blockquote>
<p>So <code>value</code> in <code>title</code> is intended to be used like this:</p>
<pre><title>
My title
<value xml:lang="fr">Mon titre</value>
</title>
</pre>
<p>and not how it is in <a href="https://search.dataone.org/view/https://pasta.lternet.edu/package/metadata/eml/edi/359/1">https://search.dataone.org/view/https://pasta.lternet.edu/package/metadata/eml/edi/359/1</a>.</p>
<p>Do we want to tweak the indexer to support this or is it actually a good thing that the indexer didn't pick this up because it's, subjectively, not well-formed EML?</p>
Infrastructure - Task #8755 (New): Expand EML indexing support for EML 2.2https://redmine.dataone.org/issues/87552018-12-19T00:55:12ZBryce Mecummecum@nceas.ucsb.eduInfrastructure - Decision #8693 (In Progress): Support Google Dataset Search on search.dataone.or...https://redmine.dataone.org/issues/86932018-09-07T00:16:59ZBryce Mecummecum@nceas.ucsb.edu
<a name="Background"></a>
<h2 >Background<a href="#Background" class="wiki-anchor">¶</a></h2>
<p>Yesterday, <a href="https://toolbox.google.com/datasetsearch" class="external">Google Dataset Search</a> launched. We previoiusly attempted to make MetacatUI (and by extension, DataONE Search) compatible with it by <a href="https://github.com/NCEAS/metacatui/issues/482" class="external">injecting Schema.org JSON-LD into appropriate pages</a>. During development and testing, we checked our compatibility with the upcoming Google Dataset Search using Google's <a href="https://search.google.com/structured-data/testing-tool" class="external">Structured Data Testing Tool</a>. During development, this was all working fine and the feature appeared to be compatible but, after launching the feature on search.dataone.org, behavior changed on Google's end making it so Google no longer saw this JSON-LD. The reason for this is likely that, because MetacatUI follows a single page application architecture and we inject the JSON-LD on the client side, Google's JSON-LD crawler only saw what was sent from the server (a nearly empty index.html) and not our full page (with JSON-LD). I was able to test this theory and, while Google's crawler does execute JavaScript, it limits execution to about or exactly five seconds and MetacatUI <em>usually</em> doesn't finish injecting JSON-LD and rendering all content until after that timeout.</p>
<p>Potential paths forward to get DataONE Search compatible with Google's Dataset Search include (none of which are mutually exclusive):</p>
<ol>
<li>The assets that make up MetacatUI and the asset loading strategies could be optimized: <a href="https://github.com/NCEAS/metacatui/issues/224">https://github.com/NCEAS/metacatui/issues/224</a></li>
<li>Move the code (and any dependencies) that injects JSON-LD further up in the app boot so that Google sees it</li>
<li>Inject the appropriate JSON-LD on the server side to guarantee that Google sees it (originally Matt Jones' idea!)</li>
</ol>
<p>(1) is being worked on for sure, and (2) may not be needed if (1) is successful. I want to talk about option (3) because:</p>
<ul>
<li>It's a quicker solution (I already have something working) which would help get us involved in the project faster</li>
<li>It paves the way for future features and/or improvements to MetacatUI (we could be rendering more on the server side than just JSON-LD, like other metadata, more page content, etc)</li>
</ul>
<a name="What-I-did"></a>
<h2 >What I did<a href="#What-I-did" class="wiki-anchor">¶</a></h2>
<p>To test this idea, I modified a <a href="https://github.com/amoeba/backbone-pushstate-example" class="external">previous project</a> which is just a simple Node (Express.js) app that hosts MetacatUI by intercepting every request and serving the appropriate asset. In injects Schema.org JSON-LD, when appropriate, by querying the CN Solr index before sending MetacatUI's index.html to the client. <a href="https://github.com/amoeba/metacatui-ssr" class="external">Code is here</a> and its deployed <a href="http://neutral-cat.nceas.ucsb.edu/" class="external">here</a>. View source on any /view/... pages and you'll see a minimal Schema.org/Dataset description in the head. More properties can be added later. I did it quick and dirty: The app pre-loads MetacatUI's index.html as a <code>String</code> at app boot and injects the JSON-LD into it. No templating language or other magic.</p>
<a name="Things-to-address"></a>
<h2 >Things to address<a href="#Things-to-address" class="wiki-anchor">¶</a></h2>
<ul>
<li>How do we feel abouts switching from hosting MetacatUI via Apache (simple, bullet proof) to a Node based deployment just to support this feature (new territory, at least for me)?</li>
<li>If we do switch, we'd want to make really sure the Node app doesn't have weird failure cases where it doesn't return index.html (e.g., when Solr is down, or slow). The app needs to return index.html (and every other static asset) on every request and do it very fast and we should decide what the cutoff is so that it doesn't hold up app boot if Solr is slow/down.</li>
<li>Can this type of deployment easily be integrated with CN buildouts? I've deployed Node apps before by fronting them with Apache/nginx (via reverse proxy) and then keeping the node process up with Upstart</li>
<li>Is this performant enough for DataONE? I think my implementation is non-blocking but I'm not a Node expert so we'd want to code review and probably benchmark </li>
<li>We could wait on (1) and stick with our current deployment strategy</li>
</ul>
<a name="Other-notes"></a>
<h2 >Other notes<a href="#Other-notes" class="wiki-anchor">¶</a></h2>
<p>Unrelated to the Google Dataset Search issue but related to Google's crawling for Google Search, we've also identified:</p>
<ul>
<li>That the Metacat View Service is often unreasonably slow: <a href="https://github.com/NCEAS/metacat/issues/1234">https://github.com/NCEAS/metacat/issues/1234</a> and are planning to figure out why</li>
<li>That we can and should make use of sitemaps to help Google crawl our pages: <a href="https://github.com/NCEAS/metacat/issues/1263">https://github.com/NCEAS/metacat/issues/1263</a></li>
</ul>
Infrastructure - Bug #8615 (Closed): isotc211 indexing component has the wrong XPath for the pubD...https://redmine.dataone.org/issues/86152018-06-14T00:14:32ZBryce Mecummecum@nceas.ucsb.edu
<p>The latest trunk isotc211 indexing component has the following XPATH for <code>pubDate</code>:</p>
<pre><bean id="isotc.pubDate" class="org.dataone.cn.indexer.parser.SolrField">
<constructor-arg name="name" value="pubDate"/>
<constructor-arg name="xpath" value="(//gmd:dateStamp/gco:Date/text() | //gmd:dateStamp/gco:DateTime/text())[1]"/>
<property name="converter" ref="dateConverter"/>
</bean>
</pre>
<p>This doesn't make any sense and probably wasn't vetted when it went into version control. If you look at an example document, like one from PANGAEA, you see this is how they describe the publication date:</p>
<pre><ns0:identificationInfo>
<ns0:MD_DataIdentification>
<ns0:citation>
<ns0:CI_Citation>
<ns0:title>
<ns2:CharacterString>Simulated ocean velocity at 420 m water depth for pre-industrial, glacial, Pliocene, and Miocene climate states</ns2:CharacterString>
</ns0:title>
<ns0:date>
<ns0:CI_Date>
<ns0:date>
<ns2:DateTime>2018-04-25T15:20:13</ns2:DateTime>
</ns0:date>
<ns0:dateType>
<ns0:CI_DateTypeCode codeList="http://www.isotc211.org/2005/resources/Codelist/gmxCodelists.xml#CI_DateTypeCode" codeListValue="http://www.isotc211.org/2005/resources/Codelist/gmxCodelists.xml#CI_DateTypeCode_publication">publication</ns0:CI_DateTypeCode>
</ns0:dateType>
</ns0:CI_Date>
</ns0:date>
</pre>
<p>I think it'd be good if the XPath pulled out the first <code>date</code> from the <code>identificationInfo/citation</code>, preferring one of <code>dateType</code> <code>publicationDate</code> and falling back to any <code>date</code> that's in the <code>identificationInfo/citation</code>.</p>
DataONE API - Bug #7684 (New): Call to MNStorage.update() via REST API returns java.lang.StackOve...https://redmine.dataone.org/issues/76842016-03-21T23:07:39ZBryce Mecummecum@nceas.ucsb.edu
<p>I was trying to update an object via the REST API via cURL and forgot to enter the correct URL. The cURL command I used and response is:</p>
<p>$ curl -X PUT -H "Authorization: Bearer $TOKEN" -F "pid=resourceMap_doi:10.5065/D6G44NFV" -F "object=@object.xml" -F "sysmeta=@sysmeta.xml" -F "newPid=resourceMap_doi:10.5065/D6G44NFV_v3" $URL<br>
<?xml version="1.0" encoding="UTF-8"?><br>
java.lang.StackOverflowError<br>
</p>
<p>Where $URL was '<a href="https://arcticdata.io/metacat/d1/mn/v2/object">https://arcticdata.io/metacat/d1/mn/v2/object</a>' instead of '<a href="https://arcticdata.io/metacat/d1/mn/v2/object/resourceMap_doi:10.5065/D6G44NFV">https://arcticdata.io/metacat/d1/mn/v2/object/resourceMap_doi:10.5065/D6G44NFV</a>'</p>
<p>I expected to receive some sort of warning/error that I had forgotten to specify the URL properly for this call but instead saw a StackOverflowError.</p>
DataONE API - Bug #7578 (New): Fix 404 link to d1_instance_generator folder in documentationhttps://redmine.dataone.org/issues/75782016-01-08T22:01:20ZBryce Mecummecum@nceas.ucsb.edu
<p>In the MN API documentation for MNStorage.create (<a href="https://jenkins-ucsb-1.dataone.org/job/API%20Documentation%20-%20trunk/ws/api-documentation/build/html//apis/MN_APIs.html#MNStorage.create">https://jenkins-ucsb-1.dataone.org/job/API%20Documentation%20-%20trunk/ws/api-documentation/build/html//apis/MN_APIs.html#MNStorage.create</a>), I found a the following paragraph contains a broken link to d1_instance_generator:</p>
<blockquote>
<p>"The system metadata included with the create call must contain values for the elements required to be set by clients (see System Metadata). The system metadata document can be crafted by hand or preferably with a tool such as generate_sysmeta.py which is available in the d1_instance_generator Python package. See documentation included with that package for more information on its operation."</p>
</blockquote>
<p>The link to d1_instance_generator was to the SVN folder <a href="https://repository.dataone.org/software/cicore/trunk/d1_instance_generator">https://repository.dataone.org/software/cicore/trunk/d1_instance_generator</a> which is currently a 404. I think the folder moved to /d1_test_utilities_python/src/d1_test/instance_generator.</p>