DataONE Tasks: Issueshttps://redmine.dataone.org/https://redmine.dataone.org/favicon.ico2018-09-07T00:16:59ZDataONE Tasks
Redmine Infrastructure - Decision #8693 (In Progress): Support Google Dataset Search on search.dataone.or...https://redmine.dataone.org/issues/86932018-09-07T00:16:59ZBryce Mecummecum@nceas.ucsb.edu
<a name="Background"></a>
<h2 >Background<a href="#Background" class="wiki-anchor">¶</a></h2>
<p>Yesterday, <a href="https://toolbox.google.com/datasetsearch" class="external">Google Dataset Search</a> launched. We previoiusly attempted to make MetacatUI (and by extension, DataONE Search) compatible with it by <a href="https://github.com/NCEAS/metacatui/issues/482" class="external">injecting Schema.org JSON-LD into appropriate pages</a>. During development and testing, we checked our compatibility with the upcoming Google Dataset Search using Google's <a href="https://search.google.com/structured-data/testing-tool" class="external">Structured Data Testing Tool</a>. During development, this was all working fine and the feature appeared to be compatible but, after launching the feature on search.dataone.org, behavior changed on Google's end making it so Google no longer saw this JSON-LD. The reason for this is likely that, because MetacatUI follows a single page application architecture and we inject the JSON-LD on the client side, Google's JSON-LD crawler only saw what was sent from the server (a nearly empty index.html) and not our full page (with JSON-LD). I was able to test this theory and, while Google's crawler does execute JavaScript, it limits execution to about or exactly five seconds and MetacatUI <em>usually</em> doesn't finish injecting JSON-LD and rendering all content until after that timeout.</p>
<p>Potential paths forward to get DataONE Search compatible with Google's Dataset Search include (none of which are mutually exclusive):</p>
<ol>
<li>The assets that make up MetacatUI and the asset loading strategies could be optimized: <a href="https://github.com/NCEAS/metacatui/issues/224">https://github.com/NCEAS/metacatui/issues/224</a></li>
<li>Move the code (and any dependencies) that injects JSON-LD further up in the app boot so that Google sees it</li>
<li>Inject the appropriate JSON-LD on the server side to guarantee that Google sees it (originally Matt Jones' idea!)</li>
</ol>
<p>(1) is being worked on for sure, and (2) may not be needed if (1) is successful. I want to talk about option (3) because:</p>
<ul>
<li>It's a quicker solution (I already have something working) which would help get us involved in the project faster</li>
<li>It paves the way for future features and/or improvements to MetacatUI (we could be rendering more on the server side than just JSON-LD, like other metadata, more page content, etc)</li>
</ul>
<a name="What-I-did"></a>
<h2 >What I did<a href="#What-I-did" class="wiki-anchor">¶</a></h2>
<p>To test this idea, I modified a <a href="https://github.com/amoeba/backbone-pushstate-example" class="external">previous project</a> which is just a simple Node (Express.js) app that hosts MetacatUI by intercepting every request and serving the appropriate asset. In injects Schema.org JSON-LD, when appropriate, by querying the CN Solr index before sending MetacatUI's index.html to the client. <a href="https://github.com/amoeba/metacatui-ssr" class="external">Code is here</a> and its deployed <a href="http://neutral-cat.nceas.ucsb.edu/" class="external">here</a>. View source on any /view/... pages and you'll see a minimal Schema.org/Dataset description in the head. More properties can be added later. I did it quick and dirty: The app pre-loads MetacatUI's index.html as a <code>String</code> at app boot and injects the JSON-LD into it. No templating language or other magic.</p>
<a name="Things-to-address"></a>
<h2 >Things to address<a href="#Things-to-address" class="wiki-anchor">¶</a></h2>
<ul>
<li>How do we feel abouts switching from hosting MetacatUI via Apache (simple, bullet proof) to a Node based deployment just to support this feature (new territory, at least for me)?</li>
<li>If we do switch, we'd want to make really sure the Node app doesn't have weird failure cases where it doesn't return index.html (e.g., when Solr is down, or slow). The app needs to return index.html (and every other static asset) on every request and do it very fast and we should decide what the cutoff is so that it doesn't hold up app boot if Solr is slow/down.</li>
<li>Can this type of deployment easily be integrated with CN buildouts? I've deployed Node apps before by fronting them with Apache/nginx (via reverse proxy) and then keeping the node process up with Upstart</li>
<li>Is this performant enough for DataONE? I think my implementation is non-blocking but I'm not a Node expert so we'd want to code review and probably benchmark </li>
<li>We could wait on (1) and stick with our current deployment strategy</li>
</ul>
<a name="Other-notes"></a>
<h2 >Other notes<a href="#Other-notes" class="wiki-anchor">¶</a></h2>
<p>Unrelated to the Google Dataset Search issue but related to Google's crawling for Google Search, we've also identified:</p>
<ul>
<li>That the Metacat View Service is often unreasonably slow: <a href="https://github.com/NCEAS/metacat/issues/1234">https://github.com/NCEAS/metacat/issues/1234</a> and are planning to figure out why</li>
<li>That we can and should make use of sitemaps to help Google crawl our pages: <a href="https://github.com/NCEAS/metacat/issues/1263">https://github.com/NCEAS/metacat/issues/1263</a></li>
</ul>
Infrastructure - Bug #7698 (New): Exclude SLF4J jars in d1_solr_extensions.jarhttps://redmine.dataone.org/issues/76982016-03-28T16:22:54ZRobert Waltz
<p>Dave reported:<br>
when running “service solr status” on a CN, I see a complaint about multiple SLF4J bindings:<br>
(12:05:53 PM) vieglais: SLF4J: Class path contains multiple SLF4J bindings.<br>
(12:05:53 PM) vieglais: SLF4J: Found binding in <a href="12:05:55 PM" class="external">jar:file:/var/solr/server/solr-webapp/webapp/WEB-INF/lib/d1_solr_extensions.jar!/org/slf4j/impl/StaticLoggerBinder.class</a> vieglais: SLF4J: Found binding in <a href="12:05:56 PM" class="external">jar:file:/var/solr/server/lib/ext/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class</a> vieglais: SLF4J: See <a href="http://www.slf4j.org/codes.html#multiple_bindings">http://www.slf4j.org/codes.html#multiple_bindings</a> for an explanation.<br>
(12:05:56 PM) vieglais: SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]</p>
<p>looks like a class conflict due to the shaded d1_solr_extensions.jar that includes SLF4J and solr jetty that also includes SLF4J.</p>
<p>try excluding SLF4J jars in d1_solr_extensions.jar.</p>
DataONE API - Bug #7684 (New): Call to MNStorage.update() via REST API returns java.lang.StackOve...https://redmine.dataone.org/issues/76842016-03-21T23:07:39ZBryce Mecummecum@nceas.ucsb.edu
<p>I was trying to update an object via the REST API via cURL and forgot to enter the correct URL. The cURL command I used and response is:</p>
<p>$ curl -X PUT -H "Authorization: Bearer $TOKEN" -F "pid=resourceMap_doi:10.5065/D6G44NFV" -F "object=@object.xml" -F "sysmeta=@sysmeta.xml" -F "newPid=resourceMap_doi:10.5065/D6G44NFV_v3" $URL<br>
<?xml version="1.0" encoding="UTF-8"?><br>
java.lang.StackOverflowError<br>
</p>
<p>Where $URL was '<a href="https://arcticdata.io/metacat/d1/mn/v2/object">https://arcticdata.io/metacat/d1/mn/v2/object</a>' instead of '<a href="https://arcticdata.io/metacat/d1/mn/v2/object/resourceMap_doi:10.5065/D6G44NFV">https://arcticdata.io/metacat/d1/mn/v2/object/resourceMap_doi:10.5065/D6G44NFV</a>'</p>
<p>I expected to receive some sort of warning/error that I had forgotten to specify the URL properly for this call but instead saw a StackOverflowError.</p>
Infrastructure - Bug #4303 (New): Fix potential bug where an object could be created and accepted...https://redmine.dataone.org/issues/43032014-03-06T22:42:59ZRoger Dahldahl@unm.edu
<p>Use SQL "select for update" (or similar) to lock the replication queue while a regular create takes place, so that a replication request cannot be accepted for an object at the same time as that object is being created. GMN already checks, in create(), that a replication request does not exist for the object. It also checks for the opposite, but there is a tiny window in which both could be created at the same time.</p>
Infrastructure - Bug #4300 (New): Create more appropriate "create" entry in the LogRecord for rep...https://redmine.dataone.org/issues/43002014-03-06T22:34:56ZRoger Dahldahl@unm.edu
<p>In the "create" log entry for objects created by the replication processor (replicas), IP address and subject should be for the CN that requested the replica. That information is currently not being captured when the replication request is received.</p>
Infrastructure - Feature #3762 (New): DataONE software download statisticshttps://redmine.dataone.org/issues/37622013-05-14T21:07:03ZRoger Dahldahl@unm.edu
<p>Gather and visualize statistics for downloads and/or installed user base for DataONE software components.</p>
Infrastructure - Task #3333 (New): Generalize mk_* scripts for host namehttps://redmine.dataone.org/issues/33332012-10-11T18:50:14ZDave Vieglaisdave.vieglais@gmail.com
<p>There is currently a dependency of "-1" in the hostname for the various custom check_mk custom plugins. </p>
<p>This should be generalized at some point.</p>
<p>This didn't work:</p>
<p>testhost=$(echo $thishost | sed s/-[a-z]*-([0-9])/-${host}-\1/)</p>
Infrastructure - Bug #3246 (New): Metacat returns 500 instead of 404 in some caseshttps://redmine.dataone.org/issues/32462012-09-11T02:00:24ZDave Vieglaisdave.vieglais@gmail.com
<p>For example:</p>
<p><a href="https://knb.ecoinformatics.org/knb/d1/mn/v1/bogus">https://knb.ecoinformatics.org/knb/d1/mn/v1/bogus</a></p>
<p>should return a 404 NotFound error, but instead returns a 500, ServiceFailure. </p>
<p>This is not an urgent issue, but should probably be cleaned up.</p>
Infrastructure - Task #3156 (New): Design Review: resource map indexing strategyhttps://redmine.dataone.org/issues/31562012-08-27T18:51:08ZSkye Roseboomsroseboo@dataone.unm.edu
<p>Mid to long term design review: </p>
<p>Current design is that resource maps are not indexed until all referenced objects are indexed or accounted for (archived/obsoleted).</p>
<p>This causes issues showing links to documents that are known - and could prevent resource maps from ever being indexed - if there are references to unknown ids. </p>
<p>The design works this way so that the data relationships are only introduced into the index when the entire resource map can be processed. An alternative would be to build/re-build the relationships (resourceMap, describes, describedBy) on each update of every document. Would entail querying the index for resource map referencing the current document and building the relationship links between the indexed documents for each event the index responds to. Would add a fair amount of re-work to the index parsing api controller logic (XPathDocumentParser) and a fair bit of new processing load on the CN.</p>
Infrastructure - Task #2286 (New): Change Exceptions.InvalidToken to Exceptions.InvalidSessionhttps://redmine.dataone.org/issues/22862012-02-03T02:57:36ZRoger Dahldahl@unm.eduInfrastructure - Task #2281 (New): Possibly update the Identity Management and Authenticated Sess...https://redmine.dataone.org/issues/22812012-02-01T05:10:33ZRoger Dahldahl@unm.edu
<p>I'm putting the link to some questions/comments I had about the Identity Management and Authenticated Session Management document in here, so that they don't get lost.</p>
<p><a href="http://epad.dataone.org/20120131-authn-authz-questions">http://epad.dataone.org/20120131-authn-authz-questions</a></p>
Infrastructure - Task #2147 (New): The Python stack does not support Unicode supplementary charac...https://redmine.dataone.org/issues/21472011-12-20T15:44:22ZRoger Dahldahl@unm.edu
<p>When given this identifier:</p>
<p>common-unicode-supplementary-escaped-</p>
Infrastructure - Task #1585 (New): add exec-maven-plugin to trigger python integration tests at v...https://redmine.dataone.org/issues/15852011-05-23T22:34:17ZRob Nahfrnahf@epscor.unm.edu
<p>see:</p>
<p><a href="http://steveberczuk.blogspot.com/2009/12/continuous-integration-of-python-code.html">http://steveberczuk.blogspot.com/2009/12/continuous-integration-of-python-code.html</a><br>
and<br>
<a href="http://mojo.codehaus.org/exec-maven-plugin/examples/example-exec-using-plugin-dependencies.html">http://mojo.codehaus.org/exec-maven-plugin/examples/example-exec-using-plugin-dependencies.html</a></p>
Infrastructure - Task #991 (New): implement HEAD /resolve/<guid>https://redmine.dataone.org/issues/9912010-10-11T17:41:17ZRob Nahfrnahf@epscor.unm.edu
<p>The resolution service supports the HEAD method:<br>
HEAD Returns basic information resolve response document<br>
Last-Modified: Date the resolve information was last updated for that identifier. This is helpful to clients that may cache resolve responses.</p>
Infrastructure - Task #817 (New): Integration testing: Make Java and Python stack return values e...https://redmine.dataone.org/issues/8172010-09-02T15:58:30ZRoger Dahldahl@unm.edu
<p>This is derived from the <a class="issue tracker-5 status-5 priority-5 priority-high3 closed" title="Task: post initial WBS for CCIT from the management meeting to the SVN/Collab site (Closed)" href="https://redmine.dataone.org/issues/8">#8</a> integration test, "Do the Java Stacks and Python stacks return the same thing for the same object?" This seems to focus on the get call, but I think all aspects of the stacks need to be compared and the differences fixed. A week may be a conservative estimate for this.</p>