https://redmine.dataone.org/https://redmine.dataone.org/favicon.ico2018-03-30T04:11:11ZDataONE TasksDataONE API - Task #8529: Add field to Solr index that includes the obsolescence chainhttps://redmine.dataone.org/issues/8529?journal_id=301382018-03-30T04:11:11ZRob Nahfrnahf@epscor.unm.edu
<ul><li><strong>Related to</strong> <i><a class="issue tracker-5 status-1 priority-4 priority-default" href="/issues/8528">Task #8528</a>: Add MNRead.getVersions, CNRead.getVersions</i> added</li></ul> DataONE API - Task #8529: Add field to Solr index that includes the obsolescence chainhttps://redmine.dataone.org/issues/8529?journal_id=301402018-03-30T04:33:09ZRob Nahfrnahf@epscor.unm.edu
<ul></ul><p>Performance on these updates should be evaluated, especially if we are implementing a chain identifier in CN/MN Read. In that case it would be easier to expose the chain identifier, and query for it as we currently do for seriesId. Appropriate queries would make light work of ordering the items retrieved client side. </p>
<p>Since each update changes 2 records using two index tasks, the number of updates would be 2n, unless we get away from the "something in the system metadata changed" approach of creating one-size-fits-all, "overwrite from the systemMetadata" tasks (which might be a good idea anyway - how many updates would happen if you updated the access policy on the whole chain? n<sup>2,</sup> it would seem).</p>
<p>Also, more labor intensive would be updating a Resource Map, because it reindexes all of its members, which would then cascade down to updates on every version of all members of the package (twice). The potential for locking issues would seem to rise. </p>
DataONE API - Task #8529: Add field to Solr index that includes the obsolescence chainhttps://redmine.dataone.org/issues/8529?journal_id=301412018-03-30T16:28:45ZRob Nahfrnahf@epscor.unm.edu
<ul></ul><p>a simpler implementation would be the creation of a dedicated index for chains. When a new PID needs to be added, just look up on the pid field using the obsoletes value, then append a new value if it's not already there. (d1.updates create 2 index tasks)</p>
<p>The record would simply be structured with an <code>id</code> field, and a <code>pid</code> field. The id would correspond to the chain-identifier, and could be opaque.</p>
<p>Querying would be something like <code>q=pid:myPidOfInterest</code>, and would return the entire record with all pids in the chain.</p>
<p>Ordering of elements for the CN would be more complicated than for the MN, because we couldn't rely on order of insert due to out of order synchronizations. Reindexing would also be troublesome on both MN and CN. </p>
DataONE API - Task #8529: Add field to Solr index that includes the obsolescence chainhttps://redmine.dataone.org/issues/8529?journal_id=301432018-03-30T19:45:12ZRob Nahfrnahf@epscor.unm.edu
<ul></ul><p>The order of values in multi-valued fields is guaranteed, but the order of fields in a record is not.</p>
<pre>
Yeah, Solr has a weaker guarantee.
Order is guaranteed to be maintained for values in a multi-valued field.
Order of different fields is not maintained.
-Yonik
http://lucidworks.com
</pre>
<p>we will need to be careful to either insert values in the proper order, or develop the capability to remove all and redo the list if we need to insert in the middle. Or we could also create multivalued obsoletes and obsoletedBy fields to allow the client (or a separate processor) to figure out the definitive order. </p>
DataONE API - Task #8529: Add field to Solr index that includes the obsolescence chainhttps://redmine.dataone.org/issues/8529?journal_id=301442018-03-30T20:13:34ZDave Vieglaisdave.vieglais@gmail.com
<ul></ul><p>This seems like a reasonable solution - create a separate core for identifiers. Given two fields:</p>
<pre>pid required multivalued text
sid optional single text
</pre>
<p>It would be trivial to:</p>
<p>a) get the obsolescence chain given a pid or a sid: <code>q=pid:id_to_find OR sid:id_to_find</code></p>
<p>b) determine if an identifier is a pid or a sid by examining the matching record</p>
<p>Reindexing could be driven from the content in postgres. Adding a new entry to the chain is a trivial update.</p>
<p>If choosing this route then it may be worth considering other types of common relationships that could be included in the core.</p>
<p>Rob Nahf wrote:</p>
<blockquote>
<p>a simpler implementation would be the creation of a dedicated index for chains. When a new PID needs to be added, just look up on the pid field using the obsoletes value, then append a new value if it's not already there. (d1.updates create 2 index tasks)</p>
<p>The record would simply be structured with an <code>id</code> field, and a <code>pid</code> field. The id would correspond to the chain-identifier, and could be opaque.</p>
<p>Querying would be something like <code>q=pid:myPidOfInterest</code>, and would return the entire record with all pids in the chain.</p>
<p>Ordering of elements for the CN would be more complicated than for the MN, because we couldn't rely on order of insert due to out of order synchronizations. Reindexing would also be troublesome on both MN and CN.</p>
</blockquote>
DataONE API - Task #8529: Add field to Solr index that includes the obsolescence chainhttps://redmine.dataone.org/issues/8529?journal_id=302182018-04-10T05:21:52ZRob Nahfrnahf@epscor.unm.edu
<ul></ul><p>Don't forget to look into graph queries possible in later versions of Solr</p>
<p><a href="https://lucene.apache.org/solr/guide/7_3/other-parsers.html">https://lucene.apache.org/solr/guide/7_3/other-parsers.html</a></p>
<p>and perhaps, but not as likely to be useful:</p>
<p><a href="https://lucene.apache.org/solr/guide/7_3/graph-traversal.html">https://lucene.apache.org/solr/guide/7_3/graph-traversal.html</a></p>