CN Index: Issueshttps://redmine.dataone.org/https://redmine.dataone.org/favicon.ico2016-05-23T18:51:06ZDataONE Tasks
Redmine Bug #7817 (Closed): The processIndexTaskQueue method on IndexTaskProcessor doesn't pick up all th...https://redmine.dataone.org/issues/78172016-05-23T18:51:06ZJing Taotao@nceas.ucsb.edu
<p>The processIndexTaskQueue method suppose to pick all failed the index tasks for reindexing. However, it only picked up first couple ones.</p>
Bug #7816 (Closed): After reindex cn-dev, there are more than 20,000 index tasks keep the "IN PRO...https://redmine.dataone.org/issues/78162016-05-23T18:43:01ZJing Taotao@nceas.ucsb.edu
<p>When the index finished, it should be failed or removed from the index_task table. However, there are 20,000 index tasks keep "IN PROCESS" status.</p>
Task #7771 (Closed): Use multiple threads to index objectshttps://redmine.dataone.org/issues/77712016-05-04T23:15:43ZJing Taotao@nceas.ucsb.edu
<p>Worked on generating index on multiple threads:<br><br>
while (getNextTask() != null ) {<br><br>
process(nextTask); }<br><br>
while (getNextTask() != null ) {<br><br>
exceutor.submit(nextTask);<br><br>
}<br><br>
Fortunately there is no shared class variables can’t be static. So we don’t need to lock them.<br><br>
Handling resource maps has a race issue:<br><br>
R1: s1 documents d1<br><br>
R2: s1 documents d2<br><br>
At the beginning, there is no documents and resourceMap on the solr index of s1.<br><br>
Sequence:<br><br>
After processing R1 and the solr index of s1:<br><br>
documents d1<br><br>
resourceMap R1<br><br>
After processing R2 and the solr index of s1:<br><br>
documents d1<br><br>
documents d2<br><br>
resourceMap R1<br><br>
resourceMap R2<br><br>
Concurrent:<br><br>
1. Both threads to handle R1 and R2 read a copy without documents and resourceMap information.<br><br>
2. Thread 1 handling R1 finished first and send it to the solr server:<br><br>
documents d1<br><br>
resourceMap R1<br><br>
3. Thread 2 handling R2 finished later and send it to the solr server. It will overwrite what thread 1 did. So the eventual result will be:<br><br>
documents d2<br><br>
resourceMap R2<br><br>
Wrong!<br><br>
Handle resource map objects sequentially? no.<br><br>
Proposed Solution:<br><br>
1. Maintain a set containing the relevant objects’ id (s1 and d1) when it processes a resource map<br><br>
2. Before we process a resource map, check its relevant ids are on the set. If they are on the set, please wait and try again later (with max attempts); otherwise, put those ids on the set and start to process it.<br><br>
3. The processing is done, remove those ids from the set<br><br>
ConcurrentSkipListSet vs HashSet + lock vs Hash+ synchronize</p>
Task #7770 (Closed): Profile the index processhttps://redmine.dataone.org/issues/77702016-05-04T23:12:30ZJing Taotao@nceas.ucsb.edu
<p>Solr Index profiling</p>
<p>EML: <br>
1. Create a index task queue 328 (first one is very high, now it drops dramatically. Generally it is about 10 ~ 30) <br>
2. Fetch a task from the queue 10 <br>
3. Total processing time 1076<br><br>
- SolrIndexService.processObject - 536<br><br>
*process system metadata - 66 ( each field is from 0 to 1, but the id took 32)<br><br>
*process by science metadata - 255 (each field take 2,3 and 4. abstract 12, keyword 10, title 10, project 8, CommonRootSolrField("attribute") 32, text 11,southBoundCoord 9, northBoundCoord 7, west 10, east 8, begin date 10, enddate 7)<br><br>
*process by BaseReprocessSubprocessor 71 (series id and resource map)<br><br>
*process merging - 141<br><br>
- sending to the solr server 532 </p>
<p>A resourceMap: <br>
1. Create a index task queue 233 <br>
2. Fetch a task from the queue 12<br>
3. Total processing time 1652<br><br>
- SolrIndexService.processObject - 980<br><br>
*process system metadata - 38<br><br>
* process by ResourceMapSubprocessor.processResourceMap - 416<br><br>
ResourceMapFactory.buildResourceMap() create ResourceMap from Document, 95<br><br>
ResourceMap.getAllDocumentIDs() referenced in ResourceMap, 134<br><br>
ResourceMapSubprocessor.clearSidChain() removing obsoletes chain from Solr index, 99<br><br>
HttpService.getDocumentsById() get existing referenced ids' Solr docs, 26<br><br>
*process by RdfXmlSubprocessor.processDocument() 400<br><br>
RdfXmlSubprocess.process gets a dataset from tripe store service , 116<br><br>
RdfXmlSubprocess.process adds ont-model , 21<br><br>
RdfXmlSubprocess.process process the fields total , 238 (prov stuff)<br><br>
*process merging - 125<br><br>
- sending to the solr server 661</p>
Story #7769 (Closed): Improve the performance on solr indexhttps://redmine.dataone.org/issues/77692016-05-04T23:10:44ZJing Taotao@nceas.ucsb.edu
<p>Currently it will take more than one week to reindex our production CNs. We need to improve the performance.</p>
Task #7752 (Closed): Add gmx:Anchor path for Solr indexing of the http://www.isotc211.org/2005/gm...https://redmine.dataone.org/issues/77522016-04-26T22:53:40ZMark Servillamark.servilla@gmail.com
<p>Some groups (specifically BCO-DMO) use the xPath "/gmd:indentificationInfo/gmd:MD_DataIdentification/gmd:citation/gmd:CI_Citation/gmd:title/gmx:Anchor" to define a data package title in the metadata format <a href="http://www.isotc211.org/2005/gmd-noaa">http://www.isotc211.org/2005/gmd-noaa</a>. DataONE does not currently index this field, but should do so to pick up the correct data package title.</p>
<p>The current DataONE <a href="http://www.isotc211.org/2005/gmd-noaa">http://www.isotc211.org/2005/gmd-noaa</a> title xPath is "/gmd:indentificationInfo/gmd:MD_DataIdentification/gmd:citation/gmd:CI_Citation/gmd:title/gco:CharacterString"</p>
<p>See the attached files as examples (555889.xml for gmx:Anchor and {2C845538-0EFA-46BC-A303-50384267FEA9}.xml for gco:CharacterString).</p>