Task #6843
Update the prov instance of the RdfXmlSubprocessor to index renamed and inverse provenance properties
30%
Description
In the "sem-prov-design issue 66":https://github.com/DataONEorg/sem-prov-design/issues/66 we have renamed the provenance-based Solr fields to include 'prov_' as a prefix, and have added new fields. See also "issue 99":https://github.com/DataONEorg/sem-prov-design/issues/99 and "issue 100":https://github.com/DataONEorg/sem-prov-design/issues/100.
Modify the provRdfXmlSubprocessor bean to handle the renaming scheme, the new fields, and the inverse fields determined to be useful. Also, add these fields as static Solr fields so we can remove the '_sm' suffixes from the names.
Associated revisions
Change indexing of provenance-related fields to use field names with a 'prov_' prefix, and drop the dynamic field suffixes. Also change hadExecution to two fields: prov_wasExecutedByExecution and prov_wasExecutedByUser. refs #6843
Change indexing of provenance-related fields to use field names with a 'prov_' prefix, and drop the dynamic field suffixes. Also change hadExecution to two fields: prov_wasExecutedByExecution and prov_wasExecutedByUser. refs #6843
Rename hadExecution bean to prov.wasExecutedByExecution and prov.wasExecutedByUser. refs #6843
Rename hadExecution bean to prov.wasExecutedByExecution and prov.wasExecutedByUser. refs #6843
Modify the provRdfXmlSubprocessor bean to index the prov_hasDerivations field, which indexes science metadata documents that describe data entities that were sources of other data entities. Also fix the SPARQL query logic in the prov_hasSources bean. refs #6843
Modify the provRdfXmlSubprocessor bean to index the prov_hasDerivations field, which indexes science metadata documents that describe data entities that were sources of other data entities. Also fix the SPARQL query logic in the prov_hasSources bean. refs #6843
copy the provenance application context beans from d1_cn_index_processor to update the most recent field name changes in the SPARQL queries that get deployed. refs #6843
copy the provenance application context beans from d1_cn_index_processor to update the most recent field name changes in the SPARQL queries that get deployed. refs #6843
Pilot error - I missed adding this file to the last commit. refs #6843
Pilot error - I missed adding this file to the last commit. refs #6843
Update the fixed SPARQL query in the buildout. refs #6843
Update the fixed SPARQL query in the buildout. refs #6843
Fix attribute name typo: multiValued, not multivalued. refs #6843
Fix attribute name typo: multiValued, not multivalued. refs #6843
Fix attribute name typo: multiValued, not multivalued. refs #6843
Fix attribute name typo: multiValued, not multivalued. refs #6843
Uncomment the provRdfXmlSubprocessor. refs #6843
Uncomment the provRdfXmlSubprocessor. refs #6843
Add minor debugging to help track down identifiers referenced in resource maps not found in the Solr index. refs #6843
Add minor debugging to help track down identifiers referenced in resource maps not found in the Solr index. refs #6843
Add the application-context-annotator bean definition into the tests to get the index processor tests working on jenkins. refs #6843
Add the application-context-annotator bean definition into the tests to get the index processor tests working on jenkins. refs #6843
Modify the RdfXmlSubprocessor to use mergeWithIndexedDocuments() rather than mergeDocs() to do the merging of content with content already in the index. The SolrIndexService iterates through the results provided by processDocument(), and so will index all of the document identifiers extracted out of the triple statements found in the RDF/XML document. I've removed mergeDocs() (not needed now) and fleshed out mergeWithIndexedDocument(). refs #6843
Modify the RdfXmlSubprocessor to use mergeWithIndexedDocuments() rather than mergeDocs() to do the merging of content with content already in the index. The SolrIndexService iterates through the results provided by processDocument(), and so will index all of the document identifiers extracted out of the triple statements found in the RDF/XML document. I've removed mergeDocs() (not needed now) and fleshed out mergeWithIndexedDocument(). refs #6843
Minor changes to indexing prov_used and prov_wasGeneratedBy. refs #6843
Minor changes to indexing prov_used and prov_wasGeneratedBy. refs #6843
After troubleshooting issues that Lauren pointed out:
- Remove the prov_wasGeneratedBy Solr field (in favor of just prov_wasGeneratedByExecution and prov_wasGeneratedByProgram)
- Add the prov_generated field as the inverse, where a program generates an entity
- Update the testProvenanceFields() test to reflect the above
- For now, ignore the testInsertProvResourceMap() test due to HZ conflicts
refs #6843
After troubleshooting issues that Lauren pointed out:
- Remove the prov_wasGeneratedBy Solr field (in favor of just prov_wasGeneratedByExecution and prov_wasGeneratedByProgram)
- Add the prov_generated field as the inverse, where a program generates an entity
- Update the testProvenanceFields() test to reflect the above
- For now, ignore the testInsertProvResourceMap() test due to HZ conflicts
refs #6843
- Remove the prov_wasGeneratedBy Solr field (in favor of just prov_wasGeneratedByExecution and prov_wasGeneratedByProgram)
- Add the prov_generated field as the inverse, where a program generates an entity
refs #6843
- Remove the prov_wasGeneratedBy Solr field (in favor of just prov_wasGeneratedByExecution and prov_wasGeneratedByProgram)
- Add the prov_generated field as the inverse, where a program generates an entity
refs #6843
We've found a bug during index processing that causes documents to not be indexed because of a mismatch between a 'beginDate' or 'endDate' field being added, and an existing 'beginDate' or 'endDate' field. This is a temporary fix that converts potential date strings to Date objects, and compares them. If they match, it is a dupe, and we don't add the field. Otherwise we add it. See and refs #6843
We've found a bug during index processing that causes documents to not be indexed because of a mismatch between a 'beginDate' or 'endDate' field being added, and an existing 'beginDate' or 'endDate' field. This is a temporary fix that converts potential date strings to Date objects, and compares them. If they match, it is a dupe, and we don't add the field. Otherwise we add it. See and refs #6843
Update the v1_1 Solr schema file with the prov_ field changes.
refs #6843
Update the v1_1 Solr schema file with the prov_ field changes.
refs #6843
Update the other various versions of the Solr schema to reflect the prov_ changes, and add prov_generated and prov_used to the query field descriptions file. Again, add all prov_* to the list of default return fields forSolr queries.
refs #6843
Update the other various versions of the Solr schema to reflect the prov_ changes, and add prov_generated and prov_used to the query field descriptions file. Again, add all prov_* to the list of default return fields forSolr queries.
refs #6843
Fix merging issues in the RdfxmlSubprocessor where, if mergeDocs() is called twice, and the existing map is larger than the pending map, we don't drop all of the existing documents, but rather add them to the merged map. Also, implement mergeWithIndexeddocument() in the same way as the AnnotatorSubprocessor (note that by following the API, we are cumulatively merging all existing docs, which seems a bit inefficient). refs #6843
Fix merging issues in the RdfxmlSubprocessor where, if mergeDocs() is called twice, and the existing map is larger than the pending map, we don't drop all of the existing documents, but rather add them to the merged map. Also, implement mergeWithIndexeddocument() in the same way as the AnnotatorSubprocessor (note that by following the API, we are cumulatively merging all existing docs, which seems a bit inefficient). refs #6843
History
#1 Updated by Chris Jones almost 10 years ago
- Status changed from New to In Progress
- % Done changed from 0 to 30
#2 Updated by Chris Jones almost 10 years ago
This is finished, and we are now indexing fields with prov_ prefixes.
#3 Updated by Chris Jones almost 10 years ago
Lauren pointed out that the prov_used and prov_wasGeneratedBy Solr fields should be populated by the Program that used or generated the Entity, rather than the Execution, since Executions aren't first-class objects per se in DataONE. I'm changing the SPARQL queries for these fields to reflect this.
#4 Updated by Chris Jones over 9 years ago
In testing MsTMIP test document indexing, we've had some documents that fail to index because of a mismatch between the beginDate
value being added, and existing beginDate
values already in Solr (same goes for endDate
. The core of the issue is that we are using date formats like:
yyyy-mm-ddTHH:MM:ss.SSSZ
Solr's native format is:
yyyy-mm-ddTHH:MM:ssZ
although the Solr schema states that the (optional) milliseconds are allowed.
It looks to me like Solr is reverting to it's native format with any date string values with zeros as milliseconds, like:
1900-01-01T00:00:00.000Z
When this happens, the Solr indexed value is:
1900-01-01T00:00:00Z
When subprocessors (such as the AnnotatorSubprocessor) attempt to merge these fields, the values they are trying to insert (with milli precision) don't match the existing values (seconds precision), and it attempts to add it. In doing so, it attempts to add multiple values to a non-multivalued field, and the insertion fails.
As a temporary hack, I've updated the AnnotatorSubprocessor to compare these fields as Date objects instead of strings. It's not a long term solution, but may fix our immediate issues. Need to discuss with Ben.
#5 Updated by Ben Leinfelder over 9 years ago
Seems like a fine fix to compare Dates instead of strings as long as SOLR isn't discarding/rounding any non-000 millisecond values.
#6 Updated by Skye Roseboom about 9 years ago
Solr seems to not allow trailing 0 in the millisecond value and is intentionally truncating:
http://lucene.apache.org/solr/5_2_1/solr-core/org/apache/solr/schema/TrieDateField.html
#7 Updated by Dave Vieglais over 7 years ago
- Project changed from CN Index to Infrastructure
- Category changed from d1_cn_index_processor to d1_indexer
- Target version changed from CCI-2.0.0 to CCI-2.4.0
- Milestone set to None