Project

General

Profile

Story #7668

Determine how indexing of data packages should work

Added by Bryce Mecum about 8 years ago. Updated about 6 years ago.

Status:
New
Priority:
Normal
Assignee:
Category:
d1_indexer
Target version:
Start date:
2016-03-01
Due date:
% Done:

0%

Story Points:

Description

I've discovered (with Lauren's help) a strange requirement for how the resource maps for nested data packages have to be written. In order to get nested data packages correctly indexed in Solr so that the 'resourceMap' field of the resource map being nested is set to the parent resource map's PID, you have to create the appropriate set of @cito:documents@ statements in addition to the expected @ore:aggregates@ statements.

I expected the following to be sufficient (pardon the highly abstracted RDF, examples are linked below):

parent_resource_map#aggregation ore:aggregates child_resource_map
parent_resource_map#aggregation ore:aggregates metadata_object

but I also had to add a @cito:documents@ statement between the parent resource map's metadata object and the resource maps being nested

parent_resource_map#aggregation ore:aggregates child_resource_map
parent_resource_map#aggregation ore:aggregates metadata_object

parent_metadata_object cito:documents child_resource_map

The documentation does not suggest this and I found it confusing. A real life example of what I expected to work is here: https://gist.github.com/amoeba/c7a6ba269c5a1f78db1d
What I actually had to insert is here: https://dev.nceas.ucsb.edu/knb/d1/mn/v2/object/resourceMap_urn:uuid:ab17b047-a341-4d06-b433-92eed90dacec

Is the need for the @cito:documents@ statement(s) really required and is this the intended behavior? I've made this issue in the hopes we can talk about it.

I suggest updating the API docs with whatever we decide, and hopefully that update will include example RDF for a nested data package.


Related issues

Related to Infrastructure - Task #3156: Design Review: resource map indexing strategy New 2012-08-27

History

#1 Updated by Jing Tao over 7 years ago

  • Category set to d1_cn_index_processor
  • Assignee set to Jing Tao
  • Target version set to CCI-2.3.1

#2 Updated by Chris Jones over 7 years ago

My understanding is that there shouldn't be a requirement to add a

parent_metadata_object cito:documents child_resource_map

statement. To me, this documentation isn't correct:

??A data package in DataONE is composed of at least one science metadata document describing at least one data object with the relationships between them documented in a resource map document.??

See https://releases.dataone.org/online/api-documentation-v2.0/design/DataPackage.html#synopsis

The six requirements of a DataONE Data Package that are over and above the OAI-ORE spec are found here:

https://releases.dataone.org/online/api-documentation-v2.0/design/DataPackage.html#generating-resource-maps

None of these require a @cito:documents@ statement, and so I think this documentation needs to be updated. Likewise, the @resourceMapSubprocessor@ code needs to be reviewed to remove any hard dependency on the presence of a @cito:documents@ statement. Currently, the @resourceMap@ field in Solr represents the @ore:aggregates@ statements, and I think that that only should be used to indicate participation in a resource map. In fact, I'd even prefer having the @aggregates@ and @isAggregatedBy@ Solr fields instead to show both directions of the relationship in the index.

An example is where a data manager calls @MN.create()@ on a bunch of data objects, and then calls @MN.create()@ on the resource map with these objects aggregated. Later, when the data manager is able to get the metadata from a scientist, they may call @MN.create()@ for the science metadata document, and @MN.update()@ on the resource map, adding the science metadata to the aggregation, and inserting @cito:documents@ statements.

#3 Updated by Dave Vieglais over 7 years ago

  • Target version changed from CCI-2.3.1 to CCI-2.4.0

#4 Updated by Dave Vieglais almost 7 years ago

  • Category changed from d1_cn_index_processor to d1_indexer
  • Project changed from CN Index to Infrastructure
  • Milestone set to None

#5 Updated by Rob Nahf almost 7 years ago

  • Related to Task #3156: Design Review: resource map indexing strategy added

#6 Updated by Rob Nahf almost 7 years ago

  • Tracker changed from Task to Story

#7 Updated by Dave Vieglais about 6 years ago

  • Sprint set to Infrastructure backlog

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 14.8 MB)