Story #3262: Object provenance needs to be supported across components - Infrastructure - DataONE Tasks

Story #3262

Object provenance needs to be supported across components

Added by Chris Jones over 12 years ago. Updated about 7 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Chris Jones

Category:

Target version:

Start date:

Due date:

% Done:

100%

Story Points:

Sprint:

Description

One of the goals within DataONE is to be able to support reproducible science. The "Provenance Working Group":http://www.dataone.org/working_groups/scientific-workflows-and-provenance-working-group (led by Bertram Ludascher and Paolo Missier) has been working toward this end by developing a provenance model that traces derivations of objects through scientific workflow systems (VisTrails, Kepler, etc.). The model they are working on is now called D-PROV (used to be D-OPM), and it strives to be compatible (by extension) with the more generic W3C PROV model. The working group has a use case where user Alice has a dataset D1 that is processed through a workflow WF1 to produce a derivative dataset D2 with a provenance trace Pr1. These products are then uploaded to the DataONE system along with their system and science metadata (SM1, MD1), linked together as a collection using a resource map RM1. A second user, Bob, wants to find datasets that were derived by the WF1 workflow. Once finding D2 in the DataONE system, Bob then wants to use the D2 dataset as an input to a new workflow WF2 to produce a new derived dataset, D3. The data (D3), science metadata (MD3), system metadata (SM3), and provenance (Pr2) artifacts produced during WF2 are then uploaded to the DataONE system as a new collection using a resource map (RM2).

A serialization of the provenance traces produced by the workflow engine (in the use case, VisTrails) is needed. This trace will (for now) be an RDF/XML document compatible with the W3C PROV model, using D-PROV extensions that are specific to provenance concepts relating to environmental science (as opposed to, say, provenance associated with corporate mergers, etc.). Two of the VisTrails developers (David Koop and Fernando Seabra Chirigati) will produce the provenance serialization, and will need to be able to easily insert the trace into the resource maps (RM1, RM2) for the two collections. This will require the use of new predicates in the resource map that allow for triples that "provide provenance information for" the dataset in the data package. The python and java libclient tools will need methods that allow for the insertion of these provenance triples.

To enable enhanced search based on provenance information, we need to initially create indexed attributes in the CN Solr index that are parsed out of the provenance serializations. These attributes will be similar to "derivedBy" and "derives", or others in the provenance model we think are most useful for search.

The ONEMercury interface needs to be modified to enable search based on provenance attributes. This will likely be accomplished in a development version of the the code, and so a development environment with at least a single CN and an MN accessible by the ProvWG members needs to be deployed so they can reliably interact with the DataONE API.