DataPackaging - Infrastructure - DataONE Tasks

h1. DataPackaging

This is the top level page for discussions and documentation in development around data packaging. The idea is that we can work on some of the why to do things and why not, as data packaging examples are developed. Final documentation will get moved into the overall DataONE documentation on mule1. This is more a place where we can comment, annotate, document the things that didn't work as we expected, and help maintain our own understanding of why we're doing things the way we choose to do them.

h2. Introduction

A "data package" is a set of one or more "data" objects and "science metadata" objects that together represent a scientifically useful unit of information. In order to properly interpret, preserve, and utilize the opaque set of bytes found in a data object, users and their software agents need access to the science metadata and system metadata that describe those data objects. A data package provides the conceptual relationships among the various components of the package that describe which objects are described by which metadata documents, and the role in that description that they play.

There are two main issues that need to be addressed:

What is an appropriate mechanism for documenting the relationships between objects that form a data package?
What is the best mechanism for a client to retrieve a data package?

h2. Documenting Object Relationships

Prior to July 21, 2011, the relationships between data and metadata within a package was recorded using the "describes" and "describedBy" elements of the System Metadata. The approach was too simplistic however to address the common situation where multiple metadata and data objects form a data package. The basic problems was that given a package constructed of metadata A and B, and data C and D, and A described C, B describes D, there was no way to indicate that the pairs A,C and B,D were components of a data package.

To resolve this issue, another class of object, the "resource map" was introduced. The general plan is that resource maps would be constructed using OAI-ORE as the standard, and the resource map would replace the relationships defined in the system metadata of the objects. There will be one resource map for each data package.

In ORE terms, the data and metadata objects would be referenced as aggregated objects which participate in an aggregation that captures the list of objects in the resource map.

There are two issues with using ORE resource maps:

ORE does not provide semantics for defining relationships between objects, only that the contained objects form a cohesive unit.
ORE requires that aggregate identifiers are "protocol based URIs" which is more restrictive than used by identifiers in DataONE.

h3. Defining Inter-object Relationships

Since ORE does not provide semantics for describing inter object relationships (e.g. metadata A describes data C), it is necessary to utilize another standard vocabulary for these definitions.

DataCite provides concepts semantically equivalent to what is required, however these are expressed as enumerations in the DataCite XML schema.

The document https://docs.google.com/document/d/1paJgvmCMu3pbM4in6PjWAKO0gP-6ultii3DWQslygq4/edit?authkey=CMeV3tgF&hl=en_GB referenced from http://opencitations.wordpress.com/2011/06/30/datacite2rdf-mapping-datacite-metadata-scheme-terms-to-ontologies-2/ provides a mapping between DataCite concepts and existing RDF terms. The suggestion there is the use CITO terms ( http://purl.org/spar/cito ) to indicate the relationships such as "documents" and "isDocumentedBy". The architecture documentation has been updated to reflect this suggestion.

h3. Dealing with Protocol Based URIs for Identifiers

The suggestion is to create URI identifiers for the aggregated content by providing the "get" REST URL for retrieving the object. Since the identifiers must be persistent, it will be necessary to utilize the general coordinating node URL rather than a URL for a specific node.

In addition, the actual DataONE identifier will be added to the aggregated object as a dcterms:Identifier element. DataONE clients will use this value to identify the object, and retrieval will be through the currently defined resolution and get mechanism. Non-DataONE clients consuming the resource map will encounter the aggregated object URIs which are resolvable URLs, and will use those URLs to retrieve the content. A separate REST endpoint that performs a HTTP 302 redirect for the non-DataONE clients will be provided on the Coordinating Nodes for this purpose.

h2. Packaging Content

The initial plans for data packaging don't actually do any packaging - instead the client will retrieve the resource map, parse it, and retrieve the required content one item at a time (though potentially using multiple threads of execution).

We need to determine if this simplistic solution is satisfactory for all Member Nodes, at least for the version 1.0 release.

h2. Working examples

Annotated examples (intended for documentation and illustration purposes) are being developed and maintained in the subversion repository, starting at https://repository.dataone.org/documents/Projects/DataPackaging/Examples/

An example being worked is ORNL DAAC dataset 221 ORNLDAAC 221 Packaging, where the evolution and commenting on the effort is on that page.

Project

General

Profile

Infrastructure

Wiki