Project

General

Profile

IEDA Member Node Work Plan

1. Resources

  • Redmine MN Deployment: #8035 (Task #8252)
  • Implementation: Likely to be a GMN slender node
  • MN Description Document: IEDA Description Document
  • MN ID: urn:node:IEDA_EARTHCHEM
  • IEDA Contacts: Steve Richards (Technical); Kirsten Lenhardt (Administrative)

Actions

2018-04-06

Member Node (Steve)

  • Install and maintain Generic Member Node platform on web server.

  • Implement a site-map index which leads to distinct sitemaps for each data source.

  • Embed a recognizable link as access point to associated ISO metadata documents within schema.org metadata for each page.

DataONE

  • Draft implementation plan (Dave)

  • Prototype harvest of schema.org resources (Monica, Dave)

    • Develop a Python adapter for harvesting schema.org metadata into a DataONE installation of Generic Member Node platform.

2. Background

Partners of the Interdisciplinary Earth Data Alliance (IEDA) contribute catalog resources to a centralized data catalog. Partners include:

  • EarthChem Library

  • US Antarctic Program Data Center

  • Marine-Geo Digital Library

  • Academic Seismic Portal

The IEDA repository is implemented using ESRI Geoportal v. 2.5 customized in an IEDA fork of the ESRI GitHub repository. Metadata is exposed as DataCite XML which is transformed with augmentation to ISO19139 XML.

2.1 Repository APIs

IEDA exposes the following APIs:

2.1.1. OGC CSW (Catalog Service for the Web)

The CSW service is provided by GeoPortal.

Metadata are harvested into the catalog as ISO 19139 xml files from web accessible folders, and metadata are available as a web page, ISO 19139 XML, or JSON via links with each search result. The catalog can be searched via API using the OGC Catalog Service for the Web (CSW), version 3.0.0 or OpenSearch. The APIs return metadata in CSW records or Atom XML format, both of which contain basic Dublin Core metadata elements.

2.1.2. Web folders

Because the latest version of software they are using (2.5 at this time) no longer supports exposing ISO metadata via the CSW endpoint, a workaround has been developed in which ISO metadata documents are deposited as files in web directories as follows:

This is a non-standard approach that is unlikely to be utilized by other locations.

2.1.3. Schema.org

Another alternative in progress at IEDA is the embedding of schema.org information into IEDA landing pages. A sitemap is currently available at http://get.iedadata.org/doi/xml-sitemap.php. At the present moment the sitemap combines EarthChem and MGDL resources, but these can be split. In order to fully leverage schema.org, the ISO metadata document access point will need to be embedded in the metadata.

Direct science data access is not available through IEDA, as IEDA’s source repository only provides landing pages on their home systems as data access points.

3. Development Plan & Specifications

Broadly speaking, there are two aspects that need to be resolved: 1) How does the repository represent a dataset and its components? and 2) How are datasets and their components accessed?

Initial approach is to deploy a GMN instance to act as a proxy to the resources offered by IEDA. The MN will be deployed as a Tier 1 service that will initially provide access to metadata only and will be implemented using the SlenderNode pattern.

Schema.org will be used to access ISO 19115 metadata content into the SlenderNode. Initially we will setup EarthChem as its own Member Node, but branded in such a way as to reflect the relationship with IEDA.

In the SlenderNode pattern, an adapter needed to populate the GMN instance with resources listed by the repository. The adapter is to leverage the schema.org implementation provided by IEDA since it is anticipated that this approach will facilitate participation of many other repositories that intend to work with DataONE at Tier 1.

DataONE shall be responsible for implementing the adapter for parsing and loading the resources exposed by schema.org mechanisms from the repository.

This will be a metadata only Slendernode implemented with GMN. Schema.org will be used to access ISO 19115 metadata content into the SlenderNode. Initially we will setup EarthChem as its own Member Node, but branded in such a way as to reflect the relationship with IEDA.

3.1 Dataset Structure

Within DataONE, datasets are composite structures comprised of separate data and metadata components with a third component, an OAI-ORE document that describes the relationships between the components of a dataset. See https://purl.dataone.org/architecture/design/DataPackage.html

How are the repository datasets structured?

Are resource maps available or do they need to be generated?

Are data and metadata separate components and individually identifiable?

Is each component of a dataset immutable?

When datasets or their components are updated, are old versions retained?

3.2 Identifiers

All content synchronized with DataONE is immutable (checksum of the object bytes never changes), and each object is identified with a persistent identifier (PID) that must be unique within the DataONE federation, and ideally globally unique. See http://purl.dataone.org/architecture/design/PIDs.html

Since version 2.0 of the infrastructure, DataONE also supports series identifiers (SIDs) which will always resolve to the latest revision of an object. See See http://purl.dataone.org/architecture/design/ContentImmutability.html

Within DataONE, SIDs or PIDs are treated as opaque strings and are resolved using the resolve service of the Coordinating Nodes. In practice, a repository may use different forms of identifier for different purposes. For example, DOIs may be used to identify the dataset and handles or UUIDs used to identify specific data components.

What form of identifier is being used for the different components of a dataset?

3.2.1 Persistent Identifiers (PIDs)

Does each component of a dataset have a persistent identifier (PID) that always refers to the exact same item (identical bytes)?

3.2.2 Series Identifiers (SIDs)

Does the repository support the notion of a series identifier (SID), that is, an identifier that refers to the current revision of the dataset (or its components)?

3.2.3 Format Identifiers

Different types of object are assigned unique "format identifiers" in DataONE. The formatId assists the infrastructure and consumers in appropriately managing and using the content. The list of formatIds is available from https://cn.dataone.org/cn/v2/formats.

What formats are used for the science metadata documents?

What types of science data objects are exposed?

3.3 Listing Datasets and Components

Describe how a list of available datasets and their components can be retrieved.

Is the listing per dataset or per dataset component?

If the listing is per dataset, how is a list of components of the dataset obtained?

Resources are listed by following the workflow:

  1. Load sitemap.xml from the well known address for the repository. http://get.iedadata.org/doi/xml-sitemap.php

  2. For each sitemap entry:

    1. If entry is a sitemap, got back to #1 with the newly loaded sitemap, otherwise:
    2. Load the resource from the sitemap url/loc value, parse and extract the schema.org constructs.
  3. Given the schema.org information, locate the link to the full metadata document. The full metadata document is the resource that will be synchronized with DataONE. The schema.org metadata will also contain the identifier for the resource, and should also contain links to the data associated with the resource.

  4. Generate system metadata for the resource.

  5. Add the resource to the MN implementation

Schema.org information is embedded in the landing page for the resource, typically as JSON-LD. The following snippet shows one mechanism for loading schema.org resources from a landing page:

import requests
import extruct
import pprint
response = requests.get("http://get.iedadata.org/doi/323582")
data = extruct.extract(response.content, response.url)
pprint.pprint(data, indent=2)

3.4 Change Detection

How are changes to the list of datasets and/or dataset components discovered?

How frequently are changes advertised?

Changes in resources can be determined from the sitemap.xml files. A sitemap.xml file contains, for example:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <url>
    <loc>http://get.iedadata.org/doi/323582</loc>
        <lastmod>2018-03-15T09:39:27-04:00</lastmod>
        <changefreq>yearly</changefreq>
        <priority>1</priority>
    </url>
    ...

The <lastmod> entry indicates when the resource at <loc> was last changed.

3.5 Get Dataset Component

Describe how each component of a dataset can be retrieved. For example, given the identifier to a component, is there a service that provides access to the bytes of the component?

An individual resource is retrieved from the link provided in the schema.org metadata provided with the resource landing page. The resource is expected to be an ISO19139 XML document.

TODO: 2018-04-06: The precise location and properties of the element that is to contain the link to the full resource is not yet determined.

3.6 System Metadata Generation

Describe how system metadata will be generated, specifically how formatIds for the dataset components will be determined, what the replication policy will be, who owns the content.

System Metadata is the information used by DataONE to track and manage objects across the network infrastructure. It includes fields such as identifiers, version relationships, information about the file, and so forth. The system metadata for the resource should be straightforward and for the most part could be generated from a template. The identifier is obtained from the the schema.org metadata. The formatId should be consistent, and should be one of the ISO formatIds.

Identifiers

IEDA generates DOIs for the resources. The DOIs resolve to the landing page for the resource. The landing page for the resource contains the schema.org metadata providing links to components of the resource.

It is important for Member Nodes to understand that DataONE stores multiple versions of a single metadata record. So even though only one version of a record may exist on the Member Node’s repository, older versions will still be accessible (though not discoverable) on DataONE. More information about handling identifiers in DataONE systems for this implementation scenario can be found here: < insert document link >

Series ID

IEDA’s DOI for EarthChem content will be used as the SeriesID field in DataONE. This means that the DOI is the primary identifier that is always linked to the most current version of a record.

Persistent ID

This is the unique identifier assigned to every object in DataONE. Every time a metadata record changes in IEDA, DataONE will harvest a new copy and assign it a new PID. The relationship between the new version and the old version of the PID is recorded. Read more about obsolescence chains and versioning of records here: .

Target Data

4. Deployment Plan

Which environment will the deployment be tested in?

Will the test implementation be repurposed for production use?

Who will be the points of contact for DataONE and the Member Node?

Are there any particular deadlines to be aware of?

Deployment Process Summary

DataONE tracks member node deployments in an issue management system called Redmine.

  • A test deployment of the member node platform is setup/installed on a web server.

  • Integration scripts (commonly called adapters by DataONE) are developed to harvest metadata (and if applicable, the location of science data) from the source system into the Member Node.

  • Local harvesting is confirmed to be working as expected.

  • Member confirms metadata views in stage.

Add picture from clipboard (Maximum size: 14.8 MB)