Task #3856: Re-harvest ORE documents from MN - Member Nodes - DataONE Tasks

Task #3856

MNDeployment #3552: USGS CSAS

Re-harvest ORE documents from MN

Added by Skye Roseboom over 11 years ago. Updated over 9 years ago.

Status:

Closed

Priority:

High

Assignee:

Robert Waltz

Target version:

Start date:

2013-06-28

Due date:

% Done:

100%

Estimated time:

0.00 h

Story Points:

Sprint:

Description

The content of the ORE documents from USGS CSAS have been changed 'in-place'.

Need to determine how to re-harvest the content to the CN.

Potentially same solution as being investigated for re-harvesting content from ORNL DAAC.

Directly related to issue 3839.

This content would be immediately available through the CN REST API and search index, if new resource maps were generated with new pids/system metadata which obsolete the original/edited versions.

Related issues

History

#1 Updated by Chris Jones over 11 years ago

Assignee changed from Chris Jones to Robert Waltz
Priority changed from Normal to High
Status changed from New to In Progress

I'm assigning this to Robert for now, since he's working on the code for a one-time to update the CNs that will allow us to reharvest USGSCSAS content this week before the DUG meeting.

#2 Updated by Robert Waltz over 11 years ago

Status changed from In Progress to Closed
translation missing: en.field_remaining_hours set to 0.0

#3 Updated by Ranjeet Devarakonda over 11 years ago

Estimated time set to 0.00
Status changed from Closed to In Progress

It is showing only 5 records from the ONEMercury search. https://cn.dataone.org/onemercury/send/facetsQuerry2?filterForDataHidden=true&term1=*&term1attribute=text&op1=&term3attribute=overlaps&term3=%2C%2C%2C&op3=&term8=collection&pageSize=10&start=0&sortattribute=default&facetattribute=datasource&facet=urn:node:USGSCSAS

#4 Updated by Skye Roseboom over 11 years ago

Hi Robert,

Looking at the indexing output regarding these re-harvested pids -- I am seeing a lot of object path errors. In the index processing log it looks like this:

[ INFO] 2013-07-11 14:45:11,573 (IndexTaskProcessor:isObjectPathReady:262) Object path exists for pid: resourceMap_doi_10.5066_F77H1GHV.xml however the file location: /var/metacat/data/autogen.2013042612461679446.1 does not exist. Marking not ready - task will be marked new and retried.

[ INFO] 2013-07-11 14:45:11,606 (IndexTaskProcessor:isObjectPathReady:262) Object path exists for pid: resourceMap_doi_10.5066_F7WW7FN6.xml however the file location: /var/metacat/data/autogen.2013042612464924180.1 does not exist. Marking not ready - task will be marked new and retried.

[ INFO] 2013-07-11 14:45:11,640 (IndexTaskProcessor:isObjectPathReady:262) Object path exists for pid: resourceMap_doi_10.5066_F7NZ85MB.xml however the file location: /var/metacat/data/autogen.2013042612465153383.1 does not exist. Marking not ready - task will be marked new and retried.

This indicates that the index processing process is attempting to read the contents of the ORE document off the local hard disk at the file path location indicated by the shared hazelcast data structure 'objectPath'. This is the structure in the storage cluster that maps PIDS to file system paths. Indexing does this in order to parse the contents of the ORE document - to derive the information contained by the ORE for the index record. Without a valid object path (file system path), indexing is unable to process the ORE documents. This is the reason these updated documents have not appeared updated in the index.

#5 Updated by Robert Waltz over 11 years ago

Status changed from In Progress to Closed

added logic in repair scripts to touch the hzObjectPath map when evicting pids.

Also available in: Atom PDF

Project

General

Profile

Member Nodes

Issues

Custom queries