Task #8526

MNDeployment #7082: USGS Science Data Catalog (SDC)

Story #8683: USGS SDC: redeploy as a v2 Slender Node with GMN

duplicate data sets for USGS node + archiving

Added by Matthew Jones almost 4 years ago. Updated over 2 years ago.

Target version:
Start date:
Due date:
% Done:


Story Points:


In today's ESIP provenance workshop, researchers were exploing USGS data sets from DataONE, and were confused by duplicated data sets, and inability to download the data directly. The search they did was for 'Hurricane Maria', which returned 14 datasets with identical titles of "Map data showing concentration of landslides caused by Hurricane Maria in Puerto Rico." For example, the first two are "" and "". In exploring these, it seems that most (maybe all) are for the same data set, and all point internally at the DOI "doi:10.5066/F7JD4VRF", which resolves to the USGS landing page here: . That landing page lists a bunch of data files that can be downloaded from USGS.

So, the issues that the group here would like to be resolved:

1) Can we eliminate these duplicates, or if they are different versions, can that be indicated via obsoletes/obsoletedBy relationships so that they don't seem duplicate din the catalog
2) Why is the identifier in DataONE for the data set (22c0b3d1-2ef8-4ed9-a8c9-82e5d882f64e) different from the id used at USGS (doi:10.5066/F7JD4VRF)?
3) Can the data files that are linked and downloadable from the DOI landing page be registered with DataONE so they are resolvable and downloadable from there?

Thanks. I'm sharing this with Sky Bristol who said he can help find out answers to these questions on the USGS side.

usgs_pids.csv Magnifier (3.71 MB) Monica Ihli, 2018-08-14 01:41

checksum_frequency_counts.csv Magnifier (752 KB) Monica Ihli, 2018-08-14 01:43

usgs_pids_to_archive.txt Magnifier (415 KB) John Evans, 2019-05-17 20:35


#1 Updated by Matthew Jones almost 4 years ago

  • Project changed from MN Dashboard to Member Nodes

#2 Updated by Matthew Jones almost 4 years ago

  • Parent task set to #7082

#3 Updated by Amy Forrester almost 4 years ago

{from Sky Bristol}
* Sending to Ben Wheeler and Drew Ignizio on the USGS side
* my guess is this is coming from a data lifecycle misalignment. You are likely getting a form of CSDGM metadata that only had an onlink to the ScienceBase Item but did not include distinfo links to the data download files from those items. Seems like something that definitely needs to be fixed and can really only be fixed on the USGS source end.
* I'll push on this to see that we come up with a solution for DataONE and other "third party" catalogs.

#4 Updated by Monica Ihli over 3 years ago


It is believed that duplicate records are being harvested into DataONE as a consequence of how SDC is handling its relationship with its own source repositories. The background story is that persistent identifiers are inconsistently applied, or in some cases not at all present, across contributors to the SDC. The way that SDC handles this is to essentially wipe and reload their entire set of metadata periodically, rather than identifying updated or new records and only making those changes.

As a consequence of how this process of wiping and reloading is handled, a single object will have been harvested into SDC from a given contributor more than once, but assigned a different arbitrary identifier by SDC each time in the process. This does not result in any ill effect for SDC, but an unintended consequence is that the same object would be harvested into DataONE more than under different identifiers.

Current Action

SDC had taken their member node offline in order to prevent the problem from being compounded by additional duplicate records being harvested. Today sync has been disabled on the CN side for USGS SDC to prevent any such issues. They have thus brought the node at back online.

However, download links are still failing to resolve. This is most likely due to that whatever identifier was assigned to a record whenever ingested into DataONE will have changed each time SDC picked the object back up again from its source. And the MN links are tied to the system identifier.

For example, this record in USGS SDC:*%22Mean%20annual%20water-budget%20components%20for%20the%20Island%20of%20Maui%22*)%5E100%20OR%20(placeKey%3A%22Mean%20annual%20water-budget%20components%20for%20the%20Island%20of%20Maui%22)%5E100%20OR%20(keywordsText%3A%22Mean%20annual%20water-budget%20components%20for%20the%20Island%20of%20Maui%22)%5E100%20OR%20(abstract%3A%22Mean%20annual%20water-budget%20components%20for%20the%20Island%20of%20Maui%22)%5E30%20OR%20fullText%3A%22Mean%20annual%20water-budget%20components%20for%20the%20Island%20of%20Maui%22%20AND%20(fileType%3Amarquee%5E30%20OR%20fileType%3Aharvested))%5E30%20OR%20((title%3AMean*annual%20water-budget%20components%20for%20the%20Island%20of%20Maui)%5E100%20OR%20(placeKey%3AMean%20annual%20water-budget%20components%20for%20the%20Island%20of%20Maui)%5E100%20OR%20(keywordsText%3AMean%20annual%20water-budget%20components%20for%20the%20Island%20of%20Maui)%5E100%20OR%20(abstract%3AMean%20annual%20water-budget%20components%20for%20the%20Island%20of%20Maui)%5E30%20OR%20fullText%3AMean%20annual%20water-budget%20components%20for%20the%20Island%20of%20Maui%20AND%20(fileType%3Amarquee%5E30%20OR%20fileType%3Aharvested)))%20AND%20(((title%3A*%22Mean%20annual%20water-budget%20components%20for%20the%20Island%20of%20Maui%2C%20%22*)%5E100%20OR%20(placeKey%3A%22Mean%20annual%20water-budget%20components%20for%20the%20Island%20of%20Maui%2C%20%22)%5E100%20OR%20(keywordsText%3A%22Mean%20annual%20water-budget%20components%20for%20the%20Island%20of%20Maui%2C%20%22)%5E100%20OR%20(abstract%3A%22Mean%20annual%20water-budget%20components%20for%20the%20Island%20of%20Maui%2C%20%22)%5E30%20OR%20fullText%3A%22Mean%20annual%20water-budget%20components%20for%20the%20Island%20of%20Maui%2C%20%22%20AND%20(fileType%3Amarquee%5E30%20OR%20fileType%3Aharvested))%5E30%20OR%20((title%3AMean*annual%20water-budget%20components%20for%20the%20Island%20of%20Maui%2C%20)%5E100%20OR%20(placeKey%3AMean%20annual%20water-budget%20components%20for%20the%20Island%20of%20Maui%2C%20)%5E100%20OR%20(keywordsText%3AMean%20annual%20water-budget%20components%20for%20the%20Island%20of%20Maui%2C%20)%5E100%20OR%20(abstract%3AMean%20annual%20water-budget%20components%20for%20the%20Island%20of%20Maui%2C%20)%5E30%20OR%20fullText%3AMean%20annual%20water-budget%20components%20for%20the%20Island%20of%20Maui%2C%20%20AND%20(fileType%3Amarquee%5E30%20OR%20fileType%3Aharvested)))%20AND%20(((title%3A*%22%20Mean%20annual%20water-budget%20components%20for%20the%20Island%20of%20Maui%2C%20Hawaii%2C%20for%20current%20conditions%2C%202001-10%20rainfall%20and%202001-10%20land%20cover%20(version%202.0).%22*)%5E100%20OR%20(placeKey%3A%22%20Mean%20annual%20water-budget%20components%20for%20the%20Island%20of%20Maui%2C%20Hawaii%2C%20for%20current%20conditions%2C%202001-10%20rainfall%20and%202001-10%20land%20cover%20(version%202.0).%22)%5E100%20OR%20(keywordsText%3A%22%20Mean%20annual%20water-budget%20components%20for%20the%20Island%20of%20Maui%2C%20Hawaii%2C%20for%20current%20conditions%2C%202001-10%20rainfall%20and%202001-10%20land%20cover%20(version%202.0).%22)%5E100%20OR%20(abstract%3A%22%20Mean%20annual%20water-budget%20components%20for%20the%20Island%20of%20Maui%2C%20Hawaii%2C%20for%20current%20conditions%2C%202001-10%20rainfall%20and%202001-10%20land%20cover%20(version%202.0).%22)%5E30%20OR%20fullText%3A%22%20Mean%20annual%20water-budget%20components%20for%20the%20Island%20of%20Maui%2C%20Hawaii%2C%20for%20current%20conditions%2C%202001-10%20rainfall%20and%202001-10%20land%20cover%20(version%202.0).%22%20AND%20(fileType%3Amarquee%5E30%20OR%20fileType%3Aharvested))%5E30%20OR%20((title%3A*Mean%20annual%20water-budget%20components%20for%20the%20Island%20of%20Maui%2C%20Hawaii%2C%20for%20current%20conditions%2C%202001-10%20rainfall%20and%202001-10%20land%20cover%20(version%202.0).)%5E100%20OR%20(placeKey%3A%20Mean%20annual%20water-budget%20components%20for%20the%20Island%20of%20Maui%2C%20Hawaii%2C%20for%20current%20conditions%2C%202001-10%20rainfall%20and%202001-10%20land%20cover%20(version%202.0).)%5E100%20OR%20(keywordsText%3A%20Mean%20annual%20water-budget%20components%20for%20the%20Island%20of%20Maui%2C%20Hawaii%2C%20for%20current%20conditions%2C%202001-10%20rainfall%20and%202001-10%20land%20cover%20(version%202.0).)%5E100%20OR%20(abstract%3A%20Mean%20annual%20water-budget%20components%20for%20the%20Island%20of%20Maui%2C%20Hawaii%2C%20for%20current%20conditions%2C%202001-10%20rainfall%20and%202001-10%20land%20cover%20(version%202.0).)%5E30%20OR%20fullText%3A%20Mean%20annual%20water-budget%20components%20for%20the%20Island%20of%20Maui%2C%20Hawaii%2C%20for%20current%20conditions%2C%202001-10%20rainfall%20and%202001-10%20land%20cover%20(version%202.0).%20AND%20(fileType%3Amarquee%5E30%20OR%20fileType%3Aharvested)))

is found 27 times with 27 different identifiers in DataONE. These are not expected to be different versions of the same object, just 27 duplicate records assigned 27 different arbitrary identifiers. One example identifier among the 27 has the object location URL: but this object does not exist on the MN and cannot presently be retrieved except by the copy on the CN.

Next Steps

DataONE will attempt to triage the objects already on Production, attempting to determine if obsolescence chains can be constructed to at least tie all duplicate versions of the same object together, so that if anyone has ever cited the data the availability of a new version is not lost.

However, this still leaves up in the air the problem of the URLs not resolving to anything. This must be discussed in further detail.

Future steps will be to redeploy USGS SDC as a v2 Slender Node with GMN. The details of how that will be accomplished, including resolving identifier issues, will be addressed in a separate ticket. This ticket will only address putting a band-aid on the current situation in production as it stands now.

#5 Updated by Amy Forrester over 3 years ago

  • Assignee changed from Amy Forrester to Monica Ihli

#6 Updated by Amy Forrester over 3 years ago

8/13/18: email Aaron Stokes, to get re-engaged

#7 Updated by Monica Ihli over 3 years ago

usgs_pids.csv - pid, checksum, dataeUploaded, and dateModified for 33,474 USGS records extracted from a solr query.

checksum_frequency_counts.csv - counts number of duplicates for each checksum.

The frequency distribution of number of number of duplicates for 21,983 unique checksums is as follows:

67 checksums had 16 duplicates
63 checksums had 15 duplicates
28 checksums had 14 duplicates
12 checksums had 13 duplicates
106 checksums had 12 duplicates
37 checksums had 11 duplicates
30 checksums had 10 duplicates
49 checksums had 9 duplicates
44 checksums had 8 duplicates
29 checksums had 7 duplicates
42 checksums had 6 duplicates
12 checksums had 5 duplicates
160 checksums had 4 duplicates
2250 checksums had 3 duplicates
1178 checksums had 2 duplicates
17,876 checksums were found only once and had no duplicates

#8 Updated by Monica Ihli over 3 years ago

Decided in today's maint. call that we will archive the current SDC content before proceeding with GMN.

#9 Updated by Amy Forrester over 3 years ago

8/20/18: follow-up email Aaron Stokes, to get re-engaged

8/21/18: Mike Frame sent email to USGS team to coordinate F2F meeting with Monica to knock out archiving the data

#10 Updated by Amy Forrester over 3 years ago

8/22/18: Call to discuss archiving and next steps
Aaron & Tudor; Monica & Amy

(1) Monica to archive current data
(2) MN start working in Sandbox environment and dev OAI-PMH
(3) Lisa zolly working on USGS PID/DOI issues --> Aaron to see if there is a USGS source repository that can be used for testing while they hash out the issues.

#11 Updated by Amy Forrester over 3 years ago

  • Subject changed from duplicate data sets for USGS node to duplicate data sets for USGS node + archiving

#12 Updated by Amy Forrester over 3 years ago

Archival of Existing Objects:
- Monica will archive the 33,000 ish SDC objects currently on the CN.
- Monica will notify USGS once the archival is complete via email.
- USGS will take SDC member node offline temporarily.

#13 Updated by Amy Forrester over 3 years ago

10/8/18 - from Aaron

Monica and I met today to discuss our path moving forward, and we have come up with a good plan. Over the next two weeks we plan to archive the existing SDC metadata in the DataONE CN and replace it with a static list of SDC metadata records. These records will be available while we work on setting up the GMN software and introducing DOIs as identifiers to the process.

#14 Updated by Amy Forrester over 3 years ago

  • Parent task changed from #7082 to #8683

#15 Updated by Amy Forrester about 3 years ago

  • File usgs_pids.csv added

The USGS pids list to be archived is attached (usgs_pids.csv). Monica already installed the python library on the the CN so a script can be run to call cn.archive() on them.

#16 Updated by Amy Forrester about 3 years ago

  • File deleted (usgs_pids.csv)

#17 Updated by Amy Forrester over 2 years ago

  • Assignee changed from Monica Ihli to John Evans

#18 Updated by John Evans over 2 years ago

Attached is a list of PIDs/UUIDs that can be archived. If the original CSV from USGS had 33,475 rows, and if Monica determined that there were 21,983 unique checksums, then the output file should have 11,491 rows, which turned out to be the case.

#19 Updated by Jing Tao over 2 years ago

  • Assignee changed from John Evans to Jing Tao

#20 Updated by Amy Forrester over 2 years ago

  • % Done changed from 0 to 100
  • Status changed from New to Closed

Jing wrote a script and ran it to archive all the 11,000 usgs duplicated objects. It succeeded to archive them -

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 14.8 MB)