CN checksum inconsistencies
While transferring test data from production to the sandbox-2 environment I noticed failures for a group of pids.
I'll use an example to illustrate (doi_10.5066_F71C1TV7)
CN.SystemMetadata reports checksum as:
Whereas calculating it from disk gives this:
The byte size is also off.
-rw-r--r-- 1 tomcat7 tomcat7 16529 Jun 25 2013 /var/metacat/documents/autogen.2013062508395355978.1
There are ~70 similar pids that have issues (perhaps more) from our test corpus. They are from the now defunct USGS MN.
I'm not sure what our strategy is since the original MN is not online any longer so we cannot get the "original" bytes from that.
#1 Updated by Ben Leinfelder almost 6 years ago
Here are the pids that are similar
#4 Updated by Dave Vieglais almost 6 years ago
Looking at the first item in the list: doi_10.5066_F7028PGW
With (A) as the document from the CN
- Verified that the size of the object differs from that in the system metadata, as does the checksum.
- Verified that the object is not retrievable from the member node (node offline)
- Google search on the PID shows the ONEMercury interface, no Google search results to the clearing house.
- The DOI listed in the metadata is "doi:10.5066/F7028PGW" The DOI resolves to a zip file that contains an ArcGIS layer and a PDF document, no metadata.
- The URL in the metadata resolves to an fgdc file (B).
- Searching for the DOI in the USGS search box returns one result, which points us to: www1.usgs.gov/vip/kaho/metakahospatial.xml downloaded as (C)
- diff reports no difference between (B) and (C)
- diff reports minor differences between (A) and (B) ( < is copy (A) from CN):
< <?xml version="1.0"?>
<?xml version="1.0" encoding="ISO-8859-1"?>
< Cogan, D. K. Schulz., D. Benitez, G. Kudray, and A. Ainsworth 2011. Vegetation inventory project: Kaloko-Honokohau National Historical Park NPS/KAHO/NRR2011/462. National Park Service, Fort Collins, Colorado.¶
Cogan, D. K. Schulz., D. Benitez, G. Kudray, and A. Ainsworth 2011. Vegetation inventory project: Kaloko-Honokohau National Historical Park NPS/KAHO/NRR�2011/462. National Park Service, Fort Collins, Colorado.
- the original content is not available in exactly the same form as published to the CN.
- the currently available content differs from that on the CN in a primarily cosmetic manner.
- No currently available copies report the same size or checksum as recorded in the system metadata.
- Adding the element to (B) did not reconcile reconcile the difference of checksum and size from (A)
Hence, there is by definition, no valid copy of the original data from DataONE's perspective. From a pragmatic viewpoint, the content remains available at the locations referenced within the metadata document (A), and so is still practically useful.
From a user perspective, the content remains valid. From a user perspective, the checksum and size should be updated to reflect that the copy held by the CN is the only valid copy. Since this is a version 1.0 object, such a change is not possible without violating self imposed integrity constraints.
One possible solution may be to "upgrade" the system metadata to version 2.0, make the current PID the SID, and create a new system metadata document to indicate the current state of the object, and reference the original system metadata entry as obsoleted.