Task #4213
Story #3736: CN Consistency Check and CN Recovery
Evaluate deleted identifiers on each CN
100%
Description
In order to simplify the auditing code, we want to be able to exclude identifiers that have been 'deleted', in that the 'archived' flag has been set, and that a 'delete' event was recorded for the identifier in the CN Metacat's access-log table.
History
#1 Updated by Chris Jones almost 11 years ago
- Status changed from In Progress to Closed
- translation missing: en.field_remaining_hours set to 0.0
I've looked at each production CN with regard to 'deleted' identifiers. We've defined 'deleted' as 1) having the SystemMetadata.archived flag set to true, AND 2) Having a 'delete' event recorded in the Metacat access_log table. I've looked at the CNs with a series of SQL commands:
CREATE OR REPLACE VIEW deleted_docids AS SELECT a.docid, a.event, a.date_logged FROM access_log a WHERE a.event = 'delete' AND a.principal != 'replication';
CREATE OR REPLACE VIEW ids_with_deleted_docids AS SELECT * FROM deleted_docids INNER JOIN identifier i USING(docid);
SELECT s.guid, ids.docid, ids.rev, s.serial_version, s.rights_holder, s.authoritive_member_node, s.date_modified, s.object_format, s.archived, s.obsoleted_by, ids.date_logged from systemmetadata s INNER JOIN ids_with_deleted_docids ids USING(guid);
In getting the list of deleted_docids, I've excluded entries based on Metacat replication, since it uses DocumentImpl.delete() synonymously with the archive() method of the D1 API. All other delete events had a CN=urn:node:CNXXX1 principal listed. I'm assuming these were manual delete events, but need to confirm.
That said, when getting a list of pids from the identifier table from this deleted_docids list, all 3 CNs return a value of 0. Unless there's some SQL error here, my thought is that all of these pids were completely removed from that table, and so we have no record of them at this point, and can't look them up in the systemmetadata table.
One thing to note is that there are a number of docids listed as 'deleted' in the access-log, yet their archive flag is set to false in systemmetadata. An example is:
https://cn.dataone.org/cn/v1/meta/doi:10.5063/AA/nceas.314.1
My thought is that the archive flag should be set to true, especially since, in this example, the identifier is also obsoletedBy https://cn.dataone.org/cn/v1/meta/doi:10.5063/AA/nceas.314.2
So, although we can likely say there are 'zero' deleted pids to be dealt with during the audit merge, It may also be good to correct the entries found in the deleted_docids. To do this, instead use:
CREATE OR REPLACE VIEW deleted_docids AS SELECT a.docid, a.event, a.date_logged FROM access_log a WHERE a.event = 'delete';
CREATE OR REPLACE VIEW ids_with_deleted_docids AS SELECT * FROM deleted_docids INNER JOIN identifier i USING(docid);
SELECT s.guid from systemmetadata s INNER JOIN ids_with_deleted_docids ids USING(guid) WHERE s.archived = false;
Note that the principal = 'replication' constraint is removed.
This gives us a total pid list to repair of:
UCSB 16
ORC 27
UNM 6¶
Total 49
The list is (dupes not removed):
"doi:10.5063/AA/nceas.314.1"
"doi:10.6073/AA/knb-lter-and.2727.2"
"doi:10.6073/AA/knb-lter-and.2728.2"
"doi:10.6073/AA/knb-lter-and.3982.3"
"doi:10.6073/AA/knb-lter-and.2727.3"
"doi:10.6073/AA/knb-lter-gce.158.4"
"doi:10.6073/AA/knb-lter-gce.165.4"
"doi:10.6073/AA/knb-lter-gce.60.5"
"doi:10.6073/AA/knb-lter-gce.86.3"
"doi:10.6073/AA/knb-lter-kbs.19.2"
"doi:10.6073/AA/knb-lter-kbs.35.1"
"doi:10.6073/AA/knb-lter-kbs.6.2"
"doi:10.6073/AA/knb-lter-and.2726.2"
"doi:10.6073/AA/knb-lter-and.3237.3"
"doi:10.6073/AA/knb-lter-and.4021.6"
"doi:10.6073/AA/knb-lter-gce.154.2"
"doi:10.6073/AA/knb-lter-gce.156.4"
"doi:10.6073/AA/knb-lter-gce.156.5"
"doi:10.6073/AA/knb-lter-gce.157.3"
"doi:10.6073/AA/knb-lter-gce.168.5"
"doi:10.6073/AA/knb-lter-gce.64.13"
"doi:10.6073/AA/knb-lter-gce.72.14"
"doi:10.6073/AA/knb-lter-gce.74.14"
"doi:10.6073/AA/knb-lter-gce.82.5"
"doi:10.6073/AA/knb-lter-gce.86.4"
"doi:10.6073/AA/knb-lter-kbs.6.1"
"doi:10.6073/AA/knb-lter-kbs.36.8"
"doi:10.5063/AA/nceas.314.1"
"doi:10.6073/AA/knb-lter-gce.55.12"
"doi:10.6073/AA/knb-lter-gce.6.4"
"doi:10.6073/AA/knb-lter-gce.67.2"
"doi:10.6073/AA/knb-lter-gce.78.5"
"doi:10.6073/AA/knb-lter-gce.79.5"
"doi:10.6073/AA/knb-lter-gce.83.3"
"doi:10.6073/AA/knb-lter-gce.80.5"
"doi:10.6073/AA/knb-lter-gce.85.6"
"doi:10.6073/AA/knb-lter-kbs.17.8"
"doi:10.6073/AA/knb-lter-kbs.18.1"
"doi:10.6073/AA/knb-lter-gce.158.4"
"doi:10.6073/AA/knb-lter-gce.166.4"
"doi:10.6073/AA/knb-lter-gce.175.4"
"doi:10.6073/AA/knb-lter-gce.176.1"
"doi:10.6073/AA/knb-lter-gce.61.3"
"doi:10.6073/AA/knb-lter-gce.154.2"
"doi:10.6073/AA/knb-lter-gce.156.4"
"doi:10.6073/AA/knb-lter-gce.156.5"
"doi:10.6073/AA/knb-lter-gce.157.3"
"doi:10.6073/AA/knb-lter-gce.64.13"
"doi:10.6073/AA/knb-lter-gce.72.14"
#2 Updated by Robert Waltz almost 11 years ago
- Target version deleted (
2014.2-Block.1.1)