Project

General

Profile

Story #8756

Updated by Chris Jones almost 6 years ago

The replication auditor service is currently configured to audit all objects every 90 days. As documented in #8582, the auditor is not working correctly. While the errors being thrown that are described in that ticket seem to be limited to `pid`s with certain characters in them, I think the whole auditor process is not keeping up with our content.

Looking at the number of objects on each member node that haven't been audited in the last 90 days, auditing is well behind (if we consider it working at all):

```
SELECT sm.authoritive_member_node, count(smr.guid) AS count
FROM systemmetadata sm INNER JOIN smreplicationstatus smr
ON sm.guid = smr.guid
WHERE
sm.date_uploaded < (SELECT CURRENT_DATE - interval '90 days') AND
smr.date_verified < (SELECT CURRENT_DATE - interval '90 days')
GROUP BY sm.authoritive_member_node
ORDER BY count DESC;

authoritive_member_node | count
-------------------------+---------
urn:node:PANGAEA | 1014827
urn:node:ARCTIC | 806727
urn:node:DRYAD | 672981
urn:node:LTER | 581005
urn:node:PISCO | 383397
urn:node:CDL | 338936
urn:node:TDAR | 150395
urn:node:KNB | 109911
urn:node:NCEI | 101948
urn:node:USGS_SDC | 80576
urn:node:TERN | 44921
urn:node:ESS_DIVE | 31537
urn:node:NMEPSCOR | 18406
urn:node:GOA | 11463
urn:node:TFRI | 11367
urn:node:NRDC | 11120
urn:node:IARC | 8899
urn:node:ORNLDAAC | 6526
urn:node:PPBIO | 5877
urn:node:SANPARKS | 5770
urn:node:FEMC | 4860
urn:node:GRIIDC | 4130
urn:node:R2R | 3574
urn:node:EDI | 2736
urn:node:ONEShare | 2647
urn:node:mnTestKNB | 2112
urn:node:USGSCSAS | 2008
urn:node:EDACGSTORE | 1793
urn:node:US_MPC | 1548
urn:node:RW | 1091
urn:node:NEON | 974
urn:node:KUBI | 860
urn:node:LTER_EUROPE | 686
urn:node:RGD | 546
urn:node:ESA | 496
urn:node:IOE | 435
urn:node:NKN | 244
urn:node:OTS_NDC | 210
urn:node:SEAD | 180
urn:node:BCODMO | 129
urn:node:mnTestNKN | 100
urn:node:EDORA | 56
urn:node:ONEShare.pem | 31
urn:node:CLOEBIRD | 29
urn:node:mnTestTDAR | 20
urn:node:mnTestBCODMO | 17
urn:node:USANPN | 14
urn:node:MyMemberNode | 1
(48 rows)
```

The table above represents the number of un-audited objects (in the last 90 days), but I get the feeling that the auditor isn't able to audit any of the content it is charged to audit given 1) the frequency, 2) the number of threads allotted, and 3) the configured batch count (seems way too low). ~~Note Note that this query excludes replicated content - this is just the original objects~~ (After looking at my query again, I think the join is including all replicas - the total is ). objects.

We need to evaluate the true effectiveness of the auditor. Some strategies may include: 1) looking to see if we may be in an infinite loop on processing a few `pid`s due to the issues in #8582, 2) seeing if we can increase the batch size by increasing the total threads allocated in the executor, and 3) decide if we need to offload the process from the CNs and distribute the workload across a cluster of workers that can do the auditing faster. Needs some thought and discussion.

Back

Add picture from clipboard (Maximum size: 14.8 MB)