Project

General

Profile

Story #8756

Updated by Chris Jones over 5 years ago

The replication auditor service is currently configured to audit all objects every 90 days. As documented in #8582, the auditor is not working correctly. While the errors being thrown that are described in that ticket seem to be limited to `pid`s with certain characters in them, I think the whole auditor process is not keeping up with our content.

Looking at the number of objects on each member node that haven't been audited in the last 90 days, auditing is well behind (if we consider it working at all):

```
SELECT sm.authoritive_member_node, count(smr.guid) AS count
FROM systemmetadata sm INNER JOIN smreplicationstatus smr
ON sm.guid = smr.guid
WHERE
smr.member_node != 'urn:node:CN' AND

sm.date_uploaded < (SELECT CURRENT_DATE - interval '90 days') AND
smr.date_verified < (SELECT CURRENT_DATE - interval '90 days')

GROUP BY sm.authoritive_member_node
ORDER BY count DESC;

authoritive_member_node | count
-------------------------+-------- -------------------------+---------
urn:node:ARCTIC | 771872
urn:node:PANGAEA | 507456 1014827
urn:node:LTER urn:node:ARCTIC | 416339 806727
urn:node:DRYAD | 374439 672981
urn:node:CDL urn:node:LTER | 242115 581005
urn:node:PISCO | 235791 383397
urn:node:KNB urn:node:CDL | 86075 338936
urn:node:TDAR | 75639 150395
urn:node:KNB | 109911
urn:node:NCEI | 50974 101948
urn:node:USGS_SDC | 40290 80576
urn:node:TERN | 31671 44921
urn:node:ESS_DIVE | 28830 31537
urn:node:NMEPSCOR | 16042 18406
urn:node:GOA | 9266 11463
urn:node:IARC urn:node:TFRI | 7677 11367
urn:node:NRDC | 6673 11120
urn:node:TFRI urn:node:IARC | 6478 8899
urn:node:ORNLDAAC | 6526
urn:node:PPBIO | 3464 5877
urn:node:ORNLDAAC urn:node:SANPARKS | 3328 5770
urn:node:FEMC | 2430 4860
urn:node:EDI | 2098
urn:node:GRIIDC | 2065 4130
urn:node:mnTestKNB urn:node:R2R | 2010 3574
urn:node:SANPARKS urn:node:EDI | 2008 2736
urn:node:ONEShare | 1874 2647
urn:node:R2R urn:node:mnTestKNB | 1787 2112
urn:node:USGSCSAS | 1151 2008
urn:node:EDACGSTORE | 1075 1793
urn:node:US_MPC | 1032 1548
urn:node:RW | 970 1091
urn:node:KUBI urn:node:NEON | 516 974
urn:node:NEON urn:node:KUBI | 487 860
urn:node:LTER_EUROPE | 343 686
urn:node:IOE urn:node:RGD | 279 546
urn:node:RGD urn:node:ESA | 273 496
urn:node:ESA urn:node:IOE | 272 435
urn:node:NKN | 218 244
urn:node:OTS_NDC | 126 210
urn:node:BCODMO | 115
urn:node:SEAD | 90 180
urn:node:BCODMO | 129
urn:node:mnTestNKN | 50 100
urn:node:EDORA | 28 56
urn:node:ONEShare.pem | 22 31
urn:node:CLOEBIRD | 29
urn:node:mnTestTDAR
17 | 20
urn:node:mnTestBCODMO | 11 17
urn:node:USANPN | 10 14
urn:node:mnTestTDAR | 10
urn:node:MyMemberNode | 1
(48 rows)
```

The table above represents the number of un-audited objects (in the last 90 days), but I get the feeling that the auditor isn't able to audit any of the content it is charged to audit given 1) the frequency, 2) the number of threads allotted, and 3) the configured batch count (seems way too low). ~~Note that this query excludes replicated content - this is just the original objects~~ (After looking at my query again, I think the join is including all replicas - the total is 2,935,787, which is greater than the total objects in the system (2,751,136), so this query needs to be refined). ).

We need to evaluate the true effectiveness of the auditor. Some strategies may include: 1) looking to see if we may be in an infinite loop on processing a few `pid`s due to the issues in #8582, 2) seeing if we can increase the batch size by increasing the total threads allocated in the executor, and 3) decide if we need to offload the process from the CNs and distribute the workload across a cluster of workers that can do the auditing faster. Needs some thought and discussion.

Back

Add picture from clipboard (Maximum size: 14.8 MB)