Project

General

Profile

Task #8777

Story #8756: Ensure replica auditor is effective

Configure CN to audit objects greater than 1GB

Added by Chris Jones about 5 years ago. Updated about 5 years ago.

Status:
New
Priority:
Normal
Assignee:
Category:
d1_replication_auditor
Target version:
-
Start date:
2019-03-12
Due date:
% Done:

0%

Story Points:
Sprint:

Description

The replication auditor currently limits auditing of objects at 1GB. There are currently 4 objects greater than 1TB in size, and 3,588 objects greater than 1GB in size, both being very small counts compared to the 2,769,111 objects less than 1GB in size in the network. Nonetheless, they should still be audited if feasible. The limiting factor is likely HTTP timeout limits during the call to MN.getChecksum(). For reference, I'm seeing the following general times for calculating MD5 and SHA-1 checksums:

Size   MD5        SHA-1
----   -------    -------
1GB    00m02.5s   00m02.6s
10GB   00m25.9s   00m30.0s
100GB  03m28.0s   04m01.8s
1TB    50m14.2s   67m38.6s

10GB and 100GB objects seem pretty feasible if we set the HTTP client timeout to > 5 minutes, whereas the few > 1TB files may be challenging just due to the timeouts. The other factor is that the AbstractReplicationAuditor sets a default timeout to 60 seconds, and if the task future doesn't return in that time frame, the future gets cancelled. So the HTTP timeout and this timeout need to be increased and coordinated in order to handle larger object auditing.

History

#1 Updated by Chris Jones about 5 years ago

  • Description updated (diff)

#2 Updated by Dave Vieglais about 5 years ago

Need to be smarter about verifying content. It is prohibitive for the CN to go around checking millions of objects, and won't scale. Perhaps MNs should be responsible for ensuring their copy is accurate according to the checksum reported by the authoritative MN or the CN?

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 14.8 MB)