Story #2622
d1_replication should prioritize MN replication tasks based on load, failures, and bandwidth factors
100%
Description
The current implementation of ReplicationManager evaluates ReplicationPolicies for objects and only prioritizes target member nodes based on the preferred and blocked lists. Otherwise, all MNs have been treated as equals. We've seen performance problems on single MNs that tend to cause performance problems on the CNs when sending MNReplicationTasks to the ExecutorService. Task execution looks to slow down significantly when threads in the thread pool are held up by non-performant MNs.
To alleviate this, we need to more intelligently evaluate the capabilities of an MN as a target, and prioritize targets that are performant.
The strategy is to 1) Make MN implementations more resilient (i.e. queue replicate() requests), and 2) throttle requests to MNs based on a few different performance metrics. These are outlined at:
http://epad.dataone.org/20120420-replication-priority-queue
At first we will only throttle based on a limit of pending replication requests, but down the road will also evaluate the failure factor and the bandwidth factor.
Subtasks
History
#1 Updated by Matthew Jones over 12 years ago
- Tracker changed from Task to Story
#2 Updated by Dave Vieglais over 12 years ago
- Position set to 1
- Target version changed from Sprint-2012.15-Block.2.4 to Sprint-2012.17-Block.3.1
#3 Updated by Dave Vieglais over 12 years ago
- Position deleted (
6) - Target version changed from Sprint-2012.17-Block.3.1 to Sprint-2012.19-Block.3.2
- Position set to 2
#4 Updated by Chris Jones over 12 years ago
- Status changed from New to Closed
This is implemented except for the bandwidth factors. We'll need to decide how each MN is scored based on bandwidth (calculated or assigned).