d1_replication should prioritize MN replication tasks based on load, failures, and bandwidth factors
The current implementation of ReplicationManager evaluates ReplicationPolicies for objects and only prioritizes target member nodes based on the preferred and blocked lists. Otherwise, all MNs have been treated as equals. We've seen performance problems on single MNs that tend to cause performance problems on the CNs when sending MNReplicationTasks to the ExecutorService. Task execution looks to slow down significantly when threads in the thread pool are held up by non-performant MNs.
To alleviate this, we need to more intelligently evaluate the capabilities of an MN as a target, and prioritize targets that are performant.
The strategy is to 1) Make MN implementations more resilient (i.e. queue replicate() requests), and 2) throttle requests to MNs based on a few different performance metrics. These are outlined at:
At first we will only throttle based on a limit of pending replication requests, but down the road will also evaluate the failure factor and the bandwidth factor.