Story #2622: d1_replication should prioritize MN replication tasks based on load, failures, and bandwidth factors - Infrastructure - DataONE Tasks

Story #2622

d1_replication should prioritize MN replication tasks based on load, failures, and bandwidth factors

Added by Chris Jones almost 13 years ago. Updated over 12 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Chris Jones

Category:

d1_replication

Target version:

Sprint-2012.19-Block.3.2

Start date:

2012-04-20

Due date:

% Done:

100%

Story Points:

Sprint:

Description

The current implementation of ReplicationManager evaluates ReplicationPolicies for objects and only prioritizes target member nodes based on the preferred and blocked lists. Otherwise, all MNs have been treated as equals. We've seen performance problems on single MNs that tend to cause performance problems on the CNs when sending MNReplicationTasks to the ExecutorService. Task execution looks to slow down significantly when threads in the thread pool are held up by non-performant MNs.

To alleviate this, we need to more intelligently evaluate the capabilities of an MN as a target, and prioritize targets that are performant.

The strategy is to 1) Make MN implementations more resilient (i.e. queue replicate() requests), and 2) throttle requests to MNs based on a few different performance metrics. These are outlined at:

http://epad.dataone.org/20120420-replication-priority-queue

At first we will only throttle based on a limit of pending replication requests, but down the road will also evaluate the failure factor and the bandwidth factor.

Subtasks