Replication tasks can contain stale potential target nodes
When a replication task is created, a list of potential replication targets is added. Some time in the future, the task is executed (mn.replicate() requests sent).
If a MemberNode turns off replication in the time between task creation and task execution, the mn.replicate will still be sent. Even though the MN should protect itself from unwanted requests, it seems bad form for the CN to send the request after the MN told the CN not to.
ReplicationTaskQueue seems to be the class calling replicate. It should be easy enough to copy the original logic in ReplicationManager that checked the suitability of the potential member node so that the check can be repeated right before the call is made.
We should probably also recheck the systemMetadata, in case that's changed too.
Before doing anything, confirm that the tasks are potentially long-lived.
(Also, the NodeList is cached, so we already allow for 3 minutes of being out of date)
#3 Updated by Rob Nahf over 4 years ago
The check for potential target nodes happens in a private method under ReplicationManager.createAndQueueTasks(Identifier pid). It seems to happen near the time of the request, but, there is a shift from direct execution of mn.replicate to submitting replicate requests by targetMN. Tasks can be for Pids (make more replicas) and PID+targetNode (call mn.replicate),
in Replication repository:
pid tasks can be in : NEW or IN_PROCESS state
in the object's systemMetadata
pid+target tasks can be : QUEUED, REQUESTED, FAILED, COMPLETED
ReplicationEventListener.entryUpdated() / .entryAdded()
(a Hz systemetadata map listener)
- if the pid's authoritativeMN is listed as a replica with status COMPLETE
create new replica task in the task repository unless there already is one
(if there are more than one task for the pid, delete existing ones and submit a new one)
(triggered every two minutes, trigger set up in ReplicationManager
move a page of tasks from NEW to IN_PROCESS status
- call createAndQueueTasks (pid)
removeReplicationTasks if sysmeta disallows replication or number of replicas are sufficient
determine potential target nodes (from NodeList and sysmeta)
loop to create a few per-target-node tasks
requeueReplicationTask(pid) #return the pid to the NEW state because we don't assume the requisite number of replicas have been created
(lock the targetNode)
get replication tasks in the QUEUED state for this target node
foreach task, requestReplication (ReplicationService.requestQueuedReplication(pid,target)
(unlock the targetNode)