Story #8447: synchronization queue equity and monitoring - Infrastructure - DataONE Tasks

Story #8447

synchronization queue equity and monitoring

Added by Rob Nahf almost 7 years ago. Updated almost 7 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Rob Nahf

Category:

d1_synchronization

Target version:

CCI-2.3.9

Start date:

2018-03-01

Due date:

% Done:

100%

Story Points:

Sprint:

Infrastructure backlog

Description

The recent initial sync from a new, large member node (with 409k items) brought out some equity issues with synchronization. Namely, that the queue is so long that new sync tasks can be 2 days back in line.

We configured the queue to be much shorter (about an hour's worth of tasks), but maintaining an equitable sync priority might be challenging for nodes that synchronize less frequently than the one with a large number of sync tasks.

Ideally, a multi-queue system would be adopted, with tasks "fanning-in" from each queue. This could be complicated development however.

Other quicker improvements include:
1. limiting the max harvest to a percent of remaining capacity, instead of a fixed number. This might help get around the sync-frequency problem.
2. adding a metric that identifies the leading position of a member node's items.
3. limiting a node's harvest by sync frequency - we'd have to be able to convert the Node synchronization schedule into a frequency.

Related issues

Associated revisions

Revision 19161
Added by Rob Nahf almost 7 years ago

refs: #8447, #8466. Addressed unit test failures. Completed code changes, and ready for integration tests.

Revision 19161
Added by Rob Nahf almost 7 years ago

refs: #8447, #8466. Addressed unit test failures. Completed code changes, and ready for integration tests.

History

#1 Updated by Rob Nahf almost 7 years ago

Hazelcast doesn't support dynamic queue creation, so a Map isn't possible. We could create a bunch of anonymous sync-queues, and create a Map that maps a Node to one of the anonymous ones. This introduces a scalability issue - when the number of Nodes > number of queues.

A simpler idea is to create a syncPriorityQueue and a syncBulkQueue for harvests under and over a certain size that's configurable.

#2 Updated by Rob Nahf almost 7 years ago

Related to Bug #8468: synchronization requeueing for temporary unavailability of nodeComms causes massive delays for package added

#3 Updated by Dave Vieglais almost 7 years ago

Target version changed from CCI-2.3.8 to CCI-2.3.9

#4 Updated by Rob Nahf almost 7 years ago

have sync running in sandbox, and demonstrated that a second harvest starts getting processed right away.

From the process-metric log, below, you can see that each MN is being drained from equally (the totals are decreasing by the same amount)

{"event":"synchronization queued","message":"Total Sync Objects Queued: 3765","threadName":"SynchronizationQuartzScheduler_Worker-2","threadId":61,"dateLogged":"2018-03-16T04:19:51.066+00:00"}
{"event":"synchronization queued","nodeId":"legacy","message":"Sync Objects Queued: 0","threadName":"SynchronizationQuartzScheduler_Worker-2","threadId":61,"dateLogged":"2018-03-16T04:19:51.066+00:00"}
{"event":"synchronization queued","nodeId":"urn:node:mnTestNCEI","message":"Sync Objects Queued: 3282","threadName":"SynchronizationQuartzScheduler_Worker-2","threadId":61,"dateLogged":"2018-03-16T04:19:51.066+00:00"}
{"event":"synchronization queued","nodeId":"urn:node:mnSandboxUCSB1","message":"Sync Objects Queued: 482","threadName":"SynchronizationQuartzScheduler_Worker-2","threadId":61,"dateLogged":"2018-03-16T04:19:51.066+00:00"}
{"event":"synchronization queued","message":"Total Sync Objects Queued: 3682","threadName":"SynchronizationQuartzScheduler_Worker-3","threadId":62,"dateLogged":"2018-03-16T04:20:51.071+00:00"}
{"event":"synchronization queued","nodeId":"legacy","message":"Sync Objects Queued: 0","threadName":"SynchronizationQuartzScheduler_Worker-3","threadId":62,"dateLogged":"2018-03-16T04:20:51.071+00:00"}
{"event":"synchronization queued","nodeId":"urn:node:mnTestNCEI","message":"Sync Objects Queued: 3240","threadName":"SynchronizationQuartzScheduler_Worker-3","threadId":62,"dateLogged":"2018-03-16T04:20:51.071+00:00"}
{"event":"synchronization queued","nodeId":"urn:node:mnSandboxUCSB1","message":"Sync Objects Queued: 441","threadName":"SynchronizationQuartzScheduler_Worker-3","threadId":62,"dateLogged":"2018-03-16T04:20:51.071+00:00"}

#5 Updated by Rob Nahf almost 7 years ago

% Done changed from 0 to 100
Status changed from New to Closed

tested and deployed on March 21.

Also available in: Atom PDF

Project

General

Profile

Infrastructure

Issues

Custom queries