Story #8447
synchronization queue equity and monitoring
100%
Description
The recent initial sync from a new, large member node (with 409k items) brought out some equity issues with synchronization. Namely, that the queue is so long that new sync tasks can be 2 days back in line.
We configured the queue to be much shorter (about an hour's worth of tasks), but maintaining an equitable sync priority might be challenging for nodes that synchronize less frequently than the one with a large number of sync tasks.
Ideally, a multi-queue system would be adopted, with tasks "fanning-in" from each queue. This could be complicated development however.
Other quicker improvements include:
1. limiting the max harvest to a percent of remaining capacity, instead of a fixed number. This might help get around the sync-frequency problem.
2. adding a metric that identifies the leading position of a member node's items.
3. limiting a node's harvest by sync frequency - we'd have to be able to convert the Node synchronization schedule into a frequency.
Related issues
Associated revisions
History
#1 Updated by Rob Nahf almost 7 years ago
Hazelcast doesn't support dynamic queue creation, so a Map isn't possible. We could create a bunch of anonymous sync-queues, and create a Map that maps a Node to one of the anonymous ones. This introduces a scalability issue - when the number of Nodes > number of queues.
A simpler idea is to create a syncPriorityQueue and a syncBulkQueue for harvests under and over a certain size that's configurable.
#2 Updated by Rob Nahf almost 7 years ago
- Related to Bug #8468: synchronization requeueing for temporary unavailability of nodeComms causes massive delays for package added
#3 Updated by Dave Vieglais almost 7 years ago
- Target version changed from CCI-2.3.8 to CCI-2.3.9
#4 Updated by Rob Nahf almost 7 years ago
have sync running in sandbox, and demonstrated that a second harvest starts getting processed right away.
From the process-metric log, below, you can see that each MN is being drained from equally (the totals are decreasing by the same amount)
{"event":"synchronization queued","message":"Total Sync Objects Queued: 3765","threadName":"SynchronizationQuartzScheduler_Worker-2","threadId":61,"dateLogged":"2018-03-16T04:19:51.066+00:00"} {"event":"synchronization queued","nodeId":"legacy","message":"Sync Objects Queued: 0","threadName":"SynchronizationQuartzScheduler_Worker-2","threadId":61,"dateLogged":"2018-03-16T04:19:51.066+00:00"} {"event":"synchronization queued","nodeId":"urn:node:mnTestNCEI","message":"Sync Objects Queued: 3282","threadName":"SynchronizationQuartzScheduler_Worker-2","threadId":61,"dateLogged":"2018-03-16T04:19:51.066+00:00"} {"event":"synchronization queued","nodeId":"urn:node:mnSandboxUCSB1","message":"Sync Objects Queued: 482","threadName":"SynchronizationQuartzScheduler_Worker-2","threadId":61,"dateLogged":"2018-03-16T04:19:51.066+00:00"} {"event":"synchronization queued","message":"Total Sync Objects Queued: 3682","threadName":"SynchronizationQuartzScheduler_Worker-3","threadId":62,"dateLogged":"2018-03-16T04:20:51.071+00:00"} {"event":"synchronization queued","nodeId":"legacy","message":"Sync Objects Queued: 0","threadName":"SynchronizationQuartzScheduler_Worker-3","threadId":62,"dateLogged":"2018-03-16T04:20:51.071+00:00"} {"event":"synchronization queued","nodeId":"urn:node:mnTestNCEI","message":"Sync Objects Queued: 3240","threadName":"SynchronizationQuartzScheduler_Worker-3","threadId":62,"dateLogged":"2018-03-16T04:20:51.071+00:00"} {"event":"synchronization queued","nodeId":"urn:node:mnSandboxUCSB1","message":"Sync Objects Queued: 441","threadName":"SynchronizationQuartzScheduler_Worker-3","threadId":62,"dateLogged":"2018-03-16T04:20:51.071+00:00"}
#5 Updated by Rob Nahf almost 7 years ago
- % Done changed from 0 to 100
- Status changed from New to Closed
tested and deployed on March 21.