Project

General

Profile

Story #8447

synchronization queue equity and monitoring

Added by Rob Nahf about 6 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
d1_synchronization
Target version:
Start date:
2018-03-01
Due date:
% Done:

100%

Story Points:

Description

The recent initial sync from a new, large member node (with 409k items) brought out some equity issues with synchronization. Namely, that the queue is so long that new sync tasks can be 2 days back in line.

We configured the queue to be much shorter (about an hour's worth of tasks), but maintaining an equitable sync priority might be challenging for nodes that synchronize less frequently than the one with a large number of sync tasks.

Ideally, a multi-queue system would be adopted, with tasks "fanning-in" from each queue. This could be complicated development however.

Other quicker improvements include:
1. limiting the max harvest to a percent of remaining capacity, instead of a fixed number. This might help get around the sync-frequency problem.
2. adding a metric that identifies the leading position of a member node's items.
3. limiting a node's harvest by sync frequency - we'd have to be able to convert the Node synchronization schedule into a frequency.


Related issues

Related to Infrastructure - Bug #8468: synchronization requeueing for temporary unavailability of nodeComms causes massive delays for package Closed 2018-03-02

Associated revisions

Revision 19161
Added by Rob Nahf about 6 years ago

refs: #8447, #8466. Addressed unit test failures. Completed code changes, and ready for integration tests.

Revision 19161
Added by Rob Nahf about 6 years ago

refs: #8447, #8466. Addressed unit test failures. Completed code changes, and ready for integration tests.

History

#1 Updated by Rob Nahf about 6 years ago

Hazelcast doesn't support dynamic queue creation, so a Map isn't possible. We could create a bunch of anonymous sync-queues, and create a Map that maps a Node to one of the anonymous ones. This introduces a scalability issue - when the number of Nodes > number of queues.

A simpler idea is to create a syncPriorityQueue and a syncBulkQueue for harvests under and over a certain size that's configurable.

#2 Updated by Rob Nahf about 6 years ago

  • Related to Bug #8468: synchronization requeueing for temporary unavailability of nodeComms causes massive delays for package added

#3 Updated by Dave Vieglais about 6 years ago

  • Target version changed from CCI-2.3.8 to CCI-2.3.9

#4 Updated by Rob Nahf about 6 years ago

have sync running in sandbox, and demonstrated that a second harvest starts getting processed right away.

From the process-metric log, below, you can see that each MN is being drained from equally (the totals are decreasing by the same amount)

{"event":"synchronization queued","message":"Total Sync Objects Queued: 3765","threadName":"SynchronizationQuartzScheduler_Worker-2","threadId":61,"dateLogged":"2018-03-16T04:19:51.066+00:00"}
{"event":"synchronization queued","nodeId":"legacy","message":"Sync Objects Queued: 0","threadName":"SynchronizationQuartzScheduler_Worker-2","threadId":61,"dateLogged":"2018-03-16T04:19:51.066+00:00"}
{"event":"synchronization queued","nodeId":"urn:node:mnTestNCEI","message":"Sync Objects Queued: 3282","threadName":"SynchronizationQuartzScheduler_Worker-2","threadId":61,"dateLogged":"2018-03-16T04:19:51.066+00:00"}
{"event":"synchronization queued","nodeId":"urn:node:mnSandboxUCSB1","message":"Sync Objects Queued: 482","threadName":"SynchronizationQuartzScheduler_Worker-2","threadId":61,"dateLogged":"2018-03-16T04:19:51.066+00:00"}
{"event":"synchronization queued","message":"Total Sync Objects Queued: 3682","threadName":"SynchronizationQuartzScheduler_Worker-3","threadId":62,"dateLogged":"2018-03-16T04:20:51.071+00:00"}
{"event":"synchronization queued","nodeId":"legacy","message":"Sync Objects Queued: 0","threadName":"SynchronizationQuartzScheduler_Worker-3","threadId":62,"dateLogged":"2018-03-16T04:20:51.071+00:00"}
{"event":"synchronization queued","nodeId":"urn:node:mnTestNCEI","message":"Sync Objects Queued: 3240","threadName":"SynchronizationQuartzScheduler_Worker-3","threadId":62,"dateLogged":"2018-03-16T04:20:51.071+00:00"}
{"event":"synchronization queued","nodeId":"urn:node:mnSandboxUCSB1","message":"Sync Objects Queued: 441","threadName":"SynchronizationQuartzScheduler_Worker-3","threadId":62,"dateLogged":"2018-03-16T04:20:51.071+00:00"}

#5 Updated by Rob Nahf about 6 years ago

  • % Done changed from 0 to 100
  • Status changed from New to Closed

tested and deployed on March 21.

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 14.8 MB)