Project

General

Profile

Bug #7706

Hazelcast Runtime exception halts synchronization

Added by Robert Waltz almost 6 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Robert Waltz
Category:
d1_synchronization
Start date:
2016-04-04
Due date:
% Done:

100%

Story Points:
Sprint:

Description

the exception: java.lang.RuntimeException: java.util.concurrent.TimeoutException: [CONCURRENT_MAP_REMOVE] Operation Timeout (with no response!): 0

is caused by at com.hazelcast.impl.ClientServiceException.readData(ClientServiceException.java:63)

The exception occurs at org.dataone.cn.batch.synchronization.tasks.SyncObjectTask.call(SyncObjectTask.java:116)

where it is not caught, and only is caught by threadManager SyncObjectTaskManager, ending the thread. Thus, synchronization fails.

The exception may be caused by SyncObject (the class that gets passed into the hazelcastSyncObjectQueue does not define a serialVersionUID).

Also, having the SyncObjectTaskManager should also disable ObjectListHarvestTask.

Currently, to deactivate Synchronization, the Synchronization.active property is set to false. However, we have no way of accomplishing that easily now. It may be best to
have a global 'disable' static class that evaluates the Synchronization.active property along with a static settable boolean that can permanently disable sync until it is
ready to be restarted.


Related issues

Related to Infrastructure - Story #8525: timeout exceptions thrown from Hazelcast disable synchronization In Progress 2018-11-16

Associated revisions

Revision 18003
Added by Robert Waltz over 5 years ago

refs #7706

Hazelcast Runtime exception halts synchronization

Revision 18003
Added by Robert Waltz over 5 years ago

refs #7706

Hazelcast Runtime exception halts synchronization

Revision 18004
Added by Robert Waltz over 5 years ago

refs #7706

Hazelcast Runtime exception halts synchronization

Revision 18004
Added by Robert Waltz over 5 years ago

refs #7706

Hazelcast Runtime exception halts synchronization

History

#1 Updated by Robert Waltz almost 6 years ago

  • Tracker changed from Task to Bug

#2 Updated by Robert Waltz over 5 years ago

  • % Done changed from 0 to 30
  • Category changed from d1_cn_common to d1_synchronization
  • Status changed from New to In Progress

Handling Catastrophic failures of SyncObjectTask

If the SyncObjectTaskManager recieves an exception that disables the running of SyncObjectTask, then all of synchronization should be halted, and a notification Or log message sent regarding the issue.

The impact will be on quartz scheduling and any quartz jobs that are running.

Quartz jobs run the ObjectListHarvestTask. Any ObjectListHarvestTask job this is running when the exception happens should halt it's processing and return.

since quartz jobs are created by quartz scheduling, an observer pattern can not be applied, since we are not able to track the job instantiations in an observer class.

If SyncObjectTaskManager goes down, then HarvestSchedulingManager should shutdown all of its jobs, never to be rescheduled.

NodeTopicListener calls HarvestSchedulingManager.manageHarvest method when it recieves a message about an updated node.

In DataONE CN Common, there is a class named ComponentActiviationUtility that tracks the activation state of d1_processing components. syncrhonizationIsActive() is a method call that returns the evaluation of a private method sychronizationComponentActive(). It only returns the status of the property Synchronization.active. Add an additional AtomicBoolean set by SyncObjectTaskManager. Add a method to disbleSynchronization by setting the boolean to false, also conjunct Synchronization.active to the AtomicBoolen when calling syncrhonizationIsActive().

Update HarvestSchedulingManager.manageHarvest to review the scheduler.isShutdown state before rescheduling tasks. Create halt method that will shutdown the scheduler, waiting or any active jobs to complete.

SyncObjectTaskManager will need a reference to the HarvestSchedulingManager in order to call halt when exception occurs.

#3 Updated by Robert Waltz over 5 years ago

  • Status changed from In Progress to Testing
  • % Done changed from 30 to 50

#4 Updated by Robert Waltz over 5 years ago

  • Status changed from Testing to Closed
  • % Done changed from 50 to 100

#5 Updated by Rob Nahf almost 4 years ago

  • Related to Story #8525: timeout exceptions thrown from Hazelcast disable synchronization added

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 14.8 MB)