Project

General

Profile

Story #2166

Hazelcast cluster errors need to be isolated

Added by Chris Jones about 13 years ago. Updated about 13 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
d1_cn_service
Start date:
2012-01-09
Due date:
% Done:

100%

Estimated time:
(Total: 0.50 h)
Story Points:
Sprint:

Description

When MN-MN replication (ReplicationManager) is running on a CN, calls to the cluster from the HazelcastClient instance fail in that the Response to the Call never completes. This is reflected in the /var/log/dataone/d1-processing-jsvc.err log file on cn-dev-2.

In order to isolate the issue, tests need to be performed that take the D1 code out of the mix, and add in factors:

1) Client code connections with and without D1 objects
2) Default hazelcast config vs optimized config (eviction policy, etc)
3) Default JVM params and JVM params optimized to match CN
4) Inter-JVM HazelcastClient connections vs intra-JVM
5) Introduce threading to match ReplicationManager model


Subtasks

Task #2176: Ensure Metacat doesn't insert null SystemMetadata into HazelcastClosedChris Jones

Task #2177: Use Lock.tryLock() (not Lock.lock()) in d1_replicationClosedChris Jones

Task #2178: Use Lock.tryLock() (not Lock.lock()) in MetacatRejectedChris Jones

Task #2179: Use REST API calls from d1_replication for storage cluster writesClosedChris Jones

Task #2180: Coordinate d1_replication locks in the process clusterClosedChris Jones

Task #2182: Coordinate d1_indexer locks in the process clusterClosedSkye Roseboom

Task #2183: Revise usage of d1Client in SynchronizationClosedRobert Waltz

Task #2185: In Metacat, add new method for Updating of ObsoletedBy of SystemMetadataClosedBen Leinfelder

Task #2186: In d1_cn_service, proxy new method for Updating of ObsoletedBy of SystemMetadataClosedRobert Waltz

Task #2187: In Metacat, add new method for deleting replicas of SystemMetadataClosedBen Leinfelder

Task #2188: In d1_cn_service, proxy new method for deleting replicas of SystemMetadataClosedRobert Waltz

Task #2203: Update d1_common_java to include new methods to APIsClosedBen Leinfelder

Task #2210: Update d1_libclient_java to implement new methods to APIsRejectedRob Nahf

History

#1 Updated by Chris Jones about 13 years ago

  • Position set to 1
  • Position changed from 1 to 351

#2 Updated by Chris Jones about 13 years ago

  • Status changed from In Progress to Closed

The main locking issues that would produce "No response for call" errors from the ProxyHelper class in Hazelcast were primarily due to contention between the calling code (d1_sync, d1_repl) and metacat. By switching the synchronization and replication coordination locks to the process cluster helped immensely. We also switched to using tryLock() calls to reduce the possibility of having complete blocking code. Metacat code was updated to ensure locks were closed properly in the storage cluster.

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 14.8 MB)