Story #2166
Hazelcast cluster errors need to be isolated
100%
Description
When MN-MN replication (ReplicationManager) is running on a CN, calls to the cluster from the HazelcastClient instance fail in that the Response to the Call never completes. This is reflected in the /var/log/dataone/d1-processing-jsvc.err log file on cn-dev-2.
In order to isolate the issue, tests need to be performed that take the D1 code out of the mix, and add in factors:
1) Client code connections with and without D1 objects
2) Default hazelcast config vs optimized config (eviction policy, etc)
3) Default JVM params and JVM params optimized to match CN
4) Inter-JVM HazelcastClient connections vs intra-JVM
5) Introduce threading to match ReplicationManager model
Subtasks
History
#1 Updated by Chris Jones about 13 years ago
- Position set to 1
- Position changed from 1 to 351
#2 Updated by Chris Jones about 13 years ago
- Status changed from In Progress to Closed
The main locking issues that would produce "No response for call" errors from the ProxyHelper class in Hazelcast were primarily due to contention between the calling code (d1_sync, d1_repl) and metacat. By switching the synchronization and replication coordination locks to the process cluster helped immensely. We also switched to using tryLock() calls to reduce the possibility of having complete blocking code. Metacat code was updated to ensure locks were closed properly in the storage cluster.