Hazelcast should be upgraded to version 2.X for stability
We've seen a number of issues in each environment where Metacat's PostgreSQL system metadata tables don't stay in sync, replication suffers from inconsistent replica statuses across CNs, and the hzIdentifiers set iterator inconsistently won't iterate through the entire Set. One major factor may be Hazelcast cluster instance communication problems, showing up in the catalina.out logs as:
WARNING: /184.108.40.206:5701 [DataONE] hz.1.InThread Closing socket to endpoint Address[220.127.116.11:5701], Cause:java.io.EOFException
These are known issues in the Hazelcast forum and issue list for 1.9.X, and the recommended fix is to upgrade to Hazelcast 2.X, where the connection framework has been significantly rewritten. This story documents the components that need to be modified to handle the 2.X API changes.
The plan is to use the Hazelcast 2.4.x series, however there is an outstanding HazelcastClient connection bug (see https://github.com/hazelcast/hazelcast/issues/315) that affects all versions of Hazelcast from 1.9.3 to 2.4. It is fixed in 2.4.1, which has not been released yet. The plan is to use Hudson to build Hazelcast 2.4.1 from the TAG, use this build to refactor the code, and then eventually use the 2.4.1 release from the Hazelcast group once they get it pushed into Maven Central.
#5 Updated by Chris Jones almost 11 years ago
- Status changed from In Progress to Closed
We've upgraded the CN stack to 2.4.1, and have tested in the dev, sandbox, and stage environments using the 1.1 branch, with no newly introduced bugs. The network partitioning behavior is not resolved, however. Closing this ticket since the upgrade is complete.