Story #3059: Research potential HZ Bug - Infrastructure - DataONE Tasks

Story #3059

Research potential HZ Bug

Added by Robert Waltz over 12 years ago. Updated over 12 years ago.

Status:

Rejected

Priority:

Normal

Assignee:

Robert Waltz

Category:

Target version:

Sprint-2012.41-Block.6.1

Start date:

2012-07-10

Due date:

2012-10-27

% Done:

Story Points:

Sprint:

Description

appears to be thrown when cn-orc-1, was the first system to be started in the cluster.

cn-unm-1 was restarted, waited until it was fully operational and then cn-ucsb-1 as restarted. After all three were operational,
cn-orc-1 was shutdown and left the cluster.

cn-unm-1 started to send these messages:

Jul 10, 2012 12:20:20 AM com.hazelcast.impl.PartitionManager
WARNING: /64.106.40.6:5701 [DataONE] Block [55] owner=Address[64.106.40.6:5701] migrationAddress=Address[128.111.36.80:5701] migration is not completed for 10 seconds.
Jul 10, 2012 12:20:49 AM com.hazelcast.impl.PartitionManager
WARNING: /64.106.40.6:5701 [DataONE] Block [81] owner=Address[64.106.40.6:5701] migrationAddress=Address[128.111.36.80:5701] migration is not completed for 10 seconds.

indicating partitioning problems between cn-unm-1 and cn-ucsb-1.

after 10 minutes when these messages stopped, cn-orc-1 was restarted.

cn-orc-1 reported these messages:

Jul 10, 2012 12:46:15 AM com.hazelcast.impl.ConcurrentMapManager
INFO: /160.36.13.150:5701 [DataONE] ======= -1: CONCURRENT_MAP_ADD_TO_SET ========
thisAddress= Address[160.36.13.150:5701], target= null
targetMember= null, targetConn=null, targetBlock=Block [39] owner=Address[128.111.36.80:5701] migrationAddress=Address[160.36.13.150:5701]
org.dataone.service.types.v1.Identifier@7caa5dd3 Re-doing [20] times! m:s:hzIdentifiers : null

and cn-unm-1 reported these messages:

Jul 10, 2012 12:41:28 AM com.hazelcast.impl.PartitionManager
WARNING: /64.106.40.6:5701 [DataONE] Block [30] owner=Address[128.111.36.80:5701] migrationAddress=Address[160.36.13.150:5701] migration is not completed for 10 seconds.
Jul 10, 2012 12:42:09 AM com.hazelcast.impl.PartitionManager
WARNING: /64.106.40.6:5701 [DataONE] Block [12] owner=Address[128.111.36.80:5701] migrationAddress=Address[160.36.13.150:5701] migration is not completed for 10 seconds.

History

#1 Updated by Dave Vieglais over 12 years ago

Possibly related, adding mostly to keep track of one potential trail to follow. https://github.com/hazelcast/hazelcast/issues/117

The issue was related to sensitivity of hazelcast to time offset between the servers.

#2 Updated by Dave Vieglais over 12 years ago

And another thread that suggests this message is informational, though unusually slow for something to take more than 10 secs: https://groups.google.com/forum/?fromgroups#!topic/hazelcast/54mAiG3PjTo

#3 Updated by Robert Waltz over 12 years ago

this is similar to what we saw last november with our client connection problems from the process daemons to the storage cluster

http://stackoverflow.com/questions/9997057/issue-on-start-up-with-hazelcast-concurrent-map-put

#4 Updated by Robert Waltz over 12 years ago

Milestone changed from CCI-1.0.4 to None
Target version set to Sprint-2012.39-Block.5.4

#5 Updated by Robert Waltz over 12 years ago

Milestone changed from None to CCI-1.1
Target version changed from Sprint-2012.39-Block.5.4 to Sprint-2012.41-Block.6.1

#6 Updated by Robert Waltz over 12 years ago

Milestone changed from CCI-1.1 to CCI-1.2

#7 Updated by Robert Waltz over 12 years ago

Due date set to 2012-10-27
translation missing: en.field_remaining_hours set to 0.0
Status changed from New to Rejected

This is not a bug. The resolution to this type of warning will most likely be solved by updating Hazelcast.

Also available in: Atom PDF

Project

General

Profile

Infrastructure

Issues

Custom queries