Project

General

Profile

Story #3434

Investigate Timeout Problems on Staging Hazelcast

Added by Robert Waltz over 11 years ago. Updated about 11 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Robert Waltz
Category:
Metacat
Start date:
2012-12-19
Due date:
2013-01-05
% Done:

100%

Story Points:
Sprint:

Description

We have noted on upgrade to 2.4.1 hazelcast an increase in the number of reported timeouts from HZ members to the master node of the hz cluster (the oldest member of the cluster) on staging.

Determine what effects the exceptions may have based on timeout scenarios.

Determine if any simple modifications might alleviate some of the timeouts.

Manipulation of two variables had no effect on the performance of Staging machines, and I was unable to simulate split brain on the environment.

I modified two variables and ran a single threaded script that would evict objects, get them and then put them. I had all daemons running while executing the script.

The first variable modified is hazelcast.max.no.heartbeat.seconds, Max timeout of heartbeat in seconds for a node to assume it is dead, with a default setting of 300. In the uploaded files, I modified the variable to be 60 seconds in the zip files containing the string 60SecTO.

The second variable modified is hazelcast.map.partition.count, Distributed map partition count, with a default setting of 271. In the uploaded files, the results of modifigyging the variable to be 2710 are found in files containing the strings Partions & 2710.

I first wanted to test if a lower heartbeat timeout setting would affect the number of timeouts. It did not.

I also wished to monitor the performance of increasing the partition count. I was curious if the amount of traffic over TCP would decrease if the datastructures were split among more partitions. I did not find any significant difference with the exception that increasing partition size increases migration time.

StageUnmBaseline.zip - Baseline of UNM using default settings with eviction script (105 KB) Robert Waltz, 2013-01-15 20:10

StageOrcBaseline.zip - Baseline of ORC using default settings with eviction script (125 KB) Robert Waltz, 2013-01-15 20:10

StageUnmPartitions2710.zip - Test of UNM using 2710 partitions (50.3 KB) Robert Waltz, 2013-01-15 20:10

StageOrcPartitions2710.zip - Test of ORC using 2710 partitions (162 KB) Robert Waltz, 2013-01-15 20:10

StageUnmBaseline60SecTO.zip - Test of UNM using 60 Sec Hearbeat (39.9 KB) Robert Waltz, 2013-01-15 20:10

StageOrcBaseline60SecTO.zip - Test of ORC using 60 Sec Hearbeat (81.7 KB) Robert Waltz, 2013-01-15 20:10

StageUnmPartitions60SecTO2710.zip - Test of UNM using 2710 partitions + 60 Sec Hearbeat (45.6 KB) Robert Waltz, 2013-01-15 20:10

StageOrcPartitions60SecTO2710.zip - Test of ORC using 2710 partitions + 60 Sec Hearbeat (92.2 KB) Robert Waltz, 2013-01-15 20:10


Subtasks

Task #3435: Examine the Locking exception thrown during call to updateReplicationMetadata()ClosedRobert Waltz

Task #3436: Examine the effects of modifying partition count variable on SandboxClosedRobert Waltz

Task #3437: Gather Baseline of HZ interactions for tests on SandboxClosedRobert Waltz

Task #3439: Examine the effects of modifying heartbeat timeout variable on SandboxClosedRobert Waltz

History

#1 Updated by Robert Waltz about 11 years ago

  • Description updated (diff)

#3 Updated by Robert Waltz about 11 years ago

  • Status changed from New to Closed

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 14.8 MB)