Project

General

Profile

Story #3470

CN cluster communication needs to be monitored

Added by Chris Jones about 11 years ago. Updated over 9 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Skye Roseboom
Category:
performance-scalability
Target version:
Start date:
2013-01-09
Due date:
2014-10-01
% Done:

100%

Story Points:
Sprint:

Description

The Coordinating Node environments heavily rely on consistent network communication, especially in regard to the Hazelcast cluster. Our services can get out of sync if there's a partitioned network where one or more cluster members drop from the cluster. We need to be able to monitor a few key states, with cluster membership being the most important thus far. We need to enable operational alerts through Nagios monitoring when the cluster gets partitioned. We need to programmatically respond to partitioned clusters (read only mode?), and we need to develop a custom merge policy when the cluster comes back into communication such that set, map, and queue entries that are out of sync get back into sync in terms of both number and content.

See: http://epad.dataone.org/ClusterPartitionDiscussions


Subtasks

Task #3471: Create a d1_cn_monitor project that acts as an extensible framework for adding monitoring agents and reporting agentsClosedSkye Roseboom

Task #3472: Create a Statsd reporter for cluster membership statusClosedSkye Roseboom

Task #3473: Report membership status to NagiosClosedSkye Roseboom

Task #4459: Monitor Hazelcast Logs with SplunkClosedDavid Doyle

History

#1 Updated by Chris Jones about 11 years ago

  • translation missing: en.field_remaining_hours set to 0.0
  • Due date set to 2013-01-19
  • Tracker changed from Task to Story

#2 Updated by Chris Jones about 11 years ago

  • Subject changed from CN cluster communication needs to be monitored and maintained to CN cluster communication needs to be monitored and consistency maintained

#3 Updated by Chris Jones about 11 years ago

  • Description updated (diff)

#4 Updated by Chris Jones about 11 years ago

  • Description updated (diff)

#5 Updated by Skye Roseboom about 11 years ago

  • Target version changed from 2013.2-Block.1.1 to 2013.12-Block.2.2
  • Due date changed from 2013-01-19 to 2013-03-30

#6 Updated by Skye Roseboom about 11 years ago

  • Milestone changed from CCI-1.1.1 to CCI-1.2
  • Status changed from New to In Progress

#7 Updated by Skye Roseboom almost 11 years ago

  • Target version changed from 2013.12-Block.2.2 to 2013.16-Block.2.4
  • Due date changed from 2013-03-30 to 2013-04-27

#8 Updated by Skye Roseboom almost 11 years ago

  • Target version changed from 2013.16-Block.2.4 to 2013.30-Block.4.3
  • Due date changed from 2013-04-27 to 2013-08-03

#9 Updated by Skye Roseboom over 10 years ago

  • Target version changed from 2013.30-Block.4.3 to 2013.35-Block.5.1
  • Due date changed from 2013-08-03 to 2013-09-07

#10 Updated by Skye Roseboom over 10 years ago

  • Due date changed from 2013-09-07 to 2013-10-26
  • Target version changed from 2013.35-Block.5.1 to 2013.42-Block.5.4

#11 Updated by Chris Jones about 10 years ago

  • Due date changed from 2013-10-26 to 2014-02-15
  • Target version changed from 2013.42-Block.5.4 to 2014.6-Block.1.3

#12 Updated by Robert Waltz about 10 years ago

  • Subject changed from CN cluster communication needs to be monitored and consistency maintained to CN cluster communication needs to be monitored

#13 Updated by Skye Roseboom about 10 years ago

  • Due date changed from 2014-02-15 to 2014-04-12
  • Target version changed from 2014.6-Block.1.3 to 2014.14-Block.2.3

#14 Updated by Skye Roseboom about 10 years ago

  • Due date changed from 2014-04-12 to 2014-04-26
  • Target version changed from 2014.14-Block.2.3 to 2014.16-Block.2.4

#15 Updated by Skye Roseboom almost 10 years ago

  • Due date changed from 2014-04-26 to 2014-05-10
  • Target version changed from 2014.16-Block.2.4 to 2014.18-Block.3.1

#16 Updated by Robert Waltz over 9 years ago

  • Due date changed from 2014-05-10 to 2014-09-24
  • Target version changed from 2014.18-Block.3.1 to Maintenance Backlog

#17 Updated by Skye Roseboom over 9 years ago

  • Status changed from In Progress to Closed

performed with Splunk log monitoring and email notifications.

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 14.8 MB)