Infrastructure - Task #4459: Monitor Hazelcast Logs with Splunk</h1> <article> <h1>Infrastructure - Task #4459: Monitor Hazelcast Logs with Splunk</h1> <p>2014-03-14T18:43:02Z</p> <ul><li><strong>Milestone</strong> changed from <i>None</i> to <i>CCI-1.2</i></li></ul> </article> <article> <h1>Infrastructure - Task #4459: Monitor Hazelcast Logs with Splunk</h1> <p>2014-03-14T18:59:47Z</p> <ul></ul><p>Will start by:</p> <ul> <li>build hazelcast indexes into Splunk test</li> <li>add /var/log/dataone/daemon/hazelcast-process.log and /var/metacat/logs/hazelcast-storage.log into Splunk test</li> </ul> <p>From there I'll see how Splunk treats those logs and see if any changes need to be made at the input/index level. </p> <p>Going to add bwilson as a watcher to see if he has any input.</p> </article> <article> <h1>Infrastructure - Task #4459: Monitor Hazelcast Logs with Splunk</h1> <p>2014-03-17T19:51:58Z</p> <ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li></ul><p>Instead of having to try to figure out how to get every single log we want to monitor in the future routed through rsyslog into Splunk, I'm going back to looking into universal forwarders on the nodes that monitor files based on simple Splunk input config stanzas. </p> <p>Starting out with using a forwarder on cn-sandbox-orc-1 to send /var/log/syslog and one or two Hazelcast logs to the Splunk test environment. Will see how that works out and build out further if this option works with a minimal amount of hiccups.</p> </article> <article> <h1>Infrastructure - Task #4459: Monitor Hazelcast Logs with Splunk</h1> <p>2014-03-17T22:39:11Z</p> <ul></ul><p>Test setup running /var/log/dataone/daemon/hazelcast-process.log and /var/metacat/logs/hazelcast-storage.log into Splunk prod is working.</p> </article> <article> <h1>Infrastructure - Task #4459: Monitor Hazelcast Logs with Splunk</h1> <p>2014-03-19T22:59:16Z</p> <ul></ul><p>Hazelcast logs are moving into Splunk from all sandbox CNs. Building out to other environments next.</p> </article> <article> <h1>Infrastructure - Task #4459: Monitor Hazelcast Logs with Splunk</h1> <p>2014-03-20T03:31:54Z</p> <ul></ul><p>Logging now coming in from all non-prod CNs except for cn-stage-unm-2, for which I don't have sudo access.</p> </article> <article> <h1>Infrastructure - Task #4459: Monitor Hazelcast Logs with Splunk</h1> <p>2014-03-21T04:39:32Z</p> <ul></ul><p>Robert reported some hazelcast messages that should indicate an error state. Leaving a brief IRC transcript below for Bruce and I to work back from.</p> <p>Also, monitoring VM performance on ORC sandbox/dev/stage machines that have Splunk forwarders and hazelcast logging installed. (Mentioning this here since I'm rolling out hazelcast logging with the Splunk forwarders.) Seeing minor upticks in CPU and memory usage, all well within expected levels. Will in all likelihood build out to prod tomorrow/over the weekend.</p> <hr> <p>[10:08pm] robert: i just saw the messages you should be looking for from cn-dev-ucsb-1<br> [10:08pm] robert: [ INFO] 2014-03-20 01:58:56,709 (AddOrRemoveConnection:process:58) [128.111.54.78]:5701 [DataONE] Removing Address Address[64.106.40.9]:5701<br> [10:08pm] robert: [ INFO] 2014-03-20 01:58:56,710 (ConnectionManager:destroyConnection:338) [128.111.54.78]:5701 [DataONE] Connection [Address[64.106.40.9]:5701] lost. Reason: Explicit close<br> [10:09pm] robert: [ INFO] 2014-03-20 01:58:56,743 (AddOrRemoveConnection:process:58) [128.111.54.78]:5701 [DataONE]<br> [10:09pm] robert: Members [2] {<br> [10:09pm] robert: Member [160.36.13.153]:5701<br> [10:09pm] robert: Member [128.111.54.78]:5701 this<br> [10:09pm] robert: }<br> [10:09pm] robert: [ INFO] 2014-03-20 01:58:56,792 (MemberRemover:process:40) [128.111.54.78]:5701 [DataONE] Removing Address Address[64.106.40.9]:5701</p> <p>[10:17pm] robert: Most important is the 'Members[\d] {' string followed by the Member [ip-address] lines<br> [10:17pm] robert: also Connection Lost<br> [10:21pm] robert: and ...</p> <p>[10:21pm] robert: [ INFO] 2014-03-20 01:59:21,028 (SocketAcceptor$1:run:111) [128.111.54.78]:5701 [DataONE] 5701 is accepting socket connection from /64.106.40.9:58655<br> [10:21pm] robert: [ INFO] 2014-03-20 01:59:21,029 (SocketAcceptor$1:run:122) [128.111.54.78]:5701 [DataONE] 5701 accepted socket connection from /64.106.40.9:58655</p> <p>[10:55pm] robert: hmm, ya, so you'll see the Connect Lost, and I think that indicates an error condition until you see the accepted socket connection</p> </article> <article> <h1>Infrastructure - Task #4459: Monitor Hazelcast Logs with Splunk</h1> <p>2014-03-23T00:46:13Z</p> <ul></ul><p>Hazelcast logging rolled out to prod CNs.</p> <p>TO-DO: Find out if there are any hazelcast-related logs to be monitored on the D1 MN boxes.</p> </article> <article> <h1>Infrastructure - Task #4459: Monitor Hazelcast Logs with Splunk</h1> <p>2014-03-26T18:18:33Z</p> <ul></ul><p>David,</p> <p>As Robert said, we can monitor cluster membership via log entries such as:<br> <br> [ INFO] 2014-03-21 15:59:00,488 (?:?:?) [160.36.13.150]:5701 [DataONE] </p> <p>Members [1] {<br> Member [160.36.13.150]:5701 this<br> }<br> <br> In normal, non-split-brain operations, we would have 3 CNs in the cluster, and they would be listed by IP address in this log entry. When we upgrade CN software, we usually take them out of the cluster one at a time, so, a cluster membership of 2 can happen because of maintenance or because of a split-brain event. At the moment, we have purposefully isolated the three production CNs until we get the system metadata store consistent again, which is why you see only one member in the entry above.</p> <p>We have three clusters we'll want to monitor: hzStorage, hzProcess, and hzSession.</p> <p>The hzStorage cluster is run via Metacat, under Tomcat. It manages the shared system metadata map across CNs. As Robert mentioned, it logs to /var/metacat/logs/hazelcast-storage.log. </p> <p>The hzProcess cluster is used to manage queues during d1_synchronization, d1_replication, etc., and it logs to /var/log/dataone/daemon/hazelcast-process.log.</p> <p>The hzSession cluster is used in d1_portal to manage shared http session information across the CNs (basically a cookie-to-certificate mapping). From what I see in /var/lib/tomcat6/webapps/portal/WEB-INF/log4j.properties, it should be logging to /var/log/tomcat6/portal.log.</p> <p>Monitoring all three clusters is a priority, but I'd say the hzStorage cluster is first priority because it involves a persistence layer (storing system metadata).</p> </article> <article> <h1>Infrastructure - Task #4459: Monitor Hazelcast Logs with Splunk</h1> <p>2014-04-08T00:37:31Z</p> <ul></ul><p>I now have a (crude) scheduled search and alert in place to search hazelcast logs every minute for explicit close events and email an alert with relevant information (log info, source, etc) when it encounters more than 0 events over the last minute. This will in all likelihood need to be tweaked, especially once all the fixes on prod are in place and things are back to normal, but this is at least a starting point until things are more stable and we can hook alerts into abnormalities in the log data with more confidence. I'm meeting with Bruce Wednesday, and hopefully we can take a look at this in a little more detail then.</p> <p>Just realized that I don't have hzSession cluster monitoring in place. Will put that in ASAP.</p> </article> <article> <h1>Infrastructure - Task #4459: Monitor Hazelcast Logs with Splunk</h1> <p>2014-04-09T17:02:33Z</p> <ul></ul><p>Per cjones, added Chris, Matt, Dave, Robert, Skye, and Ben to alerted parties for this alert.</p> </article> <article> <h1>Infrastructure - Task #4459: Monitor Hazelcast Logs with Splunk</h1> <p>2014-04-30T23:59:45Z</p> <ul></ul><p>Broke this alert up into four, one for each environment, with separate subject lines in alerts.</p> </article> <article> <h1>Infrastructure - Task #4459: Monitor Hazelcast Logs with Splunk</h1> <p>2014-05-05T21:48:36Z</p> <ul><li><strong>Target version</strong> changed from <i>2014.16-Block.2.4</i> to <i>2014.18-Block.3.1</i></li></ul> </article> <article> <h1>Infrastructure - Task #4459: Monitor Hazelcast Logs with Splunk</h1> <p>2014-05-24T03:57:48Z</p> <ul><li><strong>translation missing: en.field_remaining_hours</strong> set to <i>0.0</i></li><li><strong>% Done</strong> changed from <i>0</i> to <i>100</i></li><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Closed</i></li></ul><p>Added hazelcast session logs to Splunk inputs on all CNs. Included archived logs. Complete pending alert changes and additions.</p> </article> <article> <h1>Infrastructure - Task #4459: Monitor Hazelcast Logs with Splunk</h1> <p>2014-09-24T18:15:02Z</p> <ul><li><strong>Target version</strong> changed from <i>2014.18-Block.3.1</i> to <i>Maintenance Backlog</i></li></ul> </article> </main></body></html>