Bug #1397: CN resource leak - Infrastructure - DataONE Tasks

Bug #1397

CN resource leak

Added by Dave Vieglais almost 14 years ago. Updated almost 14 years ago.

Status:

Closed

Priority:

High

Assignee:

Chris Jones

Category:

d1_cn_service

Target version:

Sprint-2011.10-Block.2

Start date:

2011-03-01

Due date:

% Done:

100%

Milestone:

Product Version:

Story Points:

Sprint:

Description

CNs degrade in performance over time - noticeable over less than one week, eventually reaching the point where the CN service fails to respond.

At one point, cn-ucsb-1.dataone.org was returning an error about the garbage collector being out of resources or some such.

Early investigation indicated the problem might be in the xslt processor.

Lots of errors appear in the syslog, e.g.:

Mar 1 16:37:34 cn-ucsb-1 jsvc.exec[1766]: e(JkCoyoteHandler.java:190)#012#011at org.apache.jk.common.HandlerRequest.invoke(HandlerRequest.java:291)#012#011at org.apache.jk.common.ChannelSocket.invoke(ChannelSocket.java:769)#012#011at org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java:698)#012#011at org.apache.jk.common.ChannelSocket$SocketConnection.runIt(ChannelSocket.java:891)#012#011at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:690)#012#011at java.lang.Thread.run(Thread.java:662)#012Caused by: java.io.IOException: Unexpected packet type: 101#012#011at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1361)#012#011at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:175)#012#011... 128 more#012Error querying system metadata: An I/O error occured while sending to the backend.
Mar 1 16:37:34 cn-ucsb-1 jsvc.exec[1766]: CN Dispatching: /d1/object

History

#1 Updated by Dave Vieglais almost 14 years ago

Target version set to Sprint-2011.09-Block.2
Position set to 1

#2 Updated by Dave Vieglais almost 14 years ago

Target version changed from Sprint-2011.09-Block.2 to Sprint-2011.10-Block.2
Position deleted (4)
Position set to 1

#3 Updated by Chris Jones almost 14 years ago

Status changed from New to In Progress
Assignee set to Chris Jones

#4 Updated by Chris Jones almost 14 years ago

I've modified the cn-ucsb-1.dataone.org tomcat installation to increase resources and to monitor the memory usage. I increased PermGen size since it quickly hit 62m when only 64m was allocated. I increased heap min and max to 8G, since metacat performance relies on in-memory query caching, and when catalog size is on the order of 50K docs, caching can be an issue. I enabled parallel garbage collection to take advantage of the multi-core architecture when heap space is large. Specifically, I added the following to JAVA_OPTS:

-Djava.rmi.server.hostname=128.111.220.46
-Dcom.sun.management.jmxremote.port=8686
-Dcom.sun.management.jmxremote.ssl=false
-XX:UseParallelGC
-XX:MaxPermSize=128m
-Xms8192m
-Xmx8192m

I hit the CN heavily using the fuse client, and saw garbage collection consistently occuring around when heap usage was around 2G. No exceptios thus far, but will continue monitoring.

#5 Updated by Chris Jones almost 14 years ago

Status changed from In Progress to Closed

Also available in: Atom PDF

Project

General

Profile

Infrastructure

Issues

Custom queries