Bug #1397
CN resource leak
100%
Description
CNs degrade in performance over time - noticeable over less than one week, eventually reaching the point where the CN service fails to respond.
At one point, cn-ucsb-1.dataone.org was returning an error about the garbage collector being out of resources or some such.
Early investigation indicated the problem might be in the xslt processor.
Lots of errors appear in the syslog, e.g.:
Mar 1 16:37:34 cn-ucsb-1 jsvc.exec[1766]: e(JkCoyoteHandler.java:190)#012#011at org.apache.jk.common.HandlerRequest.invoke(HandlerRequest.java:291)#012#011at org.apache.jk.common.ChannelSocket.invoke(ChannelSocket.java:769)#012#011at org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java:698)#012#011at org.apache.jk.common.ChannelSocket$SocketConnection.runIt(ChannelSocket.java:891)#012#011at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:690)#012#011at java.lang.Thread.run(Thread.java:662)#012Caused by: java.io.IOException: Unexpected packet type: 101#012#011at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1361)#012#011at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:175)#012#011... 128 more#012Error querying system metadata: An I/O error occured while sending to the backend.
Mar 1 16:37:34 cn-ucsb-1 jsvc.exec[1766]: CN Dispatching: /d1/object
History
#1 Updated by Dave Vieglais almost 14 years ago
- Target version set to Sprint-2011.09-Block.2
- Position set to 1
#2 Updated by Dave Vieglais almost 14 years ago
- Target version changed from Sprint-2011.09-Block.2 to Sprint-2011.10-Block.2
- Position deleted (
4) - Position set to 1
#3 Updated by Chris Jones almost 14 years ago
- Status changed from New to In Progress
- Assignee set to Chris Jones
#4 Updated by Chris Jones almost 14 years ago
I've modified the cn-ucsb-1.dataone.org tomcat installation to increase resources and to monitor the memory usage. I increased PermGen size since it quickly hit 62m when only 64m was allocated. I increased heap min and max to 8G, since metacat performance relies on in-memory query caching, and when catalog size is on the order of 50K docs, caching can be an issue. I enabled parallel garbage collection to take advantage of the multi-core architecture when heap space is large. Specifically, I added the following to JAVA_OPTS:
-Djava.rmi.server.hostname=128.111.220.46
-Dcom.sun.management.jmxremote.port=8686
-Dcom.sun.management.jmxremote.ssl=false
-XX:UseParallelGC
-XX:MaxPermSize=128m
-Xms8192m
-Xmx8192m
I hit the CN heavily using the fuse client, and saw garbage collection consistently occuring around when heap usage was around 2G. No exceptios thus far, but will continue monitoring.
#5 Updated by Chris Jones almost 14 years ago
- Status changed from In Progress to Closed