Project

General

Profile

Story #2763

WebTester is unresponsive and seems to hang

Added by Rob Nahf over 9 years ago. Updated almost 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
d1_integration
Target version:
Start date:
2012-08-10
Due date:
% Done:

100%

Story Points:
Sprint:

Description

The WebTester seems to hang, because it doesn't provide output until all tests are run, and tests now take multiple minutes to run. Also, the WebTester does hang, but it is difficult to differentiate between a long-running request and a true hang.

Need to make more responsive and informative to the end-user.

Continuing bug-fixes to WebTester in the related story #3367

stackdump (114 KB) Rob Nahf, 2012-08-15 21:36

jstack.out.2 (148 KB) Rob Nahf, 2012-08-15 21:36


Subtasks

Task #3164: allow tiers to be selected individually.ClosedRob Nahf

Task #3126: try running JUnit tests in parallelClosedRob Nahf

Task #3143: change the tomcat garbage collector ClosedRob Nahf

Task #3124: webTester seems unresponsive when running long testsClosedRob Nahf

Task #3125: refactor the web-test http servlet to stream test results as they come inClosedRob Nahf

Task #3163: split out known long-running tests into separate test classes by TierClosedRob Nahf

Task #3145: remove out.close() from TestRunnerHttpServletClosedRob Nahf


Related issues

Related to Infrastructure - Story #3367: WebTester crashes periodically Rejected 2014-10-01 2014-10-01

History

#1 Updated by Rob Nahf over 9 years ago

  • Category set to d1_integration
  • Assignee set to Rob Nahf

#2 Updated by Rob Nahf over 9 years ago

  • Position changed from 1 to 446
  • Target version changed from Sprint-2012.19-Block.3.2 to Sprint-2012.21-Block.3.3
  • Position set to 1
  • Position deleted (3)

#3 Updated by Rob Nahf over 9 years ago

At some point, we may need to investigate why the web tester crashes. So far, when it does crash, I haven't seen anything in syslog, or catalina.out that provides any clues.

#4 Updated by Rob Nahf over 9 years ago

  • Target version changed from Sprint-2012.21-Block.3.3 to Sprint-2012.23-Block.3.4
  • Position set to 1
  • Position deleted (447)

#5 Updated by Rob Nahf over 9 years ago

Some tests were using fromDate and toDate to get time-based slices of their respective lists, and were depending on the 'count' and 'total' attributes to terminate their loops properly. These were removed, because they apparently caused infinite loops.

#6 Updated by Rob Nahf over 9 years ago

  • % Done changed from 0 to 20
  • Status changed from New to In Progress

#7 Updated by Rob Nahf over 9 years ago

  • Position deleted (5)
  • Position changed from 1 to 462
  • Position set to 1
  • Target version changed from Sprint-2012.23-Block.3.4 to Sprint-2012.25-Block.4.1

#8 Updated by Rob Nahf over 9 years ago

  • Position deleted (473)
  • Position changed from 1 to 475
  • Position set to 1
  • Target version changed from Sprint-2012.25-Block.4.1 to Sprint-2012.27-Block.4.2

#9 Updated by Rob Nahf over 9 years ago

  • Position deleted (480)
  • Position set to 1
  • Target version changed from Sprint-2012.27-Block.4.2 to Sprint-2012.29-Block.4.3

#10 Updated by Rob Nahf over 9 years ago

  • Position deleted (1)
  • Target version changed from Sprint-2012.29-Block.4.3 to Sprint-2012.33-Block.5.1
  • Position set to 1

#11 Updated by Rob Nahf over 9 years ago

tomcat tends more to hang / become unresponsive more than it goes down.

troubleshooting a current hang, I find:
top - cpu is 100% idle
free - using zero swap, only 487M of 4G memory used
ps - nothing remarkable
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
tomcat6 20998 0.0 3.3 503136 156144 ? Sl Aug14 1:00 /usr/lib/jvm/java-6-sun/bin/java

jstack -F : showed no deadlocks

a thread-dump to catalina.out (via "kill -3") shows many threads in the BLOCKED state:

awk '/State: / { print }' < stackdump | sort | uniq -c

 32    java.lang.Thread.State: BLOCKED (on object monitor)
 14    java.lang.Thread.State: RUNNABLE
  1    java.lang.Thread.State: TIMED_WAITING (on object monitor)
  1    java.lang.Thread.State: TIMED_WAITING (sleeping)
  2    java.lang.Thread.State: WAITING (on object monitor)

and most of them at loadClass

awk '/State: BLOCKED/ { getline; print }' < stackdump | sort | uniq -c

  1     at java.lang.ClassLoader.loadClass(ClassLoader.java:291)
  3     at org.apache.catalina.core.StandardWrapper.allocate(StandardWrapper.java:804)
 28     at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:293)

and

grep "waiting to lock" stackdump | sort

- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader)
- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader)
- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader)
- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader)
- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader)
- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader)
- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader)
- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader)
- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader)
- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader)
- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader)
- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader)
- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader)
- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader)
- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader)
- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader)
- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader)
- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader)
- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader)
- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader)
- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader)
- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader)
- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader)
- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader)
- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader)
- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader)
- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader)
- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader)
- waiting to lock <0x00000000f42eb988> (a org.apache.catalina.loader.StandardClassLoader)
- waiting to lock <0x00000000f464cb48> (a org.apache.catalina.core.StandardWrapper)
- waiting to lock <0x00000000f464cb48> (a org.apache.catalina.core.StandardWrapper)
- waiting to lock <0x00000000f464cb48> (a org.apache.catalina.core.StandardWrapper)

shows 28 waiting for the same lock/monitor - ending in 9180 and locked by "http-8080-2" thread
and 3 waiting for the same lock/monitor - ending in cb48 and locked by "http-8080-3" thread
and 1 waiting for a lock/monitor ending in b988 and also locked by "http-8080-3" thread

http-8080-3 thread is blocked, waiting for the monitor ending in 9180, so thread "http-8080-2" seems to be a key-log to the problem.

yet, http-8080-2 is in RUNNABLE state, so it's not clear why everything is stopped.

#12 Updated by Rob Nahf over 9 years ago

because there's not apparent cause, some possible explanations:
1) the ConcurrentMarkSweep garbage collector - the CNs use ParallelGC and do not experience problems
2) double-closing the outputStream (TestRunnerHttpServlet issues out.close() and I assume the container does, too)
3) Improperly portioned heap, although JSTAT command doesn't seem to indicate we're hitting limits, and Prof Google doesn't know of any multi-hour tomcat hangs due to GC
4) bouncy castle holding resources? - the "http-8080-2" thread remains in RUNNABLE state within a BounceyCastle routine. doubtful about this as a root cause.

see also the following for analyzing resource contention: https://sites.google.com/site/threaddumps/java-thread-dumps-2

#13 Updated by Rob Nahf over 9 years ago

attaching thread dump and jstat output (which is mostly redundant to the thread dump, but also shows the checks for deadlocks.

#14 Updated by Rob Nahf over 9 years ago

  • Position deleted (2)
  • Position set to 2

#15 Updated by Rob Nahf over 9 years ago

after separate hangs tried the following:
1. switched to ParallelGC, as is used in CNs. - still hangs
2. on two separate hangs, jstack traces blocked threads held up by a method call by BouncyCastle to load algorithms from uninstalled BC jars (see below).
Installed the missing jars on mncheck.test.dataone.org - still hanging.

"http-8080-2" daemon prio=10 tid=0x000000004162b000 nid=0x52b4 runnable [0x00007f90a2324000]
java.lang.Thread.State: RUNNABLE
at java.lang.ClassLoader.findBootstrapClass(Native Method)
at java.lang.ClassLoader.findBootstrapClassOrNull(ClassLoader.java:926)
at java.lang.ClassLoader.loadClass(ClassLoader.java:297)
- locked (a sun.misc.Launcher$ExtClassLoader)
at java.lang.ClassLoader.loadClass(ClassLoader.java:295)
- locked (a sun.misc.Launcher$AppClassLoader)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
- locked (a sun.misc.Launcher$AppClassLoader)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1399)
- locked (a java.lang.String)
at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1329)
at org.bouncycastle.jce.provider.BouncyCastleProvider.loadAlgorithms(Unknown Source)
at org.bouncycastle.jce.provider.BouncyCastleProvider.setup(Unknown Source)
at org.bouncycastle.jce.provider.BouncyCastleProvider.access$000(Unknown Source)
at org.bouncycastle.jce.provider.BouncyCastleProvider$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at org.bouncycastle.jce.provider.BouncyCastleProvider.(Unknown Source)

  1. fixed the double-close problem of the input stream with change to d1_integration TestRunnerHttpServlet - still having problems, but this time not a hang.

#16 Updated by Rob Nahf over 9 years ago

tomcat on mncheck crashed with SIGSEGV, during GC. Appears to be a known HotSpot / GC bug (http://bugs.sun.com/view_bug.do;jsessionid=6eacdf513d1808ffffffffbcab6a0c182fab9?bug_id=6896647)

will apply the workaround. Noticing also that PermGen is at 99%, although didn't get a PermGen log message. PermGen full errors are thought to be more related to memory leaks than improper sizing. WebTesters long processes probably put more things into PermGen. will run jmap -permgen to see what's going there.

from the logs:

445938.448: [GC [PSYoungGen: 5504K->64K(5568K)] 15537K->10153K(54720K), 0.0016570 secs] [Times: user=0.00 sys=0.00, real
=0.00 secs]
446248.492: [GC [PSYoungGen: 5504K->64K(5568K)] 15593K->10209K(54720K), 0.0016810 secs] [Times: user=0.00 sys=0.00, real
=0.01 secs]
446421.280: [GC [PSYoungGen: 3242K->32K(5568K)] 13388K->10233K(54720K), 0.0016860 secs] [Times: user=0.00 sys=0.00, real
=0.01 secs]
446421.282: [Full GC (System) [PSYoungGen: 32K->0K(5568K)] [PSOldGen: 10201K->10207K(49152K)] 10233K->10207K(54720K) [PS
PermGen: 35714K->35714K(36032K)], 0.1750370 secs] [Times: user=0.17 sys=0.00, real=0.17 secs]
446738.558: [GC [PSYoungGen: 5440K->64K(5568K)] 15647K->10271K(54720K), 0.0015660 secs] [Times: user=0.00 sys=0.00, real
=0.00 secs]
447048.602: [GC [PSYoungGen: 5504K->64K(5568K)] 15711K->10327K(54720K), 0.0016890 secs] [Times: user=0.00 sys=0.00, real
=0.00 secs]
447358.645: [GC [PSYoungGen: 5504K->64K(5568K)] 15767K->10383K(54720K), 0.0016880 secs] [Times: user=0.00 sys=0.00, real
=0.00 secs]
447668.688: [GC [PSYoungGen: 5504K->64K(5568K)] 15823K->10439K(54720K), 0.0016900 secs] [Times: user=0.00 sys=0.00, real
=0.00 secs]
447978.734: [GC [PSYoungGen: 5504K->64K(5568K)] 15879K->10495K(54720K), 0.0016710 secs] [Times: user=0.00 sys=0.00, real
=0.00 secs]
448288.779: [GC [PSYoungGen: 5504K->64K(5568K)] 15935K->10559K(54720K), 0.0016570 secs] [Times: user=0.01 sys=0.00, real
=0.00 secs]
448598.825: [GC [PSYoungGen: 5504K->64K(5568K)] 15999K->10615K(54720K), 0.0016430 secs] [Times: user=0.00 sys=0.01, real
=0.00 secs]
448908.871: [GC [PSYoungGen: 5504K->64K(5568K)] 16055K->10671K(54720K), 0.0016730 secs] [Times: user=0.00 sys=0.00, real
=0.01 secs]
449228.916: [GC [PSYoungGen: 5504K->64K(5568K)] 16111K->10727K(54720K), 0.0016650 secs] [Times: user=0.00 sys=0.00, real
=0.00 secs]
449423.399: [GC#

A fatal error has been detected by the Java Runtime Environment:

#

SIGSEGV (0xb) at pc=0x00007f627ede6779, pid=21521, tid=140060941502208

#

JRE version: 6.0_26-b03

Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode linux-amd64 compressed oops)

Problematic frame:

V [libjvm.so+0x760779] PSPromotionManager::copy_to_survivor_space(oopDesc*)+0x89

#

An error report file with more information is saved as:

/tmp/hs_err_pid21521.log

#

If you would like to submit a bug report, please visit:

http://java.sun.com/webapps/bugreport/crash.jsp

#

#17 Updated by Rob Nahf over 9 years ago

  • Position deleted (3)
  • Position set to 1
  • Target version changed from Sprint-2012.33-Block.5.1 to Sprint-2012.35-Block.5.2

#18 Updated by Rob Nahf about 9 years ago

  • Position deleted (10)
  • Position changed from 1 to 492
  • Position set to 1
  • Target version changed from Sprint-2012.35-Block.5.2 to Sprint-2012.39-Block.5.4

#19 Updated by Rob Nahf about 9 years ago

  • Target version changed from Sprint-2012.39-Block.5.4 to Sprint-2012.41-Block.6.1
  • Due date set to 2012-10-27

#20 Updated by Rob Nahf about 9 years ago

  • Subject changed from make WebTester more robust to WebTester is unresponsive and seems to hang
  • Description updated (diff)

#21 Updated by Rob Nahf about 9 years ago

  • Description updated (diff)

#22 Updated by Rob Nahf about 9 years ago

  • Target version deleted (Sprint-2012.41-Block.6.1)
  • Start date deleted (2012-08-10)
  • Due date deleted (2012-10-27)

#23 Updated by Rob Nahf about 7 years ago

  • Due date set to 2014-10-02
  • Start date set to 2014-10-02
  • Target version set to Maintenance Backlog

Rob to review and decide what's needed to keep open.

#24 Updated by Dave Vieglais almost 4 years ago

  • Status changed from In Progress to Closed
  • % Done changed from 30 to 100

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 14.8 MB)