Story #2763
WebTester is unresponsive and seems to hang
100%
Description
The WebTester seems to hang, because it doesn't provide output until all tests are run, and tests now take multiple minutes to run. Also, the WebTester does hang, but it is difficult to differentiate between a long-running request and a true hang.
Need to make more responsive and informative to the end-user.
Continuing bug-fixes to WebTester in the related story #3367
Subtasks
Related issues
History
#1 Updated by Rob Nahf over 12 years ago
- Category set to d1_integration
- Assignee set to Rob Nahf
#2 Updated by Rob Nahf over 12 years ago
- Position changed from 1 to 446
- Target version changed from Sprint-2012.19-Block.3.2 to Sprint-2012.21-Block.3.3
- Position set to 1
- Position deleted (
3)
#3 Updated by Rob Nahf over 12 years ago
At some point, we may need to investigate why the web tester crashes. So far, when it does crash, I haven't seen anything in syslog, or catalina.out that provides any clues.
#4 Updated by Rob Nahf over 12 years ago
- Target version changed from Sprint-2012.21-Block.3.3 to Sprint-2012.23-Block.3.4
- Position set to 1
- Position deleted (
447)
#5 Updated by Rob Nahf over 12 years ago
Some tests were using fromDate and toDate to get time-based slices of their respective lists, and were depending on the 'count' and 'total' attributes to terminate their loops properly. These were removed, because they apparently caused infinite loops.
#6 Updated by Rob Nahf over 12 years ago
- % Done changed from 0 to 20
- Status changed from New to In Progress
#7 Updated by Rob Nahf over 12 years ago
- Position deleted (
5) - Position changed from 1 to 462
- Position set to 1
- Target version changed from Sprint-2012.23-Block.3.4 to Sprint-2012.25-Block.4.1
#8 Updated by Rob Nahf over 12 years ago
- Position deleted (
473) - Position changed from 1 to 475
- Position set to 1
- Target version changed from Sprint-2012.25-Block.4.1 to Sprint-2012.27-Block.4.2
#9 Updated by Rob Nahf over 12 years ago
- Position deleted (
480) - Position set to 1
- Target version changed from Sprint-2012.27-Block.4.2 to Sprint-2012.29-Block.4.3
#10 Updated by Rob Nahf over 12 years ago
- Position deleted (
1) - Target version changed from Sprint-2012.29-Block.4.3 to Sprint-2012.33-Block.5.1
- Position set to 1
#11 Updated by Rob Nahf over 12 years ago
tomcat tends more to hang / become unresponsive more than it goes down.
troubleshooting a current hang, I find:
top - cpu is 100% idle
free - using zero swap, only 487M of 4G memory used
ps - nothing remarkable
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
tomcat6 20998 0.0 3.3 503136 156144 ? Sl Aug14 1:00 /usr/lib/jvm/java-6-sun/bin/java
jstack -F : showed no deadlocks
a thread-dump to catalina.out (via "kill -3") shows many threads in the BLOCKED state:
awk '/State: / { print }' < stackdump | sort | uniq -c¶
32 java.lang.Thread.State: BLOCKED (on object monitor) 14 java.lang.Thread.State: RUNNABLE 1 java.lang.Thread.State: TIMED_WAITING (on object monitor) 1 java.lang.Thread.State: TIMED_WAITING (sleeping) 2 java.lang.Thread.State: WAITING (on object monitor)
and most of them at loadClass
awk '/State: BLOCKED/ { getline; print }' < stackdump | sort | uniq -c¶
1 at java.lang.ClassLoader.loadClass(ClassLoader.java:291) 3 at org.apache.catalina.core.StandardWrapper.allocate(StandardWrapper.java:804) 28 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:293)
and
grep "waiting to lock" stackdump | sort¶
- waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader) - waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader) - waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader) - waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader) - waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader) - waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader) - waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader) - waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader) - waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader) - waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader) - waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader) - waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader) - waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader) - waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader) - waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader) - waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader) - waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader) - waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader) - waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader) - waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader) - waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader) - waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader) - waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader) - waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader) - waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader) - waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader) - waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader) - waiting to lock <0x00000000f42d9180> (a sun.misc.Launcher$AppClassLoader) - waiting to lock <0x00000000f42eb988> (a org.apache.catalina.loader.StandardClassLoader) - waiting to lock <0x00000000f464cb48> (a org.apache.catalina.core.StandardWrapper) - waiting to lock <0x00000000f464cb48> (a org.apache.catalina.core.StandardWrapper) - waiting to lock <0x00000000f464cb48> (a org.apache.catalina.core.StandardWrapper)
shows 28 waiting for the same lock/monitor - ending in 9180 and locked by "http-8080-2" thread
and 3 waiting for the same lock/monitor - ending in cb48 and locked by "http-8080-3" thread
and 1 waiting for a lock/monitor ending in b988 and also locked by "http-8080-3" thread
http-8080-3 thread is blocked, waiting for the monitor ending in 9180, so thread "http-8080-2" seems to be a key-log to the problem.
yet, http-8080-2 is in RUNNABLE state, so it's not clear why everything is stopped.
#12 Updated by Rob Nahf over 12 years ago
because there's not apparent cause, some possible explanations:
1) the ConcurrentMarkSweep garbage collector - the CNs use ParallelGC and do not experience problems
2) double-closing the outputStream (TestRunnerHttpServlet issues out.close() and I assume the container does, too)
3) Improperly portioned heap, although JSTAT command doesn't seem to indicate we're hitting limits, and Prof Google doesn't know of any multi-hour tomcat hangs due to GC
4) bouncy castle holding resources? - the "http-8080-2" thread remains in RUNNABLE state within a BounceyCastle routine. doubtful about this as a root cause.
see also the following for analyzing resource contention: https://sites.google.com/site/threaddumps/java-thread-dumps-2
#13 Updated by Rob Nahf over 12 years ago
- File jstack.out.2 added
- File stackdump added
attaching thread dump and jstat output (which is mostly redundant to the thread dump, but also shows the checks for deadlocks.
#14 Updated by Rob Nahf about 12 years ago
- Position deleted (
2) - Position set to 2
#15 Updated by Rob Nahf about 12 years ago
after separate hangs tried the following:
1. switched to ParallelGC, as is used in CNs. - still hangs
2. on two separate hangs, jstack traces blocked threads held up by a method call by BouncyCastle to load algorithms from uninstalled BC jars (see below).
Installed the missing jars on mncheck.test.dataone.org - still hanging.
"http-8080-2" daemon prio=10 tid=0x000000004162b000 nid=0x52b4 runnable [0x00007f90a2324000]
java.lang.Thread.State: RUNNABLE
at java.lang.ClassLoader.findBootstrapClass(Native Method)
at java.lang.ClassLoader.findBootstrapClassOrNull(ClassLoader.java:926)
at java.lang.ClassLoader.loadClass(ClassLoader.java:297)
- locked (a sun.misc.Launcher$ExtClassLoader)
at java.lang.ClassLoader.loadClass(ClassLoader.java:295)
- locked (a sun.misc.Launcher$AppClassLoader)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
- locked (a sun.misc.Launcher$AppClassLoader)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1399)
- locked (a java.lang.String)
at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1329)
at org.bouncycastle.jce.provider.BouncyCastleProvider.loadAlgorithms(Unknown Source)
at org.bouncycastle.jce.provider.BouncyCastleProvider.setup(Unknown Source)
at org.bouncycastle.jce.provider.BouncyCastleProvider.access$000(Unknown Source)
at org.bouncycastle.jce.provider.BouncyCastleProvider$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at org.bouncycastle.jce.provider.BouncyCastleProvider.(Unknown Source)
- fixed the double-close problem of the input stream with change to d1_integration TestRunnerHttpServlet - still having problems, but this time not a hang.
#16 Updated by Rob Nahf about 12 years ago
tomcat on mncheck crashed with SIGSEGV, during GC. Appears to be a known HotSpot / GC bug (http://bugs.sun.com/view_bug.do;jsessionid=6eacdf513d1808ffffffffbcab6a0c182fab9?bug_id=6896647)
will apply the workaround. Noticing also that PermGen is at 99%, although didn't get a PermGen log message. PermGen full errors are thought to be more related to memory leaks than improper sizing. WebTesters long processes probably put more things into PermGen. will run jmap -permgen to see what's going there.
from the logs:
445938.448: [GC [PSYoungGen: 5504K->64K(5568K)] 15537K->10153K(54720K), 0.0016570 secs] [Times: user=0.00 sys=0.00, real
=0.00 secs]
446248.492: [GC [PSYoungGen: 5504K->64K(5568K)] 15593K->10209K(54720K), 0.0016810 secs] [Times: user=0.00 sys=0.00, real
=0.01 secs]
446421.280: [GC [PSYoungGen: 3242K->32K(5568K)] 13388K->10233K(54720K), 0.0016860 secs] [Times: user=0.00 sys=0.00, real
=0.01 secs]
446421.282: [Full GC (System) [PSYoungGen: 32K->0K(5568K)] [PSOldGen: 10201K->10207K(49152K)] 10233K->10207K(54720K) [PS
PermGen: 35714K->35714K(36032K)], 0.1750370 secs] [Times: user=0.17 sys=0.00, real=0.17 secs]
446738.558: [GC [PSYoungGen: 5440K->64K(5568K)] 15647K->10271K(54720K), 0.0015660 secs] [Times: user=0.00 sys=0.00, real
=0.00 secs]
447048.602: [GC [PSYoungGen: 5504K->64K(5568K)] 15711K->10327K(54720K), 0.0016890 secs] [Times: user=0.00 sys=0.00, real
=0.00 secs]
447358.645: [GC [PSYoungGen: 5504K->64K(5568K)] 15767K->10383K(54720K), 0.0016880 secs] [Times: user=0.00 sys=0.00, real
=0.00 secs]
447668.688: [GC [PSYoungGen: 5504K->64K(5568K)] 15823K->10439K(54720K), 0.0016900 secs] [Times: user=0.00 sys=0.00, real
=0.00 secs]
447978.734: [GC [PSYoungGen: 5504K->64K(5568K)] 15879K->10495K(54720K), 0.0016710 secs] [Times: user=0.00 sys=0.00, real
=0.00 secs]
448288.779: [GC [PSYoungGen: 5504K->64K(5568K)] 15935K->10559K(54720K), 0.0016570 secs] [Times: user=0.01 sys=0.00, real
=0.00 secs]
448598.825: [GC [PSYoungGen: 5504K->64K(5568K)] 15999K->10615K(54720K), 0.0016430 secs] [Times: user=0.00 sys=0.01, real
=0.00 secs]
448908.871: [GC [PSYoungGen: 5504K->64K(5568K)] 16055K->10671K(54720K), 0.0016730 secs] [Times: user=0.00 sys=0.00, real
=0.01 secs]
449228.916: [GC [PSYoungGen: 5504K->64K(5568K)] 16111K->10727K(54720K), 0.0016650 secs] [Times: user=0.00 sys=0.00, real
=0.00 secs]
449423.399: [GC#
A fatal error has been detected by the Java Runtime Environment:¶
#
SIGSEGV (0xb) at pc=0x00007f627ede6779, pid=21521, tid=140060941502208¶
#
JRE version: 6.0_26-b03¶
Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode linux-amd64 compressed oops)¶
Problematic frame:¶
V [libjvm.so+0x760779] PSPromotionManager::copy_to_survivor_space(oopDesc*)+0x89¶
#
An error report file with more information is saved as:¶
/tmp/hs_err_pid21521.log¶
#
If you would like to submit a bug report, please visit:¶
http://java.sun.com/webapps/bugreport/crash.jsp¶
#
#17 Updated by Rob Nahf about 12 years ago
- Position deleted (
3) - Position set to 1
- Target version changed from Sprint-2012.33-Block.5.1 to Sprint-2012.35-Block.5.2
#18 Updated by Rob Nahf about 12 years ago
- Position deleted (
10) - Position changed from 1 to 492
- Position set to 1
- Target version changed from Sprint-2012.35-Block.5.2 to Sprint-2012.39-Block.5.4
#19 Updated by Rob Nahf about 12 years ago
- Target version changed from Sprint-2012.39-Block.5.4 to Sprint-2012.41-Block.6.1
- Due date set to 2012-10-27
#20 Updated by Rob Nahf about 12 years ago
- Subject changed from make WebTester more robust to WebTester is unresponsive and seems to hang
- Description updated (diff)
#21 Updated by Rob Nahf about 12 years ago
- Description updated (diff)
#22 Updated by Rob Nahf about 12 years ago
- Target version deleted (
Sprint-2012.41-Block.6.1) - Start date deleted (
2012-08-10) - Due date deleted (
2012-10-27)
#23 Updated by Rob Nahf about 10 years ago
- Due date set to 2014-10-02
- Start date set to 2014-10-02
- Target version set to Maintenance Backlog
Rob to review and decide what's needed to keep open.
#24 Updated by Dave Vieglais almost 7 years ago
- Status changed from In Progress to Closed
- % Done changed from 30 to 100