Bug #3402
synchronization lags in dev
100%
Description
At least fo R client test, synchronization is taking wildly different amounts of time to sync objects. Looking at the cn-synchronization logs, I was able to use the timestamps in the identifiers to determine the delays. Attached is a spreadsheet with the line-item results. The delays are consistent across all 3 CNs, and break out into 4 basic groups:
a many-day delay group (looks like 2 periodic updates on Nov 4 and 9th)
a 22 minute delay group (one run of R tests)
a 2-4 minute range (normal?)
a 10 hour delay group (current)
History
#1 Updated by Rob Nahf about 12 years ago
fwiw: a recent top from cn-dev-ucsb-1 shows a somewhat high load comparatively:
top - 13:28:22 up 114 days, 17:27, 1 user, load average: 2.83, 2.89, 2.77
Tasks: 327 total, 1 running, 326 sleeping, 0 stopped, 0 zombie
Cpu(s): 3.6%us, 3.4%sy, 0.0%ni, 93.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 8193652k total, 7875420k used, 318232k free, 330236k buffers
Swap: 2027512k total, 5784k used, 2021728k free, 4585916k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8578 tomcat6 20 0 5475m 698m 12m S 66 8.7 784:00.99 jsvc
8656 tomcat6 20 0 5145m 579m 12m S 16 7.2 388:27.17 jsvc
7419 tomcat6 20 0 9152m 706m 14m S 8 8.8 213:41.45 java
10548 rnahf 20 0 11152 1432 936 R 1 0.0 0:00.14 top
8617 tomcat6 20 0 4473m 423m 11m S 1 5.3 18:48.06 jsvc
34 root 20 0 0 0 0 S 0 0.0 11:08.47 ksoftirqd/10
1152 root 20 0 11284 380 300 S 0 0.0 22:30.21 irqbalance
From unm-1, load is less
top - 13:24:36 up 114 days, 17:23, 1 user, load average: 2.03, 1.84, 1.74
Tasks: 175 total, 1 running, 174 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 99.9%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 33018204k total, 29065652k used, 3952552k free, 487508k buffers
Swap: 47415288k total, 0k used, 47415288k free, 24346228k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8131 tomcat6 20 0 5403m 804m 12m S 17 2.5 189:53.67 jsvc
32758 tomcat6 20 0 9146m 701m 14m S 2 2.2 65:37.56 java
8202 tomcat6 20 0 5081m 519m 12m S 1 1.6 96:29.05 jsvc
8163 tomcat6 20 0 4447m 411m 11m S 0 1.3 5:28.51 jsvc
10039 rnahf 20 0 11020 1332 972 R 0 0.0 0:00.08 top
1 root 20 0 17344 1844 1236 S 0 0.0 0:10.49 init
2 root 20 0 0 0 0 S 0 0.0 0:00.12 kthreadd
as with orc:
top - 13:26:28 up 114 days, 17:25, 2 users, load average: 1.57, 2.31, 2.29
Tasks: 179 total, 1 running, 178 sleeping, 0 stopped, 0 zombie
Cpu(s): 3.6%us, 2.1%sy, 0.0%ni, 94.3%id, 0.0%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 33018204k total, 25964556k used, 7053648k free, 316416k buffers
Swap: 1757176k total, 0k used, 1757176k free, 21555996k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
576 tomcat6 20 0 5530m 622m 12m S 13 1.9 153:06.33 jsvc
649 tomcat6 20 0 5142m 548m 12m S 4 1.7 94:51.70 jsvc
31531 tomcat6 20 0 9152m 691m 12m S 2 2.1 58:09.58 java
1023 root 20 0 50372 3076 2460 S 1 0.0 53:17.78 vmtoolsd
610 tomcat6 20 0 4410m 440m 11m S 0 1.4 4:49.78 jsvc
8951 rnahf 20 0 11020 1296 936 R 0 0.0 0:00.03 top
1 root 20 0 63876 2952 1936 S 0 0.0 0:10.63 init
#2 Updated by Robert Waltz about 12 years ago
- Status changed from New to In Progress
Execution of Jobs failing with the following exception:
[ERROR] 2012-11-15 23:38:07,543 (StdOutErrLog$2:print:51) java.lang.IllegalStateException: Node failed to start!
[ERROR] 2012-11-15 23:38:07,544 (StdOutErrLog$2:print:51) at com.hazelcast.impl.FactoryImpl.(FactoryImpl.j
ava:398)
[ERROR] 2012-11-15 23:38:07,544 (StdOutErrLog$2:print:51) at com.hazelcast.impl.FactoryImpl.newHazelcastInstance
Proxy(FactoryImpl.java:137)
[ERROR] 2012-11-15 23:38:07,545 (StdOutErrLog$2:print:51) at com.hazelcast.impl.FactoryImpl.newHazelcastInstance
Proxy(FactoryImpl.java:123)
[ERROR] 2012-11-15 23:38:07,545 (StdOutErrLog$2:print:51) at com.hazelcast.impl.FactoryImpl.newHazelcastInstance
Proxy(FactoryImpl.java:108)
[ERROR] 2012-11-15 23:38:07,546 (StdOutErrLog$2:print:51) at com.hazelcast.core.Hazelcast.getDefaultInstance(Haz
elcast.java:96)
[ERROR] 2012-11-15 23:38:07,546 (StdOutErrLog$2:print:51) at com.hazelcast.core.Hazelcast.getExecutorService(Haz
elcast.java:265)
[ERROR] 2012-11-15 23:38:07,546 (StdOutErrLog$2:print:51) at org.dataone.cn.batch.synchronization.jobs.MemberNod
eHarvestJob.execute(MemberNodeHarvestJob.java:84)
[ERROR] 2012-11-15 23:38:07,547 (StdOutErrLog$2:print:51) at org.quartz.core.JobRunShell.run(JobRunShell.java:21
3)
[ERROR] 2012-11-15 23:38:07,547 (StdOutErrLog$2:print:51) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(
SimpleThreadPool.java:557)
I will review how the executor service should be configured and re-write some code.
#3 Updated by Dave Vieglais over 11 years ago
- Start date set to 2013-08-25
- Due date set to 2013-09-07
- Target version set to 2013.35-Block.5.1
#4 Updated by Robert Waltz about 11 years ago
- Target version deleted (
2013.35-Block.5.1) - Due date deleted (
2013-09-07) - Start date deleted (
2013-08-25) - Product Version set to *
#5 Updated by Robert Waltz over 10 years ago
- Status changed from In Progress to Closed
Have never been able to reproduce.