Project

General

Profile

Bug #3402

synchronization lags in dev

Added by Rob Nahf about 12 years ago. Updated over 10 years ago.

Status:
Closed
Priority:
High
Assignee:
Robert Waltz
Category:
d1_synchronization
Target version:
-
Start date:
Due date:
% Done:

100%

Milestone:
None
Product Version:
*
Story Points:
Sprint:

Description

At least fo R client test, synchronization is taking wildly different amounts of time to sync objects. Looking at the cn-synchronization logs, I was able to use the timestamps in the identifiers to determine the delays. Attached is a spreadsheet with the line-item results. The delays are consistent across all 3 CNs, and break out into 4 basic groups:

a many-day delay group (looks like 2 periodic updates on Nov 4 and 9th)
a 22 minute delay group (one run of R tests)
a 2-4 minute range (normal?)
a 10 hour delay group (current)

dev-sync-lags.xlsx (74.4 KB) Rob Nahf, 2012-11-16 13:11

History

#1 Updated by Rob Nahf about 12 years ago

fwiw: a recent top from cn-dev-ucsb-1 shows a somewhat high load comparatively:

top - 13:28:22 up 114 days, 17:27, 1 user, load average: 2.83, 2.89, 2.77
Tasks: 327 total, 1 running, 326 sleeping, 0 stopped, 0 zombie
Cpu(s): 3.6%us, 3.4%sy, 0.0%ni, 93.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 8193652k total, 7875420k used, 318232k free, 330236k buffers
Swap: 2027512k total, 5784k used, 2021728k free, 4585916k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

8578 tomcat6 20 0 5475m 698m 12m S 66 8.7 784:00.99 jsvc

8656 tomcat6 20 0 5145m 579m 12m S 16 7.2 388:27.17 jsvc

7419 tomcat6 20 0 9152m 706m 14m S 8 8.8 213:41.45 java

10548 rnahf 20 0 11152 1432 936 R 1 0.0 0:00.14 top

8617 tomcat6 20 0 4473m 423m 11m S 1 5.3 18:48.06 jsvc

34 root 20 0 0 0 0 S 0 0.0 11:08.47 ksoftirqd/10

1152 root 20 0 11284 380 300 S 0 0.0 22:30.21 irqbalance

From unm-1, load is less

top - 13:24:36 up 114 days, 17:23, 1 user, load average: 2.03, 1.84, 1.74
Tasks: 175 total, 1 running, 174 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 99.9%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 33018204k total, 29065652k used, 3952552k free, 487508k buffers
Swap: 47415288k total, 0k used, 47415288k free, 24346228k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

8131 tomcat6 20 0 5403m 804m 12m S 17 2.5 189:53.67 jsvc

32758 tomcat6 20 0 9146m 701m 14m S 2 2.2 65:37.56 java

8202 tomcat6 20 0 5081m 519m 12m S 1 1.6 96:29.05 jsvc

8163 tomcat6 20 0 4447m 411m 11m S 0 1.3 5:28.51 jsvc

10039 rnahf 20 0 11020 1332 972 R 0 0.0 0:00.08 top

1 root 20 0 17344 1844 1236 S 0 0.0 0:10.49 init

2 root 20 0 0 0 0 S 0 0.0 0:00.12 kthreadd

as with orc:

top - 13:26:28 up 114 days, 17:25, 2 users, load average: 1.57, 2.31, 2.29
Tasks: 179 total, 1 running, 178 sleeping, 0 stopped, 0 zombie
Cpu(s): 3.6%us, 2.1%sy, 0.0%ni, 94.3%id, 0.0%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 33018204k total, 25964556k used, 7053648k free, 316416k buffers
Swap: 1757176k total, 0k used, 1757176k free, 21555996k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

576 tomcat6 20 0 5530m 622m 12m S 13 1.9 153:06.33 jsvc

649 tomcat6 20 0 5142m 548m 12m S 4 1.7 94:51.70 jsvc

31531 tomcat6 20 0 9152m 691m 12m S 2 2.1 58:09.58 java

1023 root 20 0 50372 3076 2460 S 1 0.0 53:17.78 vmtoolsd

610 tomcat6 20 0 4410m 440m 11m S 0 1.4 4:49.78 jsvc

8951 rnahf 20 0 11020 1296 936 R 0 0.0 0:00.03 top

1 root 20 0 63876 2952 1936 S 0 0.0 0:10.63 init

#2 Updated by Robert Waltz about 12 years ago

  • Status changed from New to In Progress

Execution of Jobs failing with the following exception:

[ERROR] 2012-11-15 23:38:07,543 (StdOutErrLog$2:print:51) java.lang.IllegalStateException: Node failed to start!
[ERROR] 2012-11-15 23:38:07,544 (StdOutErrLog$2:print:51) at com.hazelcast.impl.FactoryImpl.(FactoryImpl.j
ava:398)
[ERROR] 2012-11-15 23:38:07,544 (StdOutErrLog$2:print:51) at com.hazelcast.impl.FactoryImpl.newHazelcastInstance
Proxy(FactoryImpl.java:137)
[ERROR] 2012-11-15 23:38:07,545 (StdOutErrLog$2:print:51) at com.hazelcast.impl.FactoryImpl.newHazelcastInstance
Proxy(FactoryImpl.java:123)
[ERROR] 2012-11-15 23:38:07,545 (StdOutErrLog$2:print:51) at com.hazelcast.impl.FactoryImpl.newHazelcastInstance
Proxy(FactoryImpl.java:108)
[ERROR] 2012-11-15 23:38:07,546 (StdOutErrLog$2:print:51) at com.hazelcast.core.Hazelcast.getDefaultInstance(Haz
elcast.java:96)
[ERROR] 2012-11-15 23:38:07,546 (StdOutErrLog$2:print:51) at com.hazelcast.core.Hazelcast.getExecutorService(Haz
elcast.java:265)
[ERROR] 2012-11-15 23:38:07,546 (StdOutErrLog$2:print:51) at org.dataone.cn.batch.synchronization.jobs.MemberNod
eHarvestJob.execute(MemberNodeHarvestJob.java:84)
[ERROR] 2012-11-15 23:38:07,547 (StdOutErrLog$2:print:51) at org.quartz.core.JobRunShell.run(JobRunShell.java:21
3)
[ERROR] 2012-11-15 23:38:07,547 (StdOutErrLog$2:print:51) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(
SimpleThreadPool.java:557)

I will review how the executor service should be configured and re-write some code.

#3 Updated by Dave Vieglais over 11 years ago

  • Start date set to 2013-08-25
  • Due date set to 2013-09-07
  • Target version set to 2013.35-Block.5.1

#4 Updated by Robert Waltz about 11 years ago

  • Target version deleted (2013.35-Block.5.1)
  • Due date deleted (2013-09-07)
  • Start date deleted (2013-08-25)
  • Product Version set to *

#5 Updated by Robert Waltz over 10 years ago

  • Status changed from In Progress to Closed

Have never been able to reproduce.

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 14.8 MB)