Project

General

Profile

Bug #8161

d1-processing Out of Memory Error

Added by Dave Vieglais over 4 years ago. Updated almost 4 years ago.

Status:
New
Priority:
Normal
Assignee:
Category:
d1_process_daemon
Target version:
Start date:
2017-08-13
Due date:
% Done:

0%

Milestone:
None
Product Version:
*
Story Points:
Sprint:

Description

Noticed that d1-processing had stopped processing synchronization, last successful log message received 2017-08-11T22:56:00Z

/var/log/dataone/daemon/cn-process-daemon.log reporting:

[ERROR] 2017-08-13 13:36:34,623 (StdOutErrLog$2:print:51) java.lang.OutOfMemoryError: Java heap space

d1-processing is started with:

JAVA_OPTS="-Djava.awt.headless=true -XX:UseParallelGC -Xmx4096M -Xms1024M -Xss1280k -XX:MaxPermSize=512M"

jsvc.exec -home /usr/lib/jvm/java-1.8.0-openjdk-amd64
-cp /usr/share/java/commons-daemon.jar:/usr/share/java/d1_process_daemon.jar
-debug -outfile /var/log/dataone/daemon/d1-processing-jsvc.log
-errfile /var/log/dataone/daemon/d1-processing-jsvc.err
-pidfile /var/run/d1-processing.pid
-user tomcat7
-Djava.awt.headless=true
-XX:UseParallelGC
-Xmx4096M
-Xms1024M
-Xss1280k
-XX:MaxPermSize=512M
org.dataone.cn.batch.daemon.SchedulerDaemon

Xmx sets the maximum heap size

Xms sets the initial heap size

Xss sets the java thread stack size

/proc/meminfo reports:

MemTotal: 32948380 kB
MemFree: 3428124 kB

Restarted d1-processing to continue operations while reviewing options for resolution.

History

#1 Updated by Dave Vieglais over 4 years ago

Pids in the sync queue when d1-processing stopped (from a dump of hzSyncObjectQueue)

https://pastebin.com/QviRisrj

#2 Updated by Dave Vieglais almost 4 years ago

  • Sprint set to CCI-2.3.7

#3 Updated by Dave Vieglais almost 4 years ago

d1-processing crashed.

dataone/daemon/cn-process-daemon.log:

[ INFO] 2018-02-20 13:28:52,903 (ObjectListHarvestTask:call:241) urn:node:LTER- ObjectListHarvestTask End
[DEBUG] 2018-02-20 13:29:05,230 (ReplicationManager:isAllowed:807) ReplicationManager.isAllowed() called for https://pasta.lternet.edu/package/data/eml/knb-lter-bnz/347/14/35b347649695b6800c9325a227a650dc
[DEBUG] 2018-02-20 13:29:05,231 (ReplicationManager:isAllowed:824) ReplicationManager.isAllowed() = false for https://pasta.lternet.edu/package/data/eml/knb-lter-bnz/347/14/35b347649695b6800c9325a227a650dc
[DEBUG] 2018-02-20 13:29:05,231 (ReplicationManager:processPid:413) Replication is not allowed for the object identified by https://pasta.lternet.edu/package/data/eml/knb-lter-bnz/347/14/35b347649695b6800c9325a227a650dc
[ INFO] 2018-02-20 13:29:05,231 (ReplicationManager:removeReplicationTasksForPid:772) removing replication tasks for pid: https://pasta.lternet.edu/package/data/eml/knb-lter-bnz/347/14/35b347649695b6800c9325a227a650dc
[ INFO] 2018-02-20 13:28:59,106 (MemberNodeHarvestJob:execute:71) urn:node:LTER - ObjectListHarvestTask Completed at Tue Feb 20 2018 13:28:59 UTC
[DEBUG] 2018-02-20 13:29:05,232 (HttpConnectionMonitorService:run:103) ...calling closeExpire/IdleConnections...
[DEBUG] 2018-02-20 13:29:05,232 (HttpConnectionMonitorService:run:103) ...calling closeExpire/IdleConnections...
[ INFO] 2018-02-20 13:29:11,381 (ObjectListHarvestTask:call:241) urn:node:SANPARKS- ObjectListHarvestTask End
[ INFO] 2018-02-20 13:29:11,382 (MemberNodeHarvestJob:execute:71) urn:node:SANPARKS - ObjectListHarvestTask Completed at Tue Feb 20 2018 13:29:11 UTC
[DEBUG] 2018-02-20 13:29:11,382 (SyncObjectTask:reapFutures:276) Polling empty hzSyncObjectQueue
[DEBUG] 2018-02-20 13:29:11,383 (SyncMetricLogReport:reportSyncMetrics:59) end
[DEBUG] 2018-02-20 13:29:11,383 (SyncObjectTask:call:164) ActiveCount: 0 Pool size 0 Max Pool Size 100
[DEBUG] 2018-02-20 13:29:11,383 (SyncMetricLogJob:execute:49) end
[DEBUG] 2018-02-20 13:29:56,641 (SyncMetricLogJobTriggerListener:releaseJob:98) releaseJob
[DEBUG] 2018-02-20 13:29:56,642 (SyncMetricLogJobTriggerListener:triggerComplete:62) triggersComplete
[DEBUG] 2018-02-20 13:29:36,899 (HttpConnectionMonitorService:run:103) ...calling closeExpire/IdleConnections...
[DEBUG] 2018-02-20 13:29:56,642 (HttpConnectionMonitorService:run:103) ...calling closeExpire/IdleConnections...
[ INFO] 2018-02-20 13:31:10,795 (MemberNodeHarvestJob:execute:56) urn:node:DRYAD - ObjectListHarvestTask Start
[ INFO] 2018-02-20 13:31:10,795 (MemberNodeHarvestJob:execute:56) urn:node:ESA - ObjectListHarvestTask Start
[ INFO] 2018-02-20 13:31:10,796 (MemberNodeHarvestJob:execute:56) urn:node:BCODMO - ObjectListHarvestTask Start
[ INFO] 2018-02-20 13:31:10,795 (MemberNodeHarvestJob:execute:56) urn:node:CLOEBIRD - ObjectListHarvestTask Start
[ INFO] 2018-02-20 13:31:26,199 (ObjectListHarvestTask:call:113) urn:node:CLOEBIRD- Calling ObjectListHarvestTask
[ERROR] 2018-02-20 13:31:17,081 (AbstractReplicationAuditor:handleFuture:223) Replica audit task timed out twice, cancelling.
[DEBUG] 2018-02-20 13:31:17,081 (HttpConnectionMonitorService:run:103) ...calling closeExpire/IdleConnections...
[DEBUG] 2018-02-20 13:32:13,150 (HttpConnectionMonitorService:run:103) ...calling closeExpire/IdleConnections...
[ INFO] 2018-02-20 13:32:13,150 (MemberNodeHarvestJob:execute:56) urn:node:NCEI - ObjectListHarvestTask Start
[ERROR] 2018-02-20 13:32:13,150 (StdOutErrLog$2:print:51) Exception in thread "Timer-3"
[DEBUG] 2018-02-20 13:31:10,795 (SyncMetricLogJobTriggerListener:triggerFired:68) triggerFired
[DEBUG] 2018-02-20 13:32:49,532 (SyncMetricLogJobTriggerListener:vetoJobExecution:86) vetoJobExecution
[DEBUG] 2018-02-20 13:32:49,533 (SyncMetricLogJobTriggerListener:lockJob:109) lockJob is true queue size is 1
[DEBUG] 2018-02-20 13:32:49,533 (SyncMetricLogJob:execute:41) start
[DEBUG] 2018-02-20 13:32:49,533 (AbstractReplicationAuditor:handleFuture:219) Replica audit task timed out.  waiting another60 seconds.
[DEBUG] 2018-02-20 13:32:49,533 (SyncMetricLogReport:reportSyncMetrics:29) start
[ INFO] 2018-02-20 13:32:49,531 (MemberNodeHarvestJob:execute:56) urn:node:TFRI - ObjectListHarvestTask Start
[ERROR] 2018-02-20 13:32:49,534 (StdOutErrLog$2:print:51) java.lang.NullPointerException
[ERROR] 2018-02-20 13:32:49,535 (StdOutErrLog$2:print:51)   at com.hazelcast.client.ConnectionManager$1.run(ConnectionManager.java:94)
[ERROR] 2018-02-20 13:32:49,535 (StdOutErrLog$2:print:51)   at java.util.TimerThread.mainLoop(Timer.java:555)
[ERROR] 2018-02-20 13:32:49,535 (StdOutErrLog$2:print:51)   at java.util.TimerThread.run(Timer.java:505)
[ INFO] 2018-02-20 13:32:49,531 (MemberNodeHarvestJob:execute:56) urn:node:TERN - ObjectListHarvestTask Start
[ INFO] 2018-02-20 13:32:49,531 (MemberNodeHarvestJob:execute:56) urn:node:TDAR - ObjectListHarvestTask Start
[ INFO] 2018-02-20 13:32:49,529 (MemberNodeHarvestJob:execute:56) urn:node:SANPARKS - ObjectListHarvestTask Start
[ INFO] 2018-02-20 13:32:43,348 (MemberNodeHarvestJob:execute:56) urn:node:R2R - ObjectListHarvestTask Start
[ INFO] 2018-02-20 13:32:37,069 (ObjectListHarvestTask:call:113) urn:node:NCEI- Calling ObjectListHarvestTask
[ INFO] 2018-02-20 13:32:13,150 (MemberNodeHarvestJob:execute:56) urn:node:NEON - ObjectListHarvestTask Start
[ INFO] 2018-02-20 13:31:26,199 (ObjectListHarvestTask:call:113) urn:node:DRYAD- Calling ObjectListHarvestTask
[ INFO] 2018-02-20 13:31:26,199 (ObjectListHarvestTask:call:113) urn:node:ESA- Calling ObjectListHarvestTask
[ INFO] 2018-02-20 13:31:26,199 (ObjectListHarvestTask:call:113) urn:node:BCODMO- Calling ObjectListHarvestTask
[ INFO] 2018-02-20 13:31:56,975 (MemberNodeHarvestJob:execute:56) urn:node:LTER - ObjectListHarvestTask Start
[ INFO] 2018-02-20 13:31:56,975 (MemberNodeHarvestJob:execute:56) urn:node:LTER_EUROPE - ObjectListHarvestTask Start
[ INFO] 2018-02-20 13:31:56,975 (MemberNodeHarvestJob:execute:56) urn:node:PPBIO - ObjectListHarvestTask Start
[ INFO] 2018-02-20 13:32:03,318 (MemberNodeHarvestJob:execute:56) urn:node:NMEPSCOR - ObjectListHarvestTask Start
[DEBUG] 2018-02-20 13:32:22,873 (HttpConnectionMonitorService:run:103) ...calling closeExpire/IdleConnections...
[DEBUG] 2018-02-20 13:32:59,026 (HttpConnectionMonitorService:run:103) ...calling closeExpire/IdleConnections...
[ INFO] 2018-02-20 13:32:49,572 (ObjectListHarvestTask:call:113) urn:node:R2R- Calling ObjectListHarvestTask
[ INFO] 2018-02-20 13:32:49,570 (ObjectListHarvestTask:call:113) urn:node:TDAR- Calling ObjectListHarvestTask
[ INFO] 2018-02-20 13:32:49,535 (ObjectListHarvestTask:call:113) urn:node:TFRI- Calling ObjectListHarvestTask
[ INFO] 2018-02-20 13:32:49,574 (ObjectListHarvestTask:call:113) urn:node:TERN- Calling ObjectListHarvestTask
[ INFO] 2018-02-20 13:32:59,026 (ObjectListHarvestTask:call:113) urn:node:PPBIO- Calling ObjectListHarvestTask
[ INFO] 2018-02-20 13:32:59,026 (ObjectListHarvestTask:call:113) urn:node:NEON- Calling ObjectListHarvestTask
[ INFO] 2018-02-20 13:34:37,082 (ObjectListHarvestTask:call:113) urn:node:LTER_EUROPE- Calling ObjectListHarvestTask
[DEBUG] 2018-02-20 13:35:12,737 (HttpConnectionMonitorService:run:103) ...calling closeExpire/IdleConnections...
[ERROR] 2018-02-20 13:35:59,506 (AbstractReplicationAuditor:handleFuture:223) Replica audit task timed out twice, cancelling.
[ INFO] 2018-02-20 13:35:25,394 (LogAggregationHarvestJob:execute:68) Job-urn:node:ONEShare executing job
[DEBUG] 2018-02-20 13:38:15,023 (HttpConnectionMonitorService:run:103) ...calling closeExpire/IdleConnections...
[ INFO] 2018-02-20 13:32:59,024 (ObjectListHarvestTask:call:113) urn:node:NMEPSCOR- Calling ObjectListHarvestTask
[DEBUG] 2018-02-20 13:44:17,745 (HttpConnectionMonitorService:run:103) ...calling closeExpire/IdleConnections...
[DEBUG] 2018-02-20 13:45:46,567 (AbstractReplicationAuditor:handleFuture:219) Replica audit task timed out.  waiting another60 seconds.
[ INFO] 2018-02-20 13:34:37,082 (ObjectListHarvestTask:call:113) urn:node:LTER- Calling ObjectListHarvestTask
[DEBUG] 2018-02-20 13:53:49,707 (CoordinatingNodeReplicaAuditingStrategy:auditPids:86) audit pids called with 5.
[DEBUG] 2018-02-20 13:53:56,410 (CoordinatingNodeReplicaAuditingStrategy:auditPid:93) auditPid for Coordinating Node replica called for pid: http://dx.doi.org/10.5061/dryad.8496?format=d1rem&ver=2011-09-02T15:30:29.785-04:00
[ERROR] 2018-02-20 13:53:56,411 (StdOutErrLog$2:print:51) 2337523836 [SynchronizationQuartzScheduler_Worker-11] ERROR org.dataone.cn.ldap.DirContextPooledObjectFactory - Could not set up TLS connection, using non-secure communication

[ERROR] 2018-02-20 13:54:12,369 (StdOutErrLog$2:print:51) java.lang.OutOfMemoryError: Java heap space
[ERROR] 2018-02-20 13:54:12,370 (StdOutErrLog$2:print:51)   at sun.security.ssl.SSLSocketImpl.init(SSLSocketImpl.java:627)
[ERROR] 2018-02-20 13:54:12,370 (StdOutErrLog$2:print:51)   at sun.security.ssl.SSLSocketImpl.<init>(SSLSocketImpl.java:572)
[ERROR] 2018-02-20 13:54:12,370 (StdOutErrLog$2:print:51)   at sun.security.ssl.SSLSocketFactoryImpl.createSocket(SSLSocketFactoryImpl.java:110)
[ERROR] 2018-02-20 13:54:12,370 (StdOutErrLog$2:print:51)   at com.sun.jndi.ldap.ext.StartTlsResponseImpl.startHandshake(StartTlsResponseImpl.java:330)
[ERROR] 2018-02-20 13:54:12,370 (StdOutErrLog$2:print:51)   at com.sun.jndi.ldap.ext.StartTlsResponseImpl.negotiate(StartTlsResponseImpl.java:217)
[ERROR] 2018-02-20 13:54:12,371 (StdOutErrLog$2:print:51)   at com.sun.jndi.ldap.ext.StartTlsResponseImpl.negotiate(StartTlsResponseImpl.java:170)
[ INFO] 2018-02-20 13:42:20,067 (LogAggregationHarvestJob:execute:68) Job-urn:node:mnORC1 executing job
[ERROR] 2018-02-20 13:55:36,095 (StdOutErrLog$2:print:51)   at org.dataone.cn.ldap.DirContextPooledObjectFactory.getSecureContext(DirContextPooledObjectFactory.java:126)
[ERROR] 2018-02-20 15:11:01,898 (StdOutErrLog$2:print:51) 2342149323 [pool-15-thread-2] INFO org.dataone.client.rest.RestClient - RestClient.doRequestNoBody, thread(104) call Info: GET https://cn-ucsb-1.dataone.org/cn/v2/meta/http:%2F%2Fdx.doi.org%2F10.5061%2Fdryad.8496%3Fformat=d1rem&ver=2011-09-02T15:30:29.785-04:00

[ERROR] 2018-02-20 15:09:57,673 (StdOutErrLog$2:print:51) 2342078579 [pool-15-thread-1] INFO org.dataone.client.rest.RestClient - RestClient.doRequestNoBody, thread(103) call Info: GET https://cn-ucsb-1.dataone.org/cn/v2/meta/http:%2F%2Fdx.doi.org%2F10.5061%2Fdryad.8444%3Fver=2011-09-02T13:01:00.854-04:00

[ INFO] 2018-02-20 15:06:34,039 (LogAggregationHarvestJob:execute:68) Job-urn:node:US_MPC executing job
[ INFO] 2018-02-20 15:06:15,212 (LogAggregationHarvestJob:execute:68) Job-urn:node:IOE executing job
[ INFO] 2018-02-20 15:05:58,647 (LogAggregationHarvestJob:execute:68) Job-urn:node:EDACGSTORE executing job
[ERROR] 2018-02-20 15:03:16,287 (StdOutErrLog$2:print:51) 2341313441 [LogAggregationQuartzScheduler_Worker-15] ERROR org.dataone.cn.ldap.DirContextPooledObjectFactory - Could not set up TLS connection, using non-secure communication

[ERROR] 2018-02-20 14:55:12,378 (StdOutErrLog$2:print:51) 2341193579 [SynchronizationQuartzScheduler_Worker-22] INFO org.dataone.client.rest.RestClient - RestClient.doRequestNoBody, thread(92) call Info: GET https://metacat.ots.ac.cr/metacat/d1/mn/v2/object?fromDate=2018-02-13T14:26:33.332%2B00:00&toDate=2018-02-20T13:27:02.930%2B00:00&start=0&count=1000

[ INFO] 2018-02-20 14:20:50,605 (LogAggregationHarvestJob:execute:68) Job-urn:node:DRYAD executing job
[ INFO] 2018-02-20 14:20:10,898 (LogAggregationHarvestJob:execute:68) Job-urn:node:CLOEBIRD executing job
[ INFO] 2018-02-20 14:18:15,501 (LogAggregationHarvestJob:execute:68) Job-urn:node:TFRI executing job
[ INFO] 2018-02-20 14:14:49,925 (LogAggregationHarvestJob:execute:68) Job-urn:node:LTER_EUROPE executing job
[ERROR] 2018-02-20 13:58:07,405 (AbstractReplicationAuditor:handleFuture:223) Replica audit task timed out twice, cancelling.

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 14.8 MB)