Bug #6746

SSL Connection Errors across processing components

Added by Skye Roseboom about 9 years ago. Updated almost 9 years ago.

Start date:
Due date:
% Done:


Story Points:


Discovered Jan 6 that synchronization and replication ( and log aggregation?) were throwing SSL Handshake Exceptions (caused by SSL peer shut down incorrectly).

Prevents synchronization from harvesting new objects. Prevents replication from requesting replicas. Caused replication auditing to falsely identify invalid replicas.

It is unknown what the source of the issue was/is. Odd that the same error started manifesting across multiple components within one JVM/deamon process.

It seems that each use of D1Node (CNode, MNode) create new D1RestClients - so uses of CNode and MNode should not be effected by errors in other threads.
D1Node.getSystemMetadata (used in replication) and listObject (used in sync) create a new D1RestClient
D1RestClient creates new RestClient and SSLSocketFactory
RestClient creates new DefaultHttpClient
CertificateManager.getSSLSocketFactory uses static SSLContext.getInstance

Possible source of common error in CertificateManager or TrustManager since it is a singleton and its use/methods are static.

syncSSLErrors - SSL Handshake error messages in synchronization (2.88 KB) Skye Roseboom, 2015-01-07 21:57

replicationSSLErrors - Sample SSL Handshake error, stack traces from replication (110 KB) Skye Roseboom, 2015-01-07 21:57

firstSSLException - examples of first SSL Handshake errors (also first replica audit task cancelation) (86.4 KB) Skye Roseboom, 2015-01-07 22:39


#1 Updated by Skye Roseboom about 9 years ago

  • Assignee deleted (Robert Waltz)

#2 Updated by Skye Roseboom about 9 years ago

#3 Updated by Skye Roseboom about 9 years ago

First occurrence of replica audit task cancellation:
[ERROR] 2015-01-05 22:17:02,429 (AbstractReplicationAuditor:handleFuture:214) Replica audit task cancelled.

First occurrence of SSL handshake exception:
[ERROR] 2015-01-06 16:23:04,227 (TransferObjectTask:retrieveSystemMetadata:282) Task-urn:node:LTER-doi:10.6073/AA/knb-lter-arc.1371.1
<?xml version="1.0" encoding="UTF-8"?>

class Remote host closed connection during handshake

First occurrence of SSL handshake exception in replication:
[ERROR] 2015-01-06 16:28:02,175 (MemberNodeReplicaAuditingStrategy:auditMemberNodeReplica:188) Unable to get checksum from mn: urn:node:mnUNM1.
org.dataone.service.exceptions.ServiceFailure: class Remote host closed connection during handshake


at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$

Caused by: Remote host closed connection during handshake
... 31 more
Caused by: SSL peer shut down incorrectly


So looks like sync started reporting the SSL Handshake Exception against LTER at 16:23 on 1-6, then replication at 16:28 on 1-6.

#4 Updated by Skye Roseboom about 9 years ago

  • Category changed from d1_common_java to d1_libclient_java

#5 Updated by Skye Roseboom about 9 years ago

  • Description updated (diff)

#6 Updated by Rob Nahf about 9 years ago

a quick google search turned up this posting from a couple years ago.

The presumed cause of the error was network problems. If these errors were only coming from one CN, a local network disconnect from the outside world would explain why all three processes (trying to communicate with off-site servers) would show the same errors at the same time.

It would be interesting to also know if these errors go away by themselves without manual intervention.

#7 Updated by Dave Vieglais about 9 years ago

  • Priority changed from Normal to High
  • Project changed from Infrastructure to Java Client
  • Category deleted (d1_libclient_java)

#8 Updated by Rob Nahf about 9 years ago

  • Category set to d1_libclient_java

Clarifying that this happened on production (v1) machines, so working with a v1 libclient. Need to check the HttpClient version being used - it probably has moved forward to v4.2.x from v4.1.3.

Either way, there is no shared state as a result of a shared connection manager between processes, so it's difficult to imagine any misconfiguration being stuck.

The suggestion in the bug description was possible problems with CertificateManager since it's a singleton, and has shared state. Each request creates a new connection socketfactory from scratch (reloads certificate material) so no shared state there. The trustStore is cached, but this would only mean that trusted CAs added during the lifetime of the application would not be picked up.

I think a network partition is more likely the cause, and the SSLhandshakes exceptions logged are a result of the connection being broken from the middle, rather than from either end.

see 3rd answer down, by dave_thompson_085:

see also TLS 1.2 problems referenced here:
it's less likely because for us it's an intermittent problem, but worth mentioning

#9 Updated by Rob Nahf about 9 years ago

  • Target version set to CCI-2.0.0
  • Category changed from d1_libclient_java to d1_synchronization
  • Project changed from Java Client to CN REST

#10 Updated by Rob Nahf about 9 years ago

  • Target version changed from CCI-2.0.0 to CCI-1.5.1

#11 Updated by Rob Nahf almost 9 years ago

  • % Done changed from 0 to 100
  • Status changed from New to Closed

Based on the exception, this seems to be due to network partitioning, not internal problems with libclient.


Also available in: Atom PDF

Add picture from clipboard (Maximum size: 14.8 MB)