Bug #6746: SSL Connection Errors across processing components - CN REST - DataONE Tasks

Bug #6746

SSL Connection Errors across processing components

Added by Skye Roseboom about 10 years ago. Updated almost 10 years ago.

Status:

Closed

Priority:

High

Assignee:

Category:

d1_synchronization

Target version:

Infrastructure - CCI-1.5.1

Start date:

Due date:

% Done:

100%

Story Points:

Sprint:

Description

Discovered Jan 6 that synchronization and replication ( and log aggregation?) were throwing SSL Handshake Exceptions (caused by java.io.EOFException: SSL peer shut down incorrectly).

Prevents synchronization from harvesting new objects. Prevents replication from requesting replicas. Caused replication auditing to falsely identify invalid replicas.

It is unknown what the source of the issue was/is. Odd that the same error started manifesting across multiple components within one JVM/deamon process.

It seems that each use of D1Node (CNode, MNode) create new D1RestClients - so uses of CNode and MNode should not be effected by errors in other threads.
D1Node.getSystemMetadata (used in replication) and listObject (used in sync) create a new D1RestClient
D1RestClient creates new RestClient and SSLSocketFactory
RestClient creates new DefaultHttpClient
CertificateManager.getSSLSocketFactory uses static SSLContext.getInstance
TrustManager

Possible source of common error in CertificateManager or TrustManager since it is a singleton and its use/methods are static.

syncSSLErrors - SSL Handshake error messages in synchronization (2.88 KB) Skye Roseboom, 2015-01-07 21:57

replicationSSLErrors - Sample SSL Handshake error, stack traces from replication (110 KB) Skye Roseboom, 2015-01-07 21:57

firstSSLException - examples of first SSL Handshake errors (also first replica audit task cancelation) (86.4 KB) Skye Roseboom, 2015-01-07 22:39

History

#1 Updated by Skye Roseboom about 10 years ago

Assignee deleted (~~Robert Waltz~~)

#2 Updated by Skye Roseboom about 10 years ago

File firstSSLException added

#3 Updated by Skye Roseboom about 10 years ago

First occurrence of replica audit task cancellation:
[ERROR] 2015-01-05 22:17:02,429 (AbstractReplicationAuditor:handleFuture:214) Replica audit task cancelled.

First occurrence of SSL handshake exception:
[ERROR] 2015-01-06 16:23:04,227 (TransferObjectTask:retrieveSystemMetadata:282) Task-urn:node:LTER-doi:10.6073/AA/knb-lter-arc.1371.1
<?xml version="1.0" encoding="UTF-8"?>

class javax.net.ssl.SSLHandshakeException: Remote host closed connection during handshake

First occurrence of SSL handshake exception in replication:
[ERROR] 2015-01-06 16:28:02,175 (MemberNodeReplicaAuditingStrategy:auditMemberNodeReplica:188) Unable to get checksum from mn: urn:node:mnUNM1.
org.dataone.service.exceptions.ServiceFailure: class javax.net.ssl.SSLHandshakeException: Remote host closed connection during handshake
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:946)
at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1312)

at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1339)¶

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Caused by: javax.net.ssl.SSLHandshakeException: Remote host closed connection during handshake
... 31 more
Caused by: java.io.EOFException: SSL peer shut down incorrectly

at sun.security.ssl.InputRecord.read(InputRecord.java:482)¶

So looks like sync started reporting the SSL Handshake Exception against LTER at 16:23 on 1-6, then replication at 16:28 on 1-6.

#4 Updated by Skye Roseboom about 10 years ago

Category changed from d1_common_java to d1_libclient_java

#5 Updated by Skye Roseboom about 10 years ago

Description updated (diff)

#6 Updated by Rob Nahf about 10 years ago

a quick google search turned up this posting from a couple years ago.

http://www.coderanch.com/t/563766/java/java/SSLHandshake-Error-connection-remote-server

The presumed cause of the error was network problems. If these errors were only coming from one CN, a local network disconnect from the outside world would explain why all three processes (trying to communicate with off-site servers) would show the same errors at the same time.

It would be interesting to also know if these errors go away by themselves without manual intervention.

#7 Updated by Dave Vieglais about 10 years ago

Priority changed from Normal to High
Project changed from Infrastructure to Java Client
Category deleted (~~d1_libclient_java~~)

#8 Updated by Rob Nahf about 10 years ago

Category set to d1_libclient_java

Clarifying that this happened on production (v1) machines, so working with a v1 libclient. Need to check the HttpClient version being used - it probably has moved forward to v4.2.x from v4.1.3.

Either way, there is no shared state as a result of a shared connection manager between processes, so it's difficult to imagine any misconfiguration being stuck.

The suggestion in the bug description was possible problems with CertificateManager since it's a singleton, and has shared state. Each request creates a new connection socketfactory from scratch (reloads certificate material) so no shared state there. The trustStore is cached, but this would only mean that trusted CAs added during the lifetime of the application would not be picked up.

I think a network partition is more likely the cause, and the SSLhandshakes exceptions logged are a result of the connection being broken from the middle, rather than from either end.

see 3rd answer down, by dave_thompson_085: http://stackoverflow.com/questions/21245796/javax-net-ssl-sslhandshakeexception-remote-host-closed-connection-during-handsh

see also TLS 1.2 problems referenced here: http://stackoverflow.com/questions/26604828/javax-net-ssl-sslhandshakeexception-remote-host-closed-connection-during-handsh
it's less likely because for us it's an intermittent problem, but worth mentioning