Bug #6746
SSL Connection Errors across processing components
100%
Description
Discovered Jan 6 that synchronization and replication ( and log aggregation?) were throwing SSL Handshake Exceptions (caused by java.io.EOFException: SSL peer shut down incorrectly).
Prevents synchronization from harvesting new objects. Prevents replication from requesting replicas. Caused replication auditing to falsely identify invalid replicas.
It is unknown what the source of the issue was/is. Odd that the same error started manifesting across multiple components within one JVM/deamon process.
It seems that each use of D1Node (CNode, MNode) create new D1RestClients - so uses of CNode and MNode should not be effected by errors in other threads.
D1Node.getSystemMetadata (used in replication) and listObject (used in sync) create a new D1RestClient
D1RestClient creates new RestClient and SSLSocketFactory
RestClient creates new DefaultHttpClient
CertificateManager.getSSLSocketFactory uses static SSLContext.getInstance
TrustManager
Possible source of common error in CertificateManager or TrustManager since it is a singleton and its use/methods are static.
History
#1 Updated by Skye Roseboom almost 10 years ago
- Assignee deleted (
Robert Waltz)
#2 Updated by Skye Roseboom almost 10 years ago
- File firstSSLException added
#3 Updated by Skye Roseboom almost 10 years ago
First occurrence of replica audit task cancellation:
[ERROR] 2015-01-05 22:17:02,429 (AbstractReplicationAuditor:handleFuture:214) Replica audit task cancelled.
First occurrence of SSL handshake exception:
[ERROR] 2015-01-06 16:23:04,227 (TransferObjectTask:retrieveSystemMetadata:282) Task-urn:node:LTER-doi:10.6073/AA/knb-lter-arc.1371.1
<?xml version="1.0" encoding="UTF-8"?>
class javax.net.ssl.SSLHandshakeException: Remote host closed connection during handshake
First occurrence of SSL handshake exception in replication:
[ERROR] 2015-01-06 16:28:02,175 (MemberNodeReplicaAuditingStrategy:auditMemberNodeReplica:188) Unable to get checksum from mn: urn:node:mnUNM1.
org.dataone.service.exceptions.ServiceFailure: class javax.net.ssl.SSLHandshakeException: Remote host closed connection during handshake
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:946)
at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1312)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1339)¶
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
Caused by: javax.net.ssl.SSLHandshakeException: Remote host closed connection during handshake
... 31 more
Caused by: java.io.EOFException: SSL peer shut down incorrectly
at sun.security.ssl.InputRecord.read(InputRecord.java:482)¶
So looks like sync started reporting the SSL Handshake Exception against LTER at 16:23 on 1-6, then replication at 16:28 on 1-6.
#4 Updated by Skye Roseboom almost 10 years ago
- Category changed from d1_common_java to d1_libclient_java
#5 Updated by Skye Roseboom almost 10 years ago
- Description updated (diff)
#6 Updated by Rob Nahf almost 10 years ago
a quick google search turned up this posting from a couple years ago.
http://www.coderanch.com/t/563766/java/java/SSLHandshake-Error-connection-remote-server
The presumed cause of the error was network problems. If these errors were only coming from one CN, a local network disconnect from the outside world would explain why all three processes (trying to communicate with off-site servers) would show the same errors at the same time.
It would be interesting to also know if these errors go away by themselves without manual intervention.
#7 Updated by Dave Vieglais almost 10 years ago
- Priority changed from Normal to High
- Project changed from Infrastructure to Java Client
- Category deleted (
d1_libclient_java)
#8 Updated by Rob Nahf almost 10 years ago
- Category set to d1_libclient_java
Clarifying that this happened on production (v1) machines, so working with a v1 libclient. Need to check the HttpClient version being used - it probably has moved forward to v4.2.x from v4.1.3.
Either way, there is no shared state as a result of a shared connection manager between processes, so it's difficult to imagine any misconfiguration being stuck.
The suggestion in the bug description was possible problems with CertificateManager since it's a singleton, and has shared state. Each request creates a new connection socketfactory from scratch (reloads certificate material) so no shared state there. The trustStore is cached, but this would only mean that trusted CAs added during the lifetime of the application would not be picked up.
I think a network partition is more likely the cause, and the SSLhandshakes exceptions logged are a result of the connection being broken from the middle, rather than from either end.
see 3rd answer down, by dave_thompson_085: http://stackoverflow.com/questions/21245796/javax-net-ssl-sslhandshakeexception-remote-host-closed-connection-during-handsh
see also TLS 1.2 problems referenced here: http://stackoverflow.com/questions/26604828/javax-net-ssl-sslhandshakeexception-remote-host-closed-connection-during-handsh
it's less likely because for us it's an intermittent problem, but worth mentioning
#9 Updated by Rob Nahf almost 10 years ago
- Target version set to CCI-2.0.0
- Category changed from d1_libclient_java to d1_synchronization
- Project changed from Java Client to CN REST
#10 Updated by Rob Nahf almost 10 years ago
- Target version changed from CCI-2.0.0 to CCI-1.5.1
#11 Updated by Rob Nahf almost 10 years ago
- % Done changed from 0 to 100
- Status changed from New to Closed
Based on the exception, this seems to be due to network partitioning, not internal problems with libclient.
Closing.