Project

General

Profile

Bug #6789

Read API calls to the CN are hanging

Added by Andrei Buium about 9 years ago. Updated almost 8 years ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2015-02-09
Due date:
% Done:

30%

Story Points:
Sprint:

Description

Calls made to CNRead API methods block and will sometimes hang.

The calls I've debugged into went through MultipartD1Node methods. They'd hang on calls to restClient.doGetRequest() (as opposed to just taking a long time to deserialize perhaps).
This doesn't happen every time, and even while the java client call is failing, a GET call using the browser may return results instantly.

Some of the methods that have been hanging:
getSystemMetadata()
resolve()
describe()
listQueryEngines()
getQueryEngineDescription()


Subtasks

Task #6844: add default timeout settings into libclientClosedAndrei Buium

Task #6845: Look into why the CNs are hanging connectionsClosedRob Nahf

Associated revisions

Revision 15194
Added by Andrei Buium about 9 years ago

Made the default timeout disable-able.
refs #6789, #6844

Revision 15194
Added by Andrei Buium about 9 years ago

Made the default timeout disable-able.
refs #6789, #6844

History

#1 Updated by Andrei Buium about 9 years ago

This may be related to how the timeout parameter is used.

It seems to fail on calls like this:

multiPartRestClient.doGetRequest(url.getUrl(), null); // where null is the timeoutMilliseconds Integer

And seems to pass on calls like this:
multiPartRestClient.doGetRequest(url.getUrl(), 1000);

#2 Updated by Rob Nahf about 9 years ago

When first looking at this with Andrei, I noticed the same tests against a Member Node (mn-demo-6) did not hang, so there seems that our CNs have some different connection management behavior that our libclient libraries aren't handling.

The introduction of default timeouts to calls from libclient seems to resolve problems for the client, but it would be good to understand what's different for CNs (which use Metacat, just like mn-demo-6). Could it be a difference in metacat version, something in the apache/tomcat configuration, the cn_rest layer, or java version?

Going to create 2 tasks - one for libclient to include default timeouts, the other to look into possible CN issues.

(we removed the closeIdleConnections command in v2 libclient, which may or may not be related. There isn't much information on SO regarding this situation with HttpClient v4.3.x)

#3 Updated by Rob Nahf about 9 years ago

  • Target version set to CCI-2.0.0

#4 Updated by Rob Nahf over 8 years ago

  • Target version changed from CCI-2.0.0 to CLJ
  • Assignee set to Rob Nahf

not consistently seen, and seen less frequently since libclient_java v2

#5 Updated by Rob Nahf almost 8 years ago

  • Status changed from New to In Progress
  • % Done changed from 0 to 30

this is too vague and it's been too long to confirm success or failure. There's been extensive reworking of libclient connection management since this ticket was opened, and we also removed a synchronized declaration on doGetRequest method, which could stop an entire d1client if a member node is non responsive and fails to timeout..

We are seeing CN bottlenecks under heavy load. associated with resolve() and getNodeList() - possibly associated with SID lookup, as well, so keeping open

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 14.8 MB)