Bug #6789
Read API calls to the CN are hanging
30%
Description
Calls made to CNRead API methods block and will sometimes hang.
The calls I've debugged into went through MultipartD1Node methods. They'd hang on calls to restClient.doGetRequest() (as opposed to just taking a long time to deserialize perhaps).
This doesn't happen every time, and even while the java client call is failing, a GET call using the browser may return results instantly.
Some of the methods that have been hanging:
getSystemMetadata()
resolve()
describe()
listQueryEngines()
getQueryEngineDescription()
Subtasks
Associated revisions
History
#1 Updated by Andrei Buium almost 10 years ago
This may be related to how the timeout parameter is used.
It seems to fail on calls like this:
multiPartRestClient.doGetRequest(url.getUrl(), null); // where null is the timeoutMilliseconds Integer
And seems to pass on calls like this:
multiPartRestClient.doGetRequest(url.getUrl(), 1000);
#2 Updated by Rob Nahf almost 10 years ago
When first looking at this with Andrei, I noticed the same tests against a Member Node (mn-demo-6) did not hang, so there seems that our CNs have some different connection management behavior that our libclient libraries aren't handling.
The introduction of default timeouts to calls from libclient seems to resolve problems for the client, but it would be good to understand what's different for CNs (which use Metacat, just like mn-demo-6). Could it be a difference in metacat version, something in the apache/tomcat configuration, the cn_rest layer, or java version?
Going to create 2 tasks - one for libclient to include default timeouts, the other to look into possible CN issues.
(we removed the closeIdleConnections command in v2 libclient, which may or may not be related. There isn't much information on SO regarding this situation with HttpClient v4.3.x)
#3 Updated by Rob Nahf almost 10 years ago
- Target version set to CCI-2.0.0
#4 Updated by Rob Nahf over 9 years ago
- Target version changed from CCI-2.0.0 to CLJ
- Assignee set to Rob Nahf
not consistently seen, and seen less frequently since libclient_java v2
#5 Updated by Rob Nahf over 8 years ago
- Status changed from New to In Progress
- % Done changed from 0 to 30
this is too vague and it's been too long to confirm success or failure. There's been extensive reworking of libclient connection management since this ticket was opened, and we also removed a synchronized declaration on doGetRequest method, which could stop an entire d1client if a member node is non responsive and fails to timeout..
We are seeing CN bottlenecks under heavy load. associated with resolve() and getNodeList() - possibly associated with SID lookup, as well, so keeping open