Story #8082: implement SolrCloudClient to replace HttpService to allow concurrent updates of the solr index from differen machines - Infrastructure - DataONE Tasks

Story #8082

Story #8061: develop queue-based processing system for the CN

implement SolrCloudClient to replace HttpService to allow concurrent updates of the solr index from differen machines

Added by Rob Nahf almost 8 years ago. Updated about 7 years ago.

Status:

New

Priority:

Normal

Assignee:

Rob Nahf

Category:

d1_indexer

Target version:

Start date:

2017-04-25

Due date:

% Done:

Story Points:

Sprint:

Infrastructure backlog

Description

Based on Chris' advice:

Chris: So, to add my $.02, I think that yes, we can parallelize the indexing across CNs using a federated queue,
and it’s what we (somewhat desperately) need. One thing to consider, though, is how our clients handle conflicts.

As I understand it, the Solr server use the internal _version_ field to handle concurrent update requests on
documents. If the client request has the same _version_ as the server, the request should succeed. However,
if it doesn’t, the expected behavior is for the server to return an HTTP 409 error, and the client is suppose to
GET the latest document again and resend the update request. This may very well be baked into the SolrJ
HttpSolrClient and the ConcurrentUpdateSolrClient (which apparently optimizes requests by consolidating
multiple updates into a single HTTP request). However, our indexer code doesn’t use the SolrJ client, but rather
the home-grown HTTPService class that performs the update requests. A quick glance at that shows that it
just logs all errors coming back from Solr, and doesn’t handle _version_ mismatches. So, in moving toward
concurrent clients, we might consider moving to the SolrJ optimized clients if they look like they handle
mismatches gracefully.

Davev: and perhaps use a queue for feeding solr. Indexer tasks add processed docs to the queue, where a
process using solrj sends to solr.

The CloudSolrClient seems to fit the bill...

https://community.hortonworks.com/questions/9611/concurrentupdatesolrclient-vs-cloudsolrclient-for.html

Related issues

Associated revisions

Revision 18933
Added by Rob Nahf over 7 years ago

refs: #8082: Half-tested implementation of SolrJ-based client to the solr cores, with CloudSolrClient as the default implementation. Still need to test updates and deletes, tested querying. SolrJClientIT is an integration test for the class.

Revision 18933
Added by Rob Nahf over 7 years ago

Revision 18941
Added by Rob Nahf over 7 years ago

refs: #8082: Cleanup of SolrJClient - simplified constructors and created test-parser-context.xml for flexible implementation via Spring (where it should be). Added a testTypicalPackageIndex test, and test package documents.

Revision 18941
Added by Rob Nahf over 7 years ago

Revision 18942
Added by Rob Nahf over 7 years ago

refs: #8082, SolrJClient is set up to use the hard-commit-after-update that the current client does, to enable testing. configurable in the source code. Also only uses real-time-get if set. (RT get seems to take longer). Finished the simple package indexing test. Fixed a couple NPE errors in subprocessor, that arise if identifiers are looked up in Hz systemmetadata map and are not there, that is, when resourceMaps are indexed before their members.

Revision 18942
Added by Rob Nahf over 7 years ago

History

#1 Updated by Rob Nahf almost 8 years ago

Description updated (diff)

#2 Updated by Rob Nahf almost 8 years ago

Description updated (diff)

#3 Updated by Rob Nahf almost 8 years ago

Description updated (diff)

#4 Updated by Rob Nahf almost 8 years ago

Parent task deleted (~~#8081~~)

#5 Updated by Rob Nahf almost 8 years ago

Parent task set to #8061

#6 Updated by Rob Nahf over 7 years ago

regarding Dave's comment in the desxcription, see ConcurrentUpdateSolrClient https://lucene.apache.org/solr/5_3_1/solr-solrj/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrClient.html

as well as this article on batching updates: https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/

Both point us to the fact that batching updates reduces overhead and can speed up performance.

Need to check to see if optimistic concurrency is used by SolrJ be default. We'd need to at least put the version field in the updates we are submitting, I believe to enable this kind of concurrency control. This would be good because it will reduce race situations.

(Optimistic concurrency - "If first you don't succeed, try, try again")

For additional improvements, we also need to look at atomic updates, these can be combined with optimistic concurrency chris mentioned above: see http://yonik.com/solr/atomic-updates/

#7 Updated by Dave Vieglais about 7 years ago

Sprint set to Infrastructure backlog

#8 Updated by Rob Nahf over 6 years ago

Related to Story #8702: Indexing Refactor Strategy added

Also available in: Atom PDF

Project

General

Profile

Infrastructure

Issues

Custom queries