Story #8082: implement SolrCloudClient to replace HttpService to allow concurrent updates of the solr index from differen machines - Infrastructure - DataONE Tasks

Story #8082

Based on Chris' advice:

<pre>
Chris: So, to add my `$.02`, I think that yes, we can parallelize the indexing across CNs using a federated queue, and it’s what we (somewhat desperately) need. One thing to consider, though, is how our clients handle conflicts. As I understand it, the Solr server use the internal `_version_` field to handle concurrent update requests on documents. If the client request has the same `_version_` as the server, the request should succeed. However, if it doesn’t, the expected behavior is for the server to return an `HTTP 409` error, and the client is suppose to `GET` the latest document again and resend the update request. This may very well be baked into the SolrJ `HttpSolrClient` and the `ConcurrentUpdateSolrClient` (which apparently optimizes requests by consolidating multiple updates into a single HTTP request). However, our indexer code doesn’t use the SolrJ client, but rather the home-grown `HTTPService` class that performs the update requests. A quick glance at that shows that it just logs all errors coming back from Solr, and doesn’t handle `_version_` mismatches. So, in moving toward concurrent clients, we might consider moving to the SolrJ optimized clients if they look like they handle mismatches gracefully.
Davev: and perhaps use a queue for feeding solr. Indexer tasks add processed docs to the queue, where a process using solrj sends to solr.
</pre>

The CloudSolrClient seems to fit the bill... https://dataoneorg.slack.com/archives/C2ASPD868/p1492633409959188

https://community.hortonworks.com/questions/9611/concurrentupdatesolrclient-vs-cloudsolrclient-for.html

Back

Project

General

Profile

Infrastructure

Story #8082