Story #8082: implement SolrCloudClient to replace HttpService to allow concurrent updates of the solr index from differen machines - Infrastructure - DataONE Tasks

Story #8082

Based on Chris' advice:

<pre>
Chris: So, to add my `$.02`, I think that yes, we can parallelize the indexing across CNs using a federated queue,
and it’s what we
(somewhat desperately) need. One thing to consider, though, is how our clients handle conflicts.
As I understand it, the Solr server use
the internal `_version_` field to handle concurrent update requests on
documents. If the client request has the same `_version_` as the
server, the request should succeed. However,
if it doesn’t, the expected behavior is for the server to return an `HTTP 409` error, and the
client is suppose to
`GET` the latest document again and resend the update request. This may very well be baked into the SolrJ
`HttpSolrClient` and the `ConcurrentUpdateSolrClient` (which apparently optimizes requests by consolidating
multiple updates into a
single HTTP request). However, our indexer code doesn’t use the SolrJ client, but rather
the home-grown `HTTPService` class that
performs the update requests. A quick glance at that shows that it
just logs all errors coming back from Solr, and doesn’t handle
`_version_` mismatches. So, in moving toward
concurrent clients, we might consider moving to the SolrJ optimized clients if they look like
they handle
mismatches gracefully.

Davev: and perhaps use a queue for feeding solr. Indexer tasks add processed docs to the queue, where a
process using solrj sends to solr.
</pre>

The CloudSolrClient seems to fit the bill...

https://community.hortonworks.com/questions/9611/concurrentupdatesolrclient-vs-cloudsolrclient-for.html

Back

Project

General

Profile

Infrastructure

Story #8082