Story #8082
Updated by Rob Nahf over 7 years ago
Based on Chris' advice:
<pre>
Chris: So, to add my `$.02`, I think that yes, we can parallelize the indexing across CNs using a federated queue, and it’s what we
(somewhat desperately) need. One thing to consider, though, is how our clients handle conflicts. As I understand it, the Solr server use
the internal `_version_` field to handle concurrent update requests on documents. If the client request has the same `_version_` as the
server, the request should succeed. However, if it doesn’t, the expected behavior is for the server to return an `HTTP 409` error, and the
client is suppose to `GET` the latest document again and resend the update request. This may very well be baked into the SolrJ
`HttpSolrClient` and the `ConcurrentUpdateSolrClient` (which apparently optimizes requests by consolidating multiple updates into a
single HTTP request). However, our indexer code doesn’t use the SolrJ client, but rather the home-grown `HTTPService` class that
performs the update requests. A quick glance at that shows that it just logs all errors coming back from Solr, and doesn’t handle
`_version_` mismatches. So, in moving toward concurrent clients, we might consider moving to the SolrJ optimized clients if they look like
they handle mismatches gracefully.
Davev: and perhaps use a queue for feeding solr. Indexer tasks add processed docs to the queue, where a process using solrj sends to solr.
</pre>
The CloudSolrClient seems to fit the bill...
https://community.hortonworks.com/questions/9611/concurrentupdatesolrclient-vs-cloudsolrclient-for.html
<pre>
Chris: So, to add my `$.02`, I think that yes, we can parallelize the indexing across CNs using a federated queue, and it’s what we
(somewhat desperately) need. One thing to consider, though, is how our clients handle conflicts. As I understand it, the Solr server use
the internal `_version_` field to handle concurrent update requests on documents. If the client request has the same `_version_` as the
server, the request should succeed. However, if it doesn’t, the expected behavior is for the server to return an `HTTP 409` error, and the
client is suppose to `GET` the latest document again and resend the update request. This may very well be baked into the SolrJ
`HttpSolrClient` and the `ConcurrentUpdateSolrClient` (which apparently optimizes requests by consolidating multiple updates into a
single HTTP request). However, our indexer code doesn’t use the SolrJ client, but rather the home-grown `HTTPService` class that
performs the update requests. A quick glance at that shows that it just logs all errors coming back from Solr, and doesn’t handle
`_version_` mismatches. So, in moving toward concurrent clients, we might consider moving to the SolrJ optimized clients if they look like
they handle mismatches gracefully.
Davev: and perhaps use a queue for feeding solr. Indexer tasks add processed docs to the queue, where a process using solrj sends to solr.
</pre>
The CloudSolrClient seems to fit the bill...
https://community.hortonworks.com/questions/9611/concurrentupdatesolrclient-vs-cloudsolrclient-for.html