Story #8082

Updated by Rob Nahf over 5 years ago

Based on Chris' advice:

Chris: So, to add my `$.02`, I think that yes, we can parallelize the indexing across CNs using a federated queue, and it’s what we
(somewhat desperately) need. One thing to consider, though, is how our clients handle conflicts. As I understand it, the Solr server use
the internal `_version_` field to handle concurrent update requests on documents. If the client request has the same `_version_` as the
server, the request should succeed. However, if it doesn’t, the expected behavior is for the server to return an `HTTP 409` error, and the
client is suppose to `GET` the latest document again and resend the update request. This may very well be baked into the SolrJ
`HttpSolrClient` and the `ConcurrentUpdateSolrClient` (which apparently optimizes requests by consolidating multiple updates into a
single HTTP request). However, our indexer code doesn’t use the SolrJ client, but rather the home-grown `HTTPService` class that
performs the update requests. A quick glance at that shows that it just logs all errors coming back from Solr, and doesn’t handle
`_version_` mismatches. So, in moving toward concurrent clients, we might consider moving to the SolrJ optimized clients if they look like
they handle mismatches gracefully.

Davev: and perhaps use a queue for feeding solr. Indexer tasks add processed docs to the queue, where a process using solrj sends to solr.

The CloudSolrClient seems to fit the bill...


Add picture from clipboard (Maximum size: 14.8 MB)