Project

General

Profile

Bug #6800

SOLR indexes malformed strings - identifier, id

Added by Mark Servilla almost 10 years ago. Updated almost 9 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Skye Roseboom
Category:
-
Target version:
-
Start date:
2015-02-03
Due date:
% Done:

100%

Story Points:
Sprint:

Description

During maintenance of mnTestLTER content on cn-sandbox-ucsb-1.test.dataone.org, it was found that results from a SOLR query returned records that contained malformed "identifier" and "id" strings. There are a total of 18 of such records.

Query:

curl -s -X GET "https://cn-sandbox-ucsb-1.test.dataone.org/cn/v1/query/solr/?q=datasource:urn\:node\:mnTestLTER"

List of malformed identifiers:

ld>:10.6073/pasta/e2c4d7746fc2dcda6bf62cba99389567

>oi:10.6073/pasta/18e14273378ba371de1833a41d32301d

otrends/2927/2/d5f755665edc9ckage/data/eml/ecotrends/2927/2/d5f755665edc967b5c9195366cbb6c8d

417/2.lternet.edu/package/report/eml/ecotrends/10417/2

rends/8468/2t.edu/package/report/eml/ecotrends/8468/2

85a238d13910aeaf02db6aage/data/eml/ecotrends/3920/2/ea7fd9dfab85a238d13910aeaf02db6a

bff04f425a9409e595b9ckage/data/eml/knb-lter-and/2722/6/924fa927309abff04f425a9409e595b9

071ce2bca9a8f73a007241b96f2240f1ml/ecotrends/1671/2/071ce2bca9a8f73a007241b96f2240f1

age/data/eml/.lternet.edu/package/data/eml/ecotrends/10568/2/a115d2a94070ca2085996903277db48a

34fd95f1581a290a6a35e1ege/data/eml/ecotrends/8459/2/5644b03fb34fd95f1581a290a6a35e1e

ld>ps://pasta.lternet.edu/package/data/eml/ecotrends/4473/2/f0e51aae0a51606233df892959d3a56d

otrends/8606/2/f0b80ea995d9fckage/data/eml/ecotrends/8606/2/f0b80ea995d9f6d585a2c9b2d99a7872

age/report/em.lternet.edu/package/report/eml/knb-lter-arc/10272/3

11085/2/18e3871ec8c06eb1420867cd2c6ata/eml/ecotrends/11085/2/18e3871ec8c06eb1420867cd2c6a0cb6

46a31afb6bfbeb683a12e938e/data/eml/ecotrends/6868/2/c2ca91ad46a31afb6bfbeb683a12e938

f3e297e1dd</pasta/a76c718448bff15dc8b004f3e297e1dd

211835310dbf305f731b9cce54b41bdeml/ecotrends/6584/2/e211835310dbf305f731b9cce54b41bd

d53e0930ernet.edu/package/data/eml/ecotrends/1586/2/c21358a57bffd9e176f634c1d53e0930

It is as if the string is being overwritten by left-over string-buffer content - for example, the malformed identifier "d53e0930ernet.edu/package/data/eml/ecotrends/1586/2/c21358a57bffd9e176f634c1d53e0930" contains the wrong sub-string "d53e0930" in position 0 through 15, thereby replacing the correct sub-string "https://pasta.lt". The same type of malformed strings is also found on cn-stage-ucsb-1.test.dataone.org.

The full output from the SOLR query may be found in the attached text document "cn-sandbox-ucsb-1-SOLR.txt".

cn-sandbox-ucsb-1-SOLR.txt Magnifier (19.4 KB) Mark Servilla, 2015-02-03 16:58

pidsInSolrNotInObjectStore.txt Magnifier (116 KB) Peter Slaughter, 2015-06-30 17:37


Related issues

Duplicated by ONE Mercury - Bug #6870: Fix handling of identifiers with url-escaped characters Rejected 2015-02-28

History

#1 Updated by Matthew Jones almost 10 years ago

The MDC project also found similarly malformed strings in the production SOLR index. Peter Slaughter is investigating.

#2 Updated by Skye Roseboom almost 10 years ago

  • translation missing: en.field_release set to 2

#3 Updated by Rob Nahf over 9 years ago

  • Related to Bug #6870: Fix handling of identifiers with url-escaped characters added

#4 Updated by Lauren Walker over 9 years ago

Here is an identifier that is experiencing this bug right now, on sandbox 2:

_NA_v2.0.xmlMAP_PRESENTVEG_C3Grass_RelaFrac_NA_v2.0.xml

#5 Updated by Skye Roseboom over 9 years ago

  • Status changed from New to In Progress
  • % Done changed from 0 to 30

#6 Updated by Skye Roseboom over 9 years ago

  • Target version set to CCI-1.5.1

#7 Updated by Skye Roseboom over 9 years ago

  • Related to deleted (Bug #6870: Fix handling of identifiers with url-escaped characters)

#8 Updated by Skye Roseboom over 9 years ago

  • Duplicated by Bug #6870: Fix handling of identifiers with url-escaped characters added

#9 Updated by Skye Roseboom over 9 years ago

  • Target version deleted (CCI-1.5.1)

#10 Updated by Skye Roseboom over 9 years ago

  • Subject changed from SOLR indexes malformed identifier and id strings for mnTestLTER to SOLR indexes malformed strings - identifier, id

Problem effects all environments. It also effects more than the identifier and id fields. Problem has been observed in date fields, full-text field, origin, author, fileID.

Problem is not consistent - in that re-indexing a solr record wit mangled values creates a new solr record without any mangled values.

Debugging indicates that the index-processor is sending proper solr xml messages to the solr server, but the string values are being misinterpreted by the solr server. This is supported by output of the xml payload being generated at d1's index processor and the xml payload received at the solr server - both of which when logged appear to be proper - no mangled string values.

Problem persists and will take more work to discover cause.

Working on some scripts to help detect and clean up solr records and re-index documents which get mangled.

#11 Updated by Peter Slaughter over 9 years ago

Ran a test on 6/26/2015 on cn-ucsb-1.dataone.org that fetched all pids from solr and then checked in object store (in this case guid field from 'systemmetadata' table). The attached
file 'pidsInSolrNotInObjectStore.txt' shows the pids that were in solr but not in the systemmetadata table.

#12 Updated by Matthew Jones over 9 years ago

This problem also extends to the AuthoritativeMN field. Here's an example of it being mangled:

evolve whRYAD

which can be seen here:

https://cn.dataone.org/cn/v1/query/solr/?fl=identifier,authoritativeMN&q=id:*dryad.gp23s/1%3Fver*

#13 Updated by Skye Roseboom about 9 years ago

Issue is a cause of running solr3.x on a Java7 JDK. Solr3.x line is no longer patched by solr community so issue will not be resolved until deployment of solr5 with the CCI V2.0

#14 Updated by Skye Roseboom almost 9 years ago

  • % Done changed from 30 to 50
  • Status changed from In Progress to Testing

New version of solr is being deployed to production and search index is rebuilding. No mangled strings have been detected in stage or production.

#15 Updated by Skye Roseboom almost 9 years ago

  • Status changed from Testing to Closed
  • % Done changed from 50 to 100

Solr 5 installed into production and index rebuilt. Problem appears to have gone away with solr5

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 14.8 MB)