Task #2295
Story #1386: Generation of SOLR index for Mercury
Update fullText field in solr schema.xml
100%
Description
Need to ensure that "fullText" field with all text in the XML document (excluding element names).
copyField destination. replace 'text' field with fullText? Existing fullText used?
History
#1 Updated by Skye Roseboom almost 13 years ago
- Status changed from New to In Progress
#2 Updated by Skye Roseboom almost 13 years ago
Consider stripping //dataset/dataTable elements from fullText to avoid indexing large data sets.
#3 Updated by Skye Roseboom almost 13 years ago
Further clarification from mercury folk (Jim Green):
The fullText field is populated by building a string
which contains all of the contents.
The caveat here is that most of the fgdc files are relatively small,
and these do NOT contain actual data.
As for filtering, not much. Just some cleanup of artifacts from the
harvesting process. We remove all of the tags, and anything which is
part of a "<!\[CDATA\[" block.
This is all done with some low-level java, since we had performance
issues using any of the DOM libraries for manipulating the xml.
#4 Updated by Skye Roseboom almost 13 years ago
Since tags and element names are not part of the fullText field, Ive re-implemented this as a copyField that accumulates the text from science metadata fields. (rather than creating and maintaining a class to strip tags, CDATA blocks, and dataTables).
#5 Updated by Skye Roseboom almost 13 years ago
- Status changed from In Progress to Closed
#6 Updated by Skye Roseboom almost 13 years ago
Skye Roseboom wrote:
Since tags and element names are not part of the fullText field, Ive re-implemented this as a copyField that accumulates the text from science metadata fields. (rather than creating and maintaining a class to strip tags, CDATA blocks, and dataTables).
Also it appears that mercury actually queries the 'text' field, not 'fullText'.
#7 Updated by Skye Roseboom almost 13 years ago
- Parent task changed from #2004 to #1386