Task #2295: Update fullText field in solr schema.xml - Infrastructure - DataONE Tasks

Task #2295

Story #1386: Generation of SOLR index for Mercury

Update fullText field in solr schema.xml

Added by Skye Roseboom about 13 years ago. Updated about 13 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Skye Roseboom

Category:

d1_indexer

Target version:

Sprint-2012.07-Block.1.4

Start date:

2012-02-07

Due date:

% Done:

100%

Milestone:

CCI-1.0.0

Product Version:

Story Points:

Sprint:

Description

Need to ensure that "fullText" field with all text in the XML document (excluding element names).

copyField destination. replace 'text' field with fullText? Existing fullText used?

History

#1 Updated by Skye Roseboom about 13 years ago

Status changed from New to In Progress

#2 Updated by Skye Roseboom about 13 years ago

Consider stripping //dataset/dataTable elements from fullText to avoid indexing large data sets.

#3 Updated by Skye Roseboom about 13 years ago

Further clarification from mercury folk (Jim Green):

The fullText field is populated by building a string
which contains all of the contents.
The caveat here is that most of the fgdc files are relatively small,
and these do NOT contain actual data.

As for filtering, not much. Just some cleanup of artifacts from the
harvesting process. We remove all of the tags, and anything which is
part of a "<!\[CDATA\[" block.
This is all done with some low-level java, since we had performance
issues using any of the DOM libraries for manipulating the xml.

#4 Updated by Skye Roseboom about 13 years ago

Since tags and element names are not part of the fullText field, Ive re-implemented this as a copyField that accumulates the text from science metadata fields. (rather than creating and maintaining a class to strip tags, CDATA blocks, and dataTables).

#5 Updated by Skye Roseboom about 13 years ago

Status changed from In Progress to Closed

#6 Updated by Skye Roseboom about 13 years ago

Skye Roseboom wrote:

Since tags and element names are not part of the fullText field, Ive re-implemented this as a copyField that accumulates the text from science metadata fields. (rather than creating and maintaining a class to strip tags, CDATA blocks, and dataTables).

Also it appears that mercury actually queries the 'text' field, not 'fullText'.

#7 Updated by Skye Roseboom about 13 years ago

Parent task changed from #2004 to #1386

Also available in: Atom PDF

Project

General

Profile

Infrastructure

Issues

Custom queries