Bug #6870
Fix handling of identifiers with url-escaped characters
0%
Description
We've been getting new content from Dryad, and the ONEMercury handling of the identifiers looks to be broken. For example, doing a search for * and Dryad as the Member Node:
https://cn.dataone.org/onemercury/send/query?term1=*&term1attribute=fullText&term2.1=&term2.1attribute=fullText&op2.1=AND&term2.2attribute=fullText&term2.2=&op2.2=AND&term2.3attribute=fullText&term2.3=&op2.3=and&term3=%2C%2C%2C&term3attribute=overlaps&op4=during&term4=&term4attribute=beginDate&term5=&term5attribute=endDate&term6attribute=datasource&term8=either&pageSize=10&queryString=+Entire+Document+%3A+*++and+true+coordinates+%28N%2CW%2CS%2CE%29+%3D+%28%2C%2C%2C%29+and+++and++from+sources%3A+urn%3Anode%3ADRYAD&instance=pilotcatalog&filterForDataHidden=&term6=urn%3Anode%3ADRYAD
Most all of the identifiers are rendered incorrectly. Some look truncated, others contain XML markup, etc:
ttp://dx.doi.org/10.5061/dryad.6gr7t/2?ver=2014-02-21T12:54:19.782-05:00
x.doi.org/10.5061/dryad.121d03jc/11?ver=2012-08-16T10:38:20.266-04:00
00oi.org/10.5061/dryad.121d03jc/8?ver=2012-08-16T10:36:22.333-04:00
080/9?ver=2013-01/10.5061/dryad.sd080/9?ver=2013-01-30T12:24:48.723-05:00
This results in broken links to metadata content, such as:
https://cn.dataone.org/onemercury/send/xsltText2?pid=%3Ettp://dx.doi.org/10.5061/dryad.6gr7t/2?ver=2014-02-21T12:54:19.782-05:00&fileURL=https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.6gr7t%2F2%3Fver%3D2014-02-21T12%3A54%3A19.782-05%3A00&full_datasource=Dryad%20Digital%20Repository&full_queryString=%20*%20AND%20has%20direct%20data%20AND%20%28%20datasource%20:%28%20urn:node:DRYAD%20%20%29%20%29%20&ds_id=
When using the the actual pid, the content is present:
https://cn.dataone.org/onemercury/send/xsltText2?pid=http://dx.doi.org/10.5061/dryad.6gr7t/2?ver=2014-02-21T12:54:19.782-05:00&fileURL=https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.6gr7t%2F2%3Fver%3D2014-02-21T12%3A54%3A19.782-05%3A00&full_datasource=Dryad%20Digital%20Repository&full_queryString=%20*%20AND%20has%20direct%20data%20AND%20%28%20datasource%20:%28%20urn:node:DRYAD%20%20%29%20%29%20&ds_id=#top
We need to track down where the identifiers are getting mangled.
Related issues
History
#1 Updated by Robert Waltz almost 10 years ago
I performed the first search identified above and found this result:
Santos, Scott R.. 01/14/2014. Car_ign_only_1200bp+_nucleotide_contigs_from_Ray_assembly.
Identifier: >ttp://dx.doi.org/10.5061/dryad.6gr7t/2?ver=2014-02-21T12:54:19.782-05:00 Datasource: Dryad Digital Repository
FASTA file of only nucleotide contigs >=1,200 bp assembled from 139,329,276 100 bp pair end (PE) reads from an Illumina HiSeq 2000 using Ray v2.0.0 for Caranx ignobilis...
Above is the first entry returned from the search
The raw html shows us this text:
// Register button with downloadPanel component
M3.downloadPanel.registerButton('d1-download-panel-button-1','http://dx.doi.org/10.5061/dryad.6gr7t?format=d1rem&ver=2014-02-21T13:17:41.985-05:00', '>ttp://dx.doi.org/10.5061/dryad.6gr7t/2?ver=2014-02-21T12:54:19.782-05:00');
Download
ttp://dx.doi.org/10.5061/dryad.6gr7t/2?ver=2014-02-21T12:54:19.782-05:00&fileURL=https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.6gr7t%2F2%3Fver%3D2014-02-21T12%3A54%3A19.782-05%3A00&full_datasource=Dryad Digital Repository&full_queryString= * AND has direct data AND ( datasource :( urn:node:DRYAD ) ) &ds_id='">View full metadata
Identifier:
</field>x.doi.org/10.5061/dryad.121d03jc/11?ver=2012-08-16T10:38:20.266-04:00
I believe http://dx.doi.org/10.5061/dryad.6gr7t?format=d1rem&ver=2014-02-21T13:17:41.985-05:00 to be the valid identifier, but it is mangled in a variety of ways in the html produced above. It appears that when the identifier is combined to create URLs then parts of the resulting url is truncated.
#2 Updated by Dave Vieglais almost 10 years ago
- Tracker changed from Task to Bug
#3 Updated by Rob Nahf almost 10 years ago
- Related to Bug #6800: SOLR indexes malformed strings - identifier, id added
#4 Updated by Dave Vieglais almost 10 years ago
- Assignee changed from Mark Servilla to Dave Vieglais
#5 Updated by Skye Roseboom over 9 years ago
- Status changed from New to Rejected
Identifiers are being mangled in the indexing process. Not a bug in one-mercury. Duplicate of #6800.
#6 Updated by Skye Roseboom over 9 years ago
- Related to deleted (Bug #6800: SOLR indexes malformed strings - identifier, id)
#7 Updated by Skye Roseboom over 9 years ago
- Duplicates Bug #6800: SOLR indexes malformed strings - identifier, id added