Project

General

Profile

Bug #6391

Science metadata files with different checksums on CN and MN - encoding

Added by Skye Roseboom about 5 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Metacat
Target version:
Start date:
2014-11-04
Due date:
2015-01-06
% Done:

100%

Milestone:
None
Product Version:
*
Story Points:
Sprint:

Description

Discovered that some science metadata docs on MN KNB and the CN have different checksums. Further investigation shows what appears to be encoding differences which lead to the different checksums.

Examples:
(title element)
https://cn.dataone.org/cn/v1/object/doi:10.5063/AA/knb.277.1
https://knb.ecoinformatics.org/knb/d1/mn/object/doi:10.5063/AA/knb.277.1

    (method section)
https://cn.dataone.org/cn/v1/object/doi:10.5063/AA/nceas.985.1
https://knb.ecoinformatics.org/knb/d1/mn/object/doi:10.5063/AA/nceas.985.1

Related issues

Related to Infrastructure - Story #3309: get() downloads have inconsistent mime types and filenames Closed 2012-10-08
Related to Infrastructure - Task #6568: Create or update a metadata object without altering content Closed 2014-11-13 2014-11-14
Related to Infrastructure - Task #7042: Create an element for the character set encoding in the system metadata schema Closed 2015-04-14

History

#1 Updated by Dave Vieglais about 5 years ago

RFC 7303 Sections 8.8 and 8.9 are particularly relevant here. http://www.rfc-editor.org/rfc/rfc7303.txt

HTTP headers from the cn:

Content-Type: text/xml;charset=UTF-8

and from the mn:

Content-Type: text/xml

Both the MN and the CN are operating incorrectly, though the CN is worse.

The MN SHOULD be setting the charset parameter when transmitting over HTTP.

The CN output is an example of incorrect encoding of output. In that case, the charset parameter of the HTTP header specifies UTF-8, but the encoding parameter of the XML document indicates ISO-8859-1. HTTP stream readers will likely interpret the content as UTF-8, though if the file is saved without a BOM, then an editor will recognize the file as ISO-8859-1.

#2 Updated by Robert Waltz about 5 years ago

In the web.xml of the DataONE_CN_Rest project there is a filter applied to every request named CharacterEncodingFilter that forces all encoding to be UTF-8. We can either take the filter off, or make its application more selective.

So, if this bug is still seen to apply to Metacat, then we should make a second bug report for cn_rest_service as well.

#3 Updated by Skye Roseboom about 5 years ago

  • Due date set to 2014-09-24
  • Start date set to 2014-09-24
  • Target version set to CCI-1.5.0

#4 Updated by Dave Vieglais about 5 years ago

  • Due date changed from 2014-09-24 to 2014-09-25

Conclusion after examining literature and the requirements of the get() operation:

  • The server should follow the recommendations of RFC7303 and related operational specifications such as RFC6838 and RFC7231

  • If the client is making an exact copy (e.g. during CN sync of metadata, and for replicas), the client MUST stream the exact bytes of the HTTP GET body without alteration

  • The client SHOULD make accessible any MIME and other descriptive information provided by the origin

  • The CNs SHOULD record the MIME type and other descriptive information about the content so that this information can be passed on to clients.

#5 Updated by Jing Tao about 5 years ago

  • Due date changed from 2014-09-25 to 2014-10-29

MN seems ok. Please see 8.3 http://www.rfc-editor.org/rfc/rfc7303.txt.

#6 Updated by Jing Tao about 5 years ago

First step is to implement this:
If the client is making an exact copy (e.g. during CN sync of metadata, and for replicas), the client MUST stream the exact bytes of the HTTP GET body without alteration

#7 Updated by Jing Tao about 5 years ago

in the read actions (such as get and getReplica), bytes are read directly from files. So there is no alteration.

However, in the save actions (such as create and update), there is transform from an input stream to a string by using default utf-8 encoding:

CNode.create(inputStream) ---> D1Node.create(inputStream) ---> D1Node.insertOrUpdateDocument(xmlString)--->MetacatHandler.handleInsertOrUpdateAction( xmlString). The inputStream is transformed to a string by use UTF-8 encoding.

MNode.create(inputStream) ---> D1Node.create(inputStream) ---> D1Node.insertOrUpdateDocument(xmlString)--->MetacatHandler.handleInsertOrUpdateAction( xmlString). The inputStream is transformed to a string by use UTF-8 encoding.

MNode.update(inputStream) ---> D1Node.insertOrUpdateDocument(xmlString)--->MetacatHandler.handleInsertOrUpdateAction( xmlString). The inputStream is transformed to a string by use UTF-8 encoding.

#8 Updated by Jing Tao about 5 years ago

  • Due date changed from 2014-10-29 to 2014-11-03

During the replication in Metacats, Metacat change the input stream from another host to a string using the default encoding, then save the string to the file. We need to save the input stream directly.

#9 Updated by Jing Tao about 5 years ago

  • Due date changed from 2014-11-03 to 2014-11-04
  • Status changed from New to In Progress

#10 Updated by Jing Tao about 5 years ago

  • Status changed from In Progress to Testing
  • Due date changed from 2014-11-04 to 2014-11-13

#11 Updated by Jing Tao about 5 years ago

  • Due date changed from 2014-11-13 to 2014-11-14
  • Target version changed from CCI-1.5.0 to CCI-1.5.1

#12 Updated by Jing Tao about 5 years ago

The features will be released in cci-1.5.0 is:

https://redmine.dataone.org/issues/6568

#13 Updated by Dave Vieglais almost 5 years ago

  • Due date changed from 2014-11-14 to 2015-01-06
  • Target version changed from CCI-1.5.1 to CCI-2.0.0

#14 Updated by Jing Tao over 4 years ago

  • Related to Task #7042: Create an element for the character set encoding in the system metadata schema added

#15 Updated by Jing Tao over 4 years ago

  • Status changed from Testing to Closed
  • % Done changed from 0 to 100

I created a new ticket to generate an option element for the character set encoding in the system metadata.
https://redmine.dataone.org/issues/7042
close the ticket.

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 14.8 MB)