Bug #7870
Metacat considers FGDC documents invalid from member node SEAD
100%
Description
When we harvested the documents from SEAD and we got an error that - Fatal processing error..
I used the curl command to create an object on a Metacat through DataONE API and got the same error with more details:
metacat 20160825-11:43:19: [WARN]: MetacatHandler.handleInsertOrUpdateAction - General error when writing eml document to the database: Fatal processing error. [edu.ucsb.nceas.metacat.MetacatHandler]
org.xml.sax.SAXException: Fatal processing error.
org.xml.sax.SAXParseException; systemId: http://www.fgdc.gov/metadata/fgdc-std-001-1998.xsd; lineNumber: 1; columnNumber: 50; White spaces are required between publicId and systemId.
at edu.ucsb.nceas.metacat.DBSAXHandler.fatalError(DBSAXHandler.java:736)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
at org.apache.xerces.impl.XMLScanner.scanExternalID(Unknown Source)
at org.apache.xerces.impl.XMLDocumentScannerImpl.scanDoctypeDecl(Unknown Source)
at org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.impl.xs.opti.SchemaParsingConfig.parse(Unknown Source)
at org.apache.xerces.impl.xs.opti.SchemaParsingConfig.parse(Unknown Source)
at org.apache.xerces.impl.xs.opti.SchemaDOMParser.parse(Unknown Source)
at org.apache.xerces.impl.xs.traversers.XSDHandler.getSchemaDocument(Unknown Source)
at org.apache.xerces.impl.xs.traversers.XSDHandler.parseSchema(Unknown Source)
at org.apache.xerces.impl.xs.XMLSchemaLoader.loadSchema(Unknown Source)
at org.apache.xerces.impl.xs.XMLSchemaValidator.findSchemaGrammar(Unknown Source)
at org.apache.xerces.impl.xs.XMLSchemaValidator.handleStartElement(Unknown Source)
at org.apache.xerces.impl.xs.XMLSchemaValidator.startElement(Unknown Source)
at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source)
at org.apache.xerces.impl.XMLNSDocumentScannerImpl$NSContentDispatcher.scanRootElementHook(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at edu.ucsb.nceas.metacat.DocumentImpl.write(DocumentImpl.java:2903)
at edu.ucsb.nceas.metacat.DocumentImpl.write(DocumentImpl.java:2667)
at edu.ucsb.nceas.metacat.DocumentImplWrapper.write(DocumentImplWrapper.java:63)
at edu.ucsb.nceas.metacat.MetacatHandler.handleInsertOrUpdateAction(MetacatHandler.java:1807)
at edu.ucsb.nceas.metacat.dataone.D1NodeService.insertOrUpdateDocument(D1NodeService.java:1385)
at edu.ucsb.nceas.metacat.dataone.D1NodeService.create(D1NodeService.java:455)
at edu.ucsb.nceas.metacat.dataone.MNodeService.create(MNodeService.java:616)
at edu.ucsb.nceas.metacat.restservice.v2.MNResourceHandler.putObject(MNResourceHandler.java:1535)
at edu.ucsb.nceas.metacat.restservice.v2.MNResourceHandler.handle(MNResourceHandler.java:289)
at edu.ucsb.nceas.metacat.restservice.D1RestServlet.doPost(D1RestServlet.java:84)
The attached are the science metadata file and the system metadata file.
History
#1 Updated by Jing Tao over 8 years ago
If we cache the fgdc schema files on Metacat also changed xsi:noNamespaceSchemaLocation="http://www.fgdc.gov/metadata/fgdc-std-001-1998.xsd" to the local value xsi:noNamespaceSchemaLocation="http://valley.duckdns.org/metacat/fgdc-std-001/fgdc-std-001-1998.xsd", it worked.
The main fgdc schema file have the included other schema files. It seems to me xerces somehow can't find them remotely, but can find them locally.
#2 Updated by Jing Tao over 8 years ago
I tried to upgrade xerces version to 2.11.0 and set the property "http://apache.org/xml/properties/schema/external-noNamespaceSchemaLocation" explicitly. But neither of them works.
#3 Updated by Jing Tao over 8 years ago
- Status changed from New to Closed
- % Done changed from 0 to 100
Matt found the url of xsi:noNamespaceSchemaLocation is redirected from http://www.fgdc.gov/metadata/fgdc-std-001-1998.xsd to https://www.fgdc.gov/metadata/fgdc-std-001-1998.xsd. We suspected that Xerces can't download the schema because of the redirection. So we think change the value from http to https in this attribute will be quick fix. Somehow the test didn't work (we made some mistakes in testing, i believe). Chris offered some xerces validation code and I modified the code a little bit for testing. I found the document was valid if the value of the attribute started with https; it gave the error ( White spaces are required between publicId and systemId) if it started with http. So I believe the change of http to https should work. I did a fresh installation of Metacat 2.7.2 on my local machine. Then I used curl command successfully to create the object with the attribute value starting with https. Then I created another object on dev.nceas:
https://dev.nceas.ucsb.edu/knb/d1/mn/v2/object/test-jing-11
So I believe we need to notify the operator of SEAD to change the value of xsi:noNamespaceSchemaLocation from http://www.fgdc.gov/metadata/fgdc-std-001-1998.xsd https://www.fgdc.gov/metadata/fgdc-std-001-1998.xsd.