Task #8617
Updated by Chris Jones over 6 years ago
While working on on-boarding ESS-DIVE in the Stage environment, we found that `DATA` objects were both sync'ing fine (their `sysmeta`) to the CNs, and they were indexing fine on the CNs. However, METADATA and RESOURCE object types were only sync'ing fine. They were failing to index. Looking at the indexer log files on `cn-stage-orc-1`, I saw errors where were the indexer could not find the object (science metadata and resource map documents) on disk because there was no path to the object in the `hzObjectPaths` map.
I confirmed that:
1) There was no pid-to-docid entry in the PostgreSQL `identifier` table
2) There was no file on disk for these objects in `/var/metacat/{data,documents}`
3) That the system metadata for each document *did* exist in the PostgreSQL `systemmetadata` table
Since `sysmeta` replicates via Hazelcast and is persisted on all CNs, this didn't surprise me. But science metadata nad resource maps get replicated via the CN-to-CN Metacat replication service.
Looking at the replication.log files, the CNs were unable to create SSL connections among themselves, and replication wasn't working since May 18th, 2018. After trying this manually, I confirmed this didn't work.
After looking at the certificate key and cert files on each CN in `/etc/dsataone/client/{certs,private}`, I saw that the `cn-stage-{orc,ucsb,unm}-1.test.dataone.org.{pem,key}` files had the wrong subject in them. For instance:
```
subject= /DC=org/DC=dataone/CN=urn:node:cnStageORC1
notBefore=May 14 16:52:20 2018 GMT
notAfter=May 13 16:52:20 2021 GMT
```
This is the subject for the client certificate, not for the server certificate that is signed by the DataONE Test Intermediate CA, whic for instance is:
```
subject= /DC=org/DC=dataone/CN=cn-stage-orc-1.test.dataone.org
notBefore=Oct 13 22:12:57 2015 GMT
notAfter=Oct 12 22:12:57 2018 GMT
```
This file was updated in May on all three CNs, and the old certificate was moved to a `.bak` version. The `.bak` version still contained the un-expired certificate (expires in Oct 2018), which has the correct subject string in the certificate (the FQDN of the hostname).
I moved the `.bak` certificates back to the current active name. After forcing CN-to-CN replication via the Metacat Admin interface, all three CNs are now replicating to each other again, and are making their way through the backlog.
I think someone mistakenly copied the client certificate file to the server cert files used for Metacat, LDAP, and PostgreSQL replication when the client certificates were updated in may. This is likely just a procedure documentation issue. We'll have to look out for this as the certificates expire in the next few months.
I confirmed that:
1) There was no pid-to-docid entry in the PostgreSQL `identifier` table
2) There was no file on disk for these objects in `/var/metacat/{data,documents}`
3) That the system metadata for each document *did* exist in the PostgreSQL `systemmetadata` table
Since `sysmeta` replicates via Hazelcast and is persisted on all CNs, this didn't surprise me. But science metadata nad resource maps get replicated via the CN-to-CN Metacat replication service.
Looking at the replication.log files, the CNs were unable to create SSL connections among themselves, and replication wasn't working since May 18th, 2018. After trying this manually, I confirmed this didn't work.
After looking at the certificate key and cert files on each CN in `/etc/dsataone/client/{certs,private}`, I saw that the `cn-stage-{orc,ucsb,unm}-1.test.dataone.org.{pem,key}` files had the wrong subject in them. For instance:
```
subject= /DC=org/DC=dataone/CN=urn:node:cnStageORC1
notBefore=May 14 16:52:20 2018 GMT
notAfter=May 13 16:52:20 2021 GMT
```
This is the subject for the client certificate, not for the server certificate that is signed by the DataONE Test Intermediate CA, whic for instance is:
```
subject= /DC=org/DC=dataone/CN=cn-stage-orc-1.test.dataone.org
notBefore=Oct 13 22:12:57 2015 GMT
notAfter=Oct 12 22:12:57 2018 GMT
```
This file was updated in May on all three CNs, and the old certificate was moved to a `.bak` version. The `.bak` version still contained the un-expired certificate (expires in Oct 2018), which has the correct subject string in the certificate (the FQDN of the hostname).
I moved the `.bak` certificates back to the current active name. After forcing CN-to-CN replication via the Metacat Admin interface, all three CNs are now replicating to each other again, and are making their way through the backlog.
I think someone mistakenly copied the client certificate file to the server cert files used for Metacat, LDAP, and PostgreSQL replication when the client certificates were updated in may. This is likely just a procedure documentation issue. We'll have to look out for this as the certificates expire in the next few months.