Story #7716: How to facilitate resubmission of sync Failures? - Infrastructure - DataONE Tasks

Story #7716

How to facilitate resubmission of sync Failures?

Added by Jing Tao over 8 years ago. Updated almost 7 years ago.

Status:

In Progress

Priority:

Normal

Assignee:

Dave Vieglais

Category:

d1_synchronization

Target version:

CCI-2.4.0

Start date:

2016-05-17

Due date:

% Done:

30%

Story Points:

Sprint:

Infrastructure backlog

Description

Currently, when CNs harvest objects from MNs, they only compare the last modified time stamp on the system metadata with the last harvest date. If the last modified stamp is earlier than the last harvest date, the object wouldn't be put on the synchronization queue.

This causes an issue: if the synchronization for this object failed in the previous synchornizations and the issue of the failure was corrected (e.g. a misconfiguration on the MN and the last modified time stamp should NOT be changed), the harvest will not pick up the object. We have to manually to either reset the harvest date, or submit a v2.CN.synchronize(PID,NodeID) in order to pick up the object.

MemberNode operators are not well versed in how to submit these requests, so some sort of monitoring and management tool for MemberNode operators (and DataONE administrators) is proposed to simplify this type of action.

We are thinking to persist the information for the failed objects on a database table -e.g. failedSynchronization and it contains:
identifier, mn_id, last_failed_time, message.

Message can be separated to another table to use the foreign key.

Once we have the failure information, we can work on the monitoring and management tool.

Subtasks

Related issues

History

#1 Updated by Rob Nahf over 8 years ago

Category set to d1_synchronization
Description updated (diff)
Subject changed from Tracking objects that failed during synchronization to Addressing synchronization failures

I also propose adding the CN nodeId for when we redistribute synchronization to all CNs.

We will also need record of resolution so that MN operators can determine how many (and which) objects are still unsynchronized.

The format of the table should be actionable by any automated auditing process, and be able to be analyzed by a tool like Kibana (https://www.elastic.co/products/kibana). That's to say, it might best be log-oriented.

#2 Updated by Rob Nahf over 8 years ago

it is somewhat likely that synchronization will transition into a multiple queue-based processing system, in which case, sticking synchronizationFailed items on a separate queue would make sense, and leave open the opportunity for daily processing of failed items.

A system like that would need to make sure that TransferObjectTask removes items from the syncFailed queue upon the starting to sync it. (an item could be on both queues).