Story #3349: LogAggregation needs to be more fault tolerant - Infrastructure - DataONE Tasks

Story #3349

Updated by Robert Waltz over 11 years ago

Staging test of logAggregation failed with only 10% of records processed with the following error
<pre>
[ERROR] 2012-10-06 05:43:57,221 (LogAggregatorTask:retrieve:289) LogAggregatorTask-urn:node:mnStageUCSB1 <?xml version="1.0" encoding="UTF-8"?>
<error detailCode="0 Client_Error" errorCode="500" name="ServiceFailure">
<description>class javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated</description>
</error>
</pre>

regardless of the cause, logAggregation needs a more fault tolerant mechanism to harvest records.

*Original Algorithm*

b = begin date
e = end date
t = total record count
l = limit of records to retrieve
p = total records processed
<pre>
harvestRecords()
{
do
{
recordstoProcess = queryLogRecords(b,e,l);
total = recordstoProcess.totalRecordsFound;
if (recordstoProcess.isNotEmpty()))
{
process logRecords;
// maintain the date of the last record processed
update lastLogRecordProcessedDate;
p += recordstoProcess.getCount(); //count should be <= l
}
} while (p == 0 || (p < t); t));
}

</pre>

First time trial with algorithm from current implementation:
05:53:09 began
22:16:49 (failed due to connection problem!)
=
16:23:40 total hours at time of failure: retrieved 575000 count=1000 total=2306641
*By these numbers it would take 2 2/3 days to complete > 2M records*

*Divide and Conquer Algorithm*

b = begin date
e = end date
t = total record count
l = limit insertions

// recursively recordstoProcess may become huge memory problem
<pre>
retrieveLogRecords( b , e)
recordstoProcess = queryLogRecords;
if (c > l && b != e)
{
e^1 = median(b,e);
recordstoProcess += retrieveLogRecords( b , e^1);
recordstoProcess += retrieveLogRecords( e^1, e);
}
return recordstoProcess;
</pre>

// use explicit stacking mechanism and loops instead of recursion
// earliest records should always be pushed on the top of the stack;

b = earliest date to process;
e = now;
l = 5000;
<pre>
harvestRecords()
{
new DateStack(b,e);
do
{
logRecords = null;
try_again = false;
try
{
logRecords = retrieveLogRecords();
} catch (exception TryAgain e) {
try_again = true;
} catch (exception SomethingBadHappened e) {
try_again=true; X number of times before giving up
}
if (logRecords.isNotEmpty)
{
process logRecords;
// maintain the date of the last record processed
update lastLogRecordProcessedDate;
}
} while (DateStack.count > 0 || try_again);

}

retrieveLogRecords()
{
(b,e) = pop DateStack();
records = queryLogRecords(b,e);
t = records.total();
if (t > l && b != e)
{
e^1 = median(b,e);
push DateStack(e^1,e);
push DateStack(b,e^1);
throw TryAgain;
}
else
{
return records
}
}
</pre>
stats of current run:

[ INFO] 2012-10-19 20:45:49,558 (DivideAndConquer:getRecords:94) retrieving from start=0
[ INFO] 2012-10-19 22:05:56,186 (DivideAndConquer:doit:80) Total Harvested Log count 575586
=
01:20:07 to retrieve 575586 records thus far

*shows about 92% reduction in processing time*

Back

Project

General

Profile

Infrastructure

Story #3349