DataONE Tasks: Issueshttps://redmine.dataone.org/https://redmine.dataone.org/favicon.ico2019-03-14T17:11:26ZDataONE Tasks
Redmine Infrastructure - Story #8779 (New): ForesiteResourceMap performance issuehttps://redmine.dataone.org/issues/87792019-03-14T17:11:26ZRob Nahfrnahf@epscor.unm.edu
<p>Profiling reveals that much time is spent in IndexVisibilityDelegate, and it seemingly is called twice unnecessarily, first in _init, second in getAllResourceIDs().</p>
<p>This class in general is not well documented and has some confusing traversal code, so it is difficult to assess what exactly is going on. It also seems to be a misleading encapsulation of data, in that it attempts to filter out resource map members based on current system metadata properties (archived or not), but that's not mentioned at all in the sparse javadocs.</p>
<p>the code needs to be reviewed to make sure no unnecessary calls are made.<br>
If resource map checking (for completeness) is not going to be done anymore, this class probably should be deprecated or removed.</p>
Infrastructure - Story #8702 (New): Indexing Refactor Strategyhttps://redmine.dataone.org/issues/87022018-09-21T22:42:48ZRob Nahfrnahf@epscor.unm.edu
<p>Indexing is non-performing and has some inconsistency problems.</p>
<p>A solution was developed that addresses the main issues, and involves the creation of a separate solr core for relationships (the resource maps). Initially, the solution will create the separate core as a behind the scenes reference for the main search index. Relationships (resource_map, documents, isDocumentedBy) will still be copied into the main search record.</p>
<p>Additionally, archived objects will not be removed from the index, but the field archived will be added to the schema.</p>
<p>The new logic for processing resource maps and archiving objects should remove many of the inefficient checks that cause records to be reindexed.</p>
<p>The main phases for development will be:</p>
<ol>
<li>refactor out the custom solr client for use of the standard org.apache.solrj-client.<br></li>
<li>migrate the schema to include archived field & introduce relationships core. Refactor the resourcemap subprocessor to use it, and trigger relationship tasks.</li>
<li>refactor the delete subprocessor (for archived records) & add the search handler.</li>
</ol>
Infrastructure - Story #8504 (New): Support creation of data citation record from solr recordhttps://redmine.dataone.org/issues/85042018-03-19T21:53:13ZDave Vieglaisdave.vieglais@gmail.com
<p>The goal of this story is to ensure that elements in the solr search schema are available and appropriately populated to support generation of DataCite version 4.x or later records.</p>
<p>By ensuring support for this schema, it can also be asserted that suitable citation metadata can be provided in landing pages and other renderings of content provided by DataONE.</p>
<p>Resources:</p>
<ul>
<li><a href="https://schema.datacite.org/meta/kernel-4.1/" class="external">DataCite Schema version 4</a></li>
<li><a href="http://indexer-documentation.readthedocs.io/en/latest/generated/solr_schema.html" class="external">DataONE solr Search fields</a></li>
<li><a href="https://rd-alliance.org/group/data-citation-wg/outcomes/data-citation-recommendation.html" class="external">RDA Data Citation Recommendations</a></li>
</ul>
Infrastructure - Story #8363 (New): indexer shutdown generates index taskshttps://redmine.dataone.org/issues/83632018-02-12T21:42:22ZRob Nahfrnahf@epscor.unm.edu
<p>Seen in STAGE, somehow the index processor generated about 15k tasks (after processing 215k tasks over the weekend) during a service stop. It also created about 12.5 failures. Before trying to stop services, this the status of postgres:</p>
<pre>d1-index-queue=# select status, count(*) from index_task group by status;
status | count
------------+-------
NEW | 5
FAILED | 1659
IN PROCESS | 367
(3 rows)
</pre>
<p>Execution of <code>/etc/init.d/d1-index-task-processor stop</code> timed out.<br>
I performed <code>/etc/init.d/d1-index-task-generator stop</code> successfully, getting an <code>[OK]</code><br>
then I performed <code>/etc/init.d/d1-processing stop</code> on UCSB, also getting an '<code>[OK]</code></p>
<p>examination of the indexing log file a couple minuted later showed this:</p>
<pre>[ INFO] 2018-02-12 20:36:08,975 (IndexTaskProcessor:logProcessorLoad:245) new tasks:0, tasks previously failed: 1661
[ INFO] 2018-02-12 20:36:09,361 (IndexTaskProcessor:processFailedIndexTaskQueue:226) IndexTaskProcessor.processFailedIndexTaskQueue with size 0
[ WARN] 2018-02-12 20:36:09,361 (IndexTaskProcessorJob:execute:58) processing job [org.dataone.cn.index.processor.IndexTaskProcessorJob@515de84e] finished execution of index task processor [org.dataone.cn.index.processor.IndexTaskProcessor@2062
1d44]
[ WARN] 2018-02-12 20:36:26,571 (IndexTaskProcessorScheduler:stop:99) stopping index task processor quartz scheduler [org.dataone.cn.index.processor.IndexTaskProcessorScheduler@103bbd22] ...
[ INFO] 2018-02-12 20:36:26,572 (QuartzScheduler:standby:572) Scheduler QuartzScheduler_$_NON_CLUSTERED paused.
[ INFO] 2018-02-12 20:36:26,572 (IndexTaskProcessorScheduler:stop:111) Scheuler.interrupt method can't succeed to interrupt the d1 index job and the static method IndexTaskProcessorJob.interruptCurrent() will be called.
[ WARN] 2018-02-12 20:36:26,572 (IndexTaskProcessorJob:interruptCurrent:92) IndexTaskProcessorJob class [1806183035] interruptCurrent called, shutting down processor [org.dataone.cn.index.processor.IndexTaskProcessor@20621d44]
[ WARN] 2018-02-12 20:36:26,573 (IndexTaskProcessor:shutdownExecutor:952) processor [org.dataone.cn.index.processor.IndexTaskProcessor@20621d44] Shutting down the ExecutorService. Will allow active tasks to finish; will cancel submitted tasks
and return them to NEW status, wait for active tasks to finish, then return any remaining task not yet submitted to NEW status....
[ WARN] 2018-02-12 20:36:26,573 (IndexTaskProcessor:shutdownExecutor:955) ...1.) closing ExecutorService to new tasks...
[ WARN] 2018-02-12 20:36:26,574 (IndexTaskProcessor:shutdownExecutor:957) ...2.) cancelling cancellable futures...
[ WARN] 2018-02-12 20:36:26,575 (IndexTaskProcessor:shutdownExecutor:958) ...number of futures: 591344
[ WARN] 2018-02-12 20:36:26,575 (IndexTaskProcessor:shutdownExecutor:959) ... number of tasks in futures map: 591344
</pre>
<p>15 minutes or so later, the log showed this:</p>
<pre>[ WARN] 2018-02-12 20:36:26,573 (IndexTaskProcessor:shutdownExecutor:955) ...1.) closing ExecutorService to new tasks...
[ WARN] 2018-02-12 20:36:26,574 (IndexTaskProcessor:shutdownExecutor:957) ...2.) cancelling cancellable futures...
[ WARN] 2018-02-12 20:36:26,575 (IndexTaskProcessor:shutdownExecutor:958) ...number of futures: 591344
[ WARN] 2018-02-12 20:36:26,575 (IndexTaskProcessor:shutdownExecutor:959) ... number of tasks in futures map: 591344
[ WARN] 2018-02-12 20:52:30,811 (IndexTaskProcessor:shutdownExecutor:988) ...number of (cancellable) runnables/tasks reset to new: 0
[ WARN] 2018-02-12 20:52:30,811 (IndexTaskProcessor:shutdownExecutor:989) ...number of (cancellable) runnables not mapped to tasks: 0
[ WARN] 2018-02-12 20:52:30,811 (IndexTaskProcessor:shutdownExecutor:990) ...number of uncancellable runnables: 591344 (completed or in process)
[ WARN] 2018-02-12 20:52:30,812 (IndexTaskProcessor:shutdownExecutor:993) ...3.) waiting (with timeout) for active futures to finish...
[ WARN] 2018-02-12 20:52:30,812 (IndexTaskProcessor:shutdownExecutor:998) ...4.) Reviewing remaining uncancellables to check for completion, returning incomplete ones to NEW status...
[ WARN] 2018-02-12 20:52:30,835 (IndexTaskProcessor:shutdownExecutor:1026) ...5.) Calling shutdownNow on the executor service.
[ WARN] 2018-02-12 20:52:30,835 (IndexTaskProcessor:shutdownExecutor:1028) ... .... number of runnables still waiting: 0
[ WARN] 2018-02-12 20:52:30,835 (IndexTaskProcessor:shutdownExecutor:1030) ...6.) returning preSubmitted tasks to NEW status...
[ WARN] 2018-02-12 20:52:30,835 (IndexTaskProcessor:shutdownExecutor:1031) ... .... number of preSubmitted tasks: 34735
[ INFO] 2018-02-12 20:52:30,835 (IndexTask:markNew:454) Even tough it was masked new, it is still considered failed for id testGetPackage_2017119234441164 since it was tried to many times.
[ERROR] 2018-02-12 20:52:30,891 (IndexTaskProcessor:shutdownExecutor:1038) ....... Exception thrown trying to return task to NEW status for pid: testGetPackage_2017119234441164
org.springframework.orm.hibernate3.HibernateOptimisticLockingFailureException: Object of class [org.dataone.cn.index.task.IndexTask] with identifier [13071797]: optimistic locking failed; nested exception is org.hibernate.StaleObjectStateException: Row was updated or deleted by another transaction (or unsaved-value mapping was incorrect): [org.dataone.cn.index.task.IndexTask#13071797]
...
[ INFO] 2018-02-12 20:54:19,618 (IndexTask:markNew:454) Even tough it was masked new, it is still considered failed for id P3_201622214921901 since it was tried to many times.
[ WARN] 2018-02-12 20:54:19,621 (IndexTaskProcessor:shutdownExecutor:1036) ... preSubmittedTask for pid P3_201622214921901returned to NEW status.
[ WARN] 2018-02-12 20:54:19,623 (IndexTaskProcessor:shutdownExecutor:1036) ... preSubmittedTask for pid resource_map_doi:10.5065/D6VD6WFPreturned to NEW status.
[ INFO] 2018-02-12 20:54:19,623 (IndexTask:markNew:454) Even tough it was masked new, it is still considered failed for id testGetPackage_NotAuthorized_201710605522454 since it was tried to many times.
[ WARN] 2018-02-12 20:54:19,626 (IndexTaskProcessor:shutdownExecutor:1036) ... preSubmittedTask for pid testGetPackage_NotAuthorized_201710605522454returned to NEW status.
[ WARN] 2018-02-12 20:54:19,628 (IndexTaskProcessor:shutdownExecutor:1036) ... preSubmittedTask for pid resource_map_urn:uuid:d3606ccb-2d50-4723-ae45-c0d01b817e48returned to NEW status.
[ WARN] 2018-02-12 20:54:19,631 (IndexTaskProcessor:shutdownExecutor:1036) ... preSubmittedTask for pid resource_map_doi:10.18739/A2165Freturned to NEW status.
[ WARN] 2018-02-12 20:54:19,631 (IndexTaskProcessor:shutdownExecutor:1041) ............7.) DONE with shutting down IndexTaskProcessor.
[ INFO] 2018-02-12 20:54:19,631 (IndexTaskProcessorScheduler:stop:113) The scheuler.interrupt method seems not interrupt the d1 index job and the static method IndexTaskProcessorJob.interruptCurrent() was called.
[ WARN] 2018-02-12 20:54:19,632 (IndexTaskProcessorScheduler:stop:128) Job scheduler [org.dataone.cn.index.processor.IndexTaskProcessorScheduler@103bbd22] finished executing all jobs. The d1-index-processor shut down sucessfully.============================================
</pre>
<p>but postgres yielded this:</p>
<pre>d1-index-queue=# select status, count(*) from index_task group by status;
status | count
--------+-------
NEW | 15367
FAILED | 14032
(2 rows)
</pre>
<p>indexer shutdowns are a stubborn problem...</p>
Infrastructure - Story #8307 (New): Check node subject on node registration and subsequent callshttps://redmine.dataone.org/issues/83072018-02-06T20:04:39ZDave Vieglaisdave.vieglais@gmail.com
<p>The <code>/node/subject</code> entry of the node document should match the subject of the certificate used to register the node (unless the call is being made by a CN certificate).</p>
Infrastructure - Story #8234 (New): Use University of Kansas ORCID membership to support authenti...https://redmine.dataone.org/issues/82342018-01-09T02:00:28ZDave Vieglaisdave.vieglais@gmail.com
<p><a href="https://orcid.org/members/001G000001CAkZgIAL-university-of-kansas" class="external">KU is a premium ORCID member</a> as a member of the Greater Western Library Alliance (GWLA). As a result, KU has access to five ORCID API keys. One is currently in use for the KU DSpace instance.</p>
<p>Goal of this story is to leverage on of the remaining API keys to support ORCID authentication in the DataONE production environment.</p>
Infrastructure - Story #8227 (In Progress): ExceptionHandler regurgitates long html pages into th...https://redmine.dataone.org/issues/82272017-12-13T21:19:23ZRob Nahfrnahf@epscor.unm.edu
<p>While useful to know what was returned in the error response when it was not the correct response, HTML pages can be verbose and include excessive markup that's not useful. Especially when a GMN MN is in debugging mode and there is a systematic error being returned (like during an authentication issue), these logged html pages can end up being 75% of the log files, and cause meaningful log lines from scrolling off the end of the log rotation.</p>
<p>An option should be provided to limit the amount of characters being returned in the ServiceFailure.</p>
<p>Options are to:<br>
1. eliminate the message body altogether<br>
2. truncate the message body<br>
3. only print the visible parts of the HTML (remove and elements)<br>
4. combination of 2 & 3</p>
<p>since a new feature, develop in trunk.</p>
Infrastructure - Story #8173 (New): add checks for retrograde systemMetadata changeshttps://redmine.dataone.org/issues/81732017-09-01T19:42:33ZRob Nahfrnahf@epscor.unm.edu
<p>with the ability to prioritize and the introduction of parallelized index task processing, the effective queue is not guaranteed to be time-ordered. If there are two valid system metadata changes resulting in two tasks and the second change hits the index first, the earlier task should be rejected, as its changes are out of date.</p>
Infrastructure - Story #8172 (In Progress): investigate atomic updates for some solr updateshttps://redmine.dataone.org/issues/81722017-09-01T19:35:25ZRob Nahfrnahf@epscor.unm.edu
<p>Atomic updates came to solr with v4.0. (We're currently at 5.x)</p>
<p>Atomic updates are supposed to be more efficient, and could help us with the race condition in <a class="issue tracker-5 status-5 priority-4 priority-default closed child" title="Task: Use multiple threads to index objects (Closed)" href="https://redmine.dataone.org/issues/7771">#7771</a>.<br>
(multiple tasks reading a solr record and then modifying it in divergent ways via overwriting existing values.</p>
<p>atomic add and remove modifiers allow addition and removal of multivalued fields, which is where our race conditions arise.</p>
Infrastructure - Story #8061 (New): develop queue-based processing system for the CNhttps://redmine.dataone.org/issues/80612017-04-05T22:40:24ZRob Nahfrnahf@epscor.unm.edu
<p>The event-based mechanism for generating indexing tasks is not robust to network segregation and inefficient because it triggers indexing tasks when system metadata are loaded into Hazelcast map - not "real" events, just a data hydration from persistent storage.</p>
<p>Investigate using reliable queues instead. The design will want to be abstracted so that different implementations can be swapped in at a later date, so use standard messaging patterns.</p>
<p>RabbitMQ, ActiveMQ are potential implementations to use.<br>
ZeroMQ is a lower-level implementation, probably a bit more complicated, but very performant.</p>
Infrastructure - Story #8028 (Rejected): Migrate UNM CN servers to DMZ networkhttps://redmine.dataone.org/issues/80282017-02-28T15:42:47ZDave Vieglaisdave.vieglais@gmail.com
<p>UNM now has a DMZ available which will place servers outside of the campus intrusion prevention infrastructure, and so should significantly reduce latency and increase throughput for network activity.</p>
<p>The goal of this story is to migrate all UNM CNs including test instances to the new network.</p>
<p>Network info:<br>
<br>
IP Range: 64.106.84.2/27 (.2 - .8 currently reserved for DataONE, .5 - .8 currently available for CNs)<br>
Gateway: 64.106.84.1<br>
Netmask: 255.255.255.224<br>
Broadcast: 64.106.84.31</p>
<p>To move a VM to the new network:</p>
<ol>
<li> Select IP Address</li>
<li>Update /etc/network/interfaces</li>
<li>Update /etc/hosts</li>
<li>Reconfigure any services that specify IP address, including but not limited to:</li>
</ol>
<p>a) apache<br>
b) UFW<br>
c) Zookeeper<br>
d) LDAP (?)<br>
e) Hazelcast<br>
f) Metacat replication<br>
g) CILogon ?</p>
<p>Note that the other CNs also specify specific IP addresses for connectivity, so it will be necessary to update configurations on those machines as well.</p>
<ol>
<li>Select the DMZ network in VMWare configuration for the VM</li>
<li>Restart the VM</li>
</ol>
Infrastructure - Story #8025 (Rejected): Review authorization requirements for all DataONE API me...https://redmine.dataone.org/issues/80252017-02-20T22:20:27ZDave Vieglaisdave.vieglais@gmail.com
<p>In order to ensure reliable access control to read and alter content it is prudent to periodically review implementations to ensure consistency with design criteria.</p>
<p>The goal of this story is to review CN implementation of all services to ensure access control is implemented as expected. The areas to be covered include:</p>
<ul>
<li>synchronization (new and altered content)</li>
<li>reading system metadata and content</li>
<li>access to index entries</li>
<li>replication</li>
</ul>
Infrastructure - Story #7940 (New): Retrieval of system metadata is too slowhttps://redmine.dataone.org/issues/79402016-11-25T20:48:28ZDave Vieglaisdave.vieglais@gmail.com
<p>Retrieving a system metadata document takes 1-2 seconds in the production environment. Response time is improved on subsequent calls, but still takes longer than a second to complete. Since system metadata is critical for many operations, its retrieval should not be an impediment to users. At this rate, a simple single threaded client may download information about 30 or so objects per minute, or about 1800 per hour. Since some data packages have content in the order hundreds to thousands of entries, this means that it would take an hour or so to simply iterate over the system metadata for a moderate data package. </p>
<p>The retrieval process should be profiled to identify which portions are inefficient, then those portions addressed where possible.</p>
Infrastructure - Story #7939 (Rejected): Indexing is too slow, especially with large packageshttps://redmine.dataone.org/issues/79392016-11-25T19:03:24ZDave Vieglaisdave.vieglais@gmail.com
<p>It appears that the indexing process is far too slow to keep up with content additions and changes. Since the version 2.3 upgrade which includes support for multiple indexing threads, the performance appears improved, but it falls far short of what is needed to provide reasonable currency.</p>
<p>In particular, it appears that large resource maps such as those provided by the ARCTIC node are very slow to evaluate.</p>
<p>Some optimization may be possible without major refactoring of the indexing process.</p>
<p>A few possible options:</p>
<ol>
<li><p>Check that changes to properties such as ownership do not trigger an entire re-index of the package. If permissions change, then there is no need to reindex the entire package since other properties are unchanged. This should be in place now since content is immutable, and only mutable metadata fields should be updated.</p></li>
<li><p>Dedicate a single thread to resource map processing, expanding to more threads when there is no backlog of other content. This would allow efficient processing of content on which the resource map indexing may depend.</p></li>
<li><p>Refactor the index so that resource maps may be processed independently, without the need for all other objects to be loaded and processed.</p></li>
<li><p>Refactor the indexing of resource maps so that a partially processed resource map is persisted so that processing may continue as content becomes available rather than starting from scratch each time.</p></li>
</ol>
Infrastructure - Story #7224 (New): push synchronization request status indicator: synchronizeSta...https://redmine.dataone.org/issues/72242015-06-18T08:30:42ZRob Nahfrnahf@epscor.unm.edu
<p>Push synchronization (cn.synchronize, mn.updateSystemMetadata) involves an end-user that might want to have an idea of how long until the queued action is going to take to complete. Something as simple as returning the place in line of the sync request might suffice as the indicator, or a more complete data packet, including the place in line and the queue velocity, could be attempted.</p>
<p>The real-world analogy for this kind of indictor is taking a number at the deli-counter: You don't know when you will be served, but you know how many people are in front of you. </p>
<p>This option is a separate call to the CN to check the status of the sync request, so that the current place in line is returned. The advantage of this is that if the velocity of synchronization changes, the interested party can call again and get an updated value - it has more diagnostic and monitoring power. This could lead to over-use, however.</p>