Project

General

Profile

Story #8749

Fix log aggregation events from the CN without associated CN IPs

Added by Chris Jones over 5 years ago. Updated almost 5 years ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
2018-11-16
Due date:
% Done:

0%

Story Points:
Sprint:

Description

The robots list used to filter out usage events includes the IP addresses of the CNs, so events logged during synchronization don't show up as true hits. Because of the SSL infrastructure at lbl.gov, the ESS-DIVE group doesn't see the public IP of an incoming request, but rather an internal private IP assigned by lbl.gov infrastructure. You can see the impact of this on the ESS-DIVE profile page. The spike of 11,000+ downloads in August 2018 was the CN synchronizing content.

Rushiraj summarized these events in a gist

There are multiple 10.42.x.x IP associated with the CN requests. These events all need to be updated in the logsolr core and changed to an actual CN IP. For future synchronizations, perhaps we need to add 10.42.0.0/16 to the robots list?

History

#1 Updated by Jing Tao about 5 years ago

Since those events should be filtered out. So maybe we just delete them? The criteria is the subject is a CN and IP address is 10.42.x.x.

#2 Updated by Chris Jones about 5 years ago

I think it's fine to delete them Jing, since we know they are CN events. They can be deleted from Elastic Search as well, so ask Rushi or Dave about that if need be.

#3 Updated by Jing Tao about 5 years ago

Run this query
curl "http://localhost:8983/solr/event_core/select?q=subject:CN=urn\:node\:CN*%20AND%20ipAddress:10.42.*&fl=subject,ipAddress"

It returned 32642 records.

#4 Updated by Jing Tao about 5 years ago

Proposed three three delete command:

curl http://localhost:8983/solr/event_core/update/?commit=true -H "Content-Type: text/xml" -d "<delete>(subject:CN=urn\:node\:CNUCSB1*)AND(ipAddress:10.42*)</delete>"

curl http://localhost:8983/solr/event_core/update/?commit=true -H "Content-Type: text/xml" -d "<delete>(subject:CN=urn\:node\:CNORC1*)AND(ipAddress:10.42*)</delete>"

curl http://localhost:8983/solr/event_core/update/?commit=true -H "Content-Type: text/xml" -d "<delete>(subject:CN=urn\:node\:CNUNM1*)AND(ipAddress:10.42*)</delete>"

#5 Updated by Jing Tao about 5 years ago

This query curl -d "q=(subject:CN=urn\:node\:CNUCSB1*)AND(ipAddress:10.42*)&fl=subject,ipAddress" http://localhost:8983/solr/event_core/select returns 32462 records;

curl -d "q=(subject:CN=urn\:node\:CNORC1*)AND(ipAddress:10.42*)&fl=subject,ipAddress" http://localhost:8983/solr/event_core/select returns 0 records.

curl -d "q=(subject:CN=urn\:node\:CNUNM*)AND(ipAddress:10.42*)&fl=subject,ipAddress" http://localhost:8983/solr/event_core/select returns 180 records.

So the delete queries will totally remove 32,642 records. Chris, does it sounds reasonable number?

#6 Updated by Chris Jones almost 5 years ago

Hi Jing - we discussed this with ESS-DIVE yesterday, and it reminded me of this ticket - sorry for the delayed response.

I wanted to get a sense of how many read events your query entailed, so I issued this query:

curl -d "q=(subject:CN=urn\:node\:CNUCSB1*)AND(ipAddress:10.42*)&rows=0&facet=true&facet.field=event&facet.limit=1000000" http://localhost:8983/solr/event_core/select | xmlstarlet fo

This summarizes the count of each event name, and we get:

<int name="updateSystemMetadata">79878</int>
<int name="read">13937</int>
<int name="synchronization_failed">221</int>
<int name="INSERT">0</int>
<int name="UPDATE">0</int>
<int name="create">0</int>
<int name="delete">0</int>
<int name="replicate">0</int>
<int name="unknown">0</int>
<int name="update">0</int>
<int name="upload">0</int>

So a large part of the query deletes updateSystemMetadata events and it also catches the synchronization_failed events. I don't think we want to delete those events since they are there for reference, but we also don't want them to have the wrong IP address.

To clean this up, I'd probably say your delete query should be (subject:CN=urn\:node\:CNUCSB1*)AND(ipAddress:10.42*)AND(event:read), and then we probably want to update the remaining Solr documents where subject:CN=urn\:node\:CNUCSB1* and change the IP address to the actual IP address, and do the same for the other CN's records as well.

#7 Updated by Jing Tao almost 5 years ago

Proposed three three delete command:

curl http://localhost:8983/solr/event_core/update/?commit=true -H "Content-Type: text/xml" -d "<delete><query>(subject:CN=urn\:node\:CNUCSB1*)AND(ipAddress:10.42*)AND(event:read)</query></delete>"

curl http://localhost:8983/solr/event_core/update/?commit=true -H "Content-Type: text/xml" -d "<delete><query>(subject:CN=urn\:node\:CNORC1*)AND(ipAddress:10.42*)AND(event:read)</query></delete>"

curl http://localhost:8983/solr/event_core/update/?commit=true -H "Content-Type: text/xml" -d "<delete><query>(subject:CN=urn\:node\:CNUNM1*)AND(ipAddress:10.42*)AND(event:read)</query></delete>"

#8 Updated by Jing Tao almost 5 years ago

This page give some information to update a document:
https://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/

#9 Updated by Jing Tao almost 5 years ago

Chris and I used the command to delete those records:
curl http://localhost:8983/solr/event_core/update/?commit=true -H "Content-Type: text/xml" -d "<delete><query>(subject:CN=urn\:node\:CN*)AND(ipAddress:10.42*)AND(event:read)</query></delete>"

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 14.8 MB)