Project

General

Profile

Task #4179

Story #3736: CN Consistency Check and CN Recovery

Testing CN to CN connectivity

Added by Robert Waltz over 10 years ago. Updated over 9 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Chris Brumgard
Category:
performance-scalability
Target version:
-
Start date:
2013-11-18
Due date:
2013-11-23
% Done:

100%

Estimated time:
0.00 h
Milestone:
CCI-1.3
Product Version:
*
Story Points:
Sprint:

Description

We have noticed a number of problems affecting the distributed memory application, Hazelcast. We have also seen problems that affect the OpenLDAP's multi-master replication capabilities too. Lastly, in the past, It appears that large transfer of streaming data though ssh from UNM servers to other servers (such as tailing rapidly changing log files) tend to seize up the connection (connection is still open on either end, but no information appears be passed through).

Given the time it is now taking to solve the consistency issues created by these problems, I would like certain tests to be run between the CNs on the different environments.

verify routes to cns

Confirm that the route taken is via internet 2 routers
The routes should examine connection between the various cns of an environment, e.g. unm->orc, unm->ucsb, ucsb->unm, ucsb->orc, orc->ucsb, orc->unm

assert that they are of internet 2 band width

Each of the CNs should be able to connect to any other CN at Internet 2 speeds. I have never personally witnessed this bandwidth.
Bandwidth should be examined between the various cns of an environment, e.g. unm->orc, unm->ucsb, ucsb->unm, ucsb->orc, orc->ucsb, orc->unm

verify consistency of data transferred through the routes.

Dropped packet counts
Multiple, large document transfer over a large span of time.

History

#1 Updated by Chris Jones over 10 years ago

  • Target version deleted (2013.46-Block.6.2)

#2 Updated by Chris Jones over 10 years ago

  • Parent task set to #3736

#3 Updated by Chris Brumgard over 10 years ago

  • Estimated time set to 0.00
  • Status changed from New to In Progress

Traceroute results.

orc —> ucsb
1 chm01-vlan92-vrrp1.ns.utk.edu (160.36.13.131) 12.833 ms 12.898 ms 13.038 ms
2 10.8.2.18 (10.8.2.18) 5.311 ms 5.443 ms 5.499 ms
3 bsm01v20.ns.utk.edu (160.36.128.133) 0.727 ms 0.792 ms 0.674 ms
4 bhm01ge3-3.ns.utk.edu (160.36.2.74) 69.491 ms 69.600 ms 69.486 ms
5 sox-atlanta2.ns.utk.edu (160.36.128.158) 7.532 ms 7.489 ms 7.570 ms
6 nlr-to-slr.sox.net (143.215.193.5) 6.780 ms 6.415 ms 6.500 ms
7 vlan-53.jack.layer2.nlr.net (216.24.186.55) 62.905 ms 64.146 ms 64.135 ms
8 vlan-51.hous.layer2.nlr.net (216.24.186.78) 62.938 ms 62.723 ms 62.811 ms
9 vlan-47.elpa.layer2.nlr.net (216.24.186.74) 62.964 ms 63.072 ms 63.056 ms
10 vlan-43.losa.layer2.nlr.net (216.24.186.73) 62.705 ms 63.990 ms 63.431 ms
11 hpr-lax-hpr2--nlr-pn.cenic.net (137.164.26.25) 63.342 ms 63.208 ms 63.223 ms
12 ucsb--lax-hpr2-10ge.cenic.net (137.164.26.6) 65.535 ms 65.508 ms 65.601 ms
13 r2--r1--1.commserv.ucsb.edu (128.111.252.169) 65.501 ms 65.611 ms 65.492 ms
14 574-c--r2--2.commserv.ucsb.edu (128.111.252.149) 66.774 ms 66.157 ms 66.201 ms
15 535-c-v1071.noc.ucsb.edu (128.111.4.51) 67.948 ms 66.952 ms 66.306 ms

orc —> unm
1 chm01-vlan92-vrrp1.ns.utk.edu (160.36.13.131) 0.760 ms 0.896 ms 1.043 ms
2 10.8.2.18 (10.8.2.18) 1.351 ms 1.516 ms 1.593 ms
3 bsm01v20.ns.utk.edu (160.36.128.133) 0.929 ms 1.096 ms 1.014 ms
4 bhm01ge3-3.ns.utk.edu (160.36.2.74) 0.706 ms 0.760 ms 0.653 ms
5 sox-atlanta2.ns.utk.edu (160.36.128.158) 6.540 ms 6.543 ms 6.537 ms
6 nlr-to-slr.sox.net (143.215.193.5) 6.548 ms 6.720 ms 6.780 ms
7 vlan-53.jack.layer2.nlr.net (216.24.186.55) 51.863 ms 51.668 ms 51.838 ms
8 vlan-51.hous.layer2.nlr.net (216.24.186.78) 51.836 ms 51.833 ms 51.855 ms
9 vlan-47.elpa.layer2.nlr.net (216.24.186.74) 51.799 ms 51.833 ms 51.790 ms
10 vlan-45.albu.layer2.nlr.net (216.24.186.50) 147.476 ms 147.556 ms 147.603 ms
11 129.24.198.105 (129.24.198.105) 51.570 ms 51.524 ms 51.527 ms
12 129.24.212.35 (129.24.212.35) 52.200 ms 52.371 ms 52.345 ms
13 129.24.192.26 (129.24.192.26) 53.545 ms 53.531 ms 53.521 ms

ucsb —> orc
1 128.111.54.65 (128.111.54.65) 0.552 ms 0.524 ms 0.669 ms
2 574-c-v1071.noc.ucsb.edu (128.111.4.52) 0.523 ms 0.586 ms 0.676 ms
3 r2--574-c--2.commserv.ucsb.edu (128.111.252.148) 0.570 ms 0.559 ms 0.592 ms
4 r1--r2--1.commserv.ucsb.edu (128.111.252.168) 0.554 ms 0.551 ms 0.582 ms
5 lax-hpr2--ucsb-10ge.cenic.net (137.164.26.5) 3.242 ms 3.211 ms 3.142 ms
6 hpr-nlr-pn--lax-hpr2.cenic.net (137.164.26.26) 3.151 ms 3.289 ms 3.214 ms
7 vlan-43.elpa.layer2.nlr.net (216.24.186.72) 59.543 ms 59.564 ms 59.552 ms
8 vlan-47.hous.layer2.nlr.net (216.24.186.75) 59.690 ms 59.611 ms 59.784 ms
9 vlan-51.jack.layer2.nlr.net (216.24.186.79) 59.741 ms 59.733 ms 59.697 ms
10 vlan-53.atla.layer2.nlr.net (216.24.186.54) 59.753 ms 59.722 ms 59.807 ms
11 slr-to-nlr.sox.net (143.215.193.6) 59.561 ms 59.519 ms 60.237 ms
12 sox-atlanta1.ns.utk.edu (160.36.128.157) 65.357 ms 65.383 ms 65.342 ms
13 fhm01v18.ns.utk.edu (160.36.2.38) 66.239 ms 66.296 ms 66.265 ms

ucsb —> unm
1 128.111.54.65 (128.111.54.65) 0.500 ms 0.505 ms 0.786 ms
2 574-c-v1071.noc.ucsb.edu (128.111.4.52) 0.438 ms 0.540 ms 0.638 ms
3 r2--574-c--2.commserv.ucsb.edu (128.111.252.148) 0.658 ms 0.657 ms 0.676 ms
4 r1--r2--1.commserv.ucsb.edu (128.111.252.168) 0.684 ms 0.748 ms 0.673 ms
5 lax-hpr2--ucsb-10ge.cenic.net (137.164.26.5) 3.124 ms 3.102 ms 3.085 ms
6 208.77.77.5 (208.77.77.5) 24.556 ms 201.506 ms 201.464 ms
7 208.77.76.149 (208.77.76.149) 24.187 ms 24.194 ms 24.834 ms
8 129.24.212.35 (129.24.212.35) 25.062 ms 25.171 ms 25.031 ms
9 bldg116-0020.unm.edu (129.24.192.30) 25.789 ms 25.781 ms 25.829 ms

unm —> orc
Incomplete

unm —> ucsb
Incomplete

UDP bandwidth (30 sec blast tests)
———————
orc —> ucsb: 0.0-30.3 sec 53.5 MBytes 14.8 Mbits/sec 15.608 ms 114291/152430 (75%)
orc —> unm: 0.0-30.0 sec 213 MBytes 59.5 Mbits/sec
ucsb—> orc: 0.0-30.2 sec 2.91 GBytes 825 Mbits/sec 15.513 ms 1150598/3273464 (35%)
ucsb —>unm: 0.0-30.0 sec 5.28 GBytes 1.51 Gbits/sec
unm —> orc: 0.0-30.2 sec 3.27 GBytes 927 Mbits/sec 13.630 ms 2708933/5094083 (53%)
unm —> ucsb: 0.0-30.2 sec 1.27 GBytes 361 Mbits/sec 14.781 ms 4318368/5245940 (82%)

TCP bandwidth tests
—————————————
orc —> ucsb: 0.0-30.0 sec 610 MBytes 171 Mbits/sec
orc —> unm:
ucsb —> orc: 0.0-60.0 sec 2.07 GBytes 297 Mbits/sec
ucsb —> unm:
unm —> orc: 0.0-60.1 sec 215 MBytes 30.0 Mbits/sec
unm —> ucsb: 0.0-60.1 sec 225 MBytes 31.4 Mbits/sec

Potential Packet loss

Running UDP iperf tests set to use the TCP bandwidth results, the corresponding UDP packet loss is:
orc —> ucsb: 72%
ucsb —> orc: 6.5e-05%
unm —> orc: 0.038%
unm —> ucsb: 0.034%

#4 Updated by Dave Vieglais over 10 years ago

Seems odd that traceroute didn't complete from UNM to other locations. I would have expected the reverse because UNM blocks ICMP at the perimeter.

Are the results generated by a script? If so, I'd like to see the results from different times of the day, perhaps every 4 hours or so.

#5 Updated by Chris Brumgard over 10 years ago

Dave, I agree that it seems weird that they blocked icmp outgoing. The results weren't generated by a script but I wrote one today. It does a traceroute and an iperf. It also runs a tcpdump to watch the iperf activity and then tshark to check the tcp packet loss diagnositics. I have it run from each site to the other sites once an hour.

#6 Updated by Chris Brumgard over 10 years ago

By the way, I should mention I tried all 3 traceroute methods (ICMP, UDP and TCP) and none of them succeeded for UNM. I also can't use ICMP ping to reach umm beyond the first unm hop so I'm think they're just blocking most ICMP traffic.

#7 Updated by Robert Waltz about 10 years ago

  • Target version set to 2014.2-Block.1.1

#8 Updated by Robert Waltz about 10 years ago

  • Target version deleted (2014.2-Block.1.1)

#9 Updated by Robert Waltz over 9 years ago

  • Status changed from In Progress to Closed

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 14.8 MB)