Project

General

Profile

Task #5988

Determine cause of ldap "server not responding" errors between prod CNs

Added by David Doyle about 8 years ago. Updated about 8 years ago.

Status:
New
Priority:
Normal
Assignee:
Category:
Hardware
Target version:
-
Start date:
2014-07-22
Due date:
% Done:

0%

Milestone:
None
Product Version:
Story Points:
Sprint:

Description

Check_MK is reporting errors like the following on prod environment:

Host: cn-ucsb-1.dataone.org
Alias: cn-ucsb-1.dataone.org
Address: 128.111.54.80
Service: LDAP cn-UNM-1.dataone.org/389
State: CRITICAL -> CRITICAL (PROBLEM)
Command: check_mk-ldap
Output: CRIT - server not responding

Perfdata:

Cjones reports that he has seen these errors from all three CNs referring to all three CNs.

Will get on check_MK shortly to see what I see on that end, but in the meantime, I logged into the three prod CNs to check what 389 is open to on each CN.

cn-orc-1:

389 ALLOW 160.36.13.150
389 ALLOW 127.0.0.1
389 ALLOW 64.106.40.6
389 ALLOW 160.36.13.153

This doesn't look right. In order, this is itself (160.36.13.153), itself (127.0.0.1), cn-unm-1 (64.106.40.6, but interestingly, not showing up in nslookup and cannot ping from cn-orc-1), and cn-dev-orc-1.

cn-ucsb-1:

Ufw reports no entries for port 389.

cn-unm-1:

389 ALLOW 64.106.40.6
389 ALLOW 160.36.13.150

Itself (64.106.40.6) and cn-orc-1 (160.36+.13.150). No entry for cn-ucsb-1.

Unless some fancy port forwarding tricks are happening on prod, these look like pretty glaring discrepancies. Will discuss with coredev as soon as a quorum is available to do so.

History

#1 Updated by David Doyle about 8 years ago

Added entries to ufw for prod CNs as needed to allow prod CNs to contact each other on port 389. While I was doing that, check_MK began sending out "server is responding" service recovery emails.

Going to reassign this to Jing to check over prod CN build/upgrade scripts and procedures to ensure that port 389 is opened correctly during buildouts and OS upgrades.

#2 Updated by David Doyle about 8 years ago

  • Project changed from Infrastructure Administration to Infrastructure
  • Category changed from ORC - general to Hardware
  • Assignee changed from David Doyle to Jing Tao
  • Milestone set to None

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 14.8 MB)