Task #5988
Determine cause of ldap "server not responding" errors between prod CNs
0%
Description
Check_MK is reporting errors like the following on prod environment:
Host: cn-ucsb-1.dataone.org
Alias: cn-ucsb-1.dataone.org
Address: 128.111.54.80
Service: LDAP cn-UNM-1.dataone.org/389
State: CRITICAL -> CRITICAL (PROBLEM)
Command: check_mk-ldap
Output: CRIT - server not responding
Perfdata: ¶
Cjones reports that he has seen these errors from all three CNs referring to all three CNs.
Will get on check_MK shortly to see what I see on that end, but in the meantime, I logged into the three prod CNs to check what 389 is open to on each CN.
cn-orc-1:
389 ALLOW 160.36.13.150
389 ALLOW 127.0.0.1
389 ALLOW 64.106.40.6
389 ALLOW 160.36.13.153
This doesn't look right. In order, this is itself (160.36.13.153), itself (127.0.0.1), cn-unm-1 (64.106.40.6, but interestingly, not showing up in nslookup and cannot ping from cn-orc-1), and cn-dev-orc-1.
cn-ucsb-1:
Ufw reports no entries for port 389.
cn-unm-1:
389 ALLOW 64.106.40.6
389 ALLOW 160.36.13.150
Itself (64.106.40.6) and cn-orc-1 (160.36+.13.150). No entry for cn-ucsb-1.
Unless some fancy port forwarding tricks are happening on prod, these look like pretty glaring discrepancies. Will discuss with coredev as soon as a quorum is available to do so.
History
#1 Updated by David Doyle over 10 years ago
Added entries to ufw for prod CNs as needed to allow prod CNs to contact each other on port 389. While I was doing that, check_MK began sending out "server is responding" service recovery emails.
Going to reassign this to Jing to check over prod CN build/upgrade scripts and procedures to ensure that port 389 is opened correctly during buildouts and OS upgrades.
#2 Updated by David Doyle over 10 years ago
- Project changed from Infrastructure Administration to Infrastructure
- Category changed from ORC - general to Hardware
- Assignee changed from David Doyle to Jing Tao
- Milestone set to None