Speeding up Nagios II.

In previous article of this miniseries we showed how to dramatically speed up Nagios by putting the spool directory into RAMdisk. Today we are going to look at speeding up the very basic check of every host. Like the previous improvement, this came up from real need to eliminate false-positives on the network, too.

Our company’s Nagios currently checks 1000 hosts with 1940 services. One nice Sunday it began to freak out. Not usually, but really massively. Mission-critical hosts and services were unreachable or reachable with high latency and packetloss. Fortunately I was near by a PC, so I checked the whole incident. But nothing was happening, the network was happy despite the screaming Nagios.

This really pissed me out, so I began to find the cause. Like everyone who uses Nagios to monitor network it is used mainly to check availability of devices. The choice pointed on two commands: check-host-alive and check-ping. These are by default defined like this.

define command{
   command_name check-host-alive
   command_line $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 50 00.0,100% -p 5
}

define command{
   command_name check_ping
   command_line $USER1$/check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5
}

Ok, let’s have a look on check_ping performance.

root@nagios:/usr/local/nagios/libexec# time ./check_ping -H 172.17.110.7 -w 3000.0,80% -c 5000.0,100% -p5
PING OK - Packet loss = 0%, RTA = 6.19 ms|rta=6.193000ms;3000.000000;5000.000000;0.000000 pl=0%;80;100;0

real 0m4.090s
user 0m0.020s
sys 0m0.000s

Uff, four seconds to execute one check. Thats a lot. What the hell does it do all the time? Let”s have a look on a call trace.

root@nagios:/usr/local/nagios/libexec# strace ./check_ping -H 172.17.110.7 -w 3000.0,80% -c 5000.0,100% -p5
execve("./check_ping", ["./check_ping", "-H", "172.17.110.7", "-w", "3000.0,80%", "-c", "5000.0,100%", "-p5"], [/* 22 vars */]) = 0
brk(0) = 0xfeb000
...
lseek(5, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
read(3, "PING 172.17.110.7 (172.17.110.7)"..., 4096) = 115
read(3, "64 bytes from 172.17.110.7: icmp"..., 4096) = 60
read(3, "64 bytes from 172.17.110.7: icmp"..., 4096) = 60
read(3, "64 bytes from 172.17.110.7: icmp"..., 4096) = 60
read(3, "64 bytes from 172.17.110.7: icmp"..., 4096) = 59
read(3, "\n", 4096) = 1
read(3, "--- 172.17.110.7 ping statistics"..., 4096) = 150
read(3, "", 4096) = 0

Ahaaaa! So it basically opens the common system utility ping as a subprocess and parses the output. Can we do it faster and better? Of course we can, let’s have a look on plugin directory. We can find the check_icmp plugin, which does exactly the same, but faster, because it generates ICMP packets by itself. But at the cost of SETUID bit set. Let”s try it.

root@nagios:/usr/local/nagios/libexec# time ./check_icmp -w 3000.0,80% -c 5000.0,100% -p5 -H 172.17.110.7
OK - 172.17.110.7: rta 0.208ms, lost 0%|rta=0.208ms;3000.000;5000.000;0; pl=0%;80;100;; rtmax=0.331ms;;;; rtmin=0.140ms;;;;

real    0m0.004s
user    0m0.010s
sys     0m0.010s

This looks much better. We change our definitions from beginnig of this article and reload the configuration.

define command{
   command_name check-host-alive
   command_line $USER1$/check_icmp -w 3000.0,80% -c 50 00.0,100% -p 5 -H $HOSTADDRESS$ 
}

define command{
   command_name check_ping
   command_line $USER1$/check_icmp -w $ARG1$ -c $ARG2$ -p 5 -H $HOSTADDRESS$ 
}

Since that time Nagios behaves correctly. 🙂 Anyway, recently the 4.0 version came out, which brings many performance improvements. I’ve not tried to implement it yer, but certainly I will let you know.

Leave a Reply

Your email address will not be published. Required fields are marked *