If we are taking the networking seriously, we have to monitor its status and health of its services. In small networks with number of devices in tens is the whole issue quite simple and the default configuration is sufficient. But what if the network has “grown a little bit” to several hundreds of devices and several thousand of services and Nagios starts running out of breath?
I have been exposed to this problem for quite a long time. I was postponing solving this problem hoping that “it is still working, just a update or upgrade will be needed”. Until today, when I managed to look at the problem more closely.
Our corporate Nagios is monitoring network consisting of dozens of routers, switches, radios and other jewelery spreading on the area of three-four districts. Of course, with so many devices it has a lot of work to do to run across the whole network and check it. The problem was, that this load was affecting latency and jitter during basic work with server through SSH and also distorted some measurings of latency, resulting in false alarms.
What was slowing the Nagios down? I/O operations. Using the iotop utility from the package with identical name I just made sure, that the higher I/O load was generated by Nagios. By using some well-directed questions to Google I’ve found a solution. Move the spool directory to ramdisk. And it helped.
Let’s see how to do this. First, create mountpoint, e.g. /var/ramdrive.
Define system in /etc/fstab.
tmpfs /var/ramdrive tmpfs size=128M,mode=0755,uid=1001,gid=1001 0 0
Ramdisk size of 128MB will be big enough to hold all the necessary data. Replace uid=1001,gid=1001 with UID and GID of yours installation’s nagios user. Mount the ramdisk and create basic file structure.
mount /var/ramdrive mkdir -p /var/ramdrive/spool/checkresults
Now change parameters in nagios.cfg.
object_cache_file=/var/ramdrive/objects.cache status_file=/var/ramdrive/status.dat check_result_path=/var/ramdrive/spool/checkresults
If we are using performance data collection and visualization using PNP4Nagios, we can accordingly set up saving this data to ramdisk, too. After restart using /etc/init.d/nagios restart will our system be faster because of all time-consuming and overhead operations will take place in ramdisk.
I admit that it is only a short term solution to the time when this setup also hit its limits and distributed monitoring is simply inevitable, but yet it is enough 🙂