I am building out a Nagios system to replace traditional system monitoring tools within my enterprise. Traditional system monitoring tools consisted of scripts to check file system size and up-time. Fping was used to ping all the hosts within the enterprise. Fping is very efficient and effective for host discovery; however, you want to move away from the “ping” command.
The ping command is simplistic and great. You can ping a host and receive package round-trip time and if a host is up or down. Wait, did I say if a host is up or down? In reality a ping will only tell you if (assuming you do not have any firewalls between the “pinger” and “pingee”) whether the network interface card (NIC) has power and is able to respond. Ping tells nothing about the state of the operating system.
Ping also consumes bandwidth. Bandwidth usage may appear to be negligible in regard to a small enterprise ( less than 100 host). However, what about larger enterprises? In order to be proactive in regard to host issues, the enterprise should perform a host-alive check on every server, switch and router every five minutes. Now we can start to see the bandwidth problem.
In order to shorten reaction time and determine down hosts we use a service check instead of a straight ping. Service checks go right down to the operating system level. Resources are left available for other enterprise concerns, since you are already performing service checks on hosts. If you are still in love with the ping command then use it as a service check.
I recommend using the NRPE daemon for service checks. You want to place the small load on the client and with the exception of some very old operating systems, like SunOS, you can compile and configure the NRPE daemon for any operating system. Once the NRPE daemon is set up you want to transform the default ping into a service:
- Define the check_nrpe command.
- Define the ping command. Using the already defined check-host-alive will work just fine.
- Define the service and attributes. Simply place check_nrpe! in front of the check-host-alive command within the default services file.
- Change all check_commands to check_nrpe!check-host-alive
- Add the command to every client’s nrpe.cfg file. Your ping address will always be a loop-back address like 127.0.0.1.
- Reload Nagios and you are finished.
Note: The reason to use a loop-back address is because we are not interested in the actual ping as we are in the client’s ability to return a value.
Now, if you already are using the NRPE daemon for service checks the only addition will be to add a service check to the check_command line in each template. You want to use a command that is on every client. The command must also be a service you need to monitor that rarely changes states. Remember, when the service sends a CRITICAL state back to the Nagios host, the host is also going to show being down.
I use the check_users command. The command is very light-weight, so running it every 5 minutes across the enterprise is not going to place additional load on the client or network. We already have the command set up and monitoring, so we are killing-two-birds- with-one-stone. Traditional system monitoring tools caught 70% – 80% of hosts rebooting on there own, now we catch 100%.
You cannot fix it if you do not know it is broken. Replacing traditional system monitoring tools with Nagios is the best first step. The next step is to replace the traditional ping with a service check. Catching an unresponsive operating system before your end users do will keep the unwelcome spotlight off your information technology section.
Comment and or thought?