Ping is just another four letter word, like fail or fake, unless properly deployed.

I am building out a Nagios system to replace traditional system monitoring tools within my enterprise. Traditional system monitoring tools consisted of scripts to check file system size and up-time. Fping was used to ping all the hosts within the enterprise. Fping is very efficient and effective for host discovery; however, you want to move away from the “ping” command.

The ping command is simplistic and great. You can ping a host and receive package round-trip time and if a host is up or down. Wait, did I say if a host is up or down? In reality a ping will only tell you if (assuming you do not have any firewalls between the “pinger” and “pingee”) whether the network interface card (NIC) has power and is able to respond. Ping tells nothing about the state of the operating system.

Ping also consumes bandwidth. Bandwidth usage may appear to be negligible in regard to a small enterprise ( less than 100 host). However, what about larger enterprises? In order to be proactive in regard to host issues, the enterprise should perform a host-alive check on every server, switch and router every five minutes. Now we can start to see the bandwidth problem.

In order to shorten reaction time and determine down hosts we use a service check instead of a straight ping. Service checks go right down to the operating system level. Resources are left available for other enterprise concerns, since you are already performing service checks on hosts. If you are still in love with the ping command then use it as a service check.

I recommend using the NRPE daemon for service checks. You want to place the small load on the client and with the exception of some very old operating systems, like SunOS, you can compile and configure the NRPE daemon for any operating system. Once the NRPE daemon is set up you want to transform the default ping into a service:

  1. Define the check_nrpe command.
  2. Define the ping command. Using the already defined check-host-alive will work just fine.
  3. Define the service and attributes. Simply place check_nrpe! in front of the check-host-alive command within the default services file.
  4. Change all check_commands to check_nrpe!check-host-alive
  5. Add the command to every client’s nrpe.cfg file. Your ping address will always be a loop-back address like 127.0.0.1.
  6. Reload Nagios and you are finished.

Note: The reason to use a loop-back address is because we are not interested in the actual ping as we are in the client’s ability to return a value.

Now, if you already are using the NRPE daemon for service checks the only addition will be to add a service check to the check_command line in each template. You want to use a command that is on every client. The command must also be a service you need to monitor that rarely changes states. Remember, when the service sends a CRITICAL state back to the Nagios host, the host is also going to show being down.

I use the check_users command. The command is very light-weight, so running it every 5 minutes across the enterprise is not going to place additional load on the client or network. We already have the command set up and monitoring, so we are killing-two-birds- with-one-stone. Traditional system monitoring tools caught 70% – 80% of hosts rebooting on there own, now we catch 100%.

You cannot fix it if you do not know it is broken. Replacing traditional system monitoring tools with Nagios is the best first step. The next step is to replace the traditional ping with a service check. Catching an unresponsive operating system before your end users do will keep the unwelcome spotlight off your information technology section.

Comment and or thought?

Mike Kniaziewicz, MIS

Bookmark and Share

5 Responses to “Ping is just another four letter word, like fail or fake, unless properly deployed.”


  • Hi Mike,

    After reading your article I have a couple of concerns I want to share:

    • Not using a ping initiated from a monitoring server to a remote host makes you lose important information such as network response speed between a monitoring server and a remote host.
    You will not be able to see degradation in the network link quality.

    • You loose the ability to check if a remote host is available through the network from a monitoring perspective. You’re not testing the network path.
    Granted when the host can’t be reached over network, your nrpe check will fail too, … But is this because of the network or because the nrpe daemon is down or malfunctioning on the remote host?
    This way, its clunky and unclear to create a dependency between your host executed checks and the network connectivity check.

    • *quote*In reality a ping will only tell you if (assuming you do not have any firewalls between the “pinger” and “pingee”) whether the network interface card (NIC) has power and is able to respond.
    Ping tells nothing about the state of the operating system.*quote*
    I agree you can’t determine the complete state of an OS through a simple ping, but it’s the OS network stack which “assigns” the IP address to your NIC which on it’s turn is used by a remote host to ping to. Merely powering a NIC will not suffice (for example have a look at http://en.wikipedia.org/wiki/Wake-on-LAN#How_it_works) to get an ICMP reply. The OS network stack must be able to reply.

    • Following your philosophy you will end up in 2 ways of determining a remote system is available on the network as you can’t implement the same approach for routers, switches, firewall, whatever device which can’t run the NRPE daemon. Although this doesn’t look like a big deal, in a very large network you want to make sure your tests results should have the same meaning across devices.

    • *quote*Ping also consumes bandwidth. Bandwidth usage may appear to be negligible in regard to a small enterprise ( less than 100 host). However, what about larger enterprises?
    In order to be proactive in regard to host issues, the enterprise should perform a host-alive check on every server, switch and router every five minutes. Now we can start to see the bandwidth problem.*quote*

    The default ping check in Nagios is: check_ping -H ‘remote_host’ -w 5000,100% -c 5000,100% -p 1
    When executed, this check consumes 196 bytes of data.
    When executed every 5 minutes, 1 host will consume 94080 bytes per day.
    Having a fairly large setup of let’s say 2000 hosts this makes 188160000 bytes on icmp checks per day, which equals to merely 179.4 megabyte per day.

    If you send 4 packets instead of one for the check_ping check in Nagios you end up with 717.6 Mb per day, which still is NOTHING if you have an infrastructure capable of hosting 2000 devices!

    I didn’t try out how much bandwidth your proposal consumes but it must be most definitely more, as you’re working with a tcp session check_nrpe doing sending a request to the nrpe_daemon which on its turn replies back. That kind of defeats a big chunk of your purpose.

  • Hi Mike,

    I have exactly the same concerns as smetj when I read your article.

    Pinging host is obviously important to check the network stack and less heavy than NRPE on the Bandwidth front.

    Of course if you want to check resources on the OS, NRPE is a good choice but it’s an other question.

    I also would like to mention that NRPE is running fine on SunOS :)

    I’m sorry but I don’t really see the point of this article, except discovering NRPE.

    Arnaud

  • Thanks for your replies to this post. Your thoughts are very well received and intuitive.

    Now, if you want to use the “ping” command to check network latency, then that is fine; however, “ping” does not tell you if the operating system if functioning properly. As a Linux/Unix Systems Administrator, I have seen countless times when the hardware will return an “OK” status, but the software is not functioning properly using the ping command. To see this point in action, launch a continuous “ping” command against a host and bounce the host. I can assure you the ping will reply back much sooner than you are able to log into the host. Even longer if the host has databases to start.

    My concerns rest with the end user being able to perform his or her tasks on the server. Using a service check will show a problem with the operating system. If the NRPE daemon times out with a service check, then I need to investigate deeper the problem. Many times I have discovered the xinetd daemon was not running after a system restart. Other times I have discovered a process consuming enough resources to slow the system down considerably. Using a “ping” command to check the host would have never revealed the problem.

    In regard to consumption of bandwidth, well the comment points and basic mathematics are valid. However, I am considering the other traffic flowing to and from servers. My organization’s stores have very narrow bandwidth through which must flow credit card validations, pricing and inventory updates as well as other Nagios service checks that are occurring throughout the course of a business cycle. If I can obtain valuable monitoring information, like checking directory size or checking the “sendmail” process, wouldn’t that be more advantageous for checking a host’s up/down status?

    The article focuses on maximization of Nagios’s service check ability. The “ping” command is very valuable; however, I would recommend only using it on systems that require minimal checks. If you really want the entire value of “ping” I would suggest using “tracert,” so you can identify where the latency is on the network. You also do not want Nagios checks to become an issue on the enterprise network or systems.

    Thank you both for responding to this article. GREAT insight and both I and the community appreciate your posts.

  • Hi Mike,

    Thanks for your reply too.

    I understand your point but still, for me it was quite obvious that you can’t say “processes are running because your ping is OK”.

    As a SysAdmin you have so much more things to check.

    Yes.. I like the way the Nagios Community is moving ! So I’m quite happy to participate.

  • My recommendation is to move away from the ping command to check system health. If you want to check the network health to a host then use the ping command. Ping is an excellent initial method for checking if a system is “alive” or not; however, what do you do with wireless devices?

    I am currently using a ping command to check a system’s health, and you can read the article here: Nagios: How to check a host, that for security reasons has ping disabled. When you process a ping command locally by pinging a loopback address you are engaging the operating system. What we are concentrating on here is the Nagios host’s ability to connect to a client, run a command and receive a response.

    A “ping” from the Nagios host to a client does not engage the operating system. It engages the hardware, I do believe at the Networking layer of the OSI model.

    System Administrators do have a lot of process to check throughout the enterprise. Using a service check for a check_command reduces the number of commands running against a host by 1. So, by eliminating just one command you have in reality reduced the network load by 2000 commands in a large host environment.

    Thanks for your question and I hope my thoughts in regard to the “ping” command are a little clearer.

Leave a Reply

You must login to post a comment.