This website demonstrates simple scripts that monitor servers by pinging them. They are designed to illustrate different strategies for monitoring, and can be used as a starting point for many situations.
Note: This page was written in 2003 and is no longer maintained (moved to the Crypt). I'll keep it online as I've had emails over the years to say it has helped people out.
The action that checks the service. The check may be to:
The most basic test is to use the ping tool to check if a host is alive. It sends ICMP echo requests and listens for ICMP echo responses. This lets you know that the host address has resolved (if a DNS name was used), and a host with that address has responded. What happens in detail on the remote host is this:
This means that you haven't actually checked much about the health of the operating system, and haven't checked the health of running applications at all. That's not to say ICMP-based testing is useless: it's very useful data in conjunction with other tests.
The next test would be to check if a port is listening (for example, my portping tool tests TCP). This checks more functionality: that an application has listened on a port and is still present (ie, the process hasn't terminated - if it had, the kernel would close the port). While it means the application hasn't died, it could be completely frozen, or the kernel could be inundated with work and be unable to schedule it (why TCP has a backlog). If by connecting to the TCP port the remote application sends a protocol string (eg, SSH port 22 replys with something like "SSH-2.0-OpenSSH_5.3p1 Debian-3ubuntu4"), then you know a lot more: the kernel was able to complete the accept() syscall, schedule the application process, and the application process was able to send an initial response.
Testing further would involve simulating a client transaction and checking the response - which can check that the application is not just running, but also behaving normally. If it was a web server, a script could fetch a particular website and check it's MD5 and response time - tools like wget may assist. If a database server is to be tested, a database fetch could be attempted and the data checked - using Perl and CPAN libraries can help here. Using such specific tests is somtimes called "focused monitoring".
Now, if we have gone to the effort to check a particular detail by writing our own script, it is helpful to have this script update a log whenever the check is performed. If a user were to say "the service was slow at 9am this morning", it is nice to have a log where we have recorded service response times from our end-user simulation script.
This how we draw attention to problems. Some things to consider:
Colour coding is very effective: red=bad, green=good. This is sometimes called "traffic lights", and will allow any staff member to understand your reports. However, be careful about false positives and negatives: objective metrics suit color coding (eg, hardware or failures), whereas subjective metrics may not (eg, performance).
Audio can be either keyboard beeps or recorded samples played through an audio card. Email or pager alerts should only be attempted if the code is "stateful" - a message is sent only when a change happens.
The examples on this website are written for Unix or Linux, as they are ideal platforms to run monitoring tools from. This is because it is common for an install to have powerful scripting tools such as sh, ksh, bash, sed, awk; a powerful language with network libraries such as perl or python; tools to send emails such as sendmail, mail, mailx; and a webserver that may be needed to host the reports.
Other platforms such as Microsoft Windows could be used, however it may require installing extra software such as a perl distribution.
Screenshots and Downloads
Strategy 1 - CLI
is a simple command line program to ping hosts in /etc/hosts
and colour the output. Variants could be written to ping an /etc/prodhosts file,
or /etc/devhosts - whatever is suitable. Using files like this makes maintenance
$ pinghosts Checking 127.0.0.1: 127.0.0.1 is alive Checking 192.168.1.1: 192.168.1.1 is alive Checking 192.168.1.2: no answer from 192.168.1.2 Checking 192.168.1.5: 192.168.1.5 is alive Checking 192.168.1.150: 192.168.1.150 is alive Checking 192.168.1.151: no answer from 192.168.1.151
Strategy 2 - Website
this pings servers and produces a colour coded html website which is
served by a webserver. This could be scheduled to
run via crontab to update every 5 minutes.
Ping began at Saturday January 25 22:33:20 EST 2003
Strategy 3 - CGI
this pings servers and produces a colour coded html report. As a CGI it is
triggered from a browser "on demand" to produce live data.
Ping began at Saturday January 25 22:34:30 EST 2003
Strategy 4 - Multi (Email, Syslog, CLI and Web)
watchping is a watchdog program to ping servers and take action if they go down. It is designed to run as a daemon so that it can be stateful - eg, only email when something changes not every time a check is made. The four actions are to send email alerts, send syslog alerts, log everything, and generate a website. By default it uses email and syslog.
WatchPing Report, Friday January 3 03:07:13 EST 2003
mars is alive
Example running 1:
Here, watchping is run in verbose mode for the hosts mars and phobos. Phobos
is down, so it sends a syslog message, emails root a message, and prints
messages to the screen (verbose):
# ./watchping -v mars phobos Running WatchPing... Sleep Interval: 60 secs Email address: root Syslog priority: user.err Checking Hosts: mars phobos ----- Friday January 3 02:34:37 EST 2003 mars is alive no answer from phobos ----- Friday January 3 02:35:42 EST 2003 mars is alive no answer from phobos # tail -1 /var/adm/messages Jan 3 02:34:42 mars watchping: [ID 702911 user.alert] Hosts Down: phobos # mail From firstname.lastname@example.org Fri Jan 3 02:35:47 2003 Date: Fri, 3 Jan 2003 02:35:47 +1100 (EST) From: Root
Example running 2:
Watchping can be used as a background process on
startup, configured to check against custom lists of hosts, email different
address, etc. This example demonstrates a combination of actions:
watchping -e sysadmin@mars -w /var/http/prod.html -i /etc/prod.txt & watchping -e dbadmin@venus -w /var/http/db.html -i /etc/db.txt &
If I've passed on some ideas for monitoring then the goal of this website has been a success. Good luck!