7. Using Nagios to trigger alerts

7. Using Nagios to trigger alerts
Prev		Next

Nagios is a sophisticated monitoring system. One of its major strengths is its ability to monitor many services and provide a configurable response to services going into Warning or Critical states. More information is available at the Nagios project home page.

This section will show how to configure MonAMI and Nagios to alert if something is wrong. To get the most out of this section, you will need Nagios configured on your computer or computers. Nagios requires little configuration for a basic setup: the default configuration should be sufficient.

Currently, MonAMI offers only passive monitoring. This is where MonAMI sends updates to Nagios indicating the current state of the monitored services. Passive monitoring requires some extra configuration. The section Section 7, “Setting up Nagios” gives an overview of this.

Configuration file

As before, copy the following configuration file as /etc/monami.d/example.conf.

##
##  MonAMI by Example, Section 7
##

# Our root filesystem
[filesystem]
 name = root-fs
 location = /
 cache = 2

# Our /home filesystem
[filesystem]
 name = home-fs
 location = /home
 cache = 2

# Once a minute, record / and /home available space.
[sample]
 read = root-fs, home-fs
 write = nagios
 interval = 1m  ❶

# Nagios target that sends alert data.
[nagios]
 host = nagios-svr.example.org❷
 port = 5668                      
 password = NotSecretEnough       

 service❸ = rfs❹ : ROOT_FILESYSTEM❺  
 check❻ = rfs❹ : root-fs.capacity.available❼, 10❽, 0.5❾
 check = rfs : root-fs.files.available,   400, 100
 
 service = hfs : HOME_FILESYSTEM
 check = hfs : home-fs.capacity.available, 10, 0.5
 check = hfs : home-fs.files.available,   400, 100

Some points to note:

❶	Once a minute, this sample target will query the current status of the filesystems and send the new data to the nagios target.
❷	This is the host that has the nsca daemon. `nagios-svr.example.org` is an example FQDN and should be replaced with the hostname of your Nagios server.
❸	Each nagios target should have at least one service attribute. A service is what is reported back to Nagios as being `OK`, `Warning`, or `Critical` (or, if misconfigured, as `Unknown`). Without at least one service, a nagios target has nothing to do!
❹	a short name used within the MonAMI configuration for this service.
❺	the name of the service according to Nagios. By convention, this is in capital letters.
❻	Each service should have at least one check attribute associated with it. The checks determine the current status of a service.
❼	The name of a metric. This is the path within the supplied datatree.
❽	The first value that will result in the service going into `Warning` state.
❾	The first value that will result in the service going into `Critical` state.

A service is something that Nagios will monitor on a specific server. In general, services are abstract, (usually high-level) activity or resources that the server provides. Examples of services include a MySQL DBMS, Torque resource manager and filesystem space.

Nagios considers each service as being in one of four possible states:

OK: the service is behaving normally.
Warning: normal behaviour continues but there is an early indicator of a problem. If there is any impact on available service, it is slight.
Critical: normal behaviour is no longer possible. If any service is still available it is heavily impacted and complete failure is expected soon.
Unknown: Nagios does not know the status of this service.

Setting up Nagios

In order for a Nagios host to accept status update messages from MonAMI, it must either run the nsca daemon (nscad) or configure an inetd-like daemon (e.g., inetd or xinetd) to accept these connections and run the nscad program indirectly.

The nsca daemon, whether running independently or via an inetd-like daemon, will receive the update notice and write a command to the Nagios “external commands” socket. For this to work, the socket must be created by the Nagios daemon and the nsca daemon must have write access to this socket. The former is controlled by a configuration option (check_external_commands usually found in nagios.cfg) whilst the latter requires the directory in which the socket is created (as defined in the command_file option and typically /var/log/nagios/rw/) to have the correct permissions.

Each MonAMI check must have a corresponding entry in the Nagios configuration. The following creates a generic template for use with MonAMI passive updates.

define service {
     name                      monami-service
     use                       generic-service
     active_checks_enabled     0
     passive_checks_enabled    1
     register                  0
     check_command             check_monami_dummy
     notification_interval     240
     notification_period       24x7
     notification_options      c,r
     check_period              24x7
     contact_groups            monami-admins
     max_check_attempts        3
     normal_check_interval     5
     retry_check_interval      1
}

Despite the service being purely passive, a valid check_command setting is still needed. We use the command check_monami_dummy, which is a simple command that always returns True:

define command {
   command_name    check_monami_dummy
   command_line    /bin/true
}

The final step is define the individual services that are to be monitored. The following Nagios configuration defines the two filesystem services defined above.

define service {
     use                       monami-service
     host_name                 svr017
     service_description       ROOT_FILESYSTEM
}

define service {
     use                       monami-service
     host_name                 svr017
     service_description       HOME_FILESYSTEM
}

Saying who you are

By default, nagios targets will use the local machine's fully-qualified domain name (FQDN) as the hostname. However, Nagios allows the configuration to specify the shorter hostname. In the above example, the hostname (svr017) is used instead of the longer FQDN (svr017.gla.scotgrid.ac.uk).

The localhost attribute allows you to configure what the MonAMI nagios target specifies as its identity. To correctly identify itself, the target would need localhost = svr017 within its nagios stanza.

Adjust your Nagios configuration to include the new template and checks and restart your Nagios service. Nagios is careful about checking the configuration before starting: if you have a mistake in your configuration file you must correct it before Nagios will start.

If the new services have been correctly configured you will see two new entries in Nagios' “Service Detail” web-page for the host. These will have the PASV symbol indicating that passive updates are accepted for this service, allowing MonAMI to send fresh data.

The two services describe above will be in an initial (or Pending) state when Nagios starts. They will remain in that state until MonAMI first sends data. The following figure shows these services.

Part of Nagios' Service Detail page. This shows two passive services for which MonAMI will provide data.

Figure 7. Example of passively monitored services before MonAMI has sent data.

Writing checks

Within MonAMI, each service has one or more checks associated with it. The checks are simple tests; they determine the status of their corresponding service. For example, the filesystem service might have checks for available capacity and available inodes for the different partitions, the torque service might have checks that the torque daemon is contactable, that there aren't too many queued jobs, that there aren't jobs stuck in wait state and so on.

Numerical checks are written as a metric and two numbers separated by commas. The first number is when the check should go into Warning state; the second is when it should go into Critical state. The gradient of these numbers indicates which direction is “getting worse”. If the first number is greater than the second then the metric is measuring resource exhaustion (e.g., available disk space, free memory, time spent idle); whereas, if the second number is larger than the first then the metric is measuring resource usage (e.g., number of concurrent processes, number of jobs, number of network connections).

The following service and check attributes monitor available capacity for non-root users.

 service = hfs : HOME_FILESYSTEM
 check = hfs : home-fs.capacity.available, 10, 0.5

Consider a scenario where a user is downloading data, filling the /home partition. If the available capacity (i.e., home-fs.capacity.available) is 20 MiB, then HOME_FILESYSTEM is in state OK. If, when next measured, the available capacity has dropped to 10.01 MiB then HOME_FILESYSTEM is still OK. Once the measured value has dropped to 10 MiB, HOME_FILESYSTEM will be in Warning state. As the parition becomes further filled, it will stay in Warning state until, finally, the value drops to 0.5 MiB. At this point HOME_FILESYSTEM is in Critical state.

If there are multiple check attributes for a service attribute, the service's state is a combination of the different check states. The rule is simple: the most important state wins. The states, in the order of increasing importance, are: OK, Unknown, Warning and Critical. So, if all a service's checks are OK, then the service is OK. If most of the checks are OK but some are Unknown, then the service is in Unknown state. If there are some checks that are in Warning state but none yet in Critical state, then the service is in Warning state; but, if at least one check is Critical, then the service is in Critical state.

In the following example:

 service = hfs : HOME_FILESYSTEM
 check = hfs : home-fs.capacity.available,  10, 0.5
 check = hfs : home-fs.files.available,    400, 100

If the available space drops to 10 MiB or the available files (the available inodes) drops to 400, then HOME_FILESYSTEM is Warning. If the available files drop to 100 or fewer, then HOME_FILESYSTEM becomes Critical, independent of the available space.

Running the example

As before, run MonAMI with the supplied configuration. Make sure you run MonAMI for at least one minute. There may be a slight further delay between the NSCA daemon writing the command to Nagios' socket and Nagios updating its internal state.

Once MonAMI has written data to Nagios, you will see the services' state change from Pending to one of the four normal states of a service. In the example below, both services are in state OK.

Part of Nagios' Service Detail page. This shows two passive services that MonAMI has provided data. Both are in OK state.

Figure 8. Example of passively monitored services after MonAMI has sent data.

Prev		Next
6. Plotting data with Ganglia	Home	8. Writing data into MySQL