Nagios is a sophisticated monitoring system. One of its major
strengths is its ability to monitor many services and provide a
configurable response to services going into Warning
or
Critical
states. More information is available at the
Nagios project home
page.
This section will show how to configure MonAMI and Nagios to alert if something is wrong. To get the most out of this section, you will need Nagios configured on your computer or computers. Nagios requires little configuration for a basic setup: the default configuration should be sufficient.
Currently, MonAMI offers only passive monitoring. This is where MonAMI sends updates to Nagios indicating the current state of the monitored services. Passive monitoring requires some extra configuration. The section Section 7, “Setting up Nagios” gives an overview of this.
As before, copy the following configuration file as
/etc/monami.d/example.conf
.
## ## MonAMI by Example, Section 7 ## # Our root filesystem [filesystem] name = root-fs location = / cache = 2 # Our /home filesystem [filesystem] name = home-fs location = /home cache = 2 # Once a minute, record / and /home available space. [sample] read = root-fs, home-fs write = nagios interval = 1m ❶ # Nagios target that sends alert data. [nagios] host = nagios-svr.example.org❷ port = 5668 password = NotSecretEnough service❸ = rfs❹ : ROOT_FILESYSTEM❺ check❻ = rfs❹ : root-fs.capacity.available❼, 10❽, 0.5❾ check = rfs : root-fs.files.available, 400, 100 service = hfs : HOME_FILESYSTEM check = hfs : home-fs.capacity.available, 10, 0.5 check = hfs : home-fs.files.available, 400, 100
Some points to note:
Once a minute, this sample target will query the current status of the filesystems and send the new data to the nagios target. | |
This is the host that has the nsca daemon. | |
Each nagios target should have at least one
service attribute. A service is what
is reported back to Nagios as being | |
a short name used within the MonAMI configuration for this service. | |
the name of the service according to Nagios. By convention, this is in capital letters. | |
Each service should have at least one check attribute associated with it. The checks determine the current status of a service. | |
The name of a metric. This is the path within the supplied datatree. | |
The first value that will result in the service going into
| |
The first value that will result in the service going into
|
A service is something that Nagios will monitor on a specific server. In general, services are abstract, (usually high-level) activity or resources that the server provides. Examples of services include a MySQL DBMS, Torque resource manager and filesystem space.
Nagios considers each service as being in one of four possible states:
OK
the service is behaving normally.
Warning
normal behaviour continues but there is an early indicator of a problem. If there is any impact on available service, it is slight.
Critical
normal behaviour is no longer possible. If any service is still available it is heavily impacted and complete failure is expected soon.
Unknown
Nagios does not know the status of this service.
In order for a Nagios host to accept status update messages from
MonAMI, it must either run the nsca daemon (nscad
) or
configure an inetd-like daemon (e.g., inetd or xinetd) to
accept these connections and run the nscad
program indirectly.
The nsca daemon, whether running independently or via an
inetd-like daemon, will receive the update notice and write a
command to the Nagios “external commands” socket.
For this to work, the socket must be created by the Nagios daemon
and the nsca daemon must have write access to this socket. The
former is controlled by a configuration option
(check_external_commands
usually found in
nagios.cfg
) whilst the latter requires the
directory in which the socket is created (as defined in the
command_file
option and typically /var/log/nagios/rw/
) to have the
correct permissions.
Each MonAMI check must have a corresponding entry in the Nagios configuration. The following creates a generic template for use with MonAMI passive updates.
define service { name monami-service use generic-service active_checks_enabled 0 passive_checks_enabled 1 register 0 check_command check_monami_dummy notification_interval 240 notification_period 24x7 notification_options c,r check_period 24x7 contact_groups monami-admins max_check_attempts 3 normal_check_interval 5 retry_check_interval 1 }
Despite the service being purely passive, a valid
check_command
setting is still needed. We use the
command check_monami_dummy
, which is a simple command
that always returns True:
define command { command_name check_monami_dummy command_line /bin/true }
The final step is define the individual services that are to be monitored. The following Nagios configuration defines the two filesystem services defined above.
define service { use monami-service host_name svr017 service_description ROOT_FILESYSTEM } define service { use monami-service host_name svr017 service_description HOME_FILESYSTEM }
By default, nagios targets will use the local machine's
fully-qualified domain name (FQDN) as the hostname. However,
Nagios allows the configuration to specify the shorter hostname.
In the above example, the hostname (svr017
) is used instead of the
longer FQDN (svr017.gla.scotgrid.ac.uk
).
The localhost attribute allows you to configure what the MonAMI
nagios target specifies as its identity. To correctly identify
itself, the target would need localhost = svr017
within its nagios stanza.
Adjust your Nagios configuration to include the new template and checks and restart your Nagios service. Nagios is careful about checking the configuration before starting: if you have a mistake in your configuration file you must correct it before Nagios will start.
If the new services have been correctly configured you will see
two new entries in Nagios' “Service Detail” web-page
for the host. These will have the PASV
symbol
indicating that passive updates are accepted for this service,
allowing MonAMI to send fresh data.
The two services describe above will be in an initial (or
Pending
) state when Nagios starts. They will
remain in that state until MonAMI first sends data. The following
figure shows these services.
Within MonAMI, each service has one or more checks associated with it. The checks are simple tests; they determine the status of their corresponding service. For example, the filesystem service might have checks for available capacity and available inodes for the different partitions, the torque service might have checks that the torque daemon is contactable, that there aren't too many queued jobs, that there aren't jobs stuck in wait state and so on.
Numerical checks are written as a metric and two numbers separated
by commas. The first number is when the check should go into
Warning
state; the second is when it should go into
Critical
state. The gradient of these numbers indicates
which direction is “getting worse”. If the first
number is greater than the second then the metric is measuring
resource exhaustion (e.g., available disk space, free memory, time
spent idle); whereas, if the second number is larger than the
first then the metric is measuring resource usage (e.g., number of
concurrent processes, number of jobs, number of network
connections).
The following service and check attributes monitor available
capacity for non-root
users.
service = hfs : HOME_FILESYSTEM check = hfs : home-fs.capacity.available, 10, 0.5
Consider a scenario where a user is downloading data, filling the
/home
partition. If the
available capacity (i.e., home-fs.capacity.available
)
is 20 MiB, then HOME_FILESYSTEM
is in state
OK
. If, when next measured, the available
capacity has dropped to 10.01 MiB then
HOME_FILESYSTEM
is still OK
.
Once the measured value has dropped to 10 MiB,
HOME_FILESYSTEM
will be in
Warning
state. As the parition becomes
further filled, it will stay in Warning
state
until, finally, the value drops to 0.5 MiB. At this point
HOME_FILESYSTEM
is in Critical
state.
If there are multiple check attributes for a service attribute,
the service's state is a combination of the different check
states. The rule is simple: the most important state wins. The
states, in the order of increasing importance, are: OK
,
Unknown
, Warning
and Critical
. So, if all
a service's checks are OK
, then the service is OK
.
If most of the checks are OK
but some are Unknown
,
then the service is in Unknown
state. If there are some
checks that are in Warning
state but none yet in
Critical
state, then the service is in Warning
state; but, if at least one check is Critical
, then the
service is in Critical
state.
In the following example:
service = hfs : HOME_FILESYSTEM check = hfs : home-fs.capacity.available, 10, 0.5 check = hfs : home-fs.files.available, 400, 100
If the available space drops to 10 MiB or the available files
(the available inodes) drops to 400, then
HOME_FILESYSTEM
is Warning
. If the available
files drop to 100 or fewer, then HOME_FILESYSTEM
becomes Critical
, independent of the available space.
As before, run MonAMI with the supplied configuration. Make sure you run MonAMI for at least one minute. There may be a slight further delay between the NSCA daemon writing the command to Nagios' socket and Nagios updating its internal state.
Once MonAMI has written data to Nagios, you will see the services'
state change from Pending
to one of the four
normal states of a service. In the example below, both services
are in state OK
.