3.4. Monitoring Plugins

This section describes the different services that can be monitored (for example, a MySQL database or an Apache webserver). It gives brief introductions to which services the plugins can monitor and how they can be configured. Wherever possible, sensible defaults are available so often little or no configuration is required for common deployment scenarios.

The available monitoring plugins depend on which plugins have been built and installed. If you have received this document as part of a binary distribution, it is possible that the distribution does not include all the plugins described here. It might also contain other plugins provided independently from the main MonAMI release.

3.4.1. AMGA

AMGA (ARDA Metadata Catalogue Project) is a metadata server provided by the ARDA/EGEE project as part of their gLite software releases. It provides additional metadata functionality by wrapping an underlying database storage. More information about AMGA is available from the AMGA project page.

The amga monitoring plugin will monitor the server's database connection usage and the number of incoming connections. For both, the current value and configured maximum permitted are monitored.

Attributes

host string, optional

the host on which the AMGA server is running. The default value is localhost.

port integer, optional

the port on which the AMGA server listens. The default value is 8822.

3.4.2. Apache

The Apache HTTP (or web) server is perhaps the most well known project from the Apache Software Foundation. Since April 1996, the Netcraft web survey has shown it to be the most popular on the Internet. More details can be found at the Apache home page.

The apache plugin monitors the current status of an Apache HTTP server. It can also provide event-based monitoring, based on various log files.

The Apache server monitoring is achieved by downloading the server-status page (provided by the mod_status Apache plugin) and parsing the output. Usually, this option is available within the Apache configuration, but commented-out by default (depending on the distribution). The location of the Apache configuration is Apache-version and OS specific, but is usually found in either the /etc/apache, /etc/apache2 or /etc/httpd directory. To enable the server-status page, uncomment the section or add lines within the apache configuration that look like:

<Location /server-status>
    SetHandler server-status
    Order deny,allow
    Deny from all
    Allow from .example.com
</Location>

Here .example.com is an illustration of how to limit access to this page. You should change this to either your DNS domain or explicitly to the machine on which you are to run MonAMI.

There is an ExtendedStatus option that configures Apache to include some additional information. This is controlled within the Apache configuration by lines similar to:

<IfModule mod_status.c>
  ExtendedStatus On
</IfModule>

Switching on the extended status should not greatly affect the server's load and provides some additional information. MonAMI can understand this extra information, so it is recommended to switch on this ExtendedStatus option.

Event-based monitoring

Event-based monitoring is made available by watching log files. Any time the Apache server writes to a watched log file, an event is generated. The plugin supports multiple event channels, allowing support for multi-homed servers that log events to different log files.

Event channels are specified by log attributes. This can be repeated to configure multiple event channels. Each log attribute has a corresponding value like:

name:path[type]

where:

name

is an arbitrary name given to this channel. It cannot have a colon (:) and should not have a dot (.) but most names are valid.

path

is the path to the file. Log rotations (where a log file is archived and a new one created) are supported.

type

is either combined, or error.

The following example configures the access channel to read the log file /var/log/apache2/access.log, which is in the Apache standard “combined” format.

[apache]
 log = access: /var/log/apache2/access.log [combined]

Attributes

host string, optional

the hostname for webserver to monitor. The default value is localhost.

port integer, optional

the port on which the webserver listens. The default value is 80

log string, zero or more

specifies an event monitoring channel. Each log attribute has a value like: name : path [ type ]

3.4.3. dCache

dCache (see dCache home page) is a system jointly developed by Deutsches Elektronen-Synchrotron (DESY) and Fermilab that aims to provide a mechanism for storing and retrieving huge amounts of data among a large number of heterogeneous server nodes, which can be of varying architectures (x86, ia32, ia64). It provides a single namespace view of all of the files that it manages and allows access to these files using a variety of protocols, including SRM, GridFTP, dCap and xroot. By connecting dCache to a tape storage backend, it becomes a hierarchical storage manager (HSM).

Authentication

The dCache monitoring plugin works by connecting to the underlying PostGreSQL database that dCache uses to store the current system state. To achieve this, MonAMI must have the credentials (a username and password) to log into the database and perform read queries.

If you do not already have a read-only account, you will need to create such an account. It is strongly recommended not to use an account with any write privileges as the password will be stored plain-text within the MonAMI configuration file (see Section 4.2.2, “Passwords being stored insecurely”).

To configure PostGreSQL, SQL commands need to be sent to the database server. To achieve this, you will need to use the psql command, connecting to the dcache database. On many systems you must log in as the database user “postgres”, which often has no password when connecting from the same machine on which database server is running. A suitable command is:

psql -U postgres -d dcache

The following SQL commands will create an account monami with password monami-secret that has read-only access to the tables that MonAMI will read.

Important

Please ensure you change the example password (monami-secret).

CREATE USER monami;
ALTER USER monami PASSWORD 'monami-secret';

GRANT SELECT ON TABLE copyfilerequests_b TO monami;
GRANT SELECT ON TABLE getfilerequests_b TO monami;
GRANT SELECT ON TABLE putfilerequests_b TO monami;

If you intend to monitor the database remotely, you may need to add an extra entry in PostGreSQL's remote access file: pg_hba.conf. With some distribution, this file is located in the directory /var/lib/pgsql/data.

Currently, the information gathered is limited to the rate of SRM GET, PUT and COPY requests received. This information is gathered from the copyfilerequests_b, getfilerequests_b and putfilerequests_b tables. Future versions of MonAMI may read other tables, so requiring additional GRANT statements.

Attributes

host string, optional

the host on which the PostGreSQL database is running. The default is localhost.

ipaddr string, optional

the IP address of the host on which the database is running. This is useful when the host is on multiple IP subnets and a specific one must be used. The default is to look up the IP address from the host.

port integer, optional

the TCP port to use when connecting to the database. The default is port 5432 (the standard PostGreSQL port).

user string, optional

the username to use when connecting to the database. The default is the username of the system account MonAMI is running under. When running as a daemon from a standard RPM-based installation, the default user will be monami.

password string, optional

the password to use when authenticating. The default is to attempt password-less login to the database.

3.4.4. Disk Pool Manager (DPM)

Disk Pool Manager (DPM) is a service that implements the SRM protocol (mainly for remote access) and rfio protocol (for site-local access). It is an easy-to-deploy solution that can support multiple disk servers but has no support for tape/mass-storage systems. More information on DPM can be found at the DPM home page.

Sample data collected from DPM, displayed within Ganglia.

Figure 3.1. Data from DPM displayed within Ganglia.


The dpm plugin connects to the MySQL server DPM uses. By querying this database, information is extracted such as the status of the filesystems and the used and available space. The space statistics are available as a summary, and broken down for each group, and for each filesystem. The daemon activity on the head node can also be monitored.

Authentication

This plugin requires read-only privileges for the database DPM uses. The following set of SQL statements creates login credentials with username of monamiuser and password of monamipass suitable for local access:

GRANT SELECT ON cns_db.* TO 'monamiuser'@'localhost'
                IDENTIFIED BY 'monamipass';
GRANT SELECT ON dpm_db.* TO 'monamiuser'@'localhost'
                IDENTIFIED BY 'monamipass';

If MonAMI is to monitor the MySQL database remotely, the following SQL can be used to create login credentials

GRANT SELECT ON cns_db.* TO 'monamiuser'@'%'
                IDENTIFIED BY 'monamipass';
GRANT SELECT ON dpm_db.* TO 'monamiuser'@'%'
                IDENTIFIED BY 'monamipass';

If local and remote access to the MonAMI database is needed all four above SQL commands should be combined.

Attributes

host string, optional

the host on which the MySQL server is running. Default is localhost.

user string, required

the username with which to log into the server.

password string, required

the password with which to log into the server.

3.4.5. Filesystem

The filesystem plugin monitors generic (i.e., non-filesystem specific) features of a mounted filesystem. It reports both capacity and “file” statistics. The “file” statistics correspond to inode usage for filesystems that use inodes (such as ext2).

Note

With both reported resources (blocks and files), there are similar-sounding metrics: “free” and “available”. “free” refers to total resource potentially available and “available” refers to the resource available to general (non-root) users.

The difference between the two comes about because it is common to reserve some capacity for the root user. This allows core system services to continue when a partition is full: normal users cannot create files but root (and processes running as root) can.

Attributes

location string, required

the absolute path to any file on the filesystem.

3.4.6. GridFTP

The Globus Alliance distribute a modified version of the WU-FTP client that has been patched to allow GSI-based authentication and multiple streams. This is often referred to as “GridFTP”.

Various grid components use GridFTP as an underlying transfer mechanism. Often, these have the same log-file format for recording transfers, so parsing this log-file is a common requirement.

The gridftp plugin monitors GridFTP log files, providing an event for each transfer. This is under the transfers channel.

Attributes

filename string, required

the absolute path to the GridFTP log file.

3.4.7. Maui

On their website, Cluster Resources describe Maui as “an advanced batch scheduler with a large feature set well suited for high performance computing (HPC) platforms”. Within a cluster it is used to decide which job (of many that are available) should be run next. Maui provides sophisticated scheduling features such as advanced fair-share definitions and “allocation bank”. More details are available within the Maui homepage.

Access control

The MonAMI maui plugin will need sufficient access rights to query the Maui server. If MonAMI is running on the same machine as the Maui server, (most likely) no additional host will be needed. If MonAMI is running on a remote machine, then access-right must be granted for that machine. Append the remote host's hostname to the space-separated ADMINHOST list.

The plugin will also need to use a valid username. By default it will use the name of the user it is running as (monami), but the plugin can use an alternative username (see the user attribute). To add an additional username, append the username to the space-separated ADMIN3 list.

The following example configuration shows how to configure Maui to allow monitoring from host monami.example.org as user monami.

SERVERHOST              maui-server.example.org
ADMIN1                  root
ADMIN3                  monami
ADMINHOST               maui-server.example.org  monami.example.org
RMCFG[base]             TYPE=PBS
SERVERPORT              40559
SERVERMODE              NORMAL

Password

The Maui authenticates by the client and server keeping a shared secret: a password. Currently this password must be integer number. Unfortunately, the password is decided as part of the Maui build process. If one is not explicitly specified, a random number is selected as the password. The password is then embedded within the Maui client programs and used when they communicate with the Maui server. Currently, it is not possible to configure the Maui server to use an alternative password without rebuilding the Maui client and servers.

To communicate with the Maui server the maui plugin must know the password. Unfortunately, as the password is only stored within the executables, it is difficult to discover. The maui plugin has heuristics that allow it to scan a Maui client program and, in most cases, discover the password. This requires a Maui client program to be present on whichever computer MonAMI is running. If the Maui client is in a non-standard location, its absolute path can be specified with the exec attribute.

If the password is known (for example, its value was specified when compiling Maui) then it can be specified using the password attribute. Specifying the password attribute will stop MonAMI from scanning Maui client programs.

Once the password is known, it can be stored in the MonAMI configuration using the password attribute. This removes the need for a Maui client program. However, should the Maui binaries change (for example, upgrading an installed Maui package), it is likely that the password will also change. This would stop the MonAMI plugin from working until the new password was supplied.

The recommended deployment strategy is to install MonAMI on the Maui server and allow the maui plugin to scan the Maui client programs for the required password.

Time synchronisation

When communicating between the maui and Maui server, both parties want to know that the messages are really from the other party. The shared-secret is one part of this process, another is to check the time within the message. This is to prevent a malicious third-party from sending messages that have already been sent: a “replay attack”.

To prevent these replay attacks, the clocks on the Maui server and the server MonAMI is running must agree. If both machines are well configured, their clocks will agree with ~10 millisecond difference. Since the network may introduce a slight delay, some tolerance is needed.

The maui plugin requires an agreement of one second by default. This should be easy to satisfied with modern networks. If, for whatever reason, this is not possible the tolerance can be make more lax by specifying the max_time_delta attribute.

Note

Should there be a systematic error between the clocks on two servers, effort should be made in synchronosing those clocks. Increasing the max_time_delta makes MonAMI more vulnerable to replay attacks.

Attributes

host string, optional

the hostname of the Maui server. If not specified, localhost will be used.

port integer, optional

the TCP port to which the plugin with connect. If not specified, the default value is 40559.

user string, optional

the user name to present to the Maui server when communicating. The default value is the name of the account under which MonAMI is running.

max_time_delta integer, optional

the maximum allowed time difference, in seconds, between the server and client. The default value is one second.

password integer, optional

the shared-secret between this plugin and the Maui server. The default policy is to attempt to discover the password automatically. Specifying the password will prevent attempts at discovering it automatically.

timeout string, optional

the time MonAMI should wait for a reply. The string is in time-interval format (e.g., “5m 10s” is five minutes and ten seconds; “310” would be equivalent). The default behaviour is to wait indefinitely.

exec string, optional

the absolute path to the mclient (or similar) Maui client program. If the plugin was unsuccessful scanning the program given by exec it will also try standard locations.

3.4.8. MySQL

This plugin monitors the performance of a MySQL database. MySQL is a commonly used Free (GPLed) database. The parent company (MySQL AB) describe it as “the world's most popular open source database”. For more information, please see the MySQL home page

The statistics monitored are taken from the status variables. They are acquired by executing the MySQL SQL SHOW STATUS;. The raw variables are described in the MySQL manual, section 5.2.5: Status Variables.

Note

The metrics names provided by MySQL are in a flat namespace. These names are not used by MonAMI; instead, the metrics are mapped into a tree structure, allowing more easy navigation of, and section from, the available metrics.

Privileges

To function, this plugin requires an account to access the database. Please note: this database account requires no database access privileges, only that the username and password will allow MonAMI to connect to the MySQL database. For security considerations, you should not employ login credentials used elsewhere (and never root or similar power-user). The following is a suitable SQL statement for creating a username and password of monami and monamipass.

CREATE USER 'monami'@'localhost' IDENTIFIED BY "monamipass";

Sharing login credentials is not recommended. If you decide to share credentials make sure the MonAMI configuration file is readable only by the monami user (see Section 3.2.2, “Dropping root privileges”).

Note

In addition to monitoring a MySQL database, the mysql plugin can also store information MonAMI has gathered within MySQL. This is described in Section 3.5.8, “MySQL”.

Attributes

user string, required

the username with which to log into the server.

password string, required

the password with which to log into the server

host string, optional

the host on which the MySQL server is running. If no host is specified, the default localhost is used.

3.4.9. null

The null plugin is perhaps the simplest to understand. As a monitoring plugin, it providing an empty datatree when requested for data. The main use for null as a monitoring target is to demonstrating aspects of MonAMI without the distraction of real-life effects from other monitoring plugins.

The null plugin will supply an empty datatree. In conjunction with a reporting plugin (e.g., the snapshot), this can be used to demonstrate the map attribute for adding static content. This attribute is described in Section 3.3.3, “The map attribute”.

Delays

Another use for a null target is to investigate the effect of a service taking a variable length of time to respond with monitoring data. This is emulated by specifying a delay file. If the delayfile attribute is set, then the corresponding file is read. It should contain a single integer number. This number dictates how long (in seconds) a null target should wait when requested for data. The file can be changed at any time and the change will affect the next time the null target is read from. This is particularly useful for demonstrating how MonAMI estimates future delays (see Section 3.3.4, “Estimating future data-gathering delays”) and undertakes adaptive monitoring (see Section 3.6.4, “Adaptive monitoring”).

The following example will demonstrate this usage:

[null]
 delayfile=/tmp/monami-delay

[sample]
 read = null
 write = null
 interval = 1s

Then, by changing the number stored in /tmp/monami-delay, the delay can be adjusted dynamically. To set the delay to three seconds, do:

$ echo 3 > /tmp/monami-delay

To remove the delay, simply set the delay to zero:

$ echo 0 > /tmp/monami-delay

Attributes

delayfile string, optional

the filename of the delay file, the contents of which is parsed as an integer number. This number is the number of seconds the null target will delay when replying with an empty datatree.

3.4.10. NUT

Network UPS Tools (NUT) provides a standard method through which an Uninterruptable Power Supply (UPS) can be monitored. Part of this framework allows for signalling, so that machines can undergo a controlled shutdown in the event of a power failure. Further details of NUT are available from the NUT home page.

The MonAMI nut plugin connects to the NUT data aggregator daemon (upsd) and queries the status of all known, attached UPS devices. The ups.conf file must be configured for available hardware and the startup scripts must be configured to start the required UPS-specific monitoring daemons.

By default, localhost will be allowed access to the upsd daemon but access for external hosts must be added explicitly in the upsd.conf file. See the NUT documentation on how best to achieve this.

Attributes

host string, optional

the host on which the NUT upsd daemon is running. The default value is localhost.

port integer, optional

the port on which the NUT upsd daemon listens. The default value is 3493.

3.4.11. Process

The process plugin monitors Unix processes. It can count the number of processes that match search criteria and can give detailed information on a specific process.

The information process gives should not be confused with any process, memory or thread statistics other monitoring plugins provide. Some services report their current thread, process or memory usage, which may duplicate some of the information this plugin reports (see, for example, Section 3.4.2, “Apache” and Section 3.4.8, “MySQL”). However, process reports information from the kernel and should work with any application.

The process plugin has two main types of monitors: counting processes and detailed information about a single process. A single process target can be configured to do any number of either type of monitoring and the results are combined in the resulting datatree.

Counting processes

To count the number of processes, a count attribute must be specified. In its simplest form, the count attribute value is simply the name of the process to count. The following example reports the number of imapd processes that are currently in existance.

[process]
 count = imapd

The format of the count attribute allows for more sophisticated queries of form: reported name : proc name [cond1, cond2, ...]

All of the parts are optional: the part upto and including the colon (reported name :), the part after the colon but before the square brackets (proc name) and the part in square brackets ([cond1, cond2, ...]) can be omitted, but at least one of the first two parts must be specified. The examples below may help clarify this!

To be included in the count, a process' name must match the proc name (if specified). The statistics will be reported as reported name. If no reporting name is specified, then proc name will be used.

The part in square brackets, if present, specifies some additional constraints. The comma-separated list of key, value pairs define additional predicates; for example, [uid=root, state=R] means only processes that are running as root and are in state running will be counted. The valid conditions are:

uid = uid

to be considered, the process must be running with a user ID of uid. The value may be the numerical uid or the username.

gid = gid

the process must be running with a group ID of gid. The value may be the numerical gid or the group name.

state = statelist

the process must have one of the states listed in statelist. Each acceptable process state is represented by a single capital letter and they are concatinated together. Valid process states letters are:

R

process is running (or ready to be run),

S

sleeping, awaiting some external event,

D

in uninterruptable sleep (typically waiting for disk IO to complete),

T

stopped (due to being traced),

W

paging,

X

dead,

Z

defunct (or "zombie" state).

The following example illustrates count used to count the number of processes. The different attributes show how the different criteria are represented.

[process]
 count = imapd ❶
 count = io_imapd    : imapd [state=D] ❷
 count = all_java    : java ❸
 count = tomcat_java : java  [uid=tomcat5] ❹
 count = zombies     :       [state=Z] ❺
 count = tcat_z      : java  [uid=tomcat4, state=Z] ❻
 count = run_as_root :       [uid=0] 

Count the number of imapd processes.

Count the number of imapd processes that are in “uninterruptable sleep” state: stopped whilst waiting for block I/O (e.g. disk I/O).

Count the number of java processes that are running. Store the number as a metric called all_java.

Count the number of java processes that are running as user tomcat5. Store the number as a metric called tomcat_java.

Count the total number of zombie processes. Store the number as a metric called zombies.

Count the number of zombie tomcat processes. Store the number as a metric called tcat_z.

Count the number of processes running as root. Store the number as a metric called run_as_root.

Detailed information

The watch attribute specifies a process to monitor in detail. The process to watch is identified using the same format as with count statements; however, the expectation is that only a single process will match the criteria.

If there is more than one process matching the search criteria then one is chosen and that process is reported. In principle, the selected process might change from one time to the next, which would lead to confusing results. In practise, the process with the lowest pid is chosen, so is both likely to be the oldest process and unlikely to change over time. However, this behaviour is not guaranteed.

Much information is gathered with a watch attribute. This information is documented in the stat and status sections of the proc(5) manual page. Some of the more useful entries are copied below:

pid

the process ID the the process being monitored.

ppid

the process ID of the parent process.

state

a single character, with the same semantics as the different process states listed above.

minflt

number of minor memory page faults (no disk swap activity was required).

majflt

number of major memory page faults (those requiring disk swap activity).

utime

number of jiffies[1] of time spent with this process scheduled in user-mode.

stime

number of jiffies[1] of time spent with this process scheduled in kernel-mode.

threads

number of threads in use by this process.

Note

An accurate value is provided by the 2.6-series kernels. Under 2.4-series kernel with LinuxThreads, heuristics are used to derive a value. This value should be correct under most circumstances, but it may be confused if multiple instances of the same multi-threaded process is running concurrently.

vsize

virtual memory size: total memory used by the process.

rss

Resident Set Size: number of pages of physical memory a process is using (less 3 for administrative bookkeeping).

Attributes

count string, optional

either the name of the process(es) to count, or the conditions processes must satisfy to be included in the count. This attribute may be repeated for multiple process counting.

count attributes have the form: reported name : proc name [cond1, cond2, ...]

watch string, optional

either the name of the process to obtain detailed information, or the conditions a process must satisfy to be watched. This attribute may be repeated to obtain detailed information about multiple processes.

watch attributes have the form: reported name : proc name [cond1, cond2, ...]

3.4.12. Stocks

The stocks plugin uses one of the web-services provided by XMethods to obtain a near real-time quote (delayed by 20 minutes) for one or more stocks on the United States Stock market. Further details of this service are available from the Stocks service summary page.

In addition to providing financial information, stocks is a pedagogical example that demonstrates the use of SOAP within MonAMI.

Caution

The authors of MonAMI expressly disclaim the accuracy, adequacy, or completeness of any data and shall not be liable for any errors, omissions or other defects in, delays or interruptions in such data, or for any actions taken in reliance thereon.

Please do not send too many requests. A request every couple of minutes should be sufficient.

Attributes

symbols string, required

a comma- (or space-) separated list of ticker symbols to monitor. For example, GOOG is the symbol for Google Inc. and RHT is the symbol for RedHat Inc.

3.4.13. TCP

The tcp monitoring plugin provides information about the number of TCP sockets in a particular state. Here, a socket is either a TCP connection to some machine or the ability to receive a particular connection (i.e., that the local machine is “listening” for incoming connections).

A tcp monitoring target takes an arbitrary number of count attributes. The value of a count attributes describes how to report the number of matching sockets and the criteria for including a socket within that count. These attributes take values like: name [cond1, cond2, ...], where name is the name used to report the number of matching TCP sockets. The conditions (cond1, cond2, etc.) are comma-separated keyword-value pairs (e.g., state=ESTABLISHED). A socket must match all conditions to be included in the count.

The condition keywords may be any of the following:

local_addr

The local IP address to which the socket is bound. This may be useful on multi-homed machines for sockets bound to a single interface.

remote_addr

The remote IP address of the socket, if connected.

local_port

The port on the local machine. This can be the numerical value or a common name for the port, as defined in /etc/service.

remote_port

The port on the remote machine, if connected. This can be the numerical value or a common name for the port.

port

A socket's local or remote port must match. This can be the numerical value or a common name for the port.

state

The current state of the socket. Each local socket will be in one of a number of states and changes state during the lifetime of a connection. All the states listed below are valid and may occur naturally on a working system; however, under normal circumstances some states are transitory: one would not expect a socket to stay in a transitory state for long. A large and/or increasing number of sockets in one of these transitory states might indicate a networking problem somewhere.

The valid states are listed below. For each state, a brief description is given and the possible subsequent states are listed.

LISTEN

A program has indicated it will receive connections from remote sites.

Next: SYN_RECV, SYN_SENT

SYN_SENT

Either a program on the local machine is the client and is attempting to connect to remote machine, or the local machine sends data from a LISTENing socket (less likely).

Next: ESTABLISHED, SYN_RECV or CLOSED

SYN_RECV

Either a LISTENing socket has received an incoming request to establish a connection, or both the local and remote machines are attempting to connect at the same time (less likely)

Next: ESTABLISHED, FIN_WAIT_1 or CLOSED

ESTABLISHED

Data can be sent to/from local and remote site.

Next: FIN_WAIT_1 or CLOSE_WAIT

FIN_WAIT_1

Start of an active close. The application on local machine has closed the connection. Indication of this has been sent to the remote machine.

Next: FIN_WAIT_2, CLOSING or TIME_WAIT

FIN_WAIT_2

Remote machine has acknowledged that local application has closed the connection.

Next: TIME_WAIT

CLOSING

Both local and remote applications have closed their connections “simultaneously”, but remote machine has not yet acknowledged that the local application has closed the local connection.

Next: TIME_WAIT

TIME_WAIT

Local connection is closed and we know the remote site knows this. We know the remote site's connection is closed, but we don't know if the remote site know that we know this. (It is possible that the last ACK packet was lost and, after a timeout, the remote site will retransmit the final FIN packet.)

To prevent the potential packet loss (of the local machine's final ACK) from accidentally closing a fresh connection, the socket will stay in this state for twice MSL timeout (depending on implementation, a minute or so).

Next: CLOSED

CLOSE_WAIT

The start of a passive close. The application on the remote machine has closed its end of the connection. The local application has not yet closed this end of the connection.

Next: LASK_ACK

LASK_ACK

Local application has closed its end of the connection. This has been sent to the remote machine but the remote machine has not yet acknowledged this.

Next: CLOSED

CLOSED

The socket is not in use.

Next: LISTEN or SYN_SENT

CONNECTING

A pseudo state. The transitory states when starting a connection match, specifically either SYN_SENT or SYN_RECV.

DISCONNECTING

A pseudo state. The transitory states when shutting down a connection match, specifically any of FIN_WAIT_1, FIN_WAIT_2, CLOSING, TIME_WAIT, CLOSE_WAIT or LASK_ACK match.

The states ESTABLISHED and LISTEN are long-lived states. It is natural to find sockets that are in these states for extended periods.

For applications that use “half-closed” connections, the FIN_WAIT_2 and TIME_WAIT states are less transitory. As the name suggests, half-closed connections allows data to flow in one direction only. It is achieved by the application that no longer wishes to send data closing their connection (see FIN_WAIT_1 above), whilst the application wishing to continue sending data does nothing (and so suffers a passive close). Once the half-closed connection is established, the active close socket (which can no longer send data) will be in FIN_WAIT_2, whilst the passive close socket (which can still send data) will be in CLOSE_WAIT.

There are two pseudo states for the normal transitory states: CONNECTING and DISCONNECTING. They are intended to help catch networking or software problems.

The following example checks whether an application is listening on three well-known port numbers. This might be used as a check whether services are running as expected.

[tcp]
  name = listening
  count = ssh          [local_port=ssh, state=LISTEN]
  count = ftp          [port=ftp, state=LISTEN]
  count = mysql        [local_port=mysql, state=LISTEN]

The following example records the number of connections to a webserver. The established metric records the connections where data may flow in either direction. The other two metrics record connections in the two pseudo states. Normal traffic should not stay long in these pseudo states; connections that persist in these states may be symptomatic of some problem.

[tcp]
  name = incoming_web_con
  count = established   [local_port=80, state=ESTABLISHED]
  count = connecting    [local_port=80, state=CONNECTING]
  count = disconnecting [local_port=80, state=DISCONNECTING]

Attributes

count string, optional

the name to report for this metric followed by square brackets containing a comma-separated list of conditions a socket must satisfy to be included in the count. This option may be repeated for multiple TCP connection counts.

The conditions are keyword-value pairs, separated by =, with the following valid keywords: local_addr, remote_addr, local_port, remote_port, port, state.

The state keyword can have one of the following TCP states: LISTEN, SYN_RECV, SYN_SENT, ESTABLISHED, CLOSED, FIN_WAIT_1, FIN_WAIT_2, CLOSE_WAIT, CLOSING, TIME_WAIT, LASK_ACK; or one of the following two pseudo states: CONNECTING, DISCONNECTING.

3.4.14. Tomcat

Apache Tomcat is one of the projects from the Apache Software Foundation. It is a Java-based application server (or servlet container) based on Java Servlet and JavaServer Pages technologies. Servlets and JSP are defined under Sun's Java Community Process. More information about Tomcat can be found at the Apache Tomcat home page.

Also under development of the Java Community Process is the Java Monitoring eXtensions (JMX). JMX provides a standard method of instrumenting servlets and JSPs, allowing remote monitoring and control of Java applications and servlets.

The tomcat plugin uses the JMX-proxy servlet to monitor (potentially) arbitrary aspects of a Servlet and JSPs. This provides structured plain-text output from Tomcat's JMX MBean interface. Applications that require monitoring should connect to that interface for MonAMI to discover their data.

To monitor a custom servlet, the required instrumentation within the servlet/JSP must be written. Currently, there is an additional light-weight conversion needed within MonAMI, adding some extra information about the monitored data. Sample code exists that monitors aspects of the Tomcat server itself.

Any tomcat monitoring target will need a username and password that matches a valid account within the Tomcat server that has the manager role. This is normally configured in the file $CATALINA_HOME/conf/tomcat-users.xml. Including the following line within this file creates a new user monami, with password monami-secret and manager role, to Tomcat.

<user username="monami" password="monami-secret" roles="manager"/>

This line should be added within the <tomcat-users> context.

Warning

Be aware that Basic authentication sends the username and password unencrypted over the network. These values are at risk if packets can be captured. If you are not sure, you should run MonAMI on the same server as Tomcat.

In addition to connecting to Tomcat, you also need to specify which classes of information you wish to monitor. The following are available: ThreadPool and Connector. To monitor some aspect, you must specify the object type along with the identifier for that object within the monitoring definition. For example:

[tomcat]
 name = local-tomcat
 ThreadPool = http-8080
 Connector = 8080

ThreadPool monitors a named thread pool (e.g., http-8080), monitoring the following quantities:

minSpareThreads

the minimum number of threads the server will maintain.

currentThreadsBusy

the number of threads that are either actively processing a request or waiting for input.

currentThreadCount

total number of threads within this ThreadPool.

maxSpareThreads

if the number of spare threads exceeds this value, the excess are deleted.

maxThreads

an absolute maximum number of threads.

threadPriority

the priority at which the threads run.

The Connector monitors a ConnectorMBean and is identified by which port it listens on. It monitors the following quantities:

allowTrace

Can we trace the output?

clientAuth

Did the client authenticate?

compression

Is the connection compressed?

disableUploadTimeout

Is the upload timeout disabled?

emptySessionPath

Is there no session?

enableLookups

Are lookups enabled?

tcpNoDelay

Is the TCP SO_NODELAY flag set?

useBodyEncodingForURI

does the URI contain body information?

secure

are the connections secure?

acceptCount

number of pending connections this Connector will accept before rejecting incoming connections.

bufferSize

size of the input buffer.

connectionLinger

how long the connection lingers, waiting for other connections.

connectionTimeout

the timeout for this connection.

connectionUploadTimeout

the timeout for uploads.

maxHttpHeaderSize

the maximum size for HTTP header.

maxKeepAliveRequests

how many keep-alives before the connection is considered dead.

maxPostSize

maximum size of the information POSTed.

maxSpareThreads

c.f. ThreadPool

maxThreads

c.f. ThreadPool

minSpareThreads

c.f. ThreadPool

threadPriority

c.f. ThreadPool

port

the port on which this connector listens.

poxyPort

the proxy port associated with this connector.

redirectPort

the port to which this connector will redirect.

protocol

which protocol the connector uses (e.g., HTTP/1.1)

sslProtocol

the SSL protocol the connector uses (e.g., TLS)

scheme

which scheme the URI will use (e.g., http, https)

Attributes

The tomcat monitoring target accepts the following options:

host string, optional

the hostname of the machine to monitor. The default value is localhost.

port integer, optional

the TCP port on which Tomcat listens. The default value is 8080

jmxpath string, optional

the path to the JMX-proxy servlet within the application server URI namespace. The default path is /manager/jmxproxy/

username string, optional

the username to use when completing Basic authentication.

password string, optional

the password to use when completing Basic authentication.

3.4.15. Torque

The Torque homepage describes Torque as “an open source resource manager providing control over batch jobs and distributed compute nodes.Torque was based on the original PBS/Open-PBS project, but incorporates many new features. It is now a widely used batch control system.

Torque is heavily influenced by the IEEE 1003.1 specification, in particular Section 3 (Batch Evironment Services) of the Shell & Utilities volume. However, it also includes some additional features, such as support for jobs in the suspended state.

Access control

Torque uses username-and-host based authorisation. Users may query the status of their own jobs, but may require special privileges to view the status of all jobs. Because of this, the MonAMI torque plugin may require authorisation to gather monitoring information.

To grant torque sufficient privileges to conduct its monitoring, the Torque server must have either query_other_jobs set to True (allowing all users to see other user's job information) or have the MonAMI user (typically monami) and host added as one of the operators. Setting either option is sufficient and both can be achieved using the qmgr command.

The command qmgr -ac "list server query_other_jobs" will display the current value of query_other_jobs. To allow all users to see other user's job status, run the command: qmgr -ac "set server query_other_jobs = True".

The command qmgr -ac "list server operators" will display the current list of operators. To add user monami running on host mon-hq.example.org as another operator, use the command qmgr -ac "set server operators += monami@mon-hq.example.org".

Queue groups

It is often useful to group together multiple execution queues when generating statistics. The group may represent queues with a similar purpose, or the group represents a set of queues that support a wider community. MonAMI supports this by allowing the definition of queue-groups and will report statistics for each of these groups.

A queue-group is defined by including a group attribute in the torque target. Multiple groups can be defined by repeating the group attributes, one attribute for each group.

A group attribute's value defines the group like: name : queue1, queue2, ..., where name is the name of the queue-group and queue1 is the first queue to be included, queue2 the second, and so on. The group statistics are generated based on all jobs that have any of the listed execution queues.

As an example, the following torque stanza defines four groups: HEP, LHC, Grid OPS, and Local.

[torque]
  group = HEP      : alice, atlas, babar, dzero, lhcb, cms, zeus
  group = LHC      : atlas, lhcb, cms, alice
  group = Grid OPS : dteam, ops
  group = Local    : biomed, carmont, glbio, glee

Attributes

host string, optional

the hostname of the Torque server. If not specified, a default value will be used, which is specified externally to MonAMI. This default may be localhost or may be configured to whatever is the most appropriate Torque server.

group string, optional

defines a new queue-group that statistics are collected against. The group value is like: name : queue1, queue2, .... Each Torque queue may appear in any number (zero or more) of queue-group definitions.

3.4.16. Varnish

The Varnish home page describes Varnish as a “state-of-the-art, high-performance HTTP accelerator. Varnish is targeted primarily at the FreeBSD 6/7 and Linux 2.6 platforms, and takes full advantage of the virtual memory system and advanced I/O features offered by these operating systems.

Varnish offers a management interface. The MonAMI varnish plugin connects to this this interface and request the server's current set of statistics.

Attributes

host string, optional

the host on which Varnish is running. Default is localhost.

port integer, optional

the TCP port on which the Varnish management interface is listening. The default value is 6082.



[1] a jiffy is hard-coded period of time. On most Linux machines, it is 10ms (1/100s). It can be altered to some different value, but it remains constant whilst the kernel is running. In practise, the number of jiffies since the machine booted is held as a counter, which is incremented when the timer interrupt occurs.