NAME

tg_threshold - Threshold checking for ToGather


DESCRIPTION

This document describes the threshold checking facility of the ToGather statistics application.

For information about the general syntax of a ToGather configuration file please consult tg_config(1).


INTRODUCTION

The ToGather package contains a statistics daemon. This daemon collects (gathers) data periodically, eg. network statistics data. Sometimes the collected values shouldn't exceed some limits, ie. shouldn't be too big or too small. Eg. if the number of dropped packets on a network increases over several packets per second this could be the consequence of a failure of a network device. Of course this could be noticed by looking at the statistic produced by ToGather graphs every minute by this is uncomfortable at least.

ToGather provides the facility to set some limits for collected data. If these limits are exceeded ToGather raises an alert, ie. it sends a mail to you or a message to your pager - or anything you can do with a program.


POSSIBLE THRESHOLD TESTS

ToGather can test any gathered value against an upper limit, a lower limit or both. It can raise an alert immediately or wait until a specified number of limit violations in one row are noticed or until the limit violations appear contiguously for a specified time. It can raise an alert for any further limit violation or it can wait a configurable time between two successive alerts.


CONFIGURATION

Threshold configuration is done inside datasource declarations. The keywords (options) used are

 threshold
 thresh_action (or synoym threshold_action)
 thresh_okaction (or synonym threshold_okaction)
 threshprogenv
 mailprogram

The threshold option defines the limits for the test and how long a limit violation must occurre until an alert is raised. thresh_action and thresh_okaction define what is done when a limit violation appears or disappears. threshprogenv allows to give some information to external programs called as threshold action.

mailprogram is a global option. It can be used to select a different program to send mail than ``/usr/lib/sendmail -t''. The specified program will be called without additional arguments and gets the mail on STDIN. It has to read the recipient mail address from the mail header. If the program needs the mail address as argument you can specify the address in the mailprogram option:

 mailprogam /usr/lib/sendmail admin@netmon.domain.org

The ``To:''-Header in the mail will contain the address from the thresh_action or thresh_okaction option.


THRESHOLD

The threshold option takes one variable name that defines which values is checked against the limits and several key/value pairs. An example:

 threshold packet_drop over 100 duration 120 quiettime 2h;

``packet_drop'' is the variable to test. ``over 100'' sets an upper bound for the measured value of 100. ``duration 120'' means that an alert will be raised if the measured values are over 100 for at least 120 seconds without going under 100 in this time. ``quiettime 2h'' sets the time between an 'ok alert' and the next failure alert.

All options and their meaning:

 over value      an alert is raised if the measured value is
                 over the given 'value'
 under value     an alert is raised if the measured value is
                 under the given 'value'
 repeat count    an alert is raised if the limit is exceeded
                 'count' times in a row
 duration sec    an alert is raised if the limit is exceeded
                 for 'sec' seconds continously
 quiettime sec   the minimal time between an "ok alert" and the
                 following "failure alert" (default is 24 hours)

At least one of 'over' or 'under' must be given. An alert is raised if a limit is exceeded 'repeat' times or 'duration' seconds (a failure alert is sent only if configured with thresh_action). If the measured value is inside the limits again an 'ok' alert is raised (if configured with thresh_okaction) if a corresponding failure alert was raised before.

After an 'ok alert' was raised a failure alert isn't raised for the next 'quiettime' seconds. The reason is that if a device exceeds its limits it might be inside and outside the limits at every other measurement cycle. Without 'quiettime' you would get a message upto every 'interval' seconds. With 'quiettime' failure alerts are suppressed for 'quiettime' seconds after an 'ok alert' was raised. If the values still exceed the limits at the next measurement a failure alert will be raised.

A failure message will tell you if there have been threshold violations in the 'quiettime' period and how often.

'Raising an alert' means to send a mail or to call an external program, depending on the setting of thresh_action or thresh_okaction.


THRESH_ACTION

This sets the action to do if an alert is about to be raised. There are two possible actions: Send a mail or call an external user supplied program. To send a mail set

 thresh_action packet_drop mail address@of.network.staff ;

``mail'' is the keyword to set the action to mailing, the second argument is the mail address the mail should be sent to.

To call an external programm use

 thresh_action packet_drop prog /the/program with options

``prog'' is the keyword to select external program call. The rest of the arguments are arguments for the program to call. The program will get four (4) additional arguments:

 datasource-name variable-name limit currentvalue

Ie. the name of the datasource that holds the threshold configuration, the name of the variable that exceeded the limits, the limit that was exceeded (ie. the value of the upper or lower bound) and the value that exceeded the limit.

(hint: if the measured value is smaller than the given limit a lower bound was violated else an upper bound)

Additionally the program will get the environment variables ``TOGATHER'' (with the version of the ToGather daemon), ``TOGATHER_CFG'' with the name of the config file and if the datasource declaration with this threshold setting is from an included config file ``INCLUDE_CFG'' will be set to the name of the included config file.


THRESH_OKACTION

This is mostly the same as thresh_action but this action is done when a variable is back in its limits after it exceeded the limits. The ``repeat'', ``duration'' and ``every'' settings are not consulted when doing an ``ok'' action.

Syntax of thresh_okaction is the same as for thresh_action. The only difference is that the ``limit'' argument for an external program for an ``ok'' action is the same as the measured value (it still gets four arguments so you can use the same program for action and okaction. You can distinguish these two cases by comparing ``limit'' and ``value'').


THRESHPROGENV

Normally the action programs don't get any other environment variables than the three mentioned above (even not PATH or USER). If your program needs some environment variables you can provide these with threshprogenv. The syntax is:

 threshprogenv variable var=value ...

The first parameter is again the variable (to be able to set different sets of environment variables for different actions). Use ``*'' to set envvars for all actions. The remaining parameters are key/value pairs in bourne shell syntax (sh(1)), ie. key ``='' value. If you want any spaces in a value you have to quote the whole parameter. An example:

 threshprogenv packet_drop USER=netman "SUBJECT=Network bad!" ;

This will set two environment variables ``USER'' and ``SUBJECT'' to the values ``netman'' resp. ``Network bad!''.


EXAMPLE

A more or less complete example:

 # default should be ok:
 # mailprogram /usr/lib/sendmail -t
 datasource router1 {
    # log network errors on "router1" interface 1
    host router1;
    mibs IF-MIB;
    object ifInErrors;
    variables inErr;
    port 1;
    interval 60;
    # the processed data is checked
    # ie. errors/seconds in this case
    threshold inErr over 250 repeat 5 quiettime 120min;
    # 'repeat 5' with 'interval 60' is the same as 'duration 300'
    thresh_action mail admin@network.example.dom;
    thresh_okaction mail admin@network.example.dom;
 }

A complete example: This one logs the cpu load of a linux host and alerts the user via mail if the load reaches 0.8 for at least one minute. After the load was over 0.8 and back under 0.8 again another mail will not be sent in the following 2 hours.

 basedir /var/netstat;
 graphdir /var/tmp;
 dasource load {
   dstype * gauge;
   withunits 0;
   type exec;
   interval 20;
   variables load1 load5 load15;
   command uptime;
   match regex;
   regex "(\d+\.\d\d)"
   
   threshold load1 over 0.8 duration 60 quiettime 2h.
   thresh_action load1 mail root@localhost;
   thresh_okaction load1 mail root@localhost;
 }
 target load {
   title "cpu load on host x";
   legend load1 "load average per minute";
   legend load5 "load average per 5 minutes";
   legend load15 "load average per 15 minutes";
   graphs 2hour daily weekly;
   indexgraph 2hour.s;
 }
 graph 2hour { length 2h; }
 graph 2hour.s {
   length 2h;
   withlegends 0;
   size 200 100;
 }


PERFORMANCE

If ToGather has to send many mails it might slow down the daemon. But this amount of mail will procude much network traffic and maybe stop the mail receipient from doing anything against the cause of the alerts so you might want to limit the amount of mails sent with the ``quiettime'' option of threshold.

For every external program an independent process is spawned so it shouldn't hurt the daemon how long the action program runs. If the action programs use too much system resources this might affect the ToGather daemon of course.


SEE ALSO

togatherd(1), tg_config(1).


AUTHOR

 Rainer Bawidamann, Rainer.Bawidamann@rz.uni-ulm.de
 University of Ulm, University Computer Centre


COPYRIGHT

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA