tg_threshold - Threshold checking for ToGather
This document describes the threshold checking facility of the ToGather statistics application.
For information about the general syntax of a ToGather configuration file
please consult tg_config(1).
The ToGather package contains a statistics daemon. This daemon collects (gathers) data periodically, eg. network statistics data. Sometimes the collected values shouldn't exceed some limits, ie. shouldn't be too big or too small. Eg. if the number of dropped packets on a network increases over several packets per second this could be the consequence of a failure of a network device. Of course this could be noticed by looking at the statistic produced by ToGather graphs every minute by this is uncomfortable at least.
ToGather provides the facility to set some limits for collected data. If these limits are exceeded ToGather raises an alert, ie. it sends a mail to you or a message to your pager - or anything you can do with a program.
ToGather can test any gathered value against an upper limit, a lower limit or both. It can raise an alert immediately or wait until a specified number of limit violations in one row are noticed or until the limit violations appear contiguously for a specified time. It can raise an alert for any further limit violation or it can wait a configurable time between two successive alerts.
Threshold configuration is done inside datasource
declarations. The keywords (options) used are
threshold thresh_action (or synoym threshold_action) thresh_okaction (or synonym threshold_okaction) threshprogenv mailprogram
The threshold
option defines the limits for the test and how long a limit violation must
occurre until an alert is raised. thresh_action
and
thresh_okaction
define what is done when a limit violation appears or disappears. threshprogenv
allows to give some information to external programs called as threshold
action.
mailprogram
is a global option. It can be used to select a different program to send
mail than ``/usr/lib/sendmail -t
''. The specified program will be called without additional arguments and
gets the mail on STDIN. It has to read the recipient mail address from the
mail header. If the program needs the mail address as argument you can
specify the address in the
mailprogram
option:
mailprogam /usr/lib/sendmail admin@netmon.domain.org
The ``To:''-Header in the mail will contain the address from the
thresh_action
or thresh_okaction
option.
The threshold
option takes one variable name that defines which values is checked against
the limits and several key/value pairs. An example:
threshold packet_drop over 100 duration 120 quiettime 2h;
``packet_drop'' is the variable to test. ``over 100'' sets an upper bound for the measured value of 100. ``duration 120'' means that an alert will be raised if the measured values are over 100 for at least 120 seconds without going under 100 in this time. ``quiettime 2h'' sets the time between an 'ok alert' and the next failure alert.
All options and their meaning:
over value an alert is raised if the measured value is over the given 'value' under value an alert is raised if the measured value is under the given 'value' repeat count an alert is raised if the limit is exceeded 'count' times in a row duration sec an alert is raised if the limit is exceeded for 'sec' seconds continously quiettime sec the minimal time between an "ok alert" and the following "failure alert" (default is 24 hours)
At least one of 'over' or 'under' must be given. An alert is raised if a
limit is exceeded 'repeat' times or 'duration' seconds (a failure alert is
sent only if configured with thresh_action
). If the measured value is inside the limits again an 'ok' alert is raised
(if configured with
thresh_okaction
) if a corresponding failure alert was raised before.
After an 'ok alert' was raised a failure alert isn't raised for the next 'quiettime' seconds. The reason is that if a device exceeds its limits it might be inside and outside the limits at every other measurement cycle. Without 'quiettime' you would get a message upto every 'interval' seconds. With 'quiettime' failure alerts are suppressed for 'quiettime' seconds after an 'ok alert' was raised. If the values still exceed the limits at the next measurement a failure alert will be raised.
A failure message will tell you if there have been threshold violations in the 'quiettime' period and how often.
'Raising an alert' means to send a mail or to call an external program,
depending on the setting of thresh_action
or thresh_okaction
.
This sets the action to do if an alert is about to be raised. There are two possible actions: Send a mail or call an external user supplied program. To send a mail set
thresh_action packet_drop mail address@of.network.staff ;
``mail'' is the keyword to set the action to mailing, the second argument is the mail address the mail should be sent to.
To call an external programm use
thresh_action packet_drop prog /the/program with options
``prog'' is the keyword to select external program call. The rest of the arguments are arguments for the program to call. The program will get four (4) additional arguments:
datasource-name variable-name limit currentvalue
Ie. the name of the datasource that holds the threshold configuration, the name of the variable that exceeded the limits, the limit that was exceeded (ie. the value of the upper or lower bound) and the value that exceeded the limit.
(hint: if the measured value is smaller than the given limit a lower bound was violated else an upper bound)
Additionally the program will get the environment variables ``TOGATHER'' (with the version of the ToGather daemon), ``TOGATHER_CFG'' with the name of the config file and if the datasource declaration with this threshold setting is from an included config file ``INCLUDE_CFG'' will be set to the name of the included config file.
This is mostly the same as thresh_action
but this action is done when a variable is back in its limits after it
exceeded the limits. The ``repeat'', ``duration'' and ``every'' settings
are not consulted when doing an ``ok'' action.
Syntax of thresh_okaction
is the same as for thresh_action
. The only difference is that the ``limit'' argument for an external
program for an ``ok'' action is the same as the measured value (it still
gets four arguments so you can use the same program for action and
okaction. You can distinguish these two cases by comparing ``limit'' and
``value'').
Normally the action programs don't get any other environment variables than
the three mentioned above (even not PATH or USER). If your program needs
some environment variables you can provide these with threshprogenv
. The syntax is:
threshprogenv variable var=value ...
The first parameter is again the variable (to be able to set different sets of environment variables for different actions). Use ``*'' to set envvars for all actions. The remaining parameters are key/value pairs in bourne shell syntax (sh(1)), ie. key ``='' value. If you want any spaces in a value you have to quote the whole parameter. An example:
threshprogenv packet_drop USER=netman "SUBJECT=Network bad!" ;
This will set two environment variables ``USER'' and ``SUBJECT'' to the values ``netman'' resp. ``Network bad!''.
A more or less complete example:
# default should be ok: # mailprogram /usr/lib/sendmail -t datasource router1 { # log network errors on "router1" interface 1 host router1; mibs IF-MIB; object ifInErrors; variables inErr; port 1; interval 60; # the processed data is checked # ie. errors/seconds in this case threshold inErr over 250 repeat 5 quiettime 120min; # 'repeat 5' with 'interval 60' is the same as 'duration 300' thresh_action mail admin@network.example.dom; thresh_okaction mail admin@network.example.dom; }
A complete example: This one logs the cpu load of a linux host and alerts the user via mail if the load reaches 0.8 for at least one minute. After the load was over 0.8 and back under 0.8 again another mail will not be sent in the following 2 hours.
basedir /var/netstat; graphdir /var/tmp; dasource load { dstype * gauge; withunits 0; type exec; interval 20; variables load1 load5 load15; command uptime; match regex; regex "(\d+\.\d\d)" threshold load1 over 0.8 duration 60 quiettime 2h. thresh_action load1 mail root@localhost; thresh_okaction load1 mail root@localhost; } target load { title "cpu load on host x"; legend load1 "load average per minute"; legend load5 "load average per 5 minutes"; legend load15 "load average per 15 minutes"; graphs 2hour daily weekly; indexgraph 2hour.s; } graph 2hour { length 2h; } graph 2hour.s { length 2h; withlegends 0; size 200 100; }
If ToGather has to send many mails it might slow down the daemon. But this
amount of mail will procude much network traffic and maybe stop the mail
receipient from doing anything against the cause of the alerts so you might
want to limit the amount of mails sent with the ``quiettime'' option of
threshold
.
For every external program an independent process is spawned so it shouldn't hurt the daemon how long the action program runs. If the action programs use too much system resources this might affect the ToGather daemon of course.
togatherd(1),
tg_config(1).
Rainer Bawidamann, Rainer.Bawidamann@rz.uni-ulm.de University of Ulm, University Computer Centre
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA