Processing the input - the big picture
This section will tell you how a log message transform in
an incident, what the delay and pending queues are for,
how logpecker eliminates multiple problem reports, and
the detailed modus operandi of logpecker.
First Stage: Message Parsing and Matching
As soon as logpecker has finished its initialization, it is watching
its input files. When a new line arrives, it is parsed and matched
against the rules you have defined. Based on the rules, a so-called
incident is created from the input line. If several rules "fire",
you get several incidents.
An incident has certain properties:
- severity: How important this incident is. Can be one of
ignore, info, notice, warn, error, crit, or the special "ok"
(which indicates a resolved problem). Each rule you define
includes the severity.
- name: a symbolic name like "NFS.server.servername.unreachable".
If several incidents have the same name, logpecker considers
them to refer to the same problem, so it is important to
include all relevant information in the name. The name is
assigned by matching the input line to your rule definition.
- time, host, syslog facility and priority, message-string: these
are parsed from the input line. You will see this information
in the reports.
If there is no rule that matches an input line, logpecker creates a
special "unknown" incident type.
Now, when an input line has been parsed and transformed into
one or several incidents, the following can happen to them:
- If it has the severity "ignore", it is silently ignored. That's
probably what you have expected.
- If another incident with the same name already exists, the
further processing depends on the stage this incident has reached.
This is explained below.
- If the incident has priority "ok" (i.e. problem resolved) and there is no other incident
with the same name, it is also ignored.
- Else, the incident is put into the "delay queue" for a certain,
configurable period (20 seconds by default).
Second Stage: Delay Queue
Now, we have created an incident and put it in the delay queue.
It hold incidents for a short period to catch message storms and
see if this problem would be removed by a "problem resolved" incident
that follows directly.
During the delay period, the following can happen:
- Another incident with the same name arrives.
To deal with message storms,
where the same message is repeated over and over for a short period, all the
repeating ones will be silently dropped.
- A "problem resolved" incident with the same name arrives.
To accomodate for messages like these notorious "NFS server does not respond" /
"NFS server OK", you can define special rules that create "problem resolved" incidents. If such
an incident arrives, all other incidents with the same name are silently removed
from the delay queue without any fuss about it.
- The incident times out
If the incident is still in the queue after 20 seconds, it is
finally reported as "initial occurance" through your configured reports
and moved to the "pending queue". For details on the reports, please
see the separate report reference section.
Third stage: Pending Queue
After the incident has been reported it is held in the pending queue
for typically 6 hours. This queue holds all "active" incidents and allows
to identify re-occuring problems.
During this time, the following can happen to
it:
- Another incident with the same name arrives: This incident is then reported as
"follow-up occurance". For details, please take a look at the report reference section.
The timeout (6 hours) is restarted for this incident.
- A "problem resolved" incident with the same name arrives: This is reported as
"problem solved" (see report reference), and the incident is removed from the
pending queue.
- Incidents are defined in groups. If the number of incidents in a
group that have the same severity exceeds a configurable constant
(default is 30), a "group overflow" is reported (see report reference)
and further incidents of this group with same or lower severity are
ignored for a certain period. This mechanismn is a kind of
self-protection against masses of messages and keeps the memory usage
low under all circumstances. (Well, I have cheated in the previous section:
the same mechanismn exists in the delay queue, too.)
- Eventually, if nothing removes the incident, the pending period times out. The incident
is then removed from the pending queue.
Further readings