In IBM Tivoli Monitoring 6.1, situation is defined as a set of conditions that are measured according to criteria and evaluated to be true or false. To be able to use situations is one of the most powerful features of IBM Tivoli Monitoring Express 6.1. This Technote discusses some best practices in creating situations.
Data for situations and events is collected at regular intervals. However, situations often do not need to be active on a 24 x 7 basis. For example, many alerts may only be required during normal business hours. The first way to control resource usage by situations is to stop and start them at the times that they are required. You can accomplish this by creating policies that start and stop situations at the right time. Alternately, you can externally use an automation or scheduling software that starts and stops situations using Web services.
One of the most critical success factors of the project is to discuss, with the end-user departments, the situations to build and how to build. If critical components are not monitored by a situation or not by using the right thresholds, you risk problems arising without being noticed.
If the situation intervals are set too short, the overhead will be too high, and the overall implementation may even become unreliable when the components cannot handle the workload any more; for example, if the TEMS is evaluating more situations than it can handle, or receiving more alerts, it will start to queue them. This generates even more overhead and delays critical alerts.
If the situation interval is too long, problems may be detected too late, thus the importance of an in-depth plan and review with the end-user department. There is also a need for the department to delegate a senior member to assist with this project. If the end-user representative is a junior member, he or she may not be sufficiently aware of critical performance factors and may even lack sufficient authority to defend the outcome of the discussions with his department.
When planning and reviewing with end-user departments, discuss the following items:
- Critical performance factors for the application or system:
These must be translated into data attributes to be monitored by using them in situations.
- What are the best values to check these attributes against? Should multiple situations be created to watch several levels of severity?
- If you do need several levels of severity for the same data, keep the sampling interval the same:
They are grouped together and data is only collected once (see grouping situations below).
- Select realistic alert values:
For example, if a situation triggers and resets frequently then the TEP user in Operations is over tasked and may lose reactivity over time. Moreover, this causes unnecessary overhead to the TEMS: Processing is required to handle the alerts, but also to store the data that leads to the alert.
- At which intervals should these factors be checked? Also check whether the data that is required for the situation is collected in background:
Some TEMA data (mainly Plex mainframe data) is not collected on demand, but rather at fixed intervals. These data collectors are running as UADVISOR probes in the TEMA/TEMS. The situation that uses these data should have an interval of at least the UADVISOR interval; otherwise, the same UADVISOR collected data gets used twice or more for situation evaluation.
- Which systems should be monitored? Systems should be grouped into user-managed system lists. Situations are then distributed to these MSLs. When a system should be added for the same kind of alerting afterward, the only change required is to the MSL.
- When the situation triggers, what advice can be given to the operator? This information is put into the situation advice and is presented to the operator when the alert is raised and advice is selected. This way the operator is assisted to take the right actions — action that is consistent with the company’s policies.
- Is any automated action required? And if so, what? This results in either a simple command to be executed on the system (reflex automation in the situation) or in a more complex set of automation scripts (that are added into a policy).
Grouping situations can potentially save a lot of resources but can unfortunately not be set manually. The TEMS decides whether to group situations. The following conditions must be met before a situation can be part of a group:
- All situations in the group should use elements from the same attribute group.
- Must use the same interval setting.
- Must have autostart YES.
- Cannot contain an UNTIL clause.
- Distribution lists may be different.
- Cannot contain a display item.
- Cannot contain a take action item.
- MISSING function is not supported.
- SCAN and STR functions are not supported.
- Group functions on the attribute criteria (such as average or total) are not supported.
- Event persistence is not supported.
If the situation is grouped with other situations, the data collection required to get the attributes that are referenced in the situation occurs only once for the group. All situations in the group make use of the same data.
TEMS performs situation grouping during its start-up: If the TEMS finds a number of situations that are eligible for grouping, it creates a new internal situation that performs the data collection at the specified interval.
All grouped situations then compare their criteria to the data returned by the internal situation. These internal situations only exist for the duration of the TEMS run. They get an internal name that starts with _Z_ and the full name is built from the following parts: _Z_, table name, sequence number.
For example, on Windows, when grouping situations on table WTPROCESS, the grouped situation is called _Z_WTPROCESS0. These situations are not added to the permanent situation tables in TEMS (such as TSITDESC), but since they are only temporary they can only be seen in situation temporary tables such as TSITSTSC.
To verify whether any grouped situations are created, run a SQL statement from a TEP View, using custom SQL, as shown in the following Example.
Example SQL statement from a TEP View
The grouping occurs only at TEMS start-up, so any new situations or modifications do not benefit from grouping until the TEMS restarts.
Where is the situation evaluated?
Situations can be evaluated at either TEMA or TEMS. Ideally, all situations evaluate at the TEMA, as close to the data source as possible. Unfortunately, the TEMA is limited in its capacities to evaluate the situation. The evaluation is moved to the TEMS in the following conditions:
- If the situation has attributes that cross TEMAs.
- If advanced checking is used (such as string scan).
- If situations cannot be evaluated at the TEMA, the TEMS takes over. Avoid evaluating situations at the HUB TEMS. All TEMAs should report to a Remote TEMS.
Building a situation in the right order
When starting to build a new situation, first make an overview of the attributes to test. Attributes are tested from first to last, or from left to right on the TEP panel, in the order they are entered in the situation.
Knowing the data behind attributes is recommended. The first test to make should return as few rows as possible. The next step can then further filter a limited set of rows. For example, on Windows, to check whether process XYZ uses more than n amount of real storage. We have to test two attributes (process name and real storage usage).
If we first test on real storage use, the result set may contain multiple rows, then we check whether our process name is among the returned rows. It is more efficient to first test on process name (the result will be one row), followed by the test on the storage usage---just on this single row.
Also, when using complex conditions, such as string scan, sum, or average, this can best be performed on a limited result set: First evaluate the attributes against simple conditions to reduce the result set.
This material has not been submitted to any formal IBM test and is published AS IS. It has not been the subject of rigorous review. IBM assumes no responsibility for its accuracy or completeness. The use of this information or the implementation of any of these techniques is a client responsibility and depends upon the client's ability to evaluate and integrate them into the client's operational environment.