Problem Determination Tools for Large Domino Systems
Published 02 October 2002
Authors: Mike Ebbers
This tip describes the tools available for problem determination in a large Domino for S/390 production environment.
This tip contains suggestions on how to approach problems that may occur while running a large Domino system in a production environment on the z/OS platform, and it is based on two IBM Redbooks, Lotus Domino for S/390 Release 5: Problem Determination Guide, SG24-5599, and Debugging UNIX System Services, Lotus Domino, Novel Network Services, and other Applications on OS/390, SG24-5613. From a functional standpoint, the Domino server on z/OS--including the problem determination tools--is the same as on other platforms. In addition, z/OS offers unique problem determination capabilities relevant to Domino. In this tip, we describe functions that are not available on Windows NT.
The primary requirement for Domino on z/OS is for higher availability than on other platforms. In two specific areas, z/OS has a different approach from other platforms. One is to capture enough information on the first occurrence of a problem to diagnose the cause of the problem in order to make a fix and apply it before the problem recurs. In doing so over the years, z/OS has proven to be a very reliable operating system with high availability.
The other approach is that when a failure occurs, it is isolated to a specific piece of work. For an operating system like z/OS, which runs multiple workloads on one server, this is important. A failure of one workload will almost never affect z/OS itself or other workloads on the system. z/OS will isolate the failure to a single component, clean it up after failure, and optionally restart it. For instance, z/OS forces an abend of any program that addresses memory outside of its authorized range. A dump that points directly to the erroneous instruction is taken. Thus, the system is protected and the cause of the problem is easily diagnosed. In order to help you capture enough material for problem determination, the following sections describe information and tools that are available on z/OS.
Tools in the z/OS environment
- Operator commands
- USS commands
- Component trace
- Domino trace
- nsd script
- RMF Monitors II and III
- Console Support for Domino
- SVC dumps
- CEE dumps
- PTF checker
- Service Update Facility
- IBM problem databases
Because Domino runs under z/OS using UNIX System Services (USS), z/OS writes messages to the system log during startup and shutdown. z/OS also routes error messages to the syslog.
This option is only possible when the z/OS Console Support for Domino is installed. The only exception is RACF messages, which are in the joblog even without z/OS Console Support for Domino. All messages that Domino issues are kept in the joblog of the started task.
Basic commands that help in diagnosing the overall health of the system and commands regarding Domino can be very useful in problem determination. Some useful commands are listed here.
- Displaying network connection of the Domino server. All clients must be connected to a Domino server using TCP/IP in z/OS. You can see whether this connection is active or not by issuing the following command: TSO PING <ip address>
- Displaying Global Resource Serialization contention. The output of the D GRS,C command provides information about contention on the system. If a contention situation persists for an unreasonable amount of time, then a cancel of the address spaces causing the contention should be issued. For diagnosing the cause, a dump is useful. We recommend issuing the D GRS,C command at least every 15 minutes using AOC or another automation tool.
- Displaying omvs summary. The D OMVS,S command displays the status of z/OS UNIX processes, file systems, servers, and the BPXPRMxx parmlib member set by initialization or specified by the SET OMVS=xx command.
- Displaying address space ID. The D OMVS,A=ALL or D OMVS,A=xx commands enable you to display process information of the USS address spaces. Using A=ALL gives you information on all USS address spaces, but you can also specify a single address space by using A=xx.
- Displaying user ID. The D OMVS,U=<UID> command allows you to get process information for all processes associated with the specified TSO/E user ID.
- Displaying a specific process. Using the D OMVS,PID=<processid> command enables you to display thread information (in decimal numbers) for the specified process ID.
- Displaying options. If you want to list all of the options that are set in parmlib member BPXPRMxx, use the D OMVS,O command. Keep in mind that the actual status is displayed. If a previous SETOMVS or T OMVS command was given, then that information is displayed and not the contents of the BPXPRMxx member.
- Changing options. You can dynamically reconfigure USS system characteristics in the BPXPRMxx member using the SETOMVS command. For instance, SETOMVS, MAXPROCUSER=xx changes the MAXPROCUSER parameter to xx.
- Displaying file system information. The D OMVS, FILE command gives detailed file system information on currently mounted files.
To navigate in the UNIX System Services environment, it is very helpful to have a short reference of the most common UNIX commands. For a complete list of all available commands and parameters, refer to z/OS UNIX System Services User's Guide, SA22-7801. Useful commands for the Domino environment include:
A CTRACE can be turned on for OMVS to gather useful information for doing root cause analysis of performance problems. See Lotus Domino for S/390 Release 5: Problem Determination Guide, SG24-5599, section 8.5.2. A CTRACE can be turned on for other components as well. In the case of difficult network problems, a CTRACE of TCP/IP should be considered.
It is possible to dynamically turn a DTRACE on and off. The DTRACE is a function in Domino that enables the systems programmer to follow the flow of load modules being used in USS. DTRACE is designed to collect data for highly intermittent problems which may occur on production servers. It is provided as an IBM support tool only. This tool makes it possible to record voluminous internal Domino trace records, both in memory and in a wrap HFS file with a specified maximum size, without filling up your notes.log database or console output file. This is used by IBM and Lotus for root cause analysis.
The nsd.sh script is located in the /usr/lpp/lotus/bin/tools/diag directory. It gathers information about your Domino server, such as environmental variables, job call stack, job logs, and shared memory. See Debugging UNIX System Services, Lotus Domino, Novell Network Services and other Applications on OS/390, SG24-5613, section 2.2.1, for more details.
RMF Monitors II and III
The RMF component of z/OS provides detailed information on the utilization of resources, such as processor, I/O, and memory. RMF Monitor II provides real-time displays of resource consumption, enqueues, and more. DELTA mode allows for the displayed statistics being updated when you press Enter, to show how much resource was used since the last time the Enter key was pressed. Monitor III, the work delay monitor, can display resources that are slowing down the Domino server. For instance, if the Domino server is waiting for CPU resources because other work is consuming them, it will show you this. RMF III can also display enqueue and I/O delays.
This feature provides routing of messages to the z/OS console under SDSF.
When Domino fails or abends, z/OS can automatically produce an SVC dump of the system. When necessary (for example, in case of a hang) the operations department can also manually dump the address spaces. The dump is analyzed by the IBM Support Center using IPCS. To handle a Domino-related SVC dump, the dump data sets should be 2000 MB.
A sliptrap tells z/OS to capture requested information when certain specified conditions are met. For example, you can specify that any time a specified message is produced, z/OS should take an SVC dump to an z/OS dump data set.
Because Domino is written in the C language, it runs under the z/OS Language Environment (LE). If LE encounters errors, it can trigger a CEEdump. This dump is normally placed in the /notesdata subdirectory and contains a date and time stamp and a process ID. The IBM Support Center uses these dumps for root cause analysis of server code processes. See Lotus Domino for S/390 Release 5: Problem Determination Guide, SG24-5599, section 8.6 for further details.
SMF collects data about processor, storage, and I/O consumption for address spaces in z/OS, in more detail than RMF does. For a Domino workload, the following record types are relevant:
- Record type 30: Common Address Space Work. Record type 30 contains information about CPU and storage by address space, as well as file reads and writes and data blocks transferred to disk.
- Record type 42 subtype 6: DFSMS Statistics and Configuration. This record shows the activity of all the UNIX files in each HFS data set, such as number of I/Os, average connect, disconnect, pending and control unit time, and response time.
- Record type 92 subtypes 5 and 11: UNIX File System Activity. Subtype 5 contains information about the UNIX file system. Subtype 5 contains information on the requests made to the UNIX files by the Domino address space, such as read and write calls, directory I/O blocks, blocks read and written, and bytes read and written.
- Record type 108. Starting with Release 5, the Domino server writes information into this record type. Some of the information includes:
- Current number of users
- Currently connected active users or users that have been active in the last 1,3,5,15 and 30 minutes
- Average size of Domino mail and SMTP messages delivered to local users and other servers
- Total transactions in interval
- Total number of pool threads (Server_pool_tasks)
- Number of virtual threads currently in use
- Transaction sections for each transaction type showing the number of transactions processed in the interval
- Accumulated milliseconds of response time
IBM offers an aggressive maintenance policy. It is crucial for the customer, as well as for IBM, to keep maintenance as up-to-date as possible. Fixes should be applied as soon as possible to avoid known problems. Some customers do not apply all the fixes that were called for, and run into problems. For the implementation of Domino, it is absolutely crucial to have all maintenance applied, whether it seems to affect Domino or not.
To help you keep up-to-date with your maintenance, you can run the PTF checker tool. This tool is an MVS batch job that you download in text format. It tells you if you are missing any needed PTFs, or you can do a manual check against the list of PTFs. For either maintenance choice, visit the following Web site:
Service Update Facility
z/OS Service Update Facility (SUF) is a Java-based z/OS software tool that makes ordering and receiving z/OS software quick and easy using the Internet. As we stated, it is very important to be as up-to-date as possible with your maintenance level. A large percentage of problems concerning Domino are caused by maintenance not being current. For details regarding prerequisites, entitlement, and how to obtain SUF, refer to the z/OS SUF homepage at:
IBM problem database
IBM maintains databases with solved and unsolved problems. The location of these databases varies from country to country. It is best to contact your local IBM Support Center to find out how to access these databases. A good starting point is to visit the following Web site:
The Lotus Support Center also has a database of Domino problems.
For additional information, also see these Redbooks:
Lotus Domino for S/390 Release 5: Problem Determination Guide, SG24-5599
Debugging UNIX System Services, Lotus Domino, Novel Network Services, and other Applications on OS/390, SG24-5613
This material has not been submitted to any formal IBM test and is published AS IS. It has not been the subject of rigorous review. IBM assumes no responsibility for its accuracy or completeness. The use of this information or the implementation of any of these techniques is a client responsibility and depends upon the client's ability to evaluate and integrate them into the client's operational environment.
Follow IBM Redbooks
Follow IBM Redbooks