IBM HPC Cluster Health Check

Published on 14 January 2014, updated 17 December 2014

Share this page:

IBM Form #: TIPS1078

Authors: Dino Quintero, Ross Aiken, Shivendra Ashish, Manmohan Brahma, Murali Dhandapani, Rico Franke, Jie Gong, Markus Hilger, Herbert Mehlhose, Justin I Morosi, Thorsten Nitsch and Fernando Pizzano

Abstract

IBM® has built High Performance Computing (HPC) clusters for sometime, and this experience can help a customer with configuration choices or noncompliant designs that are made during cluster deployments. Large clusters can become difficult to correct as the system scales in terms of nodes, applications, and users. This IBM Redbooks® Solution Guide describes a toolset that can aid system administrators with the initial stages of installing their cluster.

This Solution Guide addresses topics to provide infrastructure health checks, for example, checking the configuration, and verifying the functions of the common subsystems (nodes or servers, switch fabric, parallel file system, job management, and problem areas).

This Solution Guide is targeted at technical professionals (consultants, technical support staff, IT Architects, and IT Specialists) that are responsible for delivering cost effective Technical Computing and HPC solutions to optimize business results, product development, and scientific discoveries.

The IBM® Cluster Health Check (CHC) toolset is an extensible framework and collection of tools to check and verify the health of an IBM Cluster. The CHC framework provides an environment for integrating and running a subset of individual checks that can be performed on a single node or a group of nodes. The tests are categorized into different tool types and groups so that only checks for a specific part of the cluster are performed. Because the framework and toolset is extensible, users can also create their own groups and tools for checking different components of the cluster. Figure 1 shows the different cluster components and the verification pyramid. Each level of verification depends on the correctness of the underlying stages. The higher in the pyramid you are before a problem is found, the more difficult it might be to attribute it to a more foundational problem.

Figure 1. Cluster components verification pyramid

Did you know?

The IBM Cluster Health Check toolset (CHC) is an extensible framework and collection of tools that is used to check and verify the health of an IBM Cluster. The main wrapper tool is hcrun, which provides access to all tools and passes them the environment. Currently, the toolset is available on an as-is support basis with the intention of receiving input from users regarding its usefulness. Although many of the tools work in both the x86 and IBM POWER® solutions, the initial toolset concentrates on x86 solutions. There is nothing in the initial framework that precludes integrating tools that are geared toward a POWER solution.

Business value

Health checking involves a series of checking and verification steps. Basically, you must have uniformity among the different individual computers within a cluster system. Usually, thousands of individual computers are involved in one HPC cluster system. To check them one by one manually is not a good approach because manual operations are needlessly time-consuming and error prone. Thus, some kind of automated checking and verification facilities are necessary.

A single, fully automated checking and verification facility for an HPC cluster system do not exist because it is too difficult to develop such a comprehensive facility. There are too many different computer architectures, different types of operating systems, different types of input/output adapters, and different types of storage systems. To implement such a fully automated checking and verification facility, the verification tool itself must be tested and verified in an environment that includes various combinations of all these types of components. This is a difficult goal to achieve because there are too many combinations. Based on these reasons, the administrators of an HPC cluster system should implement their own checking and verification tools to leverage and adapt existing tools to their unique cluster environment.

The IBM Cluster Health Check toolset provides a set of tools that helps check and ensure consistency and stability of the cluster. The focus for the tools is to help reduce initial environment deployment time, minimize the efforts that are required for health state checking, and helps decrease performance variation.

Solution overview

There are significant challenges about how to obtain pertinent health data that is in many layers of clustered systems. Reporting and correlating critical system data and transforming it into usable information is not trivial. Today’s cluster health check tools mostly test components and some point-to-point network performance. They cannot comprehensively monitor and assess overall system health, truly aggregate performance, and reconcile conflicting usage, user expectations, and demands.

In this first release, the health check toolkit attempts to establish a few reference baselines of performance and verify the prerequisite services at the component level, which allows productive testing and detection of anomalous conditions and can even extend to periodic consistency checking at the component level when the tests are not invasive. This approach is particularly important when you try to isolate the cause of performance problems as initial deployment system testing uncovers problems.

These tools do not intend to devalue benchmarking and performance-based modeling as an alternative approach. Some researchers had success with performance modeling results that used strict cluster start and acceptance criteria. This performance-modeling approach does extend the time that it takes to start a cluster and adds considerable expense. Even those who are successful with performance-based modeling admit that there are significant challenges in setting reasonable performance expectations for broadly diversified workloads on new HPC systems to realistically set accurate expectations for performance at all stages of system integration.

Solution architecture

The CHC framework has the following characteristics:

Display available tools in an organized fashion.
Organize and display results of the tools with as much or as little information as the user requests.
Configuration file driven to allow for customization and extension.
Customize tool execution order through the usage of configuration files.
Extensible. You can add your own tools and make them available under CHC by updating the configuration files and using key environment variables from CHC.

The CHC tools cover the following broad areas:

Node health and configuration consistency, and test tools.
InfiniBand fabric health and configuration consistency, and test tools to verify health of the fabric.
Utilities.
Basic configuration checking for consistency across a group of nodes and devices, as well as against a baseline.

Usage scenarios

From build-up to final decommissioning, a cluster passes through different phases during its lifetime in iterative cycles. Updates, changes, maintenance, or outages require methods to run through the stages consistently. Each phase has its own processes, characteristics, and needs. The goal of the methodology is to reduce nonproductive time to a minimum.

The lifecycle starts with the deployment stage, where all the different components are assembled, connected, powered, installed, and configured. While the cluster grows, it must be carefully and continuously examined. Basic functions are verified and uniformity between similar devices must be ensured. Discrepancies and abnormalities must be analyzed and eliminated or they increase with the size of a cluster and ultimately result in negative impacts to the ability to meet expectations. IBM provides preferred practices and recommendations to help prevent you from running into known problems and traps.

More components are collated and tested until the whole cluster is ready for verification. During that phase, a cluster must prove that it is able to perform estimated workloads and meet expectations. This phase is when you attest functionality and performance through extensive testing and evaluation. Verification starts with the smallest entity in the cluster and then scales to the whole cluster size. For example, it is a preferred practice to prove the performance of each single compute node before running a test for the whole cluster. Reducing the size and complexity to a minimum is one of the basic rules about health. If a test fails, the source of the failure must return to the deployment stage until it can meet the requirements.

Finally, after all inspections, a cluster reaches the production phase and must carry out its duty: Run user jobs and return results. To keep the cluster at that state, it is mandatory to monitor all key characteristics continuously. Failures or discrepancies from standards must be quickly identified, the cause investigated, and the source isolated. If it is not possible to separate the cause of a problem, then the cluster on the whole must be verified again, and if a system-wide change is required to solve a problem, then the lifecycle starts again at the beginning of the deployment phase. Insufficient results in one of the stages causes a start from the beginning at the previous phase.

The lifecycle that is shown in Figure 2 does not describe only the phases of the cluster; it also applies to each single component. There is no difference between a large cluster or a single device when it comes to changes. So, even if a single cable must be replaced, the old one must be isolated from the remaining cluster. Afterward, a new cable must be deployed according to preferred practices and carefully verified before you resume production.

Figure 2. Health lifecycle methodology

It is an HPC rule of thumb that the slowest device determines the overall performance of the cluster. Therefore, extensive verification and continuous monitoring is crucial to obtaining and maintaining a healthy cluster. After a maintenance action, there is no excuse for overlooking one poorly performing memory DIMM that throttles all parallel jobs. It is important to have tests in place to ensure that all critical components work according to their expectations. Most of the testing should be done before combining them with other similar components.

Integration

The following solutions are also part of the cluster delivery solutions in the areas of file system management, cluster installation and management, and networking:

The General Parallel File System is a high performance, shared-disk file system that provides data access from all nodes in a homogeneous or heterogeneous cluster of IBM UNIX servers running either the IBM AIX® or the Linux operating system.

xCAT is open source cluster computing management software that is used for the deployment and administration. It provides the following functions:

Provisions an operating system on a different architecture, such as Linux, AIX, IBM System x®, and IBM Power Systems™.
Creates and manage clusters.
Installs and manages many cluster machines in parallel.
Remotely manages a system through a remote console or distributed shell.

For more information about xCAT, see the following website:

http://xcat.sourceforge.net/

The OpenFabrics Enterprise Distribution (OFED) is open source software for RDMA and operating system kernel bypass applications. OFED includes kernel-level drivers, channel-oriented RDMA and send/receive operations, operating system kernel bypasses, and a kernel and user-level application programming interface (API).

OFED is available for many Linux distributions, including Red Hat Enterprise Linux, SUSE Linux Enterprise Distribution, and so on.
OFED comes with many user space tools that can be used to verify and diagnose the state of InfiniBand interfaces and topology.

For more information about OFED, see the following website:

http://www.openfabrics.org/

Supported platforms

The CHC toolkit should be installed on the xCAT server because many of the individual checks and framework use xCAT commands. In addition, some of the health checks are MPI parallel applications, which run using the POE runtime environment. Here are the required software packages:

xCAT
POE environment
Python 2.6

Although many of the initial tools work in both the x86 and POWER solutions, the initial toolset concentrates on x86 solutions. There is nothing in the initial framework that precludes integrating tools that are geared toward the POWER solution.

Ordering information

The IBM Cluster Health Check tools can be downloaded from the IBM HPC Central at the following website:

http://ibm.co/1hmVBzd

Related information

For more information, see the following documents:

IBM HPC Cluster Health Check, SG24-8168
http://www.redbooks.ibm.com/abstracts/sg248168.html
IBM HPC Central
http://ibm.co/1hmVBzd

Others who read this also read

Special Notices

The material included in this document is in DRAFT form and is provided 'as is' without warranty of any kind. IBM is not responsible for the accuracy or completeness of the material, and may update the document at any time. The final, published document may not include any, or all, of the material included herein. Client assumes all risks associated with Client's use of this document.