IBM® has built High Performance Computing (HPC) clusters for sometime, and this experience can help a customer with configuration choices or noncompliant designs that are made during cluster deployments. Large clusters can become difficult to correct as the system scales in terms of nodes, applications, and users. This IBM Redbooks® Solution Guide describes a toolset that can aid system administrators with the initial stages of installing their cluster.
This Solution Guide addresses topics to provide infrastructure health checks, for example, checking the configuration, and verifying the functions of the common subsystems (nodes or servers, switch fabric, parallel file system, job management, and problem areas).
This Solution Guide is targeted at technical professionals (consultants, technical support staff, IT Architects, and IT Specialists) that are responsible for delivering cost effective Technical Computing and HPC solutions to optimize business results, product development, and scientific discoveries.
The IBM® Cluster Health Check (CHC) toolset is an extensible framework and collection of tools to check and verify the health of an IBM Cluster. The CHC framework provides an environment for integrating and running a subset of individual checks that can be performed on a single node or a group of nodes. The tests are categorized into different tool types and groups so that only checks for a specific part of the cluster are performed. Because the framework and toolset is extensible, users can also create their own groups and tools for checking different components of the cluster. Figure 1 shows the different cluster components and the verification pyramid. Each level of verification depends on the correctness of the underlying stages. The higher in the pyramid you are before a problem is found, the more difficult it might be to attribute it to a more foundational problem.
Figure 1. Cluster components verification pyramid
Did you know?
The IBM Cluster Health Check toolset (CHC) is an extensible framework and collection of tools that is used to check and verify the health of an IBM Cluster. The main wrapper tool is hcrun, which provides access to all tools and passes them the environment. Currently, the toolset is available on an as-is support basis with the intention of receiving input from users regarding its usefulness. Although many of the tools work in both the x86 and IBM POWER® solutions, the initial toolset concentrates on x86 solutions. There is nothing in the initial framework that precludes integrating tools that are geared toward a POWER solution.
Business value
Health checking involves a series of checking and verification steps. Basically, you must have uniformity among the different individual computers within a cluster system. Usually, thousands of individual computers are involved in one HPC cluster system. To check them one by one manually is not a good approach because manual operations are needlessly time-consuming and error prone. Thus, some kind of automated checking and verification facilities are necessary.
A single, fully automated checking and verification facility for an HPC cluster system do not exist because it is too difficult to develop such a comprehensive facility. There are too many different computer architectures, different types of operating systems, different types of input/output adapters, and different types of storage systems. To implement such a fully automated checking and verification facility, the verification tool itself must be tested and verified in an environment that includes various combinations of all these types of components. This is a difficult goal to achieve because there are too many combinations. Based on these reasons, the administrators of an HPC cluster system should implement their own checking and verification tools to leverage and adapt existing tools to their unique cluster environment.
The IBM Cluster Health Check toolset provides a set of tools that helps check and ensure consistency and stability of the cluster. The focus for the tools is to help reduce initial environment deployment time, minimize the efforts that are required for health state checking, and helps decrease performance variation.
Solution overview
There are significant challenges about how to obtain pertinent health data that is in many layers of clustered systems. Reporting and correlating critical system data and transforming it into usable information is not trivial. Today’s cluster health check tools mostly test components and some point-to-point network performance. They cannot comprehensively monitor and assess overall system health, truly aggregate performance, and reconcile conflicting usage, user expectations, and demands.
In this first release, the health check toolkit attempts to establish a few reference baselines of performance and verify the prerequisite services at the component level, which allows productive testing and detection of anomalous conditions and can even extend to periodic consistency checking at the component level when the tests are not invasive. This approach is particularly important when you try to isolate the cause of performance problems as initial deployment system testing uncovers problems.
These tools do not intend to devalue benchmarking and performance-based modeling as an alternative approach. Some researchers had success with performance modeling results that used strict cluster start and acceptance criteria. This performance-modeling approach does extend the time that it takes to start a cluster and adds considerable expense. Even those who are successful with performance-based modeling admit that there are significant challenges in setting reasonable performance expectations for broadly diversified workloads on new HPC systems to realistically set accurate expectations for performance at all stages of system integration.
Solution architecture
The CHC framework has the following characteristics:
The material included in this document is in DRAFT form and is provided 'as is' without warranty of any kind. IBM is not responsible for the accuracy or completeness of the material, and may update the document at any time. The final, published document may not include any, or all, of the material included herein. Client assumes all risks associated with Client's use of this document.