Testing Highly Available IBM Tivoli Storage Manager Cluster Environments

Abstract

This Technote discusses the testing of our cluster configurations. We focus on two layers of testing: cluster infrastructure and application failure and recovery scenarios (for Tivoli Storage Manager server, client, storage agent)

Contents




Objectives
Testing highly available clusters is a science. Regardless of how well the solution is architected or implemented, it all comes down to how well you test the environment. If the tester does not understand the application and its limitations, or does not understand the cluster solution and its implementation, there will be unexpected outages.

The importance of creative, thorough testing cannot be emphasized enough. You should not invest in cluster technology unless you are prepared to invest in the testing time, both pre-production and post-production. Here are the major task items involved in testing a cluster:

  • Build the testing scope.
  • Build the test plan.
  • Build a schedule for testing the various application components.
  • Document the initial test results.
  • Hold review meetings with the application owners, discuss and understand the results, and build the next test plans.
  • Retest as required from the review meetings.
  • Build process documents, including dataflow and an understanding of failure situations with anticipated results.
  • Build recovery processes for the most common user intervention situations.
  • Prepare final documentation.


Important: Planning for the appropriate testing time in a project is a challenge, and is often the forgotten or abused phase. It is our team’s experience that the testing phase must be at least two times the total implementation time for the cluster (including the customizing for the applications).

Testing the clusters

Testing is critical for building a successful (and reliable) Tivoli Storage Manager cluster environment.

Cluster infrastructure tests

The following cluster infrastructure tests should be performed:

  • Test manual failover for the core cluster
  • Test manual failback for the core cluster
  • Start each Resource Group (Service Group)
  • Stop each Resource Group (Service Group)
  • Test FC adapter failure
  • Test FC adapter recovery
  • Test public NIC failure
  • Test public NIC recovery
  • Test private NIC failure
  • Test private NIC recovery
  • Test disk heartbeat failure
  • Test disk heartbeat recovery
  • Test power failure of each node
  • Test power failure recovery of each node

These are considered a minimal set of cluster infrastructure tests to ensure that a reliable, predictable, highly available cluster has been designed and implemented.

For each of these tests, a document detailing the testing process and resulting behavior should be produced. Following this regimen ensures that issues will surface, be resolved, and be retested, thus producing final documentation.

Application tests

Resource Group (or Service Group) testing includes the complete application (Tivoli Storage Manager component) and all associated resources supporting the application.

Tivoli Storage Manager server tests

These tests are designed around Tivoli Storage Manager server failure situations. The Tivoli Storage Manager server is highly available:

  • Server nodeA fails during a scheduled client backup to diskpool.
  • Server recovers on nodeB during a scheduled client backup to diskpool.
  • Server nodeA fails during a migration from disk to tape.
  • Server node recovers on nodeB after the migration failure.
  • Server nodeA fails during a backup storage pool tape-to-tape operation.
  • Server recovers on nodeB after the backup storage pool failure.
  • Server nodeA fails during a full DB backup to tape.
  • Server recovers on nodeB after the full DB backup failure.
  • Server nodeA fails during an expire inventory.
  • Server recovers on nodeB after failing during an expire inventory.
  • Server nodeA fails during a StorageAgent backup to tape.
  • Server recovers on nodeB after failing during a StorageAgent backup to tape.
  • Server nodeA fails during a session serving as a library manager for a library client.
  • Server recovers on nodeB after failing as a library manager.


Tivoli Storage Manager client tests

These are application tests for a highly available Tivoli Storage Manager client:

  • Client nodeA fails during a scheduled backup.
  • Client recovers on nodeB after failing during a scheduled backup.
  • Client nodeA fails during a client restore.
  • Client recovers on nodeB after failing during a client restore.


Tivoli Storage Manager storage agent tests

These are application tests for a highly available Tivoli Storage Manager storage agent (and the associated Tivoli Storage Manager client):

  • StorageAgent nodeA fails during a scheduled backup to tape.
  • StorageAgent recovers on nodeB after failing during a scheduled backup.

Special Notices

This material has not been submitted to any formal IBM test and is published AS IS. It has not been the subject of rigorous review. IBM assumes no responsibility for its accuracy or completeness. The use of this information or the implementation of any of these techniques is a client responsibility and depends upon the client's ability to evaluate and integrate them into the client's operational environment.

Profile

Publish Date
10 June 2005


Rating: Not yet rated


Author(s)

IBM Form Number
TIPS0574