Skip to main content

Abstract

Active Memory is the term given to a collection of memory technologies implemented in high-end IBM eServer xSeries servers.

Contents

There are a number of advanced features implemented in the memory subsystem of IBM's Enterprise X-Architecture servers, including the x440, x445, and the 64-bit x450. These memory features are collectively known as Active Memory:

  • Memory ProteXion

    Memory ProteXion, also known as “redundant bit steering”, is the technology behind using redundant bits in a data packet to provide backup in the event of a DIMM failure.

    Currently, other industry-standard servers use 8 bits of the 72-bit data packets for ECC functions and the remaining 64 bits for data. However, the server needs only 6 bits to perform the same ECC functions, thus leaving 2 bits free. In the event that a chip failure on the DIMM is detected by memory scrubbing, the memory controller can re-route data around that failed chip through the spare bits (similar to the hot-spare drive of a RAID array). It can do this automatically without issuing a Predictive Failure Analysis (PFA) or light path diagnostics alert to the administrator. After the second DIMM failure, PFA and light path diagnostics alerts would occur on that DIMM as normal.

  • Memory scrubbing

    Memory scrubbing is an automatic daily test of all the system memory that detects and reports memory errors that might be developing before they cause a server outage.

    Memory scrubbing and Memory ProteXion work in conjunction with each other and do not require memory mirroring to be enabled to work properly.

    When a bit error is detected, memory scrubbing determines if the error is recoverable or not. If it is recoverable, Memory ProteXion is enabled and the data that was stored in the damaged locations is rewritten to a new location. The error is then reported so that preventative maintenance can be performed. As long as there are enough good locations to allow the proper operation of the server, no further action is taken other than recording the error in the error logs.

    If the error is not recoverable, then memory scrubbing sends an error message to the light path diagnostics, which then turns on the proper lights and LEDs to guide you to the damaged DIMM. If memory mirroring is enabled, then the mirrored copy of the data in the damaged DIMM is used until the system is powered down and the DIMM replaced. If hot-add is enabled in the BIOS, then no rebooting would be required and the new DIMM would be enabled immediately.

  • Memory mirroring

    Memory mirroring is roughly equivalent to RAID-1 in disk arrays, in that memory is divided in two ports and one port is mirrored to the other half. If 8 GB is installed, then the operating system sees 4 GB once memory mirroring is enabled (it is disabled in the BIOS by default). Since all mirroring activities are handled by the hardware, memory mirroring is operating system independent. Certain restrictions exist with respect to placement and size of memory DIMMs when memory mirroring is enabled, and these are system dependant.

  • Hot-swap and hot-add memory

    Currently, only the x445 supports hot-swap and hot-add memory. There are two configurations where you can add or replace memory while the server is still running:

    • Hot-swap, where you can replace failed DIMMs of the same type, size, and clock speed without turning off the server. Hot-swap memory is operating-system independent. Memory mirroring must be enabled to use hot-swap.
    • Hot-add, where you can add new DIMMs without turning off the server, thereby increasing the amount of RAM available to the operating system. This feature is currently only supported by Windows Server 2003, Enterprise Edition and Datacenter Edition. Memory mirroring must be disabled when using hot-add and due to the way memory is implemented in the x445, the port you are adding memory to must be empty before you add memory, and DIMMs must be added in multiples of two.
  • Chipkill memory

    Chipkill is integrated into the XA-32 second-generation chipset and does not require special Chipkill DIMMs. Chipkill corrects multiple single-bit errors to keep a DIMM from failing. When combining Chipkill with Memory ProteXion and Active Memory, the server provides very high reliability in the memory subsystem. Chipkill memory is approximately 100 times more effective than ECC technology, providing correction for up to four bits per DIMM (eight bits per memory controller), whether on a single chip or multiple chips.

    If a memory chip error does occur, Chipkill is designed to automatically take the inoperative memory chip offline while the server keeps running. The memory controller provides memory protection similar in concept to disk array striping with parity, writing the memory bits across multiple memory chips on the DIMM. The controller is able to reconstruct the “missing” bit from the failed chip and continue working as usual.

    Chipkill support is provided in the memory controller and implemented using standard ECC DIMMs, so it is transparent to the OS.


In addition, to maintain the highest levels of system availability, if a memory error is detected during POST or memory configuration, the server can automatically disable the failing memory bank and continue operating with reduced memory capacity. You can manually re-enable the memory bank after the problem is corrected via the Setup menu in the BIOS.

Special Notices

This material has not been submitted to any formal IBM test and is published AS IS. It has not been the subject of rigorous review. IBM assumes no responsibility for its accuracy or completeness. The use of this information or the implementation of any of these techniques is a client responsibility and depends upon the client's ability to evaluate and integrate them into the client's operational environment.

Profile

Publish Date
25 July 2003


Rating: Not yet rated


Author(s)

IBM Form Number
TIPS0259