From Cluster Documentation Project
Jump to: navigation, search

Error Detection And Correction[edit]

A cluster of systems with large amounts of RAM provides system integrators and administrators with an opportunity to become familiar with Soft Errors.

According to arch/*/kernel/mce.c and arch/*/kernel/traps.c, Linux kernels older than 2.6.16 will either see an uncorrectable bit error as a Machine Check Exception (MCE), print out a message with the DIMM bank, and panic; or as a Non-Maskable Interrupt (NMI) and continue on with a "Dazed" message. An NMI would be seen if MCE panic was disabled with the mce=off kernel boot parameter.

There are new capabilites beginning with the 2.6.16 kernel. The code from the EDAC project was merged into the kernel as optional modules. The modules provide counters for correctable and uncorrectable errors, the ability to reset counters through sysfs, a reset counter - seconds since last reset, etc.

The Linux EDAC modules support the following memory controllers:

  • AMD 76x
  • Intel e752x
  • Intel e7xxx
  • Intel 82860
  • Intel D82875P
  • Radisys 82600

I/O on 64-bit Systems With Large Amounts of Memory[edit]

Part of the transition from 32-bit x86 to x86_64 with large amounts of memory involves handling I/O devices that only support 32-bit memory addresses. AMD products include a hardware IOMMU that makes everything work transparently, for the most part. Intel EM64T and IA64 products do not include an IOMMU, so the Linux kernel implements a "software I/O translation buffer". The memory allocated to the swiotlb is made unavailable to normal processes, and some device drivers (such as the proprietary NVIDIA graphics driver) may require more memory to be reserved in order to operate reliably. See the Linux kernel's documentation for information about the swiotlb and iommu boot parameters. Much of the information summarized in this paragraph was learned from an LWN DMA article by Jonathan Corbet.