Ek 4100a SV A01
Ek 4100a SV A01
Ek 4100a SV A01
Service Manual
Order Number: EK4100ASV. A01
This manual is for anyone who services an AlphaServer 4100 pedestal or cabinet system. It includes troubleshooting information, configuration rules, and instructions for removal and replacement of field-replaceable units (FRUs).
First Printing, August 1996 Digital Equipment Corporation makes no representations that the use of its products in the manner described in this publication will not infringe on existing or future patent rights, nor do the descriptions contained in this publication imply the granting of licenses to make, use, or sell equipment or software in accordance with the description. The information in this document is subject to change without notice and should not be construed as a commitment by Digital Equipment Corporation. Digital Equipment Corporation assumes no responsibility for any errors that may appear in this document. The software, if any, described in this document is furnished under a license and may be used or copied only in accordance with the terms of such license. No responsibility is assumed for the use or reliability of software or equipment that is not supplied by Digital Equipment Corporation or its affiliated companies. Copyright 1996 by Digital Equipment Corporation. All rights reserved. The following are trademarks of Digital Equipment Corporation: AlphaGeneration, AlphaServer, OpenVMS, StorageWorks, the AlphaGeneration logo, and the DIGITAL logo. The following are third-party trademarks: UNIX is a registered trademark in the U.S. and other countries, licensed exclusively through X/Open Company Ltd. Windows NT is a trademark of Microsoft, Inc. All other trademarks and registered trademarks are the property of their respective holders. FCC Notice: The equipment described in this manual generates, uses, and may emit radio frequency energy. The equipment has been type tested and found to comply with the limits for a Class A digital device pursuant to Part 15 of FCC Rules, which are designed to provide reasonable protection against such radio frequency interference. Operation of this equipment in a residential area may cause interference, in which case the user at his own expense will be required to take whatever measures are required to correct the interference. Shielded Cables: If shielded cables have been supplied or specified, they must be used on the system in order to maintain international regulatory compliance. Warning! This is a Class A product. In a domestic environment this product may cause radio interference, in which case the user may be required to take adequate measures. Achtung! Dieses ist ein Gert der Funkstrgrenzwertklasse A. In Wohnbereichen knnen bei Betrieb dieses Gertes Rundfunkstrungen auftreten, in welchen Fllen der Benutzer fr entsprechende Gegenmanahmen verantwortlich ist. Avertisement! Cet appareil est un appareil de Classe A. Dans un environnement rsidentiel, cet appareil peut provoquer des brouillages radiolectriques. Dans ce cas, il peut tre demand l'utilisateur de prendre les mesures appropries.
Contents
System Overview
System Drawer.........................................................................................1-2 Cabinet System ........................................................................................1-4 Pedestal System .......................................................................................1-6 Control Panel and Drives..........................................................................1-8 System Consoles ....................................................................................1-10 System Architecture ............................................................................... 1-12 System Motherboard ..............................................................................1-14 CPU Types.............................................................................................1-16 Memory Modules ...................................................................................1-18 Memory Addressing ...............................................................................1-20 System Bus ............................................................................................1-22 System Bus to PCI Bus Bridge Module ..................................................1-24 PCI I/O Subsystem .................................................................................1-26 Server Control Module...........................................................................1-28 Power Control Module ...........................................................................1-30 Power Supply .........................................................................................1-32
Chapter 2
2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10
Power-Up
Control Panel ...........................................................................................2-2 Power-Up Sequence .................................................................................2-4 SROM Power-Up Test Flow.....................................................................2-8 SROM Errors Reported ..........................................................................2-11 XSROM Power-Up Test Flow ................................................................2-12 XSROM Errors Reported .......................................................................2-15 Console Power-Up Tests ........................................................................2-16 Console Device Determination............................................................... 2-18 Console Power-Up Display ....................................................................2-20 Fail-Safe Loader.....................................................................................2-24
iii
Chapter 3
3.1 3.1.1 3.2 3.2.1 3.3 3.4 3.5 3.5.1 3.5.2
Troubleshooting
Troubleshooting with LEDs .....................................................................3-2 Cabinet Power and Fan LEDs............................................................3-4 Troubleshooting Power Problems .............................................................3-6 Power Control Module LEDs ............................................................3-8 Maintenance Bus (I2C Bus) ....................................................................3-10 Running Diagnostics Test Command..................................................3-12 Testing an Entire System........................................................................ 3-13 Testing Memory ..............................................................................3-15 Testing PCI .....................................................................................3-17
Chapter 4
4.1 4.2 4.3 4.4 4.5 4.6 4.7
Power System
Power Supply ...........................................................................................4-2 Power Control Module Features ...............................................................4-4 Power Circuit and Cover Interlocks..........................................................4-6 Power-Up/Down Sequence.......................................................................4-8 Cabinet Power Configuration Rules........................................................4-10 Pedestal Power Configuration Rules (North America and Japan)............4-12 Pedestal Power Configuration Rules (Europe and Asia Pacific)..............4-13
Chapter 5
5.1 5.1.1 5.1.2 5.1.3 5.2 5.2.1 5.2.2 5.2.3 5.3 5.3.1 5.3.2 5.3.3 5.3.4 5.3.5 5.3.6 5.4 5.4.1 5.4.2 5.4.3 5.4.4 5.4.5 iv
Error Logs
Using Error Logs......................................................................................5-2 Hard Errors........................................................................................5-4 Soft Errors.........................................................................................5-4 Error Log Events ...............................................................................5-5 Using DECevent ......................................................................................5-6 Translating Event Files......................................................................5-7 Filtering Events .................................................................................5-8 Selecting Alternative Reports .......................................................... 5-10 Error Log Examples and Analysis ..........................................................5-11 MCHK 670 CPU-Detected Failure ..................................................5-11 MCHK 670 CPU and IOD Detected Failure ....................................5-16 MCHK 670 Read Dirty CPU Detected Failure.................................5-22 MCHK 660 IOD-Detected Failure ...................................................5-28 MCHK 630 Correctable CPU Error .................................................5-33 MCHK 620 Correctable Error..........................................................5-36 Troubleshooting IOD-Detected Errors....................................................5-38 System Bus ECC Error ....................................................................5-39 System Bus Nonexistent Address Error ...........................................5-40 System Bus Address Parity Error.....................................................5-41 PIO Buffer Overflow Error (PIO_OVFL) ........................................5-42 Page Table Entry Invalid Error........................................................ 5-43
5.4.6 5.4.7 5.4.8 5.4.9 5.4.10 5.4.11 5.5 5.5.1 5.5.2 5.5.3
PCI Master Abort ............................................................................5-43 PCI System Error ............................................................................5-43 PCI Parity Error............................................................................... 5-43 Broken Memory ..............................................................................5-44 Command Codes ............................................................................. 5-46 Node IDs .........................................................................................5-47 Double Error Halts and Machine Checks While in PAL Mode ...............5-48 PALcode Overview .........................................................................5-48 Double Error Halt............................................................................ 5-49 Machine Checks While in PAL ....................................................... 5-49
Chapter 6
6.1 6.1.1 6.1.2 6.1.3 6.1.4 6.1.5
Error Registers
External Interface Status Register - EL_STAT .........................................6-2 External Interface Address Register - EI_ADDR ...............................6-6 MC Error Information Register 0.......................................................6-8 MC Error Information Register 1.......................................................6-9 CAP Error Register .........................................................................6-11 PCI Error Status Register 1.............................................................. 6-14
Chapter 7
7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12 7.13 7.14 7.15 7.16 7.17 7.18 7.19 7.20 7.21 7.22 7.23
System Safety ..........................................................................................7-1 FRU List ..................................................................................................7-2 Power System FRUs.................................................................................7-8 System Drawer Exposure (Cabinet)........................................................7-10 System Drawer Exposure (Pedestal) ....................................................... 7-12 CPU Removal and Replacement.............................................................7-14 CPU Fan Removal and Replacement...................................................... 7-16 Memory Removal and Replacement....................................................... 7-18 Power Control Module Removal and Replacement................................. 7-20 System Bus to PCI Bus Bridge Module Removal and Replacement........7-22 System Motherboard Removal and Replacement....................................7-24 PCI Motherboard Removal and Replacement .........................................7-26 Server Control Module Removal and Replacement ................................ 7-28 PCI/EISA Option Removal and Replacement.........................................7-30 Power Supply Removal and Replacement ..............................................7-32 Power Harness Removal and Replacement.............................................7-34 System Drawer Fan Removal and Replacement .....................................7-36 Cover Interlock Removal and Replacement............................................7-38 Operator Control Panel Removal and Replacement (Cabinet) ................7-40 Operator Control Panel Removal and Replacement (Pedestal)................7-42 Floppy Removal and Replacement ......................................................... 7-44 CD-ROM Removal and Replacement.....................................................7-46 Cabinet Fan Tray Removal and Replacement.........................................7-48 v
Cabinet Fan Tray Power Supply Removal and Replacement ..................7-50 Cabinet Fan Tray Fan Removal and Replacement ..................................7-52 Cabinet Fan Tray Fan Fail Detect Module Removal and Replacement ...7-54 StorageWorks Shelf Removal and Replacement.....................................7-56
Appendix A
A.1 A.2 A.3 A.4 A.5 A.5.1 A.5.2 A.5.3 A.5.4 A.5.5 A.6 A.7
Running Utilities
Running Utilities from a Graphics Monitor .............................................A-2 Running Utilities from a Serial Terminal.................................................A-3 Running ECU..........................................................................................A-4 Running RAID Standalone Configuration Utility ....................................A-5 Updating Firmware with LFU .................................................................A-6 Updating Firmware from the Internal CD-ROM ...............................A-8 Updating Firmware from the Internal Floppy Disk Creating the Diskettes ........................................................................................A-12 Updating Firmware from the Internal Floppy Disk Performing the Update...................................................................................... A-14 Updating Firmware from a Network Device ................................... A-18 LFU Commands .............................................................................A-22 Updating Firmware from AlphaBIOS....................................................A-25 Upgrading AlphaBIOS .......................................................................... A-26
Appendix B
B.1 B.2 B.3
Summary of SRM Console Commands ................................................... B-2 Summary of SRM Environment Variables............................................... B-4 Recording Environment Variables........................................................... B-6
Appendix C
C.1 C.1.1 C.1.2 C.1.3 C.1.4 C.1.5 C.1.6 C.1.7
RCM Console Overview ......................................................................... C-1 Modem Usage .................................................................................. C-2 Entering and Leaving Command Mode ............................................ C-5 RCM Commands.............................................................................. C-6 Dial-Out Alerts............................................................................... C-15 Resetting the RCM to Factory Defaults .......................................... C-18 Troubleshooting Guide ................................................................... C-19 Modem Dialog Details ................................................................... C-22
Index
vi
Examples
2-1 2-2 2-3 3-1 3-2 3-3 3-4 5-1 5-2 5-3 5-4 5-5 5-6 5-7 5-8 5-9 A-1 A-2 A-3 A-4 A-5 A-6 C-1 C-2 C-3 C-4 SROM Errors Reported at Power-Up......................................................2-11 XSROM Errors Reported at Power-Up ...................................................2-15 Power-Up Display ..................................................................................2-20 Test Command Syntax ........................................................................... 3-12 Sample Test Command .......................................................................... 3-13 Sample Test Memory Command ............................................................ 3-15 Sample Test Command for PCI .............................................................. 3-17 MCHK 670 ............................................................................................5-12 MCHK 670 CPU and IOD-Detected Failure...........................................5-17 MCHK 670 Read Dirty Failure............................................................... 5-23 MCHK 660 IOD Detected Failure ..........................................................5-29 MCHK 630 Correctable CPU Error ........................................................ 5-34 MCHK 620 Correctable Error ................................................................ 5-36 INFO 3 Command..................................................................................5-50 INFO 5 Command..................................................................................5-52 INFO 8 Command..................................................................................5-54 Starting LFU from the SRM Console ......................................................A-6 Updating Firmware from the Internal CD-ROM ......................................A-8 Creating Update Diskettes on an OpenVMS System..............................A-13 Updating Firmware from the Internal Floppy Disk ................................ A-14 Selecting AS4X00FW to Update Firmware from the Internal Floppy Disk........................................................................................... A-17 Updating Firmware from a Network Device.......................................... A-18 Sample Remote Dial-In Dialog ............................................................... C-4 Entering and Leaving RCM Command Mode.......................................... C-5 Configuring the Modem for Dial-Out Alerts.......................................... C-15 Typical RCM Dial-Out Command......................................................... C-15
Figures
1-1 1-2 1-3 1-4 1-5 1-6 1-7 1-8 1-9 1-10 1-11 1-12 Components of the BA30A System Drawer..............................................1-2 Cover Interlock Circuit.............................................................................1-3 AlphaServer 4100 Cabinet System ...........................................................1-4 Cabinet Fan Tray......................................................................................1-5 Pedestal System Front ..............................................................................1-6 Pedestal System Rear ...............................................................................1-7 Control Panel Assembly ...........................................................................1-8 Architecture Diagram.............................................................................1-12 System Motherboard ..............................................................................1-14 CPU Module Layout and Placement.......................................................1-16 Memory Module Layout and Placement.................................................1-18 How Memory Addressing Is Calculated .................................................1-20 vii
1-13 1-14 1-15 1-16 1-17 1-18 1-19 2-1 2-2 2-3 2-4 2-5 2-6 2-7 3-1 3-2 3-4 3-5 4-1 4-2 4-3 4-4 4-5 4-6 4-7 4-8 5-1 7-1 7-2 7-3 7-4 7-5 7-6 7-7 7-8 7-9 7-10 7-11 7-12 7-13 7-14 7-15 7-16 7-17 viii
System Bus Block Diagram and Slot Designation ..................................1-22 Bridge Module .......................................................................................1-24 PCI Motherboard....................................................................................1-26 PCI I/O Subsystem Block Diagram ........................................................1-27 Server Control Module...........................................................................1-28 Power Control Module ...........................................................................1-30 Location of Power Supply ......................................................................1-32 Control Panel and LCD Display ...............................................................2-2 Power-Up Flow ........................................................................................2-4 Contents of FEPROMs .............................................................................2-5 Console Code Critical Path ......................................................................2-6 SROM Power-Up Test Flow.....................................................................2-8 XSROM Power-Up Flowchart................................................................2-12 Console Device Determination Flowchart .............................................. 2-18 CPU and Bridge Module LEDs ................................................................3-2 Cabinet Power and Fan LEDs...................................................................3-4 PCM LEDs...............................................................................................3-8 I2C Bus Block Diagram .......................................................................... 3-10 Power Supply Outputs..............................................................................4-2 Power Control Module .............................................................................4-4 Power Circuit Diagram.............................................................................4-6 Power Up/Down Sequence Flowchart.......................................................4-8 Simple Cabinet Power Configuration .....................................................4-10 Worst-Case Cabinet Power Configuration ..............................................4-11 Pedestal Power Distribution (N.A. and Japan) ........................................4-12 Pedestal Power Distribution (Europe and AP) ........................................4-13 Error Detector Placement .........................................................................5-2 System Drawer FRU Locations ................................................................7-2 Location of Power System FRUs..............................................................7-8 Exposing System Drawer (Cabinet)........................................................7-10 Exposing System Drawer (Pedestal) ....................................................... 7-12 Removing CPU Module .........................................................................7-14 Removing CPU Fan ...............................................................................7-16 Removing Memory Module ...................................................................7-18 Removing Power Control Module ..........................................................7-20 Removing System Bus to PCI Bus Bridge Module .................................7-22 Removing System Motherboard .............................................................7-24 Replacing PCI Motherboard ................................................................... 7-26 Removing Server Control Module..........................................................7-28 Removing PCI/EISA Option ..................................................................7-30 Removing Power Supply ........................................................................7-32 Removing Power Harness.......................................................................7-34 Removing System Drawer Fan ...............................................................7-36 Removing Cover Interlocks....................................................................7-38
7-18 7-19 7-20 7-21 7-22 7-23 7-24 7-25 7-26 A-1 A-2 A-3 C-1
Removing OCP (Cabinet).......................................................................7-40 Removing OCP (Pedestal)......................................................................7-42 Removing Floppy Drive .........................................................................7-44 Removing CD-ROM ..............................................................................7-46 Removing Cabinet Fan Tray...................................................................7-48 Removing Cabinet Fan Tray Power Supply ............................................7-50 Removing Cabinet Fan Tray Fan............................................................ 7-52 Removing Fan Tray Fan Fail Detect Module..........................................7-54 Removing StorageWorks Shelf...............................................................7-56 Running a Utility from a Graphics Monitor .............................................A-2 Starting LFU from the AlphaBIOS Console ............................................A-6 AlphaBIOS Setup Screen ......................................................................A-25 RCM Connections................................................................................... C-2
Tables
11 21 2-2 2-3 2-4 2-5 2-6 3-1 5-1 5-2 5-3 5-4 5-5 5-6 5-7 5-8 5-9 5-10 6-1 6-2 6-3 6-4 6-5 6-6 7-1 A-1 A-2 A-3 PCI Motherboard Slot Numbering ..........................................................1-26 Control Panel Display ..............................................................................2-3 SROM Tests...........................................................................................2-10 XSROM Tests ........................................................................................2-13 Memory Tests ........................................................................................2-14 IOD Tests...............................................................................................2-16 PCI Motherboard Tests...........................................................................2-17 Power Control Module LED States...........................................................3-9 Types of Error Log Events .......................................................................5-5 DECevent Report Formats......................................................................5-10 CAP Error Register Data Pattern ............................................................5-38 System Bus ECC Error Data Pattern.......................................................5-39 System Bus Nonexistent Address Error Troubleshooting........................5-40 Address Parity Error Troubleshooting ....................................................5-41 Cause of PIO_OVFL Error .....................................................................5-42 ECC Syndrome Bits Table .....................................................................5-45 Decoding Commands ............................................................................. 5-46 Node IDs................................................................................................5-47 External Interface Status Register.............................................................6-4 Loading and Locking Rules for External Interface Registers ....................6-7 MC Error Information Register 0 .............................................................6-8 MC Error Information Register 1 ...........................................................6-10 CAP Error Register ................................................................................6-12 PCI Error Status Register 1 ....................................................................6-14 Field-Replaceable Unit Part Numbers ......................................................7-3 AlphaBIOS Option Key Mapping............................................................A-3 File Locations for Creating Update Diskettes on a PC ........................... A-12 LFU Command Summary .....................................................................A-22 ix
Summary of SRM Console Commands ................................................... B-2 Environment Variable Summary ............................................................. B-4 Environment Variables Worksheet .......................................................... B-6 RCM Command Summary ...................................................................... C-6 RCM Status Command Fields ............................................................... C-14 RCM Troubleshooting........................................................................... C-19 RCM/Modem Interchange Summary ..................................................... C-24
Preface
Intended Audience
This manual is written for the customer service engineer.
Document Structure
This manual uses a structured documentation design. Topics are organized into small sections for efficient online and printed reference. Each topic begins with an abstract. You can quickly gain a comprehensive overview by reading only the abstracts. Next is an illustration or example, which also provides quick reference. Last in the structure are descriptive text and syntax definitions. This manual has six chapters and three appendixes, as follows: Chapter 1, System Overview, introduces the AlphaServer 4100 pedestal and cabinet systems and gives a brief overview of the system bus modules. Chapter 2, Power-Up, provides information on how to interpret the power-up display on the operator control panel, the console screen, and the system LEDs. It also describes how the hardware diagnostic programs are executed when the system is initialized. Chapter 3, Troubleshooting, describes troubleshooting during power-up and booting, as well as the test command. Chapter 4, Power System, describes the AlphaServer 4100 power system. Chapter 5, Error Logs, explains how to interpret error logs and how to use DECevent. Chapter 6, Error Registers, describes the error registers used to hold error information. Chapter 7, Removal and Replacement, describes removal and replacement procedures for field-replaceable units (FRUs). Appendix A, Running Utilities, explains how to run utilities such as the EISA Configuration Utility and RAID Standalone Configuration Utility.
xi
Appendix B, SRM Console Commands and Environment Variables, summarizes the commands used to examine and alter the system configuration. Appendix C, Operating the System Remotely, describes how to use the remote console monitor (RCM) to monitor and control the system remotely.
Documentation Titles
Table 1 lists titles related to AlphaServer 4100 systems.
xii
System Overview
1-1
1 5
4
PK-0702-96
When the system drawer is in a pedestal, the control panel assembly is mounted in a tray at the top of the drawer. The numbered callouts in Figure 1-1 refer to components of the system drawer.
1-2
System card cage, which holds the system motherboard and the CPU, memory, bridge, and power control modules. PCI card cage, which holds the PCI motherboard, option cards, and server control module. Server control module, which holds the I/O connectors and remote console monitor. Control panel assembly, which includes the control panel, a floppy drive, and a CD-ROM drive. Power and cooling section, which contains one to three power supplies and three fans.
Cover Interlocks The system drawer has three cover interlocks: one for the system bus card cage, one for the PCI card cage, and one for the power and system fan area.
3 Interlock Switches
170429401 To OCP
RSM_DC_EN_L
PKW-0403-95
NOTE: The cover interlocks must be engaged to enable power-up. To override the cover interlocks, find a suitable object to close the interlock circuit.
System Overview
1-3
PK-0306-96
1-4
Cabinet System Fan Tray At the top of cabinet systems is a fan tray containing three exhaust fans, a small 12volt power supply, and a module that distributes power to the server control module in each drawer.
AC Power To SCM
PKW0441A-96
System Overview
1-5
PK-0301-96
In the pedestal system, the control panel is located at the top left in a tray. See Figure 1-7. There is space for an optional device beside it.
1-6
PK-0307a-96
System Overview
1-7
Pedestal
Cabinet
Control Panel
1-8
On/Off button. Powers the system drawer on or off. When the LED at the top of the button is lit, the power is on. The On/Off button is connected to the power supplies and the system interlocks.
NOTE: The LEDs on some modules are on when the line cord is plugged in, regardless of the position of the On/Off button.
Halt button. Pressing this button in (so the LED at the top of the button is on) does the following: If Digital UNIX or OpenVMS is running, halts the operating system and returns to the SRM console. The Halt button has no effect on Windows NT. If the Halt button is in when the system is reset or powered up, the system halts in the SRM console, regardless of the operating system. Digital UNIX and OpenVMS systems that are configured for autoboot will not boot if the Halt button is in. Windows NT systems halt in the SRM console; AlphaBIOS is not loaded and started. If you press the Halt button in (LED on) and do not issue commands that disturb the system state, entering the continue command returns the system to the operating system it was running. To return to console mode again, press the Halt button (LED off) and then press it again (LED on). If the system is hung, pressing the Halt button (LED on) usually brings up the SRM console. Enter the crash command to do a crash dump. If pressing the Halt button does not bring up the SRM console, there is a problem with the machine state and you will need to reset your system.
Reset button. Initializes the system drawer. If the Halt button is pressed (LED on) when the system is reset, the SRM console is loaded and remains in the system regardless of any other conditions. Control panel display. Indicates status during power-up and self-test. The OCP display is a 16-character LCD. Its controller is on the XBUS on the PCI motherboard. While the operating system is running, displays the system type as a default. This message can be changed by the user. CD-ROM drive. The CD-ROM drive is used to load software, firmware, and updates. Its controller is on PCI1 on the PCI motherboard. Floppy disk drive. The floppy drive is used to load software and firmware updates. The floppy controller is on the XBUS on the PCI motherboard.
System Overview
1-9
SRM Console Prompt On systems running the Digital UNIX or OpenVMS operating system, the following console prompt is displayed after system startup messages are displayed, or whenever the SRM console is invoked: P00>>> NOTE: The console prompt displays only after the entire power-up sequence is complete. This can take up to several minutes if the memory is very large. AlphaBIOS Boot Menu On systems running the Windows NT operating system, the Boot menu is displayed when the AlphaBIOS console is invoked:
AlphaBIOS Version 5.12 Please select the operating system to start: Windows NT Server 3.51
Use and to move the highlight to your choice. Press Enter to choose.
Alpha
Press <F2> to enter SETUP
PK-0728-96
1-10
SRM Console The SRM console is a command-line interface that is used to boot the Digital UNIX and OpenVMS operating systems. It also provides support for examining and modifying the system state and configuring and testing the system. The SRM console can be run from a serial terminal or a graphics monitor. AlphaBIOS Console The AlphaBIOS console is a menu-based interface that supports the Microsoft Windows NT operating system. AlphaBIOS is used to set up operating system selections, boot Windows NT, and display information about the system configuration. The EISA Configuration Utility and the RAID Standalone Configuration Utility are run from the AlphaBIOS console. AlphaBIOS runs on either a serial or graphics terminal, but Windows NT requires a graphics monitor. Environment Variables Environment variables are software parameters that define, among other things, the system configuration. They are used to pass information to different pieces of software running in the system at various times. The os_type environment variable, which can be set to VMS, UNIX, or NT, determines which of the two consoles is to be used. The SRM console is always brought into memory, but AlphaBIOS is loaded if os_type is set to NT and the Halt button is out (not lit). Refer to Appendix B of this guide for a list of the environment variables used to configure AlphaServer 4100 systems. Refer to the AlphaServer 4100 System Drawer Users Guide for information on setting environment variables. It is recommended that you keep a record of the environment variables for each system that you service. Some environment variable settings are lost when a module is swapped and must be restored after the new module is installed. Refer to Appendix B for a convenient worksheet for recording environment variable settings.
System Overview
1-11
Memory Pair
128-Bit Data Bus + 16 ECC and 40-Bit Command/Address Bus Bridge Module PCI Bus 0 64 Bits EISA Bridge PCI Slot PCI/EISA Slot PCI/EISA Slot PCI/EISA Slot
System to PCI Bus Bridge 0 System to PCI Bus Bridge 1
PCI Bus 1 64 Bits PCI Slot PCI Slot PCI Slot PCI Slot
PCI Motherboard
PKW0421-96
1-12
AlphaServer 4100 systems use the Alpha chip for the CPU. The CPU, memory, and an I/O bridge module to PCI/EISA I/O buses are connected to the system bus motherboard. A fourth type of module, the power control module, also plugs into the system motherboard. A fully configured system drawer can have up to four CPUs, four memory pairs, and a total of eight I/O options. The I/O options can be all PCI options or a combination of PCI options and EISA options, but there can be no more than three EISA options. The system bus has a 144-bit data bus protected by 16 bits of ECC and a 40-bit command/address bus protected by parity. The bus speed depends on the speed of the CPU in slot 0 which provides the clock for the buses. The 40-bit address bus can create one terabyte of addresses. The bus connects CPUs, memory, and the system bus to PCI bus bridge. The CPU modules are available with and without an external cache. The Alpha chip has an 8-Kbyte instruction cache (I-cache), an 8-Kbyte write-through data cache (Dcache), and a 96-Kbyte, write-back secondary data cache (S-cache). Some variants of the CPU module include an onboard cache. The cache system is write-back. The system drawer supports up to four CPUs. The memory modules are placed on the system motherboard in pairs. Each module drives half of the system bus, along with the associated ECC bits. Memory pairs consist of two modules that are the same size and type. Two types are available: synchronous and asynchronous (EDO) memory. The system bus to PCI bus bridge module translates system bus commands and data addressed to I/O space to PCI commands and data. It also translates PCI bus commands and data addressed to system memory or CPUs to system bus commands and data. The PCI bus is a 64-bit wide bus used for I/O. The power control module, which is on the system motherboard, monitors power and the system environment.
System Overview
1-13
1 2 1 2 1 2 1 2
PK-0703D-96
1-14
The system motherboard has the logic for the system bus. It is the backplane that holds the CPU, memory, bridge, and power control modules. Figure 1-9 shows the locations of these modules and of connectors on the motherboard.
System Overview
1-15
PK-0709-96
Alpha Chip Composition The Alpha chip is made using state-of-the-art chip technology, has a transistor count of 9.3 million, consumes 50 watts of power, and is air cooled (a fan is on the chip). The default cache system is write-back and when the module has an external cache, it is write-back.
1-16
Chip Description Unit Instruction Execution Memory Description 8-byte cache, 4-way issue 4-way execution; 2 integer units, 1 floating-point adder, 1 floating-point multiplier Merge logic, 8-Kbyte write-through first-level data cache, 96-Kbyte write-back second-level data cache, bus interface unit
CPU Variants Module Variant B3001-CA B3002-AB B3004-AA Clock Frequency 300 MHz 300 MHz 400 MHz Onboard Cache None 2 Mbytes 4 Mbytes
CPU Configuration Rules The first CPU must be in CPU slot 0 to provide the system clock. Additional CPU modules should be installed in ascending order by slot number. All CPUs must have the same Alpha chip clock speed. The system bus will hang without an error message if the oscillators clocking the CPUs are different. Mixing of cached and uncached CPUs is not supported.
Color Codes The top edge of the CPU module variant is color coded for easy identification. Option Number B3001-CA B3002-AB B3004-AA
Description 300 MHz, uncached 300 MHz, 2MB cached 400 MHz, 4MB cached
System Overview
1-17
C56
R3
PK-0708-96
1-18
Memory Variants Each memory option consists of two identical modules. Each drawer supports up to four memory options, for a total of 4 Gbytes of memory. Memory modules are used only in pairs and are available in 128 Mbyte, 512 Mbyte, and 1 Gbyte sizes. The 128-Mbyte option is synchronous memory, while the larger sizes are asynchronous memory (EDO).
DRAM Option MS320-CA MS330-EA MS330-FA Size 128 MB 512 MB 1 GB Module B3020-CA B3030-EA B3030-FA Type Synch. Asynch. (EDO) Asynch. (EDO) Number 36 144 72 Size 4 MB x 4 4 MB x 4 16 MB x 4
Memory Operation Memory modules are used only in pairs; each module provides half the data, or 64 bits plus 8 ECC bits, of the octaword (16 byte) transferred on the system bus. Each module drives 32 data bits and 4 ECC bits. NOTE: Modules in slots MEMxL do not drive the lower 8 bytes, and modules in slots MEMxH do not drive the higher 8 bytes of the 16 byte transfer. Unless otherwise programmed, memory drives the system bus in bursts. Upon each memory fetch, data is transferred in 4 consecutive cycles transferring 64 bytes. There are situations, however, when memories made with EDO DRAMs cannot provide data fast enough to complete the system bus transactions. When these situations arise, EDO type memories assert a signal that causes the system bus to stall for one (occasionally more) clock tick. When memory completes such an operation, it releases the system bus. Memory Configuration Rules In a system, memories of different sizes and types are permitted, but: Memory modules are installed and used in pairs. Both modules in a memory pair must be of the same size and type. The largest memory pair must be in slots MEM 0L and MEM 0H. Other memory pairs must be the same size or smaller than the first memory pair. Memory pairs must be installed in consecutive slots.
System Overview
1-19
Second pair address space 512 Mbyte 1/2 occupied (2 B3020-DA - 128 Mbyte/mod)
512 Mbyte First pair defines total address space always fully occupied (2 B3020-EA 256 Mbyte/mod) 0
PKW0424-96
1-20
The rules for addressing memory are as follows: 1. 2. 3. 4. 5. 6. 7. Address space is determined by the memory pair in slot MEM0. Memory pairs need not be the same size. The memory pair in slot MEM0 must be the largest of all memory pairs. Other memory pairs may be as large but none may be larger. The starting address of each memory pair is N times the size of the memory pair in slot MEM0. N=0,1,2,3. Memory addresses are contiguous within each module pair. If memory pairs are of different sizes, memory holes can occur in the physical address space. See Figure 1-12. Software creates contiguous virtual memory even though physical memory may not be contiguous.
System Overview
1-21
CPU1
CPU0
MEM3 MEM 2 MEM 1 M EM0 ADR SYNC DRAMS D A T A CTRL M EMCTRL& CNTRL ARB SIM_ADR ROW COL
CPU0 CPU1 CPU2 CPU3 A L P H A CTRL EV_ADR EV_DA T A Systemto PCI Bus Bridge IOD0 IOD1 ADR
PCI/EISA PCI/EISA0
PCI1
PKW0425-96
1-22
The system bus connects up to four CPUs, four pairs of memory modules, and the I/O system bus to PCI bus bridge. It is the motherboard for the system drawer. The power control module also connects to the board. The bus consists of a 40-bit command/address bus, a 128-bit plus ECC data bus, and several control signals, clocks, and a bus arbiter. The bus requires that all CPUs have the same high-speed oscillator providing the clock to the Alpha chip. The system bus clock is provided by an oscillator on the CPU in slot CPU0. This oscillator has a 1:5 ratio to the Alpha chip. With 300 MHz CPUs, for example, the system bus operates at 60 MHz. The system bus motherboard initiates memory refresh transactions. 5 volt and 3.43 volt power is provided directly to the module from the power supplies.
System Overview
1-23
CAP
AD<31:0>
Data A to B bus
M DP A
M DPB
AD<63:32>
PKW0426r-96
1-24
The system bus to PCI bus bridge module converts system bus commands and data addressed to I/O space to PCI commands and data; and converts PCI bus commands and data addressed to system memory or CPUs to system bus commands and data. The bridge has two major components: Command/address processor (CAP) chip Two data path chips (MDPA and MDPB)
There are two sets of these three chips, one set on each side of the module. Each set bridges to one of the PCI buses on the PCI motherboard. The interface on the system bus side of the bridge responds to system bus commands addressed to the upper 64 Gbytes of I/O space. I/O space is addressed whenever bit <39> on the system bus address lines is set. The space so defined is 512 Gbytes in size. The first 448 Gbytes are reserved and the last 64 Gbytes, when bits <38:36> are set, are mapped to the two PCI I/O buses. The interface on the PCI side of the bridge responds to commands addressed to CPUs and memory on the system bus. On the PCI side, the bridge provides the interface to two PCIs, PCI0 and PCI1. Each PCI bus is addressed separately. The bridge does not respond to devices communicating with each other on the same PCI bus. However, should a device on one PCI address a device on the other PCI bus, commands, addresses, and data run through the bridge out onto the system bus and back through the bridge to the other PCI bus. In addition to its bridge function, the system bus to PCI bus bridge module monitors every transaction on the system bus for errors. It monitors the data lines for ECC errors and the command/address lines for parity errors.
System Overview
1-25
PCI 1 slot 5 PCI 1 slot 4 PCI 1 slot 3 PCI 1 slot 2 PCI 0 slot 5 PCI 0 slot 4 or EISA slot 3 PCI 0 slot 3 or EISA slot 2 PCI 0 slot 2 or EISA slot 1
PK-0704-96
1-26
The logic for two PCI buses is on the PCI motherboard. PCI0 is a 64-bit bus with a PCI to EISA bus bridge. PCI0 has one dedicated PCI slot and three shared PCI/EISA slots. PCI1 is a 64-bit bus that has four PCI slots.
The PCI motherboard has cable connections to remote I/O (mouse, keyboard, serial port, and parallel port), an internal floppy drive, an internal CD-ROM drive, the control panel, and 5V power. Also on this module are the chips for the PCI to EISA bridge and the internal CD-ROM controller. An 8-bit XBUS is connected to the EISA bus. On this bus there is an interface to the system I2 C bus; mouse and keyboard support; an I/O combo controller supporting two serial ports, the floppy controller, and a parallel port; a real-time clock; two 1Mbyte flash ROMs containing system firmware, and an 8-Kbyte NVRAM.
40MHz Clock
PCI-0 Bus
Serial Interrupt Logic BDA T A Xceivers
NVRAM 8Kx8 Flash ROM 2MB Realtime Clock PCI-0 4-64 bit slots
XBUS
Combo I/O: serial ports parallel port floppy cntrl Mouse/ Keyboard I2C Bus Interface
XBUS Xceivers
EISA: 3-32 bit slots
PKW 0431r-96
System Overview
1-27
Standard I/O
PK-0702B-96
1-28
The server control module has two sections: the remote console monitor (RCM) and the standard I/O. See Appendix C for information on controlling the system remotely. The remote console monitor connects to a modem through the modem port on the bulkhead. The RCM requires a 12V power connection. The standard I/O ports (keyboard, mouse, COM1 and COM2 serial, and parallel ports) are on the same bulkhead.
System Overview
1-29
PK-0710-96
1-30
The power control module performs these functions: Controls power sequencing. Monitors the combined output of power supplies and shuts down power if it is not in range. Monitors system temperature and shuts off power if it is out of range. Monitors the fans in the system drawer and on the CPU modules and shuts down power if a fan fails. Provides visual indication of faults through LEDs.
System Overview
1-31
PK-0715-96
1-32
Description One to three power supplies provide power to components in the system drawer. (They supply power only for the drawer in which they are located.) These power supplies share the load, and redundant configurations are supported. They autoselect line voltage (120V to 240V). Each has 450 W output and supplies up to 75A of 3.43V, 50A of 5.0V, 11A of 12V, and small amounts of 5V, 12V, and auxiliary voltage (Vaux). NOTE: The LEDs on some modules are on when the line cord is plugged in, regardless of the position of the On/Off button. Configuration Systems with one or two CPUs require one power supply (two for redundancy). Systems with three or four CPUs require two power supplies (three for redundancy). Power supply 0 is installed first, power supply 2 second, and power supply 1 third. See Figure 1-19. (The power supply numbering shown here corresponds to the numbering displayed by the SRM console's show power command.)
System Overview
1-33
Chapter 2 Power-Up
This chapter describes system power-up testing and explains the power-up displays. The following topics are covered: Control Panel Power-Up Sequence SROM Power-Up Test Flow SROM Errors Reported XSROM Power-Up Test Flow XSROM Errors Reported Console Power-Up Tests Console Device Determination Console Power-Up Display Fail-Safe Loader
Power-Up
2-1
P0 TEST 11 CPU00
PK-0706F-96
When the On/Off button LED is on, power is applied and the system is running. When it is off, the system is not running, but power may or may not be present. If power is present, the PCM or the power LED on the system bus to PCI bus bridge module should be flashing. Otherwise, there is a power problem. When the Halt button LED is lit and the On/Off button is on, the system should be running either the SRM console or Windows NT. If the Halt button is in, but the LED is off, the OCP, its cables, or the PCM is likely to be broken.
2-2
Meaning CPU reporting status Tests are executing Failure has been detected Machine check has occurred Error interrupt has occurred CPU module number1 Memory pair number and low module, high module, or either2 Bridge to PCI bus 03 Bridge to PCI bus 13 Flash ROM4 COM controller4 PCI-to-EISA bridge4 EISA system controller4 Nonvolatile RAM4 Real-time clock6 Keyboard and mouse controller6
Test number (for Digital use only) Suspected device CPU03 MEM03 and L, H, or * IOD0 IOD1 FROM0 COMBO PCEB ESC NVRAM TOY I8242
1 2 3 4
Power-Up
2-3
Power-Up/Reset
Definitions SROM. The SROM is a 128-Kbit ROM on each CPU module. The ROM contains minimal diagnostics that test the Alpha chip and the path to the XSROM. Once the path is verified, it loads XSROM code into the Alpha chip and jumps to it. XSROM. The XSROM, or extended SROM, contains back-up cache and memory tests, and a fail-safe loader. The XSROM code resides in sector 0 of FEPROM 0 on the XBUS. Sector 2 of FEPROM 0 contains a duplicate copy of the code and is used if sector 0 is bad.
2-4
FEPROM. Two 1-Mbyte programmable ROMs are on the XBUS on the PCI motherboard. FEPROM 0 contains two copies of the XSROM, the OpenVMS and Digital UNIX PALcode, and the SRM console and decompression code. FEPROM 1 contains the AlphaBIOS and NT PALcode. See Figure 2-3. These two FEPROMs can be flash updated. Refer to Appendix A.
64Kb
AlphaBIOS Code
31
64Kb
PKW0431D-96
Power-Up
2-5
For the console to run, the path from the CPU to the XSROM must be functional. The XSROM resides in FEPROM on the XBUS, off the EISA bus, off PCI 0, off IOD 0. See Figure 2-4. This path is minimally tested by SROM.
Memory Pair
128-Bit Data Bus + 16 ECC and 40-Bit Command/Address Bus Bridge Module PCI Bus 0 64 Bits EISA Bridge PCI Slot PCI/EISA Slot PCI/EISA Slot PCI/EISA Slot XBUS
XBUS Xceivers
NVRAM 8Kx8 Flash ROM 2MB
EISA Bus
PCI Motherboard
BDATA Xceivers
Real-Time Clock Combo I/O: serial ports parallel port floppy cntrl Mouse/ Keyboard I2C Bus Interface
PKW0431E-96
2-6
The SROM contents are loaded into each CPUs I-cache and executed on powerup/reset. After testing the caches on each processor chip, it tests the path to the XSROM. Once this path is tested and deemed reliable, layers of the XSROM are loaded sequentially into the processor chip on each CPU. None of the SROM or XSROM power-up tests are run from memoryall run from the caches in the CPU chip, thus providing excellent diagnostic isolation. Later, power-up tests run under the console are used to complete testing of the I/O subsystem. There are two console programs: the SRM console and the AlphaBIOS console, as detailed in the AlphaServer 4100 System Drawer Users Guide (EK4100AUG). By default, the SRM console is always loaded and I/O system tests are run under it before the system loads AlphaBIOS. To load AlphaBIOS, the os_type environment variable must be set to NT and the Halt button should be out (LED not lit). Otherwise, the SRM console continues to run.
Power-Up
2-7
Figure 2-5
HANG
Yes
D-cache errors No
HANG
No
Yes HANG
Determine Primary
Size IOD Check integrity of XSROM Fail Loopback on each IOD Pass Light IOD LEDs Pass Load first 8K of XSROM into S-cache
2-8
The Alpha chip has a built-in self-test; it tests the I-cache at power-up and upon reset. Each CPU chip loads its SROM code into its I-cache and starts executing it. If the chip is partially functional, the SROM code continues to execute. However, if the chip cannot perform most of its functions, that CPU hangs and that CPU pass/fail LED remains off. If the system has more than one CPU and at least one passes both the SROM and XSROM power-up tests, the system will bring up the console. The console checks the FW_SCRATCH register where evidence of the power-up failure is left. Upon finding the error, the console sends these messages to COM1 and the OCP: COM1 (or VGA): OCP: Power-up tests have detected a problem with your system Power-up failure
Power-Up
2-9
2-10
FEPROM Failures (PCI Motherboard Error) -XSROM -XSROM -XSROM -XSROM -XSROM -XSROM
Power-Up
2-11
XSROM banner to OCP/console device Run memory texts. Print trace to OCP/console dev. Print errors to OCP/console dev. Done message to console dev.
Run B-cache tests Print errors to OCP/console dev. Done message to console dev.
Primary verifies checksum of PAL/decomp/console code Pass Primary unloads PAL/ decompression code or fail-safe loader depending upon results of checksum
Fail
Fail-safe loader
Print mem info to console dev. Check for illegal memory config. Print warnings to console dev. and OCP. Initialize all memory pairs.
Secondaries alerted that console has started. They jump to and run PALcode joining the console.
Note: The XSROM can only print to the console device if the environment variable console = serial. It always sends output to the OCP.
PKW0432A-96
XSROM tests are described in Table 2-3. Failure indicates a CPU failure.
2-12
After jumping to the primary CPU's S-cache, the code then intentionally I-caches itself and is completely register based (no D-stream for stack or data storage is used). The only D-stream accesses are writes/reads during testing. Each FEPROM has sixteen 64-Kbyte sectors. The first sector contains B-cache tests, memory tests, and a fail-safe loader. The second sector contains PALcode. The third sector contains a copy of the first sector. The remaining thirteen sectors contain the SRM console and decompression code. NOTE: Memory tests are run during power-up and reset (see Table 2-4). They are also affected by the state of the memory_test environment variable, which can have the following values: FULL PARTIAL NONE Test all memory Test up to the first 256 Mbytes Test 32 Mbytes
Test Name
B-cache Tag Data Line test B-cache Tag March test B-cache Data Line test B-cache Data March test
Logic Tested
Access to B-cache tags. shorts between tag data and its status and parity bits B-cache tag store RAMs, B-cache STAT store RAMs B-cache data lines to B-cache data RAMs, B-cache read/write logic B-cache data RAMs, CPU chip B-cache control, CPU chip B-cache address decode, INDEX_H<2x:6> (address bus) CPU chip ECC generation and checking logic, ECC lines from CPU chip to B-cache, B-cache ECC RAMs Portion of B-cache data RAMs used for ECC CPU chip ECC single-bit error detection and correction, ECC double-bit error detection, ECC error reporting B-cache tag array, CPU parity detection, EI_ADDR and EI_STAT register operation B-cache STAT array, CPU chip B-cache STAT parity generation/detection
15
16 17
B-cache Data ECC March test CPU chip ECC Single/Double bit Error test B-cache Tag Store Parity Error test B-cache STAT Store Parity Error test
18 19
Power-Up
2-13
21
Address path to and from memory Address path on memory and RAMs No new logic
23*
Maps out bad memory by way of the bitmap. It does not completely fail memory. N/A
24
2-14
Memory Error (Memory Module Indicated) 20..21.. TEST ERR on cpu0 FRU: MEM1L #CPU running test #Low member of memory pair 1
err# c tst# 21 22..23..24..Memory testing complete on cpu0 Memory Configuration Error (Operator Error) ERR! ERR! ERR! ERR! Sctr Sctr Sctr Sctr Sctr Sctr mem_pair0 mem_pair1 mem_pair1 mem_pair1 1 1 1 3 3 3 misconfigured card size mismatch card type mismatch EMPTY
FEPROM Failures (PCI Motherboard Error) -PAL headr PTTRN fail -PAL headr CHKSM fail -PAL code CHKSM fail -CONSLE headr PTTRN fail -CONSLE headr CHKSM fail -CONSLE code CHKSM fail
Power-Up
2-15
ECC test
5 6 7 8
Translation Error test Write Pending test PCI Loopback test PCI Peer-to-Peer Byte Mask test
2-16
ncr810_diag
For both IOD tests and PCI motherboard tests, trace and failure status is sent to the OCP. If any of these tests fail, a warning is sent to the SRM console device after the console prompt (or AlphaBIOS pop-up box). The LEDs on the system bus to PCI bus bridge module are controlled by the diagnostics. If a LED is off, a failure occurred.
Power-Up
2-17
No
Yes
Yes
Enable COM port 1 and send messages as system is powering up. Warning message sent if a VGA adapter is seen on PCI 1
PKW0434-96
2-18
Console Device Options The console device can be either a serial terminal or a graphics monitor. Specifically: A serial terminal connected to COM1 off the server control module. The terminal connected to COM1 must be set to 9600 baud. This baud rate cannot be changed. A graphics monitor off an adapter on PCI0.
Systems running Windows NT must have a graphics monitor as the console device and run AlphaBIOS as the console program. During power-up, the SROM and the XSROM always send progress and error messages to the OCP and to the COM1 serial port if the SRM console environment variable (set with the set console command) is set to serial. If the console environment variable is set to graphics, no messages are sent to COM1. If the console device is connected to COM1, both the SROM and XSROM send messages to it once it has been initialized. If the console device is a graphics device, console power-up messages are sent to either of those devices, but SROM and XSROM power-up messages are lost. No matter what the console environment variable setting, each of the three programs sends messages to the control panel display.
Console Set to Serial Graphics COM1 COM1 COM1 Lost Lost VGA
Changing Where Console Output Is Displayed You can change where console output is displayed, assuming the SRM console has fully powered up and the os_type environment variable is set to OpenVMS or Digital UNIX. (The following does not work if os_type is set to NT.) If the console environment variable is set to serial and no serial terminal is attached to COM1, pressing a carriage return on a graphics monitor attached to the system makes it the console device and the console prompt is sent to it. If the console environment variable is set to graphics and no graphics monitor is attached to the adapter, pressing a carriage return on a serial terminal attached to COM1 makes it the console device and the console prompt is sent to it.
Power-Up
2-19
2-20
At power-up or reset, the SROM code on each CPU module is loaded into that modules I-cache and tests the module. If all tests pass, the processors LED lights. If any test fails, the LED remains off and power-up testing terminates on that CPU. The first determination of the primary processor is made, and the primary processor executes a loopback test to each PCI bridge. If this test passes, the bridge LED lights. If it fails, the LED remains off and power-up continues. The EISA system controller, PCI-to-EISA bridge, COM1 port, and control panel port are all initialized thereafter. Each CPU prints an SROM banner to the device attached to the COM1 port and to the control panel display. (The banner prints to the COM1 port if the console environment variable is set to serial. If it is set to graphics, nothing prints to the console terminal, only to the control panel display, until ).
Each processor's S-cache is initialized, and the XSROM code in the FEPROM on the PCI motherboard is unloaded into them. (If the unload is not successful, a copy is unloaded from a different FEPROM sector. If the second try fails, the CPU hangs.) Each processor jumps to the XSROM code and sends an XSROM banner to the COM1 port and to the control panel display. The three S-cache banks on each processor are enabled, and then the B-cache is tested. If a failure occurs, a message is sent to the COM1 port and to the control panel display. Each CPU sends a B-cache completion message to COM1. The primary CPU is again determined, and it sizes memory by reading memory registers on the I2C bus. The information on memory pairs is sent to COM1. If an illegal memory configuration is detected, a warning message is sent to COM1 and the control panel display. Memory is initialized and tested, and the test trace is sent to COM1 and the control panel display. Each CPU participates in the memory testing. The numbers for tests 20 and 21 might appear interspersed, as in Example 2-3. This is normal behavior. Test 24 can take several minutes if the memory is very large. The message P0 TEST 24 MEM** is displayed on the control panel display; the second asterisk rotates to indicate that testing is continuing. If a failure occurs, a message is sent to the COM1 port and to the control panel display. Each CPU sends a test completion message to COM1. Continued on next page Power-Up 2-21
2-22
The final primary CPU determination is made. The primary CPU unloads PALcode and decompression code from the FEPROM on the PCI motherboard to its B-cache. The primary CPU then jumps to the PALcode to start the SRM console. The primary CPU prints a message indicating that it is running the console. Starting with this message, the power-up display is printed to the default console terminal, regardless of the state of the console environment variable. (If console is set to graphics, the display from here to the end is saved in a memory buffer and printed to the graphics monitor after the PCI buses are sized and the graphics device is initialized.) The size and type of each memory pair is determined. The console is started on each of the secondary CPUs. A status message prints for each CPU. The PCI bridges (indicated as IODn) are probed and the devices are reported. I/O adapters are configured. The SRM console banner and prompt are printed. (The SRM prompt is shown in this manual as P00>>>. It can, however, be P01>>>, P02>>>, or P03>>>. The number indicates the primary processor.) If the auto_action environment variable is set to boot or restart and the os_type environment variable is set to unix or openvms, the Digital UNIX or OpenVMS operating system boots. If the system is running the Windows NT operating system (the os_type environment variable is set to nt), the SRM console loads and starts the AlphaBIOS console and does not print the SRM banner or prompt.
Power-Up
2-23
NOTE: FEPROM 0 contains images of the SROM, XSROM, PAL, decompression, and SRM console code. If the fail-safe loader loads, the following conditions exist on the machine: The SROM has passed its tests and successfully unloaded the XSROM. If the SROM fails to unload both copies of XSROM, it reports the failure to the control panel display and COM1 if possible, and the system hangs. The XSROM has completed its B-cache and memory tests but has failed to unload the PALcode in FEPROM 0 sector 1 or the SRM console code. The XSROM reports the errors encountered and loads the fail-safe loader.
2-24
Chapter 3 Troubleshooting
This chapter describes troubleshooting during power-up and booting, as well as diagnostics for AlphaServer 4100 systems. The following topics are covered: Troubleshooting with LEDs Troubleshooting Power Problems Running DiagnosticsTest Command
Troubleshooting
3-1
CPU LEDs
DC_OK SROM Oscillator CPU Self-Test Pass
POWER_FAN_OK TEMP_OK
3-2
CPU LEDs If the CPU STP LED on any CPU module is lit, that CPU chip is functioning properly. You can use the Halt button on the OCP to prevent the AlphaBIOS console from booting, thus assuring the validity of the CPU STP LED. If the LED is off, replace the CPU. If the LED is lit, you can use the SRM console command alphabios to load and run the AlphaBIOS console. The top LED on a CPU module is a DC OK LED. It is driven by the PCM module. If it is not lit, there are probably power problems. The middle LED on a CPU lights only when the SROM on each CPU is being loaded.
System Bus to PCI Bus Bridge Module LEDs There are four LEDs on the system bus to PCI bus bridge module: The top two LEDs indicate the condition of the bridge module. If either is off, the module should be replaced. The bottom two LEDs are passed from the PCM. Both should be on during normal operation. If either is off while the system is on, the LEDs on the PCM module should indicate what failed. If they do not, the PCM could be broken or the bridge module is not passing the signals to the LEDs. NOTE: If AC power is applied and the system is off and a power supply is in operation, the power LED, the top one of the bottom two, flashes, indicating the presence of Vaux (auxiliary voltage).
Troubleshooting
3-3
PK-0664-96
3-4
A cabinet system has three exhaust fans at the top of the cabinet. They are powered from a small power supply in the fan tray. This power supply also powers the server control module at the bottom of the PCI card cage to allow remote access to the system. A failure of the power supply is indicated only by the LEDs. No messages are displayed. There are two LEDs on the top panel: a fan LED and a power LED. When the fan LED (amber) is flashing, a cabinet fan needs replacing. Look to see which fan appears broken (either not functioning at all; or it appears to be slower than the others). When the power LED (green) is off, either the power supply in the fan tray is broken or there is a power problem.
Troubleshooting
3-5
If Halt Is Caused by Power, Fan, or Overtemperature If a system is stopped because of a power, fan, or overtemperature problem, use the PCM LEDs to diagnose the problem. See Section 3.2.1.
3-6
If Power Problem Occurs at Power-Up If the system has a power problem on a cold start, the PCM LEDs are not valid until after DCOK_SENSE has been asserted. The cause is one of the following: Broken system fan Broken CPU fan Power supplied to the system is out of tolerance (a power supply could be broken and the system could still power up) PCM failure Interlock failure Wire problems Temperature problem (unlikely)
Recommended Order for Troubleshooting Failure at Power-Up 1. Check to see if any CPU fan or system fan is not spinning. Fans can fail by not spinning and/or not putting out the tachometer output necessary as input to the PCM comparator that checks the fans. (See steps 4 and 5.) Replace broken fan. Replace the PCM. Sequentially remove CPUs and try to power up after you remove a CPU. If the system powers up, the last CPU you removed had a fan failure. Check the output of the power supplies. See Section 4.1 for locations of +5 and +3.43 volt output pins. If the output is above or below the threshold, replace the faulty power supply. Check the output of each system fan with a voltmeter. Probe the middle of three outputs of the fans with the positive lead of the meter and ground the other probe. The meter should read 2.5 volts to 3 volts. If a fans output is out of this range, replace the fan. NOTE: You will have to disable the interlocks to check the voltages in step 5. You will have only 10 seconds to measure them. There is a 10-second delay before the PCM turns off the power. The PCM must sense a change in Vaux (auxiliary voltage) to start the power supplies. Pressing the On button has no effect if the machine halted because of a failure in the power system. The power supplies must be unplugged and plugged back in for the On button to work.
2. 3. 4.
5.
Troubleshooting
3-7
DCOK_SENSE PS0_OK PS1_OK PS2_OK TEMP_OK CPUFAN_OK SYSFAN_OK CS_FAN0 CS_FAN1 CS_FAN2 C_FAN3 Normally On Tested at one-second intervals Off if power supply not present or broken
PK-0714-96
3-8
State
On On On Off On Off On On Off On Off On
Description
Both +5.0V and +3.43V are present and within limits. Power supply 0 is present and has asserted POK_H. Power supply 1 is present and has asserted POK_H. Power supply 1 not present. Power supply 2 is present and has asserted POK_H. Power supply 2 not present. The system temperature is below 55 C. All CPU fans are OK. A CPU fan has failed. The specific fan is identified by the CS_FANx or C_FAN3 LED that remains lit. All system fans are OK. A system fan has failed. The specific fan is identified by the CS_FANx that remains lit. CPU fan 0 and system fan 0 are being sampled or one of them has failed as indicated by CPUFAN_OK and SYSFAN_OK. CPU fan 0 and system fan 0 are not being sampled and are functioning properly. CPU fan 1 and system fan 1 are being sampled or one of them has failed as indicated by CPUFAN_OK and SYSFAN_OK. CPU fan 1 and system fan 1 are not being sampled and are functioning properly. CPU fan 2 and system fan 2 are being sampled or one of them has failed as indicated by CPUFAN_OK and SYSFAN_OK. CPU fan 2 and system fan 2 are not being sampled and are functioning properly. CPU fan 3 is being sampled or has failed as indicated by CPUFAN_OK and SYSFAN_OK. Off CPU fan 3 and system fan 3 are not being sampled and are functioning properly.
SYSFAN_OK
CS_FAN0
Off On .
CS_FAN1
Off On
CS_FAN2
Off On Off
C_FAN3
Troubleshooting
3-9
CPU
Motherboard
2
Memory Pairs
CPUs MEMs
I C Bus
IOD
PCI 0 IOD 0 IOD 1 PCI 1
XBUS
EISA
PKW0421A-96
3-10
Monitor The I2C bus monitors the state of system conditions scanned by the PCM. There are two registers on the PCM: One records the state of the fans and power supplies and is latched when there is a fault. The other causes an interrupt on the I2C bus when a CPU or system fan fails, an overtemperature condition exists, or power supplied to the system is out of tolerance.
The interrupt received by the I2C bus controller on PCI 0 alerts the system of imminent power shutdown. The controller has 30 seconds to read the two registers and store the information in the EEPROM on the PCM. The SRM console command show power reads these registers. Fault Display The OCP display is written through the I2C bus. Error State Error state is written and read for power conditions. The state of the Halt button (in/out) is read on the I2C bus. Configuration Tracking Each CPU, PCI bridge, PCI motherboard, and system motherboard has an EEPROM that contains information about the module that can be written and read over the I2C bus. All modules contain the following information: Module type Module serial number Hardware revision Firmware revision Memory size (only required for memory modules)
Troubleshooting
3-11
NOTE: If you are running the Microsoft Windows NT operating system, switch from AlphaBIOS to the SRM console in order to enter the test command. From the AlphaBIOS console, press in the Halt button (the LED will light) and reset the system, or select Digital UNIX (SRM) or OpenVMS (SRM) from the Advanced CMOS Setup screen and reset the system. test [-t time] [-q] [option] -t time -q option Specifies the run time in seconds. The default for system test is 600 seconds (10 minutes). Disables the display of status messages as exerciser processes are started and stopped during testing. Either cpun, memn, or pcin, where n is 0, 1, 2, 3, or *. If nothing is specified, the entire system is tested.
3-12
Configuring system.. polling ncr0 (NCR 53C810) slot 1, bus 0 PCI, hose 1 dka500.5.0.1.1 DKa500 SCSI Bus ID 7
polling ncr1 (NCR 53C810) slot 3, bus 0 PCI, hose 1 dkb200.2.0.3.1 dkb400.4.0.3.1 DKb200 DKb400
polling tulip0 (DECchip 21040-AA) slot 2, bus 0 PCI, hose 1 ewa0.0.0.2.1: 08-00-2B-E5-B4-1A
Starting background memory test, affinity to all CPUs.. Starting processor/cache thrasher on each CPU.. Starting processor/cache thrasher on each CPU.. Starting processor/cache thrasher on each CPU.. Starting processor/cache thrasher on each CPU..
Testing SCSI disks (read-only) No CD/ROM present, skipping embedded SCSI test Testing other SCSI devices (read-only)..
Troubleshooting
3-13
ID 00003047 00003050 00003059 00003062 00003084 000030d8 000030d9 0000310d ID 00003047 00003050 00003059 00003062 00003084 000030d8 000030d9 0000310d ID 00003047 00003050 00003059 00003062 00003084 000030d8 000030d9 0000310d
Program
Device
Pass 1 205 192 192 80 26 26 0 Pass 1 635 619 620 263 90 90 0 Pass 1 1054 1039 1041 447 155 155 1
Bytes Read 134217728 213883392 200253568 200253568 82827392 13690880 13674496 0 Bytes Read 432013312 664716032 647940864 648989312 274693376 47572992 47523840 327680 Bytes Read 727711744 1104015744 1088289024 1090385920 467607808 81488896 81472512 607232
-------- ------------ ------------ ------ --------- ------------- -----------memtest memory memtest memory memtest memory memtest memory memtest memory exer_kid dkb200.2.0.3 exer_kid dkb400.4.0.3 exer_kid dva0.0.0.100 Program Device
-------- ------------ ------------ ------ --------- ------------- -----------memtest memory memtest memory memtest memory memtest memory memtest memory exer_kid dkb200.2.0.3 exer_kid dkb400.4.0.3 exer_kid dva0.0.0.100 Program Device
-------- ------------ ------------ ------ --------- ------------- -----------memtest memory memtest memory memtest memory memtest memory memtest memory exer_kid dkb200.2.0.3 exer_kid dkb400.4.0.3 exer_kid dva0.0.0.100
Testing aborted. Shutting down tests. Please wait.. System test complete ^C P00>>>
3-14
-------- ------------ ------------ ------ --------- ------------- -----------memtest memory memtest memory memtest memory memtest memory memtest memory Program Device
-------- ------------ ------------ ------ --------- ------------- -----------memtest memory memtest memory memtest memory memtest memory memtest memory Program Device
-------- ------------ ------------ ------ --------- ------------- -----------memtest memory memtest memory memtest memory memtest memory memtest memory
Troubleshooting
3-15
ID 000046d7 000046e0 000046e9 000046f2 000046fb ID 000046d7 000046e0 000046e9 000046f2 000046fb ID 000046d7 000046e0 000046e9 000046f2 000046fb
Program
Device
Pass 1 1456 1446 1444 550 Pass 1 1901 1892 1889 720 Pass 1 2346 2337 2333 890
Bytes Read 583008256 1525491840 1515007360 1512910464 575597952 Bytes Read 761266176 1992051200 1982615168 1979469824 753834112 Bytes Read 937426944 2458610560 2449174528 2444980736 932070272
-------- ------------ ------------ ------ --------- ------------- -----------memtest memory memtest memory memtest memory memtest memory memtest memory Program Device
-------- ------------ ------------ ------ --------- ------------- -----------memtest memory memtest memory memtest memory memtest memory memtest memory Program Device
-------- ------------ ------------ ------ --------- ------------- -----------memtest memory memtest memory memtest memory memtest memory memtest memory
3-16
polling tulip0 (DECchip 21040-AA) slot 2, bus 0 PCI, hose 1 ewa0.0.0.2.1: 08-00-2B-E5-B4-1A polling floppy0 (FLOPPY) PCEB - XBUS hose 0 dva0.0.0.1000.0 DVA0 RX23
Testing all PCI buses.. Testing EWA0 network device Testing VGA (alphanumeric mode only) Testing SCSI disks (read-only) Testing floppy (dva0, read-only) ID 00002c29 00002c2a 00002c5e Program Device Pass 27 27 0 Hard/Soft Bytes Written 0 0 0 0 0 0 0 0 0 Bytes Read 14642176 14642176 0
-------- ------------ ------------ ------ --------- ------------- -----------exer_kid dkb200.2.0.3 exer_kid dkb400.4.0.3 exer_kid dva0.0.0.100
Troubleshooting
3-17
Program
Device
Pass 92 92 0
-------- ------------ ------------ ------ --------- ------------- -----------exer_kid dkb200.2.0.3 exer_kid dkb400.4.0.3 exer_kid dva0.0.0.100
Testing aborted. Shutting down tests. Please wait.. Testing complete ^C P00>>>
3-18
Power System
4-1
Current share
+5V/Return
+3.4V/Return
+3.4V/Return
+12V/Return
PKW0402A-96
4-2
Power Supply Features 90-264 Vrms input 450 watts output. Output voltages are as follows:
Remote sense on +5.0V and +3.43V +5.0V is sensed on all CPUs in the system, the system bus motherboard, and the PCI bus motherboard. +3.43V is sensed on all CPUs in the system and the system bus motherboard.
Current share on +5.0V, +3.43V, and +12V. 1 % regulation on +3.43V. Fault protection (latched). If a fault is detected by the power supply, it will shut down. The faults detected are: Overvoltage Overcurrent Power overload
DC_ENABLE_L input signal starts the DC outputs. POK_H output signal indicates that the power supply is operating properly.
Power System
4-3
PK-0710-96
4-4
The power control module performs the following functions: Controls the power-up/down sequencing. Monitors the combined output of power supplies VDD (3.43V) and VCC (5.0V) and asserts DCOK_SENSE if these voltages are within range and asserts POWER_FAULT_L causing an immediate power shutdown if either is not. Monitors system temperature and asserts TEMP_FAIL, if temperature exceeds 55 C. Monitors CPU and system drawer fans and asserts CPUFAN_OK if all CPU fans are functioning properly, asserts SYSTEM_FAN_OK if the drawer cooling fans are functioning properly; otherwise it asserts FAN_FAULT_L. Each fan is checked at 1 second intervals. Powers down the system 30 seconds after detecting TEMP_FAIL, or the absence of CPUFAN_OK, or the absence of SYSTEM_FAN_OK by asserting POWER_FAULT_L. Provides visual indication of faults through LEDs. Has two registers, one that generates interrupts when bits change, and one that latches errors but does not generate interrupts.
Power System
4-5
17-04201-01 Or 17-04302-01
RSM_DC_EN_L PKW0403F-06
4-6
Figure 4-3 shows the distribution of power thoughout the system drawer. Opens in the circuit or the PCM signal POWER_FAULT_L or the SCM signal RSM_DC_EN_L interrupt DC power applied to the system. The opens can be caused by the On/Off button or the cover interlocks. The POWER_FAULT_L signal is asserted by the PCM module if it detects a fault and the RSM_DC_EN_L is controlled remotely. A failure anywhere in the circuit will result in the removal of DC power. A potential failure is the relay used on the SCM modules to control the RSM_DC_EN_L signal. The system drawer has three cover interlocks: one for the system bus card cage, one for the PCI card cage, and one for the power and system fan area. To override the cover interlocks, find a suitable object to close the interlock circuit at the location identified in Figure 4-3. The switch assembly that contains single switches for all three covers is located at the point in the system drawer where all three covers meet.
Power System
4-7
Vaux on
On-Off Button On Assert DC_ENABLE_L Power Supply Starts 10 Second Delay 12 Second Delay
Off
Yes
Deassert DC_ENABLE_L
Halt
No
Voltages OK Yes
30 Second Delay
No
Fan/Temp OK Yes
PKW-0402-95
4-8
When AC is applied to the system, Vaux (auxiliary voltage) is asserted and is sensed by the PCM. The PCM asserts DC_ENABLE_L starting the power supplies. If there is a hard fault on power-up, the power supplies shut down immediately; otherwise, the power system powers up and remains up until the system is shut off or the PCM senses a fault. If a power fault is sensed, the power system attempts to restore power and will do so if the fault is not sensed a second time. If the fault is still present, the power system shuts down. Since Vaux is independent of the power supply start, the AC plugs at the front of the supplies must be removed to reset Vaux, allowing capacitors to drain voltage. All power failures require this procedure since the PCM must sense a change in Vaux to start the power supplies.
Power System
4-9
StorageW orks
StorageW orks
StorageW orks
SystemDrawer
Fan T ray
System Drawer
4-10
Power Strips 0.38 Ams 0.38 Ams 1.83 Ams 1.83 Ams 1.83 Ams 1.83 Ams 10A 0.38 Ams 0.38 Ams 1.83 Ams 1.83 Ams 1.83 Ams 1.83 Ams 10A 1.83 Ams 0.5Ams 1.83 Ams 1.83 Ams 1.83 Ams 10A
System Drawer
System Drawer
System Drawer
Fan Tray
System Drawer
AC Distribution Box 7.8 Ams 8.1 Ams 8.1 Ams 10A 10A 10A 30A
PKW0406A -95
Total Power Available Single Drawer Single StorageWorks Shelf System Fan Tray Outlets Site Grounding Power Strip
4800 VA 1100 VA 150 VA 100 VA 18 IEC 320 max. (3 power strips) Leakage current exceeds 3.5 mArms. One system drawer per power strip. In four-system drawer configuration, fourth drawer should have its three power cords distributed among the three power strips. Power System 4-11
4.6
SystemDrawer
15A
PKW0406B-95
N. America: 1800 VA per branch circuit and 1400 VA per line cord Japan: 1500 VA per branch circuit and 1200 VA per line cord 1100 VA 150 VA 12 NEMA receptacles Single AC power strip supports one system drawer and one StorageWorks shelf. When two AC power strips are used, combined AC input line current cannot exceed the site circuit breaker restriction, assuming both strips are plugged in to the same circuit.
4-12
4.7
StorageW orks
StorageW orks
SystemDrawer
PKW0406C-95
Total Power Available Single Drawer Single StorageWorks Shelf Outlets Power Strip
2200 VA per power strip 1100 VA 150 VA 10 IEC 320 receptacles max. One receptacle is blocked on each power strip to control leakage. Single AC power strip supports one system drawer and three StorageWorks shelves.
Power System
4-13
Error Logs
5-1
Memory ECC PS S X V E R System Bus Data System Bus Comd/add B-cache Tag & Status Data PS ECC S EISA PS PCI EISA Bus Bridge
P ECC
CPU Module
ECC
CPU Chip
P
P ECC
PS ECC S
5-2
Device
IOD on every transaction, CPU when using the bus IOD on every transaction, CPU when using the bus Parity Protected
System bus command/address lines Duplicate tag store B-cache index lines PCI bus EISA bus
IOD on every transaction, CPU when using the bus IOD on every transaction, CPU when using the bus CPU IOD EISA bridge
As shown in Figure 5-1 and the accompanying table, the CPU chip is isolated by transceivers (XVER) from the data and command/address lines on the module. This allows the CPU chip access to the duplicate tag and B-cache while the system bus is in use. The CPU detects errors only when it is the consumer of the data. The IOD detects errors on each system bus cycle regardless of whether it is involved in the transaction. System bus errors detected by the CPU may also be detected by the IOD. It is necessary to check the IOD for errors any time there is a CPU machine check. If the CPU sees bad data and the IOD does not, the CPU is at fault. If both the CPU and the IOD see bad data on the system bus, either memory or a secondary CPU is the cause. In such a case, the Dirty bit, bit<20>, in the IOD MC_ERR1 Register should be set or clear. If the Dirty bit is set, the source of the data is a CPUs cache destined for a different CPU. If the Dirty bit is not set, memory caused the bad data on the bus. In this case, multiple error log entries occur and must be analyzed together to determine the cause of the error.
Error Logs
5-3
5-4
MCHK 660
MCHK 630 MCHK 620 Last fail I/O error interrupt System environment Configuration
Error Logs
5-5
DECevent allows you to do the following: Translate event log files into readable reports Select alternate input and output files Filter input events Select alternative reports Translate events as they occur Maintain and customize your environment with the interactive shell commands
To access on-line help: OpenVMS $ HELP DIAGNOSE or $ DIA /INTERACTIVE DIA> HELP Digital UNIX > man dia > dia hlp or
Privileges necessary to use DECevent: SYSPRV for the utility DIAGNOSE to use the /CONTINUOUS qualifier
5-6
Error Logs
5-7
To reverse the order of the input events OpenVMS $ DIAGNOSE/TRANSLATE/REVERSE Digital UNIX > dia -R These commands reverse the order in which events are displayed. The default order is forward chronologically.
5-8
Use the /BEFORE and /SINCE qualifiers to select events before or after a certain date and time. OpenVMS $ DIAGNOSE/TRANSLATE/BEFORE=15-JAN-1996:10:30:00 or $ DIAGNOSE/TRANSLATE/SINCE=15-JAN-1996:10:30:00 Digital UNIX > dia -t s:15-jan-1996 e:20-jan-1996 If no time is specified, the default time is 00:00:00, and all events for that day are selected. The /BEFORE and /SINCE qualifiers can be combined to select a certain period of time. OpenVMS $ DIAGNOSE/TRANSLATE/SINCE=15-JAN-1996/BEFORE=20-JAN-1996 If no value is supplied with the /SINCE or /BEFORE qualifiers, DECevent defaults to TODAY.
Error Logs
5-9
5-10
CPU1 logged the error in a system with two CPUs. During a D-ref fill, the External Interface Status Register logged an uncorrectable EEC error. (When a CPU chip does not find data it needs to perform a task in any of its caches, it requests data from off the chip to fill its D-caches. It performs a D-ref fill.) Bit<30> is clear, indicating that the source of the error is the B-cache. Neither IOD CAP Error Register saw an error.
The error was detected by a CPU and the data was not on the system bus. Otherwise, the IODs would have seen the error. Therefore, CPU1 is broken. NOTE: The error log example has been edited to decrease its size; registers of interest are in bold type. The Horse module referred to in the error log is the system bus to PCI bus bridge module, the B3040 module. The Saddle module is the PCI motherboard, the B3050 module. The MC bus is the system bus. Refer to Table 5-9 for information on decoding commands, and refer to Table 5-10 for information on node IDs.
Error Logs
5-11
System type register x00000016 Number of CPUs (mpnum) x00000002 CPU logging event (mperr) x00000001 Event validity Event severity Entry type CPU Minor class Software Flags Active CPUs Hardware Rev System Serial Number Module Serial Number Module Type System Revision * MCHK 670 Regs * Flags: PCI Mask Machine Check Reason PAL SHADOW REG 0 PAL SHADOW REG 1 . . . PAL SHADOW REG 6 PAL SHADOW REG 7 PALTEMP0 PALTEMP1 PALTEMP2 . . . PALTEMP22 PALTEMP23 Exception Address Reg Exception Summary Reg Exception Mask Reg PAL BASE x0000000008 Interrupt Summary Reg IBOX Ctrl and Status Reg
1. O/S claims event is valid 1. Severe Priority 100. CPU Machine Check Errors 1. Machine check (670 entry) x0000000300000000 IOD 1 Register Subpkt Pres IOD 2 Register Subpkt Pres x00000003 x00000000 C1563 x0000 x00000000 x00000000 x0000 x0098 x00000000 x00000000
xFFFFFC00004F9D60 x00000000E8709A58 xFFFFFC00003BFB88 Native-mode instruction Exception PC x3FFFFF00000EFEE2 x00000000 x00000000 x00000000020000 Base addr for palcode = x00000000 AST requests 3 - 0 x00000000 x000000C160000000 Timeout Bit Not Set PAL Shadow Registers Enabled Correctable Err Intrpts Enabled ICACHE BIST Successful
5-12
Icache Par Err Stat Reg Dcache Par Err Stat Reg Virtual Address Reg Memory Mgmt Flt Sts Reg
Scache Address Reg Scache Status Reg Bcache Tag Address Reg
TEST_STATUS_H Pin Asserted x00000000 x00000000 xFFFFFFFE8F63BD38 x000000000166D1 Ref which caused err was a write Ref resulted in DTB miss RA Field x0000000000001B Opcode Field x0000000000002C xFFFFFF00000254BF x00000000 xFFFFFF80E98F7FFF External cache hit Parity for ds and v bits Cache block dirty Cache block valid Ext cache tag addr parity bit Tag address<38:20> is
x00000000000E98 Ext Interface Address Reg xFFFFFF00E984DBCF Fill Syndrome Reg x0000000000002B Ext Interface Status Reg xFFFFFFF104FFFFFF Uncorrectable ECC error Error occurred during D-ref fill LD LOCK xFFFFFF003797340F
IOD 0 Register Subpacket Device ID x0000003B Bcache Size = 2MB VCTY ASIC Rev = 0 Module Revision 0. Base Address of Bridge x000000F9E0000000 PCI Revision x06008021 CAP Chip Revision x00000001 Horse Module Revision x00000002 Saddle Module Revision x00000000 Saddle Module Type Left Hand EISA Present PCI Class Code x00000600 MC-PCI Command Register x06480FF1 Selftest passed Delayed read enabled Bridge PCI trans enabled Req 64 bit data trans enabled Accept 64 bit data trans enabled Check PCI Addr Parity enabled Check MC bus CMS/Addr Parity enabled Check MC bus NXM enabled Check all transaction enabled 16 byte aligned block write enabled Write Pend Number Thresho x00000008 RD_TYPE Short RL_TYPE Medium RM_TYPE Long ARB_MODE MC-PCI Bridge Priority Mode Memory Host Addr Exten x00000000 IO Host Addr Extension x00000000 Interrupt Control x00000003 MC-PCI Intr Enabled Device intr info enabled if en_int= 1 Interrupt Request x00000000 Interrupts asserted x00000000 Interrupt Mask Register 0 x00C50010 Interrupt Mask Register 1 x00000000 MC Error Info Register 0 xE0000000 MC bus trans addr <31:4> x0E000000 MC Error Info Register 1 x000E88FD MC bus trans addr <39:32>x000000FD MC_Command x00000008 x000000BB
Error Logs
5-13
CAP Error Register PCI Bus Trans Error Adr MDPA Status Register MDPA Error Syndrome Reg
Device Id x0000003A (no error seen) MDPA Chip Revision x00000000 Cycle 0 ECC Syndrome x00000000 Cycle 1 ECC Syndrome x00000000 Cycle 2 ECC Syndrome x00000000 Cycle 3 ECC Syndrome x00000000 MDPB Chip Revision x00000000 Cycle 0 ECC Syndrome x00000000 Cycle 1 ECC Syndrome x00000000 Cycle 2 ECC Syndrome x00000000 Cycle 3 ECC Syndrome x00000000
x00000000 x00000000
IOD 1 Register Subpacket Device ID x0000003B Bcache Size = 2MB VCTY ASIC Rev = 0 Module Revision 0. Base Address of Bridge x000000FBE0000000 PCI Revision x06000021 CAP Chip Revision x00000001 Horse ModuleRevision x00000002 Saddle Module Revision x00000000 Saddle Module Type Left Hand PCI Class Code x00000600 MC-PCI Command Register x06480FF1 Selftest passed Delayed read enabled Bridge PCI trans enabled Req 64 bit data trans enabled Accept 64 bit data trans enabled Check PCI Addr Parity enabled Check MC bus CMS/Addr Parity enabled Check MC bus NXM enabled Check all transaction enabled 16 byte aligned block write enabled Write Pend Number Thresho x00000008 RD_TYPE Short RL_TYPE Medium RM_TYPE Long ARB_MODE MC-PCI Bridge Priority Mode Memory Host Addr Exten x00000000 IO Host Addr Extension x00000000 Interrupt Control x00000003 MC-PCI Intr Enabled Device intr info enabled if en_int = 1 Interrupt Request x00000000 Interrupts asserted x00000000 Interrupt Mask Register 0 x00C50001 Interrupt Mask Register 1 x00000000 MC Error Info Register 0 xE0000000 MC bus trans addr <31:4> x0E000000 MC Error Info Register 1 x000E88FD MC bus trans addr <39:32> x000000FD MC_Command x00000008 Device Id x0000003A CAP Error Register x00000000 (no error seen) PCI Bus Trans Error Adr xC0018B48 MDPA Status Register x00000000 MDPA Chip Revision x00000000 MDPA Error Syndrome Reg x00000000 Cycle 0 ECC Syndrome x00000000 Cycle 1 ECC Syndrome x00000000 Cycle 2 ECC Syndrome x00000000 Cycle 3 ECC Syndrome x00000000 MDPB Status Register x00000000 MDPB Chip Revision x00000000 MDPB Error Syndrome Reg x00000000 Cycle 0 ECC Syndrome x00000000 Cycle 1 ECC Syndrome x00000000 Cycle 2 ECC Syndrome x00000000 x000000BB
5-14
Error Logs
5-15
CPU3 logged the error in a system with four CPUs. The External Interface Status Register logged an uncorrectable ECC error during a D-ref fill. (When a CPU chip does not find data it needs to perform a task in any of its caches, it requests data from off the chip to fill its D-cache. It performs a D-ref fill.) Bit <30> is set, indicating that the source of the error is memory or the system. Bits <32> and <35> are set, indicating an uncorrectable ECC error and a second external interface hard error, respectively. Both IOD CAP Error Registers logged an error. The command at the time of the error was a read. The bus master at the time of the error was CPU3.
The Dirty bit, bit <20> in the MC_ERR1 Register is clear, indicating the data is clean and comes from memory. The error was detected by a CPU, and the data was on the system bus and is clean. Therefore, a memory module provided the wrong data. (If the Dirty bit had been set, the data would have come from the cache of another CPU.) To determine which memory, see Section 5.4 NOTE: The error log example has been edited to decrease its size; registers of interest are in bold type. The Horse module referred to in the error log is the system bus to PCI bus bridge module, the B3040 module. The Saddle module is the PCI motherboard, the B3050 module. The MC bus is the system bus. Refer to Table 5-9 for information on decoding commands, and refer to Table 5-10 for information on node IDs.
5-16
System type register x00000016 Number of CPUs (mpnum) x00000004 CPU logging event (mperr) x00000003 Event validity Event severity Entry type CPU Minor class Software Flags Active CPUs Hardware Rev System Serial Number Module Serial Number Module Type System Revision * MCHK 670 Regs * Flags: PCI Mask Machine Check Reason PAL SHADOW REG 0 PAL SHADOW REG 1 . . . PAL SHADOW REG 6 PAL SHADOW REG 7 PALTEMP0 PALTEMP1 . . . PALTEMP23 Exception Address Reg Exception Summary Reg Exception Mask Reg PAL BASE x0000000008 Interrupt Summary Reg IBOX Ctrl and Status Reg
1. O/S claims event is valid 1. Severe Priority 100. CPU Machine Check Errors 1. Machine check (670 entry) x0000000300000000 IOD 1 Register Subpkt Pres IOD 2 Register Subpkt Pres x0000000F x00000000 C1563 x0000 x00000000 x00000000 x0000 x0098 x00000000 x00000000
x00000000ECE77A58 x000000012005A8B4 Native-mode instruction Exception PC x0000000048016A2D x00000000 x00000000 x00000000020000 Base addr for palcode = x00000000 AST requests 3 - 0 x00000000 x000000C164000000 Timeout Bit Not Set Floating Point Instr. may be issued PAL Shadow Registers Enabled Correctable Err Intrpts Enabled ICACHE BIST Successful TEST_STATUS_H Pin Asserted x00000000
Error Logs
5-17
Dcache Par Err Stat Reg Virtual Address Reg Memory Mgmt Flt Sts Reg
x00000000 x00000001407D6000 x00000000011A10 Ref resulted in DTB miss RA Field x0000000008 Opcode Field x00000000000023 xFFFFFF00000254BF x00000000 xFFFFFF80286F7FFF External cache hit Parity for ds and v bits Cache block dirty Cache block valid Ext cache tag addr parity bit Tag address<38:20> is
Scache Address Reg Scache Status Reg Bcache Tag Address Reg
x00000000000286 Ext Interface Address Reg xFFFFFF0028681A8F Fill Syndrome Reg x00000000004B00 Ext Interface Status Reg xFFFFFFF904FFFFFF Uncorrectable ECC error Error occurred during D-ref fill Second external interface hard error LD LOCK xFFFFFF000020040F ** IOD SUBPACKET -> ** WHOAMI x000000BF
IOD 0 Register Subpacket Device ID x0000003F Bcache Size = 2MB VCTY ASIC Rev = 0 Module Revision 0. Base Address of Bridge x000000F9E0000000 PCI Revision x06008021 CAP Chip Revision x00000001 Horse Module Revision x00000002 Saddle Module Revision x00000000 Saddle Module Type Left Hand EISA Present PCI Class Code x00000600 MC-PCI Command Register x06460FF1 Selftest passed Delayed read enabled Bridge PCI trans enabled Req 64 bit data trans enabled Accept 64 bit data trans enabled Check PCI Addr Parity enabled Check MC bus CMS/Addr Parity enabled Check MC bus NXM enabled Check all transaction enabled 16 byte aligned block write enabled Write Pend Number Thresho x00000006 RD_TYPE Short RL_TYPE Medium RM_TYPE Long ARB_MODE MC-PCI Bridge Priority Mode Memory Host Addr Exten x00000000 IO Host Addr Extension x00000000 Interrupt Control x00000003 MC-PCI Intr Enabled Device intr info enabled if en_int = 1 Interrupt Request x00810000 Interrupts asserted x00010000 Hard Error Interrupt Mask Register 0 x00C50010 Interrupt Mask Register 1 x00000000
MC Error Info Register 0 x28681A80 MC bus trans addr <31:4> x028681A8
5-18
x800FD800
MC bus trans addr <39:32> x00000000 MC_Command x00000018 Device Id x0000003F MC error info valid
CAP Error Register PCI Bus Trans Error Adr MDPA Status Register MDPA Error Syndrome Reg
x80000000
x0000004B
IOD 1 Register Subpacket Device ID x0000003F Bcache Size = 2MB VCTY ASIC Rev = 0 Module Revision 0. Base Address of Bridge x000000FBE0000000 PCI Revision x06000021 CAP Chip Revision x00000001 Horse Module Revision x00000002 Saddle Module Revision x00000000 Saddle Module Type Left Hand PCI Class Code x00000600 MC-PCI Command Register x06460FF1 Selftest passed Delayed read enabled Bridge PCI trans enabled Req 64 bit data trans enabled Accept 64 bit data trans enabled Check PCI Addr Parity enabled Check MC bus CMS/Addr Parity enabled Check MC bus NXM enabled Check all transaction enabled 16 byte aligned block write enabled Write Pend Number Thresho x00000006 RD_TYPE Short RL_TYPE Medium RM_TYPE Long ARB_MODE MC-PCI Bridge Priority Mode Memory Host Addr Exten x00000000 IO Host Addr Extension x00000000 Interrupt Control x00000003 MC-PCI Intr Enabled Device intr info enabled if en_int = 1 Interrupt Request x00800000 Interrupts asserted x00000000 Hard Error Interrupt Mask Register 0 x00C50001 Interrupt Mask Register 1 x00000000 x000000BF
MC Error Info Register 0 MC Error Info Register 1 x28681A80 x800FD800 MC bus trans addr <31:4> x028681A8 MC bus trans addr <39:32> x00000000 MC_Command x00000018 Device Id x0000003F
Error Logs
5-19
xC0000000
MC error info valid Uncorrectable ECC err det by MDPB MC error info latched
PCI Bus Trans Error Adr MDPA Status Register MDPA Error Syndrome Reg
x80000000 x0000004B
MDPA Chip Revision x00000000 Cycle 0 ECC Syndrome x00000000 Cycle 1 ECC Syndrome x00000000 Cycle 2 ECC Syndrome x00000000 Cycle 3 ECC Syndrome x00000000 MDPB Chip Revision x00000000 MPDB Error Syndrome of uncorrectable read error Cycle 0 ECC Syndrome Cycle 1 ECC Syndrome x00000000 Cycle 2 ECC Syndrome x00000000 Cycle 3 ECC Syndrome x00000000
PALcode Revision
5-20
CPU0 logged the error in a system with two CPUs. The External Interface Status Register records an uncorrectable ECC error from the system (bit <30> set). Both IOD CAP Error Registers logged an error. The MC Error Info Registers 0 and 1 have captured the error information. The commander at the time of the error was CPU0 (known from MC_ERR1) The command on the bus at the time was a read memory command. The address read was a memory address, not an I/O address. The data associated with the read was dirty.
From this information you know CPU0 requested data that was dirty; therefore, memory did not provide it, nor did an I/O device. Only another CPU could have provided the data from its cache. There is only one other CPU in this system, and it is faulty. Had there been more than two CPUs you could not have identified the error to a particular CPU. See Section 5.4 for a procedure designed to help with IOD-detected errors. NOTE: The error log example has been edited to decrease its size; registers of interest are in bold type. The Horse module referred to in the error log is the system bus to PCI bus bridge module, the B3040 module. The Saddle module is the PCI motherboard, the B3050 module. The MC bus is the system bus. Refer to Table 5-9 for information on decoding commands, and refer to Table 5-10 for information on node IDs.
Error Logs
5-21
System type register x00000016 Number of CPUs (mpnum) x00000002 CPU logging event (mperr) x00000000 Event validity Event severity Entry type CPU Minor class Software Flags Active CPUs Hardware Rev System Serial Number Module Serial Number Module Type System Revision * MCHK 670 Regs * Flags: PCI Mask Machine Check Reason PAL SHADOW REG 0 PAL SHADOW REG 1 PAL SHADOW REG 2 PAL SHADOW REG 3 PAL SHADOW REG 4 PAL SHADOW REG 5 PAL SHADOW REG 6 PAL SHADOW REG 7 PALTEMP0 PALTEMP1 PALTEMP2 . . . PALTEMP22 PALTEMP23 Exception Address Reg Exception Summary Reg Exception Mask Reg PAL Base Address Reg x0000000000000008 Interrupt Summary Reg x0000000000000000 IBOX Ctrl and Status Reg
1. O/S claims event is valid 1. Severe Priority 100. CPU Machine Check Errors 1. Machine check (670 entry) x0000000300000000 IOD 0 Register Subpkt Pres IOD 1 Register Subpkt Pres x00000003 x00000000 C1563 x0000 x00000000 x00000000 x0000 x0098 Fatal Alpha Chip Detected HardError x0000000000000000 x0000000000000000 x0000000000000000 x0000000000000000 x0000000000000000 x0000000000000000 x0000000000000000 x0000000000000000 xFFFFFC00006C00C0 x00000000000061A8 xFFFFFC00004E1E00
xFFFFFC00006530E0 x0000000003D2BA58 xFFFFFC000047395C Native-mode Instruction Exception PC x3FFFFF000011CE57 x0000000000000000 x0000000000000000 x0000000000020000 Base Addr for PALcode: x0000000000200000 External HW Interrupt at IPL21 AST Requests 3-0: x000000C160000000 Timeout Counter Bit Clear. IBOX Timeout Counter Enabled.
5-22
Icache Par Err Stat Reg Dcache Par Err Stat Reg Virtual Address Reg Memory Mgmt Flt Sts Reg
Floating Point Instructions will cause FEN Exceptions. PAL Shadow Registers Enabled. Correctable Error Interrupts Enabled. ICACHE BIST (Self Test) Was Successful. TEST_STATUS_H Pin Asserted x0000000000000000 x0000000000000000 x0000000000044000 x0000000000005D10 If Err, Reference Resulted in DTB Miss Fault Inst RA Field: Fault Inst Opcode:
x0000000000000014 x000000000000000B Scache Address Reg Scache Status Reg Bcache Tag Address Reg
xFFFFFF00000254BF x0000000000000000 xFFFFFF8007EE2FFF Last Bcache Access Resulted in a Miss. Value of Parity Bit for Tag Control Status Bits Dirty, Shared & Valid is Set. Value of Tag Control Dirty Bit is Clear. Value of Tag Control Shared Bit is Clear. Value of Tag Control Valid Bit is Clear. Value of Parity Bit Covering Tag Store ddress Bits is Set. Tag Address<38:20> Is:
x000000000000007E Ext Interface Address Reg xFFFFFF0007FBF08F Fill Syndrome Reg x000000000000D189 Ext Interface Status Reg xFFFFFFF944FFFFFF Error Source is Memory or System UNCORRECTABLE ECC ERROR Error Occurred During D-ref Fill Error LD LOCK xFFFFFF0007FBF00F
IOD 0 Register Subpacket Module Revision 0. VCTY ASIC Rev = 0 Bcache Size = 2MB MID 2. GID 7.
x000000F9E0000000 x06008021 CAP Chip Revision: x00000001 HORSE Module Revision: x00000002 SADDLE Module Revision: x00000000 SADDLE Module Type: LeftHand PCI-EISA Bus Bridge Present on PCI Segment PCI Class Code x00000600 x06480FF1 Module SelfTest Passed LED on Delayed PCI Bus Reads Protocol: Enabled Bridge to PCI Transactions: Enabled
Error Logs
5-23
Bridge REQUESTS 64 Bit Data Transactions Bridge ACCEPTS 64 Bit Data Transactions PCI Address Parity Check: Enabled MC Bus CMD/Addr Parity Check: Enabled MC Bus NXM Check: Enabled Check ALL Transactions for Errors Use MC_BMSK for 16 Byte Align Blk Mem Wrt Wrt PEND_NUM Threshold: 8. RD_TYPE Memory Prefetch Algorithm: Short RL_TYPE Mem Rd Line Prefetch Type: Medium RM_TYPE Mem Rd Multiple Cmd Type: Long ARB_MODE Arbitration: MC-PCI Priority Mode Mem Host Address Ext Reg IO Host Adr Ext Register Interrupt Ctrl Register Struct:Enabled Interrupt Request Interrupt Mask0 Register Interrupt Mask1 Register MC Error Info Register 0
MC Error Info Register 1
HAE Sparse Mem Adr<31:27> x00000000 PCI Upper Adr Bits<31:25> x00000000 Write Device Interrupt Info Interrupts asserted Hard Error x00000000
x801E8800
xE0000000
Sys Environmental Regs PCI Bus Trans Error Adr MDPA Status Register MDPA Error Syndrome Reg Valid MDPB Status Register MDPB Error Syndrome Reg Valid ** IOD SUBPACKET -> ** WHOAMI
Device ID 2 x00000002 MC bus error assoc w read/dirty MC error info valid Uncorrectable ECC err det by MDPB MC error info latched x00000000
MC Bus Trans Addr<31:4>: 7FBF080 MC Command is Read0-Mem
MC bus trans addr <39:32> x00000000 Uncorrectable ECC err det by MDPA
MDPA Status Register Data Not Valid MDPA Syndrome Register Data Not MDPB Status Register Data Not Valid MDPB Syndrome Register Data Not IOD 1 Register Subpacket
x000000BA
Module Revision 0. VCTY ASIC Rev = 0 Bcache Size = 2MB MID 2. GID 7.
x000000FBE0000000 x06000021 CAP Chip Revision: x00000001 HORSE Module Revision: x00000002 SADDLE Module Revision: x00000000 SADDLE Module Type: LeftHand Internal CAP Chip Arbiter: Enabled PCI Class Code x00000600 x06480FF1 Module SelfTest Passed LED on
5-24
Delayed PCI Bus Reads Protocol: Enabled Bridge to PCI Transactions: Enabled Bridge REQUESTS 64 Bit Data Transactions Bridge ACCEPTS 64 Bit Data Transactions PCI Address Parity Check: Enabled MC Bus CMD/Addr Parity Check: Enabled MC Bus NXM Check: Enabled Check ALL Transactions for Errors Use MC_BMSK for 16 Byte Align Blk Mem Wrt Wrt PEND_NUM Threshold: 8. RD_TYPE Memory Prefetch Algorithm: Short RL_TYPE Mem Rd Line Prefetch Type: Medium RM_TYPE Mem Rd Multiple Cmd Type: Long ARB_MODE Arbitration: MC-PCI Priority Mode HAE Sparse Mem Adr<31:27> x00000000 PCI Upper Adr Bits<31:25> x00000000 Write Device Interrupt Info Interrupts asserted Hard Error x00000001
Mem Host Address Ext Reg IO Host Adr Ext Register Interrupt Ctrl Register Struct:Enabled Interrupt Request Interrupt Mask0 Register Interrupt Mask1 Register
MC Error Info Register 0 MC Error Info Register 1
x07FBF080 x801E8800
MC Bus Trans Addr<31:4>: 7FBF080 MC bus trans addr <39:32> x00000000 MC Command is Read0-Mem Device ID 2 x00000002
MC bus error assoc w read/dirty MC error info valid CAP Error Register xE0000000 Uncorrectable ECC err det by MDPA Uncorrectable ECC err det by MDPB MC error info latched
Sys Environmental Regs PCI Bus Trans Error Adr MDPA Status Register MDPA Error Syndrome Reg Valid MDPB Status Register MDPB Error Syndrome Reg Valid PALcode Revision
MDPA Status Register Data Not Valid MDPA Syndrome Register Data Not MDPB Status Register Data Not Valid MDPB Syndrome Register Data Not Palcode Rev: 1.21-3
Error Logs
5-25
CPU0 logged the error in a system with two CPUs. The External Interface Status Register does not record an error. Both IOD CAP Error Registers logged an error. The MC Error Info Registers 0 and 1 captured the error information. The commander at the time of the error was CPU3 (known from MC_ERR1).
The command on the bus at the time was a write-back memory command. Since this is an MCHK 660, the IOD detected the error on the bus, and CPU0 is logging the error. CPU0 registers are not important in this case since it is servicing the IOD interrupt. There are three devices that can put data on the system bus: CPUs, memory, or an IOD. From MC_ERR Register 1 we know that at the time of the error CPU3 put bad data on the bus while writing to memory. See Section 5.4 for a procedure designed to help with IOD-detected errors. NOTE: The error log example has been edited to decrease its size; registers of interest are in bold type. The Horse module referred to in the error log is the system bus to PCI bus bridge module, the B3040 module. The Saddle module is the PCI motherboard, the B3050 module. The MC bus is the system bus. Refer to Table 5-9 for information on decoding commands, and refer to Table 5-10 for information on node IDs.
5-26
System type register x00000016 Number of CPUs (mpnum) x00000002 CPU logging event (mperr) x00000000 Event validity 1. Event severity 1. Entry type 100. CPU Minor class Software Flags Active CPUs Hardware Rev System Serial Number Module Serial Number Module Type System Revision * MCHK 660 Regs * Flags: PCI Mask Machine Check Reason PAL SHADOW REG 0 . . . PAL SHADOW REG 7 PALTEMP0 . . . PALTEMP23 Exception Address Reg Exception Summary Reg Exception Mask Reg PAL BASE x0000000008 Interrupt Summary Reg IBOX Ctrl and Status Reg
O/S claims event is valid Severe Priority CPU Machine Check Errors
2. 660 Entry x0000000300000000 IOD 1 Register Subpkt Pres IOD 2 Register Subpkt Pres x00000003 x00000000 C1563 x0000 x00000000 x00000000 x0000 x0202 x00000000
x00000000 x0000000007
x00000000047FDA58 xFFFFFC000038D784 Native-mode instruction Exception PC x3FFFFF00000E35E1 x00000000 x00000000 x00000000020000 Base addr for palcode = x00000000200000 EXT. HW interrupt at IPL21 AST requests 3 - 0 x00000000 x000000C160000000 Timeout Bit Not Set PAL Shadow Registers Enabled Correctable Err Intrpts Enabled ICACHE BIST Successful TEST_STATUS_H Pin Asserted x00000000 x00000000 xFFFFFFFFFF800130 x00000000014990 Ref resulted in DTB miss
Icache Par Err Stat Reg Dcache Par Err Stat Reg Virtual Address Reg Memory Mgmt Flt Sts Reg
Error Logs
5-27
RA Field Scache Address Reg Scache Status Reg Bcache Tag Address Reg
x0000000006
Opcode Field x00000000000029 xFFFFFF0000024EAF x00000000 xFFFFFF80FFED6FFF Parity for ds and v bits Cache block dirty Cache block valid Tag address<38:20> is
x00000000000FFE Ext Interface Address Reg xFFFFFF00FC00000F Fill Syndrome Reg x0000000000C5D2 Ext Interface Status Reg xFFFFFFF004FFFFFF Error occurred during D-ref fill LD LOCK xFFFFFF000020065F ** IOD SUBPACKET -> ** WHOAMI x000000BA
IOD 0 Register Subpacket Device ID x0000003A Bcache Size = 2MB VCTY ASIC Rev = 0 Module Revision 0. Base Address of Bridge x000000F9E0000000 PCI Revision x06008021 CAP Chip Revision x00000001 Horse Module Revision x00000002 Saddle Module Revision x00000000 Saddle Module Type Left Hand EISA Present PCI Class Code x00000600 MC-PCI Command Register x06480FF1 Selftest passed Delayed read enabled Bridge PCI trans enabled Req 64 bit data trans enabled Accept 64 bit data trans enabled Check PCI Addr Parity enabled Check MC bus CMS/Addr Parity enabled Check MC bus NXM enabled Check all transaction enabled 16 byte aligned block write enabled Write Pend Number Thresho x00000008 RD_TYPE Short RL_TYPE Medium RM_TYPE Long ARB_MODE MC-PCI Bridge Priority Mode Memory Host Addr Exten x00000000 IO Host Addr Extension x00000000 Interrupt Control x00000003 MC-PCI Intr Enabled Device intr info enabled if en_int = 1 Interrupt Request x00800000 Interrupts asserted x00000000 Hard Error Interrupt Mask Register 0 x00C50010 Interrupt Mask Register 1 x00000000 MC Error Info Register 0 x4A26DBF0 MC bus trans addr <31:4> x04A26DBF MC Error Info Register 1 x800ED600 MC bus trans addr <39:32> x00000000 MC_Command x00000016 Device Id x0000003B MC error info valid
CAP Error Register xA0000000 Uncorrectable ECC err det by MDPA
5-28
x80000000
MDPA Chip Revision x00000000 MDPA Error Syndrome of error Cycle 0 ECC Syndrome Cycle 1 ECC Syndrome x00000000 Cycle 2 ECC Syndrome x00000000 Cycle 3 ECC Syndrome
x1E00001E
x00000000 x00000000
MDPB Chip Revision x00000000 Cycle 0 ECC Syndrome x00000000 Cycle 1 ECC Syndrome x00000000 Cycle 2 ECC Syndrome x00000000 Cycle 3 ECC Syndrome x00000000
IOD 1 Register Subpacket Device ID x0000003A Bcache Size = 2MB VCTY ASIC Rev = 0 Module Revision 0. Base Address of Bridge x000000FBE0000000 PCI Revision x06000021 CAP Chip Revision x00000001 Horse ModuleRevision x00000002 Saddle Module Revision x00000000 Saddle Module Type Left Hand PCI Class Code x00000600 MC-PCI Command Register x06480FF1 Selftest passed Delayed read enabled Bridge PCI trans enabled Req 64 bit data trans enabled Accept 64 bit data trans enabled Check PCI Addr Parity enabled Check MC bus CMS/Addr Parity enabled Check MC bus NXM enabled Check all transaction enabled 16 byte aligned block write enabled Write Pend Number Thresho x00000008 RD_TYPE Short RL_TYPE Medium RM_TYPE Long ARB_MODE MC-PCI Bridge Priority Mode Memory Host Addr Exten x00000000 IO Host Addr Extension x00000000 Interrupt Control x00000003 MC-PCI Intr Enabled Device intr info enabled if en_int = 1 Interrupt Request x00800000 Interrupts asserted x00000000 Hard Error Interrupt Mask Register 0 x00C50001 Interrupt Mask Register 1 x00000000 MC Error Info Register 0 x4A26DBF0 MC bus trans addr <31:4> x04A26DBF MC Error Info Register 1 x800ED600 MC bus trans addr <39:32> x00000000 MC_Command x00000016 x000000BA Device Id x0000003B MC error info valid
CAP Error Register xA0000000
MC error info latched PCI Bus Trans Error Adr MDPA Status Register x00000000 x80000000
Error Logs
5-29
x1E00001E
x00000000 x00000000
PALcode Revision
error Cycle 0 ECC Syndrome x00000000 Cycle 1 ECC Syndrome x00000000 Cycle 2 ECC Syndrome x00000000 Cycle 3 ECC Syndrome x00000000 MDPB Chip Revision x00000000 Cycle 0 ECC Syndrome x00000000 Cycle 1 ECC Syndrome x00000000 Cycle 2 ECC Syndrome x00000000 Cycle 3 ECC Syndrome x00000000 Palcode Rev: 1.21-3
5-30
CPU0 logged the error in a system with two CPUs. During a D-ref fill, the External Interface Status Register shows no error but states that the data source is b-cache. (When a CPU chip does not find data it needs to perform a task in any of its caches, it requests data from off the chip to fill its D-cache. It performs a D-ref fill.) Both IOD CAP Error Registers logged no error.
The FIL Syndrome Register has a valid ECC code for the lower half of the data. Machine check 630s are detected by CPUs when they either take data off the system bus or when they access their own B-cache. In this case, the data did not come from the system bus, otherwise bit <30> would be set in the External Interface Status Register. CPU0 had a single-bit, ECC correctable error. NOTE: The error log example has been edited to decrease its size; registers of interest are in bold type. The Horse module referred to in the error log is the system bus to PCI bus bridge module, the B3040 module. The Saddle module is the PCI motherboard, the B3050 module. The MC bus is the system bus. Refer to Table 5-9 for information on decoding commands, and refer to Table 5-10 for information on node IDs.
Error Logs
5-31
System type register x00000016 Number of CPUs (mpnum) x00000002 CPU logging event (mperr) x00000000 Event validity Event severity Entry type CPU Minor class Software Flags Active CPUs Hardware Rev System Serial Number Module Serial Number Module Type System Revision Machine Check Reason B-Cache EI STAT
1. O/S claims event is valid 3. High Priority 100. CPU Machine Check Errors 3. Bcache error (630 entry) x00000000 x00000003 x00000000 C1563 x0000 x00000000 x0086 Alpha Chip Detected ECC Err, From
xFFFFFFF004FFFFFF DATA SOURCE IS BCACHE D-ref fill EV5 Chip Rev 4 xFFFFFF00138D85EF x00000000000800 x0000000100200000 x00000000 Module Revision MID 0. GID 0. 0.
Sys Environmental Regs Base Addr of Bridge Dev Type & Rev Register
CAP Chip Revision: x00000000 Horse Module Revision: x00000000 Saddle Module Revision: x00000000 Saddle Module Type: LeftHand Internal CAP Chip Arbiter: Enabled PCI Class Code x00000000 MC Bus Trans Addr<31:4>: 0 MC bus trans addr <39:32> x00000000 MC Command is Illegal Illegal Device ID 2 x00000000 MDPA Status Register Data Not Valid MDPA Syndrome Register Data Not MDPB Status Register Data Not Valid MDPB Syndrome Register Data Not
x00000000 x00000000
CAP Error Register MDPA Status Register MDPA Error Syndrome Reg Valid MDPB Status Register MDPB Error Syndrome Reg Valid
5-32
PALcode Revision
Error Logs
5-33
CPU0 logged the error in a system with two CPUs. The External Interface Status Register is not valid. The MC Error Info Registers 0 and 1 captured the error information. The commander at the time of the error was CPU2. The command at the time of the error was a write-back memory command.
The IOD detected a recoverable error on the system bus. The MC command at the time of the error is a WriteThru-Mem Command (x00000006). The system bus commander at the time of the error is CPU2. Since this is a write, the defective FRU is CPU2. NOTE: The error log example has been edited to decrease its size; registers of interest are in bold type. The Horse module referred to in the error log is the system bus to PCI bus bridge module, the B3040 module. The Saddle module is the PCI motherboard, the B3050 module. The MC bus is the system bus. Refer to Table 5-9 for information on decoding commands, and refer to Table 5-10 for information on node IDs.
System type register x00000016 Number of CPUs (mpnum) x00000002 CPU logging event (mperr) x00000000 Event validity Event severity Entry type CPU Minor class Software Flags Active CPUs Hardware Rev System Serial Number
1. O/S claims event is valid 5. Low Priority 100. CPU Machine Check Errors 4. 620 System Correctable Error x0000000000000000 x00000003 x00000000 C1563
5-34
Module Serial Number Module Type System Revision Machine Check Reason Ext Interface Status Reg
x0000 x00000000
x0204 IOD Detected Soft Error x0000000000000000 Not Valid for 620 System Correctable Errors Ext Interface Address Reg x0000000000000000 Not Valid for 620 System Correctable Errors Fill Syndrome Reg x0000000000000000 Not Valid for 620 System Correctable Errors Interrupt Summary Reg x0000000000000000 Not Valid for 620 System Correctable Errors WHOAMI x00000000 Module Revision 0. MID 0. GID 0. Sys Environmental Regs x00000000 Base Addr of Bridge x000000FBE0000000 Dev Type & Rev Register x06000032 CAP Chip Revision: x00000002 HORSE Module Revision: x00000003 SADDLE Module Revision: x00000000 SADDLE Module Type: LeftHand Internal CAP Chip Arbiter: Enabled PCI Class Code x00000600 MC Error Info Register 0 x122D5640 MC Bus Trans Addr<31:4>: 122D5640 MC Error Info Register 1 x800E9600 MC bus trans addr <39:32> x00000000 MC Command is WriteBack Mem CPU0 Master at Time of Error Device ID 2 x00000002 MC error info valid CAP Error Register x89000000 Error Detected but Not Logged Correctable ECC err det by MDPA MC error info latched MDPA Status Register x00000000 MDPA Status Register Data Not Valid MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid MDPB Status Register x00000000 MDPB Status Register Data Not Valid MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid PALcode Revision Palcode Rev: 0.0-1
Error Logs
5-35
Go to Step 2
Go to Step 2
Go to Step 2
100x x10x x000 0000 0000 0000 000x xxxx 100x x01x x000 0000 0000 0000 000x xxxx 100x x00x 1000 0000 0000 0000 000x xxxx 0000 0000 0000 0000 0000 0000 0001 1xxx 0000 0000 0000 0000 0000 0000 0001 x1xx 0000 0000 0000 0000 0000 0000 0001 xx1x 0000 0000 0000 0000 0000 0000 0001 xxx1
5-36
Bad nondirty data from memory (bad memory) Bad nondirty data from memory (bad memory) Bad dirty data from a CPU Bad dirty data from a CPU
Bad data from MID = 2 Bad data from MID = 3 Bad data from MID = 4 Bad data from MID = 5 Bad data from MID = 6 Bad data from MID = 7
Replace CPU0 Replace CPU1 Replace IOD Replace IOD Replace CPU2 Replace CPU3
Error Logs
5-37
1000 0000 000x xxxx xxxx xxxx 0xxx xxxx Software generated an MC ADDR > TOP_OF_MEM reg 1000 0000 0000 xxxx xxxx xxxx 1xxx 100x PCI0 bridge did not respond 1000 0000 0001 xxxx xxxx xxxx 1xxx 101x PCI1 bridge did not respond NOTE: IOD = B3040 bridge module
5-38
Error Logs
5-39
5-40
Error Logs
5-41
For a Corrected Read Data Error (CRD) When a CRD error occurs, determine which memory module pair caused the error as follows: 1. At the SRM console prompt, enter the show mem command. This command displays the base address and size of the memory module pair for each slot. P00>>> show mem Compare this address to the failing address from the MC_ERR1 and MC_ERR0 Registers to determine which memory slot is failing.
2.
5-42
3.
When you have isolated the failing memory pair, determine which of the two modules is bad. (You cannot do this if the operating system is Windows NT.) Read the CPU FIL SYNDROME Register. If this register is non-zero, use the ECC syndrome bits in Table 5-8 to determine which module had the single-bit error.
Error Logs
5-43
54 xx 00 xx xx x0 x0 x1 x1 00 10
Description Mem Idle Write Pend Ack Mem Refresh Set Dirty Write Thru - Mem Write Thru - I/O Write Back - Mem Write Intr - I/O Write Full - Mem Write Part - Mem (B-cache CPU only) Write Mask - I/O Write Merge Mem Read0 - Mem Read0 - I/O Read1 - Mem Read1 - I/O Read Mod0 Mem
IOD Y Y
Y Y Y Y Y Y Y Y Y Y
x0 x0 xx xx xx xx xx
0/2 7 0/2 7 X8 X8 X9 X9 XA
1 0 0 1 0 1 0
Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y
5-44
54 xx xx xx xx 10 10 xx xx
Description Read Mod0 Mem Read Peer0 - I/O Read Mod1 Mem Read Peer1 - I/O FILL0 (due to Read0/Peer0) FILL1 (due to Read1/Peer1) Read0 - Mem Read1 - Mem
IOD Y Y
Y Y Y Y
Y Y
Y Y
Error Logs
5-45
5.5
Two error cases require special attention. Neither double error halts or machine checks while the machine is in PAL mode result in error log entries. Nevertheless, information is available that can help determine what error occurred.
5-46
The machine returns to the console and displays the following message:
Example 5-7
P00>>> info 3
INFO 3 Command
Error Logs
5-47
cpu00 per_cpu impure area cns$flag cns$flag+4 cns$hlt cns$hlt+4 cns$mchkflag cns$mchkflag+4 cns$exc_addr cns$exc_addr+4 cns$pal_base cns$pal_base+4 cns$mm_stat cns$mm_stat+4 cns$va cns$va+4 cns$icsr cns$icsr+4 cns$ipl cns$ipl+4 cns$ps cns$ps+4 cns$itb_asn cns$itb_asn+4 cns$aster cns$aster+4 cns$astrr cns$astrr+4 cns$isr cns$isr+4 cns$ivptbr cns$ivptbr+4 cns$mcsr cns$mcsr+4 cns$dc_mode cns$dc_mode+4 cns$maf_mode cns$maf_mode+4 cns$sirr cns$sirr+4 cns$fpcsr cns$fpcsr+4 cns$icperr_stat cns$icperr_stat+4 cns$pmctr cns$pmctr+4 cns$exc_sum cns$exc_sum+4 cns$exc_mask cns$exc_mask+4 cns$intid cns$intid+4 cns$dcperr_stat cns$dcperr_stat+4 cns$sc_stat cns$sc_stat+4 cns$sc_addr cns$sc_addr+4 cns$sc_ctl cns$sc_ctl+4 cns$bc_tag_addr cns$bc_tag_addr+4 cns$ei_stat cns$ei_stat+4 00004400 00000001 00000000 00000000 00000000 00000228 00000000 20000004 00000000 00000000 00000000 0000da10 00000000 00080000 00000002 40000000 000000c1 0000001f 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00400000 00000000 00000000 00000002 00000000 00000000 00000001 00000000 00000080 00000000 00000000 00000000 00000000 ff900000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000016 00000000 00000000 00000000 00000000 00000000 000047cf ffffff00 0000f000 00000000 ff7fefff ffffffff 04ffffff fffffff0
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
0000 0004 0008 000c 0210 0214 0318 031c 0320 0324 0338 033c 0340 0344 0348 034c 0350 0354 0358 035c 0360 0364 0368 036c 0370 0374 0378 037c 0380 0384 0388 038c 0390 0394 0398 039c 03a0 03a4 03a8 03ac 03b0 03b4 03b8 03bc 03c0 03c4 03c8 03cc 03d0 03d4 03d8 03dc 03e0 03e4 03e8 03ec 03f0 03f4 03f8 03fc 0400 0404
5-48
: : : :
Error Logs
5-49
5-50
mchk$fill_syn+4 mchk$ei_stat mchk$ei_stat+4 mchk$ld_lock mchk$ld_lock+4 IOD: 0 base address: f9e0000000 WHOAMI: CAP_CTL: INT_CTL: INT_MASK1: CAP_ERR: MDPA_SYN: 0000003a 02490fb1 00000003 00000000 84000000 00000000 PCI_REV: HAE_MEM: INT_REQ: MC_ERR0: PCI_ERR: MDPB_STAT:
00000000 : 018c 04ffffff : 0190 fffffff0 : 0194 00005b6f : 0198 ffffff00 : 019c
06008221 00000000 00800000 e0000000 00000000 00000000 HAE_IO: INT_MASK0: MC_ERR1: MDPA_STAT: MDPB_SYN: 00000000 00010000 800e88fd 00000000 00000000
IOD: 1 base address: fbe0000000 WHOAMI: CAP_CTL: INT_CTL: INT_MASK1: CAP_ERR: MDPA_SYN: 0000003a 02490fb1 00000003 00000000 84000000 00000000 PCI_REV: HAE_MEM: INT_REQ: MC_ERR0: PCI_ERR: MDPB_STAT: 06000221 00000000 00800000 e0000000 00000000 00000000 HAE_IO: INT_MASK0: MC_ERR1: MDPA_STAT: MDPB_SYN: 00000000 00010000 800e88fd 00000000 00000000
Error Logs
5-51
5-52
Error Registers
6-1
FF FFF0 0168 R
24 23 0
6-2
Fill data from B-cache or main memory could have correctable or uncorrectable errors in ECC mode. In parity mode, fill data parity errors are treated as uncorrectable hard errors. System address/command parity errors are always treated as uncorrectable hard errors, irrespective of the mode. The sequence for reading, unlocking, and clearing EI_STAT, EI_ADDR, BC_TAG_ADDR, and FILL_SYN is as follows: 1. 2. Read the EI_ADDR, BC_TAG_ADDR, and FIL_SYN registers in any order. Does not unlock or clear any register. Read the EI_STAT register. This operation unlocks the EI_ADDR, BC_TAG_ADDR, and FILL_SYN registers. It also unlocks the EI_STAT register subject to conditions given in Table 6-2, which defines the loading and locking rules for external interface registers.
NOTE: If the first error is correctable, the registers are loaded but not locked. On the second correctable error, the registers are neither loaded nor locked. Registers are locked on the first uncorrectable error except the second hard error bit. This bit is set only for an uncorrectable error that follows an uncorrectable error. A correctable error that follows an uncorrectable error is not logged as a second error. B-cache tag parity errors are uncorrectable in this context.
Error Registers
6-3
COR_ECC_ERR <31>
EI_ES
<30>
BC_TPERR
<28>
CHIP_ID
<27:24>
<23:0>
6-4
FIL_IRD
<34>
UNC_ECC_ER
<32>
Error Registers
6-5
FF FFF0 0148 R
4 3 0
All 1s
61
40 39
32
All 1s
EI_ADDR <39:32>
PKW0454-96
6-6
Table 6-2 Loading and Locking Rules for External Interface Registers
Correct -able Error 0 1 0 11 Uncorrectable Error 0 0 1 1 Second Hard Error Not possible Not possible 0 0 Load Register No Yes Yes Yes Lock Register No No Yes Yes Action When EI_STAT Is Read Clears and unlocks all registers Clears and unlocks all registers Clears and unlocks all registers Clear bit (c) does not unlock. Transition to 0,1,0 state. Clears and unlocks all registers Clear bit (c) does not unlock. Transition to 0,1,1 state.
0 11
1 1
1 1
No No
These are special cases. It is possible that when EI_ADDR is read, only the correctable error bit is set and
the registers are not locked. By the time EI_STAT is read, an uncorrectable error is detected and the registers are loaded again and locked. The value of EI_ADDR read earlier is no longer valid. Therefore, for the 1,1,x case, when EI_STAT is read correctable, the error bit is cleared and the registers are not unlocked or cleared. Software must reexecute the IPR read sequence. On the second read operation, error bits are in 0,1,x state, all the related IPRs are unlocked, and EI_STAT is cleared.
Error Registers
6-7
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
Reserved
<3:0>
RO
6-8
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
1 1 1 MID
Error Registers
6-9
Reserved Dirty
<30:21> <20>
RO RO
0 0 Set if the system bus error was associated with a Read/Dirty transaction. When set, the device ID field <19:14> does not indicate the source of the data. All ones.
Reserved DEVICE_ID
<19:17> <16:14> RO 0
Slot number of bus master at the time of the error. Active command at the time the error was detected. Address bits <39:32> of the transaction on the system bus when an error is detected.
MC_CMD<5:0>
<13:8>
RO
ADDR<39:32>
<7:0>
RO
6-10
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
PERR SERR MAB PTE_INV PCI_ERR_VALID reserved PIO_OVFL LOST_MC_ERR MC_ADR_PERR NXM CRDA CRDB RDSA RDSB MC_ERR VALID
Error Registers
6-11
RDSB
<30>
RW1C
RDSA
<29>
RW1C
CRDB
<28>
RW1C
CRDA
<27>
RW1C
NXM
<26>
RW1C
MC_ADR_PERR
<25>
RW1C
6-12
PIO_OVFL
<23>
RW1C
Reserved PCI_ERR_VALID
<22:5> RO <4> RO
0 0 Logical OR of bits <3:0> of this register. When set, the PCI error address register is locked. Invalid page table entry on scatter/gather access. PCI master state machine detected PCI Target Abort (likely cause: NXM) (except Special Cycle). On reads fill error is also returned. PCI target state machine observed SERR#. CAP asserts SERR when it is master and detects target abort. PCI master state machine observed PERR#.
PTE_INV MAB
<3> <2>
RW1C RW1C
0 0
SERR
<1>
RW1C
PERR
<0>
RW1C
Error Registers
6-13
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
6-14
CAUTION: Wear an antistatic wrist strap whenever you work on a system. The AlphaServer 4100 cabinet system has a wrist strap connected to the frame at the front and rear of the cabinet. The pedestal system does not have an attached strap, so you will have to take one to the site. WARNING: When the system interlocks are disabled and the system is still powered on, voltages are low in the system drawer, but current is high. Observe the following guidelines to prevent personal injury. 1. 2. 3. Remove any jewelry that may conduct electricity before working on the system. Do not insert your hands between the fan and the power supply. If you need to access the system card cage, power down the system and wait 2 minutes to allow components in that area to cool.
7-1
Fan Tray
Power Cable
Chassis Assembly
7-2
Required System Drawer Modules and Display 54-23803-01 B3040-AA 54-24117-01 B3050-AA 54-24364-01 54-24366-01 54-24674-01 54-24691-01 30-43049-01 Fans 12-23609-21 12-24701-34 4.5-inch fan CPU fan System motherboard System bus to PCI bus bridge module Power control module PCI motherboard OCP logic module OCP switch module Server control module Fan fail detect module OCP display
7-3
Fan Tray Cables (Cabinet Only) 17-04324-01 17-04325-01 17-04338-01 17-04339-01 Elec fan power harness 12V power for SCM Power ground cable AC cable power
7-4
Current share conn on PS1 and PS2 Floppy OCP signal (system drawer only) OCP 7 conns. sys mbrd sys fans 0, 1 5V conn on PCI mbrd CD-ROM drv pwr Floppy pwr 1 OCP DC enable pwr conn or pwr conn on ped tray pwr drive cable (1704293-01)
7-5
17-04294-01
Other OCP DC enable pwr conn or pwr conn on ped tray pwr drive cable (1704293-01) 12 V DC enable conn on SCM SCM Sys fan 2 and SCM internal 12V conn 16 pos conn on SCM
SCM 12V interlock jumper SCM 34-position jumper SCM 12V power jumper SCM 16-position jumper
Interlock conn on PCI mbrd SCM Power harness (17-04217-01) SCM sig conn on PCI mbrd
7-6
17-04302-01 17-04305-01
17-04306-01
17-04380-01
7-7
Cabinet OCP
10 SCSI 13
Interlock
7-8
Part Number
Description AC input box; only in cabinet systems: The 01 variant is for N. Amer./Japan and has a NEMA L6-30P power cord; the 02 variant is for Europe and AP and has an IEC 309 power cord. AC power strip: The 12-23501-01 is used on pedestals in N. Amer./Japan only and has six NEMA outlets and a 15 ft. cord to the wall outlet; the 12-45334-02 is used on pedestals in Eur./AP and on cabinet systems worldwide and has six IEC320 outlets. In pedestal systems, cords match country-specific wall outlets. Power cord from AC input box to power strip. .5 meter, IEC320 to IEC320 connector used in cabinet systems only. In pedestal systems, cords match country-specific wall outlets. Power cord from power strip to power supply: The 17-00606-02 is a 2 m NEMA to IEC320 AC jumper used with the 12-23501-01 power strip in N. Amer./Japan pedestals. The 17-04285-02 is a 2 m IEC320 to IEC320 AC jumper used with the 12-45334-02 power strip used on pedestals in Eur./APA and on cabinet systems worldwide and has six IEC320 outlets. In pedestal systems, cords match country-specific wall outlets. Power supply; 92 to 264 VAC input; one to three in a system drawer. Cable connecting power supplies Power distribution harness Cable from OCP to PCI motherboard (cabinet system) Interconnect and cable to OCP Power from power harness between harness and Fan 2 to SCM. Cable from power harness to interconnect cable and pedestal tray connector (pedestal system) Cable from pedestal tray connector to PCI motherboard (pedestal system) Cable from pedestal tray connector to OCP (pedestal system) Cable from pedestal tray connector to OCP and SCSI devices (pedestal system) Power cord from power strip to cabinet fay tray (cabinet only)
30-45353-01 30-45353-02
12-23501-01 12-45334-02
2a
17-04285-01
17-00606-02 17-04285-02
4 5 6 7 8 9 10 11 12 13 14
30-44712-01 17-04199-01 17-04217-01 17-04201-01 17-04294-01 17-04351-01 17-04293-01 17-04302-01 17-04201-01 17-04305-01 17-04339-01
7-9
7-10
Exposing the System Bus or PCI Bus Card Cages 1. 2. 3. 4. 5. Open the front and rear doors of the cabinet. At the front of the cabinet, unplug the drawer's power supplies. At the rear, remove the two Phillips screws holding the shipping bracket on the right rail so that the drawer can be pulled out. Using a flathead screwdriver, disengage the lock mechanism at the lower left hand corner of the drawer. Pull the drawer out part way and release the lock mechanism by removing the screwdriver. If you wish to remove the whole drawer for some reason, leave the screwdriver in place. Once the lock mechanism has been released, slide the drawer out until it locks. Remove the system bus card cage cover. Unscrew the two Phillips head screws holding the cover in place and slide it off the drawer. Remove the PCI bus card cage cover. Unscrew the three Phillips head screws holding the cover to the side of the drawer and slide it off the drawer.
6. 7. 8.
Exposing the Power System or System Fans 1. 2. 3. 4. 5. Open the front and rear doors of the cabinet. At the rear of the cabinet, remove any cables from PCI options that may interfere with pulling the drawer forward. At the front, remove the shipping brackets on the right and left rails that hold the drawer. Pull out the drawer until it locks. Remove the power section cover. Unscrew the two Phillips head screws and slide the cover off the drawer.
7-11
7-12
Exposing the System Drawer 1. 2. 3. Open the front door and remove it by lifting and pulling it away from the system. Remove the top cover. Unscrew the two Phillips head screws midway up on each side of the pedestal, tilt the cover up, and lift it away from the frame. Remove the system bus card cage cover at the back of the pedestal if you are replacing any of the following: CPU, memory, power control module, system bus to PCI bus module, system motherboard, cables that attach to the system motherboard, or a system fan. To remove the cover, unscrew the two Phillips head screws and slide the cover off the drawer. Remove the PCI bus card cage cover at the back of the pedestal if you are replacing any of the following: PCI or EISA option, server control module, PCI motherboard, cables attached to the PCI motherboard. To remove the cover, unscrew the three Phillips head screws holding the cover to the side of the drawer and slide the cover off the drawer Remove the pedestal tray as described below if you are replacing any of the following: system fan, power supply, power cables.
4.
5.
Removing the Pedestal Tray 1. 2. 3. 4. Remove the tray cover by loosening the screws at the back of the tray. Disconnect the cables from the OCP and any optional SCSI device from the bulkhead connector in the rear right corner of the tray. Unscrew the Phillips head screw holding the bulkhead to the tray. Unscrew the two Phillips head retaining screws and slide the tray off the drawer.
7-13
CPU Module
PKW 0411-96
WARNING: CPU modules and memory modules have parts that operate at high temperatures. Wait 2 minutes after power is removed before touching any module.
7-14
Removal 1. 2. 3. 4. Shut down the operating system and power down the system. Expose the system drawer. Expose the system bus card cage. Remove the two Phillips head screws holding the cover in place and slide it off the drawer. Identify and remove faulty CPU. A label to the left of the system bus card cage identifies which slot contains CPU0, CPU1, CPU2, or CPU3. The CPU is held in place with levers at both ends; simultaneously raise the levers and lift the CPU from the cage.
Replacement Reverse the steps in the Removal procedure. Verification Digital UNIX and OpenVMS Systems 1. 2. Bring the system up to the SRM console by pressing the Halt button, if necessary. Issue the show cpu command to display the status of the new module.
Verification Windows NT Systems 1. 2. Start AlphaBIOS Setup, select Display System Configuration, and press Enter. Using the arrow keys, select MC Bus Configuration to display the status of the new module.
7-15
PKW411A-96
7-16
Removal 1. 2. 3. Follow the CPU Removal and Replacement procedure. Unplug the fan from the module. Remove the four Phillips head screws holding the fan to the Alpha chips heatsink.
Replacement Reverse the above procedure. Verification If the system powers up, the CPU fan is working.
7-17
PKW0408-96
WARNING: CPU modules and memory modules have parts that operate at high temperatures. Wait 2 minutes after power is removed before touching any module.
7-18
Removal 1. 2. 3. 4. Shut down the operating system and power down the system. Expose the system drawer. Expose the system bus card cage. Remove the two Phillips head screws holding the cover in place and slide it off the drawer. Identify and remove the faulty module. A label to the left of the system card cage identifies which slot contains the high or low halves of memory banks. The memory module is held in place by a flathead captive screw attached to the top brace of the module. Loosen the screw and lift the module from the cage.
Replacement Reverse the steps in the Removal procedure. NOTE: Memory modules must be installed in pairs. When you replace a bad module, be sure the second module in the pair is in place. Verification Digital UNIX and OpenVMS Systems 1. 2. 3. Bring the system up to the SRM console by pressing the Halt button, if necessary. Issue the show memory command to display the status of the new memory. Verify the functioning of the new memory by issuing the command test memn, where n is 0, 1, 2, 3, or *.
Verification Windows NT Systems 1. 2. 3. Start AlphaBIOS Setup, select Display System Configuration, and press Enter. Using the arrow keys, select Memory Configuration to display the status of the new memory. Switch to the SRM console (press the Halt button in so that the LED on the button lights and reset the system). Verify the functioning of the new memory by issuing the command test memn, where n is 0, 1, 2, 3, or *.
7-19
7.9
PKW0412 -96
7-20
Removal 1. 2. 3. 4. Shut down the operating system and power down the system. Expose the system drawer. Expose the system bus card cage. Remove the two Phillips head screws holding the cover in place and slide it off the drawer. Remove the faulty PCM. The PCM is located in the back left corner of the system bus card cage. A captive flathead screw and the rear card guide hold the PCM in place. Unscrew the screw and lift the module from the cage.
Replacement Reverse the steps in the Removal procedure. Verification Power up the system. If the PCM is faulty or not seated properly, the system will not come up.
7-21
7.10 System Bus to PCI Bus Bridge Module Removal and Replacement
Figure 7-9 Removing System Bus to PCI Bus Bridge Module
PKW0413-96
7-22
Removal 1. 2. 3. 4. 5. 6. 7. 8. Shut down the operating system and power down the system. Expose the system drawer. Expose the system bus card cage. Remove the two Phillips head screws holding the cover in place and slide it off the drawer. Expose the PCI bus card cage. Remove three Phillips head screws holding the cover in place and slide it off the drawer. Remove all the PCI/EISA options. Remove the server control module. Remove the PCI motherboard. Remove the two Phillips head screws holding the system bus to PCI bus bridge module to the sheet metal between the system bus card cage and the PCI bus card cage. Remove enough CPU and memory modules to the right of the bridge module to allow a flathead screwdriver to be inserted in the slot in the middle of the modules top bracket.
9.
10. Place a flathead screwdriver into the slot in the middle of the modules top bracket and into the corresponding slot in the sheet metal between the two card cages. Use the screwdriver as a lever to disconnect the bridge module from the connector on the system motherboard. 11. Remove the bridge module from the system bus card cage. Replacement Reverse the steps in the Removal procedure. Verification Power up the system (press the Halt button if necessary to bring up the SRM console) and issue the show device command at the console prompt to verify that the system sees all system options and peripherals.
7-23
PKW0414-96
Removal 1. 2. 3. 4. Shut down the operating system and power down the system. Expose the system drawer. Expose the system bus card cage by removing the two Phillips head screws holding it in place and sliding the cover off the drawer. Remove all CPUs, memory modules, and the PCM from the system motherboard.
7-24
5. 6. 7. 8. 9.
Expose the PCI bus card cage. Remove three Phillips head screws holding the cover in place and slide it off the drawer. Remove all the PCI/EISA options. Remove the server control module. Remove the PCI motherboard. Remove system bus to PCI bus module from the system motherboard.
10. Remove the bracket holding the power cables in place as they pass from the system bus section to the power section of the drawer. 11. Disconnect all cables to the system motherboard and lay them back over the power supply section of the system drawer. CAUTION: Secure the power harness connectors in the system card cage to ensure that they cannot damage the pins in the CPU connectors. 12. Remove both the front and back module card guides. Unscrew the two screws that hold the guides in place. 13. Remove the system motherboard from the card cage by removing the 15 Phillips head screws holding it in place. Record the system serial number. (The serial number is on a barcode on the side of the system drawer or on the system bus card cage.) Replacement Reverse the above procedure. To align the motherboard in the cage, start replacing the screws in the corners next to the system bus to PCI bus bridge module and then the PCM module. Subsequent screws should align properly. Verification 1. Power up the system (press the Halt button if necessary to bring up the SRM console) and issue the show device command at the console prompt to verify that all system options are seen. Restore the system serial number by issuing the set sys_serial_num command at the SRM console prompt.
2.
7-25
PCI Motherboard
PKW0409-96
Removal The PCI motherboard contains an NVRAM with ECU data and customized console environment variables. Therefore, if the console runs, execute a show * command at the console prompt and, if you have not done so earlier, record the settings for the sys_model_number and sys_type environment variables. These environment variables are used to display the system model number and type, and they compute certain information passed to the operating system. When you replace the PCI motherboard, these environment variables are lost and must be restored after the module swap.
7-26
1. 2. 3. 4. 5. 6. 7. 8.
Shut down the operating system and power down the system. Expose the system drawer. Expose the PCI bus card cage. Remove three Phillips head screws holding the cover in place and slide it off the drawer. Remove all PCI and EISA options. Disconnect all cables connected to the PCI motherboard. Remove the server control module. Unscrew the two screws holding the system bus to PCI bus bridge module in the system bus card cage to the PCI motherboard. Remove the nine Phillips head screws that hold the motherboard in place. To reach the screws on the bottom of the board, thread your screwdriver through the three holes in the sheet metal. Carefully pry the motherboard loose from the system bus to PCI bus bridge module on the other side of the sheet metal separating the system bus card cage from the PCI card cage.
9.
10. Remove the motherboard from the card cage. Replacement Reverse the steps in the Removal procedure. Verification 1. Power up the system (press the Halt button if necessary to bring up the SRM console) and issue the show device command at the console prompt to verify that the system sees all options. Restore the sys_model num, sys_type, and other customized environment variables to their previous settings. Run the ECU to restore EISA configuration data. This must be done regardless of whether there is an EISA option in the EISA slot on PCI 0.
2.
7-27
PKW0415-96
7-28
Removal 1. 2. 3. 4. 5. 6. 7. Shut down the operating system and power down the system. Expose the system drawer. Expose the PCI bus card cage. Remove three Phillips head screws holding the cover in place and slide it off the drawer. Disconnect the cables connected at the bulkhead to the server control module. If necessary, remove several PCI and EISA options from the bottom of the PCI card cage up until you can access the server control module. Disconnect the two cables connected to the PCI motherboard at the server control module end. The server control module is held in place by four stud snaps. Gently pull the module off the snaps and remove it.
Replacement Reverse the steps in the Removal procedure. Verification Verify console output on COM1.
7-29
PKW 0418-96
WARNING: To prevent fire, use only modules with current limited outputs. See National Electrical Code NFPA 70 or Safety of Information Technology Equipment, Including Electrical Business Equipment EN 60 950.
7-30
Removal 1. 2. 3. 4. Shut down the operating system and power down the system. Expose the system drawer. Expose the PCI bus card cage. Remove three Phillips head screws holding the cover in place and slide it off the drawer. Remove the faulty option. Disconnect cables connected to the option. Unscrew the small Phillips head screw securing the option to the card cage. Slide the option from the card cage.
Replacement Reverse the steps in the Removal procedure. Verification Digital UNIX and OpenVMS Systems 1. 2. 3. Power up the system (press the Halt button if necessary to bring up the SRM console) and run the ECU to restore EISA configuration data. Issue the show config command or show device command at the console prompt to verify that the system sees the option you replaced. Run any diagnostic appropriate for the option you replaced.
Verification Windows NT Systems 1. 2. Start AlphaBIOS Setup, select Display System Configuration, and press Enter. Using the arrow keys, select PCI Configuration or EISA Configuration to determine that the new option is listed.
7-31
4
PKW0410-96
7-32
Removal 1. 2. 3. 4. 5. 6. 7. 8. Shut down the operating system and power down the system. Expose the system drawer. Remove the cover to the power section of the drawer. Remove the two Phillips head screws holding the cover in place and slide it off the drawer. Release the power supply tray by removing the two Phillips head screws on the side of the drawer. See .
Lift the power supply tray to release it from the sheet metal and slide it out from the drawer until it locks (about 4 inches). Tilt the tray to allow easier access to the back of the power supplies. Unplug the connectors at the rear of the supply that is being replaced. Unscrew the four Phillips head screws at the front of the tray that hold the power supply in place. Also unscrew the two screws at the back of the power supply. See .
9.
Replacement Reverse the steps in the Removal procedure. Verification Power up the system. If the system has redundant power, the system will power up regardless of whether the replaced power supply is faulty. In this case look at the PCM LEDs to determine that the power supply is functioning properly. If the system does not have redundant power, it will not power up.
7-33
Fans
Power Supplies
P18
OCP T ray
T o OCP
P15
T o CD-R O M
T o Floppy T oO CP
OCP Tray
PKW 0419-96
7-34
Removal 1. 2. 3. Shut down the operating system and power down the system. Expose the system drawer. Expose the power and system card cage sections of the drawer by removing the two covers. Unscrew the two Phillips head screws holding each cover in place and slide the covers off the drawer. If you want more space to work on the fans, do this step and the next; otherwise skip to step 7. Release the power supply tray by removing the two Phillips head screws on the side of the drawer. Lift the power supply tray to release it from the sheet metal and slide it out from the drawer until it locks. Tilt the tray to allow easier access to the fans. Remove the bracket holding the power harness as it passes from the power section to the system card cage section of the drawer. Remove the three Phillips head screws holding the bracket in place. Disconnect the power harness from the system motherboard and fold the harness back over the power supplies. CAUTION: Secure the power harness connectors in the system card cage to ensure that they cannot damage the pins in the CPU connectors. 9. Push the power cables through the hole from the tray into the power section of the drawer.
4.
5. 6. 7.
8.
10. Disconnect the power harness from the power supplies. Remove the harness from the system. Replacement Reverse the steps in the Removal procedure.
7-35
PKW0416-96
Removal 1. 2. 3. Shut down the operating system and power down the system. Expose the system drawer. Expose the power system, the system card cage, and the PCI card cage sections of the drawer by removing all three covers. Unscrew the two Phillips head screws holding each cover on top of the drawer in place and slide them off the drawer. Release the two lever latches holding the PCI card cage cover in place and slide it off.
7-36
4. 5. 6. 7.
Release the power supply tray by removing the two Phillips head screws on the side of the drawer. Lift the power supply tray to release it from the sheet metal and slide it out from the drawer. Tilt the tray to allow easier access to the fans. Remove the bracket holding the power harness as it passes from the power section to the system card cage section of the drawer. Remove the three Phillips head screws holding the bracket in place. Disconnect the power harness from the system motherboard and fold the harness back over the power supplies. Remove any modules that prevent you from disconnecting the harness from the system motherboard. CAUTION: Secure the power harness connectors in the system card cage to ensure that they cannot damage the pins in the CPU connectors.
8.
9.
Disconnect the two power connectors from the PCI motherboard and pass them through the hole from the PCI card cage to the power section of the drawer.
10. Disconnect the fan power cables from the power harness. 11. Remove the four Phillips head screws holding the OCP tray to the system drawer. Slide the tray out of the system drawer far enough to disconnect power cables attached to the OCP, the floppy, and the CD-ROM drive. 12. Remove the tray from the system. 13. Release the three lever latches on the bracket holding all three fans in place. 14. Disconnect the broken fans power cable from the power harness and lift the fan from the drawer. Replacement Reverse the steps in the Removal procedure. Verification Power up the system. If the fan you installed is faulty, the system will not power up. Look at the PCM LEDs to determine that the fan you replaced is functioning properly.
7-37
To OCP
PKW-0403D-96
7-38
Removal 1. 2. 3. 4. 5. Shut down the operating system and power down the system. Expose the system drawer. Remove all three section covers to expose the interlock switch assembly. Remove the two screws holding the interlock in place. Push the interlock toward the opposite side of the system drawer (be sure not to twist it) and tilt it so that the switches affected by the power and system card cage covers clear the openings in the side of the drawer. Slide it toward the front of the drawer and remove it, letting it hang loosely over the side of the drawer. If you are working on a pedestal system, disconnect the switch connection from the tray bulkhead and remove the interlock switch assembly. If you are working on a system drawer, unscrew the four screws holding the OCP tray assembly in place beneath the drawer in front. Slide the tray out and remove it from the system. Pull the interlock switch connection to the OCP back through the access hole and remove the entire switch assembly.
6. 7. 8. 9.
Replacement Reverse the steps in the Removal procedure. Verification Power up the system. If the switch you installed is faulty, the system will not power up.
7-39
PKW0417C-96
7-40
Removal 1. 2. 3. Shut down the operating system and power down the system. Expose the system drawer. While you need not remove the tray containing the OCP, you do need to slide it forward to access the OCP retaining screws under the tray. The tray is attached to the power system section cover. To slide the tray forward: a. Remove the tray cover by loosening the retaining screws at the back of the tray and sliding it toward the back of the system. Disconnect the cables from the OCP, and any optional SCSI device in the tray from the bulkhead at the rear right of the tray. Unscrew the Phillips head retaining screw holding the bulkhead to the tray. Unscrew the two Phillips head retaining screws at the front of the system drawer and slide the tray forward.
b.
c. d. 4. 5.
Remove the white power interconnect wire and the signal ribbon cable from the OCP. Remove the two Phillips head screws holding the OCP in place and remove it from the tray.
Replacement Reverse the steps in the Removal procedure. Verification Power up the system. If the OCP you installed is faulty, the system will not power up.
7-41
PKW0430 -96
7-42
Removal 1. 2. 3. 4. 5. 6. Shut down the operating system and power down the system. Expose the system drawer. Remove the four Phillips head screws holding the OCP tray to the system drawer. Slide the tray out of the system drawer far enough to disconnect cables attached to the OCP, the floppy, and the CD-ROM drive. Remove the tray from the system. Move the tray to some handy work surface. Hold the tray vertically and remove the two Phillips head screws that hold the OCP in place from the bottom of the tray and remove the OCP assembly from the tray.
Replacement Reverse the steps in the Removal procedure. As you replace the tray in the drawer, be sure that the slides on the sides of the tray are placed on the rails in the drawer. Verification Power up the system. If the OCP you installed is faulty, the system will not power up or you will not see messages on the OCP display.
7-43
PKW0417B-96
7-44
Removal 1. 2. 3. 4. Shut down the operating system and power down the system. Expose the system drawer. Remove the four Phillips head screws holding the OCP tray to the system drawer. Slide the tray out of the system drawer and disconnect cables attached to the OCP (unnecessary on a pedestal system), the floppy, and the CD-ROM drive. (In the pedestal system the OCP is in the tray above the power supplies.) Move the tray to some handy work surface. Hold the tray vertically and from the bottom of the tray remove the four Phillips head screws that hold the floppy in place and remove it from the tray.
5.
Replacement Reverse the steps in the Removal procedure. As you replace the tray in the drawer, be sure that the slides on the sides of the tray are placed on the rails in the drawer. Verification Power up the system. Use the following SRM console commands to test the floppy: P00>>> show dev floppy P00>>> HD buf/dva0
7-45
PKW0417A-96
7-46
Removal 1. 2. 3. 4. Shut down the operating system and power down the system. Expose the system drawer. Remove the four Phillips head screws holding the OCP tray to the system drawer. Slide the tray out of the system drawer and disconnect cables attached to the OCP (unnecessary on a pedestal system), the floppy, and the CD-ROM drive. (In the pedestal system the OCP is in the pedestal tray above the power supplies.) Move the tray to some handy work surface. Hold the tray vertically and from the bottom of the tray remove the four Phillips head screws that hold the floppy in place and remove it from the tray.
5.
Replacement Reverse the steps in the Removal procedure. As you replace the tray in the drawer, be sure that the slides on the sides of the tray are placed on the rails in the drawer. Verification Power up the system (press the Halt button if necessary to bring up the SRM console). Use the following SRM console commands to test the CD-ROM: P00>>> show dev ncr0 P00>>> HD buf/dka nnn where nnn is the device number; for example, dka500.
7-47
AC Power To SCM
PKW0441A-96
7-48
Removal 1. 2. 3. 4. 5. 6. Shut down the operating system and power down the system. Unplug the AC power cable from the cabinet tray power supply. If present, unplug any power cables going to the server control modules at the back of system drawers. Unscrew the four Phillips head screws securing the fan tray to the top of the cabinet. Loosen the four hexnuts that hold the tray to the top of the cabinet. Holding the bottom of the tray, slide it out so that the holes in the tray frame slip over the loosened hexnuts. Move the tray to a work surface to remove whatever component is being replaced.
Replacement Reverse the steps in the Removal procedure. Verification Power up the system. If the green power LED comes on, and the fan LED is off, the cabinet fan tray is verified.
7-49
Offsets
To fans
7-50
Removal 1. 2. 3. 4. 5. 6. Remove the cabinet fan tray. Disconnect the power harness from the fan fail detect module and each fan. Remove the power supply cover. It is held in place by two screws that go through the AC bulkhead spot welded to the tray weldment. Remove the power harness from the tray by disconnecting it from the power supply. Disconnect the neutral and load leads from the power supply. Remove the four screws holding the power supply to the tray. Keep track of the standoffs that provide space between the power supply and weldment. You will need them during replacement.
Replacement 1. 2. Reverse the steps in the Removal procedure. Place the fan tray back in the cabinet.
Verification Power up the system. If the green power LED comes on, and the fan LED is off, the cabinet fan tray power supply is verified.
7-51
PKW0441F-96
7-52
Removal 1. 2. 3. 4. 5. Remove the cabinet fan tray. Disconnect the power harness from the fan you wish to replace. Remove the fan finger guard. Remove the two remaining screws holding the fan to the tray and remove the fan. If the new fan does not have clip nuts, remove them from the fan.
Replacement 1. 2. Reverse the Removal procedure, taking care to orient the fan so that the connection to the power harness is dressed nicely. Place the fan tray back in the cabinet.
Verification Power up the system. If the green power LED comes on, and the fan LED is off, the cabinet fan tray fan is verified.
7-53
7.26 Cabinet Fan Tray Fan Fail Detect Module Removal and Replacement
Figure 7-25 Removing Fan Tray Fan Fail Detect Module
PKW0441D-96
7-54
Removal 1. 2. 3. Remove the cabinet fan tray. Disconnect the power harness from the fan fail detect module. Remove the fan fail detect module. In early systems, the module is held in place by three screws that go through the weldment, through three standoffs, through the module to nuts. In later systems, the module snaps in place.
Replacement 1. 2. Reverse the steps in the Removal procedure. Place the fan tray back in the cabinet.
Verification Power up the system. If the green power LED comes on, and the fan LED is off, the cabinet fan fail detect module is verified.
7-55
Cabinet
Pedestal
PKW0451-96
7-56
Removal 1. 2. 3. 4. Shut down the operating system and power down the system. Remove the power cord and signal cord(s) from the StorageWorks shelf. Remove the two retaining brackets holding the shelf in the mounting rail by removing the Phillips head screws holding the brackets in place. Slide the shelf out of the system.
Verification Power up the system. Use the show device console command to verify that the StorageWorks shelf is configured into the system.
7-57
Running Utilities
A-1
Display System Configuration... Upgrade AlphaBIOS Hard Disk Setup... CMOS Setup... Install Windows NT Utilities About AlphaBIOS...
PK-0729-96
Running Utilities
A-3
1.
Start AlphaBIOS Setup. If the system is in the SRM console, issue the command alphabios. (If the system has a graphics monitor, you can set the SRM console environment variable to graphics.) From AlphaBIOS Setup, select Utilities, then select Run ECU from floppy from the submenu that displays, and press Enter. NOTE: The EISA Configuration Utility is supplied on diskettes shipped with the system. There is a diskette for Microsoft Windows NT and a diskette for Digital UNIX and OpenVMS.
2.
3.
Insert the correct ECU diskette for the operating system and press Enter to run it.
The ECU main menu displays the following options: EISA Configuration Utility Steps in configuring your computer STEP STEP STEP STEP STEP 1: 2: 3: 4: 5: Important EISA configuration information Add or remove boards View or edit details Examine required details Save and exit
NOTE: Step 1 of the ECU provides online help. It is recommended that you select this step and become familiar with the utility before proceeding.
The AlphaServer 4100 system supports the KZPSC-xx PCI RAID controller (SWXCR). The KZPSC-xx kit includes the controller, RAID Array 230 Subsystems software, and documentation. 1. Start AlphaBIOS Setup. If the system is in the SRM console, issue the command alphabios. (If the system has a graphics monitor, you can set the SRM console environment variable to graphics.)
2. 3.
At the Utilities screen, select Run Maintenance Program. Press Enter. In the Run Maintenance Program dialog box, type swxcrmgr in the Program Name: field. 4. Press Enter to execute the program. The Main menu displays the following options: [01.View/Update Configuration] 02.Automatic Configuration 03.New Configuration 04.Initialize Logical Drive 05.Parity Check 06.Rebuild 07.Tools 08.Select SWXCR 09.Controller Setup 10.Diagnostics Refer to the RAID Array Subsystems documentation for information on using the Standalone Configuration Utility to set up RAID drives.
Running Utilities
A-5
Display System Configuration... Upgrade AlphaBIOS Hard Disk Setup CMOS Setup... Install Windows NT Utilities About AlphaBIOS...
ESC=Exit
PK-0726A-96
Use the Loadable Firmware Update (LFU) utility to update system firmware. You can start LFU from either the SRM console or the AlphaBIOS console. From the SRM console, start LFU by issuing the lfu command. From the AlphaBIOS console, select Upgrade AlphaBIOS from the AlphaBIOS Setup screen (see Figure A-2). Start LFU. Use the LFU list command to show the revisions of modules that LFU can update and the revisions of update firmware. Use the LFU update command to write the new firmware. Use the LFU exit command to exit back to the console.
The sections that follow show examples of updating firmware from the local CDROM, the local floppy, and a network device. Following the examples is an LFU command reference.
Running Utilities
A-7
AS4X00CP from DKA500.5.0.1.1 . [as4x00]RHREADME from DKA500.5.0.1.1 . [as4x00]RHSRMROM from DKA500.5.0.1.1 .................... [as4x00]RHARCROM from DKA500.5.0.1.1 .............
----------------------------------------------------------------Function Description ----------------------------------------------------------------Display Displays the system's configuration table. Exit Done exit LFU (reset). List Lists the device, revision, firmware name, and update revision. Lfu Restarts LFU. Readme Lists important release information. Update Replaces current firmware with loadable data image. Verify Compares loadable and hardware images. ? or Help Scrolls this function table. ----------------------------------------------------------------UPD> list Device AlphaBIOS srmflash Current Revision V5.12-2 V1.0-9 Filename arcrom srmrom
Select the device from which firmware will be loaded. The choices are the internal CD-ROM, the internal floppy disk, or a network device. In this example, the internal CD-ROM is selected. Select the file that has the firmware update, or press Enter to select the default file. The file options are: AS4X00FW (default) AS4X00CP AS4X00IO SRM console, AlphaBIOS console, and I/O adapter firmware SRM console and AlphaBIOS console firmware only I/O adapter firmware only
In this example the file for console firmware (AlphaBIOS and SRM) is selected. The LFU function table and prompt (UPD>) display. Use the LFU list command to determine the revision of firmware in a device and the most recent revision of that firmware available in the selected file. In this example, the resident firmware for each console (SRM and AlphaBIOS) is at an earlier revision than the firmware in the update file. Continued on next page
Running Utilities
A-9
The update command updates the device specified or all devices. In this example, the wildcard indicates that all devices supported by the selected update file will be updated. For each device, you are asked to confirm that you want to update the firmware. The default is no. Once the update begins, do not abort the operation. Doing so will corrupt the firmware on the module. The exit command returns you to the console from which you entered LFU (either SRM or AlphaBIOS).
Running Utilities
A-11
A.5.2 Updating Firmware from the Internal Floppy Disk Creating the Diskettes
Create the update diskettes before starting LFU. See Section A.5.3 for an example of the update procedure.
To update system firmware from floppy disk, you first must create the firmware update diskettes. You will need to create two diskettes: one for console updates, and one for I/O. 1. 2. Download the update files from the Internet (see the Preface of this book). On a PC, copy files onto two FAT-formatted diskettes. From an OpenVMS system, copy files onto two ODS2-formatted diskettes as shown in Example A-3.
I/O Update Diskette $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ inquire ignore "Insert blank HD floppy in DVA0, then continue" set verify set proc/priv=all init /density=hd/index=begin dva0: rhods2io mount dva0: rhods2io create /directory dva0:[as4x00] create /directory dva0:[options] copy as4x00fw.sys dva0:[as4x00]as4x00fw.sys copy as4x00io.sys dva0:[as4x00]as4x00io.sys copy rhreadme.sys dva0:[as4x00]rhreadme.sys copy as4x00fw.txt dva0:[as4x00]as4x00fw.txt copy as4x00io.txt dva0:[as4x00]as4x00io.txt copy cipca214.sys dva0:[options]cipca214.sys copy dfpaa246.sys dva0:[options]dfpaa246.sys copy kzpsaA10.sys dva0:[options]kzpsaa10.sys dismount dva0: set noverify exit
Running Utilities
A-13
A.5.3 Updating Firmware from the Internal Floppy Disk Performing the Update
Insert an update diskette (see Section A.5.2) into the internal floppy drive. Start LFU and select dva0 as the load device.
Please enter the name of the options firmware files list, or Press <return> to use the default filename [AS4X00IO,(AS4X00CP)]: AS4X00IO Copying AS4X00IO from DVA0 . Copying RHREADME from DVA0 . Copying CIPCA214 from DVA0 . Copying DFPAA252 from DVA0 ... Copying KZPSAA11 from DVA0 ... . . (The function table displays, followed by the UPD> prompt, as . shown in Example A-2.) UPD> list Device AlphaBIOS pfi0 srmflash Current Revision V5.12-3 2.46 T3.2-21 Filename arcrom dfpaa_fw srmrom cipca_fw kzpsa_fw
Update Revision Missing file 2.52 Missing file A214 A11 Continued on next page
Select the device from which firmware will be loaded. The choices are the internal CD-ROM, the internal floppy disk, or a network device. In this example, the internal floppy disk is selected. Select the file that has the firmware update, or press Enter to select the default file. When the internal floppy disk is the load device, the file options are: AS4X00CP (default) AS4X00IO SRM console and AlphaBIOS console firmware only I/O adapter firmware only
The default option in Example A-2 (AS4X00FW) is not available, since the file is too large to fit on a 1.44 MB diskette. This means that when a floppy disk is the load device, you can update either console firmware or I/O adapter firmware, but not both in the same LFU session. If you need to update both, after finishing the first update, restart LFU with the lfu command and insert the floppy disk with the other file.
In this example the file for I/O adapter firmware is selected. Use the LFU list command to determine the revision of firmware in a device and the most recent revision of that firmware available in the selected file. In this example, the update revision for console firmware displays as Missing file because only the I/O firmware files are available on the floppy disk. Continued on next page
Running Utilities
A-15
The update command updates the device specified or all devices. For each device, you are asked to confirm that you want to update the firmware. The default is no. Once the update begins, do not abort the operation. Doing so will corrupt the firmware on the module. The lfu command restarts the utility so that console firmware can be updated. (Another method is shown in Example A-5, where the user specifies the file AS4X00FW and is prompted to insert the second diskette.) The default update file, AS4X00CP, is selected. The console firmware can now be updated, using the same procedure as for the I/O firmware. The exit command returns you to the console from which you entered LFU (either SRM or AlphaBIOS).
Example A-5 Selecting AS4X00FW to Update Firmware from the Internal Floppy Disk
P00>>> lfu ***** Loadable Firmware Update Utility ***** Select firmware load device (cda0, dva0, ewa0), or Press <return> to bypass loading and proceed to LFU: dva0 Please enter the name of the firmware files list, or Press <return> to use the default filename [AS4X00IO,(AS4X00CP)]: as4x00fw Copying AS4X00FW from DVA0 . Copying RHREADME from DVA0 . Copying RHSRMROM from DVA0 .......................... Copying RHARCROM from DVA0 ............... Copying CIPCA214 from DVA0 Please insert next floppy containing the firmware, Press <return> when ready. Or type DONE to abort. Copying CIPCA214 from DVA0 . Copying DFPAA246 from DVA0 ... Copying KZPSAA10 from DVA0 ... . . .
Running Utilities
A-17
. . [The function table displays, followed by the UPD> prompt, as . shown in Example A-2.] UPD> list Device AlphaBIOS kzpsa0 kzpsa1 srmflash Current Revision V5.12-2 A10 A10 V1.0-9 Filename arcrom kzpsa_fw kzpsa_fw srmrom cipca_fw dfpaa_fw
Update Revision V6.40-1 A11 A11 V2.0-3 A214 2.46 Continued on next page
Before starting LFU, download the update files from the Internet (see Preface). You will need the files with the extension .SYS. Copy these files to your local MOP servers MOP load area. Select the device from which firmware will be loaded. The choices are the internal CD-ROM, the internal floppy disk, or a network device. In this example, a network device is selected. Select the file that has the firmware update, or press Enter to select the default file. The file options are: AS4X00FW (default) AS4X00CP AS4X00IO SRM console, AlphaBIOS console, and I/O adapter firmware SRM console and AlphaBIOS console firmware only I/O adapter firmware only
In this example the default file, which has both console firmware (AlphaBIOS and SRM) and I/O adapter firmware, is selected. Use the LFU list command to determine the revision of firmware in a device and the most recent revision of that firmware available in the selected file. In this example, the resident firmware for each console (SRM and AlphaBIOS) and I/O adapter is at an earlier revision than the firmware in the update file. Continued on next page
Running Utilities
A-19
DO NOT ABORT! Updating to V6.40-1... Verifying V6.40-1... PASSED. DO NOT ABORT! Updating to A11 ... Verifying A11... PASSED. DO NOT ABORT! Updating to A11 ... Verifying A11... PASSED. DO NOT ABORT! Updating to V2.0-3... Verifying V2.0-3... PASSED.
The update command updates the device specified or all devices. In this example, the wildcard indicates that all devices supported by the selected update file will be updated. Typically, LFU requests confirmation before updating each consoles or devices firmware. The -all option removes the update confirmation requests. The exit command returns you to the console from which you entered LFU (either SRM or AlphaBIOS).
Running Utilities
A-21
display The display command shows the system physical configuration. Display is equivalent to issuing the SRM console command show configuration. Because it shows the slot for each module, display can help you identify the location of a device.
exit The exit command terminates the LFU program, causes system initialization and testing, and returns the system to the console from which LFU was called.
help The help (or ?) command displays the LFU command list, shown below. --------------------------------------------------------------------Function Description --------------------------------------------------------------------Display Displays the system's configuration table. Exit Done exit LFU (reset). List Lists the device, revision, firmware name, and update revision. Lfu Restarts LFU. Readme Lists important release information. Update Replaces current firmware with loadable data image. Verify Compares loadable and hardware images. ? or Help Scrolls this function table. ---------------------------------------------------------------------
lfu The lfu command restarts the LFU program. This command is used when the update files are on a floppy disk. The files for updating both console firmware and I/O firmware are too large to fit on a 1.44 MB disk, so only one type of firmware can be updated at a time. Restarting LFU enables you to specify another update file.
Running Utilities
A-23
list The list command displays the inventory of update firmware on the CD-ROM, network, or floppy. Only the devices listed at your terminal are supported for firmware updates. The list command shows three pieces of information for each device: Current Revision The revision of the devices current firmware Filename The name of the file used to update that firmware Update revision The revision of the firmware update image
readme The readme command lists release notes for the LFU program.
update The update command writes new firmware to the module. Then LFU automatically verifies the update by reading the new firmware image from the module into memory and comparing it with the source image. To update more than one device, you may use a wildcard but not a list. For example, update k* updates all devices with names beginning with k, and update * updates all devices. When you do not specify a device name, LFU tries to update all devices; it lists the selected devices to update and prompts before devices are updated. (The default is no.) The -all option removes the update confirmation requests, enabling the update to proceed without operator intervention. CAUTION: Never abort an update operation. Aborting corrupts the firmware on the module.
verify The verify command reads the firmware from the module into memory and compares it with the update firmware. If a module already verified successfully when you updated it, but later failed tests, you can use verify to tell whether the firmware has become corrupted.
Display System Configuration... Upgrade AlphaBIOS Hard Disk Setup CMOS Setup... Install Windows NT Utilities About AlphaBIOS...
ESC=Exit
PK-0726A-96
Running Utilities
A-25
B-1
B-2
B-3
B-4
sys_serial_num sys_type
tga_sync_green tt_allow_login
B-5
B-6
B-7
The RCM also provides an autonomous dial-out capability when it detects a power failure within the system. When triggered, the RCM dials a paging service at 30minute intervals until the administrator clears the alert within the RCM.
C-1
ConsoleTerminal
Modem
PK-0651-96
Modem Selection The RCM requires a Hayes-compatible modem. The controls that the RCM sends to the modem have been selected to be acceptable to a wide selection of modems. The modems that have been tested and qualified include: Motorola LifeStyle Series 28.8 AT&T DATAPORT 14.4/FAX Zoom Model 360 The U.S. Robotics Sportster DATA/FAX MODEM is also supported, but requires some modification of the modem initialization and answer strings. See Section C.1.7. Modem Configuration Procedure 1. 2. Connect a Hayes-compatible modem to the RCM as shown in Figure C-1, and power up the modem. From the local serial console terminal, enter the RCM firmware console by typing the following escape sequence: ^]^]rcm The character ^ is created by simultaneously holding down the Ctrl key and pressing the ] key (right square bracket). The firmware prompt, RCM>, should now be displayed. 3. 4. 5. 6. Enter a modem password with the setpass command. See Section C.1.3.14. Enable the modem port with the enable command. See Section C.1.3.5. Enter the quit command to leave the RCM console. You are now ready to dial in remotely.
C-3
Dialing In to the RCM Modem Port 1. 2. Dial the modem connected to the server control module. The RCM answers the call and after a few seconds prompts for a password with a # character. Enter the password that was loaded using the setpass command. The user has three tries to correctly enter the password. On the third unsuccessful attempt, the connection is terminated, and as a security precaution, the modem is not answered again for 5 minutes. On successful entry of the password, the RCM banner message RCM V1.0 is displayed, and the user is connected to the system COM1 port. At this point the local terminal keyboard is disabled except for entering the RCM console firmware. The local terminal displays all the terminal traffic going out to the modem. 3. To connect to the RCM firmware console, type the RCM escape sequence. Refer to Example C-1 for an example of the modem dial-in procedure.
Terminating a Modem Session Terminate the modem session by executing a hangup command from the RCM console firmware. This will cleanly terminate the modem connection. If the modem connection is terminated without using the hangup command, or if the line is dropped due to phone line problems, the RCM will detect carrier loss and initiate an internal hangup command. This process can take a minute or more, and the local terminal will be locked out until the auto hangup process completes. If the modem link is idle for more than 20 minutes, the RCM initiates an auto hangup.
Entering the RCM Firmware Console To enter the RCM firmware console, enter the RCM escape sequence. See Example C-2 for the default sequence.
in
The escape sequence is not echoed on the terminal or sent to the system. Once in the RCM firmware console, the user is in RCM command mode and can enter RCM console commands. Leaving Command Mode To leave RCM command mode and reconnect to the system console port, enter the quit command, then press Return to get a prompt from the operating system or system console. (See ).
C-5
Command Conventions The commands are not case sensitive. A command must be entered in full. If a command is entered that is not valid, the command fails with the message: *** ERROR - unknown command *** Enter a valid command. The RCM commands are described on the following pages.
C-7
C.1.3.1 alert_clr
The alert_clr command clears an alert condition within the RCM. The alert enable condition remains active, and the RCM will again enter the alert condition when it detects a system power failure. RCM>alert_clr
C.1.3.2 alert_dis
The alert_dis command disables RCM dial-out capability. It also clears any outstanding alerts. The alert disable state is nonvolatile. Dial-out capability remains disabled until the alert_enable command is issued. RCM>alert_dis
C.1.3.3 alert_ena
The alert_ena command enables the RCM to automatically dial out when it detects a power failure within the system. The RCM repeats the dial-out alert at 30-minute intervals until the alert is cleared. The alert enable state is nonvolatile. Dial-out capability remains enabled until the alert_disable command is issued. RCM>alert_ena In order for the alert_enable command to work, two conditions must be met: A modem dial-out string must be entered with the system console. Remote access to the RCM modem port must be enabled with the enable command.
If the alert_enable command is entered when remote access is disabled, the following message is returned: *** error ***
C.1.3.4 disable
The disable command disables remote access to the RCM modem port. RCM>disable The modules remote access default state is DISABLED. The modem enable state is nonvolatile. When the modem is disabled, it remains disabled until the enable command is issued. If a modem connection is in progress, entering the disable command terminates it.
C.1.3.5 enable
The enable command enables remote access to the RCM modem port. It can take up to 10 seconds for the enable command to be executed. RCM>enable The modules remote access default state is DISABLED. The modem enable state is nonvolatile. When the modem is enabled, it remains enabled until the disable command is issued. The enable command can fail for two reasons: There is no modem access password configured. The modem is not connected or is not working properly.
If the enable command fails, the following message is displayed: *** ERROR - enable failed ***
C.1.3.6 hangup
The hangup command terminates the modem session. When this command is issued, the remote user is disconnected from the server. This command can be issued from either the local or remote console. RCM>hangup
C-9
C.1.3.7 halt
The halt command attempts to halt the managed system. It is functionally equivalent to pressing the Halt button on the system operator control panel to the in position and then releasing it to the out position. The RCM console firmware exits command mode and reconnects the users terminal to the servers COM1 serial port. RCM>halt Focus returned to COM port NOTE: Pressing the Halt button has no effect on systems running Windows NT.
C.1.3.8 help or ?
The help or ? command displays the RCM firmware command set.
C.1.3.9 poweroff
The poweroff command requests the RCM module to power off the system. It is functionally equivalent to turning off the system power from the operator control panel. RCM>poweroff If the system is already powered off, this command has no effect. The external power to the RCM must be connected in order to power off the system from the RCM firmware console. If the external power supply is not connected, the command will not power the system down, and displays the message: *** ERROR ***
C.1.3.10 poweron
The poweron command requests the RCM module to power on the system. For the system power to come on, the following conditions must be met: AC power must be present at the power supply inputs. The DC On/Off button must be in the on position. All system interlocks must be set correctly.
The RCM firmware console exits command mode and reconnects the users terminal to the system console port. RCM>poweron Focus returned to COM port NOTE: If the system is powered off with the DC On/Off button, the system will not power up. The RCM will not override the off state of the DC On/Off button. If the system is already powered on, the poweron command has no effect.
C.1.3.11 quit
The quit command exits the user from command mode and reconnects the users terminal to the system console port. The following message is displayed: Focus returned to COM port The next display depends on what the system was doing when the RCM was invoked. For example, if the RCM was invoked from the SRM console prompt, the console prompt will be displayed when you enter a carriage return. Or, if the RCM was invoked from the operating system prompt, the operating system prompt will be displayed when you enter a carriage return.
C-11
C.1.3.12 reset
The reset command requests the RCM module to perform a hardware reset. It is functionally equivalent to pressing the Reset button on the system operator control panel. RCM>reset Focus returned to COM port The following events occur when the reset command is executed: The system restarts and the system console firmware reinitializes. The console exits RCM command mode and reconnects the users terminal to the servers COM1 serial port. The power-up messages are displayed, and then the console prompt is displayed or the operating system boot messages are displayed, depending on the state of the Halt button.
C.1.3.13 setesc
The setesc command allows the user to reset the default escape sequence for entering console mode. The escape sequence can be any character string. A typical sequence consists of 2 or more characters, to a maximum of 15 characters. The escape sequence is stored in the modules on-board NVRAM. NOTE: If you change the escape sequence, be sure to record the new sequence. Although the module factory defaults can be restored if the user has forgotten the escape sequence, this involves accessing the server control module and moving a jumper. The following sample escape sequence consists of five iterations of the Ctrl key and the letter o. RCM>setesc ^o^o^o^o^o RCM>
If the escape sequence entered exceeds 15 characters, the command fails with the message: *** ERROR *** When changing the default escape sequence, avoid using special characters that are used by the systems terminal emulator or applications. Control characters are not echoed when entering the escape sequence. To verify the complete escape sequence, use the status command.
C.1.3.14 setpass
The setpass command allows the user to change the modem access password that is prompted for at the beginning of a modem session. The password is stored in the modules on-board NVRAM. RCM>setpass new pass>********* RCM> The maximum password length is 15 characters. If the password entered exceeds 15 characters, the command fails with the message: *** ERROR *** The minimum password length is one character, followed by a carriage return. If only a carriage return is entered, the command fails with the message: *** ERROR - illegal password *** If the user has forgotten the password, a new password can be entered.
C-13
C.1.3.15 status
The status command displays the current state of the servers sensors, as well as the current escape sequence and alarm information. RCM>status Firmware Rev: V1.0 Escape Sequence: ^]^]RCM Remote Access: ENABLE/DISABLE Alerts: ENABLE/DISABLE Alert Pending: YES/NO (C) Temp (C): 26.0 RCM Power Control: ON/OFF External Power: ON Server Power: OFF RCM> The status fields are explained in Table C-2.
ENABLE
C-15
Enabling the Dial-Out Alert Function: 1. Enter the set rcm_dialout command, followed by a dial-out alert string, from the SRM console (see in Example C-3).
The string is a modem dial-out character string, not to exceed 47 characters, that is used by the RCM when dialing out through the modem. See the next topic for details on composing the modem dial-out string. 2. Enter the RCM firmware console and enter the enable command to enable remote access dial-in. The RCM firmware status command should display Remote Access: ENABLE. (See .)
3.
Enter the RCM firmware alert_ena command to enable outgoing alerts. (See .)
Composing a Modem Dial-Out String The modem dial-out string emulates a user dialing an automatic paging service. Typically, the user dials the pager phone number, waits for a tone, and then enters a series of numbers. The RCM dial-out string (Example C-4) has the following requirements: The entire string following the set rcm_dialout command must be enclosed by quotation marks. The characters ATDT must be entered after the opening quotation marks. Do not mix case. Enter the characters either in all uppercase or all lowercase. Enter the character X if the line to be used also carries voice mail. Refer to the example that follows. The valid characters for the dial-out string are the characters on a phone keypad: 09, *, and #. In addition, a comma (,) requests that the modem pause for 2 seconds, and a semicolon (;) is required to terminate the string.
Elements of the Dial-Out String ATXDT AT = Attention X = Forces the modem to dial blindly (not look for a dial tone). Enter this character if the dial-out line modifies its dial tone when used for services such as voice mail. D = Dial T = Tone (for touch-tone) , = Pause for 2 seconds. In the example, 9 gets an outside line. Enter the number for an outside line if your system requires it. Dial the paging service. Pause for 12 seconds for paging service to answer. Message, usually a call-back number for the paging service. Return to console command mode. Must be entered at end of string.
C-17
Reset Procedure 1. Power down the AlphaServer system and access the server control module, as follows: Expose the PCI bus card cage. Remove three Phillips head screws holding the cover in place and slide it off the drawer. If necessary, remove several PCI and EISA options from the bottom of the PCI card cage until you have enough space to access the server control module. 2. Unplug the external power supply to the server control module. Locate the password and option reset jumper. The jumper number, which is etched on the board, depends on the revision of the server control module. NOTE: If the RCM section of the server control module does not have an orange relay, the jumper number is J6. If the RCM section of the server control module has an orange relay, the jumper number is J7. 3. 4. Move the jumper so that it is sitting on both pins. Replace any panels or covers as necessary so you can power up the system. Press the Halt button and then power up the system to the SRM console prompt. Powering up with the password and option reset jumper in place resets the escape sequence, password, and modem enable states to the factory default. When the console prompt is displayed, power down the system and move the password and option reset jumper back onto the single pin. Replace any PCI or EISA modules you removed and replace the PCI bus card cage cover. Power up the system to the SRM console prompt and type the default escape sequence to enter RCM command mode: ^]^]RCM 8. Configure the module as desired. You must reset the password and modem enable states in order to enable remote access.
5. 6. 7.
Cables not correctly installed. RCM will not answer when the modem is called. Modem cables may be incorrectly installed. RCM remote access is disabled. RCM does not have a valid password set. The local terminal is currently in the RCM console firmware. On power-up, the RCM defers initializing the modem for 30 seconds to allow the modem to complete its internal diagnostics and initialization. Modem may have had power cycled since last being initialized or modem is not set up correctly.
C-19
Enter RCM console and issue the poweron command. After resetting RCM to factory defaults, move the jumper so that it is sitting on only one pin.
The password and option reset jumper is still installed. If the RCM section of the server control module does not have an orange relay, the jumper number is J6. If it does have an orange relay, the number is J7. The modem is confirming whether the modem has really lost carrier. This occurs when the modem sees an idle time, followed by a 3, followed by a carriage return, with no subsequent traffic. If the modem is still connected, it will remain so. The terminal or terminal emulator is including a linefeed character with the carriage return.
The message unknown command is displayed when the user enters a carriage return by itself.
Change the terminal or terminal emulator setting so that new line is not selected.
C-21
Phases of Modem Operation The RCM is programmed to expect specific responses from the modem during four phases of operation: Initialization Ring detection Answer Hang-up
The initialization and answer command strings are stored in the RCM NVRAM. The factory default strings are: Initialization string: Answer string AT&F0EVS0=0S12=50<cr> ATXA<cr>
NOTE: All modem commands must be terminated with a <cr> character (0x0d hex). Initialization The RCM initializes the modem to the following configuration: Factory defaults (&F0) No Echo (E) Numeric response codes (V) No Auto Answer (S0=0) Guard-band = 1 second (S12=50) Fixed modem-to-RCM baud rate Connect at highest possible reliability and speed
The RCM expects to receive a 0<cr> (OK) in response to the initialization string. If it does not, the enable command will fail.
This default initialization string works on a wide variety of modems. If your modem does not configure itself to these parameters, the initialization string will need to be modified. See the topic in this section entitled Modifying Initialization and Answer Strings. Ring Detection The RCM expects to be informed of an in-bound call by the modem signaling the RCM with the string, 2<cr> (RING). Answer When the RCM receives the ring message from the modem, it responds with the answer string. The X command modifier used in the default answer string forces the modem to report simple connect, rather than connect at xxxx. The RCM expects a simple connect message, 1<cr> (CONNECTED). If the modem responds with anything else, the RCM forces a hang-up and initializes the modem. The default answer string is formatted to request the modem to provide only basic status. If your modem does not provide the basic response, the answer string, and/or initialization string will need to be modified. See the topic in this section entitled Modifying Initialization and Answer Strings. After receiving the connect status, the modem waits for 6 seconds and then prompts the user for a password. Hangup When the RCM is requested to hang up the modem, it forces the modem into command mode and issues the hangup command to the modem. This is done by pausing for a minimum of the guard time, sending the modem +++. When the modem responds with 0<cr> (OK), the hang-up command string is sent. The modem should respond with 3<cr> (NO CARRIER). After this interchange, the modem is reinitialized in preparation for the next dial-in session.
C-23
RCM/Modem Interchange Overview Table C-4 summarizes the actions between the RCM and the modem from initialization to hangup.
To display all the RCM user settable strings: P00>>> show rcm* rcm_answer ATXA rcm_dialout rcm_init AT&F0EVS0=0S12=50 P00>>> Initialization and Answer String Substitutions The RCM default initialization and answer strings are as follows: Initialization String: Answer String: AT&F0EVS0=0S12=50 ATXA
The following modem requires a modified answer string. Initialization String USRobotics Sportster 28,800 Data/Fax Modem RCM default Answer String ATX0&B1&A0A
C-25
Index
?
? command, RCM, C-10
C
Cabinet fan tray fan removal and replacement, 7-52 fan tray fan fail detect module removal and replacement, 7-54 power supply removal and replacement, 7-50 removal and replacement, 7-48 Cabinet system, 1-4 power supply for remote access, 3-5 power and fan LEDs, 3-4 Cables and jumpers, system drawer, 7-5, 7-6 Cables, pedestal, 7-7 CAP Error Register, 6-11 CAP Error Register Data Pattern, 538 CAP_ERR Register, 6-11 CD-ROM removal and replacement, 7-46 COM1 port, 2-19 Command codes, 5-46 Command summary (SRM), B-2 Components housed in system drawer, 1-2 Console SRM, 2-23 Console device determination, 2-18 Console device options, 2-19 Console device, changing, 2-19 console environment variable, SRM, 2-21, 2-23 Index-1
A
alert_clr command, RCM, C-8 alert_dis command, RCM, C-8 alert_ena command, RCM, C-8 Alpha 21164 microprocessor, 1-12 Alpha chip composition, 1-16 AlphaBIOS upgrading, A-26 AlphaBIOS console, 1-11 loading, 2-7 Architecture, system, 1-12 auto_action environment variable, SRM, 2-23 Auxiliary voltage (vaux), 4-9
B
B3002-AA CPU module, 1-17 B3002-AB CPU module, 1-17 B3004-AA CPU module, 1-17 B3020-CA memory module, 1-19, 73 B3030-EA memory module, 1-19, 7-3 B3030-FA memory module, 1-19, 7-3 B3040-AA bridge module, 1-24, 7-3 B3050-AA PCI motherboard, 1-26, 7-3 BA30A system drawer, 1-2 B-cache, 2-21, 2-23 Bridge module removal and replacement, 7-22 Bridge module LEDs, 3-3
Console power-up tests, 2-16 Control panel, 1-8, 2-2 display, 2-21 Halt button, 1-9 messages in display, 2-3 Controls Halt button, 1-9 Cover interlocks, 1-3, 4-7 overriding, 4-7 removal and replacement, 7-38 CPU and bridge module LEDs, 3-2 CPU LEDs, 3-3 CPU module, 1-16 configuration rules, 1-17 removal and replacement, 7-14 variants, 1-17 CPU modules, 1-13, 7-3
exit command (LFU), A-11, A-17, A-21, A-22, A-23 External Interface Address Register, 6-6 External Interface Registers loading and locking rules, 6-7 External Interface Status Register, 6-2
F
Fail-safe loader, 2-24 Fan removal and replacement, 7-36 Fan tray cables (cabinet), 7-4 Fan tray, cabinet system, 1-5 Fans, 7-3 Fans, top of cabinet, 3-5 Fatal errors, 5-5 FEPROM and XSROM test flow, 2-13 defined, 2-5 Firmware RCM, C-6 updating, A-7 updating from AlphaBIOS, A-25 updating from CD-ROM, A-8 updating from floppy disk, A-12, A-14 updating from network device, A-18 updating, AlphaBIOS selection, A-6 updating, SRM command, A-6 Floppy removal and replacement, 7-44 FRU list, 7-2 power system, 7-8 FRU part numbers, 7-3
D
DECevent, 5-6 report formats, 5-10 DIAGNOSE command, 5-7 Diagnostics, test command, 3-12 disable command, RCM, C-9 display command (LFU), A-22, A-23 Double error halt, 5-49 Drives, CD-ROM and floppy, 1-8
E
ECC syndrome bits, 5-45 ECU, running, A-4 EL_ADDR Register, 6-6 EL_STAT Register, 6-2 enable command, RCM, C-9 Environment variables SRM console, B-4 Environment variables, SRM, 1-11 auto_action, 2-23 console, 2-21, 2-23 os_type, 2-23 Error detector placement, 5-2 Error log events, 5-5 Error registers, 6-1 Event files, translating, 5-7 Events, filtering, 5-8
G
Graphics monitor, VGA, 2-19
H
halt command, RCM, C-10 Halts caused by power problem, 3-6 hangup command, RCM, C-9
Index-2
Hard errors, categories of, 5-4 help command (LFU), A-22, A-23 help command, RCM, C-10
I
I squared C bus, 3-10 INFO 3 command, 5-50 INFO 5 command, 5-52 INFO 8 command, 5-54 Initialization and answer strings modifying for modem, C-24 substitutions, C-25 Interlock switches, 7-38 IOD, 2-23 IOD error interrupts, 5-5 IOD, defined, 5-2
update, A-11, A-22, A-24 verify, A-22, A-24 list command (LFU), A-9, A-15, A19, A-22, A-24 Loadable Firmware Update utility. See LFU
M
Machine checks in PAL mode, 5-49 Maintenance bus, 3-10 Maintenance bus controller, 3-10 MC Error Information Register 0, 6-8 MC Error Information Register 1, 6-9 MC_ERR0 Register, 6-8 MC_ERR1 Register, 6-9 MCHK 620 correctable error, 5-36 MCHK 630 correctable CPU error, 5-33 MCHK 660 IOD detected failure, 528 MCHK 670 CPU and IOD detected failure, 5-16 MCHK 670 CPU-detected failure, 5-11 MCHK 670 read dirty failure, 5-22 Memory addressing, 1-20 rules, 1-21 Memory errors corrected read data error, 5-44 read data substitute error, 5-44 Memory module variants, 1-19 Memory modules, 1-13, 1-18, 7-3 removal and replacement, 7-18 Memory operation, 1-19 Memory option configuration rules, 1-19 Memory pairs, 1-19 Memory tests, 2-14, 2-21 Memory, broken, 5-44 Modem, C-2 answer, C-23 dial-in procedure, C-4 hangup, C-23 phases of operation, C-22
L
LEDs troubleshooting with, 3-2 LEDs, fan and power in cabinet, 3-5 LFU exit command, A-23 starting, A-6, A-7 starting the utility, A-6 typical update procedure, A-7 update command, A-24 updating firmware from CD-ROM, A-8 updating firmware from floppy disk, A-12, A-14 updating firmware from network device, A-18 lfu command (LFU), A-15, A-17, A-22, A-23 LFU commands display, A-22, A-23 exit, A-11, A-17, A-21, A-22, A-23 help, A-22, A-23 lfu, A-15, A-17, A-22, A-23 list, A-9, A-15, A-17, A-19, A21, A-22, A-24 readme, A-22, A-24 summary, A-22
Index-3
N
Node IDs, 5-47 NVRAM, 2-3, 2-8, 7-27
O
Operator control panel removal and replacement cabinet system, 7-40 pedestal system, 7-42 os_type environment variable, SRM, 2-7, 2-23
P
Page table entry invalid error, 5-43 PALcode, 2-23 PALcode, described, 5-48 PCI Error Status Register 1, 6-14 PCI I/O subsystem, 1-26 PCI master abort, 5-43 PCI motherboard, 1-27 removal and replacment, 7-26 PCI parity error, 5-43 PCI system error, 5-43 PCI/EISA option removal and replacement, 7-30 PCI_ERR Register, 6-14 Pedestal system, 1-6 PIO buffer overflow error (PIO_OVFL), 5-42 Power circuit and cover interlocks, 4-6 diagram, 4-6 failures, 4-7 Power configuration rules cabinet system, 4-10 pedestal system, 4-12, 4-13 Power control module, 1-13, 1-30 LED states, 3-9 removal and replacement, 7-20 Power control module features, 4-4 Power control module LEDs, 3-8 Power cords, internal, 7-4 Power faults, 4-9 Index-4
Power harness removal and replacement, 7-34 Power problems at power-up, 3-7 Power supply, 1-32 fault protection, 4-3 outputs, 4-2 removal and replacement, 7-32 voltages, 4-3 Power system components, 7-4 poweroff command, RCM, C-10 poweron command, RCM, C-11 Power-up SROM and XSROM messages during, 2-19 Power-up display, 2-20 Power-up sequence, 2-4 Power-up/down sequence, 4-8 Processor determining primary, 2-21 Processor correctable error, 5-5 Processor machine checks, 5-5
Q
quit command, RCM, C-11
R
RAID Standalone Configuration Utility, running, A-5 RCM, C-1 command summary, C-6 dial-out alerts, C-15 entering and leaving command mode, C-5 modem usage, C-2 resetting to factory defaults, C-18 troubleshooting, C-19 typical dialout command, C-15 RCM commands ?, C-10 alert_clr, C-8 alert_dis, C-8 alert_ena, C-8 disable, C-9 enable, C-9
halt, C-10 hangup, C-9 help, C-10 poweroff, C-10 poweron, C-11 quit, C-11 reset, C-12 setesc, C-12 setpass, C-13 status, C-14 rcm_dialout command, C-15 readme command (LFU), A-22, A-24 Registers, 6-1 Remote console monitor. See RCM Remote console monitor module, 128 reset command, RCM, C-12
S
Safety guidelines, 7-1 Serial number, system, 7-24 restoring with set sys_serial_num, 7-25 Serial ports, 1-27 Serial terminal, 2-19 Server control module, 1-28 removal and replacment, 7-28 Server control module power, 7-5 set sys_serial_num command, 7-25 setesc command, RCM, C-12 setpass command, RCM,C-13 show power command (SRM), 1-33 Soft errors, categories of, 5-4 SRM commands show power, 1-33 SRM console, 1-11, 2-23 SROM, 2-21 defined, 2-4 errors, 2-11 power-up test flow, 2-8 tests, 2-10 Standard I/O, 1-28 status command, RCM, C-14 StorageWorks shelf removal and replacement, 7-56
sys_model_number environment variable, 7-27 sys_type environment variable, 7-27 System bus, 1-13, 1-22 System bus address parity error, 5-41 System bus ECC error, 5-39 System bus nonexistent address error, 5-40 System bus to PCI bus bridge module, 1-13, 1-24 System consoles, 1-10 System correctable errors, 5-5 System drawer, 1-2 components of, 1-2 FRU locations, 7-2 fully configured, 1-13 remote operation, C-1 System drawer exposure cabinet, 7-10 pedestal, 7-12 System drawer modules, 7-3 System machine checks, 5-5 System model number, displaying, 7-27 System motherboard, 1-14 removal and replacement, 7-24
T
Test command for entire system, 3-13 Test mem command, 3-15 Test pci command, 3-17 Troubleshooting failures at power-up, 3-7 IOD detected errors, 5-38 power problems, 3-6 using error logs, 5-2
U
update command (LFU), A-11, A-17, A-21, A-22, A-24 Updating firmware AlphaBIOS console, A-25 from AlphaBIOS console, A-6 from SRM console, A-6
Index-5
Utility programs running from graphics monitor, A-2 running from serial terminal, A-3
X
XBUS, 1-27 XSROM defined, 2-4 errors, 2-15 power-up test flow, 2-12 tests, 2-13
V
verify command (LFU), A-22, A-24
Index-6