Sun Fire X4600 and Sunfire X4600 M2 Servers Diagnostics Guide
Sun Fire X4600 and Sunfire X4600 M2 Servers Diagnostics Guide
Sun Fire X4600 and Sunfire X4600 M2 Servers Diagnostics Guide
Copyright 2006 Sun Microsystems, Inc., 4150 Network Circle, Santa Clara, Californie 95054, Etats-Unis. Tous droits réservés.
Sun Microsystems, Inc. a les droits de propriété intellectuels relatants à la technologie qui est décrit dans ce document. En particulier, et sans la
limitation, ces droits de propriété intellectuels peuvent inclure un ou plus des brevets américains énumérés à http://www.sun.com/patents et
un ou les brevets plus supplémentaires ou les applications de brevet en attente dans les Etats-Unis et dans les autres pays.
Ce produit ou document est protégé par un copyright et distribué avec des licences qui en restreignent l’utilisation, la copie, la distribution, et la
décompilation. Aucune partie de ce produit ou document ne peut être reproduite sous aucune forme, par quelque moyen que ce soit, sans
l’autorisation préalable et écrite de Sun et de ses bailleurs de licence, s’il y en a.
Le logiciel détenu par des tiers, et qui comprend la technologie relative aux polices de caractères, est protégé par un copyright et licencié par des
fournisseurs de Sun.
Des parties de ce produit pourront être dérivées des systèmes Berkeley BSD licenciés par l’Université de Californie. UNIX est une marque
déposée aux Etats-Unis et dans d’autres pays et licenciée exclusivement par X/Open Company, Ltd.
Sun, Sun Microsystems, le logo Sun, Java, AnswerBook2, docs.sun.com, Sun Fire, SunVTS, et Solaris sont des marques de fabrique ou des
marques déposées de Sun Microsystems, Inc. aux Etats-Unis et dans d’autres pays.
Toutes les marques SPARC sont utilisées sous licence et sont des marques de fabrique ou des marques déposées de SPARC International, Inc.
aux Etats-Unis et dans d’autres pays. Les produits portant les marques SPARC sont basés sur une architecture développée par Sun
Microsystems, Inc.
AMD Opteron est une marque de fabrique ou une marque deposee de Advanced Microdevices, Inc.
L’interface d’utilisation graphique OPEN LOOK et Sun™ a été développée par Sun Microsystems, Inc. pour ses utilisateurs et licenciés. Sun
reconnaît les efforts de pionniers de Xerox pour la recherche et le développement du concept des interfaces d’utilisation visuelle ou graphique
pour l’industrie de l’informatique. Sun détient une license non exclusive de Xerox sur l’interface d’utilisation graphique Xerox, cette licence
couvrant également les licenciées de Sun qui mettent en place l’interface d ’utilisation graphique OPEN LOOK et qui en outre se conforment
aux licences écrites de Sun.
LA DOCUMENTATION EST FOURNIE "EN L’ÉTAT" ET TOUTES AUTRES CONDITIONS, DECLARATIONS ET GARANTIES EXPRESSES
OU TACITES SONT FORMELLEMENT EXCLUES, DANS LA MESURE AUTORISEE PAR LA LOI APPLICABLE, Y COMPRIS NOTAMMENT
TOUTE GARANTIE IMPLICITE RELATIVE A LA QUALITE MARCHANDE, A L’APTITUDE A UNE UTILISATION PARTICULIERE OU A
L’ABSENCE DE CONTREFAÇON.
Contents
Preface xi
Contents iii
SunVTS Documentation 18
Diagnosing Server Problems With the Bootable Diagnostics CD 18
Requirements 18
Using the Bootable Diagnostics CD 19
E. Error Handling 75
Handling of Uncorrectable Errors 75
Handling of Correctable Errors 78
Contents v
vi Sun Fire X4600/X4600 M2 Servers Diagnostics Guide • October 2006
Figures
FIGURE 1-3 Sun Fire X4600 Designation of DIMM Slots on CPU Modules 1–11
FIGURE 1-4 Sun Fire X4600 M2 Designation of DIMM Slots on CPU Modules 1–12
FIGURE A-1 Sun Fire X4600 BIOS Advanced Settings Menu Main Screen A–22
FIGURE A-2 Sun Fire X4600 M2 BIOS Advanced Settings Menu Main Screen A–22
FIGURE A-3 BIOS Advanced Menu, Event Logging Details Screen A–23
FIGURE A-4 BIOS Advanced Menu, IPMI 2.0 Configuration Screen A–24
FIGURE A-6 BIOS Boot Menu, Boot Settings Configuration Screen A–28
FIGURE B-1 Sun Fire X4600/X4600 M2 Server Front Panel LEDs B–38
FIGURE B-2 Sun Fire X4600/X4600 M2 Server Back Panel LEDs B–40
FIGURE B-4 LED and Button Locations on the Sun Fire X4600/X4600 M2 CPU Module B–42
FIGURE B-5 Sun Fire X4600/X4600 M2 Service Processor Board Power Status LED Location B–43
Figures vii
FIGURE E-2 DMI Log Screen, Correctable Error E–78
FIGURE E-3 DMI Log Screen, Correctable Error, Memory Decreased E–79
Tables ix
x Sun Fire X4600/X4600 M2 Servers Diagnostics Guide • October 2006
Preface
The Sun Fire™ X4600 and SunFire X4600 M2 Servers Diagnostics Guide contains
information and procedures for using available tools to diagnose problems with the
servers.
Note – The information in this chapter applies to the original Sun Fire X4600 server,
and to the Sun Fire X4600 M2 server, unless otherwise noted in the text.
xi
Related Documentation
For a description of the document set for the Sun Fire X4600/X4600 M2 servers, see
the Where To Find Documentation sheet that is packed with your system and also
posted at the product's documentation site. See the following URLs:
http://www.sun.com/products-n-
solutions/hardware/docs/Servers/x64_servers/x4600/index.html
http://www.sun.com/products-n-
solutions/hardware/docs/Servers/x64_servers/x4600m2/index.html
Translated versions of some of these documents are available at the web site
described above in French, Simplified Chinese, Traditional Chinese, Korean, and
Japanese. English documentation is revised more frequently and might be more up-
to-date than the translated documentation.
http://www.sun.com/documentation
For Solaris™ and other software documentation, see the following URL:
http://docs.sun.com
Web Sites
Sun™ is not responsible for the availability of third-party web sites mentioned in this
document. Sun does not endorse and is not responsible or liable for any content,
advertising, products, or other materials that are available on or through such sites
or resources. Sun will not be responsible or liable for any actual or alleged damage
or loss caused by or in connection with the use of or reliance on any such content,
goods, or services that are available on or through such sites or resources.
Preface xiii
Sun Welcomes Your Comments
Sun is interested in improving its documentation and welcomes your comments and
suggestions. You can submit your comments by going to:
http://www.sun.com/hwdocs/feedback
Please include the title and part number of your document with your feedback:
Note – The information in this chapter applies to the original Sun Fire X4600 server,
and to the Sun Fire X4600 M2 server, unless otherwise noted in the text.
1
To perform this task Refer to these sections
View BIOS event logs and POST messages. “Viewing Event Logs” on page 21,
“Power-On Self-Test (POST)” on page 25
View service processor logs and sensor “Using the ILOM Service Processor GUI to
information. View System Information” on page 45
View service processor logs and sensor “Using IPMItool to View System Information”
information. on page 57
3. Take note of the results of any change you make. Include any errors or
informational messages.
4. Check for potential device conflicts before you add a new device.
1. Check that AC power cords are attached firmly to the server’s power supplies and
to the AC sources.
1. Inspect the external status indicator LEDs, which can indicate component
malfunction.
For the LED locations and descriptions of their behavior, see “External Status
Indicator LEDs” on page 37.
2. Verify that nothing in the server environment is blocking air flow or making a
contact that could short out power.
3. If the problem is not evident, continue with the next section, “Internally Inspecting
the Server” on page 5.
1. Choose a method for shutting down the server from main power mode to standby
power mode.
■ Graceful shutdown – Use a ballpoint pen or other stylus to press and release the
Power button on the front panel. This causes Advanced Configuration and Power
Interface (ACPI) enabled operating systems to perform an orderly shutdown of
the operating system. Servers not running ACPI-enabled operating systems will
shut down to standby power mode immediately.
■ Emergency shutdown – Use a ballpoint pen or other stylus to press and hold the
Power button for four seconds to force main power off and enter standby power
mode.
When main power is off, the Power/OK LED on the front panel will begin flashing,
indicating that the server is in standby power mode.
Caution – When you use the Power button to enter standby power mode, power is
still directed to the service processor board and power supply fans, indicated when
the Power/OK LED is flashing. To completely power off the server, you must
disconnect the AC power cords from the back panel of the server.
3. Inspect the internal status indicator LEDs, which can indicate component
malfunction.
For the LED locations and descriptions of their behavior, see “Internal Status
Indicator LEDs” on page 42.
Note – The server must be in standby power mode for viewing the internal LEDs.
Note – You can hold down the Locate button on the server back panel or front panel
for 5 seconds to initiate a “push-to-test” mode that illuminates all other LEDs both
inside and outside of the chassis for 15 seconds.
5. Verify that all cable connectors inside the system are firmly and correctly attached
to their appropriate connectors.
7. Check that the installed DIMMs comply with the supported DIMM population
rules and configurations, as described in “Troubleshooting DIMM Problems” on
page 7.
9. To restore main power mode to the server (all components powered on), use a
ballpoint pen or other stylus to press and release the Power button on the server
front panel. See FIGURE 1-2.
When main power is applied to the full server, the Power/OK LED next to the
Power button lights and remains lit.
10. If the problem with the server is not evident, you can try viewing the power-on
self test (POST) messages and BIOS event logs during system startup. Continue
with “Viewing Event Logs” on page 21.
Note – For information on Sun’s DIMM replacement policy for x64 servers, contact
your Sun Service representative.
1. When an UCE occurs, the memory controller causes an immediate reboot of the
system.
2. During reboot, the BIOS checks the NorthBridge memory controller’s Machine
Check registers and determines that the previous reboot was due to an UCE, then
reports this in POST after the memtest stage:
A Hypertransport Sync Flood occurred on last boot
3. Memory reports this event in the service processor’s system event log (SEL) as
shown in the sample IPMItool output below:
# ipmitool -H 10.6.77.249 -U root -P changeme -I lanplus sel list
f000 | 02/16/2006 | 03:32:38 | OEM #0x12 |
f100 | OEM record e0 | 00000000040f0c0200200000a2
f200 | OEM record e0 | 01000000040000000000000000
f300 | 02/16/2006 | 03:32:50 | Memory | Uncorrectable ECC | CPU 1 DIMM 0
f400 | 02/16/2006 | 03:32:50 | Memory | Memory Device Disabled | CPU 1 DIMM 0
f500 | 02/16/2006 | 03:32:55 | System Firmware Progress | Motherboard
initialization
f600 | 02/16/2006 | 03:32:55 | System Firmware Progress | Video initialization
f700 | 02/16/2006 | 03:33:01 | System Firmware Progress | USB resource
configuration
2. The user must manually go into Event Viewer to view errors. Access Event
Viewer through this menu path:
Start-->Administration Tools-->Event Viewer
3. The user can then view individual errors (by time) to see details of the error.
■ Solaris:
There is no reporting of CEs in Solaris x86 at this time.
■ Linux:
There is no reporting of CEs in the Linux distributions that Sun supports on this
server at this time.
Note – The Sun Fire X4600 and the Sun Fire X4600 M2 Servers have slightly
different CPU modules. The visible difference is that the Sun Fire X4600 CPU
modules have DIMM slots in alternating white and black, while the Sun Fire X4600
M2 has two white DIMM slots adjacent to each other, and two black slots adjacent to
each other. See FIGURE 1-3 and FIGURE 1-4 for the locations of the DIMMs and of the
fault LEDs on the CPU module.
The DIMM ejector levers contain LEDs that can indicate a faulty DIMM.
■ DIMM fault LED is off – The DIMM is operating properly.
■ DIMM fault LED is on (amber) – At least one of the DIMMs in this DIMM pair is
faulty and should be replaced.
The system designation of the DIMM slots on each Sun Fire X4600 CPU module is
shown in FIGURE 1-3.
DIMM0
DIMM2
DIMM1
DIMM3
Fault Remind
button
CPU fault
LED CR8
FIGURE 1-3 Sun Fire X4600 Designation of DIMM Slots on CPU Modules
The system designation of the DIMM slots on each Sun Fire X4600 CPU module is
shown in FIGURE 1-3.
DIMM0
DIMM1
DIMM2
DIMM3
Fault Remind
button
CPU fault
LED CR8
FIGURE 1-4 Sun Fire X4600 M2 Designation of DIMM Slots on CPU Modules
The system designation of the DIMM slots on each Sun Fire X4600 M2 CPU module
is shown in FIGURE 1-4.
Note – The original Sun Fire X4600 servers use only DDR1 DIMMs. The Sun Fire
X4600 M2 servers use only DDR2 DIMMs.
1 GB 0 1 GB 0 2 GB
1 GB 1 GB 1 GB 1 GB 4 GB
2 GB 1 GB 2 GB 1 GB 6 GB
2 GB 0 2 GB 0 4 GB
2 GB 2 GB 2 GB 2 GB 8 GB
1 GB 1 GB 0 0 2 GB
1 GB 1 GB 1 GB 1 GB 4 GB
2 GB 2 GB 1 GB 1 GB 6 GB
4 GB 4 GB 1 GB 1 GB 10 GB
2 GB 2 GB 0 0 4 GB
2 GB 2 GB 2 GB 2 GB 8 GB
4 GB 4 GB 2 GB 2 GB 12 GB
4 GB 4 GB 0 0 8 GB
4 GB 4 GB 4 GB 4 GB 16 GB
In this example, the log file reports an error with the DIMM in CPU0, slot 1. The
fault LEDs on CPU0, slots 1 and 0 are lit.
2. Inspect the installed DIMMs to ensure that they comply with the “DIMM
Population Rules” on page 12.
3. Inspect the fault LEDs on the DIMM slot ejectors and the CPU fault LED on the
CPU module. See FIGURE 1-3.
If any of these LEDs are lit, they can indicate the component with the fault.
7. Visually inspect the DIMMs for physical damage, dust, or any other
contamination on the connector or circuits.
8. Visually inspect the DIMM slot for physical damage. Look for cracked or broken
plastic on the slot.
9. Dust off the DIMMs, clean the contacts, and reseat them.
10. If there is no obvious damage, exchange the individual DIMMs between the two
slots of a given pair. Ensure that they are inserted correctly with ejector latches
secured. Using the slot numbers from the example:
11. Reinstall the CPU module that has the DIMM problem.
Refer to the Sun Fire X4600 and Sun Fire X4600 M2 Servers Service Manual, 819-4342.
13. Power on the server and run the diagnostics test again.
15. Shut down the server again and disconnect the AC power cords.
16. Remove the CPU module that has the DIMM problem, and remove another CPU
module that does not indicate a DIMM problem.
Refer to the Sun Fire X4600 and Sun Fire X4600 M2 Servers Service Manual, 819-4342.
17. Remove both DIMMs of the pair and install them into paired slots on the second
CPU module that did not indicate a DIMM problem.
Using the slot numbers in the example, install the two DIMMs from CPU0, slots 1
and 0 into CPU1, slots 1 and 0 or CPU1, slots 3 and 2.
20. Power on the server and run the diagnostics test again.
This chapter contains information about the SunVTS diagnostic software tool that
you can use.
Note – The information in this chapter applies to the original Sun Fire X4600 server,
and to the Sun Fire X4600 M2 server, unless otherwise noted in the text.
SunVTS is the Sun Validation Test Suite, which provides a comprehensive diagnostic
tool that tests and validates Sun hardware by verifying the connectivity and
functionality of most hardware controllers and devices on Sun platforms. SunVTS
software can be tailored with modifiable test instances and processor affinity
features.
Only the following tests are supported on x86/x64 platforms. The current x86/x64
support is for the 32-bit operating system only.
■ CD DVD Test (cddvdtest)
■ CPU Test (cputest)
■ Disk and Diskette Drives Test (disktest)
■ Data Translation Look-Aside Buffer (dtlbtest)
■ Floating Point Unit Test (fputest)
■ Network Hardware Test (nettest)
■ Ethernet Loopback Test (netlbtest)
■ Physical Memory Test (pmemtest)
■ Serial Port Test (serialtest)
17
■ System Test (systest)
■ Universal Serial Bus Test (usbtest)
■ Virtual Memory Test (vmemtest)
SunVTS software has a sophisticated graphical user interface (GUI) that provides
test configuration and status monitoring. The user interface can be run on one
system to display the SunVTS testing of another system on the network. SunVTS
software also provides a TTY-mode interface for situations in which running a GUI
is not possible.
SunVTS Documentation
For the most up-to-date information on SunVTS software, go to:
http://docs.sun.com/app/docs/coll/1140.2
Requirements
■ To use the Sun Fire X4600 Server Bootable Diagnostics CD, you must have a
keyboard, mouse, and monitor attached to the server on which you are
performing diagnostics.
1. With the server powered on, insert the Sun Fire X4600 Server Bootable
Diagnostics CD (705-1439) into the DVD-ROM drive.
2. Reboot the server, but press F2 during the start of reboot so that you can change
the BIOS setting for boot-device priority.
3. When the BIOS Main menu appears, navigate to the BIOS Boot menu.
Instructions for navigating within the BIOS screens are printed on the BIOS screens.
8. In the SunVTS GUI, press Enter or click the Start button when you are prompted
to start the tests.
The test suite will run until it encounters an error or the test is completed.
9. When SunVTS software completes the test, review the log files generated during
the test.
SunVTS provides access to four different log files:
■ SunVTS test error log contains time-stamped SunVTS test error messages. The log
file path name is /var/opt/SUNWvts/logs/sunvts.err. This file is not
created until a SunVTS test failure occurs.
■ SunVTS kernel error log contains time-stamped SunVTS kernel and SunVTS
probe errors. SunVTS kernel errors are errors that relate to running SunVTS, and
not to testing of devices. The log file path name is
/var/opt/SUNWvts/logs/vtsk.err. This file is not created until SunVTS
reports a SunVTS kernel error.
■ SunVTS information log contains informative messages that are generated when
you start and stop the SunVTS test sessions. The log file path name is
/var/opt/SUNWvts/logs/sunvts.info. This file is not created until a
SunVTS test session runs.
b. Specify the log file that you want to view by selecting it from the Log file
window.
The content of the selected log file is displayed in the window.
c. With the three lower buttons you can do the following actions:
■ Print the log file – A dialog box appears for you to specify your printer
options and printer name.
■ Delete the log file – The file remains displayed, but will be gone the next time
you try to display it.
■ Close the Log file window – The window is closed.
Note – If you want to save the log files: You must save the log files to another
networked system or a removable media device. When you use the Bootable
Diagnostics CD, the server boots from the CD. Therefore, the test log files are not on
the server’s hard disk drive and they will be deleted when you power cycle the
server.
This appendix contains information about the BIOS event log, the BMC system event
log, the power-on self test (POST), and console redirection.
Note – The information in this appendix applies to the original Sun Fire X4600
server, and to the Sun Fire X4600 M2 server, unless otherwise noted in the text.
Note – Some of the Event Log screens for the Sun File X4600 M2 Server are different
from the Sun Fire X4600 Server. Where there are differences, the figures below show
the two different screens.
1. To turn on main power mode (all components powered on), use a ball-point pen
or other stylus to press and release the Power button on the server front panel. See
FIGURE 1-2.
When main power is applied to the full server, the Power/OK LED next to the
Power button lights and remains lit.
2. Enter the BIOS Setup utility by pressing the F2 key while the system is
performing the power-on self-test (POST).
The BIOS Main menu screen is displayed.
21
Sun Fire X4600 Server Screen
Main Advanced PCIPnP Boot Security Chipset Exit
********************************************************************************
* Advanced Settings * Options for CPU *
* *************************************************** * *
* WARNING: Setting wrong values in below sections * *
* may cause system to malfunction. * *
* * *
* * CPU Configuration * *
* * IDE Configuration * *
* * SuperIO Configuration * *
* * ACPI Configuration * *
* * Event Log Configuration * *
* * Hyper Transport Configuration * *
* * IPMI 2.0 Configuration * *
* * MPS Configuration * ** Select Screen *
* * PCI express Configuration * *
* * AMD PowerNow Configuration * ** Select Item *
* * Remote Access Configuration * Enter Go to Sub Screen *
* * USB Configuration * F1 General Help *
* * F10 Save and Exit *
* * ESC Exit *
********************************************************************************
FIGURE A-1 Sun Fire X4600 BIOS Advanced Settings Menu Main Screen
FIGURE A-2 Sun Fire X4600 M2 BIOS Advanced Settings Menu Main Screen
c. From the Event Logging Details screen, select View Event Log.
All unread events are displayed.
Advanced
********************************************************************************
* IPMI 2.0 Configuration * View all events in the *
* *************************************************** * BMC Event Log. *
* Status Of BMC Working * *
* * View BMC System Event Log * It will take up to *
* Reload BMC System Event Log * 60 Seconds approx. *
* Clear BMC System Event Log * to read all *
* * LAN Configuration * BMC SEL records. *
* * PEF Configuration * *
* BMC Watch Dog Timer Action [Disabled] * *
* * *
* * *
* * *
* * ** Select Screen *
* * ** Select Item *
* * Enter Go to Sub Screen *
* * F1 General Help *
* * F10 Save and Exit *
* * ESC Exit *
* * *
* * *
********************************************************************************
c. From the IPMI 2.0 Configuration screen, select View BMC System Event Log.
The log takes about 60 seconds to generate, then it is displayed on the screen.
5. If the problem with the server is not evident, continue with “Using the ILOM
Service Processor GUI to View System Information” on page 45, or “Using IPMItool
to View System Information” on page 57.
The progress of the self-test is indicated by a series of POST codes. These codes are
displayed at the bottom right corner of the system’s VGA screen (once the self-test
has progressed far enough to initialize the system video). However, the codes are
displayed as the self-test runs and scroll off of the screen too quickly to be read. An
alternate method of displaying the POST codes is to redirect the output of the
console to a serial port (see “Redirecting Console Output” on page 26).
1. The first megabyte of DRAM is tested by the BIOS before the BIOS code is
shadowed (that is, copied from ROM to DRAM).
2. Once executing out of DRAM, the BIOS performs a simple memory test (a
write/read of every location with the pattern 55aa55aa).
Note – This memory test is performed only if Quick Boot is not enabled from the
Boot Settings Configuration screen. Enabling Quick Boot causes the BIOS to skip the
memory test. See “Changing POST Options” on page 27 for more information.
Note – Because the Sun Fire X4600 server can contain up to 64 GB of memory, the
memory test can take several minutes. You can escape from POST testing by
pressing any key during POST.
3. The BIOS polls the memory controllers for both correctable and uncorrectable
memory errors and logs those errors into the service processor.
1. Initialize the BIOS Setup utility by pressing the F2 key while the system is
performing the power-on self-test (POST).
The BIOS Main menu screen is displayed.
a. Select the IP Assignment option that you want to use (DHCP or Static).
■ If you choose DHCP, the server’s IP address is retrieved from your network’s
DHCP server and displayed using the following format:
Current IP address in BMC : xxx.xxx.xxx.xxx
■ If you choose Static to assign the IP address manually, perform the following
steps:
iii. Select Refresh and press Return to see your new settings displayed in the
Current IP address in BMC field.
6. Start a web browser and type the service processor’s IP address in the browser’s
URL field.
7. When you are prompted for a user name and password, type the following:
■ User Name: root
■ Password: changeme
The Sun Integrated Lights Out Manager main GUI screen is displayed.
12. When you are prompted for a user name and password, type the following:
■ User Name: root
■ Password: changeme
1. Initialize the BIOS Setup utility by pressing the F2 key while the system is
performing the power-on self-test (POST).
The BIOS Main menu screen is displayed.
2. Select Boot.
The Boot Settings screen is displayed.
Boot
********************************************************************************
* Boot Settings Configuration * Allows BIOS to skip *
* *************************************************** * certain tests while *
* Quick Boot [Disabled] * booting. This will *
* System Configuration Display [Disabled] * decrease the time *
* Quiet Boot [Disabled] * needed to boot the *
* Language [English] * system. *
* AddOn ROM Display Mode [Force BIOS] * *
* Bootup Num-Lock [On] * *
* Wait For 'F1' If Error [Disabled] * *
* Interrupt 19 Capture [Disabled] * *
* Default Boot Order [CRHB] * *
* * *
* * ** Select Screen *
* * ** Select Item *
* * +- Change Option *
* * F1 General Help *
* * F10 Save and Exit *
* * ESC Exit *
* * *
********************************************************************************
4. On the Boot Settings Configuration screen, there are several options that you can
enable or disable:
■ Quick Boot – This option is disabled by default. If you enable this, the BIOS skips
certain tests while booting, such as the extensive memory test. This decreases the
time it takes for the system to boot.
■ System Configuration Display – This option is disabled by default. If you enable
this, the System Configuration screen is displayed before booting begins.
■ Quiet Boot – This option is disabled by default. If you enable this, the Sun
Microsystems logo is displayed instead of POST codes.
■ Language – This option is reserved for future use. Do not change.
■ Add On ROM Display Mode – This option is set to Force BIOS by default. This
option has effect only if you have also enabled the Quiet Boot option, but it
controls whether output from the Option ROM is displayed. The two settings for
this option are as follows:
■ Force BIOS – Remove the Sun logo and display Option ROM output.
■ Keep Current – Do not remove the Sun logo. The Option ROM output is not
displayed.
This appendix contains information about the locations and behaviors of the status
and fault LEDs on the server. The information is organized to describe external LEDs
that can be viewed on the outside of the server and internal LEDs that can be viewed
only with the main cover removed.
Note – The information in this appendix applies to the original Sun Fire X4600
server, and to the Sun Fire X4600 M2 server, unless otherwise noted in the text.
37
Locate button/LED
Service action required LED
Power/OK LED
Hard disk drive status indicator LEDs
Power button
Front fan fault LED
Power supply fault LED
System overheat fault LED
Locate button/LED This LED helps you to identify which system in the rack
you are working on in a rack full of servers.
• Push and release this button to make the Locate LED
blink for 30 minutes.
• Hold down the button for 5 seconds to initiate a “push-
to-test” mode that illuminates all other LEDs both inside
and outside of the chassis for 15 seconds.
Service Action Required LED This LED has two states:
• Off – Normal operation.
• Slow Blinking – An event that requires a service action
has been detected. It also blinks when only one power
supply is plugged in.
Power/OK LED This LED has three states:
• Off – Server main power and standby power are off.
• Blinking – Server is in standby power mode, with AC
power applied to only the service processor board and
the power supply fans.
• On – Server is in main power mode with AC power
supplied to all components.
Front Fan Fault LED This LED lights when there is a failed front cooling fan
module. LEDs on the individual fan modules indicate
which fan module has failed.
Power Supply Fault LED This LED lights when:
• Two power supplies are present in the system but only
one has AC power connected. To clear this condition
either plug in the second power supply or remove it
from the chassis.
• Any voltage related event occurs in the system. For CPU-
related voltage errors, the associated CPU Fault LED will
also be illuminated.
System Overheat Fault LED This LED lights when an upper temperature limit is
detected.
Hard Disk Drive Status LEDs The hard disk drives have three LEDs:
• Top LED (blue) – reserved for future use.
• Middle LED (amber) – Hard disk drive failed.
• Bottom LED (green) – Hard disk drive is operating
properly.
Power Supply Status LEDs The power supplies have one LED:
• LED is on (amber) – Power supply failed.
• LED is off – AC power to power supply is operating
properly.
10/100/100 Gigabit Ethernet Each connector has two LEDs:
port LEDs (NET0 - NET3) • Right side LED on (green) – Indicates link activity.
• Left side LED green – Link is established at 1 gigabit.
• Left side orange – Link is established at 10 or 100
megabits.
10/100 Gigabit Ethernet The connector has two LEDs:
management port • Right side LED on (green) – Indicates link activity.
(NET MGT) • Left side LED green – Link is established at 100
megabits.
• Left side orange – Link is established at 10 megabits.
Locate button/LED This LED helps you to identify which system in the rack
(Same function as on front you are working on in a rack full of servers.
panel.) • Push and release this button to make the Locate LED
blink for 30 minutes.
• Hold down the button for 5 seconds to initiate a “push-
to-test” mode that illuminates all other LEDs both inside
and outside of the chassis for 15 seconds.
Service Action Required LED This LED has two states:
(Same function as on front • Off – Normal operation.
panel.) • Slow Blinking – An event that requires a service action
has been detected.
Power/OK LED This LED has three states:
(Same function as on front • Off – Server main power and standby power are off.
panel.) • Blinking – Server is in standby power mode, with AC
power applied to only the service processor board and
the power supply fans.
• On – server is in main power mode with AC power
supplied to all components.
FT2 FT3
Fault LED (one on each FT)
FT0 FT1
Front of server
Fan tray fault LEDs. Each fan tray has one LED:
• LED is on (amber) – Fan tray failed.
• LED is off – Fan tray is operating properly.
DIMM and CPU fault LEDs on the CPU module provide further indications of
which component has a fault condition. These CPU and DIMM fault LEDs can be lit
for up to one minute by a capacitor on the CPU module, even after the CPU module
is removed from the server. To light the fault LEDs from the capacitor, push the
small button on the CPU module labelled, “FAULT REMIND BUTTON.”
Note – FIGURE B-4 shows the Sun Fire X4600 CPU module, but the LEDs have the
same locations on the Sun Fire X4600 M2 CPU module.
DIMM0
DIMM2
DIMM1
DIMM3
Fault Remind
button
CPU fault
LED CR8
FIGURE B-4 LED and Button Locations on the Sun Fire X4600/X4600 M2 CPU Module
H G F E D C B A
FT2 FT3
FT0 FT1
FIGURE B-5 Sun Fire X4600/X4600 M2 Service Processor Board Power Status LED
Location
This appendix contains information about using the Integrated Lights Out Manager
(ILOM) Service processor (SP) GUI to view monitoring and maintenance information
for your server.
Note – The information in this appendix applies to the original Sun Fire X4600
server, and to the Sun Fire X4600 M2 server, unless otherwise noted in the text.
For more information on using the ILOM SP GUI to maintain the server (for
example, configuring alerts), refer to the Integrated Lights Out Manager Administration
Guide, 819-1160.
■ If any of the logs or information screens indicate a DIMM error, see
“Troubleshooting DIMM Problems” on page 7 and “Isolating and Correcting
DIMM ECC Errors” on page 14.
■ If the problem with the server is not evident after viewing ILOM SP logs and
information, continue with “Running SunVTS Diagnostic Tests” on page 17.
45
Making a Serial Connection to the SP
To make a serial connection to the SP:
1. Connect a serial cable from the RJ-45 Serial Management port on your ILOM SP to
a terminal device.
Note – If you are connecting to the serial port on the SP before it has been powered
up or during its power-up sequence, you will see bootup messages displayed.
3. Log in to the SP and type the default user name, root, with the default password,
changeme.
Once you have successfully logged in to the SP, it displays its default command
prompt.
->
Appendix C Using the ILOM Service Processor GUI to View System Information 47
3. Select a category of event that you want to view in the log from the drop-down list
box.
After you have selected a category of event, the Event Log table is updated with the
specified events. The fields in the Event Log are described in TABLE C-1.
Field Description
4. To clear the event log, click the Clear Event Log button.
A confirmation dialog box is displayed.
6. If the problem with the server is not evident after viewing ILOM SP logs and
information, continue with “Running SunVTS Diagnostic Tests” on page 17.
When the service processor reboots, the SP clock is set to Thu Jan 1 00:00:00 UTC
1970. The SP reboots as a result of the following:
■ A complete system unplug/replug power cycle
■ An IPMI command; for example, mc reset cold
■ A command-line interface (CLI) command; for example, reset /SP
■ ILOM web GUI operation; for example, from the Maintenance tab, selecting Reset
SP
■ An SP firmware upgrade
Appendix C Using the ILOM Service Processor GUI to View System Information 49
Viewing Replaceable Component
Information
Depending on the component you select, information about the manufacturer,
component name, serial number, and part number can be displayed. To view
replaceable component information:
4. If the problem with the server is not evident after viewing replaceable component
information, continue with “Running SunVTS Diagnostic Tests” on page 17.
Appendix C Using the ILOM Service Processor GUI to View System Information 51
Viewing Temperature, Voltage, and Fan
Sensor Readings
This section describes how to view the server temperature, voltage, and fan sensor
readings.
There are a total of six temperature sensors that are monitored. They all generate
IPMI events that will be logged in to the system event log (SEL) when an upper
threshold is exceeded. Three of these sensor readings are used to adjust the fan
speeds and perform other actions, such as illuminating LEDs and powering off the
chassis. These sensors and their respective thresholds are as follows:
■ Front panel ambient temperature (fp.t_amb)
■ Upper non-critical: 30 degrees C
■ Upper critical: 35 degrees C
■ Upper non-recoverable: 40 degrees C
■ CPU 0 (p0.t_core) and CPU 1 (p1.t_core) die temperatures
■ Upper non-critical: 55 degrees C
■ Upper critical: 65 degrees C
■ Upper non-recoverable: 75 degrees C
3. Select the type of sensor readings that you want to view from the drop-down list.
You can select All Sensors, Temperature Sensors, Voltage Sensors, or Fan Sensors.
Appendix C Using the ILOM Service Processor GUI to View System Information 53
The sensor readings are displayed. The Sensor Readings fields are described in
TABLE C-2.
Field Description
Status Reports the status of the sensor, including State Asserted, State
Deasserted, Predictive Failure, Device Inserted/Device Present,
Device Removed/Device Absent, Unknown, and Normal.
Name Reports the name of the sensor. The names correspond to the
following components:
• sys: System or chassis
• bp: Back panel
• fp: Front panel
• mb: Motherboard
• io: I/O board
• p0: Processor 0
• p1: Processor 1
• ft0: Fan tray 0
• ft1: Fan tray 1
• pdb: Power distribution board
• ps0: Power supply 0
• ps1: Power supply 1
Reading Reports the rpm, temperature, and voltage measurements.
4. Click the Refresh button to update the sensor readings to their current status.
5. Click the Show Thresholds button to display the settings that trigger alerts.
The Sensor Readings table is updated. See the example in FIGURE C-4.
For example, if system temperature reaches 30 C, the service processor will send an
alert. Sensor thresholds include the following:
■ Low/High NR: Low or high non-recoverable
■ Low/High CR: Low or high critical
■ Low/High NC: Low or high non-critical
7. If the problem with the server is not evident after viewing sensor readings
information, continue with “Running SunVTS Diagnostic Tests” on page 17.
Appendix C Using the ILOM Service Processor GUI to View System Information 55
56 Sun Fire X4600/X4600 M2 Servers Diagnostics Guide • October 2006
APPENDIX D
Caution – Although you can use IPMItool to view sensor and LED information, do
not use any interface other than the ILOM CLI or the WebGUI to alter the state or
configuration of any sensor or LED. Doing so could void your warranty.
Note – The information in this appendix applies to the original Sun Fire X4600
server, and to the Sun Fire X4600 M2 server, unless otherwise noted in the text.
57
About IPMI
IPMI is an open-standard hardware management interface specification that defines
a specific way for embedded management subsystems to communicate. IPMI
information is exchanged though baseboard management controllers (BMCs), which
are located on IPMI-compliant hardware components. Using low-level hardware
intelligence instead of the operating system has two main benefits: first, this
configuration allows for out-of-band server management, and second, the operating
system is not burdened with transporting system status data.
Your Service Processor (SP) is IPMI v2.0 compliant. You can access IPMI
functionality through the command line with the IPMItool utility either in-band or
out-of-band. Additionally, you can generate an IPMI-specific trap from the web
interface or manage the server's IPMI functions from any external management
solution that is IPMI v1.5 or v2.0 compliant. For more information about the IPMI
v2.0 specification, go to
http://www.intel.com/design/servers/ipmi/spec.htm#spec2
About IPMItool
IPMItool is included on the Sun Fire X4600 server Tools and Drivers CD (705-1438).
IPMItool is a simple, command-line interface that is useful for managing IPMI-
enabled devices. You can use this utility to perform IPMI functions with a kernel
device driver or over a LAN interface. IPMItool enables you to manage system
hardware components, monitor system health, and monitor and manage system
environmentals, independent of the operating system.
Locate IMPItool and its related documentation on your Sun Fire X4600 Server Tools
and Drivers CD, or download this tool:
http://ipmitool.sourceforge.net/
Note – In the example commands shown in this appendix, the default username,
root, and default password, changeme are shown. You should type the user name
and password that has been set for the server.
Then supply the user ID and the location of the RSA or DSA public key to use with
the ipmitool sunoem sshkey command. For example:
ipmitool -I lanplus -H <IPADDR> -U root -P changeme sunoem sshkey set 2
id_rsa.pub
You can also clear the key for a particular user, for example:
ipmitool -I lanplus -H <IPADDR> -U root -P changeme sunoem sshkey del 2
The five fields of the output lines, as read from left to right are:
5. Sensor reading
For example:
fp.t_amb | 0Ah | ok | 12.0 | 22 degrees C
For example, to see only the temperature, voltage, and fan sensors, you would use
the following command, with the full argument.
ipmitool -I lanplus -H <IPADDR> -U root -P changeme sdr elist full
You can also generate a list of all sensors for a specific Entity. Use the list output to
determine which entity you are interested in seeing, then use the sdr entity
command to get a list of all sensors for that entity. This command accepts an entity
ID and an optional entity instance argument. If an entity instance is not specified, it
will display all instances of that entity.
The entity ID is given in the fourth field of the output, as read from left to right. For
example, in the output shown in the previous example, all the fans are entity 29. The
last fan listed (29.5) is entity 29, with instance 5:
ft1.fm2.f0.speed | 48h | ok | 29.5 | 6000 RPM
For example, to see all fan-related sensors, you would use the following command
that uses the entity 29 argument.
ipmitool -I lanplus -H <IPADDR> -U root -P changeme sdr entity 29
Note – When you use this command, an event record gives a sensor number, but
does not display the name of the sensor for the event. For example, in line 100 in the
sample output above, the sensor number 0x16 is displayed. For information about
how to map sensor names to the different sensor number formats that might be
displayed, see “Sensor Numbers and Sensor Names in SEL Events” on page 67.
■ View the ILOM SP SEL with a detailed event output by using the sel elist
command instead of sel list. The sel elist command cross-references event
records with sensor data records to produce descriptive event output. It takes
longer to execute because it has to read from both the SEL and the Static Data
Repository (SDR). For increased speed, generate an SDR cache before using the
sel elist command. See “Using the Sensor Data Repository (SDR) Cache” on
page 67. For example:
Certain qualifiers are available to refine and limit the SEL output. If you want to see
only the first NUM records, add that as a qualifier to the command. If you want to
see the last NUM records, use that qualifier. For example, to see the last three records
in the SEL, type the following command:
ipmitool -I lanplus -H <IPADDR> -U root -P changeme sel elist last 3
If you want to get more detailed information on a particular event, you can use the
sel get ID command, in which you specify an SEL record ID. For example:
In the example above, this particular event describes that Power Supply #0 is
detected and present.
To speed up these operations, it is possible to pre-cache the static data in the SDR
and feed it back into IPMItool. This can have a dramatic effect in the processing time
for some commands. In order to generate an SDR cache for later use, type the sdr
dump command. For example:
After you have generated a cache file, it can be supplied to future invocations of
IPMItool with the -S option. For example:
ipmitool -I lanplus -H <IPADDR> -U root -P changeme -S galaxy.sdr sel elist
The output from certain commands might not display the sensor name along with
the corresponding sensor number. To see all sensor names in your server mapped to
the corresponding sensor numbers, you can use the following command:
ipmitool -H 129.144.82.21 -U root -P changeme sdr elist
sys.id | 00h | ok | 23.0 | State Asserted
sys.intsw | 01h | ok | 23.0 |
sys.psfail | 02h | ok | 23.0 | Predictive Failure Asserted
...
In the sample output above, the sensor name is in the first column and the
corresponding sensor number is in the second column.
For a detailed explanation of each sensor, listed by name, refer to the Integrated
Lights Out Manager Supplement For Sun Fire X4600 Servers, 819-5432.
To read the FRU inventory information on these servers, you must first have the
FRU ROMs programmed. After that is done, you can see a full list of the available
FRU data by using the fru print command, as shown in the following example
(only two FRU devices are shown in the example, but all devices would be shown).
ipmitool -I lanplus -H <IPADDR> -U root -P changeme fru print
...
See “LED Sensor IDs” on page 69 and “LED Modes” on page 71 for information
about the variables in these commands.
Each LED has both a descriptor and a status reading sensor, and the two are linked;
that is, if you use the .led sensor to turn on a particular LED, then the status change
is represented in the associated .fail sensor. Also, for some of these, an event is
generated in the SEL. For LEDs that blink on failure instead of steady-on, the events
are not generated (this is because it would display an event every time the LED
flashed in the blink cycle).
TABLE D-2 lists the LED sensor IDs in these servers. See “Status Indicator LEDs” on
page 37 for diagrams of the LED locations.
LED Modes
You supply the modes in TABLE D-3 to the led set commands to specify the mode in
which you want the LED to be placed.
Mode Description
It is desirable to have these sensors “linked” so that both the front and back panel
LEDs can be controlled at the same time. This is handled through the use of Entity
Association Records. These are records in the SDR that contain a list of entities that
are considered part of a group.
For each Entity Association Record we also define another Generic Device Locator as
a logical entity to indicate to system software that it refers to a group of LEDS rather
than a single physical LED. TABLE D-4 describes the LED sensor groups.
sys.power.led bp.power.led
fp.power.led
sys.locate.led bp.locate.led
fp.locate.led
sys.alert.led bp.alert.led
fp.alert.led
For example, to set both the front and back panel Power/OK LEDs to a standby
blink rate, you could type the following command:
ipmitool -I lanplus -H <IPADDR> -U root -P changeme sunoem led set
sys.power.led standby
You could turn off the back panel Power/OK LED but leave the front panel
Power/OK LED blinking by typing the following command:
ipmitool -I lanplus -H <IPADDR> -U root -P changeme sunoem led set
bp.power.led off
The following example shows a script to turn on all Fan module LEDS:
sunoem led set ft0.fm0.led on
sunoem led set ft0.fm1.led on
sunoem led set ft0.fm2.led on
sunoem led set ft1.fm0.led on
sunoem led set ft1.fm1.led on
sunoem led set ft1.fm2.led on
If this script file were then named leds_fan_on.isc, you would use it in a command
as follows:
ipmitool -I lanplus -H <IPADDR> -U root -P changeme exec leds_fan_on.isc
Error Handling
This appendix contains information about how the servers process and log errors.
Note – The information in this appendix applies to the original Sun Fire X4600
server, and to the Sun Fire X4600 M2 server, unless otherwise noted in the text.
Note – The BIOS ChipKill feature must be disabled if you are testing for failures of
multiple bits within a DRAM (ChipKill corrects for the failure of a four-bit wide
DRAM).
■ The BIOS logs the error to the SP system event log (SEL) through the board
management controller (BMC).
■ The SP's SEL is updated with the failing DIMM pair's particular bank address.
■ The system reboots.
75
■ The BIOS logs the error in DMI.
Note – If the error is on low 1MB, the BIOS freezes after rebooting. Therefore, no
DMI log is recorded.
■ An example of the error reported by the SEL through IPMI 2.0 is as follows:
■ When low memory is erroneous, the BIOS is frozen on pre-boot low memory
test because the BIOS cannot decompress itself into faulty DRAM and execute
the following items:
ipmitool> sel list
100 | 08/26/2005 | 11:36:09 | OEM #0xfb |
200 | 08/26/2005 | 11:36:12 | System Firmware Error | No
usable system memory
300 | 08/26/2005 | 11:36:12 | Memory | Memory Device
Disabled | CPU 0 DIMM 0
■ When the faulty DIMM is beyond the BIOS's low 1MB extraction space, proper
boot happens:
ipmitool> sel list
100 | 08/26/2005 | 05:04:04 | OEM #0xfb |
200 | 08/26/2005 | 05:04:09 | Memory | Memory Device
Disabled | CPU 0 DIMM 0
■ Note the following considerations for this revision:
■ Uncorrectable ECC Memory Error is not reported.
■ Multi-bit ECC errors are reported as Memory Device Disabled.
■ On first reboot, BIOS logs a HyperTransport Error in the DMI log.
■ The BIOS disables the DIMM.
■ The BIOS sends the SEL records to the BMC.
■ The BIOS reboots again.
■ The BIOS skips the faulty DIMM on the next POST memory test.
■ The BIOS reports available memory, excluding the faulty DIMM pair.
FIGURE E-1 shows an example of a DMI log screen from BIOS Setup Page.
■ The BIOS displays the following messages and freezes (during POST or DOS):
■ NMI EVENT!!
■ System Halted due to Fatal NMI!
■ The Linux NMI trap catches the interrupt and reports the following NMI
“confusion report” sequence:
Aug 5 05:15:00 d-mpk12-53-159 kernel: Uhhuh. NMI received
for unknown reason 2d on CPU 0.
Aug 5 05:15:00 d-mpk12-53-159 kernel: Uhhuh. NMI received
for unknown reason 2d on CPU 1.
Aug 5 05:15:00 d-mpk12-53-159 kernel: Dazed and confused,
but trying to continue
Aug 5 05:15:00 d-mpk12-53-159 kernel: Do you have a strange
power saving mode enabled?
Aug 5 05:15:00 d-mpk12-53-159 kernel: Uhhuh. NMI received
for unknown reason 3d on CPU 1.
Aug 5 05:15:00 d-mpk12-53-159 kernel: Dazed and confused,
but trying to continue
Aug 5 05:15:00 d-mpk12-53-159 kernel: Do you have a strange
power saving mode enabled?
Aug 5 05:15:00 d-mpk12-53-159 kernel: Uhhuh. NMI received
for unknown reason 3d on CPU 0.
Note – The Linux system reboots, but does not inform the BIOS of this incident.
■ SERR and HyperTransport Synch Flood Error are logged in DMI and the SP
SEL. See the following sample output:
SEL Record ID : 0a00
Record Type : 00
Timestamp : 08/10/2005 06:05:32
Generator ID : 0001
EvM Revision : 04
Sensor Type : Critical Interrupt
Sensor Number : 00
Event Type : Sensor-specific Discrete
Logged
(DMI Log
Error Description Handling or SP SEL) Fatal?
SP failure The SP fails to boot The SP controls the system reset, so the Not logged Fatal
upon application system may power on, but will not come out
of system power. of reset.
• During power up, the SP's boot loader
turns on the power LED.
• During SP boot, Linux startup, and SP
sanity check, the power LED blinks.
• The LED is turned off when SP
management code (the IPMI stack) is
started.
• At exit of BIOS POST, the LED goes to
STEADY ON state.
SP failure SP boots but fails The SP controls the system RESET, so the Not logged Fatal
POST. system will not come out of reset.
BIOS POST Server BIOS does There are fatal and non-fatal errors in POST.
failure not pass POST. The BIOS does detect some errors that are
announced during POST as POST codes on
the bottom right corner of the display on the
serial console and on the video display.
Some POST codes are forwarded to the SP
for logging.
The POST codes do not come out in
sequential order and some are repeated,
because some POST codes are issued by
code in add-in card BIOS expansion ROMs.
In the case of early POST failures (for
example, the BSP fails to operate correctly),
BIOS just halts without logging.
For some other POST failures subsequent to
memory and SP initialization, the BIOS logs
a message to the SP’s SEL.
Logged
(DMI Log
Error Description Handling or SP SEL) Fatal?
Single-bit With ECC enabled The CPU corrects the error in hardware. No SP SEL Normal
DRAM ECC in the BIOS Setup, interrupt or machine check is generated by operation
error the CPU detects the hardware. The polling is triggered every
and corrects a half-second by SMI timer interrupts and is
single-bit error on done by the BIOS SMI handler.
the DIMM The BIOS SMI handler starts logging each
interface. detected error and stops logging when the
limit for the same error is reached. The
BIOS's polling can be disabled through a
software interface.
Single four-bit With CHIP-KILL The CPU corrects the error in hardware. No SP SEL Normal
DRAM error enabled in the interrupt or machine check is generated by operation
BIOS Setup, the the hardware. The polling is triggered every
CPU detects and half-second by SMI timer interrupts and is
corrects for the done by the BIOS SMI handler.
failure of a four- The BIOS SMI handler starts logging each
bit-wide DRAM on detected error and stops logging when the
the DIMM limit for the same error is reached. The
interface. BIOS's polling can be disabled through a
software interface.
Uncorrectable The CPU detects The “sync flood” method of handling this is SP SEL Fatal
DRAM ECC an uncorrectable used to prevent the erroneous data from
error multiple-bit DIMM being propagated across the HyperTransport
error. links. The system reboots, the BIOS recovers
the machine check register information,
maps this information to the failing DIMM
(when CHIPKILL is disabled) or DIMM pair
(when CHIPKILL is enabled), and logs that
information to the SP.
The BIOS will halt the CPU.
Unsupported Unsupported The BIOS displays an error message, logs an DMI Log Fatal
DIMM DIMMs are used, error, and halts the system. SP SEL
configuration or supported
DIMMs are loaded
improperly.
HyperTranspo CRC or link error Sync floods on HyperTransport links, the DMI Log Fatal
rt link failure on one of the machine resets itself, and error information SP SEL
HyperTransport gets retained through reset.
Links The BIOS reports, A Hyper Transport
sync flood error occurred on last
boot, press F1 to continue.
Logged
(DMI Log
Error Description Handling or SP SEL) Fatal?
PCI SERR, System or parity Sync floods on HyperTransport links, the DMI Log Fatal
PERR error on a PCI bus machine resets itself, and error information SP SEL
gets retained through reset.
The BIOS reports, A Hyper Transport
sync flood error occurred on last
boot, press F1 to continue.
BIOS POST The BIOS could The BIOS displays an error message, logs DMI Log Non-fatal
Microcode not find or load the error to DMI, and boots.
Error the CPU
Microcode Update
to the CPU. The
message most
likely appears
when a new CPU
is installed in a
motherboard with
an outdated BIOS.
In this case, the
BIOS must be
updated.
BIOS POST CMOS contents The BIOS displays an error message, logs DMI Log Non-fatal
CMOS failed the the error to DMI, and boots.
Checksum Checksum check.
Bad
Unsupported The BIOS supports The BIOS displays an error message, logs DMI Log Fatal
CPU mismatched the error, and halts the system.
configuration frequency and
steppings in CPU
configuration, but
some CPUs might
not be supported.
Correctable The CPU detects a The CPU corrects the error in hardware. No DMI Log Normal
error variety of interrupt or machine check is generated by SP SEL operation
correctable errors the hardware. The polling is triggered every
in the half second by SMI timer interrupts, and is
MCi_STATUS done by the BIOS SMI handler.
registers. The SMI handler logs a message to the SP
SEL if the SEL is available, otherwise SMI
logs a message to DMI. The BIOS's polling
can be disabled through software SMI.
Logged
(DMI Log
Error Description Handling or SP SEL) Fatal?
Single fan Fan failure is The Front Fan Fault, Service Action SP SEL Non-fatal
failure detected by Required, and individual fan module LEDs
reading tach are lit.
signals.
Multiple fan Fan failure is The Front Fan Fault, Service Action SP SEL Fatal
failure detected by Required, and individual fan module LEDs
reading tach are lit.
signals.
Single power When any of the Service Action Required, and Power Supply SP SEL Non-fatal
supply failure AC/DC Fault LEDs are lit.
PS_VIN_GOOD or
PS_PWR_OK
signals are
deasserted.
DC/DC Any The Service Action Required LED is lit, the SP SEL Fatal
power POWER_GOOD system is powered down to standby power
converter signal is mode, and the Power LED enters standby
failure deasserted from blink state.
the DC/DC
converters.
Voltage The SP monitors The Service Action Required LED and SP SEL Fatal
above/below system voltages Power Supply Fault LED blink.
Threshold and detects voltage
above or below a
given threshold.
Logged
(DMI Log
Error Description Handling or SP SEL) Fatal?
High The SP monitors The Service Action Required LED and SP SEL Fatal
temperature CPU and system System Overheat Fault LED blink. The
temperatures, and motherboard is shut down above the
detects specified critical level.
temperatures
above a given
threshold.
Processor The CPU drives CPLD shuts down power to the CPU. The SP SEL Fatal
thermal trip the Service Action Required LED and System
THERMTRIP_L Overheat Fault LED blink.
signal upon
detecting an
overtemp
condition.
Boot device The BIOS is not The BIOS goes to the next boot device in the DMI Log Non-fatal
failure able to boot from a list. If all devices in the list fail, an error
device in the boot message is displayed, retry from beginning
device list. of list. SP can control/change boot order
A SunVTS 17
anonymous user, IPMItool 59 DIMMs
error handling 7
B fault LEDs 10, 44
back panel isolating errors 14
LED functions 40 population rules 12
LED locations 40 supported configurations 12
BIOS
changing POST options 27 E
event logs 21 emergency shutdown 5
POST code checkpoints 32 error handling
POST codes 30 correctable 78
POST overview 25 DIMMs 7
redirecting console output for POST 26 hardware errors 86
Bootable Diagnostics CD 18 mismatching processors 85
parity errors 80
C system errors 83
uncorrectable errors 75
comments and suggestions xiv
event logs, BIOS 21
component inventory
viewing with ILOM SP GUI 50 external inspection 4
viewing with IPMItool 68 external LEDs 37
configurations supported for DIMMs 12
console output, redirecting 26 F
correctable errors, handling 78 fan tray fault LEDs 41
CPU faults, DIMM 10
fault LED 44 finding sensor names 67
module attention LED 44 Front Fan Fault LED 39
front panel
D LED functions 39
default password, changing with IPMItool 60 LED locations 38
diagnostic software FRU inventory
Bootable Diagnostics CD 18 viewing with ILOM SP GUI 50
1
viewing with IPMItool 68 viewing component inventory 68
viewing LED status 69
G viewing sensor status 61
gathering service visit information 3 viewing SP SEL 65
general troubleshooting guidelines 3 isolating DIMM ECC errors 14
graceful shutdown 5
GRASP board power status LED 44 L
LEDs
guidelines for troubleshooting 3
back panel functions 40
back panel locations 40
H CPU fault 44
hard disk drive status LEDs 39 CPU module attention 44
hardware errors, handling 86 DIMM fault 44
external 37
I Fan Tray fault 41
ILOM SP GUI Front Fan Fault 39
general information 45 front fan fault functions 41
serial connection 46 front panel functions 39
time stamps 49 front panel locations 38
viewing component inventory 50 GRASP Board Power Status 44
viewing sensors 52 Hard Disk Drive Status 39
viewing SP event log 47 internal 42
inspection Locate 39
external 4 modes 71
internal 5 Power Supply Fault 39
Integrated Lights-Out Manager Service Processor, Power Supply Status 40
See ILOM SP GUI Power/OK 39
sensor groups 71
Intelligent Platform Management Interface, See
sensor IDs 69
IPMI
Service Action Required 39
internal inspection 5
setting status with IPMItool 69
internal LEDs 42 System Overheat Fault 39
IPMI, general information 58 viewing status with IPMItool 69
IPMItool Locate LED and button 39
changing default password 60
clearing SP SEL 66 M
configuring SSH key 60
mapping sensor numbers to sensor names 67
connecting to server 59
mismatching processors, error handling 85
enabling anonymous user 59
general information 58
LED modes 71 P
LED sensor groups 71 parity errors, handling 80
LED sensor IDs 69 password, changing with IPMItool 60
location of package 58 PERR 80
man page 58 population rules for DIMMs 12
setting LED status 69 POST
using scripts for testing 72 changing options 27
using SDR cache 67 code checkpoints 32
R U
redirecting console output 26 uncorrectable errors, handling 75
related documentation xii
S
safety guidelines xi
scripts, IPMItool 72
SDR cache, using with IPMItool 67
sensor data repository, See SDR
sensor IDs for LEDs 69
sensor number formats 67
sensors
viewing with ILOM SP GUI 52
viewing with IPMItool 61
serial connection to ILOM SP 46
SERR 83
Service Action Required LED 39
Service Processor system event log, See SP SEL
service visit information, gathering 3
shutdown procedure 5
SP event log
viewing with ILOM SP GUI 47
SP SEL
clearing with IMPItool 66
sensor numbers and names 67
time stamps 49
using SDR cache 67
viewing with IPMItool 65
SSH key, configuring with IPMItool 60
Sun Fire X4200
Power button 5
SunVTS
Index 3
4 Sun Fire X4600/X4600 M2 Servers Diagnostics Guide • October 2006