Page MenuHomePhabricator

Memory correctable errors -EDAC- elastic1029
Closed, ResolvedPublic

Description

There's a warning alert on icinga about Memory correctable errors -EDAC- on elastic1029.

Event Timeline

I'm closing this task as invalid I no longer see any error

Dzahn subscribed.

It's back:

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=elastic1029&service=Memory+correctable+errors+-EDAC-

Current Status: CRITICAL
(for 1d 10h 32m 40s)
Status Information: 4.001 ge 4

Error reseted as documented in Monitoring/Memory.

@Cmjohnson this seems to happen often enough that we probably need to hove a look at those memory modules. What do you need from me to move forward?

Note that you can downtime and shutdown this server whenever you need.

@Cmjohnson any news on this? Do you need anything from our side?

@Gehel you will need to take the server offline for a day so I can reseat the DIMM. The server logs do not indicate any memory errors. If you want to downtime it for Wednesday or Thursday let me know.

Mentioned in SAL (#wikimedia-operations) [2019-06-11T15:41:10Z] <gehel> shutting down elastic1029 for investigation - T214283

@Cmjohnson elastic1029 is shut down and downtimed in icinga, do whatever you need to do and restart whenever it is done.

Mentioned in SAL (#wikimedia-operations) [2019-06-18T11:22:35Z] <akosiaris> set elastic1029 as inactive in all conftool data. Command was sudo confctl select "name=elastic1029.eqiad.wmnet" set/pooled=inactive T214283

The DIMM has been reseated and swapped to the opposite sides.

Cmjohnson claimed this task.

Closing this for now, let me know if there is another issue. Keep in mind this server is out of warranty

Mentioned in SAL (#wikimedia-operations) [2019-06-19T16:23:29Z] <onimisionipe> pooling elastic1029 - T214283

This hsows no errors in the service event log for the memory:

/admin1-> racadm getsel
Record:      1
Date/Time:   10/06/2014 10:01:16
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   10/23/2014 13:56:59
Source:      system
Severity:    Critical
Description: The chassis is open while the power is off.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   10/23/2014 13:57:04
Source:      system
Severity:    Ok
Description: The chassis is closed while the power is off.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   08/03/2017 14:22:24
Source:      system
Severity:    Critical
Description: The chassis is open while the power is off.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   08/03/2017 14:22:30
Source:      system
Severity:    Ok
Description: The chassis is closed while the power is off.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   08/16/2018 16:19:51
Source:      system
Severity:    Critical
Description: The chassis is open while the power is off.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   08/16/2018 16:19:57
Source:      system
Severity:    Ok
Description: The chassis is closed while the power is off.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   06/19/2019 15:44:46
Source:      system
Severity:    Critical
Description: The chassis is open while the power is off.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   06/19/2019 15:44:51
Source:      system
Severity:    Ok
Description: The chassis is closed while the power is off.
-------------------------------------------------------------------------------
/admin1->

The next step would be to reboot the machine into the Dell ePSA tool and run memtest via that tool. Can this system be taken offline for this work?

I don't see any SEL paste into this task showing the original errors, and the log is quite long and old, so the old error also didn't show in the SEL.

Also, in the future, please open a new task for hardware troubleshooting and follow all directions on:

https://phabricator.wikimedia.org/maniphest/task/edit/form/55/

debt subscribed.

Also, in the future, please open a new task for hardware troubleshooting and follow all directions on:

https://phabricator.wikimedia.org/maniphest/task/edit/form/55/

Hi @Mathew.onipe - can you create a new ticket for the 'new' errors, please? Thanks!