There's a warning alert on icinga about Memory correctable errors -EDAC- on elastic1029.
Description
Event Timeline
It's back:
Current Status: CRITICAL
(for 1d 10h 32m 40s)
Status Information: 4.001 ge 4
Error reseted as documented in Monitoring/Memory.
@Cmjohnson this seems to happen often enough that we probably need to hove a look at those memory modules. What do you need from me to move forward?
Note that you can downtime and shutdown this server whenever you need.
@Gehel you will need to take the server offline for a day so I can reseat the DIMM. The server logs do not indicate any memory errors. If you want to downtime it for Wednesday or Thursday let me know.
Mentioned in SAL (#wikimedia-operations) [2019-06-11T15:41:10Z] <gehel> shutting down elastic1029 for investigation - T214283
@Cmjohnson elastic1029 is shut down and downtimed in icinga, do whatever you need to do and restart whenever it is done.
Mentioned in SAL (#wikimedia-operations) [2019-06-18T11:22:35Z] <akosiaris> set elastic1029 as inactive in all conftool data. Command was sudo confctl select "name=elastic1029.eqiad.wmnet" set/pooled=inactive T214283
Closing this for now, let me know if there is another issue. Keep in mind this server is out of warranty
Mentioned in SAL (#wikimedia-operations) [2019-06-19T16:23:29Z] <onimisionipe> pooling elastic1029 - T214283
elastic1029 is back on icinga showing memory errors. see https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=elastic1029&service=Memory+correctable+errors+-EDAC-
I'm reopening this.
This hsows no errors in the service event log for the memory:
/admin1-> racadm getsel Record: 1 Date/Time: 10/06/2014 10:01:16 Source: system Severity: Ok Description: Log cleared. ------------------------------------------------------------------------------- Record: 2 Date/Time: 10/23/2014 13:56:59 Source: system Severity: Critical Description: The chassis is open while the power is off. ------------------------------------------------------------------------------- Record: 3 Date/Time: 10/23/2014 13:57:04 Source: system Severity: Ok Description: The chassis is closed while the power is off. ------------------------------------------------------------------------------- Record: 4 Date/Time: 08/03/2017 14:22:24 Source: system Severity: Critical Description: The chassis is open while the power is off. ------------------------------------------------------------------------------- Record: 5 Date/Time: 08/03/2017 14:22:30 Source: system Severity: Ok Description: The chassis is closed while the power is off. ------------------------------------------------------------------------------- Record: 6 Date/Time: 08/16/2018 16:19:51 Source: system Severity: Critical Description: The chassis is open while the power is off. ------------------------------------------------------------------------------- Record: 7 Date/Time: 08/16/2018 16:19:57 Source: system Severity: Ok Description: The chassis is closed while the power is off. ------------------------------------------------------------------------------- Record: 8 Date/Time: 06/19/2019 15:44:46 Source: system Severity: Critical Description: The chassis is open while the power is off. ------------------------------------------------------------------------------- Record: 9 Date/Time: 06/19/2019 15:44:51 Source: system Severity: Ok Description: The chassis is closed while the power is off. ------------------------------------------------------------------------------- /admin1->
The next step would be to reboot the machine into the Dell ePSA tool and run memtest via that tool. Can this system be taken offline for this work?
I don't see any SEL paste into this task showing the original errors, and the log is quite long and old, so the old error also didn't show in the SEL.
Also, in the future, please open a new task for hardware troubleshooting and follow all directions on:
https://phabricator.wikimedia.org/maniphest/task/edit/form/55/