Page MenuHomePhabricator

Disk (sdv) failed on ms-be1065
Closed, ResolvedPublic

Description

Hi,

A disk has failed in ms-be1065 - I think the RAID output below is sufficient information? In any case, could this be replaced ASAP, please? You can work on this system at any time without further input from me.

A degraded RAID (megacli) was detected on host ms-be1065. An automatic snapshot of the current RAID status is attached below.

CRITICAL: 1 failed LD(s) (Offline)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 22 (Target Id: 22)
	RAID Level: Primary-0, Secondary-0, RAID Level Qualifier-0
	State: =====> Offline <=====
	Number Of Drives: 1
	Number of Spans: 1
	Current Cache Policy: WriteThrough, ReadAheadNone, Cached, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 1

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 20
			Drive's position: DiskGroup: 20, Span: 0, Arm: 0
			Media Error Count: 76
			Other Error Count: 1
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 7.277 TB [0x3a3812ab0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: 38C (100.40 F)

=== RaidStatus completed

Event Timeline

DC ops: Would you have an 8 TB disk spare for this host? It seems out of warranty.

MatthewVernon renamed this task from Degraded RAID on ms-be1065 to Disk (sdv) failed on ms-be1065.Oct 14 2024, 8:46 AM
MatthewVernon triaged this task as High priority.
MatthewVernon updated the task description. (Show Details)
Jclark-ctr claimed this task.
Jclark-ctr subscribed.

@jcrespo yes we do have spare 8tb drives am i able to change in the morning?

also did update idrac firmware while was logged into server

@Jclark-ctr please do replace the disk at your earliest convenience - the server is ready for the disk swap.

[I've reopened this task, as I don't think the disk has yet been replaced]

@MatthewVernon Drive has been replaced but will not let me add new drive. Reboot might be needed
I get this error but new drive is listed as ready. I have rebooted the Idrac and has cleared any errors. @Papaul
do you have any input on this?

  1. Try importing foreign drives if any.
  2. Make sure that the enclosure containing the virtual drive is connected to the controller.
  3. Install any drives that are reported as missed or failed.

@Jclark-ctr I don't know what error you're referring to, but kern.log shows a new disk being added and then removed again:

Oct 16 14:21:43 ms-be1065 kernel: [11498629.969979] scsi 0:0:20:0: Direct-Access     ATA      ST8000NM012A-2KE CALD PQ: 0 ANSI: 6
Oct 16 14:21:43 ms-be1065 kernel: [11498629.983931] sd 0:0:20:0: Attached scsi generic sg22 type 0
Oct 16 14:21:44 ms-be1065 kernel: [11498629.992190] sd 0:0:20:0: [sdv] 15628053168 512-byte logical blocks: (8.00 TB/7.28 TiB)
Oct 16 14:21:44 ms-be1065 kernel: [11498629.992193] sd 0:0:20:0: [sdv] 4096-byte physical blocks
Oct 16 14:21:44 ms-be1065 kernel: [11498630.019355] sd 0:0:20:0: [sdv] Write Protect is off
Oct 16 14:21:44 ms-be1065 kernel: [11498630.019360] sd 0:0:20:0: [sdv] Mode Sense: 9b 00 10 08
Oct 16 14:21:44 ms-be1065 kernel: [11498630.021681] sd 0:0:20:0: [sdv] Write cache: disabled, read cache: enabled, supports DPO and FUA
Oct 16 14:21:44 ms-be1065 kernel: [11498630.244122] sd 0:0:20:0: [sdv] Attached SCSI disk
Oct 16 14:24:02 ms-be1065 kernel: [11498768.773563] megaraid_sas 0000:18:00.0: scanning for scsi0...
Oct 16 14:24:02 ms-be1065 kernel: [11498768.775325] sd 0:0:20:0: SCSI device is removed

Looking back through the log, I see that a couple of times, after a chunk of output from the initial replacement -

Oct 16 13:30:28 ms-be1065 kernel: [11495554.478648] sd 0:2:22:0: SCSI device is removed
Oct 16 13:30:28 ms-be1065 kernel: [11495554.797310] megaraid_sas 0000:18:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
Oct 16 13:30:28 ms-be1065 kernel: [11495554.797349] megaraid_sas 0000:18:00.0: FW in FAULT state Fault code:0x10000 subcode:0x0 func:megasas_wait_for_outstanding_fusion
Oct 16 13:30:28 ms-be1065 kernel: [11495554.809199] megaraid_sas 0000:18:00.0: resetting fusion adapter scsi0.
Oct 16 13:30:28 ms-be1065 kernel: [11495554.809355] megaraid_sas 0000:18:00.0: Outstanding fastpath IOs: 24
Oct 16 13:30:39 ms-be1065 kernel: [11495565.401029] megaraid_sas 0000:18:00.0: Waiting for FW to come to ready state
Oct 16 13:31:03 ms-be1065 kernel: [11495589.156432] megaraid_sas 0000:18:00.0: FW now in Ready state
Oct 16 13:31:03 ms-be1065 kernel: [11495589.156439] megaraid_sas 0000:18:00.0: FW now in Ready state
Oct 16 13:31:03 ms-be1065 kernel: [11495589.156639] megaraid_sas 0000:18:00.0: Current firmware supports maximum commands: 928   LDIO threshold: 0
Oct 16 13:31:03 ms-be1065 kernel: [11495589.156643] megaraid_sas 0000:18:00.0: Performance mode :Latency
Oct 16 13:31:03 ms-be1065 kernel: [11495589.156646] megaraid_sas 0000:18:00.0: FW supports sync cache   : No
Oct 16 13:31:03 ms-be1065 kernel: [11495589.156654] megaraid_sas 0000:18:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
Oct 16 13:31:03 ms-be1065 kernel: [11495589.296418] megaraid_sas 0000:18:00.0: FW provided supportMaxExtLDs: 1  max_lds: 64
Oct 16 13:31:03 ms-be1065 kernel: [11495589.296423] megaraid_sas 0000:18:00.0: controller type  : MR(2048MB)
Oct 16 13:31:03 ms-be1065 kernel: [11495589.296426] megaraid_sas 0000:18:00.0: Online Controller Reset(OCR)     : Enabled
Oct 16 13:31:03 ms-be1065 kernel: [11495589.296429] megaraid_sas 0000:18:00.0: Secure JBOD support      : No
Oct 16 13:31:03 ms-be1065 kernel: [11495589.296431] megaraid_sas 0000:18:00.0: NVMe passthru support    : No
Oct 16 13:31:03 ms-be1065 kernel: [11495589.296434] megaraid_sas 0000:18:00.0: FW provided TM TaskAbort/Reset timeout   : 0 secs/0 secs
Oct 16 13:31:03 ms-be1065 kernel: [11495589.296437] megaraid_sas 0000:18:00.0: JBOD sequence map support        : No
Oct 16 13:31:03 ms-be1065 kernel: [11495589.296439] megaraid_sas 0000:18:00.0: PCI Lane Margining support       : No
Oct 16 13:31:03 ms-be1065 kernel: [11495589.325242] megaraid_sas 0000:18:00.0: JBOD sequence map is disabled megasas_setup_jbod_map 5746
Oct 16 13:31:03 ms-be1065 kernel: [11495589.325250] megaraid_sas 0000:18:00.0: megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000
Oct 16 13:31:03 ms-be1065 kernel: [11495589.325278] megaraid_sas 0000:18:00.0: Adapter is OPERATIONAL for scsi:0
Oct 16 13:31:03 ms-be1065 kernel: [11495589.366824] megaraid_sas 0000:18:00.0: Reset successful for scsi0.
Oct 16 13:31:03 ms-be1065 kernel: [11495589.374144] megaraid_sas 0000:18:00.0: 3133 (782400602s/0x0020/CRIT) - Controller encountered a fatal error and was reset
Oct 16 13:31:03 ms-be1065 kernel: [11495589.408962] megaraid_sas 0000:18:00.0: 3221 (782400627s/0x0021/FATAL) - Controller cache pinned for missing or offline VDs:  16
Oct 16 13:36:16 ms-be1065 kernel: [11495902.626312] megaraid_sas 0000:18:00.0: scanning for scsi0...
Oct 16 13:36:16 ms-be1065 kernel: [11495902.630039] scsi 0:0:20:0: Direct-Access     ATA      ST8000NM012A-2KE CALD PQ: 0 ANSI: 6
Oct 16 13:36:16 ms-be1065 kernel: [11495902.641775] sd 0:0:20:0: Attached scsi generic sg22 type 0
Oct 16 13:36:16 ms-be1065 kernel: [11495902.649945] sd 0:0:20:0: [sdv] 15628053168 512-byte logical blocks: (8.00 TB/7.28 TiB)
Oct 16 13:36:16 ms-be1065 kernel: [11495902.649947] sd 0:0:20:0: [sdv] 4096-byte physical blocks
Oct 16 13:36:16 ms-be1065 kernel: [11495902.677088] sd 0:0:20:0: [sdv] Write Protect is off
Oct 16 13:36:16 ms-be1065 kernel: [11495902.677093] sd 0:0:20:0: [sdv] Mode Sense: 9b 00 10 08
Oct 16 13:36:16 ms-be1065 kernel: [11495902.679505] sd 0:0:20:0: [sdv] Write cache: disabled, read cache: enabled, supports DPO and FUA
Oct 16 13:36:16 ms-be1065 kernel: [11495902.896606] sd 0:0:20:0: [sdv] Attached SCSI disk
Oct 16 13:38:43 ms-be1065 kernel: [11496049.198207] megaraid_sas 0000:18:00.0: scanning for scsi0...
Oct 16 13:38:43 ms-be1065 kernel: [11496049.203119] sd 0:0:20:0: SCSI device is removed
Oct 16 13:53:45 ms-be1065 kernel: [11496951.528852] megaraid_sas 0000:18:00.0: scanning for scsi0...
Oct 16 13:53:45 ms-be1065 kernel: [11496951.546184] scsi 0:0:20:0: Direct-Access     ATA      ST8000NM012A-2KE CALD PQ: 0 ANSI: 6
Oct 16 13:53:45 ms-be1065 kernel: [11496951.558342] sd 0:0:20:0: Attached scsi generic sg22 type 0
Oct 16 13:53:45 ms-be1065 kernel: [11496951.566671] sd 0:0:20:0: [sdv] 15628053168 512-byte logical blocks: (8.00 TB/7.28 TiB)
Oct 16 13:53:45 ms-be1065 kernel: [11496951.566674] sd 0:0:20:0: [sdv] 4096-byte physical blocks
Oct 16 13:53:45 ms-be1065 kernel: [11496951.593861] sd 0:0:20:0: [sdv] Write Protect is off
Oct 16 13:53:45 ms-be1065 kernel: [11496951.593866] sd 0:0:20:0: [sdv] Mode Sense: 9b 00 10 08
Oct 16 13:53:45 ms-be1065 kernel: [11496951.596269] sd 0:0:20:0: [sdv] Write cache: disabled, read cache: enabled, supports DPO and FUA
Oct 16 13:53:45 ms-be1065 kernel: [11496951.813326] sd 0:0:20:0: [sdv] Attached SCSI disk
Oct 16 13:57:36 ms-be1065 kernel: [11497182.239872] megaraid_sas 0000:18:00.0: scanning for scsi0...
Oct 16 13:57:36 ms-be1065 kernel: [11497182.241486] sd 0:0:20:0: SCSI device is removed

maybe the drive's not seated right and/or is faulty? If you want to try a reboot, feel free (the system doesn't need depooling or anything), but I'm not sure that's likely to solve it.

Rebooted drive and cleared cache. added drive back in. looks good now to me.

Host rebooted by mvernon@cumin2002 with reason: disks badly ordered