ManagementSSHDown
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	phaultfinder
	Sep 26 2024, 2:17 PM

Description

Common information

dashboard: TODO
runbook: https://wikitech.wikimedia.org/wiki/Management_Interfaces#Reset_the_management_card

alertname: ManagementSSHDown
job: probes/mgmt
module: ssh_banner
prometheus: ops
rack: E1
severity: task
site: eqiad
source: prometheus
team: dcops

Firing alerts

dashboard: TODO
description: The management interface at an-presto1006.mgmt:22 has been unresponsive for multiple hours.
runbook: https://wikitech.wikimedia.org/wiki/Management_Interfaces#Reset_the_management_card
summary: Unresponsive management for an-presto1006.mgmt:22
alertname: ManagementSSHDown
instance: an-presto1006.mgmt:22
job: probes/mgmt
module: ssh_banner
prometheus: ops
rack: E1
severity: task
site: eqiad
source: prometheus
team: dcops
Source

dashboard: TODO
description: The management interface at backup1010.mgmt:22 has been unresponsive for multiple hours.
runbook: https://wikitech.wikimedia.org/wiki/Management_Interfaces#Reset_the_management_card
summary: Unresponsive management for backup1010.mgmt:22
alertname: ManagementSSHDown
instance: backup1010.mgmt:22
job: probes/mgmt
module: ssh_banner
prometheus: ops
rack: E1
severity: task
site: eqiad
source: prometheus
team: dcops
Source

dashboard: TODO
description: The management interface at dse-k8s-worker1005.mgmt:22 has been unresponsive for multiple hours.
runbook: https://wikitech.wikimedia.org/wiki/Management_Interfaces#Reset_the_management_card
summary: Unresponsive management for dse-k8s-worker1005.mgmt:22
alertname: ManagementSSHDown
instance: dse-k8s-worker1005.mgmt:22
job: probes/mgmt
module: ssh_banner
prometheus: ops
rack: E1
severity: task
site: eqiad
source: prometheus
team: dcops
Source

dashboard: TODO
description: The management interface at dumpsdata1006.mgmt:22 has been unresponsive for multiple hours.
runbook: https://wikitech.wikimedia.org/wiki/Management_Interfaces#Reset_the_management_card
summary: Unresponsive management for dumpsdata1006.mgmt:22
alertname: ManagementSSHDown
instance: dumpsdata1006.mgmt:22
job: probes/mgmt
module: ssh_banner
prometheus: ops
rack: E1
severity: task
site: eqiad
source: prometheus
team: dcops
Source

dashboard: TODO
description: The management interface at elastic1090.mgmt:22 has been unresponsive for multiple hours.
runbook: https://wikitech.wikimedia.org/wiki/Management_Interfaces#Reset_the_management_card
summary: Unresponsive management for elastic1090.mgmt:22
alertname: ManagementSSHDown
instance: elastic1090.mgmt:22
job: probes/mgmt
module: ssh_banner
prometheus: ops
rack: E1
severity: task
site: eqiad
source: prometheus
team: dcops
Source

Event Timeline

phaultfinder created this task.Sep 26 2024, 2:17 PM

Restricted Application added a project: DC-Ops. · View Herald TranscriptSep 26 2024, 2:17 PM

Maintenance_bot added a project: SRE.Sep 26 2024, 2:29 PM

After troubleshooting the cables and seeing multiple issues with other servers. It was recommended to reboot the switch. Logged it and then proceeded to reboot. It looks like this has cleard up the issue. Closing this now.

VRiley-WMF closed this task as Resolved.Sep 26 2024, 6:13 PM

ManagementSSHDownClosed, ResolvedPublicActions

Description

Common information

Firing alerts

Event Timeline

ManagementSSHDown
Closed, ResolvedPublic
Actions