Page MenuHomePhabricator

ManagementSSHDown
Closed, ResolvedPublic

Description

Common information

  • alertname: ManagementSSHDown
  • job: probes/mgmt
  • module: ssh_banner
  • prometheus: ops
  • rack: E1
  • severity: task
  • site: eqiad
  • source: prometheus
  • team: dcops

Firing alerts


  • dashboard: TODO
  • description: The management interface at an-presto1006.mgmt:22 has been unresponsive for multiple hours.
  • runbook: https://wikitech.wikimedia.org/wiki/Management_Interfaces#Reset_the_management_card
  • summary: Unresponsive management for an-presto1006.mgmt:22
  • alertname: ManagementSSHDown
  • instance: an-presto1006.mgmt:22
  • job: probes/mgmt
  • module: ssh_banner
  • prometheus: ops
  • rack: E1
  • severity: task
  • site: eqiad
  • source: prometheus
  • team: dcops
  • Source

  • dashboard: TODO
  • description: The management interface at backup1010.mgmt:22 has been unresponsive for multiple hours.
  • runbook: https://wikitech.wikimedia.org/wiki/Management_Interfaces#Reset_the_management_card
  • summary: Unresponsive management for backup1010.mgmt:22
  • alertname: ManagementSSHDown
  • instance: backup1010.mgmt:22
  • job: probes/mgmt
  • module: ssh_banner
  • prometheus: ops
  • rack: E1
  • severity: task
  • site: eqiad
  • source: prometheus
  • team: dcops
  • Source

  • dashboard: TODO
  • description: The management interface at dse-k8s-worker1005.mgmt:22 has been unresponsive for multiple hours.
  • runbook: https://wikitech.wikimedia.org/wiki/Management_Interfaces#Reset_the_management_card
  • summary: Unresponsive management for dse-k8s-worker1005.mgmt:22
  • alertname: ManagementSSHDown
  • instance: dse-k8s-worker1005.mgmt:22
  • job: probes/mgmt
  • module: ssh_banner
  • prometheus: ops
  • rack: E1
  • severity: task
  • site: eqiad
  • source: prometheus
  • team: dcops
  • Source

  • dashboard: TODO
  • description: The management interface at dumpsdata1006.mgmt:22 has been unresponsive for multiple hours.
  • runbook: https://wikitech.wikimedia.org/wiki/Management_Interfaces#Reset_the_management_card
  • summary: Unresponsive management for dumpsdata1006.mgmt:22
  • alertname: ManagementSSHDown
  • instance: dumpsdata1006.mgmt:22
  • job: probes/mgmt
  • module: ssh_banner
  • prometheus: ops
  • rack: E1
  • severity: task
  • site: eqiad
  • source: prometheus
  • team: dcops
  • Source

  • dashboard: TODO
  • description: The management interface at elastic1090.mgmt:22 has been unresponsive for multiple hours.
  • runbook: https://wikitech.wikimedia.org/wiki/Management_Interfaces#Reset_the_management_card
  • summary: Unresponsive management for elastic1090.mgmt:22
  • alertname: ManagementSSHDown
  • instance: elastic1090.mgmt:22
  • job: probes/mgmt
  • module: ssh_banner
  • prometheus: ops
  • rack: E1
  • severity: task
  • site: eqiad
  • source: prometheus
  • team: dcops
  • Source

Event Timeline

After troubleshooting the cables and seeing multiple issues with other servers. It was recommended to reboot the switch. Logged it and then proceeded to reboot. It looks like this has cleard up the issue. Closing this now.