Page MenuHomePhabricator

mr1-eqsin performance issue
Open, MediumPublic

Description

mr1-eqsin has been flapping regularly in monitoring (and actual reboot), ssh is super slow and many core dumps happened since January:

mr1-eqsin> show system core-dumps
-rw-rw----  1 root  wheel  123406399 Dec 11  2022 /var/crash/vmcore.0.gz
-rw-rw----  1 root  wheel  127015005 Jan 20 17:17 /var/crash/vmcore.1.gz
-rw-rw----  1 root  wheel  131292187 Mar 13 06:32 /var/crash/vmcore.2.gz
-rw-rw----  1 root  wheel  131527624 Mar 27 12:02 /var/crash/vmcore.3.gz
-rw-rw----  1 root  wheel  133546320 Apr 12 02:47 /var/crash/vmcore.4.gz

We should do one or many of:

  • Open a JTAC ticket
  • Upgrade to a more recent Junos version (20 to 21)
  • Do a clean reboot of the device

Event Timeline

ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Opened JTAC 2024-0415-128563 and attached logs/RSI/coredump.

I have checked the logs and it looks like the issue we are facing with the slowness on the device and the reboots is product of a brute force SSH attack on the SRX.
The login attempts are creating process on the SRX that sometimes don't close correctly or take more time to fully close. If enough of them are stack it can cause the reboot.
To fix this we can set a firewall filter for the control plane of the SRX and use an allow list to mitigate the packets that are actually reaching the device.

I asked if there was some other workarounds, otherwise T277438: Move management routers ssh port might be the best path forward.

it looks like the issue we are facing with the slowness on the device and the reboots is product of a brute force SSH attack on the SRX

Yeah that's the reason for the high CPU on all our MRs afaik. Still unsure why it should totally crash/reboot.

I asked if there was some other workarounds, otherwise T277438: Move management routers ssh port might be the best path forward.

Port change is probably the lowest hanging fruit agreed.

FWIW I changed the key-exchange algo configured on mr1-eqsin to see if it would make any difference, from some brief searching the ec21159 one seems to use less cpu than dh group-exchange one we have now. Thus far hasn't made much impact but I'll recheck the graphs in a day or two.

cmooney@mr1-eqsin# show | compare 
[edit system services ssh]
-    key-exchange group-exchange-sha2;
+    key-exchange curve25519-sha256;

FWIW I changed the key-exchange algo configured on mr1-eqsin to see if it would make any difference

CPU is roughly the same pattern since yesterday so changing the kex algo hasn't helped. Reverted now.

One thing I noticed when doing a quick scan on the magru ranges for open ports is that the all the MR public IPs are open to the world for SSH.

I haven't quite weighed this up in terms of emergency access etc. But one thing that occurs to me is if we limited access to the non-OOB IPs to our bastions/trusted space we'd reduce the number of IPs getting hit with junk SSH attempts, which would reduce the number of attempts and reduce the CPU use?

Good idea, worth trying ! If it's enough it would be less of a pain than changing the SSH port.

Good idea, worth trying ! If it's enough it would be less of a pain than changing the SSH port.

My guess is we probably need to change the port eventually, but this might help.

I’ll have a look into what’s required.

Change #1025279 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] mr: only allow ssh from bast hosts on production side

https://gerrit.wikimedia.org/r/1025279

Change #1025279 merged by Ayounsi:

[operations/homer/public@master] mr: only allow ssh from bast hosts on production side

https://gerrit.wikimedia.org/r/1025279

cmooney lowered the priority of this task from High to Medium.May 3 2024, 9:35 AM

I checked the router again today after the Junos upgrade and reboot no core-dump file so far.

show system core-dumps no-forwarding 
/var/crash/*core*: No such file or directory
/var/tmp/*core*: No such file or directory
/var/tmp/pics/*core*: No such file or directory
/var/crash/kernel.*: No such file or directory
/var/jails/rest-api/tmp/*core*: No such file or directory
/tftpboot/corefiles/*core*: No such file or directory
/jail/var/tmp/*core*: No such file or directory

We will need to monitor it a bit more, at they seem to happen once a month or about.