mr1-eqsin performance issue
Open, MediumPublic
Actions

Assigned To

None

Authored By

	ayounsi
	Apr 15 2024, 11:53 AM

Description

mr1-eqsin has been flapping regularly in monitoring (and actual reboot), ssh is super slow and many core dumps happened since January:

mr1-eqsin> show system core-dumps

-rw-rw----  1 root  wheel  123406399 Dec 11  2022 /var/crash/vmcore.0.gz
-rw-rw----  1 root  wheel  127015005 Jan 20 17:17 /var/crash/vmcore.1.gz
-rw-rw----  1 root  wheel  131292187 Mar 13 06:32 /var/crash/vmcore.2.gz
-rw-rw----  1 root  wheel  131527624 Mar 27 12:02 /var/crash/vmcore.3.gz
-rw-rw----  1 root  wheel  133546320 Apr 12 02:47 /var/crash/vmcore.4.gz

We should do one or many of:

Open a JTAC ticket
Upgrade to a more recent Junos version (20 to 21)
Do a clean reboot of the device

Details

	Subject	Repo	Branch	Lines +/-
	mr: only allow ssh from bast hosts on production side	operations/homer/public	master	+1 -0

Customize query in gerrit

Related Objects

Mentioned In: T277438: Move management routers ssh port
Mentioned Here: T277438: Move management routers ssh port

Event Timeline

ayounsi triaged this task as High priority.Apr 15 2024, 11:53 AM

ayounsi created this task.

Restricted Application added a project: Infrastructure-Foundations. · View Herald TranscriptApr 15 2024, 11:53 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

ayounsi updated the task description. (Show Details)Apr 15 2024, 11:56 AM

Opened JTAC 2024-0415-128563 and attached logs/RSI/coredump.

cmooney subscribed.Apr 15 2024, 2:43 PM

I have checked the logs and it looks like the issue we are facing with the slowness on the device and the reboots is product of a brute force SSH attack on the SRX.
The login attempts are creating process on the SRX that sometimes don't close correctly or take more time to fully close. If enough of them are stack it can cause the reboot.
To fix this we can set a firewall filter for the control plane of the SRX and use an allow list to mitigate the packets that are actually reaching the device.

I asked if there was some other workarounds, otherwise T277438: Move management routers ssh port might be the best path forward.

In T362522#9715615, @ayounsi wrote:

it looks like the issue we are facing with the slowness on the device and the reboots is product of a brute force SSH attack on the SRX

Yeah that's the reason for the high CPU on all our MRs afaik. Still unsure why it should totally crash/reboot.

I asked if there was some other workarounds, otherwise T277438: Move management routers ssh port might be the best path forward.

Port change is probably the lowest hanging fruit agreed.

FWIW I changed the key-exchange algo configured on mr1-eqsin to see if it would make any difference, from some brief searching the ec21159 one seems to use less cpu than dh group-exchange one we have now. Thus far hasn't made much impact but I'll recheck the graphs in a day or two.

cmooney@mr1-eqsin# show | compare 
[edit system services ssh]
-    key-exchange group-exchange-sha2;
+    key-exchange curve25519-sha256;

In T362522#9717511, @cmooney wrote:

FWIW I changed the key-exchange algo configured on mr1-eqsin to see if it would make any difference

CPU is roughly the same pattern since yesterday so changing the kex algo hasn't helped. Reverted now.

One thing I noticed when doing a quick scan on the magru ranges for open ports is that the all the MR public IPs are open to the world for SSH.

I haven't quite weighed this up in terms of emergency access etc. But one thing that occurs to me is if we limited access to the non-OOB IPs to our bastions/trusted space we'd reduce the number of IPs getting hit with junk SSH attempts, which would reduce the number of attempts and reduce the CPU use?

Good idea, worth trying ! If it's enough it would be less of a pain than changing the SSH port.

In T362522#9751106, @ayounsi wrote:

Good idea, worth trying ! If it's enough it would be less of a pain than changing the SSH port.

My guess is we probably need to change the port eventually, but this might help.

I’ll have a look into what’s required.

Change #1025279 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] mr: only allow ssh from bast hosts on production side

https://gerrit.wikimedia.org/r/1025279

gerritbot added a project: Patch-For-Review.Apr 29 2024, 9:14 AM

Change #1025279 merged by Ayounsi:

[operations/homer/public@master] mr: only allow ssh from bast hosts on production side

https://gerrit.wikimedia.org/r/1025279

Maintenance_bot removed a project: Patch-For-Review.Apr 29 2024, 9:31 AM

cmooney lowered the priority of this task from High to Medium.May 3 2024, 9:35 AM

I checked the router again today after the Junos upgrade and reboot no core-dump file so far.

show system core-dumps no-forwarding 
/var/crash/*core*: No such file or directory
/var/tmp/*core*: No such file or directory
/var/tmp/pics/*core*: No such file or directory
/var/crash/kernel.*: No such file or directory
/var/jails/rest-api/tmp/*core*: No such file or directory
/tftpboot/corefiles/*core*: No such file or directory
/jail/var/tmp/*core*: No such file or directory

We will need to monitor it a bit more, at they seem to happen once a month or about.

mr1-eqsin performance issueOpen, MediumPublicActions

Description

Details

Related Objects

Event Timeline

mr1-eqsin performance issue
Open, MediumPublic
Actions