User Details
- User Since
- Apr 3 2017, 6:23 PM (397 w, 3 d)
- Availability
- Available
- IRC Nick
- xionox
- LDAP User
- Ayounsi
- MediaWiki User
- AYounsi (WMF) [ Global Accounts ]
Today
Because of the various limitations listed in {T342673} (plus the ones from pygnmi) we're not going to proceed any further on Dell SONiC, focusing on {T371088} now.
Because of the various limitations listed in T340045: Package pyGNMI and dictdiffer to be used by cookbooks we're not going to proceed any further on Dell SONiC, focusing on {T371088} now.
Thanks for dictdiffer, because of a change in priorities and current limitations in pyGNMI, there is no more need to package it.
Going to close that task as we're not planning on using gNMI for automation any further, due to various shortcoming in the existing python gNMI library. We're alternatively looking into JSON-RPC see T371088#10272661 for example.
Cool, nothing urgent, in that case please let you know when you can which hosts that you want to migrate (or the ones that are not worth it), we can then figure out a plan of attack.
Yesterday
Wed, Nov 13
interesting idea, definitely worth a try. I'm particularly curious on how routing between VMs would work in that setup, and where to apply filtering. But not requiring multihop would be a plus.
Tue, Nov 12
Updated :)
Thu, Nov 7
If it's a bug on the switch it's probably worth opening a JTAC ticket. Even if it's not fixed on time for us they could provide a workaround or fix it in the longer run (unfortunately not on time for us).
Tue, Nov 5
Another point, after running the script, the changelog on a problematic interface shows 3 changes (for that interface) in the same transaction:
- "updated" Post-Change Data looks like what we want (disabled, no vlans, no mtu, cable still attached).
- "updated" that "reverts" the values we don't want to keep <- that's the odd one
- "delete" that removes the cable termination, as expected
Thu, Oct 31
@bking I think it's a question worth asking, but probably not in that task :) Could you open a dedicated one for the Procurement/DCops team?
Had a chat with Riccardo on IRC, here is the new list I came up with:
@Papaul 54 but that only included rows A and B, now C and D are also eligible to a free 10G upgrade when available.
Wed, Oct 30
It doesn't, but once it's ready to receive traffic we need to :
1/ review then deploy to all the eqiad switches/routers https://gerrit.wikimedia.org/r/1084760
2/ Set the BGP flag on https://netbox.wikimedia.org/dcim/devices/121/ to True, then run Homer on lsw1-e1-eqiad
3/ Potentially a bit of fine tuning as it's the first time we would do (2) for LVS in eqiad
Actually, a 2nd look at https://wikitech.wikimedia.org/wiki/IP_and_AS_allocations shows that 14907:11 is a bit better. But it doesn't matter much ultimately.
Script deployed, I don't think it will be extremely useful, but let's see how it goes.
@cmooney @Papaul What do you think of:
- Keeping the new script presented previously for the "easy" usecases
- Introducing an optional "Server interface" (and port speed) choices to the existing move server script https://netbox.wikimedia.org/extras/scripts/7/ to move individual servers while upgrading their nic speed
Tue, Oct 29
Mon, Oct 28
If I understand correctly, this task is about upgrading iDRAC to be able to upgrade iDRAC or other firmware more easily in the future.
There might be some edge cases, but I think ideally we should disable the autoconf on all hosts as they're supposed to be statically configured.
Thu, Oct 24
If we don't want to use dummy interface names I think the simple way forward is Option 1, which seems like a big improvement, and we could investigate how to add the cables as a phase 2?
SGTM!
It's quite a big task overall, splitting it into several well defined sub-tasks will make it easier to accomplish. For example splitting the IP side from the vlan side.
Wed, Oct 23
Fri, Oct 18
We can work through those nodes as reimages (slowly), but it would be nice(r) if we could know all the new IPs up front and add them all to that set at once.
Nicely written plan !!
Thu, Oct 17
Now this applies to rows C and D as well as the switches got upgraded there as well.
re1.cr1-eqiad> show system alarms 1 alarms currently active Alarm time Class Description 2024-07-18 16:11:37 UTC Minor Backup RE Active
Perfect, thanks !
We will need to monitor it a bit more, at they seem to happen once a month or about.
Oct 16 2024
Oct 14 2024
Closing this. Please re-open if it happens again.
Closing that task as the original goal has been reached.
Above path tested on Netbox next and ready for review.
Oct 11 2024
FYI, I finally cleaned up the description field and removed the WikiKube tag in Netbox.
A few more reasons to upgrade in {T376986}.
Oct 10 2024
No objection to that. Seems like a good idea. In the short term we can delete the old account too.
Oct 9 2024
Phase 2 lgtm, one point though : you need to trunk the management vlan between the old and new switch for fasw to be reachable between steps 3 and 9.
Oct 8 2024
About phase 1. I checked the pfw1 config and steps here. Gave some feedback over IRC. Overall lgtm.
Oct 7 2024
I re-ran John's script:
Oct 3 2024
Let's use the latest recommended, so 23. Thx!
Oct 2 2024
No interface range as each switch will be independent.
Thinking out loud I'm wondering if we could/should add an ASN (multi-)object(s) custom field to prefixes.
The idea is to have something that not only works for k8s but would be generic enough for all parts of our infra.
Oct 1 2024
cr3-ulsfo> request vmhost snapshot ? Possible completions: <[Enter]> Execute this command config Sychronise Configuration between the disks no-confirm Do not ask for confirmation partition Partition the target media recovery Recover the primary media from snapshot | Pipe through a command
Thanks, all is good now !
Sep 30 2024
From JTAC :
You can periodically take a vmhost snapshot of the device to avoid losing configurations.
On the device, back up the snapshot of the host OS image along with the Junos OS image. In case of failure of the primary disk, you can boot from the image available in the backup disk and then recover the primary disk with the snapshot created using the recovery option.
https://www.juniper.net/documentation/us/en/software/junos/cli-reference/topics/ref/command/request-vmhost-snapshot.html
Sep 27 2024
I went ahead and replaced the Kubernetes tag with the role.
Sep 26 2024
ssh: connect to host login.toolforge.org port 22: No route to host is a red hearing, SSH will show that when it just can't reach the end node.
A few more info thanks to @aborrero on IRC.
It would be useful to capture more data (eg. packet capture) next time this happens. The ICMP no route to host packet contains more data, including which host actually sends it.
Sep 25 2024
cr3-ulsfo> show system alarms 1 alarms currently active Alarm time Class Description 2024-09-25 13:11:42 UTC Minor FPC 0 Minor Errors
Sep 24 2024
Closing, will re-open if the issue happens again and we need to RMA it.