Page MenuHomePhabricator

SREGroup
ActivePublic

Recent Activity

Today

aborrero added a comment to T375847: openstack: initial IPv6 support in neutron.

I see the dhcp6 packets from my test VM arriving into neutron:

Fri, Sep 27, 11:21 AM · User-aborrero, cloud-services-team, Infrastructure-Foundations, SRE, netops
aborrero added a comment to T375847: openstack: initial IPv6 support in neutron.

We got DNS integration half working:

Fri, Sep 27, 11:12 AM · User-aborrero, cloud-services-team, Infrastructure-Foundations, SRE, netops
aborrero added a comment to T375847: openstack: initial IPv6 support in neutron.

The VM did not get the IPv6 assigned in the interface via dhcpv6 :-(

Fri, Sep 27, 11:11 AM · User-aborrero, cloud-services-team, Infrastructure-Foundations, SRE, netops
aborrero added a comment to T375847: openstack: initial IPv6 support in neutron.

neutron virtual router has the right IPv6 address:

Fri, Sep 27, 10:54 AM · User-aborrero, cloud-services-team, Infrastructure-Foundations, SRE, netops
Maintenance_bot removed a project from T375847: openstack: initial IPv6 support in neutron: Patch-For-Review.
Fri, Sep 27, 10:30 AM · User-aborrero, cloud-services-team, Infrastructure-Foundations, SRE, netops
aborrero added a comment to T375847: openstack: initial IPv6 support in neutron.

however, instance creation itself failed:

Fri, Sep 27, 10:14 AM · User-aborrero, cloud-services-team, Infrastructure-Foundations, SRE, netops
Stashbot added a comment to T375847: openstack: initial IPv6 support in neutron.

Mentioned in SAL (#wikimedia-cloud) [2024-09-27T10:04:30Z] <arturo> [codfw1dev] enable IPv6 on the neutron virtual router T375847

Fri, Sep 27, 10:05 AM · User-aborrero, cloud-services-team, Infrastructure-Foundations, SRE, netops
aborrero added a comment to T375847: openstack: initial IPv6 support in neutron.

new instance creation will allocate an IPv6 by default for a VM:

Fri, Sep 27, 10:04 AM · User-aborrero, cloud-services-team, Infrastructure-Foundations, SRE, netops
CodeReviewBot added a comment to T375847: openstack: initial IPv6 support in neutron.

aborrero merged https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/64

Fri, Sep 27, 9:58 AM · User-aborrero, cloud-services-team, Infrastructure-Foundations, SRE, netops
CodeReviewBot added a comment to T375847: openstack: initial IPv6 support in neutron.

aborrero opened https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/64

Fri, Sep 27, 9:55 AM · User-aborrero, cloud-services-team, Infrastructure-Foundations, SRE, netops
phaultfinder updated the task description for T375776: PDU sensor over limit.
Fri, Sep 27, 9:55 AM · SRE, DC-Ops, ops-eqiad
CodeReviewBot added a comment to T375847: openstack: initial IPv6 support in neutron.

aborrero merged https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/63

Fri, Sep 27, 9:52 AM · User-aborrero, cloud-services-team, Infrastructure-Foundations, SRE, netops
CodeReviewBot added a project to T375847: openstack: initial IPv6 support in neutron: Patch-For-Review.

aborrero opened https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/63

Fri, Sep 27, 8:53 AM · User-aborrero, cloud-services-team, Infrastructure-Foundations, SRE, netops
aborrero added a comment to T374712: netbox: create IPv6 entries for Cloud VPS.

@arturo /64s for VM usage I guess can be allocated from 2a02:ec80:a100::/56. We still need to decide how to set the routing (and firealling on cloudgw) up, and make the neccecary adjustments on our edge (RIR DB entries, RPKI objects and route policies) to make it usable on the internet. But you can probably get going with seeing how Neutron can assign some of that to VMs and get it to the point of pinging between two VMs etc.

Fri, Sep 27, 8:34 AM · User-aborrero, Infrastructure-Foundations, SRE, netops
aborrero updated the task description for T375847: openstack: initial IPv6 support in neutron.
Fri, Sep 27, 8:28 AM · User-aborrero, cloud-services-team, Infrastructure-Foundations, SRE, netops
aborrero updated the task description for T375847: openstack: initial IPv6 support in neutron.
Fri, Sep 27, 8:23 AM · User-aborrero, cloud-services-team, Infrastructure-Foundations, SRE, netops
aborrero changed the status of T375847: openstack: initial IPv6 support in neutron from Open to In Progress.
Fri, Sep 27, 8:22 AM · User-aborrero, cloud-services-team, Infrastructure-Foundations, SRE, netops
aborrero changed the status of T375847: openstack: initial IPv6 support in neutron, a subtask of T245495: CloudVPS: IPv6 in codfw1dev, from Open to In Progress.
Fri, Sep 27, 8:21 AM · User-aborrero, cloud-services-team, Infrastructure-Foundations, SRE, netops
aborrero created T375847: openstack: initial IPv6 support in neutron.
Fri, Sep 27, 8:21 AM · User-aborrero, cloud-services-team, Infrastructure-Foundations, SRE, netops
akosiaris added a comment to T256098: Segfault for systemd-sysusers.service on stat1007.

And we 've just seen this on parsoidtest1001 which is bullseye. Old host, scandium is on buster.

Fri, Sep 27, 7:59 AM · Infrastructure-Foundations, SRE
Jelto closed T375837: PuppetFailure, a subtask of T375839: puppetserver[1001-1002,2001] crashed on 2024-09-27 00:00, as Resolved.
Fri, Sep 27, 7:41 AM · Infrastructure-Foundations, SRE, Puppet-Infrastructure
Jelto added a subtask for T375839: puppetserver[1001-1002,2001] crashed on 2024-09-27 00:00: T375837: PuppetFailure.
Fri, Sep 27, 7:40 AM · Infrastructure-Foundations, SRE, Puppet-Infrastructure
MoritzMuehlenhoff merged T375839: puppetserver[1001-1002,2001] crashed on 2024-09-27 00:00 into T373527: puppetserver1002 thrashing and requiring a power cycle as a result.
Fri, Sep 27, 6:56 AM · User-Elukey, Infrastructure-Foundations, SRE
MoritzMuehlenhoff merged task T375839: puppetserver[1001-1002,2001] crashed on 2024-09-27 00:00 into T373527: puppetserver1002 thrashing and requiring a power cycle as a result.
Fri, Sep 27, 6:56 AM · Infrastructure-Foundations, SRE, Puppet-Infrastructure
Vgutierrez triaged T375839: puppetserver[1001-1002,2001] crashed on 2024-09-27 00:00 as High priority.

after a powercycle puppetserver1001 is responsive again

Fri, Sep 27, 3:35 AM · Infrastructure-Foundations, SRE, Puppet-Infrastructure
Stashbot added a comment to T375839: puppetserver[1001-1002,2001] crashed on 2024-09-27 00:00.

Mentioned in SAL (#wikimedia-operations) [2024-09-27T03:26:31Z] <vgutierrez> powercycle puppetserver1001 - T375839

Fri, Sep 27, 3:26 AM · Infrastructure-Foundations, SRE, Puppet-Infrastructure
Vgutierrez created T375839: puppetserver[1001-1002,2001] crashed on 2024-09-27 00:00.
Fri, Sep 27, 3:22 AM · Infrastructure-Foundations, SRE, Puppet-Infrastructure

Yesterday

bd808 edited projects for T375795: With XWikimediaDebug enabled, wikitech.wikimedia.org gets redirected to foundation.wikimedia.org until Wikitech is on k8s, added: wikitech.wikimedia.org; removed Traffic.
Thu, Sep 26, 8:33 PM · wikitech.wikimedia.org, WikimediaDebug, SRE
bd808 added a comment to T375795: With XWikimediaDebug enabled, wikitech.wikimedia.org gets redirected to foundation.wikimedia.org until Wikitech is on k8s.

Interesting, good to know. This is fairly inconvenient though, especially since this is a permanent (301) redirect, meaning my browser is now confused even w/o WikimediaDebug. Can we at least use the 302 redirect in this case? Alternatively, getting a "not here" error could work as well.

Thu, Sep 26, 8:30 PM · wikitech.wikimedia.org, WikimediaDebug, SRE
Urbanecm_WMF added a comment to T375795: With XWikimediaDebug enabled, wikitech.wikimedia.org gets redirected to foundation.wikimedia.org until Wikitech is on k8s.

I don't know if there is a task for this yet, but it is known. [...]

Thu, Sep 26, 8:24 PM · wikitech.wikimedia.org, WikimediaDebug, SRE
bd808 added a comment to T375795: With XWikimediaDebug enabled, wikitech.wikimedia.org gets redirected to foundation.wikimedia.org until Wikitech is on k8s.

I don't know if there is a task for this yet, but it is known. The bug here is that we changed WikimediaDebug to support Wikitech once Wikitech is in the k8s cluster as part of T371537: MVP: Privately serve wikitech via mwdebug1001 (https://gerrit.wikimedia.org/r/c/performance/WikimediaDebug/+/1070275), but we have not yet moved Wikitech to the k8s cluster. This should be magically resolved by T292707: ☂ Migrate Wikitech to Kubernetes.

Thu, Sep 26, 8:21 PM · wikitech.wikimedia.org, WikimediaDebug, SRE
bd808 renamed T375795: With XWikimediaDebug enabled, wikitech.wikimedia.org gets redirected to foundation.wikimedia.org until Wikitech is on k8s from With XWikimediaDebug enabled, wikitech.wikimedia.org gets redirected to foundation.wikimedia.org to With XWikimediaDebug enabled, wikitech.wikimedia.org gets redirected to foundation.wikimedia.org until Wikitech is on k8s.
Thu, Sep 26, 8:20 PM · wikitech.wikimedia.org, WikimediaDebug, SRE
Urbanecm_WMF created T375795: With XWikimediaDebug enabled, wikitech.wikimedia.org gets redirected to foundation.wikimedia.org until Wikitech is on k8s.
Thu, Sep 26, 8:14 PM · wikitech.wikimedia.org, WikimediaDebug, SRE
RobH added a comment to T373993: CPU temperature issues in cp hosts.

Opened ticket CS1011077 for the above updated google doc draft.

Thu, Sep 26, 7:15 PM · SRE, ops-esams, ops-magru, DC-Ops, Traffic
Stashbot added a comment to T375766: cp2037 hardware issues: A fatal error was detected on a component at bus 174 device 0 function 0.

Mentioned in SAL (#wikimedia-operations) [2024-09-26T18:23:51Z] <sukhe> repooling cp2037; downtimed removed for some time, looks good to repool: T375766

Thu, Sep 26, 6:23 PM · SRE, DC-Ops, Traffic, ops-codfw
VRiley-WMF closed T374897: ManagementSSHDown - elastic1089 as Resolved.

after troubleshooting this, we had to reboot E1 managment switch. This issue should be cleared up.

Thu, Sep 26, 6:15 PM · SRE, DC-Ops, ops-eqiad
VRiley-WMF claimed T374897: ManagementSSHDown - elastic1089.
Thu, Sep 26, 6:14 PM · SRE, DC-Ops, ops-eqiad
VRiley-WMF closed T375758: ManagementSSHDown as Resolved.
Thu, Sep 26, 6:13 PM · SRE, DC-Ops, ops-eqiad
VRiley-WMF added a comment to T375758: ManagementSSHDown.

After troubleshooting the cables and seeing multiple issues with other servers. It was recommended to reboot the switch. Logged it and then proceeded to reboot. It looks like this has cleard up the issue. Closing this now.

Thu, Sep 26, 6:13 PM · SRE, DC-Ops, ops-eqiad
VRiley-WMF claimed T375758: ManagementSSHDown.
Thu, Sep 26, 6:12 PM · SRE, DC-Ops, ops-eqiad
Maintenance_bot added a project to T375785: PowerSupplyFailure: SRE.
Thu, Sep 26, 5:29 PM · SRE, ops-codfw, DC-Ops
fnegri added a comment to T371486: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null.

However, I checked and the globalblocks table in the labswiki db is empty

Thu, Sep 26, 5:12 PM · cloud-services-team (FY2024/2025-Q1-Q2), SRE, Trust and Safety Product Sprint (Sprint Beatboxing (Sept 16-27)), Temporary accounts (Blockers to minor pilot wiki deployment), Data-Engineering, Data-Services, Trust and Safety Product Team
Stashbot added a comment to T375766: cp2037 hardware issues: A fatal error was detected on a component at bus 174 device 0 function 0.

Mentioned in SAL (#wikimedia-operations) [2024-09-26T17:03:29Z] <sukhe> removing downtime on cp2037 but still keeping it depooled: T375766

Thu, Sep 26, 5:03 PM · SRE, DC-Ops, Traffic, ops-codfw
Jhancock.wm added a comment to T375766: cp2037 hardware issues: A fatal error was detected on a component at bus 174 device 0 function 0.

firmware updated and event log cleared.

Thu, Sep 26, 5:02 PM · SRE, DC-Ops, Traffic, ops-codfw
RobH added a comment to T375345: cr3-ulsfo incident 22 Sep 2024.

Inbound shipment ticket 00980858 for UPS 1Z20506Y0100053206 (already delivered today and got the shipment notice last night).

Thu, Sep 26, 5:00 PM · DC-Ops, ops-ulsfo, Infrastructure-Foundations, netops, SRE
fnegri closed T371486: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null as Resolved.

This task is Resolved, as the view change has been applied to all wiki replicas hosts (an-redacteddb1001 and clouddb10[13-20]*).

Thu, Sep 26, 4:50 PM · cloud-services-team (FY2024/2025-Q1-Q2), SRE, Trust and Safety Product Sprint (Sprint Beatboxing (Sept 16-27)), Temporary accounts (Blockers to minor pilot wiki deployment), Data-Engineering, Data-Services, Trust and Safety Product Team
fnegri moved T371486: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null from Backlog to Done on the cloud-services-team (FY2024/2025-Q1-Q2) board.
Thu, Sep 26, 4:47 PM · cloud-services-team (FY2024/2025-Q1-Q2), SRE, Trust and Safety Product Sprint (Sprint Beatboxing (Sept 16-27)), Temporary accounts (Blockers to minor pilot wiki deployment), Data-Engineering, Data-Services, Trust and Safety Product Team
RobH added a comment to T373993: CPU temperature issues in cp hosts.

Draft body of support request for magru temp investigation: https://docs.google.com/document/d/1T-XwSS_Rwfb9nfC1aHQW4AjptLjxiviyZfGFdFcowZY/edit?usp=sharing

Thu, Sep 26, 4:43 PM · SRE, ops-esams, ops-magru, DC-Ops, Traffic
fnegri closed T375760: update-views cookbook doesn't handle filters correctly, a subtask of T371486: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null, as Resolved.
Thu, Sep 26, 4:34 PM · cloud-services-team (FY2024/2025-Q1-Q2), SRE, Trust and Safety Product Sprint (Sprint Beatboxing (Sept 16-27)), Temporary accounts (Blockers to minor pilot wiki deployment), Data-Engineering, Data-Services, Trust and Safety Product Team
ops-monitoring-bot added a comment to T371486: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null.

Cookbook cookbooks.sre.wikireplicas.update-views started by fnegri executed with errors:

  • an-redacteddb1001.eqiad.wmnet (FAIL)
    • Ran Puppet agent
    • The maintain-views run failed, see OUTPUT of 'maintain-views ...' above for details
Thu, Sep 26, 4:34 PM · cloud-services-team (FY2024/2025-Q1-Q2), SRE, Trust and Safety Product Sprint (Sprint Beatboxing (Sept 16-27)), Temporary accounts (Blockers to minor pilot wiki deployment), Data-Engineering, Data-Services, Trust and Safety Product Team