Page MenuHomePhabricator

jcrespo (Jaime Crespo)
Sr Database Administrator

Projects (12)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
May 11 2015, 8:31 AM (496 w, 5 h)
Availability
Available
IRC Nick
jynus
LDAP User
Jcrespo
MediaWiki User
JCrespo (WMF) [ Global Accounts ]

Recent Activity

Today

jcrespo added a comment to T379519: Restore commons-l subscribers removed due to fat finger "remove all members".

It seems to me not all members were deleted, while I can see some people subscribing manually, only around 300 are missing ATM.

Mon, Nov 11, 1:09 PM · User-revi, Wikimedia-Mailing-lists, SRE
jcrespo added a comment to T379519: Restore commons-l subscribers removed due to fat finger "remove all members".

I was able to pin down the undesired mass deletion starting at:

# at 906744201
#241111  6:21:17 server id 171966607  end_log_pos 906744243 CRC32 0xeb3133f4    GTID 171966607-171966607-71484404 trans
Mon, Nov 11, 11:49 AM · User-revi, Wikimedia-Mailing-lists, SRE
jcrespo added a comment to T379519: Restore commons-l subscribers removed due to fat finger "remove all members".

Hey, @revi, no worries. Could you confirm with me a timestamp when members would be good? 2024-11-11 06:20:00. A little earlier, at 06:00:00 ?

Mon, Nov 11, 10:31 AM · User-revi, Wikimedia-Mailing-lists, SRE

Wed, Nov 6

jcrespo added a comment to T371416: Q1:rack/setup/install backup1012.
Wed, Nov 6, 9:22 PM · SRE, Data-Persistence-Backup, Data-Persistence, ops-eqiad, DC-Ops

Mon, Nov 4

jcrespo added a comment to T378628: db2190 is not coming back online.

Was db2190 taken care, data-wise/repooled? Not super worried or super-urgent, but to track it somewhere and making sure it doesn't fall into the cracks of depooled hosts pending to reproductionized.

Mon, Nov 4, 10:35 AM · SRE, ops-codfw, DC-Ops, DBA

Thu, Oct 31

jcrespo added a comment to T378628: db2190 is not coming back online.

image.png (111×2 px, 33 KB)

Thu, Oct 31, 3:04 PM · SRE, ops-codfw, DC-Ops, DBA

Wed, Oct 30

jcrespo added a comment to T371416: Q1:rack/setup/install backup1012.

@wiki_willy I am going to remove backup1010 and backup2010 from bacula and use it for mediabackups instead. This will solve my immediate needs, but please note I am only delaying the inevitable: that host was for the general bacula expansion, and we can delay that a bit more, as the growth there was much smaller (and we will impact other team instead of mine :-S).

Wed, Oct 30, 9:58 AM · SRE, Data-Persistence-Backup, Data-Persistence, ops-eqiad, DC-Ops

Tue, Oct 29

jcrespo added a comment to T377451: zarcillo: Take inventory of all active use cases.

wouldn't give much where the queries are coming from

Tue, Oct 29, 10:57 PM · Data-Persistence-SRE, DBA
jcrespo added a comment to T378143: Q2:rack/setup/install es104[1-6].

T262388 is the bug, but I couldn't fix it because I couldn't reproduce it at the time. I highly recommend double checking data after copy to make sure nothing was lost (counting tables and comparing file sizes is a fast way to do it). transfer does a checksum, but one never knows or other errors could be missed.

Tue, Oct 29, 10:49 PM · Data-Persistence-SRE, DBA, Patch-For-Review, SRE, Data-Persistence, ops-eqiad, DC-Ops
jcrespo added a comment to T371416: Q1:rack/setup/install backup1012.

I can wait for that a few extra days. But I need certainty on dates or know that a new testing period is in front of us that could take weeks (?) In the latter case, I would prefer to use what we have now and solve the backup storage, and do the testing elsewhere, under less stressful constraints. Later we can migrate or do whatever, but I really need to start working on it now, and know those files won't be lost in a few weeks. Migrating files on these hosts can take weeks. I will prefer taking any kind of performance penalty over certainty and stability.

Tue, Oct 29, 9:26 PM · SRE, Data-Persistence-Backup, Data-Persistence, ops-eqiad, DC-Ops
jcrespo added a comment to T378068: pc1017 crashed.

Discover what happened to pc1017/fix/prevent if from happening

Tue, Oct 29, 1:47 PM · Data-Persistence-SRE, DBA
jcrespo added a comment to T371416: Q1:rack/setup/install backup1012.

This is not great for me. I need these hosts by 2024-09-08 as documented at T368926, and I am running out of space. If I start resharding now likely we will lose all data on them due to the RAID solution changing.

Tue, Oct 29, 11:17 AM · SRE, Data-Persistence-Backup, Data-Persistence, ops-eqiad, DC-Ops

Mon, Oct 28

jcrespo added a comment to T378076: Parsercache issues in codfw causing large-scale outage.

I linked the above task: T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes), as mitigations may have slowed down jobqueue processing. We may need to have a look, as while almost surely was a consequence of the reduced concurrency, it could show show some interesting signal, too- as potentially the bottleneck/increased load could have started before the incident.

Mon, Oct 28, 5:21 PM · SRE, DBA, Wikimedia-production-error
jcrespo added a comment to T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes).

Potentially related to T378076, as concurrency was lowered to mitigate ongoing issues.

Mon, Oct 28, 5:16 PM · FlaggedRevs, serviceops, WMF-JobQueue
jcrespo added a comment to T371984: Q1:rack/setup/install backup2012.

No need to be sorry. I was a bit pushy about it because we are in a bit of a hurry due to the parent ticket, as we were soon running out of space for media backups (6% left).

Mon, Oct 28, 10:49 AM · SRE, Data-Persistence, Data-Persistence-Backup, ops-codfw, DC-Ops

Thu, Oct 24

jcrespo added a comment to T378068: pc1017 crashed.

pc1014 has been setup as the host to cover for pc5 in eqiad, instead of pc1017. dbctl and puppet may need further adjustments to see how to configure now the candidates/mark them on puppet and dbctl, but other wise, and with some 10% impact in cache misses, the servers seem to be working as intended.

Thu, Oct 24, 10:49 AM · Data-Persistence-SRE, DBA

Wed, Oct 23

jcrespo added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

perccli and storecli are not exactly the same either, existing script fails with:

Failed to execute ['/usr/local/lib/nagios/plugins/get-raid-status-perccli']: KeyError 'BBU_Info'
Wed, Oct 23, 8:55 AM · DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
jcrespo added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

Nope, megacli doesn't work. That's the one option I tried first, before going on this rabbit hole. 0:-)

Wed, Oct 23, 8:51 AM · DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
jcrespo added a comment to T377451: zarcillo: Take inventory of all active use cases.

Note it is possible to -filter on pt-query-digest. However, if usage is what is needed, enabling performance_schema stats and querying by database may be just easier.

Wed, Oct 23, 7:12 AM · Data-Persistence-SRE, DBA

Tue, Oct 22

jcrespo added a comment to T371984: Q1:rack/setup/install backup2012.

How far is this in the queue? The original need by was 2024-09-08, and this could help debug issues with: T377853

Tue, Oct 22, 3:53 PM · SRE, Data-Persistence, Data-Persistence-Backup, ops-codfw, DC-Ops
jcrespo added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

perccli and storecli are not exactly the same either, existing script fails with:

Failed to execute ['/usr/local/lib/nagios/plugins/get-raid-status-perccli']: KeyError 'BBU_Info'
Tue, Oct 22, 3:37 PM · DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
jcrespo added a comment to T377820: Increase the frequency of backups of the analytics_meta MariaDB instance.

Let's talk. I don't disagree with more frequent backups, but this is the first time I hear about a data incident, and with point in time recovery (binlogs), it is possible to recover from any exact point in time in the past 3 months, not just the weekly backup time. So I would like to discuss what are the real needs, not already come up with a specific decision.

Tue, Oct 22, 3:15 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), Data-Persistence-Backup, Data-Persistence
jcrespo added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

After testing on older hosts, storecli seems to work on older hosts from a different vendor, so either there is something broken with this host in particular (I would need a second new host with a RAID controller to check) or we should upgrade the cli to use this other tool, that seems to work for both vendors.

Tue, Oct 22, 3:11 PM · DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
jcrespo added a comment to T371416: Q1:rack/setup/install backup1012.

I filed T377853 with a possible fix.

Tue, Oct 22, 3:04 PM · SRE, Data-Persistence-Backup, Data-Persistence, ops-eqiad, DC-Ops
jcrespo updated the task description for T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.
Tue, Oct 22, 2:46 PM · DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
jcrespo created T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.
Tue, Oct 22, 2:42 PM · DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
jcrespo added a comment to T371416: Q1:rack/setup/install backup1012.

Thanks, I can start provisioning it, however, there seems to be an issue with the disk monitoring. Our puppet installation seems to be identifying the RAID controller:

Tue, Oct 22, 1:22 PM · SRE, Data-Persistence-Backup, Data-Persistence, ops-eqiad, DC-Ops

Mon, Oct 21

jcrespo added a comment to T377718: db2205 and db2227 need to be recloned from 10.6.17.
apt1002:/home/marostegui$ ls -R
Mon, Oct 21, 5:07 PM · Data-Persistence-SRE, DBA
jcrespo closed T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication as Resolved.

Done: https://wikitech.wikimedia.org/wiki/Incidents/2024-09-18_replication

Mon, Oct 21, 1:35 PM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA
jcrespo added a comment to T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.

Deployment went well, I will update the incident doc with the long-term fix and then call this resolved.

Mon, Oct 21, 12:52 PM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA

Fri, Oct 18

jcrespo updated the task description for T376921: Upgrade dbprov hosts to Debian Bookworm.
Fri, Oct 18, 3:58 PM · Patch-For-Review, database-backups, Data-Persistence-Backup
jcrespo updated subscribers of T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.

Let's merge carefully https://gerrit.wikimedia.org/r/1081103 early next week (so we can monitor not affecting production hosts) CC @Ladsgroup @ABran-WMF

Fri, Oct 18, 10:11 AM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA
jcrespo closed T374933: Add section alias for databases in the test-s1 and test-s4 sections as Resolved.

This is resolved as long as "this is deployed", but feel free to reopen if it causes any issue or want to discuss the removal.

Fri, Oct 18, 10:03 AM · DBA
jcrespo added a comment to T371351: Automate the pre/post switchover tasks related to databases.

As soon as the other patches are merged, this is done for me IMHO. Core section ordering can be discussed afterwards.

Fri, Oct 18, 10:00 AM · Data-Persistence-SRE, DBA, Datacenter-Switchover
jcrespo closed T374933: Add section alias for databases in the test-s1 and test-s4 sections, a subtask of T371351: Automate the pre/post switchover tasks related to databases, as Resolved.
Fri, Oct 18, 10:00 AM · Data-Persistence-SRE, DBA, Datacenter-Switchover

Thu, Oct 17

jcrespo added a comment to T371351: Automate the pre/post switchover tasks related to databases.

Indeed it is. let me use blame to see when this happened. This is good news because we finally know WHY this happened only recently, the sad part is that we may discard this patch partially.

Nice catch! I don't mind discarding the patch if we solved the problem :)

Thu, Oct 17, 3:20 PM · Data-Persistence-SRE, DBA, Datacenter-Switchover
jcrespo added a comment to T371351: Automate the pre/post switchover tasks related to databases.

Indeed it is. let me use blame to see when this happened. This is good news because we finally know WHY this happened only recently, the sad part is that we may discard this patch partially.

Nice catch! I don't mind discarding the patch if we solved the problem :)

Thu, Oct 17, 2:34 PM · Data-Persistence-SRE, DBA, Datacenter-Switchover
jcrespo added a comment to T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.

The (potential) change that caused it was: https://gerrit.wikimedia.org/r/c/operations/puppet/+/693162

Thu, Oct 17, 1:24 PM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA
jcrespo added a comment to T371351: Automate the pre/post switchover tasks related to databases.

Indeed it is. let me use blame to see when this happened. This is good news because we finally know WHY this happened only recently, the sad part is that we may discard this patch partially.

Thu, Oct 17, 10:42 AM · Data-Persistence-SRE, DBA, Datacenter-Switchover
jcrespo added a comment to T371351: Automate the pre/post switchover tasks related to databases.

I cannot at the moment reproduce the bad scenario, because this is the binlog from, eg. x1:

# at 1199094
#241017  9:53:45 server id 180360966  end_log_pos 1199136 CRC32 0x45c248cc      GTID 180360966-180360966-115215619 trans
/*!100001 SET @@session.gtid_seq_no=115215619*//*!*/;
START TRANSACTION
/*!*/;
# at 1199136
# at 1199400
#241017  9:53:45 server id 180360966  end_log_pos 1199400 CRC32 0xd37d7cf7      Annotate_rows:
#Q> REPLACE INTO `heartbeat`.`heartbeat` (ts, server_id, file, position, relay_master_log_file, exec_master_log_pos, shard, datacenter) VALUES ('2024-10-17T09:53:45.001510', '180360966', 'db2196-bin.003776', '1199094', NULL, NULL, 'x1', 'codfw')
#241017  9:53:45 server id 180360966  end_log_pos 1199474 CRC32 0xe050a53c      Table_map: `heartbeat`.`heartbeat` mapped to number 23
# at 1199474
#241017  9:53:45 server id 180360966  end_log_pos 1199642 CRC32 0x14ae218b      Update_rows: table id 23 flags: STMT_END_F
### UPDATE `heartbeat`.`heartbeat`
### WHERE
###   @1='2024-10-17T09:53:44.001370' /* VARSTRING(26) meta=26 nullable=0 is_null=0 */
###   @2=180360966 /* INT meta=0 nullable=0 is_null=0 */
###   @3='db2196-bin.003775' /* VARSTRING(255) meta=255 nullable=1 is_null=0 */
###   @4=1048427332 /* LONGINT meta=0 nullable=1 is_null=0 */
###   @5=NULL /* VARSTRING(255) meta=255 nullable=1 is_null=1 */
###   @6=NULL /* LONGINT meta=0 nullable=1 is_null=1 */
###   @7='x1' /* VARSTRING(10) meta=10 nullable=1 is_null=0 */
###   @8='codfw' /* STRING(5) meta=65029 nullable=1 is_null=0 */
### SET
###   @1='2024-10-17T09:53:45.001510' /* VARSTRING(26) meta=26 nullable=0 is_null=0 */
###   @2=180360966 /* INT meta=0 nullable=0 is_null=0 */
###   @3='db2196-bin.003776' /* VARSTRING(255) meta=255 nullable=1 is_null=0 */
###   @4=1199094 /* LONGINT meta=0 nullable=1 is_null=0 */
###   @5=NULL /* VARSTRING(255) meta=255 nullable=1 is_null=1 */
###   @6=NULL /* LONGINT meta=0 nullable=1 is_null=1 */
###   @7='x1' /* VARSTRING(10) meta=10 nullable=1 is_null=0 */
###   @8='codfw' /* STRING(5) meta=65029 nullable=1 is_null=0 */
# Number of rows: 1
# at 1199642
#241017  9:53:45 server id 180360966  end_log_pos 1199673 CRC32 0xc6f0d777      Xid = 3891994318
COMMIT/*!*/;
# at 1199673
Thu, Oct 17, 10:40 AM · Data-Persistence-SRE, DBA, Datacenter-Switchover
jcrespo added a comment to T371351: Automate the pre/post switchover tasks related to databases.

There is some minor usability issue (but could be confusing under pressure), I get this text:

==> Run on section test-s4 was manually aborted. Continue with the remaining sections or abort completely?

However, if it is the last or the only section, it doesn't make much sense, as it would do the same, basically. Maybe just changing the wording if there are no more sections left even if you want to keep the pause?

Thu, Oct 17, 10:09 AM · Data-Persistence-SRE, DBA, Datacenter-Switchover

Wed, Oct 16

jcrespo added a comment to T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.

Thanks Riccardo, as I said on IRC it looked like a minor issue so I wasn't too worried, and it was. Your are very fast at this, so big ❤ to you. Will continue testing, probably tomorrow morning.

Wed, Oct 16, 6:30 PM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA
jcrespo added a comment to T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.

I need to research more line 255 change:

self._validate_slave_status(f"MASTER_TO {self.master_to.host}", status, expected)
Wed, Oct 16, 2:32 PM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA
jcrespo added a comment to T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.
DRY-RUN: MASTER_TO db2230.codfw.wmnet Ignoring MASTER STATUS is not stable in DRY-RUN
DRY-RUN: [test-s4] Binlog format is STATEMENT. No heartbeat corrective action needed.
DRY-RUN: [test-s4] MASTER_FROM db1125.eqiad.wmnet CHANGE MASTER to ReplicationInfo(primary='db2230.codfw.wmnet', binlog='db2230-bin.000005', position=201650676, port=3306)
DRY-RUN: MASTER_FROM db1125.eqiad.wmnet CHANGE MASTER to ReplicationInfo(primary='db2230.codfw.wmnet', binlog='db2230-bin.000005', position=201650676, port=3306) and user repl2024
DRY-RUN: Executing commands ['/usr/local/bin/mysql --socket /run/mysqld/mysqld.sock --batch --execute "START SLAVE"'] on 1 hosts: db1125.eqiad.wmnet
DRY-RUN: MASTER_FROM db1125.eqiad.wmnet START SLAVE
DRY-RUN: MASTER_FROM db1125.eqiad.wmnet skipping replication from MASTER_TO db2230.codfw.wmnet verification
DRY-RUN: Executing commands ['/bin/systemctl start pt-heartbeat-wikimedia.service'] on 1 hosts: db2230.codfw.wmnet
DRY-RUN: MASTER_TO db2230.codfw.wmnet started pt-heartbeat.
DRY-RUN: Executing commands ['/usr/local/bin/mysql --socket /run/mysqld/mysqld.sock --batch --execute "START SLAVE"'] on 1 hosts: db2230.codfw.wmnet
DRY-RUN: MASTER_TO db2230.codfw.wmnet START SLAVE.
DRY-RUN: Executing commands ['/usr/local/bin/mysql --socket /run/mysqld/mysqld.sock --batch --execute "SHOW SLAVE STATUS\\G"'] on 1 hosts: db2230.codfw.wmnet
DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Master_Host=db1125.eqiad.wmnet
DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Master_User=repl2024
DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Master_Port=3306
DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Slave_IO_Running=Yes
DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Slave_SQL_Running=Yes
DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Last_IO_Errno=0
DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Last_SQL_Errno=0
DRY-RUN: MASTER_TO db2230.codfw.wmnet replication from MASTER_FROM db1125.eqiad.wmnet verified
DRY-RUN: Executing commands ['/usr/local/bin/mysql --socket /run/mysqld/mysqld.sock --batch --execute "SHOW SLAVE STATUS\\G"'] on 1 hosts: db1125.eqiad.wmnet
DRY-RUN: Failed to run cookbooks.sre.switchdc.databases.prepare.PrepareSection.master_from_check_replication: SHOW SLAVE STATUS seems to have been executed on a master.
DRY-RUN: Traceback
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/wmflib/interactive.py", line 183, in confirm_on_failure
    ret = func(*args, **kwargs)
  File "/home/jynus/cookbooks_testing/cookbooks/cookbooks/sre/switchdc/databases/prepare.py", line 331, in master_from_check_replication
    status = self.master_from.show_slave_status()
  File "/usr/lib/python3/dist-packages/spicerack/mysql_legacy.py", line 201, in show_slave_status
    raise MysqlLegacyError(f"{sql} seems to have been executed on a master.")
spicerack.mysql_legacy.MysqlLegacyError: SHOW SLAVE STATUS seems to have been executed on a master.
==> What do you want to do? "retry" the last command, manually fix the issue and "skip" the last command to continue the execution or completely "abort" the execution.
>
Wed, Oct 16, 2:08 PM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA
jcrespo added a comment to T374933: Add section alias for databases in the test-s1 and test-s4 sections.

I am going to execute this now because it is making the fix for T375144 harder to test. It can be removed afterwards, but I needs this now.

Wed, Oct 16, 1:22 PM · DBA
jcrespo added a comment to T375652: Wikimedia\Rdbms\DBQueryError: Error 1062: Duplicate entry '1' for key 'PRIMARY' Function: MediaWiki\CheckUser\Services\CheckUserLogService::addLogEntry.

@Ladsgroup Did you add the backup sources into your checks? If not, could you do it (assuming it won't take much effort)? I would like to make sure backups are generated fine on both dcs before archiving them for the other ticket.

Wed, Oct 16, 10:40 AM · DBA, Trust and Safety Product Team, CheckUser, Wikimedia-production-error
jcrespo added a comment to T377276: db2205 is stuck at "Shutdown in progress".

We should also mark it on netbox/dbctl as failed (?), in case another SRE runs into issues, so it is not pooled by accident until this ticket or its parent is resolved.

Wed, Oct 16, 10:37 AM · DBA
jcrespo added a comment to T377276: db2205 is stuck at "Shutdown in progress".

Whatever you do, can I suggest to remove it from master/master candidate and choose a different host for that?

Wed, Oct 16, 10:22 AM · DBA

Tue, Oct 15

jcrespo added a comment to T321808: Port all Icinga checks to Prometheus/Alertmanager.

I would like to send a question for observability (Observability-Alerting ) about how to go for a very simple check that currently lives in icinga. I am not too worried about the individual check, I know how to fix it technologically, but maybe as a question for how to do this kind of checks.

Tue, Oct 15, 5:53 PM · SRE Observability (FY2024/2025-Q2), Observability-Alerting
jcrespo added a comment to T377164: Switchover s3 master (db2205 -> db2209).

It db2205 had gotten stuck previously due to gtid/replication at T374425. Either newer mariadb version + large number of objects or hw, would check the latter first.

Tue, Oct 15, 4:27 PM · DBA
jcrespo added a comment to T374425: db2205 stuck replication/processlist.

Happened again at T377164

Tue, Oct 15, 4:24 PM · DBA
jcrespo added a comment to T376627: Drop ad-hoc or obsolete tables in production.

I will generate a tarball from today's logical export. Will ping here when done.

Tue, Oct 15, 11:38 AM · DBA

Mon, Oct 14

jcrespo added a comment to T376926: frwiki trashing for 20241001 dump.

Not sure if related, but seeing lots of

PHP Notice: fwrite(): write of 137 bytes failed with errno=32 Broken pipe
Mon, Oct 14, 4:52 PM · Data-Platform, Dumps-Generation
jcrespo added a comment to T375789: Replace on-wiki raw HTML overrides for "MediaWiki:Copyright" etc. and set $wgAllowRawHtmlCopyrightMessages = false.

For some reason, the script that checks legal conformity of the footer, failed with:

Mon, Oct 14, 9:55 AM · User-notice-archive, WMF-General-or-Unknown, MediaWiki-Platform-Team, SUL3

Oct 11 2024

jcrespo added a comment to T376988: db2175 replication breakage with: Error 'Index for table 'recentchanges' is corrupt; try to repair it' on query. Default database: 'nlwiki'..

Please add it there, I am happy to help with the mysql side, but I am not familiar with the DBA's day to day procedures.

Oct 11 2024, 11:05 AM · Data-Persistence, Data-Persistence-SRE, DBA
jcrespo added a comment to T376988: db2175 replication breakage with: Error 'Index for table 'recentchanges' is corrupt; try to repair it' on query. Default database: 'nlwiki'..

Not needed to close the ticket- but I found something really weird- no megacli cmd. Do you know what's the binary for manual checking on newer hosts? Also, learning cycles are disabled on those hosts, too, right?

Oct 11 2024, 10:58 AM · Data-Persistence, Data-Persistence-SRE, DBA
jcrespo reassigned T376988: db2175 replication breakage with: Error 'Index for table 'recentchanges' is corrupt; try to repair it' on query. Default database: 'nlwiki'. from jcrespo to ABran-WMF.

The repair finished in 10 seconds, faster than I expected without issues (table & index rebuild succesfully, and I restarted replication).

Oct 11 2024, 10:54 AM · Data-Persistence, Data-Persistence-SRE, DBA
jcrespo claimed T376988: db2175 replication breakage with: Error 'Index for table 'recentchanges' is corrupt; try to repair it' on query. Default database: 'nlwiki'..
Oct 11 2024, 10:47 AM · Data-Persistence, Data-Persistence-SRE, DBA
jcrespo added a comment to T376988: db2175 replication breakage with: Error 'Index for table 'recentchanges' is corrupt; try to repair it' on query. Default database: 'nlwiki'..

Started on a screen on db2175:

mysql:root@localhost [nlwiki]> ALTER TABLE recentchanges ENGINE=InnoDB, FORCE;
Oct 11 2024, 10:46 AM · Data-Persistence, Data-Persistence-SRE, DBA
jcrespo created T376988: db2175 replication breakage with: Error 'Index for table 'recentchanges' is corrupt; try to repair it' on query. Default database: 'nlwiki'..
Oct 11 2024, 10:43 AM · Data-Persistence, Data-Persistence-SRE, DBA

Oct 10 2024

jcrespo added a parent task for T362509: Setup new dbprov hosts and decommission the old ones: T376921: Upgrade dbprov hosts to Debian Bookworm.
Oct 10 2024, 6:12 PM · Patch-For-Review, database-backups, Data-Persistence-Backup
jcrespo added a subtask for T376921: Upgrade dbprov hosts to Debian Bookworm: T362509: Setup new dbprov hosts and decommission the old ones.
Oct 10 2024, 6:12 PM · Patch-For-Review, database-backups, Data-Persistence-Backup
jcrespo triaged T376921: Upgrade dbprov hosts to Debian Bookworm as High priority.
Oct 10 2024, 3:59 PM · Patch-For-Review, database-backups, Data-Persistence-Backup
jcrespo created T376921: Upgrade dbprov hosts to Debian Bookworm.
Oct 10 2024, 3:59 PM · Patch-For-Review, database-backups, Data-Persistence-Backup
jcrespo merged T291454: Clean up backup metadata into T349360: Clean up dbbackups.backup_files table.
Oct 10 2024, 3:56 PM · database-backups, Data-Persistence-Backup
jcrespo merged task T291454: Clean up backup metadata into T349360: Clean up dbbackups.backup_files table.
Oct 10 2024, 3:55 PM · Data-Persistence-Backup
jcrespo renamed T366092: Upgrade eqiad mediabackups database hosts to Debian Bookworm from Upgrade eqiad mediabackups hosts to Debian Bookworm to Upgrade eqiad mediabackups database hosts to Debian Bookworm.
Oct 10 2024, 3:51 PM · Data-Persistence-Backup
jcrespo added a parent task for T366092: Upgrade eqiad mediabackups database hosts to Debian Bookworm: T376916: Upgrade backup hosts to Debian Bookworm 12.X.
Oct 10 2024, 3:51 PM · Data-Persistence-Backup
jcrespo added a subtask for T376916: Upgrade backup hosts to Debian Bookworm 12.X: T366092: Upgrade eqiad mediabackups database hosts to Debian Bookworm.
Oct 10 2024, 3:51 PM · Epic, Data-Persistence-Backup
jcrespo moved T366092: Upgrade eqiad mediabackups database hosts to Debian Bookworm from Triage to Ready on the Data-Persistence-Backup board.
Oct 10 2024, 3:51 PM · Data-Persistence-Backup
jcrespo moved T376892: Expand media backup storage available space to 960 TB per datacenter from Triage to Blocked on the Data-Persistence-Backup board.
Oct 10 2024, 3:50 PM · Patch-For-Review, media-backups, Data-Persistence-Backup, SRE
jcrespo moved T262668: WMF media storage must be adequately backed up from Blocked to Ready on the Data-Persistence-Backup board.
Oct 10 2024, 3:50 PM · media-backups, Data-Persistence-Backup, Epic, Goal, SRE, SRE-swift-storage
jcrespo claimed T362509: Setup new dbprov hosts and decommission the old ones.
Oct 10 2024, 3:50 PM · Patch-For-Review, database-backups, Data-Persistence-Backup
jcrespo changed the status of T362509: Setup new dbprov hosts and decommission the old ones from Stalled to In Progress.
Oct 10 2024, 3:50 PM · Patch-For-Review, database-backups, Data-Persistence-Backup
jcrespo moved T362509: Setup new dbprov hosts and decommission the old ones from Triage to In Progress on the Data-Persistence-Backup board.
Oct 10 2024, 3:49 PM · Patch-For-Review, database-backups, Data-Persistence-Backup
jcrespo moved T283017: Create a dashboard for database backups monitoring/reporting from In Progress to Ready on the Data-Persistence-Backup board.
Oct 10 2024, 3:49 PM · dbbackups-dashboard, Patch-For-Review, Goal, database-backups, Data-Persistence-Backup
jcrespo moved T376916: Upgrade backup hosts to Debian Bookworm 12.X from Triage to In Progress on the Data-Persistence-Backup board.
Oct 10 2024, 3:48 PM · Epic, Data-Persistence-Backup
jcrespo added a project to T376916: Upgrade backup hosts to Debian Bookworm 12.X: Epic.
Oct 10 2024, 3:48 PM · Epic, Data-Persistence-Backup
jcrespo triaged T376916: Upgrade backup hosts to Debian Bookworm 12.X as High priority.
Oct 10 2024, 3:47 PM · Epic, Data-Persistence-Backup
jcrespo created T376916: Upgrade backup hosts to Debian Bookworm 12.X.
Oct 10 2024, 3:47 PM · Epic, Data-Persistence-Backup
jcrespo added a comment to T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.

Yes, the REPLACE is not the issue, it is ROW that translates it to UPDATE or DELETE + INSERT, but those would cause the same issues if doing it in the wrong case (but will do the right thing if the row was to be inserted randomly after select). We want to do replace, even if we did INSERT ignore, it won't fix things for replicas, the issue is ROW behavior, not the query itself.

Oct 10 2024, 2:07 PM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA
jcrespo added a comment to T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.

So this is my request for you @Volans, this is the best thing I think we can do now:

Oct 10 2024, 1:26 PM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA
jcrespo updated subscribers of T376775: Disk (sdv) failed on ms-be1065.
Oct 10 2024, 12:49 PM · SRE-swift-storage, SRE, ops-eqiad, DC-Ops
jcrespo updated the task description for T376892: Expand media backup storage available space to 960 TB per datacenter.
Oct 10 2024, 12:26 PM · Patch-For-Review, media-backups, Data-Persistence-Backup, SRE
jcrespo added a parent task for T371984: Q1:rack/setup/install backup2012: T376892: Expand media backup storage available space to 960 TB per datacenter.
Oct 10 2024, 12:17 PM · SRE, Data-Persistence, Data-Persistence-Backup, ops-codfw, DC-Ops
jcrespo added a subtask for T376892: Expand media backup storage available space to 960 TB per datacenter: T371984: Q1:rack/setup/install backup2012.
Oct 10 2024, 12:17 PM · Patch-For-Review, media-backups, Data-Persistence-Backup, SRE
jcrespo added a parent task for T371416: Q1:rack/setup/install backup1012: T376892: Expand media backup storage available space to 960 TB per datacenter.
Oct 10 2024, 12:17 PM · SRE, Data-Persistence-Backup, Data-Persistence, ops-eqiad, DC-Ops
jcrespo added a subtask for T376892: Expand media backup storage available space to 960 TB per datacenter: T371416: Q1:rack/setup/install backup1012.
Oct 10 2024, 12:17 PM · Patch-For-Review, media-backups, Data-Persistence-Backup, SRE
jcrespo updated the task description for T376892: Expand media backup storage available space to 960 TB per datacenter.
Oct 10 2024, 12:17 PM · Patch-For-Review, media-backups, Data-Persistence-Backup, SRE
jcrespo triaged T376892: Expand media backup storage available space to 960 TB per datacenter as High priority.
Oct 10 2024, 12:12 PM · Patch-For-Review, media-backups, Data-Persistence-Backup, SRE
jcrespo created T376892: Expand media backup storage available space to 960 TB per datacenter.
Oct 10 2024, 12:12 PM · Patch-For-Review, media-backups, Data-Persistence-Backup, SRE
jcrespo added a comment to T376775: Disk (sdv) failed on ms-be1065.

DC ops: Would you have an 8 TB disk spare for this host? It seems out of warranty.

Oct 10 2024, 12:07 PM · SRE-swift-storage, SRE, ops-eqiad, DC-Ops

Oct 9 2024

jcrespo added a comment to T375652: Wikimedia\Rdbms\DBQueryError: Error 1062: Duplicate entry '1' for key 'PRIMARY' Function: MediaWiki\CheckUser\Services\CheckUserLogService::addLogEntry.

Thanks. I don't have access to the pastes:

Oct 9 2024, 2:16 PM · DBA, Trust and Safety Product Team, CheckUser, Wikimedia-production-error
jcrespo added a member for SRE: Ladsgroup.
Oct 9 2024, 2:16 PM
jcrespo added a comment to T375652: Wikimedia\Rdbms\DBQueryError: Error 1062: Duplicate entry '1' for key 'PRIMARY' Function: MediaWiki\CheckUser\Services\CheckUserLogService::addLogEntry.

No concerns. Look, if you can, if there is any pattern common to the issues both corrected T375507#10175248 & pending at T375507#10175251

Oct 9 2024, 1:32 PM · DBA, Trust and Safety Product Team, CheckUser, Wikimedia-production-error
jcrespo added a comment to T371416: Q1:rack/setup/install backup1012.

@elukey thanks, that is all I needed, some info on where we were and I wasn't aware of ongoing progress during my sabbatical, which is understandable. No worries. :-)

Oct 9 2024, 10:32 AM · SRE, Data-Persistence-Backup, Data-Persistence, ops-eqiad, DC-Ops

Oct 7 2024

jcrespo added a comment to T371416: Q1:rack/setup/install backup1012.

Any ETA for this and the codfw equivalent, DC-ops? I know there may be some delays due to the vendor peculiarities, but my "Need by date" was 2024-09-08 and I haven't seen any update. Thank you (this is not an emergency, but we are soon running out of space on a shard of media backups).

Oct 7 2024, 3:05 PM · SRE, Data-Persistence-Backup, Data-Persistence, ops-eqiad, DC-Ops

Oct 4 2024

jcrespo added a comment to T376387: repool pc1013.

Feel free to merge it into T375382, the only reason I had reopened it is because repooling/making it back to normal operation was missing.

Oct 4 2024, 11:43 AM · DBA, Data-Persistence-SRE

Oct 2 2024

jcrespo awarded T376249: Wikimedia\Rdbms\DBUnexpectedError: Database servers in cluster30 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds. a Love token.
Oct 2 2024, 12:39 PM · Wikimedia-database-issue, DBA, wikitech.wikimedia.org, Wikimedia-production-error
jcrespo moved T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication from Backlog to Pending Review & Scorecard on the SRE-OnFire board.
Oct 2 2024, 11:44 AM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA