User Details
- User Since
- May 11 2015, 8:31 AM (496 w, 5 h)
- Availability
- Available
- IRC Nick
- jynus
- LDAP User
- Jcrespo
- MediaWiki User
- JCrespo (WMF) [ Global Accounts ]
Today
It seems to me not all members were deleted, while I can see some people subscribing manually, only around 300 are missing ATM.
I was able to pin down the undesired mass deletion starting at:
# at 906744201 #241111 6:21:17 server id 171966607 end_log_pos 906744243 CRC32 0xeb3133f4 GTID 171966607-171966607-71484404 trans
Hey, @revi, no worries. Could you confirm with me a timestamp when members would be good? 2024-11-11 06:20:00. A little earlier, at 06:00:00 ?
Wed, Nov 6
Mon, Nov 4
Was db2190 taken care, data-wise/repooled? Not super worried or super-urgent, but to track it somewhere and making sure it doesn't fall into the cracks of depooled hosts pending to reproductionized.
Thu, Oct 31
Wed, Oct 30
@wiki_willy I am going to remove backup1010 and backup2010 from bacula and use it for mediabackups instead. This will solve my immediate needs, but please note I am only delaying the inevitable: that host was for the general bacula expansion, and we can delay that a bit more, as the growth there was much smaller (and we will impact other team instead of mine :-S).
Tue, Oct 29
wouldn't give much where the queries are coming from
T262388 is the bug, but I couldn't fix it because I couldn't reproduce it at the time. I highly recommend double checking data after copy to make sure nothing was lost (counting tables and comparing file sizes is a fast way to do it). transfer does a checksum, but one never knows or other errors could be missed.
I can wait for that a few extra days. But I need certainty on dates or know that a new testing period is in front of us that could take weeks (?) In the latter case, I would prefer to use what we have now and solve the backup storage, and do the testing elsewhere, under less stressful constraints. Later we can migrate or do whatever, but I really need to start working on it now, and know those files won't be lost in a few weeks. Migrating files on these hosts can take weeks. I will prefer taking any kind of performance penalty over certainty and stability.
Discover what happened to pc1017/fix/prevent if from happening
This is not great for me. I need these hosts by 2024-09-08 as documented at T368926, and I am running out of space. If I start resharding now likely we will lose all data on them due to the RAID solution changing.
Mon, Oct 28
I linked the above task: T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes), as mitigations may have slowed down jobqueue processing. We may need to have a look, as while almost surely was a consequence of the reduced concurrency, it could show show some interesting signal, too- as potentially the bottleneck/increased load could have started before the incident.
Potentially related to T378076, as concurrency was lowered to mitigate ongoing issues.
No need to be sorry. I was a bit pushy about it because we are in a bit of a hurry due to the parent ticket, as we were soon running out of space for media backups (6% left).
Thu, Oct 24
pc1014 has been setup as the host to cover for pc5 in eqiad, instead of pc1017. dbctl and puppet may need further adjustments to see how to configure now the candidates/mark them on puppet and dbctl, but other wise, and with some 10% impact in cache misses, the servers seem to be working as intended.
Wed, Oct 23
Nope, megacli doesn't work. That's the one option I tried first, before going on this rabbit hole. 0:-)
Note it is possible to -filter on pt-query-digest. However, if usage is what is needed, enabling performance_schema stats and querying by database may be just easier.
Tue, Oct 22
How far is this in the queue? The original need by was 2024-09-08, and this could help debug issues with: T377853
perccli and storecli are not exactly the same either, existing script fails with:
Failed to execute ['/usr/local/lib/nagios/plugins/get-raid-status-perccli']: KeyError 'BBU_Info'
Let's talk. I don't disagree with more frequent backups, but this is the first time I hear about a data incident, and with point in time recovery (binlogs), it is possible to recover from any exact point in time in the past 3 months, not just the weekly backup time. So I would like to discuss what are the real needs, not already come up with a specific decision.
After testing on older hosts, storecli seems to work on older hosts from a different vendor, so either there is something broken with this host in particular (I would need a second new host with a RAID controller to check) or we should upgrade the cli to use this other tool, that seems to work for both vendors.
I filed T377853 with a possible fix.
Thanks, I can start provisioning it, however, there seems to be an issue with the disk monitoring. Our puppet installation seems to be identifying the RAID controller:
Mon, Oct 21
apt1002:/home/marostegui$ ls -R
Deployment went well, I will update the incident doc with the long-term fix and then call this resolved.
Fri, Oct 18
Let's merge carefully https://gerrit.wikimedia.org/r/1081103 early next week (so we can monitor not affecting production hosts) CC @Ladsgroup @ABran-WMF
This is resolved as long as "this is deployed", but feel free to reopen if it causes any issue or want to discuss the removal.
As soon as the other patches are merged, this is done for me IMHO. Core section ordering can be discussed afterwards.
Thu, Oct 17
The (potential) change that caused it was: https://gerrit.wikimedia.org/r/c/operations/puppet/+/693162
Indeed it is. let me use blame to see when this happened. This is good news because we finally know WHY this happened only recently, the sad part is that we may discard this patch partially.
I cannot at the moment reproduce the bad scenario, because this is the binlog from, eg. x1:
# at 1199094 #241017 9:53:45 server id 180360966 end_log_pos 1199136 CRC32 0x45c248cc GTID 180360966-180360966-115215619 trans /*!100001 SET @@session.gtid_seq_no=115215619*//*!*/; START TRANSACTION /*!*/; # at 1199136 # at 1199400 #241017 9:53:45 server id 180360966 end_log_pos 1199400 CRC32 0xd37d7cf7 Annotate_rows: #Q> REPLACE INTO `heartbeat`.`heartbeat` (ts, server_id, file, position, relay_master_log_file, exec_master_log_pos, shard, datacenter) VALUES ('2024-10-17T09:53:45.001510', '180360966', 'db2196-bin.003776', '1199094', NULL, NULL, 'x1', 'codfw') #241017 9:53:45 server id 180360966 end_log_pos 1199474 CRC32 0xe050a53c Table_map: `heartbeat`.`heartbeat` mapped to number 23 # at 1199474 #241017 9:53:45 server id 180360966 end_log_pos 1199642 CRC32 0x14ae218b Update_rows: table id 23 flags: STMT_END_F ### UPDATE `heartbeat`.`heartbeat` ### WHERE ### @1='2024-10-17T09:53:44.001370' /* VARSTRING(26) meta=26 nullable=0 is_null=0 */ ### @2=180360966 /* INT meta=0 nullable=0 is_null=0 */ ### @3='db2196-bin.003775' /* VARSTRING(255) meta=255 nullable=1 is_null=0 */ ### @4=1048427332 /* LONGINT meta=0 nullable=1 is_null=0 */ ### @5=NULL /* VARSTRING(255) meta=255 nullable=1 is_null=1 */ ### @6=NULL /* LONGINT meta=0 nullable=1 is_null=1 */ ### @7='x1' /* VARSTRING(10) meta=10 nullable=1 is_null=0 */ ### @8='codfw' /* STRING(5) meta=65029 nullable=1 is_null=0 */ ### SET ### @1='2024-10-17T09:53:45.001510' /* VARSTRING(26) meta=26 nullable=0 is_null=0 */ ### @2=180360966 /* INT meta=0 nullable=0 is_null=0 */ ### @3='db2196-bin.003776' /* VARSTRING(255) meta=255 nullable=1 is_null=0 */ ### @4=1199094 /* LONGINT meta=0 nullable=1 is_null=0 */ ### @5=NULL /* VARSTRING(255) meta=255 nullable=1 is_null=1 */ ### @6=NULL /* LONGINT meta=0 nullable=1 is_null=1 */ ### @7='x1' /* VARSTRING(10) meta=10 nullable=1 is_null=0 */ ### @8='codfw' /* STRING(5) meta=65029 nullable=1 is_null=0 */ # Number of rows: 1 # at 1199642 #241017 9:53:45 server id 180360966 end_log_pos 1199673 CRC32 0xc6f0d777 Xid = 3891994318 COMMIT/*!*/; # at 1199673
There is some minor usability issue (but could be confusing under pressure), I get this text:
==> Run on section test-s4 was manually aborted. Continue with the remaining sections or abort completely?
However, if it is the last or the only section, it doesn't make much sense, as it would do the same, basically. Maybe just changing the wording if there are no more sections left even if you want to keep the pause?
Wed, Oct 16
Thanks Riccardo, as I said on IRC it looked like a minor issue so I wasn't too worried, and it was. Your are very fast at this, so big ❤ to you. Will continue testing, probably tomorrow morning.
I need to research more line 255 change:
self._validate_slave_status(f"MASTER_TO {self.master_to.host}", status, expected)
DRY-RUN: MASTER_TO db2230.codfw.wmnet Ignoring MASTER STATUS is not stable in DRY-RUN DRY-RUN: [test-s4] Binlog format is STATEMENT. No heartbeat corrective action needed. DRY-RUN: [test-s4] MASTER_FROM db1125.eqiad.wmnet CHANGE MASTER to ReplicationInfo(primary='db2230.codfw.wmnet', binlog='db2230-bin.000005', position=201650676, port=3306) DRY-RUN: MASTER_FROM db1125.eqiad.wmnet CHANGE MASTER to ReplicationInfo(primary='db2230.codfw.wmnet', binlog='db2230-bin.000005', position=201650676, port=3306) and user repl2024 DRY-RUN: Executing commands ['/usr/local/bin/mysql --socket /run/mysqld/mysqld.sock --batch --execute "START SLAVE"'] on 1 hosts: db1125.eqiad.wmnet DRY-RUN: MASTER_FROM db1125.eqiad.wmnet START SLAVE DRY-RUN: MASTER_FROM db1125.eqiad.wmnet skipping replication from MASTER_TO db2230.codfw.wmnet verification DRY-RUN: Executing commands ['/bin/systemctl start pt-heartbeat-wikimedia.service'] on 1 hosts: db2230.codfw.wmnet DRY-RUN: MASTER_TO db2230.codfw.wmnet started pt-heartbeat. DRY-RUN: Executing commands ['/usr/local/bin/mysql --socket /run/mysqld/mysqld.sock --batch --execute "START SLAVE"'] on 1 hosts: db2230.codfw.wmnet DRY-RUN: MASTER_TO db2230.codfw.wmnet START SLAVE. DRY-RUN: Executing commands ['/usr/local/bin/mysql --socket /run/mysqld/mysqld.sock --batch --execute "SHOW SLAVE STATUS\\G"'] on 1 hosts: db2230.codfw.wmnet DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Master_Host=db1125.eqiad.wmnet DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Master_User=repl2024 DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Master_Port=3306 DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Slave_IO_Running=Yes DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Slave_SQL_Running=Yes DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Last_IO_Errno=0 DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Last_SQL_Errno=0 DRY-RUN: MASTER_TO db2230.codfw.wmnet replication from MASTER_FROM db1125.eqiad.wmnet verified DRY-RUN: Executing commands ['/usr/local/bin/mysql --socket /run/mysqld/mysqld.sock --batch --execute "SHOW SLAVE STATUS\\G"'] on 1 hosts: db1125.eqiad.wmnet DRY-RUN: Failed to run cookbooks.sre.switchdc.databases.prepare.PrepareSection.master_from_check_replication: SHOW SLAVE STATUS seems to have been executed on a master. DRY-RUN: Traceback Traceback (most recent call last): File "/usr/lib/python3/dist-packages/wmflib/interactive.py", line 183, in confirm_on_failure ret = func(*args, **kwargs) File "/home/jynus/cookbooks_testing/cookbooks/cookbooks/sre/switchdc/databases/prepare.py", line 331, in master_from_check_replication status = self.master_from.show_slave_status() File "/usr/lib/python3/dist-packages/spicerack/mysql_legacy.py", line 201, in show_slave_status raise MysqlLegacyError(f"{sql} seems to have been executed on a master.") spicerack.mysql_legacy.MysqlLegacyError: SHOW SLAVE STATUS seems to have been executed on a master. ==> What do you want to do? "retry" the last command, manually fix the issue and "skip" the last command to continue the execution or completely "abort" the execution. >
I am going to execute this now because it is making the fix for T375144 harder to test. It can be removed afterwards, but I needs this now.
@Ladsgroup Did you add the backup sources into your checks? If not, could you do it (assuming it won't take much effort)? I would like to make sure backups are generated fine on both dcs before archiving them for the other ticket.
We should also mark it on netbox/dbctl as failed (?), in case another SRE runs into issues, so it is not pooled by accident until this ticket or its parent is resolved.
Whatever you do, can I suggest to remove it from master/master candidate and choose a different host for that?
Tue, Oct 15
I would like to send a question for observability (Observability-Alerting ) about how to go for a very simple check that currently lives in icinga. I am not too worried about the individual check, I know how to fix it technologically, but maybe as a question for how to do this kind of checks.
It db2205 had gotten stuck previously due to gtid/replication at T374425. Either newer mariadb version + large number of objects or hw, would check the latter first.
Happened again at T377164
I will generate a tarball from today's logical export. Will ping here when done.
Mon, Oct 14
Not sure if related, but seeing lots of
PHP Notice: fwrite(): write of 137 bytes failed with errno=32 Broken pipe
For some reason, the script that checks legal conformity of the footer, failed with:
Oct 11 2024
Please add it there, I am happy to help with the mysql side, but I am not familiar with the DBA's day to day procedures.
Not needed to close the ticket- but I found something really weird- no megacli cmd. Do you know what's the binary for manual checking on newer hosts? Also, learning cycles are disabled on those hosts, too, right?
The repair finished in 10 seconds, faster than I expected without issues (table & index rebuild succesfully, and I restarted replication).
Started on a screen on db2175:
mysql:root@localhost [nlwiki]> ALTER TABLE recentchanges ENGINE=InnoDB, FORCE;
Oct 10 2024
Yes, the REPLACE is not the issue, it is ROW that translates it to UPDATE or DELETE + INSERT, but those would cause the same issues if doing it in the wrong case (but will do the right thing if the row was to be inserted randomly after select). We want to do replace, even if we did INSERT ignore, it won't fix things for replicas, the issue is ROW behavior, not the query itself.
So this is my request for you @Volans, this is the best thing I think we can do now:
DC ops: Would you have an 8 TB disk spare for this host? It seems out of warranty.
Oct 9 2024
No concerns. Look, if you can, if there is any pattern common to the issues both corrected T375507#10175248 & pending at T375507#10175251
@elukey thanks, that is all I needed, some info on where we were and I wasn't aware of ongoing progress during my sabbatical, which is understandable. No worries. :-)
Oct 7 2024
Any ETA for this and the codfw equivalent, DC-ops? I know there may be some delays due to the vendor peculiarities, but my "Need by date" was 2024-09-08 and I haven't seen any update. Thank you (this is not an emergency, but we are soon running out of space on a shard of media backups).
Oct 4 2024
Feel free to merge it into T375382, the only reason I had reopened it is because repooling/making it back to normal operation was missing.