User Details
- User Since
- Jan 6 2020, 12:19 PM (254 w, 6 d)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- HNowlan (WMF) [ Global Accounts ]
Thu, Nov 21
I am going to close this issue as AQS1 is no longer in service.
Tue, Nov 19
Things done to address this issue so far:
This appears to be a problem with the image: rsvg-convert returns rendering error: NoMemory (which is a bit misleading). My understanding of SVG internals is relatively limited but it seems like the line patternTransform="matrix(0.142 -0.0168 -0.0205 -0.1008 -91816.0078 -14072.0449) is a recurrence of T292439
Mon, Nov 18
As of the 13th of November, all video transcoding has been moved to shellbox-video. The service seems quite stable. We'll reclaim the videoscaler hardware at a later point.
We've migrated to shellbox-video and the pod failures are no longer an issue thanks to the use of both the process check and tcp keepalives.
Thu, Nov 14
Wed, Nov 13
Tue, Nov 12
Mon, Nov 11
tldr: we have an issue with tiff conversion that is causing workers to block indefinitely, revealing a multitude of issues.
If we see a recurrence of this in future, please isolate the pod rather than delete it so it can be debugged
This is a recurrence of T374350
Wed, Nov 6
Looks like the same crashpad flood issue again. The service needs a restart, and I think we should implement the flags @TheDJ has mentioned.
Tue, Oct 29
Just to note Joely has verified the SSH key in this ticket via slack
Oct 25 2024
This access requires signing an NDA, adding @KFrancis as per access request documentation. Thanks!
closing as dupe, following up in T378181
This request first requires signing an NDA with Legal - tagging @KFrancis as per the access request process. Thanks!
Oct 23 2024
Key updated - please let me know if it works.
sessionstore codfw and eqiad are running with an envoy tls terminator, and latencies etc look acceptable.
Merged!
Running the client directly against a k8s worker IP also succeeds, which means that kube-proxy most likely isn't to blame here.
Oct 22 2024
eqiad is currently using the mesh - codfw is not. We decided to leave this config in place for the evening to get certainty and allow for time constraints. eqiad is looking fine so far. If an emergency revert is needed, both 2adb4cf4c6aa6e534aa7a596e796f5f099abc60f and 622bec969ea59a4352abc1e6daa20313ae1fe4f3 will need to be reverted before applying in eqiad
When connecting the same client to a k8s pod IP, the encoding and download of the file complete successfully, so some point of the communication between is definitely at fault here. We can now say with reasonable confidence that Envoy and Apache are not at fault here. Isolating which part will be a bit of a challenge but it's a clearer task.
I've mocked up a horrible Frankenstein script that mimics the TimedMediaHandler behaviour - when directly calling shellbox-video.discovery.wmnet via it, we see the exact same behaviour. This means that at the very least we can rule out failures at the Jobqueue or RunSingleJob layer:
Oct 21 2024
Minor datapoint that hasn't been noted - when testing with a larger file that takes longer to convert, we're seeing the same behaviour. This adds credence to the idea that this issue is not caused by a timeout, and is most likely caused by some kind of issue with the handling of and reading of responses, most likely beyond shellbox.
I've removed the Upstream tag as requested. T40010 may be of interest for similar threads of conversation, might be worth making this task a subtask of that one for now.
Oct 18 2024
Chromium is leaking processes, leaving chromium_crashpads lying around after a failure most likey:
root@wikikube-worker2070:/home/hnowlan# ps uax| grep chrome_crashpad | wc -l 115357
Oct 17 2024
Oct 16 2024
Mercurius is now built into the php8.1-fpm-multiversion-base image as of docker-registry.discovery.wmnet/php8.1-fpm-multiversion-base:8.1.30-2.
Debian packages are now in the apt repo
Oct 15 2024
This appears to be a rerun of T375521 - temporary fix last time was a roll restart, but there's clearly a deeper issue.
Oct 14 2024
aqs1 is disabled in restbase and the puppet configuration has been removed. All that remains is to archive the codebase and deploy repos.
Oct 9 2024
The main reason sessionstore didn't roll ahead with using the mesh was concern around the extremely broad impact any issues might have incurred. The risk profile for echostore is a lot lower, so I think we can move ahead with testing the mesh. I can't quite remember what they were but I'm fairly sure there's a bug or two in in the chart logic, but nothing that isn't obvious and can't be ironed out :)
Oct 5 2024
Just to explain the issue - a while ago a rate-limiting feature that was known to be problematic was reenabled in an emergency due to a harmful surge in traffic. This was left enabled and caused this issue to recur. I've since disabled this feature and we'll be removing it to prevent it being erroneously triggered again. However, the fact that this required manual reporting and wasn't noticed on the SRE-side isn't really acceptable so next week I'll be working on adding per-format alerting so that if there is an increase in errors for a single format we'll catch these before they can have a wide impact which will be tracked in T376538.
Thanks for the report - this was caused by T372470. I'm seeing recoveries on thumbnailing those files, could you confirm?
I'm seeing recoveries on most of the linked images, but reopening this until we're sure this is resolved.
High ThumbnailRender volume is normal, this is a constant background process that is ongoing to generate thumbnails on newly uploaded files. The change in the graphs from eqiad to codfw is part of the datacentre switchover (T370962).
Oct 3 2024
Mercurius images for bookworm and bullseye are now building via CI (with some modifications for bullseye): https://gitlab.wikimedia.org/hnowlan/mercurius/-/artifacts
Sep 30 2024
Sep 27 2024
Sep 25 2024
Sep 19 2024
Just to note, I've been testing by forcing a reencode of this video in VP9 format. This can also be tested by grabbing a job from kafka using kafkacat (kafkacat -b kafka-main1004.eqiad.wmnet:9092 -t eqiad.mediawiki.job.webVideoTranscode -o -200) and then POSTing the inner parts of the event via curl to a specific videoscaler to test logging changes etc:
time curl -H "Host: videoscaler.discovery.wmnet" -k -v -v -X POST -d '{"database":"testwiki","type":"webVideoTranscode","params": {"transcodeMode":"derivative" ,"transcodeKey":"240p.vp9.webm","prioritized":false,"manualOverride":true,"remux":false,"requestId":"A_REQ_ID","namespace":6,"title":"CC_1916_10_02_ThePawnshop.mpg"},"mediawiki_signature":"A_SIG"}' https://mw1437.eqiad.wmnet/rpc/RunSingleJob.php
These all appear to be requests from jobrunner hosts, which leads me to assume they're from the ThumbnailRender job. Could it be an ordering issue where we're triggering thumbnail generation during upload or something? The images themselves all seem to be fine when requested directly.
Sep 18 2024
Sep 17 2024
Sep 16 2024
I think that's fairly on the money, we can probably remove this now. We still have some bare metal deployments on debug (but I think scap is aware of this versioning during a deploy) and videoscaler hosts so we're not completely free of it. But I think at this point we stand to lose little from removing it.
Sep 13 2024
We have at least partially addressed the healthchecking issues by introducing a second readiness probe on the shellbox app container that checks for an ffmpeg process running, which appears to be working quite well.
Sep 12 2024
Sep 11 2024
At this point in time I'd say it's not out of the question that we could have mercurius up and running some jobs, but for the purposes of the switchover I think it makes sense to revert to using videoscalers for the short term. It's a much more well understood problem space and while I hope to have some jobs running via mercurius, I really doubt we'd be doing it for *all* jobs.
From php-fpm's fpm-status we can even see this behaviour so our check isn't at fault:
root@mw1451:/home/hnowlan# for i in `seq 200`; do curl -s 10.67.165.241:9181/fpm-status| grep ^active; sleep 0.2; done | sort | uniq -c 18 active processes: 1 182 active processes: 2
The healthcheck endpoint is not consistently returning a 503 when workers are busy - this could be some kind of a race condition. When all of the following were executed the pod was actively running an ffmpeg process:
Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1 Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1 OKroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1 Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1 Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1 Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1 Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1 OKroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1 Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1 Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1