Page MenuHomePhabricator

WMF-JobQueueComponent
ActivePublic

Details

Description

The infrastructure used by Wikimedia Foundation for storage and execution of the MediaWiki job queue.

As of July 2018, the MediaWiki JobQueue infrastructure (at WMF) in a nutshell:

  • Jobs are submitted from MediaWiki web servers to Kafka using EventBus.
  • Jobs are scheduled using ChangeProp.
  • Jobs are executed using rpc/RunSingleJob endpoint in wmf-config, on a dedicated "jobrunner" pool of MediaWiki app servers.

Workboard columns:

See also:

Recent Activity

Yesterday

mdaniels5757 added a comment to T380544: Temporarily run more refreshLinks jobs on Commons.

Just processed those edit requests.

Sun, Nov 24, 7:19 PM · Commons, serviceops, WMF-JobQueue

Sat, Nov 23

Pppery edited projects for T175146: JobQueue: Unify JobRunner entry points, added: Patch-Needs-Improvement; removed Patch-For-Review.
Sat, Nov 23, 9:33 PM · Patch-Needs-Improvement, Security, MW-Interfaces-Team, Platform Team Workboards (Initiatives), WMF-JobQueue, TechCom-RFC (TechCom-RFC-Closed), MediaWiki-Core-JobQueue, MediaWiki-Configuration
LucasWerkmeister added a comment to T380544: Temporarily run more refreshLinks jobs on Commons.

For the record, I just found another set of templates we want to update on most of the affected files (Cc-by(-sa)-layout should bypass the SDC_statement_exist template) and filed edit requests for them.

Sat, Nov 23, 2:38 PM · Commons, serviceops, WMF-JobQueue

Fri, Nov 22

Nikki added a comment to T380544: Temporarily run more refreshLinks jobs on Commons.

@LucasWerkmeister how was the 10 year estimation calculated?

Fri, Nov 22, 1:24 PM · Commons, serviceops, WMF-JobQueue
akosiaris added a comment to T380544: Temporarily run more refreshLinks jobs on Commons.

This type of Kafka Consumer lag for that job isn't unheard of. In fact, just recently we had way higher consumer lags for commons specifically.

Fri, Nov 22, 9:51 AM · Commons, serviceops, WMF-JobQueue

Thu, Nov 21

Scott_French added a comment to T380544: Temporarily run more refreshLinks jobs on Commons.

Indeed, it looks like the refreshLinks_partitioner rule is easily keeping up with the "upstream" rate of new jobs [0] but the "real" refreshLinks rule on partition 3 (commons) has a rather deep backlog.

Thu, Nov 21, 11:35 PM · Commons, serviceops, WMF-JobQueue
LucasWerkmeister added a comment to T380544: Temporarily run more refreshLinks jobs on Commons.

That’s great, thank you!

Thu, Nov 21, 11:12 PM · Commons, serviceops, WMF-JobQueue
LucasWerkmeister updated the task description for T380544: Temporarily run more refreshLinks jobs on Commons.
Thu, Nov 21, 11:12 PM · Commons, serviceops, WMF-JobQueue
Platonides added a comment to T380544: Temporarily run more refreshLinks jobs on Commons.

I have processed those editrequests, so at least when a page contains multiple license templates, it will only need to be reparsed once.

Thu, Nov 21, 11:11 PM · Commons, serviceops, WMF-JobQueue
Krinkle edited Description on WMF-JobQueue.
Thu, Nov 21, 10:34 PM
Krinkle updated the task description for T380543: WMF-JobQueue project description is out of date.
Thu, Nov 21, 10:34 PM · MW-Interfaces-Team, Phabricator, Documentation, WMF-JobQueue
Krinkle edited projects for T380543: WMF-JobQueue project description is out of date, added: MW-Interfaces-Team; removed MediaWiki-Platform-Team.
Thu, Nov 21, 10:32 PM · MW-Interfaces-Team, Phabricator, Documentation, WMF-JobQueue
LucasWerkmeister added a comment to T380544: Temporarily run more refreshLinks jobs on Commons.

Currently, the number of links to Template:SDC statement has value (as counted by search – this lags somewhat behind the “real” number, as it depends on a further job, but it should be a decent approximation)

Thu, Nov 21, 10:30 PM · Commons, serviceops, WMF-JobQueue
LucasWerkmeister added a parent task for T380544: Temporarily run more refreshLinks jobs on Commons: T343131: Commons database is growing way too fast.
Thu, Nov 21, 9:51 PM · Commons, serviceops, WMF-JobQueue
AntiCompositeNumber added a project to T380544: Temporarily run more refreshLinks jobs on Commons: Commons.
Thu, Nov 21, 9:45 PM · Commons, serviceops, WMF-JobQueue
LucasWerkmeister triaged T380544: Temporarily run more refreshLinks jobs on Commons as High priority.

Per IRC discussion, marking as High priority. @AntiCompositeNumber reports that this results in category changes being slow to propagate (Category:Johann Baptist Hops not showing up in Category:Hops (surname) yet).

Thu, Nov 21, 9:43 PM · Commons, serviceops, WMF-JobQueue
LucasWerkmeister created T380544: Temporarily run more refreshLinks jobs on Commons.
Thu, Nov 21, 9:40 PM · Commons, serviceops, WMF-JobQueue
Reedy renamed T380543: WMF-JobQueue project description is out of date from WMF-JobQueue description is out of date to WMF-JobQueue project description is out of date.
Thu, Nov 21, 9:35 PM · MW-Interfaces-Team, Phabricator, Documentation, WMF-JobQueue
Reedy created T380543: WMF-JobQueue project description is out of date.
Thu, Nov 21, 9:35 PM · MW-Interfaces-Team, Phabricator, Documentation, WMF-JobQueue
Scott_French closed T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes) as Resolved.

Monitoring for sustained latency impact on low-traffic jobs is now live.

Thu, Nov 21, 6:45 PM · FlaggedRevs, serviceops, WMF-JobQueue

Wed, Nov 13

Scott_French claimed T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes).

Since the three job types critical to uploads have now been moved to dedicated consumers (T379035), the primary follow-up here is monitoring (T378609) to detect when these kinds of isolation failures occur so that we can reactively isolate the "antagonist" job.

Wed, Nov 13, 8:02 PM · FlaggedRevs, serviceops, WMF-JobQueue

Tue, Nov 12

Samwalton9-WMF moved T379476: Page deletion queued via Nuke is sometimes very slow to complete from Backlog to Bugs on the MediaWiki-extensions-Nuke board.
Tue, Nov 12, 1:35 PM · WMF-JobQueue, Moderator-Tools-Team, MediaWiki-extensions-Nuke
Samwalton9-WMF moved T379476: Page deletion queued via Nuke is sometimes very slow to complete from Inbox to Triaged on the Moderator-Tools-Team board.
Tue, Nov 12, 1:14 PM · WMF-JobQueue, Moderator-Tools-Team, MediaWiki-extensions-Nuke

Mon, Nov 11

jijiki placed T377512: runJobs.log isn't being written to up for grabs.
Mon, Nov 11, 4:07 PM · WMF-JobQueue, MW-on-K8s
jijiki moved T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes) from Incoming 🐫 to Production Errors 🚜 on the serviceops board.
Mon, Nov 11, 1:11 PM · FlaggedRevs, serviceops, WMF-JobQueue
jijiki triaged T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes) as High priority.
Mon, Nov 11, 1:07 PM · FlaggedRevs, serviceops, WMF-JobQueue
Scott_French added a comment to T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes).

Thanks for flagging, all. Yes, this looks like another isolation failure on the low-traffic consumer, and appears to have largely self-resolved as of ~ 14:50 UTC on the 10th. I'll follow up on T379462 for this particular instance, and aim to prioritize T379035 when I'm back this week.

Mon, Nov 11, 1:51 AM · FlaggedRevs, serviceops, WMF-JobQueue

Sun, Nov 10

Wargo added a comment to T379476: Page deletion queued via Nuke is sometimes very slow to complete.

Happened to me some months ago.

Sun, Nov 10, 6:24 PM · WMF-JobQueue, Moderator-Tools-Team, MediaWiki-extensions-Nuke
Samwalton9 added a comment to T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes).

Possibly the cause of T379476?

Sun, Nov 10, 10:30 AM · FlaggedRevs, serviceops, WMF-JobQueue
matej_suchanek added a project to T379476: Page deletion queued via Nuke is sometimes very slow to complete: WMF-JobQueue.
Sun, Nov 10, 9:07 AM · WMF-JobQueue, Moderator-Tools-Team, MediaWiki-extensions-Nuke

Sat, Nov 9

Myrealnamm added a comment to T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes).

Subscribing myself. I'm seeing this for a while, and yes, today some mw tags are taking forever to update.

Sat, Nov 9, 8:51 PM · FlaggedRevs, serviceops, WMF-JobQueue
Bawolff added a comment to T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes).

That said, it does seem like the p99 for AssembleChunkUpload jobs has spiked to ~15 min for the last 2 hours (was fine before that point), so maybe that is just it. Sounds like a dedicated queue as Scott suggests would really help.

Sat, Nov 9, 10:18 AM · FlaggedRevs, serviceops, WMF-JobQueue
Bawolff added a comment to T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes).

@MBH lets open a separate new task to investigate, as the cause could be something different than the job queue thing this task is about. If you want you could email the HAR file to me ( [email protected] ).

Sat, Nov 9, 10:01 AM · FlaggedRevs, serviceops, WMF-JobQueue
MBH added a comment to T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes).

Now files are waiting ~10 minutes before publishing and doesn't published due to errors "Unknown server error" and "Incorrect CSRF token". Uploading (first UploadWizard step) was very slow too with the same behavior than in previous case: 3 files in queue and all other files waiting, after several minutes this 3 files uploaded and next 3 files in queue.

{A2518AE1-A179-4C24-B15D-0D827809598E}.png (1×1 px, 43 KB)

Sat, Nov 9, 9:53 AM · FlaggedRevs, serviceops, WMF-JobQueue
MBH reopened T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes) as "Open".

@Bawolff The problem described in T378276 is rised again. I have recorded a HAR file, please, give me an e-mail where should I send it. I will not clear any cookies because I don't know how to do it, I'll just seng you a raw file.

Sat, Nov 9, 9:49 AM · FlaggedRevs, serviceops, WMF-JobQueue

Tue, Nov 5

Bawolff merged T378276: Mass uploads to Commons doesn't work for me into T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes).
Tue, Nov 5, 2:22 AM · FlaggedRevs, serviceops, WMF-JobQueue
Scott_French added a comment to T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes).

Thanks, @Bawolff - Yes, indeed, those both fan into the low-traffic consumer. While we don't really have a prioritization mechanism in this context that I'm aware of, it would probably be fairly straightforward to at least move them out of low-traffic to a dedicated consumer, as @Ladsgroup points out. I've opened T379035 to look into that.

Tue, Nov 5, 1:07 AM · FlaggedRevs, serviceops, WMF-JobQueue

Mon, Nov 4

Ladsgroup added a comment to T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes).

It shouldn't be too hard to give it a dedicated lane with small concurrency

Mon, Nov 4, 10:12 AM · FlaggedRevs, serviceops, WMF-JobQueue

Thu, Oct 31

Bawolff added a comment to T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes).

Just as an aside, I believe PublishStashedFile AssembleUploadChunks are considered low traffic job. Unlike normal jobs these are very latency sensitive, as they don't happen in the background, but the UI actually makes users wait well these jobs complete (See also T378276). It would be really great if somehow these jobs can be prioritized in a job queue backlog situation.

Thu, Oct 31, 9:23 PM · FlaggedRevs, serviceops, WMF-JobQueue

Wed, Oct 30

lmata moved T359472: Migrate MediaWiki.jobqueue to statslib from Inbox to Prioritized on the Observability-Metrics board.
Wed, Oct 30, 6:58 PM · MW-Interfaces-Team, WMF-JobQueue, MediaWiki-Engineering, Observability-Metrics
kostajh closed T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes) as Resolved.

Thanks everyone!

Wed, Oct 30, 3:53 PM · FlaggedRevs, serviceops, WMF-JobQueue
Scott_French added a comment to T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes).

Thanks @Ammarpad for the pointer to where this category came from, and thanks @cscott for providing some context. Indeed, this seems likely to be an interaction between the new category and the way flaggedrevs is used on dewiki.

Wed, Oct 30, 2:50 PM · FlaggedRevs, serviceops, WMF-JobQueue
Dreamy_Jazz added a comment to T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes).

MediaModeration jobs have gone back to normal processing.

Wed, Oct 30, 1:22 PM · FlaggedRevs, serviceops, WMF-JobQueue

Tue, Oct 29

cscott added a comment to T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes).

I don't think they are related. That same tracking category appeared on every wiki, nothing explains why it would have cause a load spike for dewiki only. Unless perhaps flagged revs does something "funny" with categories that would cause much higher load?

Tue, Oct 29, 8:55 PM · FlaggedRevs, serviceops, WMF-JobQueue
Ammarpad added a comment to T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes).

Thanks, @kostajh.

FWIW, it looks like Kategorie:Wikipedia:Seite,_die_JsonConfig_verwendet is a brand new category created on the 24th, with a lot of associated pages.

Tue, Oct 29, 8:48 AM · FlaggedRevs, serviceops, WMF-JobQueue

Mon, Oct 28

Scott_French added a comment to T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes).

FWIW, it looks like Kategorie:Wikipedia:Seite,_die_JsonConfig_verwendet is a brand new category created on the 24th, with a lot of associated pages.

Mon, Oct 28, 7:21 PM · FlaggedRevs, serviceops, WMF-JobQueue
kostajh added a comment to T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes).

Looking at the breakdown by wiki for flaggedrevs_CacheUpdate jobs among the last 1M entries in the executor log on mwlog1002:

$ tail -1000000 JobExecutor.log | grep flaggedrevs_CacheUpdate | cut -f 5 -d ' ' | sort | uniq -c | sort -nr
 261208 dewiki
   3026 arwiki
    111 ukwiki
 ...

So yeah, seems to effectively all be targeting dewiki. Example:

Mon, Oct 28, 7:12 PM · FlaggedRevs, serviceops, WMF-JobQueue
kostajh added a project to T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes): FlaggedRevs.
Mon, Oct 28, 7:08 PM · FlaggedRevs, serviceops, WMF-JobQueue
Scott_French added a comment to T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes).

Looking at the breakdown by wiki for flaggedrevs_CacheUpdate jobs among the last 1M entries in the executor log on mwlog1002:

Mon, Oct 28, 6:47 PM · FlaggedRevs, serviceops, WMF-JobQueue
kostajh added a comment to T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes).

Given the rate at which the backlog is draining, this should self-resolve in ~ 24h.

We can try to relax the concurrency limit for the low_traffic_jobs rule, but that might not make a significant difference in either clearing the backlog or improving queue times for other jobs handled by that rule.

I'm happy to give it a try, though.

I think it self-resolving in ~24 hrs would be okay. The concern was that this was a more long-term change to the wait times.

Mon, Oct 28, 6:32 PM · FlaggedRevs, serviceops, WMF-JobQueue