Forked from T159618
The edit rate may have been the issue, but we should still utilize the tools we have (maxlag) to notify bots that the server is under high load. If we throw a check in maxlag value calculation checking for the number of JobQueue entries and then raising the maxlag to indicate it, it would prevent bots from causing this issue again. Regardless of whether its one bot or several causing the spike, the existing maxlag checks could be used to notify all bots to back off.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
API: Optionally include in job queue size in maxlag | mediawiki/core | master | +51 -5 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T160003 Factor the JobQueue into the maxlag value | |||
Resolved | Legoktm | T159618 Job queue rising to nearly 3 million jobs |
Event Timeline
How would the number of job queue entries translate into a number of seconds of database lag?
Most bots use a maxlag check of 5-10 seconds, if we pick a value for each wiki based off average jobqueue size and for every X over that value add a few seconds to the lag param. We would want to scale the check based on wiki size. Take enwiki (These are random numbers not fact checked) has an average JobQueue of 1 million items. for every 100,000 over that add 1 second to the value of maxlag so that when the JobQueue reaches 150% of normal they back off until its down to normal levels.
Take enwiki (These are random numbers not fact checked) has an average JobQueue of 1 million items. for every 100,000 over that add 1 second to the value of maxlag so that when the JobQueue reaches 150% of normal they back off until its down to normal levels.
To be clear, do you mean something like setting $jobQueueLag = ( $jobQueueLength - $A ) / $B; (in your example A is 1000000 and B is 100000) and then comparing that to the maxlag parameter in the same way the lag returned from wfGetLB()->getMaxLag() is compared to the maxlag parameter?
I havent looked at the actual code, but the formula looks correct. My thought would be in getMaxLag() to call that formula and if it has a positive value greater than 1 add it to the maxlag total value. So that the final Maxlag value would be ServerLag + JObQueueLag = MaxLag
Ah, so what I said isn't what you meant. Why are you proposing adding the job queue pseudo-lag to the database lag, instead of using whichever of the two is larger?
We probably want some kind of dividing factor to turn job queue lag into something that is useable for maxlag (I think most clients use maxlag=5). For Wikimedia sites a job queue of 100k might be where we want to stop bots, but on smaller sites they might want to hit 1k jobs or something. It's probably easier if we use $lag = max( $adjustedJobQueueCount, $dbLag );
Sorry if that wasn't clear in the initial post. The reason I was thinking of a combined total was to include the JobQueue as part of the value so that it was a representation of the current server lag. Returning the max of either would also work.
Change 347320 had a related patch set uploaded (by Legoktm):
[mediawiki/core@master] API: Optionally include in job queue size in maxlag
Change 347320 merged by jenkins-bot:
[mediawiki/core@master] API: Optionally include in job queue size in maxlag
Bleh, yet another global configuration variable.
Does this task include setting $wgJobQueueIncludeInMaxLagFactor to true on Wikimedia wikis?
I'd think so, although the value isn't "true". The code can now factor the JobQueue in, but it isn't actually happening yet on Wikimedia wikis because the variable isn't set.
Part of setting the variable will be determining what an appropriate value for the variable is on Wikimedia wikis, and whether one value works for all or we need different values for different sites.