Factor the JobQueue into the maxlag value
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Betacommand
	Mar 8 2017, 11:39 PM

Description

Forked from T159618
The edit rate may have been the issue, but we should still utilize the tools we have (maxlag) to notify bots that the server is under high load. If we throw a check in maxlag value calculation checking for the number of JobQueue entries and then raising the maxlag to indicate it, it would prevent bots from causing this issue again. Regardless of whether its one bot or several causing the spike, the existing maxlag checks could be used to notify all bots to back off.

Details

	Subject	Repo	Branch	Lines +/-
	API: Optionally include in job queue size in maxlag	mediawiki/core	master	+51 -5

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T160003 Factor the JobQueue into the maxlag value
		Resolved		Legoktm	T159618 Job queue rising to nearly 3 million jobs

Event Timeline

Betacommand created this task.Mar 8 2017, 11:39 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 8 2017, 11:39 PM

Betacommand added a subtask: T159618: Job queue rising to nearly 3 million jobs.Mar 8 2017, 11:40 PM

Reedy added projects: MediaWiki-Core-JobQueue, MediaWiki-Action-API.Mar 8 2017, 11:42 PM

Reedy added a project: MediaWiki-libs-Rdbms.

How would the number of job queue entries translate into a number of seconds of database lag?

Anomie moved this task from Unsorted to Needs details or plan on the MediaWiki-Action-API board.Mar 9 2017, 2:17 PM

Most bots use a maxlag check of 5-10 seconds, if we pick a value for each wiki based off average jobqueue size and for every X over that value add a few seconds to the lag param. We would want to scale the check based on wiki size. Take enwiki (These are random numbers not fact checked) has an average JobQueue of 1 million items. for every 100,000 over that add 1 second to the value of maxlag so that when the JobQueue reaches 150% of normal they back off until its down to normal levels.

Take enwiki (These are random numbers not fact checked) has an average JobQueue of 1 million items. for every 100,000 over that add 1 second to the value of maxlag so that when the JobQueue reaches 150% of normal they back off until its down to normal levels.

To be clear, do you mean something like setting $jobQueueLag = ( $jobQueueLength - $A ) / $B; (in your example A is 1000000 and B is 100000) and then comparing that to the maxlag parameter in the same way the lag returned from wfGetLB()->getMaxLag() is compared to the maxlag parameter?

I havent looked at the actual code, but the formula looks correct. My thought would be in getMaxLag() to call that formula and if it has a positive value greater than 1 add it to the maxlag total value. So that the final Maxlag value would be ServerLag + JObQueueLag = MaxLag

Ah, so what I said isn't what you meant. Why are you proposing adding the job queue pseudo-lag to the database lag, instead of using whichever of the two is larger?

In T160003#3088276, @Anomie wrote:

Ah, so what I said isn't what you meant. Why are you proposing adding the job queue pseudo-lag to the database lag, instead of using whichever of the two is larger?

We probably want some kind of dividing factor to turn job queue lag into something that is useable for maxlag (I think most clients use maxlag=5). For Wikimedia sites a job queue of 100k might be where we want to stop bots, but on smaller sites they might want to hit 1k jobs or something. It's probably easier if we use $lag = max( $adjustedJobQueueCount, $dbLag );

Sorry if that wasn't clear in the initial post. The reason I was thinking of a combined total was to include the JobQueue as part of the value so that it was a representation of the current server lag. Returning the max of either would also work.

Change 347320 had a related patch set uploaded (by Legoktm):
[mediawiki/core@master] API: Optionally include in job queue size in maxlag

https://gerrit.wikimedia.org/r/347320

gerritbot added a project: Patch-For-Review.Apr 10 2017, 6:54 AM

Change 347320 merged by jenkins-bot:
[mediawiki/core@master] API: Optionally include in job queue size in maxlag

https://gerrit.wikimedia.org/r/347320

ReleaseTaggerBot added projects: MW-1.29-release (WMF-deploy-2017-04-11_(1.29.0-wmf.20)), MW-1.29-release-notes.Apr 11 2017, 3:00 PM

Krinkle removed projects: MediaWiki-libs-Rdbms, MediaWiki-Core-JobQueue.Apr 18 2017, 10:40 PM

Anomie moved this task from Needs details or plan to Done on the MediaWiki-Action-API board.May 5 2017, 9:18 PM

In T160003#3171448, @gerritbot wrote:

Change 347320 merged by jenkins-bot:
[mediawiki/core@master] API: Optionally include in job queue size in maxlag

https://gerrit.wikimedia.org/r/347320

Is this task resolved? I'm not sure I see what else needs doing.

Bleh, yet another global configuration variable.

Does this task include setting $wgJobQueueIncludeInMaxLagFactor to true on Wikimedia wikis?

In T160003#3246297, @MZMcBride wrote:

Does this task include setting $wgJobQueueIncludeInMaxLagFactor to true on Wikimedia wikis?

I'd think so, although the value isn't "true". The code can now factor the JobQueue in, but it isn't actually happening yet on Wikimedia wikis because the variable isn't set.

Part of setting the variable will be determining what an appropriate value for the variable is on Wikimedia wikis, and whether one value works for all or we need different values for different sites.

Krinkle removed projects: MW-1.29-release (WMF-deploy-2017-04-11_(1.29.0-wmf.20)), Patch-For-Review.May 25 2017, 1:09 PM

revi subscribed.Aug 13 2017, 7:14 PM

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:01 PM

Restricted Application added a project: Platform Engineering. · View Herald TranscriptOct 16 2020, 5:01 PM

• AMooney removed a project: Platform Engineering.Oct 20 2020, 7:26 PM

Factor the JobQueue into the maxlag valueOpen, Needs TriagePublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Factor the JobQueue into the maxlag value
Open, Needs TriagePublic
Actions

Related Objects
Search...