Page MenuHomePhabricator

Temporarily run more refreshLinks jobs on Commons
Open, HighPublic

Description

Wikimedia Commons currently has a large backlog of (I believe) refreshLinks jobs, as a result of some edits (and not-yet-processed now-processed edit requests) for highly-used CC license templates: see edit, edit, edit, edit, edit and request, request, request, request. Together, these should result in a large percentage of Commons’ files being re-rendered, and having one templatelinks row each removed from the database (CC T343131).

Currently, the number of links to Template:SDC statement has value (as counted by search – this lags somewhat behind the “real” number, as it depends on a further job, but it should be a decent approximation) is only going down rather slowly; @Nikki estimated that the jobs would take some 10 years to complete at the current rate. Can we increase the rate at which these jobs are run? Discussion in #wikimedia-tech suggests this should be possible in changeprop-jobqueue.

Event Timeline

Per IRC discussion, marking as High priority. @AntiCompositeNumber reports that this results in category changes being slow to propagate (Category:Johann Baptist Hops not showing up in Category:Hops (surname) yet).

Currently, the number of links to Template:SDC statement has value (as counted by search – this lags somewhat behind the “real” number, as it depends on a further job, but it should be a decent approximation)

To put a concrete number out: at the moment, Quarry reports 79158475 links (79.1M), CirrusSearch reports 77979457 links (77.9M). I’m not really sure why Quarry’s number is higher, to be honest – but at least they’re somewhat close to each other. (If we bump the refreshLinks concurrency or do whatever else the right technical thing for this task is, we might have to do the same for some CirrusSearch-related jobs too… I think cirrusSearchLinksUpdate might be the right job type? But I’m not sure at all.)

Also interesting: pages with cc-by-sa-3.0 and *without* SDC statement has value – currently 1289684 (1.2M), almost two weeks after I updated the former template to no longer use the latter. (I’m not trying a Quarry version of this because I expect it would be prohibitively expensive.)

I have processed those editrequests, so at least when a page contains multiple license templates, it will only need to be reparsed once.

Indeed, it looks like the refreshLinks_partitioner rule is easily keeping up with the "upstream" rate of new jobs [0] but the "real" refreshLinks rule on partition 3 (commons) has a rather deep backlog.

Unfortunately, I don't think changeprop offers a way to increase just the concurrency for commons [0], so if we do increase the concurrency for refreshLinks as a whole, we'd need to be comfortable with possibly 8x'ing that at times (number of partitions) in aggregate. That said, since partitioning is by database, the amplification factor is perhaps a bit less concerning.

I am hopeful that some other folks in serviceops might have practical experience with the risks of doing something like this.

[0] In short, there's no support for manual assignment of {topic, partition} to a given consumer.

This type of Kafka Consumer lag for that job isn't unheard of. In fact, just recently we had way higher consumer lags for commons specifically.

image.png (961×1 px, 91 KB)

and this has also been happening in eqiad as well - we have switched over to codfw since then.

image.png (873×1 px, 77 KB)

That panel is somewhat misleading btw. The metric used is kafka_burrow_partition_lag, which isn't the same as consumer lag (despite the title of the panel). It represents is the number of messages in that partition that haven't been consumed (so backlog), but Kafka consumer lag is about the ACKed consumer processing delay (offset vs consumer group commited offset).

Anyway, doing an eyeball linear regression of codfw's kakfa consumer lag seems to say ~3-4 weeks at the current rate, but the history of that panel says that probably this isn't the most interesting metric, as it is not unheard of for message processing to suddenly spike from ~500 jobs per second to 5k jobs per second for a short amount of time.

As another data point, per RefreshLinks JobQueue Job stats, it seems like refreshlinks is servicing requests at double the rate for insertion and the backlog time is consistent with business as usual.

Unfortunately, I don't think changeprop offers a way to increase just the concurrency for commons

That is correct, the setting is for all partitioners, we can't do just commons.

if we do increase the concurrency for refreshLinks as a whole, we'd need to be comfortable with possibly 8x'ing that at times (number of partitions) in aggregate

It is definitely something that we 'd want to gather more information about the issue at hand before we bump this setting considerably.

@LucasWerkmeister how was the 10 year estimation calculated?

@LucasWerkmeister how was the 10 year estimation calculated?

I've been using the number of results of a hastemplate: search between two points in time to work out the average over that period of time. I did various searches between the 8th and 10th and also checked things like https://templatecount.toolforge.org/ and it didn't seem to be going down much at all, so I mentioned it to Lucas and decided to see how much one search would drop over a few days:

At 14:44 on the 10th, https://commons.wikimedia.org/w/index.php?search=hastemplate%3A%22SDC_statement_has_value%22&ns6=1 had 79,628,867 results.
At 22:33 on the 14th, it had 79,533,769 results, so it went down by ~95,000 in ~106 hours, around 900 per hour (and 900 per hour would be 21,600 per day, 7.8 million per year).

It seems it has finally started to go faster now though:

At 23:18 on the 21st, there were 77,979,524 results (down by ~1,550,000 in ~168 hours, around 9200 per hour).
At 12:52 on the 22nd, there were 77,593,815 results (down by ~385,000 in ~13 hours, around 30,000 per hour).

Anyway, doing an eyeball linear regression of codfw's kakfa consumer lag seems to say ~3-4 weeks at the current rate

If it continues at that rate, then that seems fine (and a lot more like what I was expecting).

For the record, I just found another set of templates we want to update on most of the affected files (Cc-by(-sa)-layout should bypass the SDC_statement_exist template) and filed edit requests for them.

Just processed those edit requests.