Page MenuHomePhabricator

Create a way to filter only WB-related changes from Commons recentchanges
Closed, ResolvedPublic

Description

Right now one of the ways we fetch change data - and only one available outside WMF production servers - is using recentchanges API. However, currently there is no support in this API for fetching only updates that relate to structured data - all edits to Commons File: space look the same to it, even though most of them aren't related to SDC data and thus are useless for SDC Query engine to fetch and process. It would be nice to have some mechanism that allows only fetch SDC-related changes. Unfortunately, right now the only thing that distinguishes them is comments like /* wbsetlabel-add:1|en */ but these would be expensive to parse and impossible to pre-filter on API level.

Maybe if we could tag such changes or select changes by slot, it would solve the issue.

Event Timeline

This is a special case of being able to filter by MCR slot type, which we'll surely need in the long term.

daniel subscribed.

Maybe if we could tag such changes or select changes by slot, it would solve the issue.

The problem with selecting changes by is that changes can affect multiple slots. Not at the moment with SDC, but in principle. I don't see any efficient way to apply this kind of filtering with the way we currently store recentchanges.

Using change tags for all MediaInfo edits might be a solution. Though it may overload the ChangeTags system. Perhaps the recentchanges system should be migrated to Elastic, that would help with a lot of things :)

I'm untagging MCR and CPT for now. Once the Multimedia team figures out what they want to do, they can ping us or the wikidata team to discuss the technical solution.

Tags should work, at least for now, I think, if I can filter by tag efficiently. There's not a lot of data edits so far, compared to overall Commons edit volume.

Eventually, I'd like to be able to select all edits that touch particular slot (even if it also touches other slots) but from what I see in recentchanges table I am not sure it's possible without changing the table structure. Please correct me if I'm wrong and there's a way.

I also think while short-term it's #SDC issue, longer term it is general Multi-Content-Revisions issue since tools would want to know which edit changed which slot.

In theory it would be something like recentchanges JOIN revision ON rc_this_oldid = rev_id JOIN slots ON slot_revision_id = rev_id AND slot_role_id = whatever AND slot_origin = rev_id. Except slot_origin is not used consistently that way; this would miss rollbacks, for example.

Perhaps the recentchanges system should be migrated to Elastic, that would help with a lot of things :)

OTOH, that would break it for any MediaWiki wikis not using Elastic...

In theory it would be something like recentchanges JOIN revision ON rc_this_oldid = rev_id JOIN slots ON slot_revision_id = rev_id AND slot_role_id = whatever AND slot_origin = rev_id. Except slot_origin is not used consistently that way; this would miss rollbacks, for example.

We'd also have to watch whether that query runs efficiently enough.

BTW, I think you could simplify that: recentchanges JOIN slots ON slot_revision_id = rc_this_oldid AND slot_role_id = whatever AND slot_origin = rc_this_oldid

OTOH, that would break it for any MediaWiki wikis not using Elastic...

IMO we'll have to bite that bullet at some point and change MediaWiki from a PHP-based application to a container-based one, so we can package ElasticSearch, Node.js, BlazeGraph, Lua etc. with it. (I'm hoping to start some conversation about that at TechConf which seems thematically well-aligned.) And yeah, recentchanges is fundamentally poorly suited to being in SQL.

RecentChanges has many flaws (for example, it is not a reliable stream as timestamps are not sequential and it can't be queried by RC ID - see https://gerrit.wikimedia.org/r/c/mediawiki/core/+/302368) but it is the only way to get change stream for a wiki without setting up Kafka, etc. as I understand. So I imagine until we get containers with all that stuff working we're stuck with RC as the only option to get changes in public.

IMO we'll have to bite that bullet at some point and change MediaWiki from a PHP-based application to a container-based one, so we can package ElasticSearch, Node.js, BlazeGraph, Lua etc. with it. (I'm hoping to start some conversation about that at TechConf which seems thematically well-aligned.) And yeah, recentchanges is fundamentally poorly suited to being in SQL.

That has been proposed and opposed by various people for a long time. Being a container-based application limits you to whatever the container-maker includes, and limits you to environments where you can use a container in the first place. I believe one reason MediaWiki is so popular is because it's easy to get started on any generic LAMP stack (and several variations as well).

That has been proposed and opposed by various people for a long time.

I think it's worth revisiting, and there was unanimous agreement on that at last TechConf's relevant session. It's probably a conversation better left to another task, though (my bad for derailing) - for now, the filter has to deal with SQL as @Smalyshev says.

I think it's worth revisiting, and there was unanimous agreement on that at last TechConf's relevant session.

When I look at the minutes at that link I see very little discussion of containers at all, and mostly in the context of large farms rather than individual installs.

Smalyshev moved this task from Incoming to SDAW on the Wikidata-Query-Service board.

We're no longer the Multimedia team (we're "Structured Data" now 😺 )

So say we add a 'structured-data-mediainfo' tag to every revision that has a mediainfo slot - would that be adequate? @EBernhardson ?

So say we add a 'structured-data-mediainfo' tag to every revision that has a mediainfo slot - would that be adequate? @EBernhardson ?

I assume this means every edit that changes mediainfo, not every revision that has mediainfo.

Change 538066 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikibaseMediaInfo@master] Always report indexable fields on NS_FILE

https://gerrit.wikimedia.org/r/538066

I started writing a patch for this, but got stuck trying to get mw vagrant back into working order. In this patch MediaInfo essentially always provides its fields to the NS_FILE namespace, but when no mediainfo is present it provides appropriate empty values. I'm not clear on what tagging the revision would do, I assume that must trigger some other process? I've uploaded the patch as is, but I've been unable to test this.

Hi @EBernhardson Did you intend that patch for this task, or T231038 ?

I started writing a patch for this, but got stuck trying to get mw vagrant back into working order. In this patch MediaInfo essentially always provides its fields to the NS_FILE namespace, but when no mediainfo is present it provides appropriate empty values. I'm not clear on what tagging the revision would do, I assume that must trigger some other process? I've uploaded the patch as is, but I've been unable to test this.

Indeed I've completely mixed the two, sorry for confusion!

Change 538249 had a related patch set uploaded (by Cparle; owner: Cparle):
[mediawiki/extensions/WikibaseMediaInfo@master] Add a tag to recentchanges that touch MediaInfo data

https://gerrit.wikimedia.org/r/538249

So say we add a 'structured-data-mediainfo' tag to every revision that has a mediainfo slot - would that be adequate? @EBernhardson ?

Given that this is a functionality that will be needed for pretty much every new MCR role, maybe worth the effort to come up with a generic solution instead of having to work twice?

At the database level, you should be able to detect RC entries that changed a particular slot easily enough, by including a join like

JOIN slots ON (rc_this_oldid = slot_revision_id AND slot_role_id = ? AND slot_origin = slot_revision_id)

If you want to allow for "touched any of multiple slots", that would probably need an EXISTS. I forget the exact semantics of slot_origin, that may or may not find reverts.

You could also exclude changes that touched any particular slot or slots, like

LEFT JOIN slots ON (rc_this_oldid = slot_revision_id AND slot_role_id IN (?) AND slot_origin = slot_revision_id)
WHERE slots.slot_revision_id IS NULL

The potential problem with any of these is performance, particularly in situations where few or no edits pass the filter (cf. T97797). Although we already have such problems with other RC queries, so one more may not be a huge deal.

So ... if the CPT is going to do this in a generic fashion instead of it being done in MediaInfo, can I remove the Structured Data tags?

(FWIW the answer is "no", and CPT *wants it done* in a generic fashion - they haven't actually volunteered to do it)

@Anomie @Tgr perhaps there's a way to always tag with the slot name, if an edit is not to the main slot? Would that be preferable to adding joins to the RC queries?

Using tags has the advantage of more directly identifying the relevant revisions, if the planner decides that gathering all revisions with the tag then filtering by which are in RC (and probably filesorting) is a better plan than taking rows from RC (in order) and filtering by which have the tag. The equivalent querying the slots table is much less likely to be workable since it would have to fetch all revisions with the slot rather than all revisions actually changing the slot. On the other hand, I'm skeptical that long-term there would be few enough revisions so tagged that the tag-first plan would actually be better.

If the planner is going with an RC-first plan, then it seems unlikely to matter much whether it filters by joining change_tag or joining slots. There's no always-good option here, and the maybe-bad parts are the same for both.

Personally I'd lean against adding lots of rows to change_tag, which will be visible in various UIs, if the only use case is filtering on slots edited. Using a wider index (bigint+bigint+smallint rather than big​int+int) for a filtering-join seems less troublesome than adding more to an already-tall table. But you might want to ask the DBAs for their opinion.

@Marostegui do you have a position on which index would be better?

Using tags has the advantage of more directly identifying the relevant revisions, if the planner decides that gathering all revisions with the tag then filtering by which are in RC (and probably filesorting) is a better plan than taking rows from RC (in order) and filtering by which have the tag. The equivalent querying the slots table is much less likely to be workable since it would have to fetch all revisions with the slot rather than all revisions actually changing the slot. On the other hand, I'm skeptical that long-term there would be few enough revisions so tagged that the tag-first plan would actually be better.

If the planner is going with an RC-first plan, then it seems unlikely to matter much whether it filters by joining change_tag or joining slots. There's no always-good option here, and the maybe-bad parts are the same for both.

Personally I'd lean against adding lots of rows to change_tag, which will be visible in various UIs, if the only use case is filtering on slots edited. Using a wider index (bigint+bigint+smallint rather than big​int+int) for a filtering-join seems less troublesome than adding more to an already-tall table. But you might want to ask the DBAs for their opinion.

I agree with Brad.
Once this is finally release, we do have to keep a close eye on the optimizer for the first iterations, as I am worried that it will decide to use the wrong index/plan we we can end up scanning LOTS of rows.
The good thing is that we can test this on 10.1 and on 10.3.

Change 544873 had a related patch set uploaded (by Anomie; owner: Cparle):
[mediawiki/core@master] Add 'slotId' param for recentchanges API query

https://gerrit.wikimedia.org/r/544873

Change 538249 abandoned by Cparle:
Add a tag to recentchanges that touch MediaInfo data

https://gerrit.wikimedia.org/r/538249

Change 544873 merged by jenkins-bot:
[mediawiki/core@master] Add 'slot' param for recentchanges API query

https://gerrit.wikimedia.org/r/544873

rcslot=mediainfo works!