Page MenuHomePhabricator

Replace use of removed hook InternalParseBeforeSanitize
Closed, ResolvedPublic

Description

Unit tests failure:

Use of InternalParseBeforeSanitize hook (used in VariablesHooks::onInternalParseBeforeSanitize) was deprecated in MediaWiki 1.35. [Called from MediaWiki\HookRunner\HookContainer::run in /workspace/src/includes/HookRunner/HookContainer.php at line 126]

Tests also failing for dependency extension: RegexFun and Arrays

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

This can not be fixed by the extension without dropping features. See T236809 for further information.

Parsoid will probably be the doom for this extension.

Parsoid will probably be the doom for this extension.

It need not be necessarily. Most extensions will need some work to be compatible with Parsoid. In some cases, it might be become much harder to provide equivalent functionality since Parsoid does things differently. But, there may be scenarios where wikis might choose functionality over performance and in that mode, it may be possible to support features that wouldn't be possible otherwise.

https://www.mediawiki.org/wiki/Parsoid/Extension_API is where we are working out a Parsoid extension API that extensions will need to work with. It is still really early work in progress but we expect to have a first working version for extensions to play with in the coming month or two.

I took a quick look at the Variables extension and as far as I can tell, even though you might think (on the surface) that the "No_support_for_global_ordering" would mean that your extension won't work with Parsoid, it actually will. In a future where Parsoid starts doing more aggressive performance optimizations, using extensions like this might potentially disable those performance optimizations (unless we figure out appropriate solutions). But, let us cross that bridge when we get there.

As far as I can tell, Variables extension can implement #var_final in Parsoid by registering a DOM post-processing pass that inspects the DOM and updates the DOM appropriately. This maybe potentially be less performant in Parsoid as a result, but if we can provide options in Parsoid for extensions to add custom annotations to the DOM, you can use custom annotations to fetch just the DOM nodes that match them (which will likely be more performant than full DOM walks). So, at worst, this extension will be less performant in Parsoid but will still continue to function.

So, that just leaves us with the 'InternalParseBeforeSanitize' hook. Parsoid is highly unlikely to support this (because the parsing pipeline is different). So, the simplest solution would indeed be for Variables to ask for its content to be sanitized before insertion into Parsoid's final DOM. We can provide sanitization options in Parsoid's extension API so you don't have to explicit sanitize anything yourself. But for now, while Parsoid isn't the default yet, we could consider removing the deprecation on this hook but add clear guidelines that this hook will not be present when we switch over to Parsoid and extensions will have to implement their functionality differently. That said, @cscott and I will chat about this in our team meeting to see what the best approach is.

So TLDR is: I haven't seen anything in the extension that is fundamentally incompatible with Parsoid. But, any extension that relies on global ordered state can gets its work done by registering a DOM post processing pass that inspects the "final" DOM and update it appropriately (in your case, processing the #var_final uses).

Considering the discussion in T236809, I am against doing anything here. There is no suitable alternative, as InternalParseBeforeLinks will probably recieve the same treatment, and the deprecation should be removed soon.

Removing the deprecation will not 'fix' anything. I think perhaps T253768: No easy way to suppress hard-deprecation warnings for hooks would solve your problem, if you believe that #var_final is not worth reimplementing for Parsoid and you're content to just let it die once MediaWiki switches parsers. But deprecation is 100% the correct way to indicate to authors of new code that this hook should not be used because it will not be supported in the future.

There are other hooks that can be used to do roughly the same thing as #var_final, including ParserAfterTidy.

The thing is that i am in over my head here. I was able to pick up and modernize this extensions regarding best practices, extension registration, and do some basic bug fixes and so on. However, the nuances of the MediaWiki parser and Parsoid in particular escape me. (And the parser documentation is horrible.) And these are points with significant security relevance too – at the moment I am wondering if the migration from ParserBeforeTidy to ParserAfterTidy has not introduced various security holes, since the output of these extensions is apparently no longer tidied. I do not want to be reponsible for introducing serious security vulnerabilities into thousands of wikis.

Currently, Variables introduces a central store in the parser (probably, using ParserOutput would work as well) which can be freely read from and written to during a page parse. Hence, a change at a given point in the wikitext can affect arbitrary later points by changing the var values at these points. I would like to know how you think this can be made work with Parsoid considering the "No support for global ordering" heading on the Extension API page. If this feature can not recovered using Parsoid, all efforts to work around the current deprecations will be in vain anyway.

This behavior can actually be used quite effectively to cache constant, expensive template queries done may times within a single page. Using other extensions like Loops additionally, it can become even more potent.

"No support for global ordering" heading on the Extension API page

We should probably clarify that. That means that we don't guarantee that extension invocations will execute in the same order that they show up on the page. But, there is a global DOM pass that gives extensions a global ordered DOM. So, for example extensions can walk the DOM and get a global order.

However, if extensions introduce state that depends on the page parsed in linear order, yes, we cannot support that. So, this really is a difference between sequential parsing vs. out-of-order parsing (and in some cases, parsing could be skipped on segments as part of perf improvement strategies).

Which hooks are deprecated or undeprecated will not make a difference to that strategy. If this is a real blocker, then, we'll have to introduce some kind of page property OR some kind of extension config option that will force full sequential reparse for any page that has that flag or extension config set.

Legoktm subscribed.

I'm re-opening this based on discussion in T236809: Refactor Parser.php to allow alternate parser (Parsoid), where it seems that the declining of this task was based on a misunderstanding.

If this is a real blocker, then, we'll have to introduce some kind of page property OR some kind of extension config option that will force full sequential reparse for any page that has that flag or extension config set.

To me that sounds like the Parsing Team is open to modifying Parsoid to support your use case.

I would really appreciate clarification on this point, since right above it is stated "However, if extensions introduce state that depends on the page parsed in linear order, yes, we cannot support that.". At the time I read this, this lead me to the conclusion that this is more of a theoretical possibility which has no real chance to be implemented.

But rereading this again, this might actually be a misunderstanding.

Sorry about using imprecise language in my comments across phabricator tasks.

High-performance and other functionality that depends on independent / decoupled parsing of fragments of a page cannot, by definition, support linear in-order parsing (To clarify: that does not mean hat the final page output won't be in the right order. It will be.). But, if we provide an opt-out mechanism in Parsoid for that to force a linear-ordered full reparse of a page, then I suppose that does mean that Parsoid supports such pages (but disable certain features / functionality on the page as well). Once again, we haven't worked through the feasibility and complexity of providing such support (that is a separate technical investigation), but hopefully that answers your question. And, all things considered, once again so we aren't offering false hopes, we'll lean towards breaking incompatible functionality in the pursuit of higher-performance and future-proofing the technology. But, we will not go down the route of breaking things simply because we can. We have attempted to minimize impacts to the extent it is practical to do so.

I believe there are two issues that need solving:

  1. How to implement Variables as a Parsoid extension, and if new Parsoid functionality is needed for that
  2. What to do in the meantime for the legacy Parser, as the InternalParseBeforeSanitize hook is hard deprecated but still in use

I believe there are two issues that need solving:

  1. How to implement Variables as a Parsoid extension, and if new Parsoid functionality is needed for that

Yes. However, the fundamental question is not really related to the hook discussed here, since the core issue is modifying ParserOutput extension data in place during parsing. I have created T282499: Consider whether Parsoid will support forced linear parsing. to durther discuss this point.

  1. What to do in the meantime for the legacy Parser, as the InternalParseBeforeSanitize hook is hard deprecated but still in use

I still think undeprecating is not the worst idea, since "still in use by Semantic MediaWiki" was a sufficient point to not deprecate InternalParseBeforeLinks exhibiting precisely the same issues. Alternatively, silencing the warning would probably be fine as well.

var_final will be a special, independent problem from the first (and the problem this task is actually about).

While from past conversation with the Parsing-Team in this task, I am confident a solution in Parsoid exists for this case, there has not been an actual migration path until now. Which makes it hard to act on the deprecation warning in a sensible way…

Change 721950 had a related patch set uploaded (by Universal Omega; author: Universal Omega):

[mediawiki/extensions/Variables@master] [Variables] Replace deprecated InternalParseBeforeSanitize hook

https://gerrit.wikimedia.org/r/721950

It was recommended in one of the live channels that users of Variables should share their use cases with maintainers here on this phabricator thread. That said, that's what I'm hear to do.

I operate and maintain a semantic mediawiki server at NASA for a multitude of enterprise knowledge management and operational process management purposes. Over the years we have developed a large ensemble of wiki-based tools based on collections of Forms, Templates, Categories, Properties that all work together for specific (and sometimes complicated) purposes .. these wiki-tools are every-day becoming more and more the back-bone of our internal business processes and the "Variables" extension is just one of a number of extension we are critically dependent on. That said, we are critically dependent on it. I suppose if we had to, we could move our template logic to Lua, but I would prefer not to because what we have works really well using the #vardefine, #vardefineechom and #var. I would never have guessed that this would be one of the extensions that the value of would be questioned. From out perspective, it's a must have. Please don't let this extension fall by the way-side. Thank you for reading.

Echoing @Revansx , here's my primary use case:

Nearly all of my code on Leaguepedia (https://lol.fandom.com) is in Lua already; however, I still make extensive use of the Variables extension. The primary use is as a singleton object to coordinate creating indices for primary/foreign keys for stored values in Cargo. For example, if I need a "current row" key to join 3 tables together, that will be of the form {{FULLPAGENAME}}_{{#var:counter}}, and that counter must be globally incrementing across separate module calls.

There is no way to get around this, even putting all of my individual row stores into a single Lua module, because I need individual h2 tags splitting up sections so that editors are able to edit only a single section at a time. Without these anchors, the process would be prohibitively overwhelming. Here's an example page: https://lol.fandom.com/wiki/Data:LCS/2021_Season/Spring_Season

I wrote a blog post about some of the challenges of dealing with these keys here: https://river.me/blog/primary-key-caching-adventures/ - it's also not always as simple as just incrementing a counter by 1, or even as close to that.

Other, less critical use cases:

  • Adding line numbers to Cargo queries - https://help.fandom.com/wiki/Extension:Cargo/customizing_tables#Special_case:_adding_line_numbers - this is something I do a lot when a query doesn't require an entire module, and losing access to this functionality would add a large burden to create individual modules for what should be extremely simple queries (though I suppose this could be functionality written into the Cargo extension)
  • Caching results of Cargo queries from the top of the page to the bottom - I could simply redo the query, sure, but that seems rather counterproductive when the goal is to improve performance (as an example, this module reuses the list of redirects to a page - https://lol.fandom.com/wiki/Module:PlayerPageEnd - (the list is deliberately done through Cargo and not another tool like DPL because I want to be using the right set of equivalences for special characters, capitalizations, etc) - in a lot of cases I could maybe reorganize page content, but I'd rather not in most
  • When debugging Lua modules, I often use variables to log to {{#var:log}} and then print the contents of the log on the page afterwards

I am curious: Where has this topic come up recently?

@MGChecker

I'm following up on @Revansx, who may have been replying here due to me bringing this up on a live channel the same day he posted based on reviewing the documentation for all of the extensions we use as I prepare for a major version upgrade to 1.36. We use Variables extensively on our five Guild Wars 1 and 2 wikis.

I do not do any of the editing, but I build and manage the wiki platform for our communities. To summarize what I was told by a few of our main editors, we absolutely depend on this extension "on basically every page we have that's not a one-liner" and that "we use this extension in a huge number of templates[, though] Var_final is much less frequently used – we mainly use it on the achievement overview pages." One editor commented that "I am not sure there will be a work around for us if it has to go."

In general, Ext:Variables is incredibly useful for optimizing page performance when using SMW/Cargo, since it allows query results to be cached and reused. This isn't something that comes up constantly, but it happens often enough to have a sitewide impact for us on Yugipedia.

T282499 is related and we'll probably get around to that within the next 6 months. But, we'll discuss this on the team and see what action, if any, we need to take in the interim.

Thank you all for moving the discussion forward. As a bit of 'power user' of Variables myself, in conjunction with other extensions (Parser Functions, Arrays, Semantic MediaWiki, etc.), I can certainly add to the list of use cases.

But ... I'm a little confused now about the scope of this task/discussion. InternalParseBeforeSanitize is relevant only to the #var_final parser function from Variables and a number of other extensions (I don't think SMW uses the hook any longer), but we've gone on to have a fruitful discussion about the future of the Variables extension in general in the light of development on Parsoid and page parsing in non-linear order. (And there's also T282499 and T236809. )

Is this the right place to continue?

Change 868399 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[integration/config@master] Zuul: [mediawiki/extensions/Variables] Use extension-broken

https://gerrit.wikimedia.org/r/868399

Change 868399 merged by jenkins-bot:

[integration/config@master] Zuul: [mediawiki/extensions/Variables] Use extension-broken

https://gerrit.wikimedia.org/r/868399

It seems to me the issue is Parsoid introducing parallel processing across all the individual parts of parsing during a page render. This way it can memoize the results of these individual parts and potentially run them out-of-order. In order for it to accomplish such, Parsoid has to know all of the inputs to any part of the parse and it assumes that anytime the inputs are the same the output will be the same.

Any part of the parse (e.g., parser functions introduced in an extension) that wants to store and modify data behind the back of such an out-of-order parser will cause the page render to potentially fail to create the same output as a purely sequential parser.

Since we are talking about parsers, we typically think of just the text in and text out of a bunch of functions that modify such (parsing functions). An out-of-order parser like Parsoid wants all of these parsing functions to be referentially transparent without hidden side effects.

The solution is to make each parsing function referentially transparent by either storing your results in the output text or having some other means of providing the parser with information about what you are using as input and stashing as output.

Why not legitimize variable storage by adding such an API to the parser itself? This is akin to the same issues seen in out-of-order instruction execution seen in modern microprocessors where each instruction can read and write to memory and registers. If we create a variable storage API in the parser (it would have to be designed just right), it can then do more advanced things like register renaming to analyze and eliminate data dependency "hazards". The parser can then also cache these variables and know when and how they change and how to schedule reparsing of all the individual parse parts.

I am sure it will not be a simple thing (very few real world parsers are "simple") but this would give current parser function extensions affected by this, a path to porting their changes by making use of this API to store their variable values in a way that the parser can know all of its input and outputs and thus allowing it to "know" all of the data dependencies across page renders.

That would be ideal, and I can see at least one potential side-benefits to that approach, although it would depend on exactly how Parsoid works. I'm thinking specifically of dynamic functions. With an approach like you're outlining, the document would inherently be aware of whether parts of it change based on other parts, so if you had a dynamic function that had no effect on any other part of the document, that bit could remain uncached and updated at every refresh, while the remainder of the document could still be drawn from the parser cache.

The one problem I see with this is the complexity of such a design. The more complex the design is, the fewer people there will be who can maintain it. Since I haven't looked at Parsoid, I have no idea how it and the proposed design compare.

I second that the variable extension functionality should be incorporated in to parsoid or drop the whole parsoid improvement ideas that call for removing the variable extension. For me the variables extension is much more important than the parsoid improvement ideas discussed here. These improvement ideas are IMHO introducing more complexity than solving problems for end users if the current way of implementation is enforced. The worst part for me is currently that on each and every new wiki i create the first thing that pops up is the deprecation message in a fully useable wiki - this makes it look as if the wiki would be broken which it isn't at this point. De-Deprecation would be much appreciated by me.

I would like to add this extension to our wiki (bg3.wiki) but the big red warning on the extension page, saying that its future is unclear, made me worried and led me here.

From the discussion above, it sounds like a lot of wikis rely on this heavily, so we want to make sure it keeps working in the future, is that correct?

In my opinion, the functionality added by this extension is invaluable. So far, in complicated templates, I've avoided having to use it by calling to another helper template to which I pass the calculated value. But I've already had challenges with that approach, because if one "variable" will rely on the value of another, you have to create yet another template and call to that one. So, your template graph has to correspond to the graph of how your variables depend on each other, meaning you could end up with a ton of templates calling to each other which quickly makes things unwieldy and causes a lot of small helper templates being created, when everything could have fit neatly into one template using the Variables extension.

I know it's recommended to use Lua for complicated things needing lots of variables. I'm a programmer, and would *normally* prefer a proper programming language anyway, instead of the wonky template language. However, I have to say, after using the template language for some complex templates, I greatly appreciate how well it works for constructing a page. It could be considered a DSL for generating wiki pages, and works really well in that regard. I suppose it's also easier to get into if you've never used Lua and/or are not a programmer.

(If the template language supported a cleaner syntax than {{{foo}}} for referring to parameters, that would be perfect, but that's another topic. Even for a Lisp appreciator, that's too many braces!)

I'm trying to get up to speed on this whole issue, after asking to have the corporate wiki I manage updated to a newer version and being informed that the big scary deprecation message appeared and we mustn't upgrade to a version that doesn't support our primary needs. I'm in basically the same situation as @Revansx; I could give more information about my use case if that's needed, but basically we use Cargo and Parser Functions a lot. I hope a solution is possible (and maybe even in progress?) that would allow my wiki to update to a current version without also requiring my whole team of only-moderately-technical folks to learn Lua and then rewrite all of our templates. If there's anything I can do to help make that happen, I'm interested!

Pppery renamed this task from Replace use of deprecated hook InternalParseBeforeSanitize to Replace use of removed hook InternalParseBeforeSanitize.Nov 18 2023, 12:23 AM

Change 504488 had a related patch set uploaded (by Func; author: Universal Omega):

[mediawiki/extensions/Variables@master] Use the default StripState for '#var_final'

https://gerrit.wikimedia.org/r/504488

Change 721950 abandoned by MGChecker:

[mediawiki/extensions/Variables@master] [Variables] Replace deprecated InternalParseBeforeSanitize hook

Reason:

Superseded by I320531a6bd

https://gerrit.wikimedia.org/r/721950

Change 504488 merged by jenkins-bot:

[mediawiki/extensions/Variables@master] Use the default StripState for '#var_final'

https://gerrit.wikimedia.org/r/504488

I got hit by this today - the deprecation message even shows up in API calls that should only create JSON results. I did not find any way to switch it off yet. I badly need a workaround and still hope that the whole parsoid idea will be dropped in favor of compatibility. Why would people have to use Lua? Python would be the language of choice these days.

The workaround i found with ChatGPT-4 today is:

class CustomMWDebug extends MWDebug {
    public static function addDeprecationFilter($pattern, $callback = null) {
        self::$deprecationFilters[$pattern] = $callback;
    }
}
CustomMWDebug::addDeprecationFilter('/Use of InternalParseBeforeSanitize hook/');

which can be put right into LocalSettings.php

it's strange how the original code prevents adding deprecation filters for mere mortals

Please keep it on topic. This task isn't about the deprecation, it's about fixing the deprecation warning in the Variables extension. In order to have situation you describe, I believe your wiki has to be configured both with https://www.mediawiki.org/wiki/Manual:$wgDevelopmentWarnings set to true and PHP to print notices during web requests. This is definitely not a configuration a production wiki should use. Notices should be directed to a log file.

@Nikerabbit - thanks for the hint - why did the message show up in the JSON response and made mwclient choke? see https://github.com/mwclient/mwclient/issues/296#issuecomment-2064066084 - or better - where should i discuss such a topic?

These are not the default configuration settings for MediaWiki or PHP (to my knowledge at least) so your wiki has been explicitly configured to exhibit this behavior. You should just get that fixed. Please see https://www.mediawiki.org/wiki/Manual:How_to_debug and for most parts do NOT do what it says.

@Nikerabbit - thx for the hint - my fear is that the only speciality here is that the wiki is running in a docker container. The wiki setup is generated by https://github.com/WolfgangFahl/pymediawikidocker which explicitly tries to create a "clean" environment. The templates (general and per version) in https://github.com/WolfgangFahl/pymediawikidocker/tree/main/mwdocker/resources/templates are cut&paste from an out-of-the box installation without any extensions. The LocalSettings.php changes are defined per extension in https://github.com/WolfgangFahl/pymediawikidocker/blob/main/mwdocker/resources/extensions.yaml - and except for the current issue there is no fiddling with debug parameters activated at this point - i added some commented out stuff yesterday but that's it.

Func added a subscriber: Universal_Omega.

@WolfgangFahl regarding

I got hit by this today - the deprecation message even shows up in API calls that should only create JSON results. I did not find any way to switch it off yet. [...]

PHP's display_errors setting is On by default only in development environments. You can set it to Off in your php.ini or use production settings directly. That should make all such messages go to the serverside logs, rather than to the webserver outout.

In general you should scan your MediaWiki settings for $wgDevelopmentWarnings = true; and remove it, or set it to false explicitly.