Current status
Initial requirements gathering, research, and further discussion led to four proposals, which are listed below. Proposal 4 has been implemented in Parsoid and deployed to production end-December 2021. We are currently fixing bugs and working through issues in Parsoid's support.
Page translation
Docs: https://www.mediawiki.org/wiki/Help:Extension:Translate/Page_translation_administration
Requirements
Support translation of wiki pages
- Pages are living documents, so changes must be tracked
- Each document has one source language (the source) and any number of translations
- Each translated version is its own page with a separate history
- The source is annotated to:
- translatable and non-translatable parts
- the translatable parts are further divided into units which are individually translatable
- units may have non-translatable holes (variables)
Translators want:
- Small but meaningful units with minimal amount of mark-up
Authors wants:
- Control of what can be translated
- Minimal degradation of editing experience, e.g. minimal amount of new mark-up to understand
Current architecture
Translate uses ParserBeforeInternalParse hook to mangle wikitext on translatable source page. For example:
<languages/> <translate> == Heading == You have <tvar|1>999</> bugs. </translate>
Is mangled to as follows (note whitespace handling) for the parser to parse:
<languages/> == Heading == You have 999 bugs.
Translation pages have no such markup. They are generated by Translate.
Proposal 1: concept of preprocessing
Parsoid would have a new kind of hook for “preprocess” (name up for discussion) that would be run for a whole page of a wikitext. Translate would register such a hook and mangle the wikitext before Parsoid starts processing it.
Anticipated issues
- May break some assumption of Parsoid and probably would break html2wt.
Pros and cons:
- Have absolute control over parsing.
- Requires minimal changes to our code.
- No need to worry about balanced DOM.
- Does not harm wikitext editing
- Visual editing would become even worse or impossible
- Does not address the underlying architecture issues going forward
Proposal 2: <translatablepage> wrapper
For translatable pages, either implicitly or explicitly wrap the whole page under another tag, such as <translatablepage>. This tag would do the current mangling: basically removing <translate> tags and converting variable syntax to the actual value.
<translatablepage> would be type 3 without postprocessing.
Pros and cons:
- Have absolute control over parsing.
- Requires minimal changes to our code.
- Does not harm wikitext editing, except a bit for the wrapper tag.
- No need to worry about balanced DOM
- Visual editing could get even a bit worse (but maybe with some effort it could get better, just allow editing the whole contents as wikitext instead of the weird mix it currently has)
- Does not address the underlying architecture issues going forward
Proposal 3: <translate> is an extension tag
We would register type 3 <translate> tags with Parsoid and have them do parseWT( mangle( $input ) ).
Anticipated issues:
- Block-vs-inline rendering based on content may be difficult to implement.
- There may be a lot of "unbalanced DOM" type of mark-up usage on existing pages.
Pros and cons:
- Enables better VE support in the future
- Likely requires most effort to implement
- Like requires a lot of effort to support migration
- May require introducing new mark-up to better support cases where unbalanced mark-up is used currently
Proposal 4: <translate> is an annotation
See T261181#6476451. But TLDR is that Parsoid treats translation tags as transparent annotations and handles them as such in Parsoid HTML. The new Parsoid markup spec is documented @ https://www.mediawiki.org/wiki/Specs/HTML/2.4.0#Annotation_tags and enables editing of translatable content in VE.
Plural parsing
Validation of MediaWiki plural syntax is part of message validation framework used in translatewiki.net
Requirements
Validate plural syntax in messages
- Should match the MediaWiki core parsing behavior.
Developers want
- Simple way to parse (expand) plural syntax to validate the number of forms.
Current architecture
Translate installs a custom parser function that overrides the normal plural function to gather the parser function arguments.
Proposals
Not investigated yet.
Open questions
1. Block vs. inline rendering
Block vs. inline rendering is currently proposed to be a setting per extension tag. How would <translate> keep working while supporting e.g. the following cases:
Some stuff <translate>inline text</translate> <translate> A paragraph goes here. Another here. </translate>
The current logic is: IF tag contents contains a newline THEN block context ELSE inline context
Having explicitly to specify whether the context is inline or block would be, imho, too much overhead for translation admins.
sourceToDom allows to specify a wrapper tag. Can this tag have attributes? Can it be different for different calls of the tag?
2. Balanced DOM
How to find out what would break currently?
Can a linter be written? In a way that doesn't affect current workflow?
What exactly are the rules of a balanced DOM? For example, could <translate> tags span over multiple sections? Can they stop or start in the middle of a section when spanning multiple sections?
What would the migration process look like?
3. Is <translate> actually a new type of an extension tag, or an extension tag at all?
The tech talk mostly focuses on transforming content. The main point of <translate> is to just annotate parts of a page in a machine readable way, and any effects on parsing are to be considered unwanted implementation details.
4. Parsing plural syntax
Does Parsoid provide a way to do this:
$input = 'Some translation here with {{PLURAL:$1|a house|$1 houses}}'; $output = $parsoid->doSomething( $input );
$output being something like:
[ [ 'a house', '$1 houses' ] ]
Must support multiple plurals in one string. Nice to support nested plurals. Must handle {} inside plural options.