Page MenuHomePhabricator

Add a configuration variable that allows disabling language codes for labels, descriptions, and aliases
Open, HighPublicFeature

Description

Context:

  • Wikibase Core

Problem:
There are various lists defining language codes for Wikibase. Some values in these lists are not suitable for the termbox on Wikidata.org (codes for labels/descriptions/aliases). See T44396 for the general problem. Despite efforts to clean this up by bot, the numbers are currently at 500,000 (June 2021, see T44396#7150919).

Suggested solution:
The idea is to add a configuration variable that allows disabling language codes for the termbox that are not suitable in this context.

This doesn't touch the use of these language codes for other purposes (like lexemes or monolingual strings). For the latter we have a similar implementation already, see DifferenceContentLanguages with DefaultMonolingualTextLanguages

Example:

  • Sample: "no" isn't used on Wikidata, but is in the domain name for no.wikipedia.org . Wikidata uses just "nb".
  • Other codes for an initial version of the variable: 'bat-smg' (→'sgs'), 'bh' (→'bho'), 'fiu-vro' (→'vro'), 'roa-rup' (→'rup'), 'simple' (→'en'), 'zh-classical' (→'lzh'), 'zh-min-nan' (→'nan'), 'zh-yue' (→'yue'), 'be-x-old' (→ 'be-tarask'), 'shy' (→ 'shy-latn'), 'de-formal', 'es-formal', 'hu-formal', 'nl-informal'

Acceptance criteria:

  • there is a configuration variable that allows disabling language codes for "labels, descriptions, and aliases" (everywhere including in the API, Special:SetLabel, etc.)
  • there should be a default configuration that makes sense for Wikibase instances in general (e.g. including "simple")
  • edge cases are cared for
    • Deletion of existing disabled language code values should still be possible
    • Reverts should still be possible, even if a disabled language code was used in the old revision.
    • When a disallowed code is used as the UI language, item labels, the page title and termbox all use the correct language instead (e.g. UI language de-formal should behave the same as de) (see discussion below)

Original:
There are various lists defining language codes for Wikibase. Some values in these lists are not suitable for the termbox on Wikidata.org. See T44396 for the general problem.

The idea is to add a configuration variable that allows to disable such language codes. Deletion of existing values should still be possible.

  • Sample: "no" isn't used on Wikidata, but is the domain name for no.wikipedia.org . Wikidata uses just "nb".
  • Other codes for an initial version of the variable: "bat-smg", "bh", "fiu-vro", "roa-rup", "simple", "zh-classical, "zh-min-nan", "zh-yue", 'de-formal', 'es-formal', 'hu-formal', 'nl-informal',

Despite efforts to clean this up by bot, the numbers of are currently at 500,000 (June 2021, see T44396#7150919).

This doesn't touch their use for lexemes or monolingual strings. For the later, see DifferenceContentLanguages with DefaultMonolingualTextLanguages

Event Timeline

I think the list of disallowed languages can be something like:

				[
					'no', // T284808, use nb
					'be-x-old' // T284808, use be-tarask
				]

Just the list used for monolingual language code. As done in lib/includes/WikibaseContentLanguages.php.
But I think it's best for @Lydia_Pintscher and her team to say how easy it is to implement the option and if the list is usable.

Esc3300 updated the task description. (Show Details)

For backwards compatibility you may wants:

  1. We automatically rewrite some language to other ones (e.g. be-x-old to be-tarask) with a warning
  2. Throw an error if (1) in one request we have different content on be-x-old and be-tarask term; and probably (2) trying to overwrite existing content using wrong language code

If such language codes is removed entirely it may be a breaking change.

  1. We automatically rewrite some language to other ones (e.g. be-x-old to be-tarask) with a warning

Not sure if we should actually encourage using incorrect language codes. This would be a change from the current solution where this doesn't happen.

If such language codes is removed entirely it may be a breaking change.

For future code additions, this is something one has to bear in mind.

Esc3300 changed the subtype of this task from "Task" to "Feature Request".Jun 17 2021, 7:05 AM

@Lydia_Pintscher what priority should we give to this? I'd suggest Medium to High.

Don't we already have this? I don't think we allow adding terms for simple for example? Or am I missing something?

Don't we already have this? I don't think we allow adding terms for simple for example? Or am I missing something?

Nope, there are 181k simple labels.

Don't we already have this? I don't think we allow adding terms for simple for example? Or am I missing something?

Only for monolingual codes. See https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/Wikibase/+/refs/heads/master/lib/includes/WikibaseContentLanguages.php#254

I think the need was mentioned when the configuration for monolingual texts was implemented, but none followed up on it.

Manuel renamed this task from add a configuration variable that disallows some language codes for labels/descriptions/aliases (termbox) to Add a configuration variable that allows disabling language codes for the termbox.Jul 20 2021, 9:42 AM
Manuel updated the task description. (Show Details)

@Lucas_Werkmeister_WMDE: You mentioned some edge cases that we need to consider here. In case an important edge case is still missing, could you please let me know or add it directly?

I think the edge case I was thinking of was which language would be used on nowiki when getting the label of an item in “the wiki’s language”, but it looks like that already uses the nb label (tested on my user sandbox there, though that test will stop working the next time someone resets the sandbox item).

But I have a different question: the task description keeps talking about the “termbox”; is this task specifically about the termbox user interface (either the old one or the new / mobile one), or is it supposed to prevent labels, descriptions and aliases in these language codes, everywhere (including in the API, Special:SetLabel, etc.)?

is this task specifically about the termbox user interface (either the old one or the new / mobile one), or is it supposed to prevent labels, descriptions and aliases in these language codes, everywhere (including in the API, Special:SetLabel, etc.)?

Ideally the latter, though if the former is dealt with as a consequence of handling the latter then that's great too. (Perhaps I'm not the only one that uses "termbox" as a two-syllable equivalent to the nine-syllable "labels, descriptions, and aliases".)

Manuel renamed this task from Add a configuration variable that allows disabling language codes for the termbox to Add a configuration variable that allows disabling language codes for labels, descriptions, and aliases.Jul 21 2021, 7:32 AM
Manuel updated the task description. (Show Details)

Thank you @Mahir256 and @Lucas_Werkmeister_WMDE! I have changed the description accordingly.

Looked at in story time, but not picked up today and will be tied into some other language related things in the coming week.

Language codes uses termbox as a synonym for labels, descriptions, aliases.

Supposedly, on a practical level, disabling additions through the api would probably lead to the largest improvement for Wikidata.org

I'm not really convinced by @Nikki 's addition of today:

  • When a disallowed code is used as the UI language, item labels, the page title and termbox all use the correct language instead (e.g. UI language de-formal should behave the same as de)

If I understand this correctly, it should do some conversion automatically.

  • If this is meant for codes used for labels/descriptions/aliases, I think this should be avoided otherwise people keep using invalid language codes. The information can still be displayed.
  • If it's merely for the GUI, this may be another task

But I have a different question: the task description keeps talking about the “termbox”; is this task specifically about the termbox user interface (either the old one or the new / mobile one), or is it supposed to prevent labels, descriptions and aliases in these language codes, everywhere (including in the API, Special:SetLabel, etc.)?

I think it needs to be everywhere, otherwise we're still going to get people adding them on Special:NewItem, via gadgets (e.g. Label Lister), via QuickStatements, via bots, etc.

I'm not really convinced by @Nikki 's addition of today:

  • When a disallowed code is used as the UI language, item labels, the page title and termbox all use the correct language instead (e.g. UI language de-formal should behave the same as de)

If I understand this correctly, it should do some conversion automatically.

  • If this is meant for codes used for labels/descriptions/aliases, I think this should be avoided otherwise people keep using invalid language codes. The information can still be displayed.
  • If it's merely for the GUI, this may be another task

UI language means the language you have selected for the user interface. You can change that clicking on the language name next to your username at the top of the page and you can also set which language you prefer on the "User profile" tab of the preferences (globally, if you wish).

Right now, UI languages are automatically label languages and the language you select as your UI language is used to select which label to show as the page title, which labels to show in links to entities and which language comes first in the termbox. This ticket involves changing it so that UI languages are no longer automatically valid label languages and therefore one of the acceptance criteria should be that it does not break when the UI language is one which can't be used for labels.

The feature request doesn't propose to change existing UI functionality.

It is something that could be improved, also for locales that aren't interface languages, e.g. uselang=fr-be, but this isn't directly linked to the feature request above.

Reverts should still be possible, even if a disabled language code was used in the old revision.

What should happen in this case – should the terms in the disabled language codes be restored, or silently removed?

Currently, Wikibase lets you undo edits that removed statements with deleted properties (i.e. restoring the item to a state where it has a statement for a deleted property), so I guess it would be consistent to allow restoring terms in these languages as well.

Reverts should still be possible, even if a disabled language code was used in the old revision.

What should happen in this case – should the terms in the disabled language codes be restored, or silently removed?

I think it would make sense to ignore any codes which aren't accepted - I don't see any benefit in restoring them just to remove them again. If an edit removed something that you wanted to keep, you can add it using the correct code instead.

I don't think we should change behavior on diffs and reverts for this feature only and potentially create incomprehensible reverts hiding content.

I also don't think disallowed codes should be excluded from a revert, in the interest of keeping behavior consistent with how deleted items/properties are dealt with.

@Manuel does this need more input or can this move ahead?

@Addshore re T284808#7226880 which ones?

Hi @Esc3300, I did not forget this. I am currently working on finding a solution for the principle challenge of getting rid of invalid data in Wikidata. I will pick it up after I have more clarity on this.

@Manuel is this in any way related to this request ? While general questions can be interesting, they tend be not easily solved in wikis. T51024 (from 2013) might have waited for that too.

If the question is just about invalid labels: once new additions are stopped with the feature, bots should eventually clean it up.

Given that Wikipedia hasn't really figured out how to update language codes they are using, we need to address this for labels at Wikidata anyways.

Seems that there was an attempt to implement this in August 2013: T53071 , abandoned in March 2014 after a few code rewrite requests weren't addressed: https://gerrit.wikimedia.org/r/81001