Page MenuHomePhabricator

$linkPrefixCharset is too broad in Arabic
Closed, ResolvedPublic

Assigned To
Authored By
Esanders
Sep 18 2020, 3:05 PM
Referenced Files
F32374645: capture-20201005-215159.png
Oct 5 2020, 7:53 PM
F32374642: capture-20201005-215118.png
Oct 5 2020, 7:53 PM
F32374605: image.png
Oct 5 2020, 7:19 PM
F32371616: image.png
Oct 2 2020, 8:16 PM
F32371609: image.png
Oct 2 2020, 8:12 PM
F32371599: image.png
Oct 2 2020, 8:00 PM
F32371596: image.png
Oct 2 2020, 8:00 PM
F32370486: T263266-list.txt
Oct 2 2020, 12:36 AM

Description

See T261750 for context.

The current character range is 'a-zA-Z\\x{80}-\\x{10ffff}', which any non-ASCII character plus Latin letters. This is too broad, and much broader that the link suffix setting, which is just Latin and Arabic letters.

The range should be changes to Latin and Arabic letters to match the rule for suffixes.

Event Timeline

Change 628351 had a related patch set uploaded (by Esanders; owner: Esanders):
[mediawiki/core@master] Use the same character set for link prefix as for suffix

https://gerrit.wikimedia.org/r/628351

I have put this in User-notice (because it affects multiple wikis) and asked Dyolf to post to arwiki separately.

I have put this in User-notice (because it affects multiple wikis) and asked Dyolf to post to arwiki separately.

Community notified here.

I wanted to estimate the impact of this change before we do it.

To restate: Currently, any non-ASCII character, when used directly before a link, is treated as part of the link. After the proposed change is deployed, only Latin and Arabic letters will be treated as part of the link (a-z, A-Z, ء-ي).

I wanted to find pages that will be affected (where a non-ASCII non-Arabic character is used directly before a link), which I did by running this regexp: [\u{80}-\u{0620}\u{064b}-\u{10ffff}]\[\[ on the database dump "arwiki-20200920-pages-meta-current.xml.bz2" (https://dumps.wikimedia.org/arwiki/20200920/).


This generated a list of 19545 pages. I think this is suspiciously huge and I'd like someone to check whether this makes sense before we go ahead. Full list is here if you want to view it: F32370486

These are the characters that appear directly before links in those pages (and would no longer be part of the links after this change). Can you please check whether it's reasonable that they won't be pulled into the links? (particularly for the characters with "Arabic" in the name)

Number of occurrencesCharacterCharacter name
8225 U+00A0 NO-BREAK SPACE
8190،U+060C ARABIC COMMA
7351«U+00AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
2927U+2013 EN DASH
2014U+2022 BULLET
1451َU+064E ARABIC FATHA
1230U+2014 EM DASH
1137U+2020 DAGGER
683ِU+0650 ARABIC KASRA
516U+201C LEFT DOUBLE QUOTATION MARK
443ّU+0651 ARABIC SHADDA
338؟U+061F ARABIC QUESTION MARK
309·U+00B7 MIDDLE DOT
128U+201D RIGHT DOUBLE QUOTATION MARK
107؛U+061B ARABIC SEMICOLON
88U+2500 BOX DRAWINGS LIGHT HORIZONTAL
85U+2019 RIGHT SINGLE QUOTATION MARK
85ًU+064B ARABIC FATHATAN
73U+2190 LEFTWARDS ARROW
60ْU+0652 ARABIC SUKUN
53 U+3000 IDEOGRAPHIC SPACE
51°U+00B0 DEGREE SIGN
50U+30FB KATAKANA MIDDLE DOT
43📻U+1F4FB RADIO
41U+300A LEFT DOUBLE ANGLE BRACKET
38U+200C ZERO WIDTH NON-JOINER
34U+FF08 FULLWIDTH LEFT PARENTHESIS
34U+2660 BLACK SPADE SUIT
34U+2192 RIGHTWARDS ARROW
30U+200D ZERO WIDTH JOINER
27U+2018 LEFT SINGLE QUOTATION MARK
26U+3001 IDEOGRAPHIC COMMA
23±U+00B1 PLUS-MINUS SIGN
22ٍU+064D ARABIC KASRATAN
21U+25AA BLACK SMALL SQUARE
19U+FF0C FULLWIDTH COMMA
19U+2212 MINUS SIGN
18»U+00BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
16ُU+064F ARABIC DAMMA
15U+2009 THIN SPACE
13U+2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK
13×U+00D7 MULTIPLICATION SIGN
12U+202F NARROW NO-BREAK SPACE
12בU+05D1 HEBREW LETTER BET
10📺U+1F4FA TELEVISION
10U+3008 LEFT ANGLE BRACKET

(…and 147 other characters appearing less than 10 times each)

Hi @matmarex , I don't know if it is answering your question but i want you to know that many characters with "Arabic" in the name in this list, are Arabic diacritics (kind of vowels) and not a separated characters/letters.
For example U+064E ARABIC FATHA:

  • it is this character َ (called Fatha and similar to the vowel a )
  • but it is always used with another letter. Like when it is added to ب (letter b without vowel) it becomes a بَ which is pronounced ba .

Characters from this list which have to be combined with letters are: U+064E ; U+0650 ; U+0651 ; U+064B ; U+0652 ; U+064D ; U+064F
I hope you get the idea!

Most of the stuff in the list is punctuation so that shouldn't be a problem.

We should add combining diacritics. They wouldn't have been a problem with link trails as the diacritic codepoint always goes after the letter codepoint.

Fortunately from our experience with UnicodeJS, we already have these ranges to hand, it is everything in the 0x06-- range (Arabic) in unicodeJS.wordbreakproperties.Extend :

[ 0x0610, 0x061A ], [ 0x064B, 0x065F ], 0x0670, [ 0x06D6, 0x06DC ], [ 0x06DF, 0x06E4 ], 0x06E7, 0x06E8, [ 0x06EA, 0x06ED ]

This ranges covers all the characters you picked out of that list, which is a good sign.

Thank you @Dyolf77_WMF, I think that answers my question… Just to be perfectly clear, those characters should be included within the link when written before link syntax?

This makes me wonder whether the current configuration for characters after link syntax is correct. I searched the dump for that situation too.

For example, on this page: https://ar.wikipedia.org/wiki/صوديوم ("sodium"), around the middle, there is a link "نصف قطرذرّة" pointing to https://ar.wikipedia.org/wiki/نصف_قطر_ذري ("atomic radius"). In wikitext this link is written as [[نصف قطر ذري|نصف قطر]]ذرّة. And if you look at it closely, the last character and a half are not actually part of the link. This looks incorrect to me.

Wikitext editor screenshot (zoomed in):

image.png (89×1 px, 26 KB)

Article screenshot (zoomed in) – note how the last character and a half are black instead of blue:

image.png (207×757 px, 15 KB)

It looks like this would also trigger a bug in the link trail functionality. As soon as a diacritic is encountered, no characters pass that will be included in the link:

image.png (261×246 px, 5 KB)

Here I wrote the ba/بَ combined glyph multiple times, but only linked the first one. Note that in the preview at the top, only the first two glyphs are linked (red), as once it sees the fatha diacritic outside of the link it stops link trailing.

Seems like I should refresh the page, we've posted the same comment :)

I'm going to add the combining diacritics to the link prefixer and the suffixer.

Testing this on my fake example, this fixes the suffixer:

image.png (57×90 px, 1 KB)

Thank you @Dyolf77_WMF, I think that answers my question… Just to be perfectly clear, those characters should be included within the link when written before link syntax?

Yes, they should be included within the link.

For example, on this page: https://ar.wikipedia.org/wiki/صوديوم ("sodium"), around the middle, there is a link "نصف قطرذرّة" pointing to https://ar.wikipedia.org/wiki/نصف_قطر_ذري ("atomic radius"). In wikitext this link is written as [[نصف قطر ذري|نصف قطر]]ذرّة. And if you look at it closely, the last character and a half are not actually part of the link. This looks incorrect to me.

Thanks for pointing this example, but there is a typo in the text, it is missing a space because قطر and ذرّة are separated words. But yes, if it was a one word, it should be a link for all the text. You are right.

For example, on this page: https://ar.wikipedia.org/wiki/صوديوم ("sodium"), around the middle, there is a link "نصف قطرذرّة" pointing to https://ar.wikipedia.org/wiki/نصف_قطر_ذري ("atomic radius"). In wikitext this link is written as [[نصف قطر ذري|نصف قطر]]ذرّة. And if you look at it closely, the last character and a half are not actually part of the link. This looks incorrect to me.

Thanks for pointing this example, but there is a typo in the text, it is missing a space because قطر and ذرّة are separated words. But yes, if it was a one word, it should be a link for all the text. You are right.

Thanks for checking. I guess I picked an unfortunate example, heh. There are more (not sure how many – I stopped the search after 100, I could generate the whole list if anyone would find it useful), but in many cases, the only character left out of the link is the diacritical mark itself, which is almost impossible to see and not really a big problem.

Change 628351 merged by jenkins-bot:
[mediawiki/core@master] Use the same character set for link prefix as for suffix in Arabic 'ar'

https://gerrit.wikimedia.org/r/628351

We merged the patch with fixes for both prefixes and suffixes. I'm not sure when it will be deployed to production wikis, since the deployment is currently blocked (see T263177), ordinarily it would happen this Thursday (2020-10-08). You can test the new behavior on https://ar.wikipedia.beta.wmflabs.org/ right now (feel free to create new pages there for testing).

For the production deployment, there are a few things to note:

  • Because the HTML rendering of the wikitext is cached, you might still see the old behavior when reading pages, until the caches are cleared (you can do this manually by purging a specific page).
  • If you start editing a page using visual editor before the deployment, and save your edit after it happens, some <nowiki/> tags may be added before/after links whose rendering would change. This should be very rare, please let us know if you see more than a few edits like that.

Description of the final version of the change:

PreviouslyAfter the change
Link prefixesa-z, A-Z, and any character other than ASCII charactersa-z, A-Z, Arabic letters ء-ي (marked in orange), and some Arabic diacritics (marked in yellow)
Link suffixesa-z, Arabic letters ء-ي (marked in orange)a-z, Arabic letters ء-ي (marked in orange), and some Arabic diacritics (marked in yellow)

image.png (1×1 px, 249 KB)

(this image is based on https://en.wikipedia.org/wiki/Arabic_(Unicode_block)#Block, you can find the names of each character there)

Hello,

Sorry for late comment, but can this task include "و" (In English: and)? as it should not included on the link;

Present

capture-20201005-215118.png (179×127 px, 2 KB)

Expected

capture-20201005-215159.png (198×185 px, 3 KB)

@alanajjar It would be possible to implement, but that seems like a much bigger change and I think it should have a separate task.

Also, note that we can't implement different behaviors for letters at the beginning and in the middle of words. For example, if we removed "a" from link prefix, only the phrase "test" would be linked in both of the following:

a[[test]]
ba[[test]]

I don't know if that would be a problem with "و" in Arabic. (Sorry for using an English example, but I don't want to mess up Arabic text, since I can't read it.)

We merged the patch with fixes for both prefixes and suffixes. I'm not sure when it will be deployed to production wikis, since the deployment is currently blocked (see T263177), ordinarily it would happen this Thursday (2020-10-08). You can test the new behavior on https://ar.wikipedia.beta.wmflabs.org/ right now (feel free to create new pages there for testing).

Good to know, thanks! I made a first test and seems that there is no more the </nowiki> tag when using the right arrow symbol .
Also I did a couple of links in this test page, I think that's correct!
Also I agree with @alanajjar about the Arabic (and) Waw و , Arabic typography says that there is no be a space after Waw, that's why we should have the link only for the word without Waw (like in the second image by Alaa).

We can remove Waw from the list of characters, but I would want to check a few things first:

  • Doing so would mean that this character never gets pulled into the link, regardless of context. I don't know if this character can ever be used inside another word, or have a different meaning, so I don't know if this is a problem.
  • This would apply to any wiki that has a content language of 'ar'. I don't know if there any wikis where they write in a dialect but use this language code, but it's worth making sure that the meaning of this character is reasonably global.
  • Waw was included in the link trail set prior to this task, so this would be a change to that behaviour.

Doing so would mean that this character never gets pulled into the link, regardless of context

This would include if it got combined with an diacritic.

I'm boldly resolving this task for now.

@Dyolf77_WMF and @alanajjar, would it be accurate for me to assume one of you will file a new ticket to have the Arabic (and) Waw و removed from the list of $linkPrefixCharset?

@Dyolf77_WMF and @alanajjar, would it be accurate for me to assume one of you will file a new ticket to have the Arabic (and) Waw و removed from the list of $linkPrefixCharset?

Done here T268631