Quotes turn on exact term matches. But searching for ~"daß" also finds pages containing "dass" on de.wikipedia. Compare this to the behaviour of Google which finds 28.200 "daß" and 1.630.000 "dass" on site:de.wikipedia.org.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Unpack German, Portuguese, and Dutch Elasticsearch Analyzers | mediawiki/extensions/CirrusSearch | master | +939 -234 |
Related Objects
- Mentioned In
- T284185: Reindex German, Dutch, and Portugese Wikis to Enabled Unpacked Versions
T272606: [EPIC] Unpack all Elasticsearch analyzers
T226812: de.wikipedia: search for "Bedusz" does not find "Będusz"
T182856: Add current issues to "exactly this text" helptext
T182447: Search in "" does not distinguish between "ss" and "ß"
T104814: Appropriately ignore diacritics for German-language wikis
T147636: German ß (sharp s, eszett) triggers confusing behavior in insource: regular expressions
T87112: Wrong stemming in German
T90089: ~"Maße" should not match "Masse" - Mentioned Here
- T281379: Unpack German, Portuguese, and Dutch Elasticsearch Analyzers
T90089: ~"Maße" should not match "Masse"
Event Timeline
Another example: ~"Maße" and ~"Masse" (T90089). They have different meaning in German.
@FriedhelmW: Did you reproduce this problem on a local MediaWiki instance with its default search backend, or why did you add the project "MediaWiki-Search" to this task? Clarifying comment welcome. :)
Now I tested on an other wiki, and MediaWiki-Search is not affected. Thank you for clarifying!
This sounds like a unicode normalization issue. Some details can be found on the Elasticsearch site. I thought we were using [[https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/master/includes/Maintenance/AnalysisConfigBuilder.php#L229-L234|nfkc]] normalization which should not in my understanding normalize ß to ss but it looks like that is what is happening here and in T90089.
When using quotes, no normalization or stemming must be done (except for converting to lower case). Of course, there needs to be an unnormalized index, too.
when using quotes in cirrussearch the exact same fields (specific analyzed versions of text) in elasticsearch are queried. Without moving the parsing into php we don't really have control over how that happens.
Filed https://github.com/elastic/elasticsearch/pull/20814
Not sure it's the right approach for us but It will give us a chance to blacklist some chars which is not currently possible.
The reason ß is folded to ss is because nfkc_cf is a case folding technique which is designed to group words into an unique form regardless of the input:
DASS => dass
daß => dass
The question is: should we track a per language list of chars to exclude from nfkc_cf normalization or should we simply disable nfkc_cf and use a naive lower-casing approach for the plain field (replace nfck_cf with [ nfkc, lowercase] for plain)?
@dcausse's patch above was accepted about a week ago, so the ability to exclude certain characters from other kinds of normalization will be improved in Elasticsearch 6—but that's a little far off. I don't know if it's possible to unpack the German language analyzer and/or mess with the German config to get the desired result in the meantime.
It's not clear to me how we would ideally want to treat ß. I know there are potential complexities related to the 1996 spelling reform, Swiss and Austrian variants, and words that have both ß and ss in different forms of the word (like beißen / bissen).
Would it make sense to have ß normalize to ss for "regular" searches (i.e., in the stemmed "text" field), but remain "ß" for quoted searches (i.e., in the exact-match "plain" field)?
Also, as an expensive workaround to the problem of searching for daß but not dass you can use insource. The simplest version would be insource:/[dD]aß/ , which would match daß or Daß in any part of a word. You could try to do your own word boundary detection with something like insource:/[ .,;:'"!?][Dd]aß[ .,;:'"!?]/ , but Erik may come and give you a stern talking to for overtaxing the Elastic cluster.
Would it make sense to have ß normalize to ss for "regular" searches (i.e., in the stemmed "text" field), but remain "ß" for quoted searches (i.e., in the exact-match "plain" field)? Yes, because quoted mean exact match.
A related request was made on-wiki by a friendly IP editor: https://www.mediawiki.org/wiki/Topic:Updgir90a9m8tij9
Change 692700 had a related patch set uploaded (by Tjones; author: Tjones):
[mediawiki/extensions/CirrusSearch@master] Unpack German, Portuguese, and Dutch Elasticsearch Analyzers
Change 692700 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Unpack German, Portuguese, and Dutch Elasticsearch Analyzers