User Story: As a search engineer, I want to have general improvements apply to analysis chains for all languages, and to be able to customize and improve individual analysis chains, so that search can incrementally improve for all users.
The intent is to unpack the analyzers (i.e., converting language-specific monolithic analyzers to their constituent parts, based on their breakdown from Elastic) and then apply general improvements (upgrading lowercasing to ICU norm and adding homoglyph norm, for example), test the results, deploy the changes, and re-index.
There are 27 analyzers to unpack, and there seem to be two more analyzers to enable (which is more to test, since we’d be enabling new stemmers). Re-indexing takes time, too. Because of the number of analyzers to work on, this task is probably "Epic-ish".
Details: There are 25 Wikipedias using the default Elasticsearch analyzers. All of these need to be unpacked: Arabic, Armenian, Basque, Bulgarian, Catalan, Czech, CJK for Japanese, Danish, Dutch, Finnish, Galician, German, Hindi, Hungarian, Irish, Latvian, Lithuanian, Norwegian, Persian, Portuguese, Romanian, Sorani, Spanish, Turkish, Thai. There is also a "Brazillian" analyzer from Elastic, for Brazillian Portuguese, which is used by br.wikimedia.org.
There is one non-Elastic monolithic analyzer plugin that we have deployed that should be unpackable: Ukrainian.
Two others analyzers are available, but aren't in use on their respective Wikipedias: Bengali, Estonian. Those should be enabled (as unpacked analyzers) and tested, though that will be more involved since they both include stemmers.
Re-indexing after unpacking is not too much actual work if it goes smoothly, but does take time for larger wikis (Spanish, German, Portuguese, Dutch, Japanese, and Ukrainian all have >1M articles) and may be delayed to sync with other SRE activities.
As we unpack more analyzers, we may also need to do some refactoring of AnalysisConfigBuilder. It may also be worthwhile to do some simple bug fixes after unpacking, to minimize re-indexing, but we'll have to see how quickly we proceed and how "simple" the bugs turn out to be. ;)
Done / In Progress—See tickets for deployment status and T147505 for reindexing status
- Spanish T277699
- German, Portuguese, Dutch T281379
- Basque, Catalan, Danish T283366
- Czech, Finnish, Galician T284578
- Hindi, Irish, Norwegian T289612
- Arabic, Thai T294147 — includes languages from emerging regions (Arabic and Thai)
- Ukrainian T318264 (not from Elasticsearch, may be more complicated) — language from emerging regions
- CJK for Japanese (
may be worth looking into Kuromoji again) T326822 - Armenian, Latvian, Hungarian T325089
- Bulgarian, Lithuanian, Persian T325090
- Romanian, Sorani T325091
- Turkish T329762
- Brazilian (only used on br.wikimedia.org) T325092
Not yet in use for some reason—so install, test, and unpack
Acceptance Criteria (per analyzer):
- Unpacked analyzers perform the same as their monolithic counterparts (without general upgrades).
- Upgraded analyzers either have no unexpected impact (we know what to expect from ICU norm and homoglyph norm, for example), or the impact is reviewed by a speaker of the language.
- Analysis changes are deployed, re-indexing sub-tasks are created here, and linked to in T147505.