On the russian UI of Q120180, in position held property (P39) (занимаемая должность), the reference label for Q19180760 is trimmed.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Fix truncated terms on the fly | mediawiki/extensions/Wikibase | master | +19 -4 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Invalid | None | T142691 [Bug] wb_terms table truncates labels exceeding 255 bytes, possibly leaving invalid UTF-8 | |||
Resolved | thiemowmde | T142820 [Task] Decide on quick fix for "Max length of reference label in russian" |
Event Timeline
There are multiple things to consider here here:
- The label, description and alias fields are limited to 250 characters (see multilang-limits in Wikibase.default.php). Same for monolingual terms. But string values are limited to 400 characters. I'm not absolutely sure all this is as intended.
- We are currently using the term_text column from the wb_terms table to display these labels. It's set to varbinary(255). But:
- There are multiple problems with touching the term_text field:
- There is no absolute maximum byte length a 250 characters UTF-8 string can have. A sequence of emoji may even exceed 1000 bytes.
- [Solution: Increase length] but does MySQL still support all indexing features when we increase the length?
- [Solution: Use VARCHAR] But does MySQL still support all indexing features when we change the field from VARBINARY to VARCHAR?
- [Solution: Fallback to retrieving lookup] A possible workaround is to check the length of the strings returned by the EntityInfoTermLookup. If it's exactly 255 characters we repeat the lookup a second time with the EntityRetrievingTermLookup. (Or to add this fallback behavior directly to the EntityInfoTermLookup) The "retrieving" lookup was the one we used for all label lookups a few months ago. We started using the term table based lookup for performance reasons, to avoid deserializing hundreds of entity blobs just because we need a few labels.
- [Solution: Elastic lookup] Long term solution should be to "drop" the terms table (read: do not use it any more) in favor of a proper elastic based index (see T125500: [Epic] Index Wikidata labels and descriptions as separate fields in ElasticSearch).
Yes, we had a similar problem with ips_site_page which we solved that way: T99459: ips_site_page is too short to store some (full) page titles.
We have 2133 affected terms:
MariaDB [wikidatawiki_p]> SELECT COUNT(*) FROM wb_terms WHERE LENGTH( term_text) > 254; +----------+ | COUNT(*) | +----------+ | 2133 | +----------+ 1 row in set (8 min 15.88 sec)
Some quick notes:
In repo/sql/Wikibase.sql, this is declared as term_text VARCHAR(255) BINARY NOT NULL, which is NOT the same as VARBINARY. VARCHAR(255) BINARY should use binary collation for UTF8 data, any would allow 255 unicode characters to be stored. The field was apparently changed to VARBINARY during deployment, possibly for backwards compat with MySQL4.
How ever we work around the issue that this field is VARBINARY(255) on the live system, we should strip any broken UTF8 from the end of the string, using StringNormalizer::removeBadCharLast.
Change 306253 had a related patch set uploaded (by Daniel Kinzler):
Fix truncated terms on the fly
It's VARBINARY in both my local database and on the live system (we can see it truncates at 255 bytes). I don't know how this happened. I assume the meaning of VARCHAR BINARY changed. It appears this was an alias for VARBINARY in the MySQL versions we use.
Here is an example search: https://www.wikidata.org/w/api.php?action=wbsearchentities&search=%D0%92%D1%8B%D1%81%D0%BE%D1%87%D0%B0%D0%B9%D1%88%D0%B8%D0%B9+%D0%BC%D0%B0%D0%BD%D0%B8%D1%84%D0%B5%D1%81%D1%82+1+%D0%BC%D0%B0%D1%80%D1%82%D0%B0+1881+%D0%B3%D0%BE%D0%B4&format=json&language=ru&uselang=ru&type=item. Note that the search result ends with \ufffd. This should be stripped, as @daniel suggests.
During story time we found that:
- Certain closed subtasks of T75087: Efficient entity label lookup (tracking) are probably the reason for this bug.
- The remaining open subtasks of T75087 will solve this specific bug.
- A much wider parent task T86530: Replace wb_terms table with more specialized mechanisms for terms (tracking) also exists, outlining a series of individual replacements necessary to get rid of the terms table.
We said that we will, for now, solve this bug by implementing the "fallback" workaround described in T142691#2543325.
Given that it is only a problem in a small number of cases let's not spend time on the workaround and instead work on the proper fix long-term. That's what we also said in story time.
Could be relevant: I was able to create a item with 810 characters (https://www.wikidata.org/wiki/Q26903397), but I can't add more labels with that length. https://www.wikidata.org/wiki/Q2732136#P734 shows the broken display as described in this ticket.
The proper long term fix for this one is killing the wb_terms table.
Tagging T198866 as related.