- Affected components: MediaWiki core:
- Database schema..
- ParserOutput, LinksUpdate, RefreshLinks job internals.
- Special:WhatLinksHere, API:Backlinks etc.
- Engineer for initial implementation: @Ladsgroup.
- Code steward: Platform Engineering.
Motivation
This RFC is about these tables to be precise: imagelinks, pagelinks, templatelinks, categorylinks).
These link tables are among largest tables in any MediaWiki database. Here's an example from commonswiki:
[email protected](commonswiki)> SELECT table_schema, table_name AS `Table`, round(((data_length + index_length) / 1024 / 1024 / 1024), 2) `Size in GB` FROM information_schema.TABLES where table_name like '%links' order by (data_length + index_length) desc; +-----------------+------------------+------------+ | table_schema | Table | Size in GB | +-----------------+------------------+------------+ | commonswiki | templatelinks | 301.45 | | commonswiki | externallinks | 179.16 | | commonswiki | categorylinks | 175.10 | | commonswiki | globalimagelinks | 136.36 | | commonswiki | pagelinks | 84.96 | | commonswiki | imagelinks | 55.47 | | commonswiki | iwlinks | 42.81 | | commonswiki | langlinks | 2.74 | +-----------------+------------------+------------+
In total they are responsible for 0.8TB of data in Wikimedia Commons. Most of these data is redundant because target fields are repeated over and over again, for example "License_template_tag" as value of templatelinks.tl_target has been repeated 53 million times. This can be easily normalized through page table but the problem is that the actual content in mediawiki binds to the text. You use {{License template tag}} and not the page id of Template:License template tag in commonswiki.
That has several consequences:
- You can link to non-existing page and store those links (which empowers special pages of "Most wanted categories" and "Most wanted pages").
- If a page moves, the links stays to the old target which either gets redirected, repurposed to a completely different page or fixed by a bot. This can be a hassle, we have bots to fix moved categories but fixing that is outside of scope of this RFC.
Requirements
- Any individual table ideally remains under 50GB. Right now some of these are over 300GB in size. (DBA)
- The total size of an entire wiki's db tables must remain well under 4TB.
This means tables must either be reduced in size, or split.
Exploration
Proposed solution
I propose to add a new table with this structure:
CREATE TABLE /*_*/title ( title_id BIGINT NOT NULL PRIMARY KEY AUTO_INCREMENT, title_namespace INT NOT NULL, title_title VARCHAR(255) BINARY NOT NULL, -- Naming is hard ) /*$wgDBTableOptions*/; CREATE UNIQUE INDEX /*i*/title_name_title ON /*_*/title (title_namespace, title_title);
(ts prefix used to be used in tag_summary table but it doesn't exist anymore, I killed it with my bare hands)
Then a lookup service would be introduced similar to NameTableStore but with a different caching strategy (caching only the most read lookups and not all in memcached) to provide a lookup from string to id and vice versa.
Then for each of the mentioned *links table:
- We add a new column to the table
- Turn on the migration stage on "write both, read old"
- Run the maintenance script to populate the new column
- Turn the migration to "write both, read new"
- After a period, reading only from the normalized column and dropping the non-normalized column.
(Repeating the whole procedure for each table, I did something similar but more complex with change_tags table)
That would save a lot of space and addressed significant scalability issues for Wikimedia Commons and English Wikipedia (these tables are around 60 GB in wikidatawiki, rather big but not the most pressing storage issue there).
Disclaimer: I do this on my volunteer capacity, this is not owned by WMDE or WMF. I would be very happy if any team would like to help with this.