This is could be a meta task with some discrete parts that engineers could work on in parallel.
At a high level (see https://wikitech.wikimedia.org/wiki/Add_Link and related docs for more detail), we need a maintenance script that:
- iterates over ORES topics
- for each ORES topic
- get a random list of articles that don't have link recommendations (use an ES query)
- for each article that doesn't have a link recommendation
- query the link recommendation service
- cache the results in the GrowthExperiments MySQL table
- fire an event to indicate result of querying the service; search pipeline handles updating the index
- query the link recommendation service
- for each article that doesn't have a link recommendation
- get a random list of articles that don't have link recommendations (use an ES query)
Additionally there are a couple of parameters we should think about:
- configurable number of tasks to query for, an defining an allocation of how many link tasks we want per topic (the default is 500)
- The query/cache/search index update code should be modular enough to reuse in a job, because we may want to do some of this same work on page edit (refreshing recommendations) or deletion (purging the cache).
After selecting the random set of articles for a topic, here are some additional rules. Some apply before calling the link recommendation service, others apply after calling the service but before saving the recommendations to the database.
Attribute | Initial setting |
---|---|
Suggested links | 1) Must have at least 2 suggestions (configurable) per article over X probability score (X should be configurable per wiki). 2) We will only display a maximum of 10 (should be configurable) suggestions per article. If the service provides more than 10, we should save those in the database in case we end up being unable to locate the phrases in the article text for a particular recommendation. Configuration should be allowed via NewcomerTasks.json |
Existing links | The idea is to filter out already well-linked articles. This is probably handled well enough by the link recommendation service, but let's double check on this. |
Protection status | Exclude articles with any protection |
Categories to include/exclude | None, but make configurable via NewcomerTasks.json |
Templates to include/exclude | None, but make configurable via NewcomerTasks.json |
Time since last edit | 1 day (configurable via NewcomerTasks.json) |
Time since last suggested links edits | Do not use article if previous edit was link recommendations edit *or* if previous edit was a revert of a link recommendations edit (see footnote [1]) |
Article word count (max/min) | None (configurable via NewcomerTasks.json) |
[1] We want to avoid these two scenarios (but maybe there are better ways to accomplish that avoidance):
- User A gets 10 link suggestions on a long article. Adds most of them. Then the article goes back in the queue, gets its suggestions regenerated, and User B chooses it from the queue.* User B adds 10 more suggestions. If that keeps happening, the article could get overlinked.
- User A gets 10 link suggestions and adds them all. Then they get reverted. Then the article’s suggestions are regenerated and it goes back in the queue, where User B adds those same 10 suggestions again. Then it gets reverted again.