This is a tracking-only task for the search-related part of the Add Link project, mainly to have a self-contained description of the plan, which might also serve as a reference point for the upcoming image recommendation work. All work is happening in the linked tasks.
Product overview
WMF Product is planning to implement a series of structured tasks that can be recommended for new users; these tasks can be performed without using the editor and without having to understand all wiki policies, thus offering a more gradual learning experience. (More details here.) The first one is recommending links (Add-Link), the second recommending lead images (T254768); more might come based on the success of those.
The shape of the plan for link recommendations (and probably the others): there is some tool that can generate suggested edits (e.g. which words in an article should link where), which is used to provide a task feed or similar interface to the users. Users review the suggestions, which might or might not result in edits (the suggestion might be fully or partially accepted, or rejected). Outcomes are logged, and used to track the user's progress, and possibly to improve the recommendation tool.
Product requirements
- Users can filter tasks via some selection criteria (currently, the ORES topics of the article). For any criteria, the number of available tasks should be above a certain threshold (to make sure users can find tasks they are interested in). This is not enforced real-time, but whenever the number of tasks dips below the limit, it should be fixed within a reasonable amount of time.
- Picking a task from the pool must be fast (faster than the recommendation tool itself).
- Recommendations need to fulfill certain criteria (e.g. confidence level). This might change per wiki.
- The same recommendation should not be given to multiple users (this can be probabilistic). If a recommendation has been reviewed (accepted or rejected), it should not be given to users anymore. Same if the article has been edited.
Engineering requirements
- We don't want to build any kind of custom search system; instead, use the proper one we already have.
- Minimize the number of needed search index updates.
- Limited storage size, at least initially (ie. we probably can't store recommendations for every article). In the future, storage might be reimplemented via the upcoming AI platform.
- If possible, keep business logic in MediaWiki since product requirements might change and product teams can easier change it there.
Design decisions
Since storage is limited, and directly querying the recommendation tool is too slow, we maintain a recommendation pool as a MediaWiki DB table. For finding tasks, we use the search engine (which supports ORES topics; this also gives us flexibility in case the filtering criteria change), with randomized sorting to avoid collisions. We use a cronjob to make sure the pool is large and diverse enough; since the pool is consumed by newly registered users (whose number and speed doesn't change that much), this doesn't have to be super fast. We need to keep the search index in sync; this does not have to be real-time (the cronjob isn't anyway) so it should be batched to minimize search index updates. There is no way to do batched updates from within MediaWiki so we use the EventGate - Kafka - Hadoop - Spark pipeline that other similar features (e.g. ORES scores) do. Invalidating tasks on edits (or a user rejecting the recommendation) does have to be fast; we rely on the normal MediaWiki index update mechanism there, which is more or less real-time.
Implementation steps
- a MediaWiki DB table for storing recommendations (T261410)
a search index field for whether an article has recommendationsthis got rolled into the new, more generic weighted_tags field- a cronjob that monitors the size of the table and fetches more recommendations when needed (T261408)
- the cronjob sends an event via EventBus when it adds a new recommendation (T261407)
- the Elasticsearch update pipeline consumes these events (probably in hourly batches) and updates the index (T262226)
- a search keyword for filtering for articles which have recommendations (T269493)
- when an article is edited or a recommendation has been reviewed, discard the recommendation from the MediaWiki DB table (T275790)
- during the MediaWiki search index update, check the recommendation table and set the search index flag accordingly (note that this happens after an edit, in which case the flag should be set to false; but also after a null edit or other re-render, in which case it should keep its value) (T261409)
- when a recommendation was reviewed without making an edit, trigger the index update manually (T261409)
Open questions
- How do we avoid the same (rejected) recommendation being generated again? We'll probably want to keep some list of rejected recommendations, and make the cron job ignore them (T266446)
- Should we use EventGate for logging the outcome of users reviewing the recommendations? In theory, that could also be an alternative meachanism for index updates. Doesn't seem to have any benefit though.