Page MenuHomePhabricator

Add selector for editors to control exclusion of navigational elements from search results
Closed, ResolvedPublic

Description

See this new feature documented at https://www.mediawiki.org/wiki/Help:CirrusSearch#Exclude_content_from_the_search_index


(Based on this feedback from French Wikipedia.)

When you perform a particular search, the results can be polluted by navigation elements that are not supposed to be displayed.

For instance, this request displays the following description for the first result (also true for the other ones):
approchent. Ligue des champions de l'UEFA 2017-2018 Navigation Édition précédente Édition suivante modifier La Ligue des champions 2017-2018 est la

The block "Navigation Édition précédente Édition suivante modifier" is a navigation element, located at the bottom of the infobox and designed to go to the previous/next article about this football competition. These elements are not supposed to be included in the search and displayed as results.

The expected search result would be like this. However, this result is excluding the whole infobox by using -hastemplate:"Infobox Compétition sportive". This is not a solution, because some elements can be somehow only displayed in the infobox.

Would it be possible to implement something like class="navigation-not-searchable" that would avoid to display a particular tellement in search results?

Event Timeline

This system already exists, although it's manually done. Basically any content that matches this list are entirely excluded from search: https://github.com/wikimedia/mediawiki/blob/master/includes/content/WikiTextStructure.php#L29-L40

	private $excludedElementSelectors = [
		// "it looks like you don't have javascript enabled..." – do not need to index
		'audio', 'video',
		// The [1] for references
		'sup.reference',
		// The ↑ next to references in the references section
		'.mw-cite-backlink',
		// Headings are already indexed in their own field.
		'h1', 'h2', 'h3', 'h4', 'h5', 'h6',
		// Collapsed fields are hidden by default so we don't want them showing up.
		'.autocollapse',
	];

Items that match this list are moved from the 'text' field into an auxilliary field: https://github.com/wikimedia/mediawiki/blob/master/includes/content/WikiTextStructure.php#L45-L56

	private $auxiliaryElementSelectors = [
		// Thumbnail captions aren't really part of the text proper
		'.thumbcaption',
		// Neither are tables
		'table',
		// Common style for "See also:".
		'.rellink',
		// Common style for calling out helpful links at the top of the article.
		'.dablink',
		// New class users can use to mark stuff as auxiliary to searches.
		'.searchaux',
	];

The currently suggested plan would be to mark something with the 'searchaux' class. The auxilliary text is indexed and searched, but with a low value. We could add a selector to the exclude list as well if needed?

We could add a selector to the exclude list as well if needed?

Yes.
"searchaux" is different from the need to absolutely exclude, because it is targeting useful elements that may help for the search. Navigation items are elements that are not supposed to be taken into consideration at all. IMHO.

EBernhardson renamed this task from Have a system to exclude from search results some elements used as navigation items to Add selector for editors to control exclusion of navigational elements from search results.Apr 18 2017, 9:04 PM
EBernhardson claimed this task.

Change 348855 had a related patch set uploaded (by EBernhardson):
[mediawiki/core@master] Allow editors to exclude navigation items from search indices

https://gerrit.wikimedia.org/r/348855

Change 348855 merged by jenkins-bot:
[mediawiki/core@master] Allow editors to exclude navigation items from search indices

https://gerrit.wikimedia.org/r/348855

We should find a place to document this new selector, we have https://www.mediawiki.org/wiki/Help:CirrusSearch but I'm not sure this page is well known by editors. @Trizek-WMF any suggestions?

@dcausse, Help:CirrusSearch is the right place to document and User-notice the best way to communicate it. But @CKoerner_WMF will have more ideas than me about it. :)

I think Help:CirrusSearch makes sense. @Cpiral, assuming you agree I'd love your opinion on where in the documentation. :)

How would phrase this in a couple of simple sentences for non-technical readers?

How would phrase this in a couple of simple sentences for non-technical readers?

How about something like "You can now tag navigation elements like hatnotes to not appear in search result snippets"? That does include a few technical terms, but it might be understandable to the Tech News audience. What do you think?

I'll probably simplify it, but I can work with that. Thanks! (:

I'll probably simplify it, but I can work with that. Thanks! (:

Great. Thanks!

Just ping me when you know when this will go into production.

debt triaged this task as Medium priority.
debt subscribed.

@Johan - this is done and can be announce now, thanks!

It looks like this section was added to the documentation, for this new feature:
https://www.mediawiki.org/wiki/Help:CirrusSearch#Exclude_content_from_the_search_index

I suggest adding more details, explaining things like

  • How it gets added to a template (e.g. searchaux)
    • And how it gets added to a Lua module (?!)
  • How it impacts the search results
  • 1 or 2 examples of things that we should use each wthin, and where we should NOT use each)
    • i.e. I'm unsure where exactly class="navigation-not-searchable" would ever be used (perhaps the entire 'module:mbox' family?), or if it might have negative consequences that should be kept in mind?
    • i.e. I would hesitantly guess that class="searchaux" should be added to Module:Navbox but I'm not sure (and 6million page uses, so...)
  • How it relates to other CSS classes, such as
    • class="metadata" which I see used in Module:Navbox/configuration -- Note: AFAIK The only documentation about that class is in the very outdated https://en.wikipedia.org/wiki/Wikipedia:Catalogue_of_CSS_classes#Classes - I don't know if that's a global standard already, or if each wiki has invented their own?
    • class="noprint" which is used in things like the `<languages/> bar.
    • Possibly class="navigation-not-searchable" is either redundant to one of those, or has an overlap but subtle difference (which would need to be explained).

Thanks!!

It looks like this section was added to the documentation, for this new feature:
https://www.mediawiki.org/wiki/Help:CirrusSearch#Exclude_content_from_the_search_index

Good! Gotta start somewhere.

  • How it relates to other CSS classes, such as
    • class="metadata" which I see used in Module:Navbox/configuration -- Note: AFAIK The only documentation about that class is in the very outdated https://en.wikipedia.org/wiki/Wikipedia:Catalogue_of_CSS_classes#Classes - I don't know if that's a global standard already, or if each wiki has invented their own?
    • class="noprint" which is used in things like the `<languages/> bar.
    • Possibly class="navigation-not-searchable" is either redundant to one of those, or has an overlap but subtle difference (which would need to be explained).

The metadata class is mostly documented in https://www.mediawiki.org/wiki/Help:Extension:Media_Viewer#How_can_I_disable_Media_Viewer_for_unrelated_images.3F, I think. It would indeed be useful to have a single index for such classes, perhaps within https://www.mediawiki.org/wiki/Manual:Interface/IDs_and_classes . There's also nomobile, which has an unclear overlap with noprint.