Information Retrieval and Artificial Intelligence.
Information Retrieval and Artificial Intelligence.
Information Retrieval and Artificial Intelligence.
Abstract
After the past two decades, there have been very fruitful interactions between information
retrieval and artificial intelligence. Most of these have to do with machine learning, language
understanding, and knowledge representation. As search engines morph from relevance ma-
chines and assistants to goal completion devices that pro-actively help users achieve their
goals, the potential for artificial intelligence to contribute to information retrieval increases.
In this paper, I sketch some of the possibilities for interaction between information retrieval
and artificial intelligence that I see ahead.
1 Artificial intelligence
For the purposes of this paper, artificial intelligence (AI) is concerned with building agents that
take the best possible action in a situation [5]. The leading textbook in AI, Russell and Norvig’s
Artificial Intelligence: A Modern Approach, groups the field in a number of areas:
• Problem solving
• Learning
Many of these core AI areas have witnessed significant progress in recent years and some of this has
figured prominently in the popular press. Think, for example, of machines that beat humans at the
game of go (“learning”), the widespread use of chat bots (“communicating”), speech recognition
as used on mobile phones (“perception”), and, of course, self-driving cars (“acting”).
While it is useful to present the vast amount of work in AI in using these somewhat isolated
groupings, the ambition to build complete agents forces us to consider their connections. In
particular, machine-based perception cannot deliver perfectly reliable information about the world.
Hence, reasoning and planning systems must be able to handle uncertainty. A second major
consequence of trying to build complete agents is that AI has been drawn into much closer contact
with other fields. Examples include control theory and economics, both of which also deal with
agents, and human computer interaction, as AI agents are increasingly exposed to humans. A
third and urgent consequence of the ambition to build complete AI agents concerns ethical and
legal aspects of AI, as we are beginning to realize that the deployment of AI agents has a range
of implications, both short-term and long-term, for people’s lives.
One of the most important environments for intelligent agents is the internet. AI systems have
become common in web-based applications. AI technologies underlie many web-based tools that
we have come to rely on, such as search engines, recommender systems, and web site aggregators.
Moreover, with the wide-spread usage of smartphones AI systems, that solve problems (e.g.,
scheduling), learn from our behavior (e.g., to tag photos), and perceive their environment (e.g.,
through voice recognition) have become ubiquitous.
2 Information retrieval
Interaction with information is a fundamental activity of the human condition. Interactions with
search systems play an important role in the daily activities of many people, so as to inform their
decisions and guide their actions [6]. For many years, the field of information retrieval (IR) has
accepted “the provision of relevant documents” as the goal of its most fundamental algorithms [1].
The focus of the field is shifting towards algorithms that are able to get the right information to
the right people in the right way.
Making search results as useful as possible may be easy for some search tasks, but in many
cases it depends on context, such as a user’s background knowledge, age, or location, or their
specific search goals. There is a wide variety in the tasks and goals encountered in web search
but web search engines are only the tip of the iceberg [4]. Much harder and more specialized
search needs are everywhere: search engines for a company’s intranet, academic resources, local
and national libraries, and users’ personal documents (e.g., photos, emails, and music) all provide
access to di↵erent, more or less specialized, document collections, and cater to di↵erent users with
di↵erent search goals and expectations. The more we learn about how much context influences
peoples’ search behavior and goals, the more it becomes clear that many hundreds of preference
criteria play a role.
Addressing each optimal combination of search result preferences individually is not feasible.
Instead, we need to look for scalable methods that can learn good search results without expensive
tuning. In recent years, there has been significant work on developing methods for “self-learning
search engines” that learn online, i.e., directly from natural interactions with their users; see
Figure 1 for a schematic overview. Such algorithms are able to continuously adapt and improve
their results to the specific setting they are deployed in, and continue to learn for as long as they
are being used.
generate implicit
feedback
query implicit
state st feedback
Our focus on “getting the right information to the right people in the right way” suggests a
di↵erent view of IR, one where we begin with users and then ask ourselves what it would take to
address a user’s information need. This takes us from queries to query understanding, and asks
for o✏ine preparations (largely independent of the query) and online actions (fully dependent on
the query), along with ways to assess, automatically or not, the quality of the results that we
are producing. Here, we follow standard practice in computer science and say that algorithms
that receive their input sequentially operate in an online modality. In contrast, batch or o✏ine
processing does not need human interaction. Typical o✏ine computations in IR involves any kind
of processing that is not query dependent (crawling, document enriching, aggregation, indexing,
. . . ). Typical online computations in IR involves any type of processing that depends on users
and their input (query improvement, ranking, presentation, online evaluation, . . . ). We detail our
view of IR in four steps, pausing at every step so as to identify where IR plays a role or could
play a role.
Figure 2 provides the first step, called front door, as it here that user and search engine
interface.
query
Front door
SERP
SERP
clicks
query
UX
improvement
query
suggestions
The front door determines the user experience and produces the search engine result page
(SERP). It receives queries, and may return query improvements or suggestions for improvements.
The front door receives other user signals (clicks, shares, abandonment, . . . ). What are the
opportunities, realized or not, for AI at the front door?
Next comes the o✏ine component, illustrated in Figure 3, where (mostly) query independent
processing takes place.
The o✏ine component is about getting content and about enriching content, e.g., by classify-
Offline
extraction
source indexing index
aggregation
source crawl/ingest index
source index
scheduler enriching
ing documents, detecting duplicates, and enriching documents with additional labels or entities
occurring in it. Another important part of the o✏ine is content aggregation, e.g., aggregating
all documents in which a certain entity occurs, or all documents authored by the same person.
Creating indexing that allow efficient access to large volumes of data is another key aspect of
the o✏ine component. What are some of the obvious opportunities, again realized or not, for AI
within the o✏ine component of an IR system?
The online component deals with providing adequate responses to queries submitted by users;
see Figure 4.
A key ingredient of the online component is query understanding: what is the user looking for?
What is their intent? Depending on the estimated intents, di↵erent specialized search engines,
so-called verticals, serving up specialized types of content (news, images, social media, . . . ) may
be activated. Each of these has a complex ranking function, that is often machine-learned to yield
highly personalized results. The results from disparate verticals are blended and then served to
the user. Again, we may ask what the potential for AI is in this component.
The final component, evaluation, deals with di↵erent ways of assessing the performance of the
IR system; see Figure 5.
A/B
Logs
learning
Logs
Online
Logs
offline evaluation interleaving
References
[1] N. Belkin. People interacting with information. SIGIR Forum, 49(2):13–27, 2015.
[2] K. Hofmann, S. Whiteson, and M. de Rijke. Balancing exploration and exploitation in learning
to rank online. In ECIR ’11. Springer, April 2011.
[3] K. Hofmann, S. Whiteson, and M. de Rijke. Balancing exploration and exploitation in listwise
and pairwise online learning to rank for information retrieval. Information Retrieval Journal,
16(1):63–90, February 2013.
[4] K. Hofmann, S. Whiteson, A. Schuth, and M. de Rijke. Learning to rank for information
retrieval from user interactions. ACM SIGWEB Newsletter, (Issue Spring):Article 5, April
2014.
[5] S. Russell and P. Norvig. Artifical Intelligence: A Modern Approach. Pearson, 3rd edition,
2010.
[6] R. W. White. Interactions with Search Systems. Cambridge University Press, 2016.