Beresi PhD thesis.pdf - OpenAIR @ RGU - Robert Gordon University
Beresi PhD thesis.pdf - OpenAIR @ RGU - Robert Gordon University
Beresi PhD thesis.pdf - OpenAIR @ RGU - Robert Gordon University
Citation Details
September 2011
Every thesis should have a section which acknowledges the contributions of those who
directly, or indirectly, have helped create it. This thesis is no different and as such contains
said section. It is very possible that this section is both incomplete and unjust in giving
This thesis is dedicated to my wife Jorgelina, to whom I say thank you for your
patience, understanding and endurance. However, and most importantly, thank you for
trusting in me and joining me in this pursuit. Finishing this project would not have been
who taught me the importance of approaching the world with a clear and educated mind,
thinking for myself, questioning everything, and the importance of always keeping an open
to them. To David, thanks for being direct. You taught me how to communicate both
praise and criticism frontally. To Ian, thank you for trusting in both my abilities and
judgement. When times looked bleak you provided me with encouragement and advice. I
have to also admire your boldness: you agreed to supervise me after knowing me for about
one hour (how silly was that?) To Mark, thank you for your sound advice. I will never
forget your words: “a good thesis is a finished one.” To Patrik, thank you for being there.
To both examiners of this thesis, Prof. Ian Allison and Prof. Birger Larsen. Whatever
clarity and insights are contained in this work, they are due to your insistence. Not
only your suggestions led to a much improved dissertation but your insights and questions
helped me realise the importance of communicating research in a clear and concise manner.
To Sandy, Thierry and Malcolm, who provided hours of amusement and sanity-preserving
tea breaks. To Prof. Dawei Song, who supported and encouraged me endlessly; I would
probably be still writing up without his support. Last, but certainly not least, to you,
Yunhyong, without your constant challenges much of the quality of my work would be
below par.
As some portions of this dissertation have been presented in various forums, each
individual publication went through a review process which provided me with insightful
comments from the reviewers. These were invaluable in improving the final dissertation.
Many others are to be thanked, however you are too many to be listed here. So, thank
1 Introduction 2
2 Literature Review 8
2.4 On Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3 Study 39
3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4 Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.6 Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.7.1 Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.7.2 Intent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4 Results 73
4.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2 Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4 Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5 Intent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5 Results II 112
6 Discussion 154
Appendices 179
A Forms 180
B Publications 189
List of Tables
3.3 Encoding used to tag the utterances that express a relevance criterion . . . 65
4.1 Number of participants per school/group for which valid data was gathered 74
4.2 Queries issued to the search engine for constructing the collection for the
4.3 Number of occurrences for each criterion according to the global relevance
criteria profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.5 Number of occurrences for each criterion according to each school relevance
criteria profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.7 Number of occurrences for each criterion according to each research expe-
5.1 Average use (averaged across participants that expressed using them) of
5.2 Frequency of each criterion as distributed across single criterion rule uses. . 124
List of Figures
2.1 The LBD models in action. The inner box depicts the open model in which
The outer box depicts the closed model; a model in which searchers either
3.1 Search task introduced during the first meeting with participants . . . . . . 43
3.2 Example search task introduced during the second meeting with participants 44
3.3 User interface of the system participants used during their first session . . . 46
3.4 An example document. On the top left the document identifier is displayed. 47
3.5 Visual representation of the open search algorithm. The first step is to
model the topics contained in the user submitted documents and retrieve
more documents about them (step a). The second step is to model the topics
contained in the pooled documents and retrieve even more documents (step
b). The final step is to model the topics contained in each document set
3.6 Topic tree. The A node is the participant’s initial topic. The inner nodes,
the Bi nodes, are the immediately related topics as described by both the
literature and the topic extraction algorithm. The tree leaves are the indi-
3.7 First screen users see when doing the second part of the study. In the
image the terms representing the initial topic are presented on the left
panel whereas the initial C topics, also in the form of terms, are displayed. . 50
3.8 Navigation screen. The top panel (a) displays both the initial topic (listed as
“Your topic”) and the potentially related topic (listed as “Related topic”).
The middle panel (b) lists the intermediate topics. The bottom panel (c)
3.9 Search task given to all research students that participated in the study . . 54
3.10 Search task given to all researchers that participated in the study . . . . . . 55
3.11 Search task given to all senior researchers that participated in the study . . 56
3.12 A typical relevance criteria profile. Frequencies are normalised, hence the
3.14 An example with four relevance criteria plotted. Interactions are further
4.2 Relevance criteria profile of the global group. Values in the y axis vary
between 0 and 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.5 The profiles of the schools, normalised within criteria, plotted together. . . 91
It must be noticed that the closer to 0 the score, the more similar the two
profiles are. This is in line with traditional heatmap plots where red is used
5.2 Total number of participants that used, at least once, a relevance judgement
5.4 Navigation screen. The top panel (a) displays both the initial topic (listed as
“Your topic”) and the potentially related topic (listed as “Related topic”).
The middle panel (b) lists the intermediate topics. The bottom panel (c)
5.5 Proportion of interactions with the left (right) panel as the session pro-
gresses. The two curves mirror each other. For every interaction observed
the proportion of one increase while the proportion of the other decrease,
the x axis and the curve for the interactions with the left panel increases
accordingly while the curve for the interactions with the right panel decreases.140
6.1 Feedback form. Several criteria are present which, when combined, present
a more informative judgement for why the video is worth seeing. . . . . . . 162
A.6 Form used to capture the documents selected as well as the topics they
support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
A.7 Form used at the end of the second search session (page 1). . . . . . . . . . 187
A.8 Form used at the end of the second search session (page 2). . . . . . . . . . 188
Chapter 1
As pools of information increase in size and complexity, the mechanisms to retrieve in-
formation from them must evolve accordingly. The discipline that deals with the issues
involved in the design and testing of these mechanisms is called Information Retrieval
(IR). Traditionally, the retrieval of information has been based on the matching of infor-
mation objects to user requests. To perform the matches, keys are extracted, from both
information objects and user requests, and they are compared. This procedure, however,
places the burden of choosing the right keys on the users. Hence, users are faced with
the problem of having to select the keys that they think are likelier to guide them to the
relevant information.
with searching for and retrieving relevant information, not just any kind of information.
However, this notion of relevance is elusive. At a basic level one could talk about relevance
estimation of true relevance and the one usually embedded in the matching techniques
of IR systems. However, “...users can read into results a lot more than correspondence
between noun phrases or some such in queries and objects, used primarily by systems for
problem that is not retrieved by a system for a variety of reasons, for example not reflected
Introduction 3
In a series of articles, Swanson provides examples and explores the possibility of deriv-
ing relevance relations between seemingly disconnected areas of medicine (Swanson 1986a,
Swanson 1988b, Swanson 1990). Swanson (1986c) attributes the existence of these knowl-
edge gaps to the natural fragmentation of science and defines these relevance relations
as Undiscovered Public Knowledge (UPK) alluding to the fact that while the knowledge
was implicit in the literature bodies, nobody had noticed the links before. The discovery
of these hidden relationships –or the derivation of these relevance relations– led Swanson
search strategy for finding these logically connected but non-interactive (in terms of cross-
and co-citations) literatures (Swanson 1986c, Swanson & Smalheiser 1997b). The proposed
search strategy is composed of two stages where the first stage is “ exploratory process
related pairs of literatures...” and is followed by “...a method for eliminating all pairs
except those that are noninteractive” (Swanson 1989). These two stages of the search pro-
cess have been labelled the open model and closed model of discovery respectively (Weeber,
Klein, de Jong-van den Berg & Vos 2001). Eventually, the art of retrieving these disjoint
Researchers have taken on Swanson’s message and started developing specialised In-
formation Retrieval (IR) systems; systems that are designed and tuned to find these com-
plementary literatures and provide information as to why they are potentially so. While
traditional IR system are designed with a best match strategy in mind, in LBD systems
this approach is only complementary to an initial step of discovery in which topics of inter-
est and their relationships are modelled and assessed. Research in LBD has been mostly
of words, recent techniques have included information from specialised databases such as
the Unified Medical Language System (UMLS) and the Medical Subject Headings (MeSH)
(Lowe & Barnett 1994). LBD is a relatively young discipline, though the array of tried
Evaluation of LBD techniques typically involve identifying whether key concepts from
Introduction 4
the Swanson discoveries were promoted through the results returned by the system. For
example, the appearance of phrases such as “blood viscosity” and “platelet aggregation”,
which were important in connecting “Raynaud’s disease” with “Fish oil” (Swanson 1986a)
useful to measure the performance at a system level (albeit in a limited fashion), however
this approach leaves several questions unanswered that, in the case of LBD, might be
central to measuring the success of a system. Effectively, the evaluation method used
collection of requests, and a collection of relevance judgements– are used to assess the
products of IR systems are evaluated (Harman 1997). In this tradition potential end users
are rarely involved in the process, hence cognitive factors such as information seeking and
Borlund (2003b) proposes a framework for evaluating Interactive IR (IIR) systems that
takes into account user-centred issues. The framework is composed of three experimen-
tal components which allow the evaluation of IIR systems to take place under realistic
settings while still retaining the control observed in the evaluations conducted using the
test persons, while control is retained by the use of simulated work task descriptions; a
core component of the framework proposed by Borlund. Simulated work task situations
describe a situation in which needs for information are triggered on users; users are led
to a cognitive state in which information needs arise and need to be satisfied before they
can move on. Because potential end-users are involved, relevance can be treated as a dy-
processes can be considered; all factors that may be central to the evaluation of LBD
1.1. Aims of the Dissertation 5
LBD systems are inherently interactive since it is through interacting with the system,
assessing proposed relationships and inspecting the literature that researchers propose,
verify or reject hypotheses. Cognitive issues, therefore, such as information seeking and
interaction processes, are factors that should be taken into account during the evaluation
of such systems.
is likely that several factors, for instance the interactivity of the literatures, may affect the
judgements made by the researchers. It is not entirely clear what the nature of relevance
in the context of LBD is, how it can be observed (or measured) nor whether this can be
have not been explicitly observed. The multidimensional and dynamic nature of relevance
in this context is of particular interest in this work. Effectively, the core of this work lies
in observing which criteria form part of this notion of relevance, the fluctuations across
these criteria and whether researchers use them in varying proportions when assessing the
Hence, the main research question that this dissertation aims to answer is
What relevance criteria, if any, do researchers use when assessing the relevance
is achieved by having the researchers interact with an LBD system while completing a
knowledge discovery task. As LBD is mainly aimed at the scientific community, questions
ferent frequencies?
1.2. Structure of the Dissertation 6
Does research experience affect the relevance criteria, and their frequency of
To guide the evolution and improvement of LBD systems, and IR systems in general,
here where relevance plays an important role. Hence, the aforementioned questions gain
To try and find answers to these questions, it is proposed in this dissertation to fol-
low the guidelines and include the experimental components as proposed by Borlund &
Ingwersen (2000) in the design of a user study. As it has been argued before, studies
on the nature of relevance should be conducted with real end-users and in realistic set-
tings (Schamber, Eisenberg & Nilan 1990). The context of the study is that of scientific
discovery, hence the test persons are recruited from the scientific community; researchers
are the end users of LBD systems. Researchers from three different disciplines, namely
computing, information management and pharmacy, are invited to take part of the study.
review of LBD: an introduction to the problems, a survey of the most representative tech-
niques, and a description of the evaluation methods used are offered. Additionally both
the system-driven tradition of IR research and Borlund’s proposal for evaluating IIR sys-
tems are described here. In Chapter 3 details of the design and the materials used during
the study including the characteristics of the user group, the collections searched and a
description of the system used are offered. Relevance criteria profiles and session visuali-
sation techniques are central to the analysis of the data gathered during the study. These
analysis tools, developed during the course of this investigation, are described in Chapter
3. Chapters 4 and 5 present the data gathered during the study. In Chapter 4 an analysis
of the relevance criteria observed using relevance criteria profiles is offered. Relevance
criteria profiles are revisited in Chapter 5. Furthermore, the data is segmented to isolate
and analyse the relevance judgement processes and the interactions with the system ob-
1.2. Structure of the Dissertation 7
served are described and analysed. Chapter 6 includes a discussion of the implications of
the results and a number of recommendations for future work. Chapter 7 concludes this
dissertation with a summary of the research carried out regarding the research questions.
Chapter 2
Literature Review
As research fields become more specialised, academics tend to interact more with re-
searchers and literature from their chosen speciality, and less with researchers and litera-
ture outside of their own specific area of interest. Consequently, the interaction between
fields, through cross-referencing across fields and shared use of common literature, be-
comes reduced and related fields detach from one another. The result is relatively isolated
(fragmented) and highly specialised bodies of literature, a phenomenon that has recently
accelerated due to the increased rate of new publications available online (Swanson 1986c).
This detachment of research fields means that academics who share common interests
and approaches but work in different fields can miss important connections. It is becom-
ing increasingly challenging for most researchers, especially in established fields, to keep
up to date with important developments in their own chosen speciality (Swanson 1988a).
However, keeping track of useful new developments in allied fields is even more demand-
ing, relying far more heavily on the inefficient processes of manual literature searching
neered through dedicated, but often small-scale, initiatives. Ashworth (1966) argues that
librarians are in a unique position since much knowledge passes through their hands. A
librarian should, then, be able to perceive which items of knowledge might be profitably
combined even though he will not combine them himself or discover the full potential
Literature Review 9
behind said combination. Ashworth also suggests that “this need for combination is so
fundamental that library systems should be designed to meet the requirement, thus enabling
librarians to play an active and vital part in future innovation”. In Ashworth’s proposed
• To demonstrate automatically when ideas are neighbours of each other, and therefore
The aim of research into Literature Based Discovery (LBD) is to help discover these
literatures. This area of research is motivated by the findings of Don Swanson, who in
the mid-80s discovered two disjoint literature bodies that were complementary i.e. when
put together, they suggested an answer to a question which was not previously published.
Swanson saw the potential in this procedure –combining knowledge from both literatures
to form an answer– and started to systematically investigate it under the name of Undis-
covered Public Knowledge (UPK), more recently known as Literature Based Discovery
(Swanson 1986c).
To this day researchers keep on improving the techniques used to discover and retrieve
these unrelated literatures. Techniques vary from system to system, however the search
models implemented remain the same: the open model and the closed model. The open
(Weeber et al. 2001). Users begin a search session with a topic in mind. The system is in
charge of generating (suggesting) a set of directly related topics which in turn will be used
to infer a new set of related topics. This last set of topics is interpreted as being indirectly
related to the researcher’s initial topic. Using these suggested topics, users can then start
hypothesising about the potential relationships. The closed model, on the other hand,
aids users in verifying the hypotheses generated using the open model. A typical search
session begins with a hypothesis. This hypothesis is that the topics selected during the
open model search session (the researcher’s initial topic and an indirectly related topic)
2.1. Literature Based Discovery 10
are indeed related. A system implementing the closed model search is responsible for
retrieving the intermediate topics and the relevant literatures. It is with these literatures
that the researcher can start considering whether this hypothesis has merit or not.
replicating Swanson’s initial discoveries. To this extent, researchers model and extract
the concepts contained in a subset of the relevant literature and observe, after filtering
and ranking, how many of these modelled concepts correspond to those that Swanson
discovered play a role in the discoveries made. We suggest that evaluating systems in
such a way, albeit convenient and cheap, may not be the most appropriate method. For
example, the users’ context and interactions are disregarded. In a scenario in which
background knowledge, for instance, and other contextual factors may greatly affect the
outcome of the search sessions, an evaluation method that simply ignores them might not
In the following sections a description of the most representative work in the area of
LBD is offered. The history of LBD is covered in Section 2.1 while the search models are
described in Section 2.2. The techniques and evaluation methods are discussed in Section
2.3. In Section 2.5 a short overview of evaluation methods for Interactive Information
Retrieval (IIR) is offered; it is our suggestion that these methods may be more appropriate
The most famous example of a successful discovery using LBD is that of the relation
between dietary fish oil and Raynaud’s disease (Swanson 1986a). Dietary fish oil has
been shown to have several effects that improve blood circulation, e.g. reduction in blood
circulary disorder, presents several symptoms that seem to be the negation of the effects
achieved by dietary fish oil. Both arguments, if considered together, suggest that dietary
In his seminal article, Swanson (1986a) presented a set of 25 articles discussing the
2.2. LBD Search Models 11
beneficial effects of dietary fish oil on blood circulation and another set of 35 articles
discussing the symptoms of Raynaud’s syndrome that would be affected by these effects.
A careful analysis of these two sets showed that they were not interacting with each other in
the sense that no article in either set cited any article on the other set (cross-citation) and
that only 4 articles cited at least one article from each set (co-citation). However, Swanson
noticed and explained that out of the 4 articles which cited at least one article from each
set, none did so in a way that related fish oil and Raynaud’s syndrome. Contrasting the
the general idea of Undiscovered Public Knowledge (UPK) of which the fish oil - Raynaud’s
syndrome was a particular instance. Swanson argued that the creative use of information
searching strategies could lead to meaningful discoveries as these isolated, but comple-
mentary, bodies of literature were awaiting to be discovered. By working on the fish oil
- Raynaud’s syndrome example Swanson not only suggested that dietary fish oil might
ameliorate or even prevent Raynaud’s syndrome, he also suggested that, if his suggested
link was indeed correct, logically related, but isolated, bodies of literature existed. At
in Section 2.3, this domain offers certain structural information and metadata that has
been beneficial to enhancing the techniques used to not only replicate Swanson’s original
discoveries but also to make new discoveries. There has been a few examples of LBD being
applied outside the field of medicine (Gordon, Lindsay & Fan 2002, Cory 1997) however,
the question of whether LBD can be applied outside the field of medicine remains largely
replicate Swanson’s original discoveries which suggests that extrapolating the proposed
Weeber et al. (2001) suggested that most LBD systems are designed to implement one
2.2. LBD Search Models 12
(or both) of two search model. The open model, an exploratory model by nature, is
characterised by the generation of a hypothesis at the end of the process. The closed
model, on the other hand, aids the researcher in evaluating whether a hypothesis has
a laboratory. Both models complement each other and are described next.
The search process begins with a topic of a scientific problem or research question. This
initial topic is usually referred to as “the starting topic” or “A-topic”. For instance, a
scientist may be interested in finding a novel treatment for Raynaud’s syndrome. Next,
the A topic (Raynaud’s disease in the example) is used to query a database, Medline
for instance, and the pertinent literature is retrieved. This literature is referred to as
the “starting literature” or “A-literature”. Important topics are then extracted and pro-
cessed, according to techniques that vary from system to system. There may be human
intervention in this extraction and processing step. The extracted topics are referred
to as “intermediate topics” or “B-topics”. If, for instance, the starting topic is that of
our example (Raynaud’s disease), the B-topics may then correspond to characteristics or
symptoms of the syndrome, already tried treatments, etc. These B-topics are then used
to query a database to obtain the pertinent literature; this literature is referred to as the
extracted. These topics are known as “target topics” or “C-topics”. The searcher is then
presented with this network of topics so that one or more target topics can be selected for
further inspection.
a B-topic, which in turn is related to a C-topic, then there is a chance that an indirect re-
lationship between the A-topic and the C-topic exists. This simplification of the reasoning
which are reviewed in Section 2.3. And because the process is simplified, the result is an
exponential growth in the number of potential relationships between any one A-topic and
2.2. LBD Search Models 13
any one C-topic. Hence, the two biggest challenges, in terms of building systems, posed
In Figure 2.1 (figure adapted from (Weeber, Klein, Aronson, Mork, de Jong-van den
Berg & Vos 2000)) we can observe a graphical depiction of the open model.
Figure 2.1: The LBD models in action. The inner box depicts the open model in which
a searcher is interested in forming hypotheses about the topics of interest. The outer box
depicts the closed model; a model in which searchers either reject or find evidence for
pursuing the verification of hypotheses.
With a hypothesis in mind, a searcher embarks in using the closed model to assess whether
it merit further investigation. The process begins with two topics, the A-topic and the
C-topic. The database (or databases) is queried twice, once with the A-topic and once
with the C-topic and both the A-literature and the C-literature are retrieved. These two
literatures are pooled together and the resulting set is processed. The processing of the
pool of literatures results in a set of intermediate B-topics that may, or may not, link
2.3. LBD Systems 14
the two starting topics together. The searcher’s job is then to investigate the B-topics
Doyle (1961) points out that “literature searchers value both the unexpected and the ex-
pected. When something unexpected is found, one thereby obtains information; when the
expected is found, one obtains confirmation. However, when one formulates a search, the
unexpected is hardly ever involved. Search requests are practically always constructed out
of familiar combinations of terms.” Both models of LBD share a direct relationship with
Doyle’s observation. By using the open model of search, users will be confronted with the
unexpected, with what is unknown to them yet related to their initial search. When using
the open model, users should obtain information about the possibly related topics. As a
complementary step, users could perform a search using the closed model. Results from
this search may provide evidence supporting different types of relationships between the
initial topics. At this stage users obtain confirmation (or refutation) on the relationship
between the provided topics; that is, the hypothesis that the topics are linked.
In this section a review of the most prominent examples of the techniques used to build
LBD systems is offered. Both techniques used as well as the evaluation methods are
Although Weeber et al. (2001) suggest that the closed model serves as complement to
the open model, most research focuses in one of the two. The initial discovery made
by Swanson (1986a), for instance, could be classified as a use of the closed model. After
manually inspecting the literatures on Raynaud’s disease and several articles discussing the
benefits of dietary fish oil, Swanson suggested the link between them. Swanson (1988b)
hypothesised next that migraine attacks may be linked, through eleven connections, to
2.3. LBD Systems 15
magnesium deficiency. The procedure followed by Swanson, however, resembles the open
model as it begins with the migraine literature (A-literature) and eventually reaches the
in which he assumes that the relevant literature has been retrieved. Swanson analyses
whether the initial topic can be logically linked to the target topic.
The procedure followed by authors whose research is focused on the open model, e.g.
(Gordon & Lindsay 1996, Gordon & Dumais 1998, Lindsay & Gordon 1999, Pratt &
Yetisgen-Yildiz 2003, Hristovski, Peterlin, Mitchell & Humphrey 2005), is such that it be-
gins with a starting topic and the goal is to rank highly the known target topics. Authors
who conduct their research following the closed model, e.g. (Weeber et al. 2001, Srinivasan
2004, Swanson 1988b, Smalheiser & Swanson 1996b, Smalheiser & Swanson 1996a, Swan-
son & Smalheiser 1997a), on the other hand, have a different goal. Their followed procedure
begins with two topics for which their linking intermediate topics are known and try to
While generally based on co-occurrence, techniques differ in how topics are modelled,
how relationships are inferred and how these are filtered and ranked. Approaches based
After having manually inspected the literatures for co-occurring terms in the article titles,
Gordon & Lindsay (1996) investigate statistical approaches further, however they take
a more principled analytical approach and use traditional IR weighting techniques such as
tf.idf (Salton & Buckley 1988). Gordon & Lindsay (1996) model topics as either unigrams
2.3. LBD Systems 16
or bi-grams (words or two-word phrases) and the statistics of these grams are used to filter
and rank the modelled topics: those that do not meet a user-defined threshold are dis-
carded (they are considered noise). While Swanson (1986a) analyses the co-occurrence of
words in article titles, Gordon & Lindsay (1996) do so in the full text (whenever available)
of the articles. Gordon and colleagues continue evaluating the appropriateness of term
statistics for filtering and ranking purposes (Lindsay & Gordon 1999, Gordon et al. 2002).
Gordon & Dumais (1998) report on the application of a technique called Latent Se-
mantic Indexing (LSI) (Deerwester, Dumais, Furnas, Landauer & Harshman 1990). The
use of LSI in this context is to reveal hidden potential relationships amongst terms in
text, as semantically similar terms lie close together in the LSI space. Unfortunately, no
mention is made to how topics are filtered or discarded. Topics are ranked according to
their proximity (using cosine distance) to the starting topic (e.g. “Raynaud’s disease”).
In addition to text statistics, other approaches have involved the use of external
databases and metadata for topic modelling, filtering and ranking. The system imple-
mented by Weeber et al. (2001) –an extension to their previously implemented system
(Weeber et al. 2000)– uses the MetaMap algorithm (Aronson 2001). The MetaMap al-
gorithm maps free-form text to the Unified Medical Language System (UMLS) medical
concepts; concepts which are usually used as topics. The semantic information associated
to these concepts is also usually used as filtering options. Algorithms that use MetaMap
include LitLinker (Pratt & Yetisgen-Yildiz 2003) and the work conducted by Wren, Bek-
eredjian, Stewart, Shohet & Garner (2004). Pratt & Yetisgen-Yildiz (2003) initially map
text to medical concepts using MetaMap. To uncover potential links, LitLinker uses as-
sociation rule mining (ARM), an unsupervised machine learning algorithm for learning
c, hence a is considered to indirectly co-occur with c (R. Agrawal, Mannila, Srikant, Toivo-
nen & Verkamo 1996). Topics are pruned according to several criteria. Firstly, topics that
are considered to be too general (according to the UMLS) are pruned. Secondly, topics
that appear in more than 10,000 titles are removed. Thirdly, the proximity of the topics
to the initial A-topic is measured using the number of parents shared in the UMLS hi-
2.3. LBD Systems 17
erarchy. Topics that are considered too close are also removed. Finally, UMLS semantic
types are used to filter the remaining topics. Once pruning is over, similar concepts are
grouped together to increase their statistics. Wren et al. (2004) model topics by mapping
free form text to concepts using several databases amongst which we find UMLS and the
Medical Subject Headings (MeSH) (Lowe & Barnett 1994). Relationships are identified
by analysing the co-occurrence of topics within Medline records. To filter and rank the
discovered relationships, they are compared to a random network model. The ranking of
a connection depends on the ratio between the number of connections for any given link
and the expected number of connections by chance. Those that exceed a defined threshold
Previous approaches rely on free form text to model and extract the relevant topics,
however the algorithm proposed by Srinivasan (2004) relies only on Mesh (Lowe & Barnett
1994) and UMLS. The algorithm models topics as combinations of MeSH profiles where
profiles are vectors of weighted MeSH terms. Topics are ranked according to the number of
intermediate B-topics, and the strength of their association, that link the starting A-topic
who include genetic information into their LBD system named BITOLA. This is an ex-
tension to their previous work (Hristovski, Stare, Peterlin & Dzeroski 2001, Hristovski,
Peterlin, Mitchell & Humphrey 2003). Information about chromosome location of the
starting and target topics is integrated so that disease-gene discoveries can be performed
with their system. To this extent, despite the focus on the Medline literature, other
are included. Much like LitLinker (Pratt & Yetisgen-Yildiz 2003), to discover relation-
ships, Hristovski et al. (2005) use association rule mining (R. Agrawal et al. 1996) between
medical concepts.
Now superseded by Entrez Gene –
2.3. LBD Systems 18
Swanson’s initial discoveries were later corroborated in subsequent articles in the biomedicine
field (Weeber et al. 2001). Most researchers evaluate their systems using Swanson’s dis-
coveries as gold standards. The golden data is composed of golden topics, Swanson’s
reported A, B, and C topics. Although the details of each approach vary, to measure the
performance of their systems, researchers follow a similar approach. The products of the
• Determine the number of golden topics present in the list of returned topics
Gordon & Lindsay (1996) attempt to replicate the fish oil-Raynaud’s disease discovery
(Swanson 1986a). To do so, the authors perform their experiments with the original fish-oil
documents dated between 1982 and 1986. The effectiveness of their system is measured
by means of calculating Precision and Recall (Cleverdon, Mills & Keen 1966) however,
the authors do not specify how they compare the topics retrieved by their system with
the golden topics as to consider them either true or false positives. To assess further the
ranked lists of topics extracted by their systems, human experts are involved and these
suggest that “blood viscosity” is a salient topic. It remains unclear, however, why this
topic was considered to be salient. Additionally, while it is true that its statistics placed
this topic within the top ranked topics, other topics were also ranked highly and hence, if
one was to set a threshold for picking potential candidates, these other topics could have
In a subsequent article, Gordon & Dumais (1998) evaluate the performance of LSI
by comparing the results of this technique with those obtained through the use of term
statistics of Gordon & Lindsay (1996). 136 LSI concepts which are found to be “near” to
the concept “raynaud” (near in terms of being proportional to the cosine distance between
two concepts) are compared to the union of 6 top-40 concepts from the results of (Gordon
& Lindsay 1996). The authors find that the there is a 42% overlap between the two sets
of concepts. More importantly however, is that the authors report that the top 10 LSI
2.3. LBD Systems 19
concepts include 9 of the top 10 concepts of (Gordon & Lindsay 1996) and conclude that
the approaches are, to a certain extent, uncovering the same relationships. The authors
do not explicitly mention, however, how the correspondence between the LSI concepts and
In an extension to (Gordon & Lindsay 1996), Lindsay & Gordon (1999) use tri-grams
(three word phrases). The authors follow the evaluation procedure of (Gordon & Lindsay
1996) however the golden data is taken to be the Swanson’s discovery that migraine might
is the open model, using “migraine” as the starting concept, the evaluation is done in
two stages. During the first stage, the set of intermediate concepts is evaluated to see
the authors consider to also be related) are discovered. The authors report that “10 of
the 12 intermediate concepts linking migraine and magnesium were among the first few
dozen items suggested by either token or record count analyses of the migraine literature”
(Lindsay & Gordon 1999). During the second stage the authors evaluate the results of
further processing of all the 12 intermediate concepts3 This processing involved extracting
the top 500 topics from each intermediate literature, pooling them together and filtering
to extract the top 50. These are in turn given to a medicine student who further groups
and consolidates the list. The authors report that magnesium did not appear in the top
Weeber et al. (2001) also try and replicate Swanson’s original discoveries (Swanson
1986a, Swanson 1988b). In the case of the fish oil-Raynaud’s discovery, to build the corpus
of documents to analyse, the authors run the query “Raynaud’s disease” on Medline. The
system then maps the raw text from the titles and the abstracts of the documents retrieved
to UMLS concepts using MetaMap (Aronson 2001). The semantic information associated
to these concepts is also used as filtering and ranking options. For the closed model, the
authors examine the 68 top ranked intermediate metamap concepts and report that the
The authors argue that since they are trying to replicate Swanson’s findings, selecting only those for
the experiment is reasonable. Additionally, it is suggested that if the purpose of the experiment was to
generate new discoveries, all top intermediate topics in the ranking should be investigated further.
2.3. LBD Systems 20
original B-topics of the fish oil-Raynaud’s disease discovery are included in this list. When
using the open model, the authors suggest that, while the original fish oil concepts were
not ranked highly, other concepts related to fish oil were, and that researchers should be
able to still recognise fish oil as a target concept from them. Again, it remains unclear
how the golden topics from Swanson’s discovery are matched against those produced by
the system.
As the work by Gordon et al. (2002) is exploratory in nature –the authors set to explore
the appropriateness of using LBD on the World Wide Web (WWW)– they do not attempt
to replicate any of Swanson’s discoveries. To conduct their experiments the authors focus
on the open model. The process begins by retrieving4 the top 50 documents on the topic
“Genetic Algorithms”. On these documents, several statistics are calculated such as term
frequency and document frequency (the statistics calculated are explained in more detail
in their previous work (Gordon & Lindsay 1996, Gordon & Dumais 1998, Lindsay &
Gordon 1999)). The extracted concepts, represented as n-word phrases, are ranked and
filtered according to the term statistics calculated on the pooled documents. A human
expert proceeds then to select the 12 most salient topics from the list of ranked topics5 .
salient topic is largely depends on the expert’s knowledge of the starting topic “Genetic
Algorithms”. Each of these topics is then used to query the search engine and the top 100
documents is retrieved from the WWW. Each document set is statistically analysed and
the results are pooled together. Out of this new list of topics, the authors selected 42 topics
et al. 2002). To validate the results, an expert in the field is asked to generate a list of
report that they did not find any overlap between the list of the expert-generated topics
Other discoveries made by Swanson were used as golden standard as well. For instance,
the migraine-magnesium link is used as gold standard by Pratt & Yetisgen-Yildiz (2003).
The search engine used is Altavista –
No rationale is given as to why they chose 12 topics.
2.3. LBD Systems 21
The experiment conducted by the authors is focused on the closed model of search. The
system works by mapping the raw text of the documents to UMLS concept by means of
the MetaMap algorithm (Aronson 2001). The resulting concepts are initially filtered in
two steps:
1. Filtering too general concepts: the authors observe that many general concepts are
usually present in the second level of the UMLS hierarchy of concepts and decide to
2. Filtering too frequent concepts: as there are still too many general concepts present,
the authors decide to filter out concepts with a document frequency greater than
Once filtered, concepts are grouped together by folding upwards to parent concepts
in an attempt to increase the statistics of the different concepts extracted. After starting
an open model search with an A-topic representing “migraine”, the authors observe that
the target topic “magnesium” is ranked at position 11. The target topics are ranked
according to how many in-links each topic has6 . When conducting a closed model search
using “migraine” and “magnesium” as start and target topics respectively, the authors
report that 5 out of the 11 original links found by Swanson are suggested by the system.
to extracting concepts from the documents. Additionally, the authors’ interest is to see
whether the process can be fully automated and present the end-user with a final ranked
list of C-topics, i.e. no human intervention is done during the ranking and filtering of
intermediate B-topics. The topics used and returned by the system are composed of
MeSH terms. These concepts are filtered according to the semantic types of the MeSH
terms. Term statistics are then calculated within each semantic type. Concepts are then
are represented as MeSH term weights, within the semantic types, as calculated for a
particular document set. To test the performance of this approach, the authors perform
experiments using both the open and the closed model. In the open model the end-user
identifies the semantic types of interest and the system in turn returns a ranked list of
In-links in this case is the number of intermediate topics related to the target topic.
2.3. LBD Systems 22
topics to investigate further. The ranking is based on the number of connections from
each topic to the initial topic as indicated by the user. These are the final target C-
topics. In the closed model, the approach is similar however the user is in charge of
providing both the starting A topic and the target C topic. The system returns a ranked
list of intermediate B-topics filtered by the semantic types of both the A and the C
topics. MeSH terms are ranked within B-topics hence each B-topic acts as a group of
1988b). After manually inspecting the returned list, the authors report that key MeSH
terms were ranked within the top 10 ranked topics. The same conclusion is reached for
the closed model experiment, however the authors express that comparing results with
previous work is difficult as the ranking strategy for the closed model is incompatible
with previous approaches. Additionally, the authors attempt to also replicate Swanson’s
(Smalheiser, Swanson & Ross 1998). The results for these experiments are reported to be
Other approaches at evaluation included using artificial data sets (Van Der Eijk, Van
Mulligen, Kors, Mons & Van Den Berg 2004) and conducting clinical trials on mice (Wren
et al. 2004). Despite that Hristovski et al. (2005) did not evaluate their system, they
reported that future evaluation was to be carried out and proposed a plan for doing so.
The guidelines as proposed, suggest that in order to evaluate their system firstly valid
disease-gene relationships are to be detected. Their plan is to observe whether the system
can detect such relationships by only having access to the literature previous to the date
suffer from the same drawbacks as the system-driven tradition. Some of these drawbacks
are addressed by the user-centred tradition of evaluation and, more recently, a hybrid
2.4. On Relevance 23
approach which combines the two traditions. Before we discuss these models, however,
we have to discuss the concept of relevance and its role in LBD. In the following section
the reader is presented with a brief account of the pertinent, to this dissertation, research
efforts on the notion of relevance. Once concluded, the evaluation models are described
2.4 On Relevance
due to the increase in size of the available information that we have to search today, we
humans rely more and more on computers to retrieve relevant information for us. The
contraposition and need for cooperation between computers and humans are what make
the notion of relevance hard both to define and to investigate. IR systems have a notion
of relevance embedded in their algorithms and retrieve and offer what may be relevant
according to this notion. People, on the other hand, then go and assess relevance on their
own. However, each version of the notion may have a different take on what relevance is
(Saracevic 2006).
The concept of relevance in IR is not new as it is tightly tied to that of the evaluation
of IR systems. Good systems are so because they return relevant information. The
discussion on relevance has evolved over time and several interpretations and variations
on the concept of relevance have emerged. Each variation has placed emphasis on different
aspects of this notion. Amongst others we find the notion of Logical Relevance (Cooper
1971), Situational Relevance (Wilson 1973), Objective and Subjective Relevance (Swanson
1986b) and Psychological Relevance (Harter 1992). Each interpretation brings its own set
of assumptions and rough edges and this only highlights the difficulty of the debate.
document and an information need as judged by a person (Saracevic 1975). And even
This section offers a brief account of the most pertinent strands of research into the concept of relevance.
The reader is, however, encouraged to read the review by Mizzaro (1998) and those by Saracevic (1975,
2006, 2007) for a fuller account of the research efforts in this area.
2.4. On Relevance 24
this simple interpretation comes with a variety of difficulties. The idea of information
(Belkin, Oddy & Brooks 1982). According to Belkin et al. (1982), this non-specifiability
of need can be exhibited at two levels. Firstly, we have the cognitive level. At this
level, the range of non-specifiability can vary between two extremes. At one end of the
spectrum, we have people who know exactly what they need to solve their information
needs. At the other end, we have people who are able to recognise they have the need
for information but cannot express, or can only do so vaguely, this need. Secondly, in
Belkin’s hypothesis, we have the level of linguistics. To begin the interaction with the
The difficulty of linguistic non-specifiability resides in that people may not know how to
best use the language to write queries to the system. This is aggravated as documents
are usually created independently from requests and persons. Hence people are not aware
of the underlying language included in the database and the documents. Finally, the
person, when judging, brings with him all his background, knowledge, current state of
judgements are the verdicts that requesters emit on the information retrieved for their
queries. Swanson (1977) proposed two frames of reference for relevance judgements:
While the first interpretation allows for more room for subjectivity from the requesters,
the second one is more specific and specific to the point where judgements might be col-
lected from third-party judges who might not necessarily be the original requesters. Two
major studies on relevance judgements were conducted (Rees & Schultz 1967, Cuadra &
Katter 1967). These studies suggested that there are about 40 variables, e.g. specificity
2.4. On Relevance 25
and difficulty of documents, that might affect relevance judgements, however not all vari-
ables were analysed. Even though, relevance, in both projects, was judged by third-party
judges and in experimental conditions, it was still suggested that the reliability of human
relevance judgement is questionable. That is to say that the desired stability and objec-
tivity that researchers were after might not be achievable by using professional specialist
By the late 1970s, the cognitive point of view in information sciences had gained
enough impetus to influence most empirical studies causing a shift from a system-centric
view, a view where relevance can be judged static and objectively, to a more user-centred
view (Ingwersen 1988). The shift to a more user-centred approach made it possible for
researcher to find, for instance, that psychological factors, such as cognitive states or
affective feelings, influence the search behaviour (Kuhlthau 2004). Additionally, the nature
et al. (1990) maintains that despite this dynamic and multidimensional nature of relevance,
it is both systematic and measurable, hence its study can be done in a controlled and
repeatable manner. This led researchers to argue that relevance studies should observe real
users in natural settings holistically rather than study recruited judges under experimental
conditions. Effectively, Schamber (1994) advocates that relevance studies should focus on
During the last decades, several studies focused on observing the relevance criteria
as used by various real end-users in real-world situations (Schamber 1991, Barry 1993,
Barry 1994, Cool, Belkin, Frieder & Kantor 1993). In these studies, qualitative methods
were applied to collect data. Users were not given a predefined set of criteria to perform
relevance judgments, quite the contrary, criteria were derived directly from content analysis
of verbal or written reports. This resulted in user-defined relevance criteria; criteria that
is commonly used by real end-users during relevance judgement. During the analysis of
the data from these studies, it was observed that most decisions were based on additional
variables in document surrogates such as title, author, journal, and descriptor. After an
in-depth analysis of the concept of topicality, Green (1995) points out that topicality is
2.5. Evaluation of IR Systems 26
only part of the subject contents of a document. One of the suggestions stemming from
these studies is that topicality alone is not enough to make decisions and that other factors
affect relevance judgments (Green 1995). A key difference between these studies and those
by Rees & Schultz (1967) and Cuadra & Katter (1967) is that the variables found to affect
relevance judgements were derived from real end-users on real-world tasks (as opposed to
In the context of LBD, the notion of relevance might even be more complex. For one,
there are additional moving pieces (as compared to the simple view of request, document,
person) such as the start, intermediate and target topics. Additionally, the type of task
to be solved is that of knowledge discovery which might be, in principle, quite heavy in
terms of cognitive effort. The discussion on relevance in the context of LBD has been
brief and mostly in the form of examples, e.g. (Swanson 1986c). However, the general
consensus seems to be that topicality alone is not enough to derive relevance in this
context (Swanson 1991). Effectively, the task of an LBD system is to first distill the topics
that might compose a fruitful combination and second retrieve the pertinent literature.
Pertinent, in this sense is not to be taken as being on topic since there may be several
topics in play in any given combination, but rather supportive, amongst others, in the
Traditionally, most of the research in IR has been carried out following the principles of
the system-driven tradition. The principles behind this tradition serve to design, and con-
sequently test, better algorithms and systems. The evaluation of newly crafted algorithms
follows the Cranfield tradition (Cleverdon et al. 1966); a tradition that is heavily based
on the concept of test collections (Harman 1997). Test collections consist of:
1. A collection of document,
search this new database, requests have to be transformed into queries: a representation
that is suitable for matching against the contents of the database. The algorithm can,
then, run the queries against the database. This procedure results in a ranked list whereby
documents are ranked according to their likelihood of relevance (Belkin & Croft 1987).
Because users are not involved in the evaluation of systems, it is commonly referred to
as “laboratory IR”, making emphasis on that the evaluation of system takes place in
laboratory conditions (as opposed to real conditions) (Ingwersen & Järvelin 2007).
The quality of the results of any one algorithm is measured by assessing its ability to
retrieve relevant documents. To measure this, relevance judgements are used as “ground
truth”. Relevance judgements link (relevant) documents to requests, hence the assessment
of the products of the algorithms is possible. During evaluation, the recall and precision
of a system are usually measured and averaged over requests (Cleverdon et al. 1966).
These two metrics measure how many of all the known relevant documents have been
retrieved (recall) and how many irrelevant documents have been included in the ranked
list (precision). Because requests have been typically of a topical nature, and relevance
judgements were made by experts in the topics, this type of relevance is usually referred
This model of evaluation fits the underlying structure of all of the approaches used to
evaluate the performance of LBD systems described in the previous section: Swanson’s
original discoveries make the collection of relevance judgements (the B- and C-topics in the
case of the open model and only the B-topics in the case of the closed model), a subset of
Medline the collection of documents and Swanson’s reported stating and target topics (the
A- and C-topics) the collection of requests. However, these are not standard across research
efforts. Each researcher that attempts to replicate any of Swanson’s discoveries chooses
use (collection of documents) and which topics to include (collection of requests). When
it comes to reporting performance, none but one group of researchers report standard
2.5. Evaluation of IR Systems 28
metrics such as Precision and Recall. Most researchers report subjectively whether the
golden topics have been detected within the products of their system and the ranks of
these topics. Additionally, the judgement as to whether the system has produced a true
positive (how topics produced by a system are matched to those reported by Swanson) is
subjective which makes the comparison of approaches very difficult. Further assumptions
made make this model inappropriate to evaluate LBD systems. As it has been discussed
earlier, LBD searches are inherently interactive and most researchers seem to agree that
human intervention is required. However, all efforts either disregard the end-user by
assuming that a researcher should be able to identify the salient topics or include humans
While the system-driven tradition of evaluating systems constitutes most of the litera-
ture of IR research, and accommodates the current approaches at evaluating LBD systems,
the type of relevance assessment employed has been met with criticism, in particular, when
Contrary to the laboratory IR tradition, in which the user is seldom involved, research
in user-centred IR focuses on the behavioural and psychological aspects that affect how
one, research in the user-centred IR tradition sees it as a problem solving, and goal-oriented
interactive task. It is by analysing and understanding these aspects that researchers aim
research themes investigated include, for instance, the nature (Taylor 1967) and types of
information needs that can be identified (Ingwersen 1996) as well as user defined relevance
(see e.g. (Borlund 2003a, Barry & Schamber 1998, Saracevic 1996, Schamber et al. 1990)).
Initially, a user-centred approach might seem like an appropriate one to evaluate LBD
system. However, this would mean completely disregarding the underlying systems and
instead focusing on factors such as the perceptive power of different researchers, the ap-
propriateness of workflows for particular discovery tasks, etc. Hence, applying such an
approach would provide a partial picture of the performance of an LBD system, if at all.
2.5. Evaluation of IR Systems 29
This stems from the fact that the community involved in user-centred IR research, does not
seem to concern itself with the details of the IR system with which the users under exami-
considers systems as being constants and rarely linked to human beings (Ingwersen 1992).
This view seems to be, in principle, as limited as that of the system-driven community
There are examples of research that attempt to integrate both viewpoints, however an
approach two properties must be present in any evaluation framework of this nature:
2. Realism: while control must be retained, the evaluation has to take place in a real-
Borlund (2000) who state that the user-centred approach is ideal for evaluating in-
teractive IR systems, except for the lack of control and the expense incurred in during
experiments (involving real users with real and evolving information needs). Motivated by
the demand of evaluation methods that take into account both the system-driven and the
interactive IR (IIR). The aim of the framework is twofold: i) to allow for the controlled
information needs and relevance judgements and ii) to include in measures of system per-
is a reaction to the demands stemming from what has been termed as the three revolutions
1. The relevance revolution: the fact that a request is not the same as an information
need and that relevance should, therefore, be judged in relation to this need and not
2.5. Evaluation of IR Systems 30
and multidimensional.
3. The interactive revolution: the fact that systems are interactive by nature (or are
becoming more interactive) and that the system-driven approach at evaluation does
Borlund & Ingwersen (2000) acknowledge these revolutions and proposes three exper-
Relevance is then judged in relation to the user’s information needs and the situation in
which the needs arise. Further, the relevance of the information presented is also judged in
an interactive and non-binary way, hence the concept of relevance is such that relevance is
treated as being multidimensional and dynamic. By allowing end users to interact with the
system the dynamic and multidimensional nature of information needs, and hence both
the cognitive and interactive revolutions, are considered in the evaluation framework.
the experimental settings. This realism is reinforced, and control is achieved, by the
use of simulated work task situations (Borlund & Ingwersen 1997). Simulated work task
situations are a semantically open description of the context of a given work situation,
i.e. they are a cover-story that provides context and describes a situation in which IR
is needed. The simulated work task situation triggers the information needs in users.
Realism is provided as simulated work situations lead users into a cognitive state which
creates information needs; needs that need to be satisfied before the user can move on.
2.6. Evaluation of LBD Systems 31
The experimental control is provided by the fact that simulated work situations remain
the same across users (and possibly systems). Relevance judgements are made in relation
to the information needs triggered by the simulated work situation, hence they provide
The approach proposed by Borlund & Ingwersen (2000) may be appropriate for con-
ducting experiments in LBD, whether they are to test the performance of a give system or
to explore some of the yet unexplored cognitive aspects involved in LBD. As the end-users
of the system under test are involved in the evaluation loop the experimenter has then the
option of gathering cognitive data. In the case of LBD, this means inviting researchers
to take part of the proposed experiment and have them use the system under evaluation.
Only then it becomes possible to effectively observe whether researchers are able to detect
salient intermediate or target topics for further evaluation. Furthermore, which topics
and why they are deemed salient becomes observable. This opens up the possibility of
breaking free from having to replicate Swanson’s discoveries onto a more general evalua-
tion approach where discoveries are those that the researchers involved deem as such. By
keeping the tasks stable across end-users (researchers) and potentially across systems, one
attains control. Effectively, this provides anchors against which relevance judgements are
made. These anchors are the artefacts that make it possible to compare results across
Evaluating knowledge discovery systems is a complex task. One of the main obstacles is
that if systems are successful they are, by definition, capturing new unproven knowledge
(Pratt & Yetisgen-Yildiz 2003). Moreover, much of the success of a system may actually
depend on the expertise and interpretation of the operator. A system will perform well
as long as the operator is able to interpret the results and draw inferences from them.
Swanson’s initial discoveries –fish oil-Raynaud’s disease (Swanson 1986a) and migraine-
researchers (Gordon & Lindsay 1996). Finding supporting evidence by conducting clinical
trials (in the case of the medical domain) is one form of discovery validation. Effectively,
this is the approach that Wren et al. (2004) followed. However, this approach may not
only be very expensive but also not be feasible at all in some cases.
tial discoveries, while some either did not evaluate their systems or used human experts
in an ad-hoc fashion for this task. Those that attempted to replicate Swanson’s initial
discoveries reported to have been successful at doing so. To measure the correctness of
the results, most researchers resorted to measuring accuracy. The top suggested topics
were usually inspected and compared to those reported by Swanson albeit in a rather sub-
jective fashion. Researchers usually informally measured recall (Baeza-Yates & Ribeiro-
Neto 1999) as they measured how many of the links reported by Swanson their systems
had predicted (in any one position in the ranking). However, measuring how well the
system replicates these discoveries is only evaluating the predictive power of it (albeit in
a limited fashion). There are many factors in the performance of a system one could
evaluate, and the ones one chooses to evaluate will depend on the ultimate goal of the
system. For systems where the goal is to predict, as precisely as possible, the potential
Systems aimed at aiding researchers make discoveries are more complex to evaluate
as they are likely to be inherently interactive. Not only is their predictive power to be
evaluated (as a minimum requirement, these systems should exhibit a level of predictive
power) but also other factors such as their interface, usability and speed are to be evaluated
as well. Two studies either evaluated or observed other factors such as interface usability
Supporting evidence for Swanson’s fish oil-Raynaud’s discovery (the most replicated
throughout the literature) has been provided, however there is no indication that the
original links found are the only ones connecting the literatures. As newly crafted algo-
rithms suggest potential links, there is no clear approach at measuring how “good” they
2.7. Summary 33
are not only in terms of their own merit but in comparison to Swanson’s discoveries. Ef-
fectively, Weeber et al. (2001) report that even though their system did not rank highly
the original fish oil related intermediate topics, other fish oil related topics were and that
researchers should be able to infer the link based on those as well. The interpretation of
these suggested links poses an additional problem. Most researchers assume, or suggest,
that a researcher should be able to detect salient topics. However, this assumption is
made regardless of the researcher’s background experience, ability to use search engines,
and information needs. It is also worth mentioning that while finding the hidden link
between the active ingredients in fish oil and Raynaud’s disease led Swanson to propose
the hypothesis that the use of dietary fish oil might ameliorate the symptoms of Raynaud’s
disease, it also led Swanson to propose that there may be more hidden links awaiting for
their discovery and to develop LBD. Hence, the interpretation of these suggested links is a
factor to consider even for those systems in which the ultimate goal is to discover hidden
tion were observed, two of which were the visualisation of links between two topics (even
if known)8 and the browsing of literature detached from the searcher’s research field (lit-
erature evaluation in a new context). Hence search tasks and outcomes, other than the
i.e. the topics A, B and C are deemed relevant precisely because “A affects B” and “B af-
fects C”, however relevance in these unexpected uses, and others, may be harder to define
and assess.
2.7 Summary
Co-occurrences are measured on the basic building blocks chosen by the researcher, e.g.
A use of LBD that had been previously suggested as a potential use by Gordon et al. (2002).
2.7. Summary 34
n-grams Lindsay & Gordon (1999). Filtering is then usually applied on the statistics of
these building blocks. Infrequent items are filtered out on the assumption that if an item is
too infrequent then it is likely to be noise. Too frequent items are also filtered out as they
are considered to either be noise or too common to be of interest. Lastly, the remaining
items are ranked before being presented to users. This ranking step varies according to
both the researchers’ idea of what characteristics of an item makes it potentially more
Evaluation approaches are varied, however they all can be modelled using the system-
this evaluation as automatically as possible and using the two original discoveries made
by Swanson (Swanson 1986a, Swanson 1988b) as ground truth. Researchers then evaluate
the performance of their systems by observing how many of the intermediate (or target)
topics from these two discoveries are suggested by their system. Additionally, the ranks of
• The matching between the concepts produced by the systems and those reported by
• The ranks of these concepts is also evaluated subjectively as to whether they are
“reasonable” or not.
Observing how well a system has reproduced Swanson’s discoveries poses some ini-
tial drawbacks. The dataset is composed of two single examples and is limited in size.
Small datasets not only may prove to be ineffective in training models (Halevy, Norvig &
Pereira 2009) but when it comes to evaluating systems, reproducing a reduced set of golden
outcomes limits the conclusions that can be reached on the effectiveness of a technique.
Additionally, over-tailoring a system to reproduce Swanson’s findings may limit its gener-
alisation power. An interesting approach at overcoming the data size limitation imposed
by reproducing Swanson’s discoveries is proposed by Van Der Eijk et al. (2004) whereby
the golden dataset is crafted by following back the references of a set of articles detailing
2.7. Summary 35
the testing of a hypothesis. It is not entirely clear, however, how the dataset is to be
constructed since starting and target topics should be defined as well as the intermediate
that while not fully automated (researchers manually inspect the ranked lists of topics
and take different measurements) the final end-users are never involved. Swanson (1991)
argues that fully automating the discovery procedure is complex at best. The main issue
is that neither the syntactic nor the semantic structures provide for the deductive chain
of reasoning in which the author incurred. Additionally, Swanson (1991) suggests that
of literature searching. Human experts, in this context, are to provide the background
knowledge needed to interpret the statistical cues offered by the system as extracted from
the literatures. This is contrary to the assumptions made during nearly all evaluation
approaches that users should be able to make the discoveries once they have been presented
is here where different factors, for instance the users’ background knowledge, come into
suggested that relevance is present in all steps of the process. One must be careful though
as to what one means by relevance since the word is heavily overloaded with meaning and
subjective and has been done by the researcher on behalf of the end user of a system.
by a system, the saliency of a particular topic (or combination thereof) and the utility of
During the text modelling phase of all systems, where concepts are extracted from
natural language documents, is where the first decision is made. What is an appropri-
ate representation of a concept extracted from text? The basic building blocks, such as
2.7. Summary 36
n-grams, are decided beforehand by the researcher since the system needs them to per-
form the desired processing and relationship-finding. This is a reasonable and necessary
decision from a algorithmic point of view. However, it becomes less clear how reason-
able each representation becomes when it is used during evaluation. For instance, to
evaluate their system, Weeber et al. (2001) try and replicate both of Swanson’s original
discoveries (Swanson 1986a, Swanson 1988b). The authors suggest that, while the orig-
inal fish oil concepts were not ranked highly, other concepts related to fish oil were and
that researchers should be able to still recognise fish oil as a target concept from them.
The main assumption behind this suggestion is that the presented combination should
interpretation has been made on behalf of the end users and this is where the process
observe and count how many of the original intermediate topics have been found, ranked
highly and hence proposed for further inspection by the system under evaluation. Firstly,
it is assumed that the representation of the combination, be it n-grams or any other more
complex structure, conveys enough information for it to be salient, i.e. the combination is
readily interpretable without ambiguity. Secondly, it is assumed that the combination not
only contains enough information but that the information present is of the right kind.
For instance that the combination is easily mapped to a higher-level representation that
corresponds to one of the original discoveries, e.g. “omega-3” is suggested by the system
“fish-oil”. Finally, the sole presence of a golden combination is taken as a positive sign of
performance and it is inferred that if researchers were presented with it they should be
Additional assumptions are made on the context in which the discovery task is to be
stances such as personal background and research experience, should be able to derive
meaning from a golden combination once it has been found. This is to say that the leap
from a golden combination to a meaningful discovery is granted once the right combina-
tion has been selected. Additionally, it is implied in this assumption that there is a single
discovery to be made after a particular golden combination. Essentially, because the re-
searchers know about Swanson’s findings, they know what to look for, so when found they
We argue that there are too many areas where the subjective judgement of the end-
user play a crucial role and that these should be studied in more detail. However, it is
not entirely clear how to proceed. On the one hand, one must retain control on how
such study is performed. Control is necessary so that the results are comparable and
not only performance data is obtained, but also cognitive data regarding the interaction
and information seeking processes can be gathered. The method relies on the concept of
simulated work task situations for experimental control and involves the use of potential
end users as test persons to accommodate for realism. Simulated work task situations
provide context and describe a situation which leads users into a cognitive state in which
information needs arise and have to be satisfied before users can move on. Additionally,
it was suggested that these properties of the framework make it appealing to conduct
the experiments in LBD, whether they are designed to evaluate systems or explore the
In Chapter 3 the reader is offered a description of the design of a user study tailored
towards finding an answer to the three main research questions of this dissertation. This
study has been designed keeping the components of the framework proposed by Borlund
1. Real end-users of the system are included: researchers were invited to take part of
2. Real and dynamic information needs are applied: researchers were provided with a
task that only provided a context in which they should formulate and execute their
2.7. Summary 38
searches, and
the literature presented in any way they wished to and as many times as they deemed
As discussed in previous chapters, relevance is not only dynamic but subjective. As such
it is very hard to come up with an objective and universal measure of relevance. More
so, in the context of LBD, relevance might be especially complex as it is derived from the
relevance and the reasons why it is derived in this context an observational study was
conducted between the months of January and August of 2008. The study is described in
the following sections. The purpose of this study was to observe the relevance criteria, as
defined by Barry & Schamber (1998), used by participants when assessing the relevance
of related literature. During the study data was gathered using a combination of feedback
Harter (1992) suggests that only weak relevance, or hope for relevance, can be derived
from reading surrogates of documents. This means that once a user is presented with
surrogates of documents, as retrieved and generated by an IR system, the user can only
hope that the document related to the presented surrogate is relevant. It is in this sense
that relevance becomes weak (as opposed to full or strong relevance). In the context of
LBD, the implication is that end-users of an LBD system would only derive full relevance
of a suggested combination of topics after having read the documents that supported it.
The focus of the study was on the closed model of search as it is at this stage where
3.1. Method 40
section 2.2.2, the closed model of search is aimed at aiding the user search for literature
The original discoveries using LBD were restricted to the scientific community. In
particular the scientific medical community. Even though attempts have been made to
extrapolate the mechanisms and search patterns outside this community, e.g. the work
done by Cory (1997) and by Gordon et al. (2002), much of the literature on LBD is in
this domain. For this study, however, the scope was broaden by inviting participants
from three different communities: the computer science community, represented by the
School of Pharmacy. All these schools are part of the Robert Gordon University, located
in Aberdeen, Scotland.
This chapter is structured as follows. In section 3.1 a description of the study and the
two sessions that compose it is offered. The systems used in the study are described in
section 3.2. Section 3.3 contains a description of the population that participated in the
offered. The collections used in the study and their particulars are discussed in section
3.4 while the search tasks and measurements taken are discussed in sections 3.5 and 3.6
respectively. Finally, in section 3.7, a description of the different types of data analysis
3.1 Method
The study consisted of two sessions with a time gap between them of no more than a
week. This time gap was necessary as the system used by the participants during the
study processed offline the results from the first session and this process took, on average,
between 5 and 6 hours. The length of the time gap was chosen so that results from the first
session would still be present in the participants’s minds when doing the second session.
The results of the offline processing, topics and potential relationships between them, were
Participants were asked, before beginning the session, to read, agree to and sign a con-
fidentiality agreement. The agreement stated that their data would be anonymised and
that no personal references of any kind would be made on the write up of the study.
Participants were assured that their data would remain secure and appropriately stored.
Once the agreement was signed, participants were asked to fill in a form with information
about their background such as research experience and confidence in using search engines.
The form can be seen in appendix A. Instructions on how to operate the provided system
were delivered next. The system used in this session is described in section 3.2.1. Once
comfortable with the system participants were given the search task. The task consisted
required that the participant searched for and found five documents that described or
represented the participant’s area of research. The search task given to the participants
The goal of the first session was to gather data from the participants to initiate an
automated open search. This initial open search is needed as only two paths can be
2. The user has formed a relation in his mind after looking at the suggestions resulting
The documents selected during the search session were interpreted as a representa-
tion of the participant’s area of research and used to seed the automated open search as
described in section 3.2.2. There was no time limit imposed on this session.
Representing an area of research might be a practically impossible task. Yet for this study
it was needed that the participants provided such a representation in a form that either
could be a set of words. This approach would be one which participants could be familiar
3.1. Method 42
with. Hence, one could have asked them to describe their area of research using a few
words and then work with them. This would have provided, however, a limited view of
their areas of research. This limitation is better worded in one of Swanson’s works in
which several postulates of impotence were put forward. It is the first postulate that
states that “an information need cannot be fully expressed as a search request that is
includes, amongst others, the participant’s background knowledge and the database being
• There are documents in the database that match the keywords and that
• The matched documents, if any, provide a good representation of their area of re-
Should any of these not be satisfied, participants would be left with an inadequate
representation of their field of work; one that is lacking in information and generally
incomplete or even erroneous (in respect to the contents of the database). Accepting the
veracity of this postulate also rules out providing an interface that integrates the two
search models of LBD. Moreover, the computationally expensive nature of the algorithm
As an alternative approach participants were asked to search for and provide documents
a description of the participant’s research area participants were freed from the burden
of having to come up with terms that would, to a certain extent, describe their research.
Following this approach meant that topics had to be automatically extracted from the
documents. This also allowed for unsuspected but related vocabulary to creep in into the
search and relationship discovery. Incidentally Swanson’s original and current procedure
works as described. Users of Swanson’s system are asked to search PubMed for the source
and destination literatures and then provide these to Swanson’s system to work with1 .
Visit the Arrowsmith website (
start.cgi) for more information.
3.1. Method 43
Dear participant,
I’d like to ask you to search for documents that describe your
area of research, or an aspect of it. Whenever you think you have
found one, please write down the document ID (located at the
top of the viewing window) on the provided sheet. The purpose of
this search is so that the system under test can then suggest
topics that might be related to your area of research for you to
further investigate.
Figure 3.1: Search task introduced during the first meeting with participants
During the offline processing the documents found during the first session were used as
the starting topic for an automated open search. The automated open search is described
in section 3.2.2 and it results in a network of topics in which the entry topic (see Figure
3.6) represents the participant’s area of research. This automated construction of the
to investigate them all would render the study pointless. Instead the system ranks the
final topics Cij according to a simple algorithm (see 3.2) and presents the participants
with the top 10 topics. During the second session participants were asked to investigate
the potential relationships between their area of research and 3 out of these 10 topics.
At the beginning of their second session participants were given training on two aspects of
the session: the system and the talk-aloud protocol. Once the instructions were delivered,
participants were allowed to practise both using the system (on example but realistic data)
and talking-aloud. This practice session had a time limit of 15 minutes. At the end of
Unlike the first session in which all participants were all given the same search task,
participants were given a search task that corresponded to their expressed level of research
the beginning of the first session. An example search task can be seen in Figure 3.2. To
3.1. Method 44
Dear participant,
Figure 3.2: Example search task introduced during the second meeting with participants
3.2. The System 45
complete the search task participants had an hour. During this hour they were asked to
investigate exactly three of the ten potential relations presented. Participants were also
asked to write down the document identifier whenever they thought they had found a
document that they thought complied with the request. The form used during this second
At the end of the session participants were asked to fill in a questionnaire which can
be seen in Appendix A.7. Information gathered referred to, amongst others, the quality
of the results obtained during the session as well as general comments about the whole
In this section the details of the systems participants used in each session are described.
Firstly the system used during the first session is described. In this session participants
used a simple keyword driven search engine. This engine offered a search box and returned
a ranked list of document surrogates. Each returned document surrogate had a hyperlink
to access the full document. Depending on the affiliation of the participant, the system
searched the appropriate database. Secondly the offline processing of the documents re-
trieved during the first session is described. During this offline processing topics were
& Jordan 2003). These topics were used as queries to retrieve a new set of documents
from the database. An extra set of topics were automatically extracted from this new
set of documents. The process finishes by suggesting topics which are potentially related
to the participant’s area of research. Lastly the system used during the second session
is described. This system presents and makes use of the topics discovered during the of-
fline processing. Participants used the interface for navigating topics and retrieving the
Figure 3.3: User interface of the system participants used during their first session
The search facilities provided during the first session are rather rudimentary. A screenshot
of the user interface is presented in Figure 3.3. In the interface, after searching for “infor-
mation” it can be seen that the query matched 4752 documents. The document surrogates
of the initial 10 documents retrieved are listed below the horizontal line. The document
surrogates were built as follows. The text in the hyperlink (in blue in Figure 3.3) is the
document identifier — the internal code used when indexing the document. Initially this
might seem like a poor choice for the text in the hyperlink as, for instance, the document
title could have been used. Using document titles, however, was not feasible. As it is
explained in section 3.4, the original documents were in the Portable Document Format
(PDF). This format is particularly problematic when it comes to parsing and extracting
particular portions of text such as titles. Given the different layouts of the documents,
extracting titles was simply not possible. The document snippet below the hyperlink con-
sists of the first 5 sentences of the document (a sentence was taken to be any character
Clicking on a hyperlink would bring up a new window with the full contents of the
document. An example document can be seen in Figure 3.4. On the top left corner the
3.2. The System 47
Figure 3.4: An example document. On the top left the document identifier is displayed.
document identifier is displayed again so that users could quickly identify it should they
The engine behind the interface was implemented using the Kullback-Liebler diver-
gence retrieval model (Zhai & Lafferty 2004)2 . In this retrieval model documents and
queries are represented as statistical language models (Ponte & Croft 1998). A statisti-
cal language model is defined as a probability distribution over the vocabulary (the set
of unique words in the collection of documents). The score of a document, then, can
language model of the query and that of the document (Lafferty & Zhai 2001). For this
system, the model was used as out of the box, i.e. no parameters were tuned. Potentially,
this could have impacted on the length of the first session and participants might have
taken longer than they would have had the parameters of the engine been tuned.
The initial step in the process is to automatically extract the topics contained in the
documents selected during the first session. This is achieved by modelling the documents
Figure 3.5: Visual representation of the open search algorithm. The first step is to model
the topics contained in the user submitted documents and retrieve more documents about
them (step a). The second step is to model the topics contained in the pooled documents
and retrieve even more documents (step b). The final step is to model the topics contained
in each document set retrieved in the previous step.
using a technique called Latent Dirichlet Allocation (Blei et al. 2003). In LDA each
over words. We refer to the initial topics as topics tA . These topics are each used as
a query to retrieve more documents. To build a query representing each topic the top
three words, according to their probability of being generated by the topic P (w|tA ), are
taken3 . These queries are issued to the engine of the system used in session one and the
top 50 documents are retrieved. This is depicted as step a in Figure 3.5. The assumption
behind this first step is that topics discussed within a document share a relationship.
Retrieving more documents using the top terms for the central topic should then increases
All documents retrieved in step a are pooled together. Topics are then automatically
extracted again using LDA. This is effectively extracting the first layer of related topics
in the open model search (referred to as B topics in the LBD literature). The process of
building queries from topics and retrieving documents is repeated, however this time the
retrieved documents are not pooled together. This is so that individual B-topic could be
This is a rather ad-hoc procedure. A more principled approach would have been to take the actual
topic —a probability distribution over words; a language model— and used directly as the query model
as the engine was implemented using the Kullback-Leibler divergence model. This was not implemented
for this study as the library used to build the system did not provide access to the low level language
modelling framework and hence a custom language model could not be used to query the engine.
3.2. The System 49
linked to their corresponding set of C-topics. This is depicted as step b in Figure 3.5.
performed. These are the final tC topics which are potentially related to the tA topics
through one or more of their directly related tB topics (step c in Figure 3.5).
The result of this process can be represented as a tree. The root node of the tree is
the starting topic as represented by the documents selected in the first session. The inner
nodes of the tree are the immediately related topics tB and the leaves are the potentially
—indirectly— related topics tC . This tree is depicted in Figure 3.6. Each edge represents
Figure 3.6: Topic tree. The A node is the participant’s initial topic. The inner nodes, the
Bi nodes, are the immediately related topics as described by both the literature and the
topic extraction algorithm. The tree leaves are the indirectly related, to the initial topic
A, Cij topics.
The system used during the session two consists of two parts: the entry screen and the
navigation screen. The purpose of the entry screen is to provide an overview of the
potentially related topics. On the left panel a representation of the initial topic, the
participant’s area of research, is provided. This representation is actually the query words
used to retrieve more documents during step a of the offline processing. It is a set of
keywords. On the right panel the 10 potentially related topics are listed. These topics
are also represented as a set of keywords. These keywords are the three most important
words for each potentially related topic. The number of terms selected for each C topic is
3.2. The System 50
Figure 3.7: First screen users see when doing the second part of the study. In the image
the terms representing the initial topic are presented on the left panel whereas the initial
C topics, also in the form of terms, are displayed.
motivated by the observation that the average number of terms in user-submitted queries
during web-searches is 3 (Spink, Wolfram, Jansen & Saracevic 2001). Although LBD
might not particularly fit the context of web search, it was decided that this was still a
reasonable approach. The importance of a word is measured by the probability that the
topic will generate the word. The user interface of the entry screen can be seen in Figure
On the entry screen the potentially related topics are actually hyperlinks. When a
user clicks on any of these hyperlinks he navigates to the second screen; the navigation
screen. On the navigation screen, the user can navigate and investigate the intermediate
topics as well as the supporting literatures. The user interface of the navigation screen
consists of three panels. On the top panel both the initial topic and the potentially related
topic are listed again. This provides the context in which the intermediate topics are to
be interpreted. In the middle panel the intermediate B topics are listed. Three columns
of topics are presented to the user. Each entry is a hyperlink where the text is the top
three words for the intermediate topic. Clicking on any intermediate topic performs a
search for the literature supporting the topic. The lower panel, which is split into two,
3.3. User groups 51
Figure 3.8: Navigation screen. The top panel (a) displays both the initial topic (listed as
“Your topic”) and the potentially related topic (listed as “Related topic”). The middle
panel (b) lists the intermediate topics. The bottom panel (c) contains the supporting
is used to display the supporting literatures. On the left hand side of the lower panel
the literature that supports the relationship A ↔ B is listed. The literature supporting
the relationship B ↔ C is listed on the right hand side of the panel. The supporting
literature was listed as a ranked list of document surrogates. The title of each surrogate
is a hyperlink. Clicking on this hyperlink opened a new window with the full document
contents (as seen in Figure 3.4). The document snippet consisted of the first five sentences
Participants from three different schools were invited to take part of the study. The three
participating schools were the School of Computing, the School of Pharmacy and the
Potential participants were emailed with a request for participation. Prior to this the
heads, of the respective schools were contacted and explained the purpose and mechanics
Note that the name of the school, the Business School, is actually an umbrella name which groups
many different schools, one of which is the Information Management Group.
3.4. Collections 52
of the study (this ruled them out immediately as potential participants), and asked for
Participants were classified according to their expressed research experience into one
3.4 Collections
Inviting participants from three different schools meant that three different collections had
to be created, one for each group, as no standard collections existed for this type of study.
However different in content, all collections shared certain commonalities. First of all,
all collections covered a variety of topics. Covering several different topics, always within
a main theme, meant that participants were not restricted in their searches. Secondly,
all collections contained a mixture of general public articles and scientific papers. This
allowed for participants to have a good range of depth in the information they could find.
Had the collections only contained scientific articles the second session would have had to
1. Converted documents from the Portable Document Format (PDF) to plain text:
all collections harvested consisted of documents in the PDF format. The original
2. Removed too-frequent words that do not add information: all words (stop words)
3. Stemming: this standard step was not performed on the documents despite its po-
from document terms. Had the stemming step been performed the resulting esti-
mated topics would contain the stemmed terms (instead of the original terms). We
assumed that this would make the topics harder to interpret for the participants.
The collections used during the study are further discussed in Chapter 4, Section 4.2.
Simulated work task descriptions were written according to the three different levels of
research experience as described in table 3.1. All tasks, however, were written with the
intention of pushing participants outside their area of research and make them find (or cre-
ate) relations with topics outside their own. This was masked as a request for a search for
literature that aided the participant fulfil a particular task. An example task is presented
in Figure 3.2.
Participants classified in the category of research student were given the search task
depicted in Figure 3.9. This task suggested that, even though the work they had been
carrying on was good, their supervisor had suggested they broaden their literary scope.
Research students would have to search outside their own area of research and look for
connections. Their ultimate goal was not only to make their supervisors happy (as we all
did once) but also to enrich both their work and literature survey.
Participants in the category researcher were presented with the task depicted in Figure
3.10. The task suggested that they were immerse in the process of writing a grant proposal.
As funding is vital to research activities, any help in getting a proposal accepted would
be more than welcome. The search task mentions that a senior colleague had suggested
the researcher write about the potential impact outside the main theme of the proposal.
This would increase the chances of getting the proposal accepted. This sets the stage for
An approach at solving this is the use of a dictionary for stemming and then doing a reverse lookup,
however this may lead to situations where the stemmed word leads to at least two words, e.g.walk leads
to at least walked and walking.
3.5. Search Tasks 54
Dear participant,
Figure 3.9: Search task given to all research students that participated in the study
Senior researchers were given the search task depicted in Figure 3.11. This task sug-
gested that they had been invited to deliver a keynote speech at a very prestigious confer-
ence. To make their speech more appealing a colleague suggested they look for connections
between their area of research and other research areas so that the speech was more fo-
cused on the grand-scheme of things rather than on the particulars of their research area.
To do so, senior researchers would have to search for potentially related areas of research
Dear participant,
Figure 3.10: Search task given to all researchers that participated in the study
3.6 Measurements
Different measurements were made during the search sessions. Data was gathered in both
written and verbal form. Data gathered included information on the searches performed
and their results as well as background information on the participant. Data was not only
anonymized but also kept securely to avoid both privacy breaches and tampering.
different forms. These questions aimed at understanding the quality of the documentation
found (if any) amongst other things. Information regarding their background was also
At the beginning of their first session participants had to fill in the form depicted
3.6. Measurements 56
Dear participant,
Figure 3.11: Search task given to all senior researchers that participated in the study
3.6. Measurements 57
others, the participants’s professions, research fields and topics, their expressed confidence
During their second session participants had to record the documents they selected in
a form. This form is designed not only to capture the document identifiers but also the
suggested relationship supported. Participants had to write down the document identifier
together with the intermediate topics which had retrieved it. Together with the initial topic
and the indirectly related topic (topic C) the document identifier and the intermediate
When the second session ends participants were presented with a final form. This form
is designed to capture data on the search results of the closed model search system. The
information gathered ranges from the topic variety present in the search results to the
validity and intent regarding the connections suggested. This form is depicted in Figure
Talk aloud protocols are based on the idea that talking aloud while solving a task provides
a view of the thoughts as the task solving process is ongoing. The assumption behind this
idea is that people retain a small amount of information in a short term memory store.
If you can tap into this memory store you can learn about the person’s thought process
in solving the task (Ericsson & Simon 1993). The information obtained could be used
to improve not only the understanding on the problem-solving processes but also, for
instance, to devise problem-solving computer algorithms. It was decided to use talk aloud
protocols as they would provide a raw view of the relevance judgement process that users
Two levels of reporting claim to be the closest reflection on the thinking process: con-
current reports and retrospective reports (Ericsson & Simon 1993). Retrospective reports
refer to verbalisations where the thought process is no longer occurring. The person has
to remember and summarise what they were thinking as they were solving the task. It can
happen that the original contents of the memory store are changed as the person might
leave out invalid reasoning steps, details that are deemed irrelevant and so on. Concur-
rent reports on the other hand happen while the thought process is ongoing. Reporting
on the thought process as it happens provides with a much more raw view of the process,
however it may also introduce irrelevant details (Green 1998). Additionally, the burden
of verbalising thoughts concurrent to the action of solving a problem may overload the
person’s short-term memory and alter the normal process path (Ericsson & Simon 1993).
For this study it was decided to use concurrent reports as retrospective reports would
not have provided the desired data granularity. Additionally, as participants were likely to
judge the relevance of several documents during their second session a concurrent reporting
Talking aloud
The process of gathering verbal data relies on the participant talking aloud during the
study. To ensure the quality of the verbal reports gathered, following two main guidelines
the nature of verbal protocol data can be influenced by the instructions received, these
must be carefully and consistently worded (Ericsson & Simon 1993). In the instructions,
should reassure participants so that they are comfortable verbalising their thoughts using
their own words (Pressley & Afflerbach 1995). Additionally, and to maximise reliability,
all participants should receive the same instructions (Ericsson & Simon 1993). These
participants should follow pre-session warm-up exercises. As the ability of different people
to verbalise their thoughts varies, these warm-up exercises are also important to maximise
the quality of the generated verbal protocol data (Pressley & Afflerbach 1995). These
exercises usually require less than 15 minutes (van Someren, Barnard & Sandberg 1994)
and result in several benefits. First, they ensure that participants have understood the
instructions and that researchers and participants share the same understanding of the
kind of data required (Ericsson & Simon 1993). Second, participant anxiety is reduced and
so participants feel more comfortable in reporting their thinking (Ericsson & Simon 1993).
To maximise reliability, and just like with instructions, all participants should not only
follow the same warm-up exercises but these should also be designed as to be as similar
as possible to the target task (Ericsson & Simon 1993, van Someren et al. 1994).
During the second session in the study, participants were instructed to verbalise all
that came through their minds as they solved the task presented. These instructions were
given verbally to each participant. Prior training was also provided. This consisted of a
single warm-up session of up to 15 minutes navigating the system on example, but realistic,
data. During the actual session, the verbal protocols were captured using digital audio
Data Analysis
Reducing threats to data validity and reliability throughout the data analysis process can
be achieved by following three suggestions found in the literature (Ericsson & Simon 1993,
The first guideline refers to the transcription of the verbal data. When the recordings
are transcribed, verbal data is to be transcribed verbatim, capturing as much verbal data
as possible by including pauses, emphases, and indications of tone (Ericsson & Simon
1993). This additional information becomes secondary data sources as they assist the
interpretation of concurrent verbal data (Pressley & Afflerbach 1995). The transcribed
data is then to be segmented (divided into “utterances”). This step ensures that all the
3.6. Measurements 60
Code Description
... pause or silence
[read] “text” the participant is reading a document out loud; this appears
usually in the form of a mechanical voice at a constant speed
with the occasional mumbling
[mumbles] “...” the participant is mumbling and the audio cannot be tran-
Table 3.2: Codes used during the transcription step of the protocol
data is segmented in standard units for later encoding/analysis. Care must be taken,
The second guideline suggests that a valid coding scheme that identifies major processes
and patterns of knowledge in the data collected is to be designed. Special attention must
be paid so that it facilitates cross-case analysis. The encoding of the data can be achieved
with minimum threat to validity when the encoding scheme is developed from the data and,
once developed, further data is encoded to check it (Ericsson & Simon 1993). However,
there are advantages to building on existing encoding schemes. First, the method can be
strengthened by applying and refining a common encoding scheme across data and second,
the processes being studied can be further elaborated by the analysis of new data.
The third guideline pertains the assessment of the reliability of the encoding scheme
and the encoding procedure (Pressley & Afflerbach 1995). The reliability of encoding
schemes can be enhanced by the use of clearly defined codes, illustrated with examples
(Rowe 1985) and the reliability of the coding procedure can be tested by using measures
such as inter-rater agreement measures such as Kappa (van Someren et al. 1994) for
the utterances (Ericsson & Simon 1993, Pressley & Afflerbach 1995). Additionally, coders
should practice using the encoding scheme until the codes are both familiar and applied
The recordings captured during the session were transcribed verbatim. The transcrip-
tions were annotated using the tags listed in table 3.2. The annotated transcriptions were
then split into “utterances”. Utterances are defined as the minimum unit of speech that
could be assigned a label from the encoding scheme. Often these minimal units were
3.7. Data Analysis 61
As coding schemes that suited the study were not readily available, to analyse the
data gathered in the sessions a custom encoding scheme was developed. The scheme was
designed so that three types of events could be categorised i) criteria used when assessing
information, ii) any kind of interactions between the user and the system, and iii) the
on/with the system or interacting with it, e.g. reading a document, clicking on a
• Intent: any mention of the participant’s intentions regarding the obtained informa-
tion or regarding their actions, e.g. using a retrieved document to impress their
• Relevance Criteria: any mention of factors that may affect the participant’s choices
regarding whether they are to keep or not a document, e.g. if the user picks the
The encoders practised using the encoding on data gathered during a pilot study
conducted to test the viability of the design of the study. The details of this pilot study
are described in (Cerviño Beresi, Baillie & Ruthven 2008). Additionally, the reliability of
the encoding was tested by measuring the overlap of the assigned codes by independent
Once recordings are transcribed and segmented in utterances, they were labeled using the
first level encoding described earlier. The utterances were then further analysed within
each group.
3.7. Data Analysis 62
3.7.1 Interaction
Utterances labeled with interaction were analysed to see if any search patterns emerged
from the sessions. They were also analysed to confirm that the participants understood
3.7.2 Intent
Expressions of the participant’s intents were observed and are presented in Section 4.5.
Expressions of the relevance criteria used for either selecting or discarding documentation
were the primary interest of this study. These expressions were classified further according
to a second encoding scheme. The encoding scheme used was the one presented by Barry
and Schamber in (Barry & Schamber 1998) which is briefly revisited in the following
• Clarity: whether the information is presented in a clear fashion. This includes well
written documents and well as the presence of visual cues such as images.
• Tangibility: whether the information relates to tangible issues, hard data/facts are
• Quality of Sources: whether the quality of the information can be derived from the
• Verification: whether other information in the field, or the user, agrees with the
presented information.
• Affectiveness: whether the user shows an affective or emotional response when pre-
According to Barry & Schamber (1998), accessibility refers to the cost or effort involved
in obtaining the information. Effort, in their interpretation, refers to physical and not
mental effort. If a document is available only through an interlibrary loan, then it would
require physical effort from the user to obtain it. Cost involved refers to possible fees
involved in obtaining such document. In this study documents were readily available and
no fees were involved in obtaining them. Since the mental effort necessary to process
the information is not interpreted to be a type of “effort”, it was not expected that this
Since documents were available at all times, this criterion was not expected to be observed.
Despite these expectations all codes were included in the encoding scheme.
Verifying the validity of a piece of information in a research field can be very hard to do
for a newcomer. The task given to participants required them to branch out to poten-
tially unknown areas of science placing them in the spot as newcomers. Considering this,
accuracy/validity as a criterion, was not expected to be observed very often. What was
expected, though, was that different forms of novelty would play an important role when
users judged documents. Codes that would account for this were included. In the study
• Source novelty: whether the source of the document is new to the user, e.g. an
unknown author
3.7. Data Analysis 64
These codes were used to tag utterances that expressed if a document had been seen
before (in the current session or not) and if the document or the information contained
within it was known to the participant. The code source novelty was included to code
As participants were asked to search for literature in potentially unknown areas of sci-
ence it could well happen that the information found would be deemed non-relevant based
on that they had actually not been able to understand it. In this situation participants
could either silently reject the document or express their inability to correctly understand
the information presented. To accommodate for this situation it was decided to include a
code found in Barry’s list of criteria denoted by the tag ability to understand (Barry 1994).
According to Barry, utterances that denote “the user’s judgement that he/she will be able
participants could possess prior experience that would enable them to make educated
guesses much more easily which combined with the right information could lead to the
mentions of the use of background knowledge or information during the search session
a code found in Barry’s original listing was included: Background Experience. Barry
(1994) states that this code is used to denote “the degree of knowledge with which the
encoding scheme used to tag the utterances found in the transcriptions is depicted in table
Once utterances have been coded they are grouped at the session level and counted, i.e.
all mentions of a particular relevance criterion within the search session are added up and
contribute to a single count for that session and that criterion. For any one participant
3.7. Data Analysis 65
Tag Description
Depth/Scope/Specificity the extent to which information is in-depth or focused; is
specific to the user’s needs; has sufficient detail or depth;
provides a summary, interpretation, or explanation; provides
a sufficient variety or volume
Accuracy/Validity information found is accurate or valid
Clarity information is presented in a clear fashion
Currency information is current or up to date
Tangibility information relates to tangible issues
Quality of Sources quality can be derived from the quality of the sources
Accessibility the extent to which some effort is required to obtain infor-
mation; some cost is required to obtain information
Availability the extent to which information or sources of information
are available
Verification whether other information in the field, or the user, agrees
with the presented information
Affectiveness whether the user shows an affective or emotional response
when presented the information
Background/Experience degree of knowledge with which the user approaches infor-
Ability to Understand user’s judgement that he/she will be able to understand in-
formation presented
Content novelty the extent to which the information presented is novel to the
Source novelty the extent to which a source of the document (i.e., author,
journal) is novel to the user
Document novelty the extent to which the document itself is novel to the user
Table 3.3: Encoding used to tag the utterances that express a relevance criterion
3.7. Data Analysis 66
Source novelt
Content nove
Quality of Sou
Document no
Ability to und
Figure 3.12: A typical relevance criteria profile. Frequencies are normalised, hence the y
axis varies between 0 and 1.
there is what is defined as a “relevance criteria profile”. A relevance criteria profile, simply
put, is the grouping of the mentions of the relevance criteria during the search session. A
typical relevance criteria profile, visualised as a chart, looks like Figure 3.12. These profiles
provide a global view of the number of times, generally speaking, that each criterion has
Aggregating Profiles
Aggregating profiles, for instance if participants are grouped by their affiliation, does not
require any special processing. Criterion counts are added together and the profile is
rci = rcij (3.1)
where rci is the count for criterion i in the new aggregated profile and rcij is the count
3.7. Data Analysis 67
for criterion i for of the profile of participant j. The variable j is then restricted to the
group for which the new aggregated profile is being calculated, e.g. j = 1 . . . 21 such that
Normalising Profiles
Modeling the participants’s preferences using relevance criteria profiles allows one to per-
form different types of analyses. Analysing a profile can be done with the profile as defined
however comparative types of analyses need a normalising step before they can be per-
formed. Two types of normalising can be performed and each allows a different type of
analysis. On the one hand, normalising within a group (or individual session) is necessary
when one wishes to investigate the relationships and relative weight of criteria within the
group (or individual session). On the other hand, normalising within criteria is necessary
when one wishes to investigate the relative weight across groups (or individual sessions).
To normalise within a group (or individual session) one applies the following formula:
rc′i = PN (3.2)
j=0 rcj
where rc′i is the new, normalised, count for relevance criterion i, rci is the count for
relevance criterion i in the relevance criteria profile of the group (or individual) and N is
applied. The result of this extra normalisation step is that criteria counts, in each profile,
represent the proportional mentions across the profiles. To normalise across groups, and
i = PP (3.3)
m=0 rci
where rc′j
i is the relative count of criterion i for profile j, rci is the actual count of
a “true” probability distribution, p, and a target distribution q (Kullback & Leibler 1951).
DKL (p||q) = pi log2
as it does not satisfy the triangle inequality. The KL divergence is also non-symmetric
(DKL (p||q) 6= DKL (q||p)). The properties of the equation makes it non-negative and 0 if
both distributions are equal (p = q). The smaller the divergence the more similar the two
distributions are.
(JS) divergence (Lin 1991a). The JS divergence considers the KL divergence between p
and q under the assumption that if they are similar to each other they should both be
“close” to their average. Setting λ = 2 results in the JS divergence:
1 1
DJS (p||q) = DKL (p||m) + DKL (q||m) (3.4)
2 2
where m = 21 (p + q). As the JS divergence is based on the KL divergence, the smaller the
A discrete probability distribution p(x) is a function that satisfies the following prop-
P [X = x] = p(x) = px
3.7. Data Analysis 69
• The sum of p(x) over all x equals to 1, i.e. x p(x) =1
Normalised relevance criteria profiles satisfy all these properties so they can be inter-
preted as a discrete probability function. One can, hence, compare profiles using any of
ity between different profiles and to spot outliers in the data by comparing each individual
Relevance criteria profiles provide a global view of the relevance criteria mentioned through-
out a search session. This view however does not provide a view of the distribution of said
criteria. A relevance criterion might not be evenly distributed; it could perhaps be that
the distribution of its occurrences is skewed towards the beginning, or end, of the session.
Another drawback of global profiles is that the sequence of occurrence of relevance criteria
is lost. If during the session relevance criterion ci was mentioned before cj , this order is
was designed. Graphs resulting from applying this technique include information on the
order of occurrence of the relevance criteria observed during a search session and the
Sequence is denoted by a time line. The time line only denotes an order in time and not
any measure of it; equal spacing on the line does not mean equal time spans in the session.
Relevance criteria ordering and grouping are represented as piles of coloured blocks. Each
block represents the observation of a particular relevance criterion. Different criteria are
assigned different colours. With relevance criteria piles relevance judgement processes are
modelled. As long as relevance criteria are observed together one after the other with
no other utterances of a different type in between, e.g. interactions, they are considered
This is to say that DJS (p||q) = DJS (q||p).
3.7. Data Analysis 70
Figure 3.13: An example with four relevance criteria and interactions plotted.
Figure 3.14: An example with four relevance criteria plotted. Interactions are further
encoded and plotted accordingly.
to be part of the same relevance judgement process. Interactions are plotted in between
Plotting Sessions
To plot a search session first the tagged utterances are grouped. For each group, the first
relevance criterion in the sequence is plotted at the bottom of the pile, the second on top
of it one unit to the right and so on. Blocks are made as long as need be so that the final
shape of the pile resembles a staircase. The graph of the example sequence can be seen in
Figure 3.13. In this graph there are two interactions on either side of the relevance pile
There are assumptions behind the piles metaphor. First of all there is the assumption
of aggregation. When a relevance criterion has been observed it is assumed that it will
apply all the way until the user has made a final judgement. The length of each block in
the graph symbolises this assumption. The application of criteria is done sequentially until
the user is able to make a judgement about the relevance of the information. Each criterion
are represented as a minus sign next to the block in the graph (as seen in figures 3.13 and
3.7. Data Analysis 71
3.14). One of the consequences, should this assumption hold true, is that the order in
which criteria are used matters and that there might be a degree of relationship between
relevance criteria. Users might follow a pattern when using relevance criteria. By using
piles one can start analysing whether a user’s relevance judgement process exhibits these
ited by the appearance of interactions. During the study it was observed that relevance
judgements usually ended with the user navigating away from the document. This inter-
action can be preceeded by the explicit verbalisation of the relevance judgement, e.g. the
user utters “I don’t like this document”. A pile is then defined as occurrences of utter-
ances that are not interactions. The shortcomings are obvious. First of all, depending on
what the researcher considers to be an interaction, piles will (or will not) correspond to
documents and their judgement processes as interactions are not necessarily all naviga-
tion interactions. Further encoding of interactions might alleviate this to a certain extent
since the dynamics of the session might become more visible. For instance re-encoding the
interactions results in Figure 3.14. Gathering click-through data and using it to better
delimit the relevance judgement processes might also alleviate this situation.
process. In Figure 3.13 we see that one of the four criteria mentioned has a negative
sign next to it (“Criterion 2”). This represents situations in which the user expressed a
relevance criterion in a negative way, e.g. “this is too old, it’s from back in the 60’s”.
In the example, Criterion 2 is negative yet the judgement process continues. This may
suggest that the strength of Criterion 2, relative to the overall judgement process, is not
as strong as to end it right there and then. The explanations can be varied, however the
point is that researchers can direct their attention to further investigate these scenarios.
3.7. Data Analysis 72
According to Ware (1988) the effectiveness of using colours for coding is degraded as
more categories are added. Ware recommends 12 colours which are normally used when
labelling using colours. The first six colours, which also correspond to the basic colours
in the colour opponent theory (Hurvich & Jameson 1957), are: white, black, red, green,
yellow and blue. The remaining six colours are: pink, grey, brown, magenta, orange and
an appropriate colour. The most occurring relevance criteria should then be assigned the
first colour in the sequence, the second most occurring criterion the second colour in the
sequence and so on. The rationale behind this procedure is that, since aggregated profiles
are obtained by averaging across users, higher relevance criteria counts mean that users
have mentioned the criterion, on average, more often hence the relevance criterion is likelier
to be observed in any one search session. Choosing the most contrasting colours for the
most commonly occurring relevance criteria should make easier the visual detection of the
In this chapter the data gathered during the study is described. Data dealing with the
mentioned relevance criteria is presented and analysed in this chapter. We initially analyse
the data using relevance criteria profiles (this technique is explained in Chapter 3, Sections
3.7.4). As discussed, relevance criteria profiles allow the analysis of the occurrence of
relevance criteria at a global level. As such they provide a quick view of the occurrence of
the relevance criteria on a per session basis while visualising one or more of these profiles
as charts aids in uncovering the salient differences between individuals and groups.
Participants came from different schools and possessed different levels of research ex-
perience. Affiliation and research experience level lead to natural groupings of the partic-
ipants. The following subsections describe the relevance criteria profiles of three groups,
• Global: participants are not grouped. Statistics are calculated across all partici-
• School: participants are grouped by their affiliation. Statistics are calculated inde-
pendently for each school (as listed in Table 4.1) regardless of all the other partici-
Statistics are calculated for each research experience level (as described in Table 3.1)
4.1. Participants 74
In Section 4.1 the user groups, their affiliations and research experience levels are
described. The collections searched during the study are described next in Section 4.2.
While a general overview of the data gathered is provided in Section 4.3, relevance profiles
are discussed and their plottings are presented in Section 4.7. Section 4.8 concludes with
4.1 Participants
came from the School of Computing making this school the biggest school to take part
of the study. Participants from the School of Computing were distributed as follows:
6 expressed that their research experience was that of a “research student”, 2 that their
research experience was that of a “researcher” and 2 that their research experience was that
of a “senior researcher”. The second largest school is the Information Management Group
with 8 participants in total. Out of these 8 participants, 2 were research students, 4 were
researchers and 2 were senior researchers. Only 3 people from the School of Pharmacy
agreed to take part of the study, out of which 2 were research students and 1 was a
researcher. No senior researchers from the School of Pharmacy accepted the invitation to
take part of the study. The distribution of the participants according to their affiliation
is displayed in Table 4.1 while the distribution of participants per research level (grouped
School/Group Participants
Computing 10
Information Management 8
Pharmacy 3
Total 21
Table 4.1: Number of participants per school/group for which valid data was gathered
4.2. Collections 75
Research student
6 Researcher
Senior researcher
School of Computing Information Management School of Pharmacy
4.2 Collections
The collection searched by participants coming from the School of Computing consisted of
several volumes (up to volume 50) of the Communications of the Association for Comput-
ing Machinery (CACM)1 . The collection contained 7028 articles covering several areas of
Computer Science. Even though most recent articles in CACM are of a magazine type of
an article, previous volumes contained scientific articles. Topics covered in this collection
ranged from peer-to-peer (P2P) computing to software engineering theory and practice.
The average document length is approximately 2676 terms with 85% of the documents
containing between 0 and 5000 terms (after stopword removal). The longest document
contains 34184 terms and the shortest only 79 terms. This collection was created by
downloading all available documents from the CACM web site up to volume number 50.
4.2. Collections 76
Query issued
Information Management
Content Management
Information Retrieval
Information Systems
Knowledge Management
Profiles + content
Table 4.2: Queries issued to the search engine for constructing the collection for the
Information Management Group
articles revolving around the topic Information Management. To create the collection,
documents were searched for and retrieved from Library of Information Science Ab-
stract2 (LISA). As the list of participants was known beforehand, queries that would reflect,
as much as possible, each of the participants’s research areas were crafted. To do so, each
of the participants’s research interests section from their home pages (whenever they were
available) were visited and the most significant, but still generic, key words such as Knowl-
edge Management, Digital Libraries were extracted. The full list of queries crafted can be
seen in Table 4.2. The topics covered by the retrieved documents revolved around a main
theme: information management. Amongst the topics covered, in particular, we find the
web 2.0, law librarians and new trends in enterprise content management solutions.
came from a group with a common research theme there was a significant overlap in
the queries constructed (e.g. the term information was present in several queries). This
resulted in repeated documents when pooling all the document sets returned for each
query. Duplicates were removed before the study. The collection contained a total of 4756
documents after de-duplication. The average document length is 3128 terms with 88% of
The collection searched by participants from the School of Pharmacy contained documents
coming from two different sources: the Public Library of Science (PLoS)3 and The Phar-
maceutical Journal Online (PJO)4 . Articles published in PLoS, as well as in PJO, can
be of two different natures: i) scientific articles or ii) magazine articles. Both types were
included. The collection contained a total of 11426 documents. The topics covered by the
documents were quite varied and ranged from tropical diseases (PLoS Neglected Tropical
5064 terms with 88% of the documents containing between 0 and 10128 terms.
To create this collection all available documents from both sources were downloaded
encoded a total of 300 these (approximately 17%). We found that the overlap between our
encoding and that of the researcher amounted to a total of 87%, i.e. 87% of the utterances
had been assigned the same label by the two independent encodings (some utterances had
4.4 Interaction
Interaction information was analysed mainly to see if participants had understood and
knew how to use the system. Sessions would generally follow a pattern of interactions
which could be laid out as follows. Participants, after being trained and having interacted
briefly with the system, would initiate their searches by looking at the offered main topics
as well as the description of their research topic. This was reflected in utterances like
“I’m going through the keywords first” and “the possible related topics for this is”. Once
4.5. Intent 78
the participant had selected a topic to investigate further an initial quick search over
the potential intermediate topics was usually done. Utterances like “now I’m looking at
the top B ones” or utterances which denote that the participant was reading out loud
topic usually followed and the literature was retrieved. At this stage participants would be
able to examine the actual literature connecting the two topics chosen. The session would
progress with participants spending more or less time on a single intermediate topic, going
back and forth between literatures. Unfortunately there were two exceptions where the
participants were frustrated and abandoned the study prematurely. Interactions observed
during the search sessions are analysed in more depth in Chapter 5, Section 5.3.
4.5 Intent
While participants usually verbalised their interactions with the system quite regularly,
intentions were not expressed as often. Intentions generally referred to their search pur-
poses such as “I need to find complementary”. Mentions of other types of intentions such
as the use of the information found (when found) were also observed (“I’m thinking that
The encoding provided in Section 3.7.3 is a reinterpretation of the overlap of two other
encodings as presented in (Barry 1993) and (Schamber 1991). Using this encoding for
1. If it was found that the encoding applied to the data gathered during our study,
Since this type of investigation had never been done before in the context of LBD, it
provided by Barry & Schamber (1998), in this study a number of codes were used according
to a personal reinterpretation. Moreover, as the Barry (1993) study is more in line with
this study (when compared to that of Schamber (1991)), most interpretations are closer
to those of Barry (1993) than to those of Schamber (1991). The interpretations of the
In the definition provided by Barry & Schamber (1998) for depth/scope/specificity we see
that utterances regarding to “whether the information is in depth or focused, has enough
detail or is specific to the user’s needs [...] it provides a summary or overview or a sufficient
variety or volume.” In this study, the code was interpreted as originally intended.
information presented is accurate or valid. Even though the criterion refers to a personal
judgement, information validity (or accuracy) does not depend on personal opinion nor
if a person may disagree with it. This criterion refers to the act of the user judging the
The differences in the definitions of clarity from the Barry (1993) study and the Schamber
(1991) study are likely to be due to the different information objects studied in each study.
In the cross-comparison study done by Barry & Schamber (1998), however, it is clarified
that, in the broadest sense, what users are actually judging is whether the information is
presented in a clear and understandable way. In this study, this is how the criterion was
4.6. Interpretation of the Relevance Criteria 80
Barry’s definition of currency (1993) agrees with that of Schamber (1991). According
to this definition, currency refers to the extent to which users judge the information to
be current, up to date, etc. In this study, expression of such nature were also coded as
A code with which mentions of topicality (or aboutness) should be encoded is not included
in the encoding used in this study. This stems from the assumptions in the Barry (1993)
study and the Schamber (1991) study. The assumption behind both studies is that users
judge relevance beyond topicality. In this respect, as the nature of both studies was to
observe and list the relevance criteria observed, it seems that the participants of both
studies did not mention topicality as a criterion explicitly. Regardless of the reasons for
why the participants of both the Barry (1993) and the Schamber (1991) studies did not
mention topicality, in this study participants did and so these mentions had to be labeled
accordingly. Initially a code for this purpose could have been added. We believed, however,
mentions of the information being on topic (or about a topic). Consider the following
Retrieval which is pretty bang on the topic I was looking for so,
yeah ...’’
In this extract we can see three mentions of topicality. The first one is signaled by the
utterance “...this is [about] 2nd life...”. The participant, in this case, is mentioning that he
has recognised the overall theme of the document and what it talks about. The second and
third mentions of topicality are “’s about accessibility...” and “’s about different
4.6. Interpretation of the Relevance Criteria 81
people’s abilities to use environments for Information Retrieval...”. The user refers, again,
to the topics being discussed in the document. These mentions of topicality refer to the
topics being discussed in the document. In this respect, it is interpreted that the document
provides information about the topic. Utterances like “it’s about [topic]” are interpreted
as expressions of the document contents discussing the topic. It is in this sense that the
document is providing information about the topic, so this type of utterances was coded as
tangibility. The last utterance, “...which is pretty bang on the topic I was looking for...”,
Quality of Sources
Quality of Sources, as interpreted in this study, refers to the different sources that partic-
ipants could evaluate such as authors (or editors), affiliations or the publications in which
the documents appeared. This interpretation is consistent with that of the criteria Source
Quality and Source Reputation/Visibility in the Barry (1993) study. Utterances regarding
the extent to which the quality of the information could be inferred either from personal
experience with the source of the information or from the reputation (visibility) of the
The differences between the definitions of accessibility provided by Barry (1993) and by
Schamber (1991) seem to be a result of the information objects examined by the users in
their studies. The interpretation of this criterion in this study refers to both the effort
and cost involved in obtaining a document; these correspond to Obtainability and Cost as
defined by Barry (1993). Mentions of the effort and/or the cost involved in obtaining the
Barry (1993) defined availability on two levels: environmental and personal. Environmen-
tal availability refers to the extent to which the information presented is available in other
documents within the environment. Personal availability refers to the extent to which the
information presented was already possessed by the participant. In this study availability
to be part of a different code, namely document novelty as will be explained later on.
Despite its apparent similarity with accuracy/validity, verification refers to personal agree-
ment with the information presented regardless of the validity (or accuracy) of the infor-
mation. It may be the case that the information presented to the user is invalid (such
as the statement 2 + 2 = 5), however a person might still agree with it (“well, for large
values of 2 that statement holds!”). Furthermore, the interpretation of the criterion refers
on by different sources of information. In the Barry (1993) study the code is actually
agreed with the information presented or to the extent to which the participants’ point of
Affectiveness, as defined by both Barry (1993) and Schamber (1991), refers to the extent
In this study, the interpretation of the code was extended to include expressions of raised
Ability to Understand
The code ability to understand, according to Barry (1993), is used to code utterances
that denote “the user’s judgement that he/she will be able to understand or follow the
4.6. Interpretation of the Relevance Criteria 83
information presented”. In this study the code has been interpreted accordingly.
Background Experience
Mentions of the use of background knowledge or information during the search session
were encoded with Background Experience. Barry (1993) states that this code is used to
denote “the degree of knowledge with which the user approaches information, as indicated
defined by Barry.
Content Novelty
Utterances expressing that the information contained within documents is known (or un-
known) to the participant were coded as content novelty. Expressions indicating the extent
to which the information is novel to the participant, and consistent with the definition
Source Novelty
The code source novelty refers to the extent to which the source of the information (for
instance the author) was novel to the user. In this study the interpretation was extended
to also include mentions, for instance, of a known author writing in a different field or
on an unexpected journal. It is not only interpreted to refer to the extent to which the
sources are novel but also to the extent to which the relationship between the information
Document Novelty
Utterances expressing that a document had been seen before (in the current session or not)
were coded as document novelty. An expression classified as document novelty can express
that, for instance, a document was not known by the participant prior to finding it. As
such, this means that the document is novel but also that the document is not available
in his personal collection. Personal availability was covered by Barry (1993) under the
4.7. Relevance Criteria Profiles 84
code availability (explained earlier) however in this study it is covered as part of document
Analysing relevance criteria information was performed at a global level using relevance
criteria profiles. Relevance criteria profiles allowed the analysis of frequencies at an in-
dividual and group level. Comparative types of analyses are also possible on relevance
were followed when plotting profiles: plotting profiles individually and plotting profiles
together. Plotting profiles individually aids the interpretation together provides a quick
In Section 4.7.1 an account for the most mentioned relevance criteria, according to
a global relevance criteria profile, is provided. Profiles for the groups listed in Section
4 are also calculated and plotted together. In sections 4.7.2 and 4.7.3, the school and
research experience profiles are presented and briefly analysed. An account of the observed
relevance criteria, together with plottings of the profiles, is offered in Section 4.7.4. To
produce these plottings one of two approaches, depending on the desired analysis, were
was performed before plotting the profiles together. Plotting profiles together as they
are is done in the attempt to uncover salient (dis) similarities in terms of within-profile
proportions. By plotting two, or more, profiles together the salient criteria, within each
needed to uncover a different type of pattern: the proportional mentions within criteria.
By re-normalising and then plotting the profiles together we can observe how each criterion
Because verbal reports are only a subset of the thought processes that occurred during
the search session, the results presented next can only be interpreted as indicative and
coverage) from user to user, the absence of mentions of a particular criterion does not
4.7. Relevance Criteria Profiles 85
Table 4.3: Number of occurrences for each criterion according to the global relevance
criteria profile
mean that the participant never considered it during the relevance judgement process.
Moreover, as the volume of the verbalisations varied from participant to participant, cer-
criteria profiles.
The global relevance profile was obtained by applying Formula 3.1 and restricting j to
all participants, i.e. j = 1 . . . 21. In Figure 4.2 we can see that, overall, criteria dealing
with the tangibility and with the depth/scope/specificity of the information were the two
most common. Document novelty and affectiveness follow in third and fourth place re-
spectively. These four criteria account for a 76.9% of the total number of observations
(1355 occurrences). The list of all counts per criterion can be seen in Table 4.3 are also
The distribution of participants according to their affiliation can be seen in Table 4.1.
The profiles of the three schools were obtained by applying formula 3.1 with the variable
4.7. Relevance Criteria Profiles 86
Source novelt
Content nove
Quality of Sou
Document no
Ability to und
Figure 4.2: Relevance criteria profile of the global group. Values in the y axis vary between
0 and 1.
Group Restriction
Ten participants from the School of Computing took part of the study. This is the most
represented school in the study. The most mentioned criteria, by members of the School
times (about 40.6%) while depth/scope/specificity 127 times (about 14.4%). Their third
most mentioned criterion is document novelty which has been mentioned 98 times (about
11%). This may suggest that members of the School of Computing prefer tangible data
4.7. Relevance Criteria Profiles 87
over, for instance, voluminous information. Eight participants were affiliated to the Infor-
mation Management Group making it the second most represented school in the study.
Members of this group mentioned depth/scope/specificity 231 times (about 30.3%) and
tangibility 215 times (about 28.2%). Document novelty was mentioned 103 times (about
11.7%). Unlike members of the School of Computing, who seem to have a preference for
tangible data, members of the Information Management Group seem to be more interested
in other properties of the information such as its volume and its specificity. This prefer-
ence, however, is not as marked as that of the members of the School of Computing. Only
3 participants from the School of Pharmacy accepted the invitation and took part of the
study. Members of this school also seem to exhibit the same preferences as members of the
(about 38.7%) and tangibility 24 times (about 19.3%). The criterion document novelty
Table 4.5: Number of occurrences for each criterion according to each school relevance
criteria profile
topicality, so care must be taken when comparing the mentions of tangibility with those
of any other criterion. This will be examined in more depth in the following subsections.
4.7. Relevance Criteria Profiles 88
The distribution of participants per research level is depicted in Figure 4.1. The profiles
for these three groups were obtained by applying formula 3.1 using the restrictions listed
in Table 4.6. The distribution of the utterances according to each profile is depicted in
Table 4.7
A total number of 10 participants expressed that their research experience level was on
par with that of a research student. This is the largest group. The second largest group
is the group of participants that were classified as researchers. This group consisted of 7
participants. The smallest group, with 4 participant, is that of the senior researchers.
Group Restriction
Research Student j = 1 . . . 21 such that pj has expressed that they
are a research student
Researcher j = 1 . . . 21 such that pj has expressed that they
are a researcher
Senior Researcher j = 1 . . . 21 such that pj has expressed that they
are a senior researcher
Regardless of research experience level, the two most mentioned criteria were, in order,
32%) and depth/scope/specificity 166 times (about 22%), researchers mentioned tangibility
218 times (about 39%) and depth/scope/specificity 150 times (about 26.8%) and senior
times (about 20.5%). As with the school profiles, it must be noticed that there may be
mentions of “aboutness” or “topicality” that have been encoded as tangibility and are
driving the counts up. This phenomenon will be analysed further in the next subsections.
Plotting the profiles together may help visualise the (dis) similarities between uses of crite-
ria per school (or research experience level) more clearly. Different individuals had varying
degrees of verbosity which resulted in different numbers of utterances coded. To make the
4.7. Relevance Criteria Profiles 89
Table 4.7: Number of occurrences for each criterion according to each research experience
level relevance criteria profile
by normalising the profiles. The normalisation step was done applying Formula 3.2. A
graphical depiction of the normalised school profiles can be seen in Figure 4.3 while the
In both figures we can observe that the two most mentioned criteria are
depth/scope/specificity and tangibility. In the case of the school profiles we can see that
while the other two schools mentioned depth/scope/specificity more often. This preference
is not as clear in the case of the Information Management Group however the data in
Table 4.5 confirms that indeed they mentioned depth/scope/specificity more times than
they mentioned tangibility. Students, researchers and senior researchers all mentioned
top mentioned criteria across groups and doing basic comparisons, however, adding an
extra normalising step, coupled with combined plotting, helps reveal even more patterns.
Applying Formula 3.3 on the already normalised profiles results in normalised proportions
4.7. Relevance Criteria Profiles 90
School of Computing
0.4 Information Management Group
School of Pharmacy
Source novelt
Content nove
Quality of Sou
Document no
Ability to und
Source novelty
Content nove
Quality of Sour
Document no
Ability to unde
Figure 4.4: Research experience profiles plotted together.
per criteria, i.e. for each criterion the proportions are normalised resulting in a distribution
over groups of the proportional mentions of the criterion. As the resulting proportions of
4.7. Relevance Criteria Profiles 91
each criterion now sum up to 1 (100%), we can observe, when plotting these profiles, how
much different schools, in respect to the others, have used each criterion.
Figures 4.5 and 4.6 depict the re-normalised school profiles and the re-normalised
research experience level profiles respectively. These two figures will be used as guides in
Source novelty
Content nove
Quality of Sour
Document no
Ability to unde
Figure 4.5: The profiles of the schools, normalised within criteria, plotted together.
(23.07%). This criterion deals not only with scope, but also with specificity, volume, detail
and even genre of the information contained in the document. Reasonably so, participants
were interested in these properties of the information obtained. As utterances that ex-
press that the document refers to a topic specific to the user’s needs were also coded as
• “general summary”
• “detailed enough”
4.7. Relevance Criteria Profiles 92
0.8 )
Source novelty
Content nove
Quality of Sour
Document no
Ability to unde
Figure 4.6: Research experience profiles, normalised within criteria, plotted together.
• “lots of information”
Out of these 406 occurrences, 61 (15%) were references to exemplary documents. Ac-
cording to Blair & Kimbrough (2002), “exemplary documents are those documents that
varies significantly across research fields in science. One function these exemplary docu-
ments perform is to provide a definition of the words included in these vocabularies. This
documents in the scientific community is the survey article. Survey articles summarise up
to a point in time the most important advances and issues to be treated in a field, include
a list of references to follow up and possibly a list of important academics and institutions.
Participants of this study referred to this type of exemplary documents in ways such as
“ overview of the key papers...” and “ overview of data collection techniques...”.
This utterance could also be classified as clarity or ability to understand.
This utterance could also be encoded as tangibility
4.7. Relevance Criteria Profiles 93
There is some evidence that suggests that an exemplary document of this type may,
for instance, ease the entrance of a newcomer to the world of research in that field. It
would ease this entrance by not only providing an overview of the field itself but also of
the pertinent vocabulary and major players in it. It would be reasonable to observe users
preferring a survey article to the latest article on a specific topic when getting acquainted
with the field being investigated. This must be kept in mind as exemplary documents
can be of most use when their topicality has been assessed to be outwith the participant’s
own field of work. Effectively, this has been mentioned by a participant who negatively
expressed that a document on his own research was “high level”. As the participant
continued the session, a positive expression of “high level” was observed again, however
this time the participant was referring to a document on a topic, different but related to
the participant’s own field of research. Preferences for exemplary documents included, but
• “general summary”
• “gentle introduction”
• “good overview”
A possible answer for why this type of documents are preferred by users when entering
a new field may be because these documents may have a high ratio of information ob-
tained vs. processing effort (both concepts introduced in Harter’s theory of psychological
relevance (Harter 1992)). Perhaps by providing this roadmap to the field, together with
the associated jargon, survey articles offer plenty information in exchange for little mental
processing effort. This would afford users a quick judgement to whether or not it would
be beneficial to go deeper into the field and search for possible connections.
For two schools, namely the Information Management Group and the School of Phar-
macy, this criterion was their most mentioned; each school mentioned it 231 and 48 times
times. In Figure 4.5 we observe that, proportionally speaking, the school that mentioned
Management Group and the School of Computing. It seems reasonable that, while mem-
bers of the School of Computing have used this criterion to a certain extent, they have
as follows: students mentioned it 166, researchers 150 and senior researchers 90 times.
This makes this criterion the second most mentioned criterion by any one research ex-
perience profile. In figures 4.4 and 4.6 we can also observe that it was researchers who,
Judging the accuracy/validity of the information presented when entering new fields of
research may be very hard to do, if not impossible. It was not expected to be observed very
often, and in fact it was not observed at all. A potential explanation is that users, as they
had entered new territories, took for granted the accuracy/validity of the information as
the information came from documents that had been published in different, and sometimes
well known, publications. In a sense, it may be that there was an implicit use of quality
of sources.
The code clarity refers to the extent to which the information was presented in a clear and
understandable way and it has been observed 38 times (about 2.1%). This is a criterion
that might have an effect on the mentions of ability to understand, as it may happen that
because the information is not presented in a clear fashion, the user expresses his (or her)
documents which have gone through a reviewing process, it seems reasonable to have
observed this criterion less than most the other criteria as the peer-reviewing process is
supposed to guarantee, amongst others, a certain level of clarity. However, the expressions
of the criterion alone (its counts) only indicate its presence. It may have happened that
4.7. Relevance Criteria Profiles 95
the information was in general very clear so that participants mentioned clarity only when
the information had been presented in an outstandingly clear and understandable way (or
the total opposite) and that generally the clarity of the information is silently ignored.
Mentions coded as currency accounted for a 3.2% of all relevance criteria mentions (57
utterances coded). Currency refers to the extent to which the information was judged
to be current or up to date. It is not entirely clear what role of this criterion plays
in this context as both “old” as well as “new” information could be potentially very
relevant, i.e. regardless of the date published, related information would remain being
related. However, users might prefer more current information as the chances of making
“new” discoveries may increase by incorporating current information. Users of this study
• “outdated...yeah, 1985”
• “it’s up to date”
in Figure 4.6, most mentions of this criterion came from senior researchers. This phe-
nomenon could have been influenced by the search task given to senior researchers: while
students had to complement their literature review and researchers had to write a pro-
posal, senior researchers had to gather information for a keynote speech. Perhaps, the
activities undertaken by both researchers and students allowed them more room when it
4.7. Relevance Criteria Profiles 96
came to the currency of the information. Despite the preference usually given to current
understanding of the research field they are involved in. A similar line of reasoning applies
to researchers in the process of writing a research proposal and while special attention
is usually paid to the state of the art and very current information, potentially outdated
information serves as a complement in this task. Preparing a keynote speech, where the
organisers have asked you to “focus your speech on the future directions and implications
of advances in your research field, especially on those fields outside your own”, however,
might be more restrictive regarding the currency of the information used for this. Perhaps
it was this that made senior researchers concentrate on very current information and hence
the higher proportion observed in terms of mentions. This suggests that bias might have
been inadvertedly introduced while crafting the simulated work tasks. More specifically,
by making a difference, in an attempt to make the tasks more realistic in regard to the
research experience levels, in the underlying tasks that participants had to perform. Each
task has an implicit time8 constraint included which was not detected beforehand.
In terms of the school profiles, members of the School of Computing were the least
interested in the currency of the information; most mentions (about 81%) came from
Recalling the explanation of the code tangibility from section 3.7.3 it is not surprising that
this was the most common criterion used by the participants (it appeared 595 times rep-
resenting a 33.8% of all the criteria mentioned). Tangibility refers to the document’s con-
tents, the actual explicit information contained within. However, some of the utterances
coded with tangibility actually refer to the topics discussed. Out of the 595 utterances
utterances as topicality would result in tangibility becoming the second most mentioned cri-
terion followed by topicality with 338 and 257 counts respectively (depth/scope/specificity
Participants of this study mentioned the topicality of the information presented less
often than, for instance, the volume of it. There are at least two potential explanations
for this phenomenon. On the one hand, it may be that topicality is less important than,
with a short document, for instance, a participant could express that “it’s too short” and
disregard it. In this situation, regardless of the topics being discussed, the information
would have been deemed irrelevant by the user simply by the expression of it not being
enough9 . On the other hand, it could be that topicality is a requisite sine-qua-non rele-
vance cannot be judged. When presented with information, it could be that a user firstly
assesses whether it is on topic or not and expresses this like “it’s about [topic]”. Once the
information has been deemed on topic, the judgement can proceed and the user applies
other criteria. Any of these two potential explanations could be the cause for the higher
Members of the School of Computing mentioned tangibility more often than any of the
other criteria. Out of the 356 mentions of tangibility, 135 (37% of all mentions encoded
actual tangibility, according to the Barry & Schamber (1998) interpretation, and topicality
does not affect the ranking in terms of mentions of criteria: members of the School of
Computing still mentioned tangibility more often than any other criterion. The revised
tangibility as the most mentioned criterion with 221 mentions followed by topicality with
135 mentions. In third place we would find depth/scope/specificity with 127 mentions.
We must remember that this analysis is solely based on what users verbalised. Silent rejections or
use of criteria, as they cannot be detected, cannot be taken into account when analysing and proposing
potential hypotheses.
4.7. Relevance Criteria Profiles 98
This preference for hard data, as exhibited by members of the School of Computing, is not
entirely surprising given the nature of the discipline. As the code tangibility refers to the
actual contents of documents, this suggest that even when users referred to the contents
of the document, topicality was not the only factor affecting the judgements. This can
this hard data in ways such as “..that’s an application...” and “...there’s some interesting
facts in this one...”. One must be careful, however, when interpreting this observation as
not all mentions of tangibility were positive. Indeed, at least one member of the School of
As observed before, in the case of the profile of the Information Management Group and
or “topicality” makes the difference even larger. In the case of the mentions from members
of the Information Management Group, out of the 215 utterances encoded as tangibility,
115 (53.5%) were actual mentions of topicality and, in the case of the mentions from
actual mentions of topicality. Taking this into account means that the members of the
Information Management Group and members of the School of Pharmacy have a marked
those of tangibility. Despite that interest was shown by members of these schools in
tangible data (expressed, for instance, as “...they achieved a 50% repose rate...”) most
did not regard hard data as an affecting factor in terms of relevance of the information
All three research profiles mentioned tangibility the most, however, if we take into
account that topicality accounts for a 43% of the mentions coded as tangibility this situa-
tion changes. As observed before for the school profiles, re-encoding mentions of topicality
with their own code results in a new ranking where depth/scope/specificity is the most
mentioned criterion regardless of research experience level. The second most mentioned
criterion, however, in this new ranking depends on research experience. Students and re-
4.7. Relevance Criteria Profiles 99
searchers mentioned tangibility mostly while senior researchers mentioned topicality. Po-
tentially, this could be influenced by the task each group had to try to complete. Students
were asked to complement their literature review. The nature of literature reviews is to
provide an overview of the landscape of a field of practise and as such they are required to
include not only the subtopics that might be found within the field but also more tangible
information such as previous results and techniques. Perhaps this is what motivated stu-
dents to mention tangibility more often than topicality. A similar explanation applies to
researchers who were asked to write a funding proposal. In the case of senior researchers,
however, the situation changes. Senior researchers mentioned topicality more often than
tangibility and this might also may have been influenced by the task they had to complete.
Senior researchers were requested to write a keynote speech. It may be that preparing
a keynote speech actually does not require tangible data but only information that is on
topic as keynote speeches are usually about the future of an area. When suggesting what
the future may bring, only related topics and potential interactions are described but not
in great detail.
Quality of Sources
Participants of the study resorted to the reputation of the authors, their affiliation and/or
the reputation of the publications as an indicator of the quality of the information. Men-
tions of this usage were encoded as quality of sources and were observed 85 times (4.4%).
In the context of entering new, and possibly unknown, fields of research, resorting to this
criterion seems like a sensible approach. However, evaluating the credentials of the sources
of the information may not be an easy, and even feasible, task. Perhaps it was this that
made the participants refer mostly to generic qualities such as position and not to more
specific factors such as names and familiarity with the authors’s work. Some examples of
there were more specific mentions of names and expressions of personal relationships with
the authors of the documents such as “...she’s a friend of mine...” and “...Carol’s review
Mentions of quality of sources came mostly from students and senior researchers. It
may be the case that students, as they are beginning their career as researchers, are more
impressionable by positions and affiliation and have mentioned these often. This could
potentially explain the high proportion of mentions coming from students. In addition
senior researchers, as they are established in their field, are familiar with who are the
major players in the field and their work, and potentially share a personal relationship
with them, may have mentioned the quality of sources in a less generic way and referred
to people and names instead of positions and affiliations. While members of the School
of Computing and members of the Information Management Group mentioned the quality
of sources of the information, members of the School of Pharmacy almost did not express
their interest in this criterion. Potentially this could be due to an inadequacy of the
collection searched by members of this school. It could be that because all, or most of, the
articles and even the journals are not well known by members of the school, they silently
Accessibility was not observed at all. This is likely to be due to the settings in which
the study took place. Obtaining any document from the collection did not involve any
sort of fees nor effort (except the effort of clicking on the provided hyperlink). This could
potentially explain why there were no mentions of accessibility, however, it may have
still happened that participants had referred to the potential cost of obtaining documents
cited within the documents being inspected in ways such as “...getting that might be
4.7. Relevance Criteria Profiles 101
Availability was also not observed at all. As the collections were all crawled beforehand,
all the documents were available when requested. This, however, does not mean that there
could have been mentions of availability. For instance, a participant could have expressed
interest in reading a document cited in the document he was examining and then mention
Utterances coded as verification accounted for 3.4% (60 occurrences) of all the coded
utterances. The code verification was used to tag utterances from participants expressing
that the information presented was, for instance, supported by other information within
the field. Mentions of personal agreement with the information, or the support of a user’s
point of view, were also coded as verification. Participants were placed in the spot as
newcomers by the very search task they had to solve so confirming, or rejecting, that a
piece of information was supported by other information within the field seems unlikely
• “I’m thinking about [topic X] but this one is looking all the way through [topic Y]”
students. While senior researchers are quite interested in obtaining information that is
interpreted, refers to the extent to which the information supports the user’s point of view
or is agreed on by the user. It is a subjective criterion in the sense that it does not depend
on the accuracy or validity of the information (as reflected by the code accuracy/validity).
It may be that the the level of agreement is related to the research experience level as
4.7. Relevance Criteria Profiles 102
students do not mention this criterion as much as researchers and senior researchers do.
Perhaps the more experienced, in terms of research, a person is, the more views and beliefs
he (or she) has. Having more views, or more established views, would mean that the (dis)
agreements with the information presented, in terms of these views, are likely to happen
more often. This observation does not contradict what can be observed in Figure 4.5: the
school that least mentioned verification is the School of Computing and 60% the members
Affectiveness was observed 141 times (7.5%). Utterances coded with the tag affectiveness
included expressions of surprise, rejection and disregard amongst others. Even though af-
fectiveness was analysed in isolation10 it is possible that affectivenes (to the information)
has an effect on the user’s search experience. A person constantly expressing negative
affectiveness towards the information retrieved by the system may develop a level of ani-
mosity to the point where future judgements are negative regardless of the appropriateness
of the information presented. Under such circumstances, the user may even choose to end
the search session prematurely. Some examples of the utterances coded as affectiveness
• “I like [topic]”
Members of the School of Computing seem to be, according to their expressions, rel-
atively more affective than members of any of the other schools (Figure 4.5). Moreover,
Ability to Understand
Ability to understand was mentioned 49 times (about 2.7%). The code is used to encode
utterances that refer to the extent that users express that they will be able to understand
or follow the information presented. As such, this criterion may be related to background
experience. It seems reasonable to relate them as it may be that the availability of (or lack
of) background knowledge could result in users expressing their (in) ability to understand
the information presented. Hence, it is be reasonable to observe that most mentions are
produced by the less experienced people in the group, e.g. students. This might suggest
that experience does affect its occurrence but also that there might be a relationship
between the two codes. Participants mentioned their (in) ability to understand in ways
such as:
Background Experience
Background experience mentions were observed 51 times (about 2.9%). These mentions
referred to the extent to which the background knowledge of experienced was used during
the session to judge the relevance of the information presented. Initially it could be
suggested that this code is similar to content novelty, however, they differ as background
experience refers to uses of background experience while judging the information presented
and content novelty refers to whether the information had been encountered before. This
ability to understand can be observed. The distributions of the mentions, at first sight,
4.7. Relevance Criteria Profiles 104
seem similar to each other. Members of the School of Pharmacy mentioned both criteria
more often than members of the other two schools. This could be a coincidence however
it could also be that both criteria are related (as suggested earlier). For instance, a
presented. Exactly the opposite interaction could also explain the occurrences: due to
information presented could be observed. This pattern in usage can only be confirmed (or
Content Novelty
Verification and content novelty were both observed an equal number of times; utterances
coded as content novelty accounted for 3.4% (60 occurrences) of all coded utterances.
Content novelty refers to the extent to which the information contents are novel to the
user. As it can be seen in Figure 4.6, expressions of content novelty came mainly from
research students. It may be that the research experience level exhibited by participants,
as well as their background knowledge, have an effect on how novel they will find the
information presented.
Content novelty was mostly mentioned by members of the School of Computing. This
however, that the sign of the mentions is not depicted in the figures. Members of the
School of Computing could either be familiar with the contents of the collection or the
contents could be totally new. Should most content in the collection be new to the members
of this school, one should expect a high proportion of positive mentions of content novelty
4.7. Relevance Criteria Profiles 105
and vice-versa. Relevance criteria counts alone only indicate the presence (or absence) of
a criterion being used to judge the relevance of the information presented and not whether
Background experience, ability to understand and content novelty are three criteria
that, according to the figure, have been all mentioned primarily by students. Even though
whether the mentions are either positive or negative is not displayed on the graph, it
is sensible to expect students to mention these criteria more often than researchers and
senior researchers. Since students are supposed to have a limited experience in both their
are negative in the sense that students judge the information irrelevant as they lack the
students could be explained by students being less experienced than researchers and senior
researchers (the same argument applies to researchers). Being students less experienced
could result in higher, and negative, mentions of ability to understand (students express
their inability to understand) and also in higher, though positive, mentions of content
novelty (being less experienced could also mean less familiar with research topics outside
their own).
Source Novelty
Source novelty refers to whether the source of the information is novel to the user. This
criterion has not been observed during this study. There are at least two potential expla-
nations for this. On the one hand, as users entered new field of research it may have been
that they did not know any of the sources of the information (the authors, the journals,
etc.) and they did not express this. When everything is new, perhaps, one perhaps cannot
say this all the time or even at all. On the other hand, it may be that they knew all the
sources of the information, but that is very unlikely. There is also the situation that some
utterances, for instance “I don’t know who he is”, were coded as quality of sources only
when they could have also been encoded as source novelty. This is due to the subjective
nature of the encoding and analysis and, while a researcher could have encoded these ut-
4.7. Relevance Criteria Profiles 106
terances as source novelty, another research could have encoded these utterances as quality
of sources.
Document Novelty
Utterances regarding the novelty of the documents were observed 213 times (10.1%).
According to almost all authors, e.g. (Weeber et al. 2001, Gordon et al. 2002, Pratt
& Yetisgen-Yildiz 2003) novelty is an important factor to consider. Systems are actually
tailored to prefer novel information, according to the definition of novelty embedded in the
system, and rank them higher. Its importance is deemed so to the point where non-novel
As Gordon et al. (2002), amongst others, mention, novelty is highly subjective. Novelty
is also dependent on the context in which is observed. At least three scopes of novelty
can be suggested: community, personal and task. Novelty at the community scope means
that documents are novel to a whole community of practice. Novel, in this context, means
that the document has not been discussed nor cited in any of the documents produced
by the community. This type of novelty may be observed, for instance, when an member
a document that might benefit his work and, in turn, the community itself. Validating
the novelty in this scope, however, might not be feasible in practice, and can only be
partial, as it would involve examining all publications produced by the community and
their references.
When documents are already known to a community of practice, they might still
happen to be novel to an individual. In situations like this, the novelty is purely personal.
While doing a literature search for his (or her) Ph.D., a research student might come
across a document he had not come across before. From his point of view, this is a novel
document. It could well be, however, that the document in question is widely known and
Novelty in the scope of a task refers to documents which are novel in the context of
the task being solved but not necessarily in either the personal scope nor the scope of a
4.7. Relevance Criteria Profiles 107
the task scope. This could result in the searcher gaining new insight, due to the influence
of the context, on the information contained in the document. Novelty in this scope is
heavily related to personal novelty but is not necessarily tied to novelty in the scope of a
It is very hard to determine the novelty scope from the mentions of document novelty
observed in this study. However, the polarity of such mentions can be determined as the
criterion was often mentioned in either a positive or negative way. Intuitively, within
this context of knowledge discovery, one would expect a correlation between positive men-
tions of novelty and a positive influence towards relevance judgements however this is
not entirely true. Some participants actually interpreted being presented with the same
document over and over (“negative” novelty) as a sign of relevance while others took that
pattern like “I’ve seen this before, therefore I’m not interested in it”. Examples of such
Initially this is the most expected behaviour in these settings: already known infor-
mation (or documents) has little to offer. However, a reversed pattern was also observed.
Some participants followed a pattern like “[because] I’ve seen this before, I will take it”.
reinforcing positive sign, regarding the relevance of the documents. Documents being re-
retrieved for different intermediate topics were judged to be relevant as they kept “cropping
topics, what made the re-occurrence of known documents to be judged as a positive sign
and not a negative sign. If this was the case, it could then be derived that the novelty
4.8 Summary
We began this chapter with an account of the collections used during the study, the
characteristics of the user groups and our interpretation of the encoding used to label the
was aligned to the original interpretation as described in (Barry & Schamber 1998). When
the segmented transcriptions were encoded, we observed a total of 1726 relevance criteria
was re-encoded by an independent researcher and the overlap between assignments was
found to be 87%, i.e. out of the 300 utterances, 261 shared at least one label with our
original encoding. This suggests that, although there were differences, the interpretation
The types of relevance criteria observed were analysed at three different levels:
Initially, profiles were described both quantitative and qualitatively and the top two
relevance criteria were reported. These were consistent across groupings: tangibility and
Relevance criteria observations were normalised at the criterion level. This normalisa-
tion step allows the observation of how each criterion had been mentioned across group-
ings. These new slices of the data were visualised to aid their analysis. For each criterion,
the occurrences at the three levels were analysed and examples of these occurrences were
referred to exemplary documents (Blair & Kimbrough 2002). We argued that this was a
reasonable observation since these type of documents could well be used by newcomers as
an entry point to a research field of potential interest. Additionally, it was suggested that
this preference might indeed relate to the theory of psychological relevance (Harter 1992)
to be high. It must be noted, however, that this type of documents would only be of use
An unexpected introduction of potential bias was detected while discussing the crite-
rion currency. It was noted that as the research experience level increased, the mentions
of this criterion would increase. It was suggested that this might be an undesired effect of
how the simulated work tasks were crafted. Because all efforts strived to make these more
realistic, they were tied to the research experience level by creating different narratives
and tasks for each of the levels. It is possible that each task has an implicit association
to the concept of time, in the sense of dates, that made participants mention the criterion
currency in different proportions. This suggests that the differences in mentions might be
Mentions encoded as tangibility were further dissected. This was motivated by the
realisation that several of these were mentions of topicality while encoding the utterances.
tangibility was the most mentioned criterion, however relabelling utterances as topicality
meant that depth/scope/specificity would become the most mentioned criterion. Two
explanations as to why this could be happening were offered. On one hand, it might be
that there are situations where relevance can be judged immediately and independently
of topicality, e.g. if a document is too short, it does not matter whether it is on topic or
4.8. Summary 110
not; if it’s too short, it’s too short. On the other hand, it might be that topicality is a
pre-requisite so that relevance can be judged and that most participants made an implicit
quality of sources mentions were done mostly by research students and senior re-
searchers. It was further observed that students had referred to generic attributes such
sources in more specific ways such as author’s names and the reputation of some publica-
tion houses. When it comes to verifying the information presented (utterances encoded as
verification) we observed that the more experienced as a researcher a participant was, the
more mentions of this criterion would occur. It was posited that as researchers progress
in their careers and gain experience, the likely it is that (dis)agreement will arise with
certain works. This is due to the strengthening (weakening) of current and past beliefs.
was suggested as the distributions of these two criteria were indeed very similar. It seems
reasonable to suggest that if a person lacks the relevant background knowledge, their
Lastly, the criterion document novelty was analysed. The different scopes in which
novelty could occur were discussed and it was pointed out that deriving this context from
the utterances would be a very difficult, if not impossible, task. Additionally, the polarity
of the mentions was briefly discussed and examples of these and how negative mentions of
Overall, both a quantitative and qualitative description of the observed relevance cri-
teria were given and attempts to provide explanations for each of the observed frequencies
were made. Behaviour was inspected to see if any of its types, as represented by the
frequencies of mentions of a particular criterion, would relate to either the research expe-
rience levels or to the affiliations or both. It was discovered, however, that this might not
frequencies varied, however none clearly related the frequencies to either type of grouping.
Additionally, while analysing the frequencies of the criterion currency, it was discussed
4.8. Summary 111
how the way the simulated work tasks were crafted might have introduced bias.
Chapter 5
Results II
In this chapter relevance criteria profiles are revisited in Section 5.1. Here these profiles,
and their similarities, are analysed to see if there are any relationship between participants’
comparative analysis of their divergences suggests that there might be three naturally
emerging clusters. This analysis is aided by the use of heatmaps and the Jensen-Shannon
Isolated mentions of relevance criteria give partial insight into the cognitive processes
that were present during the search session. However, it may be wished to analyse how
these mentions interact together and how these groups of mentions are used to judge the
presented information. In Section 5.2 a technique for isolating and analysing these groups
are segmented with the objective of isolating the so-called relevance judgement processes.
These processes are defined as uses of relevance criteria as delimited by user interactions.
the number of relevance criteria included in the process and one can use this as a
2. Selection rules: one can relate the number of criteria in each relevance judgement
Results II 113
process to the six rules of selection presented by Wang & White (1999). These rules
can then be interpreted in the context of LBD and related to user intentions.
Additionally to our analysis of the relevance criteria, judgement processes, and selec-
tion rules, the interactions with the system in which the participants incurred are described
and analysed in Section 5.3. The description and analysis is broken down into three sub-
sections, each which correspond to one of the three main activities in a closed model search
in LBD:
After analysing both relevance criteria profiles and user interactions in isolation, both
of these ideas are analysed in conjunction. Relevance criteria profiles, complemented with
user interactions, provide a fuller picture than either alone. Carrying out such analysis,
however, can be difficult and time-consuming. Having a bird’s eye view of the search ses-
sion which included both these notions –the relevance criteria and the user interactions–
would help detect, for instance, the segments of the session which might be of particular
interest to us. Unfortunately, there are currently no available plotting techniques that
combine these two entities into a single picture1 . In Section 5.4 a custom visual represen-
tation of a search session is included. This visualisation includes the relevance judgement
processes, the user interactions and the search session evolution. This visual representa-
tion provides a holistic view of the search session as it progressed and aids in the analysis
of the complexity and interactions between the relevance criteria observed and the user
Individual and aggregated relevance criteria profiles provide a global view of the most
commonly mentioned criteria for that particular session or group of sessions. Aggregated
profiles, however provide a view for arbitrary groups of profiles, e.g. profiles grouped
profiles by measuring their similarities to each other. Since relevance criteria profiles can
The JS-divergence was chosen to measure the similarity between relevance criteria
profiles. The profiles were first normalised and then the JS-divergence value was calculated
for every possible pair of profiles. This is depicted as heat maps in Figure 5.1.
In each heat map, the value in cell (i, j) corresponds to the JS-divergence value between
the profiles of participants i and j. The matrices are symmetric as the JS-divergence is
a symmetric measure, i.e. the value in cell (i, j) is equal to that in cell (j, i). Rows and
columns are ordered by date in which the participant took part of the study. This leads to
the participants being ordered by school. Index values from 1 to 10 represent the School
School of Pharmacy.
In the figure we find four heat maps. The matrices in each map are all equal and the
only difference between maps is the number of colours used as palette for the JS-divergence
values. In all maps, the redder the colour of the cells the less divergent the two profiles
are. In Figure 5.1 (a) only two colours have been used. In this map we can observe that
the profile in row/column 6 has a high divergence with almost all the other profiles in the
map. The divergence between the profile and most others (with the exception of two) is
above 0.42 . This suggests that the participant represented by the profile in row 6 is an
outlier. In Figure 5.1 (b) three colours have been used in the palette and we begin to better
JS-divergence values closer to 0 mean that the two distributions compared are less divergent and
5.1. Profile Similarities 115
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
5 10 15 20 5 10 15 20
(a) (b)
0.7 0.7
0.6 0.6
0.5 15 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
5 10 15 20 5 10 15 20
(c) (d)
Figure 5.1: JS Divergence scores between participants. Each cell represents a divergence
score between two participants (rows/columns represent participants.) It must be noticed
that the closer to 0 the score, the more similar the two profiles are. This is in line with
traditional heatmap plots where red is used for higher activity.
observe the divergences between profiles. Profiles in rows 11 and 18 divergence mostly with
profiles of members of the School of Computing (rows/columns 1-10). In the third heat
map, (c) in the figure, four colours have been used in the palette. The divergences become
more noticeable and we can observe that the profile in row 18 is actually divergent with
most other profiles (with the exception of some profiles of members of the Information
Management Group). We can also observe that the profiles of the participants from the
School of Computing exhibit a level of convergence and this shows as a red block on the
bottom left of the map. Moreover, their profiles do not diverge with most of the profiles
5.1. Profile Similarities 116
of members of the Information Management Group (with the exception of profiles in rows
11 and 18 which seem to diverge with almost all profiles). In the last heat map, Figure
5.1 (d), five colours were used. In this figure the divergences become even clearer. The
profile in row 18 diverges with practically every other profile but with two. One of these
two profiles is that in row 11 which also seems to diverge with most other profiles. In the
figure we can also observe that the profiles of the participants of the School of Computing
remain convergent and that they diverge more with the profiles of the members of the
School of Pharmacy than with those of the Information Management Group. The profile
in row 17 seems to be very similar to almost every other profile with the exception of two:
profiles in rows 18 and 4. There seems to be a group of profiles that are convergent with
almost every other profile. These profiles are those in rows 1,2,3,7 (members of the School
these profiles are convergent with most other profiles could be due to that the participants
represented by these profiles follow a globally shared behaviour in using relevance criteria
Analysing the JS-divergence reveals three emerging clusters, however these do not
correspond to any of the groups analysed in the previous chapter; neither cluster wholly
corresponds to either schools groups nor research experience groups. The three clusters
found through the analysis of the heat maps are: i) that of the potential outliers –profiles
in rows 6, 11 and 18– ii) that of the potential representatives of the whole group –profiles
in rows 1, 2, 3, 7, 12 and 17– and iii) the rest. The profile in row 6 is divergent with
practically every other profile. This suggests that the participant, in terms of his/her
associated relevance criteria profile, may be an outlier. The participant’s profile is only
close to two other profiles and they may also be considered outliers. Effectively, the profiles
in rows 11 and 18 do not seem to be convergent with almost any other profile. An outlier
is a sample that is numerically distant from the rest of the data. As such, outliers may be
an indication of measurement errors. It has been suggested that the profiles in rows 6, 11
and 18 may be outliers, therefore checking whether there have been errors in measuring
5.2. Relevance Judgement Processes 117
during the search sessions of the participants represented by these profiles, can only be
done, by closely inspecting these and analysing their evolution. However, outliers can also
indicate the areas in which a theory might not be valid. It may be the case that the
participants in question behave in a way that does not correspond with that of the rest of
the group. This should also be checked by analysing their search sessions in more detail.
Some initial suggestions can be proposed from this simple divergence analysis. Firstly,
visualising the JS-divergence in this fashion helped detect that some search sessions may
either be anomalous or at least different enough, in terms of the relevance criteria mentions,
so that they merit a closer inspection. Secondly, the profiles of the members of the School of
Computing seem to exhibit a certain level of convergence, however, this level of convergence
exhibited could be due to the way the profiles were ordered in the matrix. For instance,
swapping places between participants 11 and 12 would have revealed a bigger red-ish
Management Group. Before suggesting that members of any one predefined group behave
in a certain way, or use relevance criteria following a shared pattern, a deeper inspection
profiles that are convergent with almost every other profile. This group of profiles could
be signalling that the participants, whom these profiles represent, may be using a group
Analysis of the JS-divergence between relevance criteria profiles is only useful to indi-
cate which participants (or groups of) may be behaving in a particular way. As such, this
analysis may only be useful to detect these individuals and so further inspect their search
In the study conducted by Wang & White (1999), participants were asked to select, from
the results of searches conducted by librarians, which documentation they would use for
their projects. One of the observations resulting from the analysis of the selection process is
that users applied a set of decision rules when selecting this documentation. The selection
5.2. Relevance Judgement Processes 118
1. Single criterion decision: if the user detects a single salient unwanted aspect in the
2. Multiple criteria decision: if users cannot reach a judgement after applying the single
3. Dominance rule: users select documents such that they excel in at least one criterion
and are no worse in any of the other criteria, e.g. two documents which provide the
same information, however one of them is more current than the other.
4. Scarcity rule: when information is scarce, users tend to be more lax regarding the
5. Abundance rule: when users have found enough information, they tend to stop
6. Chain rule: when users have detected that they are on a chain, or vein, of information
they tend to make a collective information on the set, e.g. because the previous
document, deemed relevant, is on this chain, a new document on the same chain is
In this study, it was observed that the participants applied a subset of these rules in
varying proportions. This suggests that the rules found in the study conducted by Wang
& White (1999) are also applicable to the context of LBD making them more general.
To estimate the frequency with which these rules were used, the following procedure
was applied. Firstly, search sessions were segmented to obtain the relevance judgement
processes as described in Section 3.7.5. Secondly, the length of each of these sequences was
measured; the number of utterances encoded as a relevance criterion within each sequence
is counted. Lastly, once the length of each sequence is measured, sequences of length n
5.2. Relevance Judgement Processes 119
were counted per participant, i.e. for each session how many sessions or length 1, 2,...n
there are.
used in it. This definition stems from the assumption that the more criteria is mentioned
within any one process, even if one criterion is mentioned many times, the more complex
the process is. It was observed that participants expressed applying relevance criteria in
sequence when evaluating the presented literature, and that the more criteria that was
applied the more time-consuming and difficult the judgement was. Hesitations, together
with mentions of criteria and backtracking, were an indication of this type of behaviour.
Although this definition rests on a limited number of observations, Wang & White (1999)
found that “participants often apply a salient criterion to reject a document. Participants
tend not to scan all aspects of a document in decision-making” suggesting that the com-
plexity of a decision may be related to the number of relevance criteria used to reach said
and hence map to the elimination rule of (Wang & White 1999). In this case this
and hence map directly to the multiple criteria rule of (Wang & White 1999)
A total of 589 relevance judgement processes (of any complexity) were counted. Out
of these, 215 (36.5%) are of complexity 1 and 374 (63.5%) of complexity 2 and larger. The
bars in Figure 5.2 denote the total number of participants (y axis) that used a sequence of
n criteria (x axis) to assess the relevance of the information presented at least once. In the
figure we can see that all participants applied, at least once, a sequence of one criterion to
judge the information presented. This corresponds to the rule of single criterion decision
described by Wang & White (1999). We can also see that most participants (at least 14
participants) used up to 6 criteria in any one relevance judgement processes. More complex
5.2. Relevance Judgement Processes 120
In Figure 5.3 we see the average use (y axis) of relevance judgement processes of
complexity n (x axis). In the figure we see that sequences of complexity 1 (single criterion
rule) were used, on average, about 10.6 times per session. The more complex the relevance
judgement process becomes, the less it is used. As it can be seen in Table 5.1 (and also
in Figure 5.3), the average number of uses decreases as the complexity of the process
increases (with the few exceptions) and is always lower than the average use of relevance
processes of complexity 2 and above, on average, they were used 3.5 times per session.
This suggests that participants wanted to quickly judge the information and keep the
information flow dynamic. Quickly assessing the information presented, possibly with aims
of a quick dismissal, means that they could spend more time assessing more information
and obtain a broader coverage of the information space. This behaviour may have been
encouraged by the settings in which the study took place: search sessions had a time limit
of 1 hour.
The single criterion decision rule, as described by Wang & White (1999), suggests
that this rule is mostly applied to quickly dismiss information based on salient unwanted
features. During this study, two types of use of the single criterion rule were observed:
• Filter out: in concordance with the original description of the rule, participants
• Eager acceptance: contrary to the original mention of the rule, participants de-
tected a salient feature that made them consider the presented information relevant
The frequency with which these two uses were observed can be estimated as follows.
Firstly, for each use of the single criterion rule the criterion used in it was counted. The
polarity of the expression was also taken into account. A positive mention of currency, for
5.2. Relevance Judgement Processes 121
Total participants
Number of Participants
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Figure 5.2: Total number of participants that used, at least once, a relevance judgement
process of complexity n
Average number of applications
Average Use
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Figure 5.3: Average use of relevance judgement processes of length n. Bars represent a
standard deviation.
5.2. Relevance Judgement Processes 122
Table 5.1: Average use (averaged across participants that expressed using them) of rele-
vance judgement processes of complexity n
instance, was considered different from a negative expression of the same criterion. This
expresses our assumption that negative mentions of criteria correspond to uses of the rule
to filter out irrelevant information and that positive mentions correspond to uses of the
rule to eagerly accept the relevance of the presented information. Finally, the counts were
The assessment of the polarity of any one utterance was done by analysing the type
of words used in the utterance itself. In the few cases where the language itself was not
enough to determine the polarity, the tone of the voice of the participant and the preced-
ing utterances were taken into account. Consider the following example. A participant
mentions that “...[the document] is too old...”. This utterance is classified as currency and
its polarity deemed negative. The negative polarity is inferred from the use of “too old”
in the utterance. This expression suggests that the participant deemed the information to
not fulfil a specific criterion: that the information is current or up to date. Currency is
used as a criterion, but in a negative fashion and will probably influence so that the final
relevance judgement of the information presented is negative. Consider now the polarity
of utterances such as “’s from 2006...”. The language used in the utterance indicates
5.2. Relevance Judgement Processes 123
that the person is referring to the date of publishing and the potential currency of the
information, however it does not offer any indications regarding the polarity of the expres-
sion. In cases like this, one can resort to the audio recordings to assess the tone of the
person’s voice and also consider the influence of the preceding utterances. Consider these
two potential scenarios in which the polarity of the utterance is to be inferred. Both begin
in the same fashion: as soon as the user is presented with the document, the relevance
judgement process begins and the use of criteria mentions start. Suppose that the first
e.g. “’s only 2 pages long...”. As expressed, the user is already starting to lean to-
wards a negative judgement. Should the next utterance be “...[and] it’s from 2006...”,
then its polarity would be deemed negative. This stems from the use of the word and to
connect the two mentions of criteria suggesting that they share the same polarity. On the
contrary, should the next utterance be “...[but] it’s from 2006...”, then its polarity would
be deemed positive. In this case, what makes the polarity to be deemed positive, instead
of negative as in the first example, is that the expression is contraposed by the appearance
of the word but which signals an opposite polarity to that of the first utterance (negative).
The preceeding utterance and its polarity are used as a reference point against which the
The counts depicted in Table 5.2 show that criteria mentions are almost evenly dis-
tributed across polarity; out of a total of 215 criteria mentions, 114 correspond to positive
mentions while 101 are negative mentions. All criteria was used –either positively, nega-
Because the verbal data gathered from participants did not always correspond to ac-
tual relevance judgements of documents, a portion of the uses of the single criterion rule
were observed in a different context. Positive uses of this rule were used mostly for as-
sessing the potential relevance of the information. That is, participants expressed using a
criterion in a positive fashion to decide whether the information could be relevant. The
relevance of the information would then be decided, possibly by using more than one rel-
evance criterion, once it had been assessed more thoroughly. Negative uses, on the other
5.2. Relevance Judgement Processes 124
Table 5.2: Frequency of each criterion as distributed across single criterion rule uses.
hand, were always used to immediately dismiss the information and hence corresponded
Filtering out irrelevant information was mostly done on the grounds that the docu-
ments were not novel, e.g. a document had been re-retrieved. Participants mentioned
document novelty in a negative fashion 41 times (about 19%) when using a single criterion
“old ants! ah the little buddies ... that’s the document I already have now so
“... yes, I’ve seen it before ... oh, not again, no, still not want to see that,
hmm ...”
The second most used criterion, for filtering out irrelevant information was
“... I’m gonna put it back because it’s very brief and a bit journalistic ...”
5.2. Relevance Judgement Processes 125
Positive relevance judgements made using the single criterion rule had tangibility as the
most used criterion. This may suggest that a group of participants found hard data a good
indicator of the relevance of the information and when this criterion was met they were
quick to accept the information as relevant. However, one must remember that mentions
“... oh yes it’s about simulations, interactive kind of thing, I’ll write that one
of irrelevance. This observation, coupled with that document novelty is the second most
used criterion in positive relevance judgements suggests that the correlation between rel-
evance judgements and the polarity of document novelty may be high. Document novelty
“... again that one has already been identified as high up which is really em-
“... I think maybe I’ve seen before, again it gives me a lot of theoretical un-
derpinning it has a lot of really nice, well not really nice, mathematical stuff
anyway and yeah I think that’s probably the one I would take ...”
Using the segmentation and counting process described we can estimate the frequencies
with which participants applied the single and multiple criteria rules. Estimating the
frequencies of the other rules is a much more subjective task, hence no quantitative data
The dominance rule was mentioned by participants. The use of this rule suggests that
participants assessed the relevance of documents not only in relation to the topics being
“...that’s the kind of paper that I’m looking for, it’s probably the most appro-
“... this must be one of the best ones I’ve found so far ... ”
“... I’ll put it as relevant but it’s not as relevant as the others ...”
This mention suggests that the participant deemed a document relevant, though when
compared to the previously assessed documentation, it was not “as relevant”. This could
be due to two potential reasons: a reversed dominance rule, which means that the docu-
ment is considered relevant despite it being worse (in some aspects) than the previously
found documents or that this is an expression of usage of the scarcity rule. It could have
been the case that the participant had found a few other documents before the one as-
sessed, however, the number of documents found was not enough. If the scarcity rule was
applied and, even though the document might have been “less relevant” than the others,
“... can’t help feeling that this one should be a rich vein ...”
“... that [topic] was quite a rich one so ... I got quite a lot out of that one ...”
and one mention was even coupled with a suggestion for a desired feature of the system:
“... this is something that I want, I’m not going to read it because the title
says it all ... now what I really desperately want is a little box at the bottom
That the participant requested a feature that retrieved “more like this” suggests that
the participant suspected that the document might have been the first example of a set of
documents in the same vein of information and that they might have all been interesting.
One could, therefore, consider this expression as a use of the chain rule.
In addition to the six rules presented by Wang & White (1999), participants of this
1. The reoccurrence rule: participants selected, for further inspection, reoccurring doc-
when a document appeared high on the ranked list on both the left and the right
These two rules were applied to assess the potential relevance of the documents (defined
as weak relevance by Harter (1992)). This means that participants used these rules to
decide whether they would click on a link to obtain the full documents and assess them
further. The reoccurrence rule refers to documents being re-retrieved during the session.
As such, it seems that the number of times the document would reoccur was interpreted
as a signal of relevance:
“... it’s been presented to me for every single damn search query I input in
“... again this one, this one is cropping up everywhere this document, it’s about
chips ...“... simulation model” okay that’s definitely something that I will have
a look ...”
“... in this case the Schneider thing has come up again and again, we’ll have
a look at it just to see ... yeah, well okay we’ll pick that just because it’s
interesting to me ... just because it has popped under my nose enough ...”
“... it’s [the document] popped up a few times so I’ll take it ...”
5.2. Relevance Judgement Processes 128
This rule was also affected, or so it seems, by the intermediate topic being evaluated:
“... yeah I’ve seen this one before, I think there is some overlap between the
topics which I guess it’s a good thing ... I’ve seen that before as well ... already
However, in this case, it may have been that the document was relevant regardless
of which intermediate topic had retrieved it (the relevance of the document could be
considered invariant to a certain extent) or the document was relevant because a new
interpretation had been derived due to the context provided by the intermediate topic.
Unless participants expressed it, one cannot assess whether it was one case or the other.
The reoccurrence rule seems to contradict special cases of the single criterion rule. It was
observed that the most used criterion, in single criterion negative relevance judgements,
was document novelty. So how could reoccurring documents be selected for further inspec-
tion when there was a high proportion of documents automatically dismissed based on that
they were not novel? Based on observations, it is suggested that the reoccurrence rule
depends not only on documents reoccurring but also on the frequency of the reoccurrence.
A document reoccurring for the first time may be initially dismissed automatically on the
grounds on “having seen it before”, however should the same document reoccur for the
nth time, then it may have been selected for further inspection because it had “cropped
and ranked highly, on several of the intermediate topics inspected by participants. That
the different intermediate topics, seems to have prompted participants to select them for
further inspection.
The concordance rule refers to documents that appear ranked highly on both the right
and the left panel. Participants inspected the top ranked documents on both the left and
“... So, again I’ve looked at the top four of the results I’ve been produced and
“.. again the top one on both has matched so again I’m thinking that that’s
“... and again here at the top, the top ... here I’m thinking that the top two
And some participants were even puzzled when they did not see this concordance:
“... So, the first time that the two documents listing haven’t agreed at all so
I’m thinking that I will really have to think about how I’m gonna tackle this
topic ...”
This interpretation of the search results as presented on both panels is plausible. Doc-
uments on the left panel are supposed to be discussing the relationship between the partic-
ipants’s research area and the intermediate topic while the documents on the right panel
should be discussing the relationship between the intermediate topic and the target topic.
The coincidence rule seems reasonable as it suggests that documents highly ranked on
both panels may be likelier to explain both sides of the relationship and as such they may
Even though listed as two separate rules, participants applied the two rules combined
in a single step:
“two wildly different results come out of the search on either side, our old
friend the first article I picked has actually come out at the top again which
again makes me think that this must be some article, must be really really good,
and yeah I’m kind of thinking that I’m always getting that article and that I
should really just take the hint and go and read that particular one ...”
5.3. Interactions Revisited 130
Participants interacted with the system in different ways, however patterns were observed.
These patterns depended on two factors: the information presented by the system at any
one stage and how this presentation was done, i.e. the user interface. A typical search
At the beginning of the search session, participants were presented with their research
topic and a list of ten possibly related topics (target topics). Participants were asked
to investigate, one at a time, three out of these ten target topics. Initially, participants
went over the list of the topics, starting with their own, and tried to assimilate them.
Participants usually began by trying to assimilate their own area of research first:
“Ok, right, so, I’m looking at the starting topic first ... evolutionary ehw means
nothing to me, heuristics, p2p peer yeah ok let’s see where that came from ...
computation ... very general ... genetic constraints ... genocop was a particular
kind of optimisation software ants and food, so these are my starting topics and
they may well relate to some grant sort of thing that I was doing...”
“Ok, so let’s see what keywords I’ve got to start with ... induction maybe,
expert yes, training possibly, CBR retrieval definitely ... belief evidence ...
“... the ones immediately jump off the page are off the screen are archives
cultural heritage probably human if we take human in the broadest sense, user
Participants may have deemed this step necessary due to how the initial topics were
presented. Each topic was represented as a bag of words. Because these bags of words had
no structure, participants had to interpret what the represented topic might have been.
After inspecting their initial topics, participants set out to decide which related topic they
would investigate further. The selection of topics, at this stage, was based on two main
1. Whether the topics “jumped out”, i.e. they stood out by either being “obvious
choices” or strange enough combinations of words such that they arose the partici-
pant’s curiosity.
2. Whether relationships between their area of research and the presented topics could
Participants initially scanned the list of related topics in search for something that
was salient enough that would make a particular topic stand out from the list. As such,
they tried to assimilate them and make sense out of the bags of words they had been
presented with. This is an area in which the system could be improved. A more intuitive
representation of the topics may make the selection making process easier by leaving more
cognitive energy to be used for finding connections instead of interpreting bags of words.
Once participants had a topic in mind, they started forming initial potential explanations
“hmm related to this maybe, maybe possible applications for ... [...] ... inte-
grations transaction, what the heck is that? ... [...] ... oh heck! Difficult to
find any obvious meaning from the keywords for the topics, my guess health
care seems a very clear one so let’s start with health care ... ”
“... possible related topics ... yes, ok, I’m looking down these to see if anything
tion of my research ... ok, I’m going to start with “mathematical computation
would that I would be able to discovery something about that ... so I’ll select
“the possible related topics, hmm ... I supposed if I’m looking at it from a
colleague has told me, then “retrieval classification” and “evaluation”, “evalu-
ation” particularly ... I like the next one which is “indigenous” I don’t like the
“Africa” but the “indigenous” bit is something that I quite like and that would
actually tie in with the Australian people the aboriginal ones ... [...] ... but
When participants clicked on a topic, they were presented with a second screen in which
they were given the opportunity to investigate the intermediate topics that completed the
potential relationships between their area of research and the selected topic. Participants
entered this stage, usually, with a preconception of the type of relationship they were after.
This may have affected how they evaluated the documents as these preconceptions may
“ah it’s interesting what it decided to come up with “computational java infras-
tructure” ... I would’ve thought it would’ve been about just general optimisation
“I’m thinking that my initial thing is that “teachers ...” has come up which
I’m finding a bit strange, I’m finding also a bit strange that some of the other
ones that have come up are quite application-based maybe that’s the “applied”
coming out; our old friend from number one task is back ... so I’m thinking
“... it’s floating up for a few things and I’ll take it although it’s not directly
relevant to the current ... it’s not what I’d expect to find in this current thing
5.3. Interactions Revisited 133
“... I think this is interesting that this document would be related to this topic,
Almost all participants devoted an equal amount of time to the inspection of each
related topic, i.e. they spent about one third of the allocated time to each topic. This
may be because participants were offered to be prompted every 15 minutes before the
session began and almost all participants accepted the offer with the exception of one
participant who asked to be prompted every 20 minutes. Only one participant spent
almost the entire session inspecting a single related topic. When prompted at minute 45,
the participant realised that there was not much time left in the session and decided to
end it.
The second screen offered three different panels. The top panel was vertically split into
two panels. The top left panel contained the representation, as a bag of words, of the
participant’s research area. The top right panel contained the bag of words representing
the selected related topic. These two topics were static in the sense that participants
could not interact with them by either modifying them, selecting a new topic, etc. In the
middle panel, a list of intermediate topics was offered. These intermediate topics were also
represented as a bags of words. In the bottom panel, also split vertically into two panels
–left and right–, the supporting literatures were displayed. Initially, as participants had
not selected any intermediate topic, each panel in the bottom listed the retrieved literature
for the participants’s research area and the selected related topic. The bottom left panel
contained the literature retrieved for the participant’s research area and the bottom right
panel the literature retrieved for the related topic. At this stage two common behaviours
• Participants directed their attention directly to the middle panel and the interme-
diate topics.
5.3. Interactions Revisited 134
Figure 5.4: Navigation screen. The top panel (a) displays both the initial topic (listed as
“Your topic”) and the potentially related topic (listed as “Related topic”). The middle
panel (b) lists the intermediate topics. The bottom panel (c) contains the supporting
When participants started analysing the literatures first, they found that most of the
documents listed on the left panel had already been seen. This is due to the fact that
the representation of their research area, the bag of words, had been extracted from their
initial set of documents retrieved during their first session and hence were likely to re-
retrieve and rank these documents at the top of the list. To some participants, this did
not seem to be an impediment for selecting these documents again. Perhaps the new
context, as provided by the related and intermediate topics, shed a new light into the
“I’m intrigued by [reads] “record culture institutions” ... I’m drawn to this one
... but I think it’s one of the ones I took last week ... it is ... and that very
much interest me ... even though it’s under [topic] is giving examples of the
though it’s of the USA there are lots of examples that are probably transferable
... so I’d be interested in that ... and that’s from the left ...”
5.3. Interactions Revisited 135
The purpose of the participants’s research topic and selected related topic was to
provide a context in which the intermediate topics were to be evaluated. When participants
selected the intermediate topics, they also applied the same rules as for the target topics:
“okay, let’s explore some subtopics and see if some of these relate to health
care and CBR problem solving ... I don’t think most of these do but ... face
video participants ... hmm ... let’s see this one ... I guess I should look at both
“ok so I’m now looking down the list of things that we have there ... ok, so I see
the one that says combinatorial complexity which is quite close to evolutionary
algorithms or I suppose that before I click on that ... okay I will click on that
combinatorial complexity travelling and now I’ve got a bunch of things that
In this case, the participants mentioned their use of the rule of connection making for
choosing the intermediate topic selected. In this context, however, the rule of connection
making may have been harder to apply than before as there is less freedom, i.e. while
initially participants had to speculate about what the potential relationships between their
areas of research and the suggested target topics may be, now they would have to make
sense of their speculations in such way that they included some of (if any) the intermediate
topics presented. This may explain why there were cases in which only the rule of saliency
“... this is where we get our related topics ... I see we’ve got various that have
their countries of origin because I remember that some of the papers where in
Uganda and Ghana and so on ... Nigeria gets a mention ... some mention
nurses which clearly relates to the medic professions papers, clinical trials,
again seems to be related to the medical papers ... health women ... South
Africa, Nigeria again, oncology ... right ... so if again I try and find one
that seems more general and generic as opposed to one that has a very specific
geographic or sectoral kind of focus ... I’ll try “viewed abstracts abstract” ...”
5.3. Interactions Revisited 136
“... [reads] “routing street women” I want to look at that just because it seems
“... I’m very tempted by [reads] “chatterbot terrorism” because it’s such a weird
combination ...”
“... there are not so many intermediate topics that are jumping out to me this
time ... ”
“... I’ve kinda been through the ones here that kinda stick out, the ones that
Once an intermediate topic was selected, each panel would list the retrieved documen-
tation for:
• left panel: the participant’s area of research combined with the selected intermediate
• right panel: the selected intermediate topic combined with the target topic.
Once the literatures had been retrieved and listed on both panels, participants pro-
ceeded to investigate them, however, to decide when it was time to finish their current
inspection, by either selecting a new intermediate topic or a new target topic, participants
1. Satisfaction: participants were satisfied with the information gathered through the
information obtained from the selected intermediate topic was not satisfactory/enough/etc.
“... the feeling at this point is that I have a round selection, I have news pieces
which are going to give me examples of sites that I can go to and look at actual
practice and I’ve got theory and I’ve got actual practice ...”
5.3. Interactions Revisited 137
The rule of frustration or boredom was also applied to end current searches. Some
participants decided to select a new intermediate topic when they either could not find any
relevant documentation or simply got bored by the information retrieved by the selected
intermediate topic:
“... I’m going to do what a good librarian should not do and say that I’m now
topic ...”
The purpose of the bottom panel was to present the retrieved literature for each side of
the relationship. The left panel displayed the literature for the combination of the par-
ticipants’s research topic and the selected intermediate topic (if any). The right panel
displayed the literature for the selected related topic combined with the selected interme-
topic. This was observed as the documents on the left panel were inspected more frequently
than the documentation on the right panel. Participants also expressed this verbally. The
literature on the right panel started being considered once participants had exhausted that
“... it’s interesting because I’m instinctively drawn to the left column first
rather than the right column, I don’t know why that should be, maybe because
that leads my topic field closer to me than those in these fields ...”
“... I’m scrolling through the ones on the left hand side ... [...] ... still on
the left hand side, scrolling down ... I’m gonna have a look and see how many
others are of interest in this area ... [...] ... searching though the ones on the
left for terms that broadly fit the brief that I’ve been given well, related to the
“... so I’m more drawn to the left side because it’s more related to my keywords,
which is why I’m getting all the 2nd life stuff on the left hand side ...”
Participants seemed more comfortable, initially, with staying “close” to their initial
topic (their research area). Perhaps they felt that they would be more competent at eval-
uating the literature if it was closely related to their research area. There were, however,
exceptions to this behaviour as some participants set out to explore new territories from
the beginning. One participant, for instance, expressed at the beginning of the session
“... so I’m motivated to go to the right hand side first because I feel that that’s
more relevant to the direction I’m trying to go in so I’m gonna look at more
carefully ... ’cause I don’t really care whether the left hand side is that relevant
As the sessions progressed, however, participants realised that diverging into other
research areas may be beneficial if they were to fulfil the task they had been assigned and
started evaluating the documents on the right panel more often. Documents listed on the
left panel became “ironically too close” to their research topic so they started diverging
and analysing documents on the right panel (assuming that these were “closer” to the
target topic). This was observed in two ways: i) mentions of examination of the right
panel became more frequent and ii) some participants expressed this explicitly:
“... looking at the documents on the left hand side it’s very close to ... but
in terms of the brief, future relations, it’s actually ironically too close to the
initial area and probably isn’t looking at relations with other fields ...”
“... I’m on the right hand side ... which is really I suppose where I should be
panel follows. Below there is the transcription of an entire search session that depicts this
5.3. Interactions Revisited 139
diverging behaviour. Irrelevant parts have been omitted, e.g. mentions of use of relevance
criteria, interactions that are not related to the right/left panels, etc.
“[...] ... I’m scrolling through the ones on the left hand side ... [...] ... still
on the left hand side, scrolling down ... [...] ... searching though the ones on
the left ... [...] ... I’ll look at the ones on the right hand side ... [...] ... still
looking at the right hand side ... [...] ... looking at the documents on the left
hand side it’s very close to ... [...] ... I’ll look at the documents on the right
hand side now ... [...] ... pick one of the left again ... [...] ... moving on to
the right ... [...] still haven’t looked at the ones on the left hand side ... [...] ...
looking at the left hand side ... [...] ... take one on the right hand side ... [...]
... theres another article on the right hand side ... [...] ... picking up more on
the right hand side ... [...] ... again an article from the left ... [...] ... looking
at the right hand side ... [...] ... okay pulled up one on the left ... [...] ... okay
still scrolling through the left ... [...] ... scrolling on the ones of the right ...
[...] ... and that’s from the right ... [...] ... I picked from the left hand side ...
[...] ... I’m on the right hand side ... [...] ... I’ll have a look at the next 10 on
the right hand side ... [...] ... I’m not really seeing anything on the left so I’ll
just focus on the right ... [...] ... scrolling through the right ...”
During the session, the participant verbalised a total of 23 interactions with either
panel. Out of these, 10 were interactions with the left panel and 13 were interactions
with the right panel. Overall, the proportions seem to suggest that both panels are
equally relevant in terms of user interactions, however it is how these interactions were
distributed, as the session progressed, what depicts the previously mentioned behaviour.
Encoding the interactions in the transcription using the code L, for interactions with the
left panel, and the code R, for those with the right panel, results in an encoded stream
of interactions as follows: L L L R R L R L R L R R R L R L L R R L R R R.
Visualising these interactions can be done as follows. First, each interaction is considered
to occur at a point in time and in an ordered fashion. At any one point in time one of
the two types of interactions can occur: either the participant expresses to be interacting
5.3. Interactions Revisited 140
with the left panel or with the right panel but not both at the same time. As such,
these interactions are considered to be mutually exclusive. At each step, all previous
any point in time, the proportions sum to 1 (100%). Figure 5.3.3 depicts the sequence of
2 4 6 8 10 12 14 16 18 20 22
Figure 5.5: Proportion of interactions with the left (right) panel as the session progresses.
The two curves mirror each other. For every interaction observed the proportion of one
increase while the proportion of the other decrease, e.g. if an interaction with the left
panel is accounted for at value 10 of the x axis and the curve for the interactions with the
left panel increases accordingly while the curve for the interactions with the right panel
The proportions for the interactions with the left panel seem to follow a decreasing
trend while those of the interactions with the right panel seem to follow an increasing
trend. At the beginning of the sequence of interactions (closer to the beginning of the
search session) the proportion of interactions with the left panel is larger than the pro-
portion of interactions with the right panel. However, as the session progresses, both
proportions tend towards the centre, i.e. interactions with the left panel stabilise while
5.3. Interactions Revisited 141
interactions with the right panel seem to become more frequent. At this point, the partici-
pant seems to be interacting in equal proportions with both panels. As the session reaches
its end, however, the proportion of interactions with the right panel follows an increas-
ing trend as interactions with this panel become more and more frequent. At the same
time, interactions with the left panel begin decreasing in frequency. Despite this potential
behaviour, not all participants found the right panel a “richer source” of information:
“ feeling is that there’s more useful on the left side than on the right side
“... so I’m just going to look at the left hand side because that was more
successful in the previous search ... I will look at the right hand side in case it
is better this time... left hand side ... not finding much interest on the right
These examples uncover some of the deficiencies with the analysis of the interactions
with the left and the right panel performed on the participant’s transcription. This type
interactions may not be available. Participants were asked to say out loud “anything that
went through their minds” during the search session, however, that does not guarantee
that they will verbalise all interactions with the system. Secondly, assuming that these
verbalisations are available, they may not be properly aligned with time. Verbalisations
of interactions with the left and the right panel, for instance, may be observed for the
first time after half the search session has elapsed and hence the temporal analysis is
rendered invalid. Finally, the verbalisations may be ambiguous. Even if the verbalisations
are observed and properly aligned with the progress of the search session, they may still
be ambiguous which makes the analysis highly subjective. Some participants verbalised
the interactions with the panels in ways such as “...on the other panel...” and “...I will
now investigate the other side...”. These utterances cannot be encoded with either L nor
Alternatively to verbal data one could resort to analysing the user-click logs or eye-
tracking information to get a more reliable account of the interactions with the panels.
5.4. Sessions Visualised 142
The technique for plotting the interactions remains unchanged for either type of data
assuming that it was collected appropriately. For instance, if one was to analyse user logs
one would have to record clicks for each panel and have them be distinguishable but also
to record scrolling actions on both panels as it could be the case that the surrogates are
briefly inspected but that the user does not click on any document. Unfortunately, none
of these data were available hence an example of how such analysis could be carried out
In Section 5.1, it was suggested that some relevance criteria profiles may be considered
outliers. One such profile was that on row 6. In this section, the corresponding search
session is visualised and analysed in more depth. The search session is firstly segmented as
described in Section 3.7.5 and then plotted as described in Section 3.7.5. This procedure
is also applied to the search sessions of rows 2 and 19 in the divergence matrix (see Figure
The result of the segmentation and visualisation process for the profile of participant 7 (in
row 6 in the JS-divergence matrix) is presented in Figure 5.6. At first sight it can be seen
that the participant spent almost all of the session reading out loud. This could reflect a
misunderstanding in the instructions for the study. The participant may have interpreted
the request of talking out loud as a request for the participant to read out loud. We can
also observe that the participant did not mention many relevance criteria nor did so very
frequently. This explains the high divergence value between the participant’s profile and
the other profiles. Because the participant may have misinterpreted the instructions and
spent most of the session reading out loud, fewer expressions of relevance criteria may have
been observed. However, it could also be that the participant did not find any documents
that were even remotely interesting and hence silently (in the sense of mentioning relevance
Participant 2 is a research student from the School of Computing. Applying the segmen-
Participants 2 and 19
Relevance Criteria
R N R R R N R R R N R N N R N N R N −R R N N R R N Clarity
− Quality of Sources
R R R R R N R N R R N N R N R R N −R R R R N R R N N Accessibility
R N R R N R R R R R R R R N R R R R R R R R R R R R − Affectiveness
Background Experience
Ability to Understand
Content Novelty
R R R −N N Source Novelty
Document Novelty
Figure 5.7: Visual representation of the search session of participant 2
− −
− − −
− −
N N N R −N N N N N −N R −N R
start Relevance Criteria
− Accuracy/Validity
− Clarity
− −
R N −N R N R R N −N N N R N −N R Currency
Quality of Sources
− − Verification
N R −N N R R N R N R N N R N R N
Background Experience
Ability to Understand
Content Novelty
Source Novelty
N N R N R −R R R N R N R R R −R N N
Document Novelty
− Interactions
N N −R N N R R N
5.4. Sessions Visualised 145
depicted in Figure 5.7. In the figure we can observe that the user was engaged during the
ment, then we can observe that the participant is engaged, and remains so throughout the
session from the beginning of it. These affective responses, encoded as affectiveness, are
processes (depicted as coloured piles in the graph) 22 (about 45%) contain at least one
the beginning of the session than closer to the end of the session. Perhaps the participant
begins to express less emotions (or have less emotional responses) as the session progresses
cesses very often. The interweaving of piles and interactions (including acts of reading out
loud) is frequent. This may suggest a more “careful” approach at searching for relevant
may be due to the participant constantly analysing the presented information looking for
cues to derive its relevance. As such, it may be a sign of the participant’s experience in
finding these cues. A person relatively inexperienced in finding these cues may have to
sequentially assess each information piece in more detail. This would be translated to
stacks of 2 or 3 relevance criteria blocks. It may also be that the participant is wary and
does not want to filter out potentially relevant information too quickly. Hence, the partic-
ipant assesses in more detail (than average) each piece of information. As the participant
expressed: “... hmmm ... I’m usually crap at selecting things for my literature review, I
either go for everything or select hardly anything ...” which suggests that the participant
Tangibility, which includes topicality, seems to play an important role during the par-
ticipant’s search session. Out of the 49 relevance judgement processes, 37 (75.5%) include
at least one utterance encoded as tangibility. This complements the global view pre-
sented by the relevance criteria profile (see Figure 4.3) which showed that tangibility was
a commonly used criterion by participants from the School of Computing. During the par-
5.4. Sessions Visualised 146
ticipant’s session, tangibility not only was a commonly used criterion, but also one that
was present in most relevance judgement processes. Moreover, the criterion is present in
relevance judgement processes of different complexities covering almost the full range.
participant’s search session is depicted in Figure 5.8. Contrary to the interaction behaviour
the participant considers to have found a promising source of information, however, the
relevance judgement processes are rich both in the number of uses of relevance criteria and
in their variety. On average, the relevance judgement processes in which the participant
engaged seems to be more complex than those of participant 2. Figure 5.9 contains a
bar chart depicting the frequency of the relevance judgement processes in which both
1, 2 and 3 with some occasions in which more complex processes are used. Participant
19, on the other hand, seems to make use, on average, of more complex processes. Even
though simple processes (of complexity 1) are used frequently –possibly for quickly filtering
irrelevant information– the remaining processes are more evenly “spread out” and more
the other hand, appears at least once in 27 (about 65%) relevance judgement processes.
In both sessions we see that some criteria are repeated within relevance judgement
judgement process (participant 2). This is, however, reasonable. The code tangibility,
like “... [a] neural network, ah you know what that could almost be an application if a
− −
− −
N R N N R N R N −N −R N −R N R −
Relevance Criteria
− Currency
− − −
N N −R N N R N −R N N −N −N −N −N Tangibility
Quality of Sources
− Verification
− Affectiveness
− − −
− −
−N −
Background Experience
Ability to Understand
Content Novelty
− Source Novelty
− Document Novelty
− − − −
R −R N R −R −R R R −N N R Interactions
5.4. Sessions Visualised 148
14 Participant 19
Participant 2
1 2 3 4 5 6 7 8 9 10
one to look at to get references from this ...” are both encoded as tangibility, however
they both refer to different types of tangible information being analysed. One refers to
could also be encoded as depth/scope/specificity, while the other expression refers to the
references to be extracted from the document (which could also be encoded as intent).
any one relevance judgement process (participant 2). As a criterion that encompasses
mentions of different properties of the information being assessed (its depth, its scope,
its specificity with respect to the user’s information needs, etc.), depth/scope/specificity
“... [this] is really what I’m interested in and again is really relevant to the brief
which is find new technologies or technologies used in a new way for knowledge
management and sharing ... looks quite current, november 07, looks a wee bit
These expressions correspond to Figure 5.10. As it can be observed, there are repeti-
5.5. Summary 149
tions of depth/scope/specificity and even with opposite polarity. This is due to that there
are expressions referring to the specificity with respect to the participant’s information
needs (“it’s really relevant to the brief ”) and to the volume of the information (“it’s very
frequently for participant 19 (and other members of the Information Management Group)
but not for participant 2 (and the remaining members of the School of Computing) while
repetitions of mentions of tangibility where more frequently observed during the session
5.5 Summary
In this chapter the relevance criteria profiles and the user interactions at different levels
were analysed. Initially, the similarity of the relevance criteria profiles was examined using
a divergence measure. The Jensen-Shannon divergence (Lin 1991b) was measured between
pairs of relevance criteria profiles and these similarities were plotted as a heatmap. On
the heatmap we observed that there may be three naturally emerging clusters. However,
it was noted that the emerging clusters did not map either on to the participating schools
nor the research experience levels of the participants. This may be due to the fact that
are common to either the schools or the research experience levels. Alternatively, it might
have been that the analysis technique is not entirely appropriate and that other methods
would reveal that there are indeed clusters that correspond to these predefined groups.
Relevance judgement processes were analysed next. It was described how these pro-
cesses are obtained through a segmentation process of the search sessions. This segmen-
5.5. Summary 150
tation processes relies on expressions of interactions and uses these as delimiters. The
segmentation process rests on the assumption that a judgement process begins and ends
with user interactions. This implies that relevance judgement processes are not necessar-
process was defined as the number of criteria mentioned within the process. The com-
plexity of a process was used to map it to either the single-criterion rule (referred to as
elimination due to its most common usage) and the multiple-criteria rule as presented
by Wang & White (1999). It was observed that participants usually engaged in simpler
processes, however there were observations of more complex processes (of complexity up
to 16 criteria). Single-criterion rules were used, in accordance to (Wang & White 1999),
to quickly reject unwanted (or potentially irrelevant) documents. However, we also ob-
served the opposite behaviour. The single-criterion rule was also used to quickly accept
and judge as relevant the information presented. This was discussed through the analysis
of the polarity of the criteria involved in such uses of the rule. Negative mentions of the
criterion used in the rule were mapped to rejections of the information presented while
positive mentions of the criterion to the potential acceptance of it. Overall, all mentions
were evenly distributed across polarity, however some criteria leaned more towards one
of the two. Rejections were mostly done based on the basis of the documents not being
novel, i.e. 19.0% of all mentions of a single criterion were negative mentions of document
novelty. Most positive mentions of a single criterion (25.6%) were mentions of tangibil-
ity. It was observed that the second most mentioned positive criterion is also document
novelty, which suggests that the polarity of document novelty may be highly correlated
with the actual relevance judgement of the documents. Additionally, a discussion, through
examples and qualitative data, of the occurrences of the other 4 rules listed in (Wang &
It was also described and analysed, with examples, the user interaction patterns. These
The way topics were represented –as bags of words– may have had an effect on the
interactions and how participants decided which ones to inspect further. The selection
1. Standing out: if the topic immediately stood out, it would be selected for further
inspection. This included “obvious” topics or topics which would arouse the partic-
ipant’s curiosity.
could be inferred, the destination topic would be selected for further inspection.
The selection of intermediate topics was based on similar patterns, however when
it came to inferring relationships, the degrees of freedom were more restricted (when
compared to the top level topic selection) since there was more context in which the
The interactions during the literature selection process were also examined. A qualita-
tive example in which a participant slowly digressed from interacting frequently with the
left panel to interacting more frequently with the right panel was offered. To exemplify this
behaviour, a simple approach for plotting these interactions was used. Interaction were
modelled as a sequence of L and R events (signifying an interaction with the Left and
Right panel respectively) and interpreting each of these as contributing to a total number
of interactions at any given point in time. Verbal data from an example session was used.
Plotting the proportions for this session then resulted in mirroring curves showing the
digression the participant had incurred in. It was suggested that verbal data might not
be the best data for this kind of analysis and that user-clicks or eye-tracking data might
be more appropriate. The technique presented, however, remains data-agnostic and could
During the analysis of the divergence measures between profiles it was suggested that
one of the participant’s profiles was very different from the other profiles. When we
visualised the sessions of three participants, it was confirmed that said participant was
5.5. Summary 152
indeed an outlier. Further, it was suggested that the way in which the participant behaved
may have been due to a misinterpretation of the instructions delivered at the beginning
of the session. This was reflected in the visualisation of the participant’s session as being
mostly read-alouds with very few mentions of relevance criteria. The other two sessions
were briefly analysed, and it was observed that some commonly occurring criteria were also
distributed across the search sessions (as opposed to being concentrated on some portion
of them). One of the participants, participant 2, was engaged all throughout the search
session and to have mostly used relevance judgement processes of moderate complexity.
This behaviour suggests that the participant approached the task with caution as to not
filter out any potentially relevant information. Participant 19, on the other hand, exhibited
a more diligent approach at judging information. Although the participant used the single-
criterion rule quite often (perhaps for quickly filtering out irrelevant information), he also
The custom visualisation tool developed in this dissertation has been useful in:
1. Outlier detection: as suggested, looking at the plot of the search session quickly
confirmed that the session was drastically different from the other two
the spread of different relevance criteria across the search sessions. This is useful in
complementing the global analysis of relevance criteria profiles which provide with
a total number (or proportion) of mentions of criteria for the entire session
processes of different complexities are used across the search session which gives an
not comparable across sessions. This stems from the fact that they are not time-annotated,
hence one cannot analyse, for instance, at which point in time (during the session) each
participant mentions their first relevance criteria. This drawback comes as the encoding
of the verbal utterances was not time-annotated either. Had they been so, the informa-
5.5. Summary 153
tion would be available and it could be included in the session visualisation. Secondly,
relevance judgements are not visualised. This is related to the way relevance judgements
processes does not guarantee that they will be aligned with document judgements hence
these cannot be included in the visualisation. Coupling verbal analysis with the analysis
of logs to discover the user interactions might alleviate this situation. Finally, a scheme
for picking colours for labelling the different relevance criteria was recommended, however
if the number of observed criteria increases the number of different colours must do so
Chapter 6
Carrying out an experiment designed as explained in Chapter 3 allowed for the observa-
tion of the relevance criteria used when researchers, of different disciplines and research
experience levels, judged the relevance of information related to their area of research. In
Chapter 4 the observations were presented in the form of relevance criteria profiles; a tech-
nique for grouping these observations developed in Section 3.7.4 which allows for different
visualisations of the data as well as comparisons between the observations. These profiles
showed that the participating researchers used different criteria in different occasions and
In Chapter 5 the relevance profiles were compared against each other using a divergence
measure. It was observed that even though there may be three naturally emerging groups,
none conformed to the imposed groups: discipline or research experience level. This
suggests that additional factors may be affecting the uses of relevance criteria. Relevance
judgement processes were isolated and the global trends and rules used analysed. It
was reported that while all participants used a single criterion for judging the relevance
of the information presented often, more complex processes were also common however
less frequently used. The chapter finalised with a description of the common interaction
patterns when selecting target topics, intermediate topics and the supporting literature.
their search sessions. An example was provided in which this behaviour could be observed.
6.1. Verbal Protocols 155
Participants usually began by analysing the literature that was deemed to be “closer” to
their research experience to eventually drift to the literature concerning the selected related
Overall, the approach taken seems appropriate for conducting, at least, observational
The design of the study included the use of verbal protocols for gathering data regard-
ing the cognitive processes involved in the assessment of the relationships suggested by
the system. It was decided that concurrent reports, as opposed to retrospective reports,
would be used as they would provide a raw view of these processes. An alternative, al-
reports as this might not only improve the reliability of the data gathered but also provide
new insights into the cognitive processes being observed (Taylor & Dionne 2000, Ericsson
& Simon 1993). Both convergent as well as divergent information contained in comple-
mentary reports are of use to the analysis of the data. Convergent information offers
divergent information may indicate where the complex relationships that are part of the
During the description of the recommended guidelines for gathering data using verbal
protocols (Section 3.6.2) it was suggested that all participants should receive the same
instructions in order to maximise reliability. This guideline was followed as during this
study all participants received the same set of instructions, however these instructions were
such as the natural fluctuations in language, for instance, that are likely to be present when
communicating instructions verbally, the instructions should have been given in writing to
participants. This has the added benefit of ensuring that participants understand what it
is required from them. Although great care was paid when making sure participants had
understood the instructions, during the analysis of the data gathered during the study, it
6.1. Verbal Protocols 156
was suggested that some participant might have misunderstood the instructions since all
the data gathered during their sessions consisted of reading out loud data points. Providing
the instructions in writing, allowing for reading and re-reading time, and allowing for
questions to be asked afterwards would have minimised the risk of misunderstandings and
improved reliability, and hence it is the preferred approach when delivering instructions
to participants.
but realistic, data. The exercise, however, was not identical to the task they would have to
perform afterwards. The exercise allowed for maximum freedom in navigating the system.
The goal of the exercise was twofold: i) to make sure participants would be familiar
and comfortable with the system as its user interface diverged from those of traditional
search engines and ii) to make sure participants had understood the instructions and were
comfortable verbalising their thoughts as they used the system. The design of future
studies, however, should include a warm-up exercise that mimics, as much as possible,
the actual task to be solved. This is to augment the chances that participants not only
have understood the instructions and are familiar with the system in question, but also so
that they are comfortable providing verbal data as they solve a similar task as opposed
The reliability of the encoding procedure was assessed by means of having an indepen-
dent encoder encoding a random sample of utterances and then measuring the overlap be-
tween codes. Although this overlap was found to be significant, this is an ad-hoc procedure
in that it is an intuitive approach but that its reliability was not validated. Approaches
better founded and grounded on literature should have been followed instead. For exam-
ple, using inter-rater agreement measures such as Kappa (van Someren et al. 1994) might
be more appropriate. Future studies should take this issue into account seriously as using
standard measures allows not only to obtain a measure of reliability but also to judge its
standard. Additionally, having more than a single independent encoder will reduce the
threat of subjectivity and personal interpretation in the encoding of the transcribed data.
As a final note on the gathering and analysis of verbal data, it must be pointed out
6.2. Relevance Criteria 157
that although a predefined encoding was used, a custom encoding could have been derived
from the data as this is the suggested approach in the literature. However, using a pre-
defined encoding increases the stability of the existing encoding as well as provides both
refinements and new interpretations of the codes. Much like complementing concurrent
reports with retrospective reports, a similar approach could be followed in future experi-
ments where independent encoders evolve their own encoding from the data in parallel to
encoders labelling utterances using a pre-defined encoding scheme (if a suitable encoding is
readily available). Although measuring the agreement between encodings might prove to
be a very difficult task, doing so will increase the likelihood that any one code is objective
and describes the cognitive process under study (or part of).
Researchers do use different relevance criteria when judging the relevance of the pre-
sented information. In this case, information related to their area of research. The most
criterion dealing with whether the information is tangible; utterances coded with tangi-
bility were observed a total number of 595 times. Mentions of topicality accounted for
almost half of its mentions: topicality was observed 257 times – a 43.2% of all mentions
encoded with tangibility. If we take this into account, we then have to conclude that it is
It seems sensible that topicality was not the most used criterion. When assessing the
information and the starting topic that is assessed and not its topicality. At least not
initially. Keeping in mind that assessing the relevance of related information, if we accept
that it is the nature of the potential relationship that is being assessed, is hard, it is
likely that the participants were actually judging the potential relevance and not the
actual relevance of the presented information. In this respect, they may have judged the
relevance in a shallow fashion, e.g. “this looks like it has potential, I’ll save it for later.”
Especially considering the time constraints imposed. Hence, interpreting properties of the
6.2. Relevance Criteria 158
information such as document length, the genre of the document, and the specificity of the
information as signals of potential relevance seems sensible. Longer documents may offer
more relevant information, general overview documents may be easier to understand and
may provide a broader picture of the field. An interesting case of the mentions encoded as
interest. In so doing, they provide both an indexing vocabulary for that area and, more
importantly, a narrative context in which the indexing terms have a clearer meaning”
(Blair & Kimbrough 2002). An example of exemplary documents is the survey article
which broadly covers an area of research and as such offers, amongst others, a list of the
major trends in it, the state of the art techniques as well as the most influential authors.
Based on the observation that these documents were chosen across research experience
levels and affiliation group, it is not unreasonable to suggest that these are worth special
attention. Perhaps these documents, as they offer an overview in a single document, are
a good entry point for researchers to a new field of research. Moreover, considering the
amount of information usually contained in these documents, it may be that the rewards
obtained once read, when compared to other documents in the field, is large, i.e. the ratio
inf ormation obtained
ef f ort needed to obtain it is large (Harter 1992).
Members from the School of Computing exhibited a preference for tangible informa-
tion as utterances encoded with tangibility were their most frequently expressed. Members
of the other two schools on the other hand, namely the Information Management Group
and the School of Pharmacy, mentioned depth/scope/specificity more often than any other
criteria. Given the nature of the discipline, it is not unreasonable to observe that members
of the School of Computing had a preference for tangible information (even after discrim-
inating between tangibility and topicality). However, the reasons behind the difference in
behaviour of the group, regarding the global trends, are not clear. It may be that members
of the different disciplines make inferences at different levels of abstraction. While mem-
bers of the School of Computing may be making connections at very detailed levels, and
hence they require tangible information for doing so, members of the other two disciplines
6.2. Relevance Criteria 159
may be making them at more abstract levels, hence the requirements of depth and scope
of the information.
When the data was grouped according to the research experience level expressed by the
researcher, the picture changed. Regardless of experience level, all participants mentioned
tions of tangibility that are actual mentions of topicality, the preferences actually depend
on the research experience level of the researcher. Students and Researchers preferred
tangibility over topicality while Senior Researchers did not. Depth/scope/specificity was
the most mentioned criterion regardless of research experience. This behaviour may have
been provoked by the search tasks assigned to each group. While research students and
researchers had to complete their literature review and a research proposal respectively,
senior researchers had to deliver a keynote speech at a conference. It may be that when
writing a keynote speech, tangible data is not as important as topicality since in keynote
The implications of these observations are clear: relevance, in the context of LBD, is multi-
find discipline and research experience. Despite the subjective an highly personal nature
of these manifestations, potentially global trends were observed which can be used either
Operational estimations of the two most observed criteria1 may be embedded in sys-
and only if, we can measure tangibility, for instance by looking at the number of tables in
may then embed these measurements in ranking algorithms. Two approaches may then
be followed. If one knows beforehand the target audience of the system one may decide
topicality is excluded as this is already operationalised, to a certain extent, in the underlying best
match algorithms
6.2. Relevance Criteria 160
to favour one criterion over the other. For instance, if one knows that the systems is
to be tailored towards the Computer Science scientific community then tangible informa-
tion may be preferred. Systems tailored, on the other hand, to Information Management
and Pharmacy would favour documents for which the depth/scope/specificity score is high.
However, things are not this simple. When researchers where categorised by their research
experience level, the preferences, as exemplified by the criteria frequencies, are different
to those of the school categories. This complicates matters as now, if we had embedded in
our system the measurements proposed before, it is not too clear how one should favour
one over the other. Moreover, detecting when we are in presence of a research student,
may then be reached by taking into account both criteria in equal proportions and favour
documents which exhibit both properties over those that only exhibit one or none.
Simulated work task situations are a powerful tool if used correctly, however their
crafting should be approached with great care. Moreover, as suggested in this disserta-
tion, comparing results obtained with different variants of these task descriptions may
introduced bias while crafting the simulated work task situations. Especially in terms of
the relevance criterion currency. When we designed the tasks we decided that we wanted
to have one per research experience level and to word it so that participants would relate
better to them. Each task required participants to search outwith their area of expertise.
However, when we asked senior researchers to find information to use for their keynote
speech we might have asked them, implicitly, to favour more recent information. This
implicit requirement may be not present, for instance, in the request for finding informa-
tion to complete a literature review for a doctoral dissertation (the task given to research
students). The proportional observations of the relevance criterion currency, then, might
have been artificially influenced by the wording of the tasks. This implies that the data
obtained might not be comparable across tasks, i.e. research experience levels.
6.3. Measuring Profile Similarities 161
Relevance profiles were compared using a divergence measure and the similarities analysed.
Three naturally emerging clusters were observed: those that do not conform to group
trends, those that do, and the rest. Unfortunately no cluster corresponded to any of
the imposed groupings, i.e. neither cluster corresponded to the school disciplines nor the
research experience levels. This suggests that neither discipline nor research experience
level alone are driving the behaviours in terms of relevance criteria used. However, being
able to measure the similarity in terms of judgement behaviour may be a useful tool
for detecting naturally emerging groups which can then be traced to groups in terms of
population features, e.g. people more proficient in use of search engines use the criterion
topicality less often as they assume the results are already topically relevant.
Measuring (dis)similarities between relevance criteria profiles may result in improved user
profiles can be estimated (built) using cheap approaches, i.e. cheap when compared to
verbal reports, then being able to compare them may be a good approach, for instance, for
recommender systems to escape the binary ratings usually used to build the co-occurrence
matrices upon which they based their recommendations. Additionally, individual profiles
may be used when modelling user preferences in IR systems targeted at personalised search.
As more detailed information about the nature of the user judgements is available, better
Individual profiles can be compared to other individual profiles, and what is more,
they can be aggregated to build group profiles. These two mechanisms, that of compari-
son and that of aggregation, form the basis for building group profiles and testing which
individuals are to be a part of such groups. Hence, they may have an impact in how
communities are modelled, for instance for collaborative search or social network analysis.
Additionally, because the profiles define clear relevance criteria, they may serve as guide-
6.4. Relevance Judgement Processes 162
Figure 6.1: Feedback form. Several criteria are present which, when combined, present a
more informative judgement for why the video is worth seeing.
lines for designing feedback systems in such scenarios. Effectively, one such example can
be found in the rating system implemented by the videolectures website2 which includes
criteria such as tangibility (valuable and informative), clarity (well presented) and ability
to understand (easily understandable). For all the criteria considered see Figure 6.1
Relevance judgement processes were defined as a set of relevance criteria used to judge
the information presented in between interactions with the system. Their complexity was
defined as the number of criteria used during the process. It was observed that processes
of complexity 1 were used by all participants, and that there were two common scenarios
More complex processes were also observed, however as processes became more com-
plex their use decreased on average. The polarity of the judgements was evenly distributed
6.4. Relevance Judgement Processes 163
for 1-criterion processes which makes it difficult to decide, initially, how these processes
are likelier to be used. In the case of negative uses, i.e. to filter out irrelevant informa-
tion, the most used criterion was document novelty. This suggests that the criterion is
an important one in this context of LBD. Despite the many observations of re-occurrence
being interpreted as a sign of relevance, the negative uses of document novelty is larger.
Positive uses of the 1-criterion rule has as their most frequently mentioned criterion tan-
gibility. This observation seems to correspond with what was suggested: that once the
potential profitability of the relationship had been established, tangibility would be used
when devising details of the relationship and that topicality may eventually act as a proxy
Expression of uses of the dominance rule (and a reversed version) and the chain rule
(Wang & White 1999) were also observed. The dominance and chain rule are related to a
certain extent. While the dominance rule relates to comparative relevance judgements in
which users judge the relevance of the new information in regards to the previously judged
information, the chain rule refers to the cases in which users detect to be on a chain of
information and make a collective judgement on the set. These two rules are related in
the sense that relevance judgements are dependent on the previously judged documents
and as such suggest that the relevance of any one document is not invariant to the order
and set within which it has been evaluated, i.e. the context of the relevance judgement is
In addition to the rules defined by Wang & White (1999), two new rules were observed:
Sometimes the negative impact of re-occurring documents was overridden by the fre-
relevant. In a sense, the users expressing this, exhibited a certain trust that the system
knew better than them as it was constantly suggesting the same document as being rel-
evant. The concordance rule is related to the re-occurrence rule in the sense that when
6.4. Relevance Judgement Processes 164
both the left and right panel showed a concordance on the top ranked documents, then
these would be selected as potentially relevant and inspected further. When the panel
contents did not match, users were forced to analyse the document surrogates in more
The high frequency of the 1-criterion processes observed suggests that users were often
interested in quick judgements. Quick judgements, perhaps, would allow users to cover
more of the search space, so in a sense it may be that they consider that knowledge
discovery is a recall oriented task. This seems to be contradictory with the approaches
taken by most LBD systems which attempt to filter out as much information as possible
(when modelling, filtering and ranking the topics) before presenting the results to the
user. If users actually consider it to be a recall oriented task, filtering out information
is a potential risk. Hence, it may be more appropriate to build systems so that they
present as much information as sensibly possible while offering an interactive user interface
appropriate for quick judging of the information. Moreover, the observations of the used
rules suggests that there may be two stages to the closed-model search: the first stage is
recall oriented, hence systems should support users to cover as much ground as possible
before aiding them (second stage – precision oriented) to focus on the potentially relevant
pockets of information.
Document novelty has been frequently used in 1-criterion judgements, and the polarity
of the use may have a strong correlation with that of the relevance judgement. Hence doc-
ument novelty may be embedded the user interface of LBD systems. Including information
about document novelty in the user interface is already done, to a certain extent, in mod-
ern browsers. Browsers change the colour of already visited links, letting users know that
they have already seen the document linked. However, this use can be extended to include
the number of times it has been read, the number of times it has been retrieved but not
necessarily read, and the context in which these events have happened. For instance, next
to links leading to already seen documents one could include the number of times it has
6.5. Interactions 165
been read and when hovering over the link the user interface could inform users in which
contexts (intermediate topics) this document has been read. Alternatively, already seen
documents could be hidden altogether from users, offering, however, the option to display
them should users wished to do so. In addition, this suggestion would accommodate for
the observations of the re-occurrence rule, as either hiding re-occurring documents until
their nth re-occurrence has been observed or displaying the number of re-occurrences next
to their link would inform users of these events. As expressed by a participant, “...I would
say maybe it would be nice to have on the paper, to have which topics it’s related to ...
because if you say one [document] is relevant, then if it’s involving some of the other topics
which also appear to be relevant then you may want to look at them, you may not have
noticed that because you get so many ... so I think that if you find a key paper you could
have “I want more like this” then you may want to know, out of all these, because they
re-appear, which ones that is appearing in ... ’cause it’s often the case you find something
from a totally different angle and you want more of that ... and ok, you do it by following
the references at the end of the paper or visiting the author’s website but if you have some-
thing like this you could actually tell me “ok this paper is relevant, tell me all the topics
in here, the suggested ones, in which ones this paper appears” because that gives you an
idea of what else to look for ... ’cause some of these ones have some rectangle topological,
now there may be things in there but I don’t know ... ’cause the keywords may not be the
The concordance rule has been observed, and it has been suggested that this rule aided
users in starting the evaluation of the retrieved literature. Using colour codes, for instance,
to provide visual cues to aid users detect which documents are present in both panels may
Lastly, the observation of uses of the dominance and chain rule suggests that the
6.5 Interactions
How topics are presented seems to have had an effect in how the selections were made.
Because the system used during the study modelled topics as tri-grams (bags of three
words), participants had to add an interpretation step into the topic selection stage. Se-
lections were guided by two factors: i) surprise and/or ii) ease of inference of the relation-
ships. The surprise factor included strange combinations of words for which the partici-
pants could not make sense. The ease of inference of the potential relationships suggests
that participants entered the evaluation of the literature phase with a preconception of
Interactions with the system also included those with the literature panels, i.e. the
left and right panels displaying the related literatures. Participants initially inspected the
literature on the left panel more often than that in the right panel, suggesting that they
felt inclined to analyse the literature “closer” to their area of research first. Eventually, the
interactions diverged towards the literature in the right panel. As participants felt more
comfortable with the literature outside their research area, these interactions increased in
It was observed that the interactions with the panels containing the literatures followed
a pattern of slow drifting towards the literature pertinent to the related topic. It may be
that, as initially users focused on the literature closer to their research area, the option
to hide either panel may be a useful one. By being able to hide either panel, users could
fully concentrate on the literature presented. Moreover, offering colour cues to aid users
identify which portions of the document surrogates correspond to which topic may aid the
Three search sessions were visualised and analysed. One session was confirmed to be an
outlier as the corresponding graph showed that the participant had spent the entire search
session reading documents out loud. The other two sessions were deemed to be more repre-
sentative of what a “normal” session looked like. From the visualisations it was suggested
quently expressed criterion. Furthermore, the participant was “careful” when judging the
relevance of the information presented. This may correspond to the participant’s research
experience level (research student). Tangibility was not only globally, in the participant’s
relevance criteria profile, frequent but also used in most relevance judgement processes.
Participant 19, on the other hand, mentioned depth/scope/specificity more often than any
other criteria. The criterion was also present in most relevance judgement processes; pro-
cesses which were mostly of low complexity suggesting that the participant engaged in a
highly interactive session. Repetition of uses of criteria within relevance judgement pro-
cesses were observed in both sessions. These may be due to the many manifestations of
any one relevance criterion, e.g. mentions of document length, information breadth, and
The technique developed in Section 3.7.5 results in graphs that offer sequential informa-
tion needed to analyse whether criteria are used in different portions of the search session
and are a good complement to relevance criteria profiles. However, as explained in Section
3.7.5, there are drawbacks to the visualisation technique used to display the search ses-
sions. The reasons behind these shortcomings are not all particular to the technique itself
as they include incomplete taxonomies (of both interactions and relevance criteria), the
assumptions behind the segmentation technique (that relevance judgement processes are
delimited by interactions) and the partial gathering of data collection during the study.
Using the beginning of the search session of participant 2, depicted in Figure 6.2, we
The session begins with an expression of recognition of the contents presented (pink
box denoting an expression of content novelty). The novelty of the contents analysed is
in relation to the participant remembering that in the set of documents retrieved during
the first session “... there was something about networking or something [...]” and that
“[...] that’s where the other keywords come from ...”. The participant then interacts with
the system. This interaction, even though encoded as N, has the user selecting a top
related topic from the presented list at the beginning of the session. This is expressed as
“... I will try the top “logic speed processor” ...”. After this interaction, the user engages
expression; the participant utters that “... [the topic] could be interesting ...”. The list of
intermediate topics is assessed next; the participant expresses that there are “... things to
do with circuits and ... logic ...”. The judgement process is ended by an expression that
“... actually there’s a few things I’ve no idea what they are to do with anything, but never
As described, the participant is using relevance criteria to assess the relevance of the
information presented by the system; in this case, the related and intermediate topics. In
the graph, no distinction is made on whether the relevance criteria (and hence the relevance
distinction is made, certain types of analysis may become more difficult to perform on
the plot as is. Should one want to analyse the relevance judgements on the contents of
the documents inspected, for instance, one would have to resort to the transcription of
the verbal reports to decide whether the relevance judgement process corresponds to the
this extent, and to annotate the graph accordingly, extra information should be collected
during the search sessions. Click information, for instance, can indicate when the user has
opened (or closed) a document. Hence the graph can be annotated to indicate whether
the information being judged is coming from a document or some other part of the system.
judged may differ from information object to information object. When participants
evaluate the relevance of of the related (or intermediate) topics, for instance, they are
actually evaluating the potential relevance of the documents that will be retrieved when
“video games...” that might be relevant, that’s right hand side, which is presumably meant
to lead me into new areas which might be of interest ...” Harter (1992) referred to this
included into the graph. Depending on how the researcher wishes to do it, a mechanism for
the user to provide feedback on whether the judgement has been positive or negative can be
built into the system. Judgements can then be incorporated into the graphs. For instance,
a button to “print” documents that the user wishes to keep for later reference could be
document, may be interpreted as positive judgements, i.e. the user wants to print the
document for later reference. Sequences with the print document step missing, may then
The taxonomy used in this study to encode interaction utterances is simple: an in-
teraction with the system is either a Navigation interaction or the act of Reading out
varied. Discriminating these, and assigning those that wish to be analysed further their
own code, is necessary as otherwise the analysis of the interactions using the graph may
become too difficult. In this respect, the shortcoming is not one particular of the visu-
alisation technique itself but of the study design. Extending the encoding should suffice
also of extreme importance. Resorting to verbal reports of these interactions is far from
optimal. As described in Section 5.3.3, interactions with the left and right panel were
not always verbally reported. Interaction data with these panels, for instance, should be
about order only, sessions (or portions of) cannot be directly compared, i.e. one cannot
compare the first quarter of one participant’s session with that of another. To be able to
do so, the timeline should be annotated with time and the stacks and interactions placed
accordingly. Having time annotated sessions, would mean one could draw the graphs in
parallel and compare, for instance, the session lengths or session subsets.
Figure 6.3 depicts a mock up of the first 3 minutes of two search sessions. The sessions
have been drawn in parallel and aligned on the time axis for better comparison. In
addition, information about document relevance judgements was added and the taxonomy
In the figure we see that participant 2 (top graph) immediately engages in assessing
the information presented. We also know that this information is not coming from a
document; this narrows down the possibilities to the related or the intermediate topics (as
it is the beginning of the session). Participant 19, on the other hand, starts navigating the
collection and only selects a topic around minute 1, point at which participant 2 begins
judging the first document of the session. The red box around the stacks of criteria and
interactions indicates that the participant found the document irrelevant. The document
identifier is also present on the top right of the box. Participant 2 judges the document
6.7. Known Limitations 171
to be irrelevant around minute 2 and the participant continues searching for information.
Even by this time, participant 19 has not judged a single document yet. The first relevance
judgement of a document by participant 19 comes around 2:35 and we know the participant
has found it relevant because of the green box around the relevance criteria stacks.
document contents or the screen) and actions of selections of topics. These are encoded as
Scrll and Ts. The full taxonomy allows for the encoding of scrolling of contents (Scrll ),
selection of topics (either related or intermediate – Ts), acts of reading out loud (R), and,
Results from this study are to be interpreted with care. Firstly, results from the talk-
aloud reports are to be interpreted as indicative rather than definitive. Verbal reports
indicate only a subset of the cognitive processes in which the researchers incurred, hence
they provide a partial view of reality. Furthermore, as there were variations in terms of
volume and coverage of the observed utterances, absences of mentions of criteria cannot be
the interaction analysis was also performed on the verbal reports. The taxonomy used to
encode interactions utterances is limited to two events: read out loud and navigation, being
navigation a placeholder for any interaction event that is not the act of reading out loud.
Clearly this limited the analysis and a more varied and comprehensive taxonomy should
be used in future studies. However there is also the issue of ambiguity. Certain interaction
utterances depended on the interaction history, e.g. “...I’ll check out the other panel...”,
and this issue cannot be resolved with an extended taxonomy. Hence, interactions should
actually be analysed from log data, for instance, gathered through hooking mechanisms
embedded in the user interface of the system under test. Thirdly, the user groups, although
varied, were small and skewed. Too little data was gathered to reach any significant
conclusions. Moreover, the skewness of the distribution of the participants made the
comparison analyses difficult and the results very tentative. Finally, the time imposition
6.8. Summary 172
on the search sessions may have influenced the outcome of the study. Researchers under
pressure may have modified their behaviour in terms of both relevance judgements and
6.8 Summary
In this chapter the implications of the results presented in chapters 4 and 5 were sum-
marised and discussed. Firstly, it was suggested that relevance in the context of LBD
evaluation method. Additionally, there were criteria that were very frequent and, if op-
evaluation experiments. Secondly, it was suggested that relevance criteria profiles and the
of IR systems. Having more detailed information regarding why certain judgements have
been reached may prove useful in recommender systems which are traditionally based on
binary co-occurrences of events. Thirdly, the observation that 1-criterion relevance judge-
ment processes are the most frequently used may be due to that users may have been
quickly judge the presented information and continue browsing. Although not rare, more
complex processes were observed in inversely proportional frequency. This would suggest
that the users may have approached the solution of the task in two stages: an initial recall-
oriented stage in which they aimed at quickly judging the information presented followed
by a precision-oriented stage in which they more meticulously analysed the retrieved in-
formation. Systems may then be designed to aid this searching activity by supporting the
two stages.
polarity, it may be a good candidate for embedding an operational version of the criterion
evaluations. Although obvious in the context of LBD, it was suggested that the relevance
of documents depends on the context in which they are judged. This observation stemmed
from the observations of uses of the dominance and the chain rule.
Other observations resulted in suggestions for improving LBD user interfaces. It was
suggested that the number of times a document has been read, retrieved, and the context,
for instance which topics lead to the documents, in which these events happened should
be provided. Regarding the concordance of the left and right panels, perhaps visual cues
indicating the matching documents would provide an indication of this concordance and
would support users in planning their approach to assessing the literature bodies. Finally,
to accommodate for the drifting behaviour observed in the interactions between panels,
hiding either panel would allow users to focus on one literature at a time. Complementing
this feature with visual cues indicating topic (in terms of the starting, intermediate and
related topic) matches both in the document surrogates and the document reader window
During the preceding chapters we described the steps followed in the investigation of the
manifestations of relevance in the context of LBD. In Chapter 2 the reader was introduced
to the problem of LBD. Initially we described the original discoveries made by Swanson.
These discoveries were made manually, however they resulted in a recommended set of
steps to follow when attempting to make new discoveries. The common LBD search mod-
els were described next. The open model was described as an exploratory model aimed at
stimulating the researchers’s intellect and suggest potentially fruitful relationships. The
closed model was described as a filtering model in which the researchers approach the
search task with a potential relationship in mind and set out to find if there is information
that supports it. Next, the most representative works in LBD were described and anal-
ysed. While most techniques rely, initially, on word co-occurrence statistics, it was seen
that newer approaches use specialised metadata for topic modelling, filtering and ranking.
A broad overview of research in IR was offered next. The system-driven tradition of eval-
uation was described and the concept of a test collection explained. Following, Borlund’s
approach Borlund & Ingwersen (2000) was suggested as a potential set of guidelines for
In Chapter 3 we described the design of the study and the components involved. The
system and the search sessions were described first. The population was described next;
a group of research scientists coming from three different disciplines and three different
7.1. The Three Research Questions 175
research experience backgrounds. For each discipline, a collection was crawled and indexed.
Three simulated work task situations were created: one for research students, one for
researchers and one for senior researchers. The method for gathering the data, both in
written and verbal form, was explained; it was the verbal reports which allowed for the
observation of the multi-dimensional nature of relevance. How data was to be analysed was
explained next. The encoding protocol and the relevance criteria profiles were explained.
Further, the technique for sessions segmentation and visualisation was developed in this
In Chapters 4 and 5 the results were presented. Researchers use a variety of crite-
ria and in different frequencies when evaluating the relevance of information in an LBD
context. For instance, we observed that participants from the School of Computing fre-
quently mentioned tangibility while participants from the Information Management group
judgements with aims, potentially, of covering as much as possible of the presented in-
formation space. When interacting with the system, they seem to try and stay “close”
to their area of research initially while they are still learning about the related domains,
The results were summarised and the implications discussed in Chapter 6. Several
suggestions regarding the inclusion of operational versions of the criteria observed into
both LBD user interfaces and ranking algorithms were made. These included, amongst
others, adding visual cues denoting document novelty and concordance between literature
The main objective of the dissertation is to investigate the relevance criteria used by
research scientists as they solve an LBD type of search. To do so, a study, designed as
described in Chapter 3 to take these components into account, was conducted. The main
outcome of the study is the observation of the multidimensional and dynamic nature of
7.1. The Three Research Questions 176
relevance and the interaction patterns in which users incurred. In Chapter 1 we presented
What relevance criteria, if any, do researchers use when assessing the relevance
Researchers do use different relevance criteria, and in different frequencies, when evalu-
ating the relevance of the information in an LBD context. We observed researchers as they
interacted with an LBD system and gathered information about their mental processes
as they judged the relevance of the information presented. After analysing this informa-
tion we were able to conclude that researchers use a variety of criteria and in different
proportions. The list of criteria used was presented in Chapter 3, Section 3.7.3. Because
Two additional questions, related to the main question investigated in this dissertation,
were introduced:
ferent frequencies?
Does research experience affect the relevance criteria, and their frequency of
Tentatively, the answer is “yes” to both questions. However the real answer may be
more complex than that. Although researchers from different disciplines used the same
array of criteria, the frequencies of the uses are different. The relevance criteria profiles for
each discipline showed that different criteria were expressed with more frequency, suggest-
ing that researchers from different disciplines exhibit certain preferences when evaluating
the presented information. There is also evidence that research experience also affected
the frequencies of these uses. The research experience background relevance criteria pro-
there may be naturally emerging clusters in terms of relevance criteria profile similarities
and none correspond to either discipline nor research experience background, we suggest
The results of the study led to suggestions for both the design and testing of LBD systems.
It is our belief that LBD systems should not be tested using methods that resemble the
system-driven tradition of IR research as the cognitive processes that are involved during
LBD searches have an effect on to how relevance is judged in this scenario. We described
and analysed how previous attempts at evaluating LBD systems approach the problem and
pointed out that there are several places where researchers have taken decisions on behalf of
the end-users, e.g. how to map the concepts, as represented by their system, to the golden
topics from Swanson’s original discoveries. This makes the results obtained with this
method hard to compare as different researchers might take different decisions on behalf
of the end-users. Additionally, we observed that certain rules affect relevance judgements,
e.g. the re-occurrence rule, and that these depend heavily on the user interactions with
the system, hence interaction patterns are to be considered to be part of the evaluation
Several by-products resulted from our investigation. As we used some of the exper-
attempt at modelling simulated work tasks in the context of LBD. We have discovered
that it is possible, although the crafting must be approached with utmost care as to not
introduce bias inadvertently. Although we paid a great deal of attention to the crafting
of the simulated work tasks, we might have introduced bias regarding the currency of the
information to be judged in at least one of the tasks. Had we not decided to investigate
whether different groups expressed criteria in different proportions, we might have used a
single work task situation and hence retained more control over the experimental settings.
We have also developed three tools for the analysis of data gathered in future stud-
ies of this nature. Firstly, we developed the concept of relevance criteria profiles. These
7.2. Designing and Evaluating LBD Systems 178
represent the criteria expressed by a participant or group during a time window (in our
case an entire search session) and provide a global view of the proportions include therein.
These profiles can be compared using standard divergence measures. These measures can
eventually be used as basis for analysis, e.g. the divergence measures between profiles
can be used as input values for the affinity propagation clustering (Frey & Dueck 2007).
In our case we resorted to plotting these divergence values as a heatmap and manually
inspecting these to see if any natural clusters emerged. Secondly, we developed the notion
of relevance judgement process. These are smaller groups of relevance criteria as delim-
ited by interactions. These were used during our data analysis to investigate the different
uses of selection rules as defined by Wang & White (1999). Additionally, these processes
can be used to analyse the behaviour of a user in terms of judging information along the
dimensions of variety of criteria used in any one process, average complexity of the pro-
cesses and more. Finally, we developed and provided two examples of a custom session
visualisation tools. Using this tool we confirmed our observation that a participant in our
study was an outlier in terms of the data gathered during his session. This was made more
explicit when visualising the session as it was clear that the participant had spent most
of the session reading out loud and that very few relevance criteria had been mentioned.
Additionally, we presented an example of how the session visualisation tool can be used to
further analyse the behaviour of participants in terms of their relevance criteria used. We
plotted two example sessions and each was analysed in terms of relevance judgement pro-
cess complexity, process variety, cadence of the judgements and most commonly preferred
Appendix A
Forms 181
Figure A.6: Form used to capture the documents selected as well as the topics they
Forms 187
Figure A.7: Form used at the end of the second search session (page 1).
Forms 188
Figure A.8: Form used at the end of the second search session (page 2).
Appendix B
Portions of this thesis have been published in different forums (ordered chronologically):
• Ulises Cerviño Beresi, Yunhyong Kim, Ian Ruthven and Dawei Song, Why did you
• Ulises Cerviño Beresi, Yunhyong Kim, Mark Baillie, Ian Ruthven and Dawei Song,
braries (ECDL 2010). Glasgow, UK, September 2010. Springer Verlag Lecture
• Ulises Cerviño Beresi, Yunhyong Kim, Mark Baillie, Ian Ruthven and Dawei Song,
formation Retrieval (ECIR 2010). Milton Keynes, UK, March 2010. Springer Verlag
• Ulises Cerviño Beresi, Mark Baillie and Ian Ruthven, Towards the Evaluation of
• Ulises Cerviño Beresi, Narrowing gaps in Science, in BCS IRSG Symposium: Future
Directions in Information Access 2007 (FDIA) held in conjunction with ESSIR 2007
Aronson, A. (2001). Effective mapping of biomedical text to the umls metathesaurus: the
Baeza-Yates, R. & Ribeiro-Neto, B. (1999). Modern Information Retrieval, 1st edn, Addison
Barry, C. (1993). The identification of user criteria of relevance and document characteristics:
Barry, C. (1994). User-defined relevance criteria: an exploratory study, Journal of the Ameri-
Belkin, N. & Croft, W. (1987). Retrieval techniques, Annual review of information science and
Belkin, N., Oddy, R. & Brooks, H. (1982). Anomalous states of knowledge as a basis for
Blair, D. & Kimbrough, S. (2002). Exemplary documents: a foundation for information re-
Blei, D., Ng, A. & Jordan, M. (2003). Latent dirichlet allocation, Journal of machine Learning
Borlund, P. (2003a). The concept of relevance in IR, Journal of the American Society for
Borlund, P. (2003b). The IIR evaluation model: a framework for evaluation of interactive
Borlund, P. & Ingwersen, P. (1997). The development of a method for the evaluation of interac-
Borlund, P. & Ingwersen, P. (2000). Experimental components for the evaluation of interactive
Cerviño Beresi, U., Baillie, M. & Ruthven, I. (2008). Towards the evaluation of literature based
Cleverdon, C. W., Mills, J. & Keen, E. (1966). Factors Determining the Performance of
Indexing Systems, Vol. 1: Design, Vol. 2: Test Results, Aslib Cranfield Research
Cool, C., Belkin, N., Frieder, O. & Kantor, P. (1993). Characteristics of text affecting relevance
Cuadra, C. & Katter, R. (1967). Experimental studies of relevance judgments: Final report.
vol. i: Project summary (nsf report no. tm-3520/001/00). santa monica, CA: System
Development Corp .
Deerwester, S., Dumais, S., Furnas, G., Landauer, T. & Harshman, R. (1990). Indexing by
latent semantic analysis, Journal of the American society for information science
41(6): 391–407.
Doyle, L. (1961). Semantic road maps for literature searchers, Journal of the ACM 8(4): 553–
Ericsson, K. & Simon, H. (1993). Protocol analysis: verbal reports as data, MIT Press Cam-
bridge, MA.
Frey, B. & Dueck, D. (2007). Clustering by passing messages between data points, science
315(5814): 972.
Gordon, M. & Dumais, S. (1998). Using latent semantic indexing for literature based discovery,
Journal of the American Society for Information Science and Technology 49(8): 674–
Gordon, M. & Lindsay, R. (1996). Toward discovery support systems: a replication, re-
connection between raynaud’s and fish oil, Journal of the American Society for In-
Gordon, M., Lindsay, R. & Fan, W. (2002). Literature-based discovery on the world wide web,
Green, A. (1998). Verbal protocol analysis in language testing research: A handbook, Vol. 5,
Green, R. (1995). Topical relevance relationships. i. why topic matching fails, Journal of the
Halevy, A., Norvig, P. & Pereira, F. (2009). The unreasonable effectiveness of data, IEEE
Harter, S. (1992). Psychological relevance and information science, Journal of the American
Hristovski, D., Peterlin, B., Mitchell, J. & Humphrey, S. (2003). Improving literature based
Hristovski, D., Peterlin, B., Mitchell, J. & Humphrey, S. (2005). Using literature-based discov-
74(2-4): 289–298.
Hristovski, D., Stare, J., Peterlin, B. & Dzeroski, S. (2001). Supporting discovery in medicine
by association rule mining in Medline and UMLS, Studies in Health Technology and
pp. 150–168.
Ingwersen, P. & Järvelin, K. (2007). On the holistic cognitive theory for information retrieval,
Kekäläinen, J. & Järvelin, K. (2002). Evaluating information retrieval systems under the chal-
ceptions of Library and Information Science, Seattle, WA, USA, July 21-25, 2002,
Kuhlthau, C. (2004). Seeking meaning: A process approach to library and information services,
Westport, CT .
Lafferty, J. & Zhai, C. (2001). Document language models, query models, and risk minimization
for information retrieval, Proceedings of the 24th annual international ACM SIGIR
Lin, J. (1991a). Divergence measures based on the shannon entropy, IEEE Transactions on
Lin, J. (1991b). Divergence measures based on the Shannon entropy, Information Theory,
the American Society for Information Science and Technology 50(7): 574–587.
Lowe, H. & Barnett, G. (1994). Understanding and using the medical subject headings (MeSH)
vocabulary to perform literature searches., JAMA: the journal of the American Med-
Mizzaro, S. (1998). How many relevances in information retrieval?, Interacting with computers
10(3): 303–320.
Ponte, J. & Croft, W. (1998). A language modeling approach to information retrieval, Research
Pratt, W. & Yetisgen-Yildiz, M. (2003). Litlinker: capturing connections across the biomedical
Pressley, M. & Afflerbach, P. (1995). Verbal protocols of reading: The nature of constructively
R. Agrawal, H., Mannila, Srikant, R., Toivonen, H. & Verkamo, A. (1996). Fast discovery of
association rules, Advances in knowledge discovery and data mining 12: 307–328.
Rees, A. & Schultz, D. (1967). A field experimental approach to the study of relevance assess-
ments in relation to document searching. final report to the national science founda-
Saracevic, T. (1975). Relevance: A review of and a framework for the thinking on the no-
tion in information science, Journal of the American Society for Information Science
26(6): 321–343.
tions of Library and Information Science (COLIS 2), Copenhagen, Denmark pp. 201–
Saracevic, T. (2006). Relevance: A Review of the Literature and a Framework for Thinking
on the Notion in Information Science. Part II, Advances in Librarianship 30: 69.
Saracevic, T. (2007). Relevance: A review of the literature and a framework for thinking on
the notion in information science. part iii: Behavior and effects of relevance, Journal
of the American Society for Information Science and Technology 58(13): 2126–2144.
Skeels, M. M., Henning, K., Yildiz, M. Y. & Pratt, W. (2005). Interaction design for literature-
based discovery, CHI ’05: extended abstracts on Human factors in computing systems,
Smalheiser, N., Torvik, V., Bischoff-Grethe, A., Burhans, L., Gabriel, M., Homayouni, R.,
Martone, A. K. M., Perkins, G., Price, D., Talk, A. & West, R. (2006). Collaborative
development of the arrowsmith two node search interface designed for laboratory
Spink, A., Wolfram, D., Jansen, M. & Saracevic, T. (2001). Searching the web: The public and
their queries, Journal of the American Society for Information Science and Technology
52(3): 226–234.
Srinivasan, P. (2004). Text mining: generating hypotheses from MEDLINE, Journal of the
47(2): 128–148.
Swanson, D. (1986a). Fish oil, Raynaud’s syndrome and undiscovered public knowlege, Per-
56(2): 103–118.
Swanson, D. (1988a). Historical note: Information retrieval and the future of an illusion,
Proceedings of the 14th annual international ACM SIGIR conference on Research and
Development in Information Retrieval, ACM Press, New York, NY, USA, pp. 280–
Taylor, K. L. & Dionne, J.-P. (2000). Accessing problem-solving strategy knowledge: The
Van Der Eijk, C., Van Mulligen, E., Kors, J., Mons, B. & Van Den Berg, J. (2004). Constructing
van Someren, M., Barnard, Y. & Sandberg, J. (1994). The think aloud method: a practical
Wang, P. & White, M. D. (1999). A cognitive model of document use during a research project.
study ii. decisions at the reading and citing stages, Journal of the American Society
Ware, C. (1988). Color sequences for univariate maps: Theory, experiments and principles,
Weeber, M., Klein, H., Aronson, A., Mork, J., de Jong-van den Berg, L. & Vos, R. (2000). Text-
Weeber, M., Klein, H., de Jong-van den Berg, L. & Vos, R. (2001). Using Concepts in
Magnesium Discoveries, Journal of the American Society for Information Science and
Wilson, P. (1973). Situational relevance, Information storage and retrieval 9(8): 457–471.
Wren, J., Bekeredjian, R., Stewart, J., Shohet, R. & Garner, H. (2004). Knowledge discov-
20(3): 389–398.
Zhai, C. & Lafferty, J. (2004). A study of smoothing methods for language models applied to