Midterm Exam Information Retrieval (INLS 509) March 6 TH, 2013

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Student Name: ____________________________

Midterm Exam
Information Retrieval (INLS 509)
March 6th, 2013

Answer all of the following questions. Each answer should be thorough, complete, and relevant. Points
will be deducted for irrelevant details. Use the back of the pages if you need more room for your answer.

The exam should take you about 60 minutes to complete. The points are a clue about how much time you
should spend on each question. Plan your time accordingly.

. Please check that your exam has 9 pages


(including this one).

Good luck!

Question Points
1 25
2 20
3 20
4 15
5 20
TOTAL 100
1. Document Representation [20 points]

Given a query, a search engine uses an index to decide which documents to retrieve. We discussed
two general approaches to document representation (deciding what goes in the index): controlled
vocabularies and free-text indexing.

(a) Give two advantages of controlled-vocabulary indexing over free-text indexing. [10 points]

Accepted any of the following:

- Concepts do not need to appear explicitly in the text


- Can express relationships between concepts, which facilitates non-query based
navigation/exploration (e.g., http://www.dmoz.org) and provides a holistic view of
the major topics covered in the collection.
- Developed by experts who know the data and the users (i.e. know what the most
prevalent information needs are)
- Concepts are unambiguous. For example, if the controlled vocabulary is organized in
a hierarchy, then a concepts place in the hierarchy (i.e., its relation to more
general/specific concepts) disambiguates its meaning.
- Controlled vocabulary indexes require less space because they typically have a
shorter indexed vocabulary. [This may be an advantage in cases where storage-space
is limited. Typically, however, disk-space is cheap.]

(b) Give two advantages of free-text-indexing over controlled-vocabulary indexing. [10 points]

Accepted any of the following:

- Free-text indexing can be used to represent any arbitrary level of detail. Relevant
documents can be found even when the query-topic is not a central topic in the
document.
- Does not require that users know the controlled vocabulary. Users can query based
on terms that occur within the documents, which may not necessarily be technical
terms.
- Assigning index-terms to documents is inexpensive.
2. [Text processing and Indexed-Vocabulary Size: 25 points]

Free-text indexing requires making a few text-processing decisions that determine what goes in the
index. For each of the following text-processing decisions (a)-(e), specify whether it is likely to
increase or decrease the size of the indexed vocabulary (i.e., the set of unique terms represented in
the index). Provide a short explanation (one sentence is enough).

(a) Down-casing (making all the text lower-case) [5 points]:

Down-casing would reduce the size of the indexed vocabulary. All word-variations
based on capitalization (e.g., SMART vs. Smart vs. smart) would become the same entry
into the index (e.g., smart).

(b) Stemming [5 points]:

Stemming would reduce the size of the indexed vocabulary. Morphological variants
(e.g., computer, computing, and computation) would become the same entry into the
index (e.g., comput)

(c) Removing stopwords [5 points]:

Stopword-removal would reduce the size of the indexed vocabulary by eliminating those
entries that correspond to stopwords.

Some answers incorrectly stated that removing stopwords would reduce the vocabulary
size by 50%. Removing stopwords would reduce the size of the index by about 50%, but
the size of the indexed-vocabulary only by the number of stopwords (typically ~50-250
terms).
(d) Distinguishing whether a specific word occurs in the TITLE field or the BODY field of a
document [5 points]:

Distinguishing between terms in different fields would increase the size of the indexed
vocabulary. Potentially, it could double its size. Any term that occurs in both the TITLE
field and the BODY field would need to be a separate entry in the index: term.TITLE and
term.BODY (and potentially term.ANY).

(e) Distinguishing between different senses of the same word (for example, jaguar the car and
) [5 points]:

Distinguishing between difference senses of the same word would increase the size of the
indexed vocabulary. A term with n senses or meanings would have n entries into the
index (one per sense).
[Side note: In an end-to-end system, we would have to have a component that would map
each input query-term to a particular sense.]
3. [Retrieval Models: 20 points]
The goal of a retrieval model is to score and rank documents for a query. Different retrieval models
make different assumptions about what makes a document more (or less) relevant than another.
101 and D123
twice. Answer the following questions.

(a) Would the ranked Boolean retrieval model necessarily give both documents the same score? If
not, what information would determine which document is scored higher? [5 points]

The ranked Boolean model scores documents based on the number of ways the document
redundantly satisfies the query. In this case, we have a single-term query which happens
to occur twice in each document. Therefore, each document would obtain a score of two.
So, both documents would necessarily have the same score.

(b) Would the inner product (with a binary representation) necessarily give both documents the same
score? If not, what information would determine which document is scored higher? [5 points]

terms in common between the query and the document. Again, because we have a s
ingle-term query which happens to occur twice in each document, each document would
have a inner-product score of two.
(c) Would the cosine similarity (with a binary representation) necessarily give both documents the
same score? If not, what would determine which document is scored higher? [5 points]

The cosine similarity is basically the inner-product divided by the vector length of the
query times the vector length of the document. The vector length of the document is the
square root of the number of unique terms. So, the scores given to both documents could
be different, if the number of unique terms in both documents were different. The
document with fewer unique terms (note: not fewer term occurrences) would get a
higher score.

(d) Would the query-likelihood model (without linear interpolation) necessarily give both documents
the same score? If not, what would determine which document is scored higher? [5 points]

The query-likelihood model scores documents based on the probability of the query given
the document language model. For a s ingle-term query and assuming no linear
interpolation, this results in the proportion of the text associated with the query term. In
other words, the number of times the term occurs divided by the number of term-
-occurrences in
each document, we cannot say for sure that both documents would get the same score.
The document with fewer term occurrences would get a higher score.
4. Query-likelihood Model [15 points]

The query-

(a) Suppose you have a collection of 4 documents (show


what would be the score given to each document (with no linear interpolation)? [15 points]

Doc1: Humpty Dumpty sat on a wall,


Doc2: Humpty Dumpty had a great fall.
Doc3: All the king's horses and all the king's men
Doc4: Couldn't put Humpty together again.

Doc1: (1/6) x (1/6) = (1/36)

Doc2: (1/6) x (1/6) = (1/36)

Doc3: (0/9) x (0/9) = 0

Doc4: (1/5) x (0/5) = 0


5. Document Priors [20 points]
Regardless of the query, some documents tend to be more relevant than others for different reasons:
authoritativeness, better formatting, lack of profanity, etc. Favoring some documents over others can
be achieved by assigning each document a prior probability of relevance (or prior, for short).1
Answer the following questions. Please read both questions before you start. [Hint: be specific]

(a) You want to build a search engine on top of a collection of blog posts and want to favor blog
posts with lots of user-generated comments (i.e., you see comments as a surrogate for
interestingness). Using this information, how would you assign each document a prior
probability of relevance? [10 points]

The goal of a document prior is to multiply the query-likelihood score by the prior
probability that the document is relevant to any query.

Score(q,d) = P(q|d) x P(d)

We can set the prior to whatever we want as long as the sum of priors for all documents
in the collection equals 1.0.

If we want to favor blog posts with many comments, we could potentially set a

(# of comments in the post) / (# of comments in all posts)

This is basically the proportion of total comments that are found in the particular
document.

In probabilistic retrieval, the prior probability that the document is relevant (or prior, for short) is
multiplied by the query-likelihood to produce a final document score.
(b) You want to avoid always giving blog posts with zero comments a retrieval score of zero.
Otherwise, they would never be retrieved! How could you mitigate this problem? [10 points]

To avoid giving posts with no comments a prior of zero and, therefore, a final score of
zero, we could use add-one smoothing.

This is equivalent to giving every post one imaginary comment. Posts that previously
had zero comments would now have one and posts that previously had several would
now have one additional comment.

This would correspond to:


(# of comments in the post + 1) / (# of comments in all posts + # of posts)

You might also like