Information Retrieval and Web Search
Information Retrieval and Web Search
Information Retrieval and Web Search
Book Book
The Pearly
Microsoft Bill Gates Bill Wulf
Gates
Content-rich XML: representation
Book Book
Lexicon terms.
Encoding the Gates differently
What are the axes of the vector space?
In text retrieval, there would be a single axis
for Gates
Here we must separate out the two
occurrences, under Author and Title
Thus, axes must represent not only terms, but
something about their position in an XML tree
Queries
Before addressing this, let us consider the
kinds of queries we want to handle
Book
Book
Title Author
Title
Gates Bill
Microsoft
Query types
The preceding examples can be viewed as
subtrees of the document
But what about?
Book
Gates
Book Book
To be or not to be to be or not
Title Author
Gates Bill
Book Book
Gates Bill 0.6 0.4
Title Author
…
Gates Bill
Weight propagation
The assignment of the weights 0.6 and 0.4
in the previous example to subtrees was
simplistic
Can be more sophisticated
Think of it as generated by an application,
not necessarily an end-user
Queries, documents become normalized
vectors
Retrieval score computation “just” a matter
of cosine similarity computation
Restrict structural terms?
Depending on the application, we may
restrict the structural terms
E.g., may never want to return a Title node,
only Book or Play nodes
So don’t enumerate/index/retrieve/score
structural terms rooted at some nodes
The catch remains
This is all very promising, but …
How big is this vector space?
Can be exponentially large in the size of the
document
Cannot hope to build such an index
And in any case, still fails to answer queries
like
Book
(somewhere underneath)
Gates
Two solutions
Query-time materialization of axes
Restrict the kinds of subtrees to a
manageable set
Query-time materialization
Instead of enumerating all structural terms
of all docs (and the query), enumerate only
for the query
The latter is hopefully a small set
Now, we’re reduced to checking which
structural term(s) from the query match a
subtree of any document
This is tree pattern matching: given a text
tree and a pattern tree, find matches
Except we have many text trees
Our trees are labeled and weighted
Example
Text = Play Here we seek a doc with
Hamlet in the title
Title Act On finding the match
we compute the cosine
Scene similarity score
Hamlet
After all matches are
Alas poor Yorick found, rank by sorting
Query =
Title
Hamlet
(Still infeasible)
A doc with Yorick somewhere in it:
Query =
Title
Yorick
Will get to it …
Restricting the subtrees
Enumerating all structural terms (subtrees) is
prohibitive, for indexing
Most subtrees may never be used in
processing any query
Can we get away with indexing a restricted
class of subtrees
Ideally – focus on subtrees likely to arise in
queries
JuruXML (IBM Haifa)
Only paths including a
Play
lexicon term
In this example there
Title Act are only 14 (why?) such
paths
Hamlet Scene Thus we have 14
structural terms in the
To be or not to be index
Bill Gates
No known DTD.
Query seeks Gates under Author.
Handling descendants in the vector
space
Devise a match function that yields a score in [0,1]
between structural terms
E.g., when the structural terms are paths, measure
overlap Book
Book Book
Author
vs. Author in
LastName
Bill Bill
Bill
Match
=0.63
ST5 Doc3 (1.0) Doc6 (0.8) Doc9 (0.6)
ST = Structural Term
Corpus
1.00 if rel,cov 3E
0.75 if rel,cov 2 E ,3L,3S
f generalized (rel , cov) 0.50 if rel,cov 1E ,2 L,2S
0.25 if rel,cov 1S ,1L
0.00 if rel,cov 0 N .
The f-values
Scalar measure of goodness of a retrieved
elements
Can compute f-values for varying numbers
of retrieved elements 10, 20 … etc.
Means for comparing engines.
From raw f-values to … ?
INEX provides a method for turning these
into precision-recall curves
“Standard” issue: only elements returned by
some participant engine are assessed
Lots more commentary (and proceedings
from previous INEX bakeoffs):
http://inex.is.informatik.uni-duisburg.de:2004/
See also previous years
Resources
Querying and Ranking XML Documents
Torsten Schlieder, Holger Meuss
http://citeseer.ist.psu.edu/484073.html
Generating Vector Spaces On-the-fly for
Flexible XML Retrieval.
T. Grabs, H-J Schek
www.cs.huji.ac.il/course/2003/sdbi/Papers/ir
-xml/xmlirws.pdf
Resources
JuruXML - an XML retrieval system at
INEX'02.
Y. Mass, M. Mandelbrod, E. Amitay, A. Soffer.
http://einat.webir.org/INEX02_p43_Mass_etal
.pdf
See also INEX proceedings online.