This is the html version of the file https://repository.lib.ncsu.edu/bitstream/handle/1840.4/861/TR-2005-22.pdf.
Google automatically generates html versions of documents as we crawl the web.
These search terms have been highlighted: mixed mode xml query processing
XML Query Processing: A Survey ∗
Page 1
XML Query Processing: A Survey
Gang Gou, Rada Chirkova
Department of Computer Science
North Carolina State University
Raleigh, NC USA 27695-8207
Email: [email protected], [email protected]
May 10, 2005
Abstract
XML (Extensible Markup Language) is emerging as a de facto standard for information
exchange among various applications on the web because of its inherent data self-describing
capability and flexibility of organizing data. With increased impact of XML on information
exchange, it is particularly important to develop high-performance techniques to query large
XML data repositories efficiently.
The core of XML query processing is twig pattern matching, i.e. finding from XML
documents all matches that satisfy the twig (or path) pattern specified by a given query. In
this survey we will review and compare major techniques for processing XML twig queries.
We categorize these techniques into three classes based on the storage format of XML data.
First, we review the file approach, in which XML data have to be stored in commonly used
flat files, in the form of just original XML documents, for special-purpose applications. Then,
we review the relational approach, in which XML data are stored in relational databases so
that all existing important techniques that have been developed for relational databases can
be fully reused and so no extra development efforts are needed. Finally, we review the native
approach, in which XML data are stored in inverted lists and native algorithms are developed
to further improve XML query processing performance.
To the best of our knowledge, this is the first survey work that systematically reviews,
classifies, and compares state-of-the-art techniques for XML query processing.
*All copyrights of this technical report are reserved by the authors and North Carolina State University.
1

Page 2
1 Introduction
XML (Extensible Markup Language) is emerging as a de facto standard for information exchange among
various applications on the web because of its inherent data self-describing capability and flexibility of
organizing data [Gro04a].
First, data in XML documents are self-describing. Similar to the familiar HTML (HyperText Markup
Language), XML is based on nested tags. Figure 1 (a) shows an example of an XML document, which
records information about publishers. However, unlike HTML, in which tags associated with data are used
to express the presentation style (e.g. font styles) of data, tags in XML are used to describe the semantics
of data. For example, Line 3 in Figure 1 (a) says that ‘Cambridge’ is an address of a publisher named
MITPress’. Therefore, when an application receives an XML document from another application over the
web, it can understand the content of this XML document, since data in XML documents are self-describing.
Second, XML is flexible in organizing data. The nested hierarchy of tags structurizes the content of XML
documents. The role of nested tags is somewhat similar to schemas in relational databases. However, the
nested XML model is more flexible than the flat relational model. The same objects in an XML document
might have different kinds of sub-objects or different number of sub-objects of the same kind. For example,
in Figure 1 (a), the first publisher has an address sub-element but the second publisher does not. The book
under the first publisher has two author sub-elements but the book under the second publisher has only one
author sub-element.
авбдгдезжд й в й
авбдгдезжд й в "!$# &%('0 )21 34 бд "йдй '5
а&!76д6в "йдй 9 8д!$#"ед з$6д@д A а&Bд!76д6в "йдй
аве"CдCдD7
адEF$EFж$ д A 6&!7E&! е"! й A а&B7EF$EFж$ д
а&! г Eв "C 9 3дC$#G а&Bд! г Eв "C
а&! г Eв "C 9 HдC дIа&Bд! г Eв "C
а&B е"CдCдD7
а&B бдгдезжд й в
авбдгдезжд й в
аве"CдCдD7
адEF$EFж$ д 9 P $Qд A а&B7EF$EFж$ д
а&! г Eв "C
ав"!$# д A R #2$Eв I а&B "!$# д
а&!7@д д G SUTV а&Bд!7@д д
а&Bд! г Eв "C
а&B е"CдCдD7
ав"!$# д X WдYA бд "йдй4 а&B "!$# д
а&B бдгдезжд й в
а&B бдгдезжд й в й
авбдгдезжд й в й
авбдгдезжд й в "!$# &%('0 )21 34 бд "йдй '5
а&!76д6в "йдй 9 8д!$#"ед з$6д@д A а&Bд!76д6в "йдй
аве"CдCдD7
адEF$EFж$ д A 6&!7E&! е"! й A а&B7EF$EFж$ д
а&! г Eв "C I Qв з$ в 6&%`S 9 3дC$#G а&Bд! г Eв "C
а&! г Eв "C a $6&%`S9 ж Cдb7 "йв%7cд 9 HдC дIа&Bд! г Eв "C
а&B е"CдCдD7
а&B бдгдезжд й в
авбдгдезжд й в
аве"CдCдD7
адEF$EFж$ д 9 P $Qд A а&B7EF$EFж$ д
а&! г Eв "C a $6&%7cA Qв з$ в 6&%`S
ав"!$# д A R #2$Eв I а&Bд! г Eв "C
а&!7@д д G SUTV а&Bд!7@д д
а&Bд! г Eв "C
а&B е"CдCдD7
ав"!$# д X WдYA бд "йдй4 а&B "!$# д
а&B бдгдезжд й в
а&B бдгдезжд й в й
d!дe4 f )FPV 6&C7gвг # в Ei h2$Eв "C г Ep 1rq"B 1rq"s tдu
dе"e4 f )FPV 6&C7gвг # в Ei h2$Eв a 1rq"B 1rq"s tдu
Figure 1: XML documents
1

Page 3
1.1 Data Model
1.1.1 Basic Model: Tree
The basic data model of XML is a directed, rooted, labeled, and ordered tree. Figure 2 (a) and (b) shows
the XML data tree of the XML document in Figure 1 (a) 1. Figure 2 (a) is based on a node-labeled model
where labels are on nodes, and Figure 2 (b) is based on an edge-labeled model where labels are on edges.
These two models are equivalent. Most research papers use the node-labeled model, while the edge-labeled
model is also used in some scenarios, such as in the Edge approach that will be introduced in Section 4.2.
Here we explain the XML data tree based on the node-labeled model, and analogous explanations can also
be applied to the edge-labeled model.
There are three classes of nodes in a data tree. (1) Element Node (internal node). This class of nodes
correspond to tags in XML documents, such as publisher, address, etc. Labels on element nodes are just tags
in XML documents. (2) Attribute Node (internal node). This class of nodes correspond to attributes in XML
documents, such as ‘@name’ under the first publisher element. In contrast to element nodes, attribute nodes
are not nested (i.e. an attribute cannot have any sub-elements), are not repeatable (i.e. two same-name
attributes cannot occur under one element), and are unordered (i.e. attributes of an element can freely
interchange their occurrence locations under this element). (3) Value Node (leaf node). This class of nodes
correspond to data values in XML documents such as ‘MITPress , ‘database , etc.
Edges in a data tree represent structural relationships between elements/attributes/values.
1.1.2 Extended Model: DAG and General Graph
XML documents allow users to define ID/IDREF attributes of elements, where the id attribute is used to
uniquely identify an element and idref attributes are used to refer to other elements which are explicitly
identified by their id attributes. ID/IDREF attributes increase the flexibility of the XML model so that
elements in XML documents may directly refer to each other freely. Figure 1 (b) shows an XML document
with ID/IDREF attributes, where the newly introduced ID/IDREF attributes are underlined.
Therefore, in addition to original tree edges in XML data trees which describe main skeleton structural
relationships in XML documents, ID/IDREF edges are also introduced into the XML data model to represent
direct reference relationships between elements, which extends the original tree model to DAG (Directed
Acyclic Graph) or even more general graph with cycles. Figure 2 (c) is just a graph with cycles, which
corresponds to the XML document in Figure 1 (b).
1.2 XML Queries
Unlike keyword search in text retrieval, which concerns onlycontents of text documents, XML queries concern
structure as well as contents of XML documents.
1.2.1 XPath
XPath [Gro04c] is a basic XML query language that is used to select nodes from XML documents such
that the path from the root to each selected node satisfies a specified pattern. A simple XPath query is
1We defer discussing the pairs of numbers adorning nodes until later.
2

Page 4
авбдгжевзв й и
бвгжедз й
бвгжедз й
д дв!
#"д"
г $д$ %
г#$ $ %
вд!
&('0)21 а # д 03 &54в!дгд зв"д6в 73
8з8е
дб8й $
дб8й $
8з8ев
вб8й $
&59 8дгв ид ж3
&51$ !@3
&BAзвCд ж3
&ED!жз8й@3
дв!
&EF$вйд 73
GIH
&QP#R а д д S3
TGIU V WвX
TQYU Y `X
TV U a X
TcbU bX
T5dU dX
T5eUв HдX
TWдUf G2WдX
TG`Uж GYX
TGдG U GдG X
TGвV U Gвa X
TGbU GbX
TGeUж GIHдX
TGdU GdX
T5YG U V HдX
T(YдYUд VbX
T5YVgU YagX
T5YвbU YвbX
T5Y eU V VgX
T5Y dU YWвX
T5YH U YH X
TV`U VYX
TVдG U VдG X
TVдagU# VdX
TVeU VeX
h iqps rutwvyxБАГВ@iyДГxЕВЖxuvИ ЗqЙЕРТ СuiwУФis ХyЦwxux
Ч
бвгжедз й
бдг едзв й
"в"
г $ $g%
г $ $g%
!
&Q'S)21 а 73
&B4д!дгв жзв"в6д 73
8з8е
дб8й $
дб8й $в
8з8ев
дб8й $в
&59 8дг дид 03
&Ш1$ !73
&5AзвCв 73
&2D!жз8й@3
д !д
&5F$ йв @3
GIH
&(P R ад дидиж3
h ДupТ ЩdvqeqxФАГВfiyДdxЕВgxqvh ЗyЙЕРТ СuiwУФis ХyЦdxqx
д !д
адбвгжедз й и
i
j
k
l
m
n
o
p
G`
GдG
GY
G V
Gb
G a
авбдгжевзв й и
бдгfе звивйд
бвгжедз й
д дв!
" "
г $д$ %
г $в$ %
д !д
&Q'S)21 а 73 &E4 !дг fзв" 6 73
8з8е
б8йд$в
б8йд$в
8з8е
вб8й $
&E9 8дгв ид ж3
&51$ !@3
д6д
&BAзвCд ж3
&ED!qз8й73
д!д
&EF$ й 73
GIH
&(P R ад дидиж3
C жз "
е $ r# и
C жз "
hдsФpТ rutwvqxФАГВfiyДdxЕВgxqvh ЗyЙЕРТ СuiwУФiТ tqЦuiyuqvw h2xzygУ{v} |@Сu~Г|@Сu yЩqАБ xqvqeqxuВ{p
Figure 2: XML data model
3

Page 5
specified by a sequence of alternate axes and tags. Two commonly used axes are child axis /’ where ‘A/B
denotes selecting B-tagged child nodes of A-tagged nodes, and descendant axis //’ where ‘A//B’ denotes
selecting B-tagged descendant nodes of A-tagged nodes. An example XPath query is ”/P ublisher//title
(its standard form should be ”root/P ublisher//title”, but ”root” is always omitted for simplicity), which
returns all book titles of all publishers. The result of this query against the data tree in Figure 2 (a) is a set
of title nodes that have values ‘Database’ and ‘Life’.
The query pattern specified by the XPath query above is a simple path pattern shown in Figure 3 (a)
where the arrow with ‘=’ denotes the ‘//’ axis. Generally, an XPath query can specify a more complex
tree pattern (also called twig pattern) by introducing selection predicates into XPath expressions. One such
example is ”/P ublisher[@name = ‘MITPress ]/book/title”, in which ‘/P ublisher/book/title’ is the main
path of this query and the content between ‘[’ and ‘]’ is a selection predicate. This query returns all book
titles of the publisher named ‘MIT Press’. The pattern of this query is shown in Figure 3 (b). Generally,
multiple selection predicates might be involved in XPath queries.
а
бгвгджегзй ж
й г г
дж г
!з!ей
"$#&%('0 ) ж 1 32 4
а
бгвйд5егз6 ж
!з!е6 4
798A@C BD8AEGFI HQPDR5SUT
7WVX@` Y5acb5de HQPURQSAT
Figure 3: Query pattern
1.2.2 XQuery
XQuery [Gro04e, Cha02] is another popular XML query language, which is an extension to XPath and
is more powerful than XPath. It is a functional language comprised of FLWR (For-Let-Where-Return)
clauses that can be nested and composed with full generality. For and Let clauses bind nodes selected by
XPath expressions to user-defined node variables. Where clauses specify selection or join predicates on node
variables. Return clauses operate on node variables to construct a new XML document as the query result.
Figure 4 (a) shows a simple XQuery, which groups books by their publisher addresses. The query pattern is
shown in Figure 4 (b), and the format of the resulting XML document is shown in Figure 4 (c) where the ‘
edge means that a books node might have multiple book nodes as children. From this example we can find
that XQuery logically (rather than physically) includes two parts: twig pattern matching (defined by FLW)
and result construction (defined by Return).
Tree algebras have been developed to express more complex XQueries. [JAKC+02, PAKJ+02, PWLJ04]
address transforming XQuery to an algebraic tree. The algebraic tree represents an efficient logical plan of
answering XQuery. Each node in this tree is a tree algebraic operator. The basic tree algebraic operators
are selection, projection, and grouping, each of which takes one or multiple twig patterns as inputs.
4

Page 6
авбдг езжй з з в! #"з$ ж&% ' ( 0)21434'657%з 8 #9в(8@414ж! 4 #@ ( з зA
г4BзC6D4г4
E#FвG6GзH 6I
E#P'657%4 8 #94(8@зQ7 6 #@4(! 4 4IR е#ж E1P'з5S%4 з д9в(д@8Q7 4 д@в(6 6 4I
T
авб8г е#3U V 1434'657%з 8 #9в(8@
WвX6B4гзB ед3414ж! 4 #@ ( з ` Ya еbж
г4B6C4Dзг6 е#3з165G4G8H
c
E1F G4GдH 4I
1
3з'45S%4 8 #94(з@
ж6 6 #@ ( 4
5G6GзH
d
d
1
FвG4GдH
P'45&%6 з #9 (з@дQS 4 #@ ( з
5G6GзH
e
fhgpir q0sStvu&w0x
fАy0iБ sStvu&w0xr В0gДГpГpu&wpЕ
fЗЖ7ir И0ЙSwSРСgДГТ ЙДУТ sStvu&w0xБ ФДu0Х tЧЦШГ
Figure 4: An example of XQuery
1.2.3 Summary
The core of both XPath and XQuery queries is twig pattern matching (also called twig query), i.e. finding
from XML documents all matches that satisfy the twig (or path) pattern specified by a given query. We
call nodes in XML data trees data nodes and nodes in query twigs query nodes. For XPath queries, the
output of twig pattern matching is a set of data nodes whose corresponding query node is the end node of
the main path in a query twig. For example, the output of matching the twig pattern in Figure 3 (b) is a
set of title nodes. We call this type of output single-node solutions. For XQuery queries, the output of twig
pattern matching is a set of tuples of data nodes that correspond to multiple query nodes in a query twig.
For example, the output of matching the twig pattern in Figure 4 (b) is a set of (address, book) tuples, but
not a set of only book or address nodes. We call this type of output tuple solutions.
Another important thing is that current XPath and XQuery do not support ID/IDREF axis queries, i.e.
they always assume queries work on tree-shaped XML data model. In fact, this assumption has also been
taken by most research papers on XML query processing. The first reason for taking this assumption is
that general graph-shaped data model significantly increases the complexity of XML query processing. The
second reason is that graph-shaped XML documents with ID/IDREF attributes are not usual in practical
applications. So we will continue to take this assumption in this survey except explicitly claimed.
In the remainder of this survey we review major techniques for processing XML twig queries. We
categorize these techniques into three classes based on the storage format of XML data. Section 3 introduces
the file approach, in which XML data must be stored in commonly used flat files, as required by special-
purpose applications. Sections 4 and 5 introduce the relational approach and the native approach, in which
XML data are stored in relational databases and inverted lists, respectively. With value indexes and structural
indexes available in these two approaches, XML queries can be answered much more efficiently than in the
file approach. Before we begin to review these approaches, we first introduce numbering schemes.
2 Numbering Schemes
In this section, we introduce numbering schemes that can overcome the weakness of the file approach and
have been taken as an important foundation for many techniques in the relational and native approach.
Edges in XML data trees represent structural relationships between data nodes. The key idea of answering
XML twig queries is just determining structural relationships, or more specifically reachability, between any
pair of nodes in XML data trees. For example, in order to answer a path queryA//B’, given any pair of
A-tagged node and B-tagged node, say (a, b), in a data tree, we need to determine whether there exists a
5

Page 7
path from a to b.
A straightforward method of determining reachability is tree navigation [MW99], which consists of either
traversing the subtree rooted at an A-tagged node to see if a B-tagged node can be found (forward navigation),
or, more intelligently, backtracking from a B-tagged node upwards to see if an A-tagged node can be found
(backward navigation). Backward navigation is usually more efficient because each node in a tree has only
one incoming path from the root but multiple outgoing paths. However, if A-tagged nodes are more selective
than B-tagged nodes, i.e. most A nodes have B descendants but most B nodes have no A ancestors, then
forward navigation might be more efficient. Therefore, a trade-off has to be determined, which is just the
motivation of hybrid navigation [MW99]. However, on the whole, the navigational method is not efficient,
since both forward and backward navigations involve traversing a large amount of irrelevant nodes, i.e. nodes
tagged with neither A nor B. For example, for a path ‘/A/D/E/F/B’ in a data tree, irrelevant nodes tagged
with D, E or F have also to be traversed for answering a queryA//B’ when the navigational method is
used.
Another method of determining reachability is precomputing, for each node in a data tree, a set of nodes
that can be reached from this node, i.e. materializing transitive closure of this data tree. The transitive
closure is typically very large and so could waste storage space. Therefore, we need a less exhaustive method
to compactly represent transitive closure. Numbering Schemes is just one such method.
[Die82] is the origin of numbering schemes for trees. It proposed a kind of numbering scheme we call
PrePost Coding, which uses tree-traversal orders of nodes to compactly represent transitive closure of
trees. Specifically, each node in a tree is labelled with a pair of numbers, (start, end), where start and
end correspond to preorder and postorder traversal numbers of this node in the tree, respectively. [ZND+01]
introduced P reP ost coding into XML applications. As can be seen from Figure 2 (a), the following property
always holds.
Property 1 (Ancestor-Descendant Relationship) In a data tree, node a is an ancestor of node b if
and only if a.start < b.start < a.end.
Obviously, PrePost Coding has two big advantages. (1) (start, end) numbers (also called PrePost
numbers) only need modest storage space: 2 ∗ |V |, where |V | is the number of nodes in the data tree. (2)
Using PrePost numbers, we can efficiently determine the ancestor-descendant relationship between any pair
of nodes in constant time by using only two number comparison operations. In addition, PrePost coding
can also be easily extended to check the parent-child relationship if we attach another number, level, to each
node, which denotes the depth of this node in tree.
Property 2 (Parent-Child Relationship) In a data tree, node a is a parent of node b if and only if
a.start < b.start < a.end and a.level +1= b.level.
In fact, in addition to commonly used ‘/’ and ’//’ axes, P reP ost coding extended with the number level
is able to process all other axes defined in XPath, such as following, following-sibling, etc [Gru02, GvKT04].
Another famous numbering scheme for trees is Dewey Coding [OCL04], which was originally developed
for general knowledge classification. [TVB+02] introduced it into XML query processing. With this coding,
each node is associated with a vector of numbers that represents the path from the root to this node. This
coding method is illustrated in Figure 5. We can show that in a data tree, node a is an ancestor of node b
if and only if a.vector is a prefix of b.vector.
6

Page 8
авбдгжевзйий д в и
бдг ейзйив в
бвг е з ив
г!"й"в#
г!"в"й#
$ д%
&('0)21 ай ! дий43 &65 д%гй зй7 43
8з8е9
б8 "в
б8 !"
8з8е9
б8 д"в
&6@ 8 г ий 43
&A1"%3
7
&CBзйDй 43
&FE%з8 03
$ в%
&6G"в д$03
HPI
&(Q!R ад д идS3
H
H TU H
HйTU H9TU H
H9TU HйTд HйTU H
H TU H Tв VдT! H
HйT! H Tд V
HйT H Tд W
H9TU HйTй WдTU H
HйTйV
H Tв VдT! H
H T VдTд V
HйTд HйTд WйTд V
HйT H Tд WдTй W
HйTд HйTд WйTU HйTд H
H9TU H T WдTд VйTU H
H TU H Tв WдT WвTU H
HйTй VдTU H9TU H
H9Tд VдTд HйTд V
HйTд VйTU HйTд HйTU HX H Tд V T HйT VвTU H
H9Tд VвT! HйTд VйTд V
H9Tд VдTд HйTд VйTU HйTд HY H9Tд VдTд HйTд VйTд VдTд H
HйTй VдTд VйTU H
`в$ д%
Figure 5: Dewey Coding
An advantage of Dewey Coding over PrePost Coding is that Dewey Coding is easier to maintain when
dynamic updates occur on data trees. Using Dewey Coding, when a new node is inserted somewhere in
a data tree, only nodes in subtrees rooted at the following sibling nodes of this new node need to change
their Dewey vectors. In contrast, using PrePost Coding, when a new node is inserted, most nodes in a data
tree might need to update their (start, end) numbers. ORDPATH Coding, which is a variant of Dewey
Coding but even easier to maintain than Dewey coding, has been integrated into the XML query processing
component of Microsoft SQL Server 2005 [OOP+04].
However, compared with PrePost, Dewey has some obvious weaknesses. (1) The path vector associ-
ated with each node needs more storage space than (start, end) numbers in PrePost Coding. (2) PrePost
provides more efficient support in checking the ancestor-descendant relationship between two nodes, since
number comparison operation can be implemented more efficiently than the operation of checking the pre-
fix containment relationship between two path vectors. Due to the nice properties of PrePost, most XML
research papers use PrePost as their numbering schemes. Our survey will continue this tradition.
In addition to numbering schemes for trees, numbering schemes have also been developed for DAGs
[ABJ89] and for even more general graphs with cycles [CHKZ03]. [STW04, STW05] applied 2-hop labels
developed in [CHKZ03] to deal with general XML data graphs. However, the size of 2-hop labels is usually
very large, which limits its application in practice.
3 XML Query Processing: the File Approach
XML data are originally created in the form of XML documents (Figure 1) and stored in flat files. Generally,
various indexes need to be built on XML data to facilitate answering XML queries, since indexes can locate
goal data quickly without exhaustively scanning the data. Such indexes include classical B+-tree index
(Section 4), which is an index on data values (value indexing), and recently developed numbering schemes
(Section 2), which is an index on structure of XML documents (structure indexing). However, indexes
themselves are redundant data. In some application scenarios, XML data must be exchanged in the form of
flat files only, without any redundant data such as indexes being allowed to associate with them. In those
cases where indexes are not available, entire XML documents have to be scanned to answer queries.
One example of such applications is SDI (Selective Dissemination of Information) [AF00, DFFT02, DF03,
7

Page 9
DAF+03, BGKS03, TRP+04]. SDI is essentially an XML Publish/Subscribe system. Figure 6 illustrates its
structure. The filtering system stores XPath queries from subscribers. It matches each incoming streaming
XML document D from publishers with each subscribed XPath query. If a match is found in D with
some XPath query Q, then D will be sent to subscribers of Q. In order to reduce network bandwidth,
publishers disseminate only XML documents, without any redundant data such as indexes associated with
these documents. In this scenario, only tree navigation methods, specifically only the forward navigation
method (Section 2), can be used, since scanning XML documents sequentially in document order is essentially
a depth-first traversal of XML data trees.
авбд гж едзйивб
едз
!# "%$'&) ( 02 14365й"6768 9@5A$BD CAEF &HGP I63$Q8 9Q$"
C4RTSU EA3A5 VA94"WGT$68ж"
1 &H8$AFQXд9`YTaU C6RS
b(A7634XT$6Yж&A"
C4RASP 163A5ж"H768 9W5T$48"
c9 VQ&%$68$Bd C6RTS
b(A7634XT$6Yж&A"
Figure 6: SDI application
3.1 Single-Query Processing
3.1.1 The Automata Approach
The automata approach is a natural implementation of forward navigation, which has been widely researched
[AF00, DFFT02, DF03, DAF+03, BGKS03, HBG+03]. This approach expresses an XPath query as an
automaton and runs XML documents on this automaton as if XML documents were strings.
When a streaming XML document arrives, SAX parser [Org04] parses it sequentially on the fly. SAX is
an event-based XML parser. A StartElement event is triggered when the opening tag of an XML element
is encountered, which returns the tag name and all associated attributes (if any) of this element to the
event handler. Similarly, an EndElement event is triggered when the closing tag of an XML element is
encountered, which returns the tag name of this element to the event handler. The event handler then uses
opening/closing tags returned by events to activate corresponding state transmissions of automaton.
Figure 7 illustrates this approach. Figure 7 (b) is an automaton equivalent to XPath query//A//B/C
where ‘//’ axes are represented using -edges (‘’ denotes any tag name), and the leaf query node is taken
as an accept state (State 3). The key idea of this approach is using a run-time stack, in which each stack
element is a set of automaton states. When an opening tag is encountered, each state in the stack-top
element is transformed to new states (or to this state itself if there is a -edge outgoing from it) based on
this tag. These newly generated states are collected into a new stack element which is in turn pushed into
the run-time stack as the new stack top. Instead, when a closing tag is encountered, the stack-top element
is simply popped out of the stack. For SDI applications whose goal is just to check if there exists one match
between the published XML document and subscribed XPath queries, the matching process can terminate
once an acceptable state, such as State 3 in Figure 7 (b), is reached. However, for general query applications
whose goal is to find all matches, the matching process has to continue until the end of the XML document
is reached. All elements resulting in accept states, such as elements c1 and c2 in Figure 7 (c), are output as
query results.
8

Page 10
авб
гб
дб
аже
ге
зиб
й
"!$ #в%ад' &аб)(
1 0
02!$ #в%ад' &2гб)(
43!$ #в%ад' &вдб(
65!$ #в%ад' &аже(
1 0
47!$ #в%ад' &2ге(
3
98!$ #в%ад' &зиб(
зе
@
@ 1 0
@
@
@ 1 0
6A!' #B%адC &EDзBб(
1 0
1 0
3
9F!' #B%ад' &зе(
1 0
1 0
9G!$ #в%ад' &2Dзе(
H"IQPS RTIQUVIS WYXT`V`
HbaYPS cYdTUVegfTIYUVI
HihVPS pYdVqYrQWtsBfu`w vTUxIYhxy
Figure 7: The Automata approach: processing XPath query//A//B/C
3.1.2 The PathStack Approach
The automata approach described above is simple and feasible. However, its big weakness is that although
it derives single-node solutions (e.g. a set of C nodes), it is difficult to derive tuple solutions (e.g. a set of (A,
B, C) tuples). The reason is that the run-time stack tracks only states in automata but not data nodes in
data trees. In addition, the run-time stack wastes memory space. Due to the ‘//’ axes, states with outgoing
-edges, such as State 0 and State 1, have copies in a large number of stack elements repeatedly.
[BKS02] introduced an elegant data structure, PathStack, which can overcome the weaknesses of the
automata approach described above. PathStack was introduced in [BKS02] originally as a native approach
to answering XML twig queries. [BGKS03] extended it to process multi-queries. Here we only introduce its
role in the file approach while leaving the introduction to its role in the native approach to Section 5. Figure
8 illustrates this PathStack approach.
авб
гб
дб
аже
ге
зб
зе
й й й "! # #
а$б
а$б
гб
а$б
гб
%'&)(1 032ад5 4аб76
%98@(A 0B2адC 4Dгб76
%FEG(1 032ад5 4@дб76
а$б
гб
%IHP(1 032ад5 4аPе6
аже
а$б
гб
%9Q@(A 0B2адC 4Dге6
аже
ге
авб
гб
%FRG(1 032ад5 4зBб6
а@е
ге
звб
аб
гб
%TS@(A 0B2ад1 4GUз3б6
аже
ге
аб
гб
%9V@(5 0ж2ад1 4зWе6
аже
ге
зе
аб
гб
%FXG(C 0P2адA 4@Uз@е6
а@е
ге
`Y b a й $ce df gй hpirqs td й h iv u
wfxgy АpБГ В
Д ЕЗЖ"И$Йb РfС
ТFУBФ"ХGЦжЧ ХЩ Ш3Фd
eжfжgDhBiжjGf@kPl"m
ТFУЧ ХGЦЧ ХЩ ШФd
ТFУФ Х@ЦЧ Хn ШЧd
eжfжgDhBiжjGf@kPl"m
ТFУЧ Х@ЦЧ Хn ШЧd
Figure 8: The PathStack approach: processing XPath Query//A//B/C
The key idea of the PathStack approach is using a series of linked stacks to track scanned data nodes.
Specifically, one stack is created for each query node in a path query. For example, for a path query
//A//B/C’, there are three stacks, Stack C Stack B Stack A. When an opening tag is encountered,
the corresponding XML element is pushed into the stack named by this tag, associated with an a pointer to
9

Page 11
the top element in its parent stack (see Steps (2) (5) (6) and (8)). Note that such elements as d1 whose tags
do not correspond to any stack (i.e. are irrelevant to the given path query) are simply discarded. Instead,
when a closing tag is encountered, the top element in its corresponding stack is simply popped out (see Steps
(7) and (9)).
The procedure above guarantees that at all times, elements in all stacks are from the same path in the
data tree. Therefore, when an element is pushed into the stack corresponding to the end node of the path
query, such as Stack C, it implies that some matches might have been found. These matches can be output
immediately as solutions through backtracking pointers associated with the elements in stacks (see Steps
(6) and (8)). Note that in order to check the child-parent relationship, each element that is pushed into the
stack also needs to be associated with its depth number in the data tree, which can be easily derived in the
parsing process.
Compared with the automata approach, the PathStack approach has the following advantages. (1) The
PathStack approach saves memory space. The number of stacks in the PathStack approach is the length of
the path query, while the depth of the run-time stack in the automata approach is the depth of the entire
XML data tree. (2) More importantly, the PathStack approach can derive tuple solutions, rather than only
single-node solutions. Tuple solutions are very important. They might be required by XQueries (Section
1.2.3). More importantly, tuple solutions help answer twig queries as well as simple path queries through
using a post-joining procedure as introduced below.
3.1.3 The TwigStack Approach
The TwigStack approach extends the PathStack approach to answer general twig queries [BKS02]. Its key
idea is twig decomposition, i.e. decomposing twig queries into multiple root-to-leaf path queries. Each path
query is still processed as in PathStack, and the query results are finally joined together to get the result
of the original twig query. However, since path queries from the twig decomposition have common prefix
(query) nodes, the stacks corresponding to these common prefix nodes can be shared. Therefore, in contrast
to PathStack, which links stacks in the form of a path, TwigStack links stacks in the form of a twig.
An example of TwigStack is shown in Figure 9. In this example, once a C-tagged node is pushed into
Stack C, the tuple solutions obtained from Stack C Stack B Stack A are immediately sent to Table 1.
Similarly, once a E-tagged node is pushed into Stack E, the tuple solutions obtained from Stack E Stack
D Stack B Stack A are immediately sent to Table 2. If a path query is very selective so the size of
Table 1 and Table 2 is small, then these two tables can be temporarily stored in memory. Otherwise, they
have to be sent to disk. Finally, after the entire XML document is scanned, there is a post-joining procedure
which joins Table 1 and Table 2 on their shared attributes, A and B, to get tuple solutions (A, B, C, D, E).
3.2 Multi-Query Processing
The problem of multi-query processing is answering a batch of queries rather than a single query. For example,
the SDI application is essentially an XML multi-query processing problem except that it is only to find one
rather than all matches of a streaming XML document with each of subscribed queries.
The problem of multi-query processing has been widely researched in the context of relational databases
(e.g. in [RSSB00]). The key idea of improving the performance of multi-query processing is answering
multiple queries simultaneously rather than separately through exploring shared parts of these queries. This
idea is similarly applicable in the context of XML multi-query processing.
10

Page 12
а
б
вдгжеи з й "!#
$
%
&
')()02143 а
')()021435 б
'6()071839 $
'@(8071)3A %
'@(8071)3A &
вCBеи з й DFE"гG"HFI
а б $
P6P@P P6P@P P6P@P
P6P@P P6P@P P6P@P
P6P@P P6P@P P6P@P
а б %
P6P@P P6P@P P6P@P
P6P@P P6P@P P6P@P
P6P@P P6P@P P6P@P
&
P6P@P
P6P@P
P6P@P
P6P@P P6P@P P6P@P P6P@P
P6P@P P6P@P P6P@P P6P@P
вCGжеR QFSFI"E TVUWSX 7YV `Ya
b@0@c7d6e9 f
bg0@c`d4ei h
Figure 9: The TwigStack approach
It is straightforward to extend three approaches to XML single-query processing (Section 3.1) to XML
multi-query processing. The extension of the automata approach finds common prefixes of the given path
queries and share the states corresponding to these common prefixes in a newly constructed automaton
[DFFT02, DF03, DAF+03], as Figure 10 (a) illustrates. Similarly, the extension of the PathStack approach
finds common prefixes of the given path queries and share the stacks corresponding to these common prefixes
in a newly constructed TwigStack [BGKS03], as Figure 10 (b) illustrates. Note that the extension to the
PathStack approach above is essentially the same as the TwigStack approach.
а
б
в
г
д
е
ж
з
з
ий
а
б
в
г
д
з
з
ж
!
а
б
в
г
д
е
ж
з
з
"
ий # $
%'&)(10)2 в
%'&)(10)23 д
%4&)(50627 е
%'&6(50)23
%'&6(50)2 в
%'&4('0)23 д
%4&)(50627 е
%'&6(50)2 в
%'&6(50)23 д
%'&4('0)23
%'&4('0)23
!
%1&)('0428
ий ий 9
й # $
@BA CD EGFIHQPRARFSAT U U V H AйWйX
@`Y9Cb a9A9F5X9c9FSA WйdD U U V HйA WйX
Figure 10: XML Multi-Query Processing
Through the extensions above, each XML document needs to be scanned only once to answer multiple
queries simultaneously.
3.3 Summary
The file approach is mainly used for special-purpose applications in which XML data must be stored in
commonly used flat files in the form of just original XML documents. Because no redundant data such as
indexes are available in such applications, entire XML documents have to be scanned sequentially element
by element despite the fact that most elements in documents might be irrelevant to the specified queries,
which usually results in poor query processing performance. In the following two sections, we investigate
the relational approach and the native approach, in which XML data are stored in relational databases and
in inverted lists, respectively. With value indexes and structural indexes available in these two approaches,
XML queries can be answered much more efficiently than in the file approach.
11

Page 13
4 XML Query Processing: the Relational Approach
Relational database systems are today’s mainstream database systems. Today’s well-known commercial
database systems, such as IBM DB2, Microsoft SQL Server, and Oracle, are all relational database man-
agement systems (RDBMS). Due to more than thirty years of academic and industrial efforts, RDBMSs
have acquired strong capabilities in storage management, query processing and optimization, concurrency
control and recovery, etc. Therefore, a lot of research efforts have addressed storing and querying XML data
in RDBMS. In Section 4.1, we review past work on storing and querying XML data with a ‘schema’. In
Sections 4.2 through 4.4, we review past work on storing and querying schemaless XML data.
4.1 The DTD Approach
This approach is developed to store and query XML data with a ‘schema’. As introduced in Section 1, XML
is a flexible data model. However, XML data in many practical applications also conform to a schema to
some extent, since for various inter-operating applications that exchange data with each other, a common
agreement on the schema of exchanged data will facilitate data exchange among them significantly. Such
schemas can be described using standard Document Type Descriptors (DTDs) [Gro04b] or XML Schemas
[Gro04d]. Here we briefly introduce basic issues on DTD only, since XML Schemas are essentially extensions
to DTDs.
DTD is a set of statements where each statement specifies a relationship between an XML element and
its sub-elements/attributes, or the data type of an XML element/attribute. DTD statements are usually
stored in a special document for reference. If an XML document, X, cites a DTD document, D, on its file
head, then the structure of XML data in X must conform to the schema specified by D. We show a simple
DTD example in Figure 11. Figure 11 (a) is a DTD document, whose semantics can also be explained using
a DTD graph in Figure 11 (b). The ‘’ symbol associated with an element in DTD statements implies that
this element can have multiple copies under its parent element. For example, a P ublisher element might
have multiple Book sub-elements.
авбдгжезгй гж й ж з ж й"!$#&%( '0)21ж1"%ж# жй3й4 5й5й6ж7 8 9
авбA@$ з жеCBED" ж з ж й"!$#&%G F )"H #P IйQ&@$ &@S RжTзг$U&VWBXTжгзQ 9
ав бд гжезгй гж й Y ) 1з1`%$# з 'дRз $I&Q"@$ й@ 8 9
ав бдгжезгй гж й 4$5з5"6 'ba$ "a$ "#ж3c @ж aE! 5й%з7ж8 9
авбдгжезгй гж й d a$ &aж E# 'eRж жIйQ&@$ &@$8 9
авбдгжезгй гж й f @$ ga`!$5й%h 'eF$)iH2#з3ж )зpй#ж8 9
ав бдгжезгй гж й F$)cH$# 'eRж жIйQ&@$ &@$8 9
ав бд гжезгй гж й Y )жpз#
'eRж жIйQ&@$ &@$8 9
ж ж з й"!$#й%
4ж5ж5&6
@ a"!$5й%
)21ж1
qйF$)cH2#
aж йa$ "#
F$)"H #
)зpж#
rtsvux wА yвБCyГ ВWДЖЕ ЗCИРЙ СУТ
rХФвux wА yвБCyh ЦvЧвsCШvЩ
rXЕWuh dЖЙвe ЗgfgТihзСУj( dЖЙif sЖТih ДCСвsРfk eCЕ ЩУЙ Иis
ж з W й й"!$#&%g'AF$)cH2#з3P ) 1з1E% # йc8
4 5ж5&6g'la$ &a$ "#ж3n m&o$F )iH2#ж8
7
7
@ a"!$5"%g 'eF$)cH2#й3f йoCaж йa$ "#ж3з )жpй#ж8
Figure 11: An DTD example
DTD schemas can be naturally transformed into relational schemas [STZ+99, SSK+01], as Figure 11 (c)
illustrates. In the resulting relational schema, separate relations are created for the root element (Publisher)
and all ‘’ sub-elements (Book and Author) in DTD. Each ‘’-element relation has a foreign-key reference,
e.g. attribute p name in the Book table and attribute b title in the Author table, to its parent-element table.
After XML data conforming to a DTD schema have been shredded into relational tables, XML queries over
12

Page 14
XML data can be easily transformed into SQL queries over relational data. For example, a twig query
/P ublisher[address = ‘Cambridge ]//Author/name can be transformed into a SQL query that joins three
tables, Publisher, Book, and Author, together, as Figure 12 illustrates.
Figure 12: The DTD approach: SQL query for ‘/P ublisher[address = ‘Cambridge ]//Author/name
4.2 The Edge Approach
The Basic Edge Approach
[FK99] proposed a simple approach to shredding schemaless XML data into relations. This approach is
based on edge-labeled XML data trees. In this approach, all edges in a data tree are stored in a single
relational table, Edge. The schema of this Edge table is shown in Figure 13. The key idea of this schema
is an attribute pair (Source, Target), which represents end points of edges. Attribute Label represents tags
on edges. Attributes Flag and Value give the type and value of target nodes of edges, respectively. As an
example, Figure 13 populates the Edge table with XML data shown in Figure 2 (b).
авбдгжеизжй д жеи жйд
ж д й
!
"# $ % гий
&
'
(0)21д3246587290@
Aд389CBD96EGF
E%)в303
'
H
E2IPBG9
QвF%F8@G4R1%)GF09
SUTWVRXY (2@9525G`
&
a
(0)21д3246587290@
Aд389CBD96EGF
E%)в303
b%b0b
b%b2b
b2b%b
b0b2b
b%b2b
'
c
Id%d8@9505
efI3R)9
Shg2IPB21%@в46dpi29д`
Figure 13: The Edge Table
Two edges A and B can be joined together if and only if A.T arget = B.Source. Based on this property,
it is easy to transform XML twig queries without ”//” axes into SQL queries. The transformation method
is illustrated in Figure 14 with a twig query/P ublisher[address = ‘Cambridge ]/book/author/name’.
Execution of this SQL query comprises two steps. The first step is a candidate-edge finding step, which
retrieves data edges for each label in the twig query, as Part (1) in Figure 14 shows. We can see that a
clustered index pre-built on theLabel attribute can significantly speed up the processing of this step. The
second step is an edge joining step, which joins adjacent edges as Part (2) in Figure 14 shows. The processing
of this step can be made more efficient by pre-building indexes on attributes (Source, Target).
The Binary Approach
A weakness of the above Edge approach is that it involves multiple self-joins of the large Edge table. For
example, five Edge tables are joined in Figure 14, one table for each query node in the query twig. In order
to overcome this weakness, [FK99] also proposed a Binary approach, which is a variant of the basic Edge
approach, to avoid exploring the large Edge table. The key idea of this approach is grouping all edges with
the same label into one table respectively, i.e. creating one table for each distinct label. Each label table has
the schema (Source, Target, Flag, Value), with the Label attribute being dropped from the Edge schema.
An example of a SQL query against this schema is shown in Figure 15. In this example, the candidate-edge
finding operations in Part (1) of Figure 14 are saved. In addition to improving query processing performance,
13

Page 15
авбдгебзжй й б г"! б
#в$ %&
'( )вб1 0в!32дгв465 7 б6$в81 'д(&)3б9 (в( $йб35в5з89 '( )вб1 2 %3%е@ 81 'д(з)збA з!дие7в%з$й89 '( )вб1 & б
B 7 б6$вб 03!з2Cг34з5 7йбе$д е D з2йбйгF E GH0в!32Cгв465I7 б6$QP
6 й(
(з( $йбй5в53 ID 62 бдгR E GS (з( $йбй5в5TP
6 й(
2й%з%з@ D 62вбдгF E GU2й%в%з@VP
6 й(
6!й 7й%з$д е Dв з2йбдгF E GW з!╧7 %6$QP
6 й(
й б D 62вбдгF E GU й е бXP
6 й(
03!з2Cг34з5 7йбе$д з ав%з!3$вжвб1 E` Y
6 й(
03!з2Cг34з5 7йбе$д е a з$й)збйb Ec 3(в(е$ бй535з & а3%з!в$йжзб 6 й(
03!з2Cг34з5 7йбе$д е a з$й)збйb EA 2в%в%6@д & а3%е!в$йжвб
6 й(
2й%з%з@ a 6$в)вб d E 6!д 7й%е$д з ав%з!3$вжвб
6 й(
6!й 7й%з$д е aв з$й)вбйF EA е дб3 & а3%з!в$йжзб
6 й(
(з( $йбй5в53 I гI! бe E GSf3 23$C4з(з)збgP
hpirq
htsuq
v
wWx"yеАWБWВuГЕДuЖ
ЗЙИ"ИЖЙД"ВuВ
yIРuРWС
ТФУЗ"Хy"ЖеБИ"ЦДеЧ
Ш
ЗxIЩuГЕРuЖ
dЗuХД
Figure 14: The Edge approach: SQL for ‘/P ublisher[address = ‘Cambridge ]/book/author/name
the Binary approach also saves storage space, since it doesn’t store labels of edges. However, for large XML
documents with a lot of distinct labels, the Binary approach will unavoidably result in a large number of
relational tables, which increases the management workload of DBMS. Otherwise, we notice that the basic
idea of this approach that clusters edges by their labels is very similar to the idea of inverted lists that will
be introduced in Section 5.
авбдгебзжй й е б г б
!в" #$
%&з'(г&)е021йбв" 34 й5&52" б 0в0$34 ' #&#е6 37 в 1 #8" 34 з е дб
9з1 б8" б %&з'(г&)е021йбв"д в ав#в&" жзбA @C B
8 д5
%&з'(г&)е021йбв"д е Dз в"йEзбйF @G й5з5е"збй0&0з $ а&#ез"йжзб 8 д5
%&з'(г&)е021йбв"д е Dз в"йEзбйF @7 ' #з#86й $ а&#вз"йжвб
8 д5
'й#з#в6 D 8" Eзб H @ 8й21й#в"д в ав#в&" жзб
8 д5
8д21й#е"д е D в"йEвбй╧ @7 з е дбй $ а&#ез"йжзб
8 д5
5з52"йб&0з0& 2 гP б4 @ QSR& е '&"д)в5вEзбUT
VXW`Y
a
bPc`dеePfSgih prq
sutitqupigPg
d vPvrw
xАysPБdiqеftiВpвГ
Д
scЖЕih vrq
ЗsPБp
Figure 15: The Binary approach: SQL for ‘/P ublisher[address = ‘Cambridge ]/book/author/name
In a whole, the Edge approach has two weaknesses. (1) It involves many join operations. The number of
joins is just the number of query nodes in a twig query. So it fails to process large twig queries efficiently.
(2) Its biggest weakness is that it does not support twig queries with ”//” axes (e.g. ‘A//B’), since it does
not know how many tags and which tags are involved between tag A and tag B.
4.3 The Node Approach
As we introduced in Section 2, numbering schemes are essentially structural indexes, which help answer ‘//
axis queries efficiently. [ZND+01] is the first paper that applied PrePost coding developed in [Die82] to XML
research. This paper contributed a Node approach to shredding schemaless XML data into relations. This
approach is based on node-labeled XML data trees. In this approach, all internal nodes (i.e. element nodes
and attribute nodes) in a data tree are stored in a relational table, Node. The schema of this Node table is
shown in Figure 16. The key idea of this schema is an attribute triple (Start, End, Level), which replaces
the attribute pair (Source, Target) in the Edge schema. ‘//’ axis queries can be answered efficiently through
using (start, end) numbers of nodes. Level is used with (start, end) together to answer ‘/’ axis queries. As
an example, Figure 16 populates the Node table with XML data shown in Figure 2 (a).
Based on Property 1 and Property 2 in Section 2, it is easy to transform XML queries with both ‘/
14

Page 16
авбдгжезб
ивй
вгж з
г
!"г# % $з
&
&%'
(0)%1д243457698A@
Bж2C8EDF8AGIH
G%)д242
P
Q
G9R7DI8
SдH%HC@I3T1%)IH48 UWVYXT`a (0@ 8 505Ib
&Ic
P%d
(0)%1д243457698A@
Bж2C8EDF8AGIH
G%)д242
e%e0e
e%e4e
e4e0e
e4e0e
e%e0e
f
d
RFg0gT@98F505
h9R 2T) 8
Upi0R7D01%@д3Agrq08жb
дs
c
&
c
e%e0e
&
Figure 16: The Node Table
and ‘//’ axes into SQL queries. The transformation method is illustrated in Figure 17 with a twig query
/P ublisher[address = ‘Cambridge ]//author/name’. Similar to the Edge approach, execution of this SQL
query comprises two steps, candidate-node finding (Part (1)) and node joining (Part (2)). The difference is
that in the second step, the Node approach joins nodes using (Start, End, Level) attributes. Just as in the
Edge approach, Part (1) can be saved in the Node approach if a variant similar to the Binary approach is
used.
авбдгебзжй й е б г б
!в" #$
%й#й&$б( 'з0)1гз243 5 б4" 67 %з#й&вб8 &в&9"йбй3з3в6@ %й#й&$б7 в 95 #4"з67 % #й&вб@ й е б
Aз5 б4" б '0з)1г02е395йбв"д е Bз в)йбдгD C EGFз0)дгз24395 б4"IH
в д&
&з&9"йб03з30 9 B 4)збдгD C EP &з&9"йб03з3QH
в д&
4д95й#е"д е B в)йбйгD C EG в 95 #4"IH
в д&
й е б B 4) бдгR C EP й S бTH
в д&
'з0)дгз24395 б4"й гебвU б гV CX W
в &
'0з)1г02е395йбв"д 0 3в$ 4"д` Ya &з&9"йб03з30 з 3зив е"ди е д& 0&з&е" бй303в $ б4 д&` Yb '0з)1г02е395йбв"д в бе д&
в &
'з0)дгз24395 б4"й гебвU б гb cd WD C8 &з&9"йб03з30 9 B б4Uзбдг
в &
'0з)1г02е395йбв"д 0 3в$ 4"д` Ya 4д95й#е"д 0 3з$ 4"йи
е д& едие5 #в" 9 бв &8 YD 'з0)1гз243 5 б4"д $ б4 й&
в &
4д95й#е"д 0 3з$ 4"й` Ya й S б з 3зив е"ди
е д& з е дбй $ б4 й&` Y( 4й95й#в"д в бе д&
в &
в 5 #4"д гебеU б гa ce WD C8 й S б 9 B б4Uзбдг
в &
&з&9"йб03з30 9 гf б7 C EPg0 е )0"д2в&вhзбIH
iqpsr
iutvr
w
xfyfА БfВДГGЕ ЖvЗ
И ЙfЙЗ ЖfГДГ
Иy РvЕТСvЗ
УИfФЖ
ХЧЦИДФАfЗ ВЙfШЖ4Щ d
Figure 17: The Node approach: SQL query for ‘/P ublisher[address = ‘Cambridge ]//author/name
The Node approach overcomes the weakness of the Edge approach which does not support‘//’ axis
queries. However, similar to the Edge approach, it involves many join operations. Specifically, the number
of joins is just the number of query nodes in a twig query, which results in inefficient query processing of
large twig queries.
4.4 The Path Materialization Approach
The Basic PM Approach
In order to reduce the number of node joins, [YASU01] proposed a Path Materialization (PM) approach
to shredding schemaless XML data into a relation table, Path. The schema of this P ath table is shown in
Figure 18. It is very similar to the Node table. The difference is that rather than storing the tag of each
node in the Label attribute, the PM approach stores the tag path from the root to each node (called root
path) in a new attribute P ath.
Through the P ath attribute, the PM approach can answer twig queries efficiently in units of paths rather
than in units of single edges. Specifically, given a twig query, the PM approach first decomposes it into
multiple root-to-leaf path queries as the TwigStack approach in Section 3.1.3 does, and then joins results
of these paths queries together. Figure 19 illustrates how to use a SQL query to answer a twig query
/P ublisher[address = ‘Cambridge ]/book/author/name’. Part (1) is the twig decomposition step, which
15

Page 17
авбдгже
зигйб г
бд
!вб" $ # %
&$')($0й132346587@9
A
A3B
Cй1D7FE873GIH
G)( 1)1
&$')($0й132346587@9)&)P3G Q6Eж7
R
S
T H)HD9 2U0)(IHF7 VXW`Ybac '$987 4$4йd
&$')($0й132346587@9)&)Q e$eD9 7$4)4
f
g
h8QI1U(87
Vpi)QFE8039 23eFq$7d
&$')($0й132346587@9
Aжr
R3g
Cй1D7FE873GIH
G)( 1)1
s)s)s
s$s)s
s)s$s
s$s3s
s3s$s
Figure 18: The Path Table
uses the value of root paths of leaf nodes (address, name) and branching nodes (publisher) in the query twig
to retrieve their corresponding data nodes in data tree. Part (2) is the path joining step, which joins data
nodes retrieved from Part (1) through their (start, end) numbers.
авбдгебзжй й е б г б
!в" #$
%й й'&) (з102гз354 & б5" 6) %з йие&7 8в8'"йбй4з4в69 %й й'&) $ б
@з& б5" б (1з02г13е4'&йбв"д е %з йие&B A CEDз%з10дгз354'& б5"GF
5 й8
8з8'"йб14з41 ' % &H A CIDз%1з02г13е4'&йбв"зDй 18з8е" бй414дF
5 й8
й е б % '&H A CEDз%1в02г13в4'&йбе"зD10 #з#5PвD 5д'&й#е"зD1 е дбQF 5 й8
(1з02г13е4'&йбв"д 1 4в$ 5"дH RS 8з8'"йб14з41 з 4зив е"ди е д8 18з8е" бй414в $ б5 д8H RT (1з02г13е4'&йбв"д в бе д8
в 8
(1з02г13е4'&йбв"д 1 4в$ 5"дH RT й е б в 4зив в"ди
е д8 з е дбй $ б5 й8H RS (1в02г13в4'&йбе"д в бв д8
в 8
8з8'"йб14з41 ' гU б) A CWV1 е 01"д3в8вXзбGF
Ya`cb
Yedfb
g
hUiUp qUrfsEt uwv
x yUyv uUsfs
pUАБАwВ
xi ГwtБАwv
ДxfЕu
ЖИЗxfЕpUv ryUЙu5Р
С
Figure 19: The Basic PM approach: SQL for ‘/P ublisher[address = ‘Cambridge ]/book/Author/name
The PM approach has two advantages. (1) It involves fewer join operations in Part (2) than the Node
approach, since it answers twig queries in units of paths rather than in units of single edges. For example,
for the twig query in Figure 19, the Node approach needs to join five Node tables but the PM approach
needs to join only three Path tables. Therefore, the PM approach generally has higher query process-
ing performance. (2) The PM approach can also support ‘//’ axis queries as the Node approach does,
by using the Optional String Pattern Matching (OSPM) function (”LIKE”) provided by SQL. For exam-
ple, in order to answer a query/P ublisher[address = ‘Cambridge ]//name”, we only need to replace
”name.Path=‘/Publisher/book/author/name’” in the where clause in Figure 19 with ”name.Path LIKE
/P ublisher/%/name’”.
However, we can also observe that although the number of join operations in Part (2) is reduced, it is at
the expense of increasing the complexity of selection operations in Part (1). As we know, SQL supports Exact
String Matching (”=”) efficiently through pre-building a B+-index on string attributes, but B+-indexes do
not support optional string pattern matching (”LIKE”) efficiently due to the inherent structure of B+-trees.
In order to find patterns with multiple ‘%’ symbols, a large number of irrelevant strings in tables might have
to be checked exhaustively. Therefore, the PM approach does not support ‘//’ axis queries efficiently when
there are multiple ‘//’ axes in queries (e.g. //A//B/C//D).
The RP Approach
[PCS+04] proposed a Reversed Path (RP) approach to overcome the weakness of the PM approach discussed
above. This approach uses a schema shown in Figure 20. Its key idea is storing reversed root paths of data
nodes in a new attribute ReversedP ath. Otherwise, the RP approach uses an ORDPATH attribute to
replace the (start, end) attribute pair in the PM approach. ORDPATH coding is a variant of Dewey coding
we mentioned in Section 2. It can be used to determine ancestor-descendant/parent-child relationships
16

Page 18
between nodes as P reP ost coding does [OOP+04]. Here we simply ignore the difference between ORDPATH
numbers and (Start, End) numbers, and concentrate our discussion on the ReversedP ath attribute.
авбдгвбжеизйб ж в ж
ай й !й"
#%$& в'
( $0 )1б
2436547й849A@CB4D6E
FHGI F
Pй8CDRQSDATжU
T05д868
2IVHTIWXQSDA2436547й849A@YBID6E
F6G4 F6G& F
`дU0UCEж9Y705жU6D
acbedYfg 34E&D&@4@жh
2IW0i4iHEIDS@4@Y2430547й8696@CB&D6E
F6G4 F6G6 p
qIW&8Y5&D
asr4WHQ470Eд9AiXt4Dйh
2436547й 84 9A @CB4D6E
FH GX p
Pй 8C DRQSDATжU
T05д86 8
G0G4G
G6G4G
G6G4G
G0G4G
Figure 20: The ReversedPath Table
Figure 21 shows an example of how the RP approach answers twig queries with multiple ‘//’ axes. The
first step is still twig decomposition, which decomposes the query twig into three paths. However, Path (3)
involves three ‘//’ axes. In the PM approach, we have to use ‘%A/B/C%E/F%G’ as a search pattern on the
Path attribute to retrieve corresponding data nodes. As we analyzed earlier, this is not efficient. Therefore,
the RP approach continues to decompose Path (3) into Path (4) and Path (5), both of which include only one
//’ axis just in the beginning. So we can use ‘/F/E%’ and ‘/G%’ as search patterns on the ReversedP ath
attribute to retrieve data nodes of Path (4) and Path (5), respectively. So the task here is just finding a
string with a specified prefix, which can be implemented more efficiently than the general Optional String
Pattern Matching task with multiple ‘%’ symbols. Finally, similar to the PM approach, the RP approach
has a path joining step, which joins results of path queries together through the ORDPATH attribute.
A
B
C
D
E
F
G
A
B
C
A
B
C
D
A
B
C
A
B
C
E
F
G
авбдгжеиз й ж в
ажбжгеиз!й" ж#ж д в в в
ажб геиз$й" % &
аб гже'з!й( % )ж в0
A
B
C
E
F
G
D
13254
176ж4
198в4
13254
1@64
1BAд4
1@C4
DFEHGP IRQ(STVU
D3WVGY XR`a ERbHbHTVcHEedf
D$dgGP hRXi ERbHbHTVcHEedf
pSHdgc'qVbVc(rRS
pSHdgc'qVbVc(rRSs D$tgG
Figure 21: The RP approach
The BLAS Approach
As we saw above, the RP approach in [PCS+04] has simplified the task of general Optional String Pattern
Matching to an easier task of String Prefix Matching (SPM). However, [PCS+04] did not provide any details
on how to efficiently implement SPM. It seems that they just simply push the SPM task down to the
SQL engine. In contrast, another work [CDZ04] not only introduced the RP approach independently from
[PCS+04] but also developed a very intelligent method named BLAS (Bi-LAbeling System) to implement
SPM efficiently. The key idea of BLAS is encoding each ReversedPath string into a number, PLabel. This
encoding method is illustrated in Figure 22.
In this example, we assume that there is a total of four distinct tag names in some XML document, p1
through p4. At the first level, these four tags divide reserved number space [0, 1024) into four equal-length
17

Page 19
авб
адг
аде
азж
и
г й
й б г
б и г ж
авб
адг
аде
азж
г й
е ги
е ж
ж ж
й б г
! "!$# б
е% зж
ж и
ж&б'
ж!е г
ж ж
е ж
е
е (!г
е (
ж и
! "!$# г
! "!$# е
! "!$# ж
авб
адг
аде
азж
авб
адг
аде
азж
)103254'6
)1487 6
)@9A436
)B7 6
Figure 22: How to compute PLabel(‘/p2/p3/p1/p4’)
segments, each with length 1024/4 = 256. In the same way, at the second level, four tags divide each segment
at the first level into four equal-length segments, each with length 256/4 = 64, and so on so forth. So we
have
P label(‘/p2/p3/p1/p4 ) = 256 (2 1) + 64 (3 1) + 16 (1 1) + 4 (4 1) = 396
In the same way, we can also get
P Label(‘/p4/p2/p3 ) = 256 (4 1) + 64 (2 1) + 16 (3 1) = 864
A very nice property of Plabel is that all strings with common prefixes cluster in adjacent digital areas.
For example, all ReservedPath strings with prefix ‘/p2/p3’ cluster together. So if we pre-build a clustered
B+-tree index on the Plabel attribute of the ReversedP ath table, then all reversed paths with the specified
prefix can be retrieved very efficiently using a SQL range query. For example, in order to retrieve all
reversed paths with prefix ‘/p2/p3/’, BLAS first computes lower bound = P Label(‘/p2/p3/ ) = 384 and
higher bound = P Label(‘/p2/p4/ ) = 448. Then a SQL range query is issued to retrieve all reversed paths
with PLabel within [384, 448).
4.5 Summary
In this section, we saw that XML data can be simply loaded into relational databases and XML twig queries
over XML data can also be easily transformed into SQL queries over relational data. In the relational
approach, all query processing work is pushed into relational query optimizer and no extra processing work
is needed.
When XML data conform to a schema such as DTD, the DTD approach introduced in Section 4.1
provides better query processing performance than other approaches introduced in Sections 4.2 through
4.4. The reason is that the DTD approach generates different relational schemas for different DTDs. Each
generated relational schema is tailored for a specific DTD and so precisely captures the structure of XML
data conforming to that DTD schema. In contrast, approaches in Sections 4.2 through 4.4 generate the same
relational schema (tables Edge, Node, P ath, etc) for various XML data despite their different structures,
and so fail to efficiently process a specific goal data set. The experimental work in [TDCZ02] also verifies
this point.
18

Page 20
When XML data is schemaless (i.e. a DTD for it is not available), the PM approach is the best compared
with the Edge approach and the Node approach, since (1) it supports ’//’ axis queries and (2) it needs fewer
join operations. Further, among the three variations of the PM approach (Basic PM, RP, BLAS), the RP
approach with the BLAS extension is the best. In fact, the basic RP approach has been integrated into
Microsoft SQL Server 2005 [OOP+04, PCS+04]. Interestingly, [OOP+04, PCS+04] do not mention the work
of BLAS [CDZ04]. We propose to extend the basic RP approach in [OOP+04, PCS+04] with the PLabelling
method in BLAS to gain the best query processing performance.
5 XML Query Processing: the Native Approach
Although the relational approach is simple and feasible, it could have inferior query performance. In order
to answer ‘//’ axis queries, the Node approach and the PM approach use θ-joins 2 to implement node/path-
joining step (see Part (2) in Figures 17 and 19), discarding equi-joins used in the Edge approach (see Part
(2) in Figure 14). θ-joins are more complex and costly than equi-joins. Although current DBMSs have been
coupled with efficient techniques to optimize and process equi-joins, they do not support θ-joins efficiently,
particularly when multiple comparison predicates are involved in queries. Some experimental work has
verified this point [ZND+01].
Much research has been done on developing native algorithms to efficiently process θ-joins involved
in XML twig queries. We say these techniques are in the native approach since their storage and query
mechanisms are developed from scratch, without involving relational databases. The authors of these native
techniques believe that a special storage and query system tailored for XML data will improve XML query
processing performance significantly. In the native approach, θ-joins are also called structural join.
Specifically, in the native approach, XML data are stored in inverted lists. Inverted indexes have been
widely used in Information Retrieval to implement efficient text search [SM83]. Inverted index creates one
list for each distinct word in text documents; the list gives positions of all occurences of this word. These
lists are called inverted lists. Borrowing this idea, the native approach creates one inverted list for each
distinct tag in XML documents; the list gives positions of all elements with that tag name. Location of an
element is expressed using its (start, end, level) numbers. Locations in a list are sorted in the increasing
order of their start numbers. Figure 23 shows inverted lists of the XML document in Figure 2 (a).
авбдгдезжд й в
д! #"$&%('$ ) 0% !1#2$&%('
аве43д3д5$
76$&%(6$! #'$ ) д д!1#8$! #'
а#9 г @в 43
A% 1д&% Bд!1#'$ ) A%(C$&%(2$!1#'$ ) #C$!1д1д!1#'
адDдDдDд
DдDдD
Figure 23: Inverted lists
5.1 The MPMGJN Approach
[ZND+01] proposed an MPMGJN (Multi-Predicate MerGe JoiN) algorithm, which is the first native approach
to implementing structural joins. Its implementation is somewhat similar to that of the standard Merge Join
2θ-joins are joins involving ‘>’ and ‘<’ comparisons, while equi-joins involve only ‘=’ comparison.
19

Page 21
algorithm developed in relational query optimizers for equi-joins. In order to answer a queryA//B’ or
A/B’, two cursors are created on AList and BList that have been sorted in the increasing order of start
numbers. Initially, these two cursors are pointing to the heads of AList and BList, respectively. Then, they
are compared with each other and advanced in line to implement merge join.
In contrast to the standard merge-join implementation for equi-joins, MPMGJN has its own cursor-
advancing mechanism, which is specially tailored to efficiently support structural joins. Specifically, at each
step, it compares and advances two cursors as Figure 24 describes. The working process of MPMGJN is
also illustrated in Figure 25. Note that dotted edges in Figure 25 (a) mean there might be other data nodes
than A-tagged or B-tagged nodes on those edges although we show only A-tagged and B-tagged nodes in
this data tree. Experimental work in [ZND+01] found that MPMGJN algorithm is more than an order of
magnitude faster than RDBMS join implementation in most query cases.
авбд гжеиз йж й г"! #йж %$ " ' &) (0е1з2й 3 ж 4й% (5!ж# й $ 7 6ж809 @
AжA $CBвD0$ @ 9E в849' й 2 F бG г е5з й
й гI HHQ P
R5S й%9
AжA' T9▄зV@F 2809W зX@@49 Y1S` ac b 4з`@4dF 289 @F $4B2D$ @4 9E `809I ж 4й% e жб) (fSз йж
9Vg0a hC ж 4й% гI ip й 2 гqP
r8зS29tsu г е5з й v в92g4a hC й г"! #йж %$ " ' &) (0е1з2й 3 ж 4й% (5!ж#в9 @"Bx w
y
А 2a ' $E ` a1Sв9E йв SV 0з2 2@Б з`@ I 2809E bв зV@p 49жйв 1S #Г В a49 з2б4зв ж$SSVДd
Е$Cй%9W ЖЗ sЙИ0 E в849F РV(Aг3СУ Т2 9 жД0wХ Ф
А вaж 4 I в a1S`9Ц sЧ(0е1з2й 3 ж 4й% (5! dГ гже5з йж f Ш 92g0a h0 ж й% ги! wc з2б
(4е5з йж f й 2 (1!#ве492D09"SЩ HE Ж) ip гжеиз йж Ш 9Vg0a hC ж 4й% ги!ж#`е09 D09S #
Е$Cй%9I de sЙИ0 E в849F РV(AAг С) Т` 09 Д0wQ Ф
f1з` 09ж ж 0SVДe 4 вaж I ` a1Sв9Ц sg(0е1з й 3 h ж й% 2(5! d' гжеиз йж Ш 9Vg0a hC ж 4й% ги! wж#
%92g4a2hC й 2 гF HHQ P
i
2 "й% (e HжHХP
R@"B0звб
Figure 24: The core of the MPMGJN algorithm
5.2 The StackTree Approach
[AKJK+02] observed that although the MPMGJN approach is efficient for ‘//’ axis queries, it fails to process
/’ axis queries efficiently in some cases. A motivating example is shown in Figure 26 (a). In this example,
a1 has only two B children, b1 and b6. However, we can see from Figure 26 (b) that MPMGJN finds the
child d6 only after it has scanned b1 through b5, where b2 through b5, which are indirect descendants but
not children of a1, have to be visited unnecessarily.
In order to avoid such unnecessary node scanning, [AKJK+02] proposed a new approach, StackTree.
StackTree uses a nice stack structure to cache A nodes nested on the same path in data trees. Figure 27
shows the core of the StackTree algorithm. At each step, the data node with the smallest start number is
taken out of its list. If it is an A-tagged node, it is pushed into the stack. If it is a B-tagged node, StackTree
tries to use it to form tuple solutions with existing A-tagged nodes in the stack. Figure 26 (c) illustrates its
working process. From this example, we can see that there are no redundant comparisons of b2 through b5
with a1. Therefore, StackTree has better query processing performance than MPMGJN.
Both StackTree and MPMGJN are binary join algorithms, i.e. they join only a pair of inverted lists (or
only one edge in the query twig). Since a complete twig query consists of a series of binary joins, the problem
of join order selection has to be considered seriously. Just as in the context of relational databases, join order
20

Page 22
авбдгж езбзийб з
б
б
б
д!
б
б
б
б!
"$#&%( '0)
"213%5 46)
"873%9 #@46)
"BAC%9 #D1D)
"вE@%9 FD)
"HGC%9 #DIC) "$#0#@%9 #D'3)
"P#D7@%Q #&EC)
б!
!
б
б
б
б!
!
б
б
б
б!
!
б
б
б
б!
!
б
б
б
б!
!
б
б
б
б!
!
б
б
б
б!
!
RTSVUCR0WCXCY5 X3`0aCbCcDa3d
б
б
б
б!
!
RCS3U0R3W6X0Ye XD`3a6b0cVa3f
R6S@UCRCW0XCY5 X3`Ca0bCc3a@f
gWDW6Y3aDhCi6cDc3p5 q6csrtW
R6S3UDRCW6XDY( X@`Ca6bDc3aCd
RCS3U0R3W6X0Ye XD`3a6b0cVa3f
gWCWTYVaCh0iCcCcDpe qTcTr@W
RCS3U0R3W6X0Ye XD`3a6b0cVaCd
R6S3UDRCW6XDY( X@`Ca6bDc3aCd
а гv uxw uАy Б В
Figure 25: The MPMGJN approach (For the queryA//B’)
авбдгж езбзийб з
б
б
б
е
е
е
ед
ез!
ед"
б
б
бд
е
е
е
е"
е!
е
а#е$г& %з' %)( 0 1
б
бз
а324гд бз
а65дг е7
а6б е г
бз
а98 г б
б
бз
ав@згA е
б
а6б е г
а9B г б
бд
б
б
б
авC гD е
б
а9б E ез вг
бд
бз
аGFHгD е "
б
а6бд E е"г
б
бз
а9I гAез!
б
а6б е!г
бд
а6Pдг ед
а9б е г
бд Q е7 R б е б е ед"S е$!T е
а9U гW VдHбдU Xд
аY2й` гд aзb
Figure 26: The StackTree approach (For the queryA/B’)
21

Page 23
significantly affects XML query processing performance. As we know, most relational query optimizers use
a classical dynamic programming method to select an optimal join order. [WPJ03] also proposed similar
dynamic programming methods to select an optimal or sub-optimal order of binary structural joins for XML
twig queries. The StackTree binary join algorithm and the corresponding dynamic-programming-based join
order selection algorithm have been integrated into Timber [JAKC+02], a famous native XML database
prototype from the University of Michigan.
авбдгжеиз йж ж й б г!#"%$бжзжй'&)(ж01 з32ж 4" 516ж з й3 4 7йж8 9 $бжз1й'&@(ж01 7зж2A 97516 з1й3 A 7йжB
C7D E 4 F з йG (4HI 0 з%б г PQ авбдгжеиз йж A 7й38F б46 E6R D1D "S г%2UTEзS б гV з й3 1(жH W бжйAX Eг7T г 04а%YE F з`аU D DAE
йGX жг аaб гAе з1й3 4 йQ 4 E b2b1b%ETc 2ж0dйR 2жeS з1й3 1(жH 6
feg авб`гже7з йж ж dйh "%$бжзжй'&)(ж01 з32ж 4" 516жз й3 4 7йp i XEг
q0dзGXR г 2 TE "%$бжз1й'&@(ж01 7зж2A ж" 5S б`г й32V з йж (AHsr
(ж01 7зG2ж 4"u tжtv r
w DзE
x0dйb07йI йA0b DGE зж2D07й бG24г зy б`г7йж2I йGXEV А2 б гI Eз 0Dй 6 Бb%E(7бAe7б ( D D`В8
C UзEV ГЕД%24 S йGXEV Ж"%З 9ЙИС Р 0E ВBУ Т
x07йb07йI й 0bDGE ФQ Х Ц Ч4Ш ЩЙdfeAg hdШ3i4h Х j3k1ФQ l m%nжh n
oжp qGr%s t u vGsI uжtF q r%s
qGuжwS u xV q r%sI y4zж{ {%s4t7q p q3|1yж}S |At7vI qжuжw G sжА s Бc ВF ГС ДR Е1 o4pжq'Жfy4z {dp3uA{ ЕdЗ ├ %sAА%sdБж~
И|Up3s ЙЛ КЕМ%u4{S qGr sV НдО%П П1Е'РТ СGz%s4{ У Фv Х
Цz7qGw1z7qR | Б БЧ К ШR Щ1Ъ Ы4ЬжЭ'ЮfЯ4а бdЬ3вAб Щdг3дV Э а еижGзUЬGШ й%з4б%з к4лc мжн1о п░ н%▒U▓3│F кднF G╡ │
╢A╖ ╕│жнdR л ┤жм╢4╣v║
╢ж╖1╕лG▒1╗S ╝ж╝v ║
╜н ▓%кA╛
Figure 27: The core of the StackTree algorithm
авбдгжеиз йж ж й б г!#"бжзжй $&%('0) зж1ж 2$43 5ж з6й7 0 2й9 8
$
@2A B 0 C AжA з йж ж'жD2з )EзFбдгFGH аIбдгже з6й7 0 й7PQ бж5 B5R A A г 1 SBзT б гC 'ж)6 BгEйT зжй7 '0D2зV U бжй&W Bг S
г )&аFXB Q здаY A A&B ` й&WF 0г авбaгжеизжй7 ж Eй9 & B b1b b BS9 10)2йR 16cd зжй7 '0D2з65
"BйV авбaг9 XB б0SR 1 cd й&WBC Aбжз6й9 зe) 'жWd йeWF Yй " б&з йafg$ih %('&) Eз710 жfp$ihg365 з6й7 & 2й а б г0е з6й7 & 2й
qcr авбaг` бжз г 16йs й&WB б&S9 1 c AeB 6c г 1 SB 16cT йeWB b 6йeWQ t&)B u v WBг
w)EзeW9 г 1 SB " б0з йafg$xh %('0) Eз71& жfд$ihg3C бдг2йж1T y йe '0Dжfд$АhБ Uвб0йeWC &гC з6з71ж'2б& йBS b12бaг2йB
й71d й WB г 16SB 1жгQ й WB йж1b 16cQ б&й з b 0 Bг2йT з6й7 ж'жDГВ
Д AзB
Е)Eйb)2й9 AжA й&)bAeB зж1A) йFбe10г2зT бдаbAбBSЖ X6uC 'ж)ж BгEй9 з йж '0D2зs бaг2йж1T йeWBd З1 б г
Bзe)Aй65
Дг SFб&c
'ж)ж 2зж1ж 0fg$xhЙ И6ИРВ
Figure 28: The core of the PathStack algorithm
Another important thing is that the StackTree algorithm in Figure 27 outputs all tuple solutions in the
increasing order of start numbers of descendant nodes (i.e. B-tagged nodes). For example, six tuple solutions
in Figure 26 (c) are output in the order of b1 through b6. Complementarily, [AKJK+02] also proposed a
variant of the StackTree algorithm to output tuple solutions in the increasing order of start numbers of
ancestor nodes (i.e. A-tagged nodes). This is very important for twig queries. Consider a twig query
C//A//B’. If we select a query plan C ⊳⊲ (A ⊳⊲ B), then query results of A ⊳⊲ B have to be sorted by A
nodes, since the next binary join will occur between C and A.
22

Page 24
Otherwise, [CVZ+02] extends the StackTree algorithm with a skip technique so that some nodes in
inverted lists do not need to be visited during the join process if these nodes are predicted not to form any
tuple solutions with other nodes.
5.3 The Holistic Approach
A weakness of decomposing twig queries into multiple binary joins is that this method generates a large
amount of intermediate query results. For example, for a query plan (A ⊳⊲ B) ⊳⊲ C, the query result of the
first join A ⊳⊲ B has to be written to disk first if its size is too large to be contained in memory, and then be
read back to memory to join with C after A ⊳⊲ B has been completed. This will result in high disk I/O cost.
In order to overcome this weakness, [BKS02] proposed a Holistic approach, which is essentially a pipelining
join, i.e. joining multiple inverted lists at one time so that no intermediate query results are generated.
Figure 28 shows the core of the PathStack algorithm which uses the Holistic approach to answer simple
path queries. It is easy to see that this algorithm is structurally very similar to the StackTree algorithm
in Figure 27. The difference is that StackTree uses only one stack to cache nested A nodes. In contrast,
PathStack has multiple stacks, one for each non-leaf node in a path query, since inverted lists of all nodes
in a path query are involved in pipelining joins. Also, each node cached in a path stack has an associated
pointer to a corresponding node in its parent stack, in order to track tuple solutions.
Recall that PathStack was also used as a file approach to answering path queries over XML documents
(Section 3.1.2). Here, Figure 29 illustrates how to use PathStack as a native approach to answering path
queries over inverted lists. This figure is very similar to Figure 8. The differences are: (1) only A-tagged and
B-tagged nodes are read in the native approach. Therefore, no other irrelevant nodes in XML documents,
such as node d1 in Figure 8, are read. (2) In the native approach, the event of nodes being popped out of
stacks is triggered by the arrival of other nodes with higher start numbers than their end numbers, rather
than being triggered by the arrival of their own closing tags as in the file approach.
авб
гб
аед
гд
жб
жзд
й й й "! # #
аб
аб
гб
$&%('0 )21а435 аб
$76з'0 )21а23 гб
аб
гб
$98з'0 )21а435 ад
ад
аб
гб
$A@е'0 )21ае3 гд
аед
гд
аб
гб
$7Bз'0 )21а230 жб
аед
гд
жб
аб
гб
$9Cз'0 )21а43D жд
ад
гд
жзд
азE
$7F('G )H1а23I аE
QP S R й UTW VX Yй `badce &V й ` ag f
hXiYp qbrt s
u vxw"yUАS БXВ
ГЕД2Ж ЗзИеЙ ЗG Р2ЖС
ТФУвХ(ЦHЧвШФУ(Щ2d"e
ГЕДЙ ЗзИЙ ЗG РЖС
ГЕДЖ ЗзИЙ Зf РЙС
ТеУеХ(Ц2ЧеШФУзЩHd"e
ГЕДЙ ЗзИЙ Зf РЙС
ДЖ ИЖ ДЙ ИЙ РЖ РЙ Деg
Дg
Figure 29: The Holistic approach (For the queryA//B/C’)
Similarly, the Holistic approach also provides a TwigStack algorithm to answer general twig queries. The
main idea of TwigStack has been illustrated in Figure 9.
23

Page 25
[BKS02] also experimentally compared the Holistic approach with the StackTree approach. Their ex-
perimental results show that generally the Holistic approach has more than six-fold faster query processing
performance than the StackTree approach coupled with the optimal join order. Due to its high query pro-
cessing performance and algorithmic simplicity, the Holistic approach has been used extensively in some
recent research work. For example, [JWLY03] extended it with a skip technique to avoid visiting some nodes
in inverted lists that do not form any tuple solutions with other nodes. [JLW04] extended Holistic to process
twig queries with OR predicates. [BGKS03] applied Holistic for multi-query processing.
6 Conclusions
In this survey we reviewed major techniques for processing XML twig queries. These techniques are catego-
rized into three classes based on the storage format of XML data.
The file approach is mainly used for special-purpose applications in which XML data must be stored
in commonly used flat files in the form of just original XML documents. Since no indexes are available in
such applications, the entire XML document, including a large volume of elements irrelevant to the specified
query, has to be visited, which usually results in poor query processing performance.
In the relational approach, XML data can be simply loaded into relational databases and XML twig
queries over XML data can be easily transformed into SQL queries over relational data. In this approach, all
specific query processing work is pushed into relational query optimizers and no extra processing is needed.
However, current RDBMSs do not support θ-joins efficiently, despite the fact that θ-joins is a necessary
component for answering ‘//’ axis XML queries efficiently. Among relational approaches, the RP approach
with the BLAS extension has the best performance for querying schemaless XML data.
The native approach develops native algorithms to efficiently process θ-joins involved in XML twig queries
that are essentially structural joins of inverted lists. In this approach, many existing important components
in RDBMS, such as storage management, access methods, query processing and optimization, concurrency
control and recovery, have to be rebuilt from scratch. Among native approaches, the Holistic approach shows
the best query processing performance in experiments.
Just as [ZND+01] implies, a good approach should be integrating native θ-join algorithms for XML twig
queries into existing relational query optimizers so that extended relational query optimizers will be able to
process XML twig queries more efficiently. Meanwhile, in this integration approach, other existing important
components in RDBMS than query optimizers, such as concurrency control and recovery, can also be fully
reused so that development efforts will be significantly saved. Therefore, this integration approach will gain
the best trade-off between XML query processing performance and development efforts.
References
[ABJ89]
Rakesh Agrawal, Alexander Borgida, and H. V. Jagadish. Management of transitive relation-
ships in large data and knowledge bases. SIGMOD Conference, 1989.
[AF00]
Mehmet Altinel and Michael J. Franklin. Efficient filtering of XML documents for selective
dissemination of information. VLDB Conference, 2000.
24

Page 26
[AKJK+02] Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jifnesh M. Patel, Divesh Srivastava, and
Yuqing Wu. Structural joins: a primitive for efficient XML query pattern matching. ICDE
Conference, 2002.
[BGKS03] Nicolas Bruno, Luis Gravano, Nick Koudas, and Divesh Srivastava. Navigation- vs. index-based
XML multi-query processing. ICDE Conference, 2003.
[BKS02]
N. Bruno, N. Koudas, and D. Srivastava. Holistic twig joins: optimal XML pattern matching.
SIGMOD Conference, 2002.
[CDZ04]
Yi Chen, Susan B. Davidson, and Yifeng Zheng. BLAS: An efficient XPath processing system.
SIGMOD Conference, 2004.
[Cha02]
D. Chamberlin. XQuery: an XML query language. 41 (4), 2002.
[CHKZ03] Edith Cohen, Eran Halperin, Haim Kaplan, and Uri Zwick. Reachability and distance queries
via 2-Hop labels. SIAM Journal on Computing, 32:1338–1355, 2003.
[CVZ+02] Shu-Yao Chien, Zografoula Vagena, Donghui Zhang, Vassilis J. Tsotras, and Carlo Zaniolo.
Efficient structural joins on indexed XML documents. VLDB Conference, 2002.
[DAF+03] Yanlei Diao, Mehmet Altinel, Michael J. Franklin, Hao Zhang, and Peter M. Fischer. Path
sharing and predicate evaluation for high-performance XML filtering. ACM Transactions on
Database Systems (TODS), 28:467–516, 2003.
[DF03]
Yanlei Diao and Michael J. Franklin. High-performance XML filtering: an overview of YFilter.
IEEE Data Engineering Bulletin, 26:41–48, 2003.
[DFFT02] Yanlei Diao, Peter M. Fischer, Michael J. Franklin, and Raymond To. YFilter: Efficient and
scalable filtering of XML documents. ICDE Conference, 2002.
[Die82]
Paul F. Dietz. Maintaining order in a linked list. ACM Symposium on Theory of Computing,
1982.
[FK99]
Daniela Florescu and Donald Kossmann. Storing and querying XML data using an RDMBS.
IEEE Data Engineering Bulletin, 22:27–34, 1999.
[Gro04a]
W3C Group. Extensible Markup Language (XML). http://www.w3.org/XML/, 2004.
[Gro04b]
W3C Group.
Guide to the W3C XML specification (XMLspec) DTD, version 2.1.
http://www.w3.org/XML/1998/06/xmlspec-report.htm, 2004.
[Gro04c]
W3C Group. XML path language (XPath) 2.0. http://www.w3.org/TR/xpath20/, 2004.
[Gro04d]
W3C Group. XML Schema. http://www.w3.org/XML/Schema, 2004.
[Gro04e]
W3C Group. XQuery 1.0: an XML query language. http://www.w3.org/TR/xquery/, 2004.
[Gru02]
Torsten Grust. Accelerating XPath location steps. SIGMOD Conference, 2002.
[GvKT04] Torsten Grust, Maurice van Keulen, and Jens Teubner. Accelerating XPath evaluation in any
RDBMS. ACM Transactions on Database Systems (TODS), 29:91–131, 2004.
25

Page 27
[HBG+03] Alan Halverson, Josef Burger, Leonidas Galanis, Ameet Kini, Rajasekar Krishnamurthy,
Ajith Nagaraja Rao, Feng Tian, Stratis Viglas, Yuan Wang, Jeffrey F. Naughton, and David J.
DeWitt. Mixed mode XML query processing. VLDB Conference, 2003.
[JAKC+02] H. V. Jagadish, Shurug Al-Khalifa, Adriane Chapman, Laks V. S. Lakshmanan, Andrew Nier-
man, Stelios Paparizos, Jignesh M. Patel, Divesh Srivastava, Nuwee Wiwatwattana, Yuqing Wu,
and Cong Yu. TIMBER: A native XML database. VLDB Journal, 11:274–291, 2002.
[JLW04]
Haifeng Jiang, Hongjun Lu, and Wei Wang. Efficient processing of twig queries with OR-
predicates. SIGMOD Conference, 2004.
[JWLY03] Haifeng Jiang, Wei Wang, Hongjun Lu, and Jeffrey Xu Yu. Holistic twig joins on indexed XML
documents. VLDB Conference, 2003.
[MW99]
Jason McHugh and Jennifer Widom. Query optimization for XML. VLDB Conference, 1999.
[OCL04]
OCLC. Dewey decimal classification. http://www.oclc.org/dewey/, 2004.
[OOP+04] Patrick E. O’Neil, Elizabeth J. O’Neil, Shankar Pal, Istvan Cseri, Gideon Schaller, and Nigel
Westbury. ORDPATHs: Insert-friendly XML node labels. SIGMOD Conference, 2004.
[Org04]
SAX Project Organizatiion. SAX: Simple API for XML. http://www.saxproject.org/, 2004.
[PAKJ+02] Stelios Paparizos, Shurug Al-Khalifa, H. V. Jagadish, Laks V. S. Lakshmanan, Andrew Nierman,
Divesh Srivastava, and Yuqing Wu. Grouping in XML. EDBT Workshops, 2002.
[PCS+04] Shankar Pal, Istvan Cseri, Gideon Schaller, Oliver Seeliger, Leo Giakoumakis, and Vasili Vasili
Zolotov. Indexing XML data stored in a relational database. VLDB Conference, 2004.
[PWLJ04] Stelios Paparizos, Yuqing Wu, Laks V. S. Lakshmanan, and H. V. Jagadish. Tree logical classes
for efficient evaluation of XQuery. SIGMOD Conference, 2004.
[RSSB00] Prasan Roy, S. Seshadri, S. Sudarshan, and Siddhesh Bhobe. Efficient and extensible algorithms
for multi query optimization. SIGMOD Conference, 2000.
[SM83]
G. Salton and M. J. McGill. Introduction to modern information retrieval. McGraw-Hill, 1983.
[SSK+01] Jayavel Shanmugasundaram, Eugene J. Shekita, Jerry Kiernan, Rajasekar Krishnamurthy,
Stratis Viglas, Jeffrey F. Naughton, and Igor Tatarinov. A general techniques for querying
XML documents using a relational database system. SIGMOD Record, 30:20–26, 2001.
[STW04]
Ralf Schenkel, Anja Theobald, and Gerhard Weikum. HOPI: An efficient connection index for
complex XML document collections. EDBT Conference, 2004.
[STW05]
Ralf Schenkel, Anja Theobald, and Gerhard Weikum. Efficient creation and incremental main-
tenance of the HOPI index for complex XML document collections. ICDE Conference, 2005.
[STZ+99] Jayavel Shanmugasundaram, Kristin Tufte, Chun Zhang, Gang He, David J. DeWitt, and Jef-
frey F. Naughton. Relational databases for querying XML documents: Limitations and oppor-
tunities. VLDB Conference, 1999.
[TDCZ02] Feng Tian, David J. DeWitt, Jianjun Chen, and Chun Zhang. The design and performance
evaluation of alternative XML storage strategies. SIGMOD Record, 31 (1):5–10, 2002.
26

Page 28
[TRP+04] Feng Tian, Berthold Reinwald, Hamid Pirahesh, Tobias Mayr, and Jussi Myllymaki. Imple-
menting a scalable XML publish/subscribe system using a relational database system. SIGMOD
Conference, 2004.
[TVB+02] Igor Tatarinov, Stratis Viglas, Kevin S. Beyer, Jayavel Shanmugasundaram, Eugene J. Shekita,
and Chun Zhang. Storing and querying ordered XML using a relational database system.
SIGMOD Conference, 2002.
[WPJ03]
Yuqing Wu, Jignesh M. Patel, and H. V. Jagadish. Structural join order selection for XML
query optimization. ICDE Conference, 2003.
[YASU01] Masatoshi Yoshikawa, Toshiyuki Amagasa, Takeyuki Shimura, and Shunsuke Uemura. XRel:
a path-based approach to storage and retrieval of XML documents using relational databases.
ACM Transactions on Internet Technology (TOIT), 1:110–141, 2001.
[ZND+01] Chun Zhang, Jeffrey F. Naughton, David J. DeWitt, Qiong Luo, and Guy M. Lohman. On sup-
porting containment queries in relational database management systems. SIGMOD Conference,
2001.
27