XML Query Processing: A Survey ∗

Page 1

XML Query Processing: A Survey ∗

Gang Gou, Rada Chirkova

Department of Computer Science

North Carolina State University

Raleigh, NC USA 27695-8207

Email: [email protected], [email protected]

May 10, 2005

Abstract

XML (Extensible Markup Language) is emerging as a de facto standard for information

exchange among various applications on the web because of its inherent data self-describing

capability and flexibility of organizing data. With increased impact of XML on information

exchange, it is particularly important to develop high-performance techniques to query large

XML data repositories efficiently.

The core of XML query processing is twig pattern matching, i.e. finding from XML

documents all matches that satisfy the twig (or path) pattern specified by a given query. In

this survey we will review and compare major techniques for processing XML twig queries.

We categorize these techniques into three classes based on the storage format of XML data.

First, we review the file approach, in which XML data have to be stored in commonly used

flat files, in the form of just original XML documents, for special-purpose applications. Then,

we review the relational approach, in which XML data are stored in relational databases so

that all existing important techniques that have been developed for relational databases can

be fully reused and so no extra development efforts are needed. Finally, we review the native

approach, in which XML data are stored in inverted lists and native algorithms are developed

to further improve XML query processing performance.

To the best of our knowledge, this is the first survey work that systematically reviews,

classifies, and compares state-of-the-art techniques for XML query processing.

*All copyrights of this technical report are reserved by the authors and North Carolina State University.

Page 2

1 Introduction

XML (Extensible Markup Language) is emerging as a de facto standard for information exchange among

various applications on the web because of its inherent data self-describing capability and flexibility of

organizing data [Gro04a].

First, data in XML documents are self-describing. Similar to the familiar HTML (HyperText Markup

Language), XML is based on nested tags. Figure 1 (a) shows an example of an XML document, which

records information about publishers. However, unlike HTML, in which tags associated with data are used

to express the presentation style (e.g. font styles) of data, tags in XML are used to describe the semantics

of data. For example, Line 3 in Figure 1 (a) says that ‘Cambridge’ is an address of a publisher named

‘MITPress’. Therefore, when an application receives an XML document from another application over the

web, it can understand the content of this XML document, since data in XML documents are self-describing.

Second, XML is flexible in organizing data. The nested hierarchy of tags structurizes the content of XML

documents. The role of nested tags is somewhat similar to schemas in relational databases. However, the

nested XML model is more flexible than the flat relational model. The same objects in an XML document

might have different kinds of sub-objects or different number of sub-objects of the same kind. For example,

in Figure 1 (a), the first publisher has an address sub-element but the second publisher does not. The book

under the first publisher has two author sub-elements but the book under the second publisher has only one

author sub-element.

авбдгдезжд й в й

авбдгдезжд й в "!$# &%('0 )21 34 бд "йдй '5

а&!76д6в "йдй 9 8д!$#"ед з$6д@д A а&Bд!76д6в "йдй

аве"CдCдD7

адEF$EFж$ д A 6&!7E&! е"! й A а&B7EF$EFж$ д

а&! г Eв "C 9 3дC$#G а&Bд! г Eв "C

а&! г Eв "C 9 HдC дIа&Bд! г Eв "C

а&B е"CдCдD7

а&B бдгдезжд й в

авбдгдезжд й в

аве"CдCдD7

адEF$EFж$ д 9 P $Qд A а&B7EF$EFж$ д

а&! г Eв "C

ав"!$# д A R #2$Eв I а&B "!$# д

а&!7@д д G SUTV а&Bд!7@д д

а&Bд! г Eв "C

а&B е"CдCдD7

ав"!$# д X WдYA бд "йдй4 а&B "!$# д

а&B бдгдезжд й в

а&B бдгдезжд й в й

авбдгдезжд й в й

авбдгдезжд й в "!$# &%('0 )21 34 бд "йдй '5

а&!76д6в "йдй 9 8д!$#"ед з$6д@д A а&Bд!76д6в "йдй

аве"CдCдD7

адEF$EFж$ д A 6&!7E&! е"! й A а&B7EF$EFж$ д

а&! г Eв "C I Qв з$ в 6&%`S 9 3дC$#G а&Bд! г Eв "C

а&! г Eв "C a $6&%`S9 ж Cдb7 "йв%7cд 9 HдC дIа&Bд! г Eв "C

а&B е"CдCдD7

а&B бдгдезжд й в

авбдгдезжд й в

аве"CдCдD7

адEF$EFж$ д 9 P $Qд A а&B7EF$EFж$ д

а&! г Eв "C a $6&%7cA Qв з$ в 6&%`S

ав"!$# д A R #2$Eв I а&Bд! г Eв "C

а&!7@д д G SUTV а&Bд!7@д д

а&Bд! г Eв "C

а&B е"CдCдD7

ав"!$# д X WдYA бд "йдй4 а&B "!$# д

а&B бдгдезжд й в

а&B бдгдезжд й в й

d!дe4 f )FPV 6&C7gвг # в Ei h2$Eв "C г Ep 1rq"B 1rq"s tдu

dе"e4 f )FPV 6&C7gвг # в Ei h2$Eв a 1rq"B 1rq"s tдu

Figure 1: XML documents

Page 3

1.1 Data Model

1.1.1 Basic Model: Tree

The basic data model of XML is a directed, rooted, labeled, and ordered tree. Figure 2 (a) and (b) shows

the XML data tree of the XML document in Figure 1 (a) 1. Figure 2 (a) is based on a node-labeled model

where labels are on nodes, and Figure 2 (b) is based on an edge-labeled model where labels are on edges.

These two models are equivalent. Most research papers use the node-labeled model, while the edge-labeled

model is also used in some scenarios, such as in the Edge approach that will be introduced in Section 4.2.

Here we explain the XML data tree based on the node-labeled model, and analogous explanations can also

be applied to the edge-labeled model.

There are three classes of nodes in a data tree. (1) Element Node (internal node). This class of nodes

correspond to tags in XML documents, such as publisher, address, etc. Labels on element nodes are just tags

in XML documents. (2) Attribute Node (internal node). This class of nodes correspond to attributes in XML

documents, such as ‘@name’ under the first publisher element. In contrast to element nodes, attribute nodes

are not nested (i.e. an attribute cannot have any sub-elements), are not repeatable (i.e. two same-name

attributes cannot occur under one element), and are unordered (i.e. attributes of an element can freely

interchange their occurrence locations under this element). (3) Value Node (leaf node). This class of nodes

correspond to data values in XML documents such as ‘MITPress , ‘database , etc.

Edges in a data tree represent structural relationships between elements/attributes/values.

1.1.2 Extended Model: DAG and General Graph

XML documents allow users to define ID/IDREF attributes of elements, where the id attribute is used to

uniquely identify an element and idref attributes are used to refer to other elements which are explicitly

identified by their id attributes. ID/IDREF attributes increase the flexibility of the XML model so that

elements in XML documents may directly refer to each other freely. Figure 1 (b) shows an XML document

with ID/IDREF attributes, where the newly introduced ID/IDREF attributes are underlined.

Therefore, in addition to original tree edges in XML data trees which describe main skeleton structural

relationships in XML documents, ID/IDREF edges are also introduced into the XML data model to represent

direct reference relationships between elements, which extends the original tree model to DAG (Directed

Acyclic Graph) or even more general graph with cycles. Figure 2 (c) is just a graph with cycles, which

corresponds to the XML document in Figure 1 (b).

1.2 XML Queries

Unlike keyword search in text retrieval, which concerns onlycontents of text documents, XML queries concern

structure as well as contents of XML documents.

1.2.1 XPath

XPath [Gro04c] is a basic XML query language that is used to select nodes from XML documents such

that the path from the root to each selected node satisfies a specified pattern. A simple XPath query is

1We defer discussing the pairs of numbers adorning nodes until later.

Page 4

авбдгжевзв й и

бвгжедз й

д дв!

#"д"

г $д$ %

г#$ $ %

вд!

&('0)21 а # д 03 &54в!дгд зв"д6в 73

8з8е

дб8й $

8з8ев

вб8й $

&59 8дгв ид ж3

&51$ !@3

6в

&BAзвCд ж3

&ED!жз8й@3

дв!

&EF$вйд 73

GIH

&QP#R а д д S3

TGIU V WвX

TQYU Y `X

TV U a X

TcbU bX

T5dU dX

T5eUв HдX

TWдUf G2WдX

TG`Uж GYX

TGдG U GдG X

TGвV U Gвa X

TGbU GbX

TGeUж GIHдX

TGdU GdX

T5YG U V HдX

T(YдYUд VbX

T5YVgU YagX

T5YвbU YвbX

T5Y eU V VgX

T5Y dU YWвX

T5YH U YH X

TV`U VYX

TVдG U VдG X

TVдagU# VdX

TVeU VeX

h iqps rutwvyxБАГВ@iyДГxЕВЖxuvИ ЗqЙЕРТ СuiwУФis ХyЦwxux

бвгжедз й

бдг едзв й

"в"

г $ $g%

&Q'S)21 а 73

&B4д!дгв жзв"в6д 73

8з8е

дб8й $

дб8й $в

8з8ев

дб8й $в

&59 8дг дид 03

&Ш1$ !73

6д

&5AзвCв 73

&2D!жз8й@3

д !д

&5F$ йв @3

GIH

&(P R ад дидиж3

h ДupТ ЩdvqeqxФАГВfiyДdxЕВgxqvh ЗyЙЕРТ СuiwУФis ХyЦdxqx

д !д

адбвгжедз й и

GдG

G V

G a

авбдгжевзв й и

бдгfе звивйд

бвгжедз й

д дв!

" "

г $д$ %

г $в$ %

д !д

&Q'S)21 а 73 &E4 !дг fзв" 6 73

8з8е

б8йд$в

8з8е

вб8й $

&E9 8дгв ид ж3

&51$ !@3

д6д

&BAзвCд ж3

&ED!qз8й73

д!д

&EF$ й 73

GIH

&(P R ад дидиж3

C жз "

е $ r# и

C жз "

hдsФpТ rutwvqxФАГВfiyДdxЕВgxqvh ЗyЙЕРТ СuiwУФiТ tqЦuiyuqvw h2xzygУ{v} |@Сu~Г|@Сu yЩqАБ xqvqeqxuВ{p

Figure 2: XML data model

Page 5

specified by a sequence of alternate axes and tags. Two commonly used axes are child axis ‘/’ where ‘A/B’

denotes selecting B-tagged child nodes of A-tagged nodes, and descendant axis ‘//’ where ‘A//B’ denotes

selecting B-tagged descendant nodes of A-tagged nodes. An example XPath query is ”/P ublisher//title”

(its standard form should be ”root/P ublisher//title”, but ”root” is always omitted for simplicity), which

returns all book titles of all publishers. The result of this query against the data tree in Figure 2 (a) is a set

of title nodes that have values ‘Database’ and ‘Life’.

The query pattern specified by the XPath query above is a simple path pattern shown in Figure 3 (a)

where the arrow with ‘=’ denotes the ‘//’ axis. Generally, an XPath query can specify a more complex

tree pattern (also called twig pattern) by introducing selection predicates into XPath expressions. One such

example is ”/P ublisher[@name = ‘MITPress ]/book/title”, in which ‘/P ublisher/book/title’ is the main

path of this query and the content between ‘[’ and ‘]’ is a selection predicate. This query returns all book

titles of the publisher named ‘MIT Press’. The pattern of this query is shown in Figure 3 (b). Generally,

multiple selection predicates might be involved in XPath queries.

бгвгджегзй ж

й г г

дж г

!з!ей

"$#&%('0 ) ж 1 32 4

бгвйд5егз6 ж

!з!е6 4

798A@C BD8AEGFI HQPDR5SUT

7WVX@` Y5acb5de HQPURQSAT

Figure 3: Query pattern

1.2.2 XQuery

XQuery [Gro04e, Cha02] is another popular XML query language, which is an extension to XPath and

is more powerful than XPath. It is a functional language comprised of FLWR (For-Let-Where-Return)

clauses that can be nested and composed with full generality. For and Let clauses bind nodes selected by

XPath expressions to user-defined node variables. Where clauses specify selection or join predicates on node

variables. Return clauses operate on node variables to construct a new XML document as the query result.

Figure 4 (a) shows a simple XQuery, which groups books by their publisher addresses. The query pattern is

shown in Figure 4 (b), and the format of the resulting XML document is shown in Figure 4 (c) where the ‘∗’

edge means that a books node might have multiple book nodes as children. From this example we can find

that XQuery logically (rather than physically) includes two parts: twig pattern matching (defined by FLW)

and result construction (defined by Return).

Tree algebras have been developed to express more complex XQueries. [JAKC+02, PAKJ+02, PWLJ04]

address transforming XQuery to an algebraic tree. The algebraic tree represents an efficient logical plan of

answering XQuery. Each node in this tree is a tree algebraic operator. The basic tree algebraic operators

are selection, projection, and grouping, each of which takes one or multiple twig patterns as inputs.

Page 6

авбдг езжй з з в! #"з$ ж&% ' ( 0)21434'657%з 8 #9в(8@414ж! 4 #@ ( з зA

г4BзC6D4г4

E#FвG6GзH 6I

E#P'657%4 8 #94(8@зQ7 6 #@4(! 4 4IR е#ж E1P'з5S%4 з д9в(д@8Q7 4 д@в(6 6 4I

авб8г е#3U V 1434'657%з 8 #9в(8@

WвX6B4гзB ед3414ж! 4 #@ ( з ` Ya еbж

г4B6C4Dзг6 е#3з165G4G8H

E1F G4GдH 4I

3з'45S%4 8 #94(з@

ж6 6 #@ ( 4

5G6GзH

FвG4GдH

P'45&%6 з #9 (з@дQS 4 #@ ( з

5G6GзH

fhgpir q0sStvu&w0x

fАy0iБ sStvu&w0xr В0gДГpГpu&wpЕ

fЗЖ7ir И0ЙSwSРСgДГТ ЙДУТ sStvu&w0xБ ФДu0Х tЧЦШГ

Figure 4: An example of XQuery

1.2.3 Summary

The core of both XPath and XQuery queries is twig pattern matching (also called twig query), i.e. finding

from XML documents all matches that satisfy the twig (or path) pattern specified by a given query. We

call nodes in XML data trees data nodes and nodes in query twigs query nodes. For XPath queries, the

output of twig pattern matching is a set of data nodes whose corresponding query node is the end node of

the main path in a query twig. For example, the output of matching the twig pattern in Figure 3 (b) is a

set of title nodes. We call this type of output single-node solutions. For XQuery queries, the output of twig

pattern matching is a set of tuples of data nodes that correspond to multiple query nodes in a query twig.

For example, the output of matching the twig pattern in Figure 4 (b) is a set of (address, book) tuples, but

not a set of only book or address nodes. We call this type of output tuple solutions.

Another important thing is that current XPath and XQuery do not support ID/IDREF axis queries, i.e.

they always assume queries work on tree-shaped XML data model. In fact, this assumption has also been

taken by most research papers on XML query processing. The first reason for taking this assumption is

that general graph-shaped data model significantly increases the complexity of XML query processing. The

second reason is that graph-shaped XML documents with ID/IDREF attributes are not usual in practical

applications. So we will continue to take this assumption in this survey except explicitly claimed.

In the remainder of this survey we review major techniques for processing XML twig queries. We

categorize these techniques into three classes based on the storage format of XML data. Section 3 introduces

the file approach, in which XML data must be stored in commonly used flat files, as required by special-

purpose applications. Sections 4 and 5 introduce the relational approach and the native approach, in which

XML data are stored in relational databases and inverted lists, respectively. With value indexes and structural

indexes available in these two approaches, XML queries can be answered much more efficiently than in the

file approach. Before we begin to review these approaches, we first introduce numbering schemes.

2 Numbering Schemes

In this section, we introduce numbering schemes that can overcome the weakness of the file approach and

have been taken as an important foundation for many techniques in the relational and native approach.

Edges in XML data trees represent structural relationships between data nodes. The key idea of answering

XML twig queries is just determining structural relationships, or more specifically reachability, between any

pair of nodes in XML data trees. For example, in order to answer a path query ‘A//B’, given any pair of

A-tagged node and B-tagged node, say (a, b), in a data tree, we need to determine whether there exists a

Page 7

path from a to b.

A straightforward method of determining reachability is tree navigation [MW99], which consists of either

traversing the subtree rooted at an A-tagged node to see if a B-tagged node can be found (forward navigation),

or, more intelligently, backtracking from a B-tagged node upwards to see if an A-tagged node can be found

(backward navigation). Backward navigation is usually more efficient because each node in a tree has only

one incoming path from the root but multiple outgoing paths. However, if A-tagged nodes are more selective

than B-tagged nodes, i.e. most A nodes have B descendants but most B nodes have no A ancestors, then

forward navigation might be more efficient. Therefore, a trade-off has to be determined, which is just the

motivation of hybrid navigation [MW99]. However, on the whole, the navigational method is not efficient,

since both forward and backward navigations involve traversing a large amount of irrelevant nodes, i.e. nodes

tagged with neither A nor B. For example, for a path ‘/A/D/E/F/B’ in a data tree, irrelevant nodes tagged

with D, E or F have also to be traversed for answering a query ‘A//B’ when the navigational method is

used.

Another method of determining reachability is precomputing, for each node in a data tree, a set of nodes

that can be reached from this node, i.e. materializing transitive closure of this data tree. The transitive

closure is typically very large and so could waste storage space. Therefore, we need a less exhaustive method

to compactly represent transitive closure. Numbering Schemes is just one such method.

[Die82] is the origin of numbering schemes for trees. It proposed a kind of numbering scheme we call

PrePost Coding, which uses tree-traversal orders of nodes to compactly represent transitive closure of

trees. Specifically, each node in a tree is labelled with a pair of numbers, (start, end), where start and

end correspond to preorder and postorder traversal numbers of this node in the tree, respectively. [ZND+01]

introduced P reP ost coding into XML applications. As can be seen from Figure 2 (a), the following property

always holds.

Property 1 (Ancestor-Descendant Relationship) In a data tree, node a is an ancestor of node b if

and only if a.start < b.start < a.end.

Obviously, PrePost Coding has two big advantages. (1) (start, end) numbers (also called PrePost

numbers) only need modest storage space: 2 ∗ |V |, where |V | is the number of nodes in the data tree. (2)

Using PrePost numbers, we can efficiently determine the ancestor-descendant relationship between any pair

of nodes in constant time by using only two number comparison operations. In addition, PrePost coding

can also be easily extended to check the parent-child relationship if we attach another number, level, to each

node, which denotes the depth of this node in tree.

Property 2 (Parent-Child Relationship) In a data tree, node a is a parent of node b if and only if

a.start < b.start < a.end and a.level +1= b.level.

In fact, in addition to commonly used ‘/’ and ’//’ axes, P reP ost coding extended with the number level

is able to process all other axes defined in XPath, such as following, following-sibling, etc [Gru02, GvKT04].

Another famous numbering scheme for trees is Dewey Coding [OCL04], which was originally developed

for general knowledge classification. [TVB+02] introduced it into XML query processing. With this coding,

each node is associated with a vector of numbers that represents the path from the root to this node. This

coding method is illustrated in Figure 5. We can show that in a data tree, node a is an ancestor of node b

if and only if a.vector is a prefix of b.vector.

Page 8

авбдгжевзйий д в и

бдг ейзйив в

бвг е з ив

!д

г!"й"в#

г!"в"й#

$ д%

&('0)21 ай ! дий43 &65 д%гй зй7 43

8з8е9

б8 "в

б8 !"

8з8е9

б8 д"в

&6@ 8 г ий 43

&A1"%3

&CBзйDй 43

&FE%з8 03

$ в%

&6G"в д$03

HPI

&(Q!R ад д идS3

H TU H

HйTU H9TU H

H9TU HйTд HйTU H

H TU H Tв VдT! H

HйT! H Tд V

HйT H Tд W

H9TU HйTй WдTU H

HйTйV

H Tв VдT! H

H T VдTд V

HйTд HйTд WйTд V

HйT H Tд WдTй W

HйTд HйTд WйTU HйTд H

H9TU H T WдTд VйTU H

H TU H Tв WдT WвTU H

HйTй VдTU H9TU H

H9Tд VдTд HйTд V

HйTд VйTU HйTд HйTU HX H Tд V T HйT VвTU H

H9Tд VвT! HйTд VйTд V

H9Tд VдTд HйTд VйTU HйTд HY H9Tд VдTд HйTд VйTд VдTд H

HйTй VдTд VйTU H

`в$ д%

Figure 5: Dewey Coding

An advantage of Dewey Coding over PrePost Coding is that Dewey Coding is easier to maintain when

dynamic updates occur on data trees. Using Dewey Coding, when a new node is inserted somewhere in

a data tree, only nodes in subtrees rooted at the following sibling nodes of this new node need to change

their Dewey vectors. In contrast, using PrePost Coding, when a new node is inserted, most nodes in a data

tree might need to update their (start, end) numbers. ORDPATH Coding, which is a variant of Dewey

Coding but even easier to maintain than Dewey coding, has been integrated into the XML query processing

component of Microsoft SQL Server 2005 [OOP+04].

However, compared with PrePost, Dewey has some obvious weaknesses. (1) The path vector associ-

ated with each node needs more storage space than (start, end) numbers in PrePost Coding. (2) PrePost

provides more efficient support in checking the ancestor-descendant relationship between two nodes, since

number comparison operation can be implemented more efficiently than the operation of checking the pre-

fix containment relationship between two path vectors. Due to the nice properties of PrePost, most XML

research papers use PrePost as their numbering schemes. Our survey will continue this tradition.

In addition to numbering schemes for trees, numbering schemes have also been developed for DAGs

[ABJ89] and for even more general graphs with cycles [CHKZ03]. [STW04, STW05] applied 2-hop labels

developed in [CHKZ03] to deal with general XML data graphs. However, the size of 2-hop labels is usually

very large, which limits its application in practice.

3 XML Query Processing: the File Approach

XML data are originally created in the form of XML documents (Figure 1) and stored in flat files. Generally,

various indexes need to be built on XML data to facilitate answering XML queries, since indexes can locate

goal data quickly without exhaustively scanning the data. Such indexes include classical B+-tree index

(Section 4), which is an index on data values (value indexing), and recently developed numbering schemes

(Section 2), which is an index on structure of XML documents (structure indexing). However, indexes

themselves are redundant data. In some application scenarios, XML data must be exchanged in the form of

flat files only, without any redundant data such as indexes being allowed to associate with them. In those

cases where indexes are not available, entire XML documents have to be scanned to answer queries.

One example of such applications is SDI (Selective Dissemination of Information) [AF00, DFFT02, DF03,

Page 9

DAF+03, BGKS03, TRP+04]. SDI is essentially an XML Publish/Subscribe system. Figure 6 illustrates its

structure. The filtering system stores XPath queries from subscribers. It matches each incoming streaming

XML document D from publishers with each subscribed XPath query. If a match is found in D with

some XPath query Q, then D will be sent to subscribers of Q. In order to reduce network bandwidth,

publishers disseminate only XML documents, without any redundant data such as indexes associated with

these documents. In this scenario, only tree navigation methods, specifically only the forward navigation

method (Section 2), can be used, since scanning XML documents sequentially in document order is essentially

a depth-first traversal of XML data trees.

авбд гж едзйивб

едз

!# "%$'&) ( 02 14365й"6768 9@5A$BD CAEF &HGP I63$Q8 9Q$"

C4RTSU EA3A5 VA94"WGT$68ж"

1 &H8$AFQXд9`YTaU C6RS

b(A7634XT$6Yж&A"

C4RASP 163A5ж"H768 9W5T$48"

c9 VQ&%$68$Bd C6RTS

b(A7634XT$6Yж&A"

Figure 6: SDI application

3.1 Single-Query Processing

3.1.1 The Automata Approach

The automata approach is a natural implementation of forward navigation, which has been widely researched

[AF00, DFFT02, DF03, DAF+03, BGKS03, HBG+03]. This approach expresses an XPath query as an

automaton and runs XML documents on this automaton as if XML documents were strings.

When a streaming XML document arrives, SAX parser [Org04] parses it sequentially on the fly. SAX is

an event-based XML parser. A StartElement event is triggered when the opening tag of an XML element

is encountered, which returns the tag name and all associated attributes (if any) of this element to the

event handler. Similarly, an EndElement event is triggered when the closing tag of an XML element is

encountered, which returns the tag name of this element to the event handler. The event handler then uses

opening/closing tags returned by events to activate corresponding state transmissions of automaton.

Figure 7 illustrates this approach. Figure 7 (b) is an automaton equivalent to XPath query ‘//A//B/C’

where ‘//’ axes are represented using ∗-edges (‘∗’ denotes any tag name), and the leaf query node is taken

as an accept state (State 3). The key idea of this approach is using a run-time stack, in which each stack

element is a set of automaton states. When an opening tag is encountered, each state in the stack-top

element is transformed to new states (or to this state itself if there is a ∗-edge outgoing from it) based on

this tag. These newly generated states are collected into a new stack element which is in turn pushed into

the run-time stack as the new stack top. Instead, when a closing tag is encountered, the stack-top element

is simply popped out of the stack. For SDI applications whose goal is just to check if there exists one match

between the published XML document and subscribed XPath queries, the matching process can terminate

once an acceptable state, such as State 3 in Figure 7 (b), is reached. However, for general query applications

whose goal is to find all matches, the matching process has to continue until the end of the XML document

is reached. All elements resulting in accept states, such as elements c1 and c2 in Figure 7 (c), are output as

query results.

Page 10

авб

гб

дб

аже

ге

зиб

"!$ #в%ад' &аб)(

1 0

02!$ #в%ад' &2гб)(

43!$ #в%ад' &вдб(

65!$ #в%ад' &аже(

1 0

47!$ #в%ад' &2ге(

98!$ #в%ад' &зиб(

зе

@ 1 0

6A!' #B%адC &EDзBб(

1 0

9F!' #B%ад' &зе(

1 0

9G!$ #в%ад' &2Dзе(

H"IQPS RTIQUVIS WYXT`V`

HbaYPS cYdTUVegfTIYUVI

HihVPS pYdVqYrQWtsBfu`w vTUxIYhxy

Figure 7: The Automata approach: processing XPath query ‘//A//B/C’

3.1.2 The PathStack Approach

The automata approach described above is simple and feasible. However, its big weakness is that although

it derives single-node solutions (e.g. a set of C nodes), it is difficult to derive tuple solutions (e.g. a set of (A,

B, C) tuples). The reason is that the run-time stack tracks only states in automata but not data nodes in

data trees. In addition, the run-time stack wastes memory space. Due to the ‘//’ axes, states with outgoing

∗-edges, such as State 0 and State 1, have copies in a large number of stack elements repeatedly.

[BKS02] introduced an elegant data structure, PathStack, which can overcome the weaknesses of the

automata approach described above. PathStack was introduced in [BKS02] originally as a native approach

to answering XML twig queries. [BGKS03] extended it to process multi-queries. Here we only introduce its

role in the file approach while leaving the introduction to its role in the native approach to Section 5. Figure

8 illustrates this PathStack approach.

авб

гб

дб

аже

ге

зб

зе

й й й "! # #

а$б

гб

а$б

гб

%'&)(1 032ад5 4аб76

%98@(A 0B2адC 4Dгб76

%FEG(1 032ад5 4@дб76

а$б

гб

%IHP(1 032ад5 4аPе6

аже

а$б

гб

%9Q@(A 0B2адC 4Dге6

аже

ге

авб

гб

%FRG(1 032ад5 4зBб6

а@е

ге

звб

аб

гб

%TS@(A 0B2ад1 4GUз3б6

аже

ге

аб

гб

%9V@(5 0ж2ад1 4зWе6

аже

ге

зе

аб

гб

%FXG(C 0P2адA 4@Uз@е6

а@е

ге

`Y b a й $ce df gй hpirqs td й h iv u

wfxgy АpБГ В

Д ЕЗЖ"И$Йb РfС

ТFУBФ"ХGЦжЧ ХЩ Ш3Фd

eжfжgDhBiжjGf@kPl"m

ТFУЧ ХGЦЧ ХЩ ШФd

ТFУФ Х@ЦЧ Хn ШЧd

eжfжgDhBiжjGf@kPl"m

ТFУЧ Х@ЦЧ Хn ШЧd

Figure 8: The PathStack approach: processing XPath Query ‘//A//B/C’

The key idea of the PathStack approach is using a series of linked stacks to track scanned data nodes.

Specifically, one stack is created for each query node in a path query. For example, for a path query

‘//A//B/C’, there are three stacks, Stack C → Stack B → Stack A. When an opening tag is encountered,

the corresponding XML element is pushed into the stack named by this tag, associated with an a pointer to

Page 11

the top element in its parent stack (see Steps (2) (5) (6) and (8)). Note that such elements as d1 whose tags

do not correspond to any stack (i.e. are irrelevant to the given path query) are simply discarded. Instead,

when a closing tag is encountered, the top element in its corresponding stack is simply popped out (see Steps

(7) and (9)).

The procedure above guarantees that at all times, elements in all stacks are from the same path in the

data tree. Therefore, when an element is pushed into the stack corresponding to the end node of the path

query, such as Stack C, it implies that some matches might have been found. These matches can be output

immediately as solutions through backtracking pointers associated with the elements in stacks (see Steps

(6) and (8)). Note that in order to check the child-parent relationship, each element that is pushed into the

stack also needs to be associated with its depth number in the data tree, which can be easily derived in the

parsing process.

Compared with the automata approach, the PathStack approach has the following advantages. (1) The

PathStack approach saves memory space. The number of stacks in the PathStack approach is the length of

the path query, while the depth of the run-time stack in the automata approach is the depth of the entire

XML data tree. (2) More importantly, the PathStack approach can derive tuple solutions, rather than only

single-node solutions. Tuple solutions are very important. They might be required by XQueries (Section

1.2.3). More importantly, tuple solutions help answer twig queries as well as simple path queries through

using a post-joining procedure as introduced below.

3.1.3 The TwigStack Approach

The TwigStack approach extends the PathStack approach to answer general twig queries [BKS02]. Its key

idea is twig decomposition, i.e. decomposing twig queries into multiple root-to-leaf path queries. Each path

query is still processed as in PathStack, and the query results are finally joined together to get the result

of the original twig query. However, since path queries from the twig decomposition have common prefix

(query) nodes, the stacks corresponding to these common prefix nodes can be shared. Therefore, in contrast

to PathStack, which links stacks in the form of a path, TwigStack links stacks in the form of a twig.

An example of TwigStack is shown in Figure 9. In this example, once a C-tagged node is pushed into

Stack C, the tuple solutions obtained from Stack C → Stack B → Stack A are immediately sent to Table 1.

Similarly, once a E-tagged node is pushed into Stack E, the tuple solutions obtained from Stack E → Stack

D → Stack B → Stack A are immediately sent to Table 2. If a path query is very selective so the size of

Table 1 and Table 2 is small, then these two tables can be temporarily stored in memory. Otherwise, they

have to be sent to disk. Finally, after the entire XML document is scanned, there is a post-joining procedure

which joins Table 1 and Table 2 on their shared attributes, A and B, to get tuple solutions (A, B, C, D, E).

3.2 Multi-Query Processing

The problem of multi-query processing is answering a batch of queries rather than a single query. For example,

the SDI application is essentially an XML multi-query processing problem except that it is only to find one

rather than all matches of a streaming XML document with each of subscribed queries.

The problem of multi-query processing has been widely researched in the context of relational databases

(e.g. in [RSSB00]). The key idea of improving the performance of multi-query processing is answering

multiple queries simultaneously rather than separately through exploring shared parts of these queries. This

idea is similarly applicable in the context of XML multi-query processing.

Page 12

вдгжеи з й "!#

')()02143 а

')()021435 б

'6()071839 $

'@(8071)3A %

'@(8071)3A &

вCBеи з й DFE"гG"HFI

а б $

P6P@P P6P@P P6P@P

а б %

P6P@P P6P@P P6P@P

P6P@P

P6P@P P6P@P P6P@P P6P@P

вCGжеR QFSFI"E TVUWSX 7YV `Ya

b@0@c7d6e9 f

bg0@c`d4ei h

Figure 9: The TwigStack approach

It is straightforward to extend three approaches to XML single-query processing (Section 3.1) to XML

multi-query processing. The extension of the automata approach finds common prefixes of the given path

queries and share the states corresponding to these common prefixes in a newly constructed automaton

[DFFT02, DF03, DAF+03], as Figure 10 (a) illustrates. Similarly, the extension of the PathStack approach

finds common prefixes of the given path queries and share the stacks corresponding to these common prefixes

in a newly constructed TwigStack [BGKS03], as Figure 10 (b) illustrates. Note that the extension to the

PathStack approach above is essentially the same as the TwigStack approach.

ий

ий # $

%'&)(10)2 в

%'&)(10)23 д

%4&)(50627 е

%'&6(50)23

%'&6(50)2 в

%'&4('0)23 д

%4&)(50627 е

%'&6(50)2 в

%'&6(50)23 д

%'&4('0)23

%1&)('0428

ий ий 9

й # $

@BA CD EGFIHQPRARFSAT U U V H AйWйX

@`Y9Cb a9A9F5X9c9FSA WйdD U U V HйA WйX

Figure 10: XML Multi-Query Processing

Through the extensions above, each XML document needs to be scanned only once to answer multiple

queries simultaneously.

3.3 Summary

The file approach is mainly used for special-purpose applications in which XML data must be stored in

commonly used flat files in the form of just original XML documents. Because no redundant data such as

indexes are available in such applications, entire XML documents have to be scanned sequentially element

by element despite the fact that most elements in documents might be irrelevant to the specified queries,

which usually results in poor query processing performance. In the following two sections, we investigate

the relational approach and the native approach, in which XML data are stored in relational databases and

in inverted lists, respectively. With value indexes and structural indexes available in these two approaches,

XML queries can be answered much more efficiently than in the file approach.

Page 13

4 XML Query Processing: the Relational Approach

Relational database systems are today’s mainstream database systems. Today’s well-known commercial

database systems, such as IBM DB2, Microsoft SQL Server, and Oracle, are all relational database man-

agement systems (RDBMS). Due to more than thirty years of academic and industrial efforts, RDBMSs

have acquired strong capabilities in storage management, query processing and optimization, concurrency

control and recovery, etc. Therefore, a lot of research efforts have addressed storing and querying XML data

in RDBMS. In Section 4.1, we review past work on storing and querying XML data with a ‘schema’. In

Sections 4.2 through 4.4, we review past work on storing and querying schemaless XML data.

4.1 The DTD Approach

This approach is developed to store and query XML data with a ‘schema’. As introduced in Section 1, XML

is a flexible data model. However, XML data in many practical applications also conform to a schema to

some extent, since for various inter-operating applications that exchange data with each other, a common

agreement on the schema of exchanged data will facilitate data exchange among them significantly. Such

schemas can be described using standard Document Type Descriptors (DTDs) [Gro04b] or XML Schemas

[Gro04d]. Here we briefly introduce basic issues on DTD only, since XML Schemas are essentially extensions

to DTDs.

DTD is a set of statements where each statement specifies a relationship between an XML element and

its sub-elements/attributes, or the data type of an XML element/attribute. DTD statements are usually

stored in a special document for reference. If an XML document, X, cites a DTD document, D, on its file

head, then the structure of XML data in X must conform to the schema specified by D. We show a simple

DTD example in Figure 11. Figure 11 (a) is a DTD document, whose semantics can also be explained using

a DTD graph in Figure 11 (b). The ‘∗’ symbol associated with an element in DTD statements implies that

this element can have multiple copies under its parent element. For example, a P ublisher element might

have multiple Book sub-elements.

авбдгжезгй гж й ж з ж й"!$#&%( '0)21ж1"%ж# жй3й4 5й5й6ж7 8 9

авбA@$ з жеCBED" ж з ж й"!$#&%G F )"H #P IйQ&@$ &@S RжTзг$U&VWBXTжгзQ 9

ав бд гжезгй гж й Y ) 1з1`%$# з 'дRз $I&Q"@$ й@ 8 9

ав бдгжезгй гж й 4$5з5"6 'ba$ "a$ "#ж3c @ж aE! 5й%з7ж8 9

авбдгжезгй гж й d a$ &aж E# 'eRж жIйQ&@$ &@$8 9

авбдгжезгй гж й f @$ ga`!$5й%h 'eF$)iH2#з3ж )зpй#ж8 9

ав бдгжезгй гж й F$)cH$# 'eRж жIйQ&@$ &@$8 9

ав бд гжезгй гж й Y )жpз#

'eRж жIйQ&@$ &@$8 9

ж ж з й"!$#й%

4ж5ж5&6

@ a"!$5й%

)21ж1

qйF$)cH2#

aж йa$ "#

F$)"H #

)зpж#

rtsvux wА yвБCyГ ВWДЖЕ ЗCИРЙ СУТ

rХФвux wА yвБCyh ЦvЧвsCШvЩ

rXЕWuh dЖЙвe ЗgfgТihзСУj( dЖЙif sЖТih ДCСвsРfk eCЕ ЩУЙ Иis

ж з W й й"!$#&%g'AF$)cH2#з3P ) 1з1E% # йc8

4 5ж5&6g'la$ &a$ "#ж3n m&o$F )iH2#ж8

@ a"!$5"%g 'eF$)cH2#й3f йoCaж йa$ "#ж3з )жpй#ж8

Figure 11: An DTD example

DTD schemas can be naturally transformed into relational schemas [STZ+99, SSK+01], as Figure 11 (c)

illustrates. In the resulting relational schema, separate relations are created for the root element (Publisher)

and all ‘∗’ sub-elements (Book and Author) in DTD. Each ‘∗’-element relation has a foreign-key reference,

e.g. attribute p name in the Book table and attribute b title in the Author table, to its parent-element table.

After XML data conforming to a DTD schema have been shredded into relational tables, XML queries over

Page 14

XML data can be easily transformed into SQL queries over relational data. For example, a twig query

/P ublisher[address = ‘Cambridge ]//Author/name can be transformed into a SQL query that joins three

tables, Publisher, Book, and Author, together, as Figure 12 illustrates.

Figure 12: The DTD approach: SQL query for ‘/P ublisher[address = ‘Cambridge ]//Author/name’

4.2 The Edge Approach

• The Basic Edge Approach

[FK99] proposed a simple approach to shredding schemaless XML data into relations. This approach is

based on edge-labeled XML data trees. In this approach, all edges in a data tree are stored in a single

relational table, Edge. The schema of this Edge table is shown in Figure 13. The key idea of this schema

is an attribute pair (Source, Target), which represents end points of edges. Attribute Label represents tags

on edges. Attributes Flag and Value give the type and value of target nodes of edges, respectively. As an

example, Figure 13 populates the Edge table with XML data shown in Figure 2 (b).

авбдгжеизжй д жеи жйд

ж д й

"# $ % гий

(0)21д3246587290@

Aд389CBD96EGF

E%)в303

E2IPBG9

QвF%F8@G4R1%)GF09

SUTWVRXY (2@9525G`

(0)21д3246587290@

Aд389CBD96EGF

E%)в303

b%b0b

b%b2b

b2b%b

b0b2b

b%b2b

Id%d8@9505

efI3R)9

Shg2IPB21%@в46dpi29д`

Figure 13: The Edge Table

Two edges A and B can be joined together if and only if A.T arget = B.Source. Based on this property,

it is easy to transform XML twig queries without ”//” axes into SQL queries. The transformation method

is illustrated in Figure 14 with a twig query ‘/P ublisher[address = ‘Cambridge ]/book/author/name’.

Execution of this SQL query comprises two steps. The first step is a candidate-edge finding step, which

retrieves data edges for each label in the twig query, as Part (1) in Figure 14 shows. We can see that a

clustered index pre-built on theLabel attribute can significantly speed up the processing of this step. The

second step is an edge joining step, which joins adjacent edges as Part (2) in Figure 14 shows. The processing

of this step can be made more efficient by pre-building indexes on attributes (Source, Target).

• The Binary Approach

A weakness of the above Edge approach is that it involves multiple self-joins of the large Edge table. For

example, five Edge tables are joined in Figure 14, one table for each query node in the query twig. In order

to overcome this weakness, [FK99] also proposed a Binary approach, which is a variant of the basic Edge

approach, to avoid exploring the large Edge table. The key idea of this approach is grouping all edges with

the same label into one table respectively, i.e. creating one table for each distinct label. Each label table has

the schema (Source, Target, Flag, Value), with the Label attribute being dropped from the Edge schema.

An example of a SQL query against this schema is shown in Figure 15. In this example, the candidate-edge

finding operations in Part (1) of Figure 14 are saved. In addition to improving query processing performance,

Page 15

авбдгебзжй й б г"! б

#в$ %&

'( )вб1 0в!32дгв465 7 б6$в81 'д(&)3б9 (в( $йб35в5з89 '( )вб1 2 %3%е@ 81 'д(з)збA з!дие7в%з$й89 '( )вб1 & б

B 7 б6$вб 03!з2Cг34з5 7йбе$д е D з2йбйгF E GH0в!32Cгв465I7 б6$QP

6 й(

(з( $йбй5в53 ID 62 бдгR E GS (з( $йбй5в5TP

6 й(

2й%з%з@ D 62вбдгF E GU2й%в%з@VP

6 й(

6!й 7й%з$д е Dв з2йбдгF E GW з!╧7 %6$QP

6 й(

й б D 62вбдгF E GU й е бXP

6 й(

03!з2Cг34з5 7йбе$д з ав%з!3$вжвб1 E` Y

6 й(

03!з2Cг34з5 7йбе$д е a з$й)збйb Ec 3(в(е$ бй535з & а3%з!в$йжзб 6 й(

03!з2Cг34з5 7йбе$д е a з$й)збйb EA 2в%в%6@д & а3%е!в$йжвб

6 й(

2й%з%з@ a 6$в)вб d E 6!д 7й%е$д з ав%з!3$вжвб

6 й(

6!й 7й%з$д е aв з$й)вбйF EA е дб3 & а3%з!в$йжзб

6 й(

(з( $йбй5в53 I гI! бe E GSf3 23$C4з(з)збgP

hpirq

htsuq

wWx"yеАWБWВuГЕДuЖ

ЗЙИ"ИЖЙД"ВuВ

yIРuРWС

ТФУЗ"Хy"ЖеБИ"ЦДеЧ

ЗxIЩuГЕРuЖ

dЗuХД

Figure 14: The Edge approach: SQL for ‘/P ublisher[address = ‘Cambridge ]/book/author/name’

the Binary approach also saves storage space, since it doesn’t store labels of edges. However, for large XML

documents with a lot of distinct labels, the Binary approach will unavoidably result in a large number of

relational tables, which increases the management workload of DBMS. Otherwise, we notice that the basic

idea of this approach that clusters edges by their labels is very similar to the idea of inverted lists that will

be introduced in Section 5.

авбдгебзжй й е б г б

!в" #$

%&з'(г&)е021йбв" 34 й5&52" б 0в0$34 ' #&#е6 37 в 1 #8" 34 з е дб

9з1 б8" б %&з'(г&)е021йбв"д в ав#в&" жзбA @C B

8 д5

%&з'(г&)е021йбв"д е Dз в"йEзбйF @G й5з5е"збй0&0з $ а&#ез"йжзб 8 д5

%&з'(г&)е021йбв"д е Dз в"йEзбйF @7 ' #з#86й $ а&#вз"йжвб

8 д5

'й#з#в6 D 8" Eзб H @ 8й21й#в"д в ав#в&" жзб

8 д5

8д21й#е"д е D в"йEвбй╧ @7 з е дбй $ а&#ез"йжзб

8 д5

5з52"йб&0з0& 2 гP б4 @ QSR& е '&"д)в5вEзбUT

VXW`Y

bPc`dеePfSgih prq

sutitqupigPg

d vPvrw

xАysPБdiqеftiВpвГ

scЖЕih vrq

ЗsPБp

Figure 15: The Binary approach: SQL for ‘/P ublisher[address = ‘Cambridge ]/book/author/name’

In a whole, the Edge approach has two weaknesses. (1) It involves many join operations. The number of

joins is just the number of query nodes in a twig query. So it fails to process large twig queries efficiently.

(2) Its biggest weakness is that it does not support twig queries with ”//” axes (e.g. ‘A//B’), since it does

not know how many tags and which tags are involved between tag A and tag B.

4.3 The Node Approach

As we introduced in Section 2, numbering schemes are essentially structural indexes, which help answer ‘//’

axis queries efficiently. [ZND+01] is the first paper that applied PrePost coding developed in [Die82] to XML

research. This paper contributed a Node approach to shredding schemaless XML data into relations. This

approach is based on node-labeled XML data trees. In this approach, all internal nodes (i.e. element nodes

and attribute nodes) in a data tree are stored in a relational table, Node. The schema of this Node table is

shown in Figure 16. The key idea of this schema is an attribute triple (Start, End, Level), which replaces

the attribute pair (Source, Target) in the Edge schema. ‘//’ axis queries can be answered efficiently through

using (start, end) numbers of nodes. Level is used with (start, end) together to answer ‘/’ axis queries. As

an example, Figure 16 populates the Node table with XML data shown in Figure 2 (a).

Based on Property 1 and Property 2 in Section 2, it is easy to transform XML queries with both ‘/’

Page 16

авбдгжезб

ивй

вгж з

!"г# % $з

&%'

(0)%1д243457698A@

Bж2C8EDF8AGIH

G%)д242

G9R7DI8

SдH%HC@I3T1%)IH48 UWVYXT`a (0@ 8 505Ib

&Ic

P%d

(0)%1д243457698A@

Bж2C8EDF8AGIH

G%)д242

e%e0e

e%e4e

e4e0e

e%e0e

RFg0gT@98F505

h9R 2T) 8

Upi0R7D01%@д3Agrq08жb

дs

e%e0e

Figure 16: The Node Table

and ‘//’ axes into SQL queries. The transformation method is illustrated in Figure 17 with a twig query

‘/P ublisher[address = ‘Cambridge ]//author/name’. Similar to the Edge approach, execution of this SQL

query comprises two steps, candidate-node finding (Part (1)) and node joining (Part (2)). The difference is

that in the second step, the Node approach joins nodes using (Start, End, Level) attributes. Just as in the

Edge approach, Part (1) can be saved in the Node approach if a variant similar to the Binary approach is

used.

авбдгебзжй й е б г б

!в" #$

%й#й&$б( 'з0)1гз243 5 б4" 67 %з#й&вб8 &в&9"йбй3з3в6@ %й#й&$б7 в 95 #4"з67 % #й&вб@ й е б

Aз5 б4" б '0з)1г02е395йбв"д е Bз в)йбдгD C EGFз0)дгз24395 б4"IH

в д&

&з&9"йб03з30 9 B 4)збдгD C EP &з&9"йб03з3QH

в д&

4д95й#е"д е B в)йбйгD C EG в 95 #4"IH

в д&

й е б B 4) бдгR C EP й S бTH

в д&

'з0)дгз24395 б4"й гебвU б гV CX W

в &

'0з)1г02е395йбв"д 0 3в$ 4"д` Ya &з&9"йб03з30 з 3зив е"ди е д& 0&з&е" бй303в $ б4 д&` Yb '0з)1г02е395йбв"д в бе д&

в &

'з0)дгз24395 б4"й гебвU б гb cd WD C8 &з&9"йб03з30 9 B б4Uзбдг

в &

'0з)1г02е395йбв"д 0 3в$ 4"д` Ya 4д95й#е"д 0 3з$ 4"йи

е д& едие5 #в" 9 бв &8 YD 'з0)1гз243 5 б4"д $ б4 й&

в &

4д95й#е"д 0 3з$ 4"й` Ya й S б з 3зив е"ди

е д& з е дбй $ б4 й&` Y( 4й95й#в"д в бе д&

в &

в 5 #4"д гебеU б гa ce WD C8 й S б 9 B б4Uзбдг

в &

&з&9"йб03з30 9 гf б7 C EPg0 е )0"д2в&вhзбIH

iqpsr

iutvr

xfyfА БfВДГGЕ ЖvЗ

И ЙfЙЗ ЖfГДГ

Иy РvЕТСvЗ

УИfФЖ

ХЧЦИДФАfЗ ВЙfШЖ4Щ d

Figure 17: The Node approach: SQL query for ‘/P ublisher[address = ‘Cambridge ]//author/name’

The Node approach overcomes the weakness of the Edge approach which does not support‘//’ axis

queries. However, similar to the Edge approach, it involves many join operations. Specifically, the number

of joins is just the number of query nodes in a twig query, which results in inefficient query processing of

large twig queries.

4.4 The Path Materialization Approach

• The Basic PM Approach

In order to reduce the number of node joins, [YASU01] proposed a Path Materialization (PM) approach

to shredding schemaless XML data into a relation table, Path. The schema of this P ath table is shown in

Figure 18. It is very similar to the Node table. The difference is that rather than storing the tag of each

node in the Label attribute, the PM approach stores the tag path from the root to each node (called root

path) in a new attribute P ath.

Through the P ath attribute, the PM approach can answer twig queries efficiently in units of paths rather

than in units of single edges. Specifically, given a twig query, the PM approach first decomposes it into

multiple root-to-leaf path queries as the TwigStack approach in Section 3.1.3 does, and then joins results

of these paths queries together. Figure 19 illustrates how to use a SQL query to answer a twig query

‘/P ublisher[address = ‘Cambridge ]/book/author/name’. Part (1) is the twig decomposition step, which

Page 17

авбдгже

зигйб г

бд

!вб" $ # %

&$')($0й132346587@9

A3B

Cй1D7FE873GIH

G)( 1)1

&$')($0й132346587@9)&)P3G Q6Eж7

T H)HD9 2U0)(IHF7 VXW`Ybac '$987 4$4йd

&$')($0й132346587@9)&)Q e$eD9 7$4)4

h8QI1U(87

Vpi)QFE8039 23eFq$7d

&$')($0й132346587@9

Aжr

R3g

Cй1D7FE873GIH

G)( 1)1

s)s)s

s$s)s

s)s$s

s$s3s

s3s$s

Figure 18: The Path Table

uses the value of root paths of leaf nodes (address, name) and branching nodes (publisher) in the query twig

to retrieve their corresponding data nodes in data tree. Part (2) is the path joining step, which joins data

nodes retrieved from Part (1) through their (start, end) numbers.

авбдгебзжй й е б г б

!в" #$

%й й'&) (з102гз354 & б5" 6) %з йие&7 8в8'"йбй4з4в69 %й й'&) $ б

@з& б5" б (1з02г13е4'&йбв"д е %з йие&B A CEDз%з10дгз354'& б5"GF

5 й8

8з8'"йб14з41 ' % &H A CIDз%1з02г13е4'&йбв"зDй 18з8е" бй414дF

5 й8

й е б % '&H A CEDз%1в02г13в4'&йбе"зD10 #з#5PвD 5д'&й#е"зD1 е дбQF 5 й8

(1з02г13е4'&йбв"д 1 4в$ 5"дH RS 8з8'"йб14з41 з 4зив е"ди е д8 18з8е" бй414в $ б5 д8H RT (1з02г13е4'&йбв"д в бе д8

в 8

(1з02г13е4'&йбв"д 1 4в$ 5"дH RT й е б в 4зив в"ди

е д8 з е дбй $ б5 й8H RS (1в02г13в4'&йбе"д в бв д8

в 8

8з8'"йб14з41 ' гU б) A CWV1 е 01"д3в8вXзбGF

Ya`cb

Yedfb

hUiUp qUrfsEt uwv

x yUyv uUsfs

pUАБАwВ

xi ГwtБАwv

ДxfЕu

ЖИЗxfЕpUv ryUЙu5Р

Figure 19: The Basic PM approach: SQL for ‘/P ublisher[address = ‘Cambridge ]/book/Author/name’

The PM approach has two advantages. (1) It involves fewer join operations in Part (2) than the Node

approach, since it answers twig queries in units of paths rather than in units of single edges. For example,

for the twig query in Figure 19, the Node approach needs to join five Node tables but the PM approach

needs to join only three Path tables. Therefore, the PM approach generally has higher query process-

ing performance. (2) The PM approach can also support ‘//’ axis queries as the Node approach does,

by using the Optional String Pattern Matching (OSPM) function (”LIKE”) provided by SQL. For exam-

ple, in order to answer a query ”/P ublisher[address = ‘Cambridge ]//name”, we only need to replace

”name.Path=‘/Publisher/book/author/name’” in the where clause in Figure 19 with ”name.Path LIKE

‘/P ublisher/%/name’”.

However, we can also observe that although the number of join operations in Part (2) is reduced, it is at

the expense of increasing the complexity of selection operations in Part (1). As we know, SQL supports Exact

String Matching (”=”) efficiently through pre-building a B+-index on string attributes, but B+-indexes do

not support optional string pattern matching (”LIKE”) efficiently due to the inherent structure of B+-trees.

In order to find patterns with multiple ‘%’ symbols, a large number of irrelevant strings in tables might have

to be checked exhaustively. Therefore, the PM approach does not support ‘//’ axis queries efficiently when

there are multiple ‘//’ axes in queries (e.g. //A//B/C//D).

• The RP Approach

[PCS+04] proposed a Reversed Path (RP) approach to overcome the weakness of the PM approach discussed

above. This approach uses a schema shown in Figure 20. Its key idea is storing reversed root paths of data

nodes in a new attribute ReversedP ath. Otherwise, the RP approach uses an ORDPATH attribute to

replace the (start, end) attribute pair in the PM approach. ORDPATH coding is a variant of Dewey coding

we mentioned in Section 2. It can be used to determine ancestor-descendant/parent-child relationships

Page 18

between nodes as P reP ost coding does [OOP+04]. Here we simply ignore the difference between ORDPATH

numbers and (Start, End) numbers, and concentrate our discussion on the ReversedP ath attribute.

авбдгвбжеизйб ж в ж

ай й !й"

#%$& в'

( $0 )1б

2436547й849A@CB4D6E

FHGI F

Pй8CDRQSDATжU

T05д868

2IVHTIWXQSDA2436547й849A@YBID6E

F6G4 F6G& F

`дU0UCEж9Y705жU6D

acbedYfg 34E&D&@4@жh

2IW0i4iHEIDS@4@Y2430547й8696@CB&D6E

F6G4 F6G6 p

qIW&8Y5&D

asr4WHQ470Eд9AiXt4Dйh

2436547й 84 9A @CB4D6E

FH GX p

Pй 8C DRQSDATжU

T05д86 8

G0G4G

G6G4G

G0G4G

Figure 20: The ReversedPath Table

Figure 21 shows an example of how the RP approach answers twig queries with multiple ‘//’ axes. The

first step is still twig decomposition, which decomposes the query twig into three paths. However, Path (3)

involves three ‘//’ axes. In the PM approach, we have to use ‘%A/B/C%E/F%G’ as a search pattern on the

Path attribute to retrieve corresponding data nodes. As we analyzed earlier, this is not efficient. Therefore,

the RP approach continues to decompose Path (3) into Path (4) and Path (5), both of which include only one

‘//’ axis just in the beginning. So we can use ‘/F/E%’ and ‘/G%’ as search patterns on the ReversedP ath

attribute to retrieve data nodes of Path (4) and Path (5), respectively. So the task here is just finding a

string with a specified prefix, which can be implemented more efficiently than the general Optional String

Pattern Matching task with multiple ‘%’ symbols. Finally, similar to the PM approach, the RP approach

has a path joining step, which joins results of path queries together through the ORDPATH attribute.

авбдгжеиз й ж в

ажбжгеиз!й" ж#ж д в в в

ажб геиз$й" % &

аб гже'з!й( % )ж в0

13254

176ж4

198в4

13254

1@64

1BAд4

1@C4

DFEHGP IRQ(STVU

D3WVGY XR`a ERbHbHTVcHEedf

D$dgGP hRXi ERbHbHTVcHEedf

pSHdgc'qVbVc(rRS

pSHdgc'qVbVc(rRSs D$tgG

Figure 21: The RP approach

• The BLAS Approach

As we saw above, the RP approach in [PCS+04] has simplified the task of general Optional String Pattern

Matching to an easier task of String Prefix Matching (SPM). However, [PCS+04] did not provide any details

on how to efficiently implement SPM. It seems that they just simply push the SPM task down to the

SQL engine. In contrast, another work [CDZ04] not only introduced the RP approach independently from

[PCS+04] but also developed a very intelligent method named BLAS (Bi-LAbeling System) to implement

SPM efficiently. The key idea of BLAS is encoding each ReversedPath string into a number, PLabel. This

encoding method is illustrated in Figure 22.

In this example, we assume that there is a total of four distinct tag names in some XML document, p1

through p4. At the first level, these four tags divide reserved number space [0, 1024) into four equal-length

Page 19

авб

адг

аде

азж

г й

й б г

б и г ж

авб

адг

аде

азж

г й

е ги

е ж

ж ж

й б г

! "!$# б

е% зж

ж и

ж&б'

ж!е г

ж ж

е ж

е (!г

е (

ж и

! "!$# г

! "!$# е

! "!$# ж

авб

адг

аде

азж

авб

адг

аде

азж

)103254'6

)1487 6

)@9A436

)B7 6

Figure 22: How to compute PLabel(‘/p2/p3/p1/p4’)

segments, each with length 1024/4 = 256. In the same way, at the second level, four tags divide each segment

at the first level into four equal-length segments, each with length 256/4 = 64, and so on so forth. So we

have

P label(‘/p2/p3/p1/p4 ) = 256 ∗ (2 − 1) + 64 ∗ (3 − 1) + 16 ∗ (1 − 1) + 4 ∗ (4 − 1) = 396

In the same way, we can also get

P Label(‘/p4/p2/p3 ) = 256 ∗ (4 − 1) + 64 ∗ (2 − 1) + 16 ∗ (3 − 1) = 864

A very nice property of Plabel is that all strings with common prefixes cluster in adjacent digital areas.

For example, all ReservedPath strings with prefix ‘/p2/p3’ cluster together. So if we pre-build a clustered

B+-tree index on the Plabel attribute of the ReversedP ath table, then all reversed paths with the specified

prefix can be retrieved very efficiently using a SQL range query. For example, in order to retrieve all

reversed paths with prefix ‘/p2/p3/’, BLAS first computes lower bound = P Label(‘/p2/p3/ ) = 384 and

higher bound = P Label(‘/p2/p4/ ) = 448. Then a SQL range query is issued to retrieve all reversed paths

with PLabel within [384, 448).

4.5 Summary

In this section, we saw that XML data can be simply loaded into relational databases and XML twig queries

over XML data can also be easily transformed into SQL queries over relational data. In the relational

approach, all query processing work is pushed into relational query optimizer and no extra processing work

is needed.

When XML data conform to a schema such as DTD, the DTD approach introduced in Section 4.1

provides better query processing performance than other approaches introduced in Sections 4.2 through

4.4. The reason is that the DTD approach generates different relational schemas for different DTDs. Each

generated relational schema is tailored for a specific DTD and so precisely captures the structure of XML

data conforming to that DTD schema. In contrast, approaches in Sections 4.2 through 4.4 generate the same

relational schema (tables Edge, Node, P ath, etc) for various XML data despite their different structures,

and so fail to efficiently process a specific goal data set. The experimental work in [TDCZ02] also verifies

this point.

Page 20

When XML data is schemaless (i.e. a DTD for it is not available), the PM approach is the best compared

with the Edge approach and the Node approach, since (1) it supports ’//’ axis queries and (2) it needs fewer

join operations. Further, among the three variations of the PM approach (Basic PM, RP, BLAS), the RP

approach with the BLAS extension is the best. In fact, the basic RP approach has been integrated into

Microsoft SQL Server 2005 [OOP+04, PCS+04]. Interestingly, [OOP+04, PCS+04] do not mention the work

of BLAS [CDZ04]. We propose to extend the basic RP approach in [OOP+04, PCS+04] with the PLabelling

method in BLAS to gain the best query processing performance.

5 XML Query Processing: the Native Approach

Although the relational approach is simple and feasible, it could have inferior query performance. In order

to answer ‘//’ axis queries, the Node approach and the PM approach use θ-joins 2 to implement node/path-

joining step (see Part (2) in Figures 17 and 19), discarding equi-joins used in the Edge approach (see Part

(2) in Figure 14). θ-joins are more complex and costly than equi-joins. Although current DBMSs have been

coupled with efficient techniques to optimize and process equi-joins, they do not support θ-joins efficiently,

particularly when multiple comparison predicates are involved in queries. Some experimental work has

verified this point [ZND+01].

Much research has been done on developing native algorithms to efficiently process θ-joins involved

in XML twig queries. We say these techniques are in the native approach since their storage and query

mechanisms are developed from scratch, without involving relational databases. The authors of these native

techniques believe that a special storage and query system tailored for XML data will improve XML query

processing performance significantly. In the native approach, θ-joins are also called structural join.

Specifically, in the native approach, XML data are stored in inverted lists. Inverted indexes have been

widely used in Information Retrieval to implement efficient text search [SM83]. Inverted index creates one

list for each distinct word in text documents; the list gives positions of all occurences of this word. These

lists are called inverted lists. Borrowing this idea, the native approach creates one inverted list for each

distinct tag in XML documents; the list gives positions of all elements with that tag name. Location of an

element is expressed using its (start, end, level) numbers. Locations in a list are sorted in the increasing

order of their start numbers. Figure 23 shows inverted lists of the XML document in Figure 2 (a).

авбдгдезжд й в

д! #"$&%('$ ) 0% !1#2$&%('

аве43д3д5$

76$&%(6$! #'$ ) д д!1#8$! #'

а#9 г @в 43

A% 1д&% Bд!1#'$ ) A%(C$&%(2$!1#'$ ) #C$!1д1д!1#'

адDдDдDд

DдDдD

Figure 23: Inverted lists

5.1 The MPMGJN Approach

[ZND+01] proposed an MPMGJN (Multi-Predicate MerGe JoiN) algorithm, which is the first native approach

to implementing structural joins. Its implementation is somewhat similar to that of the standard Merge Join

2θ-joins are joins involving ‘>’ and ‘<’ comparisons, while equi-joins involve only ‘=’ comparison.

Page 21

algorithm developed in relational query optimizers for equi-joins. In order to answer a query ‘A//B’ or

‘A/B’, two cursors are created on AList and BList that have been sorted in the increasing order of start

numbers. Initially, these two cursors are pointing to the heads of AList and BList, respectively. Then, they

are compared with each other and advanced in line to implement merge join.

In contrast to the standard merge-join implementation for equi-joins, MPMGJN has its own cursor-

advancing mechanism, which is specially tailored to efficiently support structural joins. Specifically, at each

step, it compares and advances two cursors as Figure 24 describes. The working process of MPMGJN is

also illustrated in Figure 25. Note that dotted edges in Figure 25 (a) mean there might be other data nodes

than A-tagged or B-tagged nodes on those edges although we show only A-tagged and B-tagged nodes in

this data tree. Experimental work in [ZND+01] found that MPMGJN algorithm is more than an order of

magnitude faster than RDBMS join implementation in most query cases.

авбд гжеиз йж й г"! #йж %$ " ' &) (0е1з2й 3 ж 4й% (5!ж# й $ 7 6ж809 @

AжA $CBвD0$ @ 9E в849' й 2 F бG г е5з й

й гI HHQ P

R5S й%9

AжA' T9▄зV@F 2809W зX@@49 Y1S` ac b 4з`@4dF 289 @F $4B2D$ @4 9E `809I ж 4й% e жб) (fSз йж

9Vg0a hC ж 4й% гI ip й 2 гqP

r8зS29tsu г е5з й v в92g4a hC й г"! #йж %$ " ' &) (0е1з2й 3 ж 4й% (5!ж#в9 @"Bx w

А 2a ' $E ` a1Sв9E йв SV 0з2 2@Б з`@ I 2809E bв зV@p 49жйв 1S #Г В a49 з2б4зв ж$SSVДd

Е$Cй%9W ЖЗ sЙИ0 E в849F РV(Aг3СУ Т2 9 жД0wХ Ф

А вaж 4 I в a1S`9Ц sЧ(0е1з2й 3 ж 4й% (5! dГ гже5з йж f Ш 92g0a h0 ж й% ги! wc з2б

(4е5з йж f й 2 (1!#ве492D09"SЩ HE Ж) ip гжеиз йж Ш 9Vg0a hC ж 4й% ги!ж#`е09 D09S #

Е$Cй%9I de sЙИ0 E в849F РV(AAг С) Т` 09 Д0wQ Ф

f1з` 09ж ж 0SVДe 4 вaж I ` a1Sв9Ц sg(0е1з й 3 h ж й% 2(5! d' гжеиз йж Ш 9Vg0a hC ж 4й% ги! wж#

%92g4a2hC й 2 гF HHQ P

2 "й% (e HжHХP

R@"B0звб

Figure 24: The core of the MPMGJN algorithm

5.2 The StackTree Approach

[AKJK+02] observed that although the MPMGJN approach is efficient for ‘//’ axis queries, it fails to process

‘/’ axis queries efficiently in some cases. A motivating example is shown in Figure 26 (a). In this example,

a1 has only two B children, b1 and b6. However, we can see from Figure 26 (b) that MPMGJN finds the

child d6 only after it has scanned b1 through b5, where b2 through b5, which are indirect descendants but

not children of a1, have to be visited unnecessarily.

In order to avoid such unnecessary node scanning, [AKJK+02] proposed a new approach, StackTree.

StackTree uses a nice stack structure to cache A nodes nested on the same path in data trees. Figure 27

shows the core of the StackTree algorithm. At each step, the data node with the smallest start number is

taken out of its list. If it is an A-tagged node, it is pushed into the stack. If it is a B-tagged node, StackTree

tries to use it to form tuple solutions with existing A-tagged nodes in the stack. Figure 26 (c) illustrates its

working process. From this example, we can see that there are no redundant comparisons of b2 through b5

with a1. Therefore, StackTree has better query processing performance than MPMGJN.

Both StackTree and MPMGJN are binary join algorithms, i.e. they join only a pair of inverted lists (or

only one edge in the query twig). Since a complete twig query consists of a series of binary joins, the problem

of join order selection has to be considered seriously. Just as in the context of relational databases, join order

Page 22

авбдгж езбзийб з

д!

б!

"$#&%( '0)

"213%5 46)

"873%9 #@46)

"BAC%9 #D1D)

"вE@%9 FD)

"HGC%9 #DIC) "$#0#@%9 #D'3)

"P#D7@%Q #&EC)

б!

RTSVUCR0WCXCY5 X3`0aCbCcDa3d

б!

RCS3U0R3W6X0Ye XD`3a6b0cVa3f

R6S@UCRCW0XCY5 X3`Ca0bCc3a@f

gWDW6Y3aDhCi6cDc3p5 q6csrtW

R6S3UDRCW6XDY( X@`Ca6bDc3aCd

RCS3U0R3W6X0Ye XD`3a6b0cVa3f

gWCWTYVaCh0iCcCcDpe qTcTr@W

RCS3U0R3W6X0Ye XD`3a6b0cVaCd

R6S3UDRCW6XDY( X@`Ca6bDc3aCd

а гv uxw uАy Б В

Figure 25: The MPMGJN approach (For the query ‘A//B’)

авбдгж езбзийб з

ед

ез!

ед"

бд

е"

е!

а#е$г& %з' %)( 0 1

бз

а324гд бз

а65дг е7

а6б е г

бз

а98 г б

бз

ав@згA е

а6б е г

а9B г б

бд

авC гD е

а9б E ез вг

бд

бз

аGFHгD е "

а6бд E е"г

бз

а9I гAез!

а6б е!г

бд

а6Pдг ед

а9б е г

бд Q е7 R б е б е ед"S е$!T е

а9U гW VдHбдU Xд

аY2й` гд aзb

Figure 26: The StackTree approach (For the query ‘A/B’)

Page 23

significantly affects XML query processing performance. As we know, most relational query optimizers use

a classical dynamic programming method to select an optimal join order. [WPJ03] also proposed similar

dynamic programming methods to select an optimal or sub-optimal order of binary structural joins for XML

twig queries. The StackTree binary join algorithm and the corresponding dynamic-programming-based join

order selection algorithm have been integrated into Timber [JAKC+02], a famous native XML database

prototype from the University of Michigan.

авбдгжеиз йж ж й б г!#"%$бжзжй'&)(ж01 з32ж 4" 516ж з й3 4 7йж8 9 $бжз1й'&@(ж01 7зж2A 97516 з1й3 A 7йжB

C7D E 4 F з йG (4HI 0 з%б г PQ авбдгжеиз йж A 7й38F б46 E6R D1D "S г%2UTEзS б гV з й3 1(жH W бжйAX Eг7T г 04а%YE F з`аU D DAE

йGX жг аaб гAе з1й3 4 йQ 4 E b2b1b%ETc 2ж0dйR 2жeS з1й3 1(жH 6

feg авб`гже7з йж ж dйh "%$бжзжй'&)(ж01 з32ж 4" 516жз й3 4 7йp i XEг

q0dзGXR г 2 TE "%$бжз1й'&@(ж01 7зж2A ж" 5S б`г й32V з йж (AHsr

(ж01 7зG2ж 4"u tжtv r

w DзE

x0dйb07йI йA0b DGE зж2D07й бG24г зy б`г7йж2I йGXEV А2 б гI Eз 0Dй 6 Бb%E(7бAe7б ( D D`В8

C UзEV ГЕД%24 S йGXEV Ж"%З 9ЙИС Р 0E ВBУ Т

x07йb07йI й 0bDGE ФQ Х Ц Ч4Ш ЩЙdfeAg hdШ3i4h Х j3k1ФQ l m%nжh n

oжp qGr%s t u vGsI uжtF q r%s

qGuжwS u xV q r%sI y4zж{ {%s4t7q p q3|1yж}S |At7vI qжuжw G sжА s Бc ВF ГС ДR Е1 o4pжq'Жfy4z {dp3uA{ ЕdЗ ├ %sAА%sdБж~

И|Up3s ЙЛ КЕМ%u4{S qGr sV НдО%П П1Е'РТ СGz%s4{ У Фv Х

Цz7qGw1z7qR | Б БЧ К ШR Щ1Ъ Ы4ЬжЭ'ЮfЯ4а бdЬ3вAб Щdг3дV Э а еижGзUЬGШ й%з4б%з к4лc мжн1о п░ н%▒U▓3│F кднF G╡ │

╢A╖ ╕│жнdR л ┤жм╢4╣v║

╢ж╖1╕лG▒1╗S ╝ж╝v ║

╜н ▓%кA╛

Figure 27: The core of the StackTree algorithm

авбдгжеиз йж ж й б г!#"бжзжй $&%('0) зж1ж 2$43 5ж з6й7 0 2й9 8

@2A B 0 C AжA з йж ж'жD2з )EзFбдгFGH аIбдгже з6й7 0 й7PQ бж5 B5R A A г 1 SBзT б гC 'ж)6 BгEйT зжй7 '0D2зV U бжй&W Bг S

г )&аFXB Q здаY A A&B ` й&WF 0г авбaгжеизжй7 ж Eй9 & B b1b b BS9 10)2йR 16cd зжй7 '0D2з65

"BйV авбaг9 XB б0SR 1 cd й&WBC Aбжз6й9 зe) 'жWd йeWF Yй " б&з йafg$ih %('&) Eз710 жfp$ihg365 з6й7 & 2й а б г0е з6й7 & 2й

qcr авбaг` бжз г 16йs й&WB б&S9 1 c AeB 6c г 1 SB 16cT йeWB b 6йeWQ t&)B u v WBг

w)EзeW9 г 1 SB " б0з йafg$xh %('0) Eз71& жfд$ihg3C бдг2йж1T y йe '0Dжfд$АhБ Uвб0йeWC &гC з6з71ж'2б& йBS b12бaг2йB

й71d й WB г 16SB 1жгQ й WB йж1b 16cQ б&й з b 0 Bг2йT з6й7 ж'жDГВ

Д AзB

Е)Eйb)2й9 AжA й&)bAeB зж1A) йFбe10г2зT бдаbAбBSЖ X6uC 'ж)ж BгEй9 з йж '0D2зs бaг2йж1T йeWBd З1 б г

Bзe)Aй65

Дг SFб&c

'ж)ж 2зж1ж 0fg$xhЙ И6ИРВ

Figure 28: The core of the PathStack algorithm

Another important thing is that the StackTree algorithm in Figure 27 outputs all tuple solutions in the

increasing order of start numbers of descendant nodes (i.e. B-tagged nodes). For example, six tuple solutions

in Figure 26 (c) are output in the order of b1 through b6. Complementarily, [AKJK+02] also proposed a

variant of the StackTree algorithm to output tuple solutions in the increasing order of start numbers of

ancestor nodes (i.e. A-tagged nodes). This is very important for twig queries. Consider a twig query

’C//A//B’. If we select a query plan C ⊳⊲ (A ⊳⊲ B), then query results of A ⊳⊲ B have to be sorted by A

nodes, since the next binary join will occur between C and A.

Page 24

Otherwise, [CVZ+02] extends the StackTree algorithm with a skip technique so that some nodes in

inverted lists do not need to be visited during the join process if these nodes are predicted not to form any

tuple solutions with other nodes.

5.3 The Holistic Approach

A weakness of decomposing twig queries into multiple binary joins is that this method generates a large

amount of intermediate query results. For example, for a query plan (A ⊳⊲ B) ⊳⊲ C, the query result of the

first join A ⊳⊲ B has to be written to disk first if its size is too large to be contained in memory, and then be

read back to memory to join with C after A ⊳⊲ B has been completed. This will result in high disk I/O cost.

In order to overcome this weakness, [BKS02] proposed a Holistic approach, which is essentially a pipelining

join, i.e. joining multiple inverted lists at one time so that no intermediate query results are generated.

Figure 28 shows the core of the PathStack algorithm which uses the Holistic approach to answer simple

path queries. It is easy to see that this algorithm is structurally very similar to the StackTree algorithm

in Figure 27. The difference is that StackTree uses only one stack to cache nested A nodes. In contrast,

PathStack has multiple stacks, one for each non-leaf node in a path query, since inverted lists of all nodes

in a path query are involved in pipelining joins. Also, each node cached in a path stack has an associated

pointer to a corresponding node in its parent stack, in order to track tuple solutions.

Recall that PathStack was also used as a file approach to answering path queries over XML documents

(Section 3.1.2). Here, Figure 29 illustrates how to use PathStack as a native approach to answering path

queries over inverted lists. This figure is very similar to Figure 8. The differences are: (1) only A-tagged and

B-tagged nodes are read in the native approach. Therefore, no other irrelevant nodes in XML documents,

such as node d1 in Figure 8, are read. (2) In the native approach, the event of nodes being popped out of

stacks is triggered by the arrival of other nodes with higher start numbers than their end numbers, rather

than being triggered by the arrival of their own closing tags as in the file approach.

авб

гб

аед

гд

жб

жзд

й й й "! # #

аб

гб

$&%('0 )21а435 аб

$76з'0 )21а23 гб

аб

гб

$98з'0 )21а435 ад

ад

аб

гб

$A@е'0 )21ае3 гд

аед

гд

аб

гб

$7Bз'0 )21а230 жб

аед

гд

жб

аб

гб

$9Cз'0 )21а43D жд

ад

гд

жзд

азE

$7F('G )H1а23I аE

QP S R й UTW VX Yй `badce &V й ` ag f

hXiYp qbrt s

u vxw"yUАS БXВ

ГЕД2Ж ЗзИеЙ ЗG Р2ЖС

ТФУвХ(ЦHЧвШФУ(Щ2d"e

ГЕДЙ ЗзИЙ ЗG РЖС

ГЕДЖ ЗзИЙ Зf РЙС

ТеУеХ(Ц2ЧеШФУзЩHd"e

ГЕДЙ ЗзИЙ Зf РЙС

ДЖ ИЖ ДЙ ИЙ РЖ РЙ Деg

Дg

Figure 29: The Holistic approach (For the query ‘A//B/C’)

Similarly, the Holistic approach also provides a TwigStack algorithm to answer general twig queries. The

main idea of TwigStack has been illustrated in Figure 9.

Page 25

[BKS02] also experimentally compared the Holistic approach with the StackTree approach. Their ex-

perimental results show that generally the Holistic approach has more than six-fold faster query processing

performance than the StackTree approach coupled with the optimal join order. Due to its high query pro-

cessing performance and algorithmic simplicity, the Holistic approach has been used extensively in some

recent research work. For example, [JWLY03] extended it with a skip technique to avoid visiting some nodes

in inverted lists that do not form any tuple solutions with other nodes. [JLW04] extended Holistic to process

twig queries with OR predicates. [BGKS03] applied Holistic for multi-query processing.

6 Conclusions

In this survey we reviewed major techniques for processing XML twig queries. These techniques are catego-

rized into three classes based on the storage format of XML data.

The file approach is mainly used for special-purpose applications in which XML data must be stored

in commonly used flat files in the form of just original XML documents. Since no indexes are available in

such applications, the entire XML document, including a large volume of elements irrelevant to the specified

query, has to be visited, which usually results in poor query processing performance.

In the relational approach, XML data can be simply loaded into relational databases and XML twig

queries over XML data can be easily transformed into SQL queries over relational data. In this approach, all

specific query processing work is pushed into relational query optimizers and no extra processing is needed.

However, current RDBMSs do not support θ-joins efficiently, despite the fact that θ-joins is a necessary

component for answering ‘//’ axis XML queries efficiently. Among relational approaches, the RP approach

with the BLAS extension has the best performance for querying schemaless XML data.

The native approach develops native algorithms to efficiently process θ-joins involved in XML twig queries

that are essentially structural joins of inverted lists. In this approach, many existing important components

in RDBMS, such as storage management, access methods, query processing and optimization, concurrency

control and recovery, have to be rebuilt from scratch. Among native approaches, the Holistic approach shows

the best query processing performance in experiments.

Just as [ZND+01] implies, a good approach should be integrating native θ-join algorithms for XML twig

queries into existing relational query optimizers so that extended relational query optimizers will be able to

process XML twig queries more efficiently. Meanwhile, in this integration approach, other existing important

components in RDBMS than query optimizers, such as concurrency control and recovery, can also be fully

reused so that development efforts will be significantly saved. Therefore, this integration approach will gain

the best trade-off between XML query processing performance and development efforts.

References

[ABJ89]

Rakesh Agrawal, Alexander Borgida, and H. V. Jagadish. Management of transitive relation-

ships in large data and knowledge bases. SIGMOD Conference, 1989.

[AF00]

Mehmet Altinel and Michael J. Franklin. Efficient filtering of XML documents for selective

dissemination of information. VLDB Conference, 2000.

Page 26

[AKJK+02] Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jifnesh M. Patel, Divesh Srivastava, and

Yuqing Wu. Structural joins: a primitive for efficient XML query pattern matching. ICDE

Conference, 2002.

[BGKS03] Nicolas Bruno, Luis Gravano, Nick Koudas, and Divesh Srivastava. Navigation- vs. index-based

XML multi-query processing. ICDE Conference, 2003.

[BKS02]

N. Bruno, N. Koudas, and D. Srivastava. Holistic twig joins: optimal XML pattern matching.

SIGMOD Conference, 2002.

[CDZ04]

Yi Chen, Susan B. Davidson, and Yifeng Zheng. BLAS: An efficient XPath processing system.

SIGMOD Conference, 2004.

[Cha02]

D. Chamberlin. XQuery: an XML query language. 41 (4), 2002.

[CHKZ03] Edith Cohen, Eran Halperin, Haim Kaplan, and Uri Zwick. Reachability and distance queries

via 2-Hop labels. SIAM Journal on Computing, 32:1338–1355, 2003.

[CVZ+02] Shu-Yao Chien, Zografoula Vagena, Donghui Zhang, Vassilis J. Tsotras, and Carlo Zaniolo.

Efficient structural joins on indexed XML documents. VLDB Conference, 2002.

[DAF+03] Yanlei Diao, Mehmet Altinel, Michael J. Franklin, Hao Zhang, and Peter M. Fischer. Path

sharing and predicate evaluation for high-performance XML filtering. ACM Transactions on

Database Systems (TODS), 28:467–516, 2003.

[DF03]

Yanlei Diao and Michael J. Franklin. High-performance XML filtering: an overview of YFilter.

IEEE Data Engineering Bulletin, 26:41–48, 2003.

[DFFT02] Yanlei Diao, Peter M. Fischer, Michael J. Franklin, and Raymond To. YFilter: Efficient and

scalable filtering of XML documents. ICDE Conference, 2002.

[Die82]

Paul F. Dietz. Maintaining order in a linked list. ACM Symposium on Theory of Computing,

1982.

[FK99]

Daniela Florescu and Donald Kossmann. Storing and querying XML data using an RDMBS.

IEEE Data Engineering Bulletin, 22:27–34, 1999.

[Gro04a]

W3C Group. Extensible Markup Language (XML). http://www.w3.org/XML/, 2004.

[Gro04b]

W3C Group.

Guide to the W3C XML specification (XMLspec) DTD, version 2.1.

http://www.w3.org/XML/1998/06/xmlspec-report.htm, 2004.

[Gro04c]

W3C Group. XML path language (XPath) 2.0. http://www.w3.org/TR/xpath20/, 2004.

[Gro04d]

W3C Group. XML Schema. http://www.w3.org/XML/Schema, 2004.

[Gro04e]

W3C Group. XQuery 1.0: an XML query language. http://www.w3.org/TR/xquery/, 2004.

[Gru02]

Torsten Grust. Accelerating XPath location steps. SIGMOD Conference, 2002.

[GvKT04] Torsten Grust, Maurice van Keulen, and Jens Teubner. Accelerating XPath evaluation in any

RDBMS. ACM Transactions on Database Systems (TODS), 29:91–131, 2004.

Page 27

[HBG+03] Alan Halverson, Josef Burger, Leonidas Galanis, Ameet Kini, Rajasekar Krishnamurthy,

Ajith Nagaraja Rao, Feng Tian, Stratis Viglas, Yuan Wang, Jeffrey F. Naughton, and David J.

DeWitt. Mixed mode XML query processing. VLDB Conference, 2003.

[JAKC+02] H. V. Jagadish, Shurug Al-Khalifa, Adriane Chapman, Laks V. S. Lakshmanan, Andrew Nier-

man, Stelios Paparizos, Jignesh M. Patel, Divesh Srivastava, Nuwee Wiwatwattana, Yuqing Wu,

and Cong Yu. TIMBER: A native XML database. VLDB Journal, 11:274–291, 2002.

[JLW04]

Haifeng Jiang, Hongjun Lu, and Wei Wang. Efficient processing of twig queries with OR-

predicates. SIGMOD Conference, 2004.

[JWLY03] Haifeng Jiang, Wei Wang, Hongjun Lu, and Jeffrey Xu Yu. Holistic twig joins on indexed XML

documents. VLDB Conference, 2003.

[MW99]

Jason McHugh and Jennifer Widom. Query optimization for XML. VLDB Conference, 1999.

[OCL04]

OCLC. Dewey decimal classification. http://www.oclc.org/dewey/, 2004.

[OOP+04] Patrick E. O’Neil, Elizabeth J. O’Neil, Shankar Pal, Istvan Cseri, Gideon Schaller, and Nigel

Westbury. ORDPATHs: Insert-friendly XML node labels. SIGMOD Conference, 2004.

[Org04]

SAX Project Organizatiion. SAX: Simple API for XML. http://www.saxproject.org/, 2004.

[PAKJ+02] Stelios Paparizos, Shurug Al-Khalifa, H. V. Jagadish, Laks V. S. Lakshmanan, Andrew Nierman,

Divesh Srivastava, and Yuqing Wu. Grouping in XML. EDBT Workshops, 2002.

[PCS+04] Shankar Pal, Istvan Cseri, Gideon Schaller, Oliver Seeliger, Leo Giakoumakis, and Vasili Vasili

Zolotov. Indexing XML data stored in a relational database. VLDB Conference, 2004.

[PWLJ04] Stelios Paparizos, Yuqing Wu, Laks V. S. Lakshmanan, and H. V. Jagadish. Tree logical classes

for efficient evaluation of XQuery. SIGMOD Conference, 2004.

[RSSB00] Prasan Roy, S. Seshadri, S. Sudarshan, and Siddhesh Bhobe. Efficient and extensible algorithms

for multi query optimization. SIGMOD Conference, 2000.

[SM83]

G. Salton and M. J. McGill. Introduction to modern information retrieval. McGraw-Hill, 1983.

[SSK+01] Jayavel Shanmugasundaram, Eugene J. Shekita, Jerry Kiernan, Rajasekar Krishnamurthy,

Stratis Viglas, Jeffrey F. Naughton, and Igor Tatarinov. A general techniques for querying

XML documents using a relational database system. SIGMOD Record, 30:20–26, 2001.

[STW04]

Ralf Schenkel, Anja Theobald, and Gerhard Weikum. HOPI: An efficient connection index for

complex XML document collections. EDBT Conference, 2004.

[STW05]

Ralf Schenkel, Anja Theobald, and Gerhard Weikum. Efficient creation and incremental main-

tenance of the HOPI index for complex XML document collections. ICDE Conference, 2005.

[STZ+99] Jayavel Shanmugasundaram, Kristin Tufte, Chun Zhang, Gang He, David J. DeWitt, and Jef-

frey F. Naughton. Relational databases for querying XML documents: Limitations and oppor-

tunities. VLDB Conference, 1999.

[TDCZ02] Feng Tian, David J. DeWitt, Jianjun Chen, and Chun Zhang. The design and performance

evaluation of alternative XML storage strategies. SIGMOD Record, 31 (1):5–10, 2002.

Page 28

[TRP+04] Feng Tian, Berthold Reinwald, Hamid Pirahesh, Tobias Mayr, and Jussi Myllymaki. Imple-

menting a scalable XML publish/subscribe system using a relational database system. SIGMOD

Conference, 2004.

[TVB+02] Igor Tatarinov, Stratis Viglas, Kevin S. Beyer, Jayavel Shanmugasundaram, Eugene J. Shekita,

and Chun Zhang. Storing and querying ordered XML using a relational database system.

SIGMOD Conference, 2002.

[WPJ03]

Yuqing Wu, Jignesh M. Patel, and H. V. Jagadish. Structural join order selection for XML

query optimization. ICDE Conference, 2003.

[YASU01] Masatoshi Yoshikawa, Toshiyuki Amagasa, Takeyuki Shimura, and Shunsuke Uemura. XRel:

a path-based approach to storage and retrieval of XML documents using relational databases.

ACM Transactions on Internet Technology (TOIT), 1:110–141, 2001.

[ZND+01] Chun Zhang, Jeffrey F. Naughton, David J. DeWitt, Qiong Luo, and Guy M. Lohman. On sup-

porting containment queries in relational database management systems. SIGMOD Conference,

2001.