Query Languages for XML Documents:
A QL '98 Position Paper
Michael Rys, Stanford University
This position paper will present several aspects that I consider
important issues in the design of a query language for XML based on
my experiences in database and information system research and building
database prototype systems, as well as developer of information system
based applications.
Some Terminology
XML document |
Any document marked up with well-formed XML |
XML data |
Any XML document that does only contain semistructured data structured
by means of XML attributes and elements and does not contain
any untagged CDATA or HTML text.
|
XML text |
Any XML document that is not XML data.
|
Query language |
A formal language to describe the search for data in a data collection, its
restructuring and transformation (query), as well as the changes to the
original data (update).
|
Why is a Query Language important?
XML will be (and already is) used to encode, provide and transfer
partially structured data between data providers and consumers.
In order to facilitate data and information retrieval for the consumer,
it is necessary to provide query abstractions
that allow access to
the data in a declarative way (what do I want?). Standard navigational
"query interfaces" that only allow navigation along predefined
relationships will not scale to large amount of data and are not well-suited
for efficient information discovery.
Fundamental Issues
- Dependence on the application domain, the XML data model and
desired expressiveness:
Query languages normally depend on the application domain and the desired
expressiveness. They are defined based on an algebraic description
that defines the way data is represented by means of a data model and
the way the data is queried and updated (often described in form of
a so called algebra and/or calculus). The final query language may be on a
similar level of abstraction as its description (e.g., SQL and relational
calculus) or on a higher level (e.g., QBE and relational algebra).
In order to understand how a query language for XML should look like,
we should know the domain in which the query language will be used, the
operations we want to have executed, and finally the data model that
describes the XML document.
- Levels of abstraction:
Since the operations and the data model may
be on a different level of abstraction (e.g., the model describes
relations, but the query language has an object-oriented view of the
data and its operations), an additional mapping may need to be defined.
XML Applications
Based on my experience, I see three main application domains, where XML is
and will be playing an important role as data representation for data
management and interchange:
- Document management:
Documents will be encoded as XML texts, where certain information
about the structure and metainformation will be represented in
XML structures but most of the text will not be XML-tagged.
- Transfer of data from single repositories:
In this case, data will be encoded as XML documents (most likely
as XML data). The data might be stored in a specific XML repository or
in another database system (relational, object-oriented), but the
clients only see XML.
- Information integration among multiple repositories:
In this case, data from different sources needs to be transformed
from their source representation into a common representation suitable
for the integration process (performed for example by mediators).
XML is well-suited as the lingua franca of the integration layer due
to its flexibility and portability. Most likely, data from the
different sources will be represented as XML data for the
integration.
In all three scenarios, XML is used to represent the data. However,
the operational requirements and the underlying data model in all three
domains differ.
Domain specific data models
- document model as data model:
Document management normally uses the notion of a document
as its underlying concept for a data model
- object-relationship model as data model:
The domains 2 and 3 normally only deal with exact
queries and have the finer granule of an object and its relationships
as basic data model.
Domain specific operational requirements
- Document queries:
Document management normally requires a mixture of
exact queries on the structured part and information retrieval queries on the
unstructured part (e.g., find all documents published by W3C containing
'XML' next to 'XSL').
- Repository queries:
Applications in this domain normally require only exact queries and
operate in a syntactically and semantically homogeneous source context.
Thus the standard operations provided
are selection, projection, aggregation, join and set operations.
- Integration queries:
Integration applications in general need to integrate syntactically and
semantically heterogeneous sources. Thus, additional important operations
besides the standard operations are:
- object composition by means of fusion
- generation of new (derived) object identities (by means of Skolem functions)
- query meta information about sources for identifying and handling
heterogeneity.
- access to ontological tools such as thesauri based comparisons
Goal: Common Data Model and Common, Extensible Query Language
Based on the different domain requirements, it will be important to decide
what the target application domain of the query language will be. I hope,
that the communities can agree on a common data model which would allow
us to define a query language which provides operations for all three domains
in a simple and elegant (and consistent) way. It is clear that this means,
that it needs to provide operations normally found in database query languages,
information integration systems and document management systems.
It is important, that the query language can easily be
extended, for example to accommodate new domains and their requirements
(geographical queries etc.) and to add new document management operations.
Meta data
In any of the three domains above, meta information plays an important role.
While XML provides a way to define simple meta information about XML documents
in form of a DTD, more complex meta information needs to be provided as
well. For example, a DTD can express relationships among objects (XML
elements) by means of referential attributes. However, there is no standard way
to define integrity constraints or ontology information (besides
the sub-element relationship).
It will be important to query such meta information as well. If it is
represented in XML, the query language can be used for querying the
meta data. If the meta data is represented in RDF, then a RDF-QL
needs to be specified in addition. If RDF is represented in XML, the
RDF-QL can be mapped to the XML query language.
A Data Model for XML
Graph Structure
An XML document itself can be viewed as a linearization of graph structured
data where the order of the different tagged and untagged elements in general
is important. Unfortunately, the XML element hierarchy can only express
tree structured data, the graph structure needs to be expressed using
element attributes. Since there are many ways to linearize a graph,
XML alone is not well-suited as its own data model. Either,
the data model needs to be a full graph-based model, or
XML needs to have a canonical form for representing the graph.
The query language should be able to deal with graph structured data:
- Navigation and restructuring semistructured graphs:
It should be able to navigate the graph (using path expressions) and
restructure it.
The path expressions need to be able to deal with semistructured
data, i.e., provide wildcards and regular path expressions.
- Query composition:
In order to
be able to compose queries, the result of a query should be another graph.
Other questions that need to be addressed are:
- Querying attributes and subelements
Should there be a difference between attributes and subelements in term
of query syntax? Especially if both are used for expressing parts of the same
relation, e.g.,
<ADULTPERSON oid="p1"> ... </>
<ADULTPERSON oid="p2"> ... </>
<ADULTPERSON oid="p3" CHILD="p1 p2">
<CHILD oid="c1"> ... </>
<CHILD oid="c2"> ... </>
</>
It is my believe that the query language should allow all three alternatives:
- query both (e.g.,
ADULTPERSON.CHILD
returns p1 p2 c1
c2
)
- query only the attributes (e.g.,
ADULTPERSON.@CHILD
returns
p1 p2
)
- query only the subelements (e.g.,
ADULTPERSON.$CHILD
returns
c1 c2
)
- Preservation of input graph representation:
Does a query have to preserve
the graph representation of the user input or always return a canonical
linearization regardless of the input? Or is it user definable?
Extensional vs. Intensional Order
Oftentimes, especially in the context of documents, but also in data
management context, the extensional order is important. Thus, the data model
should be able to preserve the extensional order of the XML documents.
The query language should therefore not only be able to allow the user to
specify intensional order (e.g., via an order by clause), but also
the extensional order in the case of updates. It should be able to preserve
the extensional order when querying, if required by the user on a
query-by-query basis.
Physical Design: Structured vs. Semistructured vs. Unstructured Data
For some application and in order to exploit performance opportunity, the
physical design of the data model should exploit as much structural
information as possible:
- If XML data conform to their DTD:
The data model should be able to
utilize this information for an efficient physical design and be
able to provide it to the query language for performance
optimization.
- If no DTD is present:
The XML data should still be
queryable according to the semantical nesting of the object relationship graph
and not just as text blob. This preserves the simplicity of the query language
and avoids inconsistencies in the query formulation.
- Untagged data (CDATA):
Untagged data should be treated as a single object text blob by
the physical mapping where
information retrieval operations can be applied.
Query Language Operations
I don't want to go into a detailed description of all the operations.
Instead, from the database and information
integration point of view, I would like to refer to the research in the area
of semistructured information processing. Especially the
XML-QL proposal, Stanford's Lore and TSIMMIS projects present in my opinion
a very good starting point. For the area of information retrieval, Lore has
presented some ideas with nearness- and similarity-based query operators,
but there are certainly other contributions from the document management
world.
Besides the already mentioned points, it is, in my opinion, important that
- the query language provides operations for both (bulk) queries and (bulk)
updates,
- the query language provides both
- object (graph) preserving operations:
the result is an already existing object (i.e., the same object id).
It should allow extension and projection of object relationships for the
answer of the query.
- object (graph) generating operations:
for example, to compose a new object out of other objects.
- value generating operations:
While the two previous kind of operations preserve closure, these ops returns
values and go beyond closure (i.e., they are not compositional). Examples of
such operations are IR retrievals, where the result is a list of locations.
- the query language is simple (high level of abstraction with
clean semantics, no baroque syntax),
- the execution is efficient (data model and query language allow standard
and new optimization techniques),
- the execution is scalable (in terms of execution time
vs size of data/number of documents/number of sites).
Some Database Issues
The following aspects should be possible with the chosen QL and data model:
- a transaction oriented framework for the execution
- namescoping (currently a problem with XML element names)
- views (either virtual or materialized)
- constraints specification and triggers