Query Languages for XML Documents:
A QL '98 Position Paper

Michael Rys, Stanford University

This position paper will present several aspects that I consider important issues in the design of a query language for XML based on my experiences in database and information system research and building database prototype systems, as well as developer of information system based applications.

Some Terminology

XML document	Any document marked up with well-formed XML
XML data	Any XML document that does only contain semistructured data structured by means of XML attributes and elements and does not contain any untagged CDATA or HTML text.
XML text	Any XML document that is not XML data.
Query language	A formal language to describe the search for data in a data collection, its restructuring and transformation (query), as well as the changes to the original data (update).

Why is a Query Language important?

XML will be (and already is) used to encode, provide and transfer partially structured data between data providers and consumers. In order to facilitate data and information retrieval for the consumer, it is necessary to provide query abstractions that allow access to the data in a declarative way (what do I want?). Standard navigational "query interfaces" that only allow navigation along predefined relationships will not scale to large amount of data and are not well-suited for efficient information discovery.

Fundamental Issues

Dependence on the application domain, the XML data model and desired expressiveness:
Query languages normally depend on the application domain and the desired expressiveness. They are defined based on an algebraic description that defines the way data is represented by means of a data model and the way the data is queried and updated (often described in form of a so called algebra and/or calculus). The final query language may be on a similar level of abstraction as its description (e.g., SQL and relational calculus) or on a higher level (e.g., QBE and relational algebra).
In order to understand how a query language for XML should look like, we should know the domain in which the query language will be used, the operations we want to have executed, and finally the data model that describes the XML document.
Levels of abstraction:
Since the operations and the data model may be on a different level of abstraction (e.g., the model describes relations, but the query language has an object-oriented view of the data and its operations), an additional mapping may need to be defined.

XML Applications

Based on my experience, I see three main application domains, where XML is and will be playing an important role as data representation for data management and interchange:

Document management:
Documents will be encoded as XML texts, where certain information about the structure and metainformation will be represented in XML structures but most of the text will not be XML-tagged.
Transfer of data from single repositories:
In this case, data will be encoded as XML documents (most likely as XML data). The data might be stored in a specific XML repository or in another database system (relational, object-oriented), but the clients only see XML.
Information integration among multiple repositories:
In this case, data from different sources needs to be transformed from their source representation into a common representation suitable for the integration process (performed for example by mediators). XML is well-suited as the lingua franca of the integration layer due to its flexibility and portability. Most likely, data from the different sources will be represented as XML data for the integration.

In all three scenarios, XML is used to represent the data. However, the operational requirements and the underlying data model in all three domains differ.

Domain specific data models

document model as data model:
Document management normally uses the notion of a document as its underlying concept for a data model
object-relationship model as data model:
The domains 2 and 3 normally only deal with exact queries and have the finer granule of an object and its relationships as basic data model.

Domain specific operational requirements

Document queries:
Document management normally requires a mixture of exact queries on the structured part and information retrieval queries on the unstructured part (e.g., find all documents published by W3C containing 'XML' next to 'XSL').
Repository queries:
Applications in this domain normally require only exact queries and operate in a syntactically and semantically homogeneous source context. Thus the standard operations provided are selection, projection, aggregation, join and set operations.
Integration queries:
Integration applications in general need to integrate syntactically and semantically heterogeneous sources. Thus, additional important operations besides the standard operations are:
- object composition by means of fusion
- generation of new (derived) object identities (by means of Skolem functions)
- query meta information about sources for identifying and handling heterogeneity.
- access to ontological tools such as thesauri based comparisons

Goal: Common Data Model and Common, Extensible Query Language

Based on the different domain requirements, it will be important to decide what the target application domain of the query language will be. I hope, that the communities can agree on a common data model which would allow us to define a query language which provides operations for all three domains in a simple and elegant (and consistent) way. It is clear that this means, that it needs to provide operations normally found in database query languages, information integration systems and document management systems. It is important, that the query language can easily be extended, for example to accommodate new domains and their requirements (geographical queries etc.) and to add new document management operations.

Meta data

In any of the three domains above, meta information plays an important role. While XML provides a way to define simple meta information about XML documents in form of a DTD, more complex meta information needs to be provided as well. For example, a DTD can express relationships among objects (XML elements) by means of referential attributes. However, there is no standard way to define integrity constraints or ontology information (besides the sub-element relationship).

It will be important to query such meta information as well. If it is represented in XML, the query language can be used for querying the meta data. If the meta data is represented in RDF, then a RDF-QL needs to be specified in addition. If RDF is represented in XML, the RDF-QL can be mapped to the XML query language.

A Data Model for XML

Graph Structure

An XML document itself can be viewed as a linearization of graph structured data where the order of the different tagged and untagged elements in general is important. Unfortunately, the XML element hierarchy can only express tree structured data, the graph structure needs to be expressed using element attributes. Since there are many ways to linearize a graph, XML alone is not well-suited as its own data model. Either, the data model needs to be a full graph-based model, or XML needs to have a canonical form for representing the graph.

The query language should be able to deal with graph structured data:

Navigation and restructuring semistructured graphs:
It should be able to navigate the graph (using path expressions) and restructure it. The path expressions need to be able to deal with semistructured data, i.e., provide wildcards and regular path expressions.
Query composition:
In order to be able to compose queries, the result of a query should be another graph.

Other questions that need to be addressed are:

Querying attributes and subelements
Should there be a difference between attributes and subelements in term of query syntax? Especially if both are used for expressing parts of the same relation, e.g.,
<ADULTPERSON oid="p1"> ... </> <ADULTPERSON oid="p2"> ... </> <ADULTPERSON oid="p3" CHILD="p1 p2"> <CHILD oid="c1"> ... </> <CHILD oid="c2"> ... </> </>
It is my believe that the query language should allow all three alternatives:
- query both (e.g., ADULTPERSON.CHILD returns p1 p2 c1 c2)
- query only the attributes (e.g., ADULTPERSON.@CHILD returns p1 p2)
- query only the subelements (e.g., ADULTPERSON.$CHILD returns c1 c2)
Preservation of input graph representation:
Does a query have to preserve the graph representation of the user input or always return a canonical linearization regardless of the input? Or is it user definable?

Extensional vs. Intensional Order

Oftentimes, especially in the context of documents, but also in data management context, the extensional order is important. Thus, the data model should be able to preserve the extensional order of the XML documents.

The query language should therefore not only be able to allow the user to specify intensional order (e.g., via an order by clause), but also the extensional order in the case of updates. It should be able to preserve the extensional order when querying, if required by the user on a query-by-query basis.

Physical Design: Structured vs. Semistructured vs. Unstructured Data

For some application and in order to exploit performance opportunity, the physical design of the data model should exploit as much structural information as possible:

If XML data conform to their DTD:
The data model should be able to utilize this information for an efficient physical design and be able to provide it to the query language for performance optimization.
If no DTD is present:
The XML data should still be queryable according to the semantical nesting of the object relationship graph and not just as text blob. This preserves the simplicity of the query language and avoids inconsistencies in the query formulation.
Untagged data (CDATA):
Untagged data should be treated as a single object text blob by the physical mapping where information retrieval operations can be applied.

Query Language Operations

I don't want to go into a detailed description of all the operations. Instead, from the database and information integration point of view, I would like to refer to the research in the area of semistructured information processing. Especially the XML-QL proposal, Stanford's Lore and TSIMMIS projects present in my opinion a very good starting point. For the area of information retrieval, Lore has presented some ideas with nearness- and similarity-based query operators, but there are certainly other contributions from the document management world.

Besides the already mentioned points, it is, in my opinion, important that

the query language provides operations for both (bulk) queries and (bulk) updates,
the query language provides both
- object (graph) preserving operations:
  the result is an already existing object (i.e., the same object id). It should allow extension and projection of object relationships for the answer of the query.
- object (graph) generating operations:
  for example, to compose a new object out of other objects.
- value generating operations:
  While the two previous kind of operations preserve closure, these ops returns values and go beyond closure (i.e., they are not compositional). Examples of such operations are IR retrievals, where the result is a list of locations.
the query language is simple (high level of abstraction with clean semantics, no baroque syntax),
the execution is efficient (data model and query language allow standard and new optimization techniques),
the execution is scalable (in terms of execution time vs size of data/number of documents/number of sites).

Some Database Issues

The following aspects should be possible with the chosen QL and data model:

a transaction oriented framework for the execution
namescoping (currently a problem with XML element names)
views (either virtual or materialized)
constraints specification and triggers

Query Languages for XML Documents: A QL '98 Position Paper