RDF Query Specification

December 3, 1998

Authors

Neel Sundaresan (IBM) <neel@almaden.ibm.com>

Abstract

This paper presents syntax and facilities for RDF Query. RDF query is a declarative syntax for selecting RDF resources that meet specified criteria.

Status

This is a technical contribution to the W3C Query Languages Workshop, Dec 3 and 4, 1998.

1. Introduction
2. Concepts
3. Examples
    3.1 Selecting all resources from a collection
    3.2 Selecting all resources with a given property
    3.3 Selecting all resources which satisfy a given condition
    3.4 Complex conditions
    3.5 Nested Queries
    3.6 Projection
    3.7 Aggregation
    3.8 Composing Results
    3.9 Grouping By Property Values
    3.10 Sorting Results
    3.11 Quantification
4. Vocabulary-Specific Inference
5. A DTD For RDF Query
6. References

1. Introduction

One of the major contributions of the Relational data model, introduced a little over 25 years ago by [Codd] was was an accompanying, declarative query language. In contrast to earlier query languages, [SQL] lets you specify what should be retrieved rather than how it should be retrieved. This freed the user from knowing how the data was stored. SQL also established the major components of a query language: a source over which the query executes; a selection condition; a specification for what has to be returned and constructs to control how it should be presented.

[RDF] is the W3C's recommended framework for metadata. In fact, the underlying data model is a semantic network model that has surfaced several times, over the years, in slightly different forms for data and knowledge representation [Findler][Brachman][Chen]. In this paper we present a declarative query language for RDF. We hope this will be useful for other RDF-based efforts such as [P3P].

RDF Query needs to be more complex than SQL since the RDF data model is more complex than the Relational data model. Specifically, while a relational query executes over one or more tables each containing tuples with the same structure, an RDF query executes over a RDF container that may contain resources of different types each with different properties. Values of properties, rather than being merely data, can be resources themselves. Finally, property values can be RDF containers.

At IBM we use RDF in the "Grand Central" family of intelligent search engines, of which [jCentral] which searches for Java related resources is an example. These search engines extract metadata about websites and encode it as a collection of RDF structures which are then queried in response to user requests. The query facilities are implemented using a number of embedded Java functions. This paper is our attempt to abstract those facilities and encode them in a XML-style, declarative syntax.

2. Concepts

An RDF Query (rdfquery) operates on a source container of resources and returns a result container of resources. The result container is always a subset of the source container and may be empty. Thus, RDF queries provide closure; they start with a container of RDF resources and they return a container of RDF resources. The result container can be the source for another RDF query.

Note that RDF queries are expressed using RDF descriptions (metadata). Since RDF has an XML syntax, and RDF Query is an RDF vocabulary, RDF Query also has an XML syntax. We use the name space rdfq for tag names used in RDF Query specifications.

RDF defines three kinds of resources and their descriptions:

RDF descriptions that describe resources that are defined elsewhere. These descriptions use the "about" attribute to describe these resources.
RDF descriptions that describe resources inline. These resources may by proxies for real resources. These descriptions use the "id" attribute.
RDF descriptions that are reified. These descriptions use the bagID attributes.

The RDF query language must be able to describe queries over these kinds of resources and their descriptions.

A rdfquery must specify a source collection and may specify a property name or a condition. If a property name or condition is not specified the query returns all resources in the source collection. If a property name is specified, the query returns all resources in the source collection whose have the named property. If a condition is specified, the query returns all resources in the source collection for whom the condition evalutes True. Instead of returning the qualifying resources, rdfquery also allows us to create new resources from the qualifying resources with fewer properties. This enables us to build views on the source containers.

The latest [RDF] specification says at the end of section 2.1 "Property values can be other resources or they can be atomic; that is, simple strings or other primitive datatypes defined by XML." As far as we know, there are no primitive datatypes defined in XML except strings. However, there is ongoing work to do so. See [DCD]. We shall assume that primitive datatypes will be added to XML and RDF and thus we can define a richer set of query primitives than are possible with only strings. In this paper we use tags such as rdf:Integer to identify such datatypes.

3. Examples

3.1 Selecting all resources from a collection

Consider a collection of RDF resources. These resources may be identified by their URIs, IDs, or bagIDs. Since no property or condition is specified, the following query selects all resources in a explicitly specified collection of resources.

<rdfquery>
  <rdfq:From>
    <rdf:Bag>
       <li resource="http://www.research.ibm.com/people/ashok/paper1.html"/>
       <li resource="http://www.research.ibm.com/people/ashok/paper3.html"/>
       <li resource="http://www.research.ibm.com/people/neel/paper1.html"/>
       <li resource="http://www.research.ibm.com/people/neel/paper7.html"/>
    </rdf:Bag>
  </rdfq:From>
</rdfquery>

If there is only one resource to be selected from, the rdf:Bag may be omitted and the resource specified as an attribute of the rdfq:From element directly.

In the more common case, the query will specify a container that contains the resources. We introduce an attribute of rdfq:From called eachResource (similar to aboutEach in RDF) to query all the resources in a container.


<rdfquery>
  <rdfq:From eachResource="papers"/>
</rdfquery>

where the resources could be collected elsewhere in a container as



    <rdf:Bag bagID="papers">
       <li resource="http://www.research.ibm.com/people/ashok/paper1.html"/>
       <li resource="http://www.research.ibm.com/people/ashok/paper3.html"/>
       <li resource="http://www.research.ibm.com/people/neel/paper1.html"/>
       <li resource="http://www.research.ibm.com/people/neel/paper7.html"/>
    </rdf:Bag>

3.2 Selecting all resources with a given property

The following query selects all resources in the collection that have the property "ResearchPapers".


<rdfq:rdfquery>
  <rdfq:From eachResource="http://www.research.ibm.com/people/neel"/>
    <rdfq:Select>
      <rdfq:Property name="ResearchPapers"/>
    </rdfq:Select>
  </rdfq:From>
</rdfq:rdfquery>

3.3 Selecting all resources which satisfy a given condition

The following query selects all resources in the collection that have a property named Project whose value is the string "WebTechnologies". Here we assume the String data type is defined in RDF.

<rdfq:rdfquery>
  <rdfq:From eachResource="http://www.research.ibm.com/people/neel"/>
    <rdfq:Select>
        <rdfq:Condition>
            <rdfq:equals>
              <rdfq:Property name="Project" />
              <rdf:String>WebTechnologies</rdf:String>
            </rdfq:equals>
        </rdfq:Condition>
    </rfq:Select>
 </rdfq:From>
</rdfq:rdfquery>

The value of the Project property may be a text string or it may be a reified property with predicate equal to Project and object equal to WebTechnologies. The query will select instances where the property is represented in either of these forms.

If the resource does not have the given property, the condition evaluates false.

Clearly, we need to be able to test on conditions other than equals. The following query selects resources from a collection of people in IBM Research where the value of the age property is greater than 50.

<rdfq:rdfquery>
  <rdfq:From eachResource="http://www.research.ibm.com/people/">
  <rdfq:Select>
    <rdfq:Condition>
      <rdfq:greaterThan>
        <rdfq:Property name="Age" />
        <rdf:Integer>50</rdf:Integer>
      </rdfq:greaterThan>
    </rdfq:Condition>
  </rdfq:Select> 
  </rdfq:From>
</rdfq:rdfquery>

3.4 Complex conditions

Complex conditions can be specified by using boolean operators in the Select tag. The following query selects all people resources in IBM Research who report to departments where the budget is greater than a million dollars and the department size is less than 10 people.

<rdfq:rdfquery>
  <rdfq:From eachResource="http://www.research.ibm.com/people">
  <rdfq:Select>
     <rdfq:Condition>
        <rdfq:and>
          <rdfq:greaterThan>
            <rdfq:Property name="Budget" />
            <rdf:Integer>1000000</rdf:Integer> 
          </rdfq:greaterThan>
          <rdfq:lessthan>
             <rdfq:Property name="Size" />
             <rdf:Integer>10</rdf:Integer>
           </rdfq:lessthan>
    </rdfq:Condition>
 </rdfq:Select>
  </rdfq:From>
</rdfq:rdfquery>

3.5 Nested Queries

If the value of the property is a RDF resource (inline or external) then the selection criterion may include constraints on the properties of the value resource. The following query selects all people in IBM Research who report to resources (departments) where the budget is over 1 million dollars. The syntax should be read as follows: find the resources in the source collection which have the BelongTo property; apply the condition to each of these resources; add those resources for whom the condition returns True to the result set.

<rdfq:rdfquery>
  <rdfq:From eachResource="http://www.research.ibm.com/people/">
  <rdfq:Select>
    <rdfq:Property name="BelongTo" />
    <rdfq:Select>
      <rdfq:Condition>
        <rdfq:greaterThan>
          <rdfq:Property name="Budget" />
          <rdf:Integer>1000000</rdf:Integer>
        </rdfq:greaterThan>
      </rdfq:Condition>
    </rdfq:Select>
  </rdfq:Select>
 </rdfq:From>
</rdfq:rdfquery>

Note that for every navigation across a property a new rdfq:Property element needs to be introduced. An abbreviation syntax makes this less complex. This abbreviation is similar to a path-expression. One possible abbreviation is to use rdf:Seq. Suppose a property named "ResearchPaper" has a value which is an RDF description with a property named "Conference" whose value is an RDF description with a property named "Venue" whose value is an RDF description with a property named "Country". Now suppose that one wants to query, based upon all the publications that Neel has, those that were in foreign conferences. One could write this query, using the RDF abbreviation syntax, as:

<rdfq:rdfquery>
  <rdfq:From eachResource="http://www.research.ibm.com/people/neel"/>
  <rdfq:Select>
    <rdfq:Condition>
      <rdfq:not>
        <rdfq:equal>
          <rdfq:Property>
           <rdfq:Seq>
             <li>ResearchPapers</li>
             <li>Conference</li>
             <li>Venue</li>
             <li>Country</li>
           </rdfq:Seq>
          </rdfq:Property>
          <rdf:String>U.S.A.</rdf:String>
       </rdfq:equal>
      </rdfq:not>
    </rdfq:Condition>
   </rdfq:Select>
  </rdfq:rdfquery>

An alternate linear syntax would use the / operator to indicate navigating over properties. This is similar to its use in [XQL Proposal]. Alternately the .. operator may be used to indicate navigation as in [OQL]. Using the / operator, the navigation over ResearchPapers, Conference, Venue and Country would be expressed as ResearchPapers/Conference/Venue/Country. We introduce a new attribute for rdfq:Property called path. to indicate a path expression. With this, the above example could be written as:

<rdfq:rdfquery>
  <rdfq:From eachResource="http://www.research.ibm.com/people/neel"/>
  <rdfq:Select>
    <rdfq:Condition>
      <rdfq:not>
        <rdfq:equal>
          <rdfq:Property path="ResearchPapers/Conference/Venue/Country"/>
          <rdf:String>U.S.A.</rdf:String>
        </rdfq:equal>
      </rdfq:not>
    </rdfq:Condition>
   </rdfq:Select>
  </rdfq:rdfquery>

3.6 Projection

Sometimes it is useful, on selecting a set of resources that satisfy a condition, to build proxy sets of resources with only some of the properties. This requires building an inline RDF description or container with only the selected properties. To describe this we introduce an attribute of the Select clause called properties. A longer syntax for the same function in RDF form can be achieved by adding a properties element (or RDF property) to rdfq:Select and whose value can be an rdf:Bag listing all the properties on which the projection is done. We prefer the abbreviated syntax. The following example selects from a set of people in IBM Research, those whose project is "WebTechnologies" and builds a description with properties "fullname" (name of the person) and "experience" (number of years in the project).

<rdfq:rdfquery>
  <rdfq:From eachResource="http://www.research.ibm.com/people/"/>
    <rdfq:Select properties="fullname experience">
        <rdfq:Property name="WebTechnologies/>
    </rfq:Select>
 </rdfq:From>
</rdfq:rdfquery>

3.7 Aggregation

Aggregation operations can be applied to RDF objects that are containers. Specifically, we may want to to evaluate an aggregate function on the result sets of a rdfquery. To specify this we add an attribute called aggregate to the rdfq:Select element. The value of this attribute can only be the name of an aggregate function such as count, min, or max. Let us modify the previous query to find the count of all papers written by Neel in foreign conferences. In the example count is applied to all resources in the source container that satisfy the condition. The "*" means apply to all. In a later example, count is applied to a specific property.

<rdfq:rdfquery>
  <rdfq:From eachResource="http://www.research.ibm.com/people/neel"/>
    <rdfq:Select properties="count(*)">
      <rdfq:Condition>
        <rdfq:not>
          <rdfq:equal>
            <rdfq:Property path="ResearchPapers/Conference/Venue/Country"/>
            <rdf:String>U.S.A.</rdf:String>
          </rdfq:equal>
        </rdfq:not>
      </rdfq:Condition>
   </rdfq:Select>
  </rdfq:rdfquery>

3.8 Composing Results

Sometimes we would like to perform algebraic set operations on the result sets. We support three kinds of operations: Union, Intersection, and Difference. For instance to find all papers written by Neel and all papers written by Ashok.

<rdfq:rdfquery>
  <rdfq:Union>
   <rdfq:From eachResource="http://www.research.ibm.com/people/neel">
    <rdfq:Select>
      <rdfq:Property name="ResearchPapers"/> 
    </rdfq:Select>
   </rdfq:From>
   <rdfq:From eachResource="http://www.research.ibm.com/people/Ashok">
    <rdfq:Select>
      <rdfq:Property name="ResearchPapers"/> 
    </rdfq:Select>
   </rdfq:From>
  </rdfq:Union>
</rdfq:rdfquery>

Duplicate elimination in the result RDF structures depends on the notion of equality of property values. If that is clearly specified, we can have the Union operator eliminate all duplicate occurrences of papers (this would eliminate all papers jointly authored by Neel and Ashok from occuring twice).

Aliasing

Set operations are allowed only on RDF structures that have the same set of properties, i.e. properties with the same names. In order to allow for equivalent properties with different names, aliases are suppoerted. Suppose Ashok refers to his papers as "ResearchPapers" on his web site, and Neel refers to his papers as "Publications" and we want to perform a union on these. We introduce an alias for the property name "Publications". We can also alias both the property names and give them a new name, such as "articles", in the union.

<rdfq:rdfquery>
  <rdfq:Union>
   <rdfq:From eachResource="http://www.research.ibm.com/people/neel">
    <rdfq:Select>
      <rdfq:Property name="ResearchPapers" /> 
    </rdfq:Select>
   </rdfq:From>
   <rdfq:From eachResource="http://www.research.ibm.com/people/Ashok">
    <rdfq:Select>
      <rdfq:Property name="Publications" alias="ResearchPapers"/> 
    </rdfq:Select>
   </rdfq:From>
  </rdfq:Union>
</rdfq:rdfquery>

3.9 Grouping By Property Values

When doing aggregate operations one might want to build containers (collections) grouped by property values before performing aggregations. Suppose one would like to count the number of papers published by Neel every year. The result will be an RDF Description that will provide a description of "http://www.research.ibm.com/people/neel" as container with each member in the container describing two properties "ResearchPapers/Conference/Year", and "count". This can be expressed as below.

<rdfq:rdfquery>
   <rdfq:From eachResource="http://www.research.ibm.com/people/neel">
    <rdfq:Select properties = "ResearchPapers/Conference/Year
                               count(ResearchPapers/Conference/Year)">
      <rdfq:Group>
        <rdfq:Property path="ResearchPapers/Conference/Year"/>          
      </rdfq:Group>	
    </rdfq:Select>
   </rdfq:From>
</rdfq:rdfquery>

3.10 Sorting Results

Since the result of a query can be an RDF object which could possibly be an RDF container we may want to sort the results based upon property values and return an RDF sequence. We introduce a tag rdfq:Order for this purpose. To order by multiple properties this will contain an rdf:Seq with the sequence in which the ordering is to be done. To select all papers written by Neel, ordered by the year of publication and within the same year by month we would write:

<rdfq:rdfquery>
   <rdfq:From eachResource="http://www.research.ibm.com/people/neel">
    <rdfq:Select>
        <rdfq:Property name="ResearchPapers"/>
    </rdfq:Select>
    <rdfq:Order>
        <rdf:Seq>
           <rdfq:Property path="ResearchPapers/Year"/>
           <rdfq:Property path="ResearchPapers/Month"/>
        <rdf:Seq>
    </rdfq:Order>
   </rdfq:From>
</rdfq:rdfquery>

3.11 Quantification

To support queries based upon for all and there exists quantification we introduce rdfq:quantifier and exists and forAll attributes. These are used to test whether any or all members of a collection meet a given condition. Since quantifiers are boolean conditions, they can occur anywhere a condition clause can occur. We also introduce a variable var which ranges over all the members of the collection. The body of the quantifier has two expression elements: the first evaluates to a collection ,the second, which is a rdfq:Condition, evaluates a boolean condition over the first element. Thus, "for each x in S : cond(x)" translates to



  <rdfq:quantifier type="forAll" var="x">
    <...> [ evaluates to a collection ]
    <rdfq:Condition>...</rdfq:Condition>
  </rdfq:quantifier>

The var can be referenced in the condition body using var-ref. In the following example, we want to pick from a set of researchers those that have at least one publication in 1998. The first child element of the quantifier searches for a property "ResearchPapers" and that returns a collection. The predicate (second child element of the quantifier) is applied to each member in this collection by referring to the member through var-ref. In the example the predicate is applied to the "Year" property of the each selected element. The predicate checks if the value is "1998". In this simple example the var and var-ref may appear extraneous but they are necessary to express nested queries and complex conditions with quantification.

<rdfq:rdfquery>
   <rdfq:From eachResource="Almaden_Researchers">
    <rdfq:Select>
       <rdfq:Condition>
         <rdfq:Quantifier type="exists" var="x">
              <rdfq:Property path="ResearchPapers"/>
              <rdfq:equals>
                <rdfq:Property var-ref="x" name="Year"/>
                <rdf:Integer>1998</rdf:Integer>
              </rdfq:equals>
           </rdfq:Condition> 
        </rdfq:Quantifier>         
      <rdfq:Condition>
    </rdfq:Select>
   </rdfq:From>
</rdfq:rdfquery>

4. Vocabulary-Specific Inference

RDF Query can be enriched by a number of vocabulary-specific abbreviations and inferences. The use of alias for similar resources in different containers has already been discussed in "3.4 Complex conditions". Another vocabulary-specific abbreviation is a single name for a chain of properties. For example "foreignPapers" may be used to refer to the chain ResearchPapers/Conference/Venue/Country where the value of the property is not U.S.A.

This is a fruitful area that needs more work.

5. A DTD For RDF Query

<!ENTITY % setOps "(rdfq:Union | rdfq:Intersection | rdfq:Difference)">

<!ELEMENT rdfq:rdfquery (%setOps; | rdfq:From)> 

<!ELEMENT rdfq:From (rdf:Bag? rdfq:Select? rdfq:Order? #PCDATA) >
<!ATTLIST rdfq:From
          eachResource HREF #IMPLIED> 

<!ELEMENT Select 
   ( rdfq:Property | (rdfq:Condition, rdfq:Group? ) Select? ) >
<!ATTLIST Select
          properties NMTOKENS #IMPLIED>

<!ELEMENT rdfq:Property EMPTY>
<!ATTLIST rdfq:Property
          resource HREF #IMPLIED
          name CDATA #IMPLIED
          path CDATA #IMPLIED     [ one of name and path should be there ]
          var-ref NMTOKEN #IMPLIED
>

<!ELEMENT rdfq:Condition (equals | greaterThan | lessThan |
rdfq:Quantifier | ... ) >

<!ELEMENT rdfq:Quantifier (rdfq:Property rdfq:Condition) >
<!ATTLIST rdfq:Quantifier 
          type NMTOKEN #REQUIRED [ should be "exists" or "forAll" ]
          var NMTOKEN #IMPLIED>
<!ELEMENT equals ANY>
<!ELEMENT greaterThan ANY>
<!ELEMENT lessThan ANY>
    ...

<!ELEMENT rdfq:Order (rdf:Seq | rdfq:Property)>

6. References

Brachman: R.J. Brachman and J.G. Smolze, An Overview of the KL-ONE Knowledge Representation System.Cognitive Science, Vol. 9, No. 2, 1985, pp. 171-216
Chen: P. P-S. Chen, The Entity-Relationship Model: Toward a Unified View of Data. ACM Transactions on Database Systems, Vol. 1 No. 1 1973, pp. 9-36
Codd: E.F. Codd, A Relational Model of Data for Large Shared Data Banks. Communications of the ACM, Vol. 13, No. 6, 1970, pp. 377-387
DCD: Document Content Description (DCD) Submission to W3C. See http://www.w3.org/TR/NOTE-dcd.
Findler: N.V. Findler (ed.), Associative Networks: Representation and Use of Knowledge by Computers, N.Y. Academic, 1979.
P3P: Platform for Privacy Preferences: P3P Project. See http://www.w3.org/P3P.
OQL: Object Query Language. See The Object Database Standard, ODMG 2.0, R.G.G. Cattell (ed.), Morgan Kaufmann, 1997.
SQL: SQL Standard. See http://www.jcc.com/sql_stnd.html.
RDF: RDF Model and Syntax. See http://www.w3.org/PICS/Member/NG/WD-rdf-syntax.
Unicode: Unicode Standard. See "The Unicode Standard, Version 2.0", Reading Mass., Addison-Wesley Developers Press, 1996
XML-Data: XML-Data. See http://www.w3.org/TR/1998/NOTE-XML-data-0105/.
XML Namespaces: Namespaces in XML. See http://www.w3.org/TR/WD-xml-names.
XQL Proposal: XQL Proposal from Microsoft.
jCentral: See http://www.ibm.com/java.

RDF Query Specification

December 3, 1998

Authors

Abstract

Status

Table of Contents