This paper presents syntax and facilities for RDF Query. RDF query is a declarative syntax for selecting RDF resources that meet specified criteria.
This is a technical contribution to the W3C Query Languages Workshop, Dec 3 and 4, 1998.
One of the major contributions of the Relational data model, introduced a little over 25 years ago by [Codd] was was an accompanying, declarative query language. In contrast to earlier query languages, [SQL] lets you specify what should be retrieved rather than how it should be retrieved. This freed the user from knowing how the data was stored. SQL also established the major components of a query language: a source over which the query executes; a selection condition; a specification for what has to be returned and constructs to control how it should be presented.
[RDF] is the W3C's recommended framework for metadata. In fact, the underlying data model is a semantic network model that has surfaced several times, over the years, in slightly different forms for data and knowledge representation [Findler][Brachman][Chen]. In this paper we present a declarative query language for RDF. We hope this will be useful for other RDF-based efforts such as [P3P].
RDF Query needs to be more complex than SQL since the RDF data model is more complex than the Relational data model. Specifically, while a relational query executes over one or more tables each containing tuples with the same structure, an RDF query executes over a RDF container that may contain resources of different types each with different properties. Values of properties, rather than being merely data, can be resources themselves. Finally, property values can be RDF containers.
At IBM we use RDF in the "Grand Central" family of intelligent search engines, of which [jCentral] which searches for Java related resources is an example. These search engines extract metadata about websites and encode it as a collection of RDF structures which are then queried in response to user requests. The query facilities are implemented using a number of embedded Java functions. This paper is our attempt to abstract those facilities and encode them in a XML-style, declarative syntax.
An RDF Query (rdfquery) operates on a source container of resources and returns a result container of resources. The result container is always a subset of the source container and may be empty. Thus, RDF queries provide closure; they start with a container of RDF resources and they return a container of RDF resources. The result container can be the source for another RDF query.
Note that RDF queries are expressed using RDF descriptions (metadata). Since RDF has an XML syntax, and RDF Query is an RDF vocabulary, RDF Query also has an XML syntax. We use the name space rdfq for tag names used in RDF Query specifications.
RDF defines three kinds of resources and their descriptions:
A rdfquery must specify a source collection and may specify a
property name or a condition. If a property name or condition is not specified
the query returns all resources in the source collection. If a
property name is specified, the query returns all resources in the
source collection whose have the named property. If a condition is specified,
the query returns all resources in the source collection for whom the
condition evalutes True
. Instead of returning the qualifying
resources, rdfquery also allows us to create new resources from the qualifying
resources with fewer properties. This enables us to build views on the source containers.
The latest [RDF] specification says at the end
of section 2.1 "Property values can be other resources or they can be
atomic; that is, simple strings or other primitive datatypes defined
by XML." As far as we know, there are no primitive datatypes defined
in XML except strings. However, there is ongoing work to do so.
See [DCD].
We shall assume that primitive datatypes will be added to XML and RDF
and thus we can define a richer set of query primitives than are possible
with only strings. In this paper we use tags such as
rdf:Integer
to identify such datatypes.
Consider a collection of RDF resources. These resources may be identified by their URIs, IDs, or bagIDs. Since no property or condition is specified, the following query selects all resources in a explicitly specified collection of resources.
<rdfquery> |
If there is only one resource to be selected from, the rdf:Bag
may
be omitted and the resource specified as an attribute of the rdfq:From
element directly.
In the more common case, the query will specify a container that contains
the resources. We introduce an
attribute of rdfq:From
called eachResource
(similar
to aboutEach in RDF) to query all the resources in a container.
|
where the resources could be collected elsewhere in a container as
|
The following query selects all resources in the collection that have the property "ResearchPapers".
|
The following query selects all resources in the collection
that have a property named Project
whose value is
the string "WebTechnologies". Here we assume the String data type is defined in RDF.
<rdfq:rdfquery> |
The value of the Project
property may be a text string or it
may be a reified property with predicate
equal to
Project
and object
equal to WebTechnologies
.
The query will select
instances where the property is represented in either of these
forms.
If the resource does not have the given property, the condition
evaluates false
.
Clearly, we need to be able to test on conditions other than
equals
. The following query selects resources from a
collection of people in IBM Research
where the value of the age
property is greater
than 50.
<rdfq:rdfquery> |
Complex conditions can be specified by using boolean operators in
the Select
tag.
The following query selects all people resources in IBM Research who report
to departments where the budget is greater than a million dollars and
the
department size is less than 10 people.
<rdfq:rdfquery> |
If the value of the property is a RDF resource (inline or external)
then the selection criterion may include constraints on the properties
of the value resource.
The following query selects all people in IBM Research who report to resources
(departments) where the budget is over 1 million dollars.
The syntax should be read as follows: find the resources in the source
collection which have the BelongTo
property; apply the
condition to each of these resources; add those resources for whom the
condition returns True
to the result set.
<rdfq:rdfquery> |
Note that for every navigation across a property a new rdfq:Property element needs to be introduced. An abbreviation syntax makes this less complex. This abbreviation is similar to a path-expression. One possible abbreviation is to use rdf:Seq. Suppose a property named "ResearchPaper" has a value which is an RDF description with a property named "Conference" whose value is an RDF description with a property named "Venue" whose value is an RDF description with a property named "Country". Now suppose that one wants to query, based upon all the publications that Neel has, those that were in foreign conferences. One could write this query, using the RDF abbreviation syntax, as:
<rdfq:rdfquery> |
An alternate linear syntax would use the / operator to indicate
navigating over properties. This is similar to its use in
[XQL Proposal]. Alternately the .. operator may be used to
indicate navigation as in [OQL].
Using the / operator, the navigation over
ResearchPapers, Conference, Venue and Country would be expressed as
ResearchPapers/Conference/Venue/Country. We introduce a new attribute for
rdfq:Property
called path
. to indicate a path
expression. With this, the above example
could be written as:
<rdfq:rdfquery> |
Sometimes it is useful, on selecting a set of resources that
satisfy a condition, to build proxy sets of resources with only some of the
properties. This requires building an inline RDF description
or container with only the selected properties. To describe this we
introduce an attribute of the Select
clause called
properties
. A longer syntax for the same function in RDF form can be
achieved by adding a properties
element (or RDF property) to
rdfq:Select
and whose value can be an rdf:Bag
listing all the properties on which the projection is done. We prefer
the abbreviated syntax. The following example selects from a set of
people in IBM Research, those whose project is
"WebTechnologies" and builds a description with properties "fullname"
(name of the person) and "experience" (number of years in the project).
<rdfq:rdfquery> |
Aggregation operations can be applied to RDF objects that are
containers. Specifically, we may want to to evaluate an aggregate
function on the result sets of a rdfquery. To specify this we
add an attribute called aggregate
to the
rdfq:Select
element. The value of this attribute can only be
the name of an aggregate function such as count
, min
, or
max
.
Let us modify the previous query to find the count of all papers
written by Neel in foreign conferences. In the example count
is applied to all resources in the source container that satisfy the
condition. The "*" means apply to all. In a later example,
count
is applied to a specific property.
<rdfq:rdfquery> |
Sometimes we would like to perform algebraic set operations on the result sets. We support three kinds of operations: Union, Intersection, and Difference. For instance to find all papers written by Neel and all papers written by Ashok.
<rdfq:rdfquery> |
Duplicate elimination in the result RDF structures depends on the notion of equality of property values. If that is clearly specified, we can have the Union operator eliminate all duplicate occurrences of papers (this would eliminate all papers jointly authored by Neel and Ashok from occuring twice).
Set operations are allowed only on RDF structures that have the same set of properties, i.e. properties with the same names. In order to allow for equivalent properties with different names, aliases are suppoerted. Suppose Ashok refers to his papers as "ResearchPapers" on his web site, and Neel refers to his papers as "Publications" and we want to perform a union on these. We introduce an alias for the property name "Publications". We can also alias both the property names and give them a new name, such as "articles", in the union.
<rdfq:rdfquery> |
When doing aggregate operations one might want to build containers (collections) grouped by property values before performing aggregations. Suppose one would like to count the number of papers published by Neel every year. The result will be an RDF Description that will provide a description of "http://www.research.ibm.com/people/neel" as container with each member in the container describing two properties "ResearchPapers/Conference/Year", and "count". This can be expressed as below.
<rdfq:rdfquery> |
Since the result of a query can be an RDF object which could possibly be an RDF container
we may want to sort the results based upon property values and return an RDF sequence.
We introduce a tag rdfq:Order
for this purpose. To order by multiple properties this will contain
an rdf:Seq with the sequence in which the ordering is to be done.
To select all papers written by Neel, ordered by the year of
publication and within the same year by month we would write:
<rdfq:rdfquery> |
To support queries based upon for all and there
exists quantification we introduce rdfq:quantifier
and exists
and
forAll
attributes. These are used to test whether any or
all members of a collection meet a given condition. Since quantifiers
are boolean conditions, they can occur anywhere a condition clause can occur.
We also introduce a variable var
which ranges over all the
members of the collection. The body of the quantifier has two
expression elements: the first evaluates to a collection ,the
second, which is a rdfq:Condition
, evaluates a boolean
condition over the first element. Thus, "for each x in S : cond(x)" translates to
|
The var
can be referenced in the condition body using var-ref
.
In the following example, we want to pick from a set of researchers those that have at least one
publication in 1998. The first child element of the quantifier
searches for a property "ResearchPapers" and that returns a
collection. The predicate (second child element of the quantifier) is applied to each member in this
collection by referring to the member through var-ref
. In the
example the predicate is applied to the "Year" property of the each
selected element. The predicate checks if the value is "1998". In
this simple example the var
and var-ref
may appear extraneous
but they are necessary to express nested queries and complex
conditions with quantification.
<rdfq:rdfquery> |
RDF Query can be enriched by a number of vocabulary-specific
abbreviations and inferences. The use of alias
for similar
resources in different containers has already
been discussed in "3.4 Complex conditions".
Another vocabulary-specific abbreviation is a single name for a
chain of properties. For example "foreignPapers" may be used to
refer to the chain ResearchPapers/Conference/Venue/Country where
the value of the property is not U.S.A
.
This is a fruitful area that needs more work.
<!ENTITY % setOps "(rdfq:Union | rdfq:Intersection | rdfq:Difference)"> |