0% found this document useful (0 votes)
28 views89 pages

Relational Database Design: Records Attributes. Relational Schema

The document discusses relational database design and relational models. It introduces some key concepts: - The relational model organizes data into tables with rows and columns, where rows represent records and columns represent attributes. - Functional dependencies specify relationships between attributes - for example, a student's name and time uniquely determine the class they are enrolled in. - Armstrong's axioms provide a system for determining implied functional dependencies from a given set. The closure of a set of dependencies includes all dependencies implied by that set. - An algorithm called Closure is described that efficiently calculates the closure of a set of attributes with respect to a set of functional dependencies in linear time. This allows testing logical implications between

Uploaded by

biswadev goswami
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
28 views89 pages

Relational Database Design: Records Attributes. Relational Schema

The document discusses relational database design and relational models. It introduces some key concepts: - The relational model organizes data into tables with rows and columns, where rows represent records and columns represent attributes. - Functional dependencies specify relationships between attributes - for example, a student's name and time uniquely determine the class they are enrolled in. - Armstrong's axioms provide a system for determining implied functional dependencies from a given set. The closure of a set of dependencies includes all dependencies implied by that set. - An algorithm called Closure is described that efficiently calculates the closure of a set of attributes with respect to a set of functional dependencies in linear time. This allows testing logical implications between

Uploaded by

biswadev goswami
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 89

18.

Relational Database Design

The relational datamodel was introduced by Codd in 1970. It is the most widely
used datamodel—extended with the possibilities of the World Wide Web—, because
of its simplicity and flexibility. The main idea of the relational model is that data
is organised in relational tables, where rows correspond to individual records and
columns to attributes. A relational schema consists of one or more relations and
their attribute sets. In the present chapter only schemata consisting of one relation
are considered for the sake of simplicity. In contrast to the mathematical concept
of relations, in the relational schema the order of the attributes is not important,
always sets of attributes are considered instead of lists. Every attribute has an
associated domain that is a set of elementary values that the attribute can take
values from. As an example, consider the following schema.
Employee(Name,Mother’s name,Social Security Number,Post,Salary)
The domain of attributes Name and Mother’s name is the set of finite charac-
ter strings (more precisely its subset containing all possible names). The domain
of Social Security Number is the set of integers satisfying certain formal and
parity check requirements. The attribute Post can take values from the set {Direc-
tor,Section chief,System integrator,Programmer,Receptionist,Janitor,Handyman}.
An instance of a schema R is a relation r if its columns correspond to the at-
tributes of R and its rows contain values from the domains of attributes at the
attributes’ positions. A typical row of a relation of the Employee schema could be
(John Brown,Camille Parker,184-83-2010,Programmer,$172,000)
There can be dependencies between different data of a relation. For example, in an
instance of the Employee schema the value of Social Security Number determines
all other values of a row. Similarly, the pair (Name,Mother’s name) is a unique
identifier. Naturally, it may occur that some set of attributes do not determine all
attributes of a record uniquely, just some of its subsets.
A relational schema has several integrity constraints attached. The most im-
portant kind of these is functional dependency. Let U and V be two sets of
attributes. V functionally depends on U , U → V in notation, means that when-
ever two records are identical in the attributes belonging to U , then they must agree
in the attribute belonging to V , as well. Throughout this chapter the attribute set
{A1 , A2 , . . . , Ak } is denoted by A1 A2 . . . Ak for the sake of convenience.
18.1. Functional dependencies 863

Example 18.1 Functional dependencies Consider the schema

R(Pprofessor,Subject,Room,Student,Grade,Time) .

The meaning of an individual record is that a given student got a given grade of a given
subject that was taught by a given professor at a given time slot. The following functional
dependencies are satisfied.
Su→P: One subject is taught by one professor.
PT→R: A professor teaches in one room at a time.
StT→R: A student attends a lecture in one room at a time.
StT→Su: A student attends a lecture of one subject at a time.
SuSt→G: A student receives a unique final grade of a subject.

In Example 18.1 the attribute set StT uniquely determines the values of all other
attributes, furthermore it is minimal such set with respect to containment. This kind
attribute sets are called keys. If all attributes are functionally dependent on a set of
attributes X, then X is called a superkey. It is clear that every superkey contains
a key and that any set of attributes containing a superkey is also a superkey.

18.1. Functional dependencies


Some functional dependencies valid for a given relational schema are known already
in the design phase, others are consequences of these. The StT→P dependency
is implied by the StT→Su and Su→P dependencies in Example 18.1. Indeed, if
two records agree on attributes St and T, then they must have the same value in
attribute Su. Agreeing in Su and Su→P implies that the two records agree in P,
as well, thus StT→P holds.

Definition 18.1 Let R be a relational schema, F be a set of functional depen-


dencies over R. The functional dependency U → V is logically implied by F , in
notation F |= U → V , if each instance of R that satisfies all dependencies of F also
satisfies U → V . The closure of a set F of functional dependencies is the set F +
given by
F + = {U → V : F |= U → V } .

18.1.1. Armstrong-axioms
In order to determine keys, or to understand logical implication between functional
dependencies, it is necessary to know the closure F + of a set F of functional depen-
dencies, or for a given X → Z dependency the question whether it belongs to F +
must be decidable. For this, inference rules are needed that tell that from a set of
functional dependencies what others follow. The Armstrong-axioms form a sys-
tem of sound and complete inference rules. A system of rules is sound if only valid
functional dependencies can be derived using it. It is complete, if every dependency
X → Z that is logically implied by the set F is derivable from F using the inference
rules.
864 18. Relational Database Design

Armstrong-axioms
(A1) Reflexivity Y ⊆ X ⊆ R implies X → Y .
(A2) Augmentation If X → Y , then for arbitrary Z ⊆ R, XZ → Y Z holds.
(A3) Transitivity If X → Y and Y → Z hold, then X → Z holds, as well.

Example 18.2 Derivation by the Armstrong-axioms. Let R = ABCD and F = {A →


C, B → D}, then AB is a key:
1. A → C is given.
2. AB → ABC 1. is augmented by (A2) with AB.
3. B → D is given.
4. ABC → ABCD 3. is augmented by (A2) with ABC.
5. AB → ABCD transitivity (A3) is applied to 2. and 4..
Thus it is shown that AB is superkey. That it is really a key, follows from algorithm
Closure(R, F, X).

There are other valid inference rules besides (A1)–(A3). The next lemma lists some,
the proof is left to the Reader (Exercise 18.1-5).
Lemma 18.2
1. Union rule {X → Y, X → Z} |= X → Y Z.
2. Pseudo transitivity {X → Y, W Y → Z} |= XW → Y Z.
3. Decomposition If X → Y holds and Z ⊆ Y , then X → Z holds, as well.

The soundness of system (A1)–(A3) can be proven by easy induction on the length
of the derivation. The completeness will follow from the proof of correctness of
algorithm Closure(R, F, X) by the following lemma. Let X + denote the closure
of the set of attributes X ⊆ R with respect to the family of functional dependencies
F , that is X + = {A ∈ R : X → A follows from F by the Armstrong-axioms}.
Lemma 18.3 The functional dependency X → Y follows from the family of func-
tional dependencies F by the Armstrong-axioms iff Y ⊆ X + .
Proof Let Y = A1 A2 . . . An where Ai ’s are attributes, and assume that Y ⊆ X + .
X → Ai follows by the Armstrong-axioms for all i by the definition of X + . Applying
the union rule of Lemma 18.2 X → Y follows. On the other hand, assume that
X → Y can be derived by the Armstrong-axioms. By the decomposition rule of
Lemma 18.2 X → Ai follows by (A1)–(A3) for all i. Thus, Y ⊆ X + .

18.1.2. Closures
Calculation of closures is important in testing equivalence or logical implication
between systems of functional dependencies. The first idea could be that for a given
18.1. Functional dependencies 865

family F of functional dependencies in order to decide whether F |= {X → Y }, it is


enough to calculate F + and check whether {X → Y } ∈ F + holds. However, the size
of F + could be exponential in the size of input. Consider the family F of functional
dependencies given by

F = {A → B1 , A → B2 , . . . , A → Bn } .

F + consists of all functional dependencies of the form A → Y , where Y ⊆


{B1 , B2 , . . . , Bn }, thus |F + | = 2n . Nevertheless, the closure X + of an attribute
set X with respect to F can be determined in linear time of the total length of func-
tional dependencies in F . The following is an algorithm that calculates the closure
X + of an attribute set X with respect to F . The input consists of the schema R,
that is a finite set of attributes, a set F of functional dependencies defined over R,
and an attribute set X ⊆ R.

Closure(R,F ,X)
1 X (0) ← X
2 i←0
3 G←F  Functional dependencies not used yet.
4 repeat
5 X (i+1) ← X (i)
6 for all Y → Z in G
7 do if Y ⊆ X (i)
8 then X (i+1) ← X (i+1) ∪ Z
9 G ← G \ {Y → Z}
10 i←i+1
11 until X (i−1) = X (i)

It is easy to see that the attributes that are put into any of the X (j) ’s by
Closure(R,F ,X) really belong to X + . The harder part of the correctness proof of
this algorithm is to show that each attribute belonging to X + will be put into some
of the X (j) ’s.

Theorem 18.4 Closure(R,F ,X) correctly calculates X + .

Proof First we prove by induction that if an attribute A is put into an X (j) during
Closure(R,F ,X), then A really belongs to X + .
Base case: j = 0. I this case A ∈ X and by reflexivity (A1) A ∈ X + .
Induction step: Let j > 0 and assume that X (j−1) ⊆ X + . A is put into X (j) , because
there is a functional dependency Y → Z in F , where Y ⊆ X (j−1) and A ∈ Z. By
induction, Y ⊆ X + holds, which implies using Lemma 18.3 that X → Y holds, as
well. By transitivity (A3) X → Y and Y → Z implies X → Z. By reflexivity (A1)
and A ∈ Z, Z → A holds. Applying transitivity again, X → A is obtained, that is
A ∈ X +.
On the other hand, we show that if A ∈ X + , then A is contained in the result
of Closure(R,F ,X). Suppose in contrary that A ∈ X + , but A 6∈ X (i) , where
X (i) is the result of Closure(R,F ,X). By the stop condition in line 9 this means
866 18. Relational Database Design

X (i) = X (i+1) . An instance r of the schema R is constructed that satisfies every


functional dependency of F , but X → A does not hold in r if A 6∈ X (i) . Let r be the
following two-rowed relation:

Attributes of X (i) Other attributes


z }| { z }| {
1 1 ... 1 1 1 ... 1
1 1 ... 1 0 0 ... 0

Let us suppose that the above r violates a U → V functional dependency of F , that


is U ⊆ X (i) , but V is not a subset of X (i) . However, in this case Closure(R,F ,X)
could not have stopped yet, since X (i) NEX (i+1) .
A ∈ X + implies using Lemma 18.3 that X → A follows from F by the
Armstrong-axioms. (A1)–(A3) is a sound system of inference rules, hence in ev-
ery instance that satisfies F , X → A must hold. However, the only way this could
happen in instance r is if A ∈ X (i) .

Let us observe that the relation instance r given in the proof above provides
the completeness proof for the Armstrong-axioms, as well. Indeed, the closure X +
calculated by Closure(R,F ,X) is the set of those attributes for which X → A
follows from F by the Armstrong-axioms. Meanwhile, for every other attribute B,
there exist two rows of r that agree on X, but differ in B, that is F |= X → B does
not hold.
The running tome of Closure(R,F ,X) is O(n2 ), where n is the length of
he input. Indeed, in the repeat – until loop of lines 4–11 every not yet used
dependency is checked, and the body of the loop is executed at most |R \ X| + 1
times, since it is started again only if X (i−1) NEX (i) , that is a new attribute is
added to the closure of X. However, the running time can be reduced to linear with
appropriate bookkeeping.

1. For every yet unused W → Z dependency of F it is kept track of how many


attributes of W are not yet included in the closure (i[W, Z]).

2. For every attribute A those yet unused dependencies are kept in a doubly
linked list LA whose left side contains A.

3. Those not yet used dependencies W → Z are kept in a linked list J, whose left
side W ’s every attribute is contained in the closure already, that is for which
i[W, Z] = 0.

It is assumed that the family of functional dependencies F is given as a set of


attribute pairs (W, Z), representing W → Z. The Linear-Closure(R,F ,X) algo-
rithm is a modification of Closure(R,F ,X) using the above bookkeeping, whose
running time is linear. R is the schema, F is the given family of functional depen-
dencies, and we are to determine the closure of attribute set X.
Algorithm Linear-Closure(R,F ,X) consists of two parts. In the initialisation
phase (lines 1–13) the lists are initialised. The loops of lines 2–5 and 6–8, respectively,
18.1. Functional dependencies 867

P
take O( (W,Z)∈F |W |) time. The loop in lines 9–11 means O(|F |) steps. If the length
of the input is denoted by n, then this is O(n) steps altogether.
During the execution of lines 14–23, every functional dependency (W, Z) is ex-
amined at most once, when it is taken off from list J. Thus, lines 15–16 and 23 take
at most |F | steps. The running
P time of the loops in line 17–22 can be estimated by
observingPthat the sum i[W, Z] is decreased by one in each execution, hence it
takes O( i0 [W, Z]) steps, where
P i0 [W, Z]Pis the i[W, Z] value obtained in the ini-
tialisation phase. However, i0 [W, Z] ≤ (W,Z)∈F |W |, thus lines 14–23 also take
O(n) time in total.

Linear-Closure(R, F, X)
1  Initialisation phase.
2 for all (W, Z) ∈ F
3 do for all A ∈ W
4 do add (W, Z) to list LA
5 i[W, Z] ← 0
6 for all A ∈ R \ X
7 do for all (W, Z) of list LA
8 do i[W, Z] ← i[W, Z] + 1
9 for all (W, Z) ∈ F
10 do if i[W, Z] = 0
11 then add (W, Z) to list J
12 X+ ← X
13  End of initialisation phase.
14 while J is nonempty
15 do (W, Z) ← head(J)
16 delete (W, Z) from list J
17 for all A ∈ Z \ X +
18 do for all (W, Z) of list LA
19 do i[W, Z] ← i[W, Z] − 1
20 if i[W, Z] = 0
21 then add (W, Z) to list J
22 delete (W, Z)from list LA
23 X+ ← X+ ∪ Z
24 return X +

18.1.3. Minimal cover


Algorithm Linear-Closure(R, F, X) can be used to test equivalence of systems of
dependencies. Let F and G be two families of functional dependencies. F and G are
said to be equivalent, if exactly the same functional dependencies follow from both,
that is F + = G+ . It is clear that it is enough to check for all functional dependencies
X → Y in F whether it belongs to G+ , and vice versa, for all W → Z in G, whether
it is in F + . Indeed, if some of these is not satisfied, say X → Y is not in G+ ,
then surely F + NEG+ . On the other hand, if all X → Y are in G+ , then a proof
868 18. Relational Database Design

of a functional dependency U → V from F + can be obtained from dependencies


in G in such a way that to the derivation of the dependencies X → Y of F from
G, the derivation of U → V from F is concatenated. In order to decide that a
dependency X → Y from F is in G+ , it is enough to construct the closure X + (G)
of attribute set X with respect to G using Linear-Closure(R,G,X), then check
whether Y ⊆ X + (G) holds. The following special functional dependency system
equivalent with F is useful.
Definition 18.5 The system of functional dependencies G is a minimal cover of
the family of functional dependencies F iff G is equivalent with F , and
1. functional dependencies of G are in the form X → A, where A is an attribute
and A 6∈ X,
2. no functional dependency can be dropped from G, i.e., (G−{X → A})+ & G+ ,
3. the left sides of dependencies in G are minimal, that is X → A ∈ G, Y &
+
X =⇒ ((G − {X → A}) ∪ {Y → A}) NEG+ .

Every set of functional dependencies have a minimal cover, namely algorithm


Minimal-cover(R, F ) constructs one.

Minimal-Cover(R, F )
1 G←∅
2 for all X → Y ∈ F
3 do for all A ∈ Y − X
4 do G ← G ∪ X → A
5  Each right hand side consists of a single attribute.
6 for all X → A ∈ G
7 do while there exists B ∈ X : A ∈ (X − B)+ (G)
8 X ←X −B
9  Each left hand side is minimal.
10 for all X → A ∈ G
11 do if A ∈ X + (G − {X → A})
12 then G ← G − {X → A}
13  No redundant dependency exists.

After executing the loop of lines 2–4, the right hand side of each dependency in
G consists of a single attribute. The equality G+ = F + follows from the union rule
of Lemma 18.2 and the reflexivity axiom. Lines 6–8 minimise the left hand sides. In
line 11 it is checked whether a given functional dependency of G can be removed
without changing the closure. X + (G − {X → A}) is the closure of attribute set X
with respect to the family of functional dependencies G − {X → A}.
Claim 18.6 Minimal-Cover(R, F ) calculates a minimal cover of F .
Proof It is enough to show that during execution of the loop in lines 10–12, no
functional dependency X → A is generated whose left hand side could be decreased.
18.1. Functional dependencies 869

Indeed, if a X → A dependency would exist, such that for some Y & X Y → A ∈ G+


held, then Y → A ∈ G0+ would also hold, where G0 is the set of dependencies
considered when X → A is checked in lines 6–8. G ⊆ G0 , which implies G+ ⊆ G0+
(see Exercise 18.1-1). Thus, X should have been decreased already during execution
of the loop in lines 6–8.

18.1.4. Keys
In database design it is important to identify those attribute sets that uniquely
determine the data in individual records.

Definition 18.7 Let (R, F ) be a relational schema. The set of attributes X ⊆ R is


called a superkey, if X → R ∈ F + . A superkey X is called a key, if it is minimal
with respect to containment, that is no proper subset Y & X is key.

The question is how the keys can be determined from (R, F )? What makes this
problem hard is that the number of keys could be super exponential function of the
size of (R, F ). In particular, Yu and Johnson constructed such relational schema,
where |F | = k, but the number of keys is k!. Békéssy and Demetrovics gave a
beautiful and simple proof of the fact that starting from k functional dependencies,
at most k! key can be obtained. (This was independently proved by Osborne and
Tompa.)
The proof of Békéssy and Demetrovics is based on the operation ∗ they intro-
duced, which is defined for functional dependencies.

Definition 18.8 Let e1 = U → V and e2 = X → Y be two functional dependen-


cies. The binary operation ∗ is defined by

e1 ∗ e2 = U ∪ ((R − V ) ∩ X) → V ∪ Y .

Some properties of operation ∗ is listed, the proof is left to the Reader (Exer-
cise 18.1-3). Operation ∗ is associative, furthermore it is idempotent in the sense
that if e = e1 ∗ e2 ∗ · · · ∗ ek and e0 = e ∗ ei for some 1 ≤ i ≤ k, then e0 = e.

Claim 18.9 (Békéssy and Demetrovics). Let (R, F ) be a relational schema and let
F = {e1 , e2 , . . . , ek } be a listing of the functional dependencies. If X is a key, then
X → R = eπ1 ∗ eπ2 ∗ . . . ∗ eπs ∗ d, where (π1 , π2 , . . . , πs ) is an ordered subset of the
index set {1, 2, . . . , k}, and d is a trivial dependency in the form D → D.

Proposition 18.9 bounds in some sense the possible sets of attributes in the search
for keys. The next proposition gives lower and upper bounds for the keys.

Claim 18.10 Let (R, F ) be a relational schema and let F = {Ui → Vi : 1 ≤ i ≤ k}.
Sk
Let us assume without loss of generality that Ui ∩ Vi = ∅. Let U = i=1 Ui and
Sk
V = i=1 Vi . If K is a key in the schema (R, F ), then

HL = R − V ⊆ K ⊆ (R − V) ∪ U = HU .
870 18. Relational Database Design

The proof is not too hard, it is left as an exercise for the Reader (Exercise 18.1-4).
The algorithm List-keys(R, F ) that lists the keys of the schema (R, F ) is based on
the bounds of Proposition 18.10. The running time can be bounded by O(n!), but
one cannot expect any better, since to list the output needs that much time in worst
case.

List-Keys(R, F )
1  Let U and V be as defined in Proposition 18.10
2 if U ∩ V = ∅
3 then return R − V
4  R − V is the only key.
5 if (R − V)+ = R
6 then return R − V
7  R − V is the only key.
8 K←∅
9 for all permutations A1 , A2 , . . . Ah of the attributes of U ∩ V
10 do K ← (R − V) ∪ U
11 for i ← 1 to h
12 do Z ← K − Ai
13 if Z + = R
14 then K ← Z
15 K ← K ∪ {K}
16 return K

Exercises
18.1-1 Let R be a relational schema and let F and G be families of functional
dependencies over R. Show that
a. F ⊆ F + .
+
b. (F + ) = F + .
c. If F ⊆ G, then F + ⊆ G+ .
Formulate and prove similar properties of the closure X + – with respect to F –
of an attribute set X.
18.1-2 Derive the functional dependency AB → F from the set of dependencies
G = {AB → C, A → D, CD → EF } using Armstrong-axioms (A1)–(A3).
18.1-3 Show that operation ∗ is associative, furthermore if for functional depen-
dencies e1 , e2 , . . . , ek we have e = e1 ∗ e2 ∗ · · · ∗ ek and e0 = e ∗ ei for some 1 ≤ i ≤ k,
then e0 = e.
18.1-4 Prove Proposition 18.10.
18.1-5 Prove the union, pseudo transitivity and decomposition rules of Lemma
18.2.
18.2. Decomposition of relational schemata 871

18.2. Decomposition of relational schemata


A decomposition of a relational schema R = {A1 , A2 , . . . , An } is a collection ρ =
{R1 , R2 , . . . , Rk } of subsets of R such that

R = R1 ∪ R2 ∪ · · · ∪ Rk .

The Ri ’s need not be disjoint, in fact in most application they must not be. One
important motivation of decompositions is to avoid anomalies.

Example 18.3 Anomalies Consider the following schema

SUPPLIER-INFO(SNAME,ADDRESS,ITEM,PRICE)

This schema encompasses the following problems:


1. Redundancy. The address of a supplier is recorded with every item it supplies.
2. Possible inconsistency (update anomaly). As a consequence of redundancy, the
address of a supplier might be updated in some records and might not be in some
others, hence the supplier would not have a unique address, even though it is expected
to have.
3. Insertion anomaly. The address of a supplier cannot be recorded if it does not
supply anything at the moment. One could try to use NULL values in attributes ITEM
and PRICE, but would it be remembered that it must be deleted, when a supplied
item is entered for that supplier? More serious problem that SNAME and ITEM
together form a key of the schema, and the NULL values could make it impossible to
search by an index based on that key.
4. Deletion anomaly This is the opposite of the above. If all items supplied by a
supplier are deleted, then as a side effect the address of the supplier is also lost.

All problems mentioned above are eliminated if schema SUPPLIER-INFO is replaced by


two sub-schemata:
SUPPLIER(SNAME,ADDRESS),
SUPPLIES(SNAME,ITEM,PRICE).
In this case each suppliers address is recorded only once, and it is not necessary that the
supplier supplies a item in order its address to be recorded. For the sake of convenience
the attributes are denoted by single characters S (SNAME), A (ADDRESS), I (ITEM), P
(PRICE).

Question is that is it correct to replace the schema SAIP by SA and SIP ? Let
r be and instance of schema SAIP . It is natural to require that if SA and SIP
is used, then the relations belonging to them are obtained projecting r to SA and
SIP , respectively, that is rSA = πSA (r) and rSIP = πSIP (r). rSA and rSIP contains
the same information as r, if r can be reconstructed using only rSA and rSIP . The
calculation of r from rSA and rSIP can bone by the natural join operator.
Definition 18.11 The natural join of relations ri of schemata Ri (i = 1, 2, . . . n)
is the relation s belonging to the schema ∪ni=1 Ri , which consists of all rows µ that
for all i there exists a row νi of relation ri such that µ[Ri ] = νi [Ri ]. In notation
s =1ni=1 ri .
872 18. Relational Database Design

Example 18.4 Let R1 = AB, R2 = BC, r1 = {ab, a0 b0 , ab00 } and r2 {bc, bc0 , b0 c00 }. The
natural join of r1 and r2 belongs to the schema R = ABC, and it is the relation r1 1 r2 =
{abc, abc0 , a0 b0 c00 }.

If s is the natural join of rSA and rSIP , that is s = rSA 1 rSIP , then πSA (s) =
rSA és πSIP (s) = rSIP by Lemma 18.12. If rNEs, then the original relation could
not be reconstructed knowing only rSA and rSIP .

18.2.1. Lossless join


Let ρ = {R1 , R2 , . . . , Rk } be a decomposition of schema R, furthermore let F be
a family of functional dependencies over R. The decomposition ρ is said to have
lossless join property (with respect to F ), if every instance r of R that satisfies
F also satisfies
r = πR1 (r) 1 πR2 (r) 1 · · · 1 πRk (r) .
That is, relation r is the natural join of its projections to attribute sets Ri , i =
1, 2, . . . , k. For a decomposition ρ = {R1 , R2 , . . . , Rk }, let mρ denote the the mapping
which assigns to relation r the relation mρ (r) =1ki=1 πRi (r). Thus, the lossless join
property with respect to a family of functional dependencies means that r = mρ (r)
for all instances r that satisfy F .

Lemma 18.12 Let ρ = {R1 , R2 , . . . , Rk } be a decomposition of schema R, and let


r be an arbitrary instance of R. Furthermore, let ri = πRi (r). Then

1. r ⊆ mρ (r).

2. If s = mρ (r), then πRi (s) = ri .

3. mρ (mρ (r)) = mρ (r).

The proof is left to the Reader (Exercise 18.2-7).

18.2.2. Checking the lossless join property


It is relatively not hard to check that a decomposition ρ = {R1 , R2 , . . . , Rk } of
schema R has the lossless join property. The essence of algorithm Join-Test(R, F, ρ)
is the following.
A k × n array T is constructed, whose column j corresponds to attribute Aj , while
row i corresponds to schema Ri . T [i, j] = 0 if Aj ∈ Ri , otherwise T [i, j] = i.
The following step is repeated until there is no more possible change in the array.
Consider a functional dependency X → Y from F . If a pair of rows i and j agree
in all attributes of X, then their values in attributes of Y are made equal. More
precisely, if one of the values in an attribute of Y is 0, then the other one is set to
0, as well, otherwise it is arbitrary which of the two values is set to be equal to the
other one. If a symbol is changed, then each of its occurrences in that column must
be changed accordingly. If at the end of this process there is an all 0 row in T , then
the decomposition has the lossless join property, otherwise, it is lossy.
18.2. Decomposition of relational schemata 873

Join-Test(R, F, ρ)
1  Initialisation phase.
2 for i ← 1 to |ρ|
3 do for j ← 1 to |R|
4 do if Aj ∈ Ri
5 then T [i, j] ← 0
6 else T [i, j] ← i
7  End of initialisation phase.
8 S←T
9 repeat
10 T ←S
11 for all {X → Y } ∈ F
12 do for i ← 1 to |ρ| − 1
13 do for j ← i + 1 to |R|
14 do if for all Ah in X (S[i, h] = S[j, h])
15 then Equate(i, j, S, Y )
16 until S = T
17 if there exist an all 0 row in S
18 then return “Lossless join”
19 else return “Lossy join”

Procedure Equate(i, j, S, Y ) makes the appropriate symbols equal.

Equate(i, j, S, Y )
1 for Al ∈ Y
2 do if S[i, l] · S[j, l] = 0
3 then
4 for d ← 1 to k
5 do if S[d, l] = S[i, l] ∨ S[d, l] = S[j, l]
6 then S[d, l] ← 0
7 else
8 for d ← 1 to k
9 do if S[d, l] = S[j, l]
10 then S[d, l] ← S[i, l]

Example 18.5 Checking lossless join property Let R = ABCDE, R1 = AD, R2 = AB,
R3 = BE, R4 = CDE, R5 = AE, furthermore let the functional dependencies be {A →
C, B → C, C → D, DE → C, CE → A}. The initial array is shown on Figure 18.1(a).
Using A → C values 1,2,5 in column C can be equated to 1. Then applying B → C value 3
of column C again can be changed to 1. The result is shown on Figure 18.1(b). Now C → D
can be used to change values 2,3,5 of column D to 0. Then applying DE → C (the only
nonzero) value 1 of column C can be set to 0. Finally, CE → A makes it possible to change
values 3 and 4 in column A to be changed to 0. The final result is shown on Figure 18.1(c).
The third row consists of only zeroes, thus the decomposition has the lossless join property.
874 18. Relational Database Design

A B C D E
0 1 1 0 1
0 0 2 2 2
3 0 3 3 0
4 4 0 0 0
0 5 5 5 0
(a)
A B C D E
0 1 1 0 1
0 0 1 2 2
3 0 1 3 0
4 4 0 0 0
0 5 1 5 0
(b)
A B C D E
0 1 0 0 1
0 0 0 2 2
0 0 0 0 0
0 4 0 0 0
0 5 0 0 0
(c)

Figure 18.1 Application of Join-test(R,F ,ρ).

It is clear that the running time of algorithm Join-test(R,F ,ρ) is polynomial


in the length of the input. The important thing is that it uses only the schema,
not the instance r belonging to the schema. Since the size of an instance is larger
than the size of the schema by many orders of magnitude, the running time of an
algorithm using the schema only is negligible with respect to the time required by
an algorithm processing the data stored.

Theorem 18.13 Procedure Join-test(R,F ,ρ) correctly determines whether a


given decomposition has the lossless join property.

Proof Let us assume first that the resulting array T contains no all zero row. T itself
can be considered as a relational instance over the schema R. This relation satisfies
all functional dependencies from F , because the algorithm finished since there was
no more change in the table during checking the functional dependencies. It is true
for the starting table that its projections to every Ri ’s contain an all zero row, and
this property does not change during the running of the algorithm, since a 0 is never
changed to another symbol. It follows, that the natural join mρ (T ) contains the all
zero row, that is T NEmρ (T ). Thus the decomposition is lossy. The proof of the other
direction is only sketched.
Logic, domain calculus is used. The necessary definitions can be found in the
books of Abiteboul, Hull and Vianu, or Ullman, respectively. Imagine that variable
aj is written in place of zeroes, and bij is written in place of i’s in column j, and
18.2. Decomposition of relational schemata 875

Join-test(R,F ,ρ) is run in this setting. The resulting table contains row a1 a2 . . . an ,
which corresponds to the all zero row. Every table can be viewed as a shorthand
notation for the following domain calculus expression

{a1 a2 . . . an | (∃b11 ) . . . (∃bkn ) (R(w1 ) ∧ · · · ∧ R(wk ))} , (18.1)

where wi is the ith row of T . If T is the starting table, then formula (18.1) defines mρ
exactly. As a justification note that for a relation r, mρ (r) contains the row a1 a2 . . . an
iff r contains for all i a row whose jth coordinate is aj if Aj is an attribute of Ri ,
and arbitrary values represented by variables bil in the other attributes.
Consider an arbitrary relation r belonging to schema R that satisfies the de-
pendencies of F . The modifications (equating symbols) of the table done by Join-
test(R,F ,ρ) do not change the set of rows obtained from r by (18.1), if the mod-
ifications are done in the formula, as well. Intuitively it can be seen from the fact
that only such symbols are equated in (18.1), that can only take equal values in a
relation satisfying functional dependencies of F . The exact proof is omitted, since it
is quiet tedious.
Since in the result table of Join-test(R,F ,ρ) the all a’s row occurs, the domain
calculus formula that belongs to this table is of the following form:

{a1 a2 . . . an | (∃b11 ) . . . (∃bkn ) (R(a1 a2 . . . an ) ∧ · · · )} . (18.2)

It is obvious that if (18.2) is applied to relation r belonging to schema R, then the


result will be a subset of r. However, if r satisfies the dependencies of F , then (18.2)
calculates mρ (r). According to Lemma 18.12, r ⊆ mρ (r) holds, thus if r satisfies F ,
then (18.2) gives back r exactly, so r = mρ (r), that is the decomposition has the
lossless join property.
Procedure Join-test(R,F ,ρ) can be used independently of the number of parts
occurring in the decomposition. The price of this generality is paid in the running
time requirement. However, if R is to be decomposed only into two parts, then
Closure(R,F ,X) or Linear-closure(R,F ,X) can be used to obtain the same
result faster, according to the next theorem.
Theorem 18.14 Let ρ = (R1 , R2 ) be a decomposition of R, furthermore let F be
a set of functional dependencies. Decomposition ρ has the lossless join property with
respect to F iff

(R1 ∩ R2 ) → (R1 − R2 ) or (R1 ∩ R2 ) → (R2 − R1 ) .

These dependencies need not be in F , it is enough if they are in F + .


Proof The starting table in procedure Join-test(R,F ,ρ) is the following:

R1 ∩ R2 R1 − R2 R2 − R1
row of R1 00 . . . 0 00 . . . 0 11 . . . 1 (18.3)
row of R2 00 . . . 0 22 . . . 2 00 . . . 0

It is not hard to see using induction on the number of steps done by Join-
test(R,F ,ρ) that if the algorithm changes both values of the column of an attribute
876 18. Relational Database Design

+
A to 0, then A ∈ (R1 ∩ R2 ) . This is obviously true at the start. If at some time
values of column A must be equated, then by lines 11–14 of the algorithm, there
exists {X → Y } ∈ F , such that the two rows of the table agree on X, and A ∈ Y .
+
By the induction assumption X ⊆ (R1 ∩ R2 ) holds. Applying Armstrong-axioms
+
(transitivity and reflexivity), A ∈ (R1 ∩ R2 ) follows.
+
On the other hand, let us assume that A ∈ (R1 ∩ R2 ) , that is (R1 ∩ R2 ) → A.
Then this functional dependency can be derived from F using Armstrong-axioms.
By induction on the length of this derivation it can be seen that procedure Join-
test(R,F ,ρ) will equate the two values of column A, that is set them to 0. Thus,
the row of R1 will be all 0 iff (R1 ∩ R2 ) → (R2 − R1 ), similarly, the row of R2 will
be all 0 iff (R1 ∩ R2 ) → (R1 − R2 ).

18.2.3. Dependency preserving decompositions


The lossless join property is important so that a relation can be recovered from
its projections. In practice, usually not the relation r belonging to the underly-
ing schema R is stored, but relations ri = r[Ri ] for an appropriate decomposition
ρ = (R1 , R2 , . . . , Rk ), in order to avoid anomalies. The functional dependencies F
of schema R are integrity constraints of the database, relation r is consistent if
it satisfies all prescribed functional dependencies. When during the life time of the
database updates are executed, that is rows are inserted into or deleted from the
projection relations, then it may happen that the natural join of the new projec-
tions does not satisfy the functional dependencies of F . It would be too costly to
join the projected relations – and then project them again – after each update to
check the integrity constraints. However, the projection of the family of functional
dependencies F to an attribute set Z can be defined: πZ (F ) consists of those func-
tional dependencies {X → Y } ∈ F + , where XY ⊆ Z. After an update, if relation
ri is changed, then it is relatively easy to check whether πRi (F ) still holds. Thus, it
would be desired if family F would be logical implication
Sk of the families of functional
dependencies πRi (F ) i = 1, 2, . . . , k. Let πρ (F ) = i=1 πRi (F ).
Definition 18.15 The decomposition ρ is said to be dependency preserving. if
πρ (F )+ = F + .
Note that πρ (F ) ⊆ F + , hence πρ (F )+ ⊆ F + always holds. Consider the following
example.

Example 18.6 Let R = (City,Street,Zip code) be the underlying schema, furthermore


let F = {CS → Z, Z → C} be the functional dependencies. Let the decomposition ρ
be ρ = (CZ, SZ). This has the lossless join property by Theorem 18.14. πρ (F ) consists
of Z → C besides the trivial dependencies. Let R1 = CZ and R2 = SZ. Two rows are
inserted into each of the projections belonging to schemata R1 and R2 , respectively, so
that functional dependencies of the projections are satisfied:
R1 C Z R2 S Z
F ort W ayne 46805 Coliseum Blvd 46805
F ort W ayne 46815 Coliseum Blvd 46815
18.2. Decomposition of relational schemata 877

In this case R1 and R2 satisfy the dependencies prescribed for them separately, however in
R1 1 R2 the dependency CS → Z does not hold.
It is true as well, that none of the decompositions of this schema preserves the depen-
dency CS → Z. Indeed, this is the only dependency that contains Z on the right hand side,
thus if it is to be preserved, then there has to be a subschema that contains C, S, Z, but then
the decomposition would not be proper. This will be considered again when decomposition
into normal forms is treated.

Note that it may happen that decomposition ρ preserves functional depen-


dencies, but does not have the lossless join property. Indeed, let R = ABCD,
F = {A → B, C → D}, and let the decomposition be ρ = (AB, CD).
Theoretically it is very simple to check whether a decomposition ρ =
(R1 , R2 , . . . Rk ) is dependency preserving. Just F + needs to be calculated, then
projections need to be taken, finally one should check whether the union of the pro-
jections is equivalent with F . The main problem with this approach is that even
calculating F + may need exponential time.
Nevertheless, the problem can be solved without explicitly determining F + . Let
G = πρ (F ). G will not be calculated, only its equivalence with F will be checked.
For this end, it needs to be decidable for all functional dependencies {X → Y } ∈ F
that if X + is taken with respect to G, whether it contains Y . The trick is that X +
is determined without full knowledge of G by repeatedly taking the effect to the
closure of the projections of F onto the individual Ri ’s. That is, the concept of S-
operation on an attribute set Z is introduced, where S is another set of attributes:
Z is replaced by Z ∪ ((Z ∩ S)+ ∩ S), where the closure is taken with respect to F .
Thus, the closure of the part of Z that lies in S is taken with respect to F , then
from the resulting attributes those are added to Z, which also belong to S.
It is clear that the running time of algorithm Preserve(ρ, F ) is polynomial
in the length of the input. More precisely, the outermost for loop is executed at
most once for each dependency in F (it may happen that it turns out earlier that
some dependency is not preserved). The body of the repeat–until loop in lines 3–7.
requires linear number of steps, it is executed at most |R| times. Thus, the body of
the for loop needs quadratic time, so the total running time can be bounded by the
cube of the input length.

Preserve(ρ, F )
1 for all (X → Y ) ∈ F
2 do Z ← X
3 repeat
4 W ←Z
5 for i ← 1 to k
6 do Z ← Z ∪ (Linear-closure(R, F, Z ∩ Ri ) ∩ Ri )
7 until Z = W
8 if Y 6⊆ Z
9 then return “Not dependency preserving”
10 return “Dependency preserving”
878 18. Relational Database Design

Example 18.7 Consider the schema R = ABCD, let the decomposition be ρ =


{AB, BC, CD}, and dependencies be F = {A → B, B → C, C → D, D → A}. That
is, by the visible cycle of the dependencies, every attribute determines all others. Since D
and A do not occur together in the decomposition one might think that the dependency
D → A is not preserved, however this intuition is wrong. The reason is that during the
projection to AB, not only the dependency A → B is obtained, but B → A, as well, since
not F , but F + is projected. Similarly, C → B and D → C are obtained, as well, but
D → A is a logical implication of these by the transitivity of the Armstrong axioms. Thus
it is expected that Preserve(ρ, F ) claims that D → A is preserved.
Start from the attribute set Y = {D}. There are three possible operations, the AB-
operation, the BC-operation and the CD-operation. The first two obviously does not add
anything to {D}+ , since {D} ∩ {A, B} = {D} ∩ {B, C} = ∅, that is the closure of the
empty set should be taken, which is empty (in the present example). However, using the
CD-operation:

Z = {D} ∪ ({D} ∩ {C, D})+ ∩ {C, D}

= {D} ∪ {D}+ ∩ {C, D}
= {D} ∪ ({A, B, C, D} ∩ {C, D})
= {C, D}.

In the next round using the BC-operation the actual Z = {C, D} is changed to Z =
{B, C, D}, finally applying the AB-operation on this, Z = {A, B, C, D} is obtained. This
cannot change, so procedure Preserve(ρ, F ) stops. Thus, with respect to the family of
functional dependencies

G = πAB (F ) ∪ πBC (F ) ∪ πCD (F ) ,

{D}+ = {A, B, C, D} holds, that is G |= D → A. It can be checked similarly that the other
dependencies of F are in G+ (as a fact in G).

Theorem 18.16 The procedure Preserve(ρ, F ) determines correctly whether the


decomposition ρ is dependency preserving.

Proof It is enough to check for a single functional dependency X → Y whether


whether the procedure decides correctly if it is in G+ . When an attribute is added to
Z in lines 3–7, then Functional dependencies from G are used, thus by the soundness
of the Armstrong-axioms if Preserve(ρ, F ) claims that X → Y ∈ G+ , then it is
indeed so.
On the other hand, if X → Y ∈ G+ , then Linear-closure(R, F, X) (run by G
as input) adds the attributes of Y one-by-one to X. In every step when an attribute
is added, some functional dependency U → V of G is used. This dependency is in
one of πRi (F )’s, since G is the union of these. An easy induction on the number
of functional dependencies used in procedure Linear-closure(R, F, X) shows that
sooner or later Z becomes a subset of U , then applying the Ri -operation all attributes
of V are added to Z.
18.2. Decomposition of relational schemata 879

18.2.4. Normal forms


The goal of transforming (decomposing) relational schemata into normal forms
is to avoid the anomalies described in the previous section. Normal forms of many
different strengths were introduced in the course of evolution of database theory,
here only the Boyce–Codd normal formát (BCNF) and the third, furthermore
fourth normal form (3NF and 4NF) are treated in detail, since these are the most
important ones from practical point of view.

Boyce-Codd normal form


Definition 18.17 Let R be relational schema, F be a family of functional depen-
dencies over R. (R, F ) is said to be in Boyce-Codd normal form if X → A ∈ F +
and A 6⊆ X implies that A is a superkey.
The most important property of BCNF is that it eliminates redundancy. This is
based on the following theorem whose proof is left to the Reader as an exercise
(Exercise 18.2-8).
Theorem 18.18 Schema (R, F ) is in BCNF iff for arbitrary attribute A ∈ R
and key X ⊂ R there exists no Y ⊆ R, for which X → Y ∈ F + ; Y → X 6∈ F + ;
Y → A ∈ F + and A 6∈ Y .
In other words, Theorem 18.18 states that “BCNF ⇐⇒ There is no transitive de-
pendence on keys”. Let us assume that a given schema is not in BCNF, for example
C → B and B → A hold, but B → C does not, then the same B value could occur
besides many different C values, but at each occasion the same A value would be
stored with it, which is redundant. Formulating somewhat differently, the meaning
of BCNF is that (only) using functional dependencies an attribute value in a row
cannot be predicted from other attribute values. Indeed, assume that there exists a
schema R, in which the value of an attribute can be determined using a functional
dependency by comparison of two rows. That is, there exists two rows that agree
on an attribute set X, differ on the set Y and the value of the remaining (unique)
attribute A can be determined in one of the rows from the value taken in the other
row.
X Y A
x y1 a
x y2 ?
If the value ? can be determined by a functional dependency, then this value can only
be a, the dependency is Z → A, where Z is an appropriate subset of X. However,
Z cannot be a superkey, since the two rows are distinct, thus R is not in BCNF.

3NF Although BCNF helps eliminating anomalies, it is not true that every
schema can be decomposed into subschemata in BCNF so that the decomposition is
dependency preserving. As it was shown in Example 18.6, no proper decomposition
of schema CSZ preserves the CS → Z dependency. At the same time, the schema
is clearly not in BCNF, because of the Z → C dependency.
Since dependency preserving is important because of consistency checking of
880 18. Relational Database Design

a database, it is practical to introduce a normal form that every schema has de-
pendency preserving decomposition into that form, and it allows minimum possible
redundancy. An attribute is called prime attribute, if it occurs in a key.
Definition 18.19 The schema (R, F ) is in third normal form, if whenever X →
A ∈ F + , then either X is a superkey, or A is a prime attribute.
The schema SAIP of Example 18.3 with the dependencies SI → P and S → A
is not in 3NF, since SI is the only key and so A is not a prime attribute. Thus,
functional dependency S → A violates the 3NF property.
3NF is clearly weaker condition than BCNF, since “or A is a prime attribute”
occurs in the definition. The schema CSZ in Example 18.6 is trivially in 3NF,
because every attribute is prime, but it was already shown that it is not in BCNF.

Testing normal forms Theoretically every functional dependency in F + should


be checked whether it violates the conditions of BCNF or 3NF, and it is known that
F + can be exponentially large in the size of F . Nevertheless, it can be shown that if
the functional dependencies in F are of the form that the right hand side is a single
attribute always, then it is enough to check violation of BCNF, or 3NF respectively,
for dependencies of F . Indeed, let X → A ∈ F + be a dependency that violates the
appropriate condition, that is X is not a superkey and in case of 3NF, A is not
prime. X → A ∈ F + ⇐⇒ A ∈ X + . In the step when Closure(R, F, X) puts A
into X+ (line 8) it uses a functional dependency Y → A from F that Y ⊂ X + and
A 6∈ Y . This dependency is non-trivial and A is (still) not prime. Furthermore, if Y
were a superkey, than by R = Y + ⊆ (X + )+ = X + , X would also be a superkey.
Thus, the functional dependency Y → A from F violates the condition of the normal
form. The functional dependencies easily can be checked in polynomial time, since
it is enough to calculate the closure of the left hand side of each dependency. This
finishes checking for BCNF, because if the closure of each left hand side is R, then
the schema is in BCNF, otherwise a dependency is found that violates the condition.
In order to test 3NF it may be necessary to decide about an attribute whether it is
prime or not. However this problem is NP-complete, see Problem 18-4.

Lossless join decomposition into BCNF Let (R, F ) be a relational schema


(where F is the set of functional dependencies). The schema is to be decomposed into
union of subschemata R1 , R2 , . . . , Rk , such that the decomposition has the lossless
join property, furthermore each Ri endowed with the set of functional dependencies
πRi (F ) is in BCNF. The basic idea of the decomposition is simple:
• If (R, F ) is in BCNF, then ready.
• If not, it is decomposed into two proper parts (R1 , R2 ), whose join is lossless.
• Repeat the above for R1 and R2 .
In order to see that this works one has to show two things:
• If (R, F ) is not in BCNF, then it has a lossless join decomposition into smaller
parts.
• If a part of a lossless join decomposition is further decomposed, then the new
decomposition has the lossless join property, as well.
18.2. Decomposition of relational schemata 881

Lemma 18.20 Let (R, F ) be a relational schema (where F is the set of functional
dependencies), ρ = (R1 , R2 , . . . , Rk ) be a lossless join decomposition of R. Further-
more, let σ = (S1 , S2 ) be a lossless join decomposition of R1 with respect to πR1 (F ).
Then (S1 , S2 , R2 , . . . , Rk ) is a lossless join decomposition of R.

The proof of Lemma 18.20 is based on the associativity of natural join. The details
are left to the Reader (Exercise 18.2-9).
This can be applied for a simple, but unfortunately exponential time algorithm
that decomposes a schema into subschemata of BCNF property. The projections in
lines 4–5 of Naïv-BCNF(S, G) may be of exponential size in the length of the input.
In order to decompose schema (R, F ), the procedure must be called with parameters
R, F . Procedure Naïv-BCNF(S, G) is recursive, S is the actual schema with set of
functional dependencies G. It is assumed that the dependencies in G are of the form
X → A, where A is a single attribute.

Naiv-BCNF(S, G)
1 while there exists {X → A} ∈ G, that violates BCNF
2 do S1 ← {XA}
3 S2 ← S − A
4 G1 ← πS1 (G)
5 G2 ← πS2 (G)
6 return (Naiv-BCNF(S1 , G1 ), Naiv-BCNF(S2 , G2 ))
7 return S

However, if the algorithm is allowed overdoing things, that is to decompose a


schema even if it is already in BCNF, then there is no need for projecting the
dependencies. The procedure is based on the following two lemmae.

Lemma 18.21

1. A schema of only two attributes is in BCNF.

2. If R is not in BCNF, then there exists two attributes A and B in R, such that
(R − AB) → A holds.

Proof If the schema consists of two attributes, R = AB, then there are at most two
possible non-trivial dependencies, A → B and B → A. It is clear, that if some of
them holds, then the left hand side of the dependency is a key, so the dependency
does not violate the BCNF property. However, if none of the two holds, then BCNF
is trivially satisfied.
On the other hand, let us assume that the dependency X → A violates the
BCNF property. Then there must exists an attribute B ∈ R − (XA), since otherwise
X would be a superkey. For this B, (R − AB) → A holds.
882 18. Relational Database Design

Let us note, that the converse of the second statement of Lemma 18.21 is not
true. It may happen that a schema R is in BCNF, but there are still two attributes
{A, B} that satisfy (R − AB) → A. Indeed, let R = ABC, F = {C → A, C → B}.
This schema is obviously in BCNF, nevertheless (R − AB) = C → A.
The main contribution of Lemma 18.21 is that the projections of functional
dependencies need not be calculated in order to check whether a schema obtained
during the procedure is in BCNF. It is enough to calculate (R − AB)+ for pairs
{A, B} of attributes, which can be done by Linear-closure(R, F, X) in linear
time, so the whole checking is polynomial (cubic) time. However, this requires a way
of calculating (R − AB)+ without actually projecting down the dependencies. The
next lemma is useful for this task.
Lemma 18.22 Let R2 ⊂ R1 ⊂ R and let F be the set of functional dependencies
of scheme R. Then
πR2 (πR1 (F )) = πR2 (F ) .
The proof is left for the Reader (Exercise 18.2-10). The method of lossless join BCNF
decomposition is as follows. Schema R is decomposed into two subschemata. One
is XA that is in BCNF, satisfying X → A. The other subschema is R − A, hence
by Theorem 18.14 the decomposition has the lossless join property. This is applied
recursively to R − A, until such a schema is obtained that satisfies property 2 of
Lemma 18.21. The lossless join property of this recursively generated decomposition
is guaranteed by Lemma 18.20.

Polynomial-BCNF(R, F )
1 Z←R
2  Z is the schema that is not known to be in BCNF during the procedure.
3 ρ←∅
4 while there exist A, B in Z, such that A ∈ (Z − AB)+ and |Z| > 2
5 do Let A and B be such a pair
6 E←A
7 Y ←Z −B
8 while there exist C, D in Y , such that C ∈ (Z − CD)+
9 do Y ← Y − D
10 E←C
11 ρ ← ρ ∪ {Y }
12 Z ←Z −E
13 ρ ← ρ ∪ {Z}
14 return ρ

The running time of Polynomial-BCNF(R, F ) is polynomial, in fact it can


be bounded by O(n5 ), as follows. During each execution of the loop in lines 4–12
the size of Z is decreased by at least one, so the loop body is executed at most n
times. (Z − AB)+ is calculated in line 4 for at most O(n2 ) pairs that can be done
in linear time using Linear-closure that results in O(n3 ) steps for each execution
of the loop body. In lines 8–10 the size of Y is decreased in each iteration, so during
each execution of lines 3–12, they give at most n iteration. The condition of the
18.2. Decomposition of relational schemata 883

command while of line 8 is checked for O(n2 ) pairs of attributes, each checking is
done in linear time. The running time of the algorithm is dominated by the time
required by lines 8–10 that take n · n · O(n2 ) · O(n) = O(n5 ) steps altogether.

Dependency preserving decomposition into 3NF We have seen already


that its is not always possible to decompose a schema into subschemata in BCNF
so that the decomposition is dependency preserving. Nevertheless, if only 3NF is
required then a decomposition can be given using Minimal-Cover(R, F ). Let R be
a relational schema and F be the set of functional dependencies. Using Minimal-
Cover(R, F ) a minimal cover G of F is constructed. Let G = {X1 → A1 , X2 →
A2 , . . . , Xk → Ak }.
Theorem 18.23 The decomposition ρ = (X1 A1 , X2 A2 , . . . , Xk Ak ) is dependency
preserving decomposition of R into subschemata in 3NF.
Proof Since G+ = F + and the functional dependency Xi → Ai is in πRi (F ), the
decomposition preserves every dependency of F . Let us suppose indirectly, that the
schema Ri = Xi Ai is not in 3NF, that is there exists a dependency U → B that
violates the conditions of 3NF. This means that the dependency is non-trivial and
U is not a superkey in Ri and B is not a prime attribute of Ri . There are two cases
possible. If B = Ai , then using that U is not a superkey U & Xi follows. In this case
the functional dependency U → Ai contradicts to that Xi → Ai was a member of
minimal cover, since its left hand side could be decreased. In the case when BNEAi ,
B ∈ Xi holds. B is not prime in Ri , thus Xi is not a key, only a superkey. However,
then Xi would contain a key Y such that Y & Xi . Furthermore, Y → Ai would hold,
as well, that contradicts to the minimality of G since the left hand side of Xi → Ai
could be decreased.

If the decomposition needs to have the lossless join property besides being de-
pendency preserving, then ρ given in Theorem 18.23 is to be extended by a key X of
R. Although it was seen before that it is not possible to list all keys in polynomial
time, one can be obtained in a simple greedy way, the details are left to the Reader
(Exercise 18.2-11).
Theorem 18.24 Let (R, F ) be a relational schema, and let G = {X1 → A1 , X2 →
A2 , . . . , Xk → Ak } be a minimal cover of F . Furthermore, let X be a key in (R, F ).
Then the decomposition τ = (X, X1 A1 , X2 A2 , . . . , Xk Ak ) is a lossless join and de-
pendency preserving decomposition of R into subschemata in 3NF.
Proof It was shown during the proof of Theorem 18.23 that the subschemata Ri =
Xi Ai are in 3NF for i = 1, 2, . . . , k. There cannot be a non-trivial dependency in the
subschema R0 = X, because if it were, then X would not be a key, only a superkey.
The lossless join property of τ is shown by the use of Join-test(R, G, ρ) proce-
dure. Note that it is enough to consider the minimal cover G of F . More precisely,
we show that the row corresponding to X in the table will be all 0 after running
Join-test(R, G, ρ). Let A1 , A2 , . . . , Am be the order of the attributes of R − X
as Closure(R,G,X) inserts them into X + . Since X is a key, every attribute of
R − X is taken during Closure(R,G,X). It will be shown by induction on i that
884 18. Relational Database Design

the element in row of X and column of Ai is 0 after running Join-test(R, G, ρ).


The base case of i = 0 is obvious. Let us suppose that the statement is true for i−
and consider when and why Ai is inserted into X + . In lines 6–8 of Closure(R,G,X)
such a functional dependency Y → Ai is used where Y ⊆ X ∪ {A1 , A2 , . . . , Ai−1 }.
Then Y → Ai ∈ G, Y Ai = Rj for some j. The rows corresponding to X and
Y Ai = Rj agree in columns of X (all 0 by the induction hypothesis), thus the
entries in column of Ai are equated by Join-test(R, G, ρ). This value is 0 in the
row corresponding to Y Ai = Rj , thus it becomes 0 in the row of X, as well.

It is interesting to note that although an arbitrary schema can be decomposed


into subschemata in 3NF in polynomial time, nevertheless it is NP-complete to decide
whether a given schema (R, F ) is in 3NF, see Problem 18-4. However, the BCNF
property can be decided in polynomial time. This difference is caused by that in
order to decide 3NF property one needs to decide about an attribute whether it is
prime. This latter problem requires the listing of all keys of a schema.

18.2.5. Multivalued dependencies

Example 18.8 Besides functional dependencies, some other dependencies hold in Exam-
ple 18.1, as well. There can be several lectures of a subject in different times and rooms.
Part of an instance of the schema could be the following.
Professor Subject Room Student Grade Time
Caroline Doubtfire Analysis MA223 John Smith A− Monday 8–10
Caroline Doubtfire Analysis CS456 John Smith A− Wednesday 12–2
Caroline Doubtfire Analysis MA223 Ching Lee A+ Monday 8–10
Caroline Doubtfire Analysis CS456 Ching Lee A+ Wednesday 12–2
A set of values of Time and Room attributes, respectively, belong to each given value of
Subject, and all other attribute values are repeated with these. Sets of attributes SR and
StG are independent, that is their values occur in each combination.

The set of attributes Y is said to be multivalued dependent on set of attributes


X, in notation X  Y , if for every value on X, there exists a set of values on Y
that is not dependent in any way on the values taken in R − X − Y . The precise
definition is as follows.
Definition 18.25 The relational schema R satisfies the multivalued depen-
dency X  Y , if for every relation r of schema R and arbitrary tuples t1 , t2 of
r that satisfy t1 [X] = t2 [X], there exists tuples t3 , t4 ∈ r such that
• t3 [XY ] = t1 [XY ]
• t3 [R − XY ] = t2 [R − XY ]
• t4 [XY ] = t2 [XY ]
• t4 [R − XY ] = t1 [R − XY ]
holds.1

1 It would be enough to require the existence of t , since the existence of t would follow. However,
3 4
the symmetry of multivalued dependency is more apparent in this way.
18.2. Decomposition of relational schemata 885

In Example 18.8 STR holds.


Remark 18.26 Functional dependency is equality generating dependency, that
is from the equality of two objects it deduces the equality of other other two objects.
On the other hand, multivalued dependency is tuple generating dependency, that
is the existence of two rows that agree somewhere implies the existence of some other
rows.
There exists a sound and complete axiomatisation of multivalued dependencies sim-
ilar to the Armstrong-axioms of functional dependencies. Logical implication and
inference can be defined analogously. The multivalued dependency X  Y is logi-
cally implied by the set M of multivalued dependencies, in notation M |= X  Y ,
if every relation that satisfies all dependencies of M also satisfies X  Y .
Note, that X → Y implies X  Y . The rows t3 and t4 of Definition 18.25
can be chosen as t3 = t2 and t4 = t1 , respectively. Thus, functional dependencies
and multivalued dependencies admit a common axiomatisation. Besides Armstrong-
axioms (A1)–(A3), five other are needed. Let R be a relational schema.
(A4) Complementation: {X  Y } |= X  (R − X − Y ).
(A5) Extension: If X  Y holds, and V ⊆ W , then W X  V Y .
(A6) Transitivity: {X  Y, Y  Z} |= X  (Z − Y ).
(A7) {X → Y } |= X  Y .
(A8) If X  Y holds, Z ⊆ Y , furthermore for some W disjoint from Y W → Z
holds, then X → Z is true, as well.
Beeri, Fagin and Howard proved that (A1)–(A8) is sound and complete system of
axioms for functional and Multivalued dependencies together. Proof of soundness
is left for the Reader (Exercise 18.2-12), the proof of the completeness exceeds the
level of this book. The rules of Lemma 18.2 are valid in exactly the same way as
when only functional dependencies were considered. Some further rules are listed in
the next Proposition.
Claim 18.27 The followings are true for multivalued dependencies.
1. Union rule: {X  Y, X  Z} |= X  Y Z.
2. Pseudotransitivity: {X  Y, W Y  Z} |= W X  (Z − W Y ).
3. Mixed pseudotransitivity: {X  Y, XY → Z} |= X → (Z − Y ).
4. Decomposition rule for multivalued dependencies: ha X  Y and X  Z
holds, then X  (Y ∩ Z), X  (Y − Z) and X  (Z − Y ) holds, as well.

Th proof of Proposition 18.27 is left for the Reader (Exercise 18.2-13).

Dependency basis Important difference between functional dependencies and


multivalued dependencies is that X → Y immediately implies X → A for all A in Y ,
however X  A is deduced by the decomposition rule for multivalued dependencies
from X  Y only if there exists a set of attributes Z such that X  Z and
Z ∩ Y = A, or Y − Z = A. Nevertheless, the following theorem is true.
886 18. Relational Database Design

Theorem 18.28 Let R be a relational schema, X ⊂ R be a set of attributes. Then


there exists a partition Y1 , Y2 , . . . , Yk of the set of attributes R − X such that for
Z ⊆ R − X the multivalued dependency X  Z holds if and only if Z is the union
of some Yi ’s.

Proof We start from the one-element partition W1 = R − X. This will be refined


successively, while the property that X  Wi holds for all Wi in the actual decom-
position, is kept. If X  Z and Z is not a union of some of the Wi ’s, then replace
every Wi such that neither Wi ∩ Z nor Wi − Z is empty by Wi ∩ Z and Wi − Z.
According to the decomposition rule of Proposition 18.27, both X  (Wi ∩ Z) and
X  (Wi − Z) hold. Since R − X is finite, the refinement process terminates after
a finite number of steps, that is for all Z such that X  Z holds, Z is the union
of some blocks of the partition. In order to complete the proof one needs to observe
only that by the union rule of Proposition 18.27, the union of some blocks of the
partition depends on X in multivalued way.

Definition 18.29 The partition Y1 , Y2 , . . . , Yk constructed in Theorem 18.28 from


a set D of functional and multivalued dependencies is called the dependency basis
of X (with respect to D).

Example 18.9 Consider the familiar schema

R(Professor,Subject,Room,Student,Grade,Time)

of Examples 18.1 and 18.8. SuRT was shown in Example 18.8. By the complementation
rule SuPStG follows. Su→P is also known. This implies by axiom (A7) that SuP.
By the decomposition rule SuStg follows. It is easy to see that no other one-element
attribute set is determined by Su via multivalued dependency. Thus, the dependency basis
of Su is the partition{P,RT,StG}.

We would like to compute the set D+ of logical consequences of a given set D of


functional and multivalued dependencies. One possibility is to apply axioms (A1)–
(A8) to extend the set of dependencies repeatedly, until no more extension is possible.
However, this could be an exponential time process in the size of D. One cannot
expect any better, since it was shown before that even D+ can be exponentially
larger than D. Nevertheless, in many applications it is not needed to compute the
whole set D+ , one only needs to decide whether a given functional dependency
X → Y or multivalued dependency X  Y belongs to D+ or not. In order to decide
about a multivalued dependency X  Y , it is enough to compute the dependency
basis of X, then to check whether Z − X can be written as a union of some blocks
of the partition. The following is true.

Theorem 18.30 (Beeri). In order to compute the dependency basis of a set of


attributes X with respect to a set of dependencies D, it is enough to consider the
following set M of multivalued dependencies:

1. All multivalued dependencies of D and


18.2. Decomposition of relational schemata 887

2. for every X → Y in D the set of multivalued dependencies X  A1 , X 


A2 , . . . , X  Ak , where Y = A1 A2 . . . Ak , and the Ai ’s are single attributes.

The only thing left is to decide about functional dependencies based on the depen-
dency basis. Closure(R,F ,X) works correctly only if multivalued dependencies are
not considered. The next theorem helps in this case.

Theorem 18.31 (Beeri). Let us assume that A 6∈ X and the dependency basis of
X with respect to the set M of multivalued dependencies obtained in Theorem 18.30
is known. X → A holds if and only if

1. A forms a single element block in the partition of the dependency basis, and

2. There exists a set Y of attributes that does not contain A, Y → Z is an element


of the originally given set of dependencies D, furthermore A ∈ Z.

Based on the observations above, the following polynomial time algorithm can be
given to compute the dependency basis of a set of attributes X.

Dependency-Basis(R, M, X)
1 S ← {R − X}  The collection of sets in the dependency basis is S.
2 repeat
3 for all V  W ∈ M
4 do if there exists Y ∈ S such that Y ∩ W NE∅ ∧ Y ∩ V = ∅
5 then S ← S − {{Y }} ∪ {{Y ∩ W }, {Y − W }}
6 until S does not change
7 return S

It is immediate that if S changes in lines 3–5. of Dependency-basis(R, M, X),


then some block of the partition is cut by the algorithm. This implies that the run-
ning time is a polynomial function of the sizes of M and R. In particular, by careful
implementation one can make this polynomial to O(|M | · |R|3 ), see Problem 18-5.

Fourth normal form 4NF The Boyce-Codd normal form can be generalised
to the case where multivalued dependencies are also considered besides functional
dependencies, and one needs to get rid of the redundancy caused by them.

Definition 18.32 Let R be a relational schema, D be a set of functional and mul-


tivalued dependencies over R. R is in fourth normal form (4NF), if for arbitrary
multivalued dependency X  Y ∈ D+ for which Y 6⊆ X and RNEXY , holds that
X is superkey in R.

Observe that 4NF=⇒BCNF. Indeed, if X → A violated the BCNF condition, then


A 6∈ X, furthermore XA could not contain all attributes of R, because that would
imply that X is a superkey. However, X → A implies X  A by (A8), which in
turn would violate the 4NF condition.
888 18. Relational Database Design

Schema R together with set of functional and multivalued dependencies D can


be decomposed into ρ = (R1 , R2 , . . . , Rk ), where each Ri is in 4NF and the decom-
position has the lossless join property. The method follows the same idea as the
decomposition into BCNF subschemata. If schema S is not in 4NF, then there exists
a multivalued dependency X  Y in the projection of D onto S that violates the
4NF condition. That is, X is not a superkey in S, Y neither is empty, nor is a subset
of X, furthermore the union of X and Y is not S. It can be assumed without loss
of generality that X and Y are disjoint, since X  (Y − X) is implied by X  Y
using (A1), (A7) and the decomposition rule. In this case S can be replaced by
subschemata S1 = XY and S2 = S − Y , each having a smaller number of attributes
than S itself, thus the process terminates in finite time.
Two things has to be dealt with in order to see that the process above is correct.

• Decomposition S1 , S2 has the lossless join property.


• How can the projected dependency set πS (D) be computed?
The first problem is answered by the following theorem.
Theorem 18.33 The decomposition ρ = (R1 , R2 ) of schema R has the lossless
join property with respect to a set of functional and multivalued dependencies D iff

(R1 ∩ R2 )  (R1 − R2 ).

Proof The decomposition ρ = (R1 , R2 ) of schema R has the lossless join property iff
for any relation r over the schema R that satisfies all dependencies from D holds that
if µ and ν are two tuples of r, then there exists a tuple ϕ satisfying ϕ[R1 ] = µ[R1 ]
and ϕ[R2 ] = ν[R2 ], then it is contained in r. More precisely, ϕ is the natural join of
the projections of µ on R1 and of ν on R2 , respectively, which exist iff µ[R1 ∩ R2 ] =
ν[R1 ∩ R2 ]. Thus the fact that ϕ is always contained in r is equivalent with that
(R1 ∩ R2 )  (R1 − R2 ).

To compute the projection πS (D) of the dependency set D one can use the follow-
ing theorem of Aho, Beeri and Ullman. πS (D) is the set of multivalued dependencies
that are logical implications of D and use attributes of S only.
Theorem 18.34 (Aho, Beeri és Ullman). πS (D) consists of the following depen-
dencies:
• For all X → Y ∈ D+ , if X ⊆ S, then X → (Y ∩ S) ∈ πS (D).
• For all X  Y ∈ D+ , if X ⊆ S, then X  (Y ∩ S) ∈ πS (D).
Other dependencies cannot be derived from the fact that D holds in R.
Unfortunately this theorem does not help in computing the projected dependencies
in polynomial time, since even computing D+ could take exponential time. Thus, the
algorithm of 4NF decomposition is not polynomial either, because the 4NF condition
must be checked with respect to the projected dependencies in the subschemata.
This is in deep contrast with the case of BCNF decomposition. The reason is, that
to check BCNF condition one does not need to compute the projected dependencies,
18.2. Decomposition of relational schemata 889

only closures of attribute sets need to be considered according to Lemma 18.21.

Exercises
18.2-1 Are the following inference rules sound?

a. If XW → Y and XY → Z, then X → (Z − W ).
b. If X  Y and Y  Z, then X  Z.
c. If X  Y and XY → Z, then X → Z.
18.2-2 Prove Theorem 18.30, that is show the following. Let D be a set of functional
and multivalued dependencies, and let m(D) = {X  Y : X  Y ∈ D} ∪ {X 
A : A ∈ Y for some X → Y ∈ D}. Then
a. D |= X → Y =⇒ m(D) |= X  Y , and
b.D |= X  Y ⇐⇒ m(D) |= X  Y .
Hint. Use induction on the inference rules to prove b.
18.2-3 Consider the database of an investment firm, whose attributes are as fol-
lows: B (stockbroker), O (office of stockbroker), I (investor), S (stock), A (amount
of stocks of the investor), D (dividend of the stock). The following functional de-
pendencies are valid: S → D, I → B, IS → A, B → O.
a. Determine a key of schema R = BOISAD.
b. How many keys are in schema R?
c. Give a lossless join decomposition of R into subschemata in BCNF.
d. Give a dependency preserving and lossless join decomposition of R into sub-
schemata in 3NF.
18.2-4 The schema R of Exercise 18.2-3 is decomposed into subschemata SD, IB,
ISA and BO. Does this decomposition have the lossless join property?
18.2-5 Assume that schema R of Exercise 18.2-3 is represented by ISA, IB, SD and
ISO subschemata. Give a minimal cover of the projections of dependencies given
in Exercise 18.2-3. Exhibit a minimal cover for the union of the sets of projected
dependencies. Is this decomposition dependency preserving?
18.2-6 Let the functional dependency S → D of Exercise 18.2-3 be replaced by
the multivalued dependency S  D. That is , D represents the stock’s dividend
“history”.
a. Compute the dependency basis of I.
b. Compute the dependency basis of BS.
c. Give a decomposition of R into subschemata in 4NF.
18.2-7 Consider the decomposition ρ = {R1 , R2 , . . . , Rk } of schema R. Let ri =
πRi (r), furthermore mρ (r) =1ki=1 πRi (r). Prove:
a. r ⊆ mρ (r).
b. If s = mρ (r), then πRi (s) = ri .
c. mρ (mρ (r)) = mρ (r).
18.2-8 Prove that schema (R, F ) is in BCNF iff for arbitrary A ∈ R and key
X ⊂ R, it holds that there exists no Y ⊆ R, for which X → Y ∈ F + ; Y → X 6∈ F + ;
Y → A ∈ F + and A 6∈ Y .
18.2-9 Prove Lemma 18.20.
18.2-10 Let us assume that R2 ⊂ R1 ⊂ R and the set of functional dependencies
of schema R is F . Prove that πR2 (πR1 (F )) = πR2 (F ).
890 18. Relational Database Design

18.2-11 Give a O(n2 ) running time algorithm to find a key of the relational schema
(R, F ). Hint. Use that R is superkey and each superkey contains a key. Try to drop
attributes from R one-by-one and check whether the remaining set is still a key.
18.2-12 Prove that axioms (A1)–(A8) are sound for functional and multivalued
dependencies.
18.2-13 Derive the four inference rules of Proposition 18.27 from axioms (A1)–
(A8).

18.3. Generalised dependencies


Two such dependencies will be discussed in this section that are generalizations of
the previous ones, however cannot be axiomatised with axioms similar to (A1)–(A8).

18.3.1. Join dependencies


Theorem 18.33 states that multivalued dependency is equivalent with that some
decomposition the schema into two parts has the lossless join property. Its generali-
sation is the join dependency.
Sk
Definition 18.35 Let R be a relational schema and let R = i=1 Xi . The relation
r belonging to R is said to satisfy the join dependency

1[X1 , X2 , . . . , Xk ]

if
r =1ki=1 πXi (r) .
In this setting r satisfies multivalued dependency X  Y iff it satisfies the join
dependency 1[XY, X(R − Y )]. The join dependency 1[X1 , X2 , . . . , Xk ] expresses
that the decomposition ρ = (X1 , X2 , . . . , Xk ) has the lossless join property. One can
define the fifth normal form, 5NF.
Definition 18.36 The relational schema R is in fifth normal form, if it is in
4NF and has no non-trivial join dependency.
The fifth normal form has theoretical significance primarily. The schemata used in
practice usually have primary keys. Using that the schema could be decomposed
into subschemata of two attributes each, where one of the attributes is a superkey
in every subschema.
Example 18.10 Consider the database of clients of a bank (Client-
number,Name,Address,accountBalance). Here C is unique identifier, thus the schema could
be decomposed into (CN,CA,CB), which has the lossless join property. However, it is not
worth doing so, since no storage place can be saved, furthermore no anomalies are avoided
with it.

There exists an axiomatisation of a dependency system if there is a finite set


of inference rules that is sound and complete, i.e. logical implication coincides with
being derivable by using the inference rules. For example, the Armstrong-axioms
give an axiomatisation of functional dependencies, while the set of rules (A1)–(A8)
18.3. Generalised dependencies 891

is the same for functional and multivalued dependencies considered together. Unfor-
tunately, the following negative result is true.

Theorem 18.37 The family of join dependencies has no finite axiomatisation.

In contrary to the above, Abiteboul, Hull and Vianu show in their book that
the logical implication problem can be decided by an algorithm for the family of
functional and join dependencies taken together. The complexity of the problem is
as follows.

Theorem 18.38

• It is NP-complete to decide whether a given join dependency is implied by another


given join dependency and a functional dependency.
• It is NP-hard to decide whether a given join dependency is implied by given set
of multivalued dependencies.

18.3.2. Branching dependencies


A generalisation of functional dependencies is the family of branching dependen-
cies. Let us assume that A, B ⊂ R and there exists no q + 1 rows in relation r over
schema R, such that they contain at most p distinct values in columns of A, but
all q + 1 values are pairwise distinct in some column of B. Then B is said to be
p,q 1,1
(p, q)-dependent on A, in notation A −−→ B. In particular, A −−→ B holds if and
only if functional dependency A → B holds.

Example 18.11 Consider the database of the trips of an international transport truck.
• One trip: four distinct countries.
• One country has at most five neighbours.
• There are 30 countries to be considered.
1,1
Let x1 , x2 , x3 , x4 be the attributes of the countries reached in a trip. In this case xi −−→ xi+1
does not hold, however another dependency is valid:
1,5
xi −−→ xi+1 .

The storage space requirement of the database can be significantly reduced using these
dependencies. The range of each element of the original table consists of 30 values, names of
countries or some codes of them (5 bits each, at least). Let us store a little table (30×5×5 =
750 bits) that contains a numbering of the neighbours of each country, which assigns to them
the numbers 0,1,2,3,4 in some order. Now we can replace attribute x2 by these numbers
(x∗2 ), because the value of x1 gives the starting country and the value of x∗2 determines the
second country with the help of the little table. The same holds for the attribute x3 , but
we can decrease the number of possible values even further, if we give a table of numbering
the possible third countries for each x1 , x2 pair. In this case, the attribute x∗3 can take only
4 different values. The same holds for x4 , too. That is, while each element of the original
table could be encoded by 5 bits, now for the cost of two little auxiliary tables we could
decrease the length of the elements in the second column to 3 bits, and that of the elements
in the third and fourth columns to 2 bits.
892 18. Relational Database Design

The (p, q)-closure of an attribute set X ⊂ R can be defined:


p,q
Cp,q (X) = {A ∈ R : X −−→ A} .

In particular, C1,1 (X) = X + . In case of branching dependencies even such basic


questions are hard as whether there exists an Armstrong-relation for a given
family of dependencies.
Definition 18.39 Let R be a relational schema, F be a set of dependencies of some
dependency family F defined on R. A relation r over schema R is Armstrong-
relation for F , if the set of dependencies from F that r satisfies is exactly F , that
is F = {σ ∈ F : r |= σ}.
Armstrong proved that for an arbitrary set of functional dependencies F there exists
Armstrong-relation for F + . The proof is based on the three properties of closures of
attributes sets with respect to F , listed in Exercise 18.1-1 For branching dependencies
only the first two holds in general.
Lemma 18.40 Let 0 < p ≤ q, furthermore let R be a relational schema. For
X, Y ⊆ R one has
1. X ⊆ Cp,q (X) and
2. X ⊆ Y =⇒ Cp,q (X) ⊆ Cp,q (Y ).

There exists such C : 2R → 2R mapping and natural numbers p, q that there exists
no Armstrong-relation for C in the family if (p, q)-dependencies.
Grant Minker investigated numerical dependencies that are similar to
k
branching dependencies. For attribute sets X, Y ⊆ R the dependency X − → Y holds
in a relation r over schema R if for every tuple value taken on the set of attributes
X, there exists at most k distinct tuple values taken on Y . This condition is stronger
1,k
than that of X −−→ Y , since the latter only requires that in each column of Y there
are at most k values, independently of each other. That allows k |Y −X| different Y
projections. Numerical dependencies were axiomatised in some special cases, based
on that Katona showed that branching dependencies have no finite axiomatisation.
It is still an open problem whether logical implication is algorithmically decidable
amongst branching dependencies.

Exercises
18.3-1 Prove Theorem 18.38.
18.3-2 Prove Lemma 18.40.
18.3-3 Prove that if p = q, then Cp,p (Cp,p (X)) = Cp,p (X) holds besides the two
properties of Lemma 18.40.
18.3-4 A C : 2R → 2R mapping is called a closure, if it satisfies the two properties
of Lemma 18.40 and and the third one of Exercise 18.3-3. Prove that if C : 2R → 2R
is a closure, and F is the family of dependencies defined by X → Y ⇐⇒ Y ⊆ C(X),
then there exists an Armstrong-relation for F in the family of (1, 1)-dependencies
(functional dependencies) and in the family of (2, 2)-dependencies, respectively.
Notes for Chapter 18 893

18.3-5 Let C be the closure defined by



X, if |X| < 2
C(X) =
R otherwise .
Prove that there exists no Armstrong-relation for C in the family of (n, n)-
dependencies, if n > 2.

Problems

18-1 External attributes


Maier calls attribute A an external attribute in the functional dependency X → Y
with respect to the family of dependencies F over schema R, if the following two
conditions hold:
1. (F − {X → Y }) ∪ {X → (Y − A)} |= X → Y , or
2. (F − {X → Y }) ∪ {(X − A) → Y } |= X → Y .

Design an O(n2 ) running time algorithm, whose input is schema (R, F ) and output
is a set of dependencies G equivalent with F that has no external attributes.
18-2 The order of the elimination steps in the construction of minimal
cover is important
In the procedure Minimal-cover(R, F ) the set of functional dependencies was
changed in two ways: either by dropping redundant dependencies, or by drop-
ping redundant attributes from the left hand sides of the dependencies. If the latter
method is used first, until there is no more attribute that can be dropped from some
left hand side, then the first method, this way a minimal cover is obtained really,
according to Proposition 18.6. Prove that if the first method applied first and then
the second, until there is no more possible applications, respectively, then the ob-
tained set of dependencies is not necessarily a minimal cover of F .
18-3 BCNF subschema
Prove that the following problem is coNP-complete: Given a relational schema R with
set of functional dependencies F , furthermore S ⊂ R, decide whether (S, πS (F )) is
in BCNF.
18-4 3NF is hard to recognise
Let (R, F ) be a relational schema, where F is the system of functional dependen-
cies.
The k size key problem is the following: given a natural number k, determine
whether there exists a key of size at most k.
The prime attribute problem is the following: for a given A ∈ R, determine
whether it is a prime attribute.
a. Prove that the k size key problem is NP-complete. Hint. Reduce the vertex
cover problem to the prime attribute problem.
b. Prove that the prime attribute problem is NP-complete by reducing the k size
key problem to it.
894 18. Relational Database Design

c. Prove that determining whether the relational schema (R, F ) is in 3NF is NP-
complete. Hint. Reduce the prime attribute problem to it.

18-5 Running time of Dependency-basis


Give an implementation of procedure Dependency-basis, whose running time is
O(|M | · |R|3 ).

Chapter Notes
The relational data model was introduced by Codd [7] in 1970. Functional depen-
dencies were treated in his paper of 1972 [?], their axiomatisation was completed
by Armstrong [?]. The logical implication problem for functional dependencies were
investigated by Beeri and Bernstein [4], furthermore Maier [19]. Maier also treats
the possible definitions of minimal covers, their connections and the complexity
of their computations in that paper. Maier, Mendelzon and Sagiv found method
to decide logical implications among general dependencies [20]. Beeri, Fagin and
Howard proved that axiom system (A1)–(A8) is sound and complete for functional
and multivalued dependencies taken together [?]. Yu and Johnson [?] constructed
such relational schema, where |F | = k and the number of keys is k!. Békéssy and
Demetrovics [6] gave a simple and beautiful proof for the statement, that from k
functional dependencies at most k! keys can be obtained, thus Yu and Johnson’s
construction is extremal.
Armstrong-relations were introduced and studied by Fagin [?, 13], furthermore
by Beeri, Fagin, Dowd and Statman [5].
Multivalued dependencies were independently discovered by Zaniolo [?], Fagin
[12] and Delobel [8].
The necessity of normal forms was recognised by Codd while studying update
anomalies [?, ?]. The Boyce–Codd normal form was introduced in [?]. The definition
of the third normal form used in this chapter was given by Zaniolo [28]. Complexity
of decomposition into subschemata in certain normal forms was studied by Lucchesi
and Osborne [18], Beeri and Bernstein [4], furthermore Tsou and Fischer [26].
Theorems 18.30 and 18.31 are results of Beeri [3]. Theorem 18.34 is from a
paper of Aho, Beeri és Ullman [2].
Theorems 18.37 and 18.38 are from the book of Abiteboul, Hull and Vianu [1],
the non-existence of finite axiomatisation of join dependencies is Petrov’s result [21].
Branching dependencies were introduced by Demetrovics, Katona and Sali, they
studied existence of Armstrong-relations and the size of minimal Armstrong-relations
[9, 10, 11, 23]. Katona showed that there exists no finite axiomatisation of branching
dependencies in (ICDT’92 Berlin, invited talk) but never published.
Possibilities of axiomatisation of numerical dependencies were investigated by
Grant and Minker [15, 16].
Good introduction of the concepts of this chapter can be found in the books of
Abiteboul, Hull and Vianu [1], Ullman [27] furthermore Thalheim [24], respectively.
19. Query Rewriting in Relational
Databases

In chapter “Relational database design” basic concepts of relational databases were


introduced, such as relational schema, relation, instance. Databases were studied
from the designer point of view, the main question was how to avoid redundant data
storage, or various anomalies arising during the use of the database.
In the present chapter the schema is considered to be given and focus is on
fast and efficient ways of answering user queries. First, basic (theoretical) query
languages and their connections are reviewed in Section 19.1.
In the second part of this chapter (Section 19.2) views are considered. Informally,
a view is nothing else, but result of a query. Use of views in query efficiency, providing
physical data independence and data integration is explained.
Finally, the third part of the present chapter (Section 19.3) introduces query
rewriting.

19.1. Queries
Consider the database of cinemas in Budapest. Assume that the schema consists of
three relations:

CinePest = {Film,Theater,Show} . (19.1)

The schemata of individual relations are as follows:

Film = {Title, Director, Actor} ,


Theater = {Theater, Address, Phone} , (19.2)
Show = {Theater, Title, Time} .
Possible values of instances of each relation are shown on Figure 19.1.
Typical user queries could be:
19.1 Who is the director of “Control”?
19.2 List the names address of those theatres where Kurosawa films are played.
19.3 Give the names of directors who played part in some of their films.
896 19. Query Rewriting in Relational Databases

Film
Title Director Actor
Control Antal, Nimród Csányi, Sándor
Control Antal, Nimród Mucsi, Zoltán
Control Antal, Nimród Pindroch, Csaba
.. .. ..
. . .
Rashomon Akira Kurosawa Toshiro Mifune
Rashomon Akira Kurosawa Machiko Kyo
Rashomon Akira Kurosawa Mori Masayuki

Theatre
Theater Address Phone
Bem II., Margit Blvd. 5/b. 316-8708
Corvin VIII., Corvin alley 1. 459-5050
Európa VII., Rákóczi st. 82. 322-5419
Művész VI., Teréz blvd. 30. 332-6726
.. .. ..
. . .
Uránia VIII., Rákóczi st. 21. 486-3413
Vörösmarty VIII., Üllői st. 4. 317-4542

Show
Theater Title Time
Bem Rashomon 19:00
Bem Rashomon 21:30
Uránia Control 18:15
Művész Rashomon 16:30
Művész Control 17:00
.. .. ..
. . .
Corvin Control 10:15

Figure 19.1 The database CinePest.

These queries define a mapping from the relations of database schema CinePest to
some other schema (in the present case to schemata of single relations). Formally,
query and query mapping should be distinguished. The former is a syntactic
concept, the latter is a mapping from the set of instances over the input schema
to the set of instances over the output schema, that is determined by the query
according to some semantic interpretation. However, for both concepts the word
“query” is used for the sake of convenience, which one is meant will be clear from
19.1. Queries 897

the context.
Definition 19.1 Queries q1 and q2 over schema R are said to be equivalent, in
notation q1 ≡ q2 , if they have the same output schema and for every instance I over
schema R q1 (I) = q2 (I) holds.
In the remaining of this chapter the most important query languages are reviewed.
The expressive powers of query languages need to be compared.
Definition 19.2 Let Q1 and Q2 be query languages (with appropriate semantics).
Q2 is dominated by Q1 ( Q1 is weaker, than Q2 ), in notation Q1 v Q2 , if for
every query q1 of Q1 there exists a query q2 ∈ Q2 , such that q1 ≡ q2 . Q1 and Q2 are
equivalent, if Q1 v Q2 and Q1 w Q2 .

Example 19.1Query. Consider Question 19.2. As a first try the next solution is obtained:

if there exist in relations Film, Theater and Show tuples (xT , ”Akira Kurosawa”, xA ),
(xT h , xAd , xP ) and (xT h , xT , xT i )
then put the tuple (Theater : xT h , Address : xA ) into the output relation.
xT , xA , xT h , xAd , xP , xT i denote different variables that take their values from the domains
of the corresponding attributes, respectively. Using the same variables implicitly marked
where should stand identical values in different tuples.

19.1.1. Conjunctive queries


Conjunctive queries are the simplest kind of queries, but they are the easiest to
handle and have the most good properties. Three equivalent forms will be studied,
two of them based on logic, the third one is of algebraic nature. The name comes from
first order logic expressions that contain only existential quantors (∃), furthermore
consist of atomic expressions connected with logical “and”, that is conjunction.

Datalog – rule based queries The tuple (x1 , x2 , . . . , xm ) is called free tuple
if the xi ’s are variables or constants. This is a generalisation of a tuple of a relational
instance. For example, the tuple (xT , ”Akira Kurosawa”, xA ) in Example 19.1 is a
free tuple.
Definition 19.3 Let R be a relational database schema. Rule based conjunctive
query is an expression of the following form

ans(u) ← R1 (u1 ), R2 (u2 ), . . . , Rn (un ) , (19.3)

where n ≥ 0, R1 , R2 , . . . , Rn are relation names from R, ans is a relation name not


in R, u, u1 , u2 , . . . , un are free tuples. Every variable occurring in u must occur in
one of u1 , u2 , . . . , un , as well.
The rule based conjunctive query is also called a rule for the sake of simplicity.
ans(u) is the head of the rule, R1 (u1 ), R2 (u2 ), . . . , Rn (un ) is the body of the rule,
898 19. Query Rewriting in Relational Databases

Ri (ui ) is called a (relational) atom. It is assumed that each variable of the head
also occurs in some atom of the body.
A rule can be considered as some tool that tells how can we deduce newer and
newer facts, that is tuples, to include in the output relation. If the variables of
the rule can be assigned such values that each atom Ri (ui ) is true (that is the
appropriate tuple is contained in the relation Ri ), then tuple u is added to the
relation ans. Since all variables of the head occur in some atoms of the body, one
never has to consider infinite domains, since the variables can take values from the
actual instance queried. Formally. let I be an instance over relational schema R,
furthermore let q be a the query given by rule (19.3). Let var(q) denote the set of
variables occurring in q, and let dom(I) denote the set of constants that occur in I.
The image of I under q is given by

q(I) = {ν(u)|ν : var(q) → dom(I) and ν(ui ) ∈ Ri i = 1, 2, . . . , n} . (19.4)

An immediate way of calculating q(I) is to consider all possible valuations ν in


some order. There are more efficient algorithms, either by equivalent rewriting of
the query, or by using some indices.
An important difference between atoms of the body and the head is that relations
R1 , R2 , . . . , Rn are considered given, (physically) stored, while relation ans is not, it
is thought to be calculated by the help of the rule. This justifies the names: Ri ’s are
extensional relations and ans is intensional relation.
Query q over schema R is monotone, if for instances I and J over R, I ⊆ J im-
plies q(I) ⊆ q(J ). q is satisfiable, if there exists an instance I, such that q(I)NE∅.
The proof of the next simple observation is left for the Reader (Exercise 19.1-1).

Claim 19.4 Rule based queries are monotone and satisfiable.

Proposition 19.4 shows the limitations of rule based queries. For example, the query
Which theatres play only Kurosawa films? is obviously not monotone, hence cannot
be expressed by rules of form (19.3).

Tableau queries. If the difference between variables and constants is not con-
sidered, then the body of a rule can be viewed as an instance over the schema.
This leads to a tabular form of conjunctive queries that is most similar to the vi-
sual queries (QBE: Query By Example) of database management system Microsoft
Access.

Definition 19.5 A tableau over the schema R is a generalisation of an instance


over R, in the sense that variables may occur in the tuples besides constants. The
pair (T, u) is a tableau query if T is a tableau and u is a free tuple such that all
variables of u occur in T, as well. The free tuple u is the summary.

The summary row u of tableau query (T, u) shows which tuples form the result
of the query. The essence of the procedure is that the pattern given by tableau
T is searched for in the database, and if found then the tuple corresponding to is
included in the output relation. More precisely, the mapping ν : var(T) → dom(I)
is an embedding of tableau (T, u) into instance I, if ν(T) ⊆ I. The output relation
19.1. Queries 899

of tableau query (T, u) consists of all tuples ν(u) that ν is an embedding of tableau
(T, u) into instance I.

Example 19.2Tableau query Let T be the following tableau.


Film Title Director Actor
xT “Akira Kurosawa” xA

Theater Theater Address Phone


xT h xAd xP

Show Theater Title Time


xT h xT xT i
The tableau query (T, hTheater : xT h , Address : xAd i) answers question 19.2. of the intro-
duction.

The syntax of tableau queries is similar to that of rule based queries. It will
be useful later that conditions for one query to contain another one can be easily
formulated in the language of tableau queries.

Relational algebra∗ . A database consists of relations, and a relation is a set of


tuples. The result of a query is also a relation with a given attribute set. It is a natural
idea that output of a query could be expressed by algebraic and other operations on
relations. The relational algebra∗ consists of the following operations.1
Selection: It is of form either σA=c or σA=B , where A and B are attributes
while c is a constant. The operation can be applied to all such relations R
that has attribute A (and B), and its result is relation ans that has the same
set of attributes as R has, and consists of all tuples that satisfy the selection
condition.
Projection: The form of the operation is πA1 ,A2 ,...,An , n ≥ 0, where Ai ’s are
distinct attributes. It can be applied to all such relations whose attribute
set includes each Ai and its result is the relation ans that has attribute set
{A1 , A2 , . . . , An },
val = {t[A1 , A2 , . . . , An ]|t ∈ R} ,
that is it consists of the restrictions of tuples in R to the attribute set
{A1 , A2 , . . . , An }.
Natural join: This operation has been defined earlier in chapter “Relational
database design”. Its notation is 1, its input consists of two (or more) relations
R1 , R2 , with attribute sets V1 , V2 , respectively. The attribute set of the output
relation is V1 ∪ V2 .
R1 1 R2 = {t tuple over V1 ∪ V2 |∃v ∈ R1 , ∃w ∈ R2 , t[V1 ] = v és t[V2 ] = w} .
Renaming: Attribute renaming is nothing else, but an injective mapping from a
finite set of attributes U into the set of all attributes. Attribute renaming f can

1 The relational algebra∗ is the monotone part of the (full) relational algebra introduced later.
900 19. Query Rewriting in Relational Databases

be given by the list of pairs (A, f (A)), where ANEf (A), which is written usually
in the form A1 A2 . . . An → B1 B2 . . . Bn . The renaming operator δf maps
from inputs over U to outputs overf [U ]. If R is a relation over U , then

δf (R) = {v over f [U ]|∃u ∈ R, v(f (A)) = u(A), ∀A ∈ U } .


Relational algebra∗ queries are obtained by finitely many applications of the opera-
tions above from relational algebra base queries, which are
Input relation: R.
Single constant: {hA : ai}, where a is a constant, A is an attribute name.

Example 19.3Relational algebra∗ query. The question 19.2. of the introduction can be
expressed with relational algebra operations as follows.

πT heater,Address ((σDirector=“Akira Kurosawa” (Film) 1 Show) 1 Theater) .

The mapping given by a relational algebra∗ query can be easily defined via induction
on the operation tree. It is easy to see (Exercise 19.1-2) that non-satisfiable queries
can be given using relational algebra∗ . There exist no rule based or tableau query
equivalent with such a non-satisfiable query. Nevertheless, the following is true.
Theorem 19.6 Rule based queries, tableau queries and satisfiable relational
algebra∗ are equivalent query languages.
The proof of Theorem 19.6 consists of three main parts:
1. Rule based ≡ Tableau
2. Satisfiable relational algebra∗ v Rule based
3. Rule based v Satisfiable relational algebra∗

The first (easy) step is left to the Reader (Exercise 19.1-3). For the second step, it has
to be seen first, that the language of rule based queries is closed under composition.
More precisely, let R = {R1 , R2 , . . . , Rn } be a database, q be a query over R. If the
output relation of q is S1 , then in a subsequent query S1 can be used in the same
way as any extensional relation of R. Thus relation S2 can be defined, then with its
help relation S3 can be defined, and so on. Relations Si are intensional relations.
The conjunctive query program P is a list of rules

S1 (u1 ) ← body1
S2 (u2 ) ← body2
.. (19.5)
.
Sm (um ) ← bodym ,

where Si ’s are pairwise distinct and not contained in R. In rule body bodyi only
relations R1 , R2 , . . . Rn and S1 , S2 , . . . , Si−1 can occur. Sm is considered to be the
output relation of P , its evaluation is is done by computing the results of the rules
19.1. Queries 901

one-by-one in order. It is not hard to see that with appropriate renaming the variables
P can be substituted by a single rule, as it is shown in the following example.

Example 19.4Conjunctive query program. Let R = {Q, R}, and consider the following
conjunctive query program

S1 (x, z) ← Q(x, y), R(y, z, w)


S2 (x, y, z) ← S1 (x, w), R(w, y, v), S1 (v, z) (19.6)
S3 (x, z) ← S2 (x, u, v), Q(v, z) .

S2 can be written using Q and R only by the first two rules of (19.6)

S2 (x, y, z) ← Q(x, y1 ), R(y1 , w, w1 ), R(w, y, v), Q(v, y2 ), R(y2 , z, w2 ) . (19.7)

It is apparent that some variables had to be renamed to avoid unwanted interaction of


rule bodies. Substituting expression (19.7) into the third rule of (19.6) in place of S2 , and
appropriately renaming the variables

S3 (x, z) ← Q(x, y1 ), R(y1 , w, w1 ), R(w, u, v1 ), Q(v1 , y2 ), R(y2 , v, w2 ), Q(v, z). (19.8)

is obtained.

Thus it is enough to realise each single relational algebra∗ operation by an ap-


propriate rule.
P 1 Q: Let − →
x denote the list of variables (and constants) corresponding to the
common attributes of P and Q, let − →
y denote the variables (and constants) cor-
responding to the attributes occurring only in P , while −

z denotes those of corre-
sponding to Q’s own attributes. Then rule ans( x , y , z ) ← P (−

→ −
→ −
→ →x,−→
y ), Q(−
→x,−
→z)
gives exactly relation P 1 Q.
σF (R): Assume that R = R(A1 , A2 , . . . , An ) and the selection condition F is of
form either Ai = a or Ai = Aj , where Ai , Aj are attributes a is constant. Then

ans(x1 , . . . , xi−1 , a, xi+1 , . . . , xn ) ← R(x1 , . . . , xi−1 , a, xi+1 , . . . , xn ) ,

respectively,

ans(x1 , . . . , xi−1 , y, xi+1 , . . . , xj−1 , y, xj+1 , . . . , xn ) ←


R(x1 , . . . , xi−1 , y, xi+1 , . . . , xj−1 , y, xj+1 , . . . , xn )

are the rules sought. The satisfiability of relational algebra∗ query is used here.
Indeed, during composition of operations we never obtain an expression where
two distinct constants should be equated.
πAi1 ,Ai2 ,...,Aim (R): If R = R(A1 , A2 , . . . , An ), then

ans(xi1 , xi2 , . . . , xim ) ← R(x1 , x2 , . . . , xn )

works.
A1 A2 . . . An → B1 B2 . . . Bn : The renaming operation of relational algebra∗ can
be achieved by renaming the appropriate variables, as it was shown in Exam-
ple 19.4.
902 19. Query Rewriting in Relational Databases

For the proof of the third step let us consider rule

ans(−

x ) ← R1 (−
→), R (−
x1
→ −

2 x2 ), . . . , Rn (xn ) . (19.9)

By renaming the attributes of relations Ri ’s, we may assume without loss of gener-
ality that all attribute names are distinct. Then R = R1 1 R2 1 · · · 1 Rn can be
constructed that is really a direct product, since the the attribute names are distinct.
The constants and multiple occurrences of variables of rule (19.9) can be simulated
by appropriate selection operators. The final result is obtained by projecting to the
set of attributes corresponding to the variables of relation ans.

19.1.2. Extensions
Conjunctive queries are a class of query languages that has many good properties.
However, the set of expressible questions are rather narrow. Consider the following.

19.4. List those pairs where one member directed the other member in a film,
and vice versa, the other member also directed the first in a film.
19.5. Which theatres show “La Dolce Vita” or “Rashomon”?
19.6. Which are those films of Hitchcock that Hitchcock did not play a part in?
19.7. List those films whose every actor played in some film of Fellini.
19.8. Let us recall the game “Chain-of-Actors”. The first player names an ac-
tor/actress, the next another one who played in some film together with the first
named. This is continued like that, always a new actor/actress has to be named
who played together with the previous one. The winner is that player who could
continue the chain last time. List those actors/actresses who could be reached
by “Chain-of-Actors” starting with “Marcello Mastroianni”.

Equality atoms. Question 19.4. can be easily answered if equalities are also
allowed rule bodies, besides relational atoms:

ans(y1 , y2 ) ← F ilm(x1 , y1 , z1 ), F ilm(x2 , y2 , z2 ), y1 = z2 , y2 = z1 . (19.10)

Allowing equalities raises two problems. First, the result of the query could become
infinite. For example, the rule based query

ans(x, y) ← R(x), y = z (19.11)

results in an infinite number of tuples, since variables y and z are not bounded by
relation R, thus there can be an infinite number of evaluations that satisfy the rule
body. Hence, the concept of domain restricted query is introduced. Rule based
query q is domain restricted, if all variables that occur in the rule body also occur
in some relational atom.
The second problem is that equality atoms may cause the body of a rule become
19.1. Queries 903

unsatisfiable, in contrast to Proposition 19.4. For example, query


ans(x) ← R(x), x = a, x = b (19.12)
is domain restricted, however if a and b are distinct constants, then the answer will
be empty. It is easy to check whether a rule based query with equality atoms is
satisfiable.

Satisfiable(q)
1 Compute the transitive closure of equalities of the body of q.
2 if Two distinct constants should be equal by transitivity
3 then return “Not satisfiable.”
4 else return “Satisfiable.”

It is also true (Exercise 19.1-4) that if a rule based query q that contains equality
atoms is satisfiable, then there exists a another rule based query q 0 without equalities
that is equivalent with q.

Disjunction – union. The question 19.5. cannot be expressed with conjunctive


queries. However, if the union operator is added to relational algebra, then 19.5. can
be expressed in that extended relational algebra:
πTheater (σTitle=“La Dolce Vita” (Show) ∪ σTitle=“ Rashomon” (Show)) . (19.13)
Rule based queries are also capable of expressing question 19.5. if it is allowed that
the same relation is in the head of many distinct rules:
ans(xM ) ← Show(xT h , “La Dolce Vita”, xT i ) ,
(19.14)
ans(xM ) ← Show(xT h , “Rashomon”, xT i ) .
Non-recursive datalog program is a generalisation of this.
Definition 19.7 A non-recursive datalog program over schema R is a set of
rules
S1 (u1 ) ← body1
S2 (u2 ) ← body2
.. (19.15)
.
Sm (um ) ← bodym ,
where no relation of R occurs in a head, the same relation may occur in the head
of several rules, furthermore there exists an ordering r1 , r2 , . . . , rm of the rules such
that the relation in the head of ri does not occur in the body of any rule rj forj ≤ i.
The semantics of the non-recursive datalog program (19.15) is similar to the con-
junctive query program (19.5). The rules are evaluated in the order r1 , r2 , . . . , rm of
Definition 19.7, and if a relation occurs in more than one head then the union of the
sets of tuples given by those rules is taken.
The union of tableau queries (Ti , u) i = 1, 2, . . . , n is denoted by
({T1 , T2 , . . . , Tn }, u). It is evaluated by individually computing the result of each
tableau query (Ti , u), then the union of them is taken. The following holds.
904 19. Query Rewriting in Relational Databases

Theorem 19.8 The language of non-recursive datalog programs with unique out-
put relation and the relational algebra extended with the union operator are equiva-
lent.
The proof of Theorem 19.8 is similar to that of Theorem 19.6 and it is left to
the Reader (Exercise 19.1-5). Let us note that the expressive power of the union
of tableau queries is weaker. This is caused by the requirement having the same
summary row for each tableau. For example, the non-recursive datalog program
query
ans(a) ←
(19.16)
ans(b) ←
cannot be realised as union of tableau queries.

Negation. The query 19.6. is obviously not monotone. Indeed, suppose that in
relation Film there exist tuples about Hitchcock’s film Psycho, for example (“Psy-
cho”,”A. Hitchcock”,”A. Perkins”), (“Psycho”,”A. Hitchcock”,”J. Leigh”), . . . , how-
ever, the tuple (“Psycho”,”A. Hitchcock”,”A. Hitchcock”) is not included. Then the
tuple (“Psycho”) occurs in the output of query 19.6. With some effort one can real-
ize however, that Hitchcock appears in the film Psycho, as “a man in cowboy hat”.
If the tuple (“Psycho”,”A. Hitchcock”,”A. Hitchcock”) is added to relation Film as
a consequence, then the instance of schema CinePest gets larger, but the output
of query 19.6. becomes smaller.
It is not too hard to see that the query languages discussed so far are monotone,
hence query 19.6. cannot be formulated with non-recursive datalog program or with
some of its equivalents. Nevertheless, if the difference (−)operator is also added to
relation algebra, then it becomes capable of expressing queries of type 19.6. For
example,

πTitle (σDirector=“A. Hitchcock” (Film)) − πTitle (σActor=“A. Hitchcock” (Film)) (19.17)

realises exactly query 19.6. Hence, the (full) relational algebra consists of operations
{σ, π, 1, δ, ∪, −}. The importance of the relational algebra is shown by the fact, that
Codd calls a query language Q relationally complete, exactly if for all relational
algebra query q there exists q 0 ∈ Q, such that q ≡ q 0 .
If negative literals, that is atoms of the form ¬R(u) are also allowed in rule
bodies, then the obtained non-recursive datalog with negation, in notation nr-
datalog¬ is relationally complete.
Definition 19.9 A non-recursive datalog¬ (nr-datalog¬ ) rule is of form

q: S(u) ← L1 , L2 , . . . , Ln , (19.18)

where S is a relation, u is a free tuple, Li ’s are literals, that is expression of form


R(v) or ¬R(v), such that v is a free tuple for i = 1, 2, . . . , n. S does not occur in
the body of the rule. The rule is domain restricted, if each variable x that occurs
in the rule also occurs in a positive literal (expression of the form R(v)) of the
body. Every nr-datalog¬ rule is considered domain restricted, unless it is specified
otherwise.
19.1. Queries 905

The semantics of rule (19.18) is as follows. Let R be a relational schema that contains
all relations occurring in the body of q, furthermore, let I be an instance over R.
The image of I under q is
q(I) = {ν(u)| ν is an valuation of the variables and for i = 1, 2, . . . , n
ν(ui ) ∈ I(Ri ), if Li = Ri (ui ) and (19.19)
ν(ui ) 6∈ I(Ri ), if Li = ¬Ri (ui )}.
A nr-datalog¬ program over schema R is a collection of nr-datalog¬ rules
S1 (u1 ) ← body1
S2 (u2 ) ← body2
.. (19.20)
.
Sm (um ) ← bodym ,
where relations of schema R do not occur in heads of rules, the same relation may
appear in more than one rule head, furthermore there exists an ordering r1 , r2 , . . . , rm
of the rules such that the relation of the head of rule ri does not occur in the head
of any rule rj if j ≤ i.
The computation of the result of nr-datalog¬ program (19.20) applied to instance
I over schema R can be done in the same way as the evaluation of non-recursive
datalog program (19.15), with the difference that the individual nr-datalog¬ rules
should be interpreted according to (19.19).

Example 19.5Nr-datalog¬ program. Let us assume that all films that are included in
relation Film have only one director. (It is not always true in real life!) The nr-datalog¬
rule

ans(x) ← Film(x, “A. Hitchcock”, z), ¬Film(x, “A. Hitchcock”, “A. Hitchcock”) (19.21)

expresses query 19.6. Query 19.7. is realised by the nr-datalog¬ program


Fellini-actor(z) ← Film(x, ”F. Fellini”, z)
Not-the-answer(x) ← Film(x, y, z), ¬Fellini-actor(z) (19.22)
Answer(x) ← Film(x, y, z), ¬Not-the-answer(x) .
One has to be careful in writing nr-datalog¬ programs. If the first two rules of program
(19.22) were to be merged like in Example 19.4

Bad-not-ans(x) ← Film(x, y, z), ¬Film(x0 , ”F. Fellini”, z), Film(x0 , ”F. Fellini”, z 0 )
Answer(x) ← Film(x, y, z), ¬Bad-not-ans(x),
(19.23)
then (19.23) answers the following query (assuming that all films have unique director)

19.9. List all those films whose every actor played in each film of Fellini,
instead of query 19.7.

It is easy to see that every satisfiable nr-datalog¬ program that contains equality
atoms can be replaced by one without equalities. Furthermore the following propo-
sition is true, as well.
906 19. Query Rewriting in Relational Databases

Claim 19.10 The satisfiable (full) relational algebra and the nr-datalog¬ programs
with single output relation are equivalent query languages.

Recursion. Query 19.8. cannot be formulated using the query languages in-
troduced so far. Some a priori information would be needed about how long a
chain-of-actors could be formed starting with a given actor/actress. Let us assume
that the maximum length of a chain-of-actors starting from “Marcello Mastroianni”
is 117. (It would be interesting to know the real value!) Then, the following non-
recursive datalog program gives the answer.
Film-partner(z1 , z2 ) ← Film(x, y, z1 ), Film(x, y, z2 ), z1 < z2 2
Partial-answer1 (z) ← Film-partner(z, “Marcello Mastroianni”)
Partial-answer1 (z) ← Film-partner(“Marcello Mastroianni”, z)
Partial-answer2 (z) ← Film-partner(z, y), Partial-answer1 (y)
Partial-answer2 (z) ← Film-partner(y, z), Partial-answer1 (y)
.. ..
. .
(19.24)
Partial-answer117 (z) ← Film-partner(z, y), Partial-answer116 (y)
Partial-answer117 (z) ← Film-partner(y, z), Partial-answer116 (y)
Mastroianni-chain(z) ← Partial-answer1 (z)
Mastroianni-chain(z) ← Partial-answer2 (z)
.. ..
. .
Mastroianni-chain(z) ← Partial-answer117 (z)

It is much easier to express query 19.8. using recursion. In fact, the transitive
closure of the graph Film-partner needs to be calculated. For the sake of simplicity
the definition of Film-partner is changed a little (thus approximately doubling the
storage requirement).
Film-partner(z1 , z2 ) ← Film(x, y, z1 ), Film(x, y, z2 )
Chain-partner(x, y) ← Film-partner(x, y) (19.25)
Chain-partner(x, y) ← Film-partner(x, z), Chain-partner(z, y) .
The datalog program (19.25) is recursive, since the definition of relation Chain-
partner uses the relation itself. Let us suppose for a moment that this is meaningful,
then query 19.8. is answered by rule
Mastroianni-chain(x) ← Chain-partner(x, “Marcello Mastroianni”) (19.26)

Definition 19.11 The expression


R1 (u1 ) ← R2 (u2 ), R3 (u3 ), . . . , Rn (un ) (19.27)
is a datalog rule, if n ≥ 1, the Ri ’s are relation names, the ui ’s are free tuples of

2 Arbitrary comparison atoms can be used, as well, similarly to equality atoms. Here z < z makes
1 2
it sure that all pairs occur at most once in the list.
19.1. Queries 907

appropriate length. Every variable of u1 has to occur in one of u2 , . . . un , as well.


The head of the rule is R1 (u1 ), the body of the rule is R2 (u2 ), R3 (u3 ), . . . , Rn (un ). A
datalog program is a finite collection of rules of type (19.27). Let P be a datalog
program. The relation R occurring in P is extensional if it occurs in only rule
bodies, and it is intensional if it occurs in the head of some rule.

If ν is a valuation of the variables of rule (19.27), then R1 (ν(u1 )) ←


R2 (ν(u2 )), R3 (ν(u3 )), . . . , Rn (ν(un )) is a realisation of rule (19.27). The exten-
sional (database) schema of P consists of the extensional relations of P , in nota-
tion edb(P ). The intensional schema of P , in notation idb(P ) is defined similarly
as consisting of the intensional relations of P . Let sch(P ) = edb(P ) ∪ idb(P ). The
semantics of datalog program P is a mapping from the set of instances over edb(P )
to the set of instances over idb(P ). This can be defined proof theoretically, model
theoretically or as a fixpoint of some operator. This latter one is equivalent with the
first two, so to save space only the fixpoint theoretical definition is discussed.
There are no negative literals used in Definition 19.11. The main reason of this is
that recursion and negation together may be meaningless, or contradictory. Never-
theless, sometimes negative atoms might be necessary. In those cases the semantics
of the program will be defined specially.

Fixpoint semantics. Let P be a datalog program, K be an instance over sch(P ).


Fact A, that is a tuple consisting of constants is an immediate consequence of
K and P , if either A ∈ K(R) for some relation R ∈ sch(P ), or A ← A1 , A2 , . . . , An
is a realisation of a rule in P and each Ai is in K. The immediate consequence
operator TP is a mapping from the set of instances over sch(P ) to itself. TP (K)
consists of all immediate consequences of K and P .

Claim 19.12 The immediate consequence operator TP is monotone.

Proof Let I and J be instances over sch(P ), such that I ⊆ J . Let A be a fact
ofTP (I). If A ∈ I(R) for some relation R ∈ sch(P ), then A ∈ J (R) is implied by
I ⊆ J . on the other hand, if A ← A1 , A2 , . . . , An is a realisation of a rule in P and
each Ai is in I, then Ai ∈ J also holds.

The definition of TP implies that K ⊆ TP (K). Using Proposition 19.12 it follows


that
K ⊆ TP (K) ⊆ TP (TP (K)) ⊆ . . . . (19.28)

Theorem 19.13 For every instance I over schema sch(P ) there exists a unique
minimal instance I ⊆ K that is a fixpoint of TP , i.e. K = TP (K).

Proof Let TPi (I) denote the consecutive application of operator TP i-times, and let


[
K= TPi (I) . (19.29)
i=0
908 19. Query Rewriting in Relational Databases

By the monotonicity of TP and (19.29) we have



[ ∞
[
TP (K) = TPi (I) ⊆ TPi (I) = K ⊆ TP (K) , (19.30)
i=1 i=0

that is K is a fixpoint. It is easy to see that every fixpoint that contains I, also
contains TPi (I) for all i = 1, 2, . . . , that is it contains K, as well.

Definition 19.14 The result of datalog program P on instance I over edb(P ) is


the unique minimal fixpoint of TP containing I, in notation P (I).

It can be seen, see Exercise 19.1-6, that the chain in (19.28) is finite, that is there
exists an n, such that TP (TPn (I)) = TPn (I). The naive evaluation of the result of the
datalog program is based on this observation.

Naiv-Datalog(P ,I)
1 K ← I
2 while TP (K)NEK
3 do K ← TP (K)
4 return K

Procedure Naiv-Datalog is not optimal, of course, since every fact that be-
comes included in K is calculated again and again at every further execution of the
while loop.
The idea of Semi-naiv-datalog is that it tries to use only recently calculated
new facts in the while loop, as much as it is possible, thus avoiding recomputation
of known facts. Consider datalog program P with edb(P ) = R, and idb(P ) = T. For
a rule
S(u) ← R1 (v1 ), . . . , Rn (vn ), T1 (w1 ), . . . , Tm (wm ) (19.31)
of P where Rk ∈ R and Tj ∈ T, the following rules are constructed for j = 1, 2, . . . , m
and i ≥ 1

tempi+1
S (u) ← R1 (v1 ), . . . , Rn (vn ),
i−1
i
T1i (w1 ), . . . , Tj−1 (wj−1 ), ∆iTj (wj ), Tj+1 i−1
(wj+1 ), . . . , Tm (wm ) .
(19.32)
Relation ∆iTj denotes the change of Tj in iteration i. The union of rules corresponding
to S in layer i is denoted by PSi , that is rules of form (19.32) for tempi+1 S , j =
1, 2, . . . , m. Assume that the list of idb relations occurring in rules defining the idb
relation S is T1 , T2 , . . . , T` . Let

PSi (I, T1i−1 , . . . , T`i−1 , T1i , . . . , T`i , ∆iT1 , . . . , ∆iT` ) (19.33)

denote the set of facts (tuples) obtained by applying rules (19.32) to input instance
I and to idb relations Tji−1 , Tji , ∆iTj . The input instance I is the actual value of the
edb relations of P .
19.1. Queries 909

Semi-naiv-datalog(P ,I)
1 P 0 ← those rules of P whose body does not contain idb relation
2 for S ∈ idb(P )
3 do S 0 ← ∅
4 ∆1S ← P 0 (I)(S)
5 i ← 1
6 repeat
7 for S ∈ idb(P )
8  T1 , . . . , T` are the idb relations of the rules defining S.
9 do S i ← S i−1 ∪ ∆iS
10 ∆i+1
S ← PSi (I, T1i−1 , . . . , T`i−1 , T1i , . . . , T`i , ∆iT1 , . . . , ∆iT` ) − S i
11 i ← i+1
12 until ∆iS = ∅ for all S ∈ idb(P )
13 for S ∈ idb(P )
14 do S ← S i
15 return S

Theorem 19.15 Procedure Semi-naiv-datalog correctly computes the result of


program P on instance I.

Proof We will show by induction on i that after execution of the loop of lines 6–12
ith times the value of S i is TPi (I)(S), while ∆i+1 S is equal to TPi+1 (I)(S) − TPi (I)(S)
i
for arbitrary S ∈ idb(P ). TP (I)(S) is the result obtained for S starting from I and
applying the immediate consequence operator TP i-times.
For i = 0, line 4 calculates exactly TP (I)(S) for all S ∈ idb(P ).
In order to prove the induction step, one only needs to see that
PSi (I, T1i−1 , . . . , T`i−1 , T1i , . . . , T`i , ∆iT1 , . . . , ∆iT` ) ∪ S i is exactly equal to TPi+1 (I)(S),
since in lines 9–10 procedure Semi-naiv-datalog constructs S i -t and ∆i+1 S using
that. The value of S i is TPi (I)(S), by the induction hypothesis. Additional new
tuples are obtained only if that for some idb relation defining S such tuples are con-
sidered that are constructed at the last application of TP , and these are in relations
∆iT1 , . . . , ∆iT` , also by the induction hypothesis.
The halting condition of line 12 means exactly that all relations S ∈ idb(P ) are
unchanged during the application of the immediate consequence operator TP , thus
the algorithm has found its minimal fixpoint. This latter one is exactly the result of
datalog program P on input instance I according to Definition 19.14.

Procedure Semi-naiv-datalog eliminates a large amount of unnecessary calcu-


lations, nevertheless it is not optimal on some datalog programs (Exercise gy:snaiv).
However, analysis of the datalog program and computation based on that can save
most of the unnecessary calculations.

Definition 19.16 Let P be a datalog program. The precedence graph of P is


the directed graph GP defined as follows. Its vertex set consists of the relations of
idb(P ), and (R, R0 ) is an arc for R, R0 ∈ idb(P ) if there exists a rule in P whose
910 19. Query Rewriting in Relational Databases

head is R0 and whose body contains R. P is recursive, if GP contains a directed


cycle. Relations R and R0 are mutually recursive if the belong to the same strongly
connected component of GP .

Being mutually recursive is an equivalence relation on the set of relations idb(P ).


The main idea of procedure Improved-Semi-Naiv-Datalog is that for a relation
R ∈ idb(P ) only those relations have to be computed “simultaneously” with R
that are in its equivalence class, all other relations defining R can be calculated “in
advance” and can be considered as edb relations.

Improved-Semi-Naiv-Datalog(P ,I)
1 Determine the equivalence classes of idb(P ) under mutual recursivity.
2 List the equivalence classes [R1 ], [R2 ], . . . , [Rn ]
according to a topological order of GP .
3  There exists no directed path from Rj to Ri in GP for all i < j.
4 for i ← 1 to n
5 do Use Semi-naiv-datalog to compute relations of [Ri ]
taking relations of [Rj ] as edb relations for j < i.

Lines 1–2 can be executed in time O(vGP + eGP ) using depth first search, where
vGP and eGP denote the number of vertices and edges of graph GP , respectively.
Proof of correctness of the procedure is left to the Reader (Exercise 19.1-8).

19.1.3. Complexity of query containment


In the present section we return to conjunctive queries. The costliest task in com-
puting result of a query is to generate the natural join of relations. In particular,
if there are no indexes available on the common attributes, hence only procedure
Full-Tuplewise-Join is applicable.

Full-Tuplewise-Join(R1 , R2 )
1 S ← ∅
2 for all u ∈ R1
3 do for all v ∈ R2
4 do if u and v can be joined
5 then S ← S ∪ {u 1 v}
6 return S

It is clear that the running time of Full-Tuplewise-Join is O(|R1 | × |R2 |).


Thus, it is important that in what order is a query executed, since during computa-
tion natural joins of relations of various sizes must be calculated. In case of tableau
queries the Homomorphism Theorem gives a possibility of a query rewriting that
uses less joins than the original.
Let q1 , q2 be queries over schema R. q2 contains q1 , in notation q1 v q2 , if for all
instances I over schema R q1 (I) ⊆ q2 (I) holds. q1 ≡ q2 according to Definition 19.1
iff q1 v q2 and q1 w q2 . A generalisation of valuations will be needed. Substitution
19.1. Queries 911

is a mapping from the set of variables to the union of sets of variables and sets of
constants that is extended to constants as identity. Extension of substitution to free
tuples and tableaux can be defined naturally.
Definition 19.17 Let q = (T, u) and q 0 = (T0 , u0 ) be two tableau queries overs
schema R. Substitution θ is a homomorphism from q 0 to q, if θ(T0 ) = T and
θ(u0 ) = u.

Theorem 19.18 (Homomorphism Theorem). Let q = (T, u) and q 0 = (T0 , u0 ) be


two tableau queries overs schema R. q v q 0 if and only if, there exists a homomor-
phism from q 0 to q.

Proof Assume first, that θ is a homomorphism from q 0 to q, and let I be an instance


over schema R. Let w ∈ q(I). This holds exactly if there exists a valuation ν that
maps tableau T into I and ν(u) = w. It is easy to see that θ ◦ ν maps tableau T0
into I and θ ◦ ν(u0 ) = w, that is w ∈ q 0 (I). Hence, w ∈ q(I) =⇒ w ∈ q 0 (I), which
in turn is equivalent with q v q 0 .
On the other hand, let us assume that q v q 0 . The idea of the proof is that
both, q and q 0 are applied to the “instance” T. The output of q is free tuple u, hence
the output of q 0 also contains u, that is there exists a θ embedding of T0 into T
that maps u0 to u. To formalise this argument the instance IT isomorphic to T is
constructed.
Let V be the set of variables occurring in T. For all x ∈ V let ax be constant that
differs from all constants appearing in T or T0 , furthermore xNEx0 =⇒ ax NEax0 .
Let µ be the valuation that maps x ∈ V to ax , furthermore let IT = µ(T). µ is a
bijection from V to µ(V ) and there are no constants of T appearing in µ(V ), hence
µ−1 well defined on the constants occurring in IT .
It is clear that µ(u) ∈ q(IT ), thus using q v q 0 µ(u) ∈ q 0 (IT ) is obtained. That
is, there exists a valuation ν that embeds tableau T0 into IT , such that ν(u0 ) = µ(u).
It is not hard to see that ν ◦ µ−1 is a homomorphism from q 0 to q.

Query optimisation by tableau minimisation. According to Theorem 19.6


tableau queries and satisfiable relational algebra (without subtraction) are equiva-
lent. The proof shows that the relational algebra expression equivalent with a tableau
query is of form π→æ (σF (R1 1 R2 1 · · · 1 Rk )), where k is the number of rows of the

tableau. It implies that if the number of joins is to be minimised, then the number
of rows of an equivalent tableau must be minimised.
The tableau query (T, u) is minimal, if there exists no tableau query (S, v)
that is equivalent with (T, u) and |S| < |T|, that is S has fewer rows. It may be
surprising, but it is true, that a minimal tableau query equivalent with (T, u) can
be obtained by simply dropping some rows from T.
Theorem 19.19 Let q = (T, u) be a tableau query. There exists a subset T0 of T,
such that query q 0 = (T0 , u) is minimal and equivalent with q = (T, u).
Proof Let (S, v) be a minimal query equivalent with q. According to the Homomor-
phism Theorem there exist homomorphisms θ from q to (S, v), and λ from (S, v) to
912 19. Query Rewriting in Relational Databases

q. Let T0 = θ ◦ λ(T). It is easy to check that (T0 , u) ≡ q and |T0 | ≤ |S|. But (S, v)
is minimal, hence (T0 , u) is minimal, as well.

Example 19.6 Application of tableau minimisation Consider the relational algebra expres-
sion
q = πAB (σB=5 (R)) 1 πBC (πAB (R) 1 πAC (σB=5 (R))) (19.34)
over the schema R of attribute set {A, B, C}. The tableau query corresponding to q is the
following tableau T:
R A B C
x 5 z1
x1 5 z2 (19.35)
x1 5 z
u x 5 z
Such a homomorphism is sought that maps some rows of T to some other rows of T, thus
sort of “folding” the tableau. The first row cannot be eliminated, since the homomorphism
is an identity on free tuple u, thus x must be mapped to itself. The situation is similar with
the third row, as the image of z is itself under any homomorphism. However the second
row can be eliminated by mapping x1 to x and z2 to z, respectively. Thus, the minimal
tableau equivalent with T consists of the first and third rows of T. Transforming back to
relational algebra expression,

πAB (σB=5 (R)) 1 πBC (σB=5 (R)) (19.36)

is obtained. Query (19.36) contains one less join operator than query (19.34).

The next theorem states that the question of tableau containment and equiva-
lence is NP-complete, hence tableau minimisation is an NP-hard problem.

Theorem 19.20 For given tableau queries q and q 0 the following decision problems
are NP-complete:
19.10. q v q 0 ?
19.11. q ≡ q 0 ?
19.12. Assume that q 0 is obtained from q by removing some free tuples. Is it true
then that q ≡ q 0 ?

Proof The Exact-Cover problem will be reduced to the various tableau problems.
The input of Exact-Cover problem is a finite set X = {x1 , x2 , . . . , xn }, and a
collection of its subsets S = {S1 , S2 , . . . , Sm }. It has to be determined whether there
exists S 0 v S, such that subsets in S 0 cover X exactly (that is, for all x ∈ X there
exists exactly one S ∈ S 0 such that x ∈ S). Exact-Cover is known to be an
NP-complete problem.
Let E = (X, S) be an input of Exact cover. A construction is sketched that
produces a pair qE , qE0 of tableau queries to E in polynomial time. This construction
can be used to prove the various NP-completeness results.
Let the schema R consist of the pairwise distinct attributes
A1 , A2 , . . . , An , B1 , B2 , . . . , Bm . qE = (TE , t) and qE0 = (T0E , t) are tableau
19.1. Queries 913

queries over schema R such that the summary row of both of them is free tuple
t = hA1 : a1 , A2 : a2 , . . . , An : an i, where a1 , a2 , . . . , an are pairwise distinct
variables.
Let b1 , b2 , . . . , bm , c1 , c2 , . . . cm be another set of pairwise distinct variables.
Tableau TE consists of n rows, for each element of X corresponds one. ai stands in
column of attribute Ai in the row of xi , while bj stands in column of attribute Bj
for all such j that xi ∈ Sj holds. In other positions of tableau TE pairwise distinct
new variables stand.
Similarly, T0E consists of m rows, one corresponding to each element of S. ai
stands in column of attribute Ai in the row of Sj for all such i that xi ∈ Sj ,
furthermore cj 0 stands in the column of attribute Bj 0 , for all j 0 NEj. In other positions
of tableau T0E pairwise distinct new variables stand.
The NP-completeness of problem 19.10. follows from that X has an exact cover
using sets of S if and only if qE0 v qE holds. The proof, and the proof of the NP-
completeness of problems 19.11. and 19.12. are left to the Reader (Exercise 19.1-9).

Exercises
19.1-1 Prove Proposition 19.4, that is every rule based query q is monotone and
satisfiable. Hint. For the proof of satisfiability let K be the set of constants occurring
in query q, and let a 6∈ K be another constant. For every relation schema Ri in rule
(19.3) construct all tuples (a1 , a2 , . . . , ar ), where ai ∈ K ∪ {a}, and r is the arity of
Ri . Let I be the instance obtained so. Prove that q(I) is nonempty.
19.1-2 Give a relational schema R and a relational algebra query q over schema
R, whose result is empty to arbitrary instance over R.
19.1-3 Prove that the languages of rule based queries and tableau queries are
equivalent.
19.1-4 Prove that every rule based query q with equality atoms is either equivalent
with the empty query q ∅ , or there exists a rule based query q 0 without equality atoms
such that q ≡ q 0 . Give a polynomial time algorithm that determines for a rule based
query q with equality atoms whether q ≡ q ∅ holds, and if not, then constructs a rule
based query q 0 without equality atoms, such that q ≡ q 0 .
19.1-5 Prove Theorem 19.8 by generalising the proof idea of Theorem 19.6.
19.1-6 Let P be a datalog program, I be an instance over edb(P ), inC(P, I) be the
(finite) set of constants occurring in I and P . Let B(P, I) be the following instance
over sch(P ):
1. For every relation R of edb(P ) the fact R(u) is in B(P, I) iff it is in I, further-
more
2. for every relation R of idb(P ) every R(u) fact constructed from constants of
C(P, I) is in B(P, I).

Prove that
I ⊆ TP (I) ⊆ TP2 (K) ⊆ TP3 (K) ⊆ · · · ⊆ B(P, I). (19.37)
914 19. Query Rewriting in Relational Databases

19.1-7 Give an example of an input, that is a datalog program P and instance I


over edb(P ), such that the same tuple is produced by distinct runs of the loop of
Semi-naiv-datalog.
19.1-8 Prove that procedure Improved-Semi-Naiv-Datalog stops in finite time
for all inputs, and gives correct result. Give an example of an input on which
Improved-Semi-Naiv-Datalog calculates less number of rows multiple times than
Semi-naiv-datalog.
19.1-9
1. Prove that for tableau queries qE = (TE , t) and qE0 = (T0E , t) of the proof of
Theorem 19.20 there exists a homomorphism from (TE , t) to (T0E , t) if and
only if the Exact-Cover problem E = (X, S) has solution.
2. Prove that the decision problems 19.11. and 19.12. are NP-complete.

19.2. Views
A database system architecture has three main levels:
• physical layer;
• logical layer;
• outer (user) layer.
The goal of the separation of levels is to reach data independence and the convenience
of users. The three views on Figure 19.2 show possible user interfaces: multirelational,
universal relation and graphical interface.
The physical layer consists of the actually stored data files and the dense and
sparse indices built over them.
The separation of the logical layer from the physical layer makes it possible
for the user to concentrate on the logical dependencies of the data, which approx-
imates the image of the reality to be modelled better. The logical layer consists of
the database schema description together with the various integrity constraints, de-
pendencies. This the layer where the database administrators work with the system.
The connection between the physical layer and the logical layer is maintained by the
database engine.
The goal of the separation of the logical layer and the outer layer is that the
endusers can see the database according to their (narrow) needs and requirements.
For example, a very simple view of the outer layer of a bank database could be the
automatic teller machine, or a much more complex view could be the credit history
of a client for loan approval.

19.2.1. View as a result of a query


The question is that how can the views of different layers be given. If a query given
by relational algebra expression is considered as a formula that will be applied to
relational instances, then the view is obtained. Datalog rules show the difference
19.2. Views 915

View 1 View 2 View 3

Outer layer

Logical layer

Physical layer

Figure 19.2 The three levels of database architecture.

between views and relations, well. The relations defined by rules are called inten-
sional, because these are the relations that do not have to exist on external storage
devices, that is to exist extensionally, in contrast to the extensional relations.

Definition 19.21 The V expression given in some query language Q over schema
R is called a view.

Similarly to intensional relations, views can be used in definition of queries or other


views, as well.

Example 19.7 SQL view. Views in database manipulation language SQL can be given in
the following way. Suppose that the only interesting data for us from schema CinePest is
where and when are Kurosawa’s film shown. The view KurosawaTimes is given by the
SQL command

KurosawaTimes
1 create view KurosawaTimes as
2 select Theater, Time
3 from Film, Show
4 where Film.Title=Show.Title and Film.Director="Akira Kurosawa"

Written in relational algebra is as follows.

KurosawaTimes(Theater, Time) = πT heater, Time (T heater 1 σDirector="Akira Kurosawa" (F ilm))


(19.38)
916 19. Query Rewriting in Relational Databases

Finally, the same by datalog rule is:

KurosawaTimes(xT h , xT i ) ← T heater(xT h , xT , xT i ), F ilm(xT , "Akira Kurosawa", xA ) .


(19.39)
Line 2 of KurosawaTimes marks the selection operator used, line 3 marks that which two
relations are to be joined, finally the condition of line 4 shows that it is a natural join, not
a direct product.

Having defined view V, it can be used in further queries or view definitions like
any other (extensional) relation.

Advantages of using views


• Automatic data hiding: Such data that is not part of the view used, is not shown
to the user, thus the user cannot read or modify them without having proper
access rights to them. So by providing access to the database through views, a
simple, but effective security mechanism is created.
• Views provide simple “macro capabilities”. Using the view KurosawaTimes de-
fined in Example 19.7 it is easy to find those theatres where Kurosawa films are
shown in the morning:

KurosawaMorning(T heater) ← KurosawaTimes(T heater, xT i ), xT i < 12 .


(19.40)
Of course the user could include the definition of KurosawaTimes in the code
directly, however convenience considerations are first here, in close similarity
with macros.
• Views make it possible that the same data could be seen in different ways by
different users at the same time.
• Views provide logical data independence. The essence of logical data inde-
pendence is that users and their programs are protected from the structural
changes of the database schema. It can be achieved by defining the relations of
the schema before the structural change as views in the new schema.
• Views make controlled data input possible. The with check option clause of
command create view is to do this in SQL.

Materialised view. Some view could be used in several different queries. It


could be useful in these cases that if the tuples of the relation(s) defined by the
view need not be calculated again and again, but the output of the query defining
the view is stored, and only read in at further uses. Such stored output is called a
materialised view.

Exercises
19.2-1 Consider the following schema:
FilmStar(Name,Address,Gender,BirthDate)
FilmMogul(Name,Address,Certificate#,Assets)
Studio(Name,Address,PresidentCert#) .
19.3. Query rewriting 917

Relation FilmMogul contains data of the big people in film business (studio pres-
idents, producers, etc.). The attribute names speak for themselves, Certificate# is
the number of the certificate of the filmmogul, PresidentCert#) is the certificate
number of the president of the studio. Give the definitions of the following views
using datalog rules, relational algebra expressions, furthermore SQL:

1. RichMogul: Lists the names, addresses,certificate numbers and assets of those


filmmoguls, whose asset value is over 1 million dollars.

2. StudioPresident: Lists the names, addresses and certificate numbers of those


filmmoguls, who are studio presidents, as well.

3. MogulStar: Lists the names, addresses,certificate numbers and assets of those


people who are filmstars and filmmoguls at the same time.

19.2-2 Give the definitions of the following views over schema CinePest using
datalog rules, relational algebra expressions, furthermore SQL:

1. Marilyn(Title): Lists the titles of Marilyn Monroe’s films.

2. CorvinInfo(Title,Time,Phone): List the titles and show times of films shown


in theatre Corvin, together with the phone number of the theatre.

19.3. Query rewriting


Answering queries using views, in other words query rewriting using views has be-
come a problem attracting much attention and interest recently. The reason is its
applicability in a wide range of data manipulation problems: query optimisation, pro-
viding physical data independence, data and information integration, furthermore
planning data warehouses.
The problem is the following. Assume that query Q is given over schema R,
together with views V1 , V2 , . . . , Vn . Can one answer Q using only the results of views
V1 , V2 , . . . , Vn ? Or, which is the largest set of tuples that can be determined knowing
the views? If the views and the relations of the base schema can be accessed both,
what is the cheapest way to compute the result of Q?

19.3.1. Motivation
Before query rewriting algorithms are studied in detail, some motivating examples
of applications are given. The following university database is used throughout this
section.

University = {Professor,Course,Teach,Registered,Major,Affiliation,Supervisor} .
(19.41)
918 19. Query Rewriting in Relational Databases

The schemata of the individual relations are as follows:


Professor = {PName,Area}
Course = {C-Number,Title}
Teaches = {PName,C-Number,Semester,Evaluation}
Registered = {Student,C-Number,Semester} (19.42)
Major = {Student,Department}
Affiliation = {PName,Department}
Advisor = {PName,Student} .
It is supposed that professors, students and departments are uniquely identified by
their names. Tuples of relation Registered show that which student took which course
in what semester, while Major shows which department a student choose in majoring
(for the sake of convenience it is assumed that one department has one subject as
possible major).

Query optimisation. If the computation necessary to answer a query was


performed in advance and the results are stored in some materialised view, then it
can be used to speed up the query answering.
Consider the query that looks for pairs (Student,Title), where the student reg-
istered for the given Ph.D.-level course, the course is taught by a professor of the
Database area (the C-number of graduate courses is at least 400, and the Ph.D.-level
courses are those with C-number at least 500).
val(xS , xT ) ← Teach(xP , xC , xSe , y1 ), Professor(xP , “database”),
(19.43)
Registered(xS , xC , xSe ), Course(xC , xT ), xC ≥ 500 .
Suppose that the following materialised view is available that contains the registra-
tion data of graduate courses.

Graduate(xS , xT , xC , xSe ) ← Registered(xS , xC , xSe ), Course(xX , xT ), xC ≥ 400 .


(19.44)
View Graduate can be used to answer query (19.43).
val(xS , xT ) ← Teaches(xP , xC , xSe , y1 ), Professor(xP , “database”),
(19.45)
(xS , xT , xC , xSe ), xC ≥ 500 .

It will be faster to compute (19.45) than to compute (19.43), because the natu-
ral join of relations Registered and Course has already be done by view Graduate,
furthermore it shelled off the undergraduate courses (that make up for the bulk of
registration data at most universities). It worth noting that view Graduate could be
used event hough syntactically did not agree with any part of query (19.43).
On the other hand, it may happen that the the original query can be answered
faster. If relations Registered and Course have an index on attribute C-Number, but
there exists no index built for Graduate, then it could be faster to answer query
(19.43) directly from the database relations. Thus, the real challenge is not only
that to decide about a materialised view whether it could be used to answer some
query logically, but a thorough cost analysis must be done when is it worth using
the existing views.
19.3. Query rewriting 919

Physical data independence. One of the principles underlying modern


database systems is the separation between the logical view of data and the physical
view of data. With the exception of horizontal or vertical partitioning of relations
into multiple files, relational database systems are still largely based on a one-to-
one correspondence between relations in the schema and the files in which they are
stored. In object-oriented systems, maintaining the separation is necessary because
the logical schema contains significant redundancy, and does not correspond to a
good physical layout. Maintaining physical data independence becomes more crucial
in applications where the logical model is introduced as an intermediate level after
the physical representation has already been determined. This is common in storage
of XML data in relational databases, and in data integration. In fact, the Stored
System stores XML documents in a relational database, and uses views to describe
the mapping from XML into relations in the database.
To maintain physical data independence, a widely accepted method is to use
views over the logical schema as mechanism for describing the physical storage of
data. In particular, Tsatalos, Solomon and Ioannidis use GMAPs (Generalised Multi-
level Access Paths) to describe data storage.
A GMAP describes the physical organisation and the indexes of the storage
structure. The first clause of the GMAP (as) describes the actual data structure
used to store a set of tuples (e.g., a B+ -tree, hash index, etc.) The remaining clauses
describe the content of the structure, much like a view definition. The given and
select clauses describe the available attributes, where the given clause describes
the attributes on which the structure is indexed. The definition of the view, given
in the where clause uses infix notation over the conceptual model.
In the example shown in Figure 19.3, the GMAP G1 stores a set of pairs contain-
ing students and the departments in which they major, and these pairs are indexed
by a B+ -tree on attribute Student.name. The GMAP G2 stores an index from names
of students to the numbers of courses in which they are registered. The GMAP G3
stores an index from course numbers to departments whose majors are enrolled in
the course.
Given that the data is stored in structures described by the GMAPs, the question
that arises is how to use these structures to answer queries. Since the logical content
of the GMAPs are described by views, answering a query amounts to finding a way
of rewriting the query using these views. If there are multiple ways of answering the
query using the views, we would like to find the cheapest one. Note that in contrast
to the query optimisation context, we must use the views to answer the given query,
because all the data is stored in the GMAPs.
Consider the following query, which asks for names of students registered for
Ph.D.-level courses and the departments in which these students are majoring.

ans(Student,Department) ←Registered(Student,C-number,y),
Major(Student,Department), (19.46)
C-number ≥ 500 .

The query can be answered in two ways. First, since Student.name uniquely identifies
a student, we can take the join of G! and G2, and then apply selection operator
Course.c-number ≥ 500, finally a projection eliminates the unnecessary attributes.
920 19. Query Rewriting in Relational Databases

def.gmap G1 as b+ -tree by
given Student.name
select Department
where Student major Department

def.gmap G2 as b+ -tree by
given Student.name
select Course.c-number
where Student registered Course

def.gmap G3 as b+ -tree by
given Course.c-number
select Department
where Student registered Course and Student major Department

Figure 19.3 GMAPs for the university domain.

The other execution plan could be to join G3 with G2 and select Course.c-number ≥
500. In fact, this solution may even be more efficient because G3 has an index on
the course number and therefore the intermediate joins may be much smaller.

Data integration. A data integration system (also known as mediator


system) provides a uniform query interface to a multitude of autonomous het-
erogeneous data sources. Prime examples of data integration applications include
enterprise integration, querying multiple sources on the World-Wide Web, and inte-
gration of data from distributed scientific experiments.
To provide a uniform interface, a data integration system exposes to the user a
mediated schema. A mediated schema is a set of virtual relations, in the sense
that they are not stored anywhere. The mediated schema is designed manually for a
particular data integration application. To be able to answer queries, the system must
also contain a set of source descriptions. A description of a data source specifies
the contents of the source, the attributes that can be found in the source, and the
(integrity) constraints on the content of the source. A widely adopted approach for
specifying source descriptions is to describe the contents of a data source as a view
over the mediated schema. This approach facilitates the addition of new data sources
and the specification of constraints on the contents of sources.
In order to answer a query, a data integration system needs to translate a query
formulated on the mediated schema into one that refers directly to the schemata of
the data sources. Since the contents of the data sources are described as views, the
translation problem amounts finding a way to answer a query using a set of views.
Consider as an example the case where the mediated schema exposed to the
user is schema University, except that the relations Teaches and Course have an
additional attribute identifying the university at which the course is being taught:
Course = {C-number,Title,Univ}
(19.47)
Teaches = {PName,C-number,Semester,Evaluation,Univ.}
19.3. Query rewriting 921

Suppose we have the following two data sources. The first source provides a listing of
all the courses entitled "Database Systems" taught anywhere and their instructors.
This source can be described by the following view definition:

DBcourse(Title,PName,C-number,Univ) ←Course( C-number,Title,Univ),


Teaches(PName,C-number,Semester,
Evaluation, U niv),
Title = “Database Systems” .
(19.48)
The second source lists Ph.D.-level courses being taught at The Ohio State Univer-
sity (OSU), and is described by the following view definition:

OSU P hD(Title,PName,C-number,Univ) ←Course( C-number,Title,Univ),


Teaches(PName,C-number,Semester,
Evaluation, U niv),
U niv = “OSU”, C-number ≥ 500.
(19.49)
If we were to ask the data integration system who teaches courses titled "Database
Systems" at OSU, it would be able to answer the query by applying a selection on
the source DB-courses:

ans(PName) ← DBcourse(Title,PName,C-number,Univ), U niv = "OSU" .


(19.50)
On the other hand, suppose we ask for all the graduate-level courses (not just in
databases) being offered at OSU. Given that only these two sources are available, the
data integration system cannot find all tuples in the answer to the query. Instead,
the system can attempt to find the maximal set of tuples in the answer available from
the sources. In particular, the system can obtain graduate database courses at OSU
from the DB-course source, and the Ph.D.-level courses at OSU from the OSUPhD
source. Hence, the following non-recursive datalog program gives the maximal set of
answers that can be obtained from the two sources:
ans(Title,C-number) ←DBcourse(Title,PName,C-number,Univ),
U niv = "OSU", C-number ≥ 400 (19.51)
ans(Title,C-number) ←OSU P hD(Title,PName,C-number,Univ) .

Note that courses that are not PH.D.-level courses or database courses will not be
returned as answers. Whereas in the contexts of query optimisation and maintain-
ing physical data independence the focus is on finding a query expression that is
equivalent with the original query, here finding a query expression that provides
the maximal answers from the views is attempted.

Semantic data caching. If the database is accessed via client-server architec-


ture, the data cached at the client can be semantically modelled as results of certain
queries, rather than at the physical level as a set of data pages or tuples. Hence,
deciding which data needs to be shipped from the server in order to answer a given
query requires an analysis of which parts of the query can be answered by the cached
views.
922 19. Query Rewriting in Relational Databases

19.3.2. Complexity problems of query rewriting


In this section the theoretical complexity of query rewriting is studied. Mostly
conjunctive queries are considered. Minimal, and complete rewriting will be dis-
tinguished. It will be shown that if the query is conjunctive, furthermore the ma-
terialised views are also given as results of conjunctive queries, then the rewriting
problem is NP-complete, assuming that neither the query nor the view definitions
contain comparison atoms. Conjunctive queries will be considered in rule-based form
for the sake of convenience.
Assume that query Q is given over schema R.
Definition 19.22 The conjunctive query Q0 is a rewriting of query Q using views
V = V1 , V2 , . . . , Vm , if
• Q and Q0 are equivalent, and
• Q0 contains one or more literals from V.
Q0 is said to be locally minimal if no literal can be removed from Q0 without
violating the equivalence. The rewriting is globally minimal, if there exists no
rewriting using a smaller number of literals. (The comparison atoms =, NE, ≤, <
are not counted in the number of literals.)

Example 19.8 Query rewriting. Consider the following query Q and view V .
Q : q(X, U ) ← p(X, Y ), p0 (Y, Z), p1 (X, W ), p2 (W, U )
(19.52)
V : v(A, B) ← p(A, C), p0 (C, B), p1 (A, D) .
Q can be rewritten using V :

Q0 : q(X, U ) ← v(X, Z), p1 (X, W ), p2 (W, U ) . (19.53)

View V replaces the first two literals of query Q. Note that the view certainly satisfies the
third literal of the query, as well. However, it cannot be removed from the rewriting, since
variable D does not occur in the head of V , thus if literal p1 were to be removed, too, then
the natural join of p1 and p2 would not be enforced anymore.

Since in some of the applications the database relations are inaccessible, only the
views can be accessed, for example in case of data integration or data warehouses,
the concept of complete rewriting is introduced.
Definition 19.23 A rewriting Q0 of query Q using views V = V1 , V2 , . . . , Vm is
called a complete rewriting, if Q0 contains only literals of V and comparison
atoms.

Example 19.9 Complete rewriting. Assume that besides view V of Example 19.8 the
following view is given, as well:

V2 : v2 (A, B) ← p1 (A, C), p2 (C, B), p0 (D, E) (19.54)

A complete rewriting of query Q is:

Q00 : q(X, U ) ← v(X, Z), v2 (X, U ) . (19.55)


19.3. Query rewriting 923

It is important to see that this rewriting cannot be obtained step-by-step, first using only
V , then trying to incorporate V2 , (or just in the opposite order) since relation p0 of V2 does
not occur in Q0 . Thus, in order to find the complete rewriting, use of the two view must be
considered parallel, at the same time.

There is a close connection between finding a rewriting and the problem of query
containment. This latter one was discussed for tableau queries in section 19.1.3. Ho-
momorphism between tableau queries can be defined for rule based queries, as well.
The only difference is that it is not required in this section that a homomorphism
maps the head of one rule to the head of the other. (The head of a rule corresponds
to the summary row of a tableau.) According to Theorem 19.20 it is NP-complete to
decide whether conjunctive query Q1 contains another conjunctive query Q2 . This
remains true in the case when Q2 may contain comparison atoms, as well. However,
if both, Q1 and Q2 may contain comparison atoms, then the existence of a homo-
morphism from Q1 to Q2 is only a sufficient but not necessary condition for the
containment of queries, which is a Πp2 -complete problem in that case. The discussion
of this latter complexity class is beyond the scope of this chapter, thus it is omitted.
The next proposition gives a necessary and sufficient condition whether there exists
a rewriting of query Q using view V .
Claim 19.24 Let Q and V be conjunctive queries that may contain comparison
atoms. There exists a a rewriting of query Q using view V if and only if π∅ (Q) ⊆
π∅ (V ), that is the projection of V to the empty attribute set contains that of Q.
Proof Observe that π∅ (Q) ⊆ π∅ (V ) is equivalent with the following proposition: If
the output of V is empty for some instance, then the same holds for the output of
Q, as well.
Assume first that there exists a rewriting, that is a rule equivalent with Q that
contains V in its body. If r is such an instance, that the result of V is empty on it,
then every rule that includes V in its body results in empty set over r, too.
In order to prove the other direction, assume that if the output of V is empty
for some instance, then the same holds for the output of Q, as well. Let

Q : q(x̃) ← q1 (x̃), q2 (x̃), . . . , qm (x̃)


(19.56)
V : v(ã) ← v1 (ã), v2 (ã), . . . , vn (ã) .

Let ỹ be a list of variables disjoint from variables of x̃. Then the query Q0 defined
by
Q0 : q 0 (x̃) ← q1 (x̃), q2 (x̃), . . . , qm (x̃), v1 (ỹ), v2 (ỹ), . . . , vn (ỹ) (19.57)
satisfies Q ≡ Q0 . It is clear that Q0 ⊆ Q. On the other hand, if there exists a valuation
of the variables of ỹ that satisfies the body of V over some instance r, then fixing
it, for arbitrary valuation of variables in x̃ a tuple is obtained in the output of Q,
whenever a tuple is obtained in the output of Q0 together with the previously fixed
valuation of variables of ỹ.

As a corollary of Theorem 19.20 and Proposition 19.24 the following theorem is


obtained.
924 19. Query Rewriting in Relational Databases

Theorem 19.25 Let Q be a conjunctive query that may contain comparison atoms,
and let V be a set of views. If the views in V are given by conjunctive queries that do
not contain comparison atoms, then it is NP-complete to decide whether there exists
a rewriting of Q using V.
The proof of Theorem 19.25 is left for the Reader (Exercise 19.3-1).
The proof of Proposition 19.24 uses new variables. However, according to the
next lemma, this is not necessary. Another important observation is that it is enough
to consider a subset of database relations occurring in the original query when locally
minimal rewriting is sought, new database relations need not be introduced.
Lemma 19.26 Let Q be a conjunctive query that does not contain comparison
atoms
Q : q(X̃) ← p1 (Ũ1 ), p2 (Ũ2 ), . . . , pn (Ũn ) , (19.58)
furthermore let V be a set of views that do not contain comparison atoms either.
1. If Q0 is a locally minimal rewriting of Q using V, then the set of database
literals in Q0 is isomorphic to a subset of database literals occurring in Q.
2. If

q(X̃) ← p1 (Ũ1 ), p2 (Ũ2 ), . . . , pn (Ũn ), v1 (Ỹ1 ), v2 (Ỹ2 ), . . . vk (Ỹk ) (19.59)

is a rewriting of Q using the views, then there exists a rewriting

q(X̃) ← p1 (Ũ1 ), p2 (Ũ2 ), . . . , pn (Ũn ), v1 (Y˜0 1 ), v2 (Y˜0 2 ), . . . vk (Y˜0 k ) (19.60)

such that {Y˜0 1 ∪ · · · ∪ Y˜0 k } ⊆ {Ũ1 ∪ · · · ∪ Ũn }, that is the rewriting does not
introduce new variables.

The details of the proof of Lemma 19.26 are left for the Reader (Exercise 19.3-2).
The next lemma is of fundamental importance: A minimal rewriting of Q using V
cannot increase the number of literals.
Lemma 19.27 Let Q be conjunctive query, V be set of views given by conjunctive
queries, both without comparison atoms. If the body of Q contains p literals and Q0
is a locally minimal rewriting of Q using V, then Q0 contains at most p literals.
Proof Replacing the view literals of Q0 by their definitions query Q00 is obtained.
Let ϕ be a homomorphism from the body of Q to Q00 . The existence of ϕ follows
from Q ≡ Q00 by the Homomorphism Theorem (Theorem 19.18). Each of the literals
l1 , l2 , . . . , lp of the body of Q is mapped to at most one literal obtained from the
expansion of view definitions. If Q0 contains more than p view literals, then the
expansion of some view literals in the body of Q00 is disjoint from the image of
ϕ. These view literals can be removed from the body of Q0 without changing the
equivalence.

Based on Lemma 19.27 the following theorem can be stated about complexity of
minimal rewritings.
19.3. Query rewriting 925

Theorem 19.28 Let Q be conjunctive query, V be set of views given by conjunctive


queries, both without comparison atoms. Let the body of Q contain p literals.

1. It is NP-complete to decide whether there exists a rewriting Q0 of Q using V


that uses at most k (≤ p) literals.

2. It is NP-complete to decide whether there exists a rewriting Q0 of Q using V


that uses at most k (≤ p) database literals.

3. It is NP-complete to decide whether there exists a complete rewriting of Q


using V.

Proof The first statement is proved, the proof of the other two is similar. According
to Lemmas 19.27 and 19.26, only such rewritings need to be considered that have
at most as many literals as the query itself, contain a subset of the literals of the
query and do not introduce new variables. Such a rewriting and the homomorphisms
proving the equivalence can be tested in polynomial time, hence the problem is in
NP. In order to prove that it is NP-hard, Theorem 19.25 is used. For a given query
Q and view V let V 0 be the view, whose head is same as the head of V , but whose
body is the conjunction of the bodies of Q and V . It is easy to see that there exists
a rewriting using V 0 with a single literal if and only if there exists a rewriting (with
no restriction) using V .

19.3.3. Practical algorithms


In this section only complete rewritings are studied. This does not mean real re-
striction, since if database relations are also to be used, then views mirroring the
database relations one-to-one can be introduced. The concept of equivalent rewrit-
ing introduced in Definition 19.22 is appropriate if the goal of the rewriting is query
optimisation or providing physical data independence. However, in the context of
data integration on data warehouses equivalent rewritings cannot be sought, since
all necessary data may not be available. Thus, the concept of maximally contained
rewriting is introduced that depends on the query language used, in contrast to
equivalent rewritings.

Definition 19.29 Let Q be a query, V be a set of views, L be a query language.


Q0 is a maximally contained rewriting of Q with respect to L, if

1. Q0 is a query of language L using only views from V,

2. Q contains Q0 ,

3. if query Q1 ∈ L satisfies Q0 v Q1 v Q, then Q0 ≡ Q1 .


926 19. Query Rewriting in Relational Databases

Query optimisation using materialised views. Before discussing how can


a traditional optimiser be modified in order to use materialised views instead of
database relations, it is necessary to survey when can view be used to answer a
given query. Essentially, view V can be used to answer query Q, if the intersection
of the sets of database relations in the body of V and in the body of Q is non-
empty, furthermore some of the attributes are selected by V are also selected by
Q. Besides this, in case of equivalent rewriting, if V contains comparison atoms for
such attributes that are also occurring in Q, then the view must apply logically
equivalent, or weaker condition, than the query. If logically stronger condition is
applied in the view, then it can be part of a (maximally) contained rewriting. This
can be shown best via an example. Consider the query Q over schema University
that list those professor, student, semester triplets, where the advisor of the student
is the professor and in the given semester the student registered for some course
taught by the professor.

Q : q(Pname,Student,Semester) ←Registered(Student,C-number,Semester),
Advisor(Pname,Student),
Teaches(Pname, C-number,Semester, xE ),
Semester ≥ “Fall2000” .
(19.61)
View V1 below can be used to answer Q, since it uses the same join condition
for relations Registered and Teaches as Q, as it is shown by variables of the same
name. Furthermore, V1 selects attributes Student, PName, Semester, that are nec-
essary in order to properly join with relation Advisor, and for select clause of the
query. Finally, the predicate Semester > “Fall1999” is weaker than the predicate
Semester ≥ “Fall2000” of the query.

V1 : v1 (Student,PName,Semester) ←Teaches(PName,C-number,Semester, xE ),
Registered(Student,C-number,Semester),
Semester > “Fall1999” .
(19.62)
The following four views illustrate how minor modifications to V1 change the us-
ability in answering the query.

V2 : v2 (Student,Semester) ←Teaches(xP , C-number,Semester, xE ),


Registered(Student,C-number,Semester), (19.63)
Semester > “Fall1999” .

V3 : v3 (Student,PName,Semester) ←Teaches(PName, C-number, xS , xE ),


Registered(Student,C-number,Semester),
Semester > “Fall1999” .
(19.64)
V4 : v4 (Student,PName,Semester) ←Teaches(PName, C-number,Semester, xE ),
Adviser(PName, xSt ), Professor(PName, xA ),
Registered(Student,C-number,Semester),
Semester > “Fall1999” .
(19.65)
19.3. Query rewriting 927

V5 : v5 (Student,PName,Semester) ←Teaches(PName, C-number,Semester, xE ),


Registered(Student,C-number,Semester),
Semester > “Fall2001” .
(19.66)
View V2 is similar to V1 , except that it does not select the attribute PName from
relation Teaches, which is needed for the join with the relation Adviser and for the
selection of the query. Hence, to use V2 in the rewriting, it has to be joined with
relation Teaches again. Still, if the join of relations Registered and Teaches is very
selective, then employing V2 may actually result in a more efficient query execution
plan.
In view V3 the join of relations Registered and Teaches is over only attribute
C-number, the equality of variables Semester and xS is not required. Since attribute
xS is not selected by V3 , the join predicate cannot be applied in the rewriting, and
therefore there is little gain by using V3 .
View V4 considers only the professors who have at least one area of research.
Hence, the view applies an additional condition that does not exists in the query,
and cannot be used in an equivalent rewriting unless union and negation are allowed
in the rewriting language. However, if there is an integrity constraint stating that
every professor has at least one area of research, then an optimiser should be able
to realise that V4 is usable.
Finally, view V5 applies a stronger predicate than the query, and is therefore
usable for a contained rewriting, but not for an equivalent rewriting of the query.

System-R style optimisation Before discussing the changes to traditional


optimisation, first the principles underlying the System-R style optimiser is re-
called briefly. System-R takes a bottom-up approach to building query execution
plans. In the first phase, it concentrates of plans of size 1, i.e., chooses the best ac-
cess paths to every table mentioned in the query. In phase n, the algorithm considers
plans of size n, by combining plans obtained in the previous phases (sizes of k and
n − k). The algorithm terminates after constructing plans that cover all the relations
in the query. The efficiency of System-R stems from the fact that it partitions query
execution plans into equivalence classes, and only considers a single execution
plan for every equivalence class. Two plans are in the same equivalence class if they

• cover the same set of relations in the query (and therefore are also of the same
size), and
• produce the answer in the same interesting order.
In our context, the query optimiser builds query execution plans by accessing a set
of views, rather than a set of database relations. Hence, in addition to the meta-data
that the query optimiser has about the materialised views (e.g., statistics, indexes)
the optimiser is also given as input the query expressions defining the views. Th
additional issues that the optimiser needs to consider in the presence of materialised
views are as follows.
A. In the first iteration the algorithm needs to decide which views are relevant
to the query according to the conditions illustrated above. The corresponding
928 19. Query Rewriting in Relational Databases

step is trivial in a traditional optimiser: a relation is relevant to the query if it


is in the body of the query expression.

B. Since the query execution plans involve joins over views, rather than joins over
database relations, plans can no longer be neatly partitioned into equivalence
classes which can be explored in increasing size. This observation implies several
changes to the traditional algorithm:

1.Termination testing: the algorithm needs to distinguish partial query


execution plans of the query from complete query execution plans.
The enumeration of the possible join orders terminates when there are no
more unexplored partial plans. In contrast, in the traditional setting the
algorithm terminates after considering the equivalence classes that include
all the relations of the query.

2.Pruning of plans: a traditional optimiser compares between pairs of plans


within one equivalence class and saves only the cheapest one for each class.
I our context, the query optimiser needs to compare between any pair of
plans generated thus far. A plan p is pruned if there is another plan p0 that

(a)is cheaper than p, and


(b)has greater or equal contribution to the query than p. Informally, a
plan p0 contributes more to the query than plan p if it covers more of
the relations in the query and selects more of the necessary attributes.

3.Combining partial plans: in the traditional setting, when two partial


plans are combined, the join predicates that involve both plans are explicit
in the query, and the enumeration algorithm need only consider the most
efficient way to apply these predicates. However, in our case, it may not
be obvious a priori which join predicate will yield a correct rewriting of
the query, since views are joined rather than database relations directly.
Hence, the enumeration algorithm needs to consider several alternative join
predicates. Fortunately, in practice, the number of join predicates that need
to be considered can be significantly pruned using meta-data about the
schema. For example, there is no point in trying to join a string attribute
with a numeric one. Furthermore, in some cases knowledge of integrity
constraints and the structure of the query can be used to reduce the number
of join predicates to be considered. Finally, after considering all the possible
join predicates, the optimiser also needs to check whether the resulting plan
is still a partial solution to the query.

The following table summarises the comparison of the traditional optimiser versus
one that exploits materialised views.
19.3. Query rewriting 929

Conventional optimiser Optimiser using views


Iteration 1 Iteration 1
a) Find all possible access paths. a1)Find all views that are relevant to
the query.
a2) Distinguish between partial and
complete solutions to the query.
b) Compare their cost and keep the b) Compare all pairs of views. If one has
least expensive. neither greater contribution nor a lower
cost than the other, prune it.
c) If the query has one relation, stop. c) If there are no partial solutions,
stop.
Iteration 2 Iteration 2
For each query join:
a) Consider joining the relevant access a1) Consider joining all partial solu-
paths found in the previous iteration tions found in the previous iteration us-
using all possible join methods. ing all possible equi-join methods and
trying all possible subsets of join pred-
icates.
a2) Distinguish between partial and
complete solutions to the query.
b) Compare the cost of the resulting b) If any newly generated solution is ei-
join plans and keep the least expensive. ther not relevant to the query, or dom-
inated by another, prune it.
c) If the query has two relations, stop. c) If there are no partial solutions,
stop.
Iteration 3 Iteration 3
.. ..
. .

Another method of equivalent rewriting is using transformation rules. The common


theme in the works of that area is that replacing some part of the query with a view
is considered as another transformation available to the optimiser. These methods
are not discussed in detail here.
The query optimisers discussed above were designed to handle cases where the
number of views is relatively small (i.e., comparable to the size of the database
schema), and cases where equivalent rewriting is required. In contrast, the context
of data integration requires consideration of large number of views, since each data
source is being described by one or more views. In addition, the view definitions
may contain many complex predicates, whose goal is to express fine-grained dis-
tinction between the contents of different data sources. Furthermore, in the context
of data integration it is often assumed that the views are not complete, i.e., they
may only contain a subset of the tuples satisfying their definition. In the foregoing,
some algorithms for answering queries using views are described that were developed
specifically for the context of data integration.
930 19. Query Rewriting in Relational Databases

The Bucket Algorithm. The goal of the Bucket Algorithm is to reformulate


a user query that is posed on a mediated (virtual) schema into a query that refers
directly to the available sources. Both the query and the sources are described by
conjunctive queries that may include atoms of arithmetic comparison predicates.
The set of comparison atoms of query Q is denoted by C(Q).
Since the number of possible rewritings may be exponential in the size of the
query, the main idea underlying the bucket algorithm is that the number of query
rewritings that need to be considered can be drastically reduced if we first consider
each subgoal – the relational atoms of the query – is considered in isolation, and
determine which views may be relevant to each subgoal.
The algorithm proceeds as follows. First, a bucket is created for each subgoal in
the query Q that is not in C(Q), containing the views that are relevant to answering
the particular subgoal. In the second step, all such conjunctive query rewritings are
considered that include one conjunct (view) from each bucket. For each rewriting V
obtained it is checked that whether it is semantically correct, that is V v Q holds,
or whether it can be made semantically correct by adding comparison atoms. Fi-
nally the remaining plans are minimised by pruning redundant subgoals. Algorithm
Create-Bucket executes the first step described above. Its input is a set of source
descriptions V and a conjunctive query Q in the form

Q : Q(X̃) ← R1 (X̃1 ), R2 (X̃2 ), . . . , Rm (X̃m ), C(Q) . (19.67)

Create-Bucket(Q,V)
1 for i ← 1 to m
2 do Bucket[i] ← ∅
3 for all V ∈ V
4  V is of form V : V (Ỹ ) ← S1 (Ỹ1 ), . . . Sn (Ỹn ), C(V ).
5 do for j ← 1 to n
6 if Ri = Sj
7 then let φ be a mapping defined on the variables
of V as follows:
8 if y is the k th variable of Ỹj and y ∈ Ỹ
9 then φ(y) = xk , where xk is the k th variable of X̃i
10 else φ(y) is a new variable that
does not appear in Q or V .
11 Q0 () ← R1 (X̃1 ), Rm (X̃m ), C(Q), S1 (φ(Ỹ1 )), . . . ,
Sn (φ(Ỹn )), φ(C(V ))
12 if Satisfiable≥ (Q0 )
13 then add φ(V ) to Bucket[i].
14 return Bucket

Procedure Satisfiable≥ is the extension of Satisfiable described in section


19.1.2 to the case when comparison atoms may occur besides equality atoms. The
necessary change is only that for all variable y occurring in comparison atoms it
must be checked whether all predicates involving y are satisfiable simultaneously.
19.3. Query rewriting 931

Create-Bucket running time is polynomial function of P the sizes of Q and V.


Indeed, the kernel of the nested loops of lines 3 and 5 runs n V ∈V |V | times. The
commands of lines 6–13 require constant time, except for line 12. The condition of
of command if in line 12 can be checked in polynomial time.
In order to prove the correctness of procedure Create-Bucket, one should
check under what condition is a view V put in Bucket[i]. In line 6 it is checked
whether relation Ri appears as a subgoal in V . If not, then obviously V cannot give
usable information for subgoal Ri of Q. If Ri is a subgoal of V , then in lines 9–10
a mapping is constructed that applied to the variables allows the correspondence
between subgoals Sj and Ri , in accordance with relations occurring in the heads of
Q and V, respectively. Finally, in line 12 it is checked whether the comparison atoms
contradict with the correspondence constructed.
In the second step, having constructed the buckets using Create-Bucket, the
bucket algorithm finds a set of conjunctive query rewritings, each of them being
a conjunctive query that includes one conjunct from every bucket. Each of these
conjunctive query rewritings represents one way of obtaining part of the answer to
Q from the views. The result of the bucket algorithm is defined to be the union of the
conjunctive query rewritings (since each of the rewritings may contribute different
tuples). A given conjunctive query Q0 is a conjunctive query rewriting, if

1. Q0 v Q, or

2. Q0 can be extended with comparison atoms, so that the previous property


holds.

Example 19.10Bucket algorithm. Consider the following query Q that lists those articles
x that there exists another article y of the same area such that x and y mutually cites each
other. There are three views (sources) available, V1 , V2 , V3 .

Q(x) ← cite(x, y), cite(y, x), sameArea(x, y)


V1 (a) ← cite(a, b), cite(b, a)
(19.68)
V2 (c, d) ← sameArea(c, d)
V3 (f, h) ← cite(f, g), cite(g, h), sameArea(f, g) .

In the first step, applying Create-Bucket, the following buckets are constructed.

cite(x, y) cite(y, x) sameArea(x, y)


V1 (x) V1 (x) V2 (x) (19.69)
V3 (x) V3 (x) V3 (x)

In the second step the algorithm constructs a conjunctive query Q0 from each element of
the Cartesian product of the buckets, and checks whether Q0 is contained in Q. If yes, it is
given to the answer.
In our case, it tries to match V1 with the other views, however no correct answer is
obtained so. The reason is that b does not appear in the head of V1 , hence the join condition
of Q – variables x and y occur in relation sameArea, as well – cannot be applied. Then
rewritings containing V3 are considered, recognising that equating the variables in the head
of V3 a contained rewriting is obtained. Finally, the algorithm finds that combining V3
and V2 rewriting is obtained, as well. This latter is redundant, as it is obtained by simple
932 19. Query Rewriting in Relational Databases

checking, that is V2 can be pruned. Thus, the result of the bucket algorithm for query
(19.68) is the following (actually equivalent) rewriting

Q0 (x) ← V3 (x, x) . (19.70)

The strength of the bucket algorithm is that it exploits the predicates in the
query to prune significantly the number of candidate conjunctive rewritings that
need to be considered. Checking whether a view should belong to a bucket can
be done in time polynomial in the size of the query and view definition when the
predicates involved are arithmetic comparisons. Hence, if the data sources (i.e., the
views) are indeed distinguished by having different comparison predicates, then the
resulting buckets will be relatively small.
The main disadvantage of the bucket algorithm is that the Cartesian product
of the buckets may still be large. Furthermore, the second step of the algorithm
needs to perform a query containment test for every candidate rewriting, which is
NP-complete even when no comparison predicates are involved.

Inverse-rules algorithm. The Inverse-rules algorithm is a procedure that can


be applied more generally than the Bucket algorithm. It finds a maximally contained
rewriting for any query given by arbitrary recursive datalog program that does not
contain negation, in polynomial time.
The first question is that for given datalog program P and set of conjunctive
queries V, whether there exists a datalog program Pv equivalent with P, whose edb
relations are relations v1 , v2 , . . . , vn of V. Unfortunately, this is algorithmically un-
decidable. Surprisingly, the best, maximally contained rewriting can be constructed.
In the case, when there exists a datalog program Pv equivalent with P, the algo-
rithm finds that, since a maximally contained rewriting contains Pv , as well. This
seemingly contradicts to the fact that the existence of equivalent rewriting is algorith-
mically undecidable, however it is undecidable about the result of the inverse-rules
algorithm, whether it is really equivalent to the original query.

Example 19.11 Equivalent rewriting. Consider the following datalog program P, where
edb relations edge and black contain the edges and vertices coloured black of a graph G.

P: q(X, Y ) ← edge(X, Z), edge(Z, Y ), black(Z)


(19.71)
q(X, Y ) ← edge(X, Z), black(Z), q(Z, Y ) .

It is easy to check that P lists the endpoints of such paths (more precisely walks) of graph G
whose inner points are all black. Assume that only the following two views can be accessed.

v1 (X, Y ) ← edge(X, Y ), black(X)


(19.72)
v2 (X, Y ) ← edge(X, Y ), black(Y )

v1 stores edges whose tail is black, while v2 stores those, whose head is black. There exists
an equivalent rewriting Pv of datalog program P that uses only views v1 and v2 as edb
relations:
Pv : q(X, Y ) ← v2 (X, Z), v1 (Z, Y )
(19.73)
q(X, Y ) ← v2 (X, Z), q(Z, Y )
However, if only v1 , or v2 is accessible alone, then equivalent rewriting is not possible, since
19.3. Query rewriting 933

only such paths are obtainable whose starting, or respectively, ending vertex is black.

In order to describe the Inverse-rules Algorithm, it is necessary to introduce the


Horn rule, which is a generalisation of datalog program, and datalog rule. If
function symbols are also allowed in the free tuple ui of rule (19.27) in Defini-
tion 19.11, besides variables and constants, then Horn rule is obtained. A logic
program is a collection of Horn rules. In this sense a logic program without function
symbols is a datalog program. The concepts of edb and idb can be defined for logic
programs in the same way as for datalog programs.
The Inverse-rules Algorithm consists of two steps. First, a logic program is con-
structed that may contain function symbols. However, these will not occur in recur-
sive rules, thus in the second step they can be eliminated and the logic program can
be transformed into a datalog program.
Definition 19.30 The inverse v −1 of view v given by

v(X1 , . . . , Xm ) ← v1 (Ỹ1 ), . . . , vn (Ỹn ) (19.74)

is the following collection of Horn rules. A rule corresponds to every subgoal vi (Ỹi ),
whose body is the literal v(X1 , . . . , Xm ). The head of the rule is vi (Z̃i ), where Z̃i
is obtained from Ỹi by preserving variables appearing in the head of rule (19.74),
while function symbol fY (X1 , . . . , Xm ) is written in place of every variable Y not
appearing the head. Distinct function symbols correspond to distinct variables. The
inverse of a set V of views is the set {v −1 : v ∈ V}, where distinct function symbols
occur in the inverses of distinct rules.
The idea of the definition of inverses is that if a tuple (x1 , . . . xm ) appears in a view v,
for some constants x1 , . . . xm , then there is a valuation of every variable y appearing
in the head that makes the body of the rule true. This “unknown” valuation is
denoted by the function symbol fY (X1 , . . . , Xm ).

Example 19.12 Inverse of views. Let V be the following collection of views.

v1 (X, Y ) ← edge(X, Z), edge(Z, W ), edge(W, Y )


(19.75)
v2 (X)) ← edge(X, Z) .

Then V −1 consists of the following rules.

edge(X, f1,Z (X, Y )) ← v1 (X, Y )


edge(f1,Z (X, Y ), f1,W (X, Y )) ← v1 (X, Y )
(19.76)
edge(f1,W (X, Y ), Y ) ← v1 (X, Y )
edge(X, f2,Z (X)) ← v2 (X) .

Now, the maximally contained rewriting of datalog program P using views V


can easily be constructed for given P and V.
First, those rules are deleted from P that contain such edb relation that do not
appear in the definition any view from V. The rules of V −1 are added the datalog
program P − obtained so, thus forming logic program (P − , V −1 ). Note, that the
remaining edb relations of P are idb relations in logic program (P − , V −1 ), since they
934 19. Query Rewriting in Relational Databases

a b c d e

Figure 19.4 The graph G.

f(a,c) f(b,d) f(c,e)

a b c d e

Figure 19.5 The graph G0 .

appear in the heads of the rules of V −1 . The names of idb relations are arbitrary,
so they can be renamed so that their names do not coincide with the names of edb
relations of P. However, this is not done in the following example, for the sake of
better understanding.

Example 19.13 Logic program. Consider the following datalog program that calculates
the transitive closure of relation edge.

P: q(X, Y ) ← edge(X, Y )
(19.77)
q(X, Y ) ← edge(X, Z), q(Z, Y )

Assume that only the following materialised view is accessible, that stores the endpoints of
paths of length two. If only this view is usable, then the most that can be expected is listing
the endpoints of paths of even length. Since the unique edb relation of datalog program P
is edge, that also appears in the definition of v, the logic program (P − , V −1 ) is obtained by
adding the rules of V −1 to P.

(P − , V −1 ) : q(X, Y ) ← edge(X, Y )
q(X, Y ) ← edge(X, Z), q(Z, Y )
(19.78)
edge(X, f (X, Y )) ← v(X, Y )
edge(f (X, Y ), Y ) ← v(X, Y ) .

Let the instance of the edb relation edge of datalog program P be the graph G shown on
Figure 19.4. Then (P − , V −1 ) introduces three new constants, f (a, c), f (b, d) és f (c, e). The
idb relation edge of logic program V −1 is graph G0 shown on Figure 19.5. P − computes
the transitive closure of graph G0 . Note that those pairs in th transitive closure that do not
contain any of the new constants are exactly the endpoints of even paths of G.

The result of logic program (P − , V −1 ) in Example 19.13 can be calculated by


procedure Naïv-datalog, for example. However, it is not true for logic programs
in general, that the algorithm terminates. Indeed, consider the logic program

q(X) ← p(X)
(19.79)
q(f (X)) ← q(X) .
19.3. Query rewriting 935

If the edb relation p contains the constant a, then the output of the program is the
infinite sequence a, f (a), f (f (a)), f (f (f (a))), . . .. In contrary to this, the output of
the logic program (P − , V −1 ) given by the Inverse-rules Algorithm is guaranteed to
be finite, thus the computation terminates in finite time.
Theorem 19.31 For arbitrary datalog program P and set of conjunctive views V,
and for finite instances of the views, there exist a unique minimal fixpoint of the
logic program (P − , V −1 ), furthermore procedures Naiv-Datalog and Semi-Naiv-
Datalog give this minimal fixpoint as output.
The essence of the proof of Theorem 19.31 is that function symbols are only intro-
duced by inverse rules, that are in turn not recursive, thus terms containing nested
functions symbols are not produced. The details of the proof are left for the Reader
(Exercise 19.3-3).
Even if started from the edb relations of a database, the output of a logic program
may contain tuples that have function symbols. Thus, a filter is introduced that
eliminates the unnecessary tuples. Let database D be the instance of the edbrelations
of datalog program P. P(D)↓ denotes the set of those tuples from P(D) that do not
contain function symbols. Let P↓ denote that program, which computes P(D)↓ for
a given instance D. The proof of the following theorem, exceeds the limitations of
the present chapter.
Theorem 19.32 For arbitrary datalog program P and set of conjunctive views
V, the logic program (P − , V −1 )↓ is a maximally contained rewriting of P using V.
Furthermore, (P − , V −1 ) can be constructed in polynomial time of the sizes of P and
V.
The meaning of Theorem 19.32 is that the simple procedure of adding the inverses
of view definitions to a datalog program results in a logic program that uses the
views as much as possible. It is easy to see that (P − , V −1 ) can be constructed in
polynomial time of the sizes of P and V, since for every subgoal vi ∈ V a unique
inverse rule must be constructed.
In order to completely solve the rewriting problem however, a datalog program
needs to be produced that is equivalent with the logic program (P − , V −1 ) ↓. The
key to this is the observation that (P − , V −1 )↓ contains only finitely many function
symbols, furthermore during a bottom-up evaluation like Naiv-Datalog and its
versions, nested function symbols are not produced. With proper book keeping the
appearance of function symbols can be kept track, without actually producing those
tuples that contain them.
The transformation is done bottom-up like in procedure Naiv-Datalog. The
function symbol f (X1 , . . . , Xk ) appearing in the idb relation of V −1 is replaced by
the list of variables X1 , . . . , Xk . At same time the name of the idb relation needs
to be marked, to remember that the list X1 , . . . , Xk belongs to function symbol
f (X1 , . . . , Xk ). Thus, new “temporary” relation names are introduced. Consider the
the rule
edge(X, f (X, Y )) ← v(X, Y ) (19.80)
of the logic program (19.78) in Example 19.13. It is replaced by rule
edgeh1,f (2,3)i (X, X, Y ) ← v(X, Y ) (19.81)
936 19. Query Rewriting in Relational Databases

Notation h1, f (2, 3)i means that the first argument of edgeh1,f (2,3)i is the same as
the first argument of edge, while the second and third arguments of edgeh1,f (2,3)i
together with function symbol f give the second argument of edge. If a function
symbol would become an argument of an idb relation of P − during the bottom-up
evaluation of (P − , V −1 ), then a new rule is added to the program with appropriately
marked relation names.

Example 19.14 Transformation of logic program into datalog program. The logic program
Example 19.13 is transformed to the following datalog program by the procedure sketched
above. The different phases of the bottom-up execution of Naiv-Datalog are separated
by lines.
edgeh1,f (2,3)i (X, X, Y ) ← v(X, Y )
edgehf (1,2),3i (X, Y, Y ) ← v(X, Y )
q h1,f (2,3)i (X, Y1 , Y2 ) ← edgeh1,f (2,3)i (X, Y1 , Y2 )
q hf (1,2),3i (X1 , X2 , Y ) ← edgehf (1,2),3i (X1 , X2 , Y )
q(X, Y ) ← edgeh1,f (2,3)i (X, Z1 , Z2 ), q hf (1,2),3i (Z1 , Z2 , Y )
q hf (1,2),f (3,4)i (X1 , X2 , Y1 , Y2 ) ← edgehf (1,2),3i (X1 , X2 , Z), q h1,f (2,3)i (Z, Y1 , Y2 )
q hf (1,2),3i (X1 , X2 , Y ) ← edgehf (1,2),3i (X1 , X2 , Z), q(Z, Y )
q h1,f (2,3)i (X, Y1 , Y2 ) ← edgeh1,f (2,3)i (X, Z1 , Z2 ), q hf (1,2),f (3,4)i (Z1 , Z2 , Y1 , Y2 )
(19.82)

The datalog program obtained shows clearly that which arguments could involve
function symbols in the original logic program. However, some rows containing func-
tion symbols never give tuples not containing function symbols during the evaluation
of the output of the program.
Relation p is called significant, if in the precedence graph of Definition 19.163
there exists oriented path from p to the output relation of q. If p is not significant,
then the tuples of p are not needed to compute the output of the program, thus p
can be eliminated from the program.

Example 19.15 Eliminating non-significant relations. There exists no directed path in


the precedence graph of the datalog program obtained in Example 19.14, from relations
q h1,f (2,3)i and q hf (1,2),f (3,4)i to the output relation q of the program, thus they are not
significant, i.e., they can be eliminated together with the rules that involve them. The
following datalog program is obtained:
edgeh1,f (2,3)i (X, X, Y ) ← v(X, Y )
edgehf (1,2),3i (X, Y, Y ) ← v(X, Y )
q hf (1,2),3i (X1 , X2 , Y ) ← edgehf (1,2),3i (X1 , X2 , Y ) (19.83)
q hf (1,2),3i (X1 , X2 , Y ) ← edgehf (1,2),3i (X1 , X2 , Z), q(Z, Y )
q(X, Y ) ← edgeh1,f (2,3)i (X, Z1 , Z2 ), q hf (1,2),3i (Z1 , Z2 , Y ) .

One more simplification step can be performed, which does not decrease the
number of necessary derivations during computation of the output, however avoids
redundant data copying. If p is such a relation in the datalog program that is defined

3 Here the definition of precedence graph needs to be extended for the edb relations of the datalog
program, as well.
19.3. Query rewriting 937

by a single rule, which in turn contains a single relation in its body, then p can be
removed from the program and replaced by the relation of the body of the rule
defining p, having equated the variables accordingly.

Example 19.16 Avoiding unnecessary data copying. In Example 19.14 the relations
edgeh1,f (2,3)i and edgehf (1,2),3i are defined by a single rule, respectively, furthermore these
two rules both have a single relation in their bodies. Hence, program (19.83) can be sim-
plified further.

q hf (1,2),3i (X, Y, Y ) ← v(X, Y )


q hf (1,2),3i (X, Z, Y ) ← v(X, Z), q(Z, Y ) (19.84)
q(X, Y ) ← v(X, Z), q hf (1,2),3i (X, Z, Y ) .

The datalog program obtained in the two simplification steps above is denoted by
(P − , V −1 )datalog . It is clear that there exists a one-to-one correspondence between the
bottom-up evaluations of (P − , V −1 ) and (P − , V −1 )datalog . Since the function symbols
in (P − , V −1 )datalog are kept track, it is sure that the output instance obtained is in
fact the subset of tuples of the output of (P − , V −1 ) that do not contain function
symbols.
Theorem 19.33 For arbitrary datalog program P that does not contain negations,
and set of conjunctive views V, the logic program (P − , V −1 )↓ is equivalent with the
datalog program (P − , V −1 )datalog .

MiniCon. The main disadvantage of the Bucket Algorithm is that it considers


each of the subgoals in isolation, therefore does not observe the most of the inter-
actions between the subgoals of the views. Thus, the buckets may contain many
unusable views, and the second phase of the algorithm may become very expensive.
The advantage of the Inverse-rules Algorithm is its conceptual simplicity and
modularity. The inverses of the views must be computed only once, then they can
be applied to arbitrary queries given by datalog programs. On the other hand, much
of the computational advantage of exploiting the materialised views can be lost.
Using the resulting rewriting produced by the algorithm for actually evaluating
queries from the views has significant drawback, since it insists on recomputing the
extensions of the database relations.
The MiniCon algorithm addresses the limitations of the previous two algo-
rithms. The key idea underlying the algorithm is a change of perspective: instead
of building rewritings for each of the query subgoals, it is considered how each of
the variables in the query can interact with the available views. The result is that
the second phase of MiniCon needs to consider drastically fewer combinations of
views. In the following we return to conjunctive queries, and for the sake of easier
understanding only such views are considered that do not contain constants.
The MiniCon algorithm starts out like the Bucket Algorithm, considering which
views contain subgoals that correspond to subgoals in the query. However, once the
algorithm finds a partial variable mapping from a subgoal g in the query to a subgoal
g1 in a view V , it changes perspective and looks at the variables in the query. The
algorithm considers the join predicates in the query – which are specified by multiple
938 19. Query Rewriting in Relational Databases

occurrences of the same variable – and finds the minimal additional set of subgoals
that must be mapped to subgoals in V , given that g will be mapped to g1 . This set
of subgoals and mapping information is called a MiniCon Description (MCD).
In the second phase the MCDs are combined to produce query rewritings. The
construction of the MCDs makes the most expensive part of the Bucket Algorithm
obsolete, that is the checking of containment between the rewritings and the query,
because the generating rule of MCDs makes it sure that their join gives correct
result.
For a given mapping τ : V ar(Q) −→ V ar(V ) subgoal g1 of view V is said to
cover a subgoal g of query Q, if τ (g) = g1 . V ar(Q), and respectively V ar(V )
denotes the set of variables of the query, respectively of that of the view. In order
to prove that a rewriting gives only tuples that belong to the output of the query,
a homomorphism must be exhibited from the query onto the rewriting. An MCD
can be considered as a part of such a homomorphism, hence, these parts will be put
together easily.
The rewriting of query Q is a union of conjunctive queries using the views.
Some of the variables may be equated in the heads of some of the views as in the
equivalent rewriting (19.70) of Example 19.10. Thus, it is useful to introduce the
concept of head homomorphism. The mapping h : Var(V ) −→ Var(V ) is a head
homomorphism, if it is an identity on variables that do not occur in the head of V ,
but it can equate variables of the head. For every variable x of the head of V , h(x)
also appear in the head of V , furthermore h(x) = h(h(x)). Now, the exact definition
of MCD can be given.

Definition 19.34 The quadruple C = (hC , V (Ỹ )C , ϕC , GC ) is a MiniCon De-


scription (MCD) for query Q over view V , where
• hC is a head homomorphism over V ,
• V (Ỹ )C is obtained from V by applying hC , that is Ỹ = hC (Ã), where à is the
set of variables appearing in the head of V ,
• ϕC is a partial mapping from V ar(Q) to hC (V ar(V )),
• GC is a set of subgoals of Q that are covered by some subgoal of HC (V ) using
the mapping ϕC (note: not all such subgoals are necessarily included in GC ).

The procedure constructing MCDs is based on the following proposition.

Claim 19.35 Let C be a MiniCon Description over view V for query Q. C can be
used for a non-redundant rewriting of Q if the following conditions hold
C1. for every variable x that is in the head of Q and is in the domain of ϕC , as
well, ϕC (x) appears in the head of hC (V ), furthermore
C2. if ϕC (y) does not appear in the head of hC (V ), then for all such subgoals
of Q that contain y holds that

1.every variable of g appears in the domain of ϕC and


2.ϕC (g) ∈ hC (V ).
19.3. Query rewriting 939

Clause C1 is the same as in the Bucket Algorithm. Clause C2 means that if a


variable x is part of a join predicate which is not enforced by the view, then x must
be in the head of the view so the join predicate can be applied by another subgoal in
the rewriting. The procedure Form-MCDs gives the usable MiniCon Descriptions
for a conjunctive query Q and set of conjunctive views V.

Form-MCDs(Q, V)
1 C ← ∅
2 for each subgoal g of Q
3 do for V ∈ V
4 do for every subgoal v ∈ V
5 do Let h be the least restrictive head homomorphism on V ,
such that there exists a mapping ϕ with ϕ(g) = h(v).
6 if ϕ and h exist
7 then Add to C any new MCD C,
that can be constructed where:
8 (a) ϕC (respectively, hC ) is
an extension of ϕ (respectively, h),
9 (b) GC is the minimal subset of subgoals of Q such that
GC , ϕC and hC satisfy Proposition 19.35, and
10 (c) It is not possible to extend ϕ and h to ϕ0C
and h0C such that (b) is satisfied,
and G0C as defined in (b), is a subset of GC .
11 return C

Consider again query (19.68) and the views of Example 19.10. Procedure Form-
MCDs considers subgoal cite(x, y) of the query first. It does not create an MCD
for view V1 , because clause C2 of Proposition 19.35 would be violated. Indeed, the
condition would require that subgoal sameArea(x, y) be also covered by V1 using
the mapping ϕ(x) = a, ϕ(y) = b, since is not in the head of V1 .4 For the same
reason, no MCD will be created for V1 even when the other subgoals of the query
are considered. In a sense, the MiniCon Algorithm shifts some of the work done by
the combination step of the Bucket Algorithm to the phase of creating the MCDs
by using Form-MCDs. The following table shows the output of procedure Form-
MCDs.
V (Ỹ ) h ϕ G
V2 (c, d) c → c, d → d x → c, y → d 3 (19.85)
V3 (f, f ) f → f, h → f x → f, y → f 1, 2, 3
Procedure Form-MCDs includes in GC only the minimal set of subgoals that
are necessary in order to satisfy Proposition 19.35. This makes it possible that in
the second phase of the MiniCon Algorithm needs only to consider combinations of
MCDs that cover pairwise disjoint subsets of subgoals of the query.
Claim 19.36 Given a query Q, a set of views V, and the set of MCDs C for Q
over the views V, the only combinations of MCDs that can result in non-redundant

4 The case of ϕ(x) = b, ϕ(y) = a is similar.


940 19. Query Rewriting in Relational Databases

rewritings of Q are of the form C1 , . . . Cl , where


C3. GC1 ∪ · · · ∪ GCl contains all the subgoals of Q, and
C4. for every iNEj GCi ∩ GCj = ∅.

The fact that only such sets of MCDs need to be considered that provide partitions
of the subgoals in the query reduces the search space of the algorithm drastically.
In order to formulate procedure Combine-MCDs, another notation needs to be
introduced. The ϕC mapping of MCD C may map a set of variables of Q onto the
same variable of hC (V ). One arbitrarily chosen representative of this set is chosen,
with the only restriction that if there exists variables in this set from the head
of Q, then one of those is the chosen one. Let ECϕC (x) denote the representative
variable of the set containing x. The MiniCon Description C is considered extended
with ECϕC (x) in he following as a quintet (hC , V (Ỹ ), ϕC , GC , ECϕC ). If the MCDs
C1 , . . . , Ck are to be combined, and for some iNEj ECϕCi (x) = ECϕCi (y) and
ECϕCj (y) = ECϕCj (z) holds, then in the conjunctive rewriting obtained by the join
x, y and z will be mapped to the same variable. Let SC denote the equivalence
relation determined on the variables of Q by two variables being equivalent if they
are mapped onto the same variable by ϕC , that is, xSC y ⇐⇒ ECϕC (x) = ECϕC (y).
Let C be the set of MCDs obtained as the output of Form-MCDs.

Combine-MCDs(C)
1 Answer ← ∅
2 for {C1 , . . . , Cn } ⊆ C such that GC1 , . . . , GCn is a partition of the subgoals of Q
3 do Define a mapping Ψi on Ỹi as follows:
4 if there exists a variable x in Q such that ϕi (x) = y
5 then Ψi (y) = x
6 else Ψi (y) is a fresh copy of y
7 Let S be the transitive closure of SC1 ∪ · · · ∪ SCn
8  S is an equivalence relation of variables of Q.
9 Choose a representative for each equivalence class of S.
10 Define mapping EC as follows:
11 if x ∈ V ar(Q)
12 then EC(x) is the representative of the equivalence class of x under S
13 else EC(x) = x
14 Let Q0 be given as Q0 (EC(X̃)) ← VC1 (EC(Ψ1 (Ỹ1 ))), . . . , VCn (EC(Ψn (Ỹn )))
15 Answer ← Answer ∪ {Q0 }
16 return Answer

The following theorem summarises the properties of the MiniCon Algorithm.


Theorem 19.37 Given a conjunctive query Q and conjunctive views V, both with-
out comparison predicates and constants, the MiniCon Algorithm produces the union
of conjunctive queries that is a maximally contained rewriting of Q using V.
The complete proof of Theorem 19.37 exceeds the limitations of the present chapter.
However, in Problem 19-1 the reader is asked to prove that union of the conjunctive
Notes for Chapter 19 941

queries obtained as output of Combine-MCDs is contained in Q.


It must be noted that the running times of the Bucket Algorithm, the Inverse-
rules Algorithm and the MiniCon Algorithm are the same in the worst case:
O(nmM n ), where n is the number of subgoals in the query, m is the maximal
number of subgoals in a view, and M is the number of views. However, practical
test runs show that in case of large number of views (3–400 views) the MiniCon
Algorithm is significantly faster than the other two.

Exercises
19.3-1 Prove Theorem 19.25 using Proposition 19.24 and Theorem 19.20.
19.3-2 Prove the two statements of Lemma 19.26. Hint. For the first statement,
write in their definitions in place of views vi (Ỹi ) into Q0 . Minimise the obtained
query Q00 using Theorem 19.19. For the second statement use Proposition 19.24 to
prove that there exists a homomorphism hi from the body of the conjunctive query
defining view vi (Ỹi ) to the body of Q. Show that Y˜0 i = hi (Ỹi ) is a good choice.
19.3-3 Prove Theorem 19.31 using that datalog programs have unique minimal
fixpoint.

Problems

19-1 MiniCon is correct


Prove that the output of the MiniCon Algorithm is correct. Hint. It is enough to
show that for each conjunctive query Q0 given in line 14 of Combine-MCDs Q0 v Q
holds. For the latter, construct a homomorphism from Q to Q0 .

19-2 (P − , V −1 )↓ is correct
Prove that each tuple produced by logic program (P − , V −1 ) ↓ is contained in the
output of P (part of the proof of Theorem 19.32). Hint. Let t be a tuple in the output
of (P − , V −1 ) that does not contain function symbols. Consider the derivation tree of
t. Its leaves are literals, since they are extensional relations of program (P − , V −1 ).
If these leaves are removed from the tree, then the leaves of the remaining tree are
edb relations of P. Prove that the tree obtained is the derivation tree of t in datalog
program P.

19-3 Datalog views


This problem tries to justify why only conjunctive views were considered. Let V be
a set of views, Q be a query. For a given instance I of the views the tuple t is a
certain answer of query Q, if for any database instance D such that I ⊆ V(D),
t ∈ Q(D) holds, as well.

a. Prove that if the views of V are given by datalog programs, query Q is conjunc-
tive and may contain non-equality (NE) predicates, then the question whether
for a given instance I of the views tuple t is a certain answer of Q is algorith-
mically undecidable. Hint. Reduce to this question the Post Correspondence
942 19. Query Rewriting in Relational Databases

Problem, which is the following: Given two sets of words {w1 , w2 , . . . , wn } and
{w10 , w20 , . . . , wn0 } over the alphabet {a, b}. The question is whether there exists
a sequence of indices i1 , i2 , . . . , ik (repetition allowed) such that

wi1 wi2 · · · wik = wi01 wi02 · · · wi0k . (19.86)

The Post Correspondence Problem is well known algorithmically undecidable


problem. Let the view V be given by the following datalog program:

V (0, 0) ← S(e, e, e)
V (X, Y ) ← V (X0 , Y0 ), S(X0 , X1 , α1 ), . . . , S(Xg−1 , Y, αg ),
S(Y0 , Y1 , β1 ), . . . , S(Yh−1 , Y, βh )
(19.87)
where wi = α1 . . . αg and wi0 = β1 . . . βh
is a rule for all i ∈ {1, 2, . . . , n}
S(X, Y, Z) ← P (X, X, Y ), P (X, Y, Z) .

Furthermore, let Q be the following conjunctive query.

Q(c) ← P (X, Y, Z), P (X, Y, Z 0 ), ZNEZ 0 . (19.88)

Show that for the instance I of V that is given by I(V ) = {he, ei} and I(S) = {},
the tuple hci is a certain answer of query Q if and only if the Post Correspondence
Problem with sets {w1 , w2 , . . . , wn } and {w10 , w20 , . . . , wn0 } has no solution.
b. In contrast to the undecidability result of a., if V is a set of conjunctive views
and query Q is given by datalog program P, then it is easy to decide about an
arbitrary tuple t whether it is a certain answer of Q for a given view instance I.
Prove that the datalog program (P − , V −1 )datalog gives exactly the tuples of the
certain answer of Q as output.

Chapter Notes
There are several dimensions along which the treatments of the problem “answering
queries using views” can be classified. Figure 19.6 shows the taxonomy of the work.
The most significant distinction between the different work s is whether their
goal is data integration or whether it is query optimisation and maintenance of
physical data independence. The key difference between these two classes of works
is the output of the the algorithm for answering queries using views. In the former
case, given a query Q and a set of views V, the goal of the algorithm is to produce
an expression Q0 that references the views and is either equivalent to or contained
in Q. In the latter case, the algorithm must go further and produce a (hopefully
optimal) query execution plan for answering Q using the views (and possibly the
database relations). Here the rewriting must be an equivalent to Q in order to ensure
the correctness of the plan.
The similarity between these two bodies of work is that they are concerned with
the core issue of whether a rewriting of a query is equivalent or contained in the query.
However, while logical correctness suffices for the data integration context, it does
Notes for Chapter 19 943

Answering queries using views

Cost−based rewriting Logical rewriting


(query optimisation and physical data independence) (data integration)

System−R style Transformational approach Rewriting algorithms Query answering


algorithms (complete or
incomplete sources)

Figure 19.6 A taxonomy of work on answering queries using views.

not in the query optimisation context where we also need to find the cheapest plan
using the views. The complication arises because the optimisation algorithms need to
consider views that do not contribute to the logical correctness of the rewriting, but
do reduce the cost of the resulting plan. Hence, while the reasoning underlying the
algorithms in the data integration context is mostly logical, in the query optimisation
case it is both logical and cost-based. On the other hand, an aspect stressed in data
integration context is the importance of dealing with a large number of views, which
correspond to data sources. In the context of query optimisation it is generally
assumed (not always!) that the number of views is roughly comparable to the size
of the schema.
The works on query optimisation can be classified further into System-R style
optimisers and transformational optimisers. Examples of the former are works of
Chaudhuri, Krishnamurty, Potomianos and Shim [?]; Tsatalos, Solomon, and Ioan-
nidis [25]. Papers of Florescu, Raschid, and Valduriez [?]; Bello et. al. [?]; Deutsch,
Popa and Tannen [?], Zaharioudakis et. al. [?], furthermore Goldstein és Larson[?]
belong to the latter.
Rewriting algorithms in the data integration context are studied in works of
Yang and Larson [?]; Levy, Mendelzon, Sagiv and Srivastava [?]; Qian [?]; further-
more Lambrecht, Kambhampati and Gnanaprakasam [?]. The Bucket Algorithm
was introduced by Levy, Rajaraman and Ordille [?]. The Inverse-rules Algorithm is
invented by Duschka and Genesereth [?, ?]. The MiniCon Algorithm was developed
by Pottinger and Halevy [?, 22].
Query answering algorithms and the complexity of the problem is studied in
papers of Abiteboul and Duschka [?]; Grahne and Mendelzon [?]; furthermore Cal-
vanese, De Giacomo, Lenzerini and Vardi [?].
The STORED system was developed by Deutsch, Fernandez and Suciu [?]. Se-
mantic caching is discussed in the paper of Yang, Karlapalem and Li [?]. Extensions
of the rewriting problem are studied in [?, ?, ?, ?, ?].
Surveys of the area can be found in works of Abiteboul [?], Florescu, Levy and
Mendelzon [14], Halevy [?, 17], furthermore Ullman[?].
Research of the authors was (partially) supported by Hungarian National Re-
search Fund (OTKA) grants Nos. T034702, T037846T and T042706.
Bibliography

[1] S. Abiteboul, V. Vianu. Foundations of Databases. Addison-Wesley, 1995. 894


[2] A. Aho, C. Beeri, J. D. Ullman. The theory of joins in relational databases. ACM Transactions
on Database Systems, 4(3):297–314, 1979. 894
[3] C. Beeri. On the membership problem for functional and multivalued dependencies in rela-
tional databases. ACM Transactions on Database Systems, 5:241–259, 1980. 894
[4] C. Beeri, P. Bernstein. Computational problems related to the design of normal form relational
schemas. ACM Transactions on Database Systems, 4(1):30–59, 1979. 894
[5] C. Beeri, M. Dowd. On the structure of armstrong relations for functional dependencies.
Journal of ACM, 31(1):30–46, 1984. 894
[6] A. Békéssy, J. Demetrovics. Contribution to the theory of data base relations. Discrete Math-
ematics, 27(1):1–10, 1979. 894
[7] E. F. Codd. A relational model of large shared data banks. Communications of the ACM,
13(6):377–387, 1970. 894
[8] C. Delobel. Normalization and hierarchical dependencies in the relational data model. ACM
Transactions on Database Systems, 3(3):201–222, 1978. 894
[9] J. Demetrovics, Gy. O. H. Katona, A. Sali. Minimal representations of branching dependencies.
Discrete Applied Mathematics, 40:139–153, 1992. 894
[10] J. Demetrovics, Gy. O. H. Katona, A. Sali. Minimal representations of branching dependencies.
Acta Scientiarum Mathematicorum (Szeged), 60:213–223, 1995. 894
[11] J. Demetrovics, Gy. O. H. Katona, A. Sali. Design type problems motivated by database theory.
Journal of Statistical Planning and Inference, 72:149–164, 1998. 894
[12] R. Fagin. Multivalued dependencies and a new normal form for relational databases. ACM
Transactions on Database Systems, 2:262–278, 1977. 894
[13] R. Fagin. Horn clauses and database dependencies. Journal of ACM, 29(4):952–985, 1982.
894
[14] D. Florescu, A. Halevy, A. O. Mendelzon. Database techniques for the world-wide web: a
survey. SIGMOD Record, 27(3):59–74, 1998. 943
[15] J. Grant, J. Minker. Inferences for numerical dependencies. Theoretical Computer Science,
41:271–287, 1985. 894
[16] J. Grant, J. Minker. Normalization and axiomatization for numerical dependencies. Informa-
tion and Control, 65:1–17, 1985. 894
[17] A. Halevy. Answering queries using views: A survey. The VLDB Journal, 10:270–294, 2001.
943
[18] C. Lucchesi. Candidate keys for relations. Journal of of Computer and System Sciences,
17(2):270–279, 1978. 894
[19] D. Maier. Minimum covers in the relational database model. Journal of the ACM, 27(4):664–
674, 1980. 894
[20] D. Maier, A. O. Mendelzon, Y. Sagiv. Testing implications of data dependencies. ACM Trans-
actions on Database Systems, 4(4):455–469, 1979. 894
Bibliography 945

[21] S. Petrov. Finite axiomatization of languages for representation of system properties. Infor-
mation Sciences, 47:339–372, 1989. 894
[22] R. Pottinger. MinCon: A scalable algorithm for answering queries using views. The VLDB
Journal, 10(2):182–198, 2001. 943
[23] A. Sali, Sr., A. Sali. Generalized dependencies in relational databases. Acta Cybernetica,
13:431–438, 1998. 894
[24] B. Thalheim. Dependencies in Relational Databases. B. G. Teubner, 1991. 894
[25] O. G. Tsatalos, M. C. Solomon, Y. Ioannidis. The GMAP: a versatile tool for physical data
independence. The VLDB Journal, 5(2):101–118, 1996. 943
[26] D. M. Tsou, P. C. Fischer. Decomposition of a relation scheme into Boyce–Codd normal form.
SIGACT News, 14(3):23–29, 1982. 894
[27] J. D. Ullman. Principles of Database and Knowledge Base Systems. Vol. 1. Computer Science
Press, 1989 (2nd edition). 894
[28] C. Zaniolo. A new normal form for the design of relational database schemata. ACM Trans-
actions on Database Systems, 7:489–499, 1982. 894

This bibliography is made by HBibTEX. First key of the sorting is the name of the
authors (first author, second author etc.), second key is the year of publication, third
key is the title of the document.
Underlying shows that the electronic version of the bibliography on the homepage
of the book contains a link to the corresponding address.
Index

This index uses the following conventions. Numbers are alphabetised as if spelled out; for
example, “2-3-4-tree" is indexed as if were “two-three-four-tree". When an entry refers to a place
other than the main text, the page number is followed by a tag: exe for exercise, exa for example,
fig for figure, pr for problem and fn for footnote.
The numbers of pages containing a definition are printed in italic font, e.g.
time complexity, 583 .

A data integration system, 920


anomaly, 871, 879 datalog
deletion, 871 non-recursive, 903
insertion, 871 with negation, 904
redundancy, 871 program, 907, 933
update, 871 precedence graph, 909
Armstrong-axioms, 863, 870exe, 878 recursive, 910
Armstrong-relation, 892 rule, 906, 933
atom decomposition
relational, 898 dependency preserving, 876
attribute, 862 dependency preserving into 3NF, 883
external, 893pr lossless join, 872
prime, 880, 893pr into BCNF, 880
axiomatisation, 890 dependency
branching, 891
equality generating, 885
B functional, 862
bucket, 930 equivalent families, 867
Bucket Algorithm, 930, 937 minimal cover of a family of, 868
join, 890
multivalued, 884
C numerical, 892
certain answer, 941pr tuple generating, 885
Closure, 865 dependency basis, 886
Closure, 865, 883, 892exe Dependency-Basis, 887
of a set of attributes, 870exe Dependency-basis, 894pr
of a set of functional dependencies, 863, depth first search, 910
864, 870exe domain, 862
of set of attributes, 864, 865 domain calculus, 874
Combine-MCDs, 940 domain restricted, 902
Create-Bucket, 930
E
D Equate, 873
database architecture Exact-cover, 912
layer
logical, 914
outer, 914 F
fact, 907
physical, 914 fixpoint, 907
data independence Form-MCDs, 939
logical, 916
Index 947

free tuple, 897, 904 Naiv-Datalog, 908, 935


Full-Tuplewise-Join, 910 natural join, 871, 881, 899
normal form, 879
BCNF, 893pr
G BCNF, 879
Generalised Multi-level Access Path, 919 Boyce-Codd, 879
GMAP, 919 Boyce–Codd, 879
5NF, 890
4NF, 879, 887
H 3NF, 879, 880, 883, 893pr
head homomorphism, 938 nr-datalog¬ program, 905
homomorphism theorem, 910
Horn rule, 933
P
Polynomial-BCNF, 882
I Post Correspondence Problem, 942pr
image under a query, 898 precedence graph, 909, 936
immediate consequence, 907 Preserve, 877
operator, 907 projection, 899
Improved-Semi-Naiv-Datalog, 910, 914exe
inference rules, 863
complete, 863 Q
sound, 863 QBE, 898
infinite domain, 898 query, 896
instance, 862, 895 conjunctive, 910
integrity constraint, 862, 876, 914 domain restricted, 902
Inverse-rules Algorithm, 932, 937 program, 900
rule based, 897
subgoal, 930
J empty, 913exe
join equivalent, 897
natural, 899 homomorphism, 911, 923
Join-Test, 873 language, 895
Join-test, 874fig, 883 equivalent, 897
mapping, 896
monotone, 898, 913exe
K relational algebra, 913exe
key, 863, 869, 880 rewriting, 922
primary, 890 complete, 922
conjunctive, 931
equivalent, 922, 925
L globally minimal, 922
Linear-Closure, 867 locally minimal, 922
Linear-closure, 877, 882 maximally contained, 925
Linear-closure, 878, 882 minimal, 922
List-Keys, 870 rule based, 913exe
literal, 904 satisfiable, 898, 913exe
negative, 904 subgoal, 937
positive, 904 tableau, 898, 913exe
minimal, 911
logical implication, 863, 885 summary, 898
logic program, 933 variables of, 937
lossless join, 872 query language
relationally complete, 904
M query rewriting, 895–943
MCD, 938
mediator system, 920
Microsoft R
Access, 898 record, 862
MiniCon Description, 938 recursion, 906
MiniCon Description (MCD), 938 relation, 895
Minimal-Cover, 868 extensional, 898, 907, 915
Minimal-cover, 883 instance, 895, 896fig
Minimal-cover, 893pr intensional, 898, 900, 907, 915
mutually recursive, 910
virtual, 920
N relational
Naiv-BCNF, 881 schema, 862
948 Index

decomposition of, 871 source description, 920


table, 862 SQL, 915
relational algebra∗ , 899 strongly connected component, 910
relational data bases, subgoal, 930
relational schema, 895 substitution, 910
renaming, 899 superkey, 863, 869, 879
rule, 897 System-R style optimiser, 927
body, 897
domain restricted, 904
head, 897 T
realisation, 907 transitive closure, 906
tuple
free, 897
S
Satisfiable, 903
schema V
extensional, 907 view, 895, 914
intensional, 907 inverse, 933
mediated, 920 materialised, 916
selection, 899
condition, 899
Semi-Naiv-Datalog, 935 X
Semi-naiv-datalog, 909, 914exe XML, 919
Name Index

This index uses the following conventions. If we know the full name of a cited person, then we
print it. If the cited person is not living, and we know the correct data, then we print also the year
of her/his birth and death.

A Gnanaprakasam, Senthil, 943


Abiteboul, Serge, 874, 891, 894, 943, 944 Goldstein, Jonathan, 943
Aho, Alfred V., 888, 894, 944 Grahne, Gösta, 943
Armstrong, William Ward, 878, 892, 894 Grant, John, 892, 894, 944

B H
Beeri, Catriel, 885, 886, 888, 894, 944 Halevy, Alon Y., 943, 944
Békéssy, András, 869, 894, 944 Howard, J. H., 885, 894
Bello, Randall G., 943 Hull, Richard, 874, 891, 894
Bernstein, P. A., 894, 944
Boyce, Raymond F., 879, 894
Buneman, Peter, 943 I
Ioannidis, Yannis E., 919, 943, 945
C
Calvanese, Diego, 943 J
Chaudhuri, Surajit, 943 Johnson, D. T., 869, 894
Cochrane, Roberta, 943
Codd, Edgar F. (1923–2003), 862, 879, 894,
904, 944 K
Kambhampati, Subbarao, 943
Karlapalem, Kamalakar, 943
D Katona, Gyula O. H., 892, 894, 944
De Giacomo, Giuseppe, 943 Krishnamurty, Ravi, 943
Delobel, C., 894, 944 Kwok, Cody T., 943
Demetrovics, János, 862, 869, 894, 895, 944
Deutsch, Alin, 943
Dias, Karl, 943 L
Dowd, M., 894, 944 Lambrecht, Eric, 943
Downing, Alan, 943 Lapis, George, 943
Duschka, Oliver M., 943 Larson, Per–Åke, 943
Lenzerini, Maurizio, 943
Levy, Alon Y., 943
F Li, Qing, 943
Fagin, R., 885, 894, 944 Lucchesi, C. L., 894, 944
Feenan, James J., 943
Fernandez, Mary, 943
Finnerty, James L., 943 M
Fischer, P. C., 894, 945 Maier, David, 893, 894, 944
Florescu, Daniela D., 943, 944 Mendelzon, Alberto O., 894, 943, 944
Friedman, Marc, 943 Minker, Jack, 892, 894, 944

G N
Genesereth, Michael R., 943 Norcott, William D., 943
950 Name Index

O Tannen, Val, 943


Ordille, Joann J., 943 Thalheim, Bernhardt, 894, 945
Osborne, Sylvia L., 869 Tompa, Frank Wm., 869
Tsatalos, Odysseas G., 919, 943, 945
Tsou, D. M., 894, 945
P
Petrov, S. V., 894, 944
Pirahesh, Hamid, 943 U
Popa, Lucian, 943 Ullman, Jeffrey David, 874, 888, 894,
Potomianos, Spyros, 943 943–945
Pottinger, Rachel, 943, 945 Urata, Monica, 943

Q V
Qian, Xiaolei, 943 Valduriez, Patrick, 943
Vardi, Moshe Y., 943
Vianu, Victor, 874, 891, 894, 944
R
Rajaraman, Anand, 943
Raschid, Louiqa, 943 W
Weld, Daniel S., 943
Witkowski, Andrew, 943
S
Sagiv, Yehoshua, 894, 943, 944
Sali, Attila, 862, 894, 895, 944, 945 Y
Sali, Attila, Sr., 945 Yang, H. Z., 943
Shim, Kyuseok, 943 Yang, Jian, 943
Solomon, Marvin H., 919, 943, 945 Yu, C. T., 869, 894
Srivastava, Divesh, 943
Statman, R., 894
Suciu, Dan, 943 Z
Sun, Harry, 943 Zaharioudakis, Markos, 943
Zaniolo, C., 894, 945
Ziauddin, Mohamed, 943
T

You might also like