Relational Database Design: Records Attributes. Relational Schema
Relational Database Design: Records Attributes. Relational Schema
The relational datamodel was introduced by Codd in 1970. It is the most widely
used datamodel—extended with the possibilities of the World Wide Web—, because
of its simplicity and flexibility. The main idea of the relational model is that data
is organised in relational tables, where rows correspond to individual records and
columns to attributes. A relational schema consists of one or more relations and
their attribute sets. In the present chapter only schemata consisting of one relation
are considered for the sake of simplicity. In contrast to the mathematical concept
of relations, in the relational schema the order of the attributes is not important,
always sets of attributes are considered instead of lists. Every attribute has an
associated domain that is a set of elementary values that the attribute can take
values from. As an example, consider the following schema.
Employee(Name,Mother’s name,Social Security Number,Post,Salary)
The domain of attributes Name and Mother’s name is the set of finite charac-
ter strings (more precisely its subset containing all possible names). The domain
of Social Security Number is the set of integers satisfying certain formal and
parity check requirements. The attribute Post can take values from the set {Direc-
tor,Section chief,System integrator,Programmer,Receptionist,Janitor,Handyman}.
An instance of a schema R is a relation r if its columns correspond to the at-
tributes of R and its rows contain values from the domains of attributes at the
attributes’ positions. A typical row of a relation of the Employee schema could be
(John Brown,Camille Parker,184-83-2010,Programmer,$172,000)
There can be dependencies between different data of a relation. For example, in an
instance of the Employee schema the value of Social Security Number determines
all other values of a row. Similarly, the pair (Name,Mother’s name) is a unique
identifier. Naturally, it may occur that some set of attributes do not determine all
attributes of a record uniquely, just some of its subsets.
A relational schema has several integrity constraints attached. The most im-
portant kind of these is functional dependency. Let U and V be two sets of
attributes. V functionally depends on U , U → V in notation, means that when-
ever two records are identical in the attributes belonging to U , then they must agree
in the attribute belonging to V , as well. Throughout this chapter the attribute set
{A1 , A2 , . . . , Ak } is denoted by A1 A2 . . . Ak for the sake of convenience.
18.1. Functional dependencies 863
R(Pprofessor,Subject,Room,Student,Grade,Time) .
The meaning of an individual record is that a given student got a given grade of a given
subject that was taught by a given professor at a given time slot. The following functional
dependencies are satisfied.
Su→P: One subject is taught by one professor.
PT→R: A professor teaches in one room at a time.
StT→R: A student attends a lecture in one room at a time.
StT→Su: A student attends a lecture of one subject at a time.
SuSt→G: A student receives a unique final grade of a subject.
In Example 18.1 the attribute set StT uniquely determines the values of all other
attributes, furthermore it is minimal such set with respect to containment. This kind
attribute sets are called keys. If all attributes are functionally dependent on a set of
attributes X, then X is called a superkey. It is clear that every superkey contains
a key and that any set of attributes containing a superkey is also a superkey.
18.1.1. Armstrong-axioms
In order to determine keys, or to understand logical implication between functional
dependencies, it is necessary to know the closure F + of a set F of functional depen-
dencies, or for a given X → Z dependency the question whether it belongs to F +
must be decidable. For this, inference rules are needed that tell that from a set of
functional dependencies what others follow. The Armstrong-axioms form a sys-
tem of sound and complete inference rules. A system of rules is sound if only valid
functional dependencies can be derived using it. It is complete, if every dependency
X → Z that is logically implied by the set F is derivable from F using the inference
rules.
864 18. Relational Database Design
Armstrong-axioms
(A1) Reflexivity Y ⊆ X ⊆ R implies X → Y .
(A2) Augmentation If X → Y , then for arbitrary Z ⊆ R, XZ → Y Z holds.
(A3) Transitivity If X → Y and Y → Z hold, then X → Z holds, as well.
There are other valid inference rules besides (A1)–(A3). The next lemma lists some,
the proof is left to the Reader (Exercise 18.1-5).
Lemma 18.2
1. Union rule {X → Y, X → Z} |= X → Y Z.
2. Pseudo transitivity {X → Y, W Y → Z} |= XW → Y Z.
3. Decomposition If X → Y holds and Z ⊆ Y , then X → Z holds, as well.
The soundness of system (A1)–(A3) can be proven by easy induction on the length
of the derivation. The completeness will follow from the proof of correctness of
algorithm Closure(R, F, X) by the following lemma. Let X + denote the closure
of the set of attributes X ⊆ R with respect to the family of functional dependencies
F , that is X + = {A ∈ R : X → A follows from F by the Armstrong-axioms}.
Lemma 18.3 The functional dependency X → Y follows from the family of func-
tional dependencies F by the Armstrong-axioms iff Y ⊆ X + .
Proof Let Y = A1 A2 . . . An where Ai ’s are attributes, and assume that Y ⊆ X + .
X → Ai follows by the Armstrong-axioms for all i by the definition of X + . Applying
the union rule of Lemma 18.2 X → Y follows. On the other hand, assume that
X → Y can be derived by the Armstrong-axioms. By the decomposition rule of
Lemma 18.2 X → Ai follows by (A1)–(A3) for all i. Thus, Y ⊆ X + .
18.1.2. Closures
Calculation of closures is important in testing equivalence or logical implication
between systems of functional dependencies. The first idea could be that for a given
18.1. Functional dependencies 865
F = {A → B1 , A → B2 , . . . , A → Bn } .
Closure(R,F ,X)
1 X (0) ← X
2 i←0
3 G←F Functional dependencies not used yet.
4 repeat
5 X (i+1) ← X (i)
6 for all Y → Z in G
7 do if Y ⊆ X (i)
8 then X (i+1) ← X (i+1) ∪ Z
9 G ← G \ {Y → Z}
10 i←i+1
11 until X (i−1) = X (i)
It is easy to see that the attributes that are put into any of the X (j) ’s by
Closure(R,F ,X) really belong to X + . The harder part of the correctness proof of
this algorithm is to show that each attribute belonging to X + will be put into some
of the X (j) ’s.
Proof First we prove by induction that if an attribute A is put into an X (j) during
Closure(R,F ,X), then A really belongs to X + .
Base case: j = 0. I this case A ∈ X and by reflexivity (A1) A ∈ X + .
Induction step: Let j > 0 and assume that X (j−1) ⊆ X + . A is put into X (j) , because
there is a functional dependency Y → Z in F , where Y ⊆ X (j−1) and A ∈ Z. By
induction, Y ⊆ X + holds, which implies using Lemma 18.3 that X → Y holds, as
well. By transitivity (A3) X → Y and Y → Z implies X → Z. By reflexivity (A1)
and A ∈ Z, Z → A holds. Applying transitivity again, X → A is obtained, that is
A ∈ X +.
On the other hand, we show that if A ∈ X + , then A is contained in the result
of Closure(R,F ,X). Suppose in contrary that A ∈ X + , but A 6∈ X (i) , where
X (i) is the result of Closure(R,F ,X). By the stop condition in line 9 this means
866 18. Relational Database Design
Let us observe that the relation instance r given in the proof above provides
the completeness proof for the Armstrong-axioms, as well. Indeed, the closure X +
calculated by Closure(R,F ,X) is the set of those attributes for which X → A
follows from F by the Armstrong-axioms. Meanwhile, for every other attribute B,
there exist two rows of r that agree on X, but differ in B, that is F |= X → B does
not hold.
The running tome of Closure(R,F ,X) is O(n2 ), where n is the length of
he input. Indeed, in the repeat – until loop of lines 4–11 every not yet used
dependency is checked, and the body of the loop is executed at most |R \ X| + 1
times, since it is started again only if X (i−1) NEX (i) , that is a new attribute is
added to the closure of X. However, the running time can be reduced to linear with
appropriate bookkeeping.
2. For every attribute A those yet unused dependencies are kept in a doubly
linked list LA whose left side contains A.
3. Those not yet used dependencies W → Z are kept in a linked list J, whose left
side W ’s every attribute is contained in the closure already, that is for which
i[W, Z] = 0.
P
take O( (W,Z)∈F |W |) time. The loop in lines 9–11 means O(|F |) steps. If the length
of the input is denoted by n, then this is O(n) steps altogether.
During the execution of lines 14–23, every functional dependency (W, Z) is ex-
amined at most once, when it is taken off from list J. Thus, lines 15–16 and 23 take
at most |F | steps. The running
P time of the loops in line 17–22 can be estimated by
observingPthat the sum i[W, Z] is decreased by one in each execution, hence it
takes O( i0 [W, Z]) steps, where
P i0 [W, Z]Pis the i[W, Z] value obtained in the ini-
tialisation phase. However, i0 [W, Z] ≤ (W,Z)∈F |W |, thus lines 14–23 also take
O(n) time in total.
Linear-Closure(R, F, X)
1 Initialisation phase.
2 for all (W, Z) ∈ F
3 do for all A ∈ W
4 do add (W, Z) to list LA
5 i[W, Z] ← 0
6 for all A ∈ R \ X
7 do for all (W, Z) of list LA
8 do i[W, Z] ← i[W, Z] + 1
9 for all (W, Z) ∈ F
10 do if i[W, Z] = 0
11 then add (W, Z) to list J
12 X+ ← X
13 End of initialisation phase.
14 while J is nonempty
15 do (W, Z) ← head(J)
16 delete (W, Z) from list J
17 for all A ∈ Z \ X +
18 do for all (W, Z) of list LA
19 do i[W, Z] ← i[W, Z] − 1
20 if i[W, Z] = 0
21 then add (W, Z) to list J
22 delete (W, Z)from list LA
23 X+ ← X+ ∪ Z
24 return X +
Minimal-Cover(R, F )
1 G←∅
2 for all X → Y ∈ F
3 do for all A ∈ Y − X
4 do G ← G ∪ X → A
5 Each right hand side consists of a single attribute.
6 for all X → A ∈ G
7 do while there exists B ∈ X : A ∈ (X − B)+ (G)
8 X ←X −B
9 Each left hand side is minimal.
10 for all X → A ∈ G
11 do if A ∈ X + (G − {X → A})
12 then G ← G − {X → A}
13 No redundant dependency exists.
After executing the loop of lines 2–4, the right hand side of each dependency in
G consists of a single attribute. The equality G+ = F + follows from the union rule
of Lemma 18.2 and the reflexivity axiom. Lines 6–8 minimise the left hand sides. In
line 11 it is checked whether a given functional dependency of G can be removed
without changing the closure. X + (G − {X → A}) is the closure of attribute set X
with respect to the family of functional dependencies G − {X → A}.
Claim 18.6 Minimal-Cover(R, F ) calculates a minimal cover of F .
Proof It is enough to show that during execution of the loop in lines 10–12, no
functional dependency X → A is generated whose left hand side could be decreased.
18.1. Functional dependencies 869
18.1.4. Keys
In database design it is important to identify those attribute sets that uniquely
determine the data in individual records.
The question is how the keys can be determined from (R, F )? What makes this
problem hard is that the number of keys could be super exponential function of the
size of (R, F ). In particular, Yu and Johnson constructed such relational schema,
where |F | = k, but the number of keys is k!. Békéssy and Demetrovics gave a
beautiful and simple proof of the fact that starting from k functional dependencies,
at most k! key can be obtained. (This was independently proved by Osborne and
Tompa.)
The proof of Békéssy and Demetrovics is based on the operation ∗ they intro-
duced, which is defined for functional dependencies.
e1 ∗ e2 = U ∪ ((R − V ) ∩ X) → V ∪ Y .
Some properties of operation ∗ is listed, the proof is left to the Reader (Exer-
cise 18.1-3). Operation ∗ is associative, furthermore it is idempotent in the sense
that if e = e1 ∗ e2 ∗ · · · ∗ ek and e0 = e ∗ ei for some 1 ≤ i ≤ k, then e0 = e.
Claim 18.9 (Békéssy and Demetrovics). Let (R, F ) be a relational schema and let
F = {e1 , e2 , . . . , ek } be a listing of the functional dependencies. If X is a key, then
X → R = eπ1 ∗ eπ2 ∗ . . . ∗ eπs ∗ d, where (π1 , π2 , . . . , πs ) is an ordered subset of the
index set {1, 2, . . . , k}, and d is a trivial dependency in the form D → D.
Proposition 18.9 bounds in some sense the possible sets of attributes in the search
for keys. The next proposition gives lower and upper bounds for the keys.
Claim 18.10 Let (R, F ) be a relational schema and let F = {Ui → Vi : 1 ≤ i ≤ k}.
Sk
Let us assume without loss of generality that Ui ∩ Vi = ∅. Let U = i=1 Ui and
Sk
V = i=1 Vi . If K is a key in the schema (R, F ), then
HL = R − V ⊆ K ⊆ (R − V) ∪ U = HU .
870 18. Relational Database Design
The proof is not too hard, it is left as an exercise for the Reader (Exercise 18.1-4).
The algorithm List-keys(R, F ) that lists the keys of the schema (R, F ) is based on
the bounds of Proposition 18.10. The running time can be bounded by O(n!), but
one cannot expect any better, since to list the output needs that much time in worst
case.
List-Keys(R, F )
1 Let U and V be as defined in Proposition 18.10
2 if U ∩ V = ∅
3 then return R − V
4 R − V is the only key.
5 if (R − V)+ = R
6 then return R − V
7 R − V is the only key.
8 K←∅
9 for all permutations A1 , A2 , . . . Ah of the attributes of U ∩ V
10 do K ← (R − V) ∪ U
11 for i ← 1 to h
12 do Z ← K − Ai
13 if Z + = R
14 then K ← Z
15 K ← K ∪ {K}
16 return K
Exercises
18.1-1 Let R be a relational schema and let F and G be families of functional
dependencies over R. Show that
a. F ⊆ F + .
+
b. (F + ) = F + .
c. If F ⊆ G, then F + ⊆ G+ .
Formulate and prove similar properties of the closure X + – with respect to F –
of an attribute set X.
18.1-2 Derive the functional dependency AB → F from the set of dependencies
G = {AB → C, A → D, CD → EF } using Armstrong-axioms (A1)–(A3).
18.1-3 Show that operation ∗ is associative, furthermore if for functional depen-
dencies e1 , e2 , . . . , ek we have e = e1 ∗ e2 ∗ · · · ∗ ek and e0 = e ∗ ei for some 1 ≤ i ≤ k,
then e0 = e.
18.1-4 Prove Proposition 18.10.
18.1-5 Prove the union, pseudo transitivity and decomposition rules of Lemma
18.2.
18.2. Decomposition of relational schemata 871
R = R1 ∪ R2 ∪ · · · ∪ Rk .
The Ri ’s need not be disjoint, in fact in most application they must not be. One
important motivation of decompositions is to avoid anomalies.
SUPPLIER-INFO(SNAME,ADDRESS,ITEM,PRICE)
Question is that is it correct to replace the schema SAIP by SA and SIP ? Let
r be and instance of schema SAIP . It is natural to require that if SA and SIP
is used, then the relations belonging to them are obtained projecting r to SA and
SIP , respectively, that is rSA = πSA (r) and rSIP = πSIP (r). rSA and rSIP contains
the same information as r, if r can be reconstructed using only rSA and rSIP . The
calculation of r from rSA and rSIP can bone by the natural join operator.
Definition 18.11 The natural join of relations ri of schemata Ri (i = 1, 2, . . . n)
is the relation s belonging to the schema ∪ni=1 Ri , which consists of all rows µ that
for all i there exists a row νi of relation ri such that µ[Ri ] = νi [Ri ]. In notation
s =1ni=1 ri .
872 18. Relational Database Design
Example 18.4 Let R1 = AB, R2 = BC, r1 = {ab, a0 b0 , ab00 } and r2 {bc, bc0 , b0 c00 }. The
natural join of r1 and r2 belongs to the schema R = ABC, and it is the relation r1 1 r2 =
{abc, abc0 , a0 b0 c00 }.
If s is the natural join of rSA and rSIP , that is s = rSA 1 rSIP , then πSA (s) =
rSA és πSIP (s) = rSIP by Lemma 18.12. If rNEs, then the original relation could
not be reconstructed knowing only rSA and rSIP .
1. r ⊆ mρ (r).
Join-Test(R, F, ρ)
1 Initialisation phase.
2 for i ← 1 to |ρ|
3 do for j ← 1 to |R|
4 do if Aj ∈ Ri
5 then T [i, j] ← 0
6 else T [i, j] ← i
7 End of initialisation phase.
8 S←T
9 repeat
10 T ←S
11 for all {X → Y } ∈ F
12 do for i ← 1 to |ρ| − 1
13 do for j ← i + 1 to |R|
14 do if for all Ah in X (S[i, h] = S[j, h])
15 then Equate(i, j, S, Y )
16 until S = T
17 if there exist an all 0 row in S
18 then return “Lossless join”
19 else return “Lossy join”
Equate(i, j, S, Y )
1 for Al ∈ Y
2 do if S[i, l] · S[j, l] = 0
3 then
4 for d ← 1 to k
5 do if S[d, l] = S[i, l] ∨ S[d, l] = S[j, l]
6 then S[d, l] ← 0
7 else
8 for d ← 1 to k
9 do if S[d, l] = S[j, l]
10 then S[d, l] ← S[i, l]
Example 18.5 Checking lossless join property Let R = ABCDE, R1 = AD, R2 = AB,
R3 = BE, R4 = CDE, R5 = AE, furthermore let the functional dependencies be {A →
C, B → C, C → D, DE → C, CE → A}. The initial array is shown on Figure 18.1(a).
Using A → C values 1,2,5 in column C can be equated to 1. Then applying B → C value 3
of column C again can be changed to 1. The result is shown on Figure 18.1(b). Now C → D
can be used to change values 2,3,5 of column D to 0. Then applying DE → C (the only
nonzero) value 1 of column C can be set to 0. Finally, CE → A makes it possible to change
values 3 and 4 in column A to be changed to 0. The final result is shown on Figure 18.1(c).
The third row consists of only zeroes, thus the decomposition has the lossless join property.
874 18. Relational Database Design
A B C D E
0 1 1 0 1
0 0 2 2 2
3 0 3 3 0
4 4 0 0 0
0 5 5 5 0
(a)
A B C D E
0 1 1 0 1
0 0 1 2 2
3 0 1 3 0
4 4 0 0 0
0 5 1 5 0
(b)
A B C D E
0 1 0 0 1
0 0 0 2 2
0 0 0 0 0
0 4 0 0 0
0 5 0 0 0
(c)
Proof Let us assume first that the resulting array T contains no all zero row. T itself
can be considered as a relational instance over the schema R. This relation satisfies
all functional dependencies from F , because the algorithm finished since there was
no more change in the table during checking the functional dependencies. It is true
for the starting table that its projections to every Ri ’s contain an all zero row, and
this property does not change during the running of the algorithm, since a 0 is never
changed to another symbol. It follows, that the natural join mρ (T ) contains the all
zero row, that is T NEmρ (T ). Thus the decomposition is lossy. The proof of the other
direction is only sketched.
Logic, domain calculus is used. The necessary definitions can be found in the
books of Abiteboul, Hull and Vianu, or Ullman, respectively. Imagine that variable
aj is written in place of zeroes, and bij is written in place of i’s in column j, and
18.2. Decomposition of relational schemata 875
Join-test(R,F ,ρ) is run in this setting. The resulting table contains row a1 a2 . . . an ,
which corresponds to the all zero row. Every table can be viewed as a shorthand
notation for the following domain calculus expression
where wi is the ith row of T . If T is the starting table, then formula (18.1) defines mρ
exactly. As a justification note that for a relation r, mρ (r) contains the row a1 a2 . . . an
iff r contains for all i a row whose jth coordinate is aj if Aj is an attribute of Ri ,
and arbitrary values represented by variables bil in the other attributes.
Consider an arbitrary relation r belonging to schema R that satisfies the de-
pendencies of F . The modifications (equating symbols) of the table done by Join-
test(R,F ,ρ) do not change the set of rows obtained from r by (18.1), if the mod-
ifications are done in the formula, as well. Intuitively it can be seen from the fact
that only such symbols are equated in (18.1), that can only take equal values in a
relation satisfying functional dependencies of F . The exact proof is omitted, since it
is quiet tedious.
Since in the result table of Join-test(R,F ,ρ) the all a’s row occurs, the domain
calculus formula that belongs to this table is of the following form:
R1 ∩ R2 R1 − R2 R2 − R1
row of R1 00 . . . 0 00 . . . 0 11 . . . 1 (18.3)
row of R2 00 . . . 0 22 . . . 2 00 . . . 0
It is not hard to see using induction on the number of steps done by Join-
test(R,F ,ρ) that if the algorithm changes both values of the column of an attribute
876 18. Relational Database Design
+
A to 0, then A ∈ (R1 ∩ R2 ) . This is obviously true at the start. If at some time
values of column A must be equated, then by lines 11–14 of the algorithm, there
exists {X → Y } ∈ F , such that the two rows of the table agree on X, and A ∈ Y .
+
By the induction assumption X ⊆ (R1 ∩ R2 ) holds. Applying Armstrong-axioms
+
(transitivity and reflexivity), A ∈ (R1 ∩ R2 ) follows.
+
On the other hand, let us assume that A ∈ (R1 ∩ R2 ) , that is (R1 ∩ R2 ) → A.
Then this functional dependency can be derived from F using Armstrong-axioms.
By induction on the length of this derivation it can be seen that procedure Join-
test(R,F ,ρ) will equate the two values of column A, that is set them to 0. Thus,
the row of R1 will be all 0 iff (R1 ∩ R2 ) → (R2 − R1 ), similarly, the row of R2 will
be all 0 iff (R1 ∩ R2 ) → (R1 − R2 ).
In this case R1 and R2 satisfy the dependencies prescribed for them separately, however in
R1 1 R2 the dependency CS → Z does not hold.
It is true as well, that none of the decompositions of this schema preserves the depen-
dency CS → Z. Indeed, this is the only dependency that contains Z on the right hand side,
thus if it is to be preserved, then there has to be a subschema that contains C, S, Z, but then
the decomposition would not be proper. This will be considered again when decomposition
into normal forms is treated.
Preserve(ρ, F )
1 for all (X → Y ) ∈ F
2 do Z ← X
3 repeat
4 W ←Z
5 for i ← 1 to k
6 do Z ← Z ∪ (Linear-closure(R, F, Z ∩ Ri ) ∩ Ri )
7 until Z = W
8 if Y 6⊆ Z
9 then return “Not dependency preserving”
10 return “Dependency preserving”
878 18. Relational Database Design
In the next round using the BC-operation the actual Z = {C, D} is changed to Z =
{B, C, D}, finally applying the AB-operation on this, Z = {A, B, C, D} is obtained. This
cannot change, so procedure Preserve(ρ, F ) stops. Thus, with respect to the family of
functional dependencies
{D}+ = {A, B, C, D} holds, that is G |= D → A. It can be checked similarly that the other
dependencies of F are in G+ (as a fact in G).
3NF Although BCNF helps eliminating anomalies, it is not true that every
schema can be decomposed into subschemata in BCNF so that the decomposition is
dependency preserving. As it was shown in Example 18.6, no proper decomposition
of schema CSZ preserves the CS → Z dependency. At the same time, the schema
is clearly not in BCNF, because of the Z → C dependency.
Since dependency preserving is important because of consistency checking of
880 18. Relational Database Design
a database, it is practical to introduce a normal form that every schema has de-
pendency preserving decomposition into that form, and it allows minimum possible
redundancy. An attribute is called prime attribute, if it occurs in a key.
Definition 18.19 The schema (R, F ) is in third normal form, if whenever X →
A ∈ F + , then either X is a superkey, or A is a prime attribute.
The schema SAIP of Example 18.3 with the dependencies SI → P and S → A
is not in 3NF, since SI is the only key and so A is not a prime attribute. Thus,
functional dependency S → A violates the 3NF property.
3NF is clearly weaker condition than BCNF, since “or A is a prime attribute”
occurs in the definition. The schema CSZ in Example 18.6 is trivially in 3NF,
because every attribute is prime, but it was already shown that it is not in BCNF.
Lemma 18.20 Let (R, F ) be a relational schema (where F is the set of functional
dependencies), ρ = (R1 , R2 , . . . , Rk ) be a lossless join decomposition of R. Further-
more, let σ = (S1 , S2 ) be a lossless join decomposition of R1 with respect to πR1 (F ).
Then (S1 , S2 , R2 , . . . , Rk ) is a lossless join decomposition of R.
The proof of Lemma 18.20 is based on the associativity of natural join. The details
are left to the Reader (Exercise 18.2-9).
This can be applied for a simple, but unfortunately exponential time algorithm
that decomposes a schema into subschemata of BCNF property. The projections in
lines 4–5 of Naïv-BCNF(S, G) may be of exponential size in the length of the input.
In order to decompose schema (R, F ), the procedure must be called with parameters
R, F . Procedure Naïv-BCNF(S, G) is recursive, S is the actual schema with set of
functional dependencies G. It is assumed that the dependencies in G are of the form
X → A, where A is a single attribute.
Naiv-BCNF(S, G)
1 while there exists {X → A} ∈ G, that violates BCNF
2 do S1 ← {XA}
3 S2 ← S − A
4 G1 ← πS1 (G)
5 G2 ← πS2 (G)
6 return (Naiv-BCNF(S1 , G1 ), Naiv-BCNF(S2 , G2 ))
7 return S
Lemma 18.21
2. If R is not in BCNF, then there exists two attributes A and B in R, such that
(R − AB) → A holds.
Proof If the schema consists of two attributes, R = AB, then there are at most two
possible non-trivial dependencies, A → B and B → A. It is clear, that if some of
them holds, then the left hand side of the dependency is a key, so the dependency
does not violate the BCNF property. However, if none of the two holds, then BCNF
is trivially satisfied.
On the other hand, let us assume that the dependency X → A violates the
BCNF property. Then there must exists an attribute B ∈ R − (XA), since otherwise
X would be a superkey. For this B, (R − AB) → A holds.
882 18. Relational Database Design
Let us note, that the converse of the second statement of Lemma 18.21 is not
true. It may happen that a schema R is in BCNF, but there are still two attributes
{A, B} that satisfy (R − AB) → A. Indeed, let R = ABC, F = {C → A, C → B}.
This schema is obviously in BCNF, nevertheless (R − AB) = C → A.
The main contribution of Lemma 18.21 is that the projections of functional
dependencies need not be calculated in order to check whether a schema obtained
during the procedure is in BCNF. It is enough to calculate (R − AB)+ for pairs
{A, B} of attributes, which can be done by Linear-closure(R, F, X) in linear
time, so the whole checking is polynomial (cubic) time. However, this requires a way
of calculating (R − AB)+ without actually projecting down the dependencies. The
next lemma is useful for this task.
Lemma 18.22 Let R2 ⊂ R1 ⊂ R and let F be the set of functional dependencies
of scheme R. Then
πR2 (πR1 (F )) = πR2 (F ) .
The proof is left for the Reader (Exercise 18.2-10). The method of lossless join BCNF
decomposition is as follows. Schema R is decomposed into two subschemata. One
is XA that is in BCNF, satisfying X → A. The other subschema is R − A, hence
by Theorem 18.14 the decomposition has the lossless join property. This is applied
recursively to R − A, until such a schema is obtained that satisfies property 2 of
Lemma 18.21. The lossless join property of this recursively generated decomposition
is guaranteed by Lemma 18.20.
Polynomial-BCNF(R, F )
1 Z←R
2 Z is the schema that is not known to be in BCNF during the procedure.
3 ρ←∅
4 while there exist A, B in Z, such that A ∈ (Z − AB)+ and |Z| > 2
5 do Let A and B be such a pair
6 E←A
7 Y ←Z −B
8 while there exist C, D in Y , such that C ∈ (Z − CD)+
9 do Y ← Y − D
10 E←C
11 ρ ← ρ ∪ {Y }
12 Z ←Z −E
13 ρ ← ρ ∪ {Z}
14 return ρ
command while of line 8 is checked for O(n2 ) pairs of attributes, each checking is
done in linear time. The running time of the algorithm is dominated by the time
required by lines 8–10 that take n · n · O(n2 ) · O(n) = O(n5 ) steps altogether.
If the decomposition needs to have the lossless join property besides being de-
pendency preserving, then ρ given in Theorem 18.23 is to be extended by a key X of
R. Although it was seen before that it is not possible to list all keys in polynomial
time, one can be obtained in a simple greedy way, the details are left to the Reader
(Exercise 18.2-11).
Theorem 18.24 Let (R, F ) be a relational schema, and let G = {X1 → A1 , X2 →
A2 , . . . , Xk → Ak } be a minimal cover of F . Furthermore, let X be a key in (R, F ).
Then the decomposition τ = (X, X1 A1 , X2 A2 , . . . , Xk Ak ) is a lossless join and de-
pendency preserving decomposition of R into subschemata in 3NF.
Proof It was shown during the proof of Theorem 18.23 that the subschemata Ri =
Xi Ai are in 3NF for i = 1, 2, . . . , k. There cannot be a non-trivial dependency in the
subschema R0 = X, because if it were, then X would not be a key, only a superkey.
The lossless join property of τ is shown by the use of Join-test(R, G, ρ) proce-
dure. Note that it is enough to consider the minimal cover G of F . More precisely,
we show that the row corresponding to X in the table will be all 0 after running
Join-test(R, G, ρ). Let A1 , A2 , . . . , Am be the order of the attributes of R − X
as Closure(R,G,X) inserts them into X + . Since X is a key, every attribute of
R − X is taken during Closure(R,G,X). It will be shown by induction on i that
884 18. Relational Database Design
Example 18.8 Besides functional dependencies, some other dependencies hold in Exam-
ple 18.1, as well. There can be several lectures of a subject in different times and rooms.
Part of an instance of the schema could be the following.
Professor Subject Room Student Grade Time
Caroline Doubtfire Analysis MA223 John Smith A− Monday 8–10
Caroline Doubtfire Analysis CS456 John Smith A− Wednesday 12–2
Caroline Doubtfire Analysis MA223 Ching Lee A+ Monday 8–10
Caroline Doubtfire Analysis CS456 Ching Lee A+ Wednesday 12–2
A set of values of Time and Room attributes, respectively, belong to each given value of
Subject, and all other attribute values are repeated with these. Sets of attributes SR and
StG are independent, that is their values occur in each combination.
1 It would be enough to require the existence of t , since the existence of t would follow. However,
3 4
the symmetry of multivalued dependency is more apparent in this way.
18.2. Decomposition of relational schemata 885
R(Professor,Subject,Room,Student,Grade,Time)
of Examples 18.1 and 18.8. SuRT was shown in Example 18.8. By the complementation
rule SuPStG follows. Su→P is also known. This implies by axiom (A7) that SuP.
By the decomposition rule SuStg follows. It is easy to see that no other one-element
attribute set is determined by Su via multivalued dependency. Thus, the dependency basis
of Su is the partition{P,RT,StG}.
The only thing left is to decide about functional dependencies based on the depen-
dency basis. Closure(R,F ,X) works correctly only if multivalued dependencies are
not considered. The next theorem helps in this case.
Theorem 18.31 (Beeri). Let us assume that A 6∈ X and the dependency basis of
X with respect to the set M of multivalued dependencies obtained in Theorem 18.30
is known. X → A holds if and only if
1. A forms a single element block in the partition of the dependency basis, and
Based on the observations above, the following polynomial time algorithm can be
given to compute the dependency basis of a set of attributes X.
Dependency-Basis(R, M, X)
1 S ← {R − X} The collection of sets in the dependency basis is S.
2 repeat
3 for all V W ∈ M
4 do if there exists Y ∈ S such that Y ∩ W NE∅ ∧ Y ∩ V = ∅
5 then S ← S − {{Y }} ∪ {{Y ∩ W }, {Y − W }}
6 until S does not change
7 return S
Fourth normal form 4NF The Boyce-Codd normal form can be generalised
to the case where multivalued dependencies are also considered besides functional
dependencies, and one needs to get rid of the redundancy caused by them.
(R1 ∩ R2 ) (R1 − R2 ).
Proof The decomposition ρ = (R1 , R2 ) of schema R has the lossless join property iff
for any relation r over the schema R that satisfies all dependencies from D holds that
if µ and ν are two tuples of r, then there exists a tuple ϕ satisfying ϕ[R1 ] = µ[R1 ]
and ϕ[R2 ] = ν[R2 ], then it is contained in r. More precisely, ϕ is the natural join of
the projections of µ on R1 and of ν on R2 , respectively, which exist iff µ[R1 ∩ R2 ] =
ν[R1 ∩ R2 ]. Thus the fact that ϕ is always contained in r is equivalent with that
(R1 ∩ R2 ) (R1 − R2 ).
To compute the projection πS (D) of the dependency set D one can use the follow-
ing theorem of Aho, Beeri and Ullman. πS (D) is the set of multivalued dependencies
that are logical implications of D and use attributes of S only.
Theorem 18.34 (Aho, Beeri és Ullman). πS (D) consists of the following depen-
dencies:
• For all X → Y ∈ D+ , if X ⊆ S, then X → (Y ∩ S) ∈ πS (D).
• For all X Y ∈ D+ , if X ⊆ S, then X (Y ∩ S) ∈ πS (D).
Other dependencies cannot be derived from the fact that D holds in R.
Unfortunately this theorem does not help in computing the projected dependencies
in polynomial time, since even computing D+ could take exponential time. Thus, the
algorithm of 4NF decomposition is not polynomial either, because the 4NF condition
must be checked with respect to the projected dependencies in the subschemata.
This is in deep contrast with the case of BCNF decomposition. The reason is, that
to check BCNF condition one does not need to compute the projected dependencies,
18.2. Decomposition of relational schemata 889
Exercises
18.2-1 Are the following inference rules sound?
a. If XW → Y and XY → Z, then X → (Z − W ).
b. If X Y and Y Z, then X Z.
c. If X Y and XY → Z, then X → Z.
18.2-2 Prove Theorem 18.30, that is show the following. Let D be a set of functional
and multivalued dependencies, and let m(D) = {X Y : X Y ∈ D} ∪ {X
A : A ∈ Y for some X → Y ∈ D}. Then
a. D |= X → Y =⇒ m(D) |= X Y , and
b.D |= X Y ⇐⇒ m(D) |= X Y .
Hint. Use induction on the inference rules to prove b.
18.2-3 Consider the database of an investment firm, whose attributes are as fol-
lows: B (stockbroker), O (office of stockbroker), I (investor), S (stock), A (amount
of stocks of the investor), D (dividend of the stock). The following functional de-
pendencies are valid: S → D, I → B, IS → A, B → O.
a. Determine a key of schema R = BOISAD.
b. How many keys are in schema R?
c. Give a lossless join decomposition of R into subschemata in BCNF.
d. Give a dependency preserving and lossless join decomposition of R into sub-
schemata in 3NF.
18.2-4 The schema R of Exercise 18.2-3 is decomposed into subschemata SD, IB,
ISA and BO. Does this decomposition have the lossless join property?
18.2-5 Assume that schema R of Exercise 18.2-3 is represented by ISA, IB, SD and
ISO subschemata. Give a minimal cover of the projections of dependencies given
in Exercise 18.2-3. Exhibit a minimal cover for the union of the sets of projected
dependencies. Is this decomposition dependency preserving?
18.2-6 Let the functional dependency S → D of Exercise 18.2-3 be replaced by
the multivalued dependency S D. That is , D represents the stock’s dividend
“history”.
a. Compute the dependency basis of I.
b. Compute the dependency basis of BS.
c. Give a decomposition of R into subschemata in 4NF.
18.2-7 Consider the decomposition ρ = {R1 , R2 , . . . , Rk } of schema R. Let ri =
πRi (r), furthermore mρ (r) =1ki=1 πRi (r). Prove:
a. r ⊆ mρ (r).
b. If s = mρ (r), then πRi (s) = ri .
c. mρ (mρ (r)) = mρ (r).
18.2-8 Prove that schema (R, F ) is in BCNF iff for arbitrary A ∈ R and key
X ⊂ R, it holds that there exists no Y ⊆ R, for which X → Y ∈ F + ; Y → X 6∈ F + ;
Y → A ∈ F + and A 6∈ Y .
18.2-9 Prove Lemma 18.20.
18.2-10 Let us assume that R2 ⊂ R1 ⊂ R and the set of functional dependencies
of schema R is F . Prove that πR2 (πR1 (F )) = πR2 (F ).
890 18. Relational Database Design
18.2-11 Give a O(n2 ) running time algorithm to find a key of the relational schema
(R, F ). Hint. Use that R is superkey and each superkey contains a key. Try to drop
attributes from R one-by-one and check whether the remaining set is still a key.
18.2-12 Prove that axioms (A1)–(A8) are sound for functional and multivalued
dependencies.
18.2-13 Derive the four inference rules of Proposition 18.27 from axioms (A1)–
(A8).
1[X1 , X2 , . . . , Xk ]
if
r =1ki=1 πXi (r) .
In this setting r satisfies multivalued dependency X Y iff it satisfies the join
dependency 1[XY, X(R − Y )]. The join dependency 1[X1 , X2 , . . . , Xk ] expresses
that the decomposition ρ = (X1 , X2 , . . . , Xk ) has the lossless join property. One can
define the fifth normal form, 5NF.
Definition 18.36 The relational schema R is in fifth normal form, if it is in
4NF and has no non-trivial join dependency.
The fifth normal form has theoretical significance primarily. The schemata used in
practice usually have primary keys. Using that the schema could be decomposed
into subschemata of two attributes each, where one of the attributes is a superkey
in every subschema.
Example 18.10 Consider the database of clients of a bank (Client-
number,Name,Address,accountBalance). Here C is unique identifier, thus the schema could
be decomposed into (CN,CA,CB), which has the lossless join property. However, it is not
worth doing so, since no storage place can be saved, furthermore no anomalies are avoided
with it.
is the same for functional and multivalued dependencies considered together. Unfor-
tunately, the following negative result is true.
In contrary to the above, Abiteboul, Hull and Vianu show in their book that
the logical implication problem can be decided by an algorithm for the family of
functional and join dependencies taken together. The complexity of the problem is
as follows.
Theorem 18.38
Example 18.11 Consider the database of the trips of an international transport truck.
• One trip: four distinct countries.
• One country has at most five neighbours.
• There are 30 countries to be considered.
1,1
Let x1 , x2 , x3 , x4 be the attributes of the countries reached in a trip. In this case xi −−→ xi+1
does not hold, however another dependency is valid:
1,5
xi −−→ xi+1 .
The storage space requirement of the database can be significantly reduced using these
dependencies. The range of each element of the original table consists of 30 values, names of
countries or some codes of them (5 bits each, at least). Let us store a little table (30×5×5 =
750 bits) that contains a numbering of the neighbours of each country, which assigns to them
the numbers 0,1,2,3,4 in some order. Now we can replace attribute x2 by these numbers
(x∗2 ), because the value of x1 gives the starting country and the value of x∗2 determines the
second country with the help of the little table. The same holds for the attribute x3 , but
we can decrease the number of possible values even further, if we give a table of numbering
the possible third countries for each x1 , x2 pair. In this case, the attribute x∗3 can take only
4 different values. The same holds for x4 , too. That is, while each element of the original
table could be encoded by 5 bits, now for the cost of two little auxiliary tables we could
decrease the length of the elements in the second column to 3 bits, and that of the elements
in the third and fourth columns to 2 bits.
892 18. Relational Database Design
There exists such C : 2R → 2R mapping and natural numbers p, q that there exists
no Armstrong-relation for C in the family if (p, q)-dependencies.
Grant Minker investigated numerical dependencies that are similar to
k
branching dependencies. For attribute sets X, Y ⊆ R the dependency X − → Y holds
in a relation r over schema R if for every tuple value taken on the set of attributes
X, there exists at most k distinct tuple values taken on Y . This condition is stronger
1,k
than that of X −−→ Y , since the latter only requires that in each column of Y there
are at most k values, independently of each other. That allows k |Y −X| different Y
projections. Numerical dependencies were axiomatised in some special cases, based
on that Katona showed that branching dependencies have no finite axiomatisation.
It is still an open problem whether logical implication is algorithmically decidable
amongst branching dependencies.
Exercises
18.3-1 Prove Theorem 18.38.
18.3-2 Prove Lemma 18.40.
18.3-3 Prove that if p = q, then Cp,p (Cp,p (X)) = Cp,p (X) holds besides the two
properties of Lemma 18.40.
18.3-4 A C : 2R → 2R mapping is called a closure, if it satisfies the two properties
of Lemma 18.40 and and the third one of Exercise 18.3-3. Prove that if C : 2R → 2R
is a closure, and F is the family of dependencies defined by X → Y ⇐⇒ Y ⊆ C(X),
then there exists an Armstrong-relation for F in the family of (1, 1)-dependencies
(functional dependencies) and in the family of (2, 2)-dependencies, respectively.
Notes for Chapter 18 893
Problems
Design an O(n2 ) running time algorithm, whose input is schema (R, F ) and output
is a set of dependencies G equivalent with F that has no external attributes.
18-2 The order of the elimination steps in the construction of minimal
cover is important
In the procedure Minimal-cover(R, F ) the set of functional dependencies was
changed in two ways: either by dropping redundant dependencies, or by drop-
ping redundant attributes from the left hand sides of the dependencies. If the latter
method is used first, until there is no more attribute that can be dropped from some
left hand side, then the first method, this way a minimal cover is obtained really,
according to Proposition 18.6. Prove that if the first method applied first and then
the second, until there is no more possible applications, respectively, then the ob-
tained set of dependencies is not necessarily a minimal cover of F .
18-3 BCNF subschema
Prove that the following problem is coNP-complete: Given a relational schema R with
set of functional dependencies F , furthermore S ⊂ R, decide whether (S, πS (F )) is
in BCNF.
18-4 3NF is hard to recognise
Let (R, F ) be a relational schema, where F is the system of functional dependen-
cies.
The k size key problem is the following: given a natural number k, determine
whether there exists a key of size at most k.
The prime attribute problem is the following: for a given A ∈ R, determine
whether it is a prime attribute.
a. Prove that the k size key problem is NP-complete. Hint. Reduce the vertex
cover problem to the prime attribute problem.
b. Prove that the prime attribute problem is NP-complete by reducing the k size
key problem to it.
894 18. Relational Database Design
c. Prove that determining whether the relational schema (R, F ) is in 3NF is NP-
complete. Hint. Reduce the prime attribute problem to it.
Chapter Notes
The relational data model was introduced by Codd [7] in 1970. Functional depen-
dencies were treated in his paper of 1972 [?], their axiomatisation was completed
by Armstrong [?]. The logical implication problem for functional dependencies were
investigated by Beeri and Bernstein [4], furthermore Maier [19]. Maier also treats
the possible definitions of minimal covers, their connections and the complexity
of their computations in that paper. Maier, Mendelzon and Sagiv found method
to decide logical implications among general dependencies [20]. Beeri, Fagin and
Howard proved that axiom system (A1)–(A8) is sound and complete for functional
and multivalued dependencies taken together [?]. Yu and Johnson [?] constructed
such relational schema, where |F | = k and the number of keys is k!. Békéssy and
Demetrovics [6] gave a simple and beautiful proof for the statement, that from k
functional dependencies at most k! keys can be obtained, thus Yu and Johnson’s
construction is extremal.
Armstrong-relations were introduced and studied by Fagin [?, 13], furthermore
by Beeri, Fagin, Dowd and Statman [5].
Multivalued dependencies were independently discovered by Zaniolo [?], Fagin
[12] and Delobel [8].
The necessity of normal forms was recognised by Codd while studying update
anomalies [?, ?]. The Boyce–Codd normal form was introduced in [?]. The definition
of the third normal form used in this chapter was given by Zaniolo [28]. Complexity
of decomposition into subschemata in certain normal forms was studied by Lucchesi
and Osborne [18], Beeri and Bernstein [4], furthermore Tsou and Fischer [26].
Theorems 18.30 and 18.31 are results of Beeri [3]. Theorem 18.34 is from a
paper of Aho, Beeri és Ullman [2].
Theorems 18.37 and 18.38 are from the book of Abiteboul, Hull and Vianu [1],
the non-existence of finite axiomatisation of join dependencies is Petrov’s result [21].
Branching dependencies were introduced by Demetrovics, Katona and Sali, they
studied existence of Armstrong-relations and the size of minimal Armstrong-relations
[9, 10, 11, 23]. Katona showed that there exists no finite axiomatisation of branching
dependencies in (ICDT’92 Berlin, invited talk) but never published.
Possibilities of axiomatisation of numerical dependencies were investigated by
Grant and Minker [15, 16].
Good introduction of the concepts of this chapter can be found in the books of
Abiteboul, Hull and Vianu [1], Ullman [27] furthermore Thalheim [24], respectively.
19. Query Rewriting in Relational
Databases
19.1. Queries
Consider the database of cinemas in Budapest. Assume that the schema consists of
three relations:
Film
Title Director Actor
Control Antal, Nimród Csányi, Sándor
Control Antal, Nimród Mucsi, Zoltán
Control Antal, Nimród Pindroch, Csaba
.. .. ..
. . .
Rashomon Akira Kurosawa Toshiro Mifune
Rashomon Akira Kurosawa Machiko Kyo
Rashomon Akira Kurosawa Mori Masayuki
Theatre
Theater Address Phone
Bem II., Margit Blvd. 5/b. 316-8708
Corvin VIII., Corvin alley 1. 459-5050
Európa VII., Rákóczi st. 82. 322-5419
Művész VI., Teréz blvd. 30. 332-6726
.. .. ..
. . .
Uránia VIII., Rákóczi st. 21. 486-3413
Vörösmarty VIII., Üllői st. 4. 317-4542
Show
Theater Title Time
Bem Rashomon 19:00
Bem Rashomon 21:30
Uránia Control 18:15
Művész Rashomon 16:30
Művész Control 17:00
.. .. ..
. . .
Corvin Control 10:15
These queries define a mapping from the relations of database schema CinePest to
some other schema (in the present case to schemata of single relations). Formally,
query and query mapping should be distinguished. The former is a syntactic
concept, the latter is a mapping from the set of instances over the input schema
to the set of instances over the output schema, that is determined by the query
according to some semantic interpretation. However, for both concepts the word
“query” is used for the sake of convenience, which one is meant will be clear from
19.1. Queries 897
the context.
Definition 19.1 Queries q1 and q2 over schema R are said to be equivalent, in
notation q1 ≡ q2 , if they have the same output schema and for every instance I over
schema R q1 (I) = q2 (I) holds.
In the remaining of this chapter the most important query languages are reviewed.
The expressive powers of query languages need to be compared.
Definition 19.2 Let Q1 and Q2 be query languages (with appropriate semantics).
Q2 is dominated by Q1 ( Q1 is weaker, than Q2 ), in notation Q1 v Q2 , if for
every query q1 of Q1 there exists a query q2 ∈ Q2 , such that q1 ≡ q2 . Q1 and Q2 are
equivalent, if Q1 v Q2 and Q1 w Q2 .
Example 19.1Query. Consider Question 19.2. As a first try the next solution is obtained:
if there exist in relations Film, Theater and Show tuples (xT , ”Akira Kurosawa”, xA ),
(xT h , xAd , xP ) and (xT h , xT , xT i )
then put the tuple (Theater : xT h , Address : xA ) into the output relation.
xT , xA , xT h , xAd , xP , xT i denote different variables that take their values from the domains
of the corresponding attributes, respectively. Using the same variables implicitly marked
where should stand identical values in different tuples.
Datalog – rule based queries The tuple (x1 , x2 , . . . , xm ) is called free tuple
if the xi ’s are variables or constants. This is a generalisation of a tuple of a relational
instance. For example, the tuple (xT , ”Akira Kurosawa”, xA ) in Example 19.1 is a
free tuple.
Definition 19.3 Let R be a relational database schema. Rule based conjunctive
query is an expression of the following form
Ri (ui ) is called a (relational) atom. It is assumed that each variable of the head
also occurs in some atom of the body.
A rule can be considered as some tool that tells how can we deduce newer and
newer facts, that is tuples, to include in the output relation. If the variables of
the rule can be assigned such values that each atom Ri (ui ) is true (that is the
appropriate tuple is contained in the relation Ri ), then tuple u is added to the
relation ans. Since all variables of the head occur in some atoms of the body, one
never has to consider infinite domains, since the variables can take values from the
actual instance queried. Formally. let I be an instance over relational schema R,
furthermore let q be a the query given by rule (19.3). Let var(q) denote the set of
variables occurring in q, and let dom(I) denote the set of constants that occur in I.
The image of I under q is given by
Proposition 19.4 shows the limitations of rule based queries. For example, the query
Which theatres play only Kurosawa films? is obviously not monotone, hence cannot
be expressed by rules of form (19.3).
Tableau queries. If the difference between variables and constants is not con-
sidered, then the body of a rule can be viewed as an instance over the schema.
This leads to a tabular form of conjunctive queries that is most similar to the vi-
sual queries (QBE: Query By Example) of database management system Microsoft
Access.
The summary row u of tableau query (T, u) shows which tuples form the result
of the query. The essence of the procedure is that the pattern given by tableau
T is searched for in the database, and if found then the tuple corresponding to is
included in the output relation. More precisely, the mapping ν : var(T) → dom(I)
is an embedding of tableau (T, u) into instance I, if ν(T) ⊆ I. The output relation
19.1. Queries 899
of tableau query (T, u) consists of all tuples ν(u) that ν is an embedding of tableau
(T, u) into instance I.
The syntax of tableau queries is similar to that of rule based queries. It will
be useful later that conditions for one query to contain another one can be easily
formulated in the language of tableau queries.
1 The relational algebra∗ is the monotone part of the (full) relational algebra introduced later.
900 19. Query Rewriting in Relational Databases
be given by the list of pairs (A, f (A)), where ANEf (A), which is written usually
in the form A1 A2 . . . An → B1 B2 . . . Bn . The renaming operator δf maps
from inputs over U to outputs overf [U ]. If R is a relation over U , then
Example 19.3Relational algebra∗ query. The question 19.2. of the introduction can be
expressed with relational algebra operations as follows.
The mapping given by a relational algebra∗ query can be easily defined via induction
on the operation tree. It is easy to see (Exercise 19.1-2) that non-satisfiable queries
can be given using relational algebra∗ . There exist no rule based or tableau query
equivalent with such a non-satisfiable query. Nevertheless, the following is true.
Theorem 19.6 Rule based queries, tableau queries and satisfiable relational
algebra∗ are equivalent query languages.
The proof of Theorem 19.6 consists of three main parts:
1. Rule based ≡ Tableau
2. Satisfiable relational algebra∗ v Rule based
3. Rule based v Satisfiable relational algebra∗
The first (easy) step is left to the Reader (Exercise 19.1-3). For the second step, it has
to be seen first, that the language of rule based queries is closed under composition.
More precisely, let R = {R1 , R2 , . . . , Rn } be a database, q be a query over R. If the
output relation of q is S1 , then in a subsequent query S1 can be used in the same
way as any extensional relation of R. Thus relation S2 can be defined, then with its
help relation S3 can be defined, and so on. Relations Si are intensional relations.
The conjunctive query program P is a list of rules
S1 (u1 ) ← body1
S2 (u2 ) ← body2
.. (19.5)
.
Sm (um ) ← bodym ,
where Si ’s are pairwise distinct and not contained in R. In rule body bodyi only
relations R1 , R2 , . . . Rn and S1 , S2 , . . . , Si−1 can occur. Sm is considered to be the
output relation of P , its evaluation is is done by computing the results of the rules
19.1. Queries 901
one-by-one in order. It is not hard to see that with appropriate renaming the variables
P can be substituted by a single rule, as it is shown in the following example.
Example 19.4Conjunctive query program. Let R = {Q, R}, and consider the following
conjunctive query program
S2 can be written using Q and R only by the first two rules of (19.6)
is obtained.
respectively,
are the rules sought. The satisfiability of relational algebra∗ query is used here.
Indeed, during composition of operations we never obtain an expression where
two distinct constants should be equated.
πAi1 ,Ai2 ,...,Aim (R): If R = R(A1 , A2 , . . . , An ), then
works.
A1 A2 . . . An → B1 B2 . . . Bn : The renaming operation of relational algebra∗ can
be achieved by renaming the appropriate variables, as it was shown in Exam-
ple 19.4.
902 19. Query Rewriting in Relational Databases
ans(−
→
x ) ← R1 (−
→), R (−
x1
→ −
→
2 x2 ), . . . , Rn (xn ) . (19.9)
By renaming the attributes of relations Ri ’s, we may assume without loss of gener-
ality that all attribute names are distinct. Then R = R1 1 R2 1 · · · 1 Rn can be
constructed that is really a direct product, since the the attribute names are distinct.
The constants and multiple occurrences of variables of rule (19.9) can be simulated
by appropriate selection operators. The final result is obtained by projecting to the
set of attributes corresponding to the variables of relation ans.
19.1.2. Extensions
Conjunctive queries are a class of query languages that has many good properties.
However, the set of expressible questions are rather narrow. Consider the following.
19.4. List those pairs where one member directed the other member in a film,
and vice versa, the other member also directed the first in a film.
19.5. Which theatres show “La Dolce Vita” or “Rashomon”?
19.6. Which are those films of Hitchcock that Hitchcock did not play a part in?
19.7. List those films whose every actor played in some film of Fellini.
19.8. Let us recall the game “Chain-of-Actors”. The first player names an ac-
tor/actress, the next another one who played in some film together with the first
named. This is continued like that, always a new actor/actress has to be named
who played together with the previous one. The winner is that player who could
continue the chain last time. List those actors/actresses who could be reached
by “Chain-of-Actors” starting with “Marcello Mastroianni”.
Equality atoms. Question 19.4. can be easily answered if equalities are also
allowed rule bodies, besides relational atoms:
Allowing equalities raises two problems. First, the result of the query could become
infinite. For example, the rule based query
results in an infinite number of tuples, since variables y and z are not bounded by
relation R, thus there can be an infinite number of evaluations that satisfy the rule
body. Hence, the concept of domain restricted query is introduced. Rule based
query q is domain restricted, if all variables that occur in the rule body also occur
in some relational atom.
The second problem is that equality atoms may cause the body of a rule become
19.1. Queries 903
Satisfiable(q)
1 Compute the transitive closure of equalities of the body of q.
2 if Two distinct constants should be equal by transitivity
3 then return “Not satisfiable.”
4 else return “Satisfiable.”
It is also true (Exercise 19.1-4) that if a rule based query q that contains equality
atoms is satisfiable, then there exists a another rule based query q 0 without equalities
that is equivalent with q.
Theorem 19.8 The language of non-recursive datalog programs with unique out-
put relation and the relational algebra extended with the union operator are equiva-
lent.
The proof of Theorem 19.8 is similar to that of Theorem 19.6 and it is left to
the Reader (Exercise 19.1-5). Let us note that the expressive power of the union
of tableau queries is weaker. This is caused by the requirement having the same
summary row for each tableau. For example, the non-recursive datalog program
query
ans(a) ←
(19.16)
ans(b) ←
cannot be realised as union of tableau queries.
Negation. The query 19.6. is obviously not monotone. Indeed, suppose that in
relation Film there exist tuples about Hitchcock’s film Psycho, for example (“Psy-
cho”,”A. Hitchcock”,”A. Perkins”), (“Psycho”,”A. Hitchcock”,”J. Leigh”), . . . , how-
ever, the tuple (“Psycho”,”A. Hitchcock”,”A. Hitchcock”) is not included. Then the
tuple (“Psycho”) occurs in the output of query 19.6. With some effort one can real-
ize however, that Hitchcock appears in the film Psycho, as “a man in cowboy hat”.
If the tuple (“Psycho”,”A. Hitchcock”,”A. Hitchcock”) is added to relation Film as
a consequence, then the instance of schema CinePest gets larger, but the output
of query 19.6. becomes smaller.
It is not too hard to see that the query languages discussed so far are monotone,
hence query 19.6. cannot be formulated with non-recursive datalog program or with
some of its equivalents. Nevertheless, if the difference (−)operator is also added to
relation algebra, then it becomes capable of expressing queries of type 19.6. For
example,
realises exactly query 19.6. Hence, the (full) relational algebra consists of operations
{σ, π, 1, δ, ∪, −}. The importance of the relational algebra is shown by the fact, that
Codd calls a query language Q relationally complete, exactly if for all relational
algebra query q there exists q 0 ∈ Q, such that q ≡ q 0 .
If negative literals, that is atoms of the form ¬R(u) are also allowed in rule
bodies, then the obtained non-recursive datalog with negation, in notation nr-
datalog¬ is relationally complete.
Definition 19.9 A non-recursive datalog¬ (nr-datalog¬ ) rule is of form
q: S(u) ← L1 , L2 , . . . , Ln , (19.18)
The semantics of rule (19.18) is as follows. Let R be a relational schema that contains
all relations occurring in the body of q, furthermore, let I be an instance over R.
The image of I under q is
q(I) = {ν(u)| ν is an valuation of the variables and for i = 1, 2, . . . , n
ν(ui ) ∈ I(Ri ), if Li = Ri (ui ) and (19.19)
ν(ui ) 6∈ I(Ri ), if Li = ¬Ri (ui )}.
A nr-datalog¬ program over schema R is a collection of nr-datalog¬ rules
S1 (u1 ) ← body1
S2 (u2 ) ← body2
.. (19.20)
.
Sm (um ) ← bodym ,
where relations of schema R do not occur in heads of rules, the same relation may
appear in more than one rule head, furthermore there exists an ordering r1 , r2 , . . . , rm
of the rules such that the relation of the head of rule ri does not occur in the head
of any rule rj if j ≤ i.
The computation of the result of nr-datalog¬ program (19.20) applied to instance
I over schema R can be done in the same way as the evaluation of non-recursive
datalog program (19.15), with the difference that the individual nr-datalog¬ rules
should be interpreted according to (19.19).
Example 19.5Nr-datalog¬ program. Let us assume that all films that are included in
relation Film have only one director. (It is not always true in real life!) The nr-datalog¬
rule
ans(x) ← Film(x, “A. Hitchcock”, z), ¬Film(x, “A. Hitchcock”, “A. Hitchcock”) (19.21)
Bad-not-ans(x) ← Film(x, y, z), ¬Film(x0 , ”F. Fellini”, z), Film(x0 , ”F. Fellini”, z 0 )
Answer(x) ← Film(x, y, z), ¬Bad-not-ans(x),
(19.23)
then (19.23) answers the following query (assuming that all films have unique director)
19.9. List all those films whose every actor played in each film of Fellini,
instead of query 19.7.
It is easy to see that every satisfiable nr-datalog¬ program that contains equality
atoms can be replaced by one without equalities. Furthermore the following propo-
sition is true, as well.
906 19. Query Rewriting in Relational Databases
Claim 19.10 The satisfiable (full) relational algebra and the nr-datalog¬ programs
with single output relation are equivalent query languages.
Recursion. Query 19.8. cannot be formulated using the query languages in-
troduced so far. Some a priori information would be needed about how long a
chain-of-actors could be formed starting with a given actor/actress. Let us assume
that the maximum length of a chain-of-actors starting from “Marcello Mastroianni”
is 117. (It would be interesting to know the real value!) Then, the following non-
recursive datalog program gives the answer.
Film-partner(z1 , z2 ) ← Film(x, y, z1 ), Film(x, y, z2 ), z1 < z2 2
Partial-answer1 (z) ← Film-partner(z, “Marcello Mastroianni”)
Partial-answer1 (z) ← Film-partner(“Marcello Mastroianni”, z)
Partial-answer2 (z) ← Film-partner(z, y), Partial-answer1 (y)
Partial-answer2 (z) ← Film-partner(y, z), Partial-answer1 (y)
.. ..
. .
(19.24)
Partial-answer117 (z) ← Film-partner(z, y), Partial-answer116 (y)
Partial-answer117 (z) ← Film-partner(y, z), Partial-answer116 (y)
Mastroianni-chain(z) ← Partial-answer1 (z)
Mastroianni-chain(z) ← Partial-answer2 (z)
.. ..
. .
Mastroianni-chain(z) ← Partial-answer117 (z)
It is much easier to express query 19.8. using recursion. In fact, the transitive
closure of the graph Film-partner needs to be calculated. For the sake of simplicity
the definition of Film-partner is changed a little (thus approximately doubling the
storage requirement).
Film-partner(z1 , z2 ) ← Film(x, y, z1 ), Film(x, y, z2 )
Chain-partner(x, y) ← Film-partner(x, y) (19.25)
Chain-partner(x, y) ← Film-partner(x, z), Chain-partner(z, y) .
The datalog program (19.25) is recursive, since the definition of relation Chain-
partner uses the relation itself. Let us suppose for a moment that this is meaningful,
then query 19.8. is answered by rule
Mastroianni-chain(x) ← Chain-partner(x, “Marcello Mastroianni”) (19.26)
2 Arbitrary comparison atoms can be used, as well, similarly to equality atoms. Here z < z makes
1 2
it sure that all pairs occur at most once in the list.
19.1. Queries 907
Proof Let I and J be instances over sch(P ), such that I ⊆ J . Let A be a fact
ofTP (I). If A ∈ I(R) for some relation R ∈ sch(P ), then A ∈ J (R) is implied by
I ⊆ J . on the other hand, if A ← A1 , A2 , . . . , An is a realisation of a rule in P and
each Ai is in I, then Ai ∈ J also holds.
Theorem 19.13 For every instance I over schema sch(P ) there exists a unique
minimal instance I ⊆ K that is a fixpoint of TP , i.e. K = TP (K).
Proof Let TPi (I) denote the consecutive application of operator TP i-times, and let
∞
[
K= TPi (I) . (19.29)
i=0
908 19. Query Rewriting in Relational Databases
that is K is a fixpoint. It is easy to see that every fixpoint that contains I, also
contains TPi (I) for all i = 1, 2, . . . , that is it contains K, as well.
It can be seen, see Exercise 19.1-6, that the chain in (19.28) is finite, that is there
exists an n, such that TP (TPn (I)) = TPn (I). The naive evaluation of the result of the
datalog program is based on this observation.
Naiv-Datalog(P ,I)
1 K ← I
2 while TP (K)NEK
3 do K ← TP (K)
4 return K
Procedure Naiv-Datalog is not optimal, of course, since every fact that be-
comes included in K is calculated again and again at every further execution of the
while loop.
The idea of Semi-naiv-datalog is that it tries to use only recently calculated
new facts in the while loop, as much as it is possible, thus avoiding recomputation
of known facts. Consider datalog program P with edb(P ) = R, and idb(P ) = T. For
a rule
S(u) ← R1 (v1 ), . . . , Rn (vn ), T1 (w1 ), . . . , Tm (wm ) (19.31)
of P where Rk ∈ R and Tj ∈ T, the following rules are constructed for j = 1, 2, . . . , m
and i ≥ 1
tempi+1
S (u) ← R1 (v1 ), . . . , Rn (vn ),
i−1
i
T1i (w1 ), . . . , Tj−1 (wj−1 ), ∆iTj (wj ), Tj+1 i−1
(wj+1 ), . . . , Tm (wm ) .
(19.32)
Relation ∆iTj denotes the change of Tj in iteration i. The union of rules corresponding
to S in layer i is denoted by PSi , that is rules of form (19.32) for tempi+1 S , j =
1, 2, . . . , m. Assume that the list of idb relations occurring in rules defining the idb
relation S is T1 , T2 , . . . , T` . Let
denote the set of facts (tuples) obtained by applying rules (19.32) to input instance
I and to idb relations Tji−1 , Tji , ∆iTj . The input instance I is the actual value of the
edb relations of P .
19.1. Queries 909
Semi-naiv-datalog(P ,I)
1 P 0 ← those rules of P whose body does not contain idb relation
2 for S ∈ idb(P )
3 do S 0 ← ∅
4 ∆1S ← P 0 (I)(S)
5 i ← 1
6 repeat
7 for S ∈ idb(P )
8 T1 , . . . , T` are the idb relations of the rules defining S.
9 do S i ← S i−1 ∪ ∆iS
10 ∆i+1
S ← PSi (I, T1i−1 , . . . , T`i−1 , T1i , . . . , T`i , ∆iT1 , . . . , ∆iT` ) − S i
11 i ← i+1
12 until ∆iS = ∅ for all S ∈ idb(P )
13 for S ∈ idb(P )
14 do S ← S i
15 return S
Proof We will show by induction on i that after execution of the loop of lines 6–12
ith times the value of S i is TPi (I)(S), while ∆i+1 S is equal to TPi+1 (I)(S) − TPi (I)(S)
i
for arbitrary S ∈ idb(P ). TP (I)(S) is the result obtained for S starting from I and
applying the immediate consequence operator TP i-times.
For i = 0, line 4 calculates exactly TP (I)(S) for all S ∈ idb(P ).
In order to prove the induction step, one only needs to see that
PSi (I, T1i−1 , . . . , T`i−1 , T1i , . . . , T`i , ∆iT1 , . . . , ∆iT` ) ∪ S i is exactly equal to TPi+1 (I)(S),
since in lines 9–10 procedure Semi-naiv-datalog constructs S i -t and ∆i+1 S using
that. The value of S i is TPi (I)(S), by the induction hypothesis. Additional new
tuples are obtained only if that for some idb relation defining S such tuples are con-
sidered that are constructed at the last application of TP , and these are in relations
∆iT1 , . . . , ∆iT` , also by the induction hypothesis.
The halting condition of line 12 means exactly that all relations S ∈ idb(P ) are
unchanged during the application of the immediate consequence operator TP , thus
the algorithm has found its minimal fixpoint. This latter one is exactly the result of
datalog program P on input instance I according to Definition 19.14.
Improved-Semi-Naiv-Datalog(P ,I)
1 Determine the equivalence classes of idb(P ) under mutual recursivity.
2 List the equivalence classes [R1 ], [R2 ], . . . , [Rn ]
according to a topological order of GP .
3 There exists no directed path from Rj to Ri in GP for all i < j.
4 for i ← 1 to n
5 do Use Semi-naiv-datalog to compute relations of [Ri ]
taking relations of [Rj ] as edb relations for j < i.
Lines 1–2 can be executed in time O(vGP + eGP ) using depth first search, where
vGP and eGP denote the number of vertices and edges of graph GP , respectively.
Proof of correctness of the procedure is left to the Reader (Exercise 19.1-8).
Full-Tuplewise-Join(R1 , R2 )
1 S ← ∅
2 for all u ∈ R1
3 do for all v ∈ R2
4 do if u and v can be joined
5 then S ← S ∪ {u 1 v}
6 return S
is a mapping from the set of variables to the union of sets of variables and sets of
constants that is extended to constants as identity. Extension of substitution to free
tuples and tableaux can be defined naturally.
Definition 19.17 Let q = (T, u) and q 0 = (T0 , u0 ) be two tableau queries overs
schema R. Substitution θ is a homomorphism from q 0 to q, if θ(T0 ) = T and
θ(u0 ) = u.
q. Let T0 = θ ◦ λ(T). It is easy to check that (T0 , u) ≡ q and |T0 | ≤ |S|. But (S, v)
is minimal, hence (T0 , u) is minimal, as well.
Example 19.6 Application of tableau minimisation Consider the relational algebra expres-
sion
q = πAB (σB=5 (R)) 1 πBC (πAB (R) 1 πAC (σB=5 (R))) (19.34)
over the schema R of attribute set {A, B, C}. The tableau query corresponding to q is the
following tableau T:
R A B C
x 5 z1
x1 5 z2 (19.35)
x1 5 z
u x 5 z
Such a homomorphism is sought that maps some rows of T to some other rows of T, thus
sort of “folding” the tableau. The first row cannot be eliminated, since the homomorphism
is an identity on free tuple u, thus x must be mapped to itself. The situation is similar with
the third row, as the image of z is itself under any homomorphism. However the second
row can be eliminated by mapping x1 to x and z2 to z, respectively. Thus, the minimal
tableau equivalent with T consists of the first and third rows of T. Transforming back to
relational algebra expression,
is obtained. Query (19.36) contains one less join operator than query (19.34).
The next theorem states that the question of tableau containment and equiva-
lence is NP-complete, hence tableau minimisation is an NP-hard problem.
Theorem 19.20 For given tableau queries q and q 0 the following decision problems
are NP-complete:
19.10. q v q 0 ?
19.11. q ≡ q 0 ?
19.12. Assume that q 0 is obtained from q by removing some free tuples. Is it true
then that q ≡ q 0 ?
Proof The Exact-Cover problem will be reduced to the various tableau problems.
The input of Exact-Cover problem is a finite set X = {x1 , x2 , . . . , xn }, and a
collection of its subsets S = {S1 , S2 , . . . , Sm }. It has to be determined whether there
exists S 0 v S, such that subsets in S 0 cover X exactly (that is, for all x ∈ X there
exists exactly one S ∈ S 0 such that x ∈ S). Exact-Cover is known to be an
NP-complete problem.
Let E = (X, S) be an input of Exact cover. A construction is sketched that
produces a pair qE , qE0 of tableau queries to E in polynomial time. This construction
can be used to prove the various NP-completeness results.
Let the schema R consist of the pairwise distinct attributes
A1 , A2 , . . . , An , B1 , B2 , . . . , Bm . qE = (TE , t) and qE0 = (T0E , t) are tableau
19.1. Queries 913
queries over schema R such that the summary row of both of them is free tuple
t = hA1 : a1 , A2 : a2 , . . . , An : an i, where a1 , a2 , . . . , an are pairwise distinct
variables.
Let b1 , b2 , . . . , bm , c1 , c2 , . . . cm be another set of pairwise distinct variables.
Tableau TE consists of n rows, for each element of X corresponds one. ai stands in
column of attribute Ai in the row of xi , while bj stands in column of attribute Bj
for all such j that xi ∈ Sj holds. In other positions of tableau TE pairwise distinct
new variables stand.
Similarly, T0E consists of m rows, one corresponding to each element of S. ai
stands in column of attribute Ai in the row of Sj for all such i that xi ∈ Sj ,
furthermore cj 0 stands in the column of attribute Bj 0 , for all j 0 NEj. In other positions
of tableau T0E pairwise distinct new variables stand.
The NP-completeness of problem 19.10. follows from that X has an exact cover
using sets of S if and only if qE0 v qE holds. The proof, and the proof of the NP-
completeness of problems 19.11. and 19.12. are left to the Reader (Exercise 19.1-9).
Exercises
19.1-1 Prove Proposition 19.4, that is every rule based query q is monotone and
satisfiable. Hint. For the proof of satisfiability let K be the set of constants occurring
in query q, and let a 6∈ K be another constant. For every relation schema Ri in rule
(19.3) construct all tuples (a1 , a2 , . . . , ar ), where ai ∈ K ∪ {a}, and r is the arity of
Ri . Let I be the instance obtained so. Prove that q(I) is nonempty.
19.1-2 Give a relational schema R and a relational algebra query q over schema
R, whose result is empty to arbitrary instance over R.
19.1-3 Prove that the languages of rule based queries and tableau queries are
equivalent.
19.1-4 Prove that every rule based query q with equality atoms is either equivalent
with the empty query q ∅ , or there exists a rule based query q 0 without equality atoms
such that q ≡ q 0 . Give a polynomial time algorithm that determines for a rule based
query q with equality atoms whether q ≡ q ∅ holds, and if not, then constructs a rule
based query q 0 without equality atoms, such that q ≡ q 0 .
19.1-5 Prove Theorem 19.8 by generalising the proof idea of Theorem 19.6.
19.1-6 Let P be a datalog program, I be an instance over edb(P ), inC(P, I) be the
(finite) set of constants occurring in I and P . Let B(P, I) be the following instance
over sch(P ):
1. For every relation R of edb(P ) the fact R(u) is in B(P, I) iff it is in I, further-
more
2. for every relation R of idb(P ) every R(u) fact constructed from constants of
C(P, I) is in B(P, I).
Prove that
I ⊆ TP (I) ⊆ TP2 (K) ⊆ TP3 (K) ⊆ · · · ⊆ B(P, I). (19.37)
914 19. Query Rewriting in Relational Databases
19.2. Views
A database system architecture has three main levels:
• physical layer;
• logical layer;
• outer (user) layer.
The goal of the separation of levels is to reach data independence and the convenience
of users. The three views on Figure 19.2 show possible user interfaces: multirelational,
universal relation and graphical interface.
The physical layer consists of the actually stored data files and the dense and
sparse indices built over them.
The separation of the logical layer from the physical layer makes it possible
for the user to concentrate on the logical dependencies of the data, which approx-
imates the image of the reality to be modelled better. The logical layer consists of
the database schema description together with the various integrity constraints, de-
pendencies. This the layer where the database administrators work with the system.
The connection between the physical layer and the logical layer is maintained by the
database engine.
The goal of the separation of the logical layer and the outer layer is that the
endusers can see the database according to their (narrow) needs and requirements.
For example, a very simple view of the outer layer of a bank database could be the
automatic teller machine, or a much more complex view could be the credit history
of a client for loan approval.
Outer layer
Logical layer
Physical layer
between views and relations, well. The relations defined by rules are called inten-
sional, because these are the relations that do not have to exist on external storage
devices, that is to exist extensionally, in contrast to the extensional relations.
Definition 19.21 The V expression given in some query language Q over schema
R is called a view.
Example 19.7 SQL view. Views in database manipulation language SQL can be given in
the following way. Suppose that the only interesting data for us from schema CinePest is
where and when are Kurosawa’s film shown. The view KurosawaTimes is given by the
SQL command
KurosawaTimes
1 create view KurosawaTimes as
2 select Theater, Time
3 from Film, Show
4 where Film.Title=Show.Title and Film.Director="Akira Kurosawa"
Having defined view V, it can be used in further queries or view definitions like
any other (extensional) relation.
Exercises
19.2-1 Consider the following schema:
FilmStar(Name,Address,Gender,BirthDate)
FilmMogul(Name,Address,Certificate#,Assets)
Studio(Name,Address,PresidentCert#) .
19.3. Query rewriting 917
Relation FilmMogul contains data of the big people in film business (studio pres-
idents, producers, etc.). The attribute names speak for themselves, Certificate# is
the number of the certificate of the filmmogul, PresidentCert#) is the certificate
number of the president of the studio. Give the definitions of the following views
using datalog rules, relational algebra expressions, furthermore SQL:
19.2-2 Give the definitions of the following views over schema CinePest using
datalog rules, relational algebra expressions, furthermore SQL:
19.3.1. Motivation
Before query rewriting algorithms are studied in detail, some motivating examples
of applications are given. The following university database is used throughout this
section.
University = {Professor,Course,Teach,Registered,Major,Affiliation,Supervisor} .
(19.41)
918 19. Query Rewriting in Relational Databases
It will be faster to compute (19.45) than to compute (19.43), because the natu-
ral join of relations Registered and Course has already be done by view Graduate,
furthermore it shelled off the undergraduate courses (that make up for the bulk of
registration data at most universities). It worth noting that view Graduate could be
used event hough syntactically did not agree with any part of query (19.43).
On the other hand, it may happen that the the original query can be answered
faster. If relations Registered and Course have an index on attribute C-Number, but
there exists no index built for Graduate, then it could be faster to answer query
(19.43) directly from the database relations. Thus, the real challenge is not only
that to decide about a materialised view whether it could be used to answer some
query logically, but a thorough cost analysis must be done when is it worth using
the existing views.
19.3. Query rewriting 919
ans(Student,Department) ←Registered(Student,C-number,y),
Major(Student,Department), (19.46)
C-number ≥ 500 .
The query can be answered in two ways. First, since Student.name uniquely identifies
a student, we can take the join of G! and G2, and then apply selection operator
Course.c-number ≥ 500, finally a projection eliminates the unnecessary attributes.
920 19. Query Rewriting in Relational Databases
def.gmap G1 as b+ -tree by
given Student.name
select Department
where Student major Department
def.gmap G2 as b+ -tree by
given Student.name
select Course.c-number
where Student registered Course
def.gmap G3 as b+ -tree by
given Course.c-number
select Department
where Student registered Course and Student major Department
The other execution plan could be to join G3 with G2 and select Course.c-number ≥
500. In fact, this solution may even be more efficient because G3 has an index on
the course number and therefore the intermediate joins may be much smaller.
Suppose we have the following two data sources. The first source provides a listing of
all the courses entitled "Database Systems" taught anywhere and their instructors.
This source can be described by the following view definition:
Note that courses that are not PH.D.-level courses or database courses will not be
returned as answers. Whereas in the contexts of query optimisation and maintain-
ing physical data independence the focus is on finding a query expression that is
equivalent with the original query, here finding a query expression that provides
the maximal answers from the views is attempted.
Example 19.8 Query rewriting. Consider the following query Q and view V .
Q : q(X, U ) ← p(X, Y ), p0 (Y, Z), p1 (X, W ), p2 (W, U )
(19.52)
V : v(A, B) ← p(A, C), p0 (C, B), p1 (A, D) .
Q can be rewritten using V :
View V replaces the first two literals of query Q. Note that the view certainly satisfies the
third literal of the query, as well. However, it cannot be removed from the rewriting, since
variable D does not occur in the head of V , thus if literal p1 were to be removed, too, then
the natural join of p1 and p2 would not be enforced anymore.
Since in some of the applications the database relations are inaccessible, only the
views can be accessed, for example in case of data integration or data warehouses,
the concept of complete rewriting is introduced.
Definition 19.23 A rewriting Q0 of query Q using views V = V1 , V2 , . . . , Vm is
called a complete rewriting, if Q0 contains only literals of V and comparison
atoms.
Example 19.9 Complete rewriting. Assume that besides view V of Example 19.8 the
following view is given, as well:
It is important to see that this rewriting cannot be obtained step-by-step, first using only
V , then trying to incorporate V2 , (or just in the opposite order) since relation p0 of V2 does
not occur in Q0 . Thus, in order to find the complete rewriting, use of the two view must be
considered parallel, at the same time.
There is a close connection between finding a rewriting and the problem of query
containment. This latter one was discussed for tableau queries in section 19.1.3. Ho-
momorphism between tableau queries can be defined for rule based queries, as well.
The only difference is that it is not required in this section that a homomorphism
maps the head of one rule to the head of the other. (The head of a rule corresponds
to the summary row of a tableau.) According to Theorem 19.20 it is NP-complete to
decide whether conjunctive query Q1 contains another conjunctive query Q2 . This
remains true in the case when Q2 may contain comparison atoms, as well. However,
if both, Q1 and Q2 may contain comparison atoms, then the existence of a homo-
morphism from Q1 to Q2 is only a sufficient but not necessary condition for the
containment of queries, which is a Πp2 -complete problem in that case. The discussion
of this latter complexity class is beyond the scope of this chapter, thus it is omitted.
The next proposition gives a necessary and sufficient condition whether there exists
a rewriting of query Q using view V .
Claim 19.24 Let Q and V be conjunctive queries that may contain comparison
atoms. There exists a a rewriting of query Q using view V if and only if π∅ (Q) ⊆
π∅ (V ), that is the projection of V to the empty attribute set contains that of Q.
Proof Observe that π∅ (Q) ⊆ π∅ (V ) is equivalent with the following proposition: If
the output of V is empty for some instance, then the same holds for the output of
Q, as well.
Assume first that there exists a rewriting, that is a rule equivalent with Q that
contains V in its body. If r is such an instance, that the result of V is empty on it,
then every rule that includes V in its body results in empty set over r, too.
In order to prove the other direction, assume that if the output of V is empty
for some instance, then the same holds for the output of Q, as well. Let
Let ỹ be a list of variables disjoint from variables of x̃. Then the query Q0 defined
by
Q0 : q 0 (x̃) ← q1 (x̃), q2 (x̃), . . . , qm (x̃), v1 (ỹ), v2 (ỹ), . . . , vn (ỹ) (19.57)
satisfies Q ≡ Q0 . It is clear that Q0 ⊆ Q. On the other hand, if there exists a valuation
of the variables of ỹ that satisfies the body of V over some instance r, then fixing
it, for arbitrary valuation of variables in x̃ a tuple is obtained in the output of Q,
whenever a tuple is obtained in the output of Q0 together with the previously fixed
valuation of variables of ỹ.
Theorem 19.25 Let Q be a conjunctive query that may contain comparison atoms,
and let V be a set of views. If the views in V are given by conjunctive queries that do
not contain comparison atoms, then it is NP-complete to decide whether there exists
a rewriting of Q using V.
The proof of Theorem 19.25 is left for the Reader (Exercise 19.3-1).
The proof of Proposition 19.24 uses new variables. However, according to the
next lemma, this is not necessary. Another important observation is that it is enough
to consider a subset of database relations occurring in the original query when locally
minimal rewriting is sought, new database relations need not be introduced.
Lemma 19.26 Let Q be a conjunctive query that does not contain comparison
atoms
Q : q(X̃) ← p1 (Ũ1 ), p2 (Ũ2 ), . . . , pn (Ũn ) , (19.58)
furthermore let V be a set of views that do not contain comparison atoms either.
1. If Q0 is a locally minimal rewriting of Q using V, then the set of database
literals in Q0 is isomorphic to a subset of database literals occurring in Q.
2. If
such that {Y˜0 1 ∪ · · · ∪ Y˜0 k } ⊆ {Ũ1 ∪ · · · ∪ Ũn }, that is the rewriting does not
introduce new variables.
The details of the proof of Lemma 19.26 are left for the Reader (Exercise 19.3-2).
The next lemma is of fundamental importance: A minimal rewriting of Q using V
cannot increase the number of literals.
Lemma 19.27 Let Q be conjunctive query, V be set of views given by conjunctive
queries, both without comparison atoms. If the body of Q contains p literals and Q0
is a locally minimal rewriting of Q using V, then Q0 contains at most p literals.
Proof Replacing the view literals of Q0 by their definitions query Q00 is obtained.
Let ϕ be a homomorphism from the body of Q to Q00 . The existence of ϕ follows
from Q ≡ Q00 by the Homomorphism Theorem (Theorem 19.18). Each of the literals
l1 , l2 , . . . , lp of the body of Q is mapped to at most one literal obtained from the
expansion of view definitions. If Q0 contains more than p view literals, then the
expansion of some view literals in the body of Q00 is disjoint from the image of
ϕ. These view literals can be removed from the body of Q0 without changing the
equivalence.
Based on Lemma 19.27 the following theorem can be stated about complexity of
minimal rewritings.
19.3. Query rewriting 925
Proof The first statement is proved, the proof of the other two is similar. According
to Lemmas 19.27 and 19.26, only such rewritings need to be considered that have
at most as many literals as the query itself, contain a subset of the literals of the
query and do not introduce new variables. Such a rewriting and the homomorphisms
proving the equivalence can be tested in polynomial time, hence the problem is in
NP. In order to prove that it is NP-hard, Theorem 19.25 is used. For a given query
Q and view V let V 0 be the view, whose head is same as the head of V , but whose
body is the conjunction of the bodies of Q and V . It is easy to see that there exists
a rewriting using V 0 with a single literal if and only if there exists a rewriting (with
no restriction) using V .
2. Q contains Q0 ,
Q : q(Pname,Student,Semester) ←Registered(Student,C-number,Semester),
Advisor(Pname,Student),
Teaches(Pname, C-number,Semester, xE ),
Semester ≥ “Fall2000” .
(19.61)
View V1 below can be used to answer Q, since it uses the same join condition
for relations Registered and Teaches as Q, as it is shown by variables of the same
name. Furthermore, V1 selects attributes Student, PName, Semester, that are nec-
essary in order to properly join with relation Advisor, and for select clause of the
query. Finally, the predicate Semester > “Fall1999” is weaker than the predicate
Semester ≥ “Fall2000” of the query.
V1 : v1 (Student,PName,Semester) ←Teaches(PName,C-number,Semester, xE ),
Registered(Student,C-number,Semester),
Semester > “Fall1999” .
(19.62)
The following four views illustrate how minor modifications to V1 change the us-
ability in answering the query.
• cover the same set of relations in the query (and therefore are also of the same
size), and
• produce the answer in the same interesting order.
In our context, the query optimiser builds query execution plans by accessing a set
of views, rather than a set of database relations. Hence, in addition to the meta-data
that the query optimiser has about the materialised views (e.g., statistics, indexes)
the optimiser is also given as input the query expressions defining the views. Th
additional issues that the optimiser needs to consider in the presence of materialised
views are as follows.
A. In the first iteration the algorithm needs to decide which views are relevant
to the query according to the conditions illustrated above. The corresponding
928 19. Query Rewriting in Relational Databases
B. Since the query execution plans involve joins over views, rather than joins over
database relations, plans can no longer be neatly partitioned into equivalence
classes which can be explored in increasing size. This observation implies several
changes to the traditional algorithm:
The following table summarises the comparison of the traditional optimiser versus
one that exploits materialised views.
19.3. Query rewriting 929
Create-Bucket(Q,V)
1 for i ← 1 to m
2 do Bucket[i] ← ∅
3 for all V ∈ V
4 V is of form V : V (Ỹ ) ← S1 (Ỹ1 ), . . . Sn (Ỹn ), C(V ).
5 do for j ← 1 to n
6 if Ri = Sj
7 then let φ be a mapping defined on the variables
of V as follows:
8 if y is the k th variable of Ỹj and y ∈ Ỹ
9 then φ(y) = xk , where xk is the k th variable of X̃i
10 else φ(y) is a new variable that
does not appear in Q or V .
11 Q0 () ← R1 (X̃1 ), Rm (X̃m ), C(Q), S1 (φ(Ỹ1 )), . . . ,
Sn (φ(Ỹn )), φ(C(V ))
12 if Satisfiable≥ (Q0 )
13 then add φ(V ) to Bucket[i].
14 return Bucket
1. Q0 v Q, or
Example 19.10Bucket algorithm. Consider the following query Q that lists those articles
x that there exists another article y of the same area such that x and y mutually cites each
other. There are three views (sources) available, V1 , V2 , V3 .
In the first step, applying Create-Bucket, the following buckets are constructed.
In the second step the algorithm constructs a conjunctive query Q0 from each element of
the Cartesian product of the buckets, and checks whether Q0 is contained in Q. If yes, it is
given to the answer.
In our case, it tries to match V1 with the other views, however no correct answer is
obtained so. The reason is that b does not appear in the head of V1 , hence the join condition
of Q – variables x and y occur in relation sameArea, as well – cannot be applied. Then
rewritings containing V3 are considered, recognising that equating the variables in the head
of V3 a contained rewriting is obtained. Finally, the algorithm finds that combining V3
and V2 rewriting is obtained, as well. This latter is redundant, as it is obtained by simple
932 19. Query Rewriting in Relational Databases
checking, that is V2 can be pruned. Thus, the result of the bucket algorithm for query
(19.68) is the following (actually equivalent) rewriting
The strength of the bucket algorithm is that it exploits the predicates in the
query to prune significantly the number of candidate conjunctive rewritings that
need to be considered. Checking whether a view should belong to a bucket can
be done in time polynomial in the size of the query and view definition when the
predicates involved are arithmetic comparisons. Hence, if the data sources (i.e., the
views) are indeed distinguished by having different comparison predicates, then the
resulting buckets will be relatively small.
The main disadvantage of the bucket algorithm is that the Cartesian product
of the buckets may still be large. Furthermore, the second step of the algorithm
needs to perform a query containment test for every candidate rewriting, which is
NP-complete even when no comparison predicates are involved.
Example 19.11 Equivalent rewriting. Consider the following datalog program P, where
edb relations edge and black contain the edges and vertices coloured black of a graph G.
It is easy to check that P lists the endpoints of such paths (more precisely walks) of graph G
whose inner points are all black. Assume that only the following two views can be accessed.
v1 stores edges whose tail is black, while v2 stores those, whose head is black. There exists
an equivalent rewriting Pv of datalog program P that uses only views v1 and v2 as edb
relations:
Pv : q(X, Y ) ← v2 (X, Z), v1 (Z, Y )
(19.73)
q(X, Y ) ← v2 (X, Z), q(Z, Y )
However, if only v1 , or v2 is accessible alone, then equivalent rewriting is not possible, since
19.3. Query rewriting 933
only such paths are obtainable whose starting, or respectively, ending vertex is black.
is the following collection of Horn rules. A rule corresponds to every subgoal vi (Ỹi ),
whose body is the literal v(X1 , . . . , Xm ). The head of the rule is vi (Z̃i ), where Z̃i
is obtained from Ỹi by preserving variables appearing in the head of rule (19.74),
while function symbol fY (X1 , . . . , Xm ) is written in place of every variable Y not
appearing the head. Distinct function symbols correspond to distinct variables. The
inverse of a set V of views is the set {v −1 : v ∈ V}, where distinct function symbols
occur in the inverses of distinct rules.
The idea of the definition of inverses is that if a tuple (x1 , . . . xm ) appears in a view v,
for some constants x1 , . . . xm , then there is a valuation of every variable y appearing
in the head that makes the body of the rule true. This “unknown” valuation is
denoted by the function symbol fY (X1 , . . . , Xm ).
a b c d e
a b c d e
appear in the heads of the rules of V −1 . The names of idb relations are arbitrary,
so they can be renamed so that their names do not coincide with the names of edb
relations of P. However, this is not done in the following example, for the sake of
better understanding.
Example 19.13 Logic program. Consider the following datalog program that calculates
the transitive closure of relation edge.
P: q(X, Y ) ← edge(X, Y )
(19.77)
q(X, Y ) ← edge(X, Z), q(Z, Y )
Assume that only the following materialised view is accessible, that stores the endpoints of
paths of length two. If only this view is usable, then the most that can be expected is listing
the endpoints of paths of even length. Since the unique edb relation of datalog program P
is edge, that also appears in the definition of v, the logic program (P − , V −1 ) is obtained by
adding the rules of V −1 to P.
(P − , V −1 ) : q(X, Y ) ← edge(X, Y )
q(X, Y ) ← edge(X, Z), q(Z, Y )
(19.78)
edge(X, f (X, Y )) ← v(X, Y )
edge(f (X, Y ), Y ) ← v(X, Y ) .
Let the instance of the edb relation edge of datalog program P be the graph G shown on
Figure 19.4. Then (P − , V −1 ) introduces three new constants, f (a, c), f (b, d) és f (c, e). The
idb relation edge of logic program V −1 is graph G0 shown on Figure 19.5. P − computes
the transitive closure of graph G0 . Note that those pairs in th transitive closure that do not
contain any of the new constants are exactly the endpoints of even paths of G.
q(X) ← p(X)
(19.79)
q(f (X)) ← q(X) .
19.3. Query rewriting 935
If the edb relation p contains the constant a, then the output of the program is the
infinite sequence a, f (a), f (f (a)), f (f (f (a))), . . .. In contrary to this, the output of
the logic program (P − , V −1 ) given by the Inverse-rules Algorithm is guaranteed to
be finite, thus the computation terminates in finite time.
Theorem 19.31 For arbitrary datalog program P and set of conjunctive views V,
and for finite instances of the views, there exist a unique minimal fixpoint of the
logic program (P − , V −1 ), furthermore procedures Naiv-Datalog and Semi-Naiv-
Datalog give this minimal fixpoint as output.
The essence of the proof of Theorem 19.31 is that function symbols are only intro-
duced by inverse rules, that are in turn not recursive, thus terms containing nested
functions symbols are not produced. The details of the proof are left for the Reader
(Exercise 19.3-3).
Even if started from the edb relations of a database, the output of a logic program
may contain tuples that have function symbols. Thus, a filter is introduced that
eliminates the unnecessary tuples. Let database D be the instance of the edbrelations
of datalog program P. P(D)↓ denotes the set of those tuples from P(D) that do not
contain function symbols. Let P↓ denote that program, which computes P(D)↓ for
a given instance D. The proof of the following theorem, exceeds the limitations of
the present chapter.
Theorem 19.32 For arbitrary datalog program P and set of conjunctive views
V, the logic program (P − , V −1 )↓ is a maximally contained rewriting of P using V.
Furthermore, (P − , V −1 ) can be constructed in polynomial time of the sizes of P and
V.
The meaning of Theorem 19.32 is that the simple procedure of adding the inverses
of view definitions to a datalog program results in a logic program that uses the
views as much as possible. It is easy to see that (P − , V −1 ) can be constructed in
polynomial time of the sizes of P and V, since for every subgoal vi ∈ V a unique
inverse rule must be constructed.
In order to completely solve the rewriting problem however, a datalog program
needs to be produced that is equivalent with the logic program (P − , V −1 ) ↓. The
key to this is the observation that (P − , V −1 )↓ contains only finitely many function
symbols, furthermore during a bottom-up evaluation like Naiv-Datalog and its
versions, nested function symbols are not produced. With proper book keeping the
appearance of function symbols can be kept track, without actually producing those
tuples that contain them.
The transformation is done bottom-up like in procedure Naiv-Datalog. The
function symbol f (X1 , . . . , Xk ) appearing in the idb relation of V −1 is replaced by
the list of variables X1 , . . . , Xk . At same time the name of the idb relation needs
to be marked, to remember that the list X1 , . . . , Xk belongs to function symbol
f (X1 , . . . , Xk ). Thus, new “temporary” relation names are introduced. Consider the
the rule
edge(X, f (X, Y )) ← v(X, Y ) (19.80)
of the logic program (19.78) in Example 19.13. It is replaced by rule
edgeh1,f (2,3)i (X, X, Y ) ← v(X, Y ) (19.81)
936 19. Query Rewriting in Relational Databases
Notation h1, f (2, 3)i means that the first argument of edgeh1,f (2,3)i is the same as
the first argument of edge, while the second and third arguments of edgeh1,f (2,3)i
together with function symbol f give the second argument of edge. If a function
symbol would become an argument of an idb relation of P − during the bottom-up
evaluation of (P − , V −1 ), then a new rule is added to the program with appropriately
marked relation names.
Example 19.14 Transformation of logic program into datalog program. The logic program
Example 19.13 is transformed to the following datalog program by the procedure sketched
above. The different phases of the bottom-up execution of Naiv-Datalog are separated
by lines.
edgeh1,f (2,3)i (X, X, Y ) ← v(X, Y )
edgehf (1,2),3i (X, Y, Y ) ← v(X, Y )
q h1,f (2,3)i (X, Y1 , Y2 ) ← edgeh1,f (2,3)i (X, Y1 , Y2 )
q hf (1,2),3i (X1 , X2 , Y ) ← edgehf (1,2),3i (X1 , X2 , Y )
q(X, Y ) ← edgeh1,f (2,3)i (X, Z1 , Z2 ), q hf (1,2),3i (Z1 , Z2 , Y )
q hf (1,2),f (3,4)i (X1 , X2 , Y1 , Y2 ) ← edgehf (1,2),3i (X1 , X2 , Z), q h1,f (2,3)i (Z, Y1 , Y2 )
q hf (1,2),3i (X1 , X2 , Y ) ← edgehf (1,2),3i (X1 , X2 , Z), q(Z, Y )
q h1,f (2,3)i (X, Y1 , Y2 ) ← edgeh1,f (2,3)i (X, Z1 , Z2 ), q hf (1,2),f (3,4)i (Z1 , Z2 , Y1 , Y2 )
(19.82)
The datalog program obtained shows clearly that which arguments could involve
function symbols in the original logic program. However, some rows containing func-
tion symbols never give tuples not containing function symbols during the evaluation
of the output of the program.
Relation p is called significant, if in the precedence graph of Definition 19.163
there exists oriented path from p to the output relation of q. If p is not significant,
then the tuples of p are not needed to compute the output of the program, thus p
can be eliminated from the program.
One more simplification step can be performed, which does not decrease the
number of necessary derivations during computation of the output, however avoids
redundant data copying. If p is such a relation in the datalog program that is defined
3 Here the definition of precedence graph needs to be extended for the edb relations of the datalog
program, as well.
19.3. Query rewriting 937
by a single rule, which in turn contains a single relation in its body, then p can be
removed from the program and replaced by the relation of the body of the rule
defining p, having equated the variables accordingly.
Example 19.16 Avoiding unnecessary data copying. In Example 19.14 the relations
edgeh1,f (2,3)i and edgehf (1,2),3i are defined by a single rule, respectively, furthermore these
two rules both have a single relation in their bodies. Hence, program (19.83) can be sim-
plified further.
The datalog program obtained in the two simplification steps above is denoted by
(P − , V −1 )datalog . It is clear that there exists a one-to-one correspondence between the
bottom-up evaluations of (P − , V −1 ) and (P − , V −1 )datalog . Since the function symbols
in (P − , V −1 )datalog are kept track, it is sure that the output instance obtained is in
fact the subset of tuples of the output of (P − , V −1 ) that do not contain function
symbols.
Theorem 19.33 For arbitrary datalog program P that does not contain negations,
and set of conjunctive views V, the logic program (P − , V −1 )↓ is equivalent with the
datalog program (P − , V −1 )datalog .
occurrences of the same variable – and finds the minimal additional set of subgoals
that must be mapped to subgoals in V , given that g will be mapped to g1 . This set
of subgoals and mapping information is called a MiniCon Description (MCD).
In the second phase the MCDs are combined to produce query rewritings. The
construction of the MCDs makes the most expensive part of the Bucket Algorithm
obsolete, that is the checking of containment between the rewritings and the query,
because the generating rule of MCDs makes it sure that their join gives correct
result.
For a given mapping τ : V ar(Q) −→ V ar(V ) subgoal g1 of view V is said to
cover a subgoal g of query Q, if τ (g) = g1 . V ar(Q), and respectively V ar(V )
denotes the set of variables of the query, respectively of that of the view. In order
to prove that a rewriting gives only tuples that belong to the output of the query,
a homomorphism must be exhibited from the query onto the rewriting. An MCD
can be considered as a part of such a homomorphism, hence, these parts will be put
together easily.
The rewriting of query Q is a union of conjunctive queries using the views.
Some of the variables may be equated in the heads of some of the views as in the
equivalent rewriting (19.70) of Example 19.10. Thus, it is useful to introduce the
concept of head homomorphism. The mapping h : Var(V ) −→ Var(V ) is a head
homomorphism, if it is an identity on variables that do not occur in the head of V ,
but it can equate variables of the head. For every variable x of the head of V , h(x)
also appear in the head of V , furthermore h(x) = h(h(x)). Now, the exact definition
of MCD can be given.
Claim 19.35 Let C be a MiniCon Description over view V for query Q. C can be
used for a non-redundant rewriting of Q if the following conditions hold
C1. for every variable x that is in the head of Q and is in the domain of ϕC , as
well, ϕC (x) appears in the head of hC (V ), furthermore
C2. if ϕC (y) does not appear in the head of hC (V ), then for all such subgoals
of Q that contain y holds that
Form-MCDs(Q, V)
1 C ← ∅
2 for each subgoal g of Q
3 do for V ∈ V
4 do for every subgoal v ∈ V
5 do Let h be the least restrictive head homomorphism on V ,
such that there exists a mapping ϕ with ϕ(g) = h(v).
6 if ϕ and h exist
7 then Add to C any new MCD C,
that can be constructed where:
8 (a) ϕC (respectively, hC ) is
an extension of ϕ (respectively, h),
9 (b) GC is the minimal subset of subgoals of Q such that
GC , ϕC and hC satisfy Proposition 19.35, and
10 (c) It is not possible to extend ϕ and h to ϕ0C
and h0C such that (b) is satisfied,
and G0C as defined in (b), is a subset of GC .
11 return C
Consider again query (19.68) and the views of Example 19.10. Procedure Form-
MCDs considers subgoal cite(x, y) of the query first. It does not create an MCD
for view V1 , because clause C2 of Proposition 19.35 would be violated. Indeed, the
condition would require that subgoal sameArea(x, y) be also covered by V1 using
the mapping ϕ(x) = a, ϕ(y) = b, since is not in the head of V1 .4 For the same
reason, no MCD will be created for V1 even when the other subgoals of the query
are considered. In a sense, the MiniCon Algorithm shifts some of the work done by
the combination step of the Bucket Algorithm to the phase of creating the MCDs
by using Form-MCDs. The following table shows the output of procedure Form-
MCDs.
V (Ỹ ) h ϕ G
V2 (c, d) c → c, d → d x → c, y → d 3 (19.85)
V3 (f, f ) f → f, h → f x → f, y → f 1, 2, 3
Procedure Form-MCDs includes in GC only the minimal set of subgoals that
are necessary in order to satisfy Proposition 19.35. This makes it possible that in
the second phase of the MiniCon Algorithm needs only to consider combinations of
MCDs that cover pairwise disjoint subsets of subgoals of the query.
Claim 19.36 Given a query Q, a set of views V, and the set of MCDs C for Q
over the views V, the only combinations of MCDs that can result in non-redundant
The fact that only such sets of MCDs need to be considered that provide partitions
of the subgoals in the query reduces the search space of the algorithm drastically.
In order to formulate procedure Combine-MCDs, another notation needs to be
introduced. The ϕC mapping of MCD C may map a set of variables of Q onto the
same variable of hC (V ). One arbitrarily chosen representative of this set is chosen,
with the only restriction that if there exists variables in this set from the head
of Q, then one of those is the chosen one. Let ECϕC (x) denote the representative
variable of the set containing x. The MiniCon Description C is considered extended
with ECϕC (x) in he following as a quintet (hC , V (Ỹ ), ϕC , GC , ECϕC ). If the MCDs
C1 , . . . , Ck are to be combined, and for some iNEj ECϕCi (x) = ECϕCi (y) and
ECϕCj (y) = ECϕCj (z) holds, then in the conjunctive rewriting obtained by the join
x, y and z will be mapped to the same variable. Let SC denote the equivalence
relation determined on the variables of Q by two variables being equivalent if they
are mapped onto the same variable by ϕC , that is, xSC y ⇐⇒ ECϕC (x) = ECϕC (y).
Let C be the set of MCDs obtained as the output of Form-MCDs.
Combine-MCDs(C)
1 Answer ← ∅
2 for {C1 , . . . , Cn } ⊆ C such that GC1 , . . . , GCn is a partition of the subgoals of Q
3 do Define a mapping Ψi on Ỹi as follows:
4 if there exists a variable x in Q such that ϕi (x) = y
5 then Ψi (y) = x
6 else Ψi (y) is a fresh copy of y
7 Let S be the transitive closure of SC1 ∪ · · · ∪ SCn
8 S is an equivalence relation of variables of Q.
9 Choose a representative for each equivalence class of S.
10 Define mapping EC as follows:
11 if x ∈ V ar(Q)
12 then EC(x) is the representative of the equivalence class of x under S
13 else EC(x) = x
14 Let Q0 be given as Q0 (EC(X̃)) ← VC1 (EC(Ψ1 (Ỹ1 ))), . . . , VCn (EC(Ψn (Ỹn )))
15 Answer ← Answer ∪ {Q0 }
16 return Answer
Exercises
19.3-1 Prove Theorem 19.25 using Proposition 19.24 and Theorem 19.20.
19.3-2 Prove the two statements of Lemma 19.26. Hint. For the first statement,
write in their definitions in place of views vi (Ỹi ) into Q0 . Minimise the obtained
query Q00 using Theorem 19.19. For the second statement use Proposition 19.24 to
prove that there exists a homomorphism hi from the body of the conjunctive query
defining view vi (Ỹi ) to the body of Q. Show that Y˜0 i = hi (Ỹi ) is a good choice.
19.3-3 Prove Theorem 19.31 using that datalog programs have unique minimal
fixpoint.
Problems
19-2 (P − , V −1 )↓ is correct
Prove that each tuple produced by logic program (P − , V −1 ) ↓ is contained in the
output of P (part of the proof of Theorem 19.32). Hint. Let t be a tuple in the output
of (P − , V −1 ) that does not contain function symbols. Consider the derivation tree of
t. Its leaves are literals, since they are extensional relations of program (P − , V −1 ).
If these leaves are removed from the tree, then the leaves of the remaining tree are
edb relations of P. Prove that the tree obtained is the derivation tree of t in datalog
program P.
a. Prove that if the views of V are given by datalog programs, query Q is conjunc-
tive and may contain non-equality (NE) predicates, then the question whether
for a given instance I of the views tuple t is a certain answer of Q is algorith-
mically undecidable. Hint. Reduce to this question the Post Correspondence
942 19. Query Rewriting in Relational Databases
Problem, which is the following: Given two sets of words {w1 , w2 , . . . , wn } and
{w10 , w20 , . . . , wn0 } over the alphabet {a, b}. The question is whether there exists
a sequence of indices i1 , i2 , . . . , ik (repetition allowed) such that
V (0, 0) ← S(e, e, e)
V (X, Y ) ← V (X0 , Y0 ), S(X0 , X1 , α1 ), . . . , S(Xg−1 , Y, αg ),
S(Y0 , Y1 , β1 ), . . . , S(Yh−1 , Y, βh )
(19.87)
where wi = α1 . . . αg and wi0 = β1 . . . βh
is a rule for all i ∈ {1, 2, . . . , n}
S(X, Y, Z) ← P (X, X, Y ), P (X, Y, Z) .
Show that for the instance I of V that is given by I(V ) = {he, ei} and I(S) = {},
the tuple hci is a certain answer of query Q if and only if the Post Correspondence
Problem with sets {w1 , w2 , . . . , wn } and {w10 , w20 , . . . , wn0 } has no solution.
b. In contrast to the undecidability result of a., if V is a set of conjunctive views
and query Q is given by datalog program P, then it is easy to decide about an
arbitrary tuple t whether it is a certain answer of Q for a given view instance I.
Prove that the datalog program (P − , V −1 )datalog gives exactly the tuples of the
certain answer of Q as output.
Chapter Notes
There are several dimensions along which the treatments of the problem “answering
queries using views” can be classified. Figure 19.6 shows the taxonomy of the work.
The most significant distinction between the different work s is whether their
goal is data integration or whether it is query optimisation and maintenance of
physical data independence. The key difference between these two classes of works
is the output of the the algorithm for answering queries using views. In the former
case, given a query Q and a set of views V, the goal of the algorithm is to produce
an expression Q0 that references the views and is either equivalent to or contained
in Q. In the latter case, the algorithm must go further and produce a (hopefully
optimal) query execution plan for answering Q using the views (and possibly the
database relations). Here the rewriting must be an equivalent to Q in order to ensure
the correctness of the plan.
The similarity between these two bodies of work is that they are concerned with
the core issue of whether a rewriting of a query is equivalent or contained in the query.
However, while logical correctness suffices for the data integration context, it does
Notes for Chapter 19 943
not in the query optimisation context where we also need to find the cheapest plan
using the views. The complication arises because the optimisation algorithms need to
consider views that do not contribute to the logical correctness of the rewriting, but
do reduce the cost of the resulting plan. Hence, while the reasoning underlying the
algorithms in the data integration context is mostly logical, in the query optimisation
case it is both logical and cost-based. On the other hand, an aspect stressed in data
integration context is the importance of dealing with a large number of views, which
correspond to data sources. In the context of query optimisation it is generally
assumed (not always!) that the number of views is roughly comparable to the size
of the schema.
The works on query optimisation can be classified further into System-R style
optimisers and transformational optimisers. Examples of the former are works of
Chaudhuri, Krishnamurty, Potomianos and Shim [?]; Tsatalos, Solomon, and Ioan-
nidis [25]. Papers of Florescu, Raschid, and Valduriez [?]; Bello et. al. [?]; Deutsch,
Popa and Tannen [?], Zaharioudakis et. al. [?], furthermore Goldstein és Larson[?]
belong to the latter.
Rewriting algorithms in the data integration context are studied in works of
Yang and Larson [?]; Levy, Mendelzon, Sagiv and Srivastava [?]; Qian [?]; further-
more Lambrecht, Kambhampati and Gnanaprakasam [?]. The Bucket Algorithm
was introduced by Levy, Rajaraman and Ordille [?]. The Inverse-rules Algorithm is
invented by Duschka and Genesereth [?, ?]. The MiniCon Algorithm was developed
by Pottinger and Halevy [?, 22].
Query answering algorithms and the complexity of the problem is studied in
papers of Abiteboul and Duschka [?]; Grahne and Mendelzon [?]; furthermore Cal-
vanese, De Giacomo, Lenzerini and Vardi [?].
The STORED system was developed by Deutsch, Fernandez and Suciu [?]. Se-
mantic caching is discussed in the paper of Yang, Karlapalem and Li [?]. Extensions
of the rewriting problem are studied in [?, ?, ?, ?, ?].
Surveys of the area can be found in works of Abiteboul [?], Florescu, Levy and
Mendelzon [14], Halevy [?, 17], furthermore Ullman[?].
Research of the authors was (partially) supported by Hungarian National Re-
search Fund (OTKA) grants Nos. T034702, T037846T and T042706.
Bibliography
[21] S. Petrov. Finite axiomatization of languages for representation of system properties. Infor-
mation Sciences, 47:339–372, 1989. 894
[22] R. Pottinger. MinCon: A scalable algorithm for answering queries using views. The VLDB
Journal, 10(2):182–198, 2001. 943
[23] A. Sali, Sr., A. Sali. Generalized dependencies in relational databases. Acta Cybernetica,
13:431–438, 1998. 894
[24] B. Thalheim. Dependencies in Relational Databases. B. G. Teubner, 1991. 894
[25] O. G. Tsatalos, M. C. Solomon, Y. Ioannidis. The GMAP: a versatile tool for physical data
independence. The VLDB Journal, 5(2):101–118, 1996. 943
[26] D. M. Tsou, P. C. Fischer. Decomposition of a relation scheme into Boyce–Codd normal form.
SIGACT News, 14(3):23–29, 1982. 894
[27] J. D. Ullman. Principles of Database and Knowledge Base Systems. Vol. 1. Computer Science
Press, 1989 (2nd edition). 894
[28] C. Zaniolo. A new normal form for the design of relational database schemata. ACM Trans-
actions on Database Systems, 7:489–499, 1982. 894
This bibliography is made by HBibTEX. First key of the sorting is the name of the
authors (first author, second author etc.), second key is the year of publication, third
key is the title of the document.
Underlying shows that the electronic version of the bibliography on the homepage
of the book contains a link to the corresponding address.
Index
This index uses the following conventions. Numbers are alphabetised as if spelled out; for
example, “2-3-4-tree" is indexed as if were “two-three-four-tree". When an entry refers to a place
other than the main text, the page number is followed by a tag: exe for exercise, exa for example,
fig for figure, pr for problem and fn for footnote.
The numbers of pages containing a definition are printed in italic font, e.g.
time complexity, 583 .
This index uses the following conventions. If we know the full name of a cited person, then we
print it. If the cited person is not living, and we know the correct data, then we print also the year
of her/his birth and death.
B H
Beeri, Catriel, 885, 886, 888, 894, 944 Halevy, Alon Y., 943, 944
Békéssy, András, 869, 894, 944 Howard, J. H., 885, 894
Bello, Randall G., 943 Hull, Richard, 874, 891, 894
Bernstein, P. A., 894, 944
Boyce, Raymond F., 879, 894
Buneman, Peter, 943 I
Ioannidis, Yannis E., 919, 943, 945
C
Calvanese, Diego, 943 J
Chaudhuri, Surajit, 943 Johnson, D. T., 869, 894
Cochrane, Roberta, 943
Codd, Edgar F. (1923–2003), 862, 879, 894,
904, 944 K
Kambhampati, Subbarao, 943
Karlapalem, Kamalakar, 943
D Katona, Gyula O. H., 892, 894, 944
De Giacomo, Giuseppe, 943 Krishnamurty, Ravi, 943
Delobel, C., 894, 944 Kwok, Cody T., 943
Demetrovics, János, 862, 869, 894, 895, 944
Deutsch, Alin, 943
Dias, Karl, 943 L
Dowd, M., 894, 944 Lambrecht, Eric, 943
Downing, Alan, 943 Lapis, George, 943
Duschka, Oliver M., 943 Larson, Per–Åke, 943
Lenzerini, Maurizio, 943
Levy, Alon Y., 943
F Li, Qing, 943
Fagin, R., 885, 894, 944 Lucchesi, C. L., 894, 944
Feenan, James J., 943
Fernandez, Mary, 943
Finnerty, James L., 943 M
Fischer, P. C., 894, 945 Maier, David, 893, 894, 944
Florescu, Daniela D., 943, 944 Mendelzon, Alberto O., 894, 943, 944
Friedman, Marc, 943 Minker, Jack, 892, 894, 944
G N
Genesereth, Michael R., 943 Norcott, William D., 943
950 Name Index
Q V
Qian, Xiaolei, 943 Valduriez, Patrick, 943
Vardi, Moshe Y., 943
Vianu, Victor, 874, 891, 894, 944
R
Rajaraman, Anand, 943
Raschid, Louiqa, 943 W
Weld, Daniel S., 943
Witkowski, Andrew, 943
S
Sagiv, Yehoshua, 894, 943, 944
Sali, Attila, 862, 894, 895, 944, 945 Y
Sali, Attila, Sr., 945 Yang, H. Z., 943
Shim, Kyuseok, 943 Yang, Jian, 943
Solomon, Marvin H., 919, 943, 945 Yu, C. T., 869, 894
Srivastava, Divesh, 943
Statman, R., 894
Suciu, Dan, 943 Z
Sun, Harry, 943 Zaharioudakis, Markos, 943
Zaniolo, C., 894, 945
Ziauddin, Mohamed, 943
T