Relational Algebra
Relational Algebra
Relational Algebra
Consider the following SQL to find which departments have had employees on the `Further Accounting'
course.
SELECT DISTINCT dname
FROM department, course, empcourse, employee
WHERE cname = `Further Accounting'
AND course.courseno = empcourse.courseno
AND empcourse.empno = employee.empno
AND employee.depno = department.depno;
The equivalent relational algebra is
PROJECTdname (department JOINdepno = depno (
PROJECTdepno (employee JOINempno = empno (
PROJECTempno (empcourse JOINcourseno = courseno (
PROJECTcourseno (SELECTcname = `Further Accounting' course)
))
))
))
Symbolic Notation
From the example, one can see that for complicated cases a large amount of the answer is formed from
operator names, such as PROJECT and JOIN. It is therefore commonplace to use symbolic notation to
represent the operators.
• SELECT ->σ (sigma)
• PROJECT -> π(pi)
• PRODUCT -> ×(times)
• JOIN -> |×| (bow-tie)
• UNION -> ∪ (cup)
• INTERSECTION -> ∩(cap)
• DIFFERENCE -> - (minus)
• RENAME ->ρ (rho)
Usage
The symbolic operators are used as with the verbal ones. So, to find all employees in department 1:
SELECTdepno = 1(employee)
becomes σdepno = 1(employee)
1
Conditions can be combined together using ^ (AND) and v (OR). For example, all employees in department 1
called `Smith':
SELECTdepno = 1 ^ surname = `Smith'(employee)
becomes σdepno = 1 ^ surname = `Smith'(employee)
The use of the symbolic notation can lend itself to brevity. Even better, when the JOIN is a natural join, the
JOIN condition may be omitted from |x|. The earlier example resulted in:
PROJECTdname (department JOINdepno = depno (
PROJECTdepno (employee JOINempno = empno (
PROJECTempno (empcourse JOINcourseno = courseno (
PROJECTcourseno (SELECTcname = `Further Accounting' course)))))))
becomes
πdname (department |×| (
πdepno (employee |×| (
πempno (empcourse |×| (
πcourseno (σcname = `Further Accounting' course) ))))))
Rename Operator
The rename operator returns an existing relation under a new name. ρA(B) is the relation B with its name
changed to A. For example, find the employees in the same Department as employee 3.
ρemp2.surname,emp2.forenames (
σemployee.empno = 3 ^ employee.depno = emp2.depno (
employee × (ρemp2employee)
)
)
Derivable Operators
• Fundamental operators:σ, π, ×, ∪, -, ρ
• Derivable operators: |×|,∩
A ∩ B ⇔ A - (A - B)
Equivalence
2
A|×|cB ⇔ πa1,a2,...aN(σc(A × B))
• where c is the join condition (eg A.a1 = B.a1),
• and a1,a2,...aN are all the attributes of A and B without repetition.
c is called the join-condition, and is usually the comparison of primary and foreign key. Where there are N
tables, there are usually N-1 join-conditions. In the case of a natural join, the conditions can be missed out, but
otherwise missing out conditions results in a cartesian product (a common mistake to make).
Equivalences
The same relational algebraic expression can be written in many different ways. The order in which tuples
appear in relations is never significant.
• A ×B ⇔ B × A
• A∩B⇔B∩A
• A ∪B ⇔ B ∪ A
• (A - B) is not the same as (B - A)
• σc1 (σc2(A)) ⇔ σc2 (σc1(A)) ⇔ σc1 ^ c2(A)
• πa1(A) ⇔ πa1(πa1,etc(A))
where etc represents any other attributes of A.
• many other equivalences exist.
While equivalent expressions always give the same result, some may be much easier to evaluate that others.
When any query is submitted to the DBMS, its query optimiser tries to find the most efficient equivalent
expression before evaluating it.
Comparing RA and SQL
• Relational algebra:
• is closed (the result of every expression is a relation)
• has a rigorous foundation
• has simple semantics
• is used for reasoning, query optimisation, etc.
• SQL:
• is a superset of relational algebra
• has convenient formatting features, etc.
• provides aggregate functions
• has complicated semantics
• is an end-user language.
Comparing RA and SQL
Any relational language as powerful as relational algebra is called relationally complete.
3
A relationally complete language can perform all basic, meaningful operations on relations.
Since SQL is a superset of relational algebra, it is also relationally complete.
Relational algebra
Relational algebra, an offshoot of first-order logic (and of algebra of sets), deals with a set of finitary relations (see
also relation (database)) which is closed under certain operators. These operators operate on one or more relations to
yield a relation. Relational algebra is a part ofcomputer science.
Introduction
Relational algebras received little attention until the publication of E.F. Codd's relational model of data in 1970. Codd
proposed such an algebra as a basis for database query languages. (See "Implementations" below.)
Relational algebra is essentially equivalent in expressive power to relational calculus (and thus first-order logic); this
result is known asCodd's theorem. Some care, however, has to be taken to avoid a mismatch that may arise between
the two languages since negation, applied to a formula of the calculus, constructs a formula that may be true on an
infinite set of possible tuples, while the difference operator of relational algebra always returns a finite result. To
overcome these difficulties, Codd restricted the operands of relational algebra to finiterelations only and also proposed
restricted support for negation (NOT) and disjunction (OR). Analogous restrictions are found in many other logic-based
computer languages. Codd defined the term relational completeness to refer to a language that is complete with
respect to first-order predicate calculus apart from the restrictions he proposed. In practice the restrictions have no
adverse effect on the applicability of his relational algebra for database purposes.
[edit]Primitive operations
As in any algebra, some operators are primitive and the others, being definable in terms of the primitive ones, are
derived. It is useful if the choice of primitive operators parallels the usual choice of primitive logical operators. Although it
is well known that the usual choice in logic of AND, OR and NOT is somewhat arbitrary, Codd made a similar arbitrary
choice for his algebra.
The six primitive operators of Codd's algebra are the selection, the projection, the Cartesian product (also called
the cross product or cross join), the set union, the set difference, and the rename. (Actually, Codd omitted the rename,
but the compelling case for its inclusion was shown by the inventors of ISBL.) These six operators are fundamental in the
sense that none of them can be omitted without losing expressive power. Many other operators have been defined in
terms of these six. Among the most important are set intersection, division, and the natural join. In fact ISBL made a
compelling case for replacing the Cartesian product with the natural join, of which the Cartesian product is a degenerate
case.
Altogether, the operators of relational algebra have identical expressive power to that of domain relational
calculus or tuple relational calculus. However, for the reasons given in the Introduction above, relational algebra has
strictly less expressive power than that of first-order predicate calculus without function symbols. Relational algebra
actually corresponds to a subset of first-order logic that is Horn clauses withoutrecursion and negation.
4
[edit]Set operators
Although three of the six basic operators are taken from set theory, there are additional constraints that are present in
their relational algebracounterparts: For set union and set difference, the two relations involved must be union-
compatible—that is, the two relations must have the same set of attributes. As set intersection can be defined in terms of
set difference, the two relations involved in set intersection must also be union-compatible.
The Cartesian product is defined differently from the one defined in set theory in the sense that tuples are considered to
be 'shallow' for the purposes of the operation. That is, unlike in set theory, where the Cartesian product of a n-tuple by
an m-tuple is a set of 2-tuples, the Cartesian product in relational algebra has the 2-tuple "flattened" into an n+m-tuple.
More formally, R × S is defined as follows:
R × S = {(r1, r2, ..., rn, s1, s2, ..., sm) | (r1, r2, ..., rn) ∈ R, (s1, s2, ..., sm) ∈ S}
Like the Cartesian product, the cardinality of the result is the product of the cardinalities of its factors, i.e., |R × S| = |R| ×
|S|. In addition, for the Cartesian product to be defined, the two relations involved must have disjoint headers—that is,
they must not have a common attribute name.
[edit]Projection (π)
Main article: Projection (relational algebra)
A projection is a unary operation written as where a1,...,an is a set of attribute names. The result of
such projection is defined as the set that is obtained when all tuples in R are restricted to the set {a1,...,an}.
[edit]Selection (σ)
Main article: Selection (relational algebra)
A generalized selection is a unary operation written as where is a propositional formula that consists
of atoms as allowed in the normal selection and the logical operators (and), (or) and (negation). This
selection selects all those tuples in R for which holds.
[edit]Rename ( ρ)
Main article: Rename (relational algebra)
A rename is a unary operation written as ρa / b(R) where the result is identical to R except that the b field in all
tuples is renamed to an afield. This is simply used to rename the attribute of a relation or the relation itself.
5
Name EmpId DeptName DeptNam Name EmpId DeptName Manager
Manager
e
This can also be used to define composition of relations. In category theory, the join is precisely the fiber product.
The natural join is arguably one of the most important operators since it is the relational counterpart of logical AND. Note
carefully that if the same variable appears in each of two predicates that are connected by AND, then that variable
stands for the same thing and both appearances must always be substituted by the same value. In particular, natural join
allows the combination of relations that are associated by a foreign key. For example, in the above example a foreign
key probably holds from Employee.DeptName to Dept.DeptName and then the natural join
of Employee and Dept combines all employees with their departments. Note that this works because the foreign key
holds between attributes with the same name. If this is not the case such as in the foreign key
from Dept.manager to Emp.emp-number then we have to rename these columns before we take the natural join. Such a
join is sometimes also referred to as an equijoin (see θ-join).
where Fun is a predicate that is true for a relation r if and only if r is a function. It is usually required that R and S must
have at least one common attribute, but if this constraint is omitted then the natural join becomes exactly the Cartesian
product.
The natural join can be simulated with Codd's primitives as follows. Assume that b1,...,bm are the attribute names
common to R, S, a1,...,anare the attribute names unique to R and c1,...,ck are the attribute unique to S. Furthermore
assume that the attribute names d1,...,dm are neither in R nor in S. In a first step we can now rename the common
attribute names in S:
Then we take the Cartesian product and select the tuples that are to be joined:
6
θ-join and equijoin
Consider tables Car and Boat which list models of cars and boats and their respective prices. Suppose a customer
wants to buy a car and a boat, but she doesn't want to spend more money for the boat than for the car. The θ-join on
the relation CarPrice ≥ BoatPrice produces a table with all the possible options.In case you are using a condition where
the attributes are equal the like Price the condition may be specified as Price=Price or alternatively (Price) itself.
Car Boat
If we want to combine tuples from two relations where the combination condition is not simply the equality of shared
attributes then it is convenient to have a more general form of join operator, which is the θ-join (or theta-join). The θ-join
is a binary operator that is written as or where a and b are attribute names, θ is a binary
relation in the set {<, ≤, =, >, ≥}, v is a value constant, and Rand S are relations. The result of this operation consists of
all combinations of tuples in R and S that satisfy the relation θ. The result of the θ-join is defined only if the headers
of S and R are disjoint, that is, do not contain a common attribute.
R φ S = σφ(R × S)
In case the operator θ is the equality operator (=) then this join is also called an equijoin.
Note, however, that a computer language that supports the natural join and rename operators does not need θ-join as
well, as this can be achieved by selection from the result of a natural join (which degenerates to Cartesian product when
there are no shared attributes).
7
Semijoin (⋉)(⋊)
The semijoin is joining similar to the natural join and written as R S where R and S are relations.[2] The result of the
semijoin is only the set of all tuples in R for which there is a tuple in S that is equal on their common attribute names. For
an example consider the tables Employeeand Dept and their semi join:
R S={t:t R, s S, Fun (t s) }
The semijoin can be simulated using the natural join as follows. If a1, ..., an are the attribute names of R, then
R S = Πa1,..,an(R S).
Since we can simulate the natural join with the basic operators it follows that this also holds for the semijoin.
Antijoin (▷)
The antijoin, written as R S where R and S are relations, is similar to the natural join, but the result of an antijoin is
only those tuples in Rfor which there is no tuple in S that is equal on their common attribute names. [3]
For an example consider the tables Employee and Dept and their antijoin:
8
Harry 3415 Finance Sales Harriet Harry 3415 Finance
R S={t:t R s S : Fun (t s) }
or
The antijoin can also be defined as the complement of the semijoin, as follows:
R S=R-R S
Given this, the antijoin is sometimes called the anti-semijoin, and the antijoin operator is sometimes written as semijoin
symbol with a bar above it, instead of .
Division (÷)
The division is a binary operation that is written as R ÷ S. The result consists of the restrictions of tuples in R to the
attribute names unique toR, i.e., in the header of R but not in the header of S, for which it holds that all their
combinations with tuples in S are present in R. For an example see the tables Completed, DBProject and their division:
Student Task
Task Student
Fred Database1
Database1 Fred
Fred Database2
Database2 Sara
Fred Compiler1
9
Eugene Database1
Eugene Compiler1
Sara Database1
Sara Database2
If DBProject contains all the tasks of the Database project then the result of the division above contains exactly all the
students that have completed both of the tasks in the Database project.
R ÷ S = { t[a1,...,an] : t R s S ( (t[a1,...,an] s) R) }
where {a1,...,an} is the set of attribute names unique to R and t[a1,...,an] is the restriction of t to this set. It is usually
required that the attribute names in the header of S are a subset of those of R because otherwise the result of the
operation will always be empty.
The simulation of the division with the basic operations is as follows. We assume that a1,...,an are the attribute names
unique to R andb1,...,bm are the attribute names of S. In the first step we project R on its unique attribute names and
construct all combinations with tuples in S:
T := πa1,...,an(R) × S
In the prior example, T would represent a table such that every Student (because Student is the unique key / attribute of
the Completed table) is combined with every given Task. So Eugene, for instance, would have two rows, Eugene ->
Database1 and Eugene -> Database2 in T.
U := T - R
Note that in U we have the possible combinations that "could have" been in R, but weren't. So if we now take the
projection on the attribute names unique to R then we have the restrictions of the tuples in R for which not all
combinations with tuples in S were present in R:
V := πa1,...,an(U)
So what remains to be done is take the projection of R on its unique attribute names and subtract those in V:
W := πa1,...,an(R) - V
10
Outer joins
Whereas the result of a join (or inner join) consists of tuples formed by combining matching tuples in the two operands,
an outer join contains those tuples and additionally some tuples formed by extending an unmatched tuple in one of the
operands by "fill" values for each of the attributes of the other operand.
The operators defined in this section assume the existence of a null value, ω, which we do not define, to be used for the
fill values. It should not be assumed that this is the NULL defined for the database language SQL, nor should it be
assumed that ω is a mark rather than a value, nor should it be assumed that the controversial three-valued logic is
introduced by it.
Three outer join operators are defined: left outer join, right outer join, and full outer join. (The word "outer" is sometimes
omitted.)
[edit]Left outer join ( ⟕)
The left outer join is written as R ⟕ S where R and S are relations.[4] The result of the left outer join is the set of all
combinations of tuples inR and S that are equal on their common attribute names, in addition (loosely speaking) to
tuples in R that have no matching tuples in S.
For an example consider the tables Employee and Dept and their left outer join:
In the resulting relation, tuples in S which have no common values in common attribute names with tuples in R take
a null value, ω.
Since there are no tuples in Dept with a DeptName of Finance or Executive, ωs occur in the resulting relation where
tuples in DeptName have tuples of Finance or Executive.
11
Let r1, r2, ..., rn be the attributes of the relation R and let {(ω, ..., ω)} be the singleton relation on the attributes that
are unique to the relationS (those that are not attributes of R). Then the left outer join can be described in terms of the
natural join (and hence using basic operators) as follows:
The right outer join behaves almost identically to the left outer join, but the roles of the tables are switched.
The right outer join of relations R and S is written as R ⟕ S.[5] The result of the right outer join is the set of all
combinations of tuples in R andS that are equal on their common attribute names, in addition to tuples in S that
have no matching tuples in R.
For example consider the tables Employee and Dept and their right outer join:
In the resulting relation, tuples in R which have no common values in common attribute names with tuples in S take
a null value, ω.
Since there are no tuples in Employee with a DeptName of Production, ωs occur in the Name attribute of the
resulting relation where tuples inDeptName had tuples of Production.
Let s1, s2, ..., sn be the attributes of the relation S and let {(ω, ..., ω)} be the singleton relation on the attributes that
are unique to the relationR (those that are not attributes of S). Then, as with the left outer join, the right outer join
can be simulated using the natural join as follows:
12
Full Outer join ( ⟕)
The outer join or full outer join in effect combines the results of the left and right outer joins.
The full outer join is written as R ⟕ S where R and S are relations.[6] The result of the full outer join is the set of all
combinations of tuples inR and S that are equal on their common attribute names, in addition to tuples in S that have no
matching tuples in R and tuples in R that have no matching tuples in S in their common attribute names.
For an example consider the tables Employee and Dept and their full outer join:
ω ω Production Charles
In the resulting relation, tuples in R which have no common values in common attribute names with tuples in S take
a null value, ω. Tuples inS which have no common values in common attribute names with tuples in R also take
a null value, ω.
The full outer join can be simulated using the left and right outer joins (and hence the natural join and set union) as
follows:
R ⟕ S = (R ⟕ S) (R ⟕ S)
Operations for domain computations
The aggregation operation
There are five aggregate functions that are included with most databases. These operations are Sum, Count, Average,
Maximum and Minimum. In relational algebra the aggregation operation over a schema (A1, A2, ... An) is written as
follows:
13
G1, G2, ..., Gm g f1(A1'), f2(A2'), ..., fk(Ak') (r)
The attributes preceding the g are grouping attributes, which function like a "group by" clause in SQL. Then there are an
arbitrary number of aggregation functions applied to individual attributes. The operation is applied to an arbitrary
relation r. The grouping attributes are optional, and if they are not supplied, the aggregation functions are applied across
the entire relation to which the operation is applied.
Let's assume that we have a table named Account with three columns, namely Account_Number, Branch_Name and
Balance. We wish to find the maximum balance of each branch. This is accomplished by Branch_NameGMax(Balance)(Account).
To find the highest balance of all accounts regardless of branch, we could simply write GMax(Balance)(Account).
It can be proven that there is no relational algebra expression E(R) taking R as a variable argument which
produces R+. The proof is based on the fact that, given a relational expression E for which it is claimed that E(R)
= R+, where R is a variable, we can always find an instance r ofR (and a corresponding domain d) such that E(r)
≠ r+.[7]
Our primary goal is to transform expression trees into equivalent expression trees, where the average size of the
relations yielded by subexpressions in the tree are smaller than they were before the optimization. Our secondary goal is
to try to form common subexpressions within a single query, or if there are more than one queries being evaluated at the
same time, in all of those queries. The rationale behind that second goal is that it is enough to compute common
subexpressions once, and the results can be used in all queries that contain that subexpression.
Rules about selection operators play the most important role in query optimization. Selection is an operator that very
effectively decreases the number of rows in its operand, so if we manage to move the selections in an expression tree
towards the leaves, the internal relations(yielded by subexpressions) will likely shrink.
[edit]Basic selection properties
Selection is idempotent (multiple applications of the same selection have no additional effect beyond the first one),
and commutative (the order selections are applied in has no effect on the eventual result).
1.
2.
A selection whose condition is a conjunction of simpler conditions is equivalent to a sequence of selections with
those same individual conditions, and selection whose condition is a disjunction is equivalent to a union of
selections. These identities can be used to merge selections so that fewer selections need to be evaluated, or to
split them so that the component selections may be moved or optimized separately.
1.
2.
Cross product is the costliest operator to evaluate. If the input relations have N and M rows, the result will
contain NM rows. Therefore it is very important to do our best to decrease the size of both operands before applying the
cross product operator.
This can be effectively done, if the cross product is followed by a selection operator, e.g. σA(R × P). Considering the
definition of join, this is the most likely case. If the cross product is not followed by a selection operator, we can try to
push down a selection from higher levels of the expression tree using the other selection rules.
In the above case we break up condition A into conditions B, C and D using the split rules about complex selection
conditions, so that A = B C D and B only contains attributes from R, C contains attributes only
from P and D contains the part of A that contains attributes from both R and P. Note, that B, C or D are possibly empty.
Then the following holds:
Selection is distributive over the setminus, intersection, and union operators. The following three rules are used to push
selection below set operations in the expression tree. Note, that in the setminus and the intersection operators it is
possible to apply the selection operator to only one of the operands after the transformation. This can make sense in
cases, where one of the operands is small, and the overhead of evaluating the selection operator outweighs the benefits
of using a smaller relation as an operand.
15
1.
2.
3.
Selection is associative with projection if and only if the fields referenced in the selection condition are a subset of the
fields in the projection. Performing selection before projection may be useful if the operand is a cross product or join. In
other cases, if the selection condition is relatively expensive to compute, moving selection outside the projection may
reduce the number of tuples which must be tested (since projection may produce fewer tuples due to the elimination of
duplicates resulting from elided fields).
[edit]Projection
[edit]Basic projection properties
Projection is idempotent, so that a series of (valid) projections is equivalent to the outermost projection.
Projection does not distribute over intersection and set difference. Counterexamples are given by:
and
Rename
Basic rename properties
Successive renames of a variable can be collapsed into a single rename. Rename operations which have no variables in
common can be arbitrarily reordered with respect to one another, which can be exploited to make successive renames
adjacent so that they can be collapsed.
1.
2.
16
1.
2.
3.
17