Query Optimization
Query Optimization
CS 317/387
Query Evaluation
Problem: An SQL query is declarative –does
not specify a query execution plan.
1
Select Distinct targetlist from
R1,…,Rn Where condition
Is equivalent to:
Example
2
5
Query Optimizer
3
Equivalence Preserving
Transformation
To transform a relational expression into
another equivalent expression we need
transformation rules that preserve equivalence
4
Commutativity, Associativity of Joins
Join commutativity: R ∝ S ≡S ∝ R
Used to reduce cost of nested loop evaluation strategies
(smaller relation should be in outer loop)
Join associativity: R ∝ (S ∝ T) ≡(R ∝ S) ∝ T
used to reduce the size of intermediate relations in
computation of multi-relational join
first compute the join that yields smaller intermediate
result
N-way join has T(N)×N! different evaluation plans–
T(N)is the number of parenthesized expressions
N! is the number of permutations
Query optimizer cannot look at all plans Hence it
does not necessarily produce optimal plan
9
10
5
Equivalence Example
σ C1 ∧C2 ∧C3 (R×S)
≡σC1(σC2(σC3(R×S) ) )
≡σC1(σC2(R) ×σC3(S) )
≡σC2(R) ∝ C1σC3(S)
11
Query Tree
Tree structure that corresponds to a
relational algebra expression:
–A leaf node represents an input relation;
–An internal node represents a relation
obtained by applying one relational
operator to its child nodes
–The root relation represents the answer to
the query
–Two query trees are equivalent if their
root relations are the same
6
Query Plan
Query Tree with specification of
algorithms for each operation.
–A query tree may have different
execution plans
–Some plans are more efficient to
execute than others.
•Two main issues:
–For a given query, what plans are
considered?
–How is the cost of a plan estimated?
•Ideally: want to find best plan.
Practically: avoid worst plans!
13
Cost - Example 1
SELECT P.Name
FROM Professor P, Teaching T
WHERE P.Id = T.ProfId -- join condition
AND P. DeptId = ‘CS’ AND T.Semester = ‘F2007’
π Name(σDeptId=‘CS’ ∧ Semester=‘F2007’(Professor Id=ProfId Teaching))
π Name
Master query
execution plan
(nothing pushed)
σDeptId=‘CS’∧ Semester=‘F2007’
Id=ProfId
Professor Teaching
14
7
Metadata on Tables (in system
catalogue)
15
8
Estimating Cost - Example
Complete algorithm:
do join, write result to intermediate file on disk
read intermediate file, do select/project, write
final result
Problem: unnecessary I/O
17
Pipelining
Solution: use pipelining:
join and select/project act as co-routines,
operate as producer/consumer sharing a buffer in
main memory.
• When join fills buffer; select/project filters it and
outputs result
• Process is repeated until select/project has processed
last output from join
Performing select/project adds no additional cost
Intermediate output
join result
select/project final result
buffer
18
9
Estimating Cost - Example 1
Total cost:
4200 + (cost of outputting final
result)
19
Cost Example 2
SELECT P.Name
FROM Professor P, Teaching T
WHERE P.Id = T.ProfId AND
P. DeptId = ‘CS’ AND T.Semester = ‘F2007’
πName(σSemester=‘F1994’ (σDeptId=‘CS’ (Professor) Id=ProfId Teaching))
π Name
Partially pushed
plan: selection σSemester=‘F2007’
pushed to Professor
Id=ProfId
σDeptId=‘CS’
Professor Teaching
20
10
Cost Example 2 -- selection
Compute σDeptId=‘CS’ (Professor) (to reduce size of one
join table) using clustered, 2-level B+ tree on DeptId.
50 departments and 1000 professors; hence weight
of DeptId is 20 (roughly 20 CS professors). These
rows are in ~4 consecutive pages in Professor.
• Cost = 4 (to get rows) + 2 (to search index) = 6
• keep resulting 4 pages in memory and pipe to next
step
clustered index
on DeptId
rows of
Professor
21
22
11
Cost Example 2 – join (cont’d)
Each professor matches ~10 Teaching rows.
Since 20 CS professors, hence 200 teaching
records.
All index entries for a particular ProfId are in
same bucket. Assume ~1.2 I/Os to get a
bucket.
• Cost = 1.2 × 20 (to fetch index entries for 20
CS professors) + 200 (to fetch Teaching
rows, since hash index is unclustered) = 224
1.2
20 × 10
23
Cost Example 2 –
select/project
24
12
Estimating Output Size
It is important to estimate the size of the
output of a relational expression – size
serves as input to the next stage and
affects the choice of how the next stage will
be evaluated.
Size estimation uses the following
measures on a particular instance of R:
Tuples(R): number of tuples
Blocks(R): number of blocks
Values(R.A): number of distinct values of A
MaxVal(R.A): maximum value of A
MinVal(R.A): minimum value of A
25
26
13
Estimation of Reduction Factor
Assume that reduction factors due
to target list and query condition are
independent
Thus:
reduction(Query) =
reduction(TargetList) ×
reduction(Condition)
27
28
14
Reduction Due to TargetList
reduction(TargetList) =
number-of-attributes (TargetList)
number-of-attributes (Ri)
29
30
15
Choosing Query Execution Plan
Step 1: Choose a logical plan
Step 2: Reduce search space
Step 3: Use a heuristic search to
further reduce complexity
31
16
Step 2: Reduce Search Space
D
A B C D
A B C D C
Logical query
execution plan A B
Equivalent
Equivalent left
query tree
deep query tree
33
Step 2 (cont’d)
Two issues:
Choose a particular shape of a tree
(like in the previous slide)
• Equals the number of ways to parenthesize
N-way join – grows very rapidly
Choose a particular permutation of the
leaves
• E.g., 4! permutations of the leaves A, B, C,
D
34
17
Step 2: Dealing With Associativity
Too many trees to evaluate: settle on a
particular shape: left-deep tree.
Used because it allows pipelining:
P1 P2 P3
A B X C Y D
X Y
Property: once a row of X has been output by
P1, it need not be output again (but C may have
to be processed several times in P2 for
successive portions of X)
Advantage: none of the intermediate relations
(X, Y) have to be completely materialized and
saved on disk.
• Important if one such relation is very large, but the 35
final result is small
36
18
Step 3: Heuristic Search
37
19
Index-Only Queries
A B+ tree index with search key attributes
A1, A2, …, An has stored in it the values of
these attributes for each row in the table.
Queries involving a prefix of the attribute list A1,
A2, .., An can be satisfied using only the index –
no access to the actual table is required.
Example: Transcript has a clustered B+
tree index on StudId. A frequently asked
query is one that requests all grades for a
given CrsCode.
Problem: Already have a clustered index on
StudId – cannot create another one (on
CrsCode)
Solution: Create an unclustered index on
(CrsCode, Grade) 39
20