Parametric Query Optimization
Parametric Query Optimization
Parametric Query Optimization
discussions, stats, and author profiles for this publication at: http://www.researchgate.net/publication/221310215
CITATIONS
READS
37
16
1 AUTHOR:
Sumit Ganguly
Indian Institute of Technology Kanpur
73 PUBLICATIONS 1,046 CITATIONS
SEE PROFILE
Department
of Parametric
Algorithms
Query Optimization
Sumit Ganguly
of Computer Science and Engineering
[email protected]
and intermediate relation cardinalities, predicate selectivities, available memory, disk bandwidth, processor
speeds and existence of access paths. The values of
some of these parameters may change over time because of change in the database state, access paths
and the execution environment. Moreover, estimating parameters for queries containing unbound variables is often not possible. The compilation of a
query into a single static plan [GrW89] could result in significantly sub-optimal executions. This issue has been pointed out by Lohman [Loh89], Graefe
and Ward [GrW89]. Approaches towards this problem have been presented by Ioannidii etal. [INS+92],
Cole and Graefe [CG94], Antoshenkov [Ant931 and by
Krishnamurthy and this author [GK94].
Abstract
Query optimizers normally compile queries
into one optimal plan by assuming complete
knowledge of all cost parameters such as selectivity and resource availability. The execution
of such plans could be sub-optimal when cost
parameters are either unknown at compile
time or change significantly between compile
time and runtime [Loh89, GrW89]. Parametric query optimization [INS+92, CG94, GK94]
optimizes a query into a number of candidate
plans, each optimal for some region of the parameter space. In this paper, we present parametric query optimization algorithms. Our
approach is based on the property that for
linear cost functions, each parametric optimal
plan is optimal in a convex polyhedral region
of the parameter space. This property is used
to optimize linear and non-linear cost functions. We also analyze the expected sizes of
the parametric optimal set of plans and the
number of plans produced by the Cole and
Graefe algorithm [CG94].
1.1
1 Introduction
Database queries are optimized based on cost models that calculate costs for query plans. The cost
of a query plan depends on parameters such as base
Optimization
arrl/aspecial~fIunthe-.
of the 24th VLDB
USA, 1998
Query
witlx&feeallorpsrtofthism&rialis
PtXni&al to
plovidfdt
-tit theccpiesamn&m&ordi&ibrtedfar
i?izziY
1adwm@r+theVIDB~nctimarxI
thetitleofthepublicarticnanlits~~,andn~~is
ejwnthaty$@&vg-+&;$f~~~~
Proceedings
New York,
The
Parametric
Problem
Conference
{h,.
228
1.2
Review
of Existing
Solutions
Three techniques [CG94, INS+92 GK94J with varying degrees of generality have been proposed to solve
the parametric query optimization problem. The most
general technique is due to Cole and Graefe [CG94] to
which we refer as the CG technique. In this approach,
the cost of a plan p is modeled as an interval [ I(p),
u(p) 1, where I(p) is the lowest cost of p over the parameter spaceand u(p) is the highest cost of p over the
parameter space. A partial order <CG is defined over
the set of plans as follows. For plans p and q, p <CC q,
if the cost interval of p lies to the left of the cost interval of q, that is u(p) < 1(q). The CG technique
computes the set of least plans with respect to <CO,
that is {p 1 for all plans q, l(q < p)}. There are two
problems with the CG algorithm. First, it computes
a superset of the parametric optimal set. As shown in
Section 9.3, the expected number of plans generated
by this technique could be much larger than the expected size of the parametric optimal set (for e.g., CG
may yield fi or more plans when the expected size
of the parametric optimal set is R In N, where N is
the cardinality of the plan space). The second problem
with the CG technique is that it does not include any
mechanism to find the regions of optimality.
Krishnamurthy and thii author present an algorithm to compute the parametric optimal set for the
specific case when the parameter is the ratio s of the
load factors of two sites in a distributed database system. The cost of a plan is modeled as Wi + s . Wz,
where Wr and WZ quantify the work done by the plan
at each of the two sites. Thii algorithm generates exactly the parametric optimal set together with the re
gions of optimality. An alternative formulation of this
algorithm is presented in Section 4.
Ioannidis, Ng, Shim and Sellis [INS+921 present a
randomization approach to this problem. Thii appreach does not give guarantees about producing all
parametric optimal plans nor does it give any bounds
on deviations from optimality.
1.3
The
Linear
and Non-linear
Cost functions
Ch51,82,
229
If a query
_ _is the union of two subqueries, such that
in each subquery, there is a predicate with unknown
selectivity, then the cost function of a simple optimizer
may be of the form
Consider a query in which two predicates with unknown selectivities (si and sz respectively) are connected by an and operator, Then, the cost function
may be expressed as (under simplifying conditions)
C(p,Sl,S2)
=xo(P)+x~(p)~s+x2(p)~t+x~(p)~s~t
where
X&J) is the cost of that portion of p that is does
not depend on either selectivities,
xi(p) . s is the cost of that portion of p that depends only on 8,
22(p).t is the cost of that portion of p that depends
only on t, and,
x3(p) . s . t is the cost of that portion of the plan
that depends on both s and t.
Thii shows that non-linear functions arise naturally
even in simple cost models. The cost function when
memory is a parameter could be expected to be a piecewise linear functions.
1.4
Summary
of Paper
This paper presents a solution technique for the parametric query optimization problem by considering the
structure of the regions of optimality. First, it is
shown that, for the n-parameter linear cost functions,
the parameter space can be decomposed into polyhedrons, each polyhedron being the region of optimality
for some plan in the parametric optimal set and vice
versa. We then consider the problem of computing the
parametric optimal plans for single parameter (unary)
and two parameter (binary) linear cost functions.
The algorithm for binary linear cost functions traverses the edges of each optimal region (convex polygons) in a counterclockwise order. This traversal is facilitated by two routines neighbor and corder. Given a
vertex of a polygon representing a region of optimality,
and the direction of an edge, neighbor yields the next
vertex and the set of plan(s) optimal there. Given a
set of equally optimal plans at a vertex, corder orders
Organization
Structure
of Parameter
Linear cost functions
Space
for
Polyhedral
decomposition
value space
of Parameter
plans
and re-
,an)=xs(p)+xr(p).8r+...+xn(p).an
a)
c@,
U) + a * y@WT
Angular
a vertex
Ordering
of Iso-optimal
plans
at
Consider binary linear cost functions and the associated convex polygonal decomposition of the 2dimensional parameter value space. Let v be a vertex of a polygon corresponding to R(p) for some p in
the parametric optimal set. Thus v is the intersection point of the optimal region of possibly more than
one plan. Let pZ be the set of plans is&optimal at V.
Choose a vector 1 of a very small magnitude 6, with
one end fixed at v and initially aligned along a given
direction d. If we rotate 1 clockwise a full circle, then
the other end of 1 successively passesthrough the optimal regions of plans in pZ. Thii circular order of plans
so obtained is called the clockwise ordering of the plan8
that are iso-optimal at v with reference to d. Analogously, one can define the clockwise ordering. The idea
is illustrated in Figure 1.
Clockwise
Ordering
Computing
the clockwise
optimal plans
ordering
of iso-
Parametric
query optimization
unary linear cost functions
for
procedure find-all-segments(a, b) {
Input: Parameter 3 lies in [a, b].
Output: Parametric optimal plans in [a, b]
64
2. pb := optimize(b);
3. if(panpb#
44
4.
return [a, b],pa fl pb;
4.1
*/
6.
7.
8.
Computing
a direction
functions
neighboring
for binary
plans along
linear cost
5.
(4
(b)
rx := lowest(l,pc);
ry := Zawest(2,pc);
return
find-segments (u,p, s, TX) o
find-segments (s, ry, v, q);
} /* end of procedure find-segments */
232
Neighbors
along a direction
Consider the convex polygonal partitioning of the parameter space induced by the parametric optimal set.
Let p be a parametric optimal plan and let R(p) denote the polygon which is the region of optimality for
p. Let u be a point in R(p) and let, d be a direction
vector designating a ray emanating from U, as shown
in Figure 3.
The neighbors of p along d starting at u is the set of
plans q such that the regions of optima&y R(p), R(q)
and the ray u + Q * d, for scalar (Y > 0 intersect at, a
single point. By definition, neighbor of p also includes
p. Figure 3 illustrates the three types of neighbors
possible. In (a), the neighbor of p along d from u is
the set, {p,q}. In (b), the neighbor is the set of plans
i.rhql,q2,q3). 1n (c1, neighbor is the singleton set {p}.
d, p) is used to denote the
The function neighbor&u,
set of plans q such that R(p),R(q) and the directed
line segment u + cx- d, for 0 < (Y 5 p intersects at, a
single point. Again, by definition, neighbor(p, u, d, /3)
includes p.
An alternative equivalent formulation of neighbor is
the following. The line segment u+wd, for 0 5 a 5 p
is a segment of the aflke set,. The parametric optimal
set,for the unary linear cost function &,d, defined in
Section 2, is a sequenceof plans p = pl ,pz, . . . , ph and
a sequence of values 80 = 0 < 81 < 82 < . . . < sh = p
such that pi is optimal for the interval [sin , si]. In this
interpretation, neighbor@, u, d, B) is the set of plans
that are optimal at a = 81 in the affine set or equivalently, at u + 81 . d in the 2-dimensional parameter
space. This formulation is used to design an algorithm
for computing the neighbor function in Section 5.2.
4.2
Algorithm
to compute
neighbors
Computing
parametric
optimal
plans for binary linear cost functions
nl2
p =
-d
pb := optimize(u
+ d/9);
(pa,u+d-p);
q := lowest(1, u, d,pb);
neighbor-segment(p, u, d, q);
return
ifpEporeturn
Y
0
10
(1)
64
procedure neighbor-segment(p, u, d, q) {
1.
v := isogdnt(u, d,p, q);
2.
pc := optimize(v);
(PC,v);
if p E pc return
3.
4.
rz := Zowest(l,u,d,pc);
5.
return neighbor-segment (p, u, d, rs);
6.
1
5.1
procedure find-all-regions(B)
Input: B defines the boundary of the 2 dimensional
7.
8.
9.
10.
11.
12.
5.2
v := first vertex of B;
d := direction of first edge of B;
p := least(2, v, d, optimize(v)) :
insert@, Visited);
enqueue(
[email protected],
new edge(v, UNKNOWN,
d, NULLPLAN))
;
q := dequeue();
while
(q # NULL) {
find-region(q, B);
q := dequeue();
/* end while */
*/
Figure 6: Algorithm for the parametric query optimization problem for binary linear cost functions
q, such that their regions of optimality, R(p) and R(q)
lie to the left and right respectively as we traverse from
u along direction d (See Figure 5(a)). Using neighbor,
we can find the coordinates of VJand the plans that are
isooptimal at V, that is, plans whose regions of optimality share a vertex at v. Using csort, we can find
the direction vectors of each of the edges at v. The
leftmost turn at v with respect to the incoming direction d or equivalently, the first edge in clockwise order
with respect to the direction -d gives us the an edge
direction of R(p) which follows the edge direction d in
the counterclockwise traversal of the edges of R(p).
Therefore, given a vertex u of an optimal region
R(p) and an edge direction, we can traverse the edges
of R(p) in a counterclockwise order (i.e., by keeping
the interior of R(p) to the left). Each vertex is visited exactly once. The algorithm when applied to the
polygonal partitioning shown in Figure 5(b) discovers
vertices in the breadth4rst order as numbered in the
figure. It is assumed that we start at vertex numbered
0 and initially move along the edge O-10.
The following property of breadth-first search is
used by the algorithm to avoid visiting a vertex twice.
Suppose a set of edges of the boundary of R(p)
. . . , ek is visited before we start traversing R(p)
el,e2,
(i.e., before p comes to the head of the queue). Then,
thii set of edgesforms a path although the sequenceof
edgesin the path may not be identical to the sequence
234
Detailed
Description
of the Algorithm
procedure find-region(p, B)
Input: p is a parametric optimal plan. B defines the
el = p.edgelist;
-) = eZ.first();
(
6.
7.
en0 := 0;
while (cdinear(cv,
iv, d) = FALSE)
/* iv is not along d from cv*/
(eno, dist) :=
find-intersection(B,
cv, d, eno);
(nv,pl) :=
neighbor(p, cv, d, dist);
/* number of plans in pZ */
pa := pl.planarray();
/* list of plans in pZ */
csMm,pa,db -4ph
13.
14.
15.
pa[m] := NULLPLAN;
for i := 1 to m - 1 do {
if (IsVisited(pa[i])
= FALSE)
16.
enqueue(pa[i]);
18.
(nv, UNKNOWN,
19.
Visited);
20.
21.
22.
23.
24.
25.
cv *= nv*
rpin := Ipa[l];
set
8.
12.
Structure
of parametric
optimal
for non-linear
cost functions
C(p,s,t)=xCp)+y(p).s+z(p)~t+ur(p).s.t
;rt1;6ds
for dont care value;
iv is initial vertex */
3.
C.
7
Figure 7: Algorithm to find the region of a parametric
optimal plan for binary linear cost functions
235
Computing
parametric
optimal
for non-linear
cost functions
set
Expected
Unary Linear
cost functions
>
Sizes
@S63].
Theorem 4 If the cost of N plans are chosen independently from a 2-dimensional normal distribution,
then E(h) = O(m).
Furthermore
E(h) < 2+m+O
various probability distribution assumptions. Theorems 6 and 7 consider binary non-linear cost functions
of the form likely to arise when the selectivities if two
predicates are treated as parameters. Note that for
uniform distributions, E(h) is O((ln N)2) although
points are uniformly distributed in a four dimensional
space. The constant for the leading order term is small
(less than l/10). For N = 1 billion, E(h) can be calculated to be less than 40.
Theorems 8,9 and 10 diicuss the expected number
of plans E(g) produced by the Cole and Graefe technique under assumptions similar to those made in Theorems 3 through 6. The values of E(g) lie in the range
from fi to nearly N, whereas the value of E(h) lies
in the range from m
to In N. This shows that the
Cole and Graefe technique could produce a large number of plans that are not members of the parametric
optimal set of plans.
The above result is a specific case of a general theorem by Bentley, Kung, Schkolnick and Thompson
[BKS+78].
8.2
Non-linear
cost function
in two variables
The paper discussesparametric query optimization algorithms for unary and binary linear cost functions
and for simple unary non-linear cost functions. The
approach is based on the property that the regions
of optimality for linear cost functions are polyhedral.
The query optimization algorithms proposed traverse
along the boundaries of the polyhedron. Simple probabilistic models are used to convince the reader that
the size of the parametric optimal set is not large for
one or two parameters.
The paper raises several issues. Extensions of this
idea to three and higher dimensions is an immediate
one and may be found in [Gan98]. Another issue is
to study the problem for piecewise linear cost functions that arise naturally when available memory is
considered as a parameter. A third question is to explore the approximate version of the parametric query
optimization problem in which a set of approximate
optimal plans are generated, such that for each value
in the parameter space, there is at least one plan in
the approximate set whose cost is within a given factor
(1 + E) of the cost of the optimal plan at that value.
Theorem 7 If the cost of N plans are chosen independently from a four dimensional normal distribution, then E(h) = O(ln N).
8.3
Analysis
Conclusions
Acknowledgements
The author wishes to thank U. Maheshwara Rao and
V.G.V. Prasad for their assistance in drawing the figures and the anonymous referees for their comments.
The nomenclature of iso-cost lines and iso-optimal
plans was given by Ravi Krishnamurthy, HP Labs.
Discussion
References
[Ant931
237
[SAC+791
April 1993.
[BKS+78]
J.L. Bentley, H.T. Kung, M. Schkolnick and C.D. Thompson, On the average number of maxima in a set of vectors, Journal of the ACM 25,536-543
(1978).
[CG94]
International
Conference on Management of Data.
SIGMOD International
Conference on
Management of Data, Minneapolis,
Minnesota, USA.
[GrWSS]
[GK94]
S. Ganguly
Parametric
Distributed
conditions
of
and R. Krishnamurthy,
Query Optimization for
Databases based on load
[Gan98]
[INS+921
Y.E.
Ioannidis, R.T. Ng, K. Shim and T.K.
Sellis, Parametric Query Processing,
Proceedings of the 1992 International
Conference on Very Large Databases,
W631
RCnyi and R. Sulanke, Ueber die konvexe Hulle von n zufallig gewahlten
Punkten I, 2. Wahrschein, 2, 75-84
(1963).
238