Parametric Query Optimization

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

See

discussions, stats, and author profiles for this publication at: http://www.researchgate.net/publication/221310215

Design and Analysis of Parametric Query


Optimization Algorithms.
CONFERENCE PAPER JANUARY 1998
Source: DBLP

CITATIONS

READS

37

16

1 AUTHOR:
Sumit Ganguly
Indian Institute of Technology Kanpur
73 PUBLICATIONS 1,046 CITATIONS
SEE PROFILE

Available from: Sumit Ganguly


Retrieved on: 21 October 2015

Design and Analysis

Department

of Parametric
Algorithms

Query Optimization

Sumit Ganguly
of Computer Science and Engineering
[email protected]

Indian Institute of Technology, Kanpur, India 208016

and intermediate relation cardinalities, predicate selectivities, available memory, disk bandwidth, processor
speeds and existence of access paths. The values of
some of these parameters may change over time because of change in the database state, access paths
and the execution environment. Moreover, estimating parameters for queries containing unbound variables is often not possible. The compilation of a
query into a single static plan [GrW89] could result in significantly sub-optimal executions. This issue has been pointed out by Lohman [Loh89], Graefe
and Ward [GrW89]. Approaches towards this problem have been presented by Ioannidii etal. [INS+92],
Cole and Graefe [CG94], Antoshenkov [Ant931 and by
Krishnamurthy and this author [GK94].

Abstract
Query optimizers normally compile queries
into one optimal plan by assuming complete
knowledge of all cost parameters such as selectivity and resource availability. The execution
of such plans could be sub-optimal when cost
parameters are either unknown at compile
time or change significantly between compile
time and runtime [Loh89, GrW89]. Parametric query optimization [INS+92, CG94, GK94]
optimizes a query into a number of candidate
plans, each optimal for some region of the parameter space. In this paper, we present parametric query optimization algorithms. Our
approach is based on the property that for
linear cost functions, each parametric optimal
plan is optimal in a convex polyhedral region
of the parameter space. This property is used
to optimize linear and non-linear cost functions. We also analyze the expected sizes of
the parametric optimal set of plans and the
number of plans produced by the Cole and
Graefe algorithm [CG94].

1.1

1 Introduction
Database queries are optimized based on cost models that calculate costs for query plans. The cost
of a query plan depends on parameters such as base

Optimization

Let sr,sz,..., 8, denote n parameters, where each si


quantifies some cost parameter, such as selectivity, table sizes, available memory etc. The cost of a plan p
chosen from the space of feasible plans for the query
is a function of the n parameters and is denoted by
For every legal value of the paGJ,~l,~zr~~~,~n).
rameters, there is some plan that is optimal for that
value. Given a query and n parameters, the parrmetric
optimal set of plans is the set of plans, each member of
which is optimal for some point in the n-dimensional
parameter space. Mathematically, the parametric optimal set may be defined as

For every legal value of the parameters 81,. . . , s,,


there is a plan in the parametric optimal set that is
optimal for that value and vice-versa. The region of
optima& for a plan p is denoted by R(p) and is the
set defined as

arrl/aspecial~fIunthe-.
of the 24th VLDB
USA, 1998

Query

(p 1p is optimal for a point in parameter space }

witlx&feeallorpsrtofthism&rialis
PtXni&al to
plovidfdt
-tit theccpiesamn&m&ordi&ibrtedfar
i?izziY
1adwm@r+theVIDB~nctimarxI
thetitleofthepublicarticnanlits~~,andn~~is
ejwnthaty$@&vg-+&;$f~~~~
Proceedings
New York,

The
Parametric
Problem

Conference
{h,.

228

. , s,)] p is optimal at (81,. . . , s,)}

x0(p) is the cost of that portion of p that is independent of a.

The problem of parametric query optimization is to


find the parametric optimal set of plans and the region
of optimality for each parametric optimal plan.
l

1.2

Review

of Existing

Solutions

Three techniques [CG94, INS+92 GK94J with varying degrees of generality have been proposed to solve
the parametric query optimization problem. The most
general technique is due to Cole and Graefe [CG94] to
which we refer as the CG technique. In this approach,
the cost of a plan p is modeled as an interval [ I(p),
u(p) 1, where I(p) is the lowest cost of p over the parameter spaceand u(p) is the highest cost of p over the
parameter space. A partial order <CG is defined over
the set of plans as follows. For plans p and q, p <CC q,
if the cost interval of p lies to the left of the cost interval of q, that is u(p) < 1(q). The CG technique
computes the set of least plans with respect to <CO,
that is {p 1 for all plans q, l(q < p)}. There are two
problems with the CG algorithm. First, it computes
a superset of the parametric optimal set. As shown in
Section 9.3, the expected number of plans generated
by this technique could be much larger than the expected size of the parametric optimal set (for e.g., CG
may yield fi or more plans when the expected size
of the parametric optimal set is R In N, where N is
the cardinality of the plan space). The second problem
with the CG technique is that it does not include any
mechanism to find the regions of optimality.
Krishnamurthy and thii author present an algorithm to compute the parametric optimal set for the
specific case when the parameter is the ratio s of the
load factors of two sites in a distributed database system. The cost of a plan is modeled as Wi + s . Wz,
where Wr and WZ quantify the work done by the plan
at each of the two sites. Thii algorithm generates exactly the parametric optimal set together with the re
gions of optimality. An alternative formulation of this
algorithm is presented in Section 4.
Ioannidis, Ng, Shim and Sellis [INS+921 present a
randomization approach to this problem. Thii appreach does not give guarantees about producing all
parametric optimal plans nor does it give any bounds
on deviations from optimality.
1.3

The

Linear

and Non-linear

Cost functions

of a plan as a function of the n parameters


. . . ,a,) depends on the cost model followed
by the optimizer. Consider a query in which the selectivity of a single predicate is unknown. Using a simple
cost model based on a uniform distribution assumption
of data values ([SAC+79]), C(p,s) may be expressed
as C(p, s) = x0(p) + xl(p) - s, where
cost

Ch51,82,

229

xl(p) . s is the cost of that portion of p that is


linearly dependent on 8.

If a query
_ _is the union of two subqueries, such that
in each subquery, there is a predicate with unknown
selectivity, then the cost function of a simple optimizer
may be of the form

Consider a query in which two predicates with unknown selectivities (si and sz respectively) are connected by an and operator, Then, the cost function
may be expressed as (under simplifying conditions)
C(p,Sl,S2)

=xo(P)+x~(p)~s+x2(p)~t+x~(p)~s~t

where
X&J) is the cost of that portion of p that is does
not depend on either selectivities,
xi(p) . s is the cost of that portion of p that depends only on 8,
22(p).t is the cost of that portion of p that depends
only on t, and,
x3(p) . s . t is the cost of that portion of the plan
that depends on both s and t.
Thii shows that non-linear functions arise naturally
even in simple cost models. The cost function when
memory is a parameter could be expected to be a piecewise linear functions.
1.4

Summary

of Paper

This paper presents a solution technique for the parametric query optimization problem by considering the
structure of the regions of optimality. First, it is
shown that, for the n-parameter linear cost functions,
the parameter space can be decomposed into polyhedrons, each polyhedron being the region of optimality
for some plan in the parametric optimal set and vice
versa. We then consider the problem of computing the
parametric optimal plans for single parameter (unary)
and two parameter (binary) linear cost functions.
The algorithm for binary linear cost functions traverses the edges of each optimal region (convex polygons) in a counterclockwise order. This traversal is facilitated by two routines neighbor and corder. Given a
vertex of a polygon representing a region of optimality,
and the direction of an edge, neighbor yields the next
vertex and the set of plan(s) optimal there. Given a
set of equally optimal plans at a vertex, corder orders

these plans according to the sequence in which their


regions would be encountered if we rotate a vector of
very small magnitude in a clockwise direction starting
along a specific direction. Solutions for non-linear cost
functions is illustrated by embedding non-linear curves
within the regions of optimality of a generalized linear
cost function. We also present a probabilistic analysis of the expected sizes of the parametric optimal set.
Thii section shows that for one and two parameters
(a) the expected sizes of the parametric optimal set of
plans is not very large and (b) the general approach
of Cole and Graefe may give a large number of nonoptimal plans.
1.5

Organization

Structure
of Parameter
Linear cost functions

Space

for

In thii section, we study some structural properties of


the regions of optimality of parametric optimal plans.
We also discuss the concepts of isocost planes for pairs
of plans and is&optimal set of plans. Finally we consider the problem of angular ordering of plans at a
vertex of an optimal region and present an algorithm
for solving it.
The region of optimality of a parametric optimal
plan p for an nary cost function parameterized by
s = sr,sz,. . . , s,, is denoted by R(p) and is defined as
{sip is optimal at 8)
2.1

Polyhedral
decomposition
value space

Theorem 2 Let p be a parametric optimal plan with


respect to an n-ary linear coat function. Then the region of optima& for a plan p is a convex polyhedron.

Theorem 2 argues that the parameter value space is


partitioned into convex polyhedrons, each polyhedron
being the region of optimality of a parametric optimal
plan. The parametric query optimization problem for
linear cost functions may be equivalently viewed as
finding the polyhedral decomposition of the parameter
apace induced by the pammetric optimal plans.
2.2

The paper is organized as follows. Section 2 presents


notations for cost functions. Section 3 discusses the
structure of the regions of optimality and the subroutine ccor&r. Section 4 discusses an algorithm for
unary linear cost functions. Section 5 presents the
neighbor subroutine and Section 6 discusses an algorithm for binary linear cost functions. Section 7 diicusses an approach for non-linear cost functions by
embedding curves or surfaces in a generalized linear
space. Section 8 sketches an algorithm for specific
unary non-linear cost functions. Section 9 is concerned
with probabilistic analysis of the size of the parametric
optimal set. Finally, we conclude in Section 10.

The above lemma immediately implies the following


theorem.

of Parameter

Consider linear nary cost functions. The following


lemma expressesa simple property of linear cost functions.
Lemma 1 Let plan p be optimal with respect to an
n-ary linear coat finction at points u and v of the ndimensional pammeter apace. Then p is optimal for
each point w that lies on the line segment joining u
with v.

Iso-cost planes, A-optimal


strictions
of cost functions

plans

and re-

Consider nary linear cost functions. For any plans p


and Q,the set of points u in the n-dimensional parameter space such that p and Qhave the same cost at u is
a plane and is called as the p, q iso-coat plane. To see
this, consider the equation C(p, u) = C(q,u). Since
C is a linear function in U, the equation represents a
plane in n-dimensions.
Let us consider unary linear cost function with
The
c(P9 u> =xr+yr.aandC(q,u)=xz+yz.u.
equation C(p,u) = C(q,u) can be solved to give a
point U = (x1 - XZ)/(~Z - ~1). The p,q isocost plane
for unary linear cost functions is a point and is called
the p, q iso-cost point, or even more simply as the p, q
isopoint. Similarly, the p, q iso-cost plane for binary
linear cost functions is a line and is referred to as the
p, q isocost line or as the p, q iso-line.
Multiple plans could be equally optimal at a point.
For example, at each vertex of the polyhedral decomposition which lies in the interior of the admissible parameter space, multiple plans have the same cost and
are optimal. Such plans are called isooptimal plans.
Consider an nary linear cost function of the form
C(p,81,82,...

,an)=xs(p)+xr(p).8r+...+xn(p).an

Given a set of plans pI, the function loweat(i,pl) re


turns the plan(s) with the least value of the ith coordinate xi. Let u be an n-dimensional vector and d be
a unit length n-dimensional vector. The set of points
that can be expressed in the form u + (Yd for (Y10,
represents a ray emanating from u in the direction d.
The cost of a plan p on this ray is a function of a single
parameter (Yand may be expressed as
Cu,d($%

a)

c@,

U) + a * y@WT

where y(p) = (xl(p), . . . , x,(p)) and . represents the


vector dot product. The function C&d is said to be
the restriction of the original cost function C to the
230

given ray defmed by u and d. Given a set of plans pl,


the function lotuest(i, u, d,pl) returns the plan(s) with
the least value in the it coordinate of C,,,&,a).
2.3

Angular
a vertex

Ordering

of Iso-optimal

plans

(zz, ~2,~2) respectively. The equation of the p, q iso


line is zi - z2 + 8 - (~1 - 92) + t - (zl - 22) = 0. We
then find the angle that the line makes with the given
direction vector d measured clockwise from d, using
the following procedure.
For any line with equation z + ye 8 + z . t = 0, the
direction of the line is given by the vector e = (z, -%I)
or its negative (-,z, 8). If e is counterclockwise from
d, then the angle between the line z + ye o + z. t = 0
and the vector d is quantified by e.d/ldl. Otherwise,
the angle is quantified as -e.d/ldl.
Here, x and .
should be interpreted as vector cross and dot product
operators respectively.
The procedure for angular (clockwise) ordering of
iso-optimal plans outlined above also produces the directions of the edges of the polygonal regions of optimality emanating at v. This procedure is called
csort(), although we do not show the pseudo-code as
that would essentially contain elementary vector alge
bra. Actually, it is not necessary to assume that the
first plan pi in the clockwise ordering is known. However, we do not discuss this issue any further.

at

Consider binary linear cost functions and the associated convex polygonal decomposition of the 2dimensional parameter value space. Let v be a vertex of a polygon corresponding to R(p) for some p in
the parametric optimal set. Thus v is the intersection point of the optimal region of possibly more than
one plan. Let pZ be the set of plans is&optimal at V.
Choose a vector 1 of a very small magnitude 6, with
one end fixed at v and initially aligned along a given
direction d. If we rotate 1 clockwise a full circle, then
the other end of 1 successively passesthrough the optimal regions of plans in pZ. Thii circular order of plans
so obtained is called the clockwise ordering of the plan8
that are iso-optimal at v with reference to d. Analogously, one can define the clockwise ordering. The idea
is illustrated in Figure 1.

Clockwise
Ordering

Figure 1: Angular ordering of iso-optimal plans in 2dimensional parameter space


2.4

Computing
the clockwise
optimal plans

ordering

of iso-

Consider binary linear cost functions with parameters


8 and t. Given a direction d, a vertex v and the set
of isooptimal plans pu at v, we design an algorithm
to compute the ordering of plans in pu with reference
to d. In addition, we assume that we are also given
pl E pu which is the first plan in the clockwise ordering. Consider the pr , q iso-line for any q E pZ, q # pl.
This line makes an angle 0 with d measured clockwise
from d. The plan pz that makes the least angle is the
second plan in the clockwise ordering. The procedure
can be repeated with pz assuming the role of pl and
considering plans in the set pZ - (pl,pz} and so on.
The angle that an isoline makes with a direction
vector d is computed using simple vector algebra.
First, the equation of the iso-line is found. Let p
and q be plans with cost coordinates (zi, 91, ei) and

Parametric
query optimization
unary linear cost functions

for

In this section, we present an optimization procedure


for unary cost functions, that is, when the cost of a
plan p is of the form C@, 8) = z(p) + 8 . y(p). we
assume that a 5 8 5 b, where a and b are given
constants. of a single predicate is a parameter etc.
We assume that a traditional query optimizer function optimize(s) returns the set of isooptimal plans
at 8.
The procedure used is the following. Let pu and pb
denote the set of isooptimal plans at a and b respectively. If pa fl pb is non-empty, say it contains p, then
p is optimal at points a and b and therefore for each
point in the interval [a, b], allowing us to terminate.
Otherwise, let p be the plan Zowest(2,pa) and q be the
plan Zmest(l,pb). Let 9 be the p,q isopoint and pc
be the set of isooptimal plans at 8. Now there are two
cases, (i) either p and q are both in pc or (ii) neither
p nor q is in pc. If case (i) holds, we terminate the
search, since p is optimal at a and 8 and therefore in
the interval [a,~] and analogously, q is optimal in the
interval [a, b]. If case (ii) holds, we partition [a, b] into
intervals [a, s] and [s, b] and we recursively use this pro
cedure on these intervals. The procedure is presented
in Figure 2, where it is called as fin&all-segments.
The algorithm is an alternative and easier formulation
of the algorithm presented in [GK94].
The complexity of the algorithm is dominated by
the calls to the optimize routine. If the number of
parametric optimal plans is h, then the number of calls
231

to optimize is maz(2h - 1,2) [GK94].

procedure find-all-segments(a, b) {
Input: Parameter 3 lies in [a, b].
Output: Parametric optimal plans in [a, b]

and their regions of optima&y expressed as


segments [U,v].
1. pa := optimize(a);
/* find optimal plan at left end */

64

2. pb := optimize(b);
3. if(panpb#
44
4.
return [a, b],pa fl pb;

/* every plan in pa fl pb is optimal


for the entire range [a, !J]
5. p := lowest(2,pa);
/* if cost = x + 8.9, then p is

4.1
*/

procedure find-segments (u,p, v, q) {


/* u 5 v, p = lawest(2, optimize(u)) and
q = huest(1, optimize(v)) */
1.
8 := is0 - point(p, q);
/* find the p, q isopoint */
2.
pc := optimize(s);
3.
ifpEpc
4.
return ([a,sl,p>00, %4;

/* o is the list concatenation operator*/

6.
7.

8.

Computing
a direction
functions

neighboring
for binary

plans along
linear cost

Suppose that cost function for a plan p has the form


C(p) = x(p) + s - y(p) + t * z(p). We first define the
set, of neighboring plans along a direction. We then
present an algorithm to compute this set.

the plan in pa with lowest cost on 21*/


6. q := Zowest(l,pb);
/* q is the plan in pa with lowest
cost on 2 */

5.

(4

Figure 3: Examples of neighbors

/* find optimal plan at right, end */

7. return find-segments (a,p, b, q);


8. } /* end of procedure find-all-segments

(b)

rx := lowest(l,pc);
ry := Zawest(2,pc);
return
find-segments (u,p, s, TX) o
find-segments (s, ry, v, q);
} /* end of procedure find-segments */

Figure 2: Parametric query optimization algorithm for


unary linear cost functions

232

Neighbors

along a direction

Consider the convex polygonal partitioning of the parameter space induced by the parametric optimal set.
Let p be a parametric optimal plan and let R(p) denote the polygon which is the region of optimality for
p. Let u be a point in R(p) and let, d be a direction
vector designating a ray emanating from U, as shown
in Figure 3.
The neighbors of p along d starting at u is the set of
plans q such that the regions of optima&y R(p), R(q)
and the ray u + Q * d, for scalar (Y > 0 intersect at, a
single point. By definition, neighbor of p also includes
p. Figure 3 illustrates the three types of neighbors
possible. In (a), the neighbor of p along d from u is
the set, {p,q}. In (b), the neighbor is the set of plans
i.rhql,q2,q3). 1n (c1, neighbor is the singleton set {p}.
d, p) is used to denote the
The function neighbor&u,
set of plans q such that R(p),R(q) and the directed
line segment u + cx- d, for 0 < (Y 5 p intersects at, a
single point. Again, by definition, neighbor(p, u, d, /3)
includes p.
An alternative equivalent formulation of neighbor is
the following. The line segment u+wd, for 0 5 a 5 p
is a segment of the aflke set,. The parametric optimal
set,for the unary linear cost function &,d, defined in
Section 2, is a sequenceof plans p = pl ,pz, . . . , ph and
a sequence of values 80 = 0 < 81 < 82 < . . . < sh = p
such that pi is optimal for the interval [sin , si]. In this
interpretation, neighbor@, u, d, B) is the set of plans

that are optimal at a = 81 in the affine set or equivalently, at u + 81 . d in the 2-dimensional parameter
space. This formulation is used to design an algorithm
for computing the neighbor function in Section 5.2.
4.2

Algorithm

to compute

neighbors

In this section, we present an algorithm to compute


neighbor(p, u,d, ,8). We modify the procedure for
finding the parametric optimal set for linear unary cost
functions with two basic differences. First, the cost
function for a plan p is given by cud(p, a). Thus, the
notion of iso-point is defined with respect to the afline
cost function. Secondly, in the find-segments proce+
dure (Figure 2) the input interval [u, w] is partitioned
into two segments [u, s] and [s, v], and the procedure is
recursively invoked on both segments. For the neighbor function, it is only necessaryto recursively descend
along the left partition [u, s].
The algorithm for neighbr@,u, d, p) is given in
Figure 4. The procedure uses a traditional optimizer
function optimize(v), where v is a two-dimensional
point, to return the plan(s) that are equally optimal at
v. The function isopht(u,
d,p, q) returns the point
u + d +a where cud(p, cu)= CQd(q, a).
procedure neighbor@, u, d, p)
Input: u and d are 2 dimensional vectors.
Zmuest(2, u, d, optimize(u)).
Output: Find neighbor along u + d - a,

each call to neighbor-segment, a segment [sin, si] is


chosen. If we assume that the probability of choosing
any of the h segments is identical, then the expected
number of calls to optimize is I&l?&?+ 2, where Hk
is the kth harmonic number. The expected value of
h is O(Zn N) (see Section 9), where N is the number of plans in the plan space. In this case, it can
be proved that the expected number of calls that the
routine neighbor makes to the routine optimize is
O(Zn(Zn IV)).

Computing
parametric
optimal
plans for binary linear cost functions

In this section, we describe an algorithm that solves


the parametric query optimization problem when any
plan p has the cost function of the form z(p) + v(p) +
s + z(p) . t. The set of values that s and t can take
is assumed to be defined by a convex polygon whose
boundary B is specified by a list of edges in counterclockwise order. For example, the boundary B could
define a rectangle in many practical cases. The two
main subroutines used by the algorithm are csort and
neighbor discussed in Sections 3 yd 5 respectively.
9
1

nl2

p =
-d

0 < cy5 0. It returns the neighboring set of plans and


the neighboring point.
1.
2.
3.
4.
5.

pb := optimize(u

+ d/9);
(pa,u+d-p);
q := lowest(1, u, d,pb);
neighbor-segment(p, u, d, q);
return

ifpEporeturn

Y
0

10

(1)

64

Figure 5: Illustrating the basic ideas in the find-allregions procedure

procedure neighbor-segment(p, u, d, q) {
1.
v := isogdnt(u, d,p, q);
2.
pc := optimize(v);
(PC,v);
if p E pc return
3.
4.
rz := Zowest(l,u,d,pc);
5.
return neighbor-segment (p, u, d, rs);
6.
1

5.1

Figure 4: An algorithm to compute the neighbor function


The complexity of neighbr(p, u, d,P) is derived
from calls to the optimize function. Supposethat there
are h optimal regions in the parametric optimal set
for Cud for 0 < cr I & corresponding to the segment
boundary points 0 = 80 < s1 < a2 < . . . < 8h = p. At
233

Basic idea of the algorithm

The algorithm starts by traversing along the first edge


of the boundary B. In general, it traverses the edges
of the regions of optimality in a counterclockwise direction. Each region of optimality is completely traversed before another region is considered. During this
traversal, new parametric optimal plans may be diicovered and their regions are subsequently traversed.
The regions of plans are considered in brecldth-first order and the procedure terminates when the queue of
plans corresponding to the breadth-first traversal is
aPtYe

Suppose we are given a vertex u of the polygonal


partitioning and a unit vector d parallel to an edge
(u,v) of a polygon. Further, we are given plans p and

procedure find-all-regions(B)
Input: B defines the boundary of the 2 dimensional

in which the edges have been visited. The algorithm


is presented in Figure 6. Thii procedure uses an auxiliary procedure find-wyion that traverses the region of
optimality of a given plan. Thii procedure is presented
in Figure 7.

legal parameter space.


Output: Parametric optimal plans and their regions of
optimality.
Initial State: Queue is empty. Edgelists of all plans
are empty.
1.
2.
3.
4.
5.
6.

7.
8.
9.

10.
11.
12.

5.2

v := first vertex of B;
d := direction of first edge of B;
p := least(2, v, d, optimize(v)) :
insert@, Visited);
enqueue(
[email protected],
new edge(v, UNKNOWN,
d, NULLPLAN))
;
q := dequeue();
while
(q # NULL) {
find-region(q, B);
q := dequeue();

/* end while */

} /* end of procedure find-all-regions

*/

Figure 6: Algorithm for the parametric query optimization problem for binary linear cost functions
q, such that their regions of optimality, R(p) and R(q)
lie to the left and right respectively as we traverse from
u along direction d (See Figure 5(a)). Using neighbor,
we can find the coordinates of VJand the plans that are
isooptimal at V, that is, plans whose regions of optimality share a vertex at v. Using csort, we can find
the direction vectors of each of the edges at v. The
leftmost turn at v with respect to the incoming direction d or equivalently, the first edge in clockwise order
with respect to the direction -d gives us the an edge
direction of R(p) which follows the edge direction d in
the counterclockwise traversal of the edges of R(p).
Therefore, given a vertex u of an optimal region
R(p) and an edge direction, we can traverse the edges
of R(p) in a counterclockwise order (i.e., by keeping
the interior of R(p) to the left). Each vertex is visited exactly once. The algorithm when applied to the
polygonal partitioning shown in Figure 5(b) discovers
vertices in the breadth4rst order as numbered in the
figure. It is assumed that we start at vertex numbered
0 and initially move along the edge O-10.
The following property of breadth-first search is
used by the algorithm to avoid visiting a vertex twice.
Suppose a set of edges of the boundary of R(p)
. . . , ek is visited before we start traversing R(p)
el,e2,
(i.e., before p comes to the head of the queue). Then,
thii set of edgesforms a path although the sequenceof
edgesin the path may not be identical to the sequence
234

Detailed

Description

of the Algorithm

A plan p is represented by its cost coordinates


(z(p), u(p), z(p)). For each parametric plan p, a list
of edgesel is kept such that el traversed in sequenceis
a path and when traversed keeps the interior of R(p)
to its left. The @,er) pairs are kept in a global data
structure. Each member of el models an edge and is a
record with schema (JbmVertex, To Vertex, Direction,
Rplan). When a vertex is discovered, the ToVertex
field is unknown, although, the Direction is known
(this information is obtained from csort). Rplan is
the plan whose region of optimality lies to the right
if we traverse the edge. The plan to the left of the
edge is p itself and so is not stored. Rplan defaults to
NULLPLAN
if the edge is on the admissible space
boundary B.
The second global data structure kept is that of a
queue of plans to implement breadth-first traversal. A
set Visited of plans is also kept to remember if a plan
was discovered earlier.
The function find-all-regions returns the set of
(p,el) pairs that identifies the parametric optimal
plans and their optimal regions. The auxiliary function find-region(p, B) returns the sequence of edges
that define the region of optimality R(p).
The following simple functions are used by the
procedure. Given a list I, Z.first() and l.last() re
turn the first and last member of 1. The function
insert-puth(el,e)
inserts an edge e either at the beginning of the path el or at the end of the path el,
such that after insertion, el still remains a path. Due
to a property of breadth-first traversal as mentioned
in the previous section, the set of edges traversed for
any plan is always a path.
Given points f, t and a direction d, the function
coZZinear(f, t, d) is true if the vector (t - f) is parallel
to d (i.e., (t - f) x d = 0 and (t - f) . d = 1). Given
a sequenceof boundary edges B and a ray emanating
from v in the direction d, find-intersection (B, v, d, k)
finds the index of the edge in B that the ray intersects
using a sequential search algorithm. The parameter k
specifies that the sequential search starts the check at
edge Ic, proceeds along the list by wrapping around if
necessary.
The following invariant is maintained when findwgion(p, B) is called. Let (cv, nextv, d,rp) be the
last edge in the edge list of p. Then nextv is UNKNOWN and cv and d are known. For every other

procedure find-region(p, B)
Input: p is a parametric optimal plan. B defines the

boundary of the legal parameter space. A list of edges


called p.edgelist is associated with p.
Output: Finds the boundary of the convex polygonal
region for which p is optimal. This boundary is traced
in counterclockwise order.
1.
2.

el = p.edgelist;
-) = eZ.first();
(

6.
7.

en0 := 0;
while (cdinear(cv,
iv, d) = FALSE)
/* iv is not along d from cv*/
(eno, dist) :=
find-intersection(B,
cv, d, eno);
(nv,pl) :=
neighbor(p, cv, d, dist);

/* new vertex nv is discovered */


update el set
el.Zast().ToVertex
= nv;
9.
if (rplan # NULLPLAN)
insert-path(
10.
rplan.edgelist, new edge (
nv, cv, NOTNEEDED,p);
11.
m := pZ.Zength();

aP7 3) = w + l4.P) * 3 + 4P> * f(s)


where f is any smooth function of s. Generalizations
to higher dimensions will be easy to see.
The definition of the region of optimality for a parametric optimal plan p is R(p) = {s] p is optimal at s}.
However, R(p) may not be a convex set. We adopt the
following approach for non-linear cost functions. Let t
be an artificial parameter. Consider the cost function

/* number of plans in pZ */
pa := pl.planarray();

/* list of plans in pZ */
csMm,pa,db -4ph

13.
14.
15.

pa[m] := NULLPLAN;
for i := 1 to m - 1 do {
if (IsVisited(pa[i])
= FALSE)

16.

Since C(2) is a binary linear function, the regions of


optimality for parametric optimal plans for Cc21are
convex polygons. The polygons for Ct2) in the s, t
spacethat intersect with the curve t = f(s) correspond

/* pali] not yet visited */


17.

enqueue(pa[i]);

18.
(nv, UNKNOWN,

19.

to the parametric optimal regions for the function

Visited);

/* mark the plan as visited */


}
/*endofif...
*/
} /* end of for ... */

20.

21.
22.
23.
24.
25.

/* put in rear of queue */


initialize p.edgelist to
dl[i],pa[i + 11);
insert(pu[i],

cv *= nv*
rpin := Ipa[l];

set

where x(p) is the cost of that portion of p that is does


not depend on either selectivities, g(p) . s is the cost of
that portion of p that depends only on a, z(p). t is the
cost of that portion of p that depends only on t and
w(p) . s . t is the cost of that portion of the plan that
depends on both s and t.
In order to avoid working in three or higher dimensions, we study the structure of parametric optimal set
of plans when the cost function of a plan p is of the
form:

8.

12.

Structure
of parametric
optimal
for non-linear
cost functions

C(p,s,t)=xCp)+y(p).s+z(p)~t+ur(p).s.t

(cv, -, d, rplan) = eZ.last();

/* cv is current (i.e., last known ) vertex */


4.
5.

Non-linear cost functions arise naturally in parametric


query optimization. For example, let s and t be the
unknown selectivities of two predicates in a conjunctive query. Then, a plan p has a cost function of the
form

;rt1;6ds
for dont care value;
iv is initial vertex */
3.

&se (upv, d! p9 in the edge list of p, all the above


four attributes are known. Of course, if the direction
dOdefines an edge along the boundary, then pis set
to NULLPLAN.

/* end of while .... */

C.

Thii is illustrated in Figure 8. In this figure, pi is


optimal for the segment of the curve from 0 to a, pz
is optimal for the segment from a to b, ps is optimal
from b to c, p4 from c to d and finally pi again (due to
non-convexity) is optimal from d to e.
The structure of the parametric optimal set for
piece-wise linear functions which arises when memory
is a parameter is not explored in the paper and is left
for future work.

} /* end of procedure find-region */

7
Figure 7: Algorithm to find the region of a parametric
optimal plan for binary linear cost functions
235

Computing
parametric
optimal
for non-linear
cost functions

set

In this section, we sketch an algorithm that computes


the parametric optimal set and regions for unary cost

Figure 8: Embedding non-linear cost functions in a


larger linear space
functions of the form C(p, s) = z(p) + B +g(p) + f(s) .
z(p). We assume that f is a smooth convex curve
(e.g., f(s) = s2). We first conceptually embed the
curve t = f(s) in the polygonal partitioning induced
by the parametric optimal set for the generalized linear
function C(2)(p,s,t) = z(p)+s.g(p)+t.z(p)
(fore.g.,
see Figure 9).
We will illustrate the idea informally by using Figure 9. Suppose we start at origin 0 in Figure 9. Let
p be the optimal plan at 0 along the direction of the
tangent at 0. If we now invoke the subroutine neighbor
along the direction of the tangent to the curve at 0,
then we arrive at point numbered 1, which is the first
cross over point along the tangential direction. We
take the leftmost turn at 1, using a variant of the subroutine csort and call neighbor again to obtain point
2. We repeat thii process until for some i, the line
segment joining points i and i - 1 intersect the curve.
Since the curve is assumed to be convex, an intersection point must exist. This gives us the intersection
point 3 in Figure 9. We then restart the process by
navigating along the tangential direction at 3. The
path traced out by successiveiterations of this procedure is shown in Figure 9.
The following details have been not been specified
for sake of brevity. The line segment joining points i
and i - 1 may intersect the curve in more than one
point. Secondly, a point on the curve may also be a
vertex of the polygonal partitioning of Cc2). All these
issues can be satisfactorily handled.

Expected

Figure 9: Illustration of algorithm used to compute


the parametric optimal set for non-linear functions
7 is Eulers constant and is equal to 0.577.... The size
of the parametric optimal set is denoted by h. The
expected size of the parametric optimal set is denoted
by E(h).
We first consider the problem of determining E(h)
under simple probability distributions for the case of
unary linear functions. Most results in this area are either known from mathematics or from computational
geometry. We then consider the same problem for binary non-linear functions, of the kind that would be
expected when the parameters are unknown predicate
selectivities. The results in this area are not available
in the literature (to the best of the knowledge of this
author). We then consider the problem of determining
the expected size of set of plans produced by the Cole
and Graefe algorithm. The section concludes with a
discussion.
8.1

Unary Linear

cost functions

The cost of a plan p is given by a function C(p, s) =


z(p) + s *g(p), and equivalently, the cost of a plan may
be thought of as a point (z(p), g(p)) in twodimensional
space.
Theorem 3 If the cost of N plans are chosen uniforml~~ and randomly from a rectangle, then E(h) =
O(Zn N)
E(h) < ;HN+2+0

>

The above result is attributed to RBnyi and Sulanke

Sizes

@S63].

In this section, we analyze the expected sizes of the


parametric optimal set of plans under simple probability distributions. The following notation is used in
thii section. The size of the set of all the plans for a
given query is denoted by N. For any natural number
n, H,., denotes the nth harmonic number defined as
CL1 l/i. For large n, H,, M In n+r+C(l/n),
where
236

Theorem 4 If the cost of N plans are chosen independently from a 2-dimensional normal distribution,
then E(h) = O(m).
Furthermore
E(h) < 2+m+O

various probability distribution assumptions. Theorems 6 and 7 consider binary non-linear cost functions
of the form likely to arise when the selectivities if two
predicates are treated as parameters. Note that for
uniform distributions, E(h) is O((ln N)2) although
points are uniformly distributed in a four dimensional
space. The constant for the leading order term is small
(less than l/10). For N = 1 billion, E(h) can be calculated to be less than 40.
Theorems 8,9 and 10 diicuss the expected number
of plans E(g) produced by the Cole and Graefe technique under assumptions similar to those made in Theorems 3 through 6. The values of E(g) lie in the range
from fi to nearly N, whereas the value of E(h) lies
in the range from m
to In N. This shows that the
Cole and Graefe technique could produce a large number of plans that are not members of the parametric
optimal set of plans.

This result is originally due to Raynaud [RaylO].


Theorem 5 If the cost of N plans are chosen independently from any set of continuous distributions,
then E(h) < HN/~.

The above result is a specific case of a general theorem by Bentley, Kung, Schkolnick and Thompson
[BKS+78].
8.2

Non-linear

cost function

in two variables

Assume that the cost function for a plan p is of the


form
C(p,s,t)=z(p)+s.y(p)+t.z(p)+s.t.w(p)

The cost of each plan may be thought of as a point


in a four-dimensional space.
(dP)l [email protected]))
Theorem 6 If the cost of N plans are chosen independently from a four dimensional hypercube, then
E(h) = O((ln N)2).

The paper discussesparametric query optimization algorithms for unary and binary linear cost functions
and for simple unary non-linear cost functions. The
approach is based on the property that the regions
of optimality for linear cost functions are polyhedral.
The query optimization algorithms proposed traverse
along the boundaries of the polyhedron. Simple probabilistic models are used to convince the reader that
the size of the parametric optimal set is not large for
one or two parameters.
The paper raises several issues. Extensions of this
idea to three and higher dimensions is an immediate
one and may be found in [Gan98]. Another issue is
to study the problem for piecewise linear cost functions that arise naturally when available memory is
considered as a parameter. A third question is to explore the approximate version of the parametric query
optimization problem in which a set of approximate
optimal plans are generated, such that for each value
in the parameter space, there is at least one plan in
the approximate set whose cost is within a given factor
(1 + E) of the cost of the optimal plan at that value.

Theorem 7 If the cost of N plans are chosen independently from a four dimensional normal distribution, then E(h) = O(ln N).
8.3

Analysis

of the Cole and Graefe algorithm

Let g represent the number of plans produced by the


Cole and Graefe algorithm, that is the number of least
plans with respect to the partial order <oG. The expected value of g is denoted by E(g). Consider unary
linear cost functions (i.e., C@, s) = z(p)+s.y(p)). The
cost of each plan is described as a point (z(p), y(p)) in
two dimensional space.
Theorem 0 If the coat of N plans avz chosen uniform18 and randomly within a rectangle, then E(g) =
( N7r/2)lj2.
Theorem 9 If the cost of N plans are chosen independently from continuous distributions , then E(g) 2
(N7r/2)/.
Theorem 10 There exist normal distributions
smh
that if the cost of N plans are chosen independently from these distributions,
then E(g) > N O((N/Zn N)i2).
8.4

Conclusions

Acknowledgements
The author wishes to thank U. Maheshwara Rao and
V.G.V. Prasad for their assistance in drawing the figures and the anonymous referees for their comments.
The nomenclature of iso-cost lines and iso-optimal
plans was given by Ravi Krishnamurthy, HP Labs.

Discussion

Theorems 3 through 7 depict the values of E(h) under


various assumptions and show that it is not extremely
large. For comparison purposes, let N be 1 billion.
This gives HN x 21. For single parameter linear cost
functions, E(h) varies between 14 and 22 under the

References
[Ant931
237

G. Antoshenkov, Dynamic Query Optimization in Rdb/VMS, Proceedings

[SAC+791

of the IEEE International


Conference
on Data Engineering, Vienna, Austria,

lection in relational database systems,


Proceedings of 1979 ACM SIGMOD

April 1993.
[BKS+78]

J.L. Bentley, H.T. Kung, M. Schkolnick and C.D. Thompson, On the average number of maxima in a set of vectors, Journal of the ACM 25,536-543
(1978).

[CG94]

R. Cole and G. Graefe, Optimization of Dynamic Query Evaluation


Plans, Proceedings of the 1994 ACM

International
Conference on Management of Data.

SIGMOD International
Conference on
Management of Data, Minneapolis,

Minnesota, USA.
[GrWSS]

G. Graefe and K. Ward, Dynamic


Query Evaluation Plans, Proceedings
of the 1989 ACM SIGMOD Intemational Conference on Management
Data, Portland, Oregon, USA.

[GK94]

S. Ganguly
Parametric
Distributed
conditions

of

and R. Krishnamurthy,
Query Optimization for
Databases based on load

Proceedings of the 1994


International
Conference on Management of Data, Pune, India.

[Gan98]

S. Ganguly, Parametric Query Optimization: A tutorial Technical Report, IIT


Kanpur August 1998. Available from
www.cse.iitk.ac.in (202.54.56.145).

[INS+921

Y.E.
Ioannidis, R.T. Ng, K. Shim and T.K.
Sellis, Parametric Query Processing,
Proceedings of the 1992 International
Conference on Very Large Databases,

Vancouver, British Columbia, Canada.


[Loh89]

G.M. Lohman, Is Query Optimization a Solved Problem?, Proceedings


of Workshop on Database Query Optimization,
G. Graefe (editor), Ore

gon Graduate Center Computer Science Technical Report 89-005.


PaY701

H. Raynaud, Sur lenveloppe convene


des nuages des points aleatoires dans
R , Journal of Applied Probability, 7,
35-48 (1970).

W631

RCnyi and R. Sulanke, Ueber die konvexe Hulle von n zufallig gewahlten
Punkten I, 2. Wahrschein, 2, 75-84
(1963).

P.G. Selinger etal. , LLAccesspath se

238

You might also like