1972 Bayer Mccreight

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Acta Informatica t, 173-t89 0972)

9 b y Springer-Verlag t972

Organization and Maintenance of Large Ordered Indexes


R. BAYER a n d E. MCCREIGHT
Received September 29, t 97t

Summary. Organization and maintenance of an index for a dynamic random


access file is considered. I t is assumed t h a t the index must be kept on some pseudo
random access backup store like a disc or a drum. The index organization described
allows retrieval, insertion, and deletion of keys in time proportional to logk I where I
is the size of the index and k is a device dependent natural number such t h a t the per-
formance of the scheme becomes near optimal. Storage utilization is at least 50 %
b u t generally much higher. The pages of the index are organized in a special data-
structure, so-called B-trees. The schern& is analyzed, performance bounds are obtained,
and a near optimal k is computed. Experirrlents have been performed with indexes
up to 100000 keys. An index of size t 5 000 (t00000) can be maintained with an average
of 9 (at least 4) transactions per second on an IBM 360/44 with a 2 31 t disc.

1. Introduction
I n this p a p e r we consider t h e p r o b l e m of organizing a n d m a i n t a i n i n g an
i n d e x for a d y n a m i c a l l y changing r a n d o m access file. B y an index we m e a n a
collection of i n d e x elements which are pairs (x, ~) of fixed size physi.cally a d j a c e n t
d a t a items, n a m e l y a k e y x a n d some associated i n f o r m a t i o n ~. T h e k e y x identifies
a unique e l e m e n t in t h e index, the associated i n f o r m a t i o n is t y p i c a l l y a p o i n t e r
to a record or a collection of records in a r a n d o m access file. F o r this p a p e r the
associated i n f o r m a t i o n is of no' f u r t h e r interest.
W e assume t h a t t h e i n d e x itself is so v o l u m i n o u s t h a t o n l y r a t h e r small
p a r t s of it can be k e p t in m a i n store a t one time. T h u s the b u l k of t h e index m u s t
be k e p t on some b a c k u p store. T h e class of b a c k u p stores Considered are pseudo
random access devices which h a v e a r a t h e r long access or wait t i m e - - a s opposed
to a t r u e r a n d o m access device like core s t o r e - - a n d a r a t h e r high d a t a r a t e once
t h e transmission of p h y s i c a l l y sequential d a t a 'has been initiated. T y p i c a l pseudo
r a n d o m access devices are: fixed a n d m o v i n g h e a d discs, drums, a n d d a t a cells.
Since t h e d a t a file itself changes, it m u s t be possible n o t only to search the
i n d e x a n d to retrieve elements, b u t also to delete a n d to insert k e y s - - m o r e
a c c u r a t e l y i n d e x e l e m e n t s - - e c o n o m i c a l l y . The i n d e x organization described
in this p a p e r a l w a y s allows retrieval, insertion, a n d deletion of keys in t i m e
p r o p o r t i o n a l to log k I or b e t t e r , where I is t h e size of the index, a n d k is a device
d e p e n d e n t n a t u r a l n u m b e r which describes the page size such t h a t the perform-
ance of the m a i n t e n a n c e a n d r e t r i e v a l scheme becomes near optimal.
I n m o r e i l l u s t r a t i v e t e r m s theoretical analysis a n d a c t u a l e x p e r i m e n t s show
t h a t it is possible to m a i n t a i n an i n d e x of size 15 000 with an average of 9 retrievals,
insertions, a n d deletions per second in real t i m e on an I B M 360/44 with a 2 3 t t
disc as b a c k u p store. According to our theoretical analysis, it should be possible
to m a i n t a i n all index of size I 500000 with at least two t r a n s a c t i o n s per second
on such a configuration in real time.
t2 Acta lnformatica, Vol, t
t 74 R. Bayer and E. McCreight:

The index is organized in pages of fixed size capable of holding up to 2k


keys, but pages need only be partially filled. Pages are the bIocks of information
transferred between main store and backup store.
The pages themselves are the nodes of a rather specialized tree, a so-called
B-tree, described in the next section. In this paper these trees grow and contract
in only one way, namely nodes split off a brother, or two brothers are merged
or " c a t e n a t e d " into a single node. The splitting and catenation processes are
initiated at the leaves only and propagate toward the root. If the root node splits,
a new root must be introduced, and this is the only way in which the height
of the tree can increase. The opposite process occurs if the tree contracts.
There are, of course, many competitive schemes, e.g., hash-coding, for or-
ganizing an index. For a large class of applications the scheme presented in this
paper offers significant advantages over others:
i) Storage utilization is at least 50% at any time and should be considerably
better in the average.
ji) Storage is requested and released as the file grows and contracts. There
is no congestion problem or degradation of performance if the storage occupancy
is very high.
iii) The natural order of the keys is maintained and allows processing based
on that order like: find predecessors and successors; search the file sequentially
to answer queries; skip, delete, retrieve a number of records starting from a
given key.
iv) If retrievals, insertions, and deletions come in batches, very efficient
essentially sequential processing of the index is possible by presorting the trans-
actions on their keys and by using a simple prepaging algorithm.
Several other schemes try to solve the same or very similar problems. AVL-
trees described in [t] and [2] guarantee performance in time log2 I, but they
are suitable only for a one-level store. The schemes described in [3] and [4] do
not have logarithmic performance. The solution presented in this paper is new
and is related to those described in [t-4] only in the sense that the problem to
b e solved is similar and that it uses a data organization involving tree structures.

2. B-Trees
Def. 2.1. Let h ~ 0 be an integer, k a natural number. A directed tree T
is in the class z (k, h) of B-trees if T is either empty (h----0) or has the following
properties:
i) Each path from tire-root to any leaf has the same length h, also called the
height of T, i.e., h = number of nodes in path.
ii) Each node except the root and the leaves has at least k + t sons. The root
is a leaf" or has at least two sons.
iii) Each node has at most 2k + 1 sons.
Number o/Nodes in B-Trees. Let Nmm and Nmx be the minimal and maximal
number of nodes in a B-tree TEv(k, h). Then
2
Nmm = 1 + 2 ((k + t) 0 + (k + t) 1 + . . . . ~-(k + t) ~-z) = t + ~ ((k + t ) h - l - t )
Organization and Maintenance of Large Ordered Indexes 175

for h => 2. This also holds for h -----1. Similarly one obtains
/*--1
t ((2k+0,_t);
Nmaz = E (2k+t)i=-~--/~ h>l.
i=0

Upper and lower bounds for the number N (T) of nodes of T Ev (k, h) are given by:

N(T) = 0 if Te~(k, 0); (2A)

2 ((k+l)h_l t ) < N ( T ) <= 2 _ ~ ( ( 2 k + t ) , _ 1 ) otherwise.

Note that the classes v (k, h) need not be disjoint.

3. The Data Structure and Retrieval Algorithm


To repeat, the pages on which the index is stored are the nodes of a B-tree
TEv(k, h) and can hold up to 2k keys. In addition the data structure for the
index has the following properties:
i) Each page holds between k and 2k keys (index elements) except the root
page which may hold between ! and 2k keys.
ii) Let the number of keys on a page P, which is not a leaf, be l. Then P has
l + t sons.
iii) Within each page P the keys are sequential in increasing order: xx, x2,
.... xl; k ~ l _ ~ 2k except for the root page for which I ~ l ~ 2k, .Furthermore,
P contains l + t pointers P0, Px . . . . . Pl to the sons of P. On leaf pages these
pointers are undefined. Logically a page is then organized as shown in Fig. 1.

r//////////////A
P~ V/// unused
//////~
X1 ~1 I Pl
j 21 21 2 xI ~l s ace
Fig. 1. Organization of a page

The aq are the associated information in the index element (x i, ai). The triple (xi,
r162Pi) or--omitting ai--the pair (x o p~) is also called an entry.
iv) Let P (p;) be the page t o which Pi points, let K (Pi) be the set of keys on
the pages of that maximal subtree of which P (Pi) is the root. Then for the B-trees
considered here the following conditions shall always hold:

(Vy ~K (P0)) (Y < x,), (3-t)


(VyEK(pi))(xi<y<xi+l); i:t, 2 . . . . . l--1, (3.2)
(VyeK(p,))(x, < y ) . (3.3)
Fig. 2 is an example of a B-tree in v(2, 3) satisfying all the above conditions.
In the figure the ai are not shown and the page pointers are represtmted graphi-
cally. The boxes represent pages and the numbers outside arc page numbers to
be used later.
12"
t 76 R. Bayer and E. McCreight:

2
/
I I rl6, 21 .,, ]

4/: 5
23 24 25 ]

[67 ] 117 18 19 20 1
17
14
Fig. 2. A data structure in ~ (2, 3) for an index

Retrieval Algorithm. The flowchart in Fig. 3 is an algorithm for retrieving a


key y. Let p, r, s be pointer variables which can also assume the v a l u e " undefined"
denoted as u. r points to the root and is u if the tree is empty, s does not serve
any purpose for retrieval, but will be used in the insertion algorithm. Let P (p)
be the page to which p is pointing, then x I . . . . . x~ are the keys in P (p) and P0. . . . . Pz
the page pointers in P (p).
The retrieval algorithm is simple logically, but to program it for a computer
one would use an efficient technique, e.g., a binary search, to scan a page.
Cost oI Retrieval. Let h be the height of the page tree. Then at most h pages
must be scanned and therefore fetched from backup store to retrieve a key y.
We will now derive bounds for h for a given index of size I. The minimum and
m a x i m u m number I ~ and/max of keys in a B-tree of pages in x (k, h) are:
+ k ( 2 (h +1) h-1 --1
lmin = t k ) = 2 (k + t) h - ~ - t
((2h + t ) ~ - 1
I=~ ~2k 2h ) = (2k + t ) h - t "

This is immediate from (2.t) for h > t. Thus we have as sharp bounds for the
height h:
logs~+a(I+t )<h~t+logk+t -- for I~_1,
(:~.t)
h=0 for I=0.

4. Key Insertion
The algorithm in Fig. 4 inserts a single key y into an index described in
Section 3. The variable s is a page pointer set b y the retrieval algorithm pointing
to the last page that was scanned or having the value u if the page tree is empty.
Organization and Maintenance of Large Ordered Indexes 177

p,,--r
S~,..-U
I
"r~%.
.f p=u?

s*.-p
1
,l )
< [ P~-Po ~ - ~ y<xt?

~l NO

3i(y=xi)?
~l NO

3i(xi<y<xi+l)? )
i [ P*"Pi
NO
I[ P*-Pt a~
F
Fig. 3. Retrieval algorithm

Splittinga Page. If a page P in which an entry should be inserted is already


full, it will be split into two pages. Logically first insert the entry into the sequence
of entries in P--which is assumed to be in main store--resulting in a sequence
Po, (x,, p:,.), (x,,, p.-.)..... (x,,+.,, ~,,+1).
Now put the subsequence P0, (xl, Pl) . . . . . (xk, Ph) into P and introduce a new
page P' to contain the subsequence
#k+l, (x~+2, ~+2), (~+,, ~+3), ..., (x~+l, ~+~).
Let Q be the father page of P. Insert the entry (xk+~, p'), where p' points to P',
into Q. Thus P'
becomes a brother of P.
t 78 R. Bayer and E. McCreight:

Inserting (x~+x, ib') into Q may, of course, cause Q to split too, and so on,
possibly up to the, root. If the splitting page P is the root, then we introduce a
new root page Q containing p, (xk+l, Ib') where p points to P and p' to P'.
Note that this insertion process maps B-trees with parameter k into B-trees
with parameter k, and preserves properties (3.t), (3.2), and (3.3).
To illustrate the insertion process, insertion of key 9 into the tree in Fig. 5
with parameter k = 2 results in the tree in Fig. 2.

apply retrieval
algorithm for
key y

(
(
found y?

~NO
S=tt .9
)
YES

YES
G[ tree is empty,
2 ~l create root
page with y
ml

i split page
routine is P(s) full? )
for P(s)

insert entry ]
(y, u) in P(s)

* Key y is already in index, take appropriate action.


Fig. 4. Insertion algorithm
Organization and Maintenance o[ Large Ordered Indexes 179

4
11 2 3 41 16 7 8 101 112 1314151 1171819201 12223 24 251
Fig. 5. Index structure in ~(2, 2)

5. Cost of Retrievals and Insertions


To analyze the cost of maintaining an index and retrieving keys we need
to know how many pages must be fetched from the backup store into main
store and how many pages must be written onto the backup store. For our analysis
we make the following assumption: Any page, whose content is examined or
modified du~ng a single retrieval, insertion, or deletion of a key, is fetched or
paged out respectively exactly once. It will become clear during the course of
this paper that a paging area to hold h ' + t pages in main store is sufficient to do
this.
Any more powerful paging scheme, like e.g., keeping the root page permanently
locked in main store, will, of course, decrease the number of pages which must
be fetched or paged out. We will not, however, analyze such schemes, although
we have used them in Our experiments.
Denote by /rain (tm=) the minimal (maximal) number of pages fetched, and
by wm~ (w~,) the minimal (maximal) number of pages written.
Cost o] Retrieval. From the retrieval algorithm it is clear that for retrieving
a single key we get
/rain = t ; /max = h ; Wmin = W m a x = 0 .

Cost o/Insertion. For inserting a single key the least work is required if no
page,splitting occurs, then
/rain= h ; Wmi n = t.

Most work is required if all pages in the retrieval path including the root page
split into two. Since the retrieval path contains h pages and we have to write
a new root page, we get:
/max=h; wm.~= 2 h + t .

Note that h always denotes the height of the old tree. Although this worst bound
is sharp, it is not a good measure for the amount of work which must generally
be done for inserting, one key.
If we consider an index in which keys are only retrieved or inserted, but no
keys are deleted, then we can derive a bound for the average amount of work
to be done for building an index of I keys as follows:
Each page sprit causes one (or two if the root page splits) new pages to be
created. Thus the number of page splits occurring in building an index of I items
is bounded by n ( / ) - - t , where n(I) is the number of pages in the tree. Since
t 80 R. Bayer and E. McCreight:

each page has at least k keys, except the root page which may have only t, we
get: n(I)<=--Ikt ~-l. Each single page split causes at most 2 additional pages
to be written. Thus the average number of pages written per single key insertion
due to page splitting is bounded by
2 2
(.(i) T < 7

A page split does not require any additional page retrievals. Thus in the average
for an index without deletions we get for a single insertion:
2
ta=h; wa<t+~-.

6. Deletion Process
In a dynamically changing index it must be necessary to delete keys. The
algorithm of Fig. 6 deletes one key y from an index and maintains our data
structure properly. It first locates the key, say Yi. To maintain the data structure
properly, yi is deleted if it is on a leaf, otherwise it must be replaced by the
smallest key in the subtree whose root is P(pi). This smallest key is found by
going from P (Pi) along the P0 pointers to the leaf page, say L, and taking the
first key in L. Then this key, say x 1, is deleted from L. As a consequence L may
contain fewer than k keys and a catenation or underflow between L and an
adjacent brother is performed.

Catenation. Two pages P and P ' are called adiacent brothers if they have the
same father Q and are pointed to by adjacent pointers in Q. P and P' can be
catenated, if together they have fewer than 2k keys, as follows: The three
pages of the form

Q
I'"' (Yi-1, P), (Yi' P')' (Yi+x, Pi+,) .... I

can be replaced by two pages of the form:

Q
[
I

....
(yi-l, P), (yj+1, Pi+l) . . . .

PO P (xl, Pl) ..... (xt, P~), (Yi, Po), (xz+l, P~+I), ..


' "
Organization and Maintenance of Large Ordered Indexes 18t

As a consequence of deleting the entry (Yi, P') from Q it is now possible that Q
contains fewer than k keys and special action must be taken for Q. This process
may propagate up to the root of the tree.

Under/low. If the sum of the number of keys in P and P' is greater than 2.k,
then the keys in P and P' can be equally distributed, the process being called
an underflow, as follows:

apply retrieval
algorithm for y

NO

~ YES

y on leaf YES ] delete y


page? ) r[ from leaf

retrieve pages
down to leaf
along P0 pointers

replace y by
first key on
leaf page
1 [ if necessary,
delete first ,_ perform
key on leaf ] r] catenations
and underflow

* The key to be deleted is not in index, take appropriate action.

Fig. 6. Deletion algorithm


t82 R. Bayer and E. McCreight:

Perform the catenation between i~ and P' resulting in too large a P. This
is possible since P is in main store. Now split P "in the middle" as described in
Section 4 with some obvious minor modifications.
Note that underflows do not propagate. Q is modified, but the number of
keys in it is not changed.
To illustrate the deletion process consider the index in Fig. 2. Deleting key 9
results in the index in Fig. 5.

7. Cost of Deletions
For a successful deletion, i.e., if the key y to be deleted is in the index, the
least amount of work is required if no catenations or underflows are performed
and y is in a leaf. This requires:

~min = ]$ ; Wmi n : t.

If y is not in a leaf and no catenations or underflows occur, then

l--h; w=2.
A maximal amount of work must be done if all but the first two pages in the
retrieval path are catenated, the son of the root in the retrieval path has an
underflow, and the root is modified. This requires:

/mx=2h--t; Wm~=h+t.

As in the case of the insertion process the bounds obtained are sharp, but very
far apart and assumed rarely except in pathological examples. To obtain a more
useful measure for the average amount of work necessary to delete a key, let us
consider a "puredeletion process" during which all keys in an index I are deleted,
but no keys are inserted.
Disregarding for the moment catenations and underflows we may get fl = h
and w1 ----2 for each deletion at worst. But this is the best bound obtainable if
one considers an example in which keys are always deleted from the root page.
Each deletion causes at most one underflow, requiring ~2= t additional
fetches and w~ = 2 additional writes.
The total number of possible catenations is bounded by n(I)--l, which is
at most ~ Each catenation causes t additional fetch and 2 additional
writes, which results in an average
t I1-t \ t
l, = T <
2 /I--t \ 2
'*, = T [-r- )< "

Thus in the average we get:


1
/,,~_h + / j + / s < h + l+
2 2
w,~_wl +w2 + w 8 < 2 + 2 + -~ = 4 + - ~ .
Organization and Maintenance of Large Ordered Indexes t 83

8. Page Overflow and Storage Utilization


In the scheme described so far utilization of back-up store may be as low as
50% in extreme cases--disregarding the root page--if all pages contain only k
keys. This could be improved b y avoiding certain page splits.
An over]low between two adjacent brother pages P and P' can be performed
as follows: Assume that a key must be inserted in P and P is already full, but P'
is not full. Then the key is inserted into the key-sequence in P and an underflow
as described in Section 6 between the resulting sequence and P' is performe d.
This avoids the need to split P into two pages. Thus a page will be split only if
both adjacent brothers are full, otherwise an overflow occurs.
In an index Without deletions overflows will increase the storage utilization
in the worst cases to about 66%. If both insertions and deletions occur, then
the storage utilization may of course again be as low as 50%. For most practical
applications, however, storage utilization should be improved appreciably With
overflows.
One could, of course, consider a larger neighborhood of pages than lust the
adjacent brothers as candidates for overflows, underflows, and catenations and
increase the minimal storage occupancy accordingly.
Bounds for the cost of insertions for a scheme With overflows are easily derived
as:

/ram = h ; w~t~ --=t ;

/ma~ = 3 h - - 2 ; Wmax= 2 h + t .

For a pure insertion process one obtains as bounds for the average cost:

2
]a<h+2+~-; w a < 3 +-k-.

It is easy to construct examples in which each insertion causes an overflow,


thus these bounds cannot be improved very much Without special assumptions
about the insertion process.

9. Maintenance Cost for Index with Insertions and Deletions


The main purpose of this paper is to develop a data structure which allows
economical maintenance of an index in which retrievals, insertions, and deletions
must be done i n any order. We will now derive bounds on the processing cost
in such an environment.
The derivation of bounds for retrieval cost did not make any assumptions
about the order of insertions or deletions, so they are still valid. Also, the minimal
and maximal bounds for the cost of insertions and deletions were derived Without
any such assumptions and are still valid. The bounds derived for the average
cost, however, are no longer valid if insertions and deletions are mixed.
The folloWing example shows that the upper bounds for the average cost
cannot be improved appreciably over the upper bounds of the cost derived for
a single retrieval or deletion.
184 R. Bayer and E. McCreight:

Example. Consider the trees T~ in Fig. 2 and T 6 in Fig. 5. Deleting key 9 from
T~ leads to T 5, and inserting key 9 in T 5 leads back to T2. Consider a sequence
of alternating deletions and insertions of key 9 being applied starting with TI.

Case 1. No page overflows, but only page splits occur:


i) Each deletion of key 9 from Tg. requires:
3 retrievals to locate key 9, namely pages t, 2, 6.
t retrieval of brother 5 of page 6 to find out that pages 5 and 6 can be
catenated.
2 pages, namely 5 and 2 are modified and must be written. Pages 6 and 3
are deleted from the tree T~.
Thus I = 5 and w = 2. But/----- 5 = 2h - - t = ]mazand w = 2 = h -- t '---Wmax-- 2.
ii) Each insertion of key 9 into T 5 requires:
2 retrievals to locate slot for 9 in page 5.
5 pages must be written, namely t, 2, 3, 5, 6.
Thus
1=2 = h =1,~
w =5 =2h +t=Wma..

Case 2. Consider a scheme with page overflows.


i) Deletion of key 9 leads to the same results as in Case t.
ii) Insertion of key 9 requires:
2 retrievals to locate sl0t for 9 on page 5.
2 retrievals of brothers 4 and 7 of 5 to find out that 5 must be split.
5 pages must be written as in Case t.
Thus:
t =4 = 3 h - 2 = t,~,,
w=i =2h+t -----~max.

Analogous examples can be constructed for arbitrary h and k.


From the analysis it is clear that the performance of our scheme depends
on the actual sequence of insertions and deletions. The interference between
insertions and deletions m a y degrade the performance of the scheme as opposed
to doing insertions or deletions only. But even in the worst cases this interference
degrades the performance at most b y a factor of 3.
I t is an open question how important this interference is in any actual applica-
tions and how relevant our worst case analysis is. Although the derivable cost
bounds are worse, the scheme with overflows performed better in our experiments
than the scheme without overflows.

10. Choice of &


The performance of our scheme depends o~ the parameter k. Thus care should
be taken in choosing k to make the performance as good as possible.
To obtain a very rough approximation to the performance of the scheme we
make the following assumptions:
Organization and Maintenance of Large Ordered Indexes 18 5

Re- Insertion Deletion Insertion Insertion Deletion Insertion


tdeval in index in index in index in index in index in index
without without without with with with
deletions insertions, deletions, deletions, insertions, deletion,
and with or b u t with without with or with
without without overflow overflow without overflow
overflows overflows overflows

min 1=I /=h /=h l=h l=h l=h l=h


w=0 w=l w=l w=t w=t w=t w=t

t 2
Averageas f < h /=h l < h + l + - k- l'<h+2+~- l = h /<2h--t 1~3h--2
derived in 2 2 2
paper w=0 w<i+~- w<4+--~- w ~ 3 + - k- w<2h+t h--l~u w<2h+l
~h+l

max /=h /=h l=2h--t 1=3h--2 /=h I=2h--1 I=3h--2


w=0 w=2h+l w=h+t w=2h+t w=2h+t w=h+l w=2h+l

[ = n u m b e r of pages fetched h = height of B-tree


w = n u m b e r of pages w r i t t e n k = p a r a m e t e r of B-tree of pages
I = size of index set u = best u p p e r b o u n d obtainable for w
Fig. 7. T a b l e of c o s t s f o r a s i n g l e r e t r i e v a l , i n s e r t i o n , o r d e l e t i o n of a k e y

i) The time spent for each page which is written or fetched can be expressed
in the form:
0t + f l ( 2 k + t ) + 7 ln(vk + t )
0t fixed time spent per page, e.g., average disc seek time plus fixed CPU
overhead, etc.
fl transfer time per page entry.
7 constant for the logarithmic part of the time, e.g., for a binary search.
factor for average page occupancy, t -< v ~ 2.
We assume that modifying a page does not require moving keys within a
page, but that the necessary channel subcommands are generated to write a
page by concatenating several pieces of information in main store. This is the
reason for our assumption that fetching and writing a page takes the same time.
i) The average number of pages fetched and written per single transaction
in an environment of mixed retrievals, insertions, and deletions is approximately
proportional--see Fig. 7--to h, say 6h. The total time T spent per transaction
can then be approximated by:

r ~ ~h (~ + ~(2k + t ) + 7 In (vk + t ) ) .
Approximating h itself by: h ~ log, k + l ( I + t ) where I is the size of the index,
we get: r ~ T, -----6 log, k+1 ( 1 + t ) (~ + f l ( 2 k + t ) + 7 In (vk + t ) ) .
Now one easily obtains the minimum of T, if k is chosen such that:

+,1)- (,-,, =//k,


186 R. Bayer and E. McCreight:

Neglecting C P U time, k is a n u m b e r which is c h a r a c t e r i s t i c for t h e device


used as b a c k u p store. To o b t a i n a near o p t i m a l page size for our t e s t e x a m p l e s
we a s s u m e d ~ = 50 ms a n d fl = 90 bts. A c c o r d i n g to t h e t a b l e in Fig. 8 an a c c e p t a b l e
choice should be 64 < k < t28. F o r reasons of p r o g r a m m i n g convenience we chose
k = 60 resulting in a page size of t 2 0 entries.

k t(k, ~) I(k, ~.5) I(k, 2)

2.0000oE + 00 1.59167E + oo 2.39356E + oo 3.o4718E + oo


4.00000E + oo 7.09437 E + 00 9.16182E + 00 t.O775OE + 01
8.00000E + 00 2.25500E + 01 2.74591 E + 01 3.11646E+ 01
1.60000E + 01 6.33292E + 01 7.42958E + 01 8.23847 E + 0t
3.20000E + 01 !.65769E + 02 1.89265 E + 02 2.06334E + O2
6.40000E + 01 4.13670E + o2 4.62662E + 02 4.97915 E + O2
t .28000E + 02 9.96831E + 02 1.09726E + 03 1.1691tE+ O3
2.56000E + 02 2.33922E + 03 2.54299E + 03 2.68826E + 03
5.12000E + 02 5.37752E + 03 5.78842E + 03 6.08075 E + 03
1.02400E + 03 1.21625E + 04 1.29881 E + 04 1.35748E + O4
2.0480oE + 03 2,71506E + 04 2.88062E + 04 2.99818E + 04
4.09600E + 03 5.99647 E + o4 6.32806E + O4 6.56343 E + 04
8.19200E + 03 t.31269E + o5 1.37906E + o5 t.426t 7E + 05
t .63840E + 04 2.85235E + 05 2.98514E + 05 3.07938E + 05
3.27680E + 04 6.15877 E + o5 6.42442 E + 05 6.61292E + O5
6.5536oE+ 04 1.32258E + 06 1.37572E+ 06 t.4t342E + 06

Fig. 8. The function /(k, v) for optimal choice of k

T h e size of t h e i n d e x which can be stored for k -----60 in a page tree of a certain


h e i g h t can be seen from Fig. 9.

Height of Minimum Maximum


page tree index size index size

1 1 t20
2 t21 14640
3 7441 t 771 560
4 453961 214358880

Fig. 9. Height of page tree and index size

11. E x p e r i m e n t a l Results
T h e algorithms p r e s e n te d here were p r o g r a m m e d an d their p e r f o r m a n c e
m e a s u r e d during various experiments. T h e p r o g r a m s were r u n on an I B M 360/44
c o m p u t e r w i t h a 231t disc u n i t as a b a c k u p store. F o r t h e i n d ex e l e m e n t size
chosen (t4 8-bit characters) a n d i n d e x size generally used (about 10000 i n d ex
elements), the a v e r a g e access m e c h a n i s m delay for this u n i t is a b o u t 50 ms,
after which i n f o r m a t i o n transfer t a k e s place at the rate of a b o u t 90 bts per index
element. F r o m these two p a r a m e t e r s , our analysis~ predicts an o p t i m a l page
size (2k) on t h e order of t2 0 index elements.
Organization and Maintenance of Large Ordered Indexes t87

The programming included a simple demand paging scheme to take advan-


tage of available core storage (about t 250 index elements' worth) and thus to
attempt to reduce the number of physical disc operations. In the following
section by virtual disc read we mean a request to the paging scheme that a certain
disc page be available in core; a virtual disc read will result in a physical disc
read only of there is no copy of the requested disc page already in the paging
area of core storage. A virtual disc write is defined analogously.
At the time of this writing ten experiments had been performed. These ex-
periments were intended to give us an idea of what kind of performance to expect,
what kind of storage utilization to expect, and so forth. For us the specification
of an experiment consists of choosing
t) whether or not to permit overflows on insertion,
2) a number of index elements per page, and
3) a sequence of transactions to be made against an initially empty index.
At several points during the performance of an experiment certain performance
variables are recorded. From these the performance of the algorithms according
to various performance measures can be deduced; to wit
1) % storage utilization
2) average number of virtual disc reads/transaction
3) average number of physical disc reads/transaction
4) average number of virtual disc writes/insertion or deletion
5) average number of physical disc writes/insertion or deletion
6) average number of transactions/second.
We now summarize the experiments. Each experiment was divided into
several phases, and at the end of each of these the performance variables were
measured. Phases are denoted by numbers within parentheses.
E t : 25 elements/page, overflow permitted.
(t) 10000 insertions sequential by key,
(2) 50 insertions, 50 retrievals, and t00 deletions uniformly random in
the key space.
E 2: 120 elements/page; otherwise idontical to E t.
E3 : 250 elements/page; otherwise identical to E t.
E 4: t 20 elements/page, overflow permitted.
(t) t0000 insertions sequential by key,
(2) t 000 retrievals uniformly random in key space,
(3) 10000 sequential deletions.
g 5 : t 20 elements/page, overflow not permitted.
(t) 5 000 insertions uniformly random in key space,
(2) t 000 retrievals uniformly random in key space,
(3) 5 000 deletions uniformly random in key space.
E6: Overflow permitted; otherwise identical to E 5.
E 7: t 20 elements/page, overflow permitted.
(t) 5 000 insertions sequential by key,
(2) 6000 each insertions, retrievals, and deletions uniformly random in
key space.
t88 R. Bayer and E. McCreight:

E8: t 20 elements/page, overflow permitted.


(t) t 5000 insertions uniformly random in key space,
(2) t00 each insertions, deletions, and retrievals uniformly random in
J~ey space.
E9: 250 elements/page; otherwise identical to E8.
Et0: 120 elements/page, overflow permitted.
(1) t00000 insertions sequential by key,
(2) 1000 each insertions, deletions, and retrievals uniformly random in
key space,
(3) t00 group retrievals uniformly random in key space, where a group is
a sequence of t00 consecutive keys (statistics on the basis of t0000
transactions),
(4) 10000 insertions sequential by key, to merge uniformly with the
elements inserted in phase (1).

% Stor- VR/T* PR/T VW/I PW/I T/see


age used or D or D

E1 (1) 99.8 2.2 0 2.3 0.04 66.1


E1 (2) 91.5 4.4 1.62 2.7 t.5 6.6
E2 (t) 99.2 1.o 0 t.0 0.008 94.5
E2 (2) 87.3 2.5 1.t5 1.3 1.1 6.7
E3 (1) 97-6 t.0 o t.0 0,004 100.0
E3 (2) 84.7 2.4 t.08 1.3 1.1 5.2
E4 (1) 99.2 1.0 0 1.0 0.008 94.5
E4 (2) 99.2 2.0 -- -- -- 19.5
E4 (3) -- 2.0 0.01 2.0 o 74.1
E5 (t) 67.1 t.0 0.55 t.0 0.56 17.0
E5 (2) 67.1 2.0 0.83 -- -- 18.2
E5 (3) -- 4.0 0.68 2.2 0.65 t2.4
E6 (1) 86.7 1.t 0.55 t.1 0.54 17.1
E6 (2) 86.7 2.0 0.79 -- -- 24.3
E6 (3) -- 4.0 0.65 2.2 0.62 13,4
E7 (1) 96.9 1.0 0 1.0 0.008 11t.9
E7 (2) 76.8 2.3 0.83 1.3 0.88 13.1
E8 (t) 84.5 t.3 0.87 1.3 0.85 10.1
E8 (2) 83-9 3.7 t.00 3.0 1.00 9-5
E9 (1) 86.4 1.1 0.84 1.0 0.82 8.5
E9 (2) 85-2 2.3 0.94 t.t 0.96 8.2
ElO (1) 99.8 1.9 0 t.9 0.008 9t.7
El0 (2) 82.t 4.1 1.94 1.8 1.54 4.2
El0 (3) 82.1 4.0 0.03 -- -- 75,7
Et0 (4) 83.8 2.2 0.t0 2.2 0.11 38,0

* This statistic is unneCessarily large for deletions, due to the w a y deletions were pro-
g r a m m e d . To find the necessary n u m b e r of virtual reads, for sequential deletions s u b t r a c t
one from the n u m b e r shown, and for r a n d o m deletions s u b t r a c t one and multiply the result
b y abou~ 0.5.
Organization and Maintenance of Large Ordered Indexes t 89

References
t. Adelson-Velskii, G.M., Landis, E. M. : An information organization algorithm.
D A N S S S R , 146, 263-266 (1962).
2. Foster, C. C. : Information storage and retrieval using AVL trees. Proc. ACM
20th Nat'l. Conf. t92-205 (t965).
3. Landauer, W. I. : The balanced tree and its utilization in information retrieval.
I E E E Trans. on Electronic Computers, Vol. EC-t2, No. 6, December t963.
4. Sussenguth, E . H . , Jr. : The use of tree structures for processing files. Comm.
ACM, 6, No. 5, May 1963.

Prof. Dr. R. Bayer Dr. E M. McCreight


Dept. of Computer Science Palo Alto Research Center
Purdue University 3t80 Porter Drive
Lafayette, Ind. 47907 Palo Alto, Calif. 94304
U.S.A. U.S.A.

~3 Acta Inlm'matlea,VoL I

You might also like