1972 Bayer Mccreight
1972 Bayer Mccreight
1972 Bayer Mccreight
9 b y Springer-Verlag t972
1. Introduction
I n this p a p e r we consider t h e p r o b l e m of organizing a n d m a i n t a i n i n g an
i n d e x for a d y n a m i c a l l y changing r a n d o m access file. B y an index we m e a n a
collection of i n d e x elements which are pairs (x, ~) of fixed size physi.cally a d j a c e n t
d a t a items, n a m e l y a k e y x a n d some associated i n f o r m a t i o n ~. T h e k e y x identifies
a unique e l e m e n t in t h e index, the associated i n f o r m a t i o n is t y p i c a l l y a p o i n t e r
to a record or a collection of records in a r a n d o m access file. F o r this p a p e r the
associated i n f o r m a t i o n is of no' f u r t h e r interest.
W e assume t h a t t h e i n d e x itself is so v o l u m i n o u s t h a t o n l y r a t h e r small
p a r t s of it can be k e p t in m a i n store a t one time. T h u s the b u l k of t h e index m u s t
be k e p t on some b a c k u p store. T h e class of b a c k u p stores Considered are pseudo
random access devices which h a v e a r a t h e r long access or wait t i m e - - a s opposed
to a t r u e r a n d o m access device like core s t o r e - - a n d a r a t h e r high d a t a r a t e once
t h e transmission of p h y s i c a l l y sequential d a t a 'has been initiated. T y p i c a l pseudo
r a n d o m access devices are: fixed a n d m o v i n g h e a d discs, drums, a n d d a t a cells.
Since t h e d a t a file itself changes, it m u s t be possible n o t only to search the
i n d e x a n d to retrieve elements, b u t also to delete a n d to insert k e y s - - m o r e
a c c u r a t e l y i n d e x e l e m e n t s - - e c o n o m i c a l l y . The i n d e x organization described
in this p a p e r a l w a y s allows retrieval, insertion, a n d deletion of keys in t i m e
p r o p o r t i o n a l to log k I or b e t t e r , where I is t h e size of the index, a n d k is a device
d e p e n d e n t n a t u r a l n u m b e r which describes the page size such t h a t the perform-
ance of the m a i n t e n a n c e a n d r e t r i e v a l scheme becomes near optimal.
I n m o r e i l l u s t r a t i v e t e r m s theoretical analysis a n d a c t u a l e x p e r i m e n t s show
t h a t it is possible to m a i n t a i n an i n d e x of size 15 000 with an average of 9 retrievals,
insertions, a n d deletions per second in real t i m e on an I B M 360/44 with a 2 3 t t
disc as b a c k u p store. According to our theoretical analysis, it should be possible
to m a i n t a i n all index of size I 500000 with at least two t r a n s a c t i o n s per second
on such a configuration in real time.
t2 Acta lnformatica, Vol, t
t 74 R. Bayer and E. McCreight:
2. B-Trees
Def. 2.1. Let h ~ 0 be an integer, k a natural number. A directed tree T
is in the class z (k, h) of B-trees if T is either empty (h----0) or has the following
properties:
i) Each path from tire-root to any leaf has the same length h, also called the
height of T, i.e., h = number of nodes in path.
ii) Each node except the root and the leaves has at least k + t sons. The root
is a leaf" or has at least two sons.
iii) Each node has at most 2k + 1 sons.
Number o/Nodes in B-Trees. Let Nmm and Nmx be the minimal and maximal
number of nodes in a B-tree TEv(k, h). Then
2
Nmm = 1 + 2 ((k + t) 0 + (k + t) 1 + . . . . ~-(k + t) ~-z) = t + ~ ((k + t ) h - l - t )
Organization and Maintenance of Large Ordered Indexes 175
for h => 2. This also holds for h -----1. Similarly one obtains
/*--1
t ((2k+0,_t);
Nmaz = E (2k+t)i=-~--/~ h>l.
i=0
Upper and lower bounds for the number N (T) of nodes of T Ev (k, h) are given by:
r//////////////A
P~ V/// unused
//////~
X1 ~1 I Pl
j 21 21 2 xI ~l s ace
Fig. 1. Organization of a page
The aq are the associated information in the index element (x i, ai). The triple (xi,
r162Pi) or--omitting ai--the pair (x o p~) is also called an entry.
iv) Let P (p;) be the page t o which Pi points, let K (Pi) be the set of keys on
the pages of that maximal subtree of which P (Pi) is the root. Then for the B-trees
considered here the following conditions shall always hold:
2
/
I I rl6, 21 .,, ]
4/: 5
23 24 25 ]
[67 ] 117 18 19 20 1
17
14
Fig. 2. A data structure in ~ (2, 3) for an index
This is immediate from (2.t) for h > t. Thus we have as sharp bounds for the
height h:
logs~+a(I+t )<h~t+logk+t -- for I~_1,
(:~.t)
h=0 for I=0.
4. Key Insertion
The algorithm in Fig. 4 inserts a single key y into an index described in
Section 3. The variable s is a page pointer set b y the retrieval algorithm pointing
to the last page that was scanned or having the value u if the page tree is empty.
Organization and Maintenance of Large Ordered Indexes 177
p,,--r
S~,..-U
I
"r~%.
.f p=u?
s*.-p
1
,l )
< [ P~-Po ~ - ~ y<xt?
~l NO
3i(y=xi)?
~l NO
3i(xi<y<xi+l)? )
i [ P*"Pi
NO
I[ P*-Pt a~
F
Fig. 3. Retrieval algorithm
Inserting (x~+x, ib') into Q may, of course, cause Q to split too, and so on,
possibly up to the, root. If the splitting page P is the root, then we introduce a
new root page Q containing p, (xk+l, Ib') where p points to P and p' to P'.
Note that this insertion process maps B-trees with parameter k into B-trees
with parameter k, and preserves properties (3.t), (3.2), and (3.3).
To illustrate the insertion process, insertion of key 9 into the tree in Fig. 5
with parameter k = 2 results in the tree in Fig. 2.
apply retrieval
algorithm for
key y
(
(
found y?
~NO
S=tt .9
)
YES
YES
G[ tree is empty,
2 ~l create root
page with y
ml
i split page
routine is P(s) full? )
for P(s)
insert entry ]
(y, u) in P(s)
4
11 2 3 41 16 7 8 101 112 1314151 1171819201 12223 24 251
Fig. 5. Index structure in ~(2, 2)
Cost o/Insertion. For inserting a single key the least work is required if no
page,splitting occurs, then
/rain= h ; Wmi n = t.
Most work is required if all pages in the retrieval path including the root page
split into two. Since the retrieval path contains h pages and we have to write
a new root page, we get:
/max=h; wm.~= 2 h + t .
Note that h always denotes the height of the old tree. Although this worst bound
is sharp, it is not a good measure for the amount of work which must generally
be done for inserting, one key.
If we consider an index in which keys are only retrieved or inserted, but no
keys are deleted, then we can derive a bound for the average amount of work
to be done for building an index of I keys as follows:
Each page sprit causes one (or two if the root page splits) new pages to be
created. Thus the number of page splits occurring in building an index of I items
is bounded by n ( / ) - - t , where n(I) is the number of pages in the tree. Since
t 80 R. Bayer and E. McCreight:
each page has at least k keys, except the root page which may have only t, we
get: n(I)<=--Ikt ~-l. Each single page split causes at most 2 additional pages
to be written. Thus the average number of pages written per single key insertion
due to page splitting is bounded by
2 2
(.(i) T < 7
A page split does not require any additional page retrievals. Thus in the average
for an index without deletions we get for a single insertion:
2
ta=h; wa<t+~-.
6. Deletion Process
In a dynamically changing index it must be necessary to delete keys. The
algorithm of Fig. 6 deletes one key y from an index and maintains our data
structure properly. It first locates the key, say Yi. To maintain the data structure
properly, yi is deleted if it is on a leaf, otherwise it must be replaced by the
smallest key in the subtree whose root is P(pi). This smallest key is found by
going from P (Pi) along the P0 pointers to the leaf page, say L, and taking the
first key in L. Then this key, say x 1, is deleted from L. As a consequence L may
contain fewer than k keys and a catenation or underflow between L and an
adjacent brother is performed.
Catenation. Two pages P and P ' are called adiacent brothers if they have the
same father Q and are pointed to by adjacent pointers in Q. P and P' can be
catenated, if together they have fewer than 2k keys, as follows: The three
pages of the form
Q
I'"' (Yi-1, P), (Yi' P')' (Yi+x, Pi+,) .... I
Q
[
I
....
(yi-l, P), (yj+1, Pi+l) . . . .
As a consequence of deleting the entry (Yi, P') from Q it is now possible that Q
contains fewer than k keys and special action must be taken for Q. This process
may propagate up to the root of the tree.
Under/low. If the sum of the number of keys in P and P' is greater than 2.k,
then the keys in P and P' can be equally distributed, the process being called
an underflow, as follows:
apply retrieval
algorithm for y
NO
~ YES
retrieve pages
down to leaf
along P0 pointers
replace y by
first key on
leaf page
1 [ if necessary,
delete first ,_ perform
key on leaf ] r] catenations
and underflow
Perform the catenation between i~ and P' resulting in too large a P. This
is possible since P is in main store. Now split P "in the middle" as described in
Section 4 with some obvious minor modifications.
Note that underflows do not propagate. Q is modified, but the number of
keys in it is not changed.
To illustrate the deletion process consider the index in Fig. 2. Deleting key 9
results in the index in Fig. 5.
7. Cost of Deletions
For a successful deletion, i.e., if the key y to be deleted is in the index, the
least amount of work is required if no catenations or underflows are performed
and y is in a leaf. This requires:
~min = ]$ ; Wmi n : t.
l--h; w=2.
A maximal amount of work must be done if all but the first two pages in the
retrieval path are catenated, the son of the root in the retrieval path has an
underflow, and the root is modified. This requires:
/mx=2h--t; Wm~=h+t.
As in the case of the insertion process the bounds obtained are sharp, but very
far apart and assumed rarely except in pathological examples. To obtain a more
useful measure for the average amount of work necessary to delete a key, let us
consider a "puredeletion process" during which all keys in an index I are deleted,
but no keys are inserted.
Disregarding for the moment catenations and underflows we may get fl = h
and w1 ----2 for each deletion at worst. But this is the best bound obtainable if
one considers an example in which keys are always deleted from the root page.
Each deletion causes at most one underflow, requiring ~2= t additional
fetches and w~ = 2 additional writes.
The total number of possible catenations is bounded by n(I)--l, which is
at most ~ Each catenation causes t additional fetch and 2 additional
writes, which results in an average
t I1-t \ t
l, = T <
2 /I--t \ 2
'*, = T [-r- )< "
/ma~ = 3 h - - 2 ; Wmax= 2 h + t .
For a pure insertion process one obtains as bounds for the average cost:
2
]a<h+2+~-; w a < 3 +-k-.
Example. Consider the trees T~ in Fig. 2 and T 6 in Fig. 5. Deleting key 9 from
T~ leads to T 5, and inserting key 9 in T 5 leads back to T2. Consider a sequence
of alternating deletions and insertions of key 9 being applied starting with TI.
t 2
Averageas f < h /=h l < h + l + - k- l'<h+2+~- l = h /<2h--t 1~3h--2
derived in 2 2 2
paper w=0 w<i+~- w<4+--~- w ~ 3 + - k- w<2h+t h--l~u w<2h+l
~h+l
i) The time spent for each page which is written or fetched can be expressed
in the form:
0t + f l ( 2 k + t ) + 7 ln(vk + t )
0t fixed time spent per page, e.g., average disc seek time plus fixed CPU
overhead, etc.
fl transfer time per page entry.
7 constant for the logarithmic part of the time, e.g., for a binary search.
factor for average page occupancy, t -< v ~ 2.
We assume that modifying a page does not require moving keys within a
page, but that the necessary channel subcommands are generated to write a
page by concatenating several pieces of information in main store. This is the
reason for our assumption that fetching and writing a page takes the same time.
i) The average number of pages fetched and written per single transaction
in an environment of mixed retrievals, insertions, and deletions is approximately
proportional--see Fig. 7--to h, say 6h. The total time T spent per transaction
can then be approximated by:
r ~ ~h (~ + ~(2k + t ) + 7 In (vk + t ) ) .
Approximating h itself by: h ~ log, k + l ( I + t ) where I is the size of the index,
we get: r ~ T, -----6 log, k+1 ( 1 + t ) (~ + f l ( 2 k + t ) + 7 In (vk + t ) ) .
Now one easily obtains the minimum of T, if k is chosen such that:
1 1 t20
2 t21 14640
3 7441 t 771 560
4 453961 214358880
11. E x p e r i m e n t a l Results
T h e algorithms p r e s e n te d here were p r o g r a m m e d an d their p e r f o r m a n c e
m e a s u r e d during various experiments. T h e p r o g r a m s were r u n on an I B M 360/44
c o m p u t e r w i t h a 231t disc u n i t as a b a c k u p store. F o r t h e i n d ex e l e m e n t size
chosen (t4 8-bit characters) a n d i n d e x size generally used (about 10000 i n d ex
elements), the a v e r a g e access m e c h a n i s m delay for this u n i t is a b o u t 50 ms,
after which i n f o r m a t i o n transfer t a k e s place at the rate of a b o u t 90 bts per index
element. F r o m these two p a r a m e t e r s , our analysis~ predicts an o p t i m a l page
size (2k) on t h e order of t2 0 index elements.
Organization and Maintenance of Large Ordered Indexes t87
* This statistic is unneCessarily large for deletions, due to the w a y deletions were pro-
g r a m m e d . To find the necessary n u m b e r of virtual reads, for sequential deletions s u b t r a c t
one from the n u m b e r shown, and for r a n d o m deletions s u b t r a c t one and multiply the result
b y abou~ 0.5.
Organization and Maintenance of Large Ordered Indexes t 89
References
t. Adelson-Velskii, G.M., Landis, E. M. : An information organization algorithm.
D A N S S S R , 146, 263-266 (1962).
2. Foster, C. C. : Information storage and retrieval using AVL trees. Proc. ACM
20th Nat'l. Conf. t92-205 (t965).
3. Landauer, W. I. : The balanced tree and its utilization in information retrieval.
I E E E Trans. on Electronic Computers, Vol. EC-t2, No. 6, December t963.
4. Sussenguth, E . H . , Jr. : The use of tree structures for processing files. Comm.
ACM, 6, No. 5, May 1963.
~3 Acta Inlm'matlea,VoL I