Data Structure and Algorithm
Data Structure and Algorithm
3 Recursive Algorithms
3.1 Introduction
3.2 When Not to Use Recursion
3.3 Two Examples of Recursive Programs
3.4 Backtracking Algorithms
3.5 The Eight Queens Problem
3.6 The Stable Marriage Problem
3.7 The Optimal Selection Problem
Exercises
4 Dynamic Information Structures
4.1 Recursive Data Types
4.2 Pointers
4.3 Linear Lists
4.3.1 Basic Operations
4.3.2 Ordered Lists and Reorganizing Lists
4.3.3 An Application: Topological Sorting
4.4 Tree Structures
4.4.1 Basic Concepts and Definitions
4.4.2 Basic Operations on Binary Trees
4.4.3 Tree Search and Insertion
4.4.4 Tree Deletion
4.4.5 Analysis of Tree Search and Insertion
4.5 Balanced Trees
4.5.1 Balanced Tree Insertion
4.5.2 Balanced Tree Deletion
4.6 Optimal Search Trees
4.7 B-Trees
4.7.1 Multiway B-Trees
4.7.2 Binary B-Trees
4.8 Priority Search Trees
Exercises
5 Key Transformations (Hashing)
5.1 Introduction
5.2 Choice of a Hash Function
5.3 Collision handling
5.4 Analysis of Key Transformation
Exercises
Appendices
A The ASCII Character Set
B The Syntax of Oberon
Index
7
Preface
In recent years the subject of computer programming has been recognized as a discipline whose mastery
is fundamental and crucial to the success of many engineering projects and which is amenable to
scientific treatement and presentation. It has advanced from a craft to an academic discipline. The initial
outstanding contributions toward this development were made by E.W. Dijkstra and C.A.R. Hoare.
Dijkstra's Notes on Structured Programming [1] opened a new view of programming as a scientific
subject and intellectual challenge, and it coined the title for a "revolution" in programming. Hoare's
Axiomatic Basis of Computer Programming [2] showed in a lucid manner that programs are amenable to
an exacting analysis based on mathematical reasoning. Both these papers argue convincingly that many
programmming errors can be prevented by making programmers aware of the methods and techniques
which they hitherto applied intuitively and often unconsciously. These papers focused their attention on
the aspects of composition and analysis of programs, or more explicitly, on the structure of algorithms
represented by program texts. Yet, it is abundantly clear that a systematic and scientific approach to
program construction primarily has a bearing in the case of large, complex programs which involve
complicated sets of data. Hence, a methodology of programming is also bound to include all aspects of
data structuring. Programs, after all, are concrete formulations of abstract algorithms based on particular
representations and structures of data. An outstanding contribution to bring order into the bewildering
variety of terminology and concepts on data structures was made by Hoare through his Notes on Data
Structuring [3]. It made clear that decisions about structuring data cannot be made without knowledge of
the algorithms applied to the data and that, vice versa, the structure and choice of algorithms often
depend strongly on the structure of the underlying data. In short, the subjects of program composition
and data structures are inseparably interwined.
Yet, this book starts with a chapter on data structure for two reasons. First, one has an intuitive feeling
that data precede algorithms: you must have some objects before you can perform operations on them.
Second, and this is the more immediate reason, this book assumes that the reader is familiar with the
basic notions of computer programming. Traditionally and sensibly, however, introductory programming
courses concentrate on algorithms operating on relatively simple structures of data. Hence, an
introductory chapter on data structures seems appropriate.
Throughout the book, and particularly in Chap. 1, we follow the theory and terminology expounded by
Hoare and realized in the programming language Pascal [4]. The essence of this theory is that data in the
first instance represent abstractions of real phenomena and are preferably formulated as abstract
structures not necessarily realized in common programming languages. In the process of program
construction the data representation is gradually refined -- in step with the refinement of the algorithm --
to comply more and more with the constraints imposed by an available programming system [5]. We
therefore postulate a number of basic building principles of data structures, called the fundamental
structures. It is most important that they are constructs that are known to be quite easily implementable
on actual computers, for only in this case can they be considered the true elements of an actual data
representation, as the molecules emerging from the final step of refinements of the data description. They
are the record, the array (with fixed size), and the set. Not surprisingly, these basic building principles
correspond to mathematical notions that are fundamental as well.
A cornerstone of this theory of data structures is the distinction between fundamental and "advanced"
structures. The former are the molecules -- themselves built out of atoms -- that are the components of
the latter. Variables of a fundamental structure change only their value, but never their structure and
never the set of values they can assume. As a consequence, the size of the store they occupy remains
constant. "Advanced" structures, however, are characterized by their change of value and structure during
the execution of a program. More sophisticated techniques are therefore needed for their implementation.
The sequence appears as a hybrid in this classification. It certainly varies its length; but that change in
structure is of a trivial nature. Since the sequence plays a truly fundamental role in practically all
computer systems, its treatment is included in Chap. 1.
The second chapter treats sorting algorithms. It displays a variety of different methods, all serving the
same purpose. Mathematical analysis of some of these algorithms shows the advantages and
disadvantages of the methods, and it makes the programmer aware of the importance of analysis in the
8
choice of good solutions for a given problem. The partitioning into methods for sorting arrays and
methods for sorting files (often called internal and external sorting) exhibits the crucial influence of data
representation on the choice of applicable algorithms and on their complexity. The space allocated to
sorting would not be so large were it not for the fact that sorting constitutes an ideal vehicle for
illustrating so many principles of programming and situations occurring in most other applications. It
often seems that one could compose an entire programming course by deleting examples from sorting
only.
Another topic that is usually omitted in introductory programming courses but one that plays an
important role in the conception of many algorithmic solutions is recursion. Therefore, the third chapter
is devoted to recursive algorithms. Recursion is shown to be a generalization of repetition (iteration), and
as such it is an important and powerful concept in programming. In many programming tutorials, it is
unfortunately exemplified by cases in which simple iteration would suffice. Instead, Chap. 3 concentrates
on several examples of problems in which recursion allows for a most natural formulation of a solution,
whereas use of iteration would lead to obscure and cumbersome programs. The class of backtracking
algorithms emerges as an ideal application of recursion, but the most obvious candidates for the use of
recursion are algorithms operating on data whose structure is defined recursively. These cases are treated
in the last two chapters, for which the third chapter provides a welcome background.
Chapter 4 deals with dynamic data structures, i.e., with data that change their structure during the
execution of the program. It is shown that the recursive data structures are an important subclass of the
dynamic structures commonly used. Although a recursive definition is both natural and possible in these
cases, it is usually not used in practice. Instead, the mechanism used in its implementation is made
evident to the programmer by forcing him to use explicit reference or pointer variables. This book
follows this technique and reflects the present state of the art: Chapter 4 is devoted to programming with
pointers, to lists, trees and to examples involving even more complicated meshes of data. It presents what
is often (and somewhat inappropriately) called list processing. A fair amount of space is devoted to tree
organizations, and in particular to search trees. The chapter ends with a presentation of scatter tables, also
called "hash" codes, which are oftern preferred to search trees. This provides the possibility of comparing
two fundamentally different techniques for a frequently encountered application.
Programming is a constructive activity. How can a constructive, inventive activity be taught? One
method is to crystallize elementary composition priciples out many cases and exhibit them in a
systematic manner. But programming is a field of vast variety often involving complex intellectual
activities. The belief that it could ever be condensed into a sort of pure recipe teaching is mistaken. What
remains in our arsenal of teaching methods is the careful selection and presentation of master examples.
Naturally, we should not believe that every person is capable of gaining equally much from the study of
examples. It is the characteristic of this approach that much is left to the student, to his diligence and
intuition. This is particularly true of the relatively involved and long example programs. Their inclusion
in this book is not accidental. Longer programs are the prevalent case in practice, and they are much
more suitable for exhibiting that elusive but essential ingredient called style and orderly structure. They
are also meant to serve as exercises in the art of program reading, which too often is neglected in favor of
program writing. This is a primary motivation behind the inclusion of larger programs as examples in
their entirety. The reader is led through a gradual development of the program; he is given various
snapshots in the evolution of a program, whereby this development becomes manifest as a stepwise
refinement of the details. I consider it essential that programs are shown in final form with sufficient
attention to details, for in programming, the devil hides in the details. Although the mere presentation of
an algorithm's principle and its mathematical analysis may be stimulating and challenging to the
academic mind, it seems dishonest to the engineering practitioner. I have therefore strictly adhered to the
rule of presenting the final programs in a language in which they can actually be run on a computer.
Of course, this raises the problem of finding a form which at the same time is both machine executable
and sufficiently machine independent to be included in such a text. In this respect, neither widely used
languages nor abstract notations proved to be adequate. The language Pascal provides an appropriate
compromise; it had been developed with exactly this aim in mind, and it is therefore used throughout this
book. The programs can easily be understood by programmers who are familiar with some other high-
level language, such as ALGOL 60 or PL/1, because it is easy to understand the Pascal notation while
proceeding through the text. However, this not to say that some proparation would not be beneficial. The
9
book Systematic Programming [6] provides an ideal background because it is also based on the Pascal
notation. The present book was, however, not intended as a manual on the language Pascal; there exist
more appropriate texts for this purpose [7].
This book is a condensation -- and at the same time an elaboration -- of several courses on programming
taught at the Federal Institute of Technology (ETH) at Zürich. I owe many ideas and views expressed in
this book to discussions with my collaborators at ETH. In particular, I wish to thank Mr. H. Sandmayr for
his careful reading of the manuscript, and Miss Heidi Theiler and my wife for their care and patience in
typing the text. I should also like to mention the stimulating influence provided by meetings of the
Working Groups 2.1 and 2.3 of IFIP, and particularly the many memorable arguments I had on these
occasions with E. W. Dijkstra and C.A.R. Hoare. Last but not least, ETH generously provided the
environment and the computing facilities without which the preparation of this text would have been
impossible.
Zürich, Aug. 1975 N. Wirth
1. In Structured Programming. O-.J. Dahl, E.W. Dijkstra, C.A.R. Hoare. F. Genuys, Ed. (New York;
Academic Press, 1972), pp. 1-82.
2. In Comm. ACM, 12, No. 10 (1969), 576-83.
3. In Structured Programming, pp. 83-174.
4. N. Wirth. The Programming Language Pascal. Acta Informatica, 1, No. 1 (1971), 35-63.
5. N. Wirth. Program Development by Stepwise Refinement. Comm. ACM, 14, No. 4 (1971), 221-27.
6. N. Wirth. Systematic Programming. (Englewood Cliffs, N.J. Prentice-Hall, Inc., 1973.)
7. K. Jensen and N. Wirth, PASCAL-User Manual and Report. (Berlin, Heidelberg, New York;
Springer-Verlag, 1974).
The entire fifth chapter of the first Edition has been omitted. It was felt that the subject of compiler
construction was somewhat isolated from the preceding chapters and would rather merit a more extensive
treatment in its own volume.
Finally, the appearance of the new Edition reflects a development that has profoundly influenced
publications in the last ten years: the use of computers and sophisticated algorithms to prepare and
automatically typeset documents. This book was edited and laid out by the author with the aid of a Lilith
computer and its document editor Lara. Without these tools, not only would the book become more
costly, but it would certainly not be finished yet.
Palo Alto, March 1985 N. Wirth
Notation
The following notations, adopted from publications of E.W. Dijkstra, are used in this book.
In logical expressions, the character & denotes conjunction and is pronounced as and. The character ~
denotes negation and is pronounced as not. Boldface A and E are used to denote the universal and
existential quantifiers. In the following formulas, the left part is the notation used and defined here in
terms of the right part. Note that the left parts avoid the use of the symbol "...", which appeals to the
readers intuition.
Ai: m ≤ i < n : Pi ≡ P m & Pm+1 & ... & P n-1
The P i are predicates, and the formula asserts that for all indices i ranging from a given value m to, but
excluding a value n, P i holds.
Ei: m ≤ i < n : Pi ≡ P m or Pm+1 or ... or Pn-1
The P i are predicates, and the formula asserts that for some indices i ranging from a given value m to, but
excluding a value n, P i holds.
Si: m ≤ i < n : xi = xm + xm+1 + ... + xn-1
MIN i: m ≤ i < n : xi = minimum(xm , ... , xn-1)
MAX i: m ≤ i < n : xi = maximum(xm, … , xn-1)
11
In this context, the significance of programming languages becomes apparent. A programming language
represents an abstract computer capable of interpreting the terms used in this language, which may embody
a certain level of abstraction from the objects used by the actual machine. Thus, the programmer who uses
such a higher-level language will be freed (and barred) from questions of number representation, if the
number is an elementary object in the realm of this language.
The importance of using a language that offers a convenient set of basic abstractions common to most
problems of data processing lies mainly in the area of reliability of the resulting programs. It is easier to
design a program based on reasoning with familiar notions of numbers, sets, sequences, and repetitions
than on bits, storage units, and jumps. Of course, an actual computer represents all data, whether numbers,
sets, or sequences, as a large mass of bits. But this is irrelevant to the programmer as long as he or she does
not have to worry about the details of representation of the chosen abstractions, and as long as he or she can
rest assured that the corresponding representation chosen by the computer (or compiler) is reasonable for
the stated purposes.
The closer the abstractions are to a given computer, the easier it is to make a representation choice for the
engineer or implementor of the language, and the higher is the probability that a single choice will be
suitable for all (or almost all) conceivable applications. This fact sets definite limits on the degree of
abstraction from a given real computer. For example, it would not make sense to include geometric objects
as basic data items in a general-purpose language, since their proper repesentation will, because of its
inherent complexity, be largely dependent on the operations to be applied to these objects. The nature and
frequency of these operations will, however, not be known to the designer of a general-purpose language
and its compiler, and any choice the designer makes may be inappropriate for some potential applications.
In this book these deliberations determine the choice of notation for the description of algorithms and their
data. Clearly, we wish to use familiar notions of mathematics, such as numbers, sets, sequences, and so on,
rather than computer-dependent entities such as bitstrings. But equally clearly we wish to use a notation for
which efficient compilers are known to exist. It is equally unwise to use a closely machine-oriented and
machine-dependent language, as it is unhelpful to describe computer programs in an abstract notation that
leaves problems of representation widely open. The programming language Pascal had been designed in an
attempt to find a compromise between these extremes, and the successor languages Modula-2 and Oberon
are the result of decades of experience [1-3]. Oberon retains Pascal's basic concepts and incorporates some
improvements and some extensions; it is used throughout this book [1-5]. It has been successfully
implemented on several computers, and it has been shown that the notation is sufficiently close to real
machines that the chosen features and their representations can be clearly explained. The language is also
sufficiently close to other languages, and hence the lessons taught here may equally well be applied in their
use.
The primary characteristics of the concept of type that is used throughout this text, and that is embodied in
the programming language Oberon, are the following [1-2]:
1. A data type determines the set of values to which a constant belongs, or which may be assumed by a
variable or an expression, or which may be generated by an operator or a function.
2. The type of a value denoted by a constant, variable, or expression may be derived from its form or its
declaration without the necessity of executing the computational process.
3. Each operator or function expects arguments of a fixed type and yields a result of a fixed type. If an
operator admits arguments of several types (e.g., + is used for addition of both integers and real
numbers), then the type of the result can be determined from specific language rules.
As a consequence, a compiler may use this information on types to check the legality of various constructs.
For example, the mistaken assignment of a Boolean (logical) value to an arithmetic variable may be
detected without executing the program. This kind of redundancy in the program text is extremely useful as
an aid in the development of programs, and it must be considered as the primary advantage of good high-
level languages over machine code (or symbolic assembly code). Evidently, the data will ultimately be
represented by a large number of binary digits, irrespective of whether or not the program had initially been
conceived in a high-level language using the concept of type or in a typeless assembly code. To the
computer, the store is a homogeneous mass of bits without apparent structure. But it is exactly this abstract
structure which alone is enabling human programmers to recognize meaning in the monotonous landscape
of a computer store.
The theory presented in this book and the programming language Oberon specify certain methods of
defining data types. In most cases new data types are defined in terms of previously defined data types.
Values of such a type are usually conglomerates of component values of the previously defined constituent
types, and they are said to be structured. If there is only one constituent type, that is, if all components are
of the same constituent type, then it is known as the base type. The number of distinct values belonging to a
type T is called its cardinality. The cardinality provides a measure for the amount of storage needed to
represent a variable x of the type T, denoted by x: T.
Since constituent types may again be structured, entire hierarchies of structures may be built up, but,
obviously, the ultimate components of a structure are atomic. Therefore, it is necessary that a notation is
provided to introduce such primitive, unstructured types as well. A straightforward method is that of
enumerating the values that are to constitute the type. For example in a program concerned with plane
geometric figures, we may introduce a primitive type called shape, whose values may be denoted by the
identifiers rectangle, square, ellipse, circle. But apart from such programmer-defined types, there will have
to be some standard, predefined types. They usually include numbers and logical values. If an ordering
exists among the individual values, then the type is said to be ordered or scalar. In Oberon, all unstructured
types are ordered; in the case of explicit enumeration, the values are assumed to be ordered by their
enumeration sequence.
With this tool in hand, it is possible to define primitive types and to build conglomerates, structured types
up to an arbitrary degree of nesting. In practice, it is not sufficient to have only one general method of
combining constituent types into a structure. With due regard to practical problems of representation and
use, a general-purpose programming language must offer several methods of structuring. In a mathematical
sense, they are equivalent; they differ in the operators available to select components of these structures.
The basic structuring methods presented here are the array, the record, the set, and the sequence. More
complicated structures are not usually defined as static types, but are instead dynamically generated during
the execution of the program, when they may vary in size and shape. Such structures are the subject of
Chap. 4 and include lists, rings, trees, and general, finite graphs.
Variables and data types are introduced in a program in order to be used for computation. To this end, a set
of operators must be available. For each standard data type a programming languages offers a certain set of
primitive, standard operators, and likewise with each structuring method a distinct operation and notation
for selecting a component. The task of composition of operations is often considered the heart of the art of
programming. However, it will become evident that the appropriate composition of data is equally
fundamental and essential.
14
The most important basic operators are comparison and assignment, i.e., the test for equality (and for order
in the case of ordered types), and the command to enforce equality. The fundamental difference between
these two operations is emphasized by the clear distinction in their denotation throughout this text.
Test for equality: x=y (an expression with value TRUE or FALSE)
Assignment to x: x := y (a statement making x equal to y)
These fundamental operators are defined for most data types, but it should be noted that their execution
may involve a substantial amount of computational effort, if the data are large and highly structured.
For the standard primitive data types, we postulate not only the availability of assignment and comparison,
but also a set of operators to create (compute) new values. Thus we introduce the standard operations of
arithmetic for numeric types and the elementary operators of propositional logic for logical values.
against the inconsistent use of operators. For example, given the declaration of s above, the statement s :=
s+1 would be meaningless.
If, however, we recall that enumerations are ordered, then it is sensible to introduce operators that generate
the successor and predecessor of their argument. We therefore postulate the following standard operators,
which assign to their argument its successor and predecessor respectively:
INC(x) DEC(x)
Many programming languages do not include an exponentiation operator. The following is an algorithm for
the fast computation of y = xn, where n is a non-negative integer.
y := 1.0; i := n;
WHILE i > 0 DO (* x0n = xi * y *)
IF ODD(i) THEN y := y*x END ;
x := x*x; i := i DIV 2
END
p q p&q p OR q ~p
TRUE TRUE TRUE TRUE FALSE
TRUE FALSE TRUE FALSE FALSE
FALSE TRUE TRUE FALSE TRUE
FALSE FALSE FALSE FALSE TRUE
Table 1.1 Boolean Operators.
The Boolean operators & (AND) and OR have an additional property in most programming languages,
which distinguishes them from other dyadic operators. Whereas, for example, the sum x+y is not defined, if
either x or y is undefined, the conjunction p&q is defined even if q is undefined, provided that p is FALSE.
This conditionality is an important and useful property. The exact definition of & and OR is therefore given
by the following equations:
p &q = if p then q else FALSE
p OR q = if p then TRUE else q
THIS IS A TEXT
r+s/t = r + (s/t)
x IN s + t = x IN (s+t)
x0 1.0
x1 0.5
x2 0.25
x3 0.125
Constituents of array types may themselves be structured. An array variable whose components are again
arrays is called a matrix. For example,
M: ARRAY 10 OF Row
is an array consisting of ten components (rows), each constisting of four components of type REAL, and is
called a 10 × 4 matrix with real components. Selectors may be concatenated accordingly, such that Mij and
M[i][j] denote the j th component of row Mi, which is the i th component of M. This is usually abbreviated
as M[i, j] and in the same spirit the declaration
M: ARRAY 10 OF ARRAY 4 OF REAL
can be written more concisely as
M: ARRAY 10, 4 OF REAL.
If a certain operation has to be performed on all components of an array or on adjacent components of a
section of the array, then this fact may conveniently be emphasized by using the FOR satement, as shown
in the following examples for computing the sum and for finding the maximal element of an array declared
as
VAR a: ARRAY N OF INTEGER
sum := 0;
FOR i := 0 TO N-1 DO sum := a[i] + sum END
k := 0; max := a[0];
FOR i := 1 TO N-1 DO
IF max < a[i] THEN k := i; max := a[k] END
END.
In a further example, assume that a fraction f is represented in its decimal form with k-1 digits, i.e., by an
array d such that
f = S i : 0 ≤ i < k: di * 10 -i or
k-1
f = d0 + 10*d1 + 100*d2 + … + dk-1*10
Now assume that we wish to divide f by 2. This is done by repeating the familiar division operation for all
k-1 digits di, starting with i=1. It consists of dividing each digit by 2 taking into account a possible carry
from the previous position, and of retaining a possible remainder r for the next position:
r := 10*r +d[i]; d[i] := r DIV 2; r := r MOD 2
This algorithm is used to compute a table of negative powers of 2. The repetition of halving to compute 2-1,
2 -2, ... , 2-N is again appropriately expressed by a FOR statement, thus leading to a nesting of two FOR
statements.
PROCEDURE Power(VAR W: Texts.Writer; N: INTEGER);
(*compute decimal representation of negative powers of 2*)
VAR i, k, r: INTEGER;
d: ARRAY N OF INTEGER;
BEGIN
FOR k := 0 TO N-1 DO
Texts.Write(W, "."); r := 0;
FOR i := 0 TO k-1 DO
r := 10*r + d[i]; d[i] := r DIV 2; r := r MOD 2;
Texts.Write(W, CHR(d[i] + ORD("0")))
END ;
d[k] := 5; Texts.Write(W, "5"); Texts.WriteLn(W)
END
END Power.
The resulting output text for N = 10 is
20
.5
.25
.125
.0625
.03125
.015625
.0078125
.00390625
.001953125
.0009765625
1.0 1 SMITH
-1.0 4 JOHN
1973 18 1 1986
male
single
store
i0
array
s=2.3
S=3
unused
23
padded
s1
s2 s3
s4
padded
s5
s6 s7 s8
dynamic storage allocation scheme must be employed. All structures with variable size share this property,
which is so essential that we classify them as advanced structures in contrast to the fundamental structures
discussed so far.
What, then, causes us to place the discussion of sequences in this chapter on fundamental structures? The
primary reason is that the storage management strategy is sufficiently simple for sequences (in contrast to
other advanced structures), if we enforce a certain discipline in the use of sequences. In fact, under this
proviso the handling of storage can safely be delegated to a machanism that can be guaranteed to be
reasonably effective. The secondary reason is that sequences are indeed ubiquitous in all computer
applications. This structure is prevalent in all cases where different kinds of storage media are involved, i.e.
where data are to be moved from one medium to another, such as from disk or tape to primary store or
vice-versa.
The discipline mentioned is the restraint to use sequential access only. By this we mean that a sequence is
inspected by strictly proceeding from one element to its immediate successor, and that it is generated by
repeatedly appending an element at its end. The immediate consequence is that elements are not directly
accessible, with the exception of the one element which currently is up for inspection. It is this accessing
discipline which fundamentally distinguishes sequences from arrays. As we shall see in Chapter 2, the
influence of an access discipline on programs is profound.
The advantage of adhering to sequential access which, after all, is a serious restriction, is the relative
simplicity of needed storage management. But even more important is the possibility to use effective
buffering techniques when moving data to or from secondary storage devices. Sequential access allows us
to feed streams of data through pipes between the different media. Buffering implies the collection of
sections of a stream in a buffer, and the subsequent shipment of the whole buffer content once the buffer is
filled. This results in very significantly more effective use of secondary storage. Given sequential access
only, the buffering mechanism is reasonably straightforward for all sequences and all media. It can
therefore safely be built into a system for general use, and the programmer need not be burdened by
incorporating it in the program. Such a system is usually called a file system, because the high-volume,
sequential access devices are used for permanent storage of (persistent) data, and they retain them even
when the computer is switched off. The unit of data on these media is commonly called (sequential) file.
Here we will use the term file as synonym to sequence.
There exist certain storage media in which the sequential access is indeed the only possible one. Among
them are evidently all kinds of tapes. But even on magnetic disks each recording track constitutes a storage
facility allowing only sequential access. Strictly sequential access is the primary characteristic of every
mechanically moving device and of some other ones as well.
It follows that it is appropriate to distinguish between the data structure, the sequence, on one hand, and
the mechanism to access elements on the other hand. The former is declared as a data structure, the latter
typically by the introduction of a record with associated operators, or, according to more modern
terminology, by a rider object. The distinction between data and mechanism declarations is also useful in
view of the fact that several access points may exist concurrently on one and the same sequence, each one
representing a sequential access at a (possibly) different location.
We summarize the essence of the foregoing as follows:
1. Arrays and records are random access structures. They are used when located in primary, random-access
store.
2. Sequences are used to access data on secondary, sequential-access stores, such as disks and tapes.
3. We distinguish between the declaration of a sequence variable, and that of an access mechanism located
at a certain position within the seqence.
Sequences, files, are typically large, dynamic data structures stored on a secondary storage device. Such a
device retains the data even if a program is terminated, or a computer is switched off. Therefore the
introduction of a file variable is a complex operation connecting the data on the external device with the
file variable in the program. We therefore define the type File in a separate module, whose definition
specifies the type together with its operators. We call this module Files and postulate that a sequence or file
variable must be explicitly initialized (opened) by calling an appropriate operator or function:
VAR f: File
f := Open(name)
where name identifies the file as recorded on the persistent data carrier. Some systems distinguish between
opening an existing file and opening a new file:
f := Old(name) f := New(name)
The disconnection between secondary storage and the file variable then must also be explicitly requested
by, for example, a call of Close(f).
Evidently, the set of operators must contain an operator for generating (writing) and one for inspecting
(reading) a sequence. We postulate that these operations apply not to a file directly, but to an object called a
rider, which itself is connected with a file (sequence), and which implements a certain access mechanism.
The sequential access discipline is guaranteed by a restrictive set of access operators (procedures).
A sequence is generated by appending elements at its end after having placed a rider on the file. Assuming
the declaration
VAR r: Rider
we position the rider r on the file f by the statement
Set(r, f, pos)
where pos = 0 designates the beginning of the file (sequence). A typical pattern for generating the sequence
is:
WHILE more DO compute next element x; Write(r, x) END
A sequence is inspected by first positioning a rider as shown above, and then proceeding from element to
element. A typical pattern for reading a sequence is:
Read(r, x);
WHILE ~r.eof DO process element x; Read(r, x) END
Evidently, a certain position is always associated with every rider. It is denoted by r.pos. Furthermore, we
postulate that a rider contain a predicate (flag) r.eof indicating whether a preceding read operation had
reached the sequence’s end. We can now postulate and describe informally the following set of primitive
operators:
1a. New(f, name) defines f to be the empty sequence.
1b. Old(f, name) defines f to be the sequence persistently stored with given name.
2. Set(r, f, pos) associate rider r with sequence f, and place it at position pos.
3. Write(r, x) place element with value x in the sequence designated by rider r, and advance.
4. Read(r, x) assign to x the value of the element designated by rider r, and advance.
5. Close(f) registers the written file f in the persistent store (flush buffers).
Note: Writing an element in a sequence is often a complex operation. However, mostly, files are created by
appending elements at the end.
In order to convey a more precise understanding of the sequencing operators, the following example of an
implementation is provided. It shows how they might be expressed if sequences were represented by
arrays. This example of an implementation intentionally builds upon concepts introduced and discussed
earlier, and it does not involve either buffering or sequential stores which, as mentioned above, make the
sequence concept truly necessary and attractive. Nevertheless, this example exhibits all the essential
27
characteristics of the primitive sequence operators, independently on how the sequences are represented in
store.
The operators are presented in terms of conventional procedures. This collection of definitions of types,
variables, and procedure headings (signatures) is called a definition. We assume that we are to deal with
sequences of characters, i.e. text files whose elements are of type CHAR. The declarations of File and
Rider are good examples of an application of record structures because, in addition to the field denoting the
array which represents the data, further fields are required to denote the current length and position, i.e. the
state of the rider.
DEFINITION Files;
TYPE File; (*sequence of characters*)
Rider = RECORD eof: BOOLEAN END ;
PROCEDURE New(VAR name: ARRAY OF CHAR): File;
PROCEDURE Old(VAR name: ARRAY OF CHAR): File;
PROCEDURE Close(VAR f: File);
PROCEDURE Set(VAR r: Rider; VAR f: File; pos: INTEGER);
PROCEDURE Write (VAR r: Rider; ch: CHAR);
PROCEDURE Read (VAR r: Rider; VAR ch: CHAR);
END Files.
A definition represents an abstraction. Here we are given the two data types, File and Rider, together with
their operations, but without further details revealing their actual representation in store. Of the operators,
declared as procedures, we see their headings only. This hiding of the details of implementation is
intentional. The concept is called information hiding. About riders we only learn that there is a property
called eof. This flag is set, if a read operation reaches the end of the file. The rider’s position is invisible,
and hence the rider’s invariant cannot be falsified by direct access. The invariant expresses the fact that the
position always lies within the limits given by the associated sequence. The invariant is established by
procedure Set, and required and maintained by procedures Read and Write.
The statements that implement the procedures and further, internal details of the data types, are sepecified
in a construct called module. Many representations of data and implementations of procedures are possible.
We chose the following as a simple example (with fixed maximal file length):
MODULE Files;
CONST MaxLength = 4096;
TYPE File = POINTER TO RECORD
len: INTEGER;
a: ARRAY MaxLength OF CHAR
END ;
Rider = RECORD (* 0 <= pos <= s.len <= Max Length *)
f: File; pos: INTEGER; eof: BOOLEAN
END ;
PROCEDURE New(name: ARRAY OF CHAR): File;
VAR f: File;
BEGIN NEW(f); f.length := 0; f.eof := FALSE; (*directory operation omitted*)
RETURN f
END New;
PROCEDURE Old(name: ARRAY OF CHAR): File;
VAR f: File;
BEGIN NEW(f); f.eof := FALSE; (*directory lookup omitted*)
RETURN f
END New;
PROCEDURE Close(VAR f: File);
BEGIN
END Close;
28
effect, if the rates of producer and consumer are about the same on the average, but fluctuate at times. The
degree of decoupling grows with increasing buffer size.
We now turn to the question of how to represent a buffer, and shall for the time being assume that data
elements are deposited and fetched individually instead of in blocks. A buffer essentially constitutes a first-
in-first-out queue (fifo). If it is declared as an array, two index variables, say in and out, mark the positions
of the next location to be written into and to be read from. Ideally, such an array should have no index
bounds. A finite array is quite adequate, however, considering the fact that elements once fetched are no
longer relevant. Their location may well be re-used. This leads to the idea of the circular buffer. The
operations of depositing and fetching an element are expressed in the following module, which exports
these operations as procedures, but hides the buffer and its index variables - and thereby effectively the
buffering mechanism - from the client processes. This mechanism also involves a variable n counting the
number of elements currently in the buffer. If N denotes the size of the buffer, the condition 0 ≤ n ≤ N is an
obvious invariant. Therefore, the operation fetch must be guarded by the condition n > 0 (buffer non-
empty), and the operation deposit by the condition n < N (buffer non-full). Not meeting the former
condition must be regarded as a programming error, a violation of the latter as a failure of the suggested
implementation (buffer too small).
MODULE Buffer; (*implements circular buffers*)
CONST N = 1024; (*buffer size*)
VAR n, in, out:INTEGER;
buf: ARRAY N OF CHAR;
PROCEDURE deposit(x: CHAR);
BEGIN
IF n = N THEN HALT END ;
INC(n); buf[in] := x; in := (in + 1) MOD N
END deposit;
PROCEDURE fetch(VAR x: CHAR);
BEGIN
IF n = 0 THEN HALT END ;
DEC(n); x := buf[out]; out := (out + 1) MOD N
END fetch;
BEGIN n := 0; in := 0; out := 0
END Buffer.
This simple implementation of a buffer is acceptable only, if the procedures deposit and fetch are activated
by a single agent (once acting as a producer, once as a consumer). If, however, they are activated by
individual, concurrent processes, this scheme is too simplistic. The reason is that the attempt to deposit into
a full buffer, or the attempt to fetch from an empty buffer, are quite legitimate. The execution of these
actions will merely have to be delayed until the guarding conditions are established. Such delays essentially
constitute the necessary synchronization among concurrent processes. We may represent these delays
respectively by the statements
REPEAT UNTIL n < N
REPEAT UNTIL n > 0
which must be substituted for the two conditioned HALT statements.
Every signal s is associated with a guard (condition) Ps. If a process needs to be delayed until Ps is
established (by some other process), it must, before proceeding, wait for the signal s. This is to be
expressed by the statement Wait(s). If, on the other hand, a process establishes Ps, it thereupon signals this
fact by the statement Send(s). If Ps is the established precondition to every statement Send(s), then P s can
be regarded as a postcondition of Wait(s).
DEFINITION Signals;
TYPE Signal;
PROCEDURE Wait(VAR s: Signal);
PROCEDURE Send(VAR s: Signal);
PROCEDURE Init(VAR s: Signal);
ENDSignals.
We are now able to express the buffer module in a form that functions properly when used by individual,
concurrent processes:
MODULE Buffer;
IMPORT Signals;
CONST N = 1024; (*buffer size*)
VAR n, in, out: INTEGER;
nonfull: Signals.Signal; (*n < N*)
nonempty: Signals.Signal; (*n > 0*)
buf: ARRAY N OF CHAR;
PROCEDURE deposit(x: CHAR);
BEGIN
IF n = N THEN Signals.Wait(nonfull) END ;
INC(n); buf[in] := x; in := (in + 1) MOD N;
IF n = 1 THEN Signals.Send(nonempty) END
END deposit;
PROCEDURE fetch(VAR x: CHAR);
BEGIN
IF n = 0 THEN Signals.Wait(nonempty) END ;
DEC(n); x := buf[out]; out := (out + 1) MOD N;
IF n = N-1 THEN Signals.Send(nonfull) END
END fetch;
BEGIN n := 0; in := 0; out := 0; Signals.Init(nonfull); Signals.Init(nonempty)
END Buffer1.
An additional caveat must be made, however. The scheme fails miserably, if by coincidence both consumer
and producer (or two producers or two consumers) fetch the counter value n simultaneously for updating.
Unpredictably, its resulting value will be either n+1 or n-1, but not n. It is indeed necessary to protect the
processes from dangerous interference. In general, all operations that alter the values of shared variables
constitute potential pitfalls.
A sufficient (but not always necessary) condition is that all shared variables be declared local to a module
whose procedures are guaranteed to be executed under mutual exclusion. Such a module is called a monitor
[1-7]. The mutual exclusion provision guarantees that at any time at most one process is actively engaged
in executing a procedure of the monitor. Should another process be calling a procedure of the (same)
monitor, it will automatically be delayed until the first process has terminated its procedure.
Note: By actively engaged is meant that a process execute a statement other than a wait statement.
At last we return now to the problem where the producer or the consumer (or both) require the data to be
available in a certain block size. The following module is a variant of the one previously shown, assuming
a block size of Np data elements for the producer, and of Nc elements for the consumer. In these cases, the
buffer size N is usually chosen as a common multiple of Np and Nc. In order to emphasise that symmetry
between the operations of fetching and depositing data, the single counter n is now represented by two
31
counters, namely ne and nf. They specify the numbers of empty and filled buffer slots respectively. When
the consumer is idle, nf indicates the number of elements needed for the consumer to proceed; and when
the producer is waiting, ne specifies the number of elements needed for the producer to resume. (Therefore
ne+nf = N does not always hold).
in
out
MODULE Buffer;
IMPORT Signals;
CONST Np = 16; (*size of producer block*)
Nc = 128; (*size of consumer block*)
N = 1024; (*buffer size, common multiple of Np and Nc*)
VAR ne, nf: INTEGER;
in, out: INTEGER;
nonfull: Signals.Signal; (*ne >= 0*)
nonempty: Signals.Signal; (*nf >= 0*)
buf: ARRAY N OF CHAR;
PROCEDURE deposit(VAR x: ARRAY OF CHAR);
BEGIN ne := ne - Np;
IF ne < 0 THEN Signals.Wait(nonfull) END ;
FOR i := 0 TO Np-1 DO buf[in] := x[i]; INC(in) END ;
IF in = N THEN in := 0 END ;
nf := nf + Np;
IF nf >= 0 THEN Signals.Send(nonempty) END
END deposit;
PROCEDURE fetch(VAR x: ARRAY OF CHAR);
BEGIN nf := nf - Nc;
IF nf < 0 THEN Signals.Wait(nonempty) END ;
FOR i := 0 TO Nc-1 DO x[i] := buf[out]; INC(out) END;
IF out = N THEN out := 0 END ;
ne := ne + Nc;
IF ne >= 0 THEN Signals.Send(nonfull) END
END fetch;
BEGIN
ne := N; nf := 0; in := 0; out := 0; Signals.Init(nonfull); Signals.Init(nonempty)
END Buffer.
consists of a sequence of characters. It is a text. This readability condition is responsible for yet another
complication incurred in most genuine input and output operations. Apart from the actual data transfer,
they also involve a transformation of representation. For example, numbers, usually considered as atomic
units and represented in binary form, need be transformed into readable, decimal notation. Structures need
to be represented in a suitable layout, whose generation is called formatting.
Whatever the transformation may be, the concept of the sequence is once again instrumental for a
considerable simplification of the task. The key is the observation that, if the data set can be considered as a
sequence of characters, the transformation of the sequence can be implemented as a sequence of (identical)
transformations of elements.
T(<s0, s1, ... , sn-1>) = <T(s0), T(s1), ... , T(sn-1)>
We shall briefly investigate the necessary operations for transforming representations of natural numbers
for input and output. The basis is that a number x represented by the sequence of decimal digits d = <dn-1,
... , d1, d0> has the value
x = Si: i = 0 .. n-1: d i * 10i
x = dn-1×10 n-1 + dn-2×10n-2 + … + d1×10 + d0
x = ( … ((dn-1×10) + dn-2) ×10 + … + d1×10) + d0
Assume now that the sequence d is to be read and transformed, and the resulting numeric value to be
assigned to x. The simple algorithm terminates with the reading of the first character that is not a digit.
(Arithmetic overflow is not considered).
x := 0; Read(ch);
WHILE ("0" <= ch) & (ch <= "9") DO
x := 10*x + (ORD(ch) - ORD("0")); Read(ch)
END
In the case of output the transformation is complexified by the fact that the decomposition of x into decimal
digits yields them in the reverse order. The least digit is generated first by computing x MOD 10. This
requires an intermediate buffer in the form of a first-in-last-out queue (stack). We represent it as an array d
with index i and obtain the following program:
i := 0;
REPEAT d[i] := x MOD 10; x := x DIV 10; INC(i)
UNTIL x = 0;
REPEAT DEC(i); Write(CHR(d[i] + ORD("0"))
UNTIL i = 0
Note: A consistent substitution of the constant 10 in these algorithms by a positive integer B will yield
number conversion routines to and from representations with base B. A frequently used case is B = 16
(hexadecimal), because the involved multiplications and divisions can be implemented by simple shifts of
the binary numbers.
Obviously, it should not be necessary to specify these ubiquitous operations in every program in full detail.
We therefore postulate a utility module that provides the most common, standard input and output
operations on numbers and strings. This module is referenced in most programs throughout this book, and
we call it Texts. It defines a type Text, Readers and Writers for Texts, and procedures for reading and
writing a character, an integer, a cardinal number, or a string.
Before we present the definition of module Texts, we point out an essential asymmetry between input and
output of texts. Whereas a text is generated by a sequence of calls of writing procedures, writing integers,
real numbers, strings etc., reading a text by a sequence of calls of reading procedures is questionable
practice. This is because we rather wish to read the next element without having to know its type. We
rather wish to determine its type after reading the item. This leads us to the concept of a scanner which,
after each scan allows to inspect type and value of the item read. A scanner acts like a rider in the case of
files. However, it imposes a certain syntax on the text to be read. We postulate a scanner for texts
33
consisting of a sequence of integers, real numbers, strings, names, and special characters given by the
following syntax specified in EBNF (Extended Backus Naur Form):
item = integer | RealNumber | identifier | string | SpecialChar.
integer = [“-“] digit {digit}.
RealNumber = [“-“] digit {digit} „.“ digit {digit} [(„E“ | „D“)[„+“ | „-“ digit {digit}].
identifier = letter {letter | digit}.
string = ‘”’ {any character except quote} ‘”’.
SpecialChar = “!” | “?” | “@” | “#” | “$” | “%” | “^” | “&” | “+” | “-“ | “*” | “/” | “\” | “|” | “(“ | “)” | “[“ |
“]” | “{“ | “}” | “<” | “>” | ”.” | “,” | ”:” | ”;” | ”~”.
Items are separated by blanks and/or line breaks.
DEFINITION Texts;
CONST Int = 1; Real = 2; Name = 3; Char = 4;
TYPE Text, Writer;
Reader = RECORD eot: BOOLEAN END ;
Scanner = RECORD class: INTEGER;
i: INTEGER;
x: REAL;
s: ARRAY 32 OF CHAR;
ch: CHAR;
nextCh: CHAR
END ;
PROCEDURE OpenReader(VAR r: Reader; t: Text; pos: INTEGER);
PROCEDURE OpenWriter(VAR w: Writer; t: Text; pos: INTEGER);
PROCEDURE OpenScanner(VAR s: Scanner; t: Text; pos: INTEGER);
PROCEDURE Read(VAR r: Reader; VAR ch: CHAR);
PROCEDURE Scan(VAR s: Scanner);
PROCEDURE Write(VAR w: Writer; ch: CHAR);
PROCEDURE WriteLn(VAR w: Writer); (*terminate line*)
PROCEDURE WriteString((VAR w: Writer; s: ARRAY OF CHAR);
PROCEDURE WriteInt((VAR w: Writer; x, n: INTEGER);
(*write integer x with (at least) n characters.
If n is greater than the number of digits needed,
blanks are added preceding the number*)
PROCEDURE WriteReal((VAR w: Writer; x: REAL);
PROCEDURE Close(VAR w: Writer);
END Texts.
Hence we postulate that after a call of Scan(S)
S.class = Int implies S.i is the integer read
S.class = Real implies S.x is the real number read
S.class = Name implies S.s is the identifier of string read
S.class = Char implies S.ch is the special character read
nextCh is the character immediately following the read item, possibly a blank.
1.9. Searching
The task of searching is one of most frequent operations in computer programming. It also provides an
ideal ground for application of the data structures so far encountered. There exist several basic variations of
the theme of searching, and many different algorithms have been developed on this subject. The basic
34
assumption in the following presentations is that the collection of data, among which a given element is to
be searched, is fixed. We shall assume that this set of N elements is represented as an array, say as
a: ARRAY N OF Item
Typically, the type item has a record structure with a field that acts as a key. The task then consists of
finding an element of a whose key field is equal to a given search argument x. The resulting index i,
satisfying a[i].key = x, then permits access to the other fields of the located element. Since we are here
interested in the task of searching only, and do not care about the data for which the element was searched
in the first place, we shall assume that the type Item consists of the key only, i.e is the key.
There is quite obviously no way to speed up a search, unless more information is available about the
searched data. It is well known that a search can be made much more effective, if the data are ordered.
Imagine, for example, a telephone directory in which the names were not alphabetically listed. It would be
utterly useless. We shall therefore present an algorithm which makes use of the knowledge that a is
ordered, i.e., of the condition
Ak: 1 ≤ k < N : ak-1 ≤ ak
The key idea is to inspect an element picked at random, say am, and to compare it with the search argument
x. If it is equal to x, the search terminates; if it is less than x, we infer that all elements with indices less or
equal to m can be eliminated from further searches; and if it is greater than x, all with index greater or equal
to m can be eliminated. This results in the following algorithm called binary search; it uses two index
variables L and R marking the left and at the right end of the section of a in which an element may still be
found.
L := 0; R := N-1; found := FALSE ;
WHILE (L ≤ R) & ~found DO
m := any value between L and R;
IF a[m] = x THEN found := TRUE
ELSIF a[m] < x THEN L := m+1
ELSE R := m-1
END
END
The loop invariant, i.e. the condition satisfied before each step, is
(L ≤ R) & (Ak : 0 ≤ k < L : ak < x) & (Ak : R < k < N : ak > x)
from which the result is derived as
found OR ((L > R) & (Ak : 0 ≤ k < L : ak < x ) & (Ak : R < k < N : ak > x))
which implies
(am = x) OR (Ak : 0 ≤ k < N : ak ≠ x)
The choice of m is apparently arbitrary in the sense that correctness does not depend on it. But it does
influence the algorithm's effectiveness. Clearly our goal must be to eliminate in each step as many elements
as possible from further searches, no matter what the outcome of the comparison is. The optimal solution is
to choose the middle element, because this eliminates half of the array in any case. As a result, the
maximum number of steps is log2N, rounded up to the nearest integer. Hence, this algorithm offers a drastic
improvement over linear search, where the expected number of comparisons is N/2.
The efficiency can be somewhat improved by interchanging the two if-clauses. Equality should be tested
second, because it occurs only once and causes termination. But more relevant is the question, whether -- as
in the case of linear search -- a solution could be found that allows a simpler condition for termination. We
indeed find such a faster algorithm, if we abandon the naive wish to terminate the search as soon as a match
is established. This seems unwise at first glance, but on closer inspection we realize that the gain in
efficiency at every step is greater than the loss incurred in comparing a few extra elements. Remember that
the number of steps is at most log N. The faster solution is based on the following invariant:
(Ak : 0 ≤ k < L : ak < x) & (Ak : R ≤ k < N : ak ≥ x)
and the search is continued until the two sections span the entire array.
L := 0; R := N;
WHILE L < R DO
m := (L+R) DIV 2;
IF a[m] < x THEN L := m+1 ELSE R := m END
END
The terminating condition is L ≥ R. Is it guaranteed to be reached? In order to establish this guarantee, we
must show that under all circumstances the difference R-L is diminished in each step. L < R holds at the
36
beginning of each step. The arithmetic mean m then satisfies L ≤ m < R. Hence, the difference is indeed
diminished by either assigning m+1 to L (increasing L) or m to R (decreasing R), and the repetition
terminates with L = R. However, the invariant and L = R do not yet establish a match. Certainly, if R = N,
no match exists. Otherwise we must take into consideration that the element a[R] had never been
compared. Hence, an additional test for equality a[R] = x is necessary. In contrast to the first solution, this
algorithm -- like linear search -- finds the matching element with the least index.
Assuming that N may be fairly large and that the table is alphabetically ordered, we shall use a binary
search. Using the algorithms for binary search and string comparison developed above, we obtain the
following program segment.
L := 0; R := N;
WHILE L < R DO
m := (L+R) DIV 2; i := 0;
WHILE (T[m,i] = x[i]) & (x[i] # 0C) DO i := i+1 END ;
IF T[m,i] < x[i] THEN L := m+1 ELSE R := m END
END ;
IF R < N THEN i := 0;
WHILE (T[R,i] = x[i]) & (x[i] # 0X) DO i := i+1 END
END
(* (R < N) & (T[R,i] = x[i]) establish a match*)
UNTIL (j = M) OR (i = N-M)
The term j = M in the terminating condition indeed corresponds to the condition found, because it implies
P(i,M). The term i = N-M implies Q(N-M) and thereby the nonexistence of a match anywhere in the string.
If the iteration continues with j < M, then it must do so with si+j ≠ pj. This implies ~P(i,j), which implies
Q(i+1), which establishes Q(i) after the next incrementing of i.
Analysis of straight string search. This algorithm operates quite effectively, if we can assume that a
mismatch between character pairs occurs after at most a few comparisons in the inner loop. This is likely to
be the case, if the cardinality of the item type is large. For text searches with a character set size of 128 we
may well assume that a mismatch occurs after inspecting 1 or 2 characters only. Nevertheless, the worst
case performance is rather alarming. Consider, for example, that the string consist of N-1 A's followed by a
single B, and that the pattern consist of M-1 A's followed by a B. Then in the order of N*M comparisons
are necessary to find the match at the end of the string. As we shall subsequently see, there fortunately exist
methods that drastically improve this worst case behaviour.
incrementing i and j by 1. They apparently neither represent a shift of the pattern to the right, nor do they
falsify Q(i-j), since the difference remains unchanged. But could they falsify P(i-j, j), the second factor of
the invariant? We notice that at this point the negation of the inner while clause holds, i.e. either j < 0 or si
= pj. The latter extends the partial match and establishes P(i-j, j+1). In the former case, we postulate that
P(i-j, j+1) hold as well. Hence, incrementing both i and j by 1 cannot falsify the invariant either. The only
other assignment left in the algorithm is j := D. We shall simply postuate that the value D always be such
that replacing j by D will maintain the invariant.
In order to find an appropriate expression for D, we must first understand the effect of the assignment.
Provided that D < j, it represents a shift of the pattern to the right by j-D positions. Naturally, we wish this
shift to be as large as possible, i.e., D to be as small as possible. This is illustrated by Fig. 1.10.
A B C D string A B C D
A B C E pattern A B C E
examples i,j
A A A A A B
shifted pattern
i,j
i,j
string A B C D E F
pattern A B C D E A
A B
shifted pattern
A B C D E F
A B C D E G
A B
Since individual character comparisons now proceed from right to left, the following, slightly modified
versions of of the predicates P and Q are more convenient.
P(i,j) = Ak: j ≤ k < M : si-j+k = p k
Q(i) = Ak: 0 ≤ k < i : ~P(i, 0)
These predicates are used in the following formulation of the BM-algorithm to denote the invariant
conditions.
i := M; j := M;
WHILE (j > 0) & (i <= N) DO
(* Q(i-M) *) j := M; k := i;
WHILE (j > 0) & (s[k-1] = p[j-1]) DO
(* P(k-j, j) & (k-j = i-M) *)
DEC(k); DEC(j)
END ;
i := i + d[s[i-1]]
END
The indices satisfy 0 < j < M and 0 < i,k < N. Therefore, termination with j = 0, together with P(k-j, j),
implies P(k, 0), i.e., a match at position k. Termination with j > 0 demands that i = N; hence Q(i-M) implies
Q(N-M), signalling that no match exists. Of course we still have to convince ourselves that Q(i-M) and
P(k-j, j) are indeed invariants of the two repetitions. They are trivially satisfied when repetition starts, since
Q(0) and P(x,M) are always true.
Let us first consider the effect of the two statements decrementing k and j. Q(i-M) is not affected, and,
since sk-1 = pj-1 had been established, P(k-j, j-1) holds as precondition, guaranteeing P(k-j, j) as
postcondition. If the inner loop terminates with j > 0, the fact that sk-1 ≠ p j-1 implies ~P(k-j, 0), since
~P(i, 0) = Ek: 0 ≤ k < M : si+k ≠ pk
Moreover, because k-j = M-i, Q(i-M) & ~P(k-j, 0) = Q(i+1-M), establishing a non-match at position i-M+1.
Next we must show that the statement i := i + d s[i-1] never falsifies the invariant. This is the case, provided
that before the assignment Q(i+ds[i-1]-M) is guaranteed. Since we know that Q(i+1-M) holds, it suffices to
establish ~P(i+h-M) for h = 2, 3, ... , ds[i-1]. We now recall that dx is defined as the distance of the rightmost
occurrence of x in the pattern from the end. This is formally expressed as
Ak: M-dx ≤ k < M-1 : p k ≠ x
Substituting si for x, we obtain
Ah: M-ds[i-1] ≤ h < M-1 : si-1 ≠ ph
Ah: 1 < h ≤ ds[i-1] : si-1 ≠ ph-M
Ah: 1 < h ≤ ds[i-1] : ~P(i+h-M)
The following program includes the presented, simplified Boyer-Moore strategy in a setting similar to that
of the preceding KMP-search program. Note as a detail that a repeat statement is used in the inner loop,
incrementing k and j before comparing s and p. This eliminates the -1 terms in the index expressions.
PROCEDURE Search(VAR s, p: ARRAY OF CHAR; m, n: INTEGER; VAR r: INTEGER);
(*search for pattern p of length m in text s of length n*)
(*if p is found, then r indicates the position in s, otherwise r = -1*)
VAR i, j, k: INTEGER;
d: ARRAY 128 OF INTEGER;
BEGIN
FOR i := 0 TO 127 DO d[i] := m END ;
FOR j := 0 TO m-2 DO d[ORD(p[j])] := m-j-1 END ;
i := m;
REPEAT j := m; k := i;
REPEAT DEC(k); DEC(j)
43
Exercises
1.1. Assume that the cardinalities of the standard types INTEGER, REAL, and CHAR are denoted by cint,
creal, and cchar . What are the cardinalities of the following data types defined as exemples in this
chapter: sex, weekday, row, alfa, complex, date, person?
1.2. Which are the instruction sequences (on your computer) for the following:
(a) Fetch and store operations for an element of packed records and arrays?
(b) Set operations, including the test for membership?
1.3. What are the reasons for defining certain sets of data as sequences instead of arrays?
1.4. Given is a railway timetable listing the daily services on several lines of a railway system. Find a
representation of these data in terms of arrays, records, or sequences, which is suitable for lookup of
arrival and departure times, given a certain station and desired direction of the train.
1.5. Given a text T in the form of a sequence and lists of a small number of words in the form of two arrays
A and B. Assume that words are short arrays of characters of a small and fixed maximum length. Write
a program that transforms the text T into a text S by replacing each occurrence of a word A i by its
corresponding word Bi.
1.6. Compare the following three versions of the binary search with the one presented in the text. Which of
the three programs are correct? Determine the relevant invariants. Which versions are more efficient?
We assume the following variables, and the constant N > 0:
VAR i, j, k, x: INTEGER;
a: ARRAY N OF INTEGER;
Program A:
i := 0; j := N-1;
REPEAT k := (i+j) DIV 2;
IF a[k] < x THEN i := k ELSE j := k END
UNTIL (a[k] = x) OR (i > j)
Program B:
i := 0; j := N-1;
REPEAT k := (i+j) DIV 2;
IF x < a[k] THEN j := k-1 END ;
44
References
1-1. O-.J. Dahl, E.W. Dijkstra, and C.A.R. Hoare. Structured Programming. (New York: Academic Press,
1972), pp. 155-65.
1-2. C.A.R. Hoare. Notes on data structuring; in Structured Programming. Dahl, Dijkstra, and Hoare, pp.
83-174.
1-3. K. Jensen and N. Wirth. Pascal User Manual and Report. (Berlin: Springer-Verlag, 1974).
1-4. N. Wirth. Program development by stepwise refinement. Comm. ACM, 14, No. 4 (1971), 221-27.
1-5. ------, Programming in Modula-2. (Berlin, Heidelberg, New York: Springer-Verlag, 1982).
1-6. ------, On the composition of well-structured programs. Computing Surveys, 6, No. 4, (1974) 247-59.
1-7. C.A.R. Hoare. The Monitor: An operating systems structuring concept. Comm. ACM 17, 10 (Oct.
1974), 549-557.
1-8. D.E.Knuth, J.H. Morris, and V.R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6, 2, (June
1977), 323-349.
1-9. R.S. Boyer and J.S. Moore. A fast string searching algorithm. Comm. ACM, 20, 10 (Oct. 1977), 762-
772.
45
2. SORTING
2.1. Introduction
The primary purpose of this chapter is to provide an extensive set of examples illustrating the use of the data
structures introduced in the preceding chapter and to show how the choice of structure for the underlying
data profoundly influences the algorithms that perform a given task. Sorting is also a good example to show
that such a task may be performed according to many different algorithms, each one having certain
advantages and disadvantages that have to be weighed against each other in the light of the particular
application.
Sorting is generally understood to be the process of rearranging a given set of objects in a specific order. The
purpose of sorting is to facilitate the later search for members of the sorted set. As such it is an almost
universally performed, fundamental activity. Objects are sorted in telephone books, in income tax files, in
tables of contents, in libraries, in dictionaries, in warehouses, and almost everywhere that stored objects have
to be searched and retrieved. Even small children are taught to put their things "in order", and they are
confronted with some sort of sorting long before they learn anything about arithmetic.
Hence, sorting is a relevant and essential activity, particularly in data processing. What else would be easier
to sort than data! Nevertheless, our primary interest in sorting is devoted to the even more fundamental
techniques used in the construction of algorithms. There are not many techniques that do not occur
somewhere in connection with sorting algorithms. In particular, sorting is an ideal subject to demonstrate a
great diversity of algorithms, all having the same purpose, many of them being optimal in some sense, and
most of them having advantages over others. It is therefore an ideal subject to demonstrate the necessity of
performance analysis of algorithms. The example of sorting is moreover well suited for showing how a very
significant gain in performance may be obtained by the development of sophisticated algorithms when
obvious methods are readily available.
The dependence of the choice of an algorithm on the structure of the data to be processed -- an ubiquitous
phenomenon -- is so profound in the case of sorting that sorting methods are generally classified into two
categories, namely, sorting of arrays and sorting of (sequential) files. The two classes are often called
internal and external sorting because arrays are stored in the fast, high-speed, random-access "internal" store
of computers and files are appropriate on the slower, but more spacious "external" stores based on
mechanically moving devices (disks and tapes). The importance of this distinction is obvious from the
example of sorting numbered cards. Structuring the cards as an array corresponds to laying them out in front
of the sorter so that each card is visible and individually accessible (see Fig. 2.1).
Structuring the cards as a file, however, implies that from each pile only the card on the top is visible (see
Fig. 2.2). Such a restriction will evidently have serious consequences on the sorting method to be used, but it
is unavoidable if the number of cards to be laid out is larger than the available table.
Before proceeding, we introduce some terminology and notation to be used throughout this chapter. If we are
given n items
a0, a1, ... , an-1
sorting consists of permuting these items into an array
ak0, ak1, ... , ak[n-1]
such that, given an ordering function f,
f(ak0) ≤ f(ak1) ≤ ... ≤ f(ak[n-1])
Ordinarily, the ordering function is not evaluated according to a specified rule of computation but is stored as
an explicit component (field) of each item. Its value is called the key of the item. As a consequence, the
record structure is particularly well suited to represent items and might for example be declared as follows:
TYPE Item = RECORD key: INTEGER;
(*other components declared here*)
END
46
The other components represent relevant data about the items in the collection; the key merely assumes the
purpose of identifying the items. As far as our sorting algorithms are concerned, however, the key is the only
relevant component, and there is no need to define any particular remaining components. In the following
discussions, we shall therefore discard any associated information and assume that the type Item be defined
as INTEGER. This choice of the key type is somewhat arbitrary. Evidently, any type on which a total
ordering relation is defined could be used just as well.
A sorting method is called stable if the relative order if items with equal keys remains unchanged by the
sorting process. Stability of sorting is often desirable, if items are already ordered (sorted) according to some
secondary keys, i.e., properties not reflected by the (primary) key itself.
This chapter is not to be regarded as a comprehensive survey in sorting techniques. Rather, some selected,
specific methods are exemplified in greater detail. For a thorough treatment of sorting, the interested reader
is referred to the excellent and comprehensive compendium by D. E. Knuth [2-7] (see also Lorin [2-10]).
Initial Keys: 44 55 12 42 94 18 06 67
i=1 44 55 12 42 94 18 06 67
i=2 12 44 55 42 94 18 06 67
i=3 12 42 44 55 94 18 06 67
i=4 12 42 44 55 94 18 06 67
i=5 12 18 42 44 55 94 06 67
i=6 06 12 18 42 44 55 94 67
i=7 06 12 18 42 44 55 67 94
Table 2.1 A Sample Process of Straight Insertion Sorting.
The process of sorting by insertion is shown in an example of eight numbers chosen at random (see Table
2.1). The algorithm of straight insertion is
FOR i := 1 TO n-1 DO
x := a[i];
insert x at the appropriate place in a0 ... ai
END
In the process of actually finding the appropriate place, it is convenient to alternate between comparisons and
moves, i.e., to let x sift down by comparing x with the next item aj, and either inserting x or moving aj to the
right and proceeding to the left. We note that there are two distinct conditions that may cause the termination
of the sifting down process:
1. An item aj is found with a key less than the key of x.
2. The left end of the destination sequence is reached.
PROCEDURE StraightInsertion;
VAR i, j: INTEGER; x: Item;
BEGIN
FOR i := 1 TO n-1 DO
x := a[i]; j := i;
WHILE (j > 0) & (x < a[j-1] DO a[j] := a[j-1]; DEC(j) END ;
a[j] := x
END
END StraightInsertion
Analysis of straight insertion. The number Ci of key comparisons in the i-th sift is at most i-1, at least 1, and
-- assuming that all permutations of the n keys are equally probable -- i/2 in the average. The number Mi of
moves (assignments of items) is Ci + 2 (including the sentinel). Therefore, the total numbers of comparisons
and moves are
Cmin = n-1 Mmin = 3*(n-1)
Cave = (n2 + n - 2)/4 Mave = (n2 + 9n - 10)/4
Cmax = (n2 + n - 4)/4 Mmax = (n2 + 3n - 4)/2
The minimal numbers occur if the items are initially in order; the worst case occurs if the items are initially
in reverse order. In this sense, sorting by insertion exhibits a truly natural behavior. It is plain that the given
algorithm also describes a stable sorting process: it leaves the order of items with equal keys unchanged.
The algorithm of straight insertion is easily improved by noting that the destination sequence a0 ... ai-1, in
which the new item has to be inserted, is already ordered. Therefore, a faster method of determining the
insertion point can be used. The obvious choice is a binary search that samples the destination sequence in
the middle and continues bisecting until the insertion point is found. The modified sorting algorithm is called
binary insertion.
PROCEDURE BinaryInsertion(VAR a: ARRAY OF Item; n: INTEGER);
VAR i, j, m, L, R: INTEGER; x: Item;
BEGIN
FOR i := 1 TO n-1 DO
48
x := a[i]; L := 1; R := i;
WHILE L < R DO
m := (L+R) DIV 2;
IF a[m] <= x THEN L := m+1 ELSE R := m END
END ;
FOR j := i TO R+1 BY -1 DO a[j] := a[j-1] END ;
a[R] := x
END
END BinaryInsertion
Analysis of binary insertion. The insertion position is found if L = R. Thus, the search interval must in the
end be of length 1; and this involves halving the interval of length i log(i) times. Thus,
C = Si: 0 ≤ I ≤ n: log(i)
We approximate this sum by the integral
Int (0:n-1) log(x) dx = n*(log n - c) + c
where c = log e = 1/ln 2 = 1.44269... .
The number of comparisons is essentially independent of the initial order of the items. However, because of
the truncating character of the division involved in bisecting the search interval, the true number of
comparisons needed with i items may be up to 1 higher than expected. The nature of this bias is such that
insertion positions at the low end are on the average located slightly faster than those at the high end, thereby
favoring those cases in which the items are originally highly out of order. In fact, the minimum number of
comparisons is needed if the items are initially in reverse order and the maximum if they are already in order.
Hence, this is a case of unnatural behavior of a sorting algorithm. The number of comparisons is then
approximately
C ≈ n*(log n - log e ± 0.5)
Unfortunately, the improvement obtained by using a binary search method applies only to the number of
comparisons but not to the number of necessary moves. In fact, since moving items, i.e., keys and associated
information, is in general considerably more time-consuming than comparing two keys, the improvement is
by no means drastic: the important term M is still of the order n2. And, in fact, sorting the already sorted
array takes more time than does straight insertion with sequential search.
This example demonstrates that an "obvious improvement" often has much less drastic consequences than
one is first inclined to estimate and that in some cases (that do occur) the "improvement" may actually turn
out to be a deterioration. After all, sorting by insertion does not appear to be a very suitable method for
digital computers: insertion of an item with the subsequent shifting of an entire row of items by a single
position is uneconomical. One should expect better results from a method in which moves of items are only
performed upon single items and over longer distances. This idea leads to sorting by selection.
06 12 18 42 44 55 94 67
06 12 18 42 44 55 94 67
06 12 18 42 44 55 67 94
Table 2.2 A Sample Process of Straight Selection Sorting.
The algorithm is formulated as follows:
FOR i := 0 TO n-1 DO
assign the index of the least item of ai ... an-1 to k;
exchange ai with ak
END
This method, called straight selection, is in some sense the opposite of straight insertion: Straight insertion
considers in each step only the one next item of the source sequence and all items of the destination array to
find the insertion point; straight selection considers all items of the source array to find the one with the least
key and to be deposited as the one next item of the destination sequence..
PROCEDURE StraightSelection;
VAR i, j, k: INTEGER; x: Item;
BEGIN
FOR i := 0 TO n-2 DO
k := i; x := a[i];
FOR j := i+1 TO n-1 DO
IF a[j] < x THEN k := j; x := a[k] END
END ;
a[k] := a[i]; a[i] := x
END
END StraightSelection
Analysis of straight selection. Evidently, the number C of key comparisons is independent of the initial order
of keys. In this sense, this method may be said to behave less naturally than straight insertion. We obtain
C = (n2 - n)/2
The number M of moves is at least
Mmin = 3*(n-1)
in the case of initially ordered keys and at most
Mmax = n2/4 + 3*(n-1)
if initially the keys are in reverse order. In order to determine Mavg we make the following deliberations:
The algorithm scans the array, comparing each element with the minimal value so far detected and, if smaller
than that minimum, performs an assignment. The probability that the second element is less than the first, is
1/2; this is also the probability for a new assignment to the minimum. The chance for the third element to be
less than the first two is 1/3, and the chance of the fourth to be the smallest is 1/4, and so on. Therefore the
total expected number of moves is Hn-1, where Hn is the n th harmonic number
Hn = 1 + 1/2 + 1/3 + ... + 1/n
Hn can be expressed as
Hn = ln(n) + g + 1/2n - 1/12n2 + ...
where g = 0.577216... is Euler's constant. For sufficiently large n, we may ignore the fractional terms and
therefore approximate the average number of assignments in the i th pass as
F i = ln(i) + g + 1
The average number of moves Mavg in a selection sort is then the sum of Fi with i ranging from 1 to n.
Mavg = n*(g+1) + (Si: 1 ≤ i ≤ n: ln(i))
50
12 18 42 44 55 67 94 06
is sorted by the improved Bubblesort in a single pass, but the array
94 06 12 18 42 44 55 67
requires seven passes for sorting. This unnatural asymmetry suggests a third improvement: alternating the
direction of consecutive passes. We appropriately call the resulting algorithm Shakersort. Its behavior is
illustrated in Table 2.4 by applying it to the same eight keys that were used in Table 2.3.
PROCEDURE ShakerSort;
VAR j, k, L, R: INTEGER; x: Item;
BEGIN L := 1; R := n-1; k := R;
REPEAT
FOR j := R TO L BY -1 DO
IF a[j-1] > a[j] THEN
x := a[j-1]; a[j-1] := a[j]; a[j] := x; k := j
END
END ;
L := k+1;
FOR j := L TO R BY +1 DO
IF a[j-1] > a[j] THEN
x := a[j-1]; a[j-1] := a[j]; a[j] := x; k := j
END
END ;
R := k-1
UNTIL L > R
END ShakerSort
L= 2 3 3 4 4
R= 8 8 7 7 4
dir = ↑ ↓ ↑ ↓ ↑
44 06 06 06 06
55 44 44 12 12
12 55 12 44 18
42 12 42 18 42
94 42 55 42 44
18 94 18 55 55
06 18 67 67 67
67 67 94 94 94
Table 2.4 An Example of Shakersort.
Analysis of Bubblesort and Shakersort. The number of comparisons in the straight exchange algorithm is
C = (n2 - n)/2
and the minimum, average, and maximum numbers of moves (assignments of items) are
Mmin = 0, Mavg = 3*(n2 - n)/2, Mmax = 3*(n2 - n)/4
The analysis of the improved methods, particularly that of Shakersort, is intricate. The least number of
comparisons is Cmin = n-1. For the improved Bubblesort, Knuth arrives at an average number of passes
proportional to n - k1*n½|, and an average number of comparisons proportional to (n2 – n*(k2 + ln(n)))/2. But
we note that all improvements mentioned above do in no way affect the number of exchanges; they only
reduce the number of redundant double checks. Unfortunately, an exchange of two items is generally a more
costly operation than a comparison of keys; our clever improvements therefore have a much less profound
effect than one would intuitively expect.
This analysis shows that the exchange sort and its minor improvements are inferior to both the insertion and
the selection sorts; and in fact, the Bubblesort has hardly anything to recommend it except its catchy name.
52
The Shakersort algorithm is used with advantage in those cases in which it is known that the items are
already almost in order -- a rare case in practice.
It can be shown that the average distance that each of the n items has to travel during a sort is n/3 places.
This figure provides a clue in the search for improved, i.e. more effective sorting methods. All straight
sorting methods essentially move each item by one position in each elementary step. Therefore, they are
bound to require in the order n2 such steps. Any improvement must be based on the principle of moving
items over greater distances in single leaps.
Subsequently, three improved methods will be discussed, namely, one for each basic sorting method:
insertion, selection, and exchange.
The procedure is therefore developed without relying on a specific sequence of increments. The T
increments are denoted by h0, h1, ... , hT-1 with the conditions
ht-1 = 1, h i+1 < hi
The algorithm is described by the procedure Shellsort [2.11] for t = 4:
PROCEDURE ShellSort;
CONST T = 4;
VAR i, j, k, m, s: INTEGER;
x: Item;
h: ARRAY T OF INTEGER;
BEGIN h[0] := 9; h[1] := 5; h[2] := 3; h[3] := 1;
FOR m := 0 TO T-1 DO
k := h[m];
FOR i := k+1 TO n-1 DO
x := a[i]; j := i-k;
WHILE (j >= k) & (x < a[j]) DO a[j+k] := a[j]; j := j-k END ;
a[j+k] := x
END
53
END
END ShellSort
Analysis of Shellsort. The analysis of this algorithm poses some very difficult mathematical problems, many
of which have not yet been solved. In particular, it is not known which choice of increments yields the best
results. One surprising fact, however, is that they should not be multiples of each other. This will avoid the
phenomenon evident from the example given above in which each sorting pass combines two chains that
before had no interaction whatsoever. It is indeed desirable that interaction between various chains takes
place as often as possible, and the following theorem holds: If a k-sorted sequence is i-sorted, then it
remains k-sorted. Knuth [2.8] indicates evidence that a reasonable choice of increments is the sequence
(written in reverse order)
1, 4, 13, 40, 121, ...
where h k-1 = 3hk+1, ht = 1, and t = k×log3(n) - 1. He also recommends the sequence
1, 3, 7, 15, 31, ...
where h k-1 = 2hk+1, ht = 1, and t = k×log2(n) - 1. For the latter choice, mathematical analysis yields an effort
proportional to n2 required for sorting n items with the Shellsort algorithm. Although this is a significant
improvement over n2, we will not expound further on this method, since even better algorithms are known.
06
12 06
44 12 18 06
44 55 12 42 94 18 06 67
12
44 12 18
44 55 12 42 94 18 67
12
12 18
44 12 18 67
44 55 12 42 94 18 67
h1 h2
h3 h4 h5 h6
h0
42 06
55 94 18 12
06
42 12
55 94 18 44
44 55 12 42 | 94 18 06 67
44 55 12 | 42 94 18 06 67
44 55 | 06 42 94 18 12 67
44 | 42 06 55 94 18 12 67
06 42 12 55 94 18 44 67
Table 2.6 Constructing a Heap.
Consequently, the process of generating a heap of n elements h0 ... hn-1 in situ is described as follows:
L := n DIV 2;
WHILE L > 0 DO DEC(L); sift(L, n-1) END
In order to obtain not only a partial, but a full ordering among the elements, n sift steps have to follow,
whereby after each step the next (least) item may be picked off the top of the heap. Once more, the question
arises about where to store the emerging top elements and whether or not an in situ sort would be possible.
Of course there is such a solution: In each step take the last component (say x) off the heap, store the top
element of the heap in the now free location of x, and let x sift down into its proper position. The necessary
n-1 steps are illustrated on the heap of Table 2.7. The process is described with the aid of the procedure sift
as follows:
56
R := n-1;
WHILE R > 0 DO
x := a[0]; a[0] := a[R]; a[R] := x;
DEC(R); sift(1, R)
END
06 42 12 55 94 18 44 67
12 42 18 55 94 67 44 | 06
18 42 44 55 94 67 | 12 06
42 55 44 67 94 | 18 12 06
44 55 94 67 | 42 18 12 06
55 67 94 | 44 42 18 12 06
67 94 | 55 44 42 18 12 06
94 | 67 55 44 42 18 12 06
Table 2.7 Example of a Heapsort Process.
The example of Table 2.7 shows that the resulting order is actually inverted. This, however, can easily be
remedied by changing the direction of the ordering relations in the sift procedure. This results in the
following procedure Heapsort. Note that sift should actually be declared local to Heapsort).
PROCEDURE sift(L, R: INTEGER);
VAR i, j: INTEGER; x: Item;
BEGIN i := L; j := 2*i+1; x := a[i];
IF (j < R) & (a[j] < a[j+1]) THEN j := j+1 END ;
WHILE (j <= R) & (x < a[j]) DO
a[i] := a[j]; i := j; j := 2*j+1;
IF (j < R) & (a[j] < a[j+1]) THEN j := j+1 END
END ;
a[i] := x
END sift;
PROCEDURE HeapSort;
VAR L, R: INTEGER; x: Item;
BEGIN L := n DIV 2; R := n-1;
WHILE L > 0 DO DEC(L); sift(L, R) END ;
WHILE R > 0 DO
x := a[0]; a[0] := a[R]; a[R] := x;
DEC(R); sift(L, R)
END
END HeapSort
Analysis of Heapsort. At first sight it is not evident that this method of sorting provides good results. After
all, the large items are first sifted to the left before finally being deposited at the far right. Indeed, the
procedure is not recommended for small numbers of items, such as shown in the example. However, for
large n, Heapsort is very efficient, and the larger n is, the better it becomes -- even compared to Shellsort.
In the worst case, there are n/2 sift steps necessary, sifting items through log(n/2), log(n/2 +1), ... , log(n-1)
positions, where the logarithm (to the base 2) is truncated to the next lower integer. Subsequently, the sorting
phase takes n-1 sifts, with at most log(n-1), log(n-2), ... , 1 moves. In addition, there are n-1 moves for
stashing the sifted item away at the right. This argument shows that Heapsort takes of the order of n×log(n)
steps even in the worst possible case. This excellent worst-case performance is one of the strongest qualities
of Heapsort.
It is not at all clear in which case the worst (or the best) performance can be expected. But generally
Heapsort seems to like initial sequences in which the items are more or less sorted in the inverse order, and
therefore it displays an unnatural behavior. The heap creation phase requires zero moves if the inverse order
is present. The average number of moves is approximately n/2 × log(n), and the deviations from this value
are relatively small.
57
of the array unless more complicated termination conditions were used. The simplicity of the conditions is
well worth the extra exchanges that occur relatively rarely in the average random case. A slight saving,
however, may be achieved by changing the clause controlling the exchange step to i < j instead of i ≤ j. But
this change must not be extended over the two statements
INC(i); DEC(j)
which therefore require a separate conditional clause. Confidence in the correctness of the partition algorithm
can be gained by verifying that the ordering relations are invariants of the repeat statement. Initially, with i =
1 and j = n, they are trivially true, and upon exit with i > j, they imply the desired result.
We now recall that our goal is not only to find partitions of the original array of items, but also to sort it.
However, it is only a small step from partitioning to sorting: after partitioning the array, apply the same
process to both partitions, then to the partitions of the partitions, and so on, until every partition consists of a
single item only. This recipe is described as follows. (Note that sort should actually be declared local to
Quicksort).
PROCEDURE sort(L, R: INTEGER);
VAR i, j: INTEGER; w, x: Item;
BEGIN i := L; j := R;
x := a[(L+R) DIV 2];
REPEAT
WHILE a[i] < x DO INC(i) END ;
WHILE x < a[j] DO DEC(j) END ;
IF i <= j THEN
w := a[i]; a[i] := a[j]; a[j] := w; i := i+1; j := j-1
END
UNTIL i > j;
IF L < j THEN sort(L, j) END ;
IF i < R THEN sort(i, R) END
END sort;
PROCEDURE QuickSort;
BEGIN sort(0, n-1)
END QuickSort
Procedure sort activates itself recursively. Such use of recursion in algorithms is a very powerful tool and
will be discussed further in Chap. 3. In some programming languages of older provenience, recursion is
disallowed for certain technical reasons. We will now show how this same algorithm can be expressed as a
non-recursive procedure. Obviously, the solution is to express recursion as an iteration, whereby a certain
amount of additional bookkeeping operations become necessary.
The key to an iterative solution lies in maintaining a list of partitioning requests that have yet to be
performed. After each step, two partitioning tasks arise. Only one of them can be attacked directly by the
subsequent iteration; the other one is stacked away on that list. It is, of course, essential that the list of
requests is obeyed in a specific sequence, namely, in reverse sequence. This implies that the first request
listed is the last one to be obeyed, and vice versa; the list is treated as a pulsating stack. In the following
nonrecursive version of Quicksort, each request is represented simply by a left and a right index specifying
the bounds of the partition to be further partitioned. Thus, we introduce two array variables low, high, used
as stacks with index s. The appropriate choice of the stack size M will be discussed during the analysis of
Quicksort.
PROCEDURE NonRecursiveQuickSort;
CONST M = 12;
VAR i, j, L, R, s: INTEGER; x, w: Item;
low, high: ARRAY M OF INTEGER; (*index stack*)
BEGIN s := 0; low[0] := 0; high[0] := n-1;
REPEAT (*take top request from stack*)
L := low[s]; R := high[s]; DEC(s);
REPEAT (*partition a[L] ... a[R]*)
59
performance considerably. It becomes evident that sorting on the basis of Quicksort is somewhat like a
gamble in which one should be aware of how much one may afford to lose if bad luck were to strike.
There is one important lesson to be learned from this experience; it concerns the programmer directly. What
are the consequences of the worst case behavior mentioned above to the performance Quicksort? We have
realized that each split results in a right partition of only a single element; the request to sort this partition is
stacked for later execution. Consequently, the maximum number of requests, and therefore the total required
stack size, is n. This is, of course, totally unacceptable. (Note that we fare no better -- in fact even worse --
with the recursive version because a system allowing recursive activation of procedures will have to store the
values of local variables and parameters of all procedure activations automatically, and it will use an implicit
stack for this purpose.)
The remedy lies in stacking the sort request for the longer partition and in continuing directly with the further
partitioning of the smaller section. In this case, the size of the stack M can be limited to log n.
The change necessary is localized in the section setting up new requests. It now reads
IF j - L < R - i THEN
IF i < R THEN (*stack request for sorting right partition*)
INC(s); low[s] := i; high[s] := R
END ;
R := j (*continue sorting left partition*)
ELSE
IF L < j THEN (*stack request for sorting left parition*)
INC(s); low[s] := L; high[s] := j
END;
L := i (*continue sorting right partition*)
END
≤ ≥
L j i k R
2. The chosen bound x was too large. The splitting operation has to be repeated on the partition aL ... aj
(see Fig. 2.10).
≤ ≥
L k j i R
≤ ≥
L j k i R
Again, there is hardly any advantage in using this algorithm, if the number of elements is small, say, fewer
than 10.
Actually, the splitting phases do not contribute to the sort since they do in no way permute the items; in a
sense they are unproductive, although they constitute half of all copying operations. They can be eliminated
altogether by combining the split and the merge phase. Instead of merging into a single sequence, the output
of the merge process is immediately redistributed onto two tapes, which constitute the sources of the
subsequent pass. In contrast to the previous two-phase merge sort, this method is called a single-phase merge
or a balanced merge. It is evidently superior because only half as many copying operations are necessary; the
price for this advantage is a fourth tape.
We shall develop a merge program in detail and initially let the data be represented as an array which,
however, is scanned in strictly sequential fashion. A later version of merge sort will then be based on the
sequence structure, allowing a comparison of the two programs and demonstrating the strong dependence of
the form of a program on the underlying representation of its data.
A single array may easily be used in place of two sequences, if it is regarded as double-ended. Instead of
merging from two source files, we may pick items off the two ends of the array. Thus, the general form of
the combined merge-split phase can be illustrated as shown in Fig. 2.12. The destination of the merged items
is switched after each ordered pair in the first pass, after each ordered quadruple in the second pass, etc., thus
evenly filling the two destination sequences, represented by the two ends of a single array. After each pass,
the two arrays interchange their roles, the source becomes the new destination, and vice versa.
source destination
i j i j
merge distribute
merge the destination is switched from the lower to the upper end of the destination array, or vice versa, to
guarantee equal distribution onto both destinations. If the destination of the merged items is the lower end of
the destination array, then the destination index is k, and k is incremented after each move of an item. If they
are to be moved to the upper end of the destination array, the destination index is L, and it is decremented
after each move. In order to simplify the actual merge statement, we choose the destination to be designated
by k at all times, switching the values of the variables k and L after each p-tuple merge, and denote the
increment to be used at all times by h, where h is either 1 or -1. These design discussions lead to the
following refinement:
h := 1; m := n; (*m = no. of items to be merged*)
REPEAT q := p; r := p; m := m - 2*p;
merge q items from i-source with r items from j-source.
destination index is k. increment k by h;
h := -h; exchange k and L
UNTIL m = 0
In the further refinement step the actual merge statement is to be formulated. Here we have to keep in mind
that the tail of the one subsequence which is left non-empty after the merge has to be appended to the output
sequence by simple copying operations.
WHILE (q > 0) & (r > 0) DO
IF a[i] < a[j] THEN
move an item from i-source to k-destination; advance i and k; q := q-1
ELSE
move an item from j-source to k-destination; advance j and k; r := r-1
END
END ;
copy tail of i-sequence; copy tail of j-sequence
After this further refinement of the tail copying operations, the program is laid out in complete detail. Before
writing it out in full, we wish to eliminate the restriction that n be a power of 2. Which parts of the algorithm
are affected by this relaxation of constraints? We easily convince ourselves that the best way to cope with the
more general situation is to adhere to the old method as long as possible. In this example this means that we
continue merging p-tuples until the remainders of the source sequences are of length less than p. The one and
only part that is influenced are the statements that determine the values of q and r, the lengths of the
sequences to be merged. The following four statements replace the three statements
q := p; r := p; m := m -2*p
and, as the reader should convince himself, they represent an effective implementation of the strategy
specified above; note that m denotes the total number of items in the two source sequences that remain to be
merged:
IF m >= p THEN q := p ELSE q := m END ;
m := m-q;
IF m >= p THEN r := p ELSE r := m END ;
m := m-r
In addition, in order to guarantee termination of the program, the condition p=n, which controls the outer
repetition, must be changed to p ≥ n. After these modifications, we may now proceed to describe the entire
algorithm in terms of a procedure operating on the global array a with 2n elements.
PROCEDURE StraightMerge;
VAR i, j, k, L, t: INTEGER; (*index range of a is 0 .. 2*n-1 *)
h, m, p, q, r: INTEGER; up: BOOLEAN;
BEGIN up := TRUE; p := 1;
REPEAT h := 1; m := n;
IF up THEN i := 0; j := n-1; k := n; L := 2*n-1
ELSE k := 0; L := n-1; i := n; j := 2*n-1
END ;
66
a maximal run or, for short, a run. A natural merge sort, therefore, merges (maximal) runs instead of
sequences of fixed, predetermined length. Runs have the property that if two sequences of n runs are merged,
a single sequence of exactly n runs emerges. Therefore, the total number of runs is halved in each pass, and
the number of required moves of items is in the worst case n*log(n), but in the average case it is even less.
The expected number of comparisons, however, is much larger because in addition to the comparisons
necessary for the selection of items, further comparisons are needed between consecutive items of each file
in order to determine the end of each run.
Our next programming exercise develops a natural merge algorithm in the same stepwise fashion that was
used to explain the straight merging algorithm. It employs the sequence structure (represented by files, see
Sect. 1.8) instead of the array, and it represents an unbalanced, two-phase, three-tape merge sort. We assume
that the file variable c represents the initial sequence of items. (Naturally, in actual data processing
application, the initial data are first copied from the original source to c for reasons of safety.) a and b are
two auxiliary file variables. Each pass consists of a distribution phase that distributes runs equally from c to a
and b, and a merge phase that merges runs from a and b to c. This process is illustrated in Fig. 2.13.
a a a
c c c c c
b b b
merge phase
distribution phase
st nd th
1 run 2 run n run
Although procedure distribute supposedly outputs runs in equal numbers to the two files, the important
consequence is that the actual number of resulting runs on a and b may differ significantly. Our merge
procedure, however, only merges pairs of runs and terminates as soon as b is read, thereby losing the tail of
one of the sequences. Consider the following input data that are sorted (and truncated) in two subsequent
passes:
17 19 13 57 23 29 11 59 31 37 07 61 41 43 05 67 47 71 02 03
13 17 19 23 29 31 37 41 43 47 57 71 11 59
11 13 17 19 23 29 31 37 41 43 47 57 59 71
Table 2.12 Incorrect Result of Mergesort Program.
The example of this programming mistake is typical for many programming situations. The mistake is
caused by an oversight of one of the possible consequences of a presumably simple operation. It is also
typical in the sense that serval ways of correcting the mistake are open and that one of them has to be chosen.
Often there exist two possibilities that differ in a very important, fundamental way:
1. We recognize that the operation of distribution is incorrectly programmed and does not satisfy the
requirement that the number of runs differ by at most 1. We stick to the original scheme of operation and
correct the faulty procedure accordingly.
2. We recognize that the correction of the faulty part involves far-reaching modifications, and we try to find
ways in which other parts of the algorithm may be changed to accommodate the currently incorrect part.
In general, the first path seems to be the safer, cleaner one, the more honest way, providing a fair degree of
immunity from later consequences of overlooked, intricate side effects. It is, therefore, the way toward a
solution that is generally recommended.
It is to be pointed out, however, that the second possibility should sometimes not be entirely ignored. It is
for this reason that we further elaborate on this example and illustrate a fix by modification of the merge
procedure rather than the distribution procedure, which is primarily at fault.
This implies that we leave the distribution scheme untouched and renounce the condition that runs be equally
distributed. This may result in a less than optimal performance. However, the worst-case performance
remains unchanged, and moreover, the case of highly unequal distribution is statistically very unlikely.
Efficiency considerations are therefore no serious argument against this solution.
If the condition of equal distribution of runs no longer exists, then the merge procedure has to be changed so
that, after reaching the end of one file, the entire tail of the remaining file is copied instead of at most one
run. This change is straightforward and is very simple in comparison with any change in the distribution
scheme. (The reader is urged to convince himself of the truth of this claim). The revised version of the merge
algorithm is shown below in the form of a function procedure:
PROCEDURE NaturalMerge(src: Files.File): Files.File;
VAR L: INTEGER; (*no. of runs merged*)
f0, f1, f2: Files.File;
r0, r1, r2: Runs.Rider;
PROCEDURE copyrun(VAR x, y: Runs.Rider);
BEGIN (*from x to y*)
REPEAT Runs.copy(x, y) UNTIL x.eor
END copyrun;
BEGIN Runs.Set(r2, src);
REPEAT f0 := Files.New("test0"); Files.Set(r0, f0, 0);
f1 := Files.New("test1"); Files.Set (r1, f1, 0);
(*distribute from r2 to r0 and r1*)
REPEAT copyrun(r2, r0);
IF ~r2.eof THEN copyrun(r2, r1) END
UNTIL r2.eof;
Runs.Set(r0, f0); Runs.Set(r1, f1);
f2 := Files.New(""); Files.Set(r2, f2, 0); L := 0;
71
R: Runs.Rider;
BEGIN Runs.Set(R, src); (*distribute initial runs from R to w[0] ... w[N-1]*)
j := 0; L := 0;
position riders w on files g;
REPEAT
copy one run from R to w[j];
INC(j); INC(L);
IF j = N THEN j := 0 END
UNTIL R.eof;
REPEAT (*merge from riders r to riders w*)
switch files g to riders r;
L := 0; j := 0; (*j = index of output file*)
REPEAT INC(L);
merge one run from inputs to w[j];
IF j < N THEN INC(j) ELSE j := 0 END
UNTIL all inputs exhausted;
UNTIL L = 1
(*sorted file is with w[0]*)
END BalancedMerge.
Having associated a rider R with the source file, we now refine the statement for the initial distribution of
runs. Using the definition of copy, we replace copy one run from R to w[j] by:
REPEAT Runs.copy(R, w[j]) UNTIL R.eor
Copying a run terminates when either the first item of the next run is encountered or when the end of the
entire input file is reached.
In the actual sort algorithm, the following statements remain to be specified in more detail:
1. Position riders w on files g
2. Merge one run from inputs to wj
3. Switch files g to riders r
4. All inputs exhausted
First, we must accurately identify the current input sequences. Notably, the number of active inputs may be
less than N. Obviously, there can be at most as many sources as there are runs; the sort terminates as soon as
there is one single sequence left. This leaves open the possibility that at the initiation of the last sort pass
there are fewer than N runs. We therefore introduce a variable, say k1, to denote the actual number of inputs
used. We incorporate the initialization of k1 in the statement switch files as follows:
IF L < N THEN k1 := L ELSE k1 := N END ;
FOR i := 0 TO k1-1 DO Runs.Set(r[i], g[i]) END
Naturally, statement (2) is to decrement k1 whenever an input source ceases. Hence, predicate (4) may easily
be expressed by the relation k1 = 0. Statement (2), however, is more difficult to refine; it consists of the
repeated selection of the least key among the available sources and its subsequent transport to the
destination, i.e., the current output sequence. The process is further complicated by the necessity of
determining the end of each run. The end of a run may be reached because (1) the subsequent key is less than
the current key or (2) the end of the source is reached. In the latter case the source is eliminated by
decrementing k1; in the former case the run is closed by excluding the sequence from further selection of
items, but only until the creation of the current output run is completed. This makes it obvious that a second
variable, say k2, is needed to denote the number of sources actually available for the selection of the next
item. This value is initially set equal to k1 and is decremented whenever a run teminates because of condition
(1).
Unfortunately, the introduction of k2 is not sufficient. We need to know not only the number of files, but also
which files are still in actual use. An obvious solution is to use an array with Boolean components indicating
the availability of the files. We choose, however, a different method that leads to a more efficient selection
73
procedure which, after all, is the most frequently repeated part of the entire algorithm. Instead of using a
Boolean array, a file index map, say t, is introduced. This map is used so that t0 ... tk2-1 are the indices of the
available sequences. Thus statement (2) can be formulated as follows:
k2 := k1;
REPEAT select the minimal key, let t[m] be the sequence number on which it occurs;
Runs.copy(r[t[m]], w[j]);
IF r[t[m]].eof THEN eliminate sequence
ELSIF r[t[m]].eor THEN close run
END
UNTIL k2 = 0
Since the number of sequences will be fairly small for any practical purpose, the selection algorithm to be
specified in further detail in the next refinement step may as well be a straightforward linear search. The
statement eliminate sequence implies a decrease of k1 as well as k2 and also a reassignment of indices in the
map t. The statement close run merely decrements k2 and rearranges components of t accordingly. The
details are shown in the following procedure, being the last refinement. The statement switch sequences is
elaborated according to explanations given earlier.
f1 f2 f3
13 8
5 0 8
0 5 3
3 2 0
1 0 2
0 1 1
1 0 0
A second example shows the Polyphase method with 6 sequences. Let there initially be 16 runs on f1, 15 on
f2, 14 on f3, 12 on f4, and 8 on f5. In the first partial pass, 8 runs are merged onto f6; In the end, f2 contains
the sorted set of items (see Fig. 2.15).
f1 f2 f3 f4 f5 f6
16 15 14 12 8
8 7 6 4 0 8
4 3 2 0 4 4
2 1 0 2 2 2
1 0 1 1 1 1
0 1 0 0 0 0
In order to find an appropriate rule for distribution, however, we must know how actual and dummy runs are
merged. Clearly, the selection of a dummy run from sequence i means precisely that sequence i is ignored
during this merge. resulting in a merge from fewer than N-1 sources. Merging of a dummy run from all N-1
sources implies no actual merge operation, but instead the recording of the resulting dummy run on the output
sequence. From this we conclude that dummy runs should be distributed to the n-1 sequences as uniformly as
possible, since we are interested in active merges from as many sources as possible.
Let us forget dummy runs for a moment and consider the problem of distributing an unknown number of runs
onto N-1 sequences. It is plain that the Fibonacci numbers of order N-2 specifying the desired numbers of
runs on each source can be generated while the distribution progresses. Assuming, for example, N = 6 and
referring to Table 2.14, we start by distributing runs as indicated by the row with index L = 1 (1, 1, 1, 1, 1); if
there are more runs available, we proceed to the second row (2, 2, 2, 2, 1); if the source is still not exhausted,
the distribution proceeds according to the third row (4, 4, 4, 3, 2), and so on. We shall call the row index
level. Evidently, the larger the number of runs, the higher is the level of Fibonacci numbers which,
incidentally, is equal to the number of merge passes or switchings necessary for the subsequent sort. The
distribution algorithm can now be formulated in a first version as follows:
1. Let the distribution goal be the Fibonacci numbers of order N-2, level 1.
2. Distribute according to the set goal.
3. If the goal is reached, compute the next level of Fibonacci numbers; the difference between them and those
on the former level constitutes the new distribution goal. Return to step 2. If the goal cannot be reached
because the source is exhausted, terminate the distribution process.
The rules for calculating the next level of Fibonacci numbers are contained in their definition. We can thus
concentrate our attention on step 2, where, with a given goal, the subsequent runs are to be distributed one
after the other onto the N-1 output sequences. It is here where the dummy runs have to reappear in our
considerations.
Let us assume that when raising the level, we record the next goal by the differences di for i = 1 ... N-1, where
di denotes the number of runs to be put onto sequence i in this step. We can now assume that we immediately
put di dummy runs onto sequence i and then regard the subsequent distribution as the replacement of dummy
runs by actual runs, each time recording a replacement by subtracting 1 from the count di. Thus, the d i
indicates the number of dummy runs on sequence i when the source becomes empty.
It is not known which algorithm yields the optimal distribution, but the following has proved to be a very
good method. It is called horizontal distribution (cf. Knuth, Vol 3. p. 270), a term that can be understood by
imagining the runs as being piled up in the form of silos, as shown in Fig. 2.16 for N = 6, level 5 (cf. Table
2.14). In order to reach an equal distribution of remaining dummy runs as quickly as possible, their
replacement by actual runs reduces the size of the piles by picking off dummy runs on horizontal levels
proceeding from left to right. In this way, the runs are distributed onto the sequences as indicated by their
numbers as shown in Fig. 2.16.
8
1
7
2 3 4
6
5 6 7 8
5
9 10 11 12
4
13 14 15 16 17
3
18 19 20 21 22
2
23 24 25 26 27
1
28 29 30 31 32
We are now in a position to describe the algorithm in the form of a procedure called select, which is activated
each time a run has been copied and a new source is selected for the next run. We assume the existence of a
variable j denoting the index of the current destination sequence. ai and di denote the ideal and dummy
distribution numbers for sequence i.
j, level: INTEGER;
a, d: ARRAY N OF INTEGER;
These variables are initialized with the following values:
ai = 1, di = 1 for i = 0 ... N-2
aN-1 = 0, dN-1 = 0 dummy
j = 0, level = 0
Note that select is to compute the next row of Table 2.14, i.e., the values a1(L) ... aN-1(L) each time that the
level is increased. The next goal, i.e., the differences d i = ai(L) - ai(L-1) are also computed at that time. The
indicated algorithm relies on the fact that the resulting di decrease with increasing index (descending stair in
Fig. 2.16). Note that the exception is the transition from level 0 to level 1; this algorithm must therefore be
used starting at level 1. Select ends by decrementing dj by 1; this operation stands for the replacement of a
dummy run on sequence j by an actual run.
PROCEDURE select;
VAR i, z: INTEGER;
BEGIN
IF d[j] < d[j+1] THEN INC(j)
ELSE
IF d[j] = 0 THEN
INC(level); z := a[0];
FOR i := 0 TO N-2 DO
d[i] := z + a[i+1] - a[i]; a[i] := z + a[i+1]
END
END ;
j := 0
END ;
DEC(d[j])
END select
Assuming the availability of a routine to copy a run from the source src woth rider R onto fj with rider rj, we
can formulate the initial distribution phase as follows (assuming that the source contains at least one run):
REPEAT select; copyrun
UNTIL R.eof
Here, however, we must pause for a moment to recall the effect encountered in distributing runs in the
previously discussed natural merge algorithm: The fact that two runs consecutively arriving at the same
destination may merge into a single run, causes the assumed numbers of runs to be incorrect. By devising the
sort algorithm such that its correctness does not depend on the number of runs, this side effect can safely be
ignored. In the Polyphase Sort, however, we are particularly concerned about keeping track of the exact
number of runs on each file. Consequently, we cannot afford to overlook the effect of such a coincidental
merge. An additional complication of the distribution algorithm therefore cannot be avoided. It becomes
necessary to retain the keys of the last item of the last run on each sequence. Fortunately, our implementation
of Runs does exactly this. In the case of output sequences, f.first represents the item last written. A next
attempt to describe the distribution algorithm could therefore be
REPEAT select;
IF f[j].first <= f0.first THEN continue old run END ;
copyrun
UNTIL R.eof
79
The obvious mistake here lies in forgetting that f[j].first has only obtained a value after copying the first run.
A correct solution must therefore first distribute one run onto each of the N-1 destination sequences without
inspection of first. The remaining runs are distributed as follows:
WHILE ~R.eof DO
select;
IF r[j].first <= R.first THEN
copyrun;
IF R.eof THEN INC(d[j]) ELSE copyrun END
ELSE copyrun
END
END
Now we are finally in a position to tackle the main polyphase merge sort algorithm. Its principal structure is
similar to the main part of the N-way merge program: An outer loop whose body merges runs until the
sources are exhausted, an inner loop whose body merges a single run from each source, and an innermost
loop whose body selects the initial key and transmits the involved item to the target file. The principal
differences to balanced merging are the following:
1. Instead of N, there is only one output sequence in each pass.
2. Instead of switching N input and N output sequences after each pass, the sequences are rotated. This is
achieved by using a sequence index map t.
3. The number of input sequences varies from run to run; at the start of each run, it is determined from the
counts d i of dummy runs. If di > 0 for all i, then N-1 dummy runs are pseudo-merged into a single dummy
run by merely incrementing the count dN of the output sequence. Otherwise, one run is merged from all
sources with d i = 0, and di is decremented for all other sequences, indicating that one dummy run was taken
off. We denote the number of input sequences involved in a merge by k.
4. It is impossible to derive termination of a phase by the end-of status of the N-1'st sequence, because more
merges might be necessary involving dummy runs from that source. Instead, the theoretically necessary
number of runs is determined from the coefficients ai. The coefficients ai were computed during the
distribution phase; they can now be recomputed backward.
The main part of the Polyphase Sort can now be formulated according to these rules, assuming that all N-1
sequences with initial runs are set to be read, and that the tape map is initially set to ti = i.
REPEAT (*merge from t[0] ... t[N-2] to t[N-1]*)
z := a[N-2]; d[N-1] := 0;
REPEAT k := 0; (*merge one run*)
(*determine no. of active input sequences*)
FOR i := 0 TO N-2 DO
IF d[i] > 0 THEN DEC(d[i]) ELSE ta[k] := t[i]; INC(k) END
END ;
IF k = 0 THEN INC(d[N-1])
ELSE merge one real run from t[0] ... t[k-1] to t[N-1]
END ;
DEC(z)
UNTIL z = 0;
Runs.Set(r[t[N-1]], f[t[N-1]]);
rotate sequences in map t; compute a[i] for next level;
DEC(level)
UNTIL level = 0
(*sorted output is f[t[0]]*)
The actual merge operation is almost identical with that of the N-way merge sort, the only difference being
that the sequence elimination algorithm is somewhat simpler. The rotation of the sequence index map and the
corresponding counts d i (and the down-level recomputation of the coefficients ai) is straightforward and can
be inspected in detail from Program 2.16, which represents the Polyphase algorithm in its entirety.
80
PROCEDURE select;
VAR i, z: INTEGER;
BEGIN
IF d[j] < d[j+1] THEN INC(j)
ELSE
IF d[j] = 0 THEN
INC(level); z := a[0];
FOR i := 0 TO N-2 DO
d[i] := z + a[i+1] - a[i]; a[i] := z + a[i+1]
END
END ;
j := 0
END ;
DEC(d[j])
END select;
IF k = 0 THEN INC(d[N-1])
ELSE (*merge one real run from t[0] ... t[k-1] to t[N-1]*)
REPEAT mx := 0; min := r[ta[0]].first; i := 1;
WHILE i < k DO
x := r[ta[i]].first;
IF x < min THEN min := x; mx := i END ;
INC(i)
END ;
Runs.copy(r[ta[mx]], r[t[N-1]]);
IF r[ta[mx]].eor THEN ta[mx] := ta[k-1]; DEC(k) END
UNTIL k = 0
END ;
DEC(z)
UNTIL z = 0;
Runs.Set(r[t[N-1]], f[t[N-1]]); (*rotate sequences*)
tn := t[N-1]; dn := d[N-1]; z := a[N-2];
FOR i := N-1 TO 1 BY -1 DO
t[i] := t[i-1]; d[i] := d[i-1]; a[i] := a[i-1] - z
END ;
t[0] := tn; d[0] := dn; a[0] := z; DEC(level)
UNTIL level = 0 ;
RETURN f[t[0]]
END Polyphase
f[j] H1= 15 f0
15 10 31 27
18 20
29 33 24 30
18
10 15 27
29 20
31 33 24 30
would be no problem, if in one (or both) of the programs the required procedure would be called at a single
place only; but instead, they are called at several places in both programs.
This situation is best reflected by the use of a coroutine (thread); it is suitable in those cases in which several
processes coexist. The most typical representative is the combination of a process that produces a stream of
information in distinct entities and a process that consumes this stream. This producer-consumer relationship
can be expressed in terms of two coroutines; one of them may well be the main program itself. The coroutine
may be considered as a process that contains one or more breakpoints. If such a breakpoint is encountered,
then control returns to the program that had activated the coroutine. Whenever the coroutine is called again,
execution is resumed at that breakpoint. In our example, we might consider Polyphase Sort as the main
program, calling upon copyrun, which is formulated as a coroutine. It consists of the main body of Program
2.17 in which each call of switch now represents a breakpoint. The test for end of file would then have to be
replaced systematically by a test of whether or not the coroutine had reached its endpoint.
PROCEDURE Distribute(src: Files.File): Files.File;
CONST M = 16; mh = M DIV 2; (*heap size*)
VAR L, R: INTEGER;
x: INTEGER;
dest: Files.File;
r, w: Files.Rider;
H: ARRAY M OF INTEGER; (*heap*)
PROCEDURE sift(L, R: INTEGER);
VAR i, j, x: INTEGER;
BEGIN i := L; j := 2*L+1; x := H[i];
IF (j < R) & (H[j] > H[j+1]) THEN INC(j) END ;
WHILE (j <= R) & (x > H[j]) DO
H[i] := H[j]; i := j; j := 2*j+1;
IF (j < R) & (H[j] > H[j+1]) THEN INC(j) END
END ;
H[i] := x
END sift;
BEGIN Files.Set(r, src, 0); dest := Files.New(""); Files.Set(w, dest, 0);
(*step 1: fill upper half of heap*)
L := M;
REPEAT DEC(L); Files.ReadInt(r, H[L]) UNTIL L = mh;
(*step 2: fill lower half of heap*)
REPEAT DEC(L); Files.ReadInt(r, H[L]); sift(L, M-1) UNTIL L = 0;
(*step 3: pass elements through heap*)
L := M; Files.ReadInt(r, x);
WHILE ~r.eof DO
Files.WriteInt(w, H[0]);
IF H[0] <= x THEN
(*x belongs to same run*) H[0] := x; sift(0, L-1)
ELSE (*start next run*)
DEC(L); H[0] := H[L]; sift(0, L-1); H[L] := x;
IF L < mh THEN sift(L, M-1) END ;
IF L = 0 THEN (*heap full; start new run*) L := M END
END ;
Files.ReadInt(r, x)
END ;
(*step 4: flush lower half of heap*)
R := M;
REPEAT DEC(L); Files.WriteInt(w, H[0]);
H[0] := H[L]; sift(0, L-1); DEC(R); H[L] := H[R];
IF L < mh THEN sift(L, R-1) END
UNTIL L = 0;
84
Analysis and conclusions. What performance can be expected from a Polyphase Sort with initial distribution
of runs by a Heapsort? We first discuss the improvement to be expected by introducing the heap.
In a sequence with randomly distributed keys the expected average length of runs is 2. What is this length
after the sequence has been funnelled through a heap of size m ? One is inclined to say m, but, fortunately,
the actual result of probabilistic analysis is much better, namely 2m (see Knuth, vol. 3, p. 254). Therefore, the
expected improvement factor is m.
An estimate of the performance of Polyphase can be gathered from Table 2.15, indicating the maximal
number of initial runs that can be sorted in a given number of partial passes (levels) with a given number N of
sequences. As an example, with six sequences and a heap of size m = 100, a file with up to 165’680’100
initial runs can be sorted within 10 partial passes. This is a remarkable performance.
Reviewing again the combination of Polyphase Sort and Heapsort, one cannot help but be amazed at the
complexity of this program. After all, it performs the same easily defined task of permuting a set of items as
is done by any of the short programs based on the straight array sorting principles. The moral of the entire
chapter may be taken as an exhibition of the following:
1. The intimate connection between algorithm and underlying data structure, and in particular the influence of
the latter on the former.
2. The sophistication by which the performance of a program can be improved, even when the available
structure for its data (sequence instead of array) is rather ill-suited for the task.
Exercises
2.1. Which of the algorithms given for straight insertion, binary insertion, straight selection, bubble sort,
shakersort, shellsort, heapsort, quicksort, and straight mergesort are stable sorting methods?
2.2. Would the algorithm for binary insertion still work correctly if L < R were replaced by L < R in the
while clause? Would it still be correct if the statement L := m+1 were simplified to L := m? If not,
find sets of values a1 ... an upon which the altered program would fail.
2.3. Program and measure the execution time of the three straight sorting methods on your computer, and
find coefficients by which the factors C and M have to be multiplied to yield real time estimates.
2.4. Specifty invariants for the repetitions in the three straight sorting algorithms.
2.5. Consider the following "obvious" version of the procedure Partition and find sets of values a0 ... an-1
for which this version fails:
i := 0; j := n-1; x := a[n DIV 2];
REPEAT
WHILE a[i] < x DO i := i+1 END ;
WHILE x < a[j] DO j := j-1 END ;
w := a[i]; a[i] := a[j]; a[j] := w
UNTIL i > j
2.6. Write a procedure that combines the Quicksort and Bubblesort algorithms as follows: Use Quicksort
to obtain (unsorted) partitions of length m (1 < m < n); then use Bubblesort to complete the task. Note
that the latter may sweep over the entire array of n elements, hence, minimizing the bookkeeping
effort. Find that value of m which minimizes the total sort time. Note: Clearly, the optimum value of
m will be quite small. It may therefore pay to let the Bubblesort sweep exactly m-1 times over the
array instead of including a last pass establishing the fact that no further exchange is necessary.
85
2.7. Perform the same experiment as in Exercise 2.6 with a straight selection sort instead of a Bubblesort.
Naturally, the selection sort cannot sweep over the whole array; therefore, the expected amount of
index handling is somewhat greater.
2.8. Write a recursive Quicksort algorithm according to the recipe that the sorting of the shorter partition
should be tackled before the sorting of the longer partition. Perform the former task by an iterative
statement, the latter by a recursive call. (Hence, your sort procedure will contain only one recursive
call instead of two.
2.9. Find a permutation of the keys 1, 2, ... , n for which Quicksort displays its worst (best) behavior (n =
5, 6, 8).
2.10. Construct a natural merge program similar to the straight merge, operating on a double length array
from both ends inward; compare its performance with that of the procedure given in this text.
2.11. Note that in a (two-way) natural merge we do not blindly select the least value among the available
keys. Instead, upon encountering the end of a run, the tail of the other run is simply copied onto the
output sequence. For example, merging of
2, 4, 5, 1, 2, ...
3, 6, 8, 9, 7, ...
results in the sequence
2, 3, 4, 5, 6, 8, 9, 1, 2, ...
instead of
2, 3, 4, 5, 1, 2, 6, 8, 9, ...
which seems to be better ordered. What is the reason for this strategy?
2.12. A sorting method similar to the Polyphase is the so-called Cascade merge sort [2.1 and 2.9]. It uses a
different merge pattern. Given, for instance, six sequences T1, ... ,T6, the cascade merge, also starting
with a "perfect distribution" of runs on T1 ... T5, performs a five-way merge from T1 ... T5 onto T6
until T5 is empty, then (without involving T6) a four-way merge onto T5, then a three-way merge
onto T4, a two-way merge onto T3, and finally a copy operation from T1 onto T2. The next pass
operates in the same way starting with a five-way merge to T1, and so on. Although this scheme
seems to be inferior to Polyphase because at times it chooses to leave some sequences idle, and
because it involves simple copy operations, it surprisingly is superior to Polyphase for (very) large
files and for six or more sequences. Write a well structured program for the Cascade merge principle.
References
2-1. B. K. Betz and Carter. Proc. ACM National Conf. 14, (1959), Paper 14.
2-2. R.W. Floyd. Treesort (Algorithms 113 and 243). Comm. ACM, 5, No. 8, (1962), 434, and Comm.
ACM, 7, No. 12 (1964), 701.
2-3. R.L. Gilstad. Polyphase Merge Sorting - An Advanced Technique. Proc. AFIPS Eastern Jt. Comp.
Conf., 18, (1960), 143-48.
2-4. C.A.R. Hoare. Proof of a Program: FIND. Comm. ACM,13, No. 1, (1970), 39-45.
2-5. -------- Proof of a Recursive Program: Quicksort. Comp. J., 14, No. 4 (1971), 391-95.
2-6. -------- Quicksort. Comp.J., 5. No.1 (1962), 10-15.
2-7. D.E. Knuth. The Art of Computer Programming. Vol. 3 (Reading, Mass.: Addison- Wesley, 1973).
2-8. -------- The Art of Computer Programming. Vol 3, pp. 86-95.
2-9. -------- The Art of Computer Programming. Vol 3, p. 289.
2-10. H. Lorin. A Guided Bibliography to Sorting. IBM Syst.J., 10, No. 3 (1971), 244-54.
86
2-11. D.L. Shell. A Highspeed Sorting Procedure. Comm. ACM, 2, No. 7 (1959), 30-32.
2-12. R.C. Singleton. An Efficient Algorithm for Sorting with Minimal Storage (Algorithm 347). Comm.
ACM, 12, No. 3 (1969), 185.
2-13. M. H. Van Emden. Increasing the Efficiency of Quicksort (Algorithm 402). Comm. ACM, 13, No. 9
(1970), 563-66, 693.
2-14. J.W.J. Williams. Heapsort (Algorithm 232) Comm. ACM, 7, No. 6 (1964), 347-48.
87
3 Recursive Algorithms
3.1. Introduction
An object is said to be recursive, if it partially consists or is defined in terms of itself. Recursion is
encountered not only in mathematics, but also in daily life. Who has never seen an advertising picture
which contains itself?
Recursion is a particularly powerful technique in mathematical definitions. A few familiar examples are
those of natural numbers, tree structures, and of certain functions:
1. Natural numbers:
(a) 0 is a natural number.
(b) the successor of a natural number is a natural number.
2. Tree structures
(a) O is a tree (called the empty tree).
(b) If t1 and t2 are trees, then the structures consisting of a node with two descendants t1 and t2 is
also a (binary) tree.
3. The factorial function f(n):
f(0) = 1
f(n) = n × f(n - 1) for n > 0
The power of recursion evidently lies in the possibility of defining an infinite set of objects by a finite
statement. In the same manner, an infinite number of computations can be described by a finite recursive
program, even if this program contains no explicit repetitions. Recursive algorithms, however, are
primarily appropriate when the problem to be solved, or the function to be computed, or the data
structure to be processed are already defined in recursive terms. In general, a recursive program P can be
expressed as a composition P of a sequence of statements S (not containing P) and P itself.
P ≡ P[S, P]
The necessary and sufficient tool for expressing programs recursively is the procedure or subroutine, for
it allows a statement to be given a name by which this statement may be invoked. If a procedure P
contains an explicit reference to itself, then it is said to be directly recursive; if P contains a reference to
another procedure Q, which contains a (direct or indirect) reference to P, then P is said to be indirectly
recursive. The use of recursion may therefore not be immediately apparent from the program text.
It is common to associate a set of local objects with a procedure, i.e., a set of variables, constants, types,
and procedures which are defined locally to this procedure and have no existence or meaning outside this
procedure. Each time such a procedure is activated recursively, a new set of local, bound variables is
created. Although they have the same names as their corresponding elements in the set local to the
previous instance of the procedure, their values are distinct, and any conflict in naming is avoided by the
rules of scope of identifiers: the identifiers always refer to the most recently created set of variables. The
same rule holds for procedure parameters, which by definition are bound to the procedure.
Like repetitive statements, recursive procedures introduce the possibility of non- terminating
computations, and thereby also the necessity of considering the problem of termination. A fundamental
requirement is evidently that the recursive calls of P are subjected to a condition B, which at some time
becomes false. The scheme for recursive algorithms may therefore be expressed more precisely by either
one of the following forms:
P ≡ IF B THEN P[S, P] END
P ≡ P[S, IF B THEN P END]
For repetitions, the basic technique of demonstrating termination consists of
1. defining a function f(x) (x shall be the set of variables), such that f(x) < 0 implies the terminating
condition (of the while or repeat clause), and
88
2. proving that f(x) decreases during each repetition step. f is called the variant of the repetition.
In the same manner, termination of a recursion can be proved by showing that each execution of P
decreases some f(x), and that f(x) < 0 implies ~B. A particularly evident way to ensure termination is to
associate a (value) parameter, say n, with P, and to recursively call P with n-1 as parameter value.
Substituting n > 0 for B then guarantees termination. This may be expressed by the following program
schemata:
P(n) ≡ IF n > 0 THEN P[S, P(n-1)] END
P(n) ≡ P[S, IF n > 0 THEN P(n-1) END]
In practical applications it is mandatory to show that the ultimate depth of recursion is not only finite, but
that it is actually quite small. The reason is that upon each recursive activation of a procedure P some
amount of storage is required to accommodate its variables. In addition to these local variables, the
current state of the computation must be recorded in order to be retrievable when the new activation of P
is terminated and the old one has to be resumed. We have already encountered this situation in the
development of the procedure Quicksort in Chap. 2. It was discovered that by naively composing the
program out of a statement that splits the n items into two partitions and of two recursive calls sorting the
two partitions, the depth of recursion may in the worst case approach n. By a clever reassessment of the
situation, it was possible to limit the depth to log(n). The difference between n and log(n) is sufficient to
convert a case highly inappropriate for recursion into one in which recursion is perfectly practical.
PROCEDURE P;
BEGIN
IF I < n THEN I := I + 1; F := I*F; P END
END P
A more frequently used, but essentially equivalent, form is the one given below. P is replaced by a
function procedure F, i.e., a procedure with which a resulting value is explicitly associated, and which
therefore may be used directly as a constituent of expressions. The variable F therefore becomes
superfluous; and the role of I is taken over by the explicit procedure parameter.
PROCEDURE F(I: INTEGER): INTEGER;
BEGIN
IF I > 0 THEN RETURN I * F(I - 1) ELSE RETURN 1 END
END F
It now is plain that in this example recursion can be replaced quite simply by iteration. This is expressed
by the program
I := 0; F := 1;
WHILE I < n DO I := I + 1; F := I*F END
In general, programs corresponding to the original schemata should be transcribed into one according to
the following schema:
P ≡ [x := x0; WHILE B DO S END]
There also exist more complicated recursive composition schemes that can and should be translated into
an iterative form. An example is the computation of the Fibonacci numbers which are defined by the
recurrence relation
fibn+1 = fibn + fibn-1 for n > 0
and fib1 = 1, fib0 = 0. A direct, naive transcription leads to the recursive program
PROCEDURE Fib(n: INTEGER): INTEGER;
BEGIN
IF n = 0 THEN RETURN 0
ELSIF n = 1 THEN RETURN 1
ELSE RETURN Fib(n-1) + Fib(n-2)
END
END Fib
Computation of fib n by a call Fib(n) causes this function procedure to be activated recursively. How
often? We notice that each call with n > 1 leads to 2 further calls, i.e., the total number of calls grows
exponentially (see Fig. 3.2). Such a program is clearly impractical.
But fortunately the Fibonacci numbers can be computed by an iterative scheme that avoids the
recomputation of the same values by use of auxiliary variables such that x = fib i and y = fib i-1.
i := 1; x := 1; y := 0;
WHILE i < n DO z := x; x := x + y; y := z; i := i + 1 END
Note: The assignments to x, y, z may be expressed by two assignments only without a need for the
auxiliary variable z: x := x + y; y := x - y.
90
4 3
3 2 2 1
2 1 1 0 1 0
1 0
H1 H2 H3
by arrows pointing in the corresponding direction. Then the following recursion scheme emerges (see
Fig. 3.3).
A: D ←A ↓ A → B
B: C ↑ B →B ↓ A
C: B →C ↑ C← D
D: A ↓ D ←D ↑ C
Let us assume that for drawing line segments we have at our disposal a procedure line which moves a
drawing pen in a given direction by a given distance. For our convenience, we assume that the direction
be indicated by an integer parameter i as 45×i degrees. If the length of the unit line is denoted by u, the
procedure corresponding to the scheme A is readily expressed by using recursive activations of
analogously designed procedures B and D and of itself.
PROCEDURE A(i: INTEGER);
BEGIN
IF i > 0 THEN
D(i-1); line(4, u);
A(i-1); line(6, u);
A(i-1); line(0, u);
B(i-1)
END
END A
This procedure is initiated by the main program once for every Hilbert curve to be superimposed. The
main procedure determines the initial point of the curve, i.e., the initial coordinates of the pen denoted by
Px and Py, and the unit increment u. The square in which the curves are drawn is placed into the middle
of the page with given width and height. These parameters as well as the drawing procedure line are
taken from a module Draw. Note that this module retains the current position of the pen. Procedure
Hilbert draws the n Hilbert curves H1 ... Hn.
DEFINITION Draw;
CONST width = 1024; height = 800;
PROCEDURE Clear; (*clear drawing plane*)
PROCEDURE SetPen(x, y: INTEGER);
PROCEDURE line(dir, len: INTEGER);
(*draw line of length len in direction of dir*45 degrees; move pen accordingly*)
END Draw.
Procedure Hilbert uses the four auxiliary procedures A, B, C, and D recursively:
PROCEDURE A(i: INTEGER);
BEGIN
IF i > 0 THEN
D(i-1); Draw.line(4,u); A(i-1); Draw.line(6,u); A(i-1); Draw.line(0,u); B(i-1)
END
END A;
PROCEDURE B(i: INTEGER);
BEGIN
IF i > 0 THEN
C(i-1); Draw.line(2,u); B(i-1); Draw.line(0,u); B(i-1); Draw.line(6,u); A(i-1)
END
END B;
PROCEDURE C(i: INTEGER);
BEGIN
IF i > 0 THEN
B(i-1); Draw.line(0,u); C(i-1); Draw.line(2,u); C(i-1); Draw.line(4,u); D(i-1)
92
END
END C;
PROCEDURE D(i: INTEGER);
BEGIN
IF i > 0 THEN
A(i-1); Draw.line(6,u); D(i-1); Draw.line(4,u); D(i-1); Draw.line(2,u); C(i-1)
END
END D;
PROCEDURE Hilbert(n: INTEGER);
CONST SquareSize = 512;
VAR i, x0, y0, u: INTEGER;
BEGIN Draw.Clear;
x0 := Draw.width DIV 2; y0 := Draw.height DIV 2;
u := SquareSize; i := 0;
REPEAT INC(i); u := u DIV 2;
x0 := x0 + (u DIV 2); y0 := y0 + (u DIV 2); Draw.Set(x0, y0); A(i)
UNTIL i = n
END Hilbert.
A similar but slightly more complex and aesthetically more sophisticated example is shown in Fig. 3.7.
This pattern is again obtained by superimposing several curves, two of which are shown in Fig. 3.6. S i is
called the Sierpinski curve of order i. What is its recursion scheme? One is tempted to single out the leaf
S 1 as a basic building block, possibly with one edge left off. But this does not lead to a solution. The
principal difference between Sierpinski curves and Hilbert curves is that Sierpinski curves are closed
(without crossovers). This implies that the basic recursion scheme must be an open curve and that the
four parts are connected by links not belonging to the recusion pattern itself. Indeed, these links consist
of the four straight lines in the outermost four corners, drawn with thicker lines in Fig. 3.6. They may be
regarded as belonging to a non-empty initial curve S0, which is a square standing on one corner. Now the
recursion schema is readily established. The four constituent patterns are again denoted by A, B, C, and
D, and the connecting lines are drawn explicitly. Notice that the four recursion patterns are indeed
identical except for 90 degree rotations.
S1 S2
S: A B C D
and the recursion patterns are (horizontal and vertical arrows denote lines of double length.)
A: A B →D A
B: B C ↓ A B
C: C D ←B C
D: D A ↑ C D
If we use the same primitives for drawing as in the Hilbert curve example, the above recursion scheme is
transformed without difficulties into a (directly and indirectly) recursive algorithm.
PROCEDURE A(k: INTEGER);
BEGIN
IF k > 0 THEN
A(k-1); line(7, h); B(k-1); line(0, 2*h);
D(k-1); line(1, h); A(k-1)
END
END A
This procedure is derived from the first line of the recursion scheme. Procedures corresponding to the
patterns B, C, and D are derived analogously. The main program is composed according to the base
pattern. Its task is to set the initial values for the drawing coordinates and to determine the unit line
length h according to the size of the plane. The result of executing this program with n = 4 is shown in
Fig. 3.7.
The elegance of the use of recursion in these exampes is obvious and convincing. The correctness of the
programs can readily be deduced from their structure and composition patterns. Moreover, the use of an
explicit (decreasing) level parameter guarantees termination since the depth of recursion cannot become
greater than n. In contrast to this recursive formulation, equivalent programs that avoid the explicit use of
recursion are extremely cumbersome and obscure. Trying to understand the programs shown in [3-3]
should easily convince the reader of this.
PROCEDURE A(k: INTEGER);
BEGIN
IF k > 0 THEN
A(k-1); Draw.line(7, h); B(k-1); Draw.line(0, 2*h); D(k-1); Draw.line(1, h); A(k-1)
END
END A;
PROCEDURE B(k: INTEGER);
BEGIN
IF k > 0 THEN
B(k-1); Draw.line(5, h); C(k-1); Draw.line(6, 2*h); A(k-1); Draw.line(7, h); B(k-1)
END
END B;
PROCEDURE C(k: INTEGER);
BEGIN
IF k > 0 THEN
C(k-1); Draw.line(3, h); D(k-1); Draw.line(4, 2*h); B(k-1); Draw.line(5, h); C(k-1)
END
END C;
PROCEDURE D(k: INTEGER);
BEGIN
IF k > 0 THEN
D(k-1); Draw.line(1, h); A(k-1); Draw.line(2, 2*h); C(k-1); Draw.line(3, h); D(k-1)
END
END D;
94
3 2
4 1
5 8
6 7
1 16 7 26 11 14
34 25 12 15 6 27
17 2 33 8 13 10
97
32 35 24 21 28 5
23 18 3 30 9 20
36 31 22 19 4 29
Table 3.1 Three Knights' Tours.
What abstractions can now be made from this example? Which pattern does it exhibit that is typical for
this kind of problem-solving algorithms? What does it teach us? The characteristic feature is that steps
toward the total solution are attempted and recorded that may later be taken back and erased in the
recordings when it is discovered that the step does not possibly lead to the total solution, that the step
leads into a dead-end street. This action is called backtracking. The general pattern below is derived from
TryNextMove, assuming that the number of potential candidates in each step is finite.
PROCEDURE Try;
BEGIN intialize selection of candidates;
REPEAT select next;
IF acceptable THEN
record it;
IF solution incomplete THEN Try;
IF not successful THEN cancel recording END
END
END
UNTIL successful OR no more candidates
END Try
Actual programs may, of course, assume various derivative forms of this schema. A frequently
encountered pattern uses an explicit level parameter indicating the depth of recursion and allowing for a
simple termination condition. If, moreover, at each step the number of candidates to be investigated is
fixed, say m, then the derived schema shown below applies; it is to be invoked by the statement Try(1).
PROCEDURE Try(i: INTEGER);
VAR k: INTEGER;
BEGIN k := 0;
REPEAT k := k+1; select k-th candidate;
IF acceptable THEN
record it;
IF i < n THEN Try(i+1);
IF not successful THEN cancel recording END
END
END
UNTIL successful OR (k = m)
END Try
The remainder of this chapter is devoted to the treatment of three more examples. They display various
incarnations of the abstract schema and are included as further illustrations of the appropriate use of
recursion.
0 1 2 3 4 5 6 7
BEGIN
FOR k := 0 TO n-1 DO
select k th candidate;
IF acceptable THEN record it;
IF i < n THEN Try(i+1) ELSE output solution END ;
cancel recording
END
END
END Try
Note that because of the simplification of the termination condition of the selection process to the single
term k = n, the repeat statement is appropriately replaced by a for statement. It comes as a surprise that
the search for all possible solutions is realized by a simpler program than the search for a single solution.
The extended algorithm to determine all 92 solutions of the eight queens problem is shown in Program
3.5. Actually, there are only 12 significantly differing solutions; our program does not recognize
symmetries. The 12 solutions generated first are listed in Table 3.2. The numbers n to the right indicate
the frequency of execution of the test for safe fields. Its average over all 92 solutions is 161.
PROCEDURE write;
VAR k: INTEGER;
BEGIN (*global writer W*)
FOR k := 0 TO 7 DO Texts.WriteInt(W, x[k], 4) END ;
Texts.WriteLn(W)
END write;
PROCEDURE Try(i: INTEGER);
VAR j: INTEGER;
BEGIN
FOR j := 1 TO 8 DO
IF a[j] & b[i+j] & c[i-j+7] THEN
x[i] := j;
a[j] := FALSE; b[i+j] := FALSE; c[i-j+7] := FALSE;
IF i < 7 THEN Try(i+1) ELSE write END ;
a[j] := TRUE; b[i+j] := TRUE; c[i-j+7] := TRUE
END
END
END Try;
PROCEDURE AllQueens;
VAR i: INTEGER;
BEGIN
FOR i := 0 TO 7 DO a[i] := TRUE END ;
FOR i := 0 TO 14 DO b[i] := TRUE; c[i] := TRUE END ;
Try(0)
END AllQueens.
x1 x2 x3 x4 x5 x6 x7 x8 n
1 5 8 6 3 7 2 4 876
1 6 8 3 7 4 2 5 264
1 7 4 6 8 2 5 3 200
1 7 5 8 2 4 6 3 136
2 4 6 8 3 1 7 5 504
2 5 7 1 3 8 6 4 400
2 5 7 4 1 8 6 3 072
2 6 1 7 4 8 3 5 280
2 6 8 3 1 4 7 5 240
2 7 3 6 8 5 1 4 264
101
2 7 5 8 1 4 6 3 160
2 8 6 1 3 5 7 4 336
Table 3.2 Twelve Solutions to the Eight Queens Problem.
7 2 4 6 3 1 7 5 8 7 3 5 7 2 4 1 8 6
8 6 1 4 2 7 5 3 8 8 7 2 8 4 5 6 3 1
Table 3.3 Sample Input Data for wmr and mwr
The result is represented by an array of women x, such that xm denotes the partner of man m. In order to
maintain symmetry between men and women, an additional array y is introduced, such that yw denotes
the partner of woman w.
VAR x, y: ARRAY n OF INTEGER
Actually, y is redundant, since it represents information that is already present through the existence of x.
In fact, the relations
x[y[w]] = w, y[x[m]] = m
hold for all m and w who are married. Thus, the value yw could be determined by a simple search of x;
the array y, however, clearly improves the efficiency of the algorithm. The information represented by x
and y is needed to determine stability of a proposed set of marriages. Since this set is constructed
stepwise by marrying individuals and testing stability after each proposed marriage, x and y are needed
even before all their components are defined. In order to keep track of defined components, we may
introduce Boolean arrays
singlem, singlew: ARRAY n OF BOOLEAN
with the meaning that singlemm implies that xm is defined, and singleww implies that yw is defined. An
inspection of the proposed algorithm, however, quickly reveals that the marital status of a man is
determined by the value m through the relation
~singlem[k] = k < m
This suggests that the array singlem be omitted; accordingly, we will simplify the name singlew to
single. These conventions lead to the refinement shown by the following procedure Try. The predicate
acceptable can be refined into the conjunction of single and stable, where stable is a function to be still
further elaborated.
PROCEDURE Try(m: man);
VAR r: rank; w: woman;
BEGIN
FOR r := 0 TO n-1 DO
w := wmr[m,r];
IF single[w] & stable THEN
x[m] := w; y[w] := m; single[w] := FALSE;
IF m < n THEN Try(m+1) ELSE record set END ;
single[w] := TRUE
END
END
END Try
At this point, the strong similarity of this solution with procedure AllQueens is still noticeable. The
crucial task is now the refinement of the algorithm to determine stability. Unfortunately, it is not possible
to represent stability by such a simple expression as the safety of a queen's position. The first detail that
should be kept in mind is that stability follows by definition from comparisons of ranks. The ranks of
men or women, however, are nowhere explicitly available in our collection of data established so far.
Surely, the rank of woman w in the mind of man m can be computed, but only by a costly search of w in
wmrm. Since the computation of stability is a very frequent operation, it is advisable to make this
information more directly accessible. To this end, we introduce the two matrices
rmw: ARRAY man, woman OF rank;
rwm: ARRAY woman, man OF rank
103
such that rmwm,w denotes woman w's rank in the preference list of man m, and rwmw,m denotes the rank
of man m in the list of w. It is plain that the values of these auxiliary arrays are constant and can initially
be determined from the values of wmr and mwr.
The process of determining the predicate stable now proceeds strictly according to its original definition.
Recall that we are trying the feasibility of marrying m and w, where w = wmrm,r , i.e., w is man m's r th
choice. Being optimistic, we first presume that stability still prevails, and then we set out to find possible
sources of trouble. Where could they be hidden? There are two symmetrical possibilities:
1. There might be a women pw, preferred to w by m, who herself prefers m over her husband.
2. There might be a man pm, preferred to m by w, who himself prefers w over his wife.
Pursuing trouble source 1, we compare ranks rwmpw,m and rwmpw,y[pw] for all women preferred to w by m,
i.e. for all pw = wmrm,i such that i < r. We happen to know that all these candidate women are already
married because, were anyone of them still single, m would have picked her beforehand. The described
process can be formulated by a simple linear search; s denotes stability.
s := TRUE; i := 0;
WHILE (i < r) & s DO
pw := wmr[m,i]; INC(i);
IF ~single[pw] THEN s := rwm[pw,m] > rwm[pw, y[pw]] END
END
Hunting for trouble source 2, we must investigate all candidates pm who are preferred by w to their
current assignation m, i.e., all preferred men pm = mwrw,i such that i < rwmw,m. In analogy to tracing
trouble source 1, comparison between ranks rmwpm,w and rmwpm,x[pm] is necessary. We must be careful,
however, to omit comparisons involving xpm where pm is still single. The necessary safeguard is a test
pm < m, since we know that all men preceding m are already married.
The complete algorithm is shown in module Marriage. Table 3.4 specifies the nine computed stable
solutions from input data wmr and mwr given in Table 3.3.
PROCEDURE write; (*global writer W*)
VAR m: man; rm, rw: INTEGER;
BEGIN rm := 0; rw := 0;
FOR m := 0 TO n-1 DO
Texts.WriteInt(W, x[m], 4);
rm := rmw[m, x[m]] + rm; rw := rwm[x[m], m] + rw
END ;
Texts.WriteInt(W, rm, 8); Texts.WriteInt(W, rw, 4); Texts.WriteLn(W)
END write;
PROCEDURE stable(m, w, r: INTEGER): BOOLEAN;
VAR pm, pw, rank, i, lim: INTEGER;
S: BOOLEAN;
BEGIN S := TRUE; i := 0;
WHILE (i < r) & S DO
pw := wmr[m,i]; INC(i);
IF ~single[pw] THEN S := rwm[pw,m] > rwm[pw, y[pw]] END
END ;
i := 0; lim := rwm[w,m];
WHILE (i < lim) & S DO
pm := mwr[w,i]; INC(i);
IF pm < m THEN S := rmw[pm,w] > rmw[pm, x[pm]] END
END ;
RETURN S
END stable;
PROCEDURE Try(m: INTEGER);
VAR w, r: INTEGER;
104
BEGIN
FOR r := 0 TO n-1 DO w := wmr[m,r];
IF single[w] & stable(m,w,r) THEN
x[m] := w; y[w] := m; single[w] := FALSE;
IF m < n-1 THEN Try(m+1) ELSE write END ;
single[w] := TRUE
END
END
END Try;
PROCEDURE FindStableMarriages(VAR S: Texts.Scanner);
VAR m, w, r: INTEGER;
BEGIN
FOR m := 0 TO n-1 DO
FOR r := 0 TO n-1 DO Texts.Scan(S); wmr[m,r] := S.i; rmw[m, wmr[m,r]] := r END
END ;
FOR w := 0 TO n-1 DO
single[w] := TRUE;
FOR r := 0 TO n-1 DO Texts.Scan(S); mwr[w,r] := S.i; rwm[w, mwr[w,r]] := r END
END ;
Try(0)
END FindStableMarriages
END Marriage
This algorithm is based on a straightforward backtracking scheme. Its efficiency primarily depends on
the sophistication of the solution tree pruning scheme. A somewhat faster, but more complex and less
transparent algorithm has been presented by McVitie and Wilson [3-1 and 3-2], who also have extended
it to the case of sets (of men and women) of unequal size.
Algorithms of the kind of the last two examples, which generate all possible solutions to a problem
(given certain constraints), are often used to select one or several of the solutions that are optimal in
some sense. In the present example, one might, for instance, be interested in the solution that on the
average best satisfies the men, or the women, or everyone.
Notice that Table 3.4 indicates the sums of the ranks of all women in the preference lists of their
husbands, and the sums of the ranks of all men in the preference lists of their wives. These are the values
rm = S m: 1 ≤ m ≤ n: rmwm,x[m]
rw = S m: 1 ≤ m ≤ n: rwmx[m],m
x1 x2 x3 x4 x5 x6 x7 x8 rm rw c
1 7 4 3 8 1 5 2 6 16 32 21
2 2 4 3 8 1 5 7 6 22 27 449
3 2 4 3 1 7 5 8 6 31 20 59
4 6 4 3 8 1 5 7 2 26 22 62
5 6 4 3 1 7 5 8 2 35 15 47
6 6 3 4 8 1 5 7 2 29 20 143
7 6 3 4 1 7 5 8 2 38 13 47
8 3 6 4 8 1 5 7 2 34 18 758
9 3 6 4 1 7 5 8 2 43 11 34
c = number of evaluations of stability.
Solution 1 = male optimal solution; solution 9 = female optimal solution.
Table 3.4 Result of the Stable Marriage Problem.
The solution with the least value rm is called the male-optimal stable solution; the one with the smallest
rw is the female-optimal stable solution. It lies in the nature of the chosen search strategy that good
105
solutions from the men's point of view are generated first and the good solutions from the women's
perspective appear toward the end. In this sense, the algorithm is based toward the male population. This
can quickly be changed by systematically interchanging the role of men and women, i.e., by merely
interchanging mwr with wmr and interchanging rmw with rwm.
We refrain from extending this program further and leave the incorporation of a search for an optimal
solution to the next and last example of a backtracking algorithm.
Weight 10 11 12 13 14 15 16 17 18 19 Tot
Value 18 20 17 19 25 21 27 23 25 24
10 * 18
20 * 27
30 * * 52
40 * * * 70
50 * * * * 84
60 * * * * * 99
70 * * * * * 115
80 * * * * * * 130
90 * * * * * * 139
100 * * * * * * * 157
110 * * * * * * * * 172
120 * * * * * * * * 183
Table 3.5 Sample Output from Optimal Selection Program.
This backtracking scheme with a limitation factor curtailing the growth of the potential search tree is also
known as branch and bound algorithm.
Exercises
3.1 (Towers of Hanoi). Given are three rods and n disks of different sizes. The disks can be stacked up
on the rods, thereby forming towers. Let the n disks initially be placed on rod A in the order of
decreasing size, as shown in Fig. 3.10 for n = 3. The task is to move the n disks from rod A to rod C
such that they are ordered in the original way. This has to be achieved under the constraints that
1. In each step exactly one disk is moved from one rod to another rod.
2. A disk may never be placed on top of a smaller disk.
3. Rod B may be used as an auxiliary store.
Find an algorithm that performs this task. Note that a tower may conveniently be considered as
consisting of the single disk at the top, and the tower consisting of the remaining disks. Describe the
algorithm as a recursive program.
A B C
108
References
3-1. D.G. McVitie and L.B. Wilson. The Stable Marriage Problem. Comm. ACM, 14, No. 7 (1971), 486-
92.
3-2. -------. Stable Marriage Assignment for Unequal Sets. Bit, 10, (1970), 295-309.
3-3. Space Filling Curves, or How to Waste Time on a Plotter. Software - Practice and Experience, 1,
No. 4 (1971), 403-40.
3-4. N. Wirth. Program Development by Stepwise Refinement. Comm. ACM, 14, No. 4 (1971), 221-27.
109
Hence, every variable of type term consists of two components, namely, the tagfield t and, if t is true, the
field id, or of the field subex otherwise. Consider now, for example, the following four expressions:
1. x + y
2. x - (y * z)
3. (x + y) * (z - w)
4. (x/(y + z)) * w
These expressions may be visualized by the patterns in Fig. 4.1, which exhibit their nested, recursive
structure, and they determine the layout or mapping of these expressions onto a store.
1. + 2. -
T x T x
T y *
F T y
T z
3. * 4. *
+ /
F T x T x
T y F +
- F T y
F T z T z
T w T w
T Ted
T Fred
T Adam
T Mary
T Eva
4.2. Pointers
The characteristic property of recursive structures which clearly distinguishes them from the fundamental
structures (arrays, records, sets) is their ability to vary in size. Hence, it is impossible to assign a fixed
amount of storage to a recursively defined structure, and as a consequence a compiler cannot associate
specific addresses to the components of such variables. The technique most commonly used to master this
problem involves dynamic allocation of storage, i.e., allocation of store to individual components at the
time when they come into existence during program execution, instead of at translation time. The
compiler then allocates a fixed amount of storage to hold the address of the dynamically allocated
component instead of the component itself. For instance, the pedigree illustrated in Fig. 4.2 would be
represented by individual -- quite possibly noncontiguous -- records, one for each person. These persons
are then linked by their addresses assigned to the respective father and mother fields. Graphically, this
situation is best expressed by the use of arrows or pointers (Fig. 4.3).
112
T Ted
T Fred T Mary
T Adam F F T Eva
F F F F
p: POINTER TO T
p↑: T
T Ted
In Chap. 3, we have seen that iteration is a special case of recursion, and that a call of a recursive
procedure P defined according to the following schema:
PROCEDURE P;
BEGIN
IF B THEN P0; P END
END
where P0 is a statement not involving P, is equivalent to and replaceable by the iterative statement
WHILE B DO P0 END
115
The analogies outlined in Table 4.1 reveal that a similar relationship holds between recursive data types
and the sequence. In fact, a recursive type defined according to the schema
TYPE T = RECORD
IF b: BOOLEAN THEN t0: T0; t: T END
END
where T0 is a type not involving T, is equivalent and replaceable by a sequence of T0s.
The remainder of this chapter is devoted to the generation and manipulation of data structures whose
components are linked by explicit pointers. Structures with specific simple patterns are emphasized in
particular; recipes for handling more complex structures may be derived from those for manipulating
basic formations. These are the linear list or chained sequence -- the simplest case -- and trees. Our
preoccupation with these building blocks of data structuring does not imply that more involved structures
do not occur in practice. In fact, the following story appeared in a Zürich newspaper in July 1922 and is a
proof that irregularity may even occur in cases which usually serve as examples for regular structures,
such as (family) trees. The story tells of a man who laments the misery of his life in the following words:
I married a widow who had a grown-up daughter. My father, who visited us quite often, fell in love with
my step-daughter and married her. Hence, my father became my son-in-law, and my step-daughter
became my mother. Some months later, my wife gave birth to a son, who became the brother-in-law of my
father as well as my uncle. The wife of my father, that is my stepdaughter, also had a son. Thereby, I got a
brother and at the same time a grandson. My wife is my grandmother, since she is my mother's mother.
Hence, I am my wife's husband and at the same time her step-grandson; in other words, I am my own
grandfather.
p 1
NIL
116
q q
q 8 27
13 27 21 13 8 21
q q
P(p); p := p.next
END
It follows from the definitions of the while statement and of the linking structure that P is applied to all
elements of the list and to no other ones.
A very frequent operation performed is list searching for an element with a given key x. Unlike for
arrays, the search must here be purely sequential. The search terminates either if an element is found or if
the end of the list is reached. This is reflected by a logical conjunction consisting of two terms. Again, we
assume that the head of the list is designated by a pointer p.
WHILE (p # NIL) & (p.key # x) DO p := p.next END
p = NIL implies that p^ does not exist, and hence that the expression p.key # x is undefined. The order of
the two terms is therefore essential.
7 w3
5 5 12
NIL
w2 w1
this is to introduce a dummy element at the list head. The initializing statement root := NIL is accordingly
replaced by
NEW(root); root.next := NIL
Referring to Fig. 4.10, we determine the condition under which the scan continues to proceed to the next
element; it consists of two factors, namely,
(w1 # NIL) & (w1.key < x)
The resulting search procedure is:.
PROCEDURE search(x: INTEGER); VAR root: Word);
VAR w1, w2, w3: Word;
BEGIN (*w2 # NIL*)
w2 := root; w1 := w2.next;
WHILE (w1 # NIL) & (w1.key < x) DO
w2 := w1; w1 := w2.next
END ;
(* (w1 = NIL) OR (w1.key >= x) *)
IF (w1 = NIL) OR (w1.key > x) THEN (*new entry*)
NEW(w3); w2.next := w3;
w3.key := x; w3.count := 1; w3.next := w1
ELSE INC(w1.count)
END
END search
In order to speed up the search, the continuation condition of the while statement can once again be
simplified by using a sentinel. This requires the initial presence of a dummy header as well as a sentinel at
the tail.
It is now high time to ask what gain can be expected from ordered list search. Remembering that the
additional complexity incurred is small, one should not expect an overwhelming improvement.
Assume that all words in the text occur with equal frequency. In this case the gain through lexicographical
ordering is indeed also nil, once all words are listed, because the position of a word does not matter if
only the total of all access steps is significant and if all words have the same frequency of occurrence.
However, a gain is obtained whenever a new word is to be inserted. Instead of first scanning the entire
list, on the average only half the list is scanned. Hence, ordered list insertion pays off only if a
concordance is to be generated with many distinct words compared to their frequency of occurrence. The
preceding examples are therefore suitable primarily as programming exercises rather than for practical
applications.
The arrangement of data in a linked list is recommended when the number of elements is relatively small
(< 50), varies, and, moreover, when no information is given about their frequencies of access. A typical
example is the symbol table in compilers of programming languages. Each declaration causes the addition
of a new symbol, and upon exit from its scope of validity, it is deleted from the list. The use of simple
linked lists is appropriate for applications with relatively short programs. Even in this case a considerable
improvement in access method can be achieved by a very simple technique which is mentioned here again
primarily because it constitutes a pretty example for demonstrating the flexibilities of the linked list
structure.
A characteristic property of programs is that occurrences of the same identifier are very often clustered,
that is, one occurrence is often followed by one or more reoccurrences of the same word. This information
is an invitation to reorganize the list after each access by moving the word that was found to the top of the
list, thereby minimizing the length of the search path the next time it is sought. This method of access is
called list search with reordering, or -- somewhat pompously -- self-organizing list search. In presenting
the corresponding algorithm in the form of a procedure, we take advantage of our experience made so far
and introduce a sentinel right from the start. In fact, a sentinel not only speeds up the search, but in this
case it also simplifies the program. The list must initially not be empty, but contains the sentinel element
already. The initialization statements are
121
sentinel
X1
root 3
U2 A0
2 7
G5
6
NIL
w2 w1
sentinel
X1
root 3
U2 A0
2 8
G5
6
NIL
w2 w1
END
END
END search
The improvement in this search method strongly depends on the degree of clustering in the input data. For
a given factor of clustering, the improvement will be more pronounced for large lists. To provide an idea
of how much gain can be expected, an empirical measurement was made by applying the above cross-
reference program to a short and a relatively long text and then comparing the methods of linear list
ordering and of list reorganization. The measured data are condensed into Table 4.2. Unfortunately, the
improvement is greatest when a different data organization is needed anyway. We will return to this
example in Sect. 4.4.
Test 1 Test 2
Number of distinct keys 53 582
Number of occurrences of keys 315 14341
Time for search with ordering 6207 3200622
Time for search with reordering 4529 681584
Improvement factor 1.37 4.70
Table 4.2 Comparsion of List Search Methods.
1 10
6
4
8 9
7 9 1 2 4 6 3 5 8 10
Fig. 4.14. Linear arrangement of the partially ordered set of Fig. 4.13.
How do we proceed to find one of the possible linear orderings? The recipe is quite simple. We start by
choosing any item that is not preceded by another item (there must be at least one; otherwise a loop would
exist). This object is placed at the head of the resulting list and removed from the set S. The remaining set
is still partially ordered, and so the same algorithm can be applied again until the set is empty.
In order to describe this algorithm more rigorously, we must settle on a data structure and representation
of S and its ordering. The choice of this representation is determined by the operations to be performed,
particularly the operation of selecting elements with zero predecessors. Every item should therefore be
represented by three characteristics: its identification key, its set of successors, and a count of its
predecessors. Since the number n of elements in S is not given a priori, the set is conveniently organized
as a linked list. Consequently, an additional entry in the description of each item contains the link to the
next item in the list. We will assume that the keys are integers (but not necessarily the consecutive
integers from 1 to n). Analogously, the set of each item's successors is conveniently represented as a
linked list. Each element of the successor list is described by an identification and a link to the next item
on this list. If we call the descriptors of the main list, in which each item of S occurs exactly once, leaders,
and the descriptors of elements on the successor chains trailers, we obtain the following declarations of
data types:
TYPE Leader = POINTER TO LeaderDesc;
Trailer = POINTER TO TrailerDesc;
LeaderDesc = RECORD key, count: INTEGER;
trail: Trailer; next: Leader
END;
TrailerDesc = RECORD id: Leader; next: Trailer
END
124
Assume that the set S and its ordering relations are initially represented as a sequence of pairs of keys in
the input file. The input data for the example in Fig. 4.13 are shown below, in which the symbols 〈 are
added for the sake of clarity, symbolizing partial order:
1 〈2 2 〈4 4 〈6 2 〈 10 4 〈8 6 〈3 1 〈3
3 〈5 5 〈8 7 〈5 7 〈9 9 〈4 9 〈 10
The first part of the topological sort program must read the input and transform the data into a list
structure. This is performed by successively reading a pair of keys x and y (x 〈 y). Let us denote the
pointers to their representations on the linked list of leaders by p and q. These records must be located by
a list search and, if not yet present, be inserted in the list. This task is perfomed by a function procedure
called find. Subsequently, a new entry is added in the list of trailers of x, along with an identification of y;
the count of predecessors of y is incremented by 1. This algorithm is called input phase. Figure 4.15
illustrates the data structure generated during processing the given input data. The function find(w) yields
the pointer to the list element with key w.
In the following poece of program we make use of text scanning, a feature of the Oberon system’s text
concept. Instead of considering a text (file) as a sequence of characters, a text is considered as a sequence
of tokens, which are identifiers, numbers, strings, and special characters (such as +, *, <, etc. The
procedure Texts.Scan(S) scans the text, reading the next token. The scanner S plays the role of a text rider.
(*input phase*)
NEW(head); tail := head; z := 0; Texts.Scan(S);
WHILE S.class = Texts.Int DO
x := S.i; Texts.Scan(S); y := S.i; p := find(x); q := find(y);
NEW(t); t.id := q; t.next := p.trail;
p.trail := t; INC(q.count); Texts.Scan(S)
END
head tail
key 1 2 4 6 10 8 3 5 7 9
count 0 1 2 1 2 2 2 2 0 1
next
trail
id
next
id
next
1 7
0 0 head
NIL
O
K F H
I
J E P
D L
M N
B
C
G
A
a)
b) (A(B(D(I),E(J,K,L)),C(F(O),G(M,N),H(P))))
A
B
D
I
E
J
K
L
C
F
O
G
M
N
H
c) P
B C
D E F G H
I J K L O M N P
d)
A A
B C C B
If an element has no descendants, it is called a terminal node or a leaf; and an element that is not terminal
is an interior node. The number of (direct) descendants of an interior node is called its degree. The
maximum degree over all nodes is the degree of the tree. The number of branches or edges that have to be
traversed in order to proceed from the root to a node x is called the path length of x. The root has path
length 0, its direct descendants have path length 1, etc. In general, a node at level i has path length i. The
path length of a tree is defined as the sum of the path lengths of all its components. It is also called its
internal path length. The internal path length of the tree shown in Fig. 4.17, for instance, is 36. Evidently,
the average path length is
P int = ( Si: 1 ≤ i ≤ n: ni×i) / n
where n i is the number of nodes at level i. In order to define what is called the external path length, we
extend the tree by a special node wherever a subtree was missing in the original tree. In doing so, we
assume that all nodes are to have the same degree, namely the degree of the tree. Extending the tree in this
way therefore amounts to filling up empty branches, whereby the special nodes, of course, have no further
descendants. The tree of Fig. 4.17 extended with special nodes is shown in Fig. 4.19 in which the special
nodes are represented by squares. The external path length is now defined as the sum of the path lengths
over all special nodes. If the number of special nodes at level i is mi, then the average external path length
is
P ext = (Si: 1 ≤ i ≤ mi×i) / m
B C
D E F G H
I J K L O M N P
Of particular importance are the ordered trees of degree 2. They are called binary trees. We define an
ordered binary tree as a finite set of elements (nodes) which either is empty or consists of a root (node)
with two disjoint binary trees called the left and the right subtree of the root. In the following sections we
shall exclusively deal with binary trees, and we therefore shall use the word tree to mean ordered binary
tree. Trees with degree greater than 2 are called multiway trees and are discussed in Sect. 7 of this
chapter.
Familiar examples of binary trees are the family tree (pedigree) with a person's father and mother as
descendants (!), the history of a tennis tournament with each game being a node denoted by its winner and
the two previous games of the combatants as its descendants, or an arithmetic expression with dyadic
operators, with each operator denoting a branch node with its operands as subtrees (see Fig. 4.20).
+ -
a / d *
b c e f
root *
+ -
a / d *
NIL NIL NIL NIL
b c e f
NIL NIL NIL NIL NIL NIL NIL NIL
Before investigating how trees might be used advantageously and how to perform operations on trees, we
give an example of how a tree may be constructed by a program. Assume that a tree is to be generated
containing nodes with the values of the nodes being n numbers read from an input file. In order to make
the problem more challenging, let the task be the construction of a tree with n nodes and minimal height.
In order to obtain a minimal height for a given number of nodes, one has to allocate the maximum
possible number of nodes of all levels except the lowest one. This can clearly be achieved by distributing
incoming nodes equally to the left and right at each node. This implies that we structure the tree for given
n as shown in Fig. 4.22, for n = 1, ... , 7.
n=7
9 5
11 7 6 12
15 20 3 1 4 14 17 18
19 21 2 13 10 16
4.4.2. Basic
Operations on Binary Trees
There are many tasks that may have to be perfomed on a tree structure; a common one is that of executing
a given operation P on each element of the tree. P is then understood to be a parameter of the more
general task of visting all nodes or, as it is usually called, of tree traversal. If we consider the task as a
single sequential process, then the individual nodes are visited in some specific order and may be
considered as being laid out in a linear arrangement. In fact, the description of many algorithms is
considerably facilitated if we can talk about processing the next element in the tree based in an underlying
order. There are three principal orderings that emerge naturally from the structure of trees. Like the tree
structure itself, they are conveniently expressed in recursive terms. Referring to the binary tree in Fig.
4.24 in which R denotes the root and A and B denote the left and right subtrees, the three orderings are
1. Preorder: R, A, B (visit root before the subtrees)
2. Inorder: A, R, B
3. Postorder: A, B, R (visit root after the subtrees)
133
A B
Binary trees are frequently used to represent a set of data whose elements are to be retrievable through a
unique key. If a tree is organized in such a way that for each node ti, all keys in the left subtree of ti are
less than the key of ti, and those in the right subtree are greater than the key of ti, then this tree is called a
search tree. In a search tree it is possible to locate an arbitrary key by starting at the root and proceeding
along a search path switching to a node's left or right subtree by a decision based on inspection of that
node's key only. As we have seen, n elements may be organized in a binary tree of a height as little as log
n. Therefore, a search among n items may be performed with as few as log n comparsions if the tree is
perfectly balanced. Obviously, the tree is a much more suitable form for organizing such a set of data than
the linear list used in the previous section. As this search follows a single path from the root to the desired
node, it can readily be programmed by iteration:
PROCEDURE locate(x: INTEGER; t: Node): Node;
BEGIN
WHILE (t # NIL) & (t.key # x) DO
IF t.key < x THEN t := t.right ELSE t := t.left END
END ;
RETURN t
END locate
The function locate(x, t) yields the value NIL, if no key with value x is found in the tree with root t. As in
the case of searching a list, the complexity of the termination condition suggests that a better solution may
exist, namely the use of a sentinel. This technique is equally applicable in the case of a tree. The use of
pointers makes it possible for all branches of the tree to terminate with the same sentinel. The resulting
structure is no longer a tree, but rather a tree with all leaves tied down by strings to a single anchor point
(Fig. 4.25). The sentinel may be considered as a common, shared representative of all external nodes by
which the original tree was extended (see Fig. 4.19):
PROCEDURE locate(x: INTEGER; t: Ptr): Ptr;
BEGIN s.key := x; (*sentinel*)
WHILE t.key # x DO
IF t.key < x THEN t := t.right ELSE t := t.left END
END ;
RETURN t
END locate
Note that in this case locate(x, t) yields the value s instead of NIL, i.e., the pointer to the sentinel, if no
key with value x is found in the tree with root t.
t 4
2 5
1 3 6
NORMA 2
GEORGE 1 ER 2
7 9
3 11
10 15
2 5
13 19
1 4 6
12 14 17 20
16 18 21
programmed to be almost as efficient as the best array sorting methods known. But a few precautions
must be taken. After encountering a match, the new element must also be inserted. If the case x = p.key is
handled identically to the case x > p.key, then the algorithm represents a stable sorting method, i.e., items
with identical keys turn up in the same sequence when scanning the tree in normal order as when they
were inserted.
In general, there are better ways to sort, but in applications in which searching and sorting are both
needed, the tree search and insertion algorithm is strongly recommended. It is, in fact, very often applied
in compilers and in data banks to organize the objects to be stored and retrieved. An appropriate example
is the construction of a cross-reference index for a given text, an example that we had already used to
illustrate list generation.
Our task is to construct a program that (while reading a text and printing it after supplying consecutive
line numbers) collects all words of this text, thereby retaining the numbers of the lines in which each word
occurred. When this scan is terminated, a table is to be generated containing all collected words in
alphabetical order with lists of their occurrences.
Obviously, the search tree (also called a lexicographic tree) is a most suitable candidate for representing
the words encountered in the text. Each node now not only contains a word as key value, but it is also the
head of a list of line numbers. We shall call each recording of an occurrence an item. Hence, we
encounter both trees and linear lists in this example. The program consists of two main parts (see Program
4.5), namely, the scanning phase and the table printing phase. The latter is a straightforward application of
a tree traversal routine in which visiting each node implies the printing of the key value (word) and the
scanning of its associated list of line numbers (items). The following are further clarifications regarding
the Cross Reference Generator listed below. Table 4.4 shows the results of processing the text of the
preceding procedure search.
1. A word is considered as any sequence of letters and digits starting with a letter.
2. Since words may be of widely different lengths, the actual characters are stored in an array buffer,
and the tree nodes contain the index of the key's first character.
3. It is desirable that the line numbers be printed in ascending order in the cross-reference index.
Therefore, the item lists must be generated in the same order as they are scanned upon printing. This
requirement suggests the use of two pointers in each word node, one referring to the first, and one
referring to the last item on the list. We assume the existence of global writer W, and a variable
representing the current line number in the text.
CONST WordLen = 32;
TYPE Word = ARRAY WordLen OF CHAR;
Item = POINTER TO RECORD
lno: INTEGER; next: Item
END ;
Node = POINTER TO RECORD
key: Word;
first, last: Item; (*list*)
left, right: Node (*tree*)
END ;
VAR line: INTEGER;
PROCEDURE search(VAR w: Node; VAR a: Word);
VAR q: Item;
BEGIN
IF w = NIL THEN (*word not in tree; new entry, insert*)
NEW(w); NEW(q); q.lno := line;
COPY(a, w.key); w.first := q; w.last := q; w.left := NIL; w.right := NIL
ELSIF w.key < a THEN search(w.right, a)
ELSIF w.key > a THEN search(w.left, a)
138
BEGIN 1
ELSE 5
ELSIF 3 4
139
END 7 8
IF 2
INC 4
INTEGER 0
NEW 5
Node 0
PROCEDURE 0
THEN 2 3 4
VAR 0
count 4 6
insert 5
key 2 3 6
left 2 6
p 0 2 2 3 3 4 4 5 6 6
6 6
right 3 6
s 4 6 6
search 0 2 3
x 0 2 2 3 3 6
END
END delete
The auxiliary, recursive procedure del is activated in case 3 only. It descends along the rightmost branch
of the left subtree of the element q^ to be deleted, and then it replaces the relevant information (key and
count) in q^ by the corresponding values of the rightmost component r^ of that left subtree, whereafter t^
may be disposed.
We note that we do not mention a procedure that would be the inverse of NEW, indicating that storage is
no longer needed and therefore disposable and reusable. It is generally assumed that a computer system
recognizes a disposable variable through the circumstance that no other variables are pointing to it, and
that it therefore can no longer be referenced. Such a system is called a garbage collector. It is not a
feature of a programming language, but rather of its implementations.
In order to illustrate the functioning of procedure delete, we refer to Fig. 4.28. Consider the tree (a); then
delete successively the nodes with keys 13, 15, 5, 10. The resulting trees are shown in Fig. 4.28 (b-e).
a) 10 b) 10
5 15 5 15
3 8 13 18 3 8 18
c) 10 d) 10 e) 10
5 18 3 18 3 18
3 8 8
Given are n distinct keys with values 1, 2, ... , n. Assume that they arrive in a random order. The
probability of the first key -- which notably becomes the root node -- having the value i is 1/n. Its left
subtree will eventually contain i-1 nodes, and its right subtree n-i nodes (see Fig. 4.29). Let the average
path length in the left subtree be denoted by ai-1, and the one in the right subtree is an-i, again assuming
that all possible permutations of the remaining n-1 keys are equally likely. The average path length in a
tree with n nodes is the sum of the products of each node's level and its probability of access. If all nodes
are assumed to be searched with equal likelihood, then
an = (Si: 1 ≤ i ≤ n: p i) / n
where p i is the path length of node i.
i-1 n-i
Hn = g + ln n + 1/12n2 + ...
we derive, for large n, the approximate value
an = 2 * (ln n + g - 1)
Since the average path length in the perfectly balanced tree is approximately
an' = log n - 1
we obtain, neglecting the constant terms which become insignificant for large n,
lim (an/an’) = 2 * ln(n) / log(n) = 2 ln(2) = 1.386...
What does this result teach us? It tells us that by taking the pains of always constructing a perfectly
balanced tree instead of the random tree, we could -- always provided that all keys are looked up with
equal probability -- expect an average improvement in the search path length of at most 39%. Emphasis is
to be put on the word average, for the improvement may of course be very much greater in the unhappy
case in which the generated tree had completely degenerated into a list, which, however, is very unlikely
to occur. In this connection it is noteworthy that the expected average path length of the random tree
grows also strictly logarithmically with the number of its nodes, even though the worst case path length
grows linearly.
The figure of 39% imposes a limit on the amount of additional effort that may be spent profitably on any
kind of reorganization of the tree's structure upon insertion of keys. Naturally, the ratio between the
frequencies of access (retrieval) of nodes (information) and of insertion (update) significantly influences
the payoff limits of any such undertaking. The higher this ratio, the higher is the payoff of a
reorganization procedure. The 39% figure is low enough that in most applications improvements of the
straight tree insertion algorithm do not pay off unless the number of nodes and the access vs. insertion
ratio are large.
The optimum is of course reached if the tree is perfectly balanced for n = 2k-1. But which is the structure
of the worst AVL-balanced tree? In order to find the maximum height h of all balanced trees with n
nodes, let us consider a fixed height h and try to construct the balanced tree with the minimum number of
nodes. This strategy is recommended because, as in the case of the minimal height, the value can be
attained only for certain specific values of n. Let this tree of height h be denoted by T h. Clearly, T0 is the
empty tree, and T1 is the tree with a single node. In order to construct the tree Th for h > 1, we will
provide the root with two subtrees which again have a minimal number of nodes. Hence, the subtrees are
also T's. Evidently, one subtree must have height h-1, and the other is then allowed to have a height of
one less, i.e. h-2. Figure 4.30 shows the trees with height 2, 3, and 4. Since their composition principle
very strongly resembles that of Fibonacci numbers, they are called Fibonacci-trees (see Fig. 4.30). They
are defined as follows:
1. The empty tree is the Fibonacci-tree of height 0.
2. A single node is the Fibonacci-tree of height 1.
3. If Th-1 and Th-2 are Fibonacci-trees of heights h-1 and h-2, then
Th = <Th-1, x, Th-2> is a Fibonacci-tree.
4. No other trees are Fibonacci-trees.
T2 T3 T4
2 3 5
1 2 4 3 7
1 2 4 6
4 10
2 6
case 1 case 2
A C
B A
case 1 case 2
A B
B A C
by an insertion, resulting in an excessively high overhead. The other extreme is to attribute an explicitly
stored balance factor to every node. The definition of the type Node is then extended into
TYPE Node = POINTER TO RECORD
key, count, bal: INTEGER; (*bal = -1, 0, +1*)
left, right: Node
END
We shall subsequently interpret a node's balance factor as the height of its right subtree minus the height
of its left subtree, and we shall base the resulting algorithm on this node type. The process of node
insertion consists essentially of the following three consecutive parts:
1. Follow the search path until it is verified that the key is not already in the tree.
2. Insert the new node and determine the resulting balance factor.
3. Retreat along the search path and check the balance factor at each node. Rebalance if necessary.
Although this method involves some redundant checking (once balance is established, it need not be
checked on that node's ancestors), we shall first adhere to this evidently correct schema because it can be
implemented through a pure extension of the already established search and insertion procedures. This
procedure describes the search operation needed at each single node, and because of its recursive
formulation it can easily accommodate an additional operation on the way back along the search path. At
each step, information must be passed as to whether or not the height of the subtree (in which the
insertion had been performed) had increased. We therefore extend the procedure's parameter list by the
Boolean h with the meaning the subtree height has increased. Clearly, h must denote a variable parameter
since it is used to transmit a result.
Assume now that the process is returning to a node p^ from the left branch (see Fig. 4.32), with the
indication that it has increased its height. We now must distinguish between the three conditions
involving the subtree heights prior to insertion:
1. hL < h R, p.bal = +1, the previous imbalance at p has been equilibrated.
2. hL = h R, p.bal = 0, the weight is now slanted to the left.
3. hL > h R, p.bal = -1, rebalancing is necessary.
In the third case, inspection of the balance factor of the root of the left subtree (say, p1.bal) determines
whether case 1 or case 2 of Fig. 4.32 is present. If that node has also a higher left than right subtree, then
we have to deal with case 1, otherwise with case 2. (Convince yourself that a left subtree with a balance
factor equal to 0 at its root cannot occur in this case.) The rebalancing operations necessary are entirely
expressed as sequences of pointer reassignments. In fact, pointers are cyclically exchanged, resulting in
either a single or a double rotation of the two or three nodes involved. In addition to pointer rotation, the
respective node balance factors have to be updated. The details are shown in the search, insertion, and
rebalancing procedures.
146
a) 4 b) 5 c) 5
5 4 7 4 7
d) 5 e) 4 f) 4
2 7 2 5 2 6
1 4 1 3 7 1 3 5 7
a) 5 b) 5
38 8 2 8
2 4 7 10 1 3 7 10
1 6 9 11 6 9 11
c) 5 d) 5
2 7 2 10
1 3 6 10 1 3 7 11
9 11 9
e) 3 f) 7
2 10 3 10
1 7 11 1 9 11
g) 7 h) 10
3 10 3 11
9 11 9
a) 3 b) 3 c) 2 d) 1 e) 1
2 1 1 3 3 2
1 2 2 3
k2|a2
k1|a1 k4|a4
k3|a3
b0 b1 b4
b2 b3
w ii = bi (0 ≤ i ≤ n)
w ij = w i, j-1 + aj + b j (0 ≤ i < j ≤ n)
p ii = w ii (0 ≤ i ≤ n)
p ij = w ij + MIN k: i < k ≤ j : (pi,k-1 + pkj) (0 ≤ i < k < j ≤ n)
The last equation follows immediately from the definitions of P and of optimality. Since there are
approximately n2/2 values p ij, and because its definition calls for a choice among all cases such that 0 < j-i
≤ n, the minimization operation will involve approximately n 3/6 operations. Knuth pointed out that a
factor n can be saved by the following consideration, which alone makes this algorithm usable for
practical purposes.
Let rij be a value of k which achieves the minimum for p ij. It is possible to limit the search for r ij to a
much smaller interval, i.e., to reduce the number of the j-i evaluation steps. The key is the observation that
if we have found the root rij of the optimal subtree T ij, then neither extending the tree by adding a node at
the right, nor shrinking the tree by removing its leftmost node ever can cause the optimal root to move to
the left. This is expressed by the relation
ri,j-1 ≤ rij ≤ ri+1,j
which limits the search for possible solutions for r ij to the range ri,j-1 ... ri+1,j. This results in a total number
of elementary steps in the order of n2.
We are now ready to construct the optimization algorithm in detail. We recall the following definitions,
which are based on optimal trees Tij consisting of nodes with keys ki+1 ... kj.
1. ai: the frequency of a search for k i.
2. b j: the frequency of a search argument x between kj and kj+1.
3. wij: the weight of T ij.
4. p ij: the weighted path length of Tij.
5. rij: the index of the root of Tij.
We declare the following arrays:
a: ARRAY n+1 OF INTEGER; (*a[0] not used*)
b: ARRAY n+1 OF INTEGER;
p,w,r: ARRAY n+1, n+1 OF INTEGER;
Assume that the weights wij have been computed from a and b in a straightforward way. Now consider w
as the argument of the procedure OptTree to be developed and consider r as its result, because r describes
the tree structure completely. p may be considered an intermediate result. Starting out by considering the
smallest possible subtrees, namely those consisting of no nodes at all, we proceed to larger and larger
trees. Let us denote the width j-i of the subtree Tij by h. Then we can trivially determine the values pii for
all trees with h = 0 according to the definition of p ij.
FOR i := 0 TO n DO p[i,i] := b[i] END
In the case h = 1 we deal with trees consisting of a single node, which plainly is also the root (see Fig.
4.38).
FOR i := 0 TO n-1 DO
j := i+1; p[i,j] := w[i,j] + p[i,i] + p[j,j]; r[i,j] := j
END
154
kj|aj
bj-1 bj
wj-1, j-1
wj-1, j
As an example, let us consider the following input data of a tree with 3 keys:
20 1 Albert 10 2 Ernst 1 5 Peter 1
b 0 = 20
a1 = 1 key1 = Albert b 1 = 10
a2 = 2 key2 = Ernst b2 = 1
a3 = 4 key3 = Peter b3 = 1
The results of procedure Find are shown in Fig. 4.40 and demonstrate that the structures obtained for the
three cases may differ significantly. The total weight is 40, the pathlength of the balanced tree is 78, and
that of the optimal tree is 66.
Fig. 4.40. The 3 trees generated by the Optimal Tree procedure (NEW FIGURE!)
It is evident from this algorithm that the effort to determine the optimal structure is of the order of n2;
also, the amount of required storage is of the order n2. This is unacceptable if n is very large. Algorithms
with greater efficiency are therefore highly desirable. One of them is the algorithm developed by Hu and
Tucker [4-5] which requires only O(n) storage and O(n*log(n)) computations. However, it considers only
the case in which the key frequencies are zero, i.e., where only the unsuccessful search trials are
registered. Another algorithm, also requiring O(n) storage elements and O(n*log(n)) computations was
described by Walker and Gotlieb [4-7]. Instead of trying to find the optimum, this algorithm merely
promises to yield a nearly optimal tree. It can therefore be based on heuristic principles. The basic idea is
the following.
Consider the nodes (genuine and special nodes) being distributed on a linear scale, weighted by their
frequencies (or probabilities) of access. Then find the node which is closest to the center of gravity. This
node is called the centroid, and its index is
(Si: 1 ≤ i ≤ n : i*ai) + (Si: 1 ≤ i ≤ m : i*b i) / W
rounded to the nearest integer. If all nodes have equal weight, then the root of the desired optimal tree
evidently coincides with the centroid Otherwise -- so the reasoning goes -- it will in most cases be in the
close neighborhood of the centroid. A limited search is then used to find the local optimum, whereafter
this procedure is applied to the resulting two subtrees. The likelihood of the root lying very close to the
centroid grows with the size n of the tree. As soon as the subtrees have reached a manageable size, their
optimum can be determined by the above exact algorithm.
4.7. B-Trees
So far, we have restricted our discussion to trees in which every node has at most two descendants, i.e., to
binary trees. This is entirely satisfactory if, for instance, we wish to represent family relationships with a
preference to the pedigree view, in which every person is associated with his parents. After all, no one has
more than two parents. But what about someone who prefers the posterity view? He has to cope with the
157
fact that some people have more than two children, and his trees will contain nodes with many branches.
For lack of a better term, we shall call them multiway trees.
Of course, there is nothing special about such structures, and we have already encountered all the
programming and data definition facilities to cope with such situations. If, for instance, an absolute upper
limit on the number of children is given (which is admittedly a somewhat futuristic assumption), then one
may represent the children as an array component of the record representing a person. If the number of
children varies strongly among different persons, however, this may result in a poor utilization of
available storage. In this case it will be much more appropriate to arrange the offspring as a linear list,
with a pointer to the youngest (or eldest) offspring assigned to the parent. A possible type definition for
this case is the following, and a possible data structure is shown in Fig. 4.43.
TYPE Person = POINTER TO RECORD
name: alfa;
sibling, offspring: Person
END
JOHN PETER
25
10 20 30 40
2 5 7 8 13 14 15 18 22 24 26 27 28 32 35 38 41 42 45 46
search trees, and it determines the method of searching an item with given key. Consider a page of the
form shown in Fig. 4.46 and a given search argument x. Assuming that the page has been moved into the
primary store, we may use conventional search methods among the keys k1 ... km. If m is sufficiently
large, one may use binary search; if it is rather small, an ordinary sequential search will do. (Note that the
time required for a search in main store is probably negligible compared to the time it takes to move the
page from secondary into primary store.) If the search is unsuccessful, we are in one of the following
situations:
1. ki < x < ki+1, for 1 < i < m The search continues on page p i^
2. km < x The search continues on page pm^.
3. x < k1 The search continues on page p0^.
p0 k1 p1 k2 p2 ... pm-1 km pm
A 20 A 20 30
7 10 15 18 26 30 35 40 7 10 15 18 22 26 35 40
B C B C D
a) 20
b) 20
10 15 30 40
c) 20 30
7 10 15 18 22 26 35 40
d) 10 20 30
5 7 15 18 22 26 35 40
e) 10 20 30 40
5 7 8 13 15 18 22 26 27 32 35 42 46
f) 25
10 20 30 40
5 7 8 13 15 18 22 24 26 27 32 35 38 42 45 46
Of course, it may happen that there is no item left to be annected since Q has already reached its minimal
size n. In this case the total number of items on pages P and Q is 2n-1; we may merge the two pages into
one, adding the middle item from the ancestor page of P and Q, and then entirely dispose of page Q. This
is exactly the inverse process of page splitting. The process may be visualized by considering the deletion
of key 22 in Fig. 4.47. Once again, the removal of the middle key in the ancestor page may cause its size
to drop below the permissible limit n, thereby requiring that further special action (either balancing or
merging) be undertaken at the next level. In the extreme case page merging may propagate all the way up
to the root. If the root is reduced to size 0, it is itself deleted, thereby causing a reduction in the height of
the B-tree. This is, in fact, the only way that a B-tree may shrink in height. Figure 4.49 shows the gradual
decay of the B-tree of Fig. 4.48 upon the sequential deletion of the keys
25 45 24; 38 32; 8 27 46 13 42; 5 22 18 26; 7 35 15;
The semicolons again mark the places where the snapshots are taken, namely where pages are being
eliminated. The similarity of its structure to that of balanced tree deletion is particularly noteworthy.
a) 25
10 20 30 40
5 7 8 13 15 18 22 24 26 27 32 35 38 42 45 46
b) 10 22 30 40
5 7 8 13 15 18 20 26 27 32 35 38 42 46
c) 10 22 30
5 7 8 13 15 18 20 26 27 35 40 42 46
d) 10 22
5 7 15 18 20 26 30 35 40
e) 15
7 10 20 30 35 40
f) 10 20 30 40
END
END
END insert;
PROCEDURE underflow(c, a: Page; s: INTEGER; VAR h: BOOLEAN);
(*a = underflowing page, c = ancestor page,
s = index of deleted entry in c*)
VAR b: Page;
i, k: INTEGER;
BEGIN (*h & (a.m = N-1) & (c.e[s-1].p = a) *)
IF s < c.m THEN (*b := page to the right of a*)
b := c.e[s].p; k := (b.m-N+1) DIV 2; (*k = nof items available on page b*)
a.e[N-1] := c.e[s]; a.e[N-1].p := b.p0;
IF k > 0 THEN (*balance by moving k-1 items from b to a*)
FOR i := 0 TO k-2 DO a.e[i+N] := b.e[i] END ;
c.e[s] := b.e[k-1]; b.p0 := c.e[s].p;
c.e[s].p := b; DEC(b.m, k);
FOR i := 0 TO b.m-1 DO b.e[i] := b.e[i+k] END ;
a.m := N-1+k; h := FALSE
ELSE (*merge pages a and b, discard b*)
FOR i := 0 TO N-1 DO a.e[i+N] := b.e[i] END ;
DEC(c.m);
FOR i := s TO c.m-1 DO c.e[i] := c.e[i+1] END ;
a.m := 2*N; h := c.m < N
END
ELSE (*b := page to the left of a*) DEC(s);
IF s = 0 THEN b := c.p0 ELSE b := c.e[s-1].p END ;
k := (b.m-N+1) DIV 2; (*k = nof items available on page b*)
IF k > 0 THEN
FOR i := N-2 TO 0 BY -1 DO a.e[i+k] := a.e[i] END ;
a.e[k-1] := c.e[s]; a.e[k-1].p := a.p0;
(*move k-1 items from b to a, one to c*) DEC(b.m, k);
FOR i := k-2 TO 0 BY -1 DO a.e[i] := b.e[i+b.m+1] END ;
c.e[s] := b.e[b.m]; a.p0 := c.e[s].p;
c.e[s].p := a; a.m := N-1+k; h := FALSE
ELSE (*merge pages a and b, discard a*)
c.e[s].p := a.p0; b.e[N] := c.e[s];
FOR i := 0 TO N-2 DO b.e[i+N+1] := a.e[i] END ;
b.m := 2*N; DEC(c.m); h := c.m < N
END
END
END underflow;
PROCEDURE delete(x: INTEGER; a: Page; VAR h: BOOLEAN);
(*search and delete key x in B-tree a; if a page underflow arises,
balance with adjacent page or merge; h := "page a is undersize"*)
VAR i, L, R: INTEGER; q: Page;
PROCEDURE del(p: Page; VAR h: BOOLEAN);
VAR k: INTEGER; q: Page; (*global a, R*)
BEGIN k := p.m-1; q := p.e[k].p;
IF q # NIL THEN del(q, h);
IF h THEN underflow(p, q, p.m, h) END
ELSE p.e[k].p := a.e[R].p; a.e[R] := p.e[k];
DEC(p.m); h := p.m < N
END
END del;
165
BEGIN
IF a # NIL THEN
L := 0; R := a.m; (*binary search*)
WHILE L < R DO
i := (L+R) DIV 2;
IF x <= a.e[i].key THEN R := i ELSE L := i+1 END
END ;
IF R = 0 THEN q := a.p0 ELSE q := a.e[R-1].p END ;
IF (R < a.m) & (a.e[R].key = x) THEN (*found*)
IF q = NIL THEN (*a is leaf page*)
DEC(a.m); h := a.m < N;
FOR i := R TO a.m-1 DO a.e[i] := a.e[i+1] END
ELSE del(q, h);
IF h THEN underflow(a, q, R, h) END
END
ELSE delete(x, q, h);
IF h THEN underflow(a, q, R, h) END
END
END
END delete;
PROCEDURE ShowTree(VAR W: Texts.Writer; p: Page; level: INTEGER);
VAR i: INTEGER;
BEGIN
IF p # NIL THEN
FOR i := 1 TO level DO Texts.Write(W, 9X) END ;
FOR i := 0 TO p.m-1 DO Texts.WriteInt(W, p.e[i].key, 4) END ;
Texts.WriteLn(W);
IF p.m > 0 THEN ShowTree(p.p0, level+1) END ;
FOR i := 0 TO p.m-1 DO ShowTree(p.e[i].p, level+1) END
END
END ShowTree;
Extensive analysis of B-tree performance has been undertaken and is reported in the referenced article
(Bayer and McCreight). In particular, it includes a treatment of the question of optimal page size, which
strongly depends on the characteristics of the storage and computing system available.
Variations of the B-tree scheme are discussed in Knuth, Vol. 3, pp. 476-479. The one notable observation
is that page splitting should be delayed in the same way that page merging is delayed, by first attempting
to balance neighboring pages. Apart from this, the suggested improvements seem to yield marginal gains.
A comprehensive survey of B-trees may be found in [4-8].
the item list as shown in Fig. 4.50. The B-tree node thereby loses its actual identity, and the items assume
the role of nodes in a regular binary tree. It remains necessary, however, to distinguish between pointers
to descendants (vertical) and pointers to siblings on the same page (horizontal). Since only the pointers to
the right may be horizontal, a single bit is sufficient to record this distiction. We therefore introduce the
Boolean field h with the meaning horizontal. The definition of a tree node based on this representation is
given below. It was suggested and investigated by R. Bayer [4-3] in 1971 and represents a search tree
organization guaranteeing p = 2*log(N) as maximum path length.
TYPE Node = POINTER TO RECORD
key: INTEGER;
...........
left, right: Node;
h: BOOLEAN (*right branch horizontal*)
END
a b
a b c
1.
A A B
a B a b c
b c
B
2.
A B A B C A C
a b C a b c d a b c d
c d
3.
B A B A B
A c a b c a b c
a b
B
4.
B C A B C A C
A c d a b c d a b c d
a b
symmetry evident. Note that whenever a subtree of node A without siblings grows, the root of the subtree
becomes the sibling of A. This case need not be considered any further.
(LL) B C A B C A C
(LR) A C A B C A C
(RR) A B A B C A C
(RL) A C A B C A C
2. Every pointer is either horizontal or vertical. There are no two consecutive horizontal pointers on any
search path.
3. All terminal nodes (nodes without descendants) appear at the same (terminal) level.
From this definition it follows that the longest search path is no longer than twice the height of the tree.
Since no SBB-tree with N nodes can have a height larger than log(N), it follows immediately that
2*log(N) is an upper bound on the search path length. In order to visualize how these trees grow, we refer
to Fig. 4.53. The lines represent snapshots taken during the insertion of the following sequences of keys,
where every semicolon marks a snapshot.
(1) 1 2; 3; 4 5 6; 7;
(2) 5 4; 3; 1 2 7 6;
(3) 6 2; 4; 1 7 3 5;
(4) 4 2 6; 1 7; 3 5;
1. 1 2 2 2 4 4
1 3 1 3 5 6 2 6
1 3 5 7
2. 4 5 4 2 4 6
3 5 1 3 5 7
3. 2 6 4 4
2 6 1 2 3 5 6 7
4. 2 4 6 2 6 2 6
1 4 7 1 3 4 5 7
The recursive procedure search again follows the pattern of the basic binary tree insertion algorithm. A
third parameter h is added; it indicates whether or not the subtree with root p has changed, and it
corresponds directly to the parameter h of the B-tree search program. We must note, however, the
consequence of representing pages as linked lists: a page is traversed by either one or two calls of the
search procedure. We must distinguish between the case of a subtree (indicated by a vertical pointer) that
has grown and a sibling node (indicated by a horizontal pointer) that has obtained another sibling and
hence requires a page split. The problem is easily solved by introducing a three-valued h with the
following meanings:
1. h = 0: the subtree p requires no changes of the tree structure.
2. h = 1: node p has obtained a sibling.
3. h = 2: the subtree p has increased in height.
PROCEDURE search(VAR p: Node; x: INTEGER; VAR h: INTEGER);
VAR q, r: Node;
BEGIN (*h=0*)
IF p = NIL THEN (*insert new node*)
NEW(p); p.key := x; p.L := NIL; p.R := NIL; p.lh := FALSE; p.rh := FALSE; h := 2
ELSIF x < p.key THEN
search(p.L, x, h);
IF h > 0 THEN (*left branch has grown or received sibling*)
q := p.L;
IF p.lh THEN
h := 2; p.lh := FALSE;
IF q.lh THEN (*LL*)
p.L := q.R; q.lh := FALSE; q.R := p; p := q
ELSE (*q.rh, LR*)
r := q.R; q.R := r.L; q.rh := FALSE; r.L := p.L; p.L := r.R; r.R := p; p := r
END
ELSE DEC(h);
IF h = 1 THEN p.lh := TRUE END
END
END
ELSIF x > p.key THEN
search(p.R, x, h);
IF h > 0 THEN (*right branch has grown or received sibling*)
q := p.R;
IF p.rh THEN
h := 2; p.rh := FALSE;
IF q.rh THEN (*RR*)
p.R := q.L; q.rh := FALSE; q.L := p; p := q
ELSE (*q.lh, RL*)
r := q.L; q.L := r.R; q.lh := FALSE; r.R := p.R; p.R := r.L; r.L := p; p := r
END
ELSE DEC(h);
IF h = 1 THEN p.rh := TRUE END
END
END
END
END search;
Note that the actions to be taken for node rearrangement very strongly resemble those developed in the
AVL-balanced tree search algorithm. It is evident that all four cases can be implemented by simple
pointer rotations: single rotations in the LL and RR cases, double rotations in the LR and RL cases. In
fact, procedure search appears here slightly simpler than in the AVL case. Clearly, the SBB-tree scheme
emerges as an alternative to the AVL-balancing criterion. A performance comparison is therefore both
possible and desirable.
171
We refrain from involved mathematical analysis and concentrate on some basic differences. It can be
proven that the AVL-balanced trees are a subset of the SBB-trees. Hence, the class of the latter is larger.
It follows that their path length is on the average larger than in the AVL case. Note in this connection the
worst-case tree (4) in Fig. 4.53. On the other hand, node rearrangement is called for less frequently. The
balanced tree is therefore preferred in those applications in which key retrievals are much more frequent
than insertions (or deletions); if this quotient is moderate, the SBB-tree scheme may be preferred. It is
very difficult to say where the borderline lies. It strongly depends not only on the quotient between the
frequencies of retrieval and structural change, but also on the characteristics of an implementation. This is
particularly the case if the node records have a densely packed representation, and if therefore access to
fields involves part-word selection.
The SBB-tree has later found a rebirth under the name of red-black tree. The difference is that whereas in
the case of the symmetric, binary B-tree every node contains two h-fields indicating whether the
emanating pointers are horizontal, every node of the red-black tree contains a single h-field, indicating
whether the incoming pointer is horizontal. The name stems from the idea to color nodes with incoming
down-pointer black, and those with incoming horizontal pointer red. No two red nodes can immedaitely
follow each other on any path. Therefore, like in the cases of the BB- and SBB-trees, every search path is
at most twice as long as the height of the tree. There exists a canonical mapping from binary B-trees to
red-black trees.
on efforts involved in searching, inserting, or deleting elements can be assured. Although this had already
been the case for the ordinary, unbalanced search tree, the chances for good average behaviour are slim.
Even worse, maintenance operations can become rather unwieldy. Consider, for example, the tree of Fig.
4.54 (a). Insertion of a new node C whose coordinates force it to be inserted above and between A and B
requires a considerable effort transforming (a) into (b).
McCreight discovered a scheme, similar to balancing, that, at the expense of a more complicated insertion
and deletion operation, guarantees logarithmic time bounds for these operations. He calls that structure a
priority search tree [4-10]; in terms of our classification, however, it should be called a balanced priority
search tree. We refrain from discussing that structure, because the scheme is very intricate and in practice
hardly used. By considering a somewhat more restricted, but in practice no less relevant problem,
McCreight arrived at yet another tree structure, which shall be presented here in detail. Instead of
assuming that the search space be unbounded, he considered the data space to be delimited by a rectangle
with two sides open. We denote the limiting values of the x-coordinate by xmin and xmax.
In the scheme of the (unbalanced) priority search tree outlined above, each node p divides the plane into
two parts along the line x = p.x. All nodes of the left subtree lie to its left, all those in the right subtree to
its right. For the efficiency of searching this choice may be bad. Fortunately, we may choose the
dividing line differently. Let us associate with each node p an interval [p.L .. p.R), ranging over all x
values including p.L up to but excluding p.R. This shall be the interval within which the x-value of the
node may lie. Then we postulate that the left descendant (if any) must lie within the left half, the right
descendant within the right half of this interval. Hence, the dividing line is not p.x, but (p.L+p.R)/2. For
each descendant the interval is halved, thus limiting the height of the tree to log(xmax-xmin). This result
holds only if no two nodes have the same x-value, a condition which, however, is guaranteed by the
invariant (4.90). If we deal with integer coordinates, this limit is at most equal to the wordlength of the
computer used. Effectively, the search proceeds like a bisection or radix search, and therefore these trees
are called radix priority search trees [4-10]. They feature logarithmic bounds on the number of operations
required for searching, inserting, and deleting an element, and are governed by the following invariants
for each node p:
p.left ≠ NIL implies (p.L ≤ p.left.x < p.M) & (p.y ≤ p.left.y)
p.right≠ NIL implies (p.M ≤ p.right.x < p.R) & (p.y ≤ p.right.y)
where
p.M = (p.L + p.R) DIV 2
p.left.L = p.L
p.left.R = p.M
p.right.L = p.M
p.right.R = p.R
for all node p, and root.L = xmin, root.R = xmax.
A decisive advantage of the radix scheme is that maintenance operations (preserving the invariants under
insertion and deletion) are confined to a single spine of the tree, because the dividing lines have fixed
values of x irrespective of the x-values of the inserted nodes.
Typical operations on priority search trees are insertion, deletion, finding an element with the least
(largest) value of x (or y) larger (smaller) than a given limit, and enumerating the points lying within a
given rectangle. Given below are procedures for inserting and enumerating. They are based on the
following type declarations:
TYPE Node = POINTER TO RECORD
x, y: INTEGER;
left, right: Node
END
Notice that the attributes x L and xR need not be recorded in the nodes themselves. They are rather
computed during each search. This, however, requires two additional parameters of the recursive
procedure insert. Their values for the first call (with p = root) are xmin and xmax respectively. Apart
from this, a search proceeds similarly to that of a regular search tree. If an empty node is encountered, the
173
element is inserted. If the node to be inserted has a y-value smaller than the one being inspected, the new
node is exchanged with the inspected node. Finally, the node is inserted in the left subtree, if its x-value
is less than the middle value of the interval, or the right subtree otherwise.
PROCEDURE insert(VAR p: Node; X, Y, xL, xR: INTEGER);
VAR xm, t: INTEGER;
BEGIN
IF p = NIL THEN (*not in tree, insert*)
NEW(p); p.x := X; p.y := Y; p.left := NIL; p.right := NIL
ELSIF p.x = X THEN (*found; don't insert*)
ELSE
IF p.y > Y THEN
t := p.x; p.x := X; X := t;
t := p.y; p.y := Y; Y := t
END ;
xm := (xL + xR) DIV 2;
IF X < xm THEN insert(p.left, X, Y, xL, xm)
ELSE insert(p.right, X, Y, xm, xR)
END
END
END insert
The task of enumerating all points x,y lying in a given rectangle, i.e. satisfying x0 ≤ x < x1 and y ≤ y1 is
accomplished by the following procedure enumerate. It calls a procedure report(x,y) for each point
found. Note that one side of the rectangle lies on the x-axis, i.e. the lower bound for y is 0. This
guarantees that enumeration requires at most O(log(N) + s) operations, where N is the cardinality of the
search space in x and s is the number of nodes enumerated.
PROCEDURE enumerate(p: Ptr; x0, x1, y, xL, xR: INTEGER);
VAR xm: INTEGER;
BEGIN
IF p # NIL THEN
IF (p.y <= y) & (x0 <= p.x) & (p.x < x1) THEN
report(p.x, p.y)
END ;
xm := (xL + xR) DIV 2;
IF x0 < xm THEN enumerate(p.left, x0, x1, y, xL, xm) END ;
IF xm < x1 THEN enumerate(p.right, x0, x1, y, xm, xR) END
END
END enumerate
Exercises
4.1. Let us introduce the notion of a recursive type, to be declared as
RECTYPE T = T0
and denoting the set of values defined by the type T0 enlarged by the single value NONE. The definition
of the type person, for example, could then be simplified to
RECTYPE person = RECORD name: Name;
father, mother: person
END
Which is the storage pattern of the recursive structure corresponding to Fig. 4.2? Presumably, an
implementation of such a feature would be based on a dynamic storage allocation scheme, and the fields
named father and mother in the above example would contain pointers generated automatically but hidden
from the programmer. What are the difficulties encountered in the realization of such a feature?
174
4.2. Define the data structure described in the last paragraph of Section 4.2 in terms of records and
pointers. Is it also possible to represent this family constellation in terms of recursive types as proposed
in the preceding exercise?
4.3. Assume that a first-in-first-out (fifo) queue Q with elements of type T0 is implemented as a linked
list. Define a module with a suitable data structure, procedures to insert and extract an element from Q,
and a function to test whether or not the queue is empty. The procedures should contain their own
mechanism for an economical reuse of storage.
4.4. Assume that the records of a linked list contain a key field of type INTEGER. Write a program to
sort the list in order of increasing value of the keys. Then construct a procedure to invert the list.
4.5. Circular lists (see Fig. 4.55) are usually set up with a so-called list header. What is the reason for
introducing such a header? Write procedures for the insertion, deletion, and search of an element
identified by a given key. Do this once assuming the existence of a header, once without header.
4.25. Write a program for the search, insertion, and deletion of keys in a binary B-tree. Use the node type
and the insertion scheme shown above for the binary B-tree.
4.26. Find a sequence of insertion keys which, starting from the empty symmetric binary B-tree, causes
the shown procedure to perform all four rebalancing acts (LL, LR, RR, RL) at least once. What is the
shortest such sequence?
4.27. Write a procedure for the deletion of elements in a symmetric binary B-tree. Then find a tree and a
short sequence of deletions causing all four rebalancing situations to occur at least once.
4.28. Formulate a data structure and procedures for the insertion and deletion of an element in a priority
search tree. The procedures must maintain the specified invariants. Compare their performance with that
of the radix priority search tree.
4.29. Design a module with the following procedures operating on radix priority search trees:
-- insert a point with coordinates x, y.
-- enumerate all points within a specified rectangle.
-- find the point with the least x-coordinate in a specified rectangle.
-- find the point with the largest y-coordinate within a specified rectangle.
-- enumerate all points lying within two (intersecting) rectangles.
References
4-1. G.M. Adelson-Velskii and E.M. Landis. Doklady Akademia Nauk SSSR, 146, (1962), 263-66; English
translation in Soviet Math, 3, 1259-63.
4-2. R. Bayer and E.M. McCreight. Organization and Maintenance of Large Ordered Indexes. Acta
Informatica, 1, No. 3 (1972), 173-89.
4-3. -----, Binary B-trees for Virtual memory. Proc. 1971 ACM SIGFIDET Workshop, San Diego, Nov.
1971, pp. 219-35.
4-4. -----, Symmetric Binary B-trees: Data Structure and Maintenance Algorithms. Acta Informatica, 1,
No. 4 (1972), 290-306.
4-5. T.C. Hu and A.C. Tucker. SIAM J. Applied Math, 21, No. 4 (1971) 514-32.
4-6. D. E. Knuth. Optimum Binary Search Trees. Acta Informatica, 1, No. 1 (1971), 14-25.
4-7. W.A. Walker and C.C. Gotlieb. A Top-down Algorithm for Constructing Nearly Optimal
Lexicographic Trees. in Graph Theory and Computing (New York: Academic Press, 1972), pp.
303-23.
4-8. D. Comer. The ubiquitous B-Tree. ACM Comp. Surveys, 11, 2 (June 1979), 121-137.
4-9. J. Vuillemin. A unifying look at data structures. Comm. ACM, 23, 4 (April 1980), 229-239.
4-10. E.M. McCreight. Priority search trees. SIAM J. of Comp. (May 1985)
177
identical indices, thus effectively causing a most uneven distribution. It is therefore particularly
recommended to let N be a prime number [5-2]. This has the conseqeunce that a full division operation is
needed that cannot be replaced by a mere masking of binary digits, but this is no serious drawback on most
modern computers that feature a built-in division instruction.
Often, hash funtions are used which consist of applying logical operations such as the exclusive or to some
parts of the key represented as a sequence of binary digits. These operations may be faster than division on
some computers, but they sometimes fail spectacularly to distribute the keys evenly over the range of
indices. We therefore refrain from discussing such methods in further detail.
with h0 = 0 and d0 = 1. This is called quadratic probing, and it essentially avoids primary clustering,
although practically no additional computations are required. A very slight disadvantage is that in probing
not all table entries are searched, that is, upon insertion one may not encounter a free slot although there are
some left. In fact, in quadratic probing at least half the table is visited if its size N is a prime number. This
assertion can be derived from the following deliberation. If the i th and the j th probes coincide upon the
same table entry, we can express this by the equation
i2 MOD N = j2 MOD N
(i2 - j2) ≡ 0 (modulo N)
Splitting the differences up into two factors, we obtain
(i + j)(i - j) ≡ 0 (modulo N)
and since i ≠ j, we realize that either i or j have to be at least N/2 in order to yield i+j = c*N, with c being an
integer. In practice, the drawback is of no importance, since having to perform N/2 secondary probes and
collision evasions is extremely rare and occurs only if the table is already almost full.
As an application of the scatter storage technique, the cross-reference generator procedure shown in Sect.
4.4.3 is rewritten. The principal differences lie in the procedure search and in the replacement of the
pointer type Node by the global hash table of words T. The hash function H is the modulus of the table size;
quadratic probing was chosen for collision handling. Note that it is essential for good performance that the
table size be a prime number.
Although the method of key transformation is most effective in this case -- actually more efficient than tree
organizations -- it also has a disadvantage. After having scanned the text and collected the words, we
presumably wish to tabulate these words in alphabetical order. This is straightforward when using a tree
organization, because its very basis is the ordered search tree. It is not, however, when key transformations
are used. The full significance of the word hashing becomes apparent. Not only would the table printout
have to be preceded by a sort process (which is omitted here), but it even turns out to be advantageous to
keep track of inserted keys by linking them together explicitly in a list. Hence, the superior performance of
the hashing method considering the process of retrieval only is partly offset by additional operations
required to complete the full task of generating an ordered cross-reference index.
END
UNTIL found
END search;
PROCEDURE Tabulate(T: Table); (*uses global writer W*)
VAR i, k: INTEGER;
BEGIN
FOR k := 0 TO P-1 DO
IF T[k].key[0] # " " THEN
Texts.WriteString(W, T[k].key); Texts.Write(W, TAB);
FOR i := 0 TO T[k].n -1 DO Texts.WriteInt(W, T[k].lno[i], 4) END ;
Texts.WriteLn(W)
END
END
END Tabulate;
PROCEDURE CrossRef(VAR R: Texts.Reader);
VAR i: INTEGER; ch: CHAR; w: Word;
H: Table;
BEGIN NEW(H);
FOR i := 0 TO P-1 DO H[i].key[0] := " " END ; (*allocate and clear hash table*)
line := 0;
Texts.WriteInt(W, 0, 6); Texts.Write(W, TAB); Texts.Read(R, ch);
WHILE ~R.eot DO
IF ch = 0DX THEN (*line end*) Texts.WriteLn(W);
INC(line); Texts.WriteInt(W, line, 6); Texts.Write(W, 9X); Texts.Read(R, ch)
ELSIF ("A" <= ch) & (ch <= "Z") OR ("a" <= ch) & (ch <= "z") THEN
i := 0;
REPEAT
IF i < WordLen-1 THEN w[i] := ch; INC(i) END ;
Texts.Write(W, ch); Texts.Read(R, ch)
UNTIL (i = WordLen-1) OR ~(("A" <= ch) & (ch <= "Z")) &
~(("a" <= ch) & (ch <= "z")) & ~(("0" <= ch) & (ch <= "9"));
w[i] := 0X; (*string terminator*)
search(H, w)
ELSE Texts.Write(W, ch); Texts.Read(R, ch)
END ;
Texts.WriteLn(W); Texts.WriteLn(W); Tabulate(H)
END
END CrossRef
p 1 = (n-k)/n
p 2 = (k/n) × (n-k)/(n-1)
p 3 = (k/n) × (k-1)/(n-1) × (n-k)/(n-2)
……….
p i = (k/n) × (k-1)/(n-1) × (k-2)/(n-2) × … × (n-k)/(n-(i-1))
The expected number E of probes required upon insertion of the k+1st key is therefore
Ek+1 = Si: 1 ≤ i ≤ k+1 : i×pi
= 1 × (n-k)/n + 2 × (k/n) × (n-k)/(n-1) + ...
+ (k+1) * (k/n) × (k-1)/(n-1) × (k-2)/(n-2) × … × 1/(n-(k-1))
= (n+1) / (n-(k-1))
Since the number of probes required to insert an item is identical with the number of probes needed to
retrieve it, the result can be used to compute the average number E of probes needed to access a random
key in a table. Let the table size again be denoted by n, and let m be the number of keys actually in the
table. Then
E = (Sk: 1 ≤ k ≤ m: Ek) / m
= (n+1) × (Sk: 1 ≤ k ≤ m: 1/(n-k+2))/m
= (n+1) × (Hn+1 - Hn-m+1)
where H is the harmonic function. H can be approximated as Hn = ln(n) + g, where g is Euler's constant. If,
moreover, we substitute a for m/(n+1), we obtain
E = (ln(n+1) - ln(n-m+1))/a = ln((n+1)/(n-m+1))/a = -ln(1-a)/a
a is approximately the quotient of occupied and available locations, called the load factor; a = 0 implies an
empty table, a = n/(n+1) ≈ 1 a full table. The expected number E of probes to retrieve or insert a randomly
chosen key is listed in Table 5.1 as a function of the load factor. The numerical results are indeed
surprising, and they explain the exceptionally good performance of the key transformation method. Even if
a table is 90% full, on the average only 2.56 probes are necessary to either locate the key or to find an
empty location. Note in particular that this figure does not depend on the absolute number of keys present,
but only on the load factor.
a E
0.1 1.05
0.25 1.15
0.5 1.39
0.75 1.85
0.9 2.56
0.95 3.15
0.99 4.66
Table 4.6 Expected number of probes as a function of the load factor.
The above analysis was based on the use of a collision handling method that spreads the keys uniformly
over the remaining locations. Methods used in practice yield slightly worse performance. Detailed analysis
for linear probing yields an expected number of probes as
E = (1 - a/2) / (1 - a)
Some numerical values for E(a) are listed in Table 4.7 [5-4]. The results obtained even for the poorest
method of collision handling are so good that there is a temptation to regard key transformation (hashing)
as the panacea for everything. This is particularly so because its performance is superior even to the most
sophisticated tree organization discussed, at least on the basis of comparison steps needed for retrieval and
insertion. It is therefore important to point out explicitly some of the drawbacks of hashing, even if they are
obvious upon unbiased consideration.
182
a E
0.1 1.06
0.25 1.17
0.5 1.50
0.75 2.50
0.9 5.50
0.95 10.50
Table 4.7 Expected number of probes for linear probing.
Certainly the major disadvantage over techniques using dynamic allocation is that the size of the table is
fixed and cannot be adjusted to actual demand. A fairly good a priori estimate of the number of data items
to be classified is therefore mandatory if either poor storage utilization or poor performance (or even table
overflow) is to be avoided. Even if the number of items is exactly known -- an extremely rare case -- the
desire for good performance dictates to dimension the table slightly (say 10%) too large.
The second major deficiency of scatter storage techniques becomes evident if keys are not only to be
inserted and retrieved, but if they are also to be deleted. Deletion of entries in a hash table is extremely
cumbersome unless direct chaining in a separate overflow area is used. It is thus fair to say that tree
organizations are still attractive, and actually to be preferred, if the volume of data is largely unknown, is
strongly variable, and at times even decreases.
Exercises
5.1. If the amount of information associated with each key is relatively large (compared to the key itself),
this information should not be stored in the hash table. Explain why and propose a scheme for representing
such a set of data.
5.2. Consider the proposal to solve the clustering problem by constructing overflow trees instead of
overflow lists, i.e., of organizing those keys that collided as tree structures. Hence, each entry of the scatter
(hash) table can be considered as the root of a (possibly empty) tree. Compare the expected performance of
this tree hashing method with that of open addressing.
5.3. Devise a scheme that performs insertions and deletions in a hash table using quadratic increments for
collision resolution. Compare this scheme experimentally with the straight binary tree organization by
applying random sequences of keys for insertion and deletion.
5.4. The primary drawback of the hash table technique is that the size of the table has to be fixed at a time
when the actual number of entries is not known. Assume that your computer system incorporates a
dynamic storage allocation mechanism that allows to obtain storage at any time. Hence, when the hash
table H is full (or nearly full), a larger table H' is generated, and all keys in H are transferred to H',
whereafter the store for H can be returned to the storage administration. This is called rehashing. Write a
program that performs a rehash of a table H of size n.
5.5. Very often keys are not integers but sequences of letters. These words may greatly vary in length, and
therefore they cannot conveniently and economically be stored in key fields of fixed size. Write a program
that operates with a hash table and variable length keys.
References
5-1. W.D. Maurer. An Improved Hash Code for Scatter Storage. Comm. ACM, 11, No. 1 (1968), 35-38.
5-2. R. Morris. Scatter Storage Techniques. Comm. ACM, 11, No. 1 (1968), 38-43.
5-3. W.W. Peterson. Addressing for Random-access Storage. IBM J. Res. & Dev., 1, (1957), 130-46.
5-4. G. Schay and W. Spruth. Analysis of a File Addressing Method. Comm. ACM, 5, No. 8 459-62.
183