Cs 314
Cs 314
Cs 314
whose skill at algorithms helped him create Mint.com, the online personal finance tool, which he recently
sold to Quicken for $170 million. New York Times, Dec. 6, 2009.
2
Acknowledgment: Many figures in these notes are taken from Wikipedia.
MapReduce slides are from code.google.com/edu/parallel/index.html .
All are licensed under creativecommons.org/licenses/by/2.5/.
Course Topics
Introduction: why we need clever data structures and
algorithms
Representation: how data, pointers, and structures
are represented in the computer
Performance and Big O: how much time and storage
are needed as data sizes become large
Some Lisp: (+ x 3)
Lists and Recursion; Stacks and Queues
Trees and Tree Recursion
Using Library Packages
Balanced Trees and Maps
Hashing and randomization; XOR
Priority Queues and Heaps
Sorting, Merge
Graphs
Map, Reduce, and MapReduce / Hadoop: dataintensive and parallel computation
Introduction
In this course, we will be interested in performance,
especially as the size of the problem increases.
Understanding performance allows us to select the right
data structure for an application.
An engineer can do for a dime
what any fool can do for a dollar.
Thus, engineer/f ool
= 10.
Fibonacci Numbers
Leonardo of Pisa, known as Fibonacci, introduced this
series to Western mathematics in 1202:
F (0) = 0
F (1) = 1
F (n) = F (n 2) + F (n 1) , where n > 1
0 1 2 3 4 5 6
9 10 11
12
13
14
15
Fibonacci Functions
Lets look at two versions of this function:
(defun fib2 (n)
(if (< n 2)
n
Testing Fibonacci
>(fib1 8)
21
>(fib2 8)
21
>(fib1 20)
6765
>(fib2 20)
6765
fib1 and fib2 appear to compute the correct result.
>(fib1 200)
280571172992510140037611932413038677189525
>(fib2 200)
...
Rates of Growth
In Greek mythology, the Hydra was a many-headed
monster with an unfortunate property: every time you
cut off a head, the Hydra grew two more in its place.
Exponential Growth
Know your enemy. Sun Tzu, The Art of War
We want to be able to recognize problems or
computations that will be intractable, impossible to solve
in a reasonable amount of time.
Exponential growth of computation has several names in
computer science:
combinatoric explosion
NP-complete
O(2n)
In real life, things that involve exponential growth usually
end badly:
Atomic bombs
Cancer
Population explosion
10
11
Big O
Big O (often pronounced order) is an abstract function
that describes how fast a function grows as the size of the
problem becomes large. The order given by Big O is a
least upper bound on the rate of growth.
We say that a function T (n) has order O(f (n)) if there
exist positive constants c and n0 such that:
T (n) c f (n) when n n0.
For example, the function T (n) = 2 n2 + 5 n + 100
has order O(n2) because 2 n2 + 5 n + 100 3 n2 for
n 13.
We dont care about performance for small values of n,
since small problems are usually easy to solve. In fact,
we can make that a rule:
If the input is small, any algorithm is okay.
In such cases, the simplest (easiest to code) algorithm is
best.
12
13
14
Classes of Algorithms
f (n)
1
log(n)
log 2(n)
n
n log(n)
n2
n3
nk
2n
Name
Constant
Logarithmic
Log-squared
Linear
Linearithmic
Quadratic
Cubic
Polynomial
Exponential
Example
+
binary search
max of array
quicksort
selection sort
matrix multiply
knapsack problem
15
16
17
Symbol
Y
Z
E
P
T
G
M
K
m
n
p
f
a
z
y
10n
1024
1021
1018
1015
1012
109
106
103
103
106
109
1012
1015
1018
1021
1024
Prefix
Yobi
Zebi
Exbi
Pebi
Tebi
Gibi
Mebi
Kibi
18
Symbol
Yi
Zi
Ei
Pi
Ti
Gi
Mi
Ki
2k
280
270
260
250
240
230
220
210
Word
trillion
billion
million
thousand
20
Log-log Example
Suppose that f (n) = 25 n2.
n f (n)
2 100
3 225
4 400
5 625
6 900
The log-log graph makes it obvious that f (n) is O(n2):
if the slope of the line is k, the function is O(nk ).
21
22
Power of 2 Big O
21
n
21 +
n log(n) ?
22
n2
23
n3
Example:
n
4000
8000
16000
32000
64000
Time
0.01779
0.06901
0.27598
1.10892
4.44222
Ratio
3.8785
3.9991
4.0180
4.0059
23
Computation Model
We will assume that basic operations take constant time.
Later, we will consider some exceptions to this
assumption:
Some library functions may take more than O(1) time.
We will want to be conscious of the Big O of library
functions.
Memory access may be affected by paging and cache
behavior.
There may be hidden costs such as garbage collection.
24
25
Big O of Loops
26
i = n (n + 1)/2 = O(n2)
27
28
29
30
Pointer / Reference
A pointer or reference is the memory address of the
beginning of a record or block of storage, which extends
through higher memory addresses. A pointer has the
same size as int, 32 bits, which can address 4 Giga bytes
of storage, or 64 bits (48 bits actually used). (A byte is 8
bits.)
31
Boxed Number
References and ==
Integer i = 3;
Integer j = i;
Integer i = 127;
Integer j = 127;
Integer i = 130;
Integer j = 130;
Integer i = new Integer(100);
Integer j = new Integer(100);
Integer i = new Integer(100);
Integer j = 100;
Integer i = Integer.valueOf(100);
Integer j = 100;
Integer i = new Integer(99);
i++;
Integer j = 100;
Integer i = 127; i++;
Integer j = 127; j++;
33
== vs. .equals()
For all reference types, including Integer etc., the
meaning of == and != is equality of pointer values, i.e.
same data address in memory.
Rule: To test equality of the contents or value of a
reference type, always use .equals().
Rule: To compare against null, use == or != .
Rule: To make an Integer from an int, either use:
Integer myInteger = (Integer) myint;
(casting)
Integer myInteger = myint;
(auto-boxing)
Integer myInteger = Integer.valueOf(myint)
These can save storage:
new Integer() always makes a new box, costing 16
bytes and (eventually) garbage collection.
Integer.valueOf() will reuse a value in the
Integer cache if possible, costing 0 bytes.
34
Linked List
A linked list is one of the simplest data structures. A
linked list element is a record with two fields:
a link or pointer to the next element (null in Java or
nil in Lisp if there is no next element).
some contents, as needed for the application. The
contents could be a simple item such as a number, or
it could be a pointer to another linked list, or multiple
data items.
In Lisp, a list is written inside parentheses:
(list a b c)
list("a", "b", "c")
->
(a b c)
(list (+ 2 3) (* 2 3))
->
(5 6)
->
(a)
(cons a (b c))
cons("a", list("b", "c"))
->
(a b c)
(first (a b c))
first(list("a", "b", "c"))
->
(rest (a b c))
rest(list("a", "b", "c"))
->
(b c)
); }
); }
->
->
(a b)
rest returns the rest of a list after the first thing. Simply
move the left parenthesis to the right past the first thing.
(rest (a b c))
->
(b c)
->
(c)
->
() = null
38
40
Recursion
A recursive program calls itself as a subroutine.
Recursion allows one to write programs that are powerful,
yet simple and elegant. Often, a large problem can be
handled by a small program which:
1. Tests for a base case and computes the value for this
case directly.
2. Otherwise,
(a) calls itself recursively to do smaller parts of the
job,
(b) computes the answer in terms of the answers to
the smaller parts.
(defun factorial (n)
(if (<= n 0)
1
(* n (factorial (- n 1))) ) )
Rule: Make sure that each recursive call involves an
argument that is strictly smaller than the original;
otherwise, the program can get into an infinite loop.
A good method is to use a counter or data whose size
decreases with each call, and to stop at 0; this is an
example of a well-founded ordering.
41
42
44
(defun fn (lst)
(if (null lst)
; test for base case
baseanswer
; answer for base case
(some-combination-of
(something-about (first lst))
(fn (rest lst))) ) ) ; recursive call
public int fn (Cons lst) {
if ( lst == null )
return baseanswer;
else return someCombinationOf(
somethingAbout(first(lst)),
fn(rest(lst))); }
The recursive version is often short and elegant, but it
has a potential pitfall: it requires O(n) stack space on
the function call stack. Many languages do not provide
enough stack space for 1000 calls, but a linked list with
1000 elements is not unusual.
45
->
(c b a)
48
50
Copying a List
Since reverse is constructive, we could copy a list by
reversing it twice:
public static Cons copy_list (Cons lst) {
return reverse(reverse(lst)); }
What is the Big O of this function?
We could criticize the efficiency of this function because
it creates O(n) garbage: the list produced by the first
reverse is unused and becomes garbage when the
function exits. However, the Big O of the function is
still O(n) + O(n) = O(n).
51
Append
append concatenates two lists to form a single list. The
first argument is copied; the second argument is reused
(shared).
(append (a b c) (d e))
->
(a b c d e)
52
Iterative Append
An iterative version of append can copy the first list in
a loop, using O(1) stack space. For this function, it is
convenient to use the setrest function:
public static Cons append (Cons x, Cons y) {
Cons front = null;
Cons back = null;
Cons cell;
if ( x == null ) return y;
for ( ; x != null ; x = rest(x) ) {
cell = cons(first(x), null);
if ( front == null )
front = cell;
else setrest(back, cell);
back = cell; }
setrest(back, y);
return front; }
53
->
->
->
(a b c)
(d e)
(a b c d e)
Nconc
nconc concatenates two lists to form a single list; instead
of copying the first list as append does, nconc modifies
the end of the first list to point to the second list.
(nconc (list a b c) (d e))
->
(a b c d e)
Nreverse
Destructive functions in Lisp often begin with n.
nreverse reverses a list in place by turning the pointers
around.
(nreverse (list a b c))
->
(c b a)
->
(dick harry)
->
nil
Intersection
The intersection (written ) of two sets is the set of
elements that are members of both sets.
(intersection (a b c) (a c e))
->
(c a)
Tail-Recursive Intersection
60
->
(b a c e)
->
(b)
61
62
Merge
To merge two sorted lists means to combine them into
a single list so that the combined list is sorted. We
will consider both constructive and destructive versions
of merge.
The general idea of a merge is to walk down two sorted
lists simultaneously, advancing down one or the other
based on comparison of the top values using the sorting
function.
In combining two sorted lists with a merge, we walk down
both lists, putting the smaller value into the output at
each step. Duplicates are retained in a merge.
(merge list (3 7 9) (1 3 4) <)
->
(1 3 3 4 7 9)
63
->
(1 3 3 4 7 9)
Constructive Merge
The easiest way to understand this idiom is a simple
merge that constructs a new list as output.
public static Cons merj (Cons x, Cons y) {
if ( x == null )
return y;
else if ( y == null )
return x;
else if ( ((Comparable) first(x))
.compareTo(first(y)) < 0 )
return cons(first(x),
merj(rest(x), y));
else return cons(first(y),
merj(x, rest(y))); }
(defun merj (x y)
(if (null x)
y
(if (null y)
x
(if (< (first x) (first y))
(cons (first x)
(merj (rest x) y))
(cons (first y)
(merj x (rest y))) ) ) ) )
What is O()? Stack depth? Conses?
64
())))
result))
result))
result))
66
(defun dmerj (x y)
(let (front end)
(if (null x)
y
(if (null y)
x
(progn
(if (< (first x) (first y))
(progn (setq front x)
(setq x (rest x)))
(progn (setq front y)
(setq y (rest y))))
(setq end front)
(while (not (null x))
(if (or (null y)
(< (first x) (first y)))
(progn (setf (rest end) x)
(setq x (rest x)))
(progn (setf (rest end) y)
(setq y (rest y))))
(setq end (rest end)) )
(setf (rest end) y)
front))) ))
68
69
Comparison in Java
The various types of Java are compared in different ways:
The primitive types int, long, float and double
are compared using < and >. They cannot use
.compareTo() .
String uses .compareTo(); it cannot use < and >.
.compareTo() for String is case-sensitive.
An application object type can be given a
.compareTo() method, but that allows only one
way of sorting. We might want a case-insensitive
comparison for String.
A Comparator can be passed as a function argument,
and this allows a custom comparison method. The
Java library has a case-insensitive Comparator for
String.
Sometimes it is necessary to implement several versions
of the same method in order to use different methods of
comparison on its arguments.
70
Comparator in Java
Java does not allow a function to be passed as a function
argument, but there is a way to get around this restriction
by passing in an object that defines the desired function as
a method (sometimes called a functor). A Comparator
is a class of object that defines a method compare.
A comparison is basically a subtraction; the compare
method returns an int that gives the sign of the
subtraction (the value of the int does not matter).
If cmp is a Comparator, cmp.compare(x, y) will be:
- x<y
0 x=y
+ x>y
A simple comparator can simply subtract properties of
the objects:
public static void
mySort( AnyType[] a,
Comparator<? super AnyType> cmp) {...}
class MyOrder implements Comparator<MyObject> {
public int compare(MyObject x, MyObject y)
{ return ( x.property() - y.property() ); }}
71
Complex Comparator
A comparator can use if statements to compare two
objects using multiple criteria, e.g. year first, then month,
then day.
Comparator<MyDate>
cmp = new Comparator<MyDate>() {
public int compare(MyDate x, MyDate y) {
int ans = x.year - y.year;
if (ans != 0) return ans;
ans = x.month - y.month;
if (ans != 0) return ans;
return x.day - y.day; }};
int res = cmp.compare(datea, dateb);
Sometimes it is convenient for the comparator to simply
return a value such as 1 or -1 to indicate that one object
should be sorted before the other.
72
73
Dividing a List
We can find the midpoint of a list by keeping two pointers,
moving one by two steps and another by one step, O(n).
public static Cons midpoint (Cons lst) {
Cons current = lst;
Cons prev = current;
while ( lst != null && rest(lst) != null) {
lst = rest(rest(lst));
prev = current;
current = rest(current); };
return prev; }
(defun midpoint (lst)
(let (prev current)
(setq current lst)
(setq prev lst)
(while (and (not (null lst))
(not (null (rest lst))))
(setq lst (rest (rest lst)))
(setq prev current)
(setq current (rest current)) )
prev))
74
Sorting by Merge
A list of length 0 or 1 is already sorted. Otherwise, break
the list in half, sort the halves, and merge them.
public static Cons llmergesort (Cons lst) {
if ( lst == null || rest(lst) == null)
return lst;
else { Cons mid = midpoint(lst);
Cons half = rest(mid);
setrest(mid, null);
return dmerj( llmergesort(lst),
llmergesort(half)); } }
(defun llmergesort (lst)
(let (mid half)
(if (or (null lst) (null (rest lst)))
lst
(progn (setq mid (midpoint lst))
(setq half (rest mid))
(setf (rest mid) nil)
(dmerj (llmergesort lst)
(llmergesort half)) ) )) ))
What is O()? Stack depth? Conses?
75
77
78
Intersection by Merge
Merge Technique
The merge technique can be used to perform a variety of
operations on sequences of items (linked lists or arrays)
in O(n log(n)) time:
1. Sort both sequences: O(n log(n))
2. The merging function steps through the sequences
one element at a time, looking only at the two front
elements and comparing their keys.
3. The function performs some operation involving one
or both front elements, perhaps producing some
output.
4. The function may step down one list, the other list,
or both lists.
5. Examples: merge two sorted lists in order; set
intersection, union, or set difference; update a bank
account based on a transaction.
80
81
Association List
An association list or alist is a simple lookup table
or map: a linked list containing a key value and some
information associated with the key.
(assoc two ((one 1) (two 2) (three 3)))
-> (two 2)
public static Cons assoc(Object key, Cons lst) {
if ( lst == null )
return null;
else if ( key.equals(first((Cons) first(lst))) )
return ((Cons) first(lst));
else return assoc(key, rest(lst)); }
(defun assoc (key lst)
(if (null lst)
nil
(if (equal key (first (first lst)))
(first lst)
(assoc key (rest lst)) ) ) )
New items can be added to the association list using cons.
Adv: Simple code. Table is easily expanded.
Dis: O(n) lookup: Suitable only for small tables.
82
; push
; pop
// push
item = first(stack);
stack = rest(stack);
// pop
(stack == null)
// empty?
Sentinel Node
The push and pop operations on a stack both have side
effects on the pointer to the stack. If that pointer is a
variable, we cannot write push and pop as subroutines.
A common technique is to put an extra dummy node or
sentinel at the front of the list; the sentinel node points
to the actual list. Then we can write subroutines:
public static Cons
pushb (Cons sentinel, Object item) {
setrest(sentinel, cons(item,rest(sentinel)));
return sentinel; }
public static Object popb (Cons sentinel) {
Object item = first(rest(sentinel));
setrest(sentinel, rest(rest(sentinel)));
return item; }
(defun pushb (sentinel item)
(setf (rest sentinel)
(cons item (rest sentinel)) )
sentinel )
(defun popb (sentinel)
(let (item)
(setq item (first (rest sentinel)))
(setf (rest sentinel) (rest (rest sentinel)))
item ))
84
85
Arrays
An array is a contiguous group of elements, all of the
same type, indexed by an integer. In a sense, the array is
the most basic data structure, since the main memory of
a computer is essentially one big array.
The major advantage of an array is random access or
O(1) access to any element. (Paging and cache behavior
can add significantly to access time, but we will ignore
that.)
The major disadvantage of an array is rigidity:
an array cannot be expanded
two arrays cannot be combined
Storage may be wasted because an array is made
larger so it will have some extra space.
In Java, an array is an Object, so it is possible to expand
it in effect by making a new larger array, copying the
old array contents into the new array, and letting the old
array get garbage-collected.
86
87
Uses of Stacks
Stacks are used in many places in computer science:
Most programming languages keep variable values on
a runtime stack.
The SPARC architecture has a register file stack in
the CPU.
Operating systems keep the state of interrupted
processes on a stack.
A web browser keeps a stack of previously visited web
pages; this stack is popped when the Back button is
clicked.
Compilers use a stack when parsing programming
languages.
In general, a stack is needed to traverse a tree.
88
89
We test
evaluate
and then
2). This
=
=
=
=
0
1
2
3
90
91
92
Balancing Parentheses
A stack is a good way to test whether parentheses are
balanced. An open paren must be matched by a close
paren of the same kind, and whatever is between the
parens must be balanced. ({[][][]}) is balanced, but
({[) is not.
public class charStack {
int n;
char[] stack;
public charStack()
{ n = 0;
stack = new char[100]; }
public void push(char c) {
stack[n++] = c; }
public char pop() {
return stack[--n]; }
public boolean empty() {
return ( n == 0 ); }
This example illustrates that it is easy to roll your own
stack.
93
94
95
(S)
[S]
{S}
SS
96
XML
XML , for Extensible Markup Language, allows users to
put tags around their data to describe what pieces of the
data mean.
<CATALOG>
<CD>
<TITLE>Empire Burlesque</TITLE>
<ARTIST>Bob Dylan</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>Columbia</COMPANY>
<PRICE>10.90</PRICE>
<YEAR>1985</YEAR>
</CD>
<CD> ... </CD>
</CATALOG>
We can see that XML provides a hierarchical tree
structure for data. The task of checking the validity of an
XML file is essentially the same as checking for balanced
parentheses; XML simply allows the users to define their
own parenthesis names.
97
// init to empty
// push
item = first(mystack);
mystack = rest(mystack);
// pop
// push
item = mystack[--mystackp];
// pop
98
Queues
A queue data structure implements waiting in line: items
are inserted (enqueued) at the end of the queue and
removed (dequeued) from the front. Sometimes the term
FIFO queue or just FIFO is used, for First-In First-Out.
A queue is a fair data structure: an entry in the queue
will eventually be removed and get service. (A stack, in
contrast, is unfair.)
Queues are frequently used in operating systems:
queues of processes that are ready to execute
queues of jobs to be printed
queues of packets that are ready to be transmitted
over a network
99
100
101
102
103
104
Filter Pattern
A filter is an important concept in CS. Just as a coffee
filter removes coffee grounds while letting liquid pass
through, a filter program removes items from a Collection
if they meet some condition:
static void filter(Collection<?> c) {
for (Iterator<?> it = c.iterator();
it.hasNext(); )
if ( condition(it.next()) )
it.remove(); }
This filter is destructive, removing items from the
collection if they satisfy the condition. One can also
write a constructive filter that makes a new collection,
without modifying the original collection.
105
106
107
ArrayList
ArrayList provides a growable array implementation of
a List.
Advantages:
get and set are O(1)
add and remove at the end are O(1), so an
ArrayList makes a good implementation of a stack.
Disadvantages:
add and remove are O(n) for random positions: the
rest of the list has to be moved down to make room.
contains is O(n). contains uses the .equals()
method of the contents type.
Some space in the array is wasted, O(n), because the
array grows by a factor of 3/2 each time it is expanded.
clear is O(n), because clear replaces all existing
entries with null to allow those items to possibly be
garbage collected.
108
LinkedList
LinkedList implements a List as a doubly-linked list,
with both forward and backward pointers.
This provides a good implementation of a linked list,
stack, queue, and deque (often pronounced deck) or
double-ended queue.
Advantage:
getFirst, addFirst, removeFirst, getLast,
addLast, removeLast are O(1).
Disadvantages:
get, set, add and remove are O(n) for random
positions: it is necessary to step down the list to find
the correct position. However, a remove during an
iteration is O(1).
contains is O(n).
LinkedList seems more like an array than a true
linked list.
There is no method equivalent to
setrest, so the structure of a LinkedList cannot
be changed. It is not possible to write the destructive
merge sort for linked lists that we saw earlier.
109
ListIterator
public interface ListIterator<AnyType>
extends Iterator<AnyType> {
boolean hasPrevious();
AnyType previous();
void add( AnyType x );
void set( AnyType newVal );
ListIterator extends the functionality of Iterator:
It is possible to move backwards through the list as
well as forwards.
add adds an element at the current position: O(1)
for LinkedList but O(n) for random positions in
ArrayList.
set sets the value of the last item seen: O(1) for both.
110
111
Trees
A tree is a kind of graph, composed of nodes and links,
such that:
A link is a directed pointer from one node to another.
One node, called the root, has no incoming links.
Each node, other than the root, has exactly one
incoming link from its parent.
Every node is reachable from the root.
A node can have any number of children. A node with
no children is called a leaf ; a node with children is an
interior node.
112
(= y (+ (* m x) b))
113
114
115
116
Phylogenetic Trees
As new species evolve and branch off from ancestor
species, they retain most of the DNA of their ancestors.
DNA of different species can be compared to reconstruct
the phylogenetic tree.
117
Taxonomies as Trees
Taxonomy, from the Greek words taxis (order) and
nomos (law or science), was introduced to biology by Carl
Linnaeus in 1760. This 1866 figure is by Ernst Haeckel.
118
Ontologies as Trees
An ontology, from the Greek words ontos (of being) and
logia (science, study, theory), is a classification of the
concepts in a domain and relationships between them.
An ontology describes the kinds of things that exist in
our model of the world.
119
Organizations as Trees
Most human organizations are hierarchical and can be
represented as trees.
120
Nerves
121
Representations of Trees
Many different representations of trees are possible:
Binary tree: contents and left and right links.
First-child/next-sibling:
sibling.
122
Binary Tree
public class Tree {
private String contents;
private Tree left, right;
public Tree(String stuff, Tree lhs, Tree rhs)
{ contents = stuff;
left = lhs;
right = rhs; }
public String str() { return contents; }
123
124
Representation:
125
126
Implicit Tree
In some cases, it may be possible to compute the children
from the state or the location of the parent.
For example, given the board state of a game of checkers,
and knowing whose turn it is to move, it is possible to
compute all possible moves and thus all child states.
127
128
129
130
131
134
135
136
137
138
Findpath Example
(/usr bill course cop3212 fall06 grades)
Findpath Representation
We will use a Cons representation in which a symbol or
String is a file or directory name and a list is a directory
whose first element is the directory name and whose
rest is the contents.
findpath has two arguments:
dirtree is a directory tree, a list of items:
(item1 ... itemn). Each itemi is:
a subdirectory: a list (name item1 ... itemn)
(fall06 prog2.r prog1.r grades)
a file name: prog2.r
path is a list of names.
(/usr bill course cop3212 fall06 grades)
What findpath does is to look through its list to find
some item that matches the first name in path:
If a subdirectory matches, go into that subdirectory.
If a file matches, done: return that entry.
If a file does not match, skip it; go to next entry.
140
141
142
1> (FINDPATH
((/USR (MARK (BOOK CH1.R CH2.R CH3.R)
(COURSE (COP3530 (FALL05 SYL.R) (SPR06 SYL.R)
(SUM06 SYL.R)))
(JUNK))
(ALEX (JUNK))
(BILL (WORK)
(COURSE (COP3212 (FALL05 GRADES PROG1.R PROG2.R)
(FALL06 PROG2.R PROG1.R GRADES)))))
(/USR BILL COURSE COP3212 FALL06 GRADES))
2> (FINDPATH
((MARK (BOOK CH1.R CH2.R CH3.R)
(COURSE (COP3530 (FALL05 SYL.R) (SPR06 SYL.R)
(SUM06 SYL.R)))
(JUNK))
(ALEX (JUNK))
(BILL (WORK)
(COURSE (COP3212 (FALL05 GRADES PROG1.R PROG2.R)
(FALL06 PROG2.R PROG1.R GRADES)))))
(BILL COURSE COP3212 FALL06 GRADES))
3> (FINDPATH
((ALEX (JUNK))
(BILL (WORK)
(COURSE (COP3212 (FALL05 GRADES PROG1.R PROG2.R)
(FALL06 PROG2.R PROG1.R GRADES)))))
(BILL COURSE COP3212 FALL06 GRADES))
4> (FINDPATH
((BILL (WORK)
(COURSE (COP3212 (FALL05 GRADES PROG1.R PROG2.R)
(FALL06 PROG2.R PROG1.R GRADES)))))
(BILL COURSE COP3212 FALL06 GRADES))
143
146
147
150
1
*
0
0
0
0
*
*
*
*
*
2
*
0
*
*
0
*
*
*
*
*
3
*
*
*
*
0
*
*
*
*
*
4
*
*
*
*
0
0
0
0
0
0
5
*
*
*
*
0
*
*
*
*
*
6
*
*
*
*
0
0
0
0
0
*
7
*
*
*
*
*
*
*
*
0
*
8
*
*
*
*
*
*
C
0
0
*
9
*)
*)
*)
*)
*)
*)
*)
*)
*)
*))))
;
;
;
;
;
;
;
;
;
;
0
1
2
3
4
5
6
7
8
9
151
152
153
Tree Traversal
For some applications, we need to traverse an entire tree,
performing some action as we go. There are three basic
orders of processing:
Preorder: process the parent node before children.
Inorder: process one child, then the parent, then
the other child.
Postorder: process children first, then the parent.
Thus, the name of the order tells when the parent is
processed.
We will examine each of these with an example.
154
Preorder
In preorder, the parent node is processed before its
children.
Suppose that we want to print out a directory name, then
the contents of the directory. We will also indent to show
the depth. We assume a directory tree as shown earlier:
a directory is a list of the directory name followed by its
contents; a non-list is a file.
(defun printdir (dir level)
(spaces (* 2 level))
(if (symbolp dir)
(progn (prin1 dir)
(terpri))
(progn (prin1 (first dir))
(terpri)
(dolist (contents (rest dir))
(printdir contents
(+ level 1)) ) ) ) )
155
Preorder Example
>(printdir (first directory) 0)
/USR
MARK
BOOK
CH1.R
CH2.R
CH3.R
COURSE
COP3530
FALL05
SYL.R
SPR06
SYL.R
SUM06
SYL.R
JUNK
ALEX
JUNK
BILL
WORK
COURSE
COP3212
FALL05
GRADES
PROG1.R
PROG2.R
FALL06
PROG2.R
PROG1.R
GRADES
156
...
157
Inorder
There are several ways of writing arithmetic expressions;
these are closely related to the orders of tree traversal:
Prefix or Cambridge Polish, as in Lisp: (+ x y)
Infix, as in Java: x + y
Polish Postfix: x y +
An expression tree can be printed as infix by an inorder
traversal:
(defun op (x) (first x))
(defun lhs (x) (second x))
(defun rhs (x) (third x))
(defun infix (x)
(if (consp x)
(progn (princ
(infix
(prin1
(infix
(princ
(prin1 x)))
"(")
(lhs x))
(op x))
(rhs x))
")") )
>(infix (* (+ x y) z))
((X+Y)*Z)
158
; access functions
; left-hand side
; right-hand side
; first child
; parent
; second child
159
; 1. L child
; 2. parent
; 3. R child
160
161
162
Postorder
The Lisp function eval evaluates a symbolic expression.
We can write a version of eval using postorder traversal
of an expression tree with numeric leaf values. Postorder
follows the usual rule for evaluating function calls, i.e.,
arguments are evaluated before the function is called.
(defun myeval (x)
(if (numberp x)
x
(funcall (op x)
; execute the op
(myeval (lhs x))
(myeval (rhs x)) ) ) )
>(myeval (* (+ 3 4) 5))
1> (MYEVAL (* (+ 3 4) 5))
2> (MYEVAL (+ 3 4))
3> (MYEVAL 3)
<3 (MYEVAL 3)
3> (MYEVAL 4)
<3 (MYEVAL 4)
<2 (MYEVAL 7)
2> (MYEVAL 5)
<2 (MYEVAL 5)
<1 (MYEVAL 35)
35
163
164
AVL Tree
An AVL Tree 3 is a binary tree that is approximately
height-balanced: left and right subtrees of any node differ
in height by at most 1.
Advantage: approximately O(log(n)) search and insert
time.
Disadvantage: complex code (120 - 200 lines).
http://www.cs.utexas.edu/users/novak/cgi/apserver.cgi
G. M. Adelson-Velskii and E. M. Landis, Soviet Math. 3, 1259-1263, 1962; D. Knuth, The Art of
Computer Programming, vol. 3: Sorting and Searching, Addison-Wesley, 1973, section 6.2.3.
3
165
Tree Rotation
The basic idea upon which self-balancing trees are based
is tree rotation. Rotations change the height of subtrees
but do not affect the ordering of elements required for
binary search trees.
166
B-Tree
Suppose that a tree is too big to be kept in memory,
and thus must be kept on a disk. A disk has large
capacity (e.g. a terabyte, 1012 bytes) but slow access
(e.g. 10 milliseconds). A computer can execute millions
of instructions in the time required for one disk access.
We would like to minimize the number of disk accesses
required to get to the data we want. One way to do this
is to use a tree with a very high branching factor.
Every interior node (except the root) has between
m/2 and m children. m may be large, e.g 256, so
that an interior node fills a disk block.
Each path from the root to a leaf has the same length.
The interior nodes, containing keys and links, may be
a different type than the leaf nodes.
A link is a disk address.
The real data (e.g. customer record) is stored at the
leaves; this is sometimes called a B+ tree. Leaves will
have between l/2 and l data items.
167
B-Tree Implementation
Conceptually, an interior node is an array of n pointers
and n 1 key values. A pointer is between the two key
values that bound the entries that it covers; we imagine
that the array is bounded by key values of and .
(In practice, two arrays, pointers and key values, may be
used since they are of different types.)
168
Advantages of B-Trees
The desired record is found at a shallow depth (few
disk accesses). A tree with 256 keys per node can
index millions of records in 3 steps or 1 disk access
(keeping the root node and next level in memory).
In general, there are many more searches than
insertions.
Since a node can have a wide range of children, m/2
to m, an insert or delete will rarely go outside this
range. It is rarely necessary to rebalance the tree.
Inserting or deleting an item within a node is
O(blocksize), since on average half the block must
be moved, but this is fast compared to a disk access.
Rebalancing the tree on insertion is easy: if a node
would become over-full, break it into two half-nodes,
and insert the new key and pointer into its parent.
Or, if a leaf node becomes over-full, see if a neighbor
node can take some extra children.
In many cases, rebalancing the tree on deletion can
simply be ignored: it only wastes disk space, which is
cheap.
169
Quadtree
A quadtree is a tree in which each interior node has 4
descendants. Quadtrees are often used to represent 2dimensional spatial data such as images or geographic
regions.
170
Image Quadtree
An image quadtree can represent an image more
compactly than a pixel representation if the image
contains homogeneous regions (which most real images
do). Even though the image is compressed, the value at
any point can be looked up quickly, in O(log n) time.
171
Intersection of Quadtrees
Quadtrees A and B can be efficiently intersected:
If A = 0 or B = 0, the result is 0.
If A = 1, the result is B.
If B = 1, the result is A.
Otherwise, make a new interior node whose values are
corresponding intersections of children of A and B.
Notice that the intersection can often reuse parts of the
input trees. Uses include:
Geospatial modeling, e.g. what is the intersection of
area of forecast rain with area of corn?
Graphics, games: view frustum culling
Collision detection
172
173
Uses of Quadtrees
Spatial indexing: O(log n) lookup of data values by
spatial position, while using less storage than an array.
Numerical algorithms: spatial algorithms can be
much more efficient by using lower-resolution data for
interactions that are far apart.
Graphics: reduce resolution except where user is
looking.
An octtree is a 3-dimensional tree similar to a quadree.
174
Be Extreme!
Sometimes an extreme solution to a problem may be best:
Buy enough memory to put everything in main
memory. A 32-bit PC can address 4 GB of memory,
or 200 bytes for every person in Texas.
Buy a lot of PCs and put part of the data on each
PC.
Use a SSN (9 digits = 1 Gig) as an array index to
index a big array stored on disk: no disk accesses to
find the disk address. Not all 9-digit numbers are valid
SSNs, so some disk will be wasted; but who cares?
Buy a million PCs if that is what it takes to do your
computation.
175
Sparse Arrays
In a sparse array, most elements are zero or empty.
A two-dimensional array of dimension n takes O(n2)
storage; that gets expensive fast as n increases. However,
a sparse array might have only O(n) nonzero elements.
What we would like to do is to store only the nonzero
elements in a lookup structure; if we look up an element
and dont find it, we can return zero.
For example, we could have an array of size n for the first
index; this could contain pointers to a linked list or tree
of values using the second index as the search key.
(defun sparse-aref (arr i j)
(or (second (assoc j (aref arr i)))
0))
(setq myarr (make-array (10) :initial-contents
(nil nil ((3 77) (2 13)) nil ((5 23))
nil nil ((2 47) (6 52)) nil nil)))
>(sparse-aref myarr 7 6)
52
>(sparse-aref myarr 7 4)
0
176
(- ?y ?x))
177
179
Binding Lists
A binding is a correspondence of a name and a value. In
Lisp, it is conventional to represent a binding as a cons:
(cons name value), e.g. (?X . 3); we will use a list,
(list name value), e.g. (?X 3) . We will use names
that begin with ? to denote variables.
A set of bindings is represented as a list, called an
association list, or alist for short. A new binding can
be added by:
(cons (list name value) binding-list )
A name can be looked up using assoc:
(assoc name binding-list )
(assoc ?y ((?x 3) (?y 4) (?z 5)))
= (?Y 4)
The value of the binding can be gotten using second:
(second (assoc ?y ((?x 3) (?y 4) (?z 5))))
= 4
180
Multiple Substitutions
The function (sublis alist form) makes multiple
substitutions simultaneously:
>(sublis ((rose peach) (smell taste))
(a rose by any other name
would smell as sweet))
(A PEACH BY ANY OTHER NAME WOULD TASTE AS SWEET)
; substitute in z with bindings in alist
(defun sublis (alist z)
(let (pair)
(if (consp z)
(cons (sublis alist (first z))
(sublis alist (rest z)))
(if (setq pair (assoc z alist))
(second pair)
z)) ))
181
Sublis in Java
public static Object
sublis(Cons alist, Object tree) {
if ( consp(tree) )
return cons(sublis(alist,
first((Cons) tree)),
(Cons) sublis(alist,
rest((Cons) tree)));
if ( tree == null ) return null;
Cons pair = assoc(tree, alist);
return ( pair == null ) ? tree
: second(pair); }
182
nnums )
+
)
numberp)
1
)
0
))
pattern)
Tree Equality
It often is necessary to test whether two trees are equal,
even though they are in different memory locations. We
will say two trees are equal if:
the structures of the trees are the same
the leaf nodes are equal, using an appropriate test
(defun equal (pat inp)
(if (consp pat)
; interior node?
(and (consp inp)
(equal (first pat) (first inp))
(equal (rest pat) (rest inp)))
(eql pat inp) ) )
; leaf node
>(equal (+ a (* b c)) (+ a (* b c)))
T
184
185
Tracing Equal
>(equal (+ a (* b c)) (+ a (* b c)))
1> (EQUAL (+ A (* B C)) (+ A (* B C)))
2> (EQUAL + +)
<2 (EQUAL T)
2> (EQUAL (A (* B C)) (A (* B C)))
3> (EQUAL A A)
<3 (EQUAL T)
3> (EQUAL ((* B C)) ((* B C)))
4> (EQUAL (* B C) (* B C))
5> (EQUAL * *)
<5 (EQUAL T)
5> (EQUAL (B C) (B C))
6> (EQUAL B B)
<6 (EQUAL T)
6> (EQUAL (C) (C))
7> (EQUAL C C)
<7 (EQUAL T)
7> (EQUAL NIL NIL)
<7 (EQUAL T)
<6 (EQUAL T)
<5 (EQUAL T)
<4 (EQUAL T)
4> (EQUAL NIL NIL)
<4 (EQUAL T)
<3 (EQUAL T)
<2 (EQUAL T)
<1 (EQUAL T)
T
186
Pattern Matching
Pattern matching is the inverse of substitution: it tests
to see whether an input is an instance of a pattern, and
if so, how it matches.
>(match (go ?expletive yourself)
(go bleep yourself))
((?EXPLETIVE BLEEP) (T T))
189
Specifications of Match
Inputs: a pattern, pat, and an input, inp
Constants in the pattern must match the input
exactly.
Structure that is present in the pattern must also be
present in the input.
Variables in the pattern are symbols (strings) that
begin with ?
A variable can match anything, but it must do so
consistently.
The result of match is a list of bindings: null
indicates failure, not null indicates success.
The dummy binding (T T) is used to allow an empty
binding list that is not null.
190
Match Function
192
Transformation by Patterns
Matching and substitution can be combined to transform
an input from a pattern-pair: a list of an input pattern
and an output pattern.
(defun transform (pattern-pair input)
(let (bindings)
(if (setq bindings
(match (first pattern-pair)
input))
(sublis bindings
(second pattern-pair))) ))
>(transform ((I aint got no ?x)
(I do not have any ?x))
(I aint got no bananas))
(I DO NOT HAVE ANY BANANAS)
193
Transformation Patterns
Optimization:
(defpatterns opt
( ((+ ?x 0)
((* ?x 0)
((* ?x 1)
((setq ?x (+ ?x 1))
?x)
0)
?x)
(incf ?x)) ))
Language translation:
(defpatterns lisptojava
( ((aref ?x ?y)
((incf ?x)
((setq ?x ?y)
((+ ?x ?y)
((= ?x ?y)
((and ?x ?y)
((if ?c ?s1 ?s2)
194
195
196
197
198
199
Array as a Map
Suppose that:
The KeyType is an integer.
The size of the domain (largest possible key minus
smallest) is not too large.
In this case, the best map to use is an array:
O(1) and very fast.
very simple code: a[i]
no space required for keys (key is the array index).
When it is applicable, use an array in your programs
rather than a switch statement:
runs faster
uses less storage
better software engineering
200
switch (month) {
case 1: System.out.println("January"); break;
case 2: System.out.println("February"); break;
case 3: System.out.println("March"); break;
...
default: System.out.println("Invalid month.");bre
}
Better:
System.out.println(months[month]);
Sadly, the examples of bad code here are from Oracles Java Tutorial web site.
201
Initializing Array
Java makes it easy to initialize an array with constants:
String[] months = {"", "January", "February",
"March", ... };
String[] days = {"Monday", "Tuesday",
"Wednesday", ... };
String[][] phone = {{"Vader",
"555-1234"},
{"Skywalker", "472-2123"},
... }
202
Key of a Map
Should you use a persons name as a key?
10,000 people named Wang Wang in Beijing.
100,000 people named Ivan Ivanov Ivanovich in
Russia.
18 people (both sexes) with the same Chinese name
at UT, 4 of them in CS and Computational Math.
Clearly, names do not make unique keys.
203
Hashing
We have seen that an array is the best way to implement
a map if the key type is integer and the range of possible
keys is not too large: access to an array is O(1).
What if the key type does not fit these criteria? The idea
of hashing is to use a hash function
h : KeyT ype integer
Convert a key to an integer so that it can be used as
an array index.
Randomize, in the sense that two different keys
usually produce different hash values, and hash values
are equally likely.
h should not be too expensive to compute.
204
Hash Function
We want a hash function to be easy to compute and to
avoid clustering , the hashing of many keys to the same
value.
One hash function that is usually good is to treat the key
(e.g., a string of characters) as an integer and find the
remainder modulo a prime p:
int hash = key % p;
Since this produces an integer from 0 to p 1, we make
the array size p.
The Java .hashCode() for String is roughly:
from Wikipedia.
205
Java .hashCode()
The Java .hashCode() returns int, which could be
positive or negative. To use the .hashCode() as an
array index, make it positive (e.g. with Math.abs())
and make it modulo the table size.
The requirements of a .hashCode() include:
1. When called on the same object, .hashCode() must
always return the same value during a given execution
of a program (but could be different on different
executions).
2. If two objects are .equals(), .hashCode() must
return the same value for both.
The .hashCode() that is inherited from Object may use
the memory address of an object as the hash code; this
will differ between executions. Therefore, it is preferable
to write your own .hashCode() for application data
structures.
206
Exclusive OR
An important Boolean function is the bitwise exclusive
or function :
a
0
0
1
1
b ab
0 0
1 1
0 1
1 0
207
Uses of Exclusive OR
Hashing: hash functions are closely related to
encryption.
Graphics: background picture paints picture, but
another picture erases it and restores background.
This is especially good for animation and for cursors,
which move across a background.
Linked lists: both forward and backward pointers can
be stored in the same space using next previous.
Since we will be coming from either next or previous,
we can with the link value to get the other one.
RAID6: a disk that stores (a b) can back up two
other disks, a and b. If a fails, b (a b) = a; if b
fails, a (a b) = b.
Wireless networking: if message (a b) is broadcast,
those who know a can get b, and those who know b
can get a.
208
210
Collisions
Even if the hash function is randomizing, it is inevitable
that sometimes different keys will hash to the same value;
this is called a collision.
The load factor is the ratio of hash table entries to
table size. Obviously, the higher is, the greater the
probability of a collision.
There are two basic ways to handle a collision:
Rehash the key using a different hash function.
The simplest method of rehashing is linear probing,
simply adding 1 to the previous hash value (modulo
table size). Other options are quadratic probing and
double hashing, using a different hash function. With
rehashing, access to the hash table is efficient if the
is kept below 0.7; this wastes nearly half of the storage.
Let each table entry point to a bucket of entries, an
auxiliary data structure such as a linked list.
211
212
213
Rehashing
If a fixed-size hash table is used with some secondary
hash function to deal with collisions, the number of
comparisons will become larger as increases, with a
knee around = 0.7 .
When this happens, it will be necessary to expand the
array by:
making a new array, larger by a factor of 1.5 or 2.
rehashing all elements of the existing array into the
new array.
replacing the old array by the new one, letting the old
one be garbage-collected.
The rehashing process is O(n), but only has to be done
once every n times, so its amortized cost is O(1).
214
215
Extendible Hashing
Extendible hashing can be used when a table is too large
to fit in main memory and must be placed on disk.
Extendible hashing is similar to the idea of a B-tree.
Instead of having key ranges and disk addresses at the
top level, we can hash the key to get an index into a table
of disk addresses.
216
Uses of Hashing
There are many applications of hashing:
A compiler keeps a symbol table containing all the
names declared within a program, together with
information about the objects that are named.
A hash table can be used to map names to numbers.
This is useful in graph theory and in networking,
where a domain name is mapped to an IP address:
willie.cs.utexas.edu = 128.83.130.16
Programs that play games such as chess keep hash
tables of board positions that have been seen and
evaluated before. In general, hashing is a good way to
look up items that do not have a natural sort order.
The Rabin-Karp string search algorithm uses hashing
to tell whether strings of interest are present in
the text being searched. This algorithm is used in
plagiarism detection and DNA matching.
If an expensive function may be called repeatedly
with the same argument value, pairs of (argument,
result) can be saved in a hash table, with argument
as the key, and result can be reused. This is called
memoization.
217
Randomization
Hashing is our first example of randomized algorithms,
since a hash function is a somewhat random function of
the key.
Randomized algorithms can be used to avoid clustering.
As an example, suppose that a table contains an equal
number of a and b. The worst case of searching for a
b with any predefined strategy is O(n); for example, a
linear search would be O(n) if all the a entries are at the
front.
If we use a randomized search, though, the expected time
is O(1) and the probability of O(n) search is very small.
The expected time forms a geometric series:
1/2 + 1/4 + 1/8 + 1/16 + ... = 1
Random delays for retry can prevent collisions.
Randomized choice of a communication path or server
can avoid failures.
MapReduce randomizes using hashing for load
balancing.
Spies randomize their route to work to avoid assassins.
218
Priority Queue
A priority queue is conceptually an array of queues,
indexed by priority. To remove an item from the priority
queue, the first item of the highest-priority non-empty
queue is removed and returned.
In an operating system, there often is a fixed and small
number of priorities. A high-priority process, such as
responding to a device interrupt, can interrupt a lowerpriority process such as a user program. The processes
that are ready to run are kept on a priority queue. The
high-priority processes are short but need fast service.
Whenever a process ceases execution, the operating
system removes the next highest-priority process from the
ready queue and runs it.
If we use an array of circular queues or two-pointer
queues, both insertion and removal are O(1).
We will assume that the highest-priority item is the one
with the lowest priority number. The operations on the
priority queue are insert and deleteMin. (We will
discuss a min queue; a max queue could be implemented
similarly.)
219
220
Binary Heap
A binary heap is a binary tree that is mapped onto an
array. Because of the mapping that is used, links between
nodes are easily computed without storing them.
A heap has two fundamental properties:
Structure Property: The heap is a complete
binary tree, meaning that all nodes are filled except
for the right-hand part of the bottom row.
221
Mapping to Array
A heap is a gedanken tree, but it is actually stored in an
array.
222
223
225
PriorityQueue in Java
The Java library
PriorityQueue:
provides
generic
class
226
227
Sorting
Sorting a set of items into order is a very common task
in Computer Science.
There are a number of bad algorithms that sort in O(n2)
time; we will not discuss those much. We have already
seen algorithms that do sorting in O(nlog(n)) time, and
in fact it can be proved that that is the best possible.
An internal sort is performed entirely in main memory.
If the set of items is too large to fit in memory, an external
sort using disk or other external storage can be done.
Sorting is based on comparison between items. In
general, complex data can be compared and sorted in
many ways. It is common to furnish a comparison
function to the sorting program:
>(sort (32 29 62 75 48 14 80 98 28 19) <)
(14 19 28 29 32 48 62 75 80 98)
Insertion Sort
Insertion sort is similar to the way people sort playing
cards: given some cards that are sorted and a new card
to be added, make a hole where the new card should go
and put the new card into the hole.
229
230
Heapsort
We have already seen that a Heap can store a set of items
and return the smallest one, in O(log(n)) time per item.
Therefore, an easy way to sort would be to put all the
items into a heap, then remove items one at a time with
deleteMin and put them into the output array.
The problem with this easy method is that it uses an extra
array. However, we can do it using the original array:
1. Make the original array into a max heap. This can be
done in O(n) time.
2. For the size of the heap,
(a) Remove the largest item from the heap. This
makes the heap smaller, so that there is a hole
at the end of the heap.
(b) Put the largest item into the hole.
This gives a sort in O(n log(n)) time. Heapsort is inplace, but not stable.
231
Merge Sort
We have already seen Merge Sort with linked lists. The
idea with arrays is the same: break the array in half, sort
the halves, and merge the halves into a sorted whole.
232
233
234
time = 430000
time = 1940000
236
Quicksort
As its name suggests, Quicksort is one of the better sort
algorithms; it is O(n log(n)) (though it can be O(n2) in
the worst case). In practice it is significantly faster than
other algorithms.
Quicksort is a divide-and-conquer algorithm that chooses
a pivot value, then partitions the array into two sections
with the pivot in its final position roughly in the middle:
elements pivot pivot elements > pivot
The outside sections are then sorted recursively.
The partitioning can be done in-place, making Quicksort
an in-place sort. Since partitioning is done in-place by
swapping elements, Quicksort is not stable.
237
Quicksort Code9
public static void quicksort(
Integer[] a, int lo, int hi ) {
int i=lo, j=hi; Integer h;
Integer pivot = a[(lo+hi)/2];
do
{
//
partition
// move
//
// swap
recursion
This
version
of
Quicksort
is
from
H.
W.
Lang,
Fachhochschule
http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/quick/quicken.htm
238
Flensburg,
Partitioning
lo
hi
32 29 62 75 48 14 80 98 28 19
initial
i
j pivot = 48
32 29 62 75 48 14 80 98 28 19
i
j move i and j
32 29 62 75 48 14 80 98 28 19
i
j
swap
32 29 19 75 48 14 80 98 28 62
i
j
swap
32 29 19 28 48 14 80 98 75 62
i j
move
32 29 19 28 48 14 80 98 75 62
lo
j i
hi swap
32 29 19 28 14 48 80 98 75 62
239
Quicksort Example
Quicksort:
Partition:
Quicksort:
Partition:
Quicksort:
Partition:
sorted:
Quicksort:
Partition:
Quicksort:
Partition:
sorted:
sorted:
sorted:
Quicksort:
Partition:
Quicksort:
Partition:
Quicksort:
Partition:
Quicksort:
Partition:
sorted:
sorted:
sorted:
sorted:
sorted:
lo
i
lo
i
lo
i
lo
lo
i
lo
i
lo
lo
lo
lo
i
lo
i
lo
i
lo
i
lo
lo
lo
lo
lo
0
5
0
2
0
1
0
2
3
3
4
3
2
0
5
9
5
8
5
7
5
6
5
5
5
5
0
hi 9
j 4
hi 4
j 1
hi 1
j -1
hi 1
hi 4
j 2
hi 4
j 2
hi 4
hi 4
hi 4
hi 9
j 8
hi 8
j 7
hi 7
j 6
hi 6
j 4
hi 6
hi 7
hi 8
hi 9
hi 9
[32,
[32,
[32,
[14,
[14,
[14,
[14,
29,
29,
29,
19,
19]
19]
19]
62,
19,
19,
29,
75,
28,
28,
28,
[29, 28,
[28, 29,
[29,
[29,
[29,
[28, 29,
[14, 19, 28, 29,
32]
32]
32]
32]
32]
32]
32]
[48,
[48,
[48,
[48,
[48,
[48,
[48,
[48,
[48,
[48,
[48,
[48,
[14, 19, 28, 29, 32, 48,
240
80,
80,
80,
75,
75,
62,
62]
62]
62]
62,
62,
62,
62,
98,
62,
62,
62,
62]
75]
75, 62]
75, 98]
75]
80]
75]
75, 80]
75, 80, 98]
75, 80, 98]
Quicksort Performance
Quicksort usually performs quite well; we want to avoid
the O(n2) worst case and keep it at O(n log(n)).
The choice of pivot is important; choosing the first
element is a very bad choice if the array is almost
sorted. Choosing the median of the first, middle, and
last elements makes it very unlikely that we will get a
bad choice.
IntroSort (introspective sort) changes from Quicksort
to a different sort for some cases:
Change to Insertion Sort when the size becomes small
( 20), where Insertion Sort may be more efficient.
Change to Heapsort after a certain depth of recursion,
which can protect against the unusual O(n2) worst
case.
Quicksort is easily parallelized, and no synchronization is
required: each subarray can be sorted independently.
241
Radix Sort
Radix sort is an old method that is worth knowing
because it can be used as an external sort for data sets
that are too large to fit in memory.
We will assume that we are sorting integers in decimal
notation; in modern practice, groups of bits would be
more sensible.
The idea of radix sort is to sort the input into bins based
on the lowest digit; then combine the bins in order and
sort on the next-highest digit, and so forth.
The bins can be mostly on external storage media, so that
the size of the data to be sorted can exceed the size of
memory.
The sorting process can also be parallelized.
The performance of radix sort is O(n k) where k is
the key length. It makes sense to think of k as being
approximately log(n), but if there many items for each
key, radix sort would be more efficient than O(n log(n)).
Radix sort is stable.
242
(80)
()
(32 62)
()
(14)
Sorted into bins on lowest digit
(75)
()
()
(48 98 28)
(29 19)
()
(14 19)
(28 29)
(32)
(48)
Sorted into bins on second digit
()
(62)
(75)
(80)
(98)
244
Graphs
A graph G = (V, E) has a set of vertices or nodes V
and a set of edges or arcs or links E, where each edge
connects two vertices. We write this mathematically as
E V V , where is called the Cartesian product of
two sets. We can write an edge as a pair (v1, v2), where
v1 and v2 are each a vertex.
245
Examples of Graphs
There are many examples of graphs:
The road network forms a graph, with cities as vertices
and roads as edges. The distance between cities can
be used as the cost of an edge.
The airline network is a graph, with airports as
vertices and airline flights as edges.
Communication networks such as the Internet:
computers and switches are nodes, connections
between them are links.
Social networks such as the graph of people who call
each other on the telephone, or friends on Facebook:
people are nodes and there are links to the people they
communicate with.
Distribution networks that model the flow of goods:
an oil terminal is a node, and an oil tanker or pipeline
is a link.
Biological networks model the interactions of
biological systems that communicate via messenger
molecules such as hormones.
246
247
Graph Representations
We want the internal representation of a graph to be one
that can be efficiently manipulated.
If the external representation of a node is a string, such
as a city name, we can use a Map or symbol table to map
it to an internal representation such as:
a node number, convenient to access arrays of
information about the node
a pointer to a node object.
A graph is called dense if |E| = O(|V |2); if |E| is less,
the graph is called sparse .
All real-world graphs are sparse unless they are
small.
248
Adjacency List
In the adjacency list representation of a graph, each node
has a list of nodes that are adjacent to it, i.e. connected
by an edge. A linked list is a natural representation.
1
2
3
4
5
6
(5 2)
(3 5 1)
(4 2)
(6 3 5)
(4 2 1)
(4)
249
Adjacency Matrix
In the adjacency matrix representation of a graph, a
Boolean matrix contains a 1 in position (i, j) iff there
is a link from vi to vj , otherwise 0.
1
2
3
4
5
6
1
0
1
0
0
1
0
2
1
0
1
0
1
0
3
0
1
0
1
0
0
4
0
0
1
0
1
1
5
1
1
0
1
0
0
6
0
0
0
1
0
0
250
Implicit Graphs
Some graphs must be represented implicitly because they
cannot be represented explicitly. For example, the graph
of all possible chess positions is larger than the number of
elementary particles in the universe. In such cases, only
part of the graph will be explicitly considered, such as
the chess positions that can be reached from the current
position in 7 moves or less.
251
Topological Sort
Some graphs specify an order in which things must
be done; a common example is the course prerequisite
structure of a university.
A topological sort orders the vertices of a directed acyclic
graph (DAG) so that if there is a path from vertex vi to
vj , vj comes after vi in the ordering. A topological sort is
not necessarily unique. An example of a topological sort
is a sequence of taking classes that is legal according to
the prerequisite structure.
An easy way to find a topological sort is:
initialize a queue to contain all vertices that have no
incoming arcs.
While the queue is not empty,
remove a vertex from the queue,
put it into the sort order
remove all of its arcs
If the target of an arc now has zero incoming arcs,
add the target to the queue.
252
253
PERT Chart
PERT, for Program Evaluation and Review Technique,
is a project management method using directed graphs.
254
255
256
Dijkstras Algorithm
Dijkstras algorithm finds the shortest path to all nodes
in a weighted graph from a specified starting node.
Dijkstras algorithm is a good example of a greedy
algorithm, one that tries to follow the best-looking
possibility at each step.
The basic idea is simple:
Set the cost of the start node to 0, and all other nodes
to .
Let the current node be the lowest-cost node that has
not yet been visited. Mark it as visited. For each edge
from the current node, if the sum of the cost of the
current node and the cost of the edge is less than the
cost of the destination node,
Update the cost of the destination node.
Set the parent of the destination node to be the
current node.
When we get done visiting all nodes, each node has a cost
and a path back to the start; we can reverse that to get
a forward path.
257
Dijkstras Algorithm
public void dijkstra( Vertex s ) {
for ( Vertex v : vertices ) {
v.visited = false;
v.cost = 999999; }
s.cost = 0;
s.parent = null;
PriorityQueue<Vertex>
fringe = new PriorityQueue<Vertex>(20,
new Comparator<Vertex>() {
public int compare(Vertex i, Vertex j) {
return (i.cost - j.cost); }});
fringe.add(s);
while ( ! fringe.isEmpty() ) {
Vertex v = fringe.remove(); // lowest-cost
if ( ! v.visited )
{ v.visited = true;
for ( Edge e : v.edges )
{ int newcost = v.cost + e.cost;
if ( newcost < e.target.cost )
{ e.target.cost = newcost;
e.target.parent = v;
fringe.add(e.target); } } } } }
258
259
260
Prims Algorithm
Prims Algorithm for finding a minimum spanning tree is
similar to Dijkstras Algorithm for shortest paths. The
basic idea is to start with a part of the tree (initially, one
node of the graph) and add the lowest-cost arc between
the existing tree and another node that is not part of the
tree.
As with Dijkstras Algorithm, each node has a parent
node pointer and a cost, which is the least cost of an arc
connecting to the tree that has been found so far.
261
Prims Algorithm
public void prim( Vertex s ) {
for ( Vertex v : vertices ) {
v.visited = false;
v.parent = null;
v.cost = 999999; }
s.cost = 0;
PriorityQueue<Vertex>
fringe = new PriorityQueue<Vertex>(20,
new Comparator<Vertex>() {
public int compare(Vertex i, Vertex j) {
return (i.cost - j.cost); }});
fringe.add(s);
while ( ! fringe.isEmpty() ) {
Vertex v = fringe.remove(); // lowest-cost
if ( ! v.visited )
{ v.visited = true;
for ( Edge e : v.edges )
{ if ( (! e.target.visited) &&
( e.cost < e.target.cost ) )
{ e.target.cost = e.cost;
e.target.parent = v;
fringe.add(e.target); } } } } }
262
263
Directed Search
Dijkstras algorithm finds the shortest path to all nodes
of a graph from a given starting node. If the graph is
large and we only want a path to a single destination,
this is inefficient.
We might have some heuristic information that gives an
estimate of how close a given node is to the goal.
Using the heuristic, we can search the more promising
parts of the graph and ignore the rest.
264
Hill Climbing
A strategy for climbing a hill in a fog is to move upward.
A heuristic that estimates distance to the goal can be
used to guide a hill-climbing search. A discrete depthfirst search guided by such a heuristic is called greedy
best-first search; it can be very efficient. For example,
in route finding, hill climbing could be implemented by
selecting the next city that is closest to the goal.
Unfortunately, hill-climbing sometimes gets into trouble
on a local maximum:
265
Heuristic Search: A*
The A* algorithm uses both actual distance (as in
Dijkstras algorithm) and a heuristic estimate of the
remaining distance. It is both very efficient and able to
overcome local maxima.
A* chooses the next node based on lowest estimated total
cost of a path through the node.
Estimated Total Cost f (n) = g(n) + h(n)
g(n) = Cost from Start to n [known]
h(n) = Cost from n to Goal [estimated]
Start
g(n)
------->
h(n)?
------->
Goal
A* Algorithm
The A* algorithm is similar to Dijkstras algorithm,
except that it puts entries into the priority queue based
on the f value (estimated total cost) rather than g value
(lowest cost found so far).
Note that A* does not recognize that a node is a goal
until the node is removed from the priority queue.
Theorem: If the h function does not over-estimate the
distance to the goal, A* finds an optimum path.
A* will get the same result as Dijkstras algorithm while
doing less work, since Dijkstra searches all nodes while A*
searches only nodes that may be on a path to the goal.
267
268
269
270
A* Algorithm Example
A* finds the shortest path to a single goal node from a
given start. A* will do less work than Dijkstra because it
focuses its search on the goal using the heuristic.
271
Name: Result
Dijkstra Shortest Path
to all nodes
from start node
Prim
Minimum Spanning
Tree to connect
all nodes
A*
Shortest Path
to goal node
from start node
272
Sort Criterion
Total cost
from start
to this node
Cost of
connecting edge
to node
Estimated total
cost from start
through this
node to goal
Formula
d+e
d+e+h
Mapping
A mapping M : D R specifies a correspondence
between elements of a domain D and a range R.
If each element of D maps to exactly one element of R,
and that element R is mapped to only by that one element
of D, the mapping is one-to-one or injective .
If every element of R is mapped to by some element of
D, the mapping is onto or surjective .
A mapping that is both one-to-one and onto is bijective.
273
Implementation of Mapping
We have seen several ways in which a mapping can be
implemented:
A function such as sqrt maps from its argument to
the target value.
If the domain is a finite, compact set of integers, we
can store the target values in an array and look them
up quickly.
If the domain is a finite set, we can use a lookup table
such as an association list, TreeMap or HashMap.
If the domain is a finite set represented as an array or
linked list, we can create a corresponding array or list
of target values.
274
Functional Programming
A functional program is one in which:
all operations are performed by functions
a function does not modify its arguments or have sideeffects (such as printing, setting the value of a global
variable, writing to disk).
A subset of Lisp, with no destructive functions, is an
example of a functional language.
(defun hypotenuse (x y)
(sqrt (+ (expt x 2)
(expt y 2))) )
Functional programming is easily adapted to parallel
programming, since the program can be modeled as flow
of data through functions that could be on different
machines.
275
276
Computation as Simulation
It is useful to view computation as simulation, cf.:
isomorphism of semigroups.10
Given two semigroups G1 = [S, ] and G2 =
[T, ], an invertible function : S T is said
to be an isomorphism between G1 and G2 if, for
every a and b in S, (a b) = (a) (b)
from which:
a b = 1((a) (b))
(defun string+ (x y)
(princ-to-string
(+
(read-from-string x)
(read-from-string y))))
;
;
;
;
phi inverse
+ in model space
phi
phi
Preparata, F. P. and Yeh, R. T., Introduction to Discrete Structures, Addison-Wesley, 1973, p. 129.
277
Mapping in Lisp
Lisp has several functions that compute mappings from
a linked list. The one we have seen is mapcar, which
makes a new list whose elements are obtained by applying
a specified function to each element (car or first) of the
input list(s).
>(defun square (x) (* x x))
>(mapcar square (1 2 3 17))
(1 4 9 289)
>(mapcar + (1 2 3 17) (2 4 6 8))
(3 6 9 25)
>(mapcar > (1 2 3 17) (2 4 6 8))
(NIL NIL NIL T)
278
Mapcan
The Lisp function mapcan works much like mapcar, but
with a different way of gathering results:
The function called by mapcan returns a list of results
(perhaps an empty list).
mapcan concatenates the results; empty lists vanish.
(defun filter (lst predicate)
(mapcan #(lambda (item)
(if (funcall predicate item)
(list item)
()))
lst) )
>(filter (a 2 or 3 and 7) numberp)
()(2)()(3)() (7)
(2 3 7)
>(filter (a 2 or 3 and 7) symbolp)
(a)()(or)()(and)()
(A OR AND)
279
Reduce in Lisp
The function reduce applies a specified function to the
first two elements of a list, then to the result of the first
two and the third element, and so forth.
>(reduce + (1 2 3 17))
23
>(reduce * (1 2 3 17))
102
reduce is what we need to process a result from mapcan:
>(reduce + (mapcan testforz
(z m u l e z r u l e z)))
(1)
(1)
(1)
= (reduce + (1 1 1))
3
281
283
284
285
286
Simplified MapReduce
We think of the map function as taking a single input,
typically a String, and emitting zero or more outputs,
each of which is a (key, (value)) pair. For example, if
our program is counting occurrences of the word liberty,
the input "Give me liberty" would emit one output,
("liberty", ("1")).
As an example, consider the problem of finding the
nutritional content of a cheeseburger. Each component
has a variety of features such as calories, protein, etc.
MapReduce can add up the features individually.
We will present a simple version of MapReduce in Lisp
to introduce how it works.
287
Mapreduce in Lisp
(defun mapreduce (mapfn reducefn lst)
(let (db keylist)
(dolist (item lst)
(dolist (resitem (funcall mapfn item))
(or (setq keylist
(assoc (first resitem) db
:test equal))
(push (setq keylist
(list (first resitem)))
db))
(push (second resitem) (rest keylist)) ) )
(mapcar #(lambda (keylist)
(list (first keylist)
(reduce reducefn
(rest keylist))))
db) ))
>(mapreduce identity + (((a 3) (b 2) (c 1))
((b 7) (d 3) (c 5))))
((D 3) (C 6) (B 9) (A 3))
288
289
MapReduce Example
290
Hamburger Example
>(mapreduce nutrition +
(hamburger bun cheese lettuce tomato mayo) t)
Mapping: HAMBURGER
Emitted: (CALORIES 80)
Emitted: (FAT 8)
Emitted: (PROTEIN 20)
Mapping: BUN
Emitted: (CALORIES 200)
Emitted: (CARBS 40)
Emitted: (PROTEIN 8)
Emitted: (FIBER 4)
Mapping: CHEESE
Emitted: (CALORIES 100)
Emitted: (FAT 15)
Emitted: (SODIUM 150)
Mapping: LETTUCE
Emitted: (CALORIES 10)
Emitted: (FIBER 2)
Mapping: TOMATO
Emitted: (CALORIES 20)
Emitted: (FIBER 2)
Mapping: MAYO
Emitted: (CALORIES 40)
Emitted: (FAT 5)
Emitted: (SODIUM 20)
Reducing: SODIUM (20 150) = 170
Reducing: FIBER (2 2 4) = 8
Reducing: CARBS (40) = 40
Reducing: PROTEIN (8 20) = 28
Reducing: FAT (5 15 8) = 28
Reducing: CALORIES (40 20 10 100 200 80) = 450
((SODIUM 170) (FIBER 8) (CARBS 40) (PROTEIN 28) (FAT 28)
(CALORIES 450))
291
PageRank
The PageRank algorithm used by Google expresses the
ranking of a web page in terms of two components:
a base value, (1 d), usually 0.15
d Pilinks P Ri/ni where P Ri is the page rank of a
page that links to this page, and ni is the number of
links from that page.
The PageRank values can be approximated by relaxation
by using this formula repeatedly within MapReduce.
Each page is initially given a PageRank of 1.0; the sum
of all values will always equal the number of pages.
Map: Share the love: each page distributes its
PageRank equally across the pages it links to.
Reduce: Each page sums the incoming values,
multiplies by 0.85, and adds 0.15 .
292
PageRank Example
B
1.00000000
0.57500000
0.57500000
0.72853125
0.59802969
0.65349285
0.65349285
0.63345678
0.65048744
0.64324941
0.64324941
0.64586415
0.64364162
C
1.00000000
1.42500000
1.06375000
1.21728125
1.21728125
1.16181809
1.20896178
1.18892571
1.18892571
1.19616374
1.19001141
1.19262615
1.19262615
0.64443188
1.19219898
http://pr.efactory.de/e-pagerank-algorithm.shtml
293
on:
(b (0.575 (c))) (c (1.425 (a))))
a val = (((b c)) (1.425))
b val = ((0.5) ((c)))
c val = ((0.5) (0.575) ((a)))
Advanced Performance
The notions of Big O and single-algorithm performance
on a single CPU must be extended in order to understand
performance of programs on more complex computer
architectures. We need to also account for:
Disk access time
Network bandwidth and data communication time
Coordination of processes on separate machines
Congestion and bottlenecks as many computers or
many users want the same resource.
295
296
Buffering
Buffering is a technique used to match a small-but-steady
process (e.g. a program that reads or writes one line at a
time) to a large-block process (e.g. disk I/O).
Disk I/O has two problematic features:
A whole disk block (e.g. 4096 bytes) must be read or
written at a time.
Disk access is slow (e.g. 8 milliseconds).
An I/O buffer is an array, the same size as a disk block,
that is used to collect data. The application program
removes data from the block (or adds data to it) until the
block is empty (full), at which time a new block is read
from disk (written to disk).
If there are R Reduce tasks, each Map task will have
R output buffers, one for each Reduce task. When an
output buffer becomes full, it is written to disk. When
the Map task is finished, it sends the file names of its R
files to the Master.
297
Load Balancing
Some data values are much more popular than others.
For example, there were 13 people on a class roster whose
names started with S, but only one K, and no Q or X.
If MapReduce assigned Reduce tasks based on key values,
some Reduce tasks might have large inputs and be too
slow, while other Reduce tasks might have too little work.
MapReduce performs load balancing by having a large
number R of Reduce tasks and using hashing to assign
data to Reduce tasks:
task = Hash(key) mod R
This assigns many keys to the same Reduce task. The
Reduce task reads the files produced by all Map tasks for
its hash value (remote read over the network), sorts the
combined input by key value, and appends the value
lists before calling the applications Reduce function.
298
Algorithm Failure
If MapReduce detects that a worker has failed or is slow
on a Map task, it will restart redundant Map tasks to
process the same data.
If the redundant Map tasks also fail, maybe the problem
is that the data caused the algorithm to fail, rather than
hardware failure.
MapReduce can restart the Map task without the last unprocessed data. This causes the output to be not quite
right, but for some tasks (e.g. average movie rating) it
may be acceptable.
299
Atomic Commit
In CS, the word atomic, from Greek words meaning
not cut, describes an all-or-nothing process: either the
process finishes without interruption, or it does not
execute at all.
If multiple worker machines are working on the same data,
it is necessary to ensure that only one set of result data
is actually used.
An atomic commit is provided by the operating system
(and, ultimately, CPU hardware) that allows exactly one
result to be committed or accepted for use. If other
workers produce the same result, those results will be
discarded.
In MapReduce, atomicity is provided by the file system.
When a Map worker finishes, it renames its temporary file
to the final name; if a file by that name already exists,
the renaming will fail.
300