Linear Algebra With Probability PDF

Math 19b: Linear Algebra with Probability
Oliver Knill, Spring 2011
Lecture 1: Data, Lists and Models

Information is described by data. While data entries are often not numerical values initially, they
can always be encoded in a numerical manner so that we can look at numerical data. Here are
some examples: stock data, weather data, networks, vocabularies, address books, encyclopedias,
surfaces, books, the human genome, pictures, movies or music pieces. Is there any information
which is not described by data?
While data describe complicated objects, they can be organized in lists, lists of lists, or lists of
lists of lists etc. Stock market or weather data can be represented as finite sequences of numbers,
pictures are arrays of numbers, a movie is an array of pictures. A book is a list of letters. A
graph is given by a list of nodes and connections between these nodes. A movie is a list of pictures
and each picture is a list of list of pixels and each pixel is a list of red, green and blue values.
If you think about these examples, you realized that vectors, matrices ore higher dimensional
arrays are helpful to organize the data. We will define these concepts later but a vector is just a
finite list of things and a matrix is a list of lists, a spreadsheet. In the case of pictures, or music
pieces the translation into an array is obvious. Is it always possible to encode data as sequences
of numbers or arrays of numbers or lists of arrays of numbers etc? Even for complicated objects
like networks, one can use lists. A network can be encoded with an array of numbers, where we
put a 1 at the node (i, j) if node i and j are connected and 0 otherwise. This example makes
clear that we need a mathematical language to describe data. It looks like a difficult problem
at first because data can appear in such different shapes and forms. It turns out that we can use
vectors to describe data and use matrices to describe relations between these data. The fact that
most popular databases are relational databases organized with tables vindicates this point of
view. In this first lecture, we also want to see that linear algebra is a tool to organize and work
with data. Even data manipulation can be described using linear algebra. Data of the same type
can be added, scaled. We can mix for example two pictures to get a new picture. Already on
a fundamental level, nature takes linear algebra seriously: while classical mechanics deals with
differential equations which are in general nonlinear and complicated, quantum mechanics replaces
this with a linear evolution of functions. Both in the classical and quantum world, we can describe
the evolution of observables with linear laws.
Observed quantities are functions defined on lists. One calls them also random variables.
Because observables can be added or subtracted, they can be treated as vectors. If the data
themselves are not numbers like the strings of a DNA, we can add and multiply numerical functions
on these data. The function X(x) for example could count the number of A terms in a genome
sequence x with letters A,G,C,T which abbreviate Adenin, Guanin, Cytosin and Tymin. It is a
fundamental and pretty modern insight that all mathematics can be described using algebras and
operators. We have mentioned that data are often related and organized in relational form and that
an array of data achieves this. Lets look at weather data accessed from http://www.nws.noaa.gov
on January 4th 2011, where one of the row coordinates is time. The different data vectors are
listed side by side and listed in form of a matrix.
Month
01
02
03
04
05
06
07
08
09
10
11
12
Year
2010
2010
2010
2010
2010
2010
2010
2010
2010
2010
2010
2010
Temperature
29.6
33.2
43.9
53.0
62.8
70.3
77.2
73.4
68.7
55.6
44.8
32.7
Precipitation
2.91
3.34
14.87
1.78
2.90
3.18
2.66
5.75
1.80
3.90
2.96
3.61
Wind
12.0
13.3
13.0
10.4
10.6
9.5
9.7
10.2
10.8
12.2
11.0
13.2
To illustrate how linear algebra enters, lets add up all the rows and divide by the number of rows.
This is called the average. We can also look at the average squre distance to the mean, which is
the variance. Its square root is called the standard deviation.
Month
6.5
Max
Year Temperature
2010 53.7667
Precipitation
4.13833
Wind
11.325
Boston Weather 2010
Here are four important uses of linear algebra to describe data:

Tool
Databases
Modeling
Fitting
Computation
Goal
Describe the data
Model the data
Reduce the data
Manipulate the data
Using
Lists
Probability
Projections
Algebra
Example
Relational database, Adjacency matrix
Markov process, Filtering, Smoothing
Linear regression.
Fourier theory.
We have seen that a fundamental tool to organize data is the concept of a list. Mathematicians
call this a vector. Since data can be added and scaled, data can be treated as vectors. We can
also look at lists of lists. These are called matrices. Matrices are important because they allow
to describe relations and operations. Given a matrix, we can access the data using coordinates.
The entry (3, 4) for example is the forth element in the third row. Having data organized in lists,
one can manipulate them more easily. One can use arithmetic on entire lists or arrays. This is
what spreadsheets do.
TemperaturePrecipitation Wind
Max=
77
Max=
Max=
15
13
In the example data set, the average day was June 15th, 2010, the average temperature was 54
degrees Fahrenheit = 12 degrees Celsius, with an average of 4 inches of precipitation per month
and an average wind speed of 11 miles per hour. The following figure visualizes the data. You
see the unusually rainy March which had produced quite a bit of flooding. The rest is not so
surprising. The temperatures are of course higher in summer than in winter and there is more
wind in the cooler months. Since data often come with noise, one can simplify their model and
reduce the amount of information to describe them. When looking at a particular precipitation
data in Boston over a year, we have a lot of information which is not interesting. More interesting
is the global trend, the deviation from this global trend and correlations within neighboring days.
It is for example more likely to be rainy near a rainy day than near a sunny day. The data are not
given by a random process like a dice. It we can identify the part of the data which are randomly
chosen, how do we find the rule? This is a basic problem and can often be approached using linear
algebra. The expectation will tell the most likely value of the data and the standard deviation
tells us how noisy the data are. Already these notions have relations with geometry.
How do we model from a situation given only one data set? For example, given the DJI data, we
would like to have a good model which predicts the nature of the process in the future. We can
do this by looking for trends. Mathematically, this is done by data fitting and will be discussed
extensively in this course. An other task is to model the process using a linear algebra model.
Maybe it is the case that near a red pixel in a picture it is more likely to have a red pixel again
or that after a gain in the DJI, we are more likely to gain more.
Homework due February 2, 2011

1
In the movie Fermats room, some mathematicians are trapped in a giant press and have
to solve mathematical problems to stop the press from crushing them to death. In one of
the math problems, they get the data stream of 169 = 132 digits:
00000000000000011111111100011111111111001111111111100110
001000110011000100011001111101111100111100011110001111111
11000001010101000000110101100000011111110000000000000000. Can you solve the riddle?
Try to solve it yourself at first. If you need a hint, watch the clip
http://www.math.harvard.edu/ knill/mathmovies/text/fermat/riddle3.html
This is problem 2 in Section 1.3 of the script, where 200 is replaced by 100: Flip a coin 100
times. Record the number of times, heads appeared in the first 10 experiments and call this
n1 . Then call the number of times, heads appears in the next N = 10 experiments and call
it n2 . This produces 10 inters n1 , . . . , n10 . Find the mean
m = (n1 + n2 + . . . + nN )/N
of your data, then the sample variance
v = ((n1 m)2 + (n2 m)2 + . . . + (nN m)2 )/(N 1)
and finally the sample standard deviation = v. Remark: there are statistical reasons
that (N 1) is chosen and not N. It is called Bessels correction.
Lets illustrate how we can use lists of data to encode a traffic situation. Assume an airline services
the towns of Boston, New York and Los Angeles. It flies from New York to Los Angeles, from Los
Angeles to Boston and from Boston to New York as well as from New York to Boston. How can
we can compute the number of round trips of any length n in this network? Linea algebra helps:
define the 3 3 connection matrix A given below and compute the nth power of the matrix.
We will learn how to do that next week. In our case, there are 670976837021 different round trips
of length 100 starting from Boston.
If you use a computer, make a million experiments at least.
R:=Random [ I n t e g e r , 1 ] ; M=100; data=Table [ R, {M} ] ;

m=Sum [ data [ [ k ] ] , { k ,M} ] /M;
sigma=S q r t [ Sum [ ( data [ [ k ]] m) 2 , { k ,M} ] / (M 1 ) ] ;
3
The matrix which encodes the situation is the following:
BO NY
0
1
1
0
LA
1
0
BO
A=
NY
LA
0
1
0
To summarize, linear algebra enters in many different ways into data analysis. Lists and lists of
lists are fundamental ways to represent data. They will be called vectors and matrices. Linear
algebra is needed to find good models or to reduce data. Finally, even if we have a model, we
want to do computations efficiently.
a) Encode the following directed graph in a

list of 5 5 integers 0 or 1. We call this a
matrix. Use the rule that if there is an arrow
from node i to node j, then the matrix A has
an entry 1 in the ith row and jth column.
b) We often organize the data in a so called

binary tree. Each vertex in the following tree can be described as a list of numbers {0, 1 }. How is each point on the most
outer circle determined by a list with 8 numbers? Assume you start at the origin and
make steps to the right if 1 occurs in the list
and left if 0 occurs. Assume you get the list
{1, 1, 1, 1, 1, 1, 1, 1}. At which point do you
end up? Which part of the outer circle corresponds to sequences which start with two
0?
2
4
3
P[ \ A] = 1 P[A].
P[A B] = P[A] + P[B] P[A B].
Lecture 2: Probability notions

An important class of probability spaces are finite probability spaces, where every subset
can be an event. The most natural choice is to assign them the probability P[A] = |A|/||
where |A| is the number of elements in A. This reads the number of good cases divided
by the number of all cases.
A probability space consists of a set called laboratory a set of subsets of

called events and a function P from events to the interval [0, 1] so that probabilities
of disjoint sets add up and such that the entire laboratory has probability 1. Every
point in is an experiment. Think of an event A as a collection of experiments
and of P[A] as the likelihood that A occurs.
P[A] =
It is important that in any situation, we first find out what the laboratory is. This is often
the hardest task. Once the setup is fixed, one has a combinatorics or counting problem.
Examples:
We turn a wheel of fortune and assume it is fair in the sense that every angle range [a, b]
appears with probability (b a)/2. What is the chance that the wheel stops with an angle
between 30 and 90 degrees?
Answer: The laboratory here is the circle [0, 2). Every point in this circle is a possible
experiment. The event that the wheel stops between 30 and 90 degrees is the interval
[/6, /2]. Assuming that all angles are equally probable, the answer is 1/6.
Here are the conditions which need to be satisfied for the probability function P:
1. 0 P[A] 1 and P[] = 1.
S
P
2. Aj are disjoint events, then P[
j=1 Aj ] =
j=1 P[Aj ].
(1, 2)
(2, 2)
(3, 2)
(4, 2)
(5, 2)
(6, 2)
(1, 3)
(2, 3)
(3, 3)
(4, 3)
(5, 3)
(6, 3)
(1, 4)
(2, 4)
(3, 4)
(4, 4)
(5, 4)
(6, 4)
(1, 5)
(2, 5)
(3, 5)
(4, 5)
(5, 5)
(6, 5)
(1, 6)
(2, 6)
(3, 6)
(4, 6)
(5, 6)
(6, 6)
be the possible cases, then there are only 8 cases where the sum is smaller or equal to 8.
Lets look at all 2 2 matrices for which the entries are either 0 or 1. What is the probability
that such a matrix has a nonzero determinant det(A) = ad bc?
Answer: We have 16 different matrices. Our probability space is finite:
= {
"
"
0 0
0 0
1 0
0 0
# "
# "
0 0
0 1
1 0
0 1
# "
# "
0 0
1 0
1 0
1 0
# "
# "
0 0
1 1
1 0
1 1
# "
# "
0 1
0 0
1 1
0 0
# "
# "
0 1
0 1
1 1
0 1
# "
# "
0 1
1 0
1 1
1 0
# "
# "
0 1
1 1
1 1
}.
1 1
Now lets look at the event that the determinant is nonzero. It contains the following matrices:
A={
Here is a more precise list of conditions which need to be satisfied for events.
It follows that also the complement of an event is an event.
(1, 1)
(2, 1)
(3, 1)
(4, 1)
(5, 1)
(6, 1)
Lets look at the digits of . What is the probability that the digit 5 appears? Answer:
Also this is a strange example since the digits are not randomly generated. They are given
by nature. There is no randomness involved. Still, one observes that the digits behave like
a random number and that the number is normal: every digit appears with the same
frequency. This is independent of the base.
1. The entire laboratory and the empty set are events.

S
T
2. If Aj is a sequence of events, then j Aj and j Aj are events.
We throw a dice twice. What is the probability that the sum is larger than 5?
Answer: We can enumerate all possible cases in a matrix and get Let
This example is called Bertrands paradox. Assume we throw randomly a line into the
unit disc. What is the probability that its length is larger than the length of the inscribed
triangle?
Answer: Interestingly, the answer depends as we will see in the lecture.
Lets look at the DowJonesIndustrial average DJI from the start. What is the probability
that the index will double in the next 50 years?
Answer: This is a strange question because we have only one data set. How can we talk
about probability in this situation? One way is to see this graph as a sample of a larger
probability space. A simple model would be to fit the data with some polynomial, then add
random noise to it. The real DJI graph now looks very similar to a typical graph of those.
|A|
||
"
0 1
1 0
# "
0 1
1 1
# "
1 0
0 1
# "
1 0
1 1
# "
1 1
0 1
# "
1 1
}.
1 0
The probability is P[A] = |A|/|| = 6/16 = 3/8 = 0.375.
Lets pick 2 cards from a deck of 52 cards. What is the probability that we have 2 kings?
Answer: Our laboratory has 52 51 possible experiments. To count the number of
good cases, note that there are 4 3 = 12 possible ordered pairs of two kings. Therefore
12/(52 51) = 1/221 is the probability.
Some notation
Set theory in :
The intersection A B contains the elements which are in A and B.
The union A B contains the elements which are in A or B.
The complement Ac contains the elements in which are not in A.
The difference A \ B contains the elements which are in A but not in B.
The symmetric difference AB contains what is in A or B but not in both.
The empty set is the set which does not contain any elements.
The algebra A of events:
If is the laboratory, the set A of events is -algebra. It is a set of subsets of
in which one can perform countably many set theoretical operations and which
S
contains and . In this set one can perform countable unions j Aj for the
T
union of a sequence of sets A1 , A2 , . . . or countable intersections j Aj .
You walk 100 steps and chose in each step

randomly one step forward or backward.
You flip a coin. What is the chance to be
back at your starting point 0 at the end of
your walk?
a) Set up the probability space . How
many elements does it have?
b) Which subset A of is the event to be
back at 0 at time 100?
c) Find the probability P[A].
d) What formula do you get for n steps.
Do problem 5) in Chapter 2 of the text but with 100 instead of 1000. You choose a random
number from {1, . . . , 100 }, where each of the numbers have the same probability. Let
A denote the event that the number is divisible by 3 and B the event that the number
is divisible by 5. What is the probability P[A] of A, the probability P[B] of B and the
probability of P[A B]? Compute also the probability P[A B]/P[B] which we will call the
conditional probability next time. It is the probability that the number is divisible by 3
under the condition that the number is divisible by 5.
You choose randomly 6 cards from 52 and do not put the cards back. What is the probability
that you got all aces? Make sure you describe the probability space and the event A that
we have all aces.
The probability measure P:

The probability function P from A to [0, 1] is assumed to be normalized that P[] = 1
S
P
and that P[ i Ai ] = i P[Ai ] if Ai are all disjoint events. The later property is called
-additivity. One gets immediately that P[Ac ] = 1 P[A], P[] = 0 and that if
A B then P[A] < P[B].
The Kolmogorov axioms:
A probability space (, A, P) consists of a laboratory set , a -algebra A on
and a probability measure P. The number P[A] is the probability of an event A.
The elements in are called experiments. The elements in A are called events.
Some remarks:
1) In a -algebra the operation AB behaves like addition and A B like multiplication. We
can compute with sets like A (BC) = (A B)(A C). It is therefore an algebra. One calls
it a Boolean algebra. Beside the just mentioned distributivity, one has commutativity, and
associativity. The zero is played by because A = A. The one is the set because
A = A. The algebra is rather special because A A = A and AA = . The square of
each set is the set itself and adding a set to itself gives the zero set.
2) The Kolmogorov axioms form a solid foundation of probability theory. This has only been achieved in the 20th century
(1931). Before that probability theory was a bit hazy. For
infinite probability spaces it is necessary to restrict the set of
all possible events. One can not take all subsets of an interval
for example. There is no probability measure P which would
work with all sets. There are just too many.
If A, B be events in the probability space (, P ), then Bayes rule holds:
P[A|B] =
Lecture 3: Conditional probability
It is a formula for the conditional probability P[A|B] when we know the unconditional
probability of A and P[B|A], the likelihood. The formula immediately follows from the
fact that P[B|A] + P[B|Ac ] = P[B].
The conditional probability of an event A under the condition that the event B
takes place is denoted with P[A|B] and defined to be P[A B]/P[B].
We throw a coin 3 times. The first 2 times, we have seen head. What is the chance that we
get tail the 3th time?
Answer: The probability space consists of all words in the alphabet H, T of length 3.
These are = {HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T T }. The event B is the
event that the first 2 cases were head. The event A is the event that the third dice is head.
While the formula followed directly from the definition of conditional probability, it is very
useful since it allows us to compute the conditional probability P[A|B] from the likelihoods
P[B|A], P[B|Ac ]. Here is an example:
Problem: Dave has two kids, one of them is a girl. What is the chance that the other is a
girl?
Intuitively one would here give the answer 1/2 because the second event looks independent
of the first. However, this initial intuition is misleading and the probability only 1/3.
Solution. We need to introduce the probability space of all possible events
= {BG, GB, BB, GG}
You are in the Monty-Hall game show and need to chose from three doors. Behind one door
is a car and behind the others are goats. The host knows what is behind the doors. After
you open the first door, he opens an other door with a goat. He asks you whether you want
to switch. Do you want to?
Answer: Yes, you definitely should switch. You double your chances to win a car:
No switching: The probability space is the set of all possible experiments = {1, 2, 3 }.
You choose a door and win with probability 1/3 . The opening of the host does not affect
any more your choice.
Switching: We have the same probability space. If you pick the car, then you lose because
the switch will turn this into a goat. If you choose a door with a goat, the host opens the
other door with the goat and you win. Since you win in two cases, the probability is 2/3 .
Also here, intuition can lead to conditional probability traps and suggest to have a
win probability 1/3 in general. Lets use the notion of conditional probability to give
an other correct argument: the intervention of the host has narrowed the laboratory to
= {12, 13, 21, 23, 31, 32 } where 21 for example means choosing first door 2 then door 1.
Assume the car is behind door 1 (the other cases are similar). The host, who we assume
always picks door 2 if you pick 1 with the car (the other case is similar) gives us the condition
B = {13, 21, 31 } because the cases 23 and 32 are not possible. The winning event is A =
{21, 31 }. The answer to the problem is the conditional probability P[A|B] = P[A B]/P[B]
= 2/3 .
Problem. The probability to die in a car accident in a 24 hour period is one in a million.
The probability to die in a car accident at night is one in two millions. At night there is
30% traffic. You hear that a relative of yours died in a car accident. What is the chance
that the accident took place at night?
Solution. Let B be the event to die in a car accident and A the event to drive at night.
We apply the Bayes formula. We know P[A B] = P[B|A] P[A] = (1/2000000) (3/10) =
3/20000000.
P[A|B] =
with P[{BG }] = P[{GB }] = P[{BB }] = P[{GG }] = 1/4. Let B = {BG, GB, GG } be

the event that there is at least one girl and A = {GG} the event that both kids are girls.
We have
(1/4)
1
P[A B]
=
= .
P[A|B] =
P[B]
(3/4)
3
P[B|A] P[A]
.
P[B|A] + P[B|Ac ]
P[A B]
= (3/20000000)/(1/1000000) = 3/20 .
P[B]
The accident took place at night with a 15 % chance.

A more general version of Bayes rule deals with more than just two possibilities. These
possibilities are called A1 , A2 , ..., An .
Bayes rule: If the disjoint events A1 , A2 , . . . An cover the entire laboratory , then
P[B|Ai ] P[Ai ]
.
P[Ai |B] = Pn
j=1 P[B|Aj ] P[Aj ]
P
Proof: Because the denominator is P[B] = nj=1 P[B|Aj ]P[Aj ], the Bayes rule just says
P[Ai |B]P[B] = P[B|Ai ]P[Ai ]. But these are by definition both P[Ai B].
Problem: A fair dice is rolled first. It gives a random number k from {1, 2, 3, 4, 5, 6 }.
Next, a fair coin is tossed k times. Assume, we know that all coins show heads, what is the
probability that the score of the dice was equal to 5?
Solution. Let B be the event that all coins are heads and let Aj be the event that the dice
showed the number j. The problem is to find P[A5 |B]. We know P[B|Aj ] = 2j . Because the
P
events Aj , j = 1, . . . , 6 are disjoint sets in which cover it, we have P[B] = 6j=1 P[B Aj ] =
P6
P6
j
j=1 P[B|Aj ]P[Aj ] =
j=1 2 /6 = (1/2 + 1/4 + 1/8 + 1/16 + 1/32 + 1/64)(1/6) = 21/128.
By Bayes rule,
2
(1/32)(1/6)
P[B|A5 ]P[A5 ]
=
,
=
P[A5 |B] = P6
21/128
63
( j=1 P[B|Aj ]P[Aj ])
0.5

0.4
Problem 2) in Chapter 3: if the probability that a student is sick at a given day is 1 percent
and the probability that a student has an exam at a given day is 5 percent. Suppose that
6 percent of the students with exams go to the infirmary. What is the probability that a
student in the infirmary has an exam on a given day?
Problem 5) in chapter 3: Suppose that A, B are subsets of a sample space with a probability
function P. We know that P[A] = 4/5 and P[B] = 3/5. Explain why P[B|A] is at least
1/2.
Solve the Monty Hall problem with 4 doors. There are 4 doors with 3 goats and 1 car. What
are the winning probabilities in the switching and no-switching cases? You can assume that
the host always opens the still closed goat closest to the car.
0.3
0.2
0.1
Figure: The conditional probabilities P [Aj |B] in the previous problem. The knowledge
that all coins show head makes it more likely to have thrown less coins.
Here is a concrete example: Assume the chance that the first kid is a girl is 60% and that
the probability to have a boy after a boy is 2/3 and the probability to have a girl after a
girl is 2/3 too. What is the probability that the second kid is a girl?
S olution. Let B1 be the event that the first kid is a boy and let B2 the event that the
first kid is a girl. Assume that for the first kid the probability to have a girl is 60%.
But that P[F irstgirl|Secondgirl] = 2/3 and P[F irstboy|Secondboy] = 2/3. What are the
probabilities that the first kid is a boy? This produces a system
Lecture 4: Linear equations from Probability

A linear equation for finitely many variables x1 , x2 , . . . , xn is an equation of the
form
a1 x1 + a2 x2 + ... + an xn = b .
Finitely many such equations form a system of linear equations. A system of
linear equations can be written in matrix form A~x = ~b. Here ~x is the column vector
containing the variables, A lists all the coefficients, and ~b is the column vector which
lists the numbers to the right.
2/3P[B1 ]
1/3P[B1 ]
+ y + z + u + v + w =3
y + z + u + v
=2
2 + 2
=4
6/10
4/10
The probabilities are 8/15, 7/15. There is still a slightly larger probability to have a girl.
This example is also at the heart of Markov processes.
Consider the system

x

+ 1/3P[B2 ] =
+ 2/3P[B2 ] =
Example Here is a toy example of a problem one has to solve for magnetic resonance
imaging (MRI). This technique makes use of the absorb and emission of energy in the radio
frequency range of the electromagnetic spectrum.
Assume we have 4 hydrogen atoms, whose nuclei are excited with energy intensity a, b, c, d.
We measure the spin echo in 4 different directions. 3 = a+b,7 = c+d,5 = a+c and 5 = b+d.
What is a, b, c, d? Solution: a = 2, b = 1, c = 3, d = 4. However, also a = 0, b = 3, c = 5, d =
2 solves the problem. This system has not a unique solution even so there are 4 equations
q
The are 6 variables and 3 equations. Since we have less equations then unknowns, we expect
infinitely many solutions. The system can be written as A~x = ~b, where
r
a
b
o
1 1 1 1 1 1
A=
1 1 1 1 1 1
0 0 2 2 0 0
and 4 unknowns.
c
and ~b =
2 .
4
Linear equations appear in probability theory. Example: Assume we have a probability

space = {x, y, z, w }, with four elements. Assume we know the probabilities P[{x, y}] =
4/10, P[{z, w}] = 6/10, P[{x, z}] = 7/10, P[{b, d}] = 3/10. The question is to find the
probabilities x = P[{a}], y = P[{b}], z = P[{c}], w = P[{d}].
Answer: This problem leads to a system of equations
x + y +
=
4/10
z + w = 6/10
x
z + =
7/10
y
+ w = 3/10
The system can be solved by eliminating the variables. But the system has no unique
solution.
Example. Assume we have two events B1 , B2 which cover the probability space. We do
not know their probabilities. We have two other events A1 , A2 from which we know P[Ai ]
and the conditional probabilities P[Ai |Bj ]. We get the system of equations.
P[A1 |B1 ]P[B1 ]
P[A2 |B1 ]P[B1 ]
+ P[A1 |B2 ]P[B2 ] =

+ P[A2 |B2 ]P[B2 ] =
P[A1 ]
P[A2 ]
a11
a12
a13
a14
a21
x11
x12
a24
a31
x21
x22
a34
a41
a42
a43
a44
We model a drum by a fine net. The heights at

each interior node needs the average the heights
of the 4 neighboring nodes. The height at the
boundary is fixed. With n2 nodes in the interior, we have to solve a system of n2 equations.
For example, for n = 2 (see left), the n2 = 4
equations are
4x11 = x21 + x12 + x21 + x12 ,
4x12 = x11 + x13 + x22 + x22 ,
4x21 = x31 + x11 + x22 + a43 ,
4x22 = x12 + x21 + a43 + a34 .
To the right we see the solution to a problem
with n = 300, where the computer had to solve
a system with 90 000 variables. This problem is
called a Dirichlet problem and has close ties to
probability theory too.
The last example should show you that linear systems of equations also appear in data fitting
even so we do not fit with linear functions. The task is to find a parabola
y = ax2 + bx + c
through the points (1, 3), (2, 1) and (4, 9). We have to solve the system
a+b+c = 3
4a + 2b + c = 1
16a + 4b + c = 9
The solution is (2, 8, 9). The parabola is y = 2x2 8x + 9.
Problem 24 in 1.1 of Bretscher): On your next trip to

Switzerland, you should take the scenic boat ride from
Rheinfall to Rheinau and back. The trip downstream
from Rheinfall to Rheinau takes 20 minutes, and the
return trip takes 40 minutes; the distance between Rheinfall and Rheinau along the river is 8 kilometers. How
fast does the boat travel (relative to the water), and
how fast does the river Rhein flow in this area? You
may assume both speeds to be constant throughout the
journey.
(Problem 28 in 1.1 of Bretscher): In a grid of wires,
the temperature at exterior mesh points is maintained
at constant values as shown in the figure. When the
grid is in thermal equilibrium, the temperature at each
interior mesh point is the average of the temperatures
at the four adjacent points. For example
T2 = (T3 + T1 + 200 + 0)/4 .
Find the temperatures T1 , T2 , T3 when the grid is in
thermal equilibrium.
(Problem 46 in 1.1 of Bretscher): A hermit eats only two

kinds of food: brown rice and yogurt. The rice contains
3 grams of protein and 30 grams of carbohydrates per
serving, while the yogurt contains 12 grams of protein
and 20 grams of carbohydrates.
a) If the hermit wants to take in 60 grams of protein and
300 grams of carbohydrates per day, how many servings
of each item should he consume?
b) If the hermit wants to take in P grams of protein and
C grams of carbohydrates per day, how many servings
of each item should he consume?
0
200 T1
200 T2 T3
0
400
The task is to compute the actual densities. We first write down the augmented matrix is
Lecture 5: Gauss-Jordan elimination

We have seen in the last lecture that a system of linear equations like

x+y+z

xyz

x + 2y 5z
=3
=5
=9
1
1
x
3
1 1
x = y , ~b = 5 .
,~
1 2 5
z
9
A= 1
elimination steps.
We aim is to find all the solutions to the system of linear equations
+ u
y
z
y + z
=
+ v
=
+ w =
=
u + v + w =
1 0
1
0
0
0 0
The augmented matrix is matrix, where other column has been added. This column contains
the vector b. The
last column is often
separated with horizontal lines for clarity reasons.
1 1
1 | 3
B = 1 1 1 | 5 .
We will solve this equation using Gauss-Jordan
1 2 5 | 9

x

x +

1
0
0
0
1
0
1
0
0
1
0
0
1
0
1
3
5
9
8
9
0
0
1
0
0
1
0
0
1
1
0
1
0
1
1
0
0
1
1
1
3
5
9
9
9
Now subtract the 4th row from the last to get a row of zeros, then subtract the 4th row
from the first. This is already the row reduced echelon form.
The ith entry (A~x)i is the dot product of the ith row of A with ~x.
0
0
1
1
0
Remove the sum of the first three rows from the 4th, then change sign of the 4th row:
can be written in matrix form as A~x = ~b, where A is a matrix called coefficient matrix and
column vectors ~x and ~b.
1 0
1
0
1
0 0
3
5
9
8
9
This system appears in tomography like magnetic resonance
imaging. In this technology, a scanner can measure averages of tissue densities along lines.
1 0
1
0
0
0 0
0
0
1
0
0
0 1 1 6
0 1
0
5
0 0
1
9
.
1 1
1
9
0 0
0
0
The first 4 columns have leading 1. The other 2 variables are free variables r, s. We write
the row reduced echelon form again as a system and get so the solution:
x
y
z
u
v
w
=
=
=
=
=
=
6 + r + s
5r
9s
9rs
r
s
There are infinitely many solutions. They are parametrized by 2 free variables.
Gauss-Jordan Elimination is a process, where successive subtraction of multiples of other
rows or scaling or swapping operations brings the matrix into reduced row echelon form.
The elimination process consists of three possible steps. They are called elementary row
operations:
Swap two rows.

Scale a row.
Subtract a multiple of a row from an other.
The process transfers a given matrix A into a new matrix rref(A).
The first nonzero element in a row is called a leading one. The goal of the Gauss Jordan
elimination process is to bring the matrix in a form for which the solution of the equations
can be found. Such a matrix is called in reduced row echelon form.
1) if a row has nonzero entries, then the first nonzero entry is 1.

2) if a column contains a leading 1, then the other column entries are 0.
3) if a row has a leading 1, then every row above has a leading 1 to the left.

1
The number of leading 1 in rref(A) is called the rank of A. It is an integer which we will
use later.
4x1 + 3x2 + 2x3 x4

5x1 + 4x2 + 3x3 x4
2x1 2x2 x3 + 2x4
11x1 + 6x2 + 4x3 + x4
A remark to the history: The process appeared already in the Chinese manuscript Jiuzhang
Suanshu the Nine Chapters on the Mathematical art. The manuscript or textbook appeared around 200 BC in the Han dynasty. The German geodesist Wilhelm Jordan
(1842-1899) applied the Gauss-Jordan method to finding squared errors to work on surveying. (An other Jordan, the French Mathematician Camille Jordan (1838-1922) worked on
linear algebra topics also (Jordan form) and is often mistakenly credited with the GaussJordan process.) Gauss developed Gaussian elimination around 1800 and used it to solve
least squares problems in celestial mechanics and later in geodesic computations. In 1809,
Gauss published the book Theory of Motion of the Heavenly Bodies in which he used the
method for solving astronomical problems.
Find the rank of the following matrix:
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
More
challenging
1 2 3 4 5
2 3 4 5 6
3 4 5 6 7
4 5 6 7 8
5 6 7 8 1
6 7 8 1 2
7 8 1 2 3
8 1 2 3 4
is
6
7
8
1
2
3
4
5
the question:
what is the rank of the following matrix?
7 8
8 1
1 2
2 3
.
3 4
4 5
5 6
6 7
What are the possible ranks for a 7 11 matrix?
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
Problem 10 In 1.2 of Bretscher: Find all solutions of the equations with paper and
pencil using Gauss-Jordan elimination. Show all your work.
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
=
=
=
=
4
4
3
11
Problem 20 in 1.2 of Bretscher: We say that two n m matrices in reduced row-echelon

form are of the same type if they contain the same number of leading ls in the same
positions. For example,
"
#
1 2 0
0 0 1
"
1 3 0
0 0 1
are of the same type. How many types of 2 x 2 matrices in reduced row-echelon form
are there?
Problem 42 in 1.2 of Bretscher: The accompanying sketch represents a maze of one way
streets in a city in the United States. The traffic volume through certain blocks during
an hour has been measured. Suppose that the vehicles leaving the area during this
hour were exactly the same as those entering it. What can you say about the traffic
volume at the four locations indicated by a question mark? Can you figure out exactly
how much traffic there was on each block? If not, describe one possible scenario. For
each of the four locations, find the highest and the lowest possible traffic volume.
JFK
Dunster
400
300
?
?
320
?
300
100
250
Mt Auburn
?
120
Winthrop
150
The augmented matrix is
Lecture 6: The structure of solutions

a1m
a2m

anm
a11 a12
a22
an1 an2
a
21
A=
w
~2 =
h
h
1 2 1
i
i
w
~ 3 = 1 1 3
The column
vectorsare
3
4
5
~v1 = 1 , ~v2 = 2 , ~v3 = 1

1
1
3
And bring it to so called row reduced echelon form. We write rref(A). In this form, we have
in each row with nonzero entries a so called leading one.
The row picture tells: 0 = b1 =

The number of leading ones in rref(A) is called the rank of the matrix.
A matrix with one column is also called a column vector. The entries of a matrix
are denoted by aij , where i is the row number and j is the column number.
w
~ n
w
~ 1 ~x
|
w
x
~2 ~
x =
~
...
w
~ n ~x
x1
| | |
x2
A~x = ~v1 ~v2 ~vm

| | |
xn
= x1~
v1 + x2~v2 + + xm~vm = ~b .
=0
=0
=9
0
x
4 5
2 1
x=
y , ~b = 0 .
,~
9
z
1 1 3
3
A=
1
If we have n equations and n unknowns, it is most likely that we have exactly one solution.
But remember Murphys law If anything can go wrong, it will go wrong. It happens!
is equivalent to A~x = ~b, where A is a coefficient matrix and ~x and ~b are vectors.
How do we determine in which case we are? It is the rank of A and the rank of the augmented
matrix B = [A|b] as well as the number m of columns which determine everything:
If rank(A) = rank(B) = m: there is exactly 1 solution.
If rank(A) < rank(B): there are no solutions.
If rank(A) = rank(B) < m: there are many solutions.
The system of linear equations

3x 4y 5z

x + 2y z

x y + 3z
Theorem. For any system of linear equations there are three possibilities:
Consistent with unique solution: Exactly one solution. There is a leading 1
in each column of A but none in the last column of the augmented matrix B.
Inconsistent with no solutions. There is a leading 1 in the last column of the
augmented matrix B.
Consistent with infinitely many solutions. There are columns of A without
leading 1.
Column picture: ~b is a sum of scaled column vectors ~vj .
A system of linear equations A~x = ~b with n equations and m unknowns is defined by the
n m matrix A and the vector ~b. The row reduced matrix rref(B) of the augmented matrix
B = [A|b] determines the number of solutions of the system Ax = b. The rank rank(A) of
a matrix A is the number of leading ones in rref(A).
There are two ways how we can look a system of linear equation. It is called the row
picture or column picture:
Row picture: each bi is the dot product of a row vector w
~ i with ~x.
w
~ 1
3 4 5
0 = x1 1 + x2 1 + x3 1 .
w
~ 2
A~x =
...
The column picture tells:
If we have a unique solution to A~x = ~b, then rref(A) is the matrix which has a leading 1 in
every column. This matrix is called the identity matrix.
4 5 | 0
2 1 | 0
.
1 1 3 | 9
In thish case, the rowi vectors of A are

w
~ 1 = 3 4 5
Last time we have learned how to row reduce a matrix
B = 1
What is the probability that we have exactly one solution if we look at all n n matrices
with entries 1 or 0? You explore this in the homework in the 2 2 case. During the lecture
we look at the 3 3 case and higher, using a Monte Carlo simulation.

1
We look at the probability space of all 2 2 matrices with matrix entries 0 or 1.

a) What is the probability that the rank of the matrix is 1?
b) What is the probability that the rank of the matrix is 0?
c) What is the probability that the rank of the matrix is 2?
Find a cubic equation f (x) = ax3 + bx2 + cx + d = y such that the graph of f goes through
the 4 points
A = (1, 7), B = (1, 1), C = (2, 11), D = (2, 25) .
40
20
-3
-2
-1
-20
-40
-60
1
-4
-1
y
-1
w
0
-1
0
-2
In a Herb garden, the humidity of its soil has the property that at any given point the humidity is the sum
of the neighboring humidities. Samples are taken on
a hexagonal grid on 14 spots. The humidity at the
four locations x, y, z, w is unknown. Solve the equations
x = y+z+w+2
y=
x+w-3
using row reduction.
z=
x+w-1
w=
x+y+z-2
physics (i.e. Lorentz transformations),

dynamics (linearizations of general maps are linear maps),
Lecture 7: Linear transformations
compression (i.e. using Fourier transform or Wavelet transform),

error correcting codes (many codes are linear codes),
A transformation T from a set X to a set Y is a rule, which assigns to every x

in X an element y = T (x) in Y . One calls X the domain and Y the codomain.
A transformation is also called a map from X to Y . A map T from Rm to Rn is
called a linear transformation if there is a n m matrix A such that T (~x) = A~x.
To the linear transformation T (x, y) = (3x + 4y, x + 5y) belongs the matrix
"
probability theory (i.e. Markov processes).

linear equations (inversion is solving the equation)
3 4
. This
1 5
transformation maps the two-dimensional plane onto itself.

T (x) = 33x is a linear transformation from the real line onto itself. The matrix is A =
[33].
h
To T (~x) = ~y ~x from R3 to R belongs the matrix A = ~y = y1 y2 y3 . This 1 3

matrix is also called a row vector. If the codomain is the real axes, one calls the map also
a function.
y1
T (x) = x~y from R to R3 . A = ~y = y2 is a 3 1 matrix which is also called a column
y3
vector. The map defines a line in space.
"
that the
0
, and
In order to find the matrix of a linear transformation, look at the image of the
standard vectors and use those to build the columns of the matrix.
1 0
T (x, y, z) = (x, y) from R3 to R2 , A is the 2 3 matrix A = 0 1 . The map projects

0 0
space onto a plane.
To the map T (x, y) = (x + y, x y, 2x 3y) belongs the 3 2 matrix A =
| | |
A linear transformation T (x) = Ax with A = ~v1 ~v2 ~vn has the property
| | |
column vector ~v1 , ~vi , ~vn are the images of the standard vectors ~e1 = , ~ei =
~en =
.
1 1
2
.
1 1 3
The image of the map is a plane in three dimensional space.

If T (~x) = ~x, then T is called the identity transformation.
Find the matrix belonging to the linear

transformation, which rotates a cube around
the diagonal (1, 1, 1) by 120 degrees (2/3).
A transformation T is linear if and only if the following properties are satisfied:

T (~0) = ~0 T (~x + ~y ) = T (~x) + T (~y ) T (~x) = T (~x)
In other words, linear transformations are compatible with addition, scalar multiplication and also
respect 0. It does not matter, whether we add two vectors before the transformation or add the
transformed vectors.
2
Linear transformations are important in
Find the linear transformation, which reflects a vector at the line containing the vector
(1, 1, 1).
geometry (i.e. rotations, dilations, projections or reflections)

art (i.e. perspective, coordinate transformations),
computer adided design applications (i.e. projections),
If there is a linear transformation S such that S(T ~x) = ~x for every ~x, then S is called
the inverse of T . We will discuss inverse transformations later in more detail.
A~x = ~b means to invert the linear transformation ~x 7 A~x. If the linear system has exactly
one solution, then an inverse exists. We will write ~x = A1~b and see that the inverse of a
linear transformation is again a linear transformation.
Otto Bretschers book contains as a motivation a code, where the encryption happens
with the linear map T (x, y) = (x + 3y, 2x + 5y). It is an variant of a Hill code. The map
has the inverse T 1 (x, y) = (5x + 3y, 2x y).
Assume we know, the other party uses a Bretscher code and can find out that T (1, 1) = (3, 5)
and T
" (2, 1) #= (7, 5). Can we reconstruct the code? The problem is to find the matrix
a b
A=
. It is useful to decode the Hill code in general. If ax+by = X and cx+dy = Y ,
c d
then x = (dX bY
" )/(ad# bc), y = (cX aY )/(ad bc). This is" a linear transformation
#
a b
d b
with matrix A =
and the corresponding matrix is A1 =
/(ad bc).
c d
c a
This is Problem 24-40 in Bretscher: Consider the circular face in the accompanying figure.
For each of the matrices A1 , ...A6 , draw a sketch showing the effect of the linear transformation"T (x) = Ax
# on this "face. #
"
#
"
#
"
#
0 1
2 0
0 1
1 0
1 0
A1 =
. A2 =
. A3 =
. A4 =
. A5 =
.
1 0
0 2
1 0
0 1
0 2
"
#
1 0
A6 =
.
0 1
Switch diagonally, negate the wings and scale with a cross.
This is problem 50 in Bretscher. A goldsmith uses a platinum alloy and a silver alloy to
make jewelry; the densities of these alloys are exactly 20 and 10 grams per cubic centimeter,
respectively.
a) King Hiero of Syracuse orders a crown from this goldsmith, with a total mass of 5 kilograms
(or 5,000 grams), with the stipulation that the platinum alloy must make up at least 90%
of the mass. The goldsmith delivers a beautiful piece, but the kings friend Archimedes has
doubts about its purity. While taking a bath, he comes up with a method to check the
composition of the crown (famously shouting Eureka! in the process, and running to the
kings palace naked). Submerging the crown in water, he finds its volume to be 370 cubic
centimeters. How much of each alloy went into this piece (by mass)? Is this goldsmith a
crook?
b) Find the matrix A that transforms the vector
"
into the vector
mass of platinum alloy

mass of silver alloy
"
totalmass
totalvolume
for any piece of jewelry this goldsmith makes.

c) Is the matrix A in part (b) invertible? If so, find the its inverse. Use the result to check
your answer in part a)
In the first week we have seen how to compute the mean and standard deviation of data.
a) Given some data (x1 , x2 , x3 , ..., x6 ). Is the transformation from R6 R which maps the
data to its mean m linear?
b) Is the map which assigns to the data the standard deviation a linear map? c) Is
the map which assigns to the data the difference (y1 , y2, ..., y6 ) defined by y1 = x1 , y2 =
x2 x1 , ..., y6 = x6 y5 linear? Find its matrix. d) Is the map which assigns to the data the
normalized data (x1 m, x2 m, ..., xn m) given by a linear transformation?
Projection
Lecture 8: Examples of linear transformations

While the space of linear transformations is large, there are few types of transformations which
are typical. We look here at dilations, shears, rotations, reflections and projections.
Shear transformations
A=
"
1 0
1 1
A=
A=
"
1 1
0 1
In general, shears are transformation in the plane with the property that there is a vector w
~ such
that T (w)
~ =w
~ and T (~x) ~x is a multiple of w
~ for all ~x. Shear transformations are invertible,
and are important in general because they are examples which can not be diagonalized.
"
1 0
0 0
A=
"
0 0
0 1
A
u is T (~x) = (~x ~u)~u with matrix A =
" projection #onto a line containing unit vector ~
u1 u1 u2 u1
.
u1 u2 u2 u2
Projections are also important in statistics. Projections are not invertible except if we project
onto the entire space. Projections also have the property that P 2 = P . If we do it twice, it
is the same transformation. If we combine a projection with a dilation, we get a rotation
dilation.
Rotation
Scaling transformations
A=
5
A=
"
2 0
0 2
A=
"
1/2 0
0 1/2
"
1 0
0 1
A
"
cos() sin()
sin() cos()
#=
Any rotation has the form of the matrix to the right.

Rotations are examples of orthogonal transformations. If we combine a rotation with a dilation,
we get a rotation-dilation.
One can also look at transformations which scale x differently then y and where A is a diagonal
matrix. Scaling transformations can also be written as A = I2 where I2 is the identity matrix.
They are also called dilations.
Rotation-Dilation
Reflection
A=
A
"
cos(2) sin(2)
sin(2) cos(2)
#=
A=
"
1 0
0 1
Any reflection at a line has the form of the matrix to the"left. A reflection at# a line containing
2u21 1 2u1u2
a unit vector ~u is T (~x) = 2(~x ~u)~u ~x with matrix A =
2u1 u2 2u22 1
Reflections have the property that they are their own inverse. If we combine a reflection with
a dilation, we get a reflection-dilation.
"
2 3
3 2
A=
"
a b
b a
6
A rotation
dilation is a composition of a rotation by angle arctan(y/x) and a dilation by a
factor x2 + y 2 .
If z = x + iy and w = a + ib and T (x, y) = (X, Y ), then X + iY = zw. So a rotation dilation
is tied to the process of the multiplication with a complex number.
Rotations in space
Rotations in space are determined by an axis of rotation and an
angle. A rotation by 120

around a line containing (0, 0, 0) and
0 0 1
(1, 1, 1) belongs to A = 1 0 0 which permutes ~e1 ~e2 ~e3 .

0 1 0
D=
1
1
1
3
1
1
1
1
3 1
3
0
0 0
1 1
1 1
0 0
0 0
B=
0
E=
0
0
3
1
0
0
1
1
0
0
1
3
0
0
1
1
C=
1
F =
1
1
1
1
1
1
0
0
1
1
1
0
0
1
1
1 1
1 1
1 1
0 0
0 0
1 1
1 1
b) The smiley face visible to the right is transformed with various linear
transformations represented by matrices A F . Find out which matrix
does which transformation:
Reflection at xy-plane
"
1 1
,
1 1 #
"
1 1
D=
,
0 1
A=
3 1 1
1
3
1 1 1
1 1 1
1 1 1
1 1 1
1 1 1
1 3
A=
1 1
To a reflection
at the xy-plane belongs the matrix A
=
1 0 0
ei . The
0 1 0 as can be seen by looking at the images of ~
0 0 1
picture to the right shows the linear algebra textbook reflected at
two different mirrors.
A-F
B=
E=
image
"
1 2
,
0 1 #
"
1 0
,
0 1
A-F
C=
F=
"
1 0
,
0 1 #
"
0 1
/2
1 0
image
A-F
image
Projection into space
To project a 4d-object into the

three dimensional
xyz-space, use
1 0 0 0
0 1 0 0
for example the matrix A =

. The picture shows
0 0 1 0
0 0 0 0
the projection of the four dimensional cube (tesseract, hypercube)
with 16 edges (1, 1, 1, 1). The tesseract is the theme of the
horror movie hypercube.
This is homework 28 in Bretscher 2.2: Each of the linear transformations in parts (a) through
(e) corresponds to one and only one of the matrices A) through J). Match them up.
a) Scaling

1
What transformation in space do you get if you reflect first at the xy-plane, then rotate
around the z axes by 90 degrees (counterclockwise when watching in the direction of the
z-axes), and finally reflect at the x axes?
a) One of the following matrices can be composed with a dilation to become an orthogonal
projection onto a line. Which one?
"
b) Shear
"
0 0
B=
0 1
"
#
"
0.6 0.8
F =
G=
0.8 0.6
A=
c) Rotation d) Orthogonal Projection
"
2 1
0.6 0.8
C=
1 0
0.8 0.6
#
"
#
0.6 0.6
2 1
H=
0.8 0.8
1 2
e) Reflection
"
7
0
"
0
I=
1
D=
0
7#
0
0
"
1
3
"
0.8
J=
0.6
E=
0
1
#
0.6
0.8
Lecture 9: Matrix algebra

If A is a n m matrix and A is a m p matrix, then the matrix product AB is defined as the
n p matrix with entries (BA)ij =
If B is a 3 4 matrix, and A
1
1 3 5 7
3
B = 3 1 8 1 , A =
1
1 0 9 2
0
Pm
k=1
is
3
1
0
1
a 4 2 matrix then BA is a
1
1 3 5 7
3
, BA = 3 1 8 1
1
1 0 9 2
0
3 2 matrix. For example:
3
15 13
1
= 14 11 .
0
10 5
1
Matrix multiplication generalizes the common multiplication of numbers. We can write the
dot product between two vectors as a matrix product when writing the first vector as a
1n
matrix
(= row vector) and the second as a n 1 matrix (=column vector) like in
[1 2 3] 3 = 20. Note that AB 6= BA in general and for n n matrices, the inverse A1

4
does not always exist, otherwise, for n n matrices the same rules apply as for numbers:
A(BC) = (AB)C, AA1 = A1 A = 1n , (AB)1 = B 1 A1 , A(B + C) = AB + AC,
(B + C)A = BA + CA etc.
If A is a n n matrix and the system of linear equations Ax = y has a unique solution for all
y, we write x = A1 y. The inverse matrix can be computed using Gauss-Jordan elimination.
Lets see how this works.
Let 1n be the nn identity matrix. Start with [A|1n ] and perform Gauss-Jordan elimination.
Then
Bik Akj .
If A is a n n matrix and T : ~x 7 Ax has an inverse S, then S is linear. The

matrix A1 belonging to S = T 1 is called the inverse matrix of A.
2 steps. In our case, it can go in three different ways back to the page itself.
Matrices help to solve combinatorial problems. One appears in the movie Good will hunting. For example, what does [A100 ] tell about the news distribution on a large network.
What does it mean if A100 has no zero entries?
The entries of matrices can themselves be matrices. If B is a n p matrix and A is a

p m matrix, and assume the entries are k k matrices, then BA is a n m matrix, where
P
each entry (BA)ij = pl=1 Bil Alj is a k k matrix. Partitioning
matrices
can be useful
#
"
A11 A12
, where Aij are k k
to improve the speed of matrix multiplication. If A =
0 A22
matrices" with the property that
# A11 and A22 are invertible, then one can write the inverse
1
A1
A1
11
11 A12 A22
is the inverse of A.
as B =
0
A1
22
0 1 1 1
1 0 1 0
Let us associate to a small bloging network a matrix

and look at the spread
1 1 0 1
1 0 1 0
of some news. Assume the source of the news about some politician is the first entry (maybe
1
0
the gossip news gawker)
. The vector Ax has a 1 at the places, where the news

0
0
could be in the next hour. The vector (AA)(x) tells, in how many ways the news can go in
rref([A|1n ]) = [1n |A1 ]
Proof. The elimination process solves A~x = ~ei simultaneously. This leads to solutions ~vi
which are the columns of the inverse matrix A1 because A1 e~i = ~vi .
"
2 6 | 1 0
1 4 | 0 1
"
1 3 | 1/2 0
1 4 | 0 1
"
1 3 | 1/2 0
0 1 | 1/2 1
"
1 0 |
2
3
0 1 | 1/2 1
The inverse is A1 =
"
#
#
#
A | 12
.... | ...
.... | ...
12 | A1
2
3
.
1/2 1
If ad bc 6= 0, the inverse of a linear transformation ~x 7 Ax with A =

the matrix A1 =
Shear:
"
A=
1 0
1 1
"
d b
c a
Diagonal:
"
#
2 0
A=
0 3
"
a b
c d
is given by
/(ad bc).
A1 =
"
1 0
1 1
A1 =
"
1/2 0
0 1/3
Reflection:
"
#
cos(2) sin(2)
A=
sin(2) cos(2)
A1 = A =
Rotation:
"
#
cos()
sin()
A=
sin() cos()
A1 =
"
"
cos(2) sin(2)
sin(2) cos(2)
cos() sin()
sin() cos()
Rotation=Dilation:
"
#
a b
A=
b a
"
A1 =
a/r 2 b/r 2
, r 2 = a2 + b2
b/r 2 a/r 2

1
Find the inverse of the following matrix
1 1 1
1
1
0
1 0 0
1 1
A=
1 1
1 1
1
1
0
0
0
1
0
0
0
0
The probability density of a multivariate normal distribution centered at the origin is

a multiple of
f (x) = exp(x A1 x)
We will see the covariance matrix later. It encodes how the different coordinates of a
random vector are correlated. Assume
1 1 1
1 1
1 1 0
A= 0
Find f (h[1, 2, 3]).
This is a system we will analyze more later. For now we are only interested in the algebra. Tom the cat moves each minute randomly from on spots 1,5,4 jumping to neighboring sites only. At the same time Jerry, the mouse, moves on spots 1,2,3, also jumping to neighboring sites. The possible position combinations (2, 5), (3, 4), (3, 1), (1, 4), (1, 1)
and transitions are encoded in a matrix
0
1/4
A = 1/4
1/4
1/4
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
1
This means for example that we can go from state (2, 5) with equal probability to all
the other states. In state (3, 4) or (3, 1) the pair (Tom,Jerry) moves back to (2, 5).
If state (1, 1) is reached, then Jerrys life ends and Tom goes to sleep there. We can
read off that the probability that Jerry gets eaten in one step as 1/4. Compute A4 .
The first column of this matrix gives the probability distribution after four steps. The
last entry of the first column gives the probability that Jerry got swallowed after 4 steps.
1
5
In the 10 nucleotid example, where X counts the number of G nucleotides, we have

P[X = k] =
Lecture 10: Random variables
10
k
3nk
.
4n
10
pk (1 p)nk with p = 1/4 and interpret it of having heads
k
turn up k times if it appears with probability p and tails with probability 1 p.
We can write this as
In this lecture, we define random variables, the expectation, mean and standard deviation.
A random variable is a function X from the probability space to the real line with
the property that for every interval the set {X [a, b] } is an event.
If X(k) counts the number of 1 in a sequence of length n and each 1 occurs with a
probability p, then
!
n
P[X = k ] =
pk (1 p)nk .
k
There is nothing complicated about random variables. They are just functions on the laboratory
. The reason for the difficulty in understanding random variables is solely due to the name
variable. It is not a variable we solve for. It is just a function. It quantifies properties of
experiments. In any applications, the sets X [a, b] are automatically events. The last condition
in the definition is something we do not have to worry about in general.
If our probability space is finite, all subsets are events. In that case, any function on is a random
variable. In the case of continuous probability spaces like intervals, any piecewise continuous
function is a random variable. In general, any function which can be constructed with a sequence
of operations is a random variable.
This Binomial distribution an extremely important example of a probability distribution.

0.25
0.25
0.20
We throw two dice and assign to each experiment the sum of the eyes when rolling two dice.
For example X[(1, 2)] = 3 or X[(4, 5)] = 9. This random variable takes values in the set
{2, 3, 4, . . . , 12}.
0.20
0.15
0.15
Given a random variable X, we can look at probabilities like P[{X = 3} ]. We

usually leave out the brackets and abbreviate this as P[X = 3]. It is read as the
probability that X = 3.
0.10
0.10
0.05
Assume is the set of all 10 letter sequences made of the four nucleotides G, C, A, T in a
string of DNA. An example is = (G, C, A, T, T, A, G, G, C, T ). Define X() as the number
of Guanin basis elements. In the particular sample just given, we have X() = 3.
Problem Assume X() is the number of Guanin basis elements in a sequence. What is the
probability of the event {X() = 2 }? Answer Our probability space has 410 = 1048576
elements. There are 38 cases, where the first two elements are G. There are 38 elements
where the first and third element is G, etc. For any pair, there are 38 sequences. We have
(109/2) = 45 possible ways to chose a pair from the 10. There are therefore 38 45 sequences
with exactly 2 amino acids G. This is the cardinality of the event A = {X() = 2 }. The
probability is |A|/|| = 45 38 /410 which is about 0.28.
For random variables taking finitely many values we can look at the probabilities
pj = P [X = cj ]. This collection of numbers is called a discrete probability
distribution of the random variable.
0.05
The Binominal distribution with p = 1/2.
10
10
The Binominal distribution with p = 1/4.
For a random variable X taking finitely many values, we define the expectation
P
as m = E[X] = x xP[X = x ]. Define the variance asqVar[X] = E[(X m)2 ] =
E[X 2 ] E[X]2 and the standard deviation as [X] = Var[X].
In the case of throwing a coin 10 times and head appears with probability p = 1/2 we have
E[X] = 0 P[X = 0] + 1 P[X = 1] + 2 P[X = 2] + 3 P[X = 3] + + 10 P[X = 10] .
We throw a dice 10 times and call! X() the number of times that heads shows up. We
10!
have P[X = k ] =
/210 . because we chose k elements from n = 10. This
k!(10 k)!
distribution is called the Binominal distribution on the set {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 }.
The average adds up to 10 p = 5, which is what we expect. We will see next time when
we discuss independence, how we can get this immediately. The variance is
Var[X] = (0 5)2 P[X = 0 ] + (1 5)2 P[X = 1 ] + + (10 5)2 P[X = 10 ] .
9
AB
AB
AB
8
9
7
8
AB
6
7
5
6
AB
AB
4
5
3
3
3
4
AB
Throw a vertical line randomly into the unit disc. Let X[ ] be the length of the segment
cut out from the circle. What is P [X > 1 ]?
Solution:
we need to hit the x axes in |x| < 1/ 3. Comparing lenghts gives the probability
is 1/ 3. We have assumed here that every interval [c, d] in the interval [1, 1] appears with
probability (d c)/2.
AB
All these examples so far, the random variable has taken only a discrete set of values. Here is
an example, where the random variable can take values in an interval. It is called a variable
with a continuous distribution.
It is 10p(1 p) = 10/4. Again, we will have to wait until next lecture to see how we can get
this without counting.
We look at the probability space of all 2 2 matrices, where the entries are either 1 or
1. Define the random variable X() = det(), where is one of the matrices. The
determinant is
"
#
a b
det(
= ad bc .
c d
Draw the probability distribution of this random variable and find the expectation as
well as the Variance and standard deviation.
If a random variable has the property

that P[X [a, b]] = ab f (x) dx where f is
R
a nonnegative function satisfying f (x) dx = 1. Then the expectation of X is
R
defined as E[X] =
xf (x) dx. The function f is called the probability density
function and we will talk about it later in the course.
R
In the previous example, we have seen again the Bertrand example, but because we insisted
on vertical sticks, the probability density was determined. The other two cases we have seen
produced different probability densities. A probability model always needs a probability
function P .

1
In the card game blackjack, each of the 52 cards is assigned a value. You see the
French card deck below in the picture. Numbered cards 2-10 have their natural
value, the picture cards jack, queen, and king count as 10, and aces are valued as
either 1 or 11. Draw the probability distribution of the random variable X which gives
the value of the card assuming that we assign to the hearts ace and diamond aces the
value 1 and to the club ace and spades ace the value 11. Find the mean the variance
and the standard deviation of the random variable X.
A LCD display with 100 pixels is described by a 10 10 matrix with entries 0 and 1.
Assume, that each of the pixels fails independently with probability p = 1/20 during
the year. Define the random variable X() as the number of dead pixels after a year.
a) What is the probability of the event P [X > 3], the probability that more than 3
pixels have died during the year?
b) What is the expected number of pixels failing during the year?
Two events A, B are called independent, if P[A B] = P[A] P[B].
Lecture 11: Independence
Let be the probability space obtained by throwing two dice. It has 36 elements. Let A be
the event that the first dice shows an odd number and let B be the event that the second
dice shows less than 3 eyes. The probability of A is 18/36 = 1/2 the probability of B is
12/36 = 1/3. The event A B consists of the cases {(1, 1), (1, 2), (3, 1), (3, 2), (5, 1), (5, 2) }
and has probability 1/6. The two events are independent.
If is the probability space of throwing 5 coins. It has 25 = 32 elements. The event A

that the first 4 coins are head and the event B that the last coin is head are uncorrelated:
P[A] = 1/24 and P[B] = 1/2. And P[A B] = 1/25 . We might think that if 4 heads have
come, justice or fairness should tilt the chance towards tails since in average the same
number of heads and tails occur. But this is not the case. The two events are independent.
The coin flying the 5 times does not know about the previous cases.
If is a finite probability space where each experiment has the same probability, then
E[X] =
1 X
X()
||
(1)
is the expectation of the random variable. Last time, we wrote this as

E[X] =
xj P[X = xj ] ,
(2)
xj
where xj are the possible values of X. The later expression is the same but involves less terms.
In real life we often do not know the probability space. Or, the probability space
is so large that we have no way to enumerate it. The only thing we can access is
the distribution, the frequency with which data occur. Statistics helps to build a
model. Formula (1) is often not computable, but (2) is since we can build a model
with that distribution.
Lets illustrate this with data
11
12
13
14
15
16
21
22
23
24
25
26
31
32
33
34
35
36
41
42
43
44
45
46
51
52
53
54
55
56
61
62
63
64
65
66
X = (1, 2, 3, 3, 4, 1, 1, 1, 2, 6)
To compute the expectation of X, write it as the result of a random variable X(1) =
1, X(2) = 2, X(3) = 3, ..., X(10) = 6 on a probability space of 10 elements. In this case,
E[X] = (1 + 2 + 3 + 3 + 4 + 1 + 1 + 1 + 2 + 6)/10 = 24/10. But we can look at these data
also differently and say P[X = 1] = 4/10, P[X = 2] = P[X = 3] = 2/10, P[X = 4] = P[X =
6] = 1/6. Now,
E[X] = 1 P[X = 1] + 2 P[X = 2] + 3 P[X = 3] + 4 P[X = 4] + 6 P[X = 6]
2
2
1
1
12
4
+2
+3
+4
+6
=
.
= 1
10
10
10
10
10
5
The first expression has 10 terms, the second 5. Not an impressive gain, but look at the
next example.
We can rephrase correlation using conditional probability

If A, B are independent then P[A|B] = P[A]. Knowing about B does not change
the probability of A.
We throw 100 coins and let X denote the number of heads. Formula (1) involves 2100
terms. This is too many to sum over. The expression (2) however
100
X
kP[X = k] =
100
X
100
k
k=1
k=1
This follows from the definition P[A|B] = P[A B]/P[B] and P[A B] = P[A] P[B].
1
2100
Two random variables X, Y are called independent if for every x, y, the events
{X = x} and {Y = y} are independent.
has only 100 terms and sums up to 100 (1/2) = 50 because in general
n
1 X
k
n
2 k=0
n
k
n
.
2
5
!
n
n1
By the way, one can see this by writing out the factorials k
=n
. Summing
k
k
over the probability space is unmanageable. Even if we would have looked at 10 trillion
cases every millisecond since 14 billion years, we would not be through. But this is not an
obstacle. Despite the huge probability space, we have a simple model which tells us what
the probability is to have k heads.
If is the probability space of throwing two dice. Let X be the random variable which gives
the value of the first dice and Y the random variable which gives the value of the second
dice. Then X((a, b)) = a and Y ((a, b)) = b. The events X = x and Y = y are independent
because each has probability 1/6 and event {X = x, Y = y} has probability 1/36.
Two random variables X, Y are called uncorrelated, if E[XY ] = E[X] E[Y ].
Let X be the random variable which is 1 on the event A and zero everywhere else. Let Y
be the random variable which is 1 on the event B and zero everywhere else. Now E[X] =
0P[X = 0] + 1P[X = 1] = P[A]. Similarly P[Y ] = P[B]. and P[XY ] = P[A B] because
XY () = 1 only if is in A and in B.

1
Let X be the random variable on the probability space of two dice which gives the dice value
of the first dice. Let Y be the value of the second dice. These two random variables are
uncorrelated.
E[XY ] =
6 X
6
212
49
1 X
ij = [(1 + 2 + 3 + 4 + 5 + 6) (1 + 2 + 3 + 4 + 5 + 6)]/36 =
=
.
36 i=1 j=1
36
4
Look at the first n digits of = 3.1415926535897932385 after the comma. For n =

20 this is 1415926535897932385. If you have no access to a computer, do this with
n = 20. Find the mean and standard deviation of these data. Then draw the discrete
distribution of the random variable which gives X(k) =kth digit after the comma.
If you should have access to Mathematica. Here is the line which produces a histogram
of the first n digits of with respect to base b = 10:

b=10; n=100; Histogram [ IntegerDigits [ Floor [ Pi bn ] , b ] , b ]

We also have E[X] = (1 + 2 + 3 + 4 + 5 + 6)/6 = 72 .
And here is the code which produces the mean and standard deviation of the first n
digits:
Define the covariance of two random variables X, Y as
b=10; n=100;
s=IntegerDigits [ Floor [ Pi bn ] , b ] ; m=N[Sum[ s [ [ k ] ] , { k , n } ] / n ] ;
sigma=Sqrt [N[Sum[ ( s [ [ k ]] m) 2 , { k , n } ] / n ] ] ; {m, sigma }
Cov[X, Y ] = E[(X E[X]) (Y E[Y ])] .
a) Verify that the empty set is independent to any other set.

b) Verify that the full laboratory is independent to any other set.
c) Two disjoint sets of positive probability are not independent.
d) Find subsets A, B, C of {1, 2, 3, 4} with probability P[A] = |A|/4 such that A is
independent of B and B is independent of C but A is not independent of C.
Let be the probability space of throwing two dice. Let X denote the difference of the
two dice values and let Y be the sum. Find the correlation between these two random
variables.
Two random variables are uncorrelated if and only if their correlation is zero.
To see this, just multiply out E[(X E[X])(Y E[Y ])] = E[XY ]2E[X]E[X]+E[X]E[X] =
E[XY ] E[X]E[Y ].
If two random variables are independent, then they are uncorrelated.
Proof. Let {a1 , . . . , an } be the values of the variable X and {b1 , . . . , bn }"be the value of
1 xA
Let
the variable Y . For an event A we define the random variable 1A () =
0 x
/A
Pn
Pm
Ai = {X = ai } and Bj = {Y = bj }. We can write X = i=1 ai 1Ai , Y = j=1 bj 1Bj , where
the events Ai and Bj are independent. Because E[1Ai ] = P[Ai ] and E[1Bj ] = P[Bj ] we have
E[1Ai 1Bj ] = P[Ai ] P[Bj ]. This implies E[XY ] = E[X]E[Y ].
For uncorrelated random variables, we have Var[X + Y ] = Var[X] + Var[Y ].
To see this, subtract first the mean from X and Y . This does not change the variance but now
the random variables have mean 0. We have Var[X+Y ] = E[(X+Y )2 ] = E[X 2 +2XY +Y 2 ] =
E[X 2 ] + 2E[XY ] + E[Y 2 ].
Let X be the random variable of one single Bernoulli trial with P[X = 1] = p and P[X =
0] = 1 p. This implies E[X] = 0P[X = 0] + pP[X = 1] and
Var[X] = (0 p)2 P[X = 0] + (1 p)2 P[X = 1] = p2 (1 p) + (1 p)2 p = p(1 p) .
If we add n independent random variables of this type, then E[X1 + + Xn ] = np and
Var[X1 + ... + Xn ] = Var[X1 + + Xn ] = np(1 p).
Lecture 12: Correlation

Independence and correlation
What is the difference between uncorrelated and independent? We have already mentioned
the important fact:
You can check the above proof using E[f (X)] = j f (aj )E[Aj ] and E[g(X)] = j g(bj )E[Bj ].
It still remains true. The only thing which changes are the numbers f (ai ) and g(bj ). By choosing suitable functions we can assure that all events Ai = X = xi and Bj = Y = yj are independent.
P
Lets explain this in a very small example, where the probability space has only three elements. In
that case, random variables are vectors. We look at centered random variables, random variables
of zero mean so that the covariance
is the dot product. We refer here as vectors as random

a
variables, meaning that X = b is the function on the probability space {1, 2, 3 } given by
c
f (1) = a, f (2) = b, f (3) = c. As you know from linear algebra books, it is more common to
write Xk instead of X(k). Lets state an almost too obvious relation between linear algebra and
probability theory because it is at the heart of the matter:
If two random variables are independent, then they are uncorrelated.
Vectors in Rn can be seen as random variables on the probability space {1, 2, ...., n }.
P
P
1 xA
We can write X = ni=1 ai 1Ai , Y = m
j=1 bj 1Bj ,
0 x
/A
where Ai = {X = ai } and Bj = {Y = bj } are independent. Because E[1Ai ] = P[Ai ] and
E[1Bj ] = P[Bj ] we have E[1Ai 1Bj ] = P[Ai ] P[Bj ]. Compare
The proof uses the notation 1A () =
E[XY ] = E[(
ai 1Ai )(
bj 1Bj )] =
ai bj E[1Ai 1Bj ] =
i,j
It is because of this relation that it makes sense to combine the two subjects of linear algebra and
probability theory. It is the reason why methods of linear algebra are immediately applicable to
probability theory. It also reinforces the picture given in the first lecture that data are vectors.
The expectation of data can be seen as the expectation of a random variable.
ai bj E[1Ai ]E[1Bj ] .
i,j
E[X]E[Y ] = E[(
ai 1Ai )]E[(
X
j
bj 1Bj )] = (
ai E[1Ai ])(
bj E[1Bj ]) =
Here are two random variables of zero mean: X = 3 and Y = 4 . They are
0
8
uncorrelated because their dot product E[XY ] = 3 4 + (3) 4 + 0 8 is zero. Are they
independent? No, the event A = {X = 3 } = {1 } and the event B = {Y = 4 } = {1, 2 } are
not independent. We have P[A] = 1/3, P[B] = 2/3 and P[AB] = 1/3. We can also see it as
follows: the random variables X 2 = [9, 9, 0] and Y 2 = [16, 16, 64] are no more uncorrelated:
E[X 2 Y 2 ] E[X 2 ]E[Y 2 ] = 31040 746496 is no more zero.
with
X
ai bj E[1Ai ]E[1Bj ] .
i,j
to see that the random variables are uncorrelated.

Remember the covariance
Cov[X, Y ] = E[XY ] E[X]E[Y ]
with which one has
Var[X] = Cov[X, X] = E[X X] E[X]2 .
One defines also the correlation
Corr[XY ] =
Cov[XY ]
.
[X][Y ]
Here is a key connection between linear algebra and probability theory:

If X, Y are two random variables of zero mean, then the covariance Cov[XY ] =
E[X Y ] is the dot product of X and Y . The standard deviation of X is the
length of X. The correlation is the cosine of the angle between the two vectors.
Positive correlation means an acute angle, negative correlation means an obtuse
angle. Uncorrelated means orthogonal.
If correlation can be seen geometrically, what is the geometric significance of independence?
Two random variables X, Y are independent if and only if for any functions f, g the
random variables f (X) and f (Y ) are uncorrelated.
Lets take the case of throwing two coins. The probability space is {HH, HT, T H, T T }.
1
1
The random variable that the first dice is 1 is X = . The random variable that
0
0
the second dice is 1 is Y = .

1
4
These random variables are independent. We can
0
center them to get centered random variables which are independent. [ Alert: the random
1
0
1 0
variables Y =
,
written down earlier are not independent, because the sets
0 1
0
1
A = {X = 1} and {Y = 1} are disjoint and P[A B] = P[A] P[B] does not hold. ]
Look at the data X = (1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1). 2 Assume the data

have a binomial distribution. What is our best bet for p? We can compute its expectation
E[X] = 12/21 = 2/3. We can also compute the variance Var[X] = 2/9. We know that
the average of binomial distributed random variables has mean p and standard deviation
p(1 p). We want both to fit of course. Which one do we trust more, the mean or the
standard deviation? In our example we got p = 2/3 and p(1 p) = 2/9. We were so lucky,
werent we?
It turns out we were not lucky at all. There is a small miracle going on which is true for all
0 1 data. For 0 1 data the mean determines the variance!
Given any 0 1 data of length n. Let k be the number of ones. If p = k/n is the
mean, then the variance of the data is p(1 p).
The random variables X = 0 and Y = 1 are not uncorrelated because E[(X
2/3
1/3
E[X])(Y E[Y ])] is the dot product 1/3 2/3 is not zero. Interestingly enough
1/3
1/3
there are no nonconstant random variables on a probability space with three elements which
are independent.1
Proof. Here is the statisticians proof: n1 ni=1 (xi p)2 = n1 (k(1 p)2 + (n k)(0 p)2 ) =
(k 2kp + np2 )/n = p 2p + p2 = p2 p = p(1 p).
And here is the probabilists proof: since E[X 2 ] = E[X] we have Var[X] = E[X 2 ] E[X]2 =
E[X](1 E[X]) = p(1 p).
P
Finally lets mention again the important relation

Pythagoras theorem: Var[X] + Var[Y ] = Var[X + Y ]
if X, Y are uncorrelated random variables. It shows that not only the expectation but also the
variance adds up, if we have independence. It is Pythagoras theorem because the notion
uncorrelated means geometrically that the centered random variables are perpendicular
and the variance is the length of the vector squared.
Parameter estimation
Parameter estimation is a central subject in statistics. We will look at it in the case of the
Binomial distribution. As you know, if we have a coin which shows heads with probability
p then the probability to have X = k heads in n coin tosses is

1
Find the correlation coefficient Corr[X, Y ] = Cov[X, Y ]/([X][Y ]) of the following

and e data
X = (31415926535897932385)
Y = (27182818284590452354)
Independence depends on the coordinate system. Find two random variables X, Y such
that X, Y are independent but X 2Y, X + 2Y are not independent.
Assume you have a string X of n = 1000 numbers which takes the two values 0 and
P
a = 4. You compute the mean of these data p = (1/n) k X(k) and find p = 1/5. Can
you figure out the standard deviation of these data?
The Binomial distribution

P[X = k] =
n
k
pk (1 p)nk .
Keep this distribution in mind. It is one of the most important distributions in probability
theory. Since this is the distribution of a sum of k random variables which are independent
Xk = 1 if kth coin is head and Xk = 0 if it is tail, we know the mean and standard deviation
of these variables E[X] = np and Var[X] = np(1 p).
1
This is true for finite probability spaces with prime || and uniform measure on it.
2
These data were obtained with IntegerDigits[P rime[100000], 2] which writes the 100000th prime p = 1299709
in binary form.
The pivot columns of A span the image of A.

Proof. You can see this by deleting the other columns. The new matrix B still allows to
solve Bx = b if Ax = b could be solved.
Lecture 13: Image and Kernel

5
Find the image of
The image of a matrix

If T : Rm Rn is a linear transformation, then {T (~x) | ~x Rm } is called the
image of T . If T (~x) = A~x, where A is a matrix then the image of T is also called
the image of A. We write im(A) or im(T ).
If T (x, y) = (cos()x sin()y, sin()x + cos()y) is a rotation in the plane, then the image
of T is the whole plane.
The averaging map T (x, y, z) = (x + y + z)/3 from R3 to R has as image the entire real axes
R.
The span of vectors ~v1 , . . . , ~vk in Rn is the set of all linear combinations c1~v1 +
. . . ck~vk .
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
The kernel of a matrix
The map T (x, y, z) = (x, y, 0) maps the three dimensional

space into itself.
It is linear
x
1 0 0
x
because we can find a matrix A for which T (~x) = A y = 0 1 0 y . The image

z
0 0 0
z
of T is the xy-plane.
1 1
2
3
4
5 5
A=
3
If T : Rm Rn is a linear transformation, then the set {x | T (x) = 0 } is called

the kernel of T . These are all vectors which are annihilated by the transformation.
If T (~x) = A~x, then the kernel of T is also called the kernel of A. The kernel of A
are all solutions to the linear system Ax = 0. We write ker(A) or ker(T ).
The kernel of T (x, y, z) = (x, y, 0) is the z-axes. Every vector (0, 0, z) is mapped to 0.
The kernel of a rotation in the plane consists only of the zero point.
The kernel of the averaging map consists of all vector (x, y, z) for which x + y + z = 0.
The kernel is a plane. In the language of random variables, the kernel of T consists of the
centered random variables.
The span of the standard basis vectors e1 , e2 is the xy-plane.

Also the kernel of a matrix A is a linear space.
n
A subset V of R is called a linear space if it is closed under addition scalar

multiplication and contains 0.
The image of a linear transformation ~x 7 A~x

is the span of the column vectors of A. The
image is a linear space.
kernel
domain
How do we compute the kernel? Just solve the linear system of equations A~x = ~0. Form
rref(A). For every column without leading 1 we can introduce a free variable si . If ~x is
P
the solution to A~xi = 0, where all sj are zero except si = 1, then ~x = j sj ~xj is a general
vector in the kernel.
image
codomain
A column vector of A is called a pivot column if it contains a leading one after

row reduction. The other columns are called redundant columns.
3
6
9
2 6
0
5
1
0
. Gauss-Jordan elimination gives: B = rref(A) =
1 3 0
0 1
. There are two pivot columns and one redundant column. The equation B~
x=0
0 0
0 0 0
is equivalent to the system x + 3y = 0, z = 0. After fixing z = 0, can chose y = t freely and
3
obtain from the first equation x = 3t. Therefore, the kernel consists of vectors t
1 .
0
How do we compute the image? If we are given a matrix for the transformation, then the
image is the span of the column vectors. But we do not need all of them in general.
Find the kernel of A =

3
0
Step II) Transmission.

Now add an error by switching one entry in each vector:
Homework due March 2, 2011

1
Find the image and kernel of the chess matrix:
1
0
1
0
1
0
1
0
A=
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
0 0
0
0
1
1 0
A=
0
0
0
1
0
4
0
1
0
3
0
1
0
2
0
6
0
1
0
3
0
0
0
1
0
4
0
0
0
1
0
0
0
0
0
1

1

Hu = 0

0
B = (0, 0, 0, 1), (0, 0, 1, 0)

E = (0, 0, 0, 1), (0, 1, 1, 0)
H = (0, 0, 0, 1), (1, 0, 1, 0)
K = (0, 0, 0, 1), (1, 1, 1, 0)
N = (0, 0, 1, 0), (0, 0, 1, 0)
Q = (0, 0, 1, 0), (0, 1, 1, 0)
T = (0, 0, 1, 0), (1, 0, 1, 0)
W = (0, 0, 1, 0), (1, 1, 1, 0)
Z = (0, 0, 1, 1), (1, 0, 1, 0)
. = (0, 0, 1, 1), (1, 0, 1, 0)
C = (0, 0, 0, 1), (0, 0, 1, 1)

F = (0, 0, 0, 1), (0, 1, 1, 1)
I = (0, 0, 0, 1), (1, 0, 1, 1)
L = (0, 0, 0, 1), (1, 1, 1, 1)
O = (0, 0, 1, 0), (0, 0, 1, 1)
R = (0, 0, 1, 0), (0, 1, 1, 1)
U = (0, 0, 1, 0), (1, 0, 1, 1)
X = (0, 0, 1, 0), (1, 1, 1, 1)
? = (0, 0, 1, 1), (1, 0, 1, 1)
, = (0, 0, 1, 1), (1, 0, 1, 1)
0
1
1
0
1
0
0
1
0
1
0
0
1
0
1
1
0
0
0
0
1

.

.
=
.
.

.
.
.
.
.
.
.

My =

v = My + f =

0 0 1 0 1
1 0 1 1 0
0 1 1 1 1

1
.

= . , Hv = 0

0
.

0 0 1
1 0 1
0 1 1
1
1
1
1
0
0
0
0
1
1
0
1
0
0
1
0
1
0
0
1
0
1
1
0
0
0
0
1

.

.
=
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.

= . .

.

, f = . .

Step IV) Decode the message.

x1
.
.

x2
.
.

x4
.
x
.
.
x
3

.
5

x
Use P
=
to determine P e = P . = , P f = P . =
4

x6
.
x5
.
.

x7
.
x
.
.
6

x7
.
.
an error-free transmission (P u, P v) would give the right result back. Now
.
.
.
.
.
.
.

0 1 1
1 0 1

1 1 0

Now look in which column Hu or Hv is. Assume this column
is k. Place 0s everywhere in the vectors e except at the kth
entry where you put 1. For example if Hu is the second e =
column, then put a 1 at the second place. We obtain vectors
e and f :
.

Pu = ,
.
the matrix multiplications to build

1
1
1
1
0
0
0
.
.
.
.
.
.
.
.
.

Choose a letter to get a pair of vectors (x, y). x = , y = . Use 1 + 1 = 0 in
.
.

Mx =

1
1

0

We work on the error correcting code as in the book (problem 53-54 in 3.1). Your task
is to do the encoding and decoding using the initials from your name and write in one
sentence using the terminology of image and kernel, what the essence of this error
correcting code is.
Step I) Encoding. To do so, we encode the letters of the alphabet by pairs of three
vectors containing zeros and ones:
A = (0, 0, 0, 1), (0, 0, 0, 1)
D = (0, 0, 0, 1), (0, 1, 0, 1)
G = (0, 0, 0, 1), (1, 0, 0, 1)
J = (0, 0, 0, 1), (1, 1, 0, 1)
M = (0, 0, 1, 0), (0, 0, 0, 1)
P = (0, 0, 1, 0), (0, 1, 0, 1)
S = (0, 0, 1, 0), (1, 0, 0, 1)
V = (0, 0, 1, 0), (1, 1, 0, 1)
Y = (0, 0, 1, 1), (1, 0, 0, 1)
! = (0, 0, 1, 1), (1, 0, 0, 1)
.
.
.
.
.
.
.
Step III) Detect the error e and f. Form
Find the image and kernel of the following Pascal triangle matrix:
0
1
0
1
0
1
0
1

u = Mx + e =

.
.
.
.
. In
.

Pv =
.
satisfy P u = x + P e, P v = y + P f . We recover the original message (x, y) and so the

letter from

.
.
.
.

x = Pu Pe = ,
y = Pv Pf =
.
.
.
.
Assignment: perform the encoding with your initials and check that you got the
right letter back. Tell then in one sentence what the essence of this error correction is.
In particular, how does the image of the encoding matrix M fit with the kernel of
the healing matrix H?
A set of random variables X1 , .., Xn which form a basis on a finite probability space
= {1, ..., n } describe everything. Every random variable X which we want to
compute can be expressed using these random variables.
Lecture 14: Basis

Recall that subset X of Rn which is closed under addition and scalar multiplication
is called a linear subspace of Rn . We have to check three conditions: (a) 0 V ,
(b) ~v + w
~ V if ~v, w
~ V . (b) ~v V if ~v and is a real number.
Given a bunch of vectors in a linear space V , we can construct a basis of V by

sticking the vectors as columns into a matrix A, then pick the pivot columns of A.
The image and kernel of a transformation are linear spaces. This is an important example
since this is how we describe linear spaces, either as the image of a linear transformation
or the kernel of a linear transformations. Both are useful and they are somehow dual to
each other. The kernel is associated to row vectors because we are perpendicular to all
row vectors, the image is associated to column vectors because we are perpendicular to all
column vectors.
Solution: Form the matrix
To see this, assume ~v is written in two different ways

~v = a1 v1 + a2 v2 + ... + an vn = b1 v1 + b2 v2 + ... + bn vn .
a Then (a1 b1 )v1 + (a2 b2 )v2 + ...( an bn )vn = 0. But this shows that the vectors are not
linearly independent.
Given a probability space with 4 elements. Any random variable can be written in a unique
way as a linear combination of the 4 random variables
1 1 0 0
,
,
,
.
0 0 1 1
In general, one a finite probability space. A basis defines n random variables such that every
random variable X can be written as a linear combination of X1 , ..., Xn .
3
6
9
2 6
A=
3
0
5
1
0
Row reduction shows that the first and third vector span the space V . This is a basis for
V.
A n n matrix A = ~v1 ~v2
|
. . . ~vn
v1 , . . . , ~vn is a
is invertible if and only if ~
|
basis in Rn .
Find a basis for the image and the
Given a basis B in V . Every vector in V can be written in a unique manner as a

linear combination of vectors in B.
0
1
0
The standard basis vectors e1 =

, e =
, . . . , en =
form a basis in Rn .
...
... 2
...
Two nonzero vectors in the plane form a basis if they are not parallel.
2 6 5
A=
,
,
.
3 9 1
A set B of vectors ~v1 , . . . , ~vm is called basis of a linear subspace X of Rn if they are
linear independent and if they span the space X. Linear independent means that
there are no nontrivial linear relations ai~v1 + . . . + am~vm = 0. Spanning the space
means that very vector ~v can be written as a linear combination ~v = a1~v1 +. . .+am~vm
of basis vectors.
Find a basis of the space V spanned by the three vectors

1

echelon form is B = rref(A) = 0

0

0 0 1

kernel of A = 1 1 0 . Solution: In reduced row

1 1 1

1 0
0 1 . To determine a basis of the kernel we write

0 0
Bx = 0 as a system of linear equations: x + y = 0, z = 0. The variable y is the free

variable.
With y = t, x = t is fixed.
The linear system rref(A)x = 0 is solved by
x
1
1
~x = y = t 1 . So, ~v = 1 is a basis of the kernel. Because the first and third

z
0
0
0
1
vectors in rref(A) are pivot columns, the vectors ~v1 = 1 , ~v2 = 0

form a basis of the
1
1
image of A.
Why do we not just always stick to the standard basis vectors ~e1 , . . . , ~en ? The reason for
the need of more general basis vectors is that they allow a more flexible adaptation to
the situation. In geometry, the reflection of a ray at a plane or at a curve is better done
in a basis adapted to the situation. For differential equations, the system can be solved in
a suitable basis. Basis also matters in statistics. Given a set of random variables, we often
can find a basis for them which consists of uncorrelated vectors.
How do we check that a set of vectors form a basis in Rn ?
A set of n vectors ~v1 , ..., ~vn in Rn form a basis in Rn if and only if the matrix A
containing the vectors as column vectors is invertible.
The vectors ~v1 =

v2 =
v2 =
1 , ~
0 and ~
2 form a basis of R3 because the matrix
0 1 1
0 2
1 1 1
A= 1
is invertible.
More generally, the pivot columns of an arbitrary matrix A form a basis for the image of A.
Since we represent linear spaces always as the kernel or image of a linear map, the problem
of finding a basis to a linear space is always the problem of finding a basis for the image or
finding a basis for the kernel of a matrix.

1
Find a basis for the image and kernel of the Chess matrix:
A=
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
and verify the rank-nullety theorem in this case.
Find a basis for the set of vectors perpendicular to the image of A, where A is the
Pascal matrix.
0 0 0 0 1 0 0 0 0
0 0 0 1 0 1 0 0 0
A= 0 0 1 0 2 0 1 0 0
.
0 1 0 3 0 3 0 1 0
1 0 4 0 6 0 4 0 1
Verify that a vector is in the kernel of a matrix AT if and only if it perpendicular to

the image of A.
Verify it in the following concrete example
1 3
6
9
4 1
A=
3
2
0
1
1
The matrix AT denotes the matrix where the columns and rows of A are switched. It
is called the transpose of the matrix A.
entries in the last row of A. This nontrivial relation of w

~ iT (and the same relation for column
vectors w)
~ is a contradiction to the linear independence of the w
~ j . The assumption q < p can not
be true.
Lecture 15: Dimension
To prove the proposition, use the lemma in two ways. Because A spans and B is linearly independent, we have m n. Because B spans and A is linearly independent, we have n m.
Remember that X Rn is called a linear space if ~0 X and if X is closed under addition and
scalar multiplication. Examples are Rn , X = ker(A), X = im(A), or the row space of a matrix.
In order to describe linear spaces, we had the notion of a basis:
B = {~v1 , . . . , ~vn } X is a basis if two conditions are satisfied: B is linear independent meaning
that c1~v1 + ... + cn~vn = 0 implies c1 = . . . = cn = 0. Then B span X: ~v X then ~v =
a1~v1 + . . . + an~vn . The spanning condition for a basis assures that there are enough vectors
to represent any other vector, the linear independence condition assures that there are not too
many vectors. Every ~v X can be written uniquely as a sum ~v = a1~v1 + . . . + an~vn of basis
vectors.
The following theorem is also called the rank-nullety theorem because dim(im(A)) is the rank
and dim(ker(A))dim(ker(A)) is the nullety.
Fundamental theorem of linear algebra: Let A : Rm Rn be a linear map.
dim(ker(A)) + dim(im(A)) = m
There are n columns. dim(ker(A)) is the number of columns without leading 1, dim(im(A)) is the
number of columns with leading 1.
The dimension of a linear space V is the number of basis vectors in V .

The dimension of three dimensional space is 3. The dimension is independent on where the space
is embedded in. For example: a line in the plane and a line embedded in space have both the
dimension 1.
The dimension of Rn is n. The standard basis is
If A is an invertible n n matrix, then the dimension of the image is n and that the
dim(ker)(A) = 0.
The first grade multiplication table matrix
0 1
0
,
,,
.
... ...
...
The dimension of {0 } is zero. The dimension of any line 1. The dimension of a plane is 2.
The dimension of the image of a matrix is the number of pivot columns. We can construct
a basis of the kernel and image of a linear transformation T (x) = Ax by forming B = rrefA.
The set of Pivot columns in A form a basis of the image of T .
A=
Given a basis A = {v1 , ..., vn } and a basis B = {w1 , ..., .wm } of X, then m = n.
Lemma: if q vectors w
~ 1 , ..., w
~ q span X and ~v1 , ..., ~vp are linearly independent in X, then q p.
Pq
3
6
9
12
15
18
21
24
27
4
8
12
16
20
24
28
32
36
5
10
15
20
25
30
35
40
45
6
12
18
24
30
36
42
48
54
7
14
21
28
35
42
49
56
63
8
16
24
32
40
48
56
64
72
9
18
27
36
45
54
63
72
81
Are there a 4 4 matrices A, B of ranks 3 and 1 such that ran(AB) = 0? Solution. Yes,
we can even find examples which are diagonal.
Is there 4 4 matrices A, B of rank 3 and 1 such that ran(AB) = 2? Solution. No, the
kernel of B is three dimensional by the rank-nullety theorem. But this also means the kernel
of AB is three dimensional (the same vectors are annihilated). But this implies that the
rank of AB can maximally be 1.
The rank of AB can not be larger than the rank of A or the rank of B.
The nullety of AB can not be smaller than the nullety of B.
aij w
~ j = ~vi . Now do Gauss

a
v1T
11 . . . a1q | ~

Jordan elimination of the augmented (p (q + n))-matrix to this system: . . . . . . . . . | . . . ,
ap1 . . . apq | ~
vpT
h
i
where ~viT is the vector ~vi written as a row vector. Each row of A of this A|b contains some
j=1
2
4
6
8
10
12
14
16
18
has rank 1. The nullety is therefore 8.
The dimension of the kernel of a matrix is the number of free variables. It is also called
nullety. A basis for the kernel is obtained by solving Bx = 0 and introducing free variables
for the redundant columns.
Assume q < p. Because w

~ i span, each vector ~vi can be written as
1
2
3
4
5
6
7
8
9

~ 1T + ... + bq w
~ qT
nonzero entry. We end up with a matrix, which contains a last row 0 ... 0 | b1 w
showing that b1 w
~ 1T + + bq w
~ qT = 0. Not all bj are zero because we had to eliminate some nonzero
We end this lecture with an informal remark about fractal dimension:

Mathematicians study objects with non-integer dimension since the early 20th century. The topic
became fashion in the 1980ies, when mathematicians started to generate fractals on computers.
To define fractals, the notion of dimension is extended. Here is an informal definition which can
be remembered easily and allows to compute the dimension of most star fractals you find on
the internet when searching for fractals. It assumes that X is a bounded set. You can pick up
this definition also in the Startreck movie (2009) when little Spock gets some math and ethics
lectures in school. It is the simplest definition and also called box counting dimension in the math
literature on earth.
Assume we can cover X with n = n(r) cubes of size r and not less. The fractal
dimension is defined as the limit
dim(X) =
log(n)
log(r)
as r 0.
For linear spaces X, the fractal dimension of X intersected with the unit cube agrees
with the usual dimension in linear algebra.
Proof. Take a basis B = {v1 , . . . , vm } in X. We can assume that this basis vectors are all
orthogonal and each vector has length 1. For given r > 0, place cubes at the lattice points
Pm
m
j with integer kj . This covers the intersection X with the unit cube with (C/r ) cubes
j=1 kj rv
where 1/ m C m. The dimension of X is

dim(X) = log(C/r m )/ log(r) = log(C)/ log(r) + m
which converges to m as r 0.
We cover the unit interval [0, 1] with n = 1/r intervals of length r. Now,
dim(X) =
10
log(1/r)
=1.
log(r)
We cover the unit square with n = 1/r 2 squares of length r. Now,

1
a) Give an example of a 5 6 matrix with dim(ker(A)) = 3 or argue why it does not

exist.
b) Give an example 5 8 matrix with dim(ker(A) = 2 or argue why it does not exist.
a) Find a basis for the subspace of all vectors in R5 satisfying
log(1/r 2 )
=2.
dim(X) =
log(r)
11
12
The Cantor set is obtained recursively by dividing intervals into 3 pieces and throwing
away the middle one. We can cover the Cantor set with n = 2k intervals of length r = 1/3k
so that
log(2k )
= log(2)/ log(3) .
dim(X) =
log(1/3k )
The Shirpinski carpet is constructed recursively by dividing a square in 9 equal squares
and throwing away the middle one, repeating this procedure with each of the squares etc.
At the kth step, we need n = 8k squares of length r = 1/3k to cover X. The dimension is
dim(X) =
log(8k )
= log(8)/ log(3) .
log(1/3k )
This is smaller than 2 = log(9)/ log(3) but larger than 1 = log(3)/ log(3).
x1 + 2x2 + 3x3 x4 + x5 = 0 .
b) Find a basis for the space spanned by the rows of the matrix
1 2 3 4 5
4 5 6
.
3 4 4 6 7
A= 2 3
a) Assume two linear subspaces V, W of Rm have the property that V W = {0 }

and such that every vector in Rm can be written as x + y with x V, y W . Find a
formula which relates dim(V ), dim(W ) and m.
b) Assume now that V W is 1 dimensional. What is the relation between dim(V ),
dim(W ) and m.
Lecture 16: Coordinates

A basis B = {~v1 , . . . ~vn } of Rn
Find a clever basis for the reflection of a light ray at the line x +"2y =
# 0. Solution:
"
# Use
1
2
one vector in the line and an other one perpendicular to it: ~v1 =
, ~v2 =
. We
2
1
"
#
"
#
1 0
1 2
achieved so B =
= S 1 AA with S =
.
0 1
2 1
... |
. . . ~vn
. It is called
... |
defines the matrix S = ~v1
the coordinate transformation matrix of the basis.
If A is the matrix of a transformation in the standard coordinates then

B = S 1 AS
By definition, the matrix S is invertible: the linear independence of the column vectors implies S
has no kernel. By the rank-nullety theorem, the image is the entire space Rn .
is the matrix in the new coordinates.

The transformation S 1 maps the coordinates from the standard basis into the coordinates
of the new basis. In order to see what a transformation A does in the new coordinates, we
first map it back to the old coordinates, apply A and then map it back again to the new
coordinates.
If ~x is a vector in Rn and ~x = c1~v1 + + cn vn , then ci are called the B-coordinates

of ~v.
We have seen that such a representation is unique if the basis is fixed.
c1
x1
S
~v
w
~ = [~v ]B
A
B
We write [~x]B =
x=
x = S([~x]B ).
. . . . If ~
. . . , we have ~
cn
xn
A~v S
The B-coordinates of ~x are obtained by applying S 1 to the coordinates of the standard basis:
[~x]B = S 1 (~x)
This just rephrases that S[~x]B = ~x. Remember the column picture. The left hand side is just
c1~v1 + + cn~vn where the vj are the column vectors of S.
If ~x = 2 , then 4, 2, 3 are the standard coordinates. With the standard basis B =
3
{e1 , e2 , e3 } we have ~x = 4e1 2e2 + 3e3 .
If ~v1 =
"
1
2
and ~v2 =
"
3
, then S =
5
S 1~v =
"
"
1 3
. A vector ~v =
2 5
5 3
2 1
#"
6
9
"
3
3
"
6
9
Two matrices A, B which are related by B = S 1 AS are called similar.
Indeed, as we can check, 3~v1 + 3~v2 = ~v.
Find the coordinates of ~v =

We have S =
"
1 1
0 1
"
2
3
and S 1
Let T be the reflection at the plane x + 2y + 3z = 0. Find the transformation matrix B in
0
1
2
the basis ~v1 = 1 ~v2 = 2 ~v3 = 3 . Because T (~v1 ) = ~v1 = [~e1 ]B , T (~v2 ) = ~v2 =
2
3
0
1 0 0
[~e2 ]B , T (~v3 ) = ~v3 = [~e3 ]B , the solution is B = 0 1 0 .

0 0 1
has the coordinates
"
"
1
1
with respect to the basis B = {~v1 =
, ~v2 =
}.
0
1
"
#
"
#
1 1
1
=
. Therefore [v]B = S 1~v =
. Indeed
0 1
3
1~v1 + 3~v2 = ~v .
If B = {v1 , . . . , vn } is a basis in Rn and T is a linear transformation on Rn , then the
B-matrix of T is
|
...
|
B = [T (~v1 )]B . . . [T (~vn )]B .

|
...
|
Bw
~
If A is similar to B, then A2 + A + 1 is similar to B 2 + B + 1. B = S 1 AS, B 2 =

S 1 B 2 S, S 1 S = 1, S 1 (A2 + A + 1)S = B 2 + B + 1.
If A is a general 2 2 matrix
"
"
d
b
1
3
"
a b
c d
and let S =
"
0 1
, What is S 1 AS? Solution:
1 0
c
. Both the rows and columns have switched. This example shows that the matrices
a# "
#
4 3
2
are similar.
,
2 1
4

1
Find the B-matrix B of the linear transformation which is given in standard coordinates
as
"
#
"
#"
#
x
1 1
x
T
=
y
1 1
y
"
2
if B = {
1
# "
1
}.
2
Let V be the plane spanned by 2 and 1 . Find the matrix A of reflection at
2
2
the plane V by using a suitable coordinate system.
a) Find a basis which describes best the points in the following lattice: We aim to
describe the lattice points with integer coordinates (k, l).
b) Once you find the basis, draw all the points which have (x, y) coordinates in the disc
x2 + y 2 10
Proof. Expand (~x + ~y ) (~x + ~y ).
Cauchy-Schwarz: |~x ~y | ||~x|| ||~y|| .
Lecture 17: Orthogonality
Proof: ~x ~y = ||~x||||~y|| cos(). If |~x ~y | = ||~x||||~y||, then ~x and ~y are parallel.

Two vectors ~v and w
~ are called orthogonal if their dot product is zero ~v w
~ = 0.
1
2
"
1
2
and
"
6
3
Triangle inequality: ||~x + ~y || ||~x|| + ||~y||.
are orthogonal in R2 .
Proof: (~x + ~y ) (~x + ~y ) = ||~x||2 + ||~y||2 + 2~x ~y ||~x||2 + ||~y ||2 + 2||~x||||~y|| = (||~x|| + ||~y||)2 .
~v and w
~ are both orthogonal to the cross product ~v w
~ in R3 . The dot product between ~v
and ~v w
~ is the determinant
v1 v2 v3
det( v1 v2 v3 ) .
w1 w2 w3
~v is called a unit vector if its length is one: ||~v|| =
~v ~v = 1.
Angle: The angle between two vectors ~x, ~y is = arccos
A set of vectors B = {~v1 , . . . , ~vn } is called orthogonal if they are pairwise orthogonal. They are called orthonormal if they are also unit vectors. A basis is called
an orthonormal basis if it is a basis which is orthonormal. For an orthonormal
basis, the matrix with entries Aij = ~vi ~vj is the unit matrix.
The transpose of a vector A = 2 is the row vector AT =
The transpose of the matrix
"
3 #
"
#
1 2
1 3
is the matrix
.
3 4
2 4
Ali Bik =
T T
Bki
Ail = (B T AT )kl .
A linear transformation is called orthogonal if AT A = In .

We see that a matrix is orthogonal if and only if the column vectors form an orthonormal basis.
Orthogonal matrices preserve length and angles. They satisfy A1 = AT .
Pythagoras theorem: If ~x and ~y are orthogonal, then ||~x + ~y || = ||~x|| + ||~y || .
1 2 3 .
The proofs are direct computations. Here is the first identity:
The orthogonal complement of a linear space V is a linear space. It is the kernel of

AT , if the image of A is V .
(AB)T = B T AT , (AT )T = A, (A1 )T = (AT )1 .

v T w is the dot product ~v w.
~
~ = AT ~x ~y .
~x Ay
Express the fact that ~x is in the kernel of a matrix A using orthogonality. Answer A~x = 0
means that w
~ k ~x = 0 for every row vector w
~ k of Rn . Therefore, the orthogonal complement
of the row space is the kernel of a matrix.
(AB)Tkl = (AB)lk =
To check this, take two vectors in the orthogonal complement. They satisfy ~v w
~ 1 = 0, ~v w
~2 =
0. Therefore, also ~v (w
~1 + w
~ 2 ) = 0.
The transpose of a matrix A is the matrix (AT )ij = Aji . If A is a n m matrix,

then AT is a m n matrix. Its rows are the columns of A. For square matrices, the
transposed matrix is obtained by reflecting the matrix at the diagonal.
Proof: The dot product of a linear relation a1~v1 + . . . + an~vn = 0 with ~vk gives ak~vk
~vk = ak ||~vk ||2 = 0 so that ak = 0. If we have n linear independent vectors in Rn , they
automatically span the space because the fundamental theorem of linear algebra shows that
the image has then dimension n.
A vector w
~ Rn is called orthogonal to a linear space V , if w
~ is orthogonal to
every vector ~v V . The orthogonal complement of a linear space V is the set
W of all vectors which are orthogonal to V .
~
x~
y
||~
x||||~
y||
~
y
cos() = ||~x~x||||~
[1, 1] is the statistical correlation of ~x and ~y if the vectors ~x, ~y
y||
represent data of zero mean.
Orthogonal vectors are linearly independent. A set of n orthogonal vectors in Rn

automatically form a basis.
A rotation is orthogonal.
A reflection is orthogonal.
Orthogonality over time

From -2800 BC until -2300 BC, Egyptians used ropes divided into length ratios like
3 : 4 : 5 to build triangles. This allowed them to triangulate areas quite precisely: for
example to build irrigation needed because the Nile was reshaping the land constantly
or to build the pyramids: for the great pyramid at Giza with a base length of 230
meters, the average error on each side is less then 20cm, an error of less then 1/1000.
A key to achieve this was orthogonality.
In quantum mechanics, states of atoms are described by functions in a linear space

of functions. The states with energy EB /n2 (where EB = 13.6eV is the Bohr energy)
in a hydrogen atom. States in an atom are orthogonal. Two states of two different
atoms which dont interact are orthogonal. One of the challenges in quantum computation, where the computation deals with qubits (=vectors) is that orthogonality is
not preserved during the computation (because we dont know all the information).
Different states can interact.
During one of Thales (-624 BC to (-548 BC)) journeys to Egypt, he used a geometrical
trick to measure the height of the great pyramid. He measured the size of the shadow
of the pyramid. Using a stick, he found the relation between the length of the stick and
the length of its shadow. The same length ratio applies to the pyramid (orthogonal
triangles). Thales found also that triangles inscribed into a circle and having as the
base as the diameter must have a right angle.
a) Assume X, Y, Z, U, V are independent random variables which have all standard

deviation 1. Find the standard deviation of X + Y + 2Z + 3U V .
b) We have two random variables X and Y of standard deviation 1 and 2 and correlation
0.5. Can you find a combination Z = aX + bY such that X, Z are uncorrelated?
c) Verify that the standard deviation of X + Y is smaller or equal than the sum of the
standard deviations of X and Y .
The Pythagoreans (-572 until -507) were interested in the discovery that the squares of
a lengths of a triangle with two orthogonal sides would add up as a2 + b2 =
c2 . They
2. This
were puzzled in assigning a length
to
the
diagonal
of
the
unit
square,
which
is
number is irrational because 2 = p/q would imply that q 2 = 2p2 . While the prime
factorization of q 2 contains an even power of 2, the prime factorization of 2p2 contains
an odd power of 2.
a) Verify that if A, B are orthogonal matrices then their product A.B and B.A are
orthogonal matrices.
b) Verify that if A, B are orthogonal matrices, then their inverse is an orthogonal
matrix.
c) Verify that 1n is an orthogonal matrix.
These properties show that the space of n n orthogonal matrices form a group. It
is called O(n).
Given a basis B, we describe a process called Gram-Schmidt orthogonalization

which produces an orthonormal basis.
If ~v1 , . . . , ~vn are the basis vectors let w
~ 1 = ~v1 and ~u1 = w
~ 1 /||w
~ 1||. The Gram-Schmidt
process recursively constructs from the already constructed orthonormal set ~u1, . . . , ~ui1
which spans a linear space Vi1 the new vector w
~ i = (~vi projVi1 (~vi )) which is orthogonal to Vi1 , and then normalizing w
~ i to to get ~ui = w
~ i /||w
~ i ||. Each vector w
~ i is
orthonormal to the linear space Vi1 . The vectors {~u1, .., ~un } form an orthonormal
basis in V .
Eratosthenes (-274 until 194) realized that while the sun rays were orthogonal to the
ground in the town of Scene, this did no more do so at the town of Alexandria, where
they would hit the ground at 7.2 degrees). Because the distance was about 500 miles
and 7.2 is 1/50 of 360 degrees, he measured the circumference of the earth as 25000
miles - pretty close to the actual value 24874 miles.
Closely related to orthogonality is parallelism. Mathematicians tried for ages to
prove Euclids parallel axiom using other postulates of Euclid (-325 until -265). These
attempts had to fail because there are geometries in which parallel lines always meet
(like on the sphere) or geometries, where parallel lines never meet (the Poincare half
plane). Also these geometries can be studied using linear algebra. The geometry on the
sphere with rotations, the geometry on the half plane uses Mobius transformations,
2 2 matrices with determinant one.
The question whether the angles of a right triangle do always add up to 180 degrees became an issue when geometries where discovered, in which the measurement depends on
the position in space. Riemannian geometry, founded 150 years ago, is the foundation
of general relativity, a theory which describes gravity geometrically: the presence of
mass bends space-time, where the dot product can depend on space. Orthogonality
becomes relative. On a sphere for example, the three angles of a triangle are bigger
than 180+ . Space is curved.
In probability theory, the notion of independence or decorrelation is used. For
example, when throwing a dice, the number shown by the first dice is independent and
decorrelated from the number shown by the second dice. Decorrelation is identical to
orthogonality, when vectors are associated to the random variables. The correlation
coefficient between two vectors ~v , w
~ is defined as ~v w/(|~
~ v|w|).
~ It is the cosine of the
angle between these vectors.
Find an orthonormal basis for ~v1 = 0 , ~v2 = 3 and ~v3 = 2 .
Let v, w be two vectors in three dimensional space which both have length 1 and are perpendicular to each other. Now
P x = (v x)~v + (w x)w
~.
Lecture 18: Projections
We can write this as AAT , where

Ais the matrix which
contains the two vectors as column

1
1

vectors. For example, if v = 1 / 3 and w = 1 / 6, then
1
2

#
1/3 1/6 "
1/2 1/2 0
1/3 1/3 1/ 3
T
= 1/2 1/2 0 .
P = AA = 1/ 3 1/6
1/ 6 1/ 6 2/ 6
0
0 0
1/ 3 2/ 6
A linear transformation P is called an orthogonal projection if the image of P is

V and the kernel is perpendicular to V and P 2 = P .
Orthogonal projections are useful for many reasons. First of all however:
In an orthonormal basis P = P T . The point P x is the point on V which is closest
to x.
Proof. P x x is perpendicular to P x because
For any matrix, we have (im(A)) = ker(AT ).

2
(P x x) P x = P x P x x P x = P x x x P x = P x x x P x = 0 .
2
We have used that P = P and Av w = v A w.
Remember that a vector is in the kernel of AT if and only if it is orthogonal to the rows of AT
and so to the columns of A. The kernel of AT is therefore the orthogonal complement of im(A)
for any matrix A:
For an orthogonal projection P there is a basis in which the matrix is diagonal and
contains only 0 and 1.
Proof. Chose a basis B of the kernel of P and a basis B of V , the image of P . Since for every
~v B1 , we have P v = 0 and for every ~v B2 , we have P v = v, the matrix of P in the basis
B1 B2 is diagonal.
If V is the image of a matrix A with trivial kernel, then the projection P onto V is
1 1 1
1 1
/3
1 1 1
A=
1
P x = A(AT A)1 AT x .
is a projection onto the one dimensional space spanned by 1 .
The matrix
is a projection onto the xy-plane.
Proof. is clear. On the other hand AT Av = 0 means that Av is in the kernel of AT . But since
the image of A is orthogonal to the kernel of AT , we have A~v = 0, which means ~v is in the kernel
of A.
The matrix
For any matrix, we have ker(A) = ker(AT A).
1 0 0
1 0
0 0 0
A=
0
Proof. Let y be the vector on V which is closest to Ax. Since y Ax is perpendicular to the
image of A, it must be in the kernel of AT . This means AT (y Ax) = 0. Now solve for x to get
the least square solution
x = (AT A)1 AT y .
The projection is Ax = A(AT A)1 AT y.
5
If V is a line containing the unit vector ~v then P x = v(v x), where is the dot product.
Writing this as a matrix product shows P x = AAT x where A is the n 1 matrix which
contains ~v as the column. If v is not a unit vector, we know from multivariable calculus that
P x = v(v x)/|v|2 . Since |v|2 = AT A we have P x = A(AT A)1 AT x.
How do we construct the matrix of an orthogonal projection? Lets look at an other example
1 0
0
. The orthogonal projection onto V = im(A) is ~b 7 A(AT A)1 AT ~b. We
0 1
"
#
1/5 2/5 0
5 0
T
T
1 T
have A A =
and A(A A) A = 2/5 4/5 0 .
2 1
0
0 1
0
2/5
~
For example, the projection of b = 1 is ~x = 4/5 and the distance to ~b is 1/ 5.
0
0
The point ~x is the point on V which is closest to ~b.
Let A = 2
Let A = . Problem: find the matrix of the orthogonal projection onto the image of A.
0
1
The image of A is a one-dimensional line spanned by the vector ~v = (1, 2, 0, 1). We calculate
AT A = 6. Then
A(AT A)1 AT =
1
2
0
1
2 0 1 /6 =
1
2
0
1
2
4
0
2
0
0
0
0
1
2
0
1
a) Find the orthogonal projection of
onto the subspace of R4 spanned by

0
/6 .
ker(A )= im(A)
T
A (b-Ax)=0
V=im(A)
0 1
,
.
1 0
b) Find the orthogonal projection of

0 onto the subspace of R which has the basis
T -1 T
*
Ax=A(A
A) A b
0 0 1
B = { 0 , 1 , 0
}.
0 1 0
Ax *
Let A be a matrix with trivial kernel. Define the matrix P = A(AT A)1 AT .
a) Verify that we have P T = P .
b) Verify that we have P 2 = P .
For this problem, just use the basis properties of matrix algebra like (AB)T = B T AT .
a) Verify that the identity matrix is a projection.

b) Verify that the zero matrix is a projection.
c) Find two orthogonal projections P, Q such that P + Q is not a projection.
d) Find two orthogonal projections P, Q such that P Q is not a projection.
x y
-1 1
1 2
2 -1
Lecture 19: Data fitting
Solution. We will do this in class. The best solution is y = x/2 + 1.

Last time we have derived the important formula
P = A(AT A)1 AT .
which gives the projection matrix when projecting onto the image of a matrix A.
Given a system of linear equations Ax = b, the point x = (AT A)1 AT b is called the
least square solution of the system.
If A has no kernel, then the least square solution exists.
Proof. We know that if A has no kernel then the square matrix AT A has no kernel and is therefore
invertible.
In applications we do not have to worry about this. In general, A is a n m matrix where n is

much larger than m meaning that we have lots of equations but few variables. Such matrices in
general have a trivial kernel. For linear regression for example, it only appears if all data points
are on a vertical axes like (0, 2), (0, 6), (0, 0), (0, 4) and where any line y = mx + 3 is a least square
solution. If in real life you should get into a situation where A has a kernel, you use the wrong
model or have not enough few data.
x y
-1 8
0 8
1 4
2 16
If x is the least square solution of Ax = b then Ax is the closest point on the image of A to b. The
least square solution is the best solution of Ax = b we can find. Since P x = Ax, it is the closest
point to b on V . Our knowledge about kernel and the image of linear transformations helped us
to derive this.
We do this in class. The best solution is f (x) = 3x2 x + 5.
Finding the best polynomial which passes through a set of points is a data fitting problem.
If we wanted to accommodate all data, the degree of the polynomial would become too
large. The fit would look too wiggly. Taking a smaller degree polynomial will not only be
more convenient but also give a better picture. Especially important is regression, the
fitting of data with linear polynomials.
4
The above pictures show 30 data points which are fitted best with polynomials of degree 1,
6, 11 and 16. The first linear fit maybe tells most about the trend of the data.
Find the best parabola y = ax2 + bx + c which fits the points
The simplest fitting problem is fitting by lines. This is called linear regression. Find the
best line y = ax + b which fits the data
Find the function y = f (x) = a cos(x) + b sin(x), which best fits the data
x y
0 1
1/2 3
1 7
Solution: We have to find the least square solution to the system of equations
1a + 0b = 1
0a + 1b = 3
1a + 0b = 7
which is in matrix form written as A~x = ~b with

-1
1 0
1
A = 0 1 , ~b = 3 .
1 0
7
Now AT~b =
"
"
6
3
and AT A =
"
2 0
0 1
and (AT A)1 =
-2
-2
"
1/2 0
0 1
and (AT A)1 AT~b is
3
. The best fit is the function f (x) = 3 cos(x) + 3 sin(x) .
3
=
=
=
=
Find the function y = f (x) = ax2 + bx3 , which best fits the data
x y
-1 1
1 3
0 10
A curve of the form

y 2 = x3 + ax + b
is called an elliptic curve in Weierstrass form. Elliptic curves are important in cryptography. Use data fitting to find the best parameters (a, b) for an elliptic curve given the
following points:
In other words, find the least square solution for the system of equations for the unknowns
a, b which aims to have all 4 data points (xi , yi ) on the circle. To get system of linear
equations Ax = b, plug in the data
lla + b
ab
2a
2a + 2b
Find the circle a(x2 + y 2 ) + b(x + y) = 1 which best fits the data
x y
0 1
-1 0
1 -1
1 1
-1
(x1 , y1)
(x2 , y2)
(x3 , y3)
(x4 , y4)
1
1
1
1.
=
=
=
=
(1, 2)
(1, 0)
(2, 1)
(0, 1)
This can be written as Ax = b, where
1
1
0
2 2
A=
2
, b =
.
We get the least square solution with the usual formula. First compute
(AT A)1 =
"
and then
AT b =
3 2
2 5
"
6
2
/22
3
,
Find the function of the form

f (t) = a sin(t) + b cos(t) + c
which best fits the data points (0, 0), (, 1), (/2, 2), (, 3).
In matrix form this can be written as A~x = ~b with
Lecture 20: More data fitting

T
We have AT A =
Last time, we saw how the geometric formula P = A(A A) A for the projection on the image
of a matrix A allows us to fit data. Given a fitting problem, we write it as a system of linear
equations
Ax = b .
1
"
2 1
1 2
and AT b =
"
7
. We get the least square solution with the
5
formula
x = (AT A)1 AT b =
While this system is not solvable in general, we can look for the point on the image of A which
is closest to b. This is the best possible choice of a solution and called the least square
solution:
"
3
1
The best fit is the function f (x, y) = 3x2 + y 2 which produces an elliptic paraboloid.
The vector x = (AT A)1 AT b is the least square solution of the system Ax = b.
The most popular example of a data fitting problem is linear regression. Here we have data
points (xi , yi ) and want to find the best line y = ax + b which fits these data. But data fitting
can be done with any finite set of functions. Data fitting can be done in higher dimensions too.
We can for example look for the best surface fit through a given set of points (xi , yi , zi ) in space.
Also here, we find the least square solution of the corresponding system Ax = b which is obtained
by assuming all points to be on the surface.
0 1
2
0
, ~b = 4 .
1 1
3
A= 1
A graphic from the Harvard Management Company Endowment Report of

October 2010 is shown to the left. Assume we want to fit the growth using functions 1, x, x2 and assume the
years are numbered starting with 1990.
What is the best parabola a+bx+cx2 =
y which fits these data?
Which paraboloid ax2 + by 2 = z best fits the data

quintenium
1
2
3
4
5
x y z
0 1 2
-1 0 4
1 -1 3
endowment in billions
5
7
18
25
27
We solved this example in class with linear regression. We saw that the best fit. With a
quadratic fit, we get the system A~x = ~b with
In other words, find the least square solution

for the system of equations for the unknowns
a, b which aims to have all data points on the
paraboloid.
1 1 1
2 4
3 9
4 16
1 5 25
A=
1
~b = 18
.
25
27
Solution: We have to find the least square solution to the system of equations
a0 + b 1 = 2
a1 + b 0 = 4
a1 + b 1 = 3 .
a
21/5
The solution vector ~x = b = 277/35 which indicates strong linear growth but some
c
2/7
slow down.
Here is a problem on data analysis from a website. We collect some data from users but not
everybody fills in all the data
Person
Person
Person
Person
1 3
2 4
3 4 1
5 - 4 2
- -
3 9 8 - 5
5 7 - - -
- 6 2
1 9
- -
2 9
- 9
8 - -
It is difficult to do statistic with this. One possibility is to filter out all data from people who
do not fulfill a minimal requirement. Person 4 for example did not do the survey seriously
enough. We would throw this data away. Now, one could sort the data according to some
important row. Arter tha one could fit the data with a function f (x, y) of two variables.
This function could be used to fill in the missing data. After that, we would go and seek
correlations between different rows.
Whenever doing datareduction like this, one must always compare different scenarios
and investigate how much the outcome changes when changing the data.
The left picture shows a linear fit of the above data. The second picture shows a fit with
cubic functions.

1
Here is an example of a fitting problem, where the solution is not unique:

x y
0 1
0 2
0 3
Write down the corresponding fitting problem for linear functions f (x) = ax + b = y.What
is going wrong?
If we fit data with a polynomial of the form y = a0 + a1 x + a2 x2 + a3 x3 + ...a + yx7 . How

many data points (x1 , y1), . . . , (xm , ym ) do you expect to fit exactly if the points x1 , x2 , ..., xm
are all different?
The first 6 prime numbers 2, 3, 5, 7, 11 define the data points (1, 2), (2, 3), (3, 5), (5, 7), (6, 11)
in the plane. Find the best parabola of the form y = ax2 + c which fits these data.
Linear algebra
Lecture 21: Midterm checklist

Probability theory
Probability space (, A, P) = (laboratory,events,probability measure)
Random variable A function from to the reals.
x1
Data (x1 , x2 , ..., xn ) = Vector . . . = random variables over = {1, . . . , n }

xn
Event A or B the intersection A B.
Event A and B the union A B.
Not the event A the complement \ A.
Event A under the condition B. P[A|B] = P[A B]/P[B]
Independent events P[A B] = P[A] P[B].
Independent random variables {X [a, b]}, {Y [c, d]} are independent events.
Independence and correlation Independent random variables are uncorrelated.
P
P
Expectation E[X] = X() = xi xi P[X = xi ]
Variance Var[X] = E[X 2 ] E[X]2 .
q
Standard deviation [X] = Var[X].

Covariance Cov[X, Y ] = E[XY ] E[X] E[Y ].
Correlation Corr[X, Y ] = Cov[X, Y ]/([X][Y ]).
Uncorrelated Corr[X, Y ] = 0.
Variance formula Var[X + Y ] = Var[X] + Var[Y ] + 2Cov[X, Y ].
Pythagoras Var[X] + Var[Y ] = Var[X + Y ] for uncorrelated random variables.
Bayes
Bayes
P[B|A]P[A]
formula P[A|B] = P[B|A]+P[B|A
c] .
P[B|A
i ]P[Ai ]
rule P[Ai |B] = Pn P[B|A
.
j ]P[Aj ]
j=1
Permutations n! = n (n 1) (n 2) 2 1 possibilities.
Combinations nk possibilities to select from n and put back.

Ordered selection Choose k from n with order n!/(n k)!
!
n
Unordered selection Choose k from n gives
= n!/(k!(n k)!)
k
!
n
Binomial distribution P [X = k ] =
pk (1 p)nk .
k
Expectation of binomial distribution pn.
Variance of binomial distribution p(1 p)n.
For 0-1 data. The expectation determines the variance.
Matrix A is a n m matrix, it has m columns and n rows, maps Rm to Rn .

Square matrix n n matrix, maps Rn to Rn .
Identity matrix the diagonal matrix In satisfies In v = v for all vectors v.
Column Vector n 1 matrix = column vector
Row Vector 1 n matrix = row vector.
Linear transformation ~x 7 A~x, T (~x + ~y ) = T (~x) + T (~y ), T (~x) = T (~x).
Column vectors of A are images of standard basis vectors ~e1 , . . . , ~en .
Linear system of equations A~x = ~b, have n equations, m unknowns.
Consistent system A~x = ~b: there is at least one solution ~x.
Vector form of linear equation x1~v1 + + xn~vn = ~b, ~vi columns of A.
Matrix form of linear equation w

~ i ~x = bi , w
~ i rows of A.
Augmented matrix of A~x = ~b is the matrix [A|b] which has one column more as A.
Coefficient matrix of A~x = ~b is the matrix A.
P
Matrix multiplication [AB]ij = k Aik Bkj , dot i-th row with jth column.
Gauss-Jordan elimination A rref(A) in row reduced echelon form.
Gauss-Jordan elimination steps SSS: Swapping, Scaling, Subtracting rows.
Leading one First nonzero entry in a row is equal to 1. Write 1 .
Row reduced echelon form (1) nonzero row has 1 , (2) columns with 1 are zero
except at 1 , (3) every row above row with 1 has 1 to the left.
Pivot column column with 1 in rref(A).
Redundant column column with no 1 in rref(A).
Rank of matrix A number of 1 in rref(A). It is equal to dim(im(A)).
Nullety of matrix A: is defined as dim(ker(A)).
Kernel of matrix {~x Rn , A~x = ~0}.
Image of matrix {A~x, ~x Rn }.
Inverse transformation of T A transformation satisfying S(T (x)) = x = T (S(x)).
Inverse matrix of A Matrix B = A1 satisfies AB = BA = In
"
#
cos() sin()
Rotation in plane A =
, rotate counter-clock by .
sin() cos()
Dilation in plane ~x 7 ~x, also called scaling. Given by diagonal A = I2
"
#
a b
Rotation-Dilation A =
. Scale by a2 + b2 , rotate by arctan(b/a).
b a
"
#
a b
Reflection-Dilation A =
. Scale by a2 + b2 , reflect at line w, slope b/a.
b a
"
#
"
#
1 a
1 0
Horizontal and vertical shear ~x 7 A~x, A =
, ~x 7 A~x, A =
.
0 1
b 1
"
#
cos(2) sin(2)
Reflection about line ~x 7 A~x, A =
.
sin(2) cos(2)
#
"
u1 u1 u1 u2
.
Projection onto line containing unit vector u: A =
u2 u1 u2 u2
Linear subspace check ~0 X, ~x, ~y X, R ~x + ~y X, ~x X.

B = {~v1 , . . . , ~vn } span X: Every ~x X can be written as ~x = a1~v1 + ... + an~vn .
P
B = {~v1 , . . . , ~vn } linear independent X: i ai~vi = ~0 implies a1 = . . . = an = 0.
B = {~v1 , . . . , ~vn } basis in X: linear independent in X and span X.

Dimension of linear space X: number of basis elements of a basis in X.
S-matrix Coordinate transformation matrix containing basis vectors as columns.
B coordinates [~v]B = S 1~v , where S = [~v1 , . . . , ~vn ] contains basis vectors ~vi as
columns.
B matrix of T in basis B. The matrix is B = S 1 AS.
A similar to B: defined as B = S 1 AS. We write A B.
Row reduction SSS: scale rows, swap rows and subtract row from other row.
Row reduced echelon form is a matrix in row reduced echelon form?
Matrix-Transformation The columns of A are the images of the basis vectors.
Kernel-Image Compute the kernel and the image by row reduction.
System of linear equations Solve a system of linear equation by row reduction.
How many solutions Are there 0, 1, solutions? rank(A), rank[A, b] matter.
Similar? Check whether B n , An are similar. Both invertible or not. Possibly find S.
Linear space 0 is in V , v + w is in V and v is in V .
Linear transformation is a given transformation linear or not?
Space orthogonal to given space write as row space of a matrix and find kernel.
Number of solutions. A linear system of equations has either exactly 0, 1 or
many solutions.
Solve system Row reduce [A|b] to get [In |x] with solution x.
Vectors perpendicular to a set of vectors, get kernel of matrix which contains
vectors as rows.
Rank-nullety theorem dim(ker(A)) + dim(im(A)) = m, where A is n m matrix.
Number of basis elements is independent of basis. Is equal to dimension.
Basis of image of A pivot columns of A form a basis of the image of A.
Basis of kernel of A introduce free variables for each redundant column of A.
Inverse of 2 2 matrix switch diagonal, change sign of wings and divide by det.
Inverse of n n matrix Row reduce [A|In ] to get [In |A1 ].
Matrix algebra (AB)1 = B 1 A1 , A(B + C) = AB + AC, etc. AB 6= BA i.g.
Invertible rref(A) = In columns form basis rank(A) = n, nullity(A) = 0 .
Similarity properties: A B implies An B n . If A is invertible, B is invertible.
Orthogonal vectors ~v w
~ = 0.
length ||~v|| = ~v ~v , unit vector ~v with ||~v|| = ~v ~v = 1.

Orthogonal basis basis such that v1 , . . . , vn are pairwise orthogonal, and length 1.
Orthogonal complement of V V = {v|v perpendicular to V }.
Projection onto V orth. basis P = QQT if Q has orthonormal columns.

Orthogonal projection onto V is A(AT A)1 AT .
Least square solution of A~x = ~b is ~x = (AT A)1 AT~b.
Data fitting Find least square solution of equations when data are fitted exactly.

0.15
Lecture 22: Distributions

0.10
A random variable is a function from a probability space to the real line R. There are two
important classes of random variables:
1) For discrete random variables, the random variable X takes a discrete set of values. This
means that the random variable takes values xk and the probabilities P[X = xk ] = pk add up to
1.
2) For continuous random variables, there is a probability density function f (x) such that
R
R
P[X [a, b]] = ab f (x) dx and
f (x) dx = 1.
0.05
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
The probability distribution on N = {0, 1, 2, ... }

P[X = k ] =
Discrete distributions
1
is called the Poisson distribution. It is used to describe the number of radioactive decays
in a sample, the number of newborns with a certain defect. You show in the homework that
the mean and standard deviation is Poisson distribution is the most important distribution
on {0, 1, 2, 3, . . .}. It is a limiting case of Binomial distributions
We throw a dice and assume that each side appears with the same probability. The random
variable X which gives the number of eyes satisfies
P[X = k ] =
1
.
6
This is a discrete distribution called the uniform distribution on the set {1, 2, 3, 4, 5, 6 }.
0.4
k e
k!
An epidemiology example from Cliffs notes: the UHS sees X=10 pneumonia cases each
winter. Assuming independence and unchanged conditions, what is the probability of there
being 20 cases of pneumonia this winter? We use the Poisson distribution with = 10 to
see P[X = 20] = 1020 e10 /20! = 0.0018.
The Poisson distribution is the n limit of the binomial distribution if we chose
for each n the probability p such that = np is fixed.
0.3
Proof. Setting p = /n gives

0.2
P[X = k] =
pk (1 p)nk
n!
( )k (1 )nk
k!(n k)! n
n
n!
k
= ( k
)( )(1 )n (1 )k .
n (n k)! k!
n
n
=
0.1
n
k
Throw n coins for which head appears with probability p. Let X denote the number of
heads. This random variable takes values 0, 1, 2, . . . , n and
P[X = k ] =
This is the Binomial distribution we know.
n
k
pk (1 p)nk .
This is a product of four factors. The first factor n(n 1)...(n k + 1)/nk converges to 1.
Leave the second factor ak /k! as it is. The third factor converges to e by the definition
of the exponential. The last factor (1 n )k converges to 1 for n since k is kept fixed.
We see that P[X = k] k e /k!.
Continuous distributions
A random variable is called a uniform distribution if P[X [a, b]] = (b a) for all
0 a < b b. We can realize this random variable on the probability space = [a, b] with
the function X(x) = x, where P[I] is the length of an interval I. The uniform distribution
is the most natural distribution on a finite interval.
The random variable X on [0, 1] where P[[a, b]] = b a is given by X(x) = x. We have
P[X [a, b]] = P[ x [a, b]] = P[x [a2 , b2 ]] = b2 a2 .
The probability density function

f (x) = ex .
is called the exponential distribution. The exponential distribution is the most natural
distribution on the positive real line.
Remark. The statements most natural can be made more precise. Given a subset X R
of the real line and the mean
and standard deviation we can look at the distribution f on X
R
for which the entropy X f log(f ) dx is maximal. The uniform, exponential and normal
distributions extremize entropy on an interval, half line or real line. The reason why they
appear so often is that adding independent random variables increases entropy. If different
processes influence an experiment then the entropy becomes large. Nature tries to maximize
entropy. Thats why these distributions are natural.
We have f (x) = 2x because ab f (x) dx = x2 |ba = b2 a2 . The function f (x) is the probability
density function of the random variable.
R
The distribution on the positive real axis with the density function
f (x) =
(log(x)m)2
1
2
e
2x2
is called the log normal distribution with mean m. Examples of quantities which have
log normal distribution is the size of a living tissue like like length or height of a population
or the size of cities. An other example is the blood pressure of adult humans. A quantity
which has a log normal distribution is a quantity which has a logarithm which is normally
distributed.
A random variable with normal distribution with mean 1 and standard deviation 1 has the
probability density
1
2
f (x) = ex /2 .
2
This is an example of a continuous distribution. The normal distribution is the most natural
distribution on the real line.
a) Find the mean of the exponential distribution.

b) Find the variance and standard deviation of the exponential distribution.
R
c) Find the entropy 0 f (x) log(f (x)) dx in the case = 1.
Here is a special case of the Students t distribution
f (x) =
2
(1 + x2 )2 .
a) Verify that it is a probability distribution.

b) Find the mean. (No computation needed, just look at the symmetry).
c) Find the standard deviation.
To compute the integrals in a),c), you can of course use a computer algebra system if
needed.
a) Verify that the Poisson distribution is a probability distribution:

P
b) Find the mean m =
= k].
k=0 kP[X
P
2
c) Find the standard deviation
k=0 (k m) P[X = k].
k=0
P[X = k] = 1.
The most important distribution on the positive real line is the exponential distribution
f (x) = ex .
Lecture 23: Chebychev theorem
Lets compute its mean:
1
.
R 2
2
From 0 x exp(x) dx = 2/ , we get the variance
m=
In this lecture we look at more probability distributions and prove the fantastically useful Chebychevs theorem.
P[X [a, b] ] =
xf (x) dx =
2/2 1/2 = 1/2
RememberR that a continuous probability density is a nonnegative function f

such that R f (x) dx = 1. A random variable X has this probability density if
Z
and the standard deviation 1/.
f (x) dx
for all intervals [a, b].

If we know the probability density of a random variable, we can compute all the important
quantities like the expectation or the variance.
If
X has the probability density f , then m = E[X] =
R
(x m)2 f (x) dx.
xf (x) dx and Var[X] =
The distribution function of a random variable with probability density f is

defined as
Z x
f (x) dx = P[X s] .
F (s) =
3
By definition F is a monotone function: F (b) F (a) for b a. One abbreviates the probability
density function with P DF and the distribution function with CDF which abbreviates cumulative
distribution function.
The most important distribution on the real line is the normal distribution
f (x) =
(xm)2
1
e 22 .
2
2
It has mean m and standarddeviation . This Ris a probability measure because

after a change
R y2
of variables y = (x m)/( 2), the integral

f (x) dx becomes 1
e
dy = 1.
The most important distribution on a finite interval [a, b] is the uniform distribution
f (x) = 1[a,b]
1
,
ba
where 1I is the characteristic function

1I (x) =
1 xI
.
0 x
/I
The following theorem is very important for estimation purposes. Despite the simplicity of
its proof it has a lot of applications:
Chebychev theorem If X is a random variable with finite variance, then
P[|X E[X]| c]
Var[X]
.
c2
Proof. The random variable Y = X E[X] has zero mean and the same variance. We need
]
only to show P[|Y | c] Var[Y
. Taking the expectation of the inequality
c2
c2 1{|Y |c} Y 2
1.0
gives
0.30
finishing the proof.
c P[|Y | c] E[Y ] = Var[Y ]
0.8
0.25
0.20
0.6
2000
The theorem also gives more meaning to the notion Variance as a measure for the deviation
from the mean. The following example is similar to the one section 11.6 of Cliffs notes:
4000
6000
8000
10 000
0.15
0.4
0.10
-2
0.2
0.05
A die is rolled 144 times. What is the probability to see 50 or more times the number 6
shows up? Let X be the random variable which counts the number of times, the number 6
appears. This random variable has a binomial distribution with p = 1/6 and n = 144. It has
the expectation E[X] = np = 144/6 = 24 and the variance Var[X] = np(1 p) = 20. Setting
c = (50 24) = 26 in Chebychev, we get P[|X 24| 26] !20/262 0.0296.... The chance
P
144
is smaller than 3 percent. The actual value 144
pk (1 p)144k 1.17 107 is
k=50
k
much smaller. Chebychev does not necessarily give good estimates, but it is a handy and
univeral rule of thumb.
-4
-10
Random numbers
10
10
The CDF

1
Finally, lets look at a practical application of the use of the cumulative distribution function.
It is the task to generate random variables with a given distribution:
The random variable X has a normal distribution with standard deviation 2 and mean 5.
Estimate the probability that |X 5| > 3.
Estimate the probability of the event X > 10 for a Poisson distributed random variable X
with mean 4.
Assume we want to generate random variables X with a given distribution function F . Then
Y = F (X) has the uniform distribution on [0, 1]. We can reverse this. If we want to produce random variables with a distribution function F , just take a random variable Y with
uniform distribution on [0, 1] and define X = F 1 (Y ). This random variable has the distribution function F because {X [a, b] } = {F 1 (Y ) [a, b] } = {Y F ([a, b]) } = {Y
[F (a), F (b)]} = F (b) F (a). We see that we need only to have a random number generator
which produces uniformly distributed random variables in [0, 1] to get a random number
generator for a given continuous distribution. A computer scientist implementing random
processes on the computer only needs to have access to a random number generator producing uniformly distributed random numbers. The later are provided in any programming
language which deserves this name.
a) Verify that (x) = tan(x) maps the interval [0, 1] onto the real line so that its inverse
F (y) = arctan(y)/ is a map from R to [0, 1].
1
b) Show that f = F (y) = 1 1+y
2.
c) Assume we have random numbers in [0, 1] handy and want to random variables which
have the probability
density f . How do we achieve this?
R
d) The mean
xf (x) dx does not exist as an indefinite integral but can be assigned the
RR
value 0 by taking
the limit R
xf (x) dx = 0 for R . Is it possible to assign a value to
R 2
the variance x f (x) dx?.
The probability distribution with density
1 1
1 + y2
To generate random variables with cumulative distribution function F , we produce

random variables X with uniform distribution in [0, 1] and form Y = F 1 (X).
which appeared in this homework problem is called the Cauchy distribution.

Physicists call it the Cauchy-Lorentz distribution.
With computer algebra systems

1) In Mathematica, you can generate random variables with a certain distribution with a command
like in the following example:
-5
The PDF
X:=Random [ N o r m a l D i s t r i b u t i o n [ 0 , 1 ] ]
L i s t P l o t [ Table [ X, { 1 0 0 0 0 } ] , PlotRange>A l l ]
Why is the Cauchy distribution natural? As one can deduce from the homework, if you chose
a random point P on the unit circle, then the slope of the line OP has a Cauchy distribution.
Instead of the circle, we can take a rotationally
symmetric probability distribution like the
R R x2 y2
Gaussian with probability measure P [A] =
/ dxdy on the plane. Random
Ae
pointscan be written as (X, Y ) where both X, Y have the normal distribution with density
2
ex / . We have just shown
2) Here is how to access the probability density function (PDF)
f=PDF[ C a u c h y D i s t r i b u t i o n [ 0 , 1 ] ] ;
S=P l o t [ f [ x ] , { x , 10 ,10} , PlotRange>Al l , F i l l i n g >Axis ]
3) And the cumulative probability distribution (CDF)

f=CDF[ C h i S q u a r e D i s t r i b u t i o n [ 1 ] ] ;
S=P l o t [ f [ x ] , { x , 0 , 1 0 } , PlotRange>Al l , F i l l i n g >Bottom ]
If we take independent Gaussian random variables X, Y of zero mean and with the
same variance and form Z = X/Y , then the random variable Z has the Cauchy
distribution.
Now, it becomes clear why the distribution appears so often. Comparing quantities is often
done by looking at their ratio X/Y . Since the normal distribution is so prevalent, there is
no surprise, that the Cauchy distribution also appears so often in applications.
The determinant of a 3 3 matrix
Lecture 24: Determinants

In this lecture, we define the determinant for a general n n matrix and look at the Laplace
expansion method to compute them. A determinant attaches a number to a square matrix, which
determines a lot about the matrix, like whether the matrix is invertible.
The 2 2 case
is defined as aei + bf g + cdh ceg bdi af h.

We can write this as a sum over all permutations of {1, 2, 3}. Each permutation produces a
pattern along we multiply the matrix entries. The patterns with an even number || of
upcrossings are taken with a positive sign the other with a negative sign.
The determinant of a 2 2 matrix

A=
"
a b
c d
is defined as det(A) = ad bc.
a
c
f
+ d
i
g
b
e
h
b
e
h
a
c

f + d
i
g
b
e
h
c
a

f d
g
i
b
e
h
c
a
f
d
i
g
b
e
h
a
c

f d
g
i
b
e
h
2 0 4
1 1
) = 2 4 = 2.
1 0 1
det( 1
d b
c a
ad bc
A permutation is an invertible transformation {1, 2, 3, . . . , n} onto itself. We can

visualize it using the permutation matrix which is everywhere 0 except Ai,(i) = 1.
This formula shows:

There are n! = n(n 1)(n 2) 2 1 permutations.
A 2 2 matrix A is invertible if and only if det(A) 6= 0. The determinant determines
the invertibility of A.
"
5 4
det(
) = 5 1 4 2 = 3.
2 1
We also see already that the determinant changes sign if we flip rows, that the determinant is
linear in each of the rows.
We can write the formula as a sum over all permutations of 1, 2. The first permutation = (1, 2)
gives the sum A1,(1) A2,pi(2) = A1,1 A2,2 = ad and the second permutation = (2, 1) gives the
sum A1,(1) A2,(2) = A1,2 A2,1 = bc. The second permutation has || upcrossing and the sign
P
(1)|| = 1. We can write the above formula as (1)|| A1(1) A2(2) .
"
The 3 3 case
a
c
b
d
"
a
c
b
d
For = (6, 4, 2, 5, 3, 1) if (1) = 6, (2) = 4, (3) = 2, (4) = 5, (5) = 3, (6) = 1 we have

the permutation matrix
P =
1
1
1
1
1
1
It has || = 5 + 2 + 3 + 1 + 1 = 12 up-crossings. The determinant of a matrix which has

everywhere zeros except Ai(j) = 1 is the number (1)| which is called the sign of the
permutation.
The determinant of a n n matrix A is defined as the sum
X
(1)|| A1(1) A2(2) An(n) ,
where is a permutation of {1, 2, . . . , n } and || is the number of up-crossings.
f
i
The general definition
We have seen that this is useful for the inverse:

"
a b c
e f
g h i
A= d
det(A) = det(
0
0
0
0
0
13
0
0
5
0
0
0
0
0
0
0
11
0
0
3
0
0
0
0
2
0
0
0
0
0
0
0
7
0
0
Find the determinant of
) = 2 3 5 7 11 13 .
Answer: 60.
The determinant of an upper triangular matrix or lower triangular matrix is the

product of the diagonal entries.
A=
3
0
0
0
0
0
2
4
0
0
0
0
3
2
5
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
1
0
0
0
7
0
1
0
0
0
1
0
0
0
Homework due April 6, 2011
Laplace expansion
1
Find the determinant of the following matrix
This Laplace expansion is a convenient way to sum over all permutations. We group the permutations by taking first all the ones where the first entry is 1, then the one where the first entry is 2
etc. In that case we have a permutation of (n 1) elements. the sum over these entries produces
a determinant of a smaller matrix.
For each entry aj1 in the first column form the (n 1) (n 1) matrix Bj1 which does not contain
the first and jth row. The determinant of Bj1 is called a minor.
Laplace expansion det(A) = (1)1+1 A11 det(Bi1 ) + + (1)1+n An1 det(Bn1 )
0
8
0
3
0
0
0
0
0
0
0
2
7
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
5
0
0
0
1
0
0
0
We have two nonzero entries in the first column.
0
0 7 0 0 0
0
0 0 0 1
4+1
0 1 0 0
3det
det(A)
+ (1)
0
0 0 5 0
0
2
2 0 0 0 0
= 8(2 7 1 5 1) + 3(2 7 0 5 1) = 560
2+1
= (1) 8det
0
The answer is 1.
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
1
0
0
7
0
0
0
0
0
0
0
0
0
0
0
0
5
0
0
0
1
0
0
0
4
3
0
0
6
0
0
4
1
4
0
Give
the#reason in terms of permutations why the determinant of a partitioned matrix
"
A 0
is the product det(A)det(B).
0 B
3 4 0 0
1 2 0 0
Example det(
) = 2 12 = 24.
0 0 4 2
0 0 2 2
Find the determinant of the diamond matrix:
3
0
0
0
0
0
0
0
1
2
0
0
A=
0
0
0
0
8
0
0
0
0
0
0
0
8
8
8
0
0
0
0
0
8
8
8
8
8
0
0
0
8
8
8
8
8
8
8
0
8
8
8
8
8
8
8
8
8
0
8
8
8
8
8
8
8
0
0
0
8
8
8
8
8
0
0
0
0
0
8
8
8
0
0
0
0
0
0
0
8
0
0
0
0
Hint. Do not compute too much. Investigate what happens with the determinant if you
switch two rows.
Lecture 25: More Determinants

In this lecture, we learn about a faster method to compute the determinant of a n n matrix.
Summing over all possible permutations is often not efficient. For a 20 20 matrix, we would
already have to sum over 20! = 2432902008176640000 2.4 1018 entries. As a comparison, there
are 4.3 1017 seconds (=13.7 billion years) since the big bang.
Proof. We have to show that the number of upcrossing changes by an odd number. Lets count
the number of upcrossings before and after the switch. Assume row a and c are switched. We look
at one pattern and assume that (a,b) be an entry on row a and (c,d) is an entry on row b. The
entry (a,b) changes the number of upcrossings to (c,d) by 1 (there is one upcrossing from (a,b) to
(c,d) before which is absent after).
For each entry (x,y) inside the rectangle (a,c) x (b,d), the number of upcrossings from and to
(x,y) changes by two. (there are two upcrossings to and from the orange squares before which are
absent after). For each entry outside the rectangle and different from (a,b),(c,d), the number of
upcrossings does not change.
Linearity of the determinant

Lets take a general nn matrix A. The following linearity property of determinants follows pretty
much from the definition because for every pattern, the sum is right. The determinant is a sum
over all patterns.
The determinant is linear in every row and every column.
Lets see what this means for rows. For columns it is similar.
A11 A12 A13

... ...
v2
v3
... ...
An1 An2 An3
...
det(
v1
...
A11
...
det(
(v + w)1
...
An1
A11 A12 A13

... ...
v2
v3
... ...
An1 An2 An3
...
det(
v1
...
. . . A1n
... ...
. . . vn
) + det(
... ...
. . . Ann
A12
...
(v + w)2
...
An2
A13
...
(v + w)3
...
An3
...
...
...
...
...
A11
...
v1
...
An1
. . . A1n
... ...
. . . vn
)
=
det(
... ...
. . . Ann
A11 A12 A13

... ... ...
w1 w2 w3
... ... ...
An1 An2 An3
A1n
...
(v + w)n
...
Ann
A12
...
v2
...
An2
. . . A1n
... ...
. . . wn
)
... ...
. . . Ann
...
...
...
...
...
A1n
...
vn
...
Ann
det(
A11 A12 A13

... ... ...
v1
v2
v3
... ... ...
w1 w2 w3
... ... ...
An1 An2 An3
. . . A1n
... ...
. . . vn
. . . . . . ) = det(
. . . wn
... ...
. . . Ann
A11 A12 A13

... ... ...
w1 w2 w3
... ... ...
v1
v2
v3
... ... ...
An1 An2 An3
If c1 , ..., ck are the row reduction scale factors and m is the number of row swaps
during row reduction, then
) .
Swapping two rows changes the sign of the determinant.
Row reduction
We immediately get from the above properties what happens if we do row reduction. Subtracting
a row from an other row does not change the determinant since by linearity we subtract the
determinant of a matrix with two equal rows. Swapping two rows changes the sign and scaling a
row scales the determinant.
) .
A13
...
v3
...
An3
It follows that if two rows are the same, then the determinant is zero.
det(A) =
(1)m
det(rref (A)) .
c1 ck
Since row reduction is fast, we can compute the determinant of a 20 20 matrix in a jiffy. It takes
about 400 operations and thats nothing for a computer.
. . . A1n
... ...
. . . wn
... ...
) .
. . . vn
... ...
. . . Ann
4 1
2
1
4 1
Row reduce.
1
2
6
1
1
2
3
7
Compute the following determinant.
0 2
2
0
0 0
det
0
5
3
0
3
6
4
4
2
What are determinants useful for?
As the name tells, determinants determine a lot about matrices. We can see from the
determinant whether the matrix is invertible.
An other reason is that determinants allow explicit formulas for the inverse of a matrix. We
might look at this next time. Next week we will see that determinants allow to define the
characteristic polynomial of a matrix whose roots are the important eigenvalues. In analysis,
the determinant appears in change of variable formulas:
We could use the Laplace expansion or see that there is only one pattern. The simplest way
however is to swap two rows to get an upper triangular matrix
1 2
2
0
0 0
det
0
3
5
3
0
4
6
2
4
= 24 .
A=
0
0
0
0
0
1
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
is 1 because swapping the last row to the first row gives the identity matrix. Alternatively
we could see that this permutation matrix has 5 upcrossings and that the determinant is
1.
Proof. Every upcrossing is a pair of entries Aij , Akl where k > i, l > j. If we look at the
transpose, this pair of entries appears again as an upcrossing. So, every summand in the
permutation definition of the determinant appears with the same sign also in the determinant
of the transpose.
3
3
3
3
3
3
2
3
3
3
3
3
0
2
3
3
3
3
0
0
2
3
3
3
0
0
0
2
3
3
0
0
0
0
2
3
3
1
1
0
0
0
1
0
0
0
0
0
1
1
3
0
0
0
2
2
2
4
1
1
2
2
2
1
4
1
2
2
2
0
0
4
det(A B) = det(A)det(B)
det(AT ) = det(A)
f (y) |det(Du1(y)|dy .
Product of matrices
Proof. One can bring the n n matrix [A|AB] into row reduced echelon form. Similar than the augmented matrix [A|b] was brought into the form [1|A1 b], we end up with
[1|A1 AB] = [1|B]. By looking at the n n matrix to the left during the Gauss-Jordan
elimination process, the determinant has changed by a factor det(A). We end up with a
matrix B which has determinant det(B). Therefore, det(AB) = det(A)det(B).
u(S)
A matrix is invertible if and only if det(A) 6= 0.
One of the main reasons why determinants are interesting is because of the following property
Physicists are excited about determinants because summation over all possible paths is
used as a quantization method. The Feynmann path integral is a summation over a
suitable class of paths and leads to quantum mechanics. The relation with determinants
comes because each summand in a determinant can be interpreted as a contribution of a
path in a finite graph with n nodes.
The determinant of
f (x) dx =
a) Find an example showing that det(A + B) 6= det(A) + det(B).

b) How do you modify det(A) = det(A) to make it correct if A is a n n matrix?
Cramers rule
Lecture 26: Determinants part three

Solution to the linear system equations A~x = ~b can be given explicitly using determinants:
Geometry
Cramers rule: If Ai is the matrix, where the column ~vi of A is replaced by ~b, then
xi =
det(Ai )
det(A)
For a n n matrix, the determinant defines a volume of the parallelepiped spanned

by the column vectors.
Since we have not defined volume in higher dimensions, we can take the absolute value of the
determinant as the definition of the volume. The sign of the determinant gives additional information, it defines the orientation. Determinants are useful to define these concepts in higher
dimensions. The linearity result shown last time illustrates why this makes sense: scaling one of
the vector by a factor 2 for example changes the determinant by a factor 2. This also increases
the volume by a factor of 2 which can be interpreted as stacking two such solids on top of each
other.
The area of the parallelogram spanned by
"
1
2
A=
"
1 1
2 1
and
"
1
1
is the determinant of
which is 4.
1 1 1
0
.
1 0 0
Solution: The determinant is 1. The volume is 1. The fact that the determinant is
negative reflects the fact that these three vectors form a left handed coordinate system.
Find the volume of the parallelepiped spanned by the column vectors of A = 1 1
Proof. det(Ai ) = det([v1 , . . . , b, . . . , vn ] = det([v1 , . . . , (Ax), . . . , vn ] = det([v1 , . . . ,

xi det([v1 , . . . , vi , . . . , vn ]) = xi det(A).
"
1 1 1
det(A A) = det(
1 1 0
T
"
#
1 1
3 2
1
)=2.
) = det(
2 2
1 0
The area is therefore 2. If you have seen multivariable calculus, you could also have
computed
the area using the cross product. The area is the length of the cross product
1 of the two vectors which is 2 too.

0
xi vi , . . . , vn ] =
Gabriel Cramer was born in 1704 in Geneva. He worked on geometry and analysis until
his death at in 1752 during a trip to France. Cramer used the rule named after him in a
book Introduction `a lanalyse des lignes courbes algebraique, where he used the method
to solve systems of equations with 5 unknowns. According to a short biography of Cramer
by J.J OConnor and E F Robertson, the rule had been used already before by other mathematicians. Solving systems with Cramers formulas is slower than by Gaussian elimination.
But it is ueful for example if the matrix A or the vector b depends on a parameter t, and we
want to see how x depends on the parameter t. One can find explicit formulas for (d/dt)xi (t)
for example.
Cramers rule leads to an explicit formula for the inverse of a matrix inverse of a matrix:
Let Aij be the matrix where the ith row and the jth column is deleted. Bij =
(1)i+j det(Aji ) is called the classical adjoint or adjugate of A. The determinant
of the classical adjugate is called minor.
[A1 ]ij = (1)i+j
We can take also this as the definition of the volume. Note that AT A is a square matrix.
The area of the parallelogram in space spanned by the vectors 1 and 1 is
Solve the system

5x+3y
= 8,
= 2 using Cramers
rule.
Solution."This linear
system
"
#
" 8x+5y
#
"
#
#
5 3
8
8 3
5 8
= 34y = det
= 54.
with A =
and b =
. We get x = det
8 5
2
2 5
8 2
The
volume of a k dimensional parallelepiped defined by the vectors v1 , . . . , vk is
q
det(AT A).
det(Aji )
det(A)
Proof. The columns of A1 are the solutions of

A~x = ~ej
where ~ej are basis vectors.
Dont confuse the classical adjoint with the transpose AT . The classical adjoint is the
transpose of the matrix where the ith row and jth column is deleted. The mix up is easy
to do since the transpose is often also called the adjoint.
B11
10.
2 3 1
14 21
2 4
8
has det(A) = 17 and we get A1 = 11
6 0 7 "
12 18
#
"
#
2 4
3 1
= (1)2 det
= 14. B12 = (1)3 det
= 21. B13
0 7
0 7
A= 5
B21 = (1)3 det

3.
B31 = (1)4 det

11.
"
5 4
6 7
= 11. B22 = (1)4 det
"
5 2
6 0
= 12. B32 = (1)5 det
"
"
2 1
6 7
2 3
6 0
#
#
10
3
/(17):
11
"
#
3 1
= (1)4 det
=
2 4
= 8. B23 = (1)5 det
"
2 1
5 4
= 18. B33 = (1)6 det
"
2 3
5 2
describes an electron in a periodic crystal, E is the energy and = 2/n. The electron can move
as a Bloch wave whenever the determinant is negative. These intervals form the spectrum of
the quantum mechanical system. A physicist is interested in the rate of change of f (E) or its
dependence on when E is fixed. .
The graph to the left shows the function E 7 log(|det(L EIn )|) in the case = 2 and n = 5.
In the energy intervals, where this function is zero, the electron can move, otherwise the crystal
is an insulator. The picture to the right shows the spectrum of the crystal depending on . It
is called the Hofstadter butterfly made popular in the book Godel, Escher Bach by Douglas
Hofstadter.
Random matrices
If the entries of a matrix are random variables with a continuous distribution, then
the determinant is nonzero with probability one.
If the entries of a matrix are random variables which have the property that P[X =
x] = p > 0 for some x, then there is a nonzero probability that the determinant is
zero.
Proof. We have with probability p2n that the first two rows have the same entry x.
What is the distribution of the determinant of a random matrix? These are questions which are
hard to analyze theoretically. Here is an experiment: we take random 100 100 matrices and look
at the distribution of the logarithm of the determinant.
M=100; T:=Log [ Abs [ Det [ Table [ Random [ N o r m a l D i s t r i b u t i o n [ 0 , 1 ] ] , {M} , {M} ] ] ] ] ;

s=Table [ T, { 1 0 0 0 } ] ; S=Histogram [ s ]
100
80
60
40
20
172
174
176
178
180
182
184
Applications of determinants
In
solid state physics, one is interested in the function f (E) = det(L EIn ), where
L=
cos()
1
0
0
1
1
cos(2) 1
0
1

1
0
1 cos((n 1))
1
1
0
0
1
cos(n)
Find the following

1 2 3
2 4 6
a) A =
5 5 5
1 3 2
3 2 8
determinants
4 5
8 10
5 4
7 4
4 9
Find the following

1 1 0
1 2 2
a) A =
0 1 1
0 0 0
0 0 0
determinants
0 0
0 0
0 0
1 3
4 2
b) A =
2
1
0
0
0
1
1
0
0
0
4
1
2
0
0
4
2
1
3
0
2
3
1
1
4
1 6 10 1 15
8 17 1 29
0 3 8 12
0 0 4 9
0 0 0 0 5
b) A =
0
Find a 4 4 matrix A with entries 0, +1 and 1 for which the determinant is maximal.
Hint. Think about the volume. How do we get a maximal volume?
Look at the recursion

"
un+1 = un un1
1 1
1 0
#"
"
un
=
un1
"
#
1 1
system. Lets compute some orbits: A =
1 0
We see that A6 is the identity. Every initial vector is
original starting point.
Lecture 27: Discrete dynamical systems
with u0 = 0, u1 = 1. Because
Eigenvectors and Eigenvalues
un+1
un
we have a discrete dynamical
"
"
0 1
1 0
A3 =
.
1 1
0 1
mapped after 6 iterations back to its
A2 =
Markets, population evolutions or ingredients in a chemical reaction are often nonlinear. A linear
description often can give a good approximation and solve the system explicitly. Eigenvectors and
eigenvalues provide us with the key to do so.
A nonzero vector v is called an eigenvector of a n n matrix A if Av = v for
some number . The later is called an eigenvalue of A.
We first look at real eigenvalues but also consider complex eigenvalues.
A vector v is an eigenvector to the eigenvalue 0 if and only if ~v is in the kernel of A because

A~v = 0~v means that ~v is in the kernel.
A rotation A in three dimensional space has an eigenvalue 1, with eigenvector spanning the
axes of rotation. This vector satisfies A~v = ~v .
Every
standard
w
~ i is an eigenvector if A is a diagonal matrix. For example,
"
#"
# basis
" vector
#
2 0
0
0
=3
.
0 3
1
1
For an orthogonal projection P onto a space V , every vector in V is an eigenvector to the
eigenvalue 1 and every vector perpendicular to V is an eigenvector to the eigenvalue 0.
For a reflection R at a space V , every vector v in V is an eigenvector with eigenvalue 1.

Every vector perpendicular to v is an eigenvector to the eigenvalue 1.
Discrete dynamical systems

When applying a linear map x 7 Ax again and again, we obtain a discrete
dynamical system. We want to understand what happens with the orbit
x1 = Ax, x2 = AAx = A2 x, x3 = AAAx = A3 x, ... and find a closed formula for
An x
The one-dimensional discrete dynamical system x 7 ax or xn+1 = axn has the solution
xn = an x0 . The value 1.0320 1000 = 1806.11 for example is the balance on a bank account
which had 1000 dollars 20 years ago if the interest rate was a constant 3 percent.
"
"
3
, A2~v =
1
"
5
. A3~v =
1
"
7
. A4~v =
1
"
9
1
etc.
The following example shows why eigenvalues and eigenvectors are so important:
9
4
"
1 2
1
. A~v =
. ~v =
0 1
1
Do you see a pattern?
A=
If ~v is an eigenvector with eigenvalue , then A~v = ~v , A2~v = A(A~v )) = A~v = A~v = 2~v
and more generally An~v = n~v .
For an eigenvector, we have a closed form solution for An~v. It is n~v .
10
The recursion
xn+1 = xn + xn1
with x0 = 0 and x1 = 1 produces the Fibonacci sequence
(1, 1, 2, 3, 5, 8, 13, 21, ...)
This can be computed with a discrete dynamical system because
"
xn+1
xn
Can we find a formula for the nth term?
"
1 1
1 0
#"
xn
xn1
in mathematics called the golden ratio. We have found our eigenvalues and eigenvectors.
Now find c1 , c2 such that
"
#
"
#
"
#
1
+
= c1
+ c2
0
1
1
We see c1 = c2 = 1/ 5. We can write
In the third section of Liber abbaci, published in

1202, the mathematician Fibonacci, with real name
Leonardo di Pisa (1170-1250) writes: A certain
man put a pair of rabbits in a place surrounded
on all sides by a wall. How many pairs of rabbits
can be produced from that pair in a year if it
is supposed that every month each pair begets
a new pair which from the second month on becomes productive?
"
We will discuss the following situation a bit more in detail:

An n n matrix is called a Markov matrix if all entries are nonnegative and each
column adds up to 1.
"
A
.
G
Each cycle 1/3 of IOS users switch to Android and 2/3 stays. Also lets"assume that
# 1/2
2/3 1/2
of the Android OS users switch to IOS and 1/2 stay. The matrix A =
is a
1/3 1/2
Markov matrix. What customer ratio do we have in the limit? The matrix A has an
eigenvector (3/5, 2/5) which belongs to the eigenvalue 1.
Customers using Apple IOS and Google Android are represented by a vector
A~v = ~v
means that 60 to 40 percent is the final stable distribution.
The following fact motivates to find good methods to compute eigenvalues and eigenvectors.
If A~v1 = 1~v1 , A~v2 = 2~v2 and ~v = c1~v1 + c2~v2 , we have closed form solution
An~v = c1 n1 ~v1 + c2 n2 ~v2 .
Lets try this in the Fibonacci case. We will see next time how we find the eigenvalues and
eigenvectors:
12
Lets try to find a number such that

"
1 1
1 0
#"
"
= An
"
1
0
and can read off xn = (n+ n )/ 5.
Markov case
11
xn+1
xn
2
This leads to the
quadratic equation + + 1 = which has the solutions + = (1 + 5)/2
and = (1 5)/2. The number is one of the most famous and symmetric numbers
1
= n+
5
"
+
1
1
n
5
"
Compare Problem 52 in Chapter 7.1 of

Bretschers Book. The glucose and excess hormone concentration
blood are modeled
" in your
#
g
by a vector ~v =
. Between meals the
h
concentration changes to ~v A~v , where
A=
"
"
0.978 0.006
0.004 0.992
#
"
1
3
and
are eigenvec2
1
tors of A. Find the eigenvalues.
Check that
Compare Problem 54 in Chapter 7.1 of

Bretschers Book. The dynamical system
vn+1 = Avn with
A=
"
0 2
1 1
models
" the
# growth of a lilac bush. The vector
n
~v =
models the number of new branches
a
and #the number
"
"
#of old branches. Verify that
1
2
and
are eigenvectors of A. Find
1
1
the eigenvalues and find the close form solution starting with ~v = [2, 3]T .
Compare problem 50 in Chapter 7.1 of
Bretschers Book. Two interacting populations of hares and foxes can be modeled
by the discrete dynamical system vn+1 = Avn
with
"
#
4 2
A=
1 1
Find a closed form solutions
following
# in
" the #
"
100
h0
=
three cases: a) ~v0 =
.
f0
100
#
"
"
#
200
h0
=
b) ~v0 =
.
f0
100
#
"
"
#
600
h0
=
c) ~v0 =
.
500
f0
Lecture 28: Eigenvalues
Fundamental theorem of algebra: For a n n matrix A, the characteristic

polynomial has exactly n roots. There are therefore exactly n eigenvalues of A if we
count them with multiplicity.
The polynomial fA () = det(A In ) is called the characteristic polynomial of

A.
The eigenvalues of A are the roots of the characteristic polynomial.
Proof. If Av = v,then v is in the kernel of A In . Consequently, A In is not invertible and
det(A In ) = 0 .
For the matrix A =
Proof1 One only has to show a polynomial p(z) = z n + an1 z n1 + + a1 z + a0 always has a root
z0 We can then factor out p(z) = (z z0 )g(z) where g(z) is a polynomial of degree (n 1) and
use induction in n. Assume now that in contrary the polynomial p has no root. Cauchys integral
theorem then tells
Z
2i
dz
=
6= 0 .
(1)
p(0)
|z|=r| zp(z)
On the other hand, for all r,
2 1
, the characteristic polynomial is
4 1
det(A I2 ) = det(
"
cos(t) sin(t)
For a rotation A =
the characteristic polynomial is 2 2 cos() + 1
sin(t) cos(t)
which has the roots cos() i sin() = ei .
Allowing complex eigenvalues is really a blessing. The structure is very simple:
We have seen that det(A) 6= 0 if and only if A is invertible.
"
"
2
1
) = 2 6 .
4
1
|z|=r|
dz
1
2
| 2rmax|z|=r
=
.
zp(z)
|zp(z)|
min|z|=r p(z)
The right hand side goes to 0 for r because

|p(z)| |z|n |(1
This polynomial has the roots 3, 2.

Let tr(A) denote the trace of a matrix, the sum of the diagonal elements of A.
(2)
|an1 |
|a0 |
n)
|z|
|z|
which goes to infinity for r . The two equations (1) and (2) form a contradiction. The
assumption that p has no root was therefore not possible.
If 1 , . . . , n are the eigenvalues of A, then
For the matrix A =
"
a b
, the characteristic polynomial is
c d
2 tr(A) + det(A) .
We can see this directly by writing out the determinant of the matrix A I2 . The trace is
important because it always appears in the characteristic polynomial, also if the matrix is
larger:
fA () = (1 )(2 ) . . . (n ) .
Comparing coefficients, we know now the following important fact:

The determinant of A is the product of the eigenvalues. The trace is the sum of the
eigenvalues.
We can therefore often compute the eigenvalues
Find the eigenvalues of the matrix
For any n n matrix, the characteristic polynomial is of the form

fA () = ()n + tr(A)()n1 + + det(A) .
A=
"
3 7
5 5
Proof. The pattern, where all the entries are in the diagonal leads to a term (A11 )
(A22 )...(Ann ) which is (n ) + (A11 + ... + Ann )()n1 + ... The rest of this as well
as the other patterns only give us terms which are of order n2 or smaller.
How many eigenvalues do we have? For real eigenvalues, it depends. A rotation in the plane
with an angle different from 0 or has no real eigenvector. The eigenvalues are complex in
that case:
"
1
. We can also
1
read off the trace 8. Because the eigenvalues add up to 8 the other eigenvalue is 2. This
example seems special but it often occurs in textbooks. Try it out: what are the eigenvalues
of
"
#
11 100
A=
?
12 101
Because each row adds up to 10, this is an eigenvalue: you can check that
1
A. R. Schep. A Simple Complex Analysis and an Advanced Calculus Proof of the Fundamental theorem of
Algebra. Mathematical Monthly, 116, p 67-68, 2009
Find the eigenvalues of the matrix
A=
1
0
0
0
0
2
2
0
0
0
3
3
3
0
0
4
4
4
4
0
5
5
5
5
5

1
We can immediately compute the characteristic polynomial in this case because A I5 is

still upper triangular so that the determinant is the product of the diagonal entries. We see
that the eigenvalues are 1, 2, 3, 4, 5.
The eigenvalues of an upper or lower triangular matrix are the diagonal entries of
the matrix.
How do we construct 2x2 matrices which have integer eigenvectors and integer eigenvalues?
Just take an integer matrix for which
" the
# row vectors have the same sum. Then this sum
1
is an eigenvalue to the eigenvector
. The other eigenvalue can be obtained by noticing
1
"
#
6 7
that the trace of the matrix is the sum of the eigenvalues. For example, the matrix
2 11
has the eigenvalue 13 and because the sum of the eigenvalues is 18 a second eigenvalue 5.
A matrix with nonnegative entries for which the sum of the columns entries add up
to 1 is called a Markov matrix.
Markov Matrices have an eigenvalue 1.
Proof. The eigenvalues of A and A are the same because they have the same characteristic
polynomial. The matrix AT has an eigenvector [1, 1, 1, 1, 1]T .
6
1/2 1/3
1/4
1/3
1/4 1/3 5/12
A = 1/4 1/3
This vector ~v defines an equilibrium point of the Markov process.
If A =
"
b) Find the eigenvalues of A =
1 1
0 1
.
4 1 0
A=
2
100 1
1
1
1
1 100 1
1
1
1
1 100 1
1
1
1
1 100 1
1
1
1
1 100
a) Verify that n n matrix has a at least one real eigenvalue if n is odd.

b) Find a 4 4 matrix, for which there is no real eigenvalue.
c) Verify that a symmetric 2 2 matrix has only real eigenvalues.
a) Verify that for a partitioned matrix

C=
"
A 0
0 B
the union of the eigenvalues of A and B are the eigenvalues of C.

b) Assume we have an eigenvalue ~v of A use this to find an eigenvector of C.
Similarly, if w
~ is an eigenvector of B, build an eigenvector of C.
(*) Optional: Make some experiments with random matrices: The following Mathematica code computes Eigenvalues of random matrices. You will observe Girkos circular
law.

a) Find the characteristic polynomial and the eigenvalues of the matrix
1/3 1/2
. Then [3/7, 4/7] is the equilibrium eigenvector to the eigenvalue 1.
2/3 1/2
M=1000;
A=Table [Random[ ] 1 / 2 , {M} , {M} ] ;
e=Eigenvalues [A ] ;
d=Table [ Min[ Table [ I f [ i==j , 1 0 , Abs [ e [ [ i ]] e [ [ j ] ] ] ] , { j ,M} ] ] , { i ,M} ] ;
a=Max[ d ] ; b=Min[ d ] ;
Graphics [ Table [ {Hue[ ( d [ [ j ]] a ) / ( ba ) ] ,
Point [ {Re[ e [ [ j ] ] ] , Im[ e [ [ j ] ] ] } ] } , { j ,M} ] ]

Find the eigenvalues of B.
Lecture 29: Eigenvectors
B=
Eigenvectors
101 2
3
4
5
1 102 3
4
5
1
2 103 4
5
1
2
3 104 5
1
2
3
4 105
This matrix is A + 100I5 where A is the matrix from the previous example. Note that if
Bv = v then (A + 100I5 )v = + 100)v so that A, B have the same eigenvectors and the
eigenvalues of B are 100, 100, 100, 100, 115.
Assume we know an eigenvalue . How do we compute the corresponding eigenvector?

The eigenspace of an eigenvalue is defined to be the linear space of all eigenvectors
of A to the eigenvalue .
Find the determinant of the previous matrix B. Solution: Since the determinant is the
product of the eigenvalues, the determinant is 1004 115.
The shear
"
The eigenspace is the kernel of A In .

Since we have computed the kernel a lot already, we know how to do that.
The dimension of the eigenspace of is called the geometric multiplicity of .
Remember that the multiplicity with which an eigenvalue appears is called the algebraic multiplicity of :
The algebraic multiplicity is larger or equal than the geometric multiplicity.
Proof. Let be the eigenvalue. Assume it has geometric multiplicity m. If v1 , . . . , vm is a basis
of the eigenspace E form the matrix S which contains these vectors in the first m columns. Fill
the other columns arbitrarily. Now B = S 1 AS has the property that the first m columns are
e1 , .., em , where ei are the standard vectors. Because A and B are similar, they have the same
eigenvalues. Since B has m eigenvalues also A has this property and the algebraic multiplicity
is m.
You can remember this with an analogy: the geometric mean
equal to the algebraic mean (a + b)/2.
1 2
2
2
2
1 2
A=
1
3
3
3
3
3
4
4
4
4
4
5
5
5
5
5
This matrix has a large kernel. Row reduction indeed shows that the kernel is 4 dimensional.
Because the algebraic multiplicity is larger or equal than the geometric multiplicity there
are 4 eigenvalues 0. We can also immediately get the last eigenvalue from the trace 15. The
eigenvalues of A are 0, 0, 0, 0, 15.
has the eigenvalue 1 with algebraic multiplicity 2. The kernel of AI2 =
is spanned by
"
1
0
and the geometric multiplicity is 1.
1 1 1
1
has eigenvalue 1 with algebraic multiplicity 2 and the eigenvalue 0
0 0 1
with multiplicity
to the eigenvalue
1. Eigenvectors
= 1 are in the kernel of A 1 which is

0 1 1
1
the kernel of 0 1 1 and spanned by 0 . The geometric multiplicity is 1.

0 0 0
0
The matrix 0 0
Proof. If all are different, there is one of them i which is different from 0. We use induction
with respect to n and assume the result is true for n 1. Assume that in contrary the eigenP
P
vectors are linearly dependent. We have vi = j6=i aj vj and i vi = Avi = A( j6=i aj vj ) =
P
P
j6=i aj j vj so that vi =
j6=i bj vj with bj = aj j /i . If the eigenvalues are different, then
P
P
P
aj 6= bj and by subtracting vi = j6=i aj vj from vi = j6=i bj vj , we get 0 = j6=i (bj aj )vj = 0.
Now (n1) eigenvectors of the n eigenvectors are linearly dependent. Now use the induction
assumption.
ab of two numbers is smaller or
1 1
0 1
If all eigenvalues are different, then all eigenvectors are linearly independent and
all geometric and algebraic multiplicities are 1. The eigenvectors form then an
eigenbasis.
Find the eigenvalues and eigenvectors of the matrix
0 1
0 0
"
Here is an other example of an eigenvector computation:
Find all the eigenvalues and eigenvectors of the matrix
0 1
0
0
1 0
0
B=
0
1
0
0
0
0
1
0
Solution. The characteristic polynomial is 4 1. It has the roots 1, 1, i, i. Instead of

computing the eigenvectors for each eigenvalue, write
v= 2 ,
v3

1
v4
The page rank vector is an eigenvector to the Google matrix.

These matrices can be huge. The google matrix is a n n matrix where n is larger than 10 billion!
1
The book of Lanville and Meyher of 2006 gives 8 billion. This was 5 years ago.
1 2 1
4 2
.
3 6 5
A= 2
and look at Bv = v.
Where are eigenvectors used: in class we will look at some applications: H
uckel theory, orbitals
of the Hydrogen atom and Page rank. In all these cases, the eigenvectors have immediate
interpretations. We will talk about page rank more when we deal with Markov processes.
Find the eigenvectors of the matrix
a) Find the eigenvectors of A10 , where A is the previous matrix.

b) Find the eigenvectors of AT , where A is the previous matrix.
This is homework problem 40 in section 7.3 of the Bretscher book.

Photos of the
Swiss
lakes
in the text.
The
pollution
story is fiction
fortunately.
The vector An (x)b gives pollution levels
in the Silvaplana,
Sils
and
0.7 0
0
100
after an oil spill. The matrix is A = 0.1 0.6 0 and b = 0

0 0.2 0.8
0
level. Find a closed form solution for the pollution after n days.
St
Moritz lake n weeks
is the initial pollution
We have seen that the eigenvalues are e
= cos() + i sin(), the eigenvectors are
"
i
.
1
The eigenvectors are the same for every rotation-dilation matrix. With
Lecture 30: Diagonalization
A=
"
we have
Diagonalization
S 1 AS =
Two matrices are called similar if S 1 AS. A matrix is called diagonalizable if it

is similar to a diagonal matrix.
A matrix is diagonalizable if and only if it has an eigenbasis, a basis consisting of
eigenvectors.
S 1 ASei S 1 Avi = S 1 i vi = i S 1 vi = i ei .
On the other hand if A is diagonalizable, then we have a matrix S for which S 1 AS = B is diagonal. The column vectors of S are eigenvectors because the kth column of the equation AS = BS
shows Avi = i vi .
has the eigenvalues 1 for which the geometric multiplicity is smaller than the algebraic one. This
matrix is not diagonalizable.
"
,S =
"
i i
1 1
a + ib
0
0
a ib
Functional calculus
Proof. If we have an eigenbasis, we have a coordinate transformation matrix S which contains the
eigenvectors vi as column vectors. To see that the matrix S 1 AS is diagonal, we check
Are all matrices diagonalizable? No! We need to have an eigenbasis and therefore that the
geometric multiplicities all agree with the algebraic multiplicities. We have seen that the shear
matrix
"
#
1 1
A=
0 1
a b
b a
"
2 3
What is A100 + A37 1 if A =
? The matrix has the eigenvalues 1 = 2 + 3 with
1 2
eigenvector "~v1= [ 3,
1] #and the eigenvalues 2 = 2 3. with eigenvector ~v2 = [ 3, 1].
3 3
and check S 1 AS = D is diagonal. Because B k = S 1 Ak S can
Form S =
1
1
easily be computed, we know A100 + A37 1 = S(B 100 + B 37 1)S 1 .
Establishing similarity
3
"
"
3 5
4 4
B=
are similar. Proof. They have the same
2 6
3 5
eigenvalues 8, 9 as you can see by inspecting the sum of rows and the trace. Both matrices
are therefore diagonalizable and similar to the matrix
Show that the matrices A =
"
8 0
0 9
Simple spectrum
If A and B have the same characteristic polynomial and diagonalizable, then they are
similar.
A matrix has simple spectrum, if all eigenvalues have algebraic multiplicity 1.
If A and B have a different determinant or trace, they are not similar.
If a matrix has simple spectrum, then it is diagonalizable.
If A and B have the same eigenvalues but different geometric multiplicities, then they
are not similar.
Proof. Because the algebraic multiplicity is 1 for each eigenvalue and the geometric multiplicity
is always at least 1, we have an eigenvector for each eigenvalue and so n eigenvalues.
If A has an eigenvalue which is not an eigenvalue of B, then they are not similar.
Without proof we mention the following result which gives an if and only if result for similarity:
We have computed the eigenvalues of the rotation matrix

A=
"
cos() sin()
sin() cos()
If A and B have the same eigenvalues with geometric multiplicities which agree and
the same holds for all powers Ak and B k , then A is similar to B.
Cayley Hamilton theorem

For any polynomial p, we can form the matrix p(A). For example, for p(x) = x2 + 2x + 3,
we have p(A) = A2 + 2A + 3.
Let x1 , x2 , x3 , x4 be the positions of the four phosphorus atoms (each of them is a 3-vector).
The inter-atomar forces bonding the atoms is modeled by springs. The first atom feels a
force x2 x1 + x3 x1 + x4 x1 and is accelerated in the same amount. Lets just chose
units so that the force is equal to the acceleration. Then
If fA is the characteristic polynomial, we can form fA (A)

If A is diagonalizable, then fA (A) = 0.
The matrix B = S 1 AS has the eigenvalues in the diagonal. So fA (B), which contains
fA (i ) in the diagonal is zero. From fA (B) = 0 we get SfA (B)S 1 = fA (A) = 0.
The theorem holds for all matrices: the coefficients of a general matrix can be changed a
tiny bit so that all eigenvalues are different. For any such perturbations one has fA (A) = 0.
Because the coefficients of fA (A) depend continuously on A, they are zero in general.
An application in chemistry
While quantum mechanics describes the motion of atoms in molecules, the vibrations can
be described classically, when treating the atoms as balls connected with springs. Such
approximations are necessary when dealing with large atoms, where quantum mechanical
computations would be too costly. Examples of simple molecules are white phosphorus P4 ,
which has tetrahedronal shape or methan CH4 the simplest organic compound or freon,
CF2 Cl2 which is used in refrigerants.
x1
x2
x3
x4
=
=
=
=
(x2 x1 ) + (x3 x1 ) + (x4 x1 )

(x3 x2 ) + (x4 x2 ) + (x1 x2 )
(x4 x3 ) + (x1 x3 ) + (x2 x3 )
(x1 x4 ) + (x2 x4 ) + (x3 x4 )
1
1
1
3 1
1
1 3 1
1
1 3
A=
1
which has the form

x = Ax, where the
4 4 matrix
, v3 =
, v4 =
are

1
What is the probability that an upper triangular 3 3 matrix with entries 0 and 1 is
diagonalizable?
Which of the following matrices are similar?

1 1 0
0
1
0 0 0
0 1
A=
0 0
0
0
1
1
1 1
1
0
0 0
,B =
0
1
1
0
0
0
0
1
1 1
1
0
0 0
,C =
0
1
1
0
0
0
1
1
1 0 0
1
1
0 0 0
0 1
,D =
0 0
0
0
1
1
Diagonalize the following matrix in the complex:
2 3
2
0
0 0
A=
0
Caffeine C8 H10 N4 O2
the eigenvectors to the eigenvalues 1 = 0, 2 = 4, 3 = 4, 3 = 4. With S = [v1 v2 v3 v4 ],

the matrix B = S 1 BS is diagonal with entries 0, 4, 4, 4. The coordinates yi = Sxi
satisfy y1 = 0, y2 = 4y2 , y3 = 4y3 , y4 = 4y4 which we can solve y0 which is the
center of mass satisfies y0 = a + bt (move molecule with constant speed). The motions
yi = ai cos(2t) + bi sin(2t) of the other eigenvectors are oscillations, called normal modes.
The general motion of the molecule is a superposition of these modes.
Freon CF2 Cl2
1
0
v1 = , v2 =
1
0
0
0
5
6
0
0
6
5
Aspirin C9 H8 O4
1
We grabbed the pdb Molecule files from http : //www.sci.ouc.bc.ca, translated them with povchem from
.pdb to .pov rendered them under Povray.
Lecture 31: The law of large numbers

A sequence of random variables is called IID abbreviating independent, identically distributed if they all have the same distribution and if they are independent.
If Xi are random variables which take the values 0, 1 and 1 is chosen with probability p, then
Sn has the binomial distribution and E[Sn ] = np. Since E[X] = p, the law of large numbers
is satisfied.
If Xi are random variables which take the value 1, 1 with equal probability 1/2, then Sn
is a symmetric random walk. In this case, Sn /n 0.
Here is a strange paradox called the Martingale paradox. We try it out in class. Go into
a Casino and play the doubling strategy. Enter 1 dollar, if you lose, double to 2 dollars, if
you lose, double to 4 dollars etc. The first time you win, stop and leave the Casino. You
won 1 dollar because you lost maybe 4 times and 1 + 2 + 4 + 8 = 15 dollars but won 16. The
paradox is that the expected win is zero and in actual Casinos even negative. The usual
solution to the paradox is that as longer you play as more you win but also increase the
chance that you lose huge leading to a zero net win. It does not quite solve the paradox
because in a Casino where you are allowed to borrow arbitrary amounts and where no bet
limit exists, you can not lose.
We assume that all random variables have a finite variance Var[X] and expectation E[X].
A sequence of random variables defines a random walk Sn = nk=1 Xk . The interpretation is that Xk are the individual steps. If we take n steps, we reach Sn .
P
Here is a typical trajectory of a random walk. We throw a dice and if the dice shows head we go
up, if the dice shows tail, we go down.
How close is Sn /n to E[X]? Experiment:
Throw a dice n times and add up the total number Sn of eyes. Estimate Sn /n E[X] with
experiments. Below is example code for Mathematica. How fast does the error decrease?
f [ n ] : =Sum[Random[ Integer , 5 ] + 1 , { n } ] / n7/2;

data=Table [ {k , f [ k ] } , { k , 1 0 0 0 } ] ; Fit [ data , { 1 ,Exp[x ] } , x ]

The following result is one of the three most important results in probability theory:
Law of large numbers. For almost all , we have Sn /n E[X].
Here is the situation where the random variables are Cauchy distributed. The expectation
is not defined. The left picture below shows this situation.
What happens if we relax the assumption that the random variables are uncorrelated? The
illustration to the right below shows an experiment, where we take a periodic function f (x)
and an irrational number and where Xk (x) = f (k).
Proof. We only prove the weak law of large numbers which deals with a weaker convergence: We
have Var[Sn /n] = nVar[X]/n2 = Var[X]/n so that by Chebyshevs theorem
P[|Sn /n E[X]| < ] Var[X]/n2
for n . We see that the probability that Sn /n deviates by a certain amount from the mean
goes to zero as n . The strong law would need a half an hour for a careful proof. 1
It turns out that no randomness is necessary to establish the strong law of large numbers. It is
enough to have ergodicity
O.Knill, Probability and Stochastic Processes with applications, 2009
A probability preserving transformation T on a probability space (, P) is called

ergodic if every event A which is left invariant has probability 0 or 1.
If is the interval [0, 1] with measure P[[c, d]] = d c, then T (x) = x + mod 1 is ergodic
if is irrational.
F [ c ] : = Block [ { z=c , u=1} ,Do[ z=N[ z2+c ] ; I f [ Abs [ z ] >3 ,u=0; z = 3 ] , { 9 9 } ] ; u ] ;

M=105; Sum[ F[ 2.5+3 Random[ ] + I ( 1.5+3 Random [ ] ) ] , {M} ] ( 9 . 0 /M)

Birkhoffs ergodic theorem. If Xk = f (T k x) is a sequence of random variables

obtained from an ergodic process, then Sn ()/n E[X] for almost all .
Application: first significant digits

This theorem is the reason that ideas from probability theory can be applied in much more
general contexts, like in number theory or in celestial mechanics.
If you look at the distribution of the first digits of the numbers 2n : 2, 4, 8, 1, 3, 6, 1, 2, ..... Lets
experiment:
Application: normal numbers
data=Table [ Fir st [ IntegerDigits [ 2 n ] ] , { n , 1 , 1 0 0 } ] ;

Histogram [ data , 1 0 ]

A real number is called normal to base 10 if in its decimal expansion, every digit
appears with the same frequency 1/10.
Almost every real number is normal
The reason is that we can look at the kth digit of a number as the value of a random variable
Xk () where [0, 1]. These random variables are all independent and have
( the same
1 k = 7
distribution. For the digit 7 for example, look at the random variables Yk () =
0 else
which have expectation 1/10. The average Sn ()/n = number of digits 7 in the first k digits
of the decimal expansion of converges to 1/10 by the law of large numbers. We can do
that for any digit and therefore, almost all numbers are normal.
Application: Monte Carlo integration

The limit
n
1X
f (xk )
n n
k=1
lim
Interestingly, the digits are not equally distributed. The

smaller digits appear with larger probability. This is
called Benfords law and it is abundant. Lists of numbers of real-life source data are distributed in a nonuniform way. Examples are bills, accounting books, stock
prices. Benfords law stats the digit k appears with probability
pk = log10 (k + 1) log10 (k)
where k = 1, ..., 9. It is useful for forensic accounting
or investigating election frauds.
300
250
200
150
100
50
10
The probability distribution pk on {1, ...., 9 } is called the Benford distribution.

The reason for this distribution is that it is uniform on a logarithmic scale. Since numbers x for
which the first digit is 1 satisfy 0 log(x) mod 1 < log10 (2) = 0.301..., the chance to have a digit
1 is about 30 percent. The numbers x for which the first digit is 6 satisfy 0.778... = log10 (6)
log(x) mod < log10 (7) = 0.845..., the chance to see a 6 is about 6.7 percent.
where xk are IID random variables in [a, b] is called the Monte-Carlo integral.
Look the first significant digit Xn of the sequence 2n . For example X5 = 3 because 25 = 32.
To which number does Sn /n converge? We know that Xn has the Benford distribution
P[Xn = k] = pk . [ The actual process which generates the random variables is the irrational
rotation x x + log10 (2)mod 1 which is ergodic since the log10 (2) is irrational. You can
therefore assume that the law of large numbers (Birkhoffs generalization) applies. ]
By going through the proof of the weak law of large numbers, does the proof also work if
Xn are only uncorrelated?
Assume An is a sequence of n n upper-triangular random matrices for which each entry is

either 1 or 2 and 2 is chosen with probability p = 1/4.
a) What can you say about tr(An )/n in the limit n ?
b) What can you say about log(det(An ))/n in the limit n ?
The Monte Carlo integral is the same than the Riemann integral for continuous
functions.
We can use this to compute areas of complicated regions:
The following two lines evaluate the area of the Mandelbrot fractal using Monte Carlo integration. The
function F is equal to 1, if the parameter value c of the
quadratic map z z 2 + c is in the Mandelbrot set and 0
else. It shoots 100000 random points and counts what
fraction of the square of area 9 is covered by the set.
Numerical experiments give values close to the actual
value around 1.51.... One could use more points to get
more accurate estimates.
Proof. Let X be a N(0, 1) distributed random variable. We show E[eitSn ] E[eitX ] for any fixed
t. Since any of the two random variables Xk , Xl are independent,
E[exp(i(Xk + Xl ))] = E[exp(iXk )]E[exp(iXl )] .
Lecture 32: Central limit theorem

More generally
The central limit theorem explains why the normal distribution
1
2
f (x) = ex /2
2
E[exp(itSn )] = E[exp(it(X1 + . . . + Xn )] = E[exp(itX1 )] E[exp(itXn )] .

Since we normalize the random variables, we can assume that each Xk has zero expectation and
variance 1. If E[Xk ] = 0, E[Xk2 ] = 1, we have for each of the n factors
is prevalent. If we add independent random variables and normalize them so that the mean is
zero and the standard deviation is 1, then the distribution of the sum converges to the normal
distribution.
E[eitXk /
2 /2
Using et
Given a random variable X with expectation m and standard deviation define the
normalized random variable X = (X m)/.
The normalized random variable has the mean 0 and the standard deviation 1. The standard
normal distribution mentioned above is an example. We have seen that Sn /n converges to a
definite number if the random variables are uncorrelated. We have also seen that the standard
deviation of Sn /n goes to zero.
] = (1
t2
it3 E[X 3 ]
+ ) .
2n
3n3/2
= 1 t2 /2 + ..., we get
E[eitSn /
] = (1
t2
2
+ Rn /(n3/2 ))n et /2 .
2n
The last step uses a Taylor remainder term Rn /n3/2 term. It is here that the E[X 3 ] < assumption has been used. The statement now follows from
E[eitX ] = (2)1/2
eitx ex
2 /2
2 /2
dx = et
A sequence Xn of random variables converges in distribution to a random variable X if for every trigonometric polynomial f , we have E[f (Xn )] E[f (X)].
0.25
0.12
0.20
This means that E[cos(tXn )] E[cos(tX)] or E[sin(tXn )] E[sin(tX)] converge for every t. We
can combine the cos and sin to exp(itx) = cos(tx) + i sin(tx) and cover both at once by showing
E[eitXn ] E[eitX ] for n . So, checking the last statement for every t is equivalent to check
the convergence in distribution.
0.15
0.10
0.08
0.06
0.10
0.04
The function X (t) = E[eitX ] is called the characteristic function of a random

variable X.
Convergence in distribution is equivalent to the statement that the cumulative distribution functions Fn (c) = P[Xn c] converge to F (c) = P[X c] at every point c at which F is continuous.
An other statement which is intuitive is that if the distribution is such that all moments E[X m ]
exist, it is enough to check that the moments E[Xnm ] converge to E[Xnm ] for all m. Trigonometric
polynomials are preferred because they do not require the boundedness of the moments. Convergence in distribution is also called convergence in law or weak convergence.
The following result is one of the most important theorems in probability theory. It explains why
the standard normal distribution is so important.
Central limit theorem: Given a sequence of IID random variables with finite mean
and variance and finite E[X 3 ]. Then Sn converges in distribution to the standard
normal distribution.
0.05
0.02
We throw a fair dice n times. The distribution of Sn is a binomial distribution.

The mean is
q
np and the standard deviation is np(1 p). The distribution of (Sn np) np(1 p) looks
close to the normal distribution. This special case of the central limit theorem is called the
de Moivre-Laplace theorem. It was proven by de Moivre in 1730 in the case p = 1/2 and
in 1812 for general 0 < p < 1 by Laplace.
Statistical inference
For the following see also Cliffs notes page 89-93 (section 14.6): what is the probability that the
average Sn /n is within to the mean E[X]?

The probability that Sn /n deviates more than R/ n from E[X] can for large n be
estimated by
1 Z x2 /2
e
dx .
2 R
place. What is the P -value of this observation?

The P-value is P[S180 20 ]. Since
S180 / 180 is close to normal and c = 20/ 180 = 1.49.. we can estimate the P value as
Z
Proof. Let p = E[X] denote the mean of Xk and the standard deviation. Denote by X a random
variable which has the standard normal distribution N(0, 1). We use the notation X Y if X
and Y are close in distribution. By the central limit theorem
Sn np
X .
n
Dividing nominator and denominator by n gives
n Sn
(n
p) X so that
Sn
p X .
n
n
The term / n is called the standard error. The central limit theorem gives some
insight why the standard error is important.
In scientific publications, the standard error should be displayed rather than the standard deviation.1
A squirrel accidentally drank from liquor leaking from a

garbage bag. Tipsy, it walks forth and back on a telephone
cable, randomly taking a step of one foot forward or backwards each second. How far do we expect him to be drifted
off after 3 minutes
Answer: The mean is zero, the standard deviation is 1. By
the central
limit theorem we expect the drift off from p = 0
to be n because
Sn X n .
That means we can expect the squirrel to be within 180 = 13

feet. The chance to see it in this neighborhood is about 2/3
because the probability to be within the standard deviation
interval is about 2/3 (see homework).
A probability space and a random variable X define a null hypothesis, a model
for your experiment. Assume you measure X = c. Assuming c is larger than the
expectation, the P -value of this experiment is defined as P[X c]. If c is smaller
than the expectation, we would define the P-value as P[X c].
The P -value is the probability that the test statistics is at least as extreme as the experiment.
The assumption in the previous problem that our squirrel is completely drunk the null
hypothesis. Assume we observe the squirrel after 3 minutes at 20 feet from the original
1
Geoff Cumming,Fiona Fidler,1 and David L. Vaux: Error bars in experimental biology, Journal of Cell biology,
177, 1, 2007 7-11
fnormal (x) dx = 0.068..
Since we know that the actual distribution is a Binomial distribution, we could have comP
k
nk
= 0.078 with p = 1/2. A P-value
puted the P-value exactly as 180
k=100 B(180, k)p (1 p)
smaller than 5 percent is called significant. We would have to reject the null hypothesis
and the squirrel is not drunk. In our case, the experiment was not significant.
The central limit theorem is so important because it gives us a tool to estimate the P -value.
It is much better in general than the estimate given by Chebyshev.

1
Look at the standard normal probability distribution function

1
2
f (x) = ex /2 .
2
Verify that for this distribution, there is a probability of 68 percent to land within the
interval [, ] is slightly larger than 2/3.
You have verified the rule of thumb:
For the normal distribution N(m, ), the chance to be in the interval [m , m + ]
is about two thirds.
In many popular children games in which one has to throw a dice and then move
forward by the number of eyes seen. After n rounds, we are at position Sn . If you play
such a game, find an interval of positions so that you expect to be with at least 2/3
percent chance in that interval after n rounds.
Remark: An example is the game Ladder which kids
play a lot in Switzerland: it is a random walk with drift
3.5. What makes the game exciting are occasional accelerations or setbacks. Mathematically it is a Markov process.
If you hit certain fields, you get pushed ahead (sometimes
significantly) but for other fields, you can almost lose everything. The game stays interesting because even if you are
ahead you can still end up last or you trail behind all the
game and win in the end.
We play in a Casino with the following version of the martingale strategy. We play
all evening and bet one dollar on black until we reach our goal of winning 10 dollars.
Assume each game lasts a minute. How long do we expect to wait until we can go
home? (You can assume that the game is fair and that you win or lose with probability
1/2 in each game. )
Remark: this strategy appears frequently in movies, usually when characters are
desperate. Examples are Run Lola Run, Casino Royale or Hangover.
The example
A=
Lecture 33: Markov matrices

A n n matrix is called a Markov matrix if all entries are nonnegative and the
sum of each column vector is equal to 1.
"
0 0
1 1
shows that a Markov matrix can have zero eigenvalues and determinant.
The example
A=
"
0 1
1 0
shows that a Markov matrix can have negative eigenvalues. and determinant.
The matrix
A=
"
1/2 1/3
1/2 2/3
The example
A=
is a Markov matrix.
Markov matrices are also called stochastic matrices. Many authors write the transpose
of the matrix and apply the matrix to the right of a row vector. In linear algebra we write
Ap. This is of course equivalent.
Ap =
p1
p2
...
pn
If all entries are positive and A is a 2 2 Markov matrix, then there is only one eigenvalue
1 and one eigenvalue smaller than 1.
A=
"
a
b
1a 1b
Proof: we have seen that there is one eigenvalue 1 because AT has [1, 1]T as an eigenvector.
The trace of A is 1 + a b which is smaller than 2. Because the trace is the sum of the
eigenvalues, the second eigenvalue is smaller than 1.
If p is a stochastic vector and A is a stochastic matrix, then Ap is a stochastic

vector.
1 0
0 1
shows that a Markov matrix can have several eigenvalues 1.
Lets call a vector with nonnegative entries pk for which all the pk add up to 1 a
stochastic vector. For a stochastic matrix, every column is a stochastic vector.
Proof. Let v1 , .., , vn be the column vectors of A. Then
"
The example
0 1 0
0 1
1 0 0
A= 0
= p1 v1 + ... + vn vn .
shows that a Markov matrix can have complex eigenvalues and that Markov matrices can
be orthogonal.
If we sum this up we get p1 + p2 + ... + pn = 1.
The following example shows that stochastic matrices do not need to be diagonalizable, not
even in the complex:
A Markov matrix A always has an eigenvalue 1. All other eigenvalues are in absolute
value smaller or equal to 1.
Proof. For the transpose matrix AT , the sum
AT therefore has the eigenvector
1
1
...
1
of the row vectors is equal to 1. The matrix
Because A and AT have the same determinant also A In and AT In have the same
determinant so that the eigenvalues of A and AT are the same. With AT having an eigenvalue
1 also A has an eigenvalue 1.
Assume now that v is an eigenvector with an eigenvalue || > 1. Then An v = ||n v has
exponentially growing length for n . This implies that there is for large n one coefficient
[An ]ij which is larger than 1. But An is a stochastic matrix (see homework) and has all entries
1. The assumption of an eigenvalue larger than 1 can not be valid.
The matrix
5/12 1/4 1/3
A = 5/12 1/4 1/3

1/6 1/2 1/3
is a stochastic matrix, even doubly stochastic. Its transpose is stochastic too. Its row
reduced echelon form is
1 0 1/2
A = 0 1 1/2
0 0 0
so that it has a one dimensional kernel. Its characteristic polynomial is fA (x) = x2 x3 which
shows that the eigenvalues are 1, 0, 0. The algebraic multiplicity of 0 is 2. The geometric
multiplicity of 0 is 1. The matrix is not diagonalizable. 1
1
This example appeared in http://mathoverflow.net/questions/51887/non-diagonalizable-doubly-stochasticmatrices
The eigenvector v to the eigenvalue 1 is called the stable equilibrium distribution

of A. It is also called Perron-Frobenius eigenvector.
Typically, the discrete dynamical system converges to the stable equilibrium. But the above
rotation matrix shows that we do not have to have convergence at all.
0.1 0.2
0.2
0.5
0.5 0.4 0.1
0.2 0.3
A=
0.3 0.2
0.3
0.1
0.4
0.2
1
2
Many games are Markov games. Lets look at a simple example of a mini monopoly, where
no property is bought:
Lets have a simple monopoly game with 6 fields. We start at field 1 and throw a coin.
If the coin shows head, we move 2 fields forward. If the coin shows tail, we move back to
the field number 2. If you reach the end, you win a dollar. If you overshoot you pay a fee
of a dollar and move to the first field. Question: in the long term, do you win or lose if
p6 p5 measures this win? Here p = (p1 , p2 , p3 , p4 , p5 , p6 ) is the stable equilibrium solution
with eigenvalue 1 of the game.
Take the same example but now throw also a
matrix is now
1/6 1/6 1/6
1/6 1/6 1/6
1/6 1/6 1/6

A=
1/6 1/6 1/6
1/6 1/6 1/6

1/6 1/6 1/6
dice and move with probability 1/6. The

1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
Find the stable equilibrium distribution of the matrix

A=
Lets visualize this. We start with the vector .

0
10
Assume
1/6
1/6
1/6
1/6
1/6
1/6
In the homework, you will see that there is only one stable equilibrium now.
"
1/2 1/3
1/2 2/3
a) Verify that the product of two Markov matrices is a Markov matrix.

b) Is the inverse of a Markov matrix always a Markov matrix? Hint for a): Let A, B be
Markov matrices. You have to verify that BAek is a stochastic vector.
Find all the eigenvalues and eigenvectors

game above
1/6 1/6
1/6 1/6
1/6 1/6
A=
1/6 1/6
1/6 1/6
1/6 1/6
of the doubly stochastic matrix in the modified

1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
Lecture 34: Perron Frobenius theorem
is the geodesic sphere distance. Such a map has a unique fixed point v by Banachs fixed point
theorem. This is the eigenvector Av = v we were looking for. We have seen now that on X,
there is only one eigenvector. Every other eigenvector Aw = w must have a coordinate entry
which is negative. Write |w| for the vector with coordinates |wj |. The computation
|||w|i = |wi | = |
Aij wj |
This is a second lecture on Markov processes. We want to see why the following result is true:
If all entries of a Markov matrix A are positive then A has a unique equilibrium:
there is only one eigenvalue 1. All other eigenvalues are smaller than 1.
|Aij ||wj | =
Aij |wj | = (A|w|)i
shows that ||L L because (A|w|) is a vector with length smaller than L, where L is the
length of w. From ||L L with nonzero L we get || . The first which appears in the
displayed formula is however an inequality for some i if one of the coordinate entries is negative.
Having established || < the proof is finished.
To illustrate the importance of the result, we look how it is used in chaos theory and how it can
be used for search engines to rank pages.
The matrix
A=
"
1/2 1/3
1/2 2/3
is a Markov matrix for which all entries are positive. The eigenvalue 1 is unique because the
sum of the eigenvalues is 1/2 + 2/3 < 2.
We have already proven Perron-Frobenius for 2 2 Markov matrices: such a matrix is of

the form
"
#
a
b
A=
1a 1b
and has an eigenvalue 1 and a second eigenvalue smaller than 1 because tr(A) the sum of
the eigenvalues is smaller than 2.
Lets give a brute force proof of the Perron-Frobenius theorem in the case of 3 3 matrices:
such a matrix is of the form
A=
a
b
c
d
e
f
.
1ad 1be 1cf
and has an eigenvalue 1. The determinant is D = c(d e) + a(e f ) + b(d + f ) and is

the product of the two remaining eigenvalues. The trace is 1 + (a c) + (e f ) so that
T = (a c) + (e f ) is the sum of the two remaining eigenvalues. An ugly verification shows
that these eigenvalues are in absolute value smaller than 1.
The Markov assumption is actually not needed. Here is a more general statement which is useful
in other parts mathematics. It is also one the theorems with the most applications like Leontiefs
models in economics, chaos theory in dynamical systems or page rank for search engines.
Remark. The theorem generalizes to situations considered in chaos theory, where products
of random matrices are considered which all have the same distribution but which do not need
to be independent. Given such a sequence of random matrices Ak , define Sn = An An1 A1 .
This is a non commutative analogue of the random walk Sn = X1 + ... + Xn for usual random
variables. But it is much more intricate because matrices do not commute. Laws of large numbers
are now more subtle.
Application: Chaos
The Lyapunov exponent of a random sequence of matrices is defined as
lim
1
log (SnT Sn ) ,
2n
where (B) is the maximal eigenvalue of the symmetric matrix SnT Sn .

Perron Frobenius theorem: If all entries of a n n matrix A are positive, then
it has a unique maximal eigenvalue. Its eigenvector has positive entries.
Proof. The proof is quite geometric and intuitive. Look at the sphere x21 +. . .+x2n = 1 and intersect
it with the space {x1 0, . . . , xn 0 } which is a quadrant for n = 2 and octant for n = 3. This
gives a closed, bounded set X. The matrix A defines a map T (v) = Av/|Av| on X because the
entries of the matrix are nonnegative. Because they are positive, T X is contained in the interior
of X. This map is a contraction, there exists 0 < k < 1 such that d(T x, T y) kd(x, y) where d
Here is a prototype result in Chaos theory due to Anosov for which the proof of Perron-Frobenius
can be modified using different contractions. It can be seen as an example of a noncommutative
law of large numbers:
If Ak is a sequence of identically distributed random positive matrices of determinant
1, then the Lyapunov exponent is positive.
"
"
2 1
3 2
Let Ak be either
or
with probability 1/2. Since the matrices do not
1 1
1 1
commute, we can not determine the long term behavior of Sn so easily and laws of large
numbers do not apply. The Perron-Frobenius generalization above however shows that still,
Sn grows exponentially fast.
Positive Lyapunov exponent is also called sensitive dependence on initial conditions for
the system or simply dubbed chaos. Nearby trajectories will deviate exponentially. Edward
Lorentz, who studied about 50 years ago models of the complicated equations which govern our
weather stated this in a poetic way in 1972:
The flap of a butterflys wing in Brazil can set off a tornado in Texas.
The damping factor can look a bit misterious. Brin and Page write:
PageRank can be thought of as a model of user behavior. We assume there is a random surfer
who is given a web page at random and keeps clicking on links, never hitting back but eventually
gets bored and starts on another random page. The probability that the random surfer visits a page
is its PageRank. And, the d damping factor is the probability at each page the random surfer will
get bored and request another random page. One important variation is to only add the damping
factor d to a single page, or a group of pages. This allows for personalization and can make it
nearly impossible to deliberately mislead the system in order to get a higher ranking. We have
several other extensions to PageRank. 1
Unfortunately, Mathematics is quite weak still to mathematically prove positive Lyapunov exponents if the system does not a priori feature positive matrices. There are cases which can be
settled quite easily. For example, if the matrices Ak are IID random matrices of determinant 1
and eigenvalues 1 have not full probability, then the Lyapunov exponent is positive due to work
of Fuerstenberg and others. In real systems, like for the motion of our solar system or particles
in a box, positive Lyapunov exponents is measured but can not be proven yet. Even for simple
toy systems like Sn = dT n , where dT is the Jacobean of a map T like T (x, y) = (2x c sin(x), y)
and T n is the nth iterate, things are unsettled. One measures log(c/2) but is unable to
prove it yet. For our real weather system, where the Navier stokes equations apply, one is even
more helpless. One does not even know whether trajectories exist for all times. This existence
problem looks like an esoteric ontological question if it were not for the fact that a one million
dollar bounty is offered for its solution.
Consider 3 sites A, B, C, where A is connected to B, C and B is connected to C and C is

connected to A. Find the
page rank to d = 0.1. Solution. The adjacency matrix of the
0 0 1
graph is A = 1 0 1 . The Google matrix is

1 1 0
d d d
0 0 1/2
d
/3 + (1 d) 1/2 0 1/2 .
d d d
1/2 1 0
G= d d
Application: Pagerank
A set of nodes with connections is a graph. Any network can be described by a graph. The link
structure of the web forms a graph, where the individual websites are the nodes and if there is an
arrow from site ai to site aj if ai links to aj . The adjacency matrix A of this graph is called the
web graph. If there are n sites, then the adjacency matrix is a n n matrix with entries Aij = 1
if there exists a link from aj to ai . If we divide each column by the number of 1 in that column,
we obtain a Markov matrix A which is called the normalized web matrix. Define the matrix
E which satisfies Eij = 1/n for all i, j. The graduate students and later entrepreneurs Sergey
Brin and Lawrence Page had in 1996 the following one billion dollar idea:
1
It is said now that page rank is the worlds largest matrix computation. The n n matrix
is huge. It was 8.1 billion 5 years ago. 2
The Google matrix is the matrix G = dA + (1 d)E, where 0 < d < 1 is a

parameter called damping factor and A is the Markov matrix obtained from the
adjacency matrix by scaling the rows to become stochastic matrices. This is a n n
Markov matrix with eigenvalue 1.
Its Perron-Frobenius eigenvector v scaled so that the largest value is 10 is called
page rank of the damping factor d.
The page rank equation is
[dA + (1 d)E]v = v
1
2
Verify that if a Markov matrix A has the property that A2 has only positive entries,
then A has a unique eigenvalue 1.
Take 4 cites A, B, C, D where A links to B, C, D, and B links to C, D and C links to

D and D links to A. Find the Google matrix with the damping factor 1/2.
Determine the Page rank of the previous system, possibly using technology like Mathematica.
http://infolab.stanford.edu/ backrub/google.html
Amy Langville and Carl Meyer, Googles PageRank and Beyond, Princeton University Press, 2006.
The result is called spectral theorem. I present now an intuitive proof, which gives more
insight why the result is true. The linear algebra book of Bretscher has an inductive proof.
Lecture 35: Symmetric matrices

In this lecture, we look at the spectrum of symmetric matrices. Symmetric matrices appear in
geometry, for example, when introducing more general dot products v Av or in statistics
as correlation matrices Cov[Xk , Xl ] or in quantum mechanics as observables or in neural
networks as learning maps x 7 sign(W x) or in graph theory as adjacency matrices.
Symmetric matrices play the same role as the real numbers do among the complex numbers.
Their eigenvalues often have physical or geometrical interpretations. One can also calculate with
symmetric matrices like with numbers: for example, we can solve B 2 = A for B if A is symmetric
matrix and B is square "root of #A.) This is not possible in general. There is no matrix B for
0 1
example such that B 2 =
. Recall the following definition:
0 0
Proof. We have seen already that if all eigenvalues are different, there is an eigenbasis and
diagonalization is possible. The eigenvectors are all orthogonal and B = S 1 AS is diagonal
containing the eigenvalues. In general, we can change the matrix A to A = A + (C A)t
where C is a matrix with pairwise different eigenvalues. Then the eigenvalues are different for
all except finitely many t. The orthogonal matrices St converges for t 0 to an orthogonal
matrix S and S diagonalizes A. 1
Why could we not perturb a general matrix At to have disjoint eigenvalues and At could be
diagonalized: St1 At St = Bt ? The problem is that St might become singular for t 0.
The matrix
A=
"
3 4
4 3
1
1
has the eigenvalues 1, 1 + t and is diagalizable for t > 0 but not
0 1+t
diagonalizable for t = 0. What happens with the diagonalization in the limit? Solution:
Because the matrix
"
#is upper triangular, the eigenvalues are 1, 1 +"t. The
# eigenvector to the
1
1
eigenvalue 1 is
. The eigenvector to the eigenvalue 1 + t is
. We see that in the
0
t
limit t 0, the second eigenvector collides with the first one. For symmetric matrices, where
the eigenvectors are always perpendicular to each other, such a collision can not happen.
The Freon molecule (Dichlorodifluoromethane or shortly CFC-12) CCl2 F2 has 5 atoms.

It is a CFC was used in refrigerators, solvents and propellants but contributes to ozone
depletion in the atmosphere. The adjacency matrix is
A real matrix is called symmetric if AT = A. Symmetric matrices are also called

T
selfadjoint. For complex matrices we would ask A = A = A.
"
The matrix A =
is symmetric.
A symmetric matrix has real eigenvalues.

Proof. Extend the dot product to complex vectors by (v, w) = i vi wi , where v is the
complex conjugate. For real vectors it is the usual dot product (v, w) = v w. The new
product has the property (Av, w) = (v, AT w) for real matrices A and (v, w) = (v, w) as
well as (v, w) = (v, w). Now (v, v) = (v, v) = (Av, v) = (v, AT v) = (v, Av) = (v, v) =
(v, v) shows that = because (v, v) 6= 0 for v 6= 0.
P
0 1
0
0
0
1 0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
a) Verify that the matrix has the characteristic polynomial x5 4x3 .

b) Find the eigenvalues of A.
c) Find the eigenvectors of A.
There is an orthogonal eigenbasis for a symmetric matrix A if all the eigenvalues of

A all different.
Proof. Assume Av = v and Aw = w. The relation
(v, w) = (v, w) = (Av, w) = (v, AT w) = (v, Aw) = (v, w) = (v, w)
is only possible if (v, w) = 0 if 6= .
Spectral theorem A symmetric matrix can be diagonalized with an orthonormal
matrix S.
1
This is justified by a result of Neumann-Wigner who proved that the set of symmetric matrices with simple
eigenvalues is path connected and dense in the linear space of all symmetric n n matrices.
In solid state physics or quantum mechanics, one is interested in matrices like
L=
cos()
1
0
0
1
1
cos(2) 1
0
1

1
0
1 cos((n 1))
1
1
0
0
1
cos(n)

1
For the following question, give a reason why it is true or give a counter example.
a) Is the sum of two symmetric matrices symmetric?
b) Is the product of a symmetric matrix symmetric?
c) Is the inverse of an invertible symmetric matrix symmetric?
d) If B is an arbitrary n m matrix, is A = B T B symmetric? e) If A is similar to B and
A is symmetric, then B is symmetric.
f) If A is similar to B with an orthogonal coordinate change S and A is symmetric, then B
is symmetric.
Find all the eigenvalues and eigenvectors of the matrix
It appears in models describing an electron in a periodic crystal. The eigenvalues form what
one calls thespectrum of the matrix. A physicist is interested in it because it determines what
conductivity properties the system has. This depends on .
A=
10001
3
5
7
9
11
1
10003
5
7
9
11
1
3
10005
7
9
11
1
3
5
10007
9
11
1
3
5
7
10009
11
1
3
5
7
9
10011
As usual, document all your reasoning.
The picture shows the eigenvalues of L for = 2 for = 2 with large n. The vertical axes is
which runs from = 0 at the bottom to = 2 on the top. Due to its nature, the picture
is called Hofstadter butterfly. It has been popularized in the book Godel, Escher Bach by
Douglas Hofstadter.
Which of the following classes of linear transformations are described by symmetric?

a) Reflections in the plane.
b) Rotations in the plane.
c) Orthogonal projections.
d) Shears.
Linear algebra
Lecture 36: Final checklist
Please see lecture 21 handout for the topics before the midterm.
Probability theory
Discrete random variable P[X = xk ] = pk discrete set of values.
R
Continuous random variable P[X [a, b]] = ab f (x) dx

Properties of random variables Var, E, Cov, .
Discrete distributions Uniform, Binomial, Poisson
Continuous distributions Uniform, Normal, Exponential
Chebychevs theorem P[|X E[X]| c]
P
Random walk Sn = nk=1 Xk .
Var[X]
.
c2
Law of large numbers Sn /n E[X].

Normalized random variable X = (X E[X])/[X].
Convergence in distribution E[cos(tXn )] E[cos(tX)] and E[sin(tXn )]
E[sin(tX)] for all t.
Characteristic function X (t) = E[eitX ].
Central limit theorem Sn N(0, 1) in distribution if Xk are IID random variables
with E[X 3 ] < .
Statistical inference Snn E[X] N(0, 1) n .
Standard error of X1 , ..., Xn is / n.

Stochastic vector Nonnegative entries which add up to 1.
Markov Matrix = Stochastic Matrix All columns are stochastic vectors.
Perron Frobenius theorem Markov matrices have an eigenvalue 1. All other eigenvalue are smaller or equal than 1.
Positive matrix. There is a unique largest eigenvalue 1.
Determinant (1)|| A1(1) A2(2) An(n)

Laplace expansion det(A) = (1)1+1 A11 det(Bi1 ) + + (1)1+n An1 det(Bn1 )
Linearity of determinants det(A) is linear in each row.
m
Determinant after row reduction det(A) = (1)
det(rref (A)).
c1 ck
Determinant and invertibility det(A B) = det(A)det(B).
Determinant and volume |det(A)| is the volume of the parallelepiped spanned by
the columns.
i)
solves Ax = b.
Cramers rule Ai replaces column by b, then xi = det(A
det(A)
Adjugate Aij delete row i and column j. Call Bij = (1)i+j det(Aji ) the classical
adjoint
ji )
Inverse [A1 ]ij = (1)i+j det(A
det(A)
Eigenvectors and eigenvalues Av = v
Characteristic polynomial fA () = det(A In )
Eigenvalues Roots or characteristic polynomial.
Eigenspace kernel of A In .
Geometric multiplicity of dimension of eigenspace of .
Algebraic multiplicity of k m if fA () = ( k )m g() and g(k ) 6= 0.
Trace of matrix Sum 1 + + n of eigenvalues.
Determinant of matrix Product 1 2 n of eigenvalues.
Discrete dynamical system orbit Am x
Closed form solution x = c1 v1 + + cn vn , Am x = c1 n1 v1 + + cn m
n vn .
Similar matrices S 1 AS.

Diagonalizable matrix There is an orthogonal eigenbasis.
Simple spectrum All eigenvalues are different. Matrices with simple spectrum are
diagonalizable.
Symmetric matrices AT = A. Symmetric matrices are diagonalizable.
Google matrix G = dA + (1 d)E, where Eij = 1/n, A is the adjusted adjacency
matrix and d is a damping factor.

Linear Algebra With Probability PDF

Uploaded by

Copyright:

Available Formats

Linear Algebra With Probability PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Algebra With Probability PDF

Uploaded by

Copyright:

Available Formats

What are some examples of data mentioned in the text?

What are some examples of data mentioned in the text?

How can data be organized and represented mathematically?

How can data be organized and represented mathematically?

Math 19b: Linear Algebra with Probability

Oliver Knill, Spring 2011

Lecture 1: Data, Lists and Models

Boston Weather 2010

Here are four important uses of linear algebra to describe data:

Homework due February 2, 2011

If you use a computer, make a million experiments at least.

R:=Random [ I n t e g e r , 1 ] ; M=100; data=Table [ R, {M} ] ;

a) Encode the following directed graph in a

b) We often organize the data in a so called

Math 19b: Linear Algebra with Probability

Oliver Knill, Spring 2011

Lecture 2: Probability notions

A probability space consists of a set called laboratory a set of subsets of

It follows that also the complement of an event is an event.

1. The entire laboratory and the empty set are events.

The probability is P[A] = |A|/|| = 6/16 = 3/8 = 0.375.

Homework due February 2, 2011

You walk 100 steps and chose in each step

The probability measure P:

Math 19b: Linear Algebra with Probability

If A, B be events in the probability space (, P ), then Bayes rule holds:

Oliver Knill, Spring 2011

Lecture 3: Conditional probability

with P[{BG }] = P[{GB }] = P[{BB }] = P[{GG }] = 1/4. Let B = {BG, GB, GG } be

The accident took place at night with a 15 % chance.

Homework due February 2, 2011

Math 19b: Linear Algebra with Probability

Oliver Knill, Spring 2011

Lecture 4: Linear equations from Probability

Consider the system

Linear equations appear in probability theory. Example: Assume we have a probability

+ P[A1 |B2 ]P[B2 ] =

We model a drum by a fine net. The heights at

Homework due February 10, 2011

The solution is (2, 8, 9). The parabola is y = 2x2 8x + 9.

Problem 24 in 1.1 of Bretscher): On your next trip to

(Problem 46 in 1.1 of Bretscher): A hermit eats only two

Math 19b: Linear Algebra with Probability

Oliver Knill, Spring 2011

Lecture 5: Gauss-Jordan elimination

We aim is to find all the solutions to the system of linear equations

This system appears in tomography like magnetic resonance

Swap two rows.

1) if a row has nonzero entries, then the first nonzero entry is 1.

Homework due February 10, 2011

4x1 + 3x2 + 2x3 x4

Find the rank of the following matrix:

What are the possible ranks for a 7 11 matrix?

Problem 20 in 1.2 of Bretscher: We say that two n m matrices in reduced row-echelon

Math 19b: Linear Algebra with Probability

The augmented matrix is

Oliver Knill, Spring 2011

Lecture 6: The structure of solutions