Linear Algebra With Probability PDF
Linear Algebra With Probability PDF
Linear Algebra With Probability PDF
Observed quantities are functions defined on lists. One calls them also random variables.
Because observables can be added or subtracted, they can be treated as vectors. If the data
themselves are not numbers like the strings of a DNA, we can add and multiply numerical functions
on these data. The function X(x) for example could count the number of A terms in a genome
sequence x with letters A,G,C,T which abbreviate Adenin, Guanin, Cytosin and Tymin. It is a
fundamental and pretty modern insight that all mathematics can be described using algebras and
operators. We have mentioned that data are often related and organized in relational form and that
an array of data achieves this. Lets look at weather data accessed from http://www.nws.noaa.gov
on January 4th 2011, where one of the row coordinates is time. The different data vectors are
listed side by side and listed in form of a matrix.
Month
01
02
03
04
05
06
07
08
09
10
11
12
Year
2010
2010
2010
2010
2010
2010
2010
2010
2010
2010
2010
2010
Temperature
29.6
33.2
43.9
53.0
62.8
70.3
77.2
73.4
68.7
55.6
44.8
32.7
Precipitation
2.91
3.34
14.87
1.78
2.90
3.18
2.66
5.75
1.80
3.90
2.96
3.61
Wind
12.0
13.3
13.0
10.4
10.6
9.5
9.7
10.2
10.8
12.2
11.0
13.2
To illustrate how linear algebra enters, lets add up all the rows and divide by the number of rows.
This is called the average. We can also look at the average squre distance to the mean, which is
the variance. Its square root is called the standard deviation.
Month
6.5
Max
Year Temperature
2010 53.7667
Precipitation
4.13833
Wind
11.325
Goal
Describe the data
Model the data
Reduce the data
Manipulate the data
Using
Lists
Probability
Projections
Algebra
Example
Relational database, Adjacency matrix
Markov process, Filtering, Smoothing
Linear regression.
Fourier theory.
We have seen that a fundamental tool to organize data is the concept of a list. Mathematicians
call this a vector. Since data can be added and scaled, data can be treated as vectors. We can
also look at lists of lists. These are called matrices. Matrices are important because they allow
to describe relations and operations. Given a matrix, we can access the data using coordinates.
The entry (3, 4) for example is the forth element in the third row. Having data organized in lists,
one can manipulate them more easily. One can use arithmetic on entire lists or arrays. This is
what spreadsheets do.
TemperaturePrecipitation Wind
Max=
77
Max=
Max=
15
13
In the example data set, the average day was June 15th, 2010, the average temperature was 54
degrees Fahrenheit = 12 degrees Celsius, with an average of 4 inches of precipitation per month
and an average wind speed of 11 miles per hour. The following figure visualizes the data. You
see the unusually rainy March which had produced quite a bit of flooding. The rest is not so
surprising. The temperatures are of course higher in summer than in winter and there is more
wind in the cooler months. Since data often come with noise, one can simplify their model and
reduce the amount of information to describe them. When looking at a particular precipitation
data in Boston over a year, we have a lot of information which is not interesting. More interesting
is the global trend, the deviation from this global trend and correlations within neighboring days.
It is for example more likely to be rainy near a rainy day than near a sunny day. The data are not
given by a random process like a dice. It we can identify the part of the data which are randomly
chosen, how do we find the rule? This is a basic problem and can often be approached using linear
algebra. The expectation will tell the most likely value of the data and the standard deviation
tells us how noisy the data are. Already these notions have relations with geometry.
How do we model from a situation given only one data set? For example, given the DJI data, we
would like to have a good model which predicts the nature of the process in the future. We can
do this by looking for trends. Mathematically, this is done by data fitting and will be discussed
extensively in this course. An other task is to model the process using a linear algebra model.
Maybe it is the case that near a red pixel in a picture it is more likely to have a red pixel again
or that after a gain in the DJI, we are more likely to gain more.
In the movie Fermats room, some mathematicians are trapped in a giant press and have
to solve mathematical problems to stop the press from crushing them to death. In one of
the math problems, they get the data stream of 169 = 132 digits:
00000000000000011111111100011111111111001111111111100110
001000110011000100011001111101111100111100011110001111111
11000001010101000000110101100000011111110000000000000000. Can you solve the riddle?
Try to solve it yourself at first. If you need a hint, watch the clip
http://www.math.harvard.edu/ knill/mathmovies/text/fermat/riddle3.html
This is problem 2 in Section 1.3 of the script, where 200 is replaced by 100: Flip a coin 100
times. Record the number of times, heads appeared in the first 10 experiments and call this
n1 . Then call the number of times, heads appears in the next N = 10 experiments and call
it n2 . This produces 10 inters n1 , . . . , n10 . Find the mean
m = (n1 + n2 + . . . + nN )/N
of your data, then the sample variance
v = ((n1 m)2 + (n2 m)2 + . . . + (nN m)2 )/(N 1)
and finally the sample standard deviation = v. Remark: there are statistical reasons
that (N 1) is chosen and not N. It is called Bessels correction.
Lets illustrate how we can use lists of data to encode a traffic situation. Assume an airline services
the towns of Boston, New York and Los Angeles. It flies from New York to Los Angeles, from Los
Angeles to Boston and from Boston to New York as well as from New York to Boston. How can
we can compute the number of round trips of any length n in this network? Linea algebra helps:
define the 3 3 connection matrix A given below and compute the nth power of the matrix.
We will learn how to do that next week. In our case, there are 670976837021 different round trips
of length 100 starting from Boston.
3
The matrix which encodes the situation is the following:
BO NY
0
1
1
0
LA
1
0
BO
A=
NY
LA
0
1
0
To summarize, linear algebra enters in many different ways into data analysis. Lists and lists of
lists are fundamental ways to represent data. They will be called vectors and matrices. Linear
algebra is needed to find good models or to reduce data. Finally, even if we have a model, we
want to do computations efficiently.
2
4
3
P[ \ A] = 1 P[A].
P[A B] = P[A] + P[B] P[A B].
P[A] =
It is important that in any situation, we first find out what the laboratory is. This is often
the hardest task. Once the setup is fixed, one has a combinatorics or counting problem.
Examples:
We turn a wheel of fortune and assume it is fair in the sense that every angle range [a, b]
appears with probability (b a)/2. What is the chance that the wheel stops with an angle
between 30 and 90 degrees?
Answer: The laboratory here is the circle [0, 2). Every point in this circle is a possible
experiment. The event that the wheel stops between 30 and 90 degrees is the interval
[/6, /2]. Assuming that all angles are equally probable, the answer is 1/6.
Here are the conditions which need to be satisfied for the probability function P:
1. 0 P[A] 1 and P[] = 1.
S
P
2. Aj are disjoint events, then P[
j=1 Aj ] =
j=1 P[Aj ].
(1, 2)
(2, 2)
(3, 2)
(4, 2)
(5, 2)
(6, 2)
(1, 3)
(2, 3)
(3, 3)
(4, 3)
(5, 3)
(6, 3)
(1, 4)
(2, 4)
(3, 4)
(4, 4)
(5, 4)
(6, 4)
(1, 5)
(2, 5)
(3, 5)
(4, 5)
(5, 5)
(6, 5)
(1, 6)
(2, 6)
(3, 6)
(4, 6)
(5, 6)
(6, 6)
be the possible cases, then there are only 8 cases where the sum is smaller or equal to 8.
Lets look at all 2 2 matrices for which the entries are either 0 or 1. What is the probability
that such a matrix has a nonzero determinant det(A) = ad bc?
Answer: We have 16 different matrices. Our probability space is finite:
= {
"
"
0 0
0 0
1 0
0 0
# "
# "
0 0
0 1
1 0
0 1
# "
# "
0 0
1 0
1 0
1 0
# "
# "
0 0
1 1
1 0
1 1
# "
# "
0 1
0 0
1 1
0 0
# "
# "
0 1
0 1
1 1
0 1
# "
# "
0 1
1 0
1 1
1 0
# "
# "
0 1
1 1
1 1
}.
1 1
Now lets look at the event that the determinant is nonzero. It contains the following matrices:
A={
Here is a more precise list of conditions which need to be satisfied for events.
(1, 1)
(2, 1)
(3, 1)
(4, 1)
(5, 1)
(6, 1)
Lets look at the digits of . What is the probability that the digit 5 appears? Answer:
Also this is a strange example since the digits are not randomly generated. They are given
by nature. There is no randomness involved. Still, one observes that the digits behave like
a random number and that the number is normal: every digit appears with the same
frequency. This is independent of the base.
We throw a dice twice. What is the probability that the sum is larger than 5?
Answer: We can enumerate all possible cases in a matrix and get Let
This example is called Bertrands paradox. Assume we throw randomly a line into the
unit disc. What is the probability that its length is larger than the length of the inscribed
triangle?
Answer: Interestingly, the answer depends as we will see in the lecture.
Lets look at the DowJonesIndustrial average DJI from the start. What is the probability
that the index will double in the next 50 years?
Answer: This is a strange question because we have only one data set. How can we talk
about probability in this situation? One way is to see this graph as a sample of a larger
probability space. A simple model would be to fit the data with some polynomial, then add
random noise to it. The real DJI graph now looks very similar to a typical graph of those.
|A|
||
"
0 1
1 0
# "
0 1
1 1
# "
1 0
0 1
# "
1 0
1 1
# "
1 1
0 1
# "
1 1
}.
1 0
Lets pick 2 cards from a deck of 52 cards. What is the probability that we have 2 kings?
Answer: Our laboratory has 52 51 possible experiments. To count the number of
good cases, note that there are 4 3 = 12 possible ordered pairs of two kings. Therefore
12/(52 51) = 1/221 is the probability.
Some notation
Set theory in :
The intersection A B contains the elements which are in A and B.
The union A B contains the elements which are in A or B.
The complement Ac contains the elements in which are not in A.
The difference A \ B contains the elements which are in A but not in B.
The symmetric difference AB contains what is in A or B but not in both.
The empty set is the set which does not contain any elements.
The algebra A of events:
If is the laboratory, the set A of events is -algebra. It is a set of subsets of
in which one can perform countably many set theoretical operations and which
S
contains and . In this set one can perform countable unions j Aj for the
T
union of a sequence of sets A1 , A2 , . . . or countable intersections j Aj .
Do problem 5) in Chapter 2 of the text but with 100 instead of 1000. You choose a random
number from {1, . . . , 100 }, where each of the numbers have the same probability. Let
A denote the event that the number is divisible by 3 and B the event that the number
is divisible by 5. What is the probability P[A] of A, the probability P[B] of B and the
probability of P[A B]? Compute also the probability P[A B]/P[B] which we will call the
conditional probability next time. It is the probability that the number is divisible by 3
under the condition that the number is divisible by 5.
You choose randomly 6 cards from 52 and do not put the cards back. What is the probability
that you got all aces? Make sure you describe the probability space and the event A that
we have all aces.
2) The Kolmogorov axioms form a solid foundation of probability theory. This has only been achieved in the 20th century
(1931). Before that probability theory was a bit hazy. For
infinite probability spaces it is necessary to restrict the set of
all possible events. One can not take all subsets of an interval
for example. There is no probability measure P which would
work with all sets. There are just too many.
P[A|B] =
It is a formula for the conditional probability P[A|B] when we know the unconditional
probability of A and P[B|A], the likelihood. The formula immediately follows from the
fact that P[B|A] + P[B|Ac ] = P[B].
The conditional probability of an event A under the condition that the event B
takes place is denoted with P[A|B] and defined to be P[A B]/P[B].
We throw a coin 3 times. The first 2 times, we have seen head. What is the chance that we
get tail the 3th time?
Answer: The probability space consists of all words in the alphabet H, T of length 3.
These are = {HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T T }. The event B is the
event that the first 2 cases were head. The event A is the event that the third dice is head.
While the formula followed directly from the definition of conditional probability, it is very
useful since it allows us to compute the conditional probability P[A|B] from the likelihoods
P[B|A], P[B|Ac ]. Here is an example:
Problem: Dave has two kids, one of them is a girl. What is the chance that the other is a
girl?
Intuitively one would here give the answer 1/2 because the second event looks independent
of the first. However, this initial intuition is misleading and the probability only 1/3.
Solution. We need to introduce the probability space of all possible events
= {BG, GB, BB, GG}
You are in the Monty-Hall game show and need to chose from three doors. Behind one door
is a car and behind the others are goats. The host knows what is behind the doors. After
you open the first door, he opens an other door with a goat. He asks you whether you want
to switch. Do you want to?
Answer: Yes, you definitely should switch. You double your chances to win a car:
No switching: The probability space is the set of all possible experiments = {1, 2, 3 }.
You choose a door and win with probability 1/3 . The opening of the host does not affect
any more your choice.
Switching: We have the same probability space. If you pick the car, then you lose because
the switch will turn this into a goat. If you choose a door with a goat, the host opens the
other door with the goat and you win. Since you win in two cases, the probability is 2/3 .
Also here, intuition can lead to conditional probability traps and suggest to have a
win probability 1/3 in general. Lets use the notion of conditional probability to give
an other correct argument: the intervention of the host has narrowed the laboratory to
= {12, 13, 21, 23, 31, 32 } where 21 for example means choosing first door 2 then door 1.
Assume the car is behind door 1 (the other cases are similar). The host, who we assume
always picks door 2 if you pick 1 with the car (the other case is similar) gives us the condition
B = {13, 21, 31 } because the cases 23 and 32 are not possible. The winning event is A =
{21, 31 }. The answer to the problem is the conditional probability P[A|B] = P[A B]/P[B]
= 2/3 .
Problem. The probability to die in a car accident in a 24 hour period is one in a million.
The probability to die in a car accident at night is one in two millions. At night there is
30% traffic. You hear that a relative of yours died in a car accident. What is the chance
that the accident took place at night?
Solution. Let B be the event to die in a car accident and A the event to drive at night.
We apply the Bayes formula. We know P[A B] = P[B|A] P[A] = (1/2000000) (3/10) =
3/20000000.
P[A|B] =
P[B|A] P[A]
.
P[B|A] + P[B|Ac ]
P[A B]
= (3/20000000)/(1/1000000) = 3/20 .
P[B]
Proof: Because the denominator is P[B] = nj=1 P[B|Aj ]P[Aj ], the Bayes rule just says
P[Ai |B]P[B] = P[B|Ai ]P[Ai ]. But these are by definition both P[Ai B].
Problem: A fair dice is rolled first. It gives a random number k from {1, 2, 3, 4, 5, 6 }.
Next, a fair coin is tossed k times. Assume, we know that all coins show heads, what is the
probability that the score of the dice was equal to 5?
Solution. Let B be the event that all coins are heads and let Aj be the event that the dice
showed the number j. The problem is to find P[A5 |B]. We know P[B|Aj ] = 2j . Because the
P
events Aj , j = 1, . . . , 6 are disjoint sets in which cover it, we have P[B] = 6j=1 P[B Aj ] =
P6
P6
j
j=1 P[B|Aj ]P[Aj ] =
j=1 2 /6 = (1/2 + 1/4 + 1/8 + 1/16 + 1/32 + 1/64)(1/6) = 21/128.
By Bayes rule,
2
(1/32)(1/6)
P[B|A5 ]P[A5 ]
=
,
=
P[A5 |B] = P6
21/128
63
( j=1 P[B|Aj ]P[Aj ])
0.5
Problem 2) in Chapter 3: if the probability that a student is sick at a given day is 1 percent
and the probability that a student has an exam at a given day is 5 percent. Suppose that
6 percent of the students with exams go to the infirmary. What is the probability that a
student in the infirmary has an exam on a given day?
Problem 5) in chapter 3: Suppose that A, B are subsets of a sample space with a probability
function P. We know that P[A] = 4/5 and P[B] = 3/5. Explain why P[B|A] is at least
1/2.
Solve the Monty Hall problem with 4 doors. There are 4 doors with 3 goats and 1 car. What
are the winning probabilities in the switching and no-switching cases? You can assume that
the host always opens the still closed goat closest to the car.
0.3
0.2
0.1
Figure: The conditional probabilities P [Aj |B] in the previous problem. The knowledge
that all coins show head makes it more likely to have thrown less coins.
Here is a concrete example: Assume the chance that the first kid is a girl is 60% and that
the probability to have a boy after a boy is 2/3 and the probability to have a girl after a
girl is 2/3 too. What is the probability that the second kid is a girl?
S olution. Let B1 be the event that the first kid is a boy and let B2 the event that the
first kid is a girl. Assume that for the first kid the probability to have a girl is 60%.
But that P[F irstgirl|Secondgirl] = 2/3 and P[F irstboy|Secondboy] = 2/3. What are the
probabilities that the first kid is a boy? This produces a system
2/3P[B1 ]
1/3P[B1 ]
+ y + z + u + v + w =3
y + z + u + v
=2
2 + 2
=4
6/10
4/10
The probabilities are 8/15, 7/15. There is still a slightly larger probability to have a girl.
This example is also at the heart of Markov processes.
+ 1/3P[B2 ] =
+ 2/3P[B2 ] =
Example Here is a toy example of a problem one has to solve for magnetic resonance
imaging (MRI). This technique makes use of the absorb and emission of energy in the radio
frequency range of the electromagnetic spectrum.
Assume we have 4 hydrogen atoms, whose nuclei are excited with energy intensity a, b, c, d.
We measure the spin echo in 4 different directions. 3 = a+b,7 = c+d,5 = a+c and 5 = b+d.
What is a, b, c, d? Solution: a = 2, b = 1, c = 3, d = 4. However, also a = 0, b = 3, c = 5, d =
2 solves the problem. This system has not a unique solution even so there are 4 equations
q
The are 6 variables and 3 equations. Since we have less equations then unknowns, we expect
infinitely many solutions. The system can be written as A~x = ~b, where
r
a
b
o
1 1 1 1 1 1
A=
1 1 1 1 1 1
0 0 2 2 0 0
and 4 unknowns.
c
and ~b =
2 .
4
Example. Assume we have two events B1 , B2 which cover the probability space. We do
not know their probabilities. We have two other events A1 , A2 from which we know P[Ai ]
and the conditional probabilities P[Ai |Bj ]. We get the system of equations.
P[A1 |B1 ]P[B1 ]
P[A2 |B1 ]P[B1 ]
P[A1 ]
P[A2 ]
a11
a12
a13
a14
a21
x11
x12
a24
a31
x21
x22
a34
a41
a42
a43
a44
The last example should show you that linear systems of equations also appear in data fitting
even so we do not fit with linear functions. The task is to find a parabola
y = ax2 + bx + c
through the points (1, 3), (2, 1) and (4, 9). We have to solve the system
a+b+c = 3
4a + 2b + c = 1
16a + 4b + c = 9
0
200 T1
200 T2 T3
0
400
The task is to compute the actual densities. We first write down the augmented matrix is
=3
=5
=9
1
1
x
3
1 1
x = y , ~b = 5 .
,~
1 2 5
z
9
A= 1
elimination steps.
+ u
y
z
y + z
=
+ v
=
+ w =
=
u + v + w =
1 0
1
0
0
0 0
The augmented matrix is matrix, where other column has been added. This column contains
the vector b. The
last column is often
separated with horizontal lines for clarity reasons.
1 1
1 | 3
B = 1 1 1 | 5 .
We will solve this equation using Gauss-Jordan
1 2 5 | 9
x
x +
1
0
0
0
1
0
1
0
0
1
0
0
1
0
1
3
5
9
8
9
0
0
1
0
0
1
0
0
1
1
0
1
0
1
1
0
0
1
1
1
3
5
9
9
9
Now subtract the 4th row from the last to get a row of zeros, then subtract the 4th row
from the first. This is already the row reduced echelon form.
The ith entry (A~x)i is the dot product of the ith row of A with ~x.
0
0
1
1
0
Remove the sum of the first three rows from the 4th, then change sign of the 4th row:
can be written in matrix form as A~x = ~b, where A is a matrix called coefficient matrix and
column vectors ~x and ~b.
1 0
1
0
1
0 0
3
5
9
8
9
imaging. In this technology, a scanner can measure averages of tissue densities along lines.
1 0
1
0
0
0 0
0
0
1
0
0
0 1 1 6
0 1
0
5
0 0
1
9
.
1 1
1
9
0 0
0
0
The first 4 columns have leading 1. The other 2 variables are free variables r, s. We write
the row reduced echelon form again as a system and get so the solution:
x
y
z
u
v
w
=
=
=
=
=
=
6 + r + s
5r
9s
9rs
r
s
There are infinitely many solutions. They are parametrized by 2 free variables.
Gauss-Jordan Elimination is a process, where successive subtraction of multiples of other
rows or scaling or swapping operations brings the matrix into reduced row echelon form.
The elimination process consists of three possible steps. They are called elementary row
operations:
The number of leading 1 in rref(A) is called the rank of A. It is an integer which we will
use later.
A remark to the history: The process appeared already in the Chinese manuscript Jiuzhang
Suanshu the Nine Chapters on the Mathematical art. The manuscript or textbook appeared around 200 BC in the Han dynasty. The German geodesist Wilhelm Jordan
(1842-1899) applied the Gauss-Jordan method to finding squared errors to work on surveying. (An other Jordan, the French Mathematician Camille Jordan (1838-1922) worked on
linear algebra topics also (Jordan form) and is often mistakenly credited with the GaussJordan process.) Gauss developed Gaussian elimination around 1800 and used it to solve
least squares problems in celestial mechanics and later in geodesic computations. In 1809,
Gauss published the book Theory of Motion of the Heavenly Bodies in which he used the
method for solving astronomical problems.
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
More
challenging
1 2 3 4 5
2 3 4 5 6
3 4 5 6 7
4 5 6 7 8
5 6 7 8 1
6 7 8 1 2
7 8 1 2 3
8 1 2 3 4
is
6
7
8
1
2
3
4
5
the question:
what is the rank of the following matrix?
7 8
8 1
1 2
2 3
.
3 4
4 5
5 6
6 7
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
Problem 10 In 1.2 of Bretscher: Find all solutions of the equations with paper and
pencil using Gauss-Jordan elimination. Show all your work.
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
=
=
=
=
4
4
3
11
1 3 0
0 0 1
are of the same type. How many types of 2 x 2 matrices in reduced row-echelon form
are there?
Problem 42 in 1.2 of Bretscher: The accompanying sketch represents a maze of one way
streets in a city in the United States. The traffic volume through certain blocks during
an hour has been measured. Suppose that the vehicles leaving the area during this
hour were exactly the same as those entering it. What can you say about the traffic
volume at the four locations indicated by a question mark? Can you figure out exactly
how much traffic there was on each block? If not, describe one possible scenario. For
each of the four locations, find the highest and the lowest possible traffic volume.
JFK
Dunster
400
300
?
?
320
?
300
100
250
Mt Auburn
?
120
Winthrop
150
a11 a12
a22
an1 an2
a
21
A=
w
~2 =
h
h
1 2 1
i
i
w
~ 3 = 1 1 3
The column
vectorsare
3
4
5
And bring it to so called row reduced echelon form. We write rref(A). In this form, we have
in each row with nonzero entries a so called leading one.
A matrix with one column is also called a column vector. The entries of a matrix
are denoted by aij , where i is the row number and j is the column number.
w
~ n
w
~ 1 ~x
|
w
x
~2 ~
x =
~
...
w
~ n ~x
x1
| | |
x2
A~x = ~v1 ~v2 ~vm
| | |
xn
= x1~
v1 + x2~v2 + + xm~vm = ~b .
=0
=0
=9
0
x
4 5
2 1
x=
y , ~b = 0 .
,~
9
z
1 1 3
3
A=
1
If we have n equations and n unknowns, it is most likely that we have exactly one solution.
But remember Murphys law If anything can go wrong, it will go wrong. It happens!
is equivalent to A~x = ~b, where A is a coefficient matrix and ~x and ~b are vectors.
How do we determine in which case we are? It is the rank of A and the rank of the augmented
matrix B = [A|b] as well as the number m of columns which determine everything:
If rank(A) = rank(B) = m: there is exactly 1 solution.
If rank(A) < rank(B): there are no solutions.
If rank(A) = rank(B) < m: there are many solutions.
Theorem. For any system of linear equations there are three possibilities:
Consistent with unique solution: Exactly one solution. There is a leading 1
in each column of A but none in the last column of the augmented matrix B.
Inconsistent with no solutions. There is a leading 1 in the last column of the
augmented matrix B.
Consistent with infinitely many solutions. There are columns of A without
leading 1.
A system of linear equations A~x = ~b with n equations and m unknowns is defined by the
n m matrix A and the vector ~b. The row reduced matrix rref(B) of the augmented matrix
B = [A|b] determines the number of solutions of the system Ax = b. The rank rank(A) of
a matrix A is the number of leading ones in rref(A).
There are two ways how we can look a system of linear equation. It is called the row
picture or column picture:
Row picture: each bi is the dot product of a row vector w
~ i with ~x.
w
~ 1
3 4 5
0 = x1 1 + x2 1 + x3 1 .
w
~ 2
A~x =
...
If we have a unique solution to A~x = ~b, then rref(A) is the matrix which has a leading 1 in
every column. This matrix is called the identity matrix.
4 5 | 0
2 1 | 0
.
1 1 3 | 9
B = 1
What is the probability that we have exactly one solution if we look at all n n matrices
with entries 1 or 0? You explore this in the homework in the 2 2 case. During the lecture
we look at the 3 3 case and higher, using a Monte Carlo simulation.
Find a cubic equation f (x) = ax3 + bx2 + cx + d = y such that the graph of f goes through
the 4 points
A = (1, 7), B = (1, 1), C = (2, 11), D = (2, 25) .
40
20
-3
-2
-1
-20
-40
-60
1
-4
-1
y
-1
w
0
-1
0
-2
In a Herb garden, the humidity of its soil has the property that at any given point the humidity is the sum
of the neighboring humidities. Samples are taken on
a hexagonal grid on 14 spots. The humidity at the
four locations x, y, z, w is unknown. Solve the equations
x = y+z+w+2
y=
x+w-3
using row reduction.
z=
x+w-1
w=
x+y+z-2
"
3 4
. This
1 5
y1
y3
vector. The map defines a line in space.
"
that the
0
, and
In order to find the matrix of a linear transformation, look at the image of the
standard vectors and use those to build the columns of the matrix.
1 0
| | |
A linear transformation T (x) = Ax with A = ~v1 ~v2 ~vn has the property
| | |
column vector ~v1 , ~vi , ~vn are the images of the standard vectors ~e1 = , ~ei =
~en =
.
1 1
2
.
1 1 3
2
Linear transformations are important in
Find the linear transformation, which reflects a vector at the line containing the vector
(1, 1, 1).
If there is a linear transformation S such that S(T ~x) = ~x for every ~x, then S is called
the inverse of T . We will discuss inverse transformations later in more detail.
A~x = ~b means to invert the linear transformation ~x 7 A~x. If the linear system has exactly
one solution, then an inverse exists. We will write ~x = A1~b and see that the inverse of a
linear transformation is again a linear transformation.
Otto Bretschers book contains as a motivation a code, where the encryption happens
with the linear map T (x, y) = (x + 3y, 2x + 5y). It is an variant of a Hill code. The map
has the inverse T 1 (x, y) = (5x + 3y, 2x y).
Assume we know, the other party uses a Bretscher code and can find out that T (1, 1) = (3, 5)
and T
" (2, 1) #= (7, 5). Can we reconstruct the code? The problem is to find the matrix
a b
A=
. It is useful to decode the Hill code in general. If ax+by = X and cx+dy = Y ,
c d
then x = (dX bY
" )/(ad# bc), y = (cX aY )/(ad bc). This is" a linear transformation
#
a b
d b
with matrix A =
and the corresponding matrix is A1 =
/(ad bc).
c d
c a
This is Problem 24-40 in Bretscher: Consider the circular face in the accompanying figure.
For each of the matrices A1 , ...A6 , draw a sketch showing the effect of the linear transformation"T (x) = Ax
# on this "face. #
"
#
"
#
"
#
0 1
2 0
0 1
1 0
1 0
A1 =
. A2 =
. A3 =
. A4 =
. A5 =
.
1 0
0 2
1 0
0 1
0 2
"
#
1 0
A6 =
.
0 1
This is problem 50 in Bretscher. A goldsmith uses a platinum alloy and a silver alloy to
make jewelry; the densities of these alloys are exactly 20 and 10 grams per cubic centimeter,
respectively.
a) King Hiero of Syracuse orders a crown from this goldsmith, with a total mass of 5 kilograms
(or 5,000 grams), with the stipulation that the platinum alloy must make up at least 90%
of the mass. The goldsmith delivers a beautiful piece, but the kings friend Archimedes has
doubts about its purity. While taking a bath, he comes up with a method to check the
composition of the crown (famously shouting Eureka! in the process, and running to the
kings palace naked). Submerging the crown in water, he finds its volume to be 370 cubic
centimeters. How much of each alloy went into this piece (by mass)? Is this goldsmith a
crook?
b) Find the matrix A that transforms the vector
"
totalmass
totalvolume
In the first week we have seen how to compute the mean and standard deviation of data.
a) Given some data (x1 , x2 , x3 , ..., x6 ). Is the transformation from R6 R which maps the
data to its mean m linear?
b) Is the map which assigns to the data the standard deviation a linear map? c) Is
the map which assigns to the data the difference (y1 , y2, ..., y6 ) defined by y1 = x1 , y2 =
x2 x1 , ..., y6 = x6 y5 linear? Find its matrix. d) Is the map which assigns to the data the
normalized data (x1 m, x2 m, ..., xn m) given by a linear transformation?
Projection
Shear transformations
A=
"
1 0
1 1
A=
A=
"
1 1
0 1
In general, shears are transformation in the plane with the property that there is a vector w
~ such
that T (w)
~ =w
~ and T (~x) ~x is a multiple of w
~ for all ~x. Shear transformations are invertible,
and are important in general because they are examples which can not be diagonalized.
"
1 0
0 0
A=
"
0 0
0 1
A
u is T (~x) = (~x ~u)~u with matrix A =
" projection #onto a line containing unit vector ~
u1 u1 u2 u1
.
u1 u2 u2 u2
Projections are also important in statistics. Projections are not invertible except if we project
onto the entire space. Projections also have the property that P 2 = P . If we do it twice, it
is the same transformation. If we combine a projection with a dilation, we get a rotation
dilation.
Rotation
Scaling transformations
A=
5
A=
"
2 0
0 2
A=
"
1/2 0
0 1/2
"
1 0
0 1
A
"
cos() sin()
sin() cos()
#=
One can also look at transformations which scale x differently then y and where A is a diagonal
matrix. Scaling transformations can also be written as A = I2 where I2 is the identity matrix.
They are also called dilations.
Rotation-Dilation
Reflection
A=
A
"
cos(2) sin(2)
sin(2) cos(2)
#=
A=
"
1 0
0 1
Any reflection at a line has the form of the matrix to the"left. A reflection at# a line containing
2u21 1 2u1u2
a unit vector ~u is T (~x) = 2(~x ~u)~u ~x with matrix A =
2u1 u2 2u22 1
Reflections have the property that they are their own inverse. If we combine a reflection with
a dilation, we get a reflection-dilation.
"
2 3
3 2
A=
"
a b
b a
6
A rotation
dilation is a composition of a rotation by angle arctan(y/x) and a dilation by a
factor x2 + y 2 .
If z = x + iy and w = a + ib and T (x, y) = (X, Y ), then X + iY = zw. So a rotation dilation
is tied to the process of the multiplication with a complex number.
Rotations in space
0 0 1
D=
1
1
1
3
1
1
1
1
3 1
3
0
0 0
1 1
1 1
0 0
0 0
B=
0
E=
0
0
3
1
0
0
1
1
0
0
1
3
0
0
1
1
C=
1
F =
1
1
1
1
1
1
0
0
1
1
1
0
0
1
1
1 1
1 1
1 1
0 0
0 0
1 1
1 1
b) The smiley face visible to the right is transformed with various linear
transformations represented by matrices A F . Find out which matrix
does which transformation:
Reflection at xy-plane
"
1 1
,
1 1 #
"
1 1
D=
,
0 1
A=
3 1 1
1
3
1 1 1
1 1 1
1 1 1
1 1 1
1 1 1
1 3
A=
1 1
To a reflection
at the xy-plane belongs the matrix A
=
1 0 0
ei . The
0 1 0 as can be seen by looking at the images of ~
0 0 1
picture to the right shows the linear algebra textbook reflected at
two different mirrors.
A-F
B=
E=
image
"
1 2
,
0 1 #
"
1 0
,
0 1
A-F
C=
F=
"
1 0
,
0 1 #
"
0 1
/2
1 0
image
A-F
image
1 0 0 0
0 1 0 0
This is homework 28 in Bretscher 2.2: Each of the linear transformations in parts (a) through
(e) corresponds to one and only one of the matrices A) through J). Match them up.
a) Scaling
What transformation in space do you get if you reflect first at the xy-plane, then rotate
around the z axes by 90 degrees (counterclockwise when watching in the direction of the
z-axes), and finally reflect at the x axes?
a) One of the following matrices can be composed with a dilation to become an orthogonal
projection onto a line. Which one?
"
b) Shear
"
0 0
B=
0 1
"
#
"
0.6 0.8
F =
G=
0.8 0.6
A=
"
2 1
0.6 0.8
C=
1 0
0.8 0.6
#
"
#
0.6 0.6
2 1
H=
0.8 0.8
1 2
e) Reflection
"
7
0
"
0
I=
1
D=
0
7#
0
0
"
1
3
"
0.8
J=
0.6
E=
0
1
#
0.6
0.8
If B is a 3 4 matrix, and A
1
1 3 5 7
3
B = 3 1 8 1 , A =
1
1 0 9 2
0
Pm
k=1
is
3
1
0
1
a 4 2 matrix then BA is a
1
1 3 5 7
3
, BA = 3 1 8 1
1
1 0 9 2
0
3
15 13
1
= 14 11 .
0
10 5
1
Matrix multiplication generalizes the common multiplication of numbers. We can write the
dot product between two vectors as a matrix product when writing the first vector as a
1n
matrix
(= row vector) and the second as a n 1 matrix (=column vector) like in
If A is a n n matrix and the system of linear equations Ax = y has a unique solution for all
y, we write x = A1 y. The inverse matrix can be computed using Gauss-Jordan elimination.
Lets see how this works.
Let 1n be the nn identity matrix. Start with [A|1n ] and perform Gauss-Jordan elimination.
Then
Bik Akj .
2 steps. In our case, it can go in three different ways back to the page itself.
Matrices help to solve combinatorial problems. One appears in the movie Good will hunting. For example, what does [A100 ] tell about the news distribution on a large network.
What does it mean if A100 has no zero entries?
1
0
Proof. The elimination process solves A~x = ~ei simultaneously. This leads to solutions ~vi
which are the columns of the inverse matrix A1 because A1 e~i = ~vi .
"
2 6 | 1 0
1 4 | 0 1
"
1 3 | 1/2 0
1 4 | 0 1
"
1 3 | 1/2 0
0 1 | 1/2 1
"
1 0 |
2
3
0 1 | 1/2 1
The inverse is A1 =
"
#
#
#
A | 12
.... | ...
.... | ...
12 | A1
2
3
.
1/2 1
1 0
1 1
"
d b
c a
Diagonal:
"
#
2 0
A=
0 3
"
a b
c d
is given by
/(ad bc).
A1 =
"
1 0
1 1
A1 =
"
1/2 0
0 1/3
Reflection:
"
#
cos(2) sin(2)
A=
sin(2) cos(2)
A1 = A =
Rotation:
"
#
cos()
sin()
A=
sin() cos()
A1 =
"
"
cos(2) sin(2)
sin(2) cos(2)
cos() sin()
sin() cos()
Rotation=Dilation:
"
#
a b
A=
b a
"
A1 =
a/r 2 b/r 2
, r 2 = a2 + b2
b/r 2 a/r 2
1 1 1
1
1
0
1 0 0
1 1
A=
1 1
1 1
1
1
0
0
0
1
0
0
0
0
1 1 1
1 1
1 1 0
A= 0
This is a system we will analyze more later. For now we are only interested in the algebra. Tom the cat moves each minute randomly from on spots 1,5,4 jumping to neighboring sites only. At the same time Jerry, the mouse, moves on spots 1,2,3, also jumping to neighboring sites. The possible position combinations (2, 5), (3, 4), (3, 1), (1, 4), (1, 1)
and transitions are encoded in a matrix
0
1/4
A = 1/4
1/4
1/4
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
1
This means for example that we can go from state (2, 5) with equal probability to all
the other states. In state (3, 4) or (3, 1) the pair (Tom,Jerry) moves back to (2, 5).
If state (1, 1) is reached, then Jerrys life ends and Tom goes to sleep there. We can
read off that the probability that Jerry gets eaten in one step as 1/4. Compute A4 .
The first column of this matrix gives the probability distribution after four steps. The
last entry of the first column gives the probability that Jerry got swallowed after 4 steps.
1
5
10
k
3nk
.
4n
10
pk (1 p)nk with p = 1/4 and interpret it of having heads
k
turn up k times if it appears with probability p and tails with probability 1 p.
We can write this as
In this lecture, we define random variables, the expectation, mean and standard deviation.
A random variable is a function X from the probability space to the real line with
the property that for every interval the set {X [a, b] } is an event.
If X(k) counts the number of 1 in a sequence of length n and each 1 occurs with a
probability p, then
!
n
P[X = k ] =
pk (1 p)nk .
k
There is nothing complicated about random variables. They are just functions on the laboratory
. The reason for the difficulty in understanding random variables is solely due to the name
variable. It is not a variable we solve for. It is just a function. It quantifies properties of
experiments. In any applications, the sets X [a, b] are automatically events. The last condition
in the definition is something we do not have to worry about in general.
If our probability space is finite, all subsets are events. In that case, any function on is a random
variable. In the case of continuous probability spaces like intervals, any piecewise continuous
function is a random variable. In general, any function which can be constructed with a sequence
of operations is a random variable.
0.25
0.20
We throw two dice and assign to each experiment the sum of the eyes when rolling two dice.
For example X[(1, 2)] = 3 or X[(4, 5)] = 9. This random variable takes values in the set
{2, 3, 4, . . . , 12}.
0.20
0.15
0.15
0.10
0.10
0.05
Assume is the set of all 10 letter sequences made of the four nucleotides G, C, A, T in a
string of DNA. An example is = (G, C, A, T, T, A, G, G, C, T ). Define X() as the number
of Guanin basis elements. In the particular sample just given, we have X() = 3.
Problem Assume X() is the number of Guanin basis elements in a sequence. What is the
probability of the event {X() = 2 }? Answer Our probability space has 410 = 1048576
elements. There are 38 cases, where the first two elements are G. There are 38 elements
where the first and third element is G, etc. For any pair, there are 38 sequences. We have
(109/2) = 45 possible ways to chose a pair from the 10. There are therefore 38 45 sequences
with exactly 2 amino acids G. This is the cardinality of the event A = {X() = 2 }. The
probability is |A|/|| = 45 38 /410 which is about 0.28.
For random variables taking finitely many values we can look at the probabilities
pj = P [X = cj ]. This collection of numbers is called a discrete probability
distribution of the random variable.
0.05
10
10
For a random variable X taking finitely many values, we define the expectation
P
as m = E[X] = x xP[X = x ]. Define the variance asqVar[X] = E[(X m)2 ] =
E[X 2 ] E[X]2 and the standard deviation as [X] = Var[X].
In the case of throwing a coin 10 times and head appears with probability p = 1/2 we have
E[X] = 0 P[X = 0] + 1 P[X = 1] + 2 P[X = 2] + 3 P[X = 3] + + 10 P[X = 10] .
We throw a dice 10 times and call! X() the number of times that heads shows up. We
10!
have P[X = k ] =
/210 . because we chose k elements from n = 10. This
k!(10 k)!
distribution is called the Binominal distribution on the set {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 }.
The average adds up to 10 p = 5, which is what we expect. We will see next time when
we discuss independence, how we can get this immediately. The variance is
Var[X] = (0 5)2 P[X = 0 ] + (1 5)2 P[X = 1 ] + + (10 5)2 P[X = 10 ] .
9
AB
AB
AB
8
9
7
8
AB
6
7
5
6
AB
AB
4
5
3
3
3
4
AB
Throw a vertical line randomly into the unit disc. Let X[ ] be the length of the segment
cut out from the circle. What is P [X > 1 ]?
Solution:
we need to hit the x axes in |x| < 1/ 3. Comparing lenghts gives the probability
is 1/ 3. We have assumed here that every interval [c, d] in the interval [1, 1] appears with
probability (d c)/2.
AB
All these examples so far, the random variable has taken only a discrete set of values. Here is
an example, where the random variable can take values in an interval. It is called a variable
with a continuous distribution.
It is 10p(1 p) = 10/4. Again, we will have to wait until next lecture to see how we can get
this without counting.
We look at the probability space of all 2 2 matrices, where the entries are either 1 or
1. Define the random variable X() = det(), where is one of the matrices. The
determinant is
"
#
a b
det(
= ad bc .
c d
Draw the probability distribution of this random variable and find the expectation as
well as the Variance and standard deviation.
In the previous example, we have seen again the Bertrand example, but because we insisted
on vertical sticks, the probability density was determined. The other two cases we have seen
produced different probability densities. A probability model always needs a probability
function P .
In the card game blackjack, each of the 52 cards is assigned a value. You see the
French card deck below in the picture. Numbered cards 2-10 have their natural
value, the picture cards jack, queen, and king count as 10, and aces are valued as
either 1 or 11. Draw the probability distribution of the random variable X which gives
the value of the card assuming that we assign to the hearts ace and diamond aces the
value 1 and to the club ace and spades ace the value 11. Find the mean the variance
and the standard deviation of the random variable X.
A LCD display with 100 pixels is described by a 10 10 matrix with entries 0 and 1.
Assume, that each of the pixels fails independently with probability p = 1/20 during
the year. Define the random variable X() as the number of dead pixels after a year.
a) What is the probability of the event P [X > 3], the probability that more than 3
pixels have died during the year?
b) What is the expected number of pixels failing during the year?
Let be the probability space obtained by throwing two dice. It has 36 elements. Let A be
the event that the first dice shows an odd number and let B be the event that the second
dice shows less than 3 eyes. The probability of A is 18/36 = 1/2 the probability of B is
12/36 = 1/3. The event A B consists of the cases {(1, 1), (1, 2), (3, 1), (3, 2), (5, 1), (5, 2) }
and has probability 1/6. The two events are independent.
If is a finite probability space where each experiment has the same probability, then
E[X] =
1 X
X()
||
(1)
xj P[X = xj ] ,
(2)
xj
where xj are the possible values of X. The later expression is the same but involves less terms.
In real life we often do not know the probability space. Or, the probability space
is so large that we have no way to enumerate it. The only thing we can access is
the distribution, the frequency with which data occur. Statistics helps to build a
model. Formula (1) is often not computable, but (2) is since we can build a model
with that distribution.
11
12
13
14
15
16
21
22
23
24
25
26
31
32
33
34
35
36
41
42
43
44
45
46
51
52
53
54
55
56
61
62
63
64
65
66
X = (1, 2, 3, 3, 4, 1, 1, 1, 2, 6)
To compute the expectation of X, write it as the result of a random variable X(1) =
1, X(2) = 2, X(3) = 3, ..., X(10) = 6 on a probability space of 10 elements. In this case,
E[X] = (1 + 2 + 3 + 3 + 4 + 1 + 1 + 1 + 2 + 6)/10 = 24/10. But we can look at these data
also differently and say P[X = 1] = 4/10, P[X = 2] = P[X = 3] = 2/10, P[X = 4] = P[X =
6] = 1/6. Now,
E[X] = 1 P[X = 1] + 2 P[X = 2] + 3 P[X = 3] + 4 P[X = 4] + 6 P[X = 6]
2
2
1
1
12
4
+2
+3
+4
+6
=
.
= 1
10
10
10
10
10
5
The first expression has 10 terms, the second 5. Not an impressive gain, but look at the
next example.
We throw 100 coins and let X denote the number of heads. Formula (1) involves 2100
terms. This is too many to sum over. The expression (2) however
100
X
kP[X = k] =
100
X
100
k
k=1
k=1
This follows from the definition P[A|B] = P[A B]/P[B] and P[A B] = P[A] P[B].
1
2100
Two random variables X, Y are called independent if for every x, y, the events
{X = x} and {Y = y} are independent.
has only 100 terms and sums up to 100 (1/2) = 50 because in general
n
1 X
k
n
2 k=0
n
k
n
.
2
5
!
n
n1
By the way, one can see this by writing out the factorials k
=n
. Summing
k
k
over the probability space is unmanageable. Even if we would have looked at 10 trillion
cases every millisecond since 14 billion years, we would not be through. But this is not an
obstacle. Despite the huge probability space, we have a simple model which tells us what
the probability is to have k heads.
If is the probability space of throwing two dice. Let X be the random variable which gives
the value of the first dice and Y the random variable which gives the value of the second
dice. Then X((a, b)) = a and Y ((a, b)) = b. The events X = x and Y = y are independent
because each has probability 1/6 and event {X = x, Y = y} has probability 1/36.
Two random variables X, Y are called uncorrelated, if E[XY ] = E[X] E[Y ].
Let X be the random variable which is 1 on the event A and zero everywhere else. Let Y
be the random variable which is 1 on the event B and zero everywhere else. Now E[X] =
0P[X = 0] + 1P[X = 1] = P[A]. Similarly P[Y ] = P[B]. and P[XY ] = P[A B] because
XY () = 1 only if is in A and in B.
Let X be the random variable on the probability space of two dice which gives the dice value
of the first dice. Let Y be the value of the second dice. These two random variables are
uncorrelated.
E[XY ] =
6 X
6
212
49
1 X
ij = [(1 + 2 + 3 + 4 + 5 + 6) (1 + 2 + 3 + 4 + 5 + 6)]/36 =
=
.
36 i=1 j=1
36
4
And here is the code which produces the mean and standard deviation of the first n
digits:
b=10; n=100;
s=IntegerDigits [ Floor [ Pi bn ] , b ] ; m=N[Sum[ s [ [ k ] ] , { k , n } ] / n ] ;
sigma=Sqrt [N[Sum[ ( s [ [ k ]] m) 2 , { k , n } ] / n ] ] ; {m, sigma }
Let be the probability space of throwing two dice. Let X denote the difference of the
two dice values and let Y be the sum. Find the correlation between these two random
variables.
Two random variables are uncorrelated if and only if their correlation is zero.
To see this, just multiply out E[(X E[X])(Y E[Y ])] = E[XY ]2E[X]E[X]+E[X]E[X] =
E[XY ] E[X]E[Y ].
If two random variables are independent, then they are uncorrelated.
Proof. Let {a1 , . . . , an } be the values of the variable X and {b1 , . . . , bn }"be the value of
1 xA
Let
the variable Y . For an event A we define the random variable 1A () =
0 x
/A
Pn
Pm
Ai = {X = ai } and Bj = {Y = bj }. We can write X = i=1 ai 1Ai , Y = j=1 bj 1Bj , where
the events Ai and Bj are independent. Because E[1Ai ] = P[Ai ] and E[1Bj ] = P[Bj ] we have
E[1Ai 1Bj ] = P[Ai ] P[Bj ]. This implies E[XY ] = E[X]E[Y ].
For uncorrelated random variables, we have Var[X + Y ] = Var[X] + Var[Y ].
To see this, subtract first the mean from X and Y . This does not change the variance but now
the random variables have mean 0. We have Var[X+Y ] = E[(X+Y )2 ] = E[X 2 +2XY +Y 2 ] =
E[X 2 ] + 2E[XY ] + E[Y 2 ].
Let X be the random variable of one single Bernoulli trial with P[X = 1] = p and P[X =
0] = 1 p. This implies E[X] = 0P[X = 0] + pP[X = 1] and
Var[X] = (0 p)2 P[X = 0] + (1 p)2 P[X = 1] = p2 (1 p) + (1 p)2 p = p(1 p) .
If we add n independent random variables of this type, then E[X1 + + Xn ] = np and
Var[X1 + ... + Xn ] = Var[X1 + + Xn ] = np(1 p).
You can check the above proof using E[f (X)] = j f (aj )E[Aj ] and E[g(X)] = j g(bj )E[Bj ].
It still remains true. The only thing which changes are the numbers f (ai ) and g(bj ). By choosing suitable functions we can assure that all events Ai = X = xi and Bj = Y = yj are independent.
P
Lets explain this in a very small example, where the probability space has only three elements. In
that case, random variables are vectors. We look at centered random variables, random variables
of zero mean so that the covariance
variables, meaning that X = b is the function on the probability space {1, 2, 3 } given by
c
f (1) = a, f (2) = b, f (3) = c. As you know from linear algebra books, it is more common to
write Xk instead of X(k). Lets state an almost too obvious relation between linear algebra and
probability theory because it is at the heart of the matter:
Vectors in Rn can be seen as random variables on the probability space {1, 2, ...., n }.
P
P
1 xA
We can write X = ni=1 ai 1Ai , Y = m
j=1 bj 1Bj ,
0 x
/A
where Ai = {X = ai } and Bj = {Y = bj } are independent. Because E[1Ai ] = P[Ai ] and
E[1Bj ] = P[Bj ] we have E[1Ai 1Bj ] = P[Ai ] P[Bj ]. Compare
E[XY ] = E[(
ai 1Ai )(
bj 1Bj )] =
ai bj E[1Ai 1Bj ] =
i,j
It is because of this relation that it makes sense to combine the two subjects of linear algebra and
probability theory. It is the reason why methods of linear algebra are immediately applicable to
probability theory. It also reinforces the picture given in the first lecture that data are vectors.
The expectation of data can be seen as the expectation of a random variable.
ai bj E[1Ai ]E[1Bj ] .
i,j
E[X]E[Y ] = E[(
ai 1Ai )]E[(
X
j
bj 1Bj )] = (
ai E[1Ai ])(
bj E[1Bj ]) =
Here are two random variables of zero mean: X = 3 and Y = 4 . They are
0
8
uncorrelated because their dot product E[XY ] = 3 4 + (3) 4 + 0 8 is zero. Are they
independent? No, the event A = {X = 3 } = {1 } and the event B = {Y = 4 } = {1, 2 } are
not independent. We have P[A] = 1/3, P[B] = 2/3 and P[AB] = 1/3. We can also see it as
follows: the random variables X 2 = [9, 9, 0] and Y 2 = [16, 16, 64] are no more uncorrelated:
E[X 2 Y 2 ] E[X 2 ]E[Y 2 ] = 31040 746496 is no more zero.
with
X
ai bj E[1Ai ]E[1Bj ] .
i,j
Cov[XY ]
.
[X][Y ]
Lets take the case of throwing two coins. The probability space is {HH, HT, T H, T T }.
1
1
The random variable that the first dice is 1 is X = . The random variable that
0
0
4
These random variables are independent. We can
0
center them to get centered random variables which are independent. [ Alert: the random
1
0
1 0
variables Y =
,
written down earlier are not independent, because the sets
0 1
0
1
A = {X = 1} and {Y = 1} are disjoint and P[A B] = P[A] P[B] does not hold. ]
Given any 0 1 data of length n. Let k be the number of ones. If p = k/n is the
mean, then the variance of the data is p(1 p).
2/3
1/3
E[X])(Y E[Y ])] is the dot product 1/3 2/3 is not zero. Interestingly enough
1/3
1/3
there are no nonconstant random variables on a probability space with three elements which
are independent.1
Proof. Here is the statisticians proof: n1 ni=1 (xi p)2 = n1 (k(1 p)2 + (n k)(0 p)2 ) =
(k 2kp + np2 )/n = p 2p + p2 = p2 p = p(1 p).
And here is the probabilists proof: since E[X 2 ] = E[X] we have Var[X] = E[X 2 ] E[X]2 =
E[X](1 E[X]) = p(1 p).
P
Parameter estimation
Parameter estimation is a central subject in statistics. We will look at it in the case of the
Binomial distribution. As you know, if we have a coin which shows heads with probability
p then the probability to have X = k heads in n coin tosses is
Independence depends on the coordinate system. Find two random variables X, Y such
that X, Y are independent but X 2Y, X + 2Y are not independent.
Assume you have a string X of n = 1000 numbers which takes the two values 0 and
P
a = 4. You compute the mean of these data p = (1/n) k X(k) and find p = 1/5. Can
you figure out the standard deviation of these data?
n
k
pk (1 p)nk .
Keep this distribution in mind. It is one of the most important distributions in probability
theory. Since this is the distribution of a sum of k random variables which are independent
Xk = 1 if kth coin is head and Xk = 0 if it is tail, we know the mean and standard deviation
of these variables E[X] = np and Var[X] = np(1 p).
1
This is true for finite probability spaces with prime || and uniform measure on it.
2
These data were obtained with IntegerDigits[P rime[100000], 2] which writes the 100000th prime p = 1299709
in binary form.
If T (x, y) = (cos()x sin()y, sin()x + cos()y) is a rotation in the plane, then the image
of T is the whole plane.
The averaging map T (x, y, z) = (x + y + z)/3 from R3 to R has as image the entire real axes
R.
The span of vectors ~v1 , . . . , ~vk in Rn is the set of all linear combinations c1~v1 +
. . . ck~vk .
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
x
1 0 0
x
1 1
2
3
4
5 5
A=
3
The kernel of T (x, y, z) = (x, y, 0) is the z-axes. Every vector (0, 0, z) is mapped to 0.
The kernel of a rotation in the plane consists only of the zero point.
The kernel of the averaging map consists of all vector (x, y, z) for which x + y + z = 0.
The kernel is a plane. In the language of random variables, the kernel of T consists of the
centered random variables.
kernel
domain
How do we compute the kernel? Just solve the linear system of equations A~x = ~0. Form
rref(A). For every column without leading 1 we can introduce a free variable si . If ~x is
P
the solution to A~xi = 0, where all sj are zero except si = 1, then ~x = j sj ~xj is a general
vector in the kernel.
image
codomain
3
6
9
2 6
0
5
1
0
1 3 0
0 1
. There are two pivot columns and one redundant column. The equation B~
x=0
0 0
0 0 0
is equivalent to the system x + 3y = 0, z = 0. After fixing z = 0, can chose y = t freely and
3
obtain from the first equation x = 3t. Therefore, the kernel consists of vectors t
1 .
0
How do we compute the image? If we are given a matrix for the transformation, then the
image is the span of the column vectors. But we do not need all of them in general.
1
0
1
0
1
0
1
0
A=
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
0 0
0
0
1
1 0
A=
0
0
0
1
0
4
0
1
0
3
0
1
0
2
0
6
0
1
0
3
0
0
0
1
0
4
0
0
0
1
0
0
0
0
0
1
1
Hu = 0
0
0
1
1
0
1
0
0
1
0
1
0
0
1
0
1
1
0
0
0
0
1
.
.
=
.
.
.
.
.
.
.
.
.
My =
v = My + f =
0 0 1 0 1
1 0 1 1 0
0 1 1 1 1
1
.
= . , Hv = 0
0
.
0 0 1
1 0 1
0 1 1
1
1
1
1
0
0
0
0
1
1
0
1
0
0
1
0
1
0
0
1
0
1
1
0
0
0
0
1
.
.
=
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
= . .
.
, f = . .
x1
.
.
x2
.
.
x4
.
x
.
.
x
3
.
5
x
Use P
=
to determine P e = P . = , P f = P . =
4
x6
.
x5
.
.
x7
.
x
.
.
6
x7
.
.
an error-free transmission (P u, P v) would give the right result back. Now
.
.
.
.
.
.
.
0 1 1
1 0 1
1 1 0
e and f :
.
Pu = ,
.
.
.
.
.
.
.
.
.
.
Choose a letter to get a pair of vectors (x, y). x = , y = . Use 1 + 1 = 0 in
.
.
Mx =
1
1
0
We work on the error correcting code as in the book (problem 53-54 in 3.1). Your task
is to do the encoding and decoding using the initials from your name and write in one
sentence using the terminology of image and kernel, what the essence of this error
correcting code is.
Step I) Encoding. To do so, we encode the letters of the alphabet by pairs of three
vectors containing zeros and ones:
A = (0, 0, 0, 1), (0, 0, 0, 1)
D = (0, 0, 0, 1), (0, 1, 0, 1)
G = (0, 0, 0, 1), (1, 0, 0, 1)
J = (0, 0, 0, 1), (1, 1, 0, 1)
M = (0, 0, 1, 0), (0, 0, 0, 1)
P = (0, 0, 1, 0), (0, 1, 0, 1)
S = (0, 0, 1, 0), (1, 0, 0, 1)
V = (0, 0, 1, 0), (1, 1, 0, 1)
Y = (0, 0, 1, 1), (1, 0, 0, 1)
! = (0, 0, 1, 1), (1, 0, 0, 1)
.
.
.
.
.
.
.
Find the image and kernel of the following Pascal triangle matrix:
0
1
0
1
0
1
0
1
u = Mx + e =
.
.
.
.
. In
.
Pv =
.
A set of random variables X1 , .., Xn which form a basis on a finite probability space
= {1, ..., n } describe everything. Every random variable X which we want to
compute can be expressed using these random variables.
The image and kernel of a transformation are linear spaces. This is an important example
since this is how we describe linear spaces, either as the image of a linear transformation
or the kernel of a linear transformations. Both are useful and they are somehow dual to
each other. The kernel is associated to row vectors because we are perpendicular to all
row vectors, the image is associated to column vectors because we are perpendicular to all
column vectors.
1 1 0 0
,
,
,
.
0 0 1 1
In general, one a finite probability space. A basis defines n random variables such that every
random variable X can be written as a linear combination of X1 , ..., Xn .
3
6
9
2 6
A=
3
0
5
1
0
Row reduction shows that the first and third vector span the space V . This is a basis for
V.
|
. . . ~vn
v1 , . . . , ~vn is a
is invertible if and only if ~
|
basis in Rn .
0
1
0
Two nonzero vectors in the plane form a basis if they are not parallel.
2 6 5
A=
,
,
.
3 9 1
A set B of vectors ~v1 , . . . , ~vm is called basis of a linear subspace X of Rn if they are
linear independent and if they span the space X. Linear independent means that
there are no nontrivial linear relations ai~v1 + . . . + am~vm = 0. Spanning the space
means that very vector ~v can be written as a linear combination ~v = a1~v1 +. . .+am~vm
of basis vectors.
1
echelon form is B = rref(A) = 0
0
0 0 1
kernel of A = 1 1 0 . Solution: In reduced row
1 1 1
1 0
0 1 . To determine a basis of the kernel we write
0 0
x
1
1
0
1
A set of n vectors ~v1 , ..., ~vn in Rn form a basis in Rn if and only if the matrix A
containing the vectors as column vectors is invertible.
0 1 1
0 2
1 1 1
A= 1
is invertible.
More generally, the pivot columns of an arbitrary matrix A form a basis for the image of A.
Since we represent linear spaces always as the kernel or image of a linear map, the problem
of finding a basis to a linear space is always the problem of finding a basis for the image or
finding a basis for the kernel of a matrix.
Find a basis for the image and kernel of the Chess matrix:
A=
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
Find a basis for the set of vectors perpendicular to the image of A, where A is the
Pascal matrix.
0 0 0 0 1 0 0 0 0
0 0 0 1 0 1 0 0 0
A= 0 0 1 0 2 0 1 0 0
.
0 1 0 3 0 3 0 1 0
1 0 4 0 6 0 4 0 1
1 3
6
9
4 1
A=
3
2
0
1
1
The matrix AT denotes the matrix where the columns and rows of A are switched. It
is called the transpose of the matrix A.
To prove the proposition, use the lemma in two ways. Because A spans and B is linearly independent, we have m n. Because B spans and A is linearly independent, we have n m.
Remember that X Rn is called a linear space if ~0 X and if X is closed under addition and
scalar multiplication. Examples are Rn , X = ker(A), X = im(A), or the row space of a matrix.
In order to describe linear spaces, we had the notion of a basis:
B = {~v1 , . . . , ~vn } X is a basis if two conditions are satisfied: B is linear independent meaning
that c1~v1 + ... + cn~vn = 0 implies c1 = . . . = cn = 0. Then B span X: ~v X then ~v =
a1~v1 + . . . + an~vn . The spanning condition for a basis assures that there are enough vectors
to represent any other vector, the linear independence condition assures that there are not too
many vectors. Every ~v X can be written uniquely as a sum ~v = a1~v1 + . . . + an~vn of basis
vectors.
The following theorem is also called the rank-nullety theorem because dim(im(A)) is the rank
and dim(ker(A))dim(ker(A)) is the nullety.
Fundamental theorem of linear algebra: Let A : Rm Rn be a linear map.
dim(ker(A)) + dim(im(A)) = m
There are n columns. dim(ker(A)) is the number of columns without leading 1, dim(im(A)) is the
number of columns with leading 1.
If A is an invertible n n matrix, then the dimension of the image is n and that the
dim(ker)(A) = 0.
0 1
0
,
,,
.
... ...
...
The dimension of {0 } is zero. The dimension of any line 1. The dimension of a plane is 2.
The dimension of the image of a matrix is the number of pivot columns. We can construct
a basis of the kernel and image of a linear transformation T (x) = Ax by forming B = rrefA.
The set of Pivot columns in A form a basis of the image of T .
A=
Given a basis A = {v1 , ..., vn } and a basis B = {w1 , ..., .wm } of X, then m = n.
Lemma: if q vectors w
~ 1 , ..., w
~ q span X and ~v1 , ..., ~vp are linearly independent in X, then q p.
Pq
3
6
9
12
15
18
21
24
27
4
8
12
16
20
24
28
32
36
5
10
15
20
25
30
35
40
45
6
12
18
24
30
36
42
48
54
7
14
21
28
35
42
49
56
63
8
16
24
32
40
48
56
64
72
9
18
27
36
45
54
63
72
81
Are there a 4 4 matrices A, B of ranks 3 and 1 such that ran(AB) = 0? Solution. Yes,
we can even find examples which are diagonal.
Is there 4 4 matrices A, B of rank 3 and 1 such that ran(AB) = 2? Solution. No, the
kernel of B is three dimensional by the rank-nullety theorem. But this also means the kernel
of AB is three dimensional (the same vectors are annihilated). But this implies that the
rank of AB can maximally be 1.
The rank of AB can not be larger than the rank of A or the rank of B.
The nullety of AB can not be smaller than the nullety of B.
aij w
~ j = ~vi . Now do Gauss
a
v1T
11 . . . a1q | ~
Jordan elimination of the augmented (p (q + n))-matrix to this system: . . . . . . . . . | . . . ,
ap1 . . . apq | ~
vpT
h
i
where ~viT is the vector ~vi written as a row vector. Each row of A of this A|b contains some
j=1
2
4
6
8
10
12
14
16
18
The dimension of the kernel of a matrix is the number of free variables. It is also called
nullety. A basis for the kernel is obtained by solving Bx = 0 and introducing free variables
for the redundant columns.
1
2
3
4
5
6
7
8
9
~ 1T + ... + bq w
~ qT
nonzero entry. We end up with a matrix, which contains a last row 0 ... 0 | b1 w
showing that b1 w
~ 1T + + bq w
~ qT = 0. Not all bj are zero because we had to eliminate some nonzero
the internet when searching for fractals. It assumes that X is a bounded set. You can pick up
this definition also in the Startreck movie (2009) when little Spock gets some math and ethics
lectures in school. It is the simplest definition and also called box counting dimension in the math
literature on earth.
Assume we can cover X with n = n(r) cubes of size r and not less. The fractal
dimension is defined as the limit
dim(X) =
log(n)
log(r)
as r 0.
For linear spaces X, the fractal dimension of X intersected with the unit cube agrees
with the usual dimension in linear algebra.
Proof. Take a basis B = {v1 , . . . , vm } in X. We can assume that this basis vectors are all
orthogonal and each vector has length 1. For given r > 0, place cubes at the lattice points
Pm
m
j with integer kj . This covers the intersection X with the unit cube with (C/r ) cubes
j=1 kj rv
We cover the unit interval [0, 1] with n = 1/r intervals of length r. Now,
dim(X) =
10
log(1/r)
=1.
log(r)
log(1/r 2 )
=2.
dim(X) =
log(r)
11
12
The Cantor set is obtained recursively by dividing intervals into 3 pieces and throwing
away the middle one. We can cover the Cantor set with n = 2k intervals of length r = 1/3k
so that
log(2k )
= log(2)/ log(3) .
dim(X) =
log(1/3k )
The Shirpinski carpet is constructed recursively by dividing a square in 9 equal squares
and throwing away the middle one, repeating this procedure with each of the squares etc.
At the kth step, we need n = 8k squares of length r = 1/3k to cover X. The dimension is
dim(X) =
log(8k )
= log(8)/ log(3) .
log(1/3k )
This is smaller than 2 = log(9)/ log(3) but larger than 1 = log(3)/ log(3).
x1 + 2x2 + 3x3 x4 + x5 = 0 .
b) Find a basis for the space spanned by the rows of the matrix
1 2 3 4 5
4 5 6
.
3 4 4 6 7
A= 2 3
Find a clever basis for the reflection of a light ray at the line x +"2y =
# 0. Solution:
"
# Use
1
2
one vector in the line and an other one perpendicular to it: ~v1 =
, ~v2 =
. We
2
1
"
#
"
#
1 0
1 2
achieved so B =
= S 1 AA with S =
.
0 1
2 1
... |
. . . ~vn
. It is called
... |
By definition, the matrix S is invertible: the linear independence of the column vectors implies S
has no kernel. By the rank-nullety theorem, the image is the entire space Rn .
c1
x1
S
~v
w
~ = [~v ]B
A
B
We write [~x]B =
x=
x = S([~x]B ).
. . . . If ~
. . . , we have ~
cn
xn
A~v S
The B-coordinates of ~x are obtained by applying S 1 to the coordinates of the standard basis:
[~x]B = S 1 (~x)
This just rephrases that S[~x]B = ~x. Remember the column picture. The left hand side is just
c1~v1 + + cn~vn where the vj are the column vectors of S.
3
{e1 , e2 , e3 } we have ~x = 4e1 2e2 + 3e3 .
If ~v1 =
"
1
2
and ~v2 =
"
3
, then S =
5
S 1~v =
"
"
1 3
. A vector ~v =
2 5
5 3
2 1
#"
6
9
"
3
3
"
6
9
"
1 1
0 1
"
2
3
and S 1
0
1
2
the basis ~v1 = 1 ~v2 = 2 ~v3 = 3 . Because T (~v1 ) = ~v1 = [~e1 ]B , T (~v2 ) = ~v2 =
2
3
0
1 0 0
"
"
1
1
with respect to the basis B = {~v1 =
, ~v2 =
}.
0
1
"
#
"
#
1 1
1
=
. Therefore [v]B = S 1~v =
. Indeed
0 1
3
1~v1 + 3~v2 = ~v .
If B = {v1 , . . . , vn } is a basis in Rn and T is a linear transformation on Rn , then the
B-matrix of T is
|
...
|
Bw
~
d
b
1
3
"
a b
c d
and let S =
"
0 1
, What is S 1 AS? Solution:
1 0
c
. Both the rows and columns have switched. This example shows that the matrices
a# "
#
4 3
2
are similar.
,
2 1
4
Find the B-matrix B of the linear transformation which is given in standard coordinates
as
"
#
"
#"
#
x
1 1
x
T
=
y
1 1
y
"
2
if B = {
1
# "
1
}.
2
2
2
the plane V by using a suitable coordinate system.
a) Find a basis which describes best the points in the following lattice: We aim to
describe the lattice points with integer coordinates (k, l).
b) Once you find the basis, draw all the points which have (x, y) coordinates in the disc
x2 + y 2 10
1
2
"
1
2
and
"
6
3
are orthogonal in R2 .
Proof: (~x + ~y ) (~x + ~y ) = ||~x||2 + ||~y||2 + 2~x ~y ||~x||2 + ||~y ||2 + 2||~x||||~y|| = (||~x|| + ||~y||)2 .
~v and w
~ are both orthogonal to the cross product ~v w
~ in R3 . The dot product between ~v
and ~v w
~ is the determinant
v1 v2 v3
det( v1 v2 v3 ) .
w1 w2 w3
~v is called a unit vector if its length is one: ||~v|| =
~v ~v = 1.
A set of vectors B = {~v1 , . . . , ~vn } is called orthogonal if they are pairwise orthogonal. They are called orthonormal if they are also unit vectors. A basis is called
an orthonormal basis if it is a basis which is orthonormal. For an orthonormal
basis, the matrix with entries Aij = ~vi ~vj is the unit matrix.
"
3 #
"
#
1 2
1 3
is the matrix
.
3 4
2 4
Ali Bik =
T T
Bki
Ail = (B T AT )kl .
1 2 3 .
Express the fact that ~x is in the kernel of a matrix A using orthogonality. Answer A~x = 0
means that w
~ k ~x = 0 for every row vector w
~ k of Rn . Therefore, the orthogonal complement
of the row space is the kernel of a matrix.
(AB)Tkl = (AB)lk =
To check this, take two vectors in the orthogonal complement. They satisfy ~v w
~ 1 = 0, ~v w
~2 =
0. Therefore, also ~v (w
~1 + w
~ 2 ) = 0.
Proof: The dot product of a linear relation a1~v1 + . . . + an~vn = 0 with ~vk gives ak~vk
~vk = ak ||~vk ||2 = 0 so that ak = 0. If we have n linear independent vectors in Rn , they
automatically span the space because the fundamental theorem of linear algebra shows that
the image has then dimension n.
A vector w
~ Rn is called orthogonal to a linear space V , if w
~ is orthogonal to
every vector ~v V . The orthogonal complement of a linear space V is the set
W of all vectors which are orthogonal to V .
~
x~
y
||~
x||||~
y||
~
y
cos() = ||~x~x||||~
[1, 1] is the statistical correlation of ~x and ~y if the vectors ~x, ~y
y||
represent data of zero mean.
A rotation is orthogonal.
A reflection is orthogonal.
During one of Thales (-624 BC to (-548 BC)) journeys to Egypt, he used a geometrical
trick to measure the height of the great pyramid. He measured the size of the shadow
of the pyramid. Using a stick, he found the relation between the length of the stick and
the length of its shadow. The same length ratio applies to the pyramid (orthogonal
triangles). Thales found also that triangles inscribed into a circle and having as the
base as the diameter must have a right angle.
The Pythagoreans (-572 until -507) were interested in the discovery that the squares of
a lengths of a triangle with two orthogonal sides would add up as a2 + b2 =
c2 . They
2. This
were puzzled in assigning a length
to
the
diagonal
of
the
unit
square,
which
is
number is irrational because 2 = p/q would imply that q 2 = 2p2 . While the prime
factorization of q 2 contains an even power of 2, the prime factorization of 2p2 contains
an odd power of 2.
a) Verify that if A, B are orthogonal matrices then their product A.B and B.A are
orthogonal matrices.
b) Verify that if A, B are orthogonal matrices, then their inverse is an orthogonal
matrix.
c) Verify that 1n is an orthogonal matrix.
These properties show that the space of n n orthogonal matrices form a group. It
is called O(n).
Eratosthenes (-274 until 194) realized that while the sun rays were orthogonal to the
ground in the town of Scene, this did no more do so at the town of Alexandria, where
they would hit the ground at 7.2 degrees). Because the distance was about 500 miles
and 7.2 is 1/50 of 360 degrees, he measured the circumference of the earth as 25000
miles - pretty close to the actual value 24874 miles.
Closely related to orthogonality is parallelism. Mathematicians tried for ages to
prove Euclids parallel axiom using other postulates of Euclid (-325 until -265). These
attempts had to fail because there are geometries in which parallel lines always meet
(like on the sphere) or geometries, where parallel lines never meet (the Poincare half
plane). Also these geometries can be studied using linear algebra. The geometry on the
sphere with rotations, the geometry on the half plane uses Mobius transformations,
2 2 matrices with determinant one.
The question whether the angles of a right triangle do always add up to 180 degrees became an issue when geometries where discovered, in which the measurement depends on
the position in space. Riemannian geometry, founded 150 years ago, is the foundation
of general relativity, a theory which describes gravity geometrically: the presence of
mass bends space-time, where the dot product can depend on space. Orthogonality
becomes relative. On a sphere for example, the three angles of a triangle are bigger
than 180+ . Space is curved.
In probability theory, the notion of independence or decorrelation is used. For
example, when throwing a dice, the number shown by the first dice is independent and
decorrelated from the number shown by the second dice. Decorrelation is identical to
orthogonality, when vectors are associated to the random variables. The correlation
coefficient between two vectors ~v , w
~ is defined as ~v w/(|~
~ v|w|).
~ It is the cosine of the
angle between these vectors.
Let v, w be two vectors in three dimensional space which both have length 1 and are perpendicular to each other. Now
P x = (v x)~v + (w x)w
~.
vectors. For example, if v = 1 / 3 and w = 1 / 6, then
1
2
#
1/3 1/6 "
1/2 1/2 0
1/3 1/3 1/ 3
T
= 1/2 1/2 0 .
P = AA = 1/ 3 1/6
1/ 6 1/ 6 2/ 6
0
0 0
1/ 3 2/ 6
(P x x) P x = P x P x x P x = P x x x P x = P x x x P x = 0 .
2
Remember that a vector is in the kernel of AT if and only if it is orthogonal to the rows of AT
and so to the columns of A. The kernel of AT is therefore the orthogonal complement of im(A)
for any matrix A:
For an orthogonal projection P there is a basis in which the matrix is diagonal and
contains only 0 and 1.
Proof. Chose a basis B of the kernel of P and a basis B of V , the image of P . Since for every
~v B1 , we have P v = 0 and for every ~v B2 , we have P v = v, the matrix of P in the basis
B1 B2 is diagonal.
If V is the image of a matrix A with trivial kernel, then the projection P onto V is
1 1 1
1 1
/3
1 1 1
A=
1
P x = A(AT A)1 AT x .
The matrix
Proof. is clear. On the other hand AT Av = 0 means that Av is in the kernel of AT . But since
the image of A is orthogonal to the kernel of AT , we have A~v = 0, which means ~v is in the kernel
of A.
The matrix
1 0 0
1 0
0 0 0
A=
0
Proof. Let y be the vector on V which is closest to Ax. Since y Ax is perpendicular to the
image of A, it must be in the kernel of AT . This means AT (y Ax) = 0. Now solve for x to get
the least square solution
x = (AT A)1 AT y .
The projection is Ax = A(AT A)1 AT y.
5
If V is a line containing the unit vector ~v then P x = v(v x), where is the dot product.
Writing this as a matrix product shows P x = AAT x where A is the n 1 matrix which
contains ~v as the column. If v is not a unit vector, we know from multivariable calculus that
P x = v(v x)/|v|2 . Since |v|2 = AT A we have P x = A(AT A)1 AT x.
How do we construct the matrix of an orthogonal projection? Lets look at an other example
1 0
0
. The orthogonal projection onto V = im(A) is ~b 7 A(AT A)1 AT ~b. We
0 1
"
#
1/5 2/5 0
5 0
T
T
1 T
have A A =
and A(A A) A = 2/5 4/5 0 .
2 1
0
0 1
0
2/5
~
For example, the projection of b = 1 is ~x = 4/5 and the distance to ~b is 1/ 5.
0
0
The point ~x is the point on V which is closest to ~b.
Let A = 2
Let A = . Problem: find the matrix of the orthogonal projection onto the image of A.
0
1
The image of A is a one-dimensional line spanned by the vector ~v = (1, 2, 0, 1). We calculate
AT A = 6. Then
A(AT A)1 AT =
1
2
0
1
2 0 1 /6 =
1
2
0
1
2
4
0
2
0
0
0
0
1
2
0
1
/6 .
ker(A )= im(A)
T
A (b-Ax)=0
V=im(A)
0 1
,
.
1 0
T -1 T
*
Ax=A(A
A) A b
0 0 1
B = { 0 , 1 , 0
}.
0 1 0
Ax *
Let A be a matrix with trivial kernel. Define the matrix P = A(AT A)1 AT .
a) Verify that we have P T = P .
b) Verify that we have P 2 = P .
For this problem, just use the basis properties of matrix algebra like (AB)T = B T AT .
x y
-1 1
1 2
2 -1
Given a system of linear equations Ax = b, the point x = (AT A)1 AT b is called the
least square solution of the system.
If A has no kernel, then the least square solution exists.
Proof. We know that if A has no kernel then the square matrix AT A has no kernel and is therefore
invertible.
x y
-1 8
0 8
1 4
2 16
If x is the least square solution of Ax = b then Ax is the closest point on the image of A to b. The
least square solution is the best solution of Ax = b we can find. Since P x = Ax, it is the closest
point to b on V . Our knowledge about kernel and the image of linear transformations helped us
to derive this.
Finding the best polynomial which passes through a set of points is a data fitting problem.
If we wanted to accommodate all data, the degree of the polynomial would become too
large. The fit would look too wiggly. Taking a smaller degree polynomial will not only be
more convenient but also give a better picture. Especially important is regression, the
fitting of data with linear polynomials.
4
The above pictures show 30 data points which are fitted best with polynomials of degree 1,
6, 11 and 16. The first linear fit maybe tells most about the trend of the data.
The simplest fitting problem is fitting by lines. This is called linear regression. Find the
best line y = ax + b which fits the data
Find the function y = f (x) = a cos(x) + b sin(x), which best fits the data
x y
0 1
1/2 3
1 7
Solution: We have to find the least square solution to the system of equations
1a + 0b = 1
0a + 1b = 3
1a + 0b = 7
1 0
1
A = 0 1 , ~b = 3 .
1 0
7
Now AT~b =
"
"
6
3
and AT A =
"
2 0
0 1
-2
-2
"
1/2 0
0 1
3
. The best fit is the function f (x) = 3 cos(x) + 3 sin(x) .
3
=
=
=
=
Find the function y = f (x) = ax2 + bx3 , which best fits the data
x y
-1 1
1 3
0 10
In other words, find the least square solution for the system of equations for the unknowns
a, b which aims to have all 4 data points (xi , yi ) on the circle. To get system of linear
equations Ax = b, plug in the data
lla + b
ab
2a
2a + 2b
Find the circle a(x2 + y 2 ) + b(x + y) = 1 which best fits the data
x y
0 1
-1 0
1 -1
1 1
-1
(x1 , y1)
(x2 , y2)
(x3 , y3)
(x4 , y4)
1
1
1
1.
=
=
=
=
(1, 2)
(1, 0)
(2, 1)
(0, 1)
1
1
0
2 2
A=
2
, b =
.
We get the least square solution with the usual formula. First compute
(AT A)1 =
"
and then
AT b =
3 2
2 5
"
6
2
/22
3
,
We have AT A =
Last time, we saw how the geometric formula P = A(A A) A for the projection on the image
of a matrix A allows us to fit data. Given a fitting problem, we write it as a system of linear
equations
Ax = b .
1
"
2 1
1 2
and AT b =
"
7
. We get the least square solution with the
5
formula
x = (AT A)1 AT b =
While this system is not solvable in general, we can look for the point on the image of A which
is closest to b. This is the best possible choice of a solution and called the least square
solution:
"
3
1
The best fit is the function f (x, y) = 3x2 + y 2 which produces an elliptic paraboloid.
The vector x = (AT A)1 AT b is the least square solution of the system Ax = b.
The most popular example of a data fitting problem is linear regression. Here we have data
points (xi , yi ) and want to find the best line y = ax + b which fits these data. But data fitting
can be done with any finite set of functions. Data fitting can be done in higher dimensions too.
We can for example look for the best surface fit through a given set of points (xi , yi , zi ) in space.
Also here, we find the least square solution of the corresponding system Ax = b which is obtained
by assuming all points to be on the surface.
0 1
2
0
, ~b = 4 .
1 1
3
A= 1
x y z
0 1 2
-1 0 4
1 -1 3
endowment in billions
5
7
18
25
27
We solved this example in class with linear regression. We saw that the best fit. With a
quadratic fit, we get the system A~x = ~b with
1 1 1
2 4
3 9
4 16
1 5 25
A=
1
~b = 18
.
25
27
Solution: We have to find the least square solution to the system of equations
a0 + b 1 = 2
a1 + b 0 = 4
a1 + b 1 = 3 .
a
21/5
The solution vector ~x = b = 277/35 which indicates strong linear growth but some
c
2/7
slow down.
Here is a problem on data analysis from a website. We collect some data from users but not
everybody fills in all the data
Person
Person
Person
Person
1 3
2 4
3 4 1
5 - 4 2
- -
3 9 8 - 5
5 7 - - -
- 6 2
1 9
- -
2 9
- 9
8 - -
It is difficult to do statistic with this. One possibility is to filter out all data from people who
do not fulfill a minimal requirement. Person 4 for example did not do the survey seriously
enough. We would throw this data away. Now, one could sort the data according to some
important row. Arter tha one could fit the data with a function f (x, y) of two variables.
This function could be used to fill in the missing data. After that, we would go and seek
correlations between different rows.
Whenever doing datareduction like this, one must always compare different scenarios
and investigate how much the outcome changes when changing the data.
The left picture shows a linear fit of the above data. The second picture shows a fit with
cubic functions.
The first 6 prime numbers 2, 3, 5, 7, 11 define the data points (1, 2), (2, 3), (3, 5), (5, 7), (6, 11)
in the plane. Find the best parabola of the form y = ax2 + c which fits these data.
Linear algebra
x1
P[B|A]P[A]
formula P[A|B] = P[B|A]+P[B|A
c] .
P[B|A
i ]P[Ai ]
rule P[Ai |B] = Pn P[B|A
.
j ]P[Aj ]
j=1
Permutations n! = n (n 1) (n 2) 2 1 possibilities.
a b
Rotation-Dilation A =
. Scale by a2 + b2 , rotate by arctan(b/a).
b a
"
#
a b
Reflection-Dilation A =
. Scale by a2 + b2 , reflect at line w, slope b/a.
b a
"
#
"
#
1 a
1 0
Horizontal and vertical shear ~x 7 A~x, A =
, ~x 7 A~x, A =
.
0 1
b 1
"
#
cos(2) sin(2)
Reflection about line ~x 7 A~x, A =
.
sin(2) cos(2)
#
"
u1 u1 u1 u2
.
Projection onto line containing unit vector u: A =
u2 u1 u2 u2
A random variable is a function from a probability space to the real line R. There are two
important classes of random variables:
1) For discrete random variables, the random variable X takes a discrete set of values. This
means that the random variable takes values xk and the probabilities P[X = xk ] = pk add up to
1.
2) For continuous random variables, there is a probability density function f (x) such that
R
R
P[X [a, b]] = ab f (x) dx and
f (x) dx = 1.
0.05
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Discrete distributions
1
is called the Poisson distribution. It is used to describe the number of radioactive decays
in a sample, the number of newborns with a certain defect. You show in the homework that
the mean and standard deviation is Poisson distribution is the most important distribution
on {0, 1, 2, 3, . . .}. It is a limiting case of Binomial distributions
We throw a dice and assume that each side appears with the same probability. The random
variable X which gives the number of eyes satisfies
P[X = k ] =
1
.
6
This is a discrete distribution called the uniform distribution on the set {1, 2, 3, 4, 5, 6 }.
0.4
k e
k!
An epidemiology example from Cliffs notes: the UHS sees X=10 pneumonia cases each
winter. Assuming independence and unchanged conditions, what is the probability of there
being 20 cases of pneumonia this winter? We use the Poisson distribution with = 10 to
see P[X = 20] = 1020 e10 /20! = 0.0018.
The Poisson distribution is the n limit of the binomial distribution if we chose
for each n the probability p such that = np is fixed.
0.3
P[X = k] =
pk (1 p)nk
n!
( )k (1 )nk
k!(n k)! n
n
n!
k
= ( k
)( )(1 )n (1 )k .
n (n k)! k!
n
n
=
0.1
n
k
Throw n coins for which head appears with probability p. Let X denote the number of
heads. This random variable takes values 0, 1, 2, . . . , n and
P[X = k ] =
This is the Binomial distribution we know.
n
k
pk (1 p)nk .
This is a product of four factors. The first factor n(n 1)...(n k + 1)/nk converges to 1.
Leave the second factor ak /k! as it is. The third factor converges to e by the definition
of the exponential. The last factor (1 n )k converges to 1 for n since k is kept fixed.
We see that P[X = k] k e /k!.
Continuous distributions
A random variable is called a uniform distribution if P[X [a, b]] = (b a) for all
0 a < b b. We can realize this random variable on the probability space = [a, b] with
the function X(x) = x, where P[I] is the length of an interval I. The uniform distribution
is the most natural distribution on a finite interval.
The random variable X on [0, 1] where P[[a, b]] = b a is given by X(x) = x. We have
We have f (x) = 2x because ab f (x) dx = x2 |ba = b2 a2 . The function f (x) is the probability
density function of the random variable.
R
The distribution on the positive real axis with the density function
f (x) =
(log(x)m)2
1
2
e
2x2
is called the log normal distribution with mean m. Examples of quantities which have
log normal distribution is the size of a living tissue like like length or height of a population
or the size of cities. An other example is the blood pressure of adult humans. A quantity
which has a log normal distribution is a quantity which has a logarithm which is normally
distributed.
A random variable with normal distribution with mean 1 and standard deviation 1 has the
probability density
1
2
f (x) = ex /2 .
2
This is an example of a continuous distribution. The normal distribution is the most natural
distribution on the real line.
2
(1 + x2 )2 .
k=0
P[X = k] = 1.
The most important distribution on the positive real line is the exponential distribution
f (x) = ex .
1
.
R 2
2
From 0 x exp(x) dx = 2/ , we get the variance
m=
In this lecture we look at more probability distributions and prove the fantastically useful Chebychevs theorem.
P[X [a, b] ] =
xf (x) dx =
f (x) dx
3
By definition F is a monotone function: F (b) F (a) for b a. One abbreviates the probability
density function with P DF and the distribution function with CDF which abbreviates cumulative
distribution function.
The most important distribution on the real line is the normal distribution
f (x) =
(xm)2
1
e 22 .
2
2
The most important distribution on a finite interval [a, b] is the uniform distribution
f (x) = 1[a,b]
1
,
ba
1 xI
.
0 x
/I
The following theorem is very important for estimation purposes. Despite the simplicity of
its proof it has a lot of applications:
Chebychev theorem If X is a random variable with finite variance, then
P[|X E[X]| c]
Var[X]
.
c2
Proof. The random variable Y = X E[X] has zero mean and the same variance. We need
]
only to show P[|Y | c] Var[Y
. Taking the expectation of the inequality
c2
c2 1{|Y |c} Y 2
1.0
gives
0.30
0.8
0.25
0.20
0.6
2000
The theorem also gives more meaning to the notion Variance as a measure for the deviation
from the mean. The following example is similar to the one section 11.6 of Cliffs notes:
4000
6000
8000
10 000
0.15
0.4
0.10
-2
0.2
0.05
A die is rolled 144 times. What is the probability to see 50 or more times the number 6
shows up? Let X be the random variable which counts the number of times, the number 6
appears. This random variable has a binomial distribution with p = 1/6 and n = 144. It has
the expectation E[X] = np = 144/6 = 24 and the variance Var[X] = np(1 p) = 20. Setting
c = (50 24) = 26 in Chebychev, we get P[|X 24| 26] !20/262 0.0296.... The chance
P
144
is smaller than 3 percent. The actual value 144
pk (1 p)144k 1.17 107 is
k=50
k
much smaller. Chebychev does not necessarily give good estimates, but it is a handy and
univeral rule of thumb.
-4
-10
Random numbers
10
10
The CDF
Finally, lets look at a practical application of the use of the cumulative distribution function.
It is the task to generate random variables with a given distribution:
The random variable X has a normal distribution with standard deviation 2 and mean 5.
Estimate the probability that |X 5| > 3.
Estimate the probability of the event X > 10 for a Poisson distributed random variable X
with mean 4.
Assume we want to generate random variables X with a given distribution function F . Then
Y = F (X) has the uniform distribution on [0, 1]. We can reverse this. If we want to produce random variables with a distribution function F , just take a random variable Y with
uniform distribution on [0, 1] and define X = F 1 (Y ). This random variable has the distribution function F because {X [a, b] } = {F 1 (Y ) [a, b] } = {Y F ([a, b]) } = {Y
[F (a), F (b)]} = F (b) F (a). We see that we need only to have a random number generator
which produces uniformly distributed random variables in [0, 1] to get a random number
generator for a given continuous distribution. A computer scientist implementing random
processes on the computer only needs to have access to a random number generator producing uniformly distributed random numbers. The later are provided in any programming
language which deserves this name.
a) Verify that (x) = tan(x) maps the interval [0, 1] onto the real line so that its inverse
F (y) = arctan(y)/ is a map from R to [0, 1].
1
b) Show that f = F (y) = 1 1+y
2.
c) Assume we have random numbers in [0, 1] handy and want to random variables which
have the probability
density f . How do we achieve this?
R
d) The mean
xf (x) dx does not exist as an indefinite integral but can be assigned the
RR
value 0 by taking
the limit R
xf (x) dx = 0 for R . Is it possible to assign a value to
R 2
the variance x f (x) dx?.
The probability distribution with density
1 1
1 + y2
-5
The PDF
X:=Random [ N o r m a l D i s t r i b u t i o n [ 0 , 1 ] ]
L i s t P l o t [ Table [ X, { 1 0 0 0 0 } ] , PlotRange>A l l ]
Why is the Cauchy distribution natural? As one can deduce from the homework, if you chose
a random point P on the unit circle, then the slope of the line OP has a Cauchy distribution.
Instead of the circle, we can take a rotationally
symmetric probability distribution like the
R R x2 y2
Gaussian with probability measure P [A] =
/ dxdy on the plane. Random
Ae
pointscan be written as (X, Y ) where both X, Y have the normal distribution with density
2
ex / . We have just shown
f=PDF[ C a u c h y D i s t r i b u t i o n [ 0 , 1 ] ] ;
S=P l o t [ f [ x ] , { x , 10 ,10} , PlotRange>Al l , F i l l i n g >Axis ]
f=CDF[ C h i S q u a r e D i s t r i b u t i o n [ 1 ] ] ;
S=P l o t [ f [ x ] , { x , 0 , 1 0 } , PlotRange>Al l , F i l l i n g >Bottom ]
If we take independent Gaussian random variables X, Y of zero mean and with the
same variance and form Z = X/Y , then the random variable Z has the Cauchy
distribution.
Now, it becomes clear why the distribution appears so often. Comparing quantities is often
done by looking at their ratio X/Y . Since the normal distribution is so prevalent, there is
no surprise, that the Cauchy distribution also appears so often in applications.
The 2 2 case
"
a b
c d
a
c
f
+ d
i
g
b
e
h
b
e
h
a
c
f + d
i
g
b
e
h
c
a
f d
g
i
b
e
h
c
a
f
d
i
g
b
e
h
a
c
f d
g
i
b
e
h
2 0 4
1 1
) = 2 4 = 2.
1 0 1
det( 1
d b
c a
ad bc
"
5 4
det(
) = 5 1 4 2 = 3.
2 1
We also see already that the determinant changes sign if we flip rows, that the determinant is
linear in each of the rows.
We can write the formula as a sum over all permutations of 1, 2. The first permutation = (1, 2)
gives the sum A1,(1) A2,pi(2) = A1,1 A2,2 = ad and the second permutation = (2, 1) gives the
sum A1,(1) A2,(2) = A1,2 A2,1 = bc. The second permutation has || upcrossing and the sign
P
(1)|| = 1. We can write the above formula as (1)|| A1(1) A2(2) .
"
The 3 3 case
a
c
b
d
"
a
c
b
d
P =
1
1
1
1
1
1
f
i
a b c
e f
g h i
A= d
det(A) = det(
0
0
0
0
0
13
0
0
5
0
0
0
0
0
0
0
11
0
0
3
0
0
0
0
2
0
0
0
0
0
0
0
7
0
0
) = 2 3 5 7 11 13 .
Answer: 60.
A=
3
0
0
0
0
0
2
4
0
0
0
0
3
2
5
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
1
0
0
0
7
0
1
0
0
0
1
0
0
0
Laplace expansion
1
This Laplace expansion is a convenient way to sum over all permutations. We group the permutations by taking first all the ones where the first entry is 1, then the one where the first entry is 2
etc. In that case we have a permutation of (n 1) elements. the sum over these entries produces
a determinant of a smaller matrix.
For each entry aj1 in the first column form the (n 1) (n 1) matrix Bj1 which does not contain
the first and jth row. The determinant of Bj1 is called a minor.
Laplace expansion det(A) = (1)1+1 A11 det(Bi1 ) + + (1)1+n An1 det(Bn1 )
0
8
0
3
0
0
0
0
0
0
0
2
7
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
5
0
0
0
1
0
0
0
0
0 7 0 0 0
0
0 0 0 1
4+1
0 1 0 0
3det
det(A)
+ (1)
0
0 0 5 0
0
2
2 0 0 0 0
= 8(2 7 1 5 1) + 3(2 7 0 5 1) = 560
2+1
= (1) 8det
0
The answer is 1.
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
1
0
0
7
0
0
0
0
0
0
0
0
0
0
0
0
5
0
0
0
1
0
0
0
4
3
0
0
6
0
0
4
1
4
0
Give
the#reason in terms of permutations why the determinant of a partitioned matrix
"
A 0
is the product det(A)det(B).
0 B
3 4 0 0
1 2 0 0
Example det(
) = 2 12 = 24.
0 0 4 2
0 0 2 2
3
0
0
0
0
0
0
0
1
2
0
0
A=
0
0
0
0
8
0
0
0
0
0
0
0
8
8
8
0
0
0
0
0
8
8
8
8
8
0
0
0
8
8
8
8
8
8
8
0
8
8
8
8
8
8
8
8
8
0
8
8
8
8
8
8
8
0
0
0
8
8
8
8
8
0
0
0
0
0
8
8
8
0
0
0
0
0
0
0
8
0
0
0
0
Hint. Do not compute too much. Investigate what happens with the determinant if you
switch two rows.
Proof. We have to show that the number of upcrossing changes by an odd number. Lets count
the number of upcrossings before and after the switch. Assume row a and c are switched. We look
at one pattern and assume that (a,b) be an entry on row a and (c,d) is an entry on row b. The
entry (a,b) changes the number of upcrossings to (c,d) by 1 (there is one upcrossing from (a,b) to
(c,d) before which is absent after).
For each entry (x,y) inside the rectangle (a,c) x (b,d), the number of upcrossings from and to
(x,y) changes by two. (there are two upcrossings to and from the orange squares before which are
absent after). For each entry outside the rectangle and different from (a,b),(c,d), the number of
upcrossings does not change.
...
det(
v1
...
A11
...
det(
(v + w)1
...
An1
...
det(
v1
...
. . . A1n
... ...
. . . vn
) + det(
... ...
. . . Ann
A12
...
(v + w)2
...
An2
A13
...
(v + w)3
...
An3
...
...
...
...
...
A11
...
v1
...
An1
. . . A1n
... ...
. . . vn
)
=
det(
... ...
. . . Ann
. . . A1n
... ...
. . . wn
)
... ...
. . . Ann
...
...
...
...
...
A1n
...
vn
...
Ann
det(
. . . A1n
... ...
. . . vn
. . . . . . ) = det(
. . . wn
... ...
. . . Ann
If c1 , ..., ck are the row reduction scale factors and m is the number of row swaps
during row reduction, then
) .
Row reduction
We immediately get from the above properties what happens if we do row reduction. Subtracting
a row from an other row does not change the determinant since by linearity we subtract the
determinant of a matrix with two equal rows. Swapping two rows changes the sign and scaling a
row scales the determinant.
) .
A13
...
v3
...
An3
It follows that if two rows are the same, then the determinant is zero.
det(A) =
(1)m
det(rref (A)) .
c1 ck
Since row reduction is fast, we can compute the determinant of a 20 20 matrix in a jiffy. It takes
about 400 operations and thats nothing for a computer.
. . . A1n
... ...
. . . wn
... ...
) .
. . . vn
... ...
. . . Ann
4 1
2
1
4 1
Row reduce.
1
2
6
1
1
2
3
7
0 2
2
0
0 0
det
0
5
3
0
3
6
4
4
2
As the name tells, determinants determine a lot about matrices. We can see from the
determinant whether the matrix is invertible.
An other reason is that determinants allow explicit formulas for the inverse of a matrix. We
might look at this next time. Next week we will see that determinants allow to define the
characteristic polynomial of a matrix whose roots are the important eigenvalues. In analysis,
the determinant appears in change of variable formulas:
We could use the Laplace expansion or see that there is only one pattern. The simplest way
however is to swap two rows to get an upper triangular matrix
1 2
2
0
0 0
det
0
3
5
3
0
4
6
2
4
= 24 .
A=
0
0
0
0
0
1
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
is 1 because swapping the last row to the first row gives the identity matrix. Alternatively
we could see that this permutation matrix has 5 upcrossings and that the determinant is
1.
Proof. Every upcrossing is a pair of entries Aij , Akl where k > i, l > j. If we look at the
transpose, this pair of entries appears again as an upcrossing. So, every summand in the
permutation definition of the determinant appears with the same sign also in the determinant
of the transpose.
3
3
3
3
3
3
2
3
3
3
3
3
0
2
3
3
3
3
0
0
2
3
3
3
0
0
0
2
3
3
0
0
0
0
2
3
3
1
1
0
0
0
1
0
0
0
0
0
1
1
3
0
0
0
2
2
2
4
1
1
2
2
2
1
4
1
2
2
2
0
0
4
det(A B) = det(A)det(B)
det(AT ) = det(A)
f (y) |det(Du1(y)|dy .
Product of matrices
Proof. One can bring the n n matrix [A|AB] into row reduced echelon form. Similar than the augmented matrix [A|b] was brought into the form [1|A1 b], we end up with
[1|A1 AB] = [1|B]. By looking at the n n matrix to the left during the Gauss-Jordan
elimination process, the determinant has changed by a factor det(A). We end up with a
matrix B which has determinant det(B). Therefore, det(AB) = det(A)det(B).
u(S)
One of the main reasons why determinants are interesting is because of the following property
Physicists are excited about determinants because summation over all possible paths is
used as a quantization method. The Feynmann path integral is a summation over a
suitable class of paths and leads to quantum mechanics. The relation with determinants
comes because each summand in a determinant can be interpreted as a contribution of a
path in a finite graph with n nodes.
The determinant of
f (x) dx =
Cramers rule
Geometry
Cramers rule: If Ai is the matrix, where the column ~vi of A is replaced by ~b, then
xi =
det(Ai )
det(A)
"
1
2
A=
"
1 1
2 1
and
"
1
1
is the determinant of
which is 4.
1 1 1
0
.
1 0 0
Solution: The determinant is 1. The volume is 1. The fact that the determinant is
negative reflects the fact that these three vectors form a left handed coordinate system.
"
1 1 1
det(A A) = det(
1 1 0
T
"
#
1 1
3 2
1
)=2.
) = det(
2 2
1 0
The area is therefore 2. If you have seen multivariable calculus, you could also have
computed
the area using the cross product. The area is the length of the cross product
xi vi , . . . , vn ] =
Gabriel Cramer was born in 1704 in Geneva. He worked on geometry and analysis until
his death at in 1752 during a trip to France. Cramer used the rule named after him in a
book Introduction `a lanalyse des lignes courbes algebraique, where he used the method
to solve systems of equations with 5 unknowns. According to a short biography of Cramer
by J.J OConnor and E F Robertson, the rule had been used already before by other mathematicians. Solving systems with Cramers formulas is slower than by Gaussian elimination.
But it is ueful for example if the matrix A or the vector b depends on a parameter t, and we
want to see how x depends on the parameter t. One can find explicit formulas for (d/dt)xi (t)
for example.
Cramers rule leads to an explicit formula for the inverse of a matrix inverse of a matrix:
Let Aij be the matrix where the ith row and the jth column is deleted. Bij =
(1)i+j det(Aji ) is called the classical adjoint or adjugate of A. The determinant
of the classical adjugate is called minor.
We can take also this as the definition of the volume. Note that AT A is a square matrix.
The
volume of a k dimensional parallelepiped defined by the vectors v1 , . . . , vk is
q
det(AT A).
det(Aji )
det(A)
B11
10.
2 3 1
14 21
2 4
8
has det(A) = 17 and we get A1 = 11
6 0 7 "
12 18
#
"
#
2 4
3 1
= (1)2 det
= 14. B12 = (1)3 det
= 21. B13
0 7
0 7
A= 5
"
5 4
6 7
"
5 2
6 0
"
"
2 1
6 7
2 3
6 0
#
#
10
3
/(17):
11
"
#
3 1
= (1)4 det
=
2 4
"
2 1
5 4
"
2 3
5 2
describes an electron in a periodic crystal, E is the energy and = 2/n. The electron can move
as a Bloch wave whenever the determinant is negative. These intervals form the spectrum of
the quantum mechanical system. A physicist is interested in the rate of change of f (E) or its
dependence on when E is fixed. .
The graph to the left shows the function E 7 log(|det(L EIn )|) in the case = 2 and n = 5.
In the energy intervals, where this function is zero, the electron can move, otherwise the crystal
is an insulator. The picture to the right shows the spectrum of the crystal depending on . It
is called the Hofstadter butterfly made popular in the book Godel, Escher Bach by Douglas
Hofstadter.
Random matrices
If the entries of a matrix are random variables with a continuous distribution, then
the determinant is nonzero with probability one.
If the entries of a matrix are random variables which have the property that P[X =
x] = p > 0 for some x, then there is a nonzero probability that the determinant is
zero.
Proof. We have with probability p2n that the first two rows have the same entry x.
What is the distribution of the determinant of a random matrix? These are questions which are
hard to analyze theoretically. Here is an experiment: we take random 100 100 matrices and look
at the distribution of the logarithm of the determinant.
100
80
60
40
20
172
174
176
178
180
182
184
Applications of determinants
In
solid state physics, one is interested in the function f (E) = det(L EIn ), where
L=
cos()
1
0
0
1
1
cos(2) 1
0
1
1
0
1 cos((n 1))
1
1
0
0
1
cos(n)
2 4 6
a) A =
5 5 5
1 3 2
3 2 8
determinants
4 5
8 10
5 4
7 4
4 9
1 2 2
a) A =
0 1 1
0 0 0
0 0 0
determinants
0 0
0 0
0 0
1 3
4 2
b) A =
2
1
0
0
0
1
1
0
0
0
4
1
2
0
0
4
2
1
3
0
2
3
1
1
4
1 6 10 1 15
8 17 1 29
0 3 8 12
0 0 4 9
0 0 0 0 5
b) A =
0
Find a 4 4 matrix A with entries 0, +1 and 1 for which the determinant is maximal.
Hint. Think about the volume. How do we get a maximal volume?
un+1 = un un1
1 1
1 0
#"
"
un
=
un1
"
#
1 1
system. Lets compute some orbits: A =
1 0
We see that A6 is the identity. Every initial vector is
original starting point.
with u0 = 0, u1 = 1. Because
un+1
un
"
"
0 1
1 0
A3 =
.
1 1
0 1
mapped after 6 iterations back to its
A2 =
Markets, population evolutions or ingredients in a chemical reaction are often nonlinear. A linear
description often can give a good approximation and solve the system explicitly. Eigenvectors and
eigenvalues provide us with the key to do so.
A nonzero vector v is called an eigenvector of a n n matrix A if Av = v for
some number . The later is called an eigenvalue of A.
We first look at real eigenvalues but also consider complex eigenvalues.
A rotation A in three dimensional space has an eigenvalue 1, with eigenvector spanning the
axes of rotation. This vector satisfies A~v = ~v .
Every
standard
w
~ i is an eigenvector if A is a diagonal matrix. For example,
"
#"
# basis
" vector
#
2 0
0
0
=3
.
0 3
1
1
For an orthogonal projection P onto a space V , every vector in V is an eigenvector to the
eigenvalue 1 and every vector perpendicular to V is an eigenvector to the eigenvalue 0.
The one-dimensional discrete dynamical system x 7 ax or xn+1 = axn has the solution
xn = an x0 . The value 1.0320 1000 = 1806.11 for example is the balance on a bank account
which had 1000 dollars 20 years ago if the interest rate was a constant 3 percent.
"
"
3
, A2~v =
1
"
5
. A3~v =
1
"
7
. A4~v =
1
"
9
1
etc.
The following example shows why eigenvalues and eigenvectors are so important:
9
4
"
1 2
1
. A~v =
. ~v =
0 1
1
Do you see a pattern?
A=
If ~v is an eigenvector with eigenvalue , then A~v = ~v , A2~v = A(A~v )) = A~v = A~v = 2~v
and more generally An~v = n~v .
For an eigenvector, we have a closed form solution for An~v. It is n~v .
10
The recursion
xn+1 = xn + xn1
with x0 = 0 and x1 = 1 produces the Fibonacci sequence
(1, 1, 2, 3, 5, 8, 13, 21, ...)
This can be computed with a discrete dynamical system because
"
xn+1
xn
"
1 1
1 0
#"
xn
xn1
in mathematics called the golden ratio. We have found our eigenvalues and eigenvectors.
Now find c1 , c2 such that
"
#
"
#
"
#
1
+
= c1
+ c2
0
1
1
"
A
.
G
Each cycle 1/3 of IOS users switch to Android and 2/3 stays. Also lets"assume that
# 1/2
2/3 1/2
of the Android OS users switch to IOS and 1/2 stay. The matrix A =
is a
1/3 1/2
Markov matrix. What customer ratio do we have in the limit? The matrix A has an
eigenvector (3/5, 2/5) which belongs to the eigenvalue 1.
Customers using Apple IOS and Google Android are represented by a vector
A~v = ~v
means that 60 to 40 percent is the final stable distribution.
The following fact motivates to find good methods to compute eigenvalues and eigenvectors.
If A~v1 = 1~v1 , A~v2 = 2~v2 and ~v = c1~v1 + c2~v2 , we have closed form solution
An~v = c1 n1 ~v1 + c2 n2 ~v2 .
Lets try this in the Fibonacci case. We will see next time how we find the eigenvalues and
eigenvectors:
12
1 1
1 0
#"
"
= An
"
1
0
Markov case
11
xn+1
xn
2
This leads to the
quadratic equation + + 1 = which has the solutions + = (1 + 5)/2
and = (1 5)/2. The number is one of the most famous and symmetric numbers
1
= n+
5
"
+
1
1
n
5
"
"
0.978 0.006
0.004 0.992
#
"
1
3
and
are eigenvec2
1
tors of A. Find the eigenvalues.
Check that
"
0 2
1 1
models
" the
# growth of a lilac bush. The vector
n
~v =
models the number of new branches
a
and #the number
"
"
#of old branches. Verify that
1
2
and
are eigenvectors of A. Find
1
1
the eigenvalues and find the close form solution starting with ~v = [2, 3]T .
Compare problem 50 in Chapter 7.1 of
Bretschers Book. Two interacting populations of hares and foxes can be modeled
by the discrete dynamical system vn+1 = Avn
with
"
#
4 2
A=
1 1
Find a closed form solutions
following
# in
" the #
"
100
h0
=
three cases: a) ~v0 =
.
f0
100
#
"
"
#
200
h0
=
b) ~v0 =
.
f0
100
#
"
"
#
600
h0
=
c) ~v0 =
.
500
f0
Proof1 One only has to show a polynomial p(z) = z n + an1 z n1 + + a1 z + a0 always has a root
z0 We can then factor out p(z) = (z z0 )g(z) where g(z) is a polynomial of degree (n 1) and
use induction in n. Assume now that in contrary the polynomial p has no root. Cauchys integral
theorem then tells
Z
2i
dz
=
6= 0 .
(1)
p(0)
|z|=r| zp(z)
On the other hand, for all r,
2 1
, the characteristic polynomial is
4 1
det(A I2 ) = det(
"
cos(t) sin(t)
For a rotation A =
the characteristic polynomial is 2 2 cos() + 1
sin(t) cos(t)
which has the roots cos() i sin() = ei .
Allowing complex eigenvalues is really a blessing. The structure is very simple:
"
"
2
1
) = 2 6 .
4
1
|z|=r|
dz
1
2
| 2rmax|z|=r
=
.
zp(z)
|zp(z)|
min|z|=r p(z)
(2)
|an1 |
|a0 |
n)
|z|
|z|
which goes to infinity for r . The two equations (1) and (2) form a contradiction. The
assumption that p has no root was therefore not possible.
If 1 , . . . , n are the eigenvalues of A, then
"
a b
, the characteristic polynomial is
c d
2 tr(A) + det(A) .
We can see this directly by writing out the determinant of the matrix A I2 . The trace is
important because it always appears in the characteristic polynomial, also if the matrix is
larger:
fA () = (1 )(2 ) . . . (n ) .
A=
"
3 7
5 5
Proof. The pattern, where all the entries are in the diagonal leads to a term (A11 )
(A22 )...(Ann ) which is (n ) + (A11 + ... + Ann )()n1 + ... The rest of this as well
as the other patterns only give us terms which are of order n2 or smaller.
How many eigenvalues do we have? For real eigenvalues, it depends. A rotation in the plane
with an angle different from 0 or has no real eigenvector. The eigenvalues are complex in
that case:
"
1
. We can also
1
read off the trace 8. Because the eigenvalues add up to 8 the other eigenvalue is 2. This
example seems special but it often occurs in textbooks. Try it out: what are the eigenvalues
of
"
#
11 100
A=
?
12 101
Because each row adds up to 10, this is an eigenvalue: you can check that
1
A. R. Schep. A Simple Complex Analysis and an Advanced Calculus Proof of the Fundamental theorem of
Algebra. Mathematical Monthly, 116, p 67-68, 2009
A=
1
0
0
0
0
2
2
0
0
0
3
3
3
0
0
4
4
4
4
0
5
5
5
5
5
How do we construct 2x2 matrices which have integer eigenvectors and integer eigenvalues?
Just take an integer matrix for which
" the
# row vectors have the same sum. Then this sum
1
is an eigenvalue to the eigenvector
. The other eigenvalue can be obtained by noticing
1
"
#
6 7
that the trace of the matrix is the sum of the eigenvalues. For example, the matrix
2 11
has the eigenvalue 13 and because the sum of the eigenvalues is 18 a second eigenvalue 5.
A matrix with nonnegative entries for which the sum of the columns entries add up
to 1 is called a Markov matrix.
Markov Matrices have an eigenvalue 1.
Proof. The eigenvalues of A and A are the same because they have the same characteristic
polynomial. The matrix AT has an eigenvector [1, 1, 1, 1, 1]T .
6
1/2 1/3
1/4
1/3
A = 1/4 1/3
If A =
"
1 1
0 1
.
4 1 0
A=
2
100 1
1
1
1
1 100 1
1
1
1
1 100 1
1
1
1
1 100 1
1
1
1
1 100
"
A 0
0 B
1/3 1/2
. Then [3/7, 4/7] is the equilibrium eigenvector to the eigenvalue 1.
2/3 1/2
M=1000;
A=Table [Random[ ] 1 / 2 , {M} , {M} ] ;
e=Eigenvalues [A ] ;
d=Table [ Min[ Table [ I f [ i==j , 1 0 , Abs [ e [ [ i ]] e [ [ j ] ] ] ] , { j ,M} ] ] , { i ,M} ] ;
a=Max[ d ] ; b=Min[ d ] ;
Graphics [ Table [ {Hue[ ( d [ [ j ]] a ) / ( ba ) ] ,
Point [ {Re[ e [ [ j ] ] ] , Im[ e [ [ j ] ] ] } ] } , { j ,M} ] ]
B=
Eigenvectors
101 2
3
4
5
1 102 3
4
5
1
2 103 4
5
1
2
3 104 5
1
2
3
4 105
This matrix is A + 100I5 where A is the matrix from the previous example. Note that if
Bv = v then (A + 100I5 )v = + 100)v so that A, B have the same eigenvectors and the
eigenvalues of B are 100, 100, 100, 100, 115.
Find the determinant of the previous matrix B. Solution: Since the determinant is the
product of the eigenvalues, the determinant is 1004 115.
The shear
"
Remember that the multiplicity with which an eigenvalue appears is called the algebraic multiplicity of :
The algebraic multiplicity is larger or equal than the geometric multiplicity.
Proof. Let be the eigenvalue. Assume it has geometric multiplicity m. If v1 , . . . , vm is a basis
of the eigenspace E form the matrix S which contains these vectors in the first m columns. Fill
the other columns arbitrarily. Now B = S 1 AS has the property that the first m columns are
e1 , .., em , where ei are the standard vectors. Because A and B are similar, they have the same
eigenvalues. Since B has m eigenvalues also A has this property and the algebraic multiplicity
is m.
You can remember this with an analogy: the geometric mean
equal to the algebraic mean (a + b)/2.
1 2
2
2
2
1 2
A=
1
3
3
3
3
3
4
4
4
4
4
5
5
5
5
5
This matrix has a large kernel. Row reduction indeed shows that the kernel is 4 dimensional.
Because the algebraic multiplicity is larger or equal than the geometric multiplicity there
are 4 eigenvalues 0. We can also immediately get the last eigenvalue from the trace 15. The
eigenvalues of A are 0, 0, 0, 0, 15.
is spanned by
"
1
0
1 1 1
1
has eigenvalue 1 with algebraic multiplicity 2 and the eigenvalue 0
0 0 1
with multiplicity
to the eigenvalue
1. Eigenvectors
The matrix 0 0
Proof. If all are different, there is one of them i which is different from 0. We use induction
with respect to n and assume the result is true for n 1. Assume that in contrary the eigenP
P
vectors are linearly dependent. We have vi = j6=i aj vj and i vi = Avi = A( j6=i aj vj ) =
P
P
j6=i aj j vj so that vi =
j6=i bj vj with bj = aj j /i . If the eigenvalues are different, then
P
P
P
aj 6= bj and by subtracting vi = j6=i aj vj from vi = j6=i bj vj , we get 0 = j6=i (bj aj )vj = 0.
Now (n1) eigenvectors of the n eigenvectors are linearly dependent. Now use the induction
assumption.
1 1
0 1
If all eigenvalues are different, then all eigenvectors are linearly independent and
all geometric and algebraic multiplicities are 1. The eigenvectors form then an
eigenbasis.
0 1
0 0
"
0 1
0
0
1 0
0
B=
0
1
0
0
0
0
1
0
v= 2 ,
v3
v4
The book of Lanville and Meyher of 2006 gives 8 billion. This was 5 years ago.
1 2 1
4 2
.
3 6 5
A= 2
and look at Bv = v.
Where are eigenvectors used: in class we will look at some applications: H
uckel theory, orbitals
of the Hydrogen atom and Page rank. In all these cases, the eigenvectors have immediate
interpretations. We will talk about page rank more when we deal with Markov processes.
0.7 0
0
100
St
Moritz lake n weeks
"
i
.
1
The eigenvectors are the same for every rotation-dilation matrix. With
A=
"
we have
Diagonalization
S 1 AS =
S 1 ASei S 1 Avi = S 1 i vi = i S 1 vi = i ei .
On the other hand if A is diagonalizable, then we have a matrix S for which S 1 AS = B is diagonal. The column vectors of S are eigenvectors because the kth column of the equation AS = BS
shows Avi = i vi .
has the eigenvalues 1 for which the geometric multiplicity is smaller than the algebraic one. This
matrix is not diagonalizable.
"
,S =
"
i i
1 1
a + ib
0
0
a ib
Functional calculus
Proof. If we have an eigenbasis, we have a coordinate transformation matrix S which contains the
eigenvectors vi as column vectors. To see that the matrix S 1 AS is diagonal, we check
Are all matrices diagonalizable? No! We need to have an eigenbasis and therefore that the
geometric multiplicities all agree with the algebraic multiplicities. We have seen that the shear
matrix
"
#
1 1
A=
0 1
a b
b a
"
2 3
What is A100 + A37 1 if A =
? The matrix has the eigenvalues 1 = 2 + 3 with
1 2
eigenvector "~v1= [ 3,
1] #and the eigenvalues 2 = 2 3. with eigenvector ~v2 = [ 3, 1].
3 3
and check S 1 AS = D is diagonal. Because B k = S 1 Ak S can
Form S =
1
1
easily be computed, we know A100 + A37 1 = S(B 100 + B 37 1)S 1 .
Establishing similarity
3
"
"
3 5
4 4
B=
are similar. Proof. They have the same
2 6
3 5
eigenvalues 8, 9 as you can see by inspecting the sum of rows and the trace. Both matrices
are therefore diagonalizable and similar to the matrix
Show that the matrices A =
"
8 0
0 9
Simple spectrum
If A and B have the same characteristic polynomial and diagonalizable, then they are
similar.
A matrix has simple spectrum, if all eigenvalues have algebraic multiplicity 1.
If A and B have the same eigenvalues but different geometric multiplicities, then they
are not similar.
Proof. Because the algebraic multiplicity is 1 for each eigenvalue and the geometric multiplicity
is always at least 1, we have an eigenvector for each eigenvalue and so n eigenvalues.
If A has an eigenvalue which is not an eigenvalue of B, then they are not similar.
Without proof we mention the following result which gives an if and only if result for similarity:
"
cos() sin()
sin() cos()
If A and B have the same eigenvalues with geometric multiplicities which agree and
the same holds for all powers Ak and B k , then A is similar to B.
Let x1 , x2 , x3 , x4 be the positions of the four phosphorus atoms (each of them is a 3-vector).
The inter-atomar forces bonding the atoms is modeled by springs. The first atom feels a
force x2 x1 + x3 x1 + x4 x1 and is accelerated in the same amount. Lets just chose
units so that the force is equal to the acceleration. Then
An application in chemistry
While quantum mechanics describes the motion of atoms in molecules, the vibrations can
be described classically, when treating the atoms as balls connected with springs. Such
approximations are necessary when dealing with large atoms, where quantum mechanical
computations would be too costly. Examples of simple molecules are white phosphorus P4 ,
which has tetrahedronal shape or methan CH4 the simplest organic compound or freon,
CF2 Cl2 which is used in refrigerants.
x1
x2
x3
x4
=
=
=
=
A=
1
, v3 =
, v4 =
are
What is the probability that an upper triangular 3 3 matrix with entries 0 and 1 is
diagonalizable?
0 1
A=
0 0
0
0
1
1
1 1
1
0
0 0
,B =
0
1
1
0
0
0
0
1
1 1
1
0
0 0
,C =
0
1
1
0
0
0
1
1
1 0 0
1
1
0 0 0
0 1
,D =
0 0
0
0
1
1
2 3
2
0
0 0
A=
0
Caffeine C8 H10 N4 O2
1
0
v1 = , v2 =
1
0
0
0
5
6
0
0
6
5
Aspirin C9 H8 O4
1
We grabbed the pdb Molecule files from http : //www.sci.ouc.bc.ca, translated them with povchem from
.pdb to .pov rendered them under Povray.
If Xi are random variables which take the values 0, 1 and 1 is chosen with probability p, then
Sn has the binomial distribution and E[Sn ] = np. Since E[X] = p, the law of large numbers
is satisfied.
If Xi are random variables which take the value 1, 1 with equal probability 1/2, then Sn
is a symmetric random walk. In this case, Sn /n 0.
Here is a strange paradox called the Martingale paradox. We try it out in class. Go into
a Casino and play the doubling strategy. Enter 1 dollar, if you lose, double to 2 dollars, if
you lose, double to 4 dollars etc. The first time you win, stop and leave the Casino. You
won 1 dollar because you lost maybe 4 times and 1 + 2 + 4 + 8 = 15 dollars but won 16. The
paradox is that the expected win is zero and in actual Casinos even negative. The usual
solution to the paradox is that as longer you play as more you win but also increase the
chance that you lose huge leading to a zero net win. It does not quite solve the paradox
because in a Casino where you are allowed to borrow arbitrary amounts and where no bet
limit exists, you can not lose.
We assume that all random variables have a finite variance Var[X] and expectation E[X].
A sequence of random variables defines a random walk Sn = nk=1 Xk . The interpretation is that Xk are the individual steps. If we take n steps, we reach Sn .
P
Here is a typical trajectory of a random walk. We throw a dice and if the dice shows head we go
up, if the dice shows tail, we go down.
Throw a dice n times and add up the total number Sn of eyes. Estimate Sn /n E[X] with
experiments. Below is example code for Mathematica. How fast does the error decrease?
The following result is one of the three most important results in probability theory:
Law of large numbers. For almost all , we have Sn /n E[X].
Here is the situation where the random variables are Cauchy distributed. The expectation
is not defined. The left picture below shows this situation.
What happens if we relax the assumption that the random variables are uncorrelated? The
illustration to the right below shows an experiment, where we take a periodic function f (x)
and an irrational number and where Xk (x) = f (k).
Proof. We only prove the weak law of large numbers which deals with a weaker convergence: We
have Var[Sn /n] = nVar[X]/n2 = Var[X]/n so that by Chebyshevs theorem
P[|Sn /n E[X]| < ] Var[X]/n2
for n . We see that the probability that Sn /n deviates by a certain amount from the mean
goes to zero as n . The strong law would need a half an hour for a careful proof. 1
It turns out that no randomness is necessary to establish the strong law of large numbers. It is
enough to have ergodicity
If is the interval [0, 1] with measure P[[c, d]] = d c, then T (x) = x + mod 1 is ergodic
if is irrational.
A real number is called normal to base 10 if in its decimal expansion, every digit
appears with the same frequency 1/10.
Almost every real number is normal
The reason is that we can look at the kth digit of a number as the value of a random variable
Xk () where [0, 1]. These random variables are all independent and have
( the same
1 k = 7
distribution. For the digit 7 for example, look at the random variables Yk () =
0 else
which have expectation 1/10. The average Sn ()/n = number of digits 7 in the first k digits
of the decimal expansion of converges to 1/10 by the law of large numbers. We can do
that for any digit and therefore, almost all numbers are normal.
n
1X
f (xk )
n n
k=1
lim
300
250
200
150
100
50
10
where xk are IID random variables in [a, b] is called the Monte-Carlo integral.
Look the first significant digit Xn of the sequence 2n . For example X5 = 3 because 25 = 32.
To which number does Sn /n converge? We know that Xn has the Benford distribution
P[Xn = k] = pk . [ The actual process which generates the random variables is the irrational
rotation x x + log10 (2)mod 1 which is ergodic since the log10 (2) is irrational. You can
therefore assume that the law of large numbers (Birkhoffs generalization) applies. ]
By going through the proof of the weak law of large numbers, does the proof also work if
Xn are only uncorrelated?
The Monte Carlo integral is the same than the Riemann integral for continuous
functions.
We can use this to compute areas of complicated regions:
The following two lines evaluate the area of the Mandelbrot fractal using Monte Carlo integration. The
function F is equal to 1, if the parameter value c of the
quadratic map z z 2 + c is in the Mandelbrot set and 0
else. It shoots 100000 random points and counts what
fraction of the square of area 9 is covered by the set.
Numerical experiments give values close to the actual
value around 1.51.... One could use more points to get
more accurate estimates.
Proof. Let X be a N(0, 1) distributed random variable. We show E[eitSn ] E[eitX ] for any fixed
t. Since any of the two random variables Xk , Xl are independent,
is prevalent. If we add independent random variables and normalize them so that the mean is
zero and the standard deviation is 1, then the distribution of the sum converges to the normal
distribution.
E[eitXk /
2 /2
Using et
Given a random variable X with expectation m and standard deviation define the
normalized random variable X = (X m)/.
The normalized random variable has the mean 0 and the standard deviation 1. The standard
normal distribution mentioned above is an example. We have seen that Sn /n converges to a
definite number if the random variables are uncorrelated. We have also seen that the standard
deviation of Sn /n goes to zero.
] = (1
t2
it3 E[X 3 ]
+ ) .
2n
3n3/2
= 1 t2 /2 + ..., we get
E[eitSn /
] = (1
t2
2
+ Rn /(n3/2 ))n et /2 .
2n
The last step uses a Taylor remainder term Rn /n3/2 term. It is here that the E[X 3 ] < assumption has been used. The statement now follows from
E[eitX ] = (2)1/2
eitx ex
2 /2
2 /2
dx = et
A sequence Xn of random variables converges in distribution to a random variable X if for every trigonometric polynomial f , we have E[f (Xn )] E[f (X)].
0.25
0.12
0.20
This means that E[cos(tXn )] E[cos(tX)] or E[sin(tXn )] E[sin(tX)] converge for every t. We
can combine the cos and sin to exp(itx) = cos(tx) + i sin(tx) and cover both at once by showing
E[eitXn ] E[eitX ] for n . So, checking the last statement for every t is equivalent to check
the convergence in distribution.
0.15
0.10
0.08
0.06
0.10
0.04
0.05
0.02
Statistical inference
For the following see also Cliffs notes page 89-93 (section 14.6): what is the probability that the
average Sn /n is within to the mean E[X]?
The probability that Sn /n deviates more than R/ n from E[X] can for large n be
estimated by
1 Z x2 /2
e
dx .
2 R
S180 / 180 is close to normal and c = 20/ 180 = 1.49.. we can estimate the P value as
Z
Proof. Let p = E[X] denote the mean of Xk and the standard deviation. Denote by X a random
variable which has the standard normal distribution N(0, 1). We use the notation X Y if X
and Y are close in distribution. By the central limit theorem
Sn np
X .
n
Dividing nominator and denominator by n gives
n Sn
(n
p) X so that
Sn
p X .
n
n
The term / n is called the standard error. The central limit theorem gives some
insight why the standard error is important.
In scientific publications, the standard error should be displayed rather than the standard deviation.1
to be n because
Sn X n .
The assumption in the previous problem that our squirrel is completely drunk the null
hypothesis. Assume we observe the squirrel after 3 minutes at 20 feet from the original
1
Geoff Cumming,Fiona Fidler,1 and David L. Vaux: Error bars in experimental biology, Journal of Cell biology,
177, 1, 2007 7-11
Since we know that the actual distribution is a Binomial distribution, we could have comP
k
nk
= 0.078 with p = 1/2. A P-value
puted the P-value exactly as 180
k=100 B(180, k)p (1 p)
smaller than 5 percent is called significant. We would have to reject the null hypothesis
and the squirrel is not drunk. In our case, the experiment was not significant.
The central limit theorem is so important because it gives us a tool to estimate the P -value.
It is much better in general than the estimate given by Chebyshev.
In many popular children games in which one has to throw a dice and then move
forward by the number of eyes seen. After n rounds, we are at position Sn . If you play
such a game, find an interval of positions so that you expect to be with at least 2/3
percent chance in that interval after n rounds.
Remark: An example is the game Ladder which kids
play a lot in Switzerland: it is a random walk with drift
3.5. What makes the game exciting are occasional accelerations or setbacks. Mathematically it is a Markov process.
If you hit certain fields, you get pushed ahead (sometimes
significantly) but for other fields, you can almost lose everything. The game stays interesting because even if you are
ahead you can still end up last or you trail behind all the
game and win in the end.
We play in a Casino with the following version of the martingale strategy. We play
all evening and bet one dollar on black until we reach our goal of winning 10 dollars.
Assume each game lasts a minute. How long do we expect to wait until we can go
home? (You can assume that the game is fair and that you win or lose with probability
1/2 in each game. )
Remark: this strategy appears frequently in movies, usually when characters are
desperate. Examples are Run Lola Run, Casino Royale or Hangover.
The example
A=
"
0 0
1 1
shows that a Markov matrix can have zero eigenvalues and determinant.
The example
A=
"
0 1
1 0
shows that a Markov matrix can have negative eigenvalues. and determinant.
The matrix
A=
"
1/2 1/3
1/2 2/3
The example
A=
is a Markov matrix.
Markov matrices are also called stochastic matrices. Many authors write the transpose
of the matrix and apply the matrix to the right of a row vector. In linear algebra we write
Ap. This is of course equivalent.
Ap =
p1
p2
...
pn
If all entries are positive and A is a 2 2 Markov matrix, then there is only one eigenvalue
1 and one eigenvalue smaller than 1.
A=
"
a
b
1a 1b
Proof: we have seen that there is one eigenvalue 1 because AT has [1, 1]T as an eigenvector.
The trace of A is 1 + a b which is smaller than 2. Because the trace is the sum of the
eigenvalues, the second eigenvalue is smaller than 1.
1 0
0 1
Lets call a vector with nonnegative entries pk for which all the pk add up to 1 a
stochastic vector. For a stochastic matrix, every column is a stochastic vector.
"
The example
0 1 0
0 1
1 0 0
A= 0
= p1 v1 + ... + vn vn .
shows that a Markov matrix can have complex eigenvalues and that Markov matrices can
be orthogonal.
The following example shows that stochastic matrices do not need to be diagonalizable, not
even in the complex:
A Markov matrix A always has an eigenvalue 1. All other eigenvalues are in absolute
value smaller or equal to 1.
Proof. For the transpose matrix AT , the sum
AT therefore has the eigenvector
1
1
...
1
Because A and AT have the same determinant also A In and AT In have the same
determinant so that the eigenvalues of A and AT are the same. With AT having an eigenvalue
1 also A has an eigenvalue 1.
Assume now that v is an eigenvector with an eigenvalue || > 1. Then An v = ||n v has
exponentially growing length for n . This implies that there is for large n one coefficient
[An ]ij which is larger than 1. But An is a stochastic matrix (see homework) and has all entries
1. The assumption of an eigenvalue larger than 1 can not be valid.
The matrix
is a stochastic matrix, even doubly stochastic. Its transpose is stochastic too. Its row
reduced echelon form is
1 0 1/2
A = 0 1 1/2
0 0 0
so that it has a one dimensional kernel. Its characteristic polynomial is fA (x) = x2 x3 which
shows that the eigenvalues are 1, 0, 0. The algebraic multiplicity of 0 is 2. The geometric
multiplicity of 0 is 1. The matrix is not diagonalizable. 1
1
This example appeared in http://mathoverflow.net/questions/51887/non-diagonalizable-doubly-stochasticmatrices
0.1 0.2
0.2
0.5
0.5 0.4 0.1
0.2 0.3
A=
0.3 0.2
0.3
0.1
0.4
0.2
1
2
Many games are Markov games. Lets look at a simple example of a mini monopoly, where
no property is bought:
Lets have a simple monopoly game with 6 fields. We start at field 1 and throw a coin.
If the coin shows head, we move 2 fields forward. If the coin shows tail, we move back to
the field number 2. If you reach the end, you win a dollar. If you overshoot you pay a fee
of a dollar and move to the first field. Question: in the long term, do you win or lose if
p6 p5 measures this win? Here p = (p1 , p2 , p3 , p4 , p5 , p6 ) is the stable equilibrium solution
with eigenvalue 1 of the game.
Take the same example but now throw also a
matrix is now
1/6
1/6
1/6
1/6
1/6
1/6
10
Assume
1/6
1/6
1/6
1/6
1/6
1/6
In the homework, you will see that there is only one stable equilibrium now.
"
1/2 1/3
1/2 2/3
1/6 1/6
1/6 1/6
1/6 1/6
A=
1/6 1/6
1/6 1/6
1/6 1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
1/6
is the geodesic sphere distance. Such a map has a unique fixed point v by Banachs fixed point
theorem. This is the eigenvector Av = v we were looking for. We have seen now that on X,
there is only one eigenvector. Every other eigenvector Aw = w must have a coordinate entry
which is negative. Write |w| for the vector with coordinates |wj |. The computation
|||w|i = |wi | = |
Aij wj |
This is a second lecture on Markov processes. We want to see why the following result is true:
If all entries of a Markov matrix A are positive then A has a unique equilibrium:
there is only one eigenvalue 1. All other eigenvalues are smaller than 1.
|Aij ||wj | =
shows that ||L L because (A|w|) is a vector with length smaller than L, where L is the
length of w. From ||L L with nonzero L we get || . The first which appears in the
displayed formula is however an inequality for some i if one of the coordinate entries is negative.
Having established || < the proof is finished.
To illustrate the importance of the result, we look how it is used in chaos theory and how it can
be used for search engines to rank pages.
The matrix
A=
"
1/2 1/3
1/2 2/3
is a Markov matrix for which all entries are positive. The eigenvalue 1 is unique because the
sum of the eigenvalues is 1/2 + 2/3 < 2.
Lets give a brute force proof of the Perron-Frobenius theorem in the case of 3 3 matrices:
such a matrix is of the form
A=
a
b
c
d
e
f
.
1ad 1be 1cf
Remark. The theorem generalizes to situations considered in chaos theory, where products
of random matrices are considered which all have the same distribution but which do not need
to be independent. Given such a sequence of random matrices Ak , define Sn = An An1 A1 .
This is a non commutative analogue of the random walk Sn = X1 + ... + Xn for usual random
variables. But it is much more intricate because matrices do not commute. Laws of large numbers
are now more subtle.
Application: Chaos
The Lyapunov exponent of a random sequence of matrices is defined as
lim
1
log (SnT Sn ) ,
2n
Here is a prototype result in Chaos theory due to Anosov for which the proof of Perron-Frobenius
can be modified using different contractions. It can be seen as an example of a noncommutative
law of large numbers:
If Ak is a sequence of identically distributed random positive matrices of determinant
1, then the Lyapunov exponent is positive.
"
"
2 1
3 2
Let Ak be either
or
with probability 1/2. Since the matrices do not
1 1
1 1
commute, we can not determine the long term behavior of Sn so easily and laws of large
numbers do not apply. The Perron-Frobenius generalization above however shows that still,
Sn grows exponentially fast.
Positive Lyapunov exponent is also called sensitive dependence on initial conditions for
the system or simply dubbed chaos. Nearby trajectories will deviate exponentially. Edward
Lorentz, who studied about 50 years ago models of the complicated equations which govern our
weather stated this in a poetic way in 1972:
The flap of a butterflys wing in Brazil can set off a tornado in Texas.
The damping factor can look a bit misterious. Brin and Page write:
PageRank can be thought of as a model of user behavior. We assume there is a random surfer
who is given a web page at random and keeps clicking on links, never hitting back but eventually
gets bored and starts on another random page. The probability that the random surfer visits a page
is its PageRank. And, the d damping factor is the probability at each page the random surfer will
get bored and request another random page. One important variation is to only add the damping
factor d to a single page, or a group of pages. This allows for personalization and can make it
nearly impossible to deliberately mislead the system in order to get a higher ranking. We have
several other extensions to PageRank. 1
Unfortunately, Mathematics is quite weak still to mathematically prove positive Lyapunov exponents if the system does not a priori feature positive matrices. There are cases which can be
settled quite easily. For example, if the matrices Ak are IID random matrices of determinant 1
and eigenvalues 1 have not full probability, then the Lyapunov exponent is positive due to work
of Fuerstenberg and others. In real systems, like for the motion of our solar system or particles
in a box, positive Lyapunov exponents is measured but can not be proven yet. Even for simple
toy systems like Sn = dT n , where dT is the Jacobean of a map T like T (x, y) = (2x c sin(x), y)
and T n is the nth iterate, things are unsettled. One measures log(c/2) but is unable to
prove it yet. For our real weather system, where the Navier stokes equations apply, one is even
more helpless. One does not even know whether trajectories exist for all times. This existence
problem looks like an esoteric ontological question if it were not for the fact that a one million
dollar bounty is offered for its solution.
d d d
0 0 1/2
d
/3 + (1 d) 1/2 0 1/2 .
d d d
1/2 1 0
G= d d
Application: Pagerank
A set of nodes with connections is a graph. Any network can be described by a graph. The link
structure of the web forms a graph, where the individual websites are the nodes and if there is an
arrow from site ai to site aj if ai links to aj . The adjacency matrix A of this graph is called the
web graph. If there are n sites, then the adjacency matrix is a n n matrix with entries Aij = 1
if there exists a link from aj to ai . If we divide each column by the number of 1 in that column,
we obtain a Markov matrix A which is called the normalized web matrix. Define the matrix
E which satisfies Eij = 1/n for all i, j. The graduate students and later entrepreneurs Sergey
Brin and Lawrence Page had in 1996 the following one billion dollar idea:
1
It is said now that page rank is the worlds largest matrix computation. The n n matrix
is huge. It was 8.1 billion 5 years ago. 2
1
2
Verify that if a Markov matrix A has the property that A2 has only positive entries,
then A has a unique eigenvalue 1.
Determine the Page rank of the previous system, possibly using technology like Mathematica.
http://infolab.stanford.edu/ backrub/google.html
Amy Langville and Carl Meyer, Googles PageRank and Beyond, Princeton University Press, 2006.
The result is called spectral theorem. I present now an intuitive proof, which gives more
insight why the result is true. The linear algebra book of Bretscher has an inductive proof.
Proof. We have seen already that if all eigenvalues are different, there is an eigenbasis and
diagonalization is possible. The eigenvectors are all orthogonal and B = S 1 AS is diagonal
containing the eigenvalues. In general, we can change the matrix A to A = A + (C A)t
where C is a matrix with pairwise different eigenvalues. Then the eigenvalues are different for
all except finitely many t. The orthogonal matrices St converges for t 0 to an orthogonal
matrix S and S diagonalizes A. 1
Why could we not perturb a general matrix At to have disjoint eigenvalues and At could be
diagonalized: St1 At St = Bt ? The problem is that St might become singular for t 0.
The matrix
A=
"
3 4
4 3
1
1
has the eigenvalues 1, 1 + t and is diagalizable for t > 0 but not
0 1+t
diagonalizable for t = 0. What happens with the diagonalization in the limit? Solution:
Because the matrix
"
#is upper triangular, the eigenvalues are 1, 1 +"t. The
# eigenvector to the
1
1
eigenvalue 1 is
. The eigenvector to the eigenvalue 1 + t is
. We see that in the
0
t
limit t 0, the second eigenvector collides with the first one. For symmetric matrices, where
the eigenvectors are always perpendicular to each other, such a collision can not happen.
"
The matrix A =
is symmetric.
0 1
0
0
0
1 0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
L=
cos()
1
0
0
1
1
cos(2) 1
0
1
1
0
1 cos((n 1))
1
1
0
0
1
cos(n)
For the following question, give a reason why it is true or give a counter example.
a) Is the sum of two symmetric matrices symmetric?
b) Is the product of a symmetric matrix symmetric?
c) Is the inverse of an invertible symmetric matrix symmetric?
d) If B is an arbitrary n m matrix, is A = B T B symmetric? e) If A is similar to B and
A is symmetric, then B is symmetric.
f) If A is similar to B with an orthogonal coordinate change S and A is symmetric, then B
is symmetric.
It appears in models describing an electron in a periodic crystal. The eigenvalues form what
one calls thespectrum of the matrix. A physicist is interested in it because it determines what
conductivity properties the system has. This depends on .
A=
10001
3
5
7
9
11
1
10003
5
7
9
11
1
3
10005
7
9
11
1
3
5
10007
9
11
1
3
5
7
10009
11
1
3
5
7
9
10011
The picture shows the eigenvalues of L for = 2 for = 2 with large n. The vertical axes is
which runs from = 0 at the bottom to = 2 on the top. Due to its nature, the picture
is called Hofstadter butterfly. It has been popularized in the book Godel, Escher Bach by
Douglas Hofstadter.
Linear algebra
Please see lecture 21 handout for the topics before the midterm.
Probability theory
Discrete random variable P[X = xk ] = pk discrete set of values.
R
Var[X]
.
c2