Intro To Probability (Pattern Recognition)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 94

Introduction to Probability

and Random Variables

Sarat Saharia
TU
Introduction

2
Introduction
Why Learn Probability?
• Nothing in life is certain. In everything we do, we gauge
the chances of successful outcomes, from business to
medicine to the weather
• A probability provides a quantitative description of the
chances or likelihoods associated with various outcomes
• It provides a bridge between descriptive and inferential
statistics

Probability
Populati Sample
on
Statistics
Probability model
• A probability model assumes that the variability in data
is due to chance or random variability. For a random
occurrence with a finite number of possible outcomes, a
probability model can be defined by listing all possible
outcomes and the probability that each one occur.
• P(x) denote the probability that a particular value x
occurs
• Example:
1. flipping a single unbiased coin. Here outcomes are head and tail
and the corresponding probabilities are
P(head) = ½ and P(tail) = ½.
2. Rolling a fair die. Outcomes are 1, 2, 3, 4, 5, and 6 and
probability of each outcome is 1/6.

4
Probability Estimate
• Methods for obtaining probability estimates:
– Frequentist approach
– Subjective approach
• Frequentist approach: probability of an event is
estimated by dividing the number of
occurrences of an event by the number of trials.
– Easy to understand but difficult or impossible to
obtain enough samples to get an estimate of the true
probabilities.
– Also, this approach is applied to repeatable events,
that is, events for which the probability is constant
over all trials. This is often difficult to verify in the real
world.

5
Probability Estimate
• Subjective approach: This is the only way to assign
probability measures to various outcomes for events
that are not repeatable. For example, the probability that
a certain candidate will win the next election from a
constituency.
• Use of the notion of a fair bet is one way to quantify the
process of selecting a subjective probability.
– Let P is the probability of the event on which you are betting.
– W is the amount you win if the event occurs and L is the amount
you lose if the event does not occur.
– For a fair bet,
PW = (1 – P)L
or P = L / (L + W).

6
Experiments and Events
• Experiment: a process for which the outcome
is not known with certainty. Example:

• Event: An event is an outcome or combination


of outcomes from a statistical experiment.
– The basic element to which probability is applied
– When an experiment is performed, a particular
event either happens, or it doesn’t!

7
Experiments and Events
Experiment: Rolling a fair six sided die
Events:
– Obtaining a 6
– Obtaining an odd number
Experiment: Randomly choosing 10 transistors from a lot of 1000
new transistors
Events:
– Finding more than three defective
– Finding no defective transistors
Experiment: Selecting a newborn child a certain hospital
Events:
– The weight of the newborn child selected is above 3 kg.
– The newborn child is a girl

8
Probabilities of Events
• Sample space: The event containing all possible
outcomes of a statistical experiment is called the
sample space. Examples:
1. For the experiment 1, the sample space consists of the
numbers 1, 2, 3, 4, 5, and 6.
2. For the experiment 2, the sample space consists of the
numbers 1, 2, …, 10.
3. For the experiment 3, the sample space consists of all
numbers that represent the possible weights of a randomly
selected newborn child.
• Venn diagrams can be used to visualize the
relationships among events.

9
A

not A

A B A B

A or B A and B

10
Joint Event
The event A and B is called a joint event.
Example:
– A: the newborn child is a girl
– B: weight of the newborn child is above 3 kg
– A and B: the newborn child is a girl and her weight
is above 3 kg

11
Mutually Exclusive events
Two events A and B are called mutually exclusive
is A and B cannot occur simultaneously.
A: observe an odd number when roll a die
B: observe a 6
A and B are mutually exclusive

A B

A and B are mutually exclusive


12
Probabilities of Events

)
tB

B
nd
no

)a
d
d(

an

tA
an

(no
A

B
A

(not A) and (not B)

13
Conditional Probabilities
• Joint Event: The conditional probability of A occurring,
given that B has occurred, is denoted by P(A|B) (read
as “P of A given B”) and is given by
P(A|B) = P(A and B)/P(B) (1)
This conditional probability is not defined if P(B) = 0.
Similarly,
P(B|A) = P(A and B)/P(A) (2)

These two expressions can be written as


P(A and B) = P(B)P(A|B) (4)
and P(A and B) = P(A)P(B|A) (5)

14
The Multiplication Rule
• Independent Events: If P(A) is not dependent on whether B has
occurred, then the event A is independent of event B. Then P(A) =
P(A|B). An important consequence of the definition of
independence is the following multiplication rule (whenever A is
independent of B):
P(A and B) = P(A|B)P(B) = P(A) P(B) (6)

Using Equation (6) in Equation (2), we get


P(B|A) = P(A and B)/P(A) = P(A) P(B)/P(A) = P(B) (7)

Therefore, if A is independent of B, then B is also independent of


A.

15
Random Variables
• A random variable is the outcome of a random process
which output a numeric value. The output of a random
variable is called a random number. An example of a
random variable is the process of randomly choosing a
sample from some population and measuring one of its
feature.
– often denoted with capital alphabetic symbols (X, Y, etc.)
– a normal random variable may be denoted as X ~ N(µ, σ)
• The probability distribution of a random variable X tells
us what values X can take and how to assign
probabilities to those values

16
Discrete random variable
• Discrete random variable: a random variable which can take
on a finite number of possible values or a countably infinite
number of values. A discrete random variable is described by
its distribution function which lists for each outcome x the
probability P(x) of x. If x1, x2, … xn are all possible outcomes,
then

• Example:
– number of pets owned (0, 1, 2, … )
– numerical day of the month (1, 2, …, 31)
– the total number of tails you get if you flip 100 coins

17
Discrete example: roll of a die

p(x)

1/6

x
1 2 3 4 5 6

18
Probability Distribution Function (PDF)
x p(x)
1 p(x=1)=1/6

2 p(x=2)=1/6

3 p(x=3)=1/6

4 p(x=4)=1/6

5 p(x=5)=1/6

6 p(x=6)=1/6

1.0 19
Cumulative Distribution Function
(CDF)

1.0 P(x)
5/6
2/3
1/2
1/3
1/6
1 2 3 4 5 6 x

20
Cumulative Distribution Function
(CDF)
x P(x≤A)
1 P(x≤1)=1/6

2 P(x≤2)=2/6

3 P(x≤3)=3/6

4 P(x≤4)=4/6

5 P(x≤5)=5/6

6 P(x≤6)=6/6
21
Examples

1. What’s the probability that you roll a 3 or less?

P(x≤3)=1/2

2. What’s the probability that you roll a 5 or


higher?

P(x≥5) = 1 – P(x≤4) = 1-2/3 = 1/3


22
The Binomial Distribution

• A fixed length sequence of events where each


event has exactly two possible outcomes can be
modeled by a binomial distribution.
– One of the outcome is generally called success and the
other is called failure.
– Probability of success, denoted by θ, is same for all trails
– The total number of success k obtained in n trails is called
a binomial random variable.
• The distribution function is given by

(9)

Example:
23
The Poisson Distribution
• The Poisson distribution is used to model random variables
that may have a countably infinite number of outcomes. For
example, number of automobiles arriving at a tollbooth,
number of phone calls received by a call center per hour and
the number of decay events per second from a radioactive
source. etc.
• The distribution function is given by
(10)

• The value P(n) is interpreted as the probability that exactly n


events will occur in a fixed time interval and λ is the number of
events that occur in that length of time on the average.
• It is assumed that the events occur randomly and
independently with a constant probability of occurring in any
small time interval, and that two events cannot occur at
exactly the same time.

Example: 24
The Poisson Distribution
• Example
– number of automobiles arriving at a tollbooth
– The number of calls coming per minute into a hotel for
booking
– The number of meteorites greater than 1 meter diameter
that strike Earth in a year
– The number of patients arriving in an emergency room
between 10 and 11 pm

25
Continuous Random Variable
• A continuous random variable is described by a probability
density function. This function is used to obtain the
probability that the value of a continuous random variable is
in a given interval.
• If the random variable is x and its density function is p(x),
then

(11)
• In case of continuous random variable

(12)
• Cumulative distribution:
C(a) = P(x ≤ a)

26
Probability Density Function (PDF)
• The probability function that accompanies a continuous
random variable is a continuous mathematical function
that integrates to 1.
• The probabilities associated with continuous functions are
just areas under the curve (integrals!).

• Probabilities are given for a range of values, rather than a


particular.

27
Continuous Random Variable

• The Uniform density


p(x) = 0 if x < a,
1/(b – a) if a ≤ x ≤ b,
(13)
0 if x > b

C(x) = 0 if x < a,
(x – a)/(b – a) if a ≤ x ≤ b,
(14)
1 if x > b

28
Continuous Random Variable
• The Exponential density
p(t) = βe-βt , t ≥ 0 (15)

(16)

29
The Normal Density Function

This is a bell shaped curve


Note constants: with different centers and
spreads depending on μ and σ
π=3.14159
e=2.71828 30
σ
The Normal Density Function

31
The Normal Density Function

It’s a probability function, so no matter what the


values of μ and σ, must integrate to 1!

32
The Shape of Normal Density
Normal distribution is bell shaped, and symmetrical around m.

90 μ 110
Why symmetrical? Let µ = 100. Suppose x Now suppose x
= 110. = 90

33
Normal Probability Density

• The expected value (also called the mean) E(X) (or μ)


can be any number
• The standard deviation σ can be any nonnegative
number
• The total area under every normal curve is 1
• There are infinitely many normal distributions

34
Normal Probability Density

Total area =1; symmetric around µ


35
The effects of μ and σ

How does the standard deviation affect the shape of f(x)?


σ= 2
σ =3
σ =4

How does the expected value affect the location of f(x)?


μ = 10 μ = 11 μ = 12

36
Statistical Measures
• Center of the data
– Mean
– Median
• Variation
– Range
– Quartiles
– Variance
– Standard Deviation
– Covariance
– Correlation
37
Mean or Average or
Expectation

38
Mean or Average
(5,6
)
(6,5
) Mea
(2,4 (3,4 (5,5 n
(3.3636,3.090
) ) ) 9)
(5,3
)
(2,1 (4,2
) )
(1,1 (1,2 (3,1
) ) )

39
Median (M)
• A resistant measure of the data’s center
• At least half of the ordered values are less
than or equal to the median value
• At least half of the ordered values are greater
than or equal to the median value
• If n is odd, the median is the middle ordered
value
• If n is even, the median is the average of the
two middle ordered values
40
Median (M)
Location of the median: L(M) = (n+1)/2 ,
where n = sample size.

Example: If 25 data values are


recorded, the Median would be the
(25+1)/2 = 13th ordered value.

41
Median
• Example 1 data: 2 4 6
Median (M) = 4

• Example 2 data: 2 4 6 8
Median = 5 (average of 4 and 6)

• Example 3 data: 6 2 4
Median ≠ 2
(order the values: 2 4 6 , so Median = 4)

42
Comparing the Mean &
Median
• Computation of mean is easier.
• Finding median in higher dimension is much
complex.
• Mean is prone to noise.
• The mean and median of data from a
symmetric distribution should be close
together. The actual (true) mean and median
of a symmetric distribution are exactly the
same. 43
Spread or Variability
• If all values are the same, then they all equal to
the mean. There is no variability.
– Eg: 2, 2, 2, 2, 2, 2; mean = 2
• Variability exists when some values are
different from (above or below) the mean.
– Eg: 10, 15,-20,-22,30, 22
• We will discuss the following measures of
spread: range, quartiles, variance, and
standard deviation
44
Range
• One way to measure spread is to give the
smallest (minimum) and largest (maximum)
values in the data set;
Range = max − min
– Eg: 10,-2,-7,22,0,11; Range = 22-(-7)=28
• The range is strongly affected by outliers

45
Quartiles
• Three numbers which divide the
ordered data into four equal sized
groups.
• Q1 has 25% of the data below it.
• Q2 has 50% of the data below it. (Median)
• Q3 has 75% of the data below it.

46
Quartiles Uniform Distribution

Q1 Q2 Q3

47
Obtaining the Quartiles
• Order the data.
• For Q2, just find the median.
• For Q1, look at the lower half of the data
values, those to the left of the median
location; find the median of this lower half.
• For Q3, look at the upper half of the data
values, those to the right of the median
location; find the median of this upper half.

48
Variance and Standard
Deviation
• Recall that variability exists when some
values are different from (above or
below) the mean.
• Each data value has an associated
deviation from the mean:

49
Deviations
• what is a typical deviation from the
mean? (standard deviation)
• small values of this typical deviation
indicate small variability in the data
• large values of this typical deviation
indicate large variability in the data

50
Variance

Variance is the average squared deviation from the


mean of a set of data. It is used to find the standard
deviation.

51
Variance

Mean

52
Variance

2
-

53
Variance

2
-

2
-

54
Variance

1
---------------- ……… + 2 2
- + - + ………
No. of Data
Points

55
Variance Formula

56
Standard Deviation

[ standard deviation = square root of the


variance ]
57
Variance and Standard Deviation
Metabolic rates of 7 men (cal./24hr.) :
1792 1666 1362 1614 1460 1867 1439

58
Variance and Standard Deviation
Observations Deviations Squared deviations

1792 1792−1600 = 192 (192)2 = 36,864


1666 1666 −1600 = 66 (66)2 = 4,356
1362 1362 −1600 = -238 (-238)2 = 56,644
1614 1614 −1600 = 14 (14)2 = 196
1460 1460 −1600 = -140 (-140)2 = 19,600
1867 1867 −1600 = 267 (267)2 = 71,289
1439 1439 −1600 = -161 (-161)2 = 25,921
sum = 0 sum = 214,870

59
Variance and Standard Deviation

60
Variance (2D)

61
Variance (2D)

62
Variance (2D)

63
Variance (2D)

64
Variance (2D)

Variance doesn’t explore


relationship between
variables

65
Covariance

66
Covariance

67
Covariance

68
Covariance

69
Covariance

70
Covariance

Positi
ve
Relati
on

71
Covariance

72
Covariance

73
Covariance

74
Covariance

75
Covariance

Negat
ive
Relati
on

76
Covariance

77
Covariance

No
Relati
on

78
Covariance

(2 , 1) (-2.4545,
(2 , 2) -2.8182)
(4 , 3) (-2.4545,
(6 , 1) -1.8182)
(8 , 3) (-0.4545,
(1 , 5) -0.8182)
(4 , 6) (1.5455,
(4 , 7) -2.8182)
(6 , 3) (3.5455,
(6 , 5) -0.8182)
(6 , 6) (-3.4545,
1.1818) 79
(4.4545, (0,
(-0.4545,
3.8182) 0)
Covariance Matrix

80
Correlation

Positive Negative No
relation relation relation

• Covariance determines whether relation is positive or negative, but


it was impossible to measure the degree to which the variables are
related.
• Correlation is another way to determine how two variables are
related.
• In addition to whether variables are positively or negatively related,
correlation also tells the degree to which the variables are related81
each other.
Correlation

82
Multivariate Gaussians (or "multinormal distribution“ or
“multivariate normal distribution”)

Univariate case: single mean μ and


variance σ

Multivariate case:
Vector of observations x,
vector of means μ and covariance matrix Σ

Dimension of x Determinant

83
Multivariate Gaussians

Univariate case

Multivariate case

do not depend on x
normalization constants
depends on x and positive

84
The mean vector

85
Covariance of two random
variables
• Recall for two random variables xi, xj

86
The covariance matrix
transpose operator

Var(xm)=Cov(xm, xm)

87
An example: 2 variate case

The pdf of the multivariate will be: Covariance matrix

Determinant

88
An example: 2 variate case

Factorized into two independent Gaussians!


They are independent!
Recall in general case independence implies uncorrelation
but uncorrelation does not necessarily implies independence.
Multivariate Gaussians is a special case where uncorrelation
implies independence as well.
89
Diagonal covariance matrix
If all the variables are independent from each other,
The covariance matrix will be an diagonal one.
Reverse is also true:
If the covariance matrix is a diagonal one they are independent
Diagonal matrix: m matrix where off-diagonal terms are zero

90
Gaussian Intuitions: Size of Σ

Identity matrix

μ = [0 0] μ = [0 0] μ = [0 0]
Σ=I Σ = 0.6I Σ = 2I

As Σ becomes larger,
Gaussian becomes more spread out
91
Gaussian Intuitions:
Off-diagonal

As the off-diagonal entries increase, more correlation between value


of x and value of y

92
Gaussian Intuitions: off-diagonal and
diagonal

• Decreasing non-diagonal entries (#1-2)


• Increasing variance of one dimension in diagonal (#3)

93
Choosing a probability distribution
• Nature of data source may determine the type of density
function on some cases
• The histogram of the data may suggest a model
• Any assumed distribution can be tested using chi-squared test
or the Kolmogorov-Smirnov test
• Most frequently used continuous density is normal density
• Linear functions of a normally distributed feature are also
normally distributed
• Features that are not normally distributed can be sometimes
converted to approximately normal by suitable transformation
• A popular graphical test for determining whether a data set is
approximately normally distributed is based on normal plot.

94

You might also like