Stat Ref

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 37

1

Statistics Refresher
CS771: Introduction to Machine Learning
Purushottam Kar

CS771: Intro to ML
Expectation of a Random Variable 2
The expectation of a random variable or its expected value is the mean
or average value that random variable takes and is defined as

Sometimes the notation used is just i.e. brackets are omitted


The name suggests that the r.v. is expected to take this value
Some truth to this: if we sample from , “likely” to get a value “close” to
What “close”, “likely” mean are topics in a learning theory course (e.g.
CS777)
However, can be misleading – be careful not to read too much into it
need not be most likely value for i.e. possible
In fact, there are r.v. which can never take this value i.e.
CS771: Intro to ML
Rules of Expectation: Sum Rule 3
Linearity of Expectation: given two r.v. , no matter how they are
defined, no matter whether independent or not, we always have

Proof: Let be a new r.v.. We have . Now the only possible values for are of the
form where .
Thus, we have . Note that even if multiple ways of getting a value , all have
been taken into account.

Note that the only result we used in our proof is the law of total probability in
the second last step above which always holds no matter which r.v.s we have
Note: the same proof shows that
CS771: Intro to ML
Rules of Expectation: Scaling Rule 4
Given a r.v. and a constant , define a new r.v. i.e. on any outcome ,
then
Proof: any value that takes is for some . Thus, we get

The expectation
For example, ifof
weahave
constant random
a fair coin variable
and create a r.v. s.t.isfor
the constant
heads and itself
If wefor tails,
have a then (since
r.v. that coin is gives
always fair) and
theclearly
same is a constant
value that does
no matter what the
outcome, then and not for
depend
all .onThen
the outcome of any toss

For any r.v. , we always have


Proof: Create a dummy random variable that always takes the value . Note
that is a constant (does not depend on the outcome ) . Linearity gives us
Note: notation is horrible here. In the expression , and do not refer to two
r.v.s or the same r.v. repeated. Instead, just read as constant.
CS771: Intro to ML
Rules of Expectation 5
Law of the Unconscious Statistician (LOTUS)
Helps calculate expectations for complicated random variables easily
Suppose we have random variable whose PMF we know
Suppose there is a weird function and we define a new random
variable . Can we calculate ?
Calculating directly would require us to first get hold of – difficult!
LOTUS gives us a way to use itself to calculate

Proof: much the same way we proved linearity of expectation


Works no matter what r.v. we have, no matter how complicated is
The function does need to satisfy some very easy conditions – all functions
we will look at in this course will satisfy these conditions CS771: Intro to ML
Rules of Expectation: Product Rule 6
If are two independent random variables, then we have stronger
results on them
Proof: Let be a new r.v.. We have . Now the only possible values for are of
the form where .
Thus, we have . Note that even if multiple ways of getting a value , all have
been taken into account.
Using independence gives us

Warning: this result crucially uses independence: may fail if are not
independent

CS771: Intro to ML
Sample Mean
Indeed! If we ask 1000 random Indians, how many children they
have, the sample mean might come out to be 2.35. However, no
Indian can have 2.35 children since number of children has to be an
7
Suppose we have a r.v. and weinteger! sample it again and again, say times
E.g. we haveYes,
a dice/coin
that is whyand we throw/toss
we warned it expectation/sample
not to take again and again
Make sure that samples
mean are
literally. Allindependent of eachtells
that your experiment other
you is that
For examplemost
in theIndians havedoaround
coin case, toss the 2.35
coinchildren. Some–may
fairly times have
do not just toss it once and
then blindly repeat
muchthe value
more of 7)
(e.g. theorfirst
muchtossless
times
(e.g. 0) but they are
Using the values obtained in these repeated
usually rarer samples, say , we can get a very
good estimate
Interestingif fact:
is sufficiently largeis the point which is the closest to
the sample mean
Called sample mean,
all samples or sample
in terms expectation,
of squared or empirical
distance (Proof: use first mean
order
optimality)

Note: sample mean can give answers that need careful analysis
Interesting fact: even the mean itself satisfies the nice property

CS771: Intro to ML
Mode of a Random Variable 8
The mode of a random variable is simply the value(s) that the r.v. takes
with highest probability
Warning: a r.v. may have more than one mode value

Giventhata number
Recall of (or
if blah is true samples of , can
blah happens) else define
if blah empirical mode similarly
Simply thedoesvalue that appears
not happen most frequently in the samples
or is false,

Note: mode of a random variable (or even samples) is always in i.e.


always a valid value that the r.v. can actually take (unlike expectation)
CS771: Intro to ML
Median of a Random Variable
Interesting Fact: The empirical median is the point which is the
closest to all samples in terms of absolute distance (Proof: in notes) 9
The median of a random variable is a value that satisfies as well as
The empirical median of a set of independent samples of a random
variable Interesting
is definedfact: even the median itself satisfies the nice property
to be a value such that as many samples are
greater than or equal to as are less than or equal to
Often we talk about median income of a country – this is a value such that
half the population earns at least that much value as income
To find the empirical median, first arrange samples in increasing order i.e.
If is odd, then . If is even, then may be (infinitely) many empirical medians
but we often take
The empirical median gives a good estimate of median of the r.v. if is large

CS771: Intro to ML
Variance 10
Tells us how “spread out” are the values that an r.v. takes. Specifically,
how far away from its expectation does the r.v. often take values
For a random variable with expectation , its variance, denoted as or
or often just as can be defined as

Can be simplified to obtain another (equivalent) definition

Notice: for all which means which means that for all r.v. . Also for all r.v.
Standard deviation: the square root of the variance (denoted )

CS771: Intro to ML
Example
ℙ[ 𝑋 ]
0 0.1 0.2 0.3 0.4 ℙ[ 𝑋]
11

0 0.1 0.2 0.3 0.4


0 1 2 3 4 5 6
𝑋 0 1 2 3 4 5 6
𝑋
,,, ,,,

This distribution has the same mean and median as the


first one but is more “spread out” hence larger variance
CS771: Intro to ML
Sample Variance 12
Given independent samples of a random variable , the empirical
variance can be calculated in two (equivalent) ways
First find the empirical mean
Method 1: Calculate
Method 2: First calculate and then get
Anmethods
Both effect called catastrophic
should give thecancellation.
same answer Basically,
(unlessonoverflow errors occur)
computers,
Method due to finite
2 preferred whenprecision,
data notfor available
example, ifall
we at
haveonce since it can be
and , then clearly
computed but our computers
using running averages. may store to1save
Method space two passes over data
requires
and ignore the error and cause us to get
However, method 2 can be bad if and are both very large and close
As before, if is large, empirical variance is a good estimate of

CS771: Intro to ML
Covariance
We can estimate covariance using samples too. Suppose we are given values
of on outcomes (i.e. we sampled outcomes and on each outcome , we
13
Ifreturn
we ).have
Then two
sample r.v.s then the
covariance covariance
can be computed in of twothese two r.v.s tell us
ways. First
how theyempirical
calculate behave in of
mean tandem
and
and
Example 1: let education level and income be two random variables
Method
defined1: Calculate
on the sample space of all Indians – it is expected that if education
Method 2: First is
of a person calculate
higher andthangetmean education level of all Indians, then their
Just as before,
income both methods
should also be always
highergive
thanthemean
same income
answer. Method
level of2all
useful
Indians
when data not2:
Example available
let agealland
at once but can
sleeping be badbeif two
hours and different
are both very
r.v.slarge
– it is expected
in that
magnitude
if agebutof close together
a person as well than mean age of all Indians, then the
is higher
person would sleep fewer than the average number of hours (since children
typically sleep more and old people tend to sleep less)

Note that
CS771: Intro to ML
Rules of Variance 14
Suppose are any two constants and are any two r.v.s, then
Constant Rule: i.e. if is a constant r.v. then
Oftenintuitive
Seems used to deal with
since catastrophic
a constant r.v.cancellation
does not vary at all i.e. zero variance
by shifting the data to make it smaller in
Scalingmagnitude
Rule: but leaving variance unchanged
Shift Rule: i.e. if then
Shifting a random variable does not change its “spread”
Sum Rule:
Difference Rule:

CS771: Intro to ML
Rules of Covariance
In books/papers, you may come across a term called correlation which is a
15
Suppose are any two normalized versionand
constants of covariance.
are any two r.v.s, then
Constant
For anyRule:
two r.v.s , we always have . If then the two r.v.s are said to be
uncorrelated. Note that if are uncorrelated, then also we have . Warning:
Symmetry Rule:
independent r.v.s are always uncorrelated but not all uncorrelated r.v.s need
Scaling Rule: be independent.
Can estimate using samples as well
Shift Rule:
If are independent then
Proof:
If , this means that typically, whenever takes larger values than its own mean,
takesWesmaller values
applied thethan its own
product mean
rule forand vice versa. Ifabove
expectations , then this means that
both r.v.s take values larger or smaller than their respective means together.
Corollary: If are independent r.v.s, then
means that typically, even if takes a value larger than its mean, may take
smaller or larger values than its own mean
CS771: Intro to ML
Conditional Statistics 16
The notation is used to express how one quantity behaves when some
other quantities are fixed to some given values
These “other” quantities could be random variables themselves, or
even constants. Sometimes we condition just to clarify exactly what
those constants are
For example we could ask, what is the probability of me misclassifying a test
data point if I use a model i.e.
Here is not a random variable (it could be in other settings but here it is not)
We previously saw conditional probabilities
Let us see other quantities that can be defined conditionally

CS771: Intro to ML
Conditional Statistics 17
Conditional Expectation
Conditional Variance where we have
Conditional Covariance
where and
Conditional Mode
Similarly we can define conditional median etc but not very popular
Note: these definitions do not require to be independent at all!

CS771: Intro to ML
Conditional Statistics 18
Rules of expectation (sum, scaling, LOTUS, product) all continue to
hold even with conditional except that all expectation are conditional

If then
Rules of variance and covariance also continue to hold if we
systematically condition all expressions involved in those rules
Note: conditioning must be the same everywhere, i.e. may happen that

CS771: Intro to ML
Statistics of Random Vectors 19
Expectation of a random vector is simply another vector (of same dim)
of the expectations of the individual random variables

Linearity of expectation continues to hold: if any two vector r.v. (not


necessarily independent, then
Scaling Rule: If is a constant then
Dot Product Rule: If is a constant vector, then
Proof:
Matrix Product Rule: If is a constant matrix then

Proof: Use Dot Product Rule times


CS771: Intro to ML
Statistics of Random Vectors 20
Mode easy to define:
Median not easy to define – no unique definition
Definition 1:
Definition 2: minimizer of absolute distance (in this case L1 norm)

Note: even here we still have


Proof:
Taking derivative w.r.t and using first order optimality does the trick

CS771: Intro to ML
Statistics of Random Vectors
If is a vector, isn’t a matrix? 21
Since random vectors are a bunch of real What does even mean?
valued r.v.s, to specify the
variance of this collection, need to have all pairwise covariances
Just as a random vector is a collection of random
variables arranged as a 1D array, a random
Another cute formula matrix is a collection of r.v.s arranged as a 2D
array!
, where

Note that -th entry of matrix is . Thus, -th entry of is


CS771: Intro to ML
Useful Operations on VectorCan youR.V.
prove that the
covariance matrix of any random
22
If are two random vectors (not necessarily independent),
vector is always then
a PSD matrix?

where and ,
Dot Product Rule: If is a constant vector, then
Proof:

Matrix Product Rule: If is a constant matrix then

Proof: Try arguing similarly as the dot product rule

CS771: Intro to ML
Continuous Random Variables 23
These are r.v. that can take infinitely many possibly values that are not
discrete but
Whycontinuous
why not ? i.e. support is or some subset of
The notion of PMF which tells us with what probability does the r.v. take this
value or that value does not make sense when support is continuous
Instead of PMF, we use a PDF (probability density function) in such cases
Consider a r.v. whichWe do have
takes an exact
value formula
in the too
interval
In general, if the r.v. has a PDF , then for any interval within its
Warning: is very different from support , we have
To specify the probability distribution of we use a PDF
The PDF takes a value in support of r.v. and spits out a non-negative number
However,
Note: can if the interval
take values greater is than
“small”,
asthen
wellwebut
canthey
oftenmust
get anot
good
be negative
and simple approximation where . How small is “small” enough
Interpretation: For any , the value tells us how likely is to take a value
depends on the PDF
around i.e. for some teeny , we have
CS771: Intro to ML
Continuous R.V.s– the Rules Revisited 24
PDF of a r.v. satisfies for all and
Expectation of a continuous R.V. is
LOTUS:
Variance of a continuous R.V. is

Joint PDFs make sense too : suppose and

They make sense even if is continuous and is discrete and vice versa
Details of these constructions, however, are beyond the scope of CS771

CS771: Intro to ML
Continuous R.V.s– the Rules Revisited
Wait! If is continuous, even
25
Marginal probabilities
if , what iscontinue
? to make sense
In general in such cases and you would be right to suspect a
Conditional probabilities also
divide-by-zero make
problem sense
here. However, it is possible to still
When are both are continuous
define using limits or a powerful technique called the
Radon-Nikodym derivative
Actually, even makes sense in this case – details beyond CS771
When is discrete but is continuous
Actually, even makes sense in this case – details beyond CS771
When is continuous but is discrete
Conditional expectations, (co)variances also defined similarly
Tricky to define in some cases as above – details beyond scope of CS771
CS771: Intro to ML
Continuous R.V.s– the Rules Revisited 26
Rules of Probability: All rules Sum, Product, Chain, Bayes,
Complement, Union continue to hold
If are independent continuous R.V. then
For independent continuous R.V. we continue to have

Rules of Expectation: All rules Linearity, Scaling, Product still hold


Rules of (co)Variance: All rules Constant, Scaling, Shift, Sum still
hold

CS771: Intro to ML
27
Probability Distributions
 Some distributions popularly used in ML
 Discrete distributions: Bernoulli, Rademacher
 Continuous distributions: Uniform, Gaussian, Laplacian
 Some of their properties such as mean, median etc
 Notion of a parametric distribution

CS771: Intro to ML
Bernoulli Distributions 28
These are probability distributions over the support
Very useful in binary classification as labels are often named
Arguably the simplest of all distributions. PMF of a r.v. with Bernoulli
distribution is uniquely specified by just specifying
Using complement rule we automatically get
called “success probability” or “bias”
Do not confuse this with the bias of linear model – not the same thing!
Mean:
Mode: if , if , if
Variance:
CS771: Intro to ML
Rademacher Distributions 29
These are probability distributions over the support
Very similar to Bernoulli distributions except that support is different
If is distributed as Bernoulli then is distributed as Rademacher
If is distributed as Rademacher then is distributed as Bernoulli
Also extremely simple distribution. PMF of a r.v. with Rademacher
distribution is uniquely specified by just specifying
Using complement rule we automatically get
Often, papers refer to Rademacher distribution only in special case
Mean: (Hint: use scaling and sum rules for expectation)
Mode: if , if , if
Variance: (Hint: use scaling and shift rules for variance)
CS771: Intro to ML
Uniform Distribution
Recall that we commented that although we must have , we need
not have .over
Note that
30
Can be defined anyiffinite
in the uniform
intervalcase, if we have then indeed
and its perfectly fine
Let be a continuous r.v. with support . Then is said to have a uniform
distribution if its PDF is a constant function (uniform density) i.e.
Note: if and
𝑓 𝑋(𝑥)
Mean:
Variance: 1
𝑏− 𝑎

Note: variance increases as since r.v. more “spread out”


𝑎 2 𝑏 𝑥
𝑎+𝑏
Notation: Often we use to denote uniform dist over

CS771: Intro to ML
Gaussian (aka Normal) Distributions 31
Arguably one of the most popular of all probability distributions
Models our intuitive assumption that in real life, data often takes values
around its mean value and it gets unlikely to witness extreme values
A fundamental result in probability theory – the law of large numbers –shows
that some form of this is indeed true

CS771: Intro to ML
Gaussian Distributions
Indeed! Look at these two Gaussians. The one on the
left seems more “spread-out” and the one on the
right seems very “squeezed-in”.
32
Specifying a Bernoulli/Rademacher
This happened because distribution took two
numbers,
Specifying a categorical distbn over elements takes numbers
Specifying a Gaussian distribution over requires two numbers
: must be a real number (may be negative or positive or even zero)
: must be a non-negative real number
Notation: PDF for a Gaussian r.v. i.e. is often written as or simply as
Notice that even here we condition on constants (either using or symbol)
The notation is no accident – if the PDF of a r.v. is , then
= Mode = Median, as well as
Requires a bit of integration to prove these results 

CS771: Intro to ML
Operations with Gaussians
Note: we can derive results such as and using rules we studied earlier.
However, those rules do not assure us that must be Gaussian (they just assure 33
Sinceusintegration
that is some r.v. with such and such mean and variance. It takes special
can be a pain, some handy results about Gaussians
analysis to show that etc are Gaussian r.v. too!
Let be two independent r.v. whose PDF is Gaussian i.e.
and . Then we haveThe colloquial “68-95-99.7 rule” describes this more
generally
Scaling Rule: If then is also Gaussian
Sum Rule: If then is also Gaussian too
Be careful that this rule apples only to the Gaussian distribution. A random
variable sampled from some other distribution may very well violate this
Shift Rule: If ( const) then is Gaussian
rule. People often cite the 68-95-99.7 rule to make real-life predictions.
Tail Rule:
This is merely an approximation (possibly a good one, possibly a bad one)
based
It gets on an assumption
exponentially that thethat
less likely realalife distribution
Gaussian r.v.istakes
approximately
value far from mean
Gaussian
For , we have (5-sigma rule)
As the r.v. gets more and more concentrated around its mean
CS771: Intro to ML
Gaussian Random Vector 34
As in the scalar case, the multivariate Gaussian requires just the mean
and the covariance to be specified

Special case and called standard Gaussian/Normal dist

However, is simply i.e. we indeed have

All coordinates of a standard Gaussian r.v. are independent!

CS771: Intro to ML
Note: Just as before, we can derive results such as and using rules we
Gaussian Random Vector
studied earlier. However, those rules do not assure us that or must be
Gaussian (they just assure us that these are some r.v./r.vec. with such
35
Given a (possibly
and such meannon-standard)
and (co)-variance.Gaussian vector
It takes a more detailed analysis to
Every coordinate of showis a that
(realthese are actually
valued) Gaussian.
Gaussian r.v. – warning: coordinates
need not be independent if the Gaussian is non-standard
The above holds true even if conditioned on all other coordinates of
Consider any coordinate of the vector, say
is distributed as the Gaussian
Given values for all other coordinates , is still Gaussian
Expression a bit complicated – refer to DFO Sec 6.5.1 (see the reference
section on the course webpage)
If is a constant vector, then
If is a constant matrix then
CS771: Intro to ML
Laplacian
If is aDistribution
r.v. with a Laplacian PDF with parameters , then
(where are constants) is also a Laplacian r.v. but with
36
Close cousins of Gaussian parameters
distributions and except that a Laplacian
r.v.
concentrates much more strongly around its mean than a Gaussian r.v.
Also require two parameters to be specified
Mean = Mode = Median: , Variance:

CS771: Intro to ML
37
Parametric distributions are extremely important for ML. We
Parametric Distributions
will next learn about algorithms that try to make realistic
predictions by first learning a parametric distribution that
Certain distributions are parametric
mimics reality. This is done by–learning
this means that there
the parameters of theare a finite
distribution using
number of parameters that completely data the distribution
describe
Similar to parametric models like LwP or linear models where a finite number
of parameters (e.g. a model vector and a bias value) describe the model fully
There exist non-parametric distributions too (beyond scope of CS771)
Bernoulli/Rademacher (), Uniform (), Gaussian (), Laplacian ()
Statistics: apart from the name of a branch of mathematics, the word
“statistic” also refers to some quantity we calculate using samples
E.g. sample mean, sample variance, sample mode, sample median
A common usage of statistics is to estimate parameters of the distribution that
generated those samples
E.g. under some mild conditions, sample mean/variance is a good estimate of
the expectation/variance of distribution that generated the samples CS771: Intro to ML

You might also like