Basics Statistics For Data Analysis
Basics Statistics For Data Analysis
Basics Statistics For Data Analysis
Basic Statistics
Essential tools for data analysis
4
Outline
Theory:
• Probabilities:
– Probability measures, events, random variables, conditional
probabilities, dependence, expectations, etc
• Bayes rule
• Parameter estimation:
– Maximum Likelihood Estimation (MLE)
– Maximum a Posteriori (MAP)
Application:
Naive Bayes Classifier for
• Spam filtering
• “Mind reading” = fMRI data processing
5
What is the probability?
Probabilities
Bayes Kolmogorov 6
Probability
• Sample space, Events, σ-Algebras
• Axioms of probability, probability measures
– What defines a reasonable theory of uncertainty?
•Random variables:
– discrete, continuous random variables
• Joint probability distribution
• Conditional probabilities
• Expectations
• Independence, Conditional independence
7
Sample space
Def: A sample space Ω is the set of all possible
outcomes of a (conceptual or physical) random
experiment. (Ω can be finite or infinite.)
Examples:
−Ω may be the set of all possible outcomes of a dice roll
(1,2,3,4,5,6)
Examples:
What is the probability of
− the book is open at an odd number
− rolling a dice the number <4
− a random person’s height X : a<X<b
9
Probability
Def: Probability P(A), the probability that event
(subset) A happens, is a function that maps the
event A onto the interval [0, 1]. P(A) is also called
the probability measure of A.
11
Kolmogorov Axioms
Consequences:
12
Venn Diagram
B
A
Ω
P(A U B) = P(A) + P(B) - P(A ∩ B) 13
Random Variables
Def: Real valued random variable is a function of the
outcome of a randomized experiment
Examples:
15
What discrete distributions do we
know?
16
Discrete Distributions
• Bernoulli distribution: Ber(p)
17
This image cannot currently be displayed.
Continuous Distribution
Def: continuous probability distribution: its cumulative distribution function is
absolutely continuous.
USA:
Hungary:
Def :
Def :
Properties :
18
Cumulative Distribution
Function (cdf)
Intuitively, one can think of f(x)dx as being the probability of X falling within
the infinitesimal interval [x, x + dx].
21
Moments
Expectation: average value, mean, 1st moment:
22
Warning!
Moments may not always exist!
Cauchy distribution
For the mean to exist the following integral would have to converge
23
Uniform Distribution
PDF CDF
24
This image cannot currently be displayed.
PDF CDF
25
Multivariate (Joint) Distribution
We can generalize the above ideas from 1-dimension to any finite dimensions.
Discrete distribution:
Flu No Flu
1/80 7/80
Headache
1/80 71/80
No Headache
26
Multivariate Gaussian distribution
Multivariate CDF
http://www.moserware.com/2010/03/computing-your-skill.htm
27
Conditional Probability
P(X|Y) = Fraction of worlds in which X event is true given Y event is true.
Flu No Flu
Headache 1/80 7/80 Y
X∧Y
X
1/80 71/80
No Headache
28
Independence
Independent random variables:
30
Conditionally Independent
London taxi drivers: A survey has pointed out a positive and
significant correlation between the number of accidents and wearing
coats. They concluded that coats could hinder movements of drivers and
be the cause of accidents. A new law was prepared to prohibit drivers
from wearing coats when driving.
Finally another study pointed out that people wear coats when it rains…
xkcd.com 31
Conditional Independence
Formally: X is conditionally independent of Y given Z:
Equivalent to:
32
Bayes Rule
33
Chain Rule & Bayes Rule
Chain rule:
Bayes rule:
34
AIDS test (Bayes rule)
Data
Approximately 0.1% are infected
Test detects all infections
Test reports positive for 1% healthy people
Only 9%!...
35
Improving the diagnosis
Use a follow-up test!
•Test 2 reports positive for 90% infections
•Test 2 reports positive for 5% healthy people
37
Delivered-To: [email protected]
• date Authentication-Results: mx.google.com; spf=neutral (google.com: 209.85.215.175 is neither permitted nor denied by best
guess record for domain of [email protected])
[email protected]; dkim=pass (test mode) [email protected]
Received: by eaal1 with SMTP id l1so15092746eaa.6
for <[email protected]>; Tue, 03 Jan 2012 14:17:51 -0800 (PST)
Received: by 10.205.135.18 with SMTP id ie18mr5325064bkc.72.1325629071362;
• recipient path
Received: by 10.52.88.179 with SMTP id bh19mr10729402vdb.38.1325629068795;
Tue, 03 Jan 2012 14:17:48 -0800 (PST)
Return-Path: <[email protected]>
Received: from mail-vx0-f179.google.com (mail-vx0-f179.google.com [209.85.220.179])
by mx.google.com with ESMTPS id dt4si11767074vdb.93.2012.01.03.14.17.48
(version=TLSv1/SSLv3 cipher=OTHER);
Tue, 03 Jan 2012 14:17:48 -0800 (PST)
• IP number Received-SPF: pass (google.com: domain of [email protected] designates 209.85.220.179 as permitted sender)
client-ip=209.85.220.179;
Received: by vcbf13 with SMTP id f13so11295098vcb.10
for <[email protected]>; Tue, 03 Jan 2012 14:17:48 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=googlemail.com; s=gamma;
• sender h=mime-version:sender:date:x-google-sender-auth:message-id:subject
:from:to:content-type;
bh=WCbdZ5sXac25dpH02XcRyDOdts993hKwsAVXpGrFh0w=;
b=WK2B2+ExWnf/gvTkw6uUvKuP4XeoKnlJq3USYTm0RARK8dSFjyOQsIHeAP9Yssxp6O
7ngGoTzYqd+ZsyJfvQcLAWp1PCJhG8AMcnqWkx0NMeoFvIp2HQooZwxSOCx5ZRgY+7qX
uIbbdna4lUDXj6UFe16SpLDCkptd8OZ3gr7+o=
--f46d043c7af4b07e8d04b5a7113a
Content-Type: text/plain; charset=ISO-8859-1 38
Naïve Bayes Assumption
Naïve Bayes assumption: Features X1 and X2 are conditionally
independent given the class label Y:
More generally:
Decision rule:
40
A Graphical Model
spam spam
x1 x2 ... xn xi
i=1..n
41
Naïve Bayes Algorithm for
discrete features
Training Data:
n d dimensional features + class labels
For Likelihood
42
Subtlety: Insufficient training
data
For example,
What now??? 43
Parameter estimation:
MLE, MAP
Estimating Probabilities
44
Flipping a Coin
I have a coin, if I flip it, what’s the probability it will fall with the head up?
Independent draws
Identically
distributed
47
Maximum Likelihood
Estimation
MLE: Choose θ that maximizes the probability of observed data
48
What about prior knowledge?
We know the coin is “close” to 50-50. What can we do now?
50-50
49
Bayesian Learning
• Use Bayes rule:
• Or equivalently:
50
MAP estimation for Binomial
distribution
Coin flip problem
Likelihood is Binomial
If the prior is Beta distribution,
51
MLE vs. MAP
Maximum Likelihood estimation (MLE)
Choose value that maximizes the probability of observed data
53
What about continuous
features?
3 4 5 6 7 8 9
σ2
σ2
µ=0 µ=0 54
MLE for Gaussian mean and
variance
Choose θ= (µ,σ2) that maximizes the probability of observed data
Independent draws
Identically
distributed
55
MLE for Gaussian mean and
variance
57
Case Study: Text Classification
• Classify e-mails
– Y = {Spam,NotSpam}
• Classify news articles
– Y = {what is the topic of the article?
58
Xi represents i word
th in document
59
NB for Text Classification
P(X|Y) is huge!!!
– Article at least 1000 words, X={X1,…,X1000}
– Xi represents ith word in document, i.e., the domain of Xi is entire
vocabulary, e.g., Webster Dictionary (or more).
Xi 2 {1,…,50000} ) K100050000 parameters….
60
Bag of words model
Typical additional assumption – Position in document doesn’t
matter: P(Xi=xi|Y=y) = P(Xk=xi|Y=y)
– “Bag of words” model – order of words on the page ignored
– Sounds really silly, but often works very well! ) K50000 parameters
61
Bag of words model
Typical additional assumption – Position in document doesn’t
matter: P(Xi=xi|Y=y) = P(Xk=xi|Y=y)
– “Bag of words” model – order of words on the page ignored
– Sounds really silly, but often works very well!
62
Bag of words approach
aardvark 0
about 2
all 2
Africa 1
apple 0
anxious 0
...
gas 1
...
oil 1
…
Zaire 0
63
Twenty news groups results
Different mean and variance for each class k and each pixel i.
Sometimes assume variance
• is independent of Y (i.e., σi),
• or independent of Xi (i.e., σk)
• or both (i.e., σ)
65
Example: GNB for
classifying mental states
~1 mm resolution
~2 images per sec.
15,000 voxels/image
non-invasive, safe
measures Blood Oxygen
Level Dependent (BOLD)
response [Mitchell et al.] 66
Learned Naïve Bayes Models –
Means for P(BrainActivity | WordCategory)
Pairwise classification accuracy: [Mitchell et al.]
78-99%, 12 participants
Tool words Building words
67
What you should know…
Naïve Bayes classifier
• What’s the assumption
• Why we use it
• How do we learn it
• Why is Bayesian (MAP) estimation important
Text classification
• Bag of words model
Gaussian NB
• Features are still conditionally independent
• Each feature has a Gaussian distribution given class
68
Further reading
Manuscript (book chapters 1 and 2)
http://alex.smola.org/teaching/berkeley2012/slides/chapter1_2.pdf
ML Books
Statistics 101
69
A tiny bit of extra theory…
70
Feasible events = σ-algebra
Examples:
a. All subsets of Ω={1,2,3}: { ;, {1},{2},{3},{1,2},{1,3},{2,3}, {1,2,3}}
b.
(Borel sets)
71
Measure
σ−additivity
Consequences:
monotonity
72
Important measures
Counting measure:
Borel measure:
This is not a complete measure: There are Borel sets with zero
measure, whose subsets are not Borel measurable…
Lebesgue measure:
complete extension of the Borel measure, i.e. extension & every subset of every
null set is Lebesgue measurable (having measure zero).
73
Brain Teasers
These might be surprising:
• Construct an uncountable Lebesgue set with measure
zero.
• Construct a Lebesgue but not Borel set.
• Prove that there are not Lebesgue measurable sets. We
can’t ask what is the probability of that event!
• Construct a Borel nullset who has a not measurable subset
Borel
74
The Banach-Tarski paradox
(1924)
Given a solid ball in 3-dimensional space, there exists a decomposition of the ball into a
finite number of non-overlapping pieces (i.e., subsets), which can then be put back together
in a different way to yield two identical copies of the original ball.
The reassembly process involves only moving the pieces around and rotating them,
without changing their shape. However, the pieces themselves are not "solids" in the usual
sense, but infinite scatterings of points.
A stronger form of the theorem implies that given any two "reasonable" solid objects (such as a
small ball and a huge ball), either one can be reassembled into the other.
This is often stated colloquially as "a pea can be chopped up and reassembled into the Sun."
75
Tarski's circle-squaring
problem (1925)
Is it possible to take a disc in the plane, cut it into finitely many
pieces, and reassemble the pieces so as to get a square of equal
area?