Chapter 1

Chapter 1: Defining and Collecting Data
Statistics is the science of collecting, organizing, analyzing, and interpreting

DATA to make decisions.
Descriptive Statistics: Involves organizing, summarizing, and displaying
data.
Thống kê mô tả (descriptive statistics): được hiểu là các phương pháp liên quan đến việc thu
thập số liệu, tóm tắt, trình bày, tính toán và mô tả các đặc trưng riêng biệt khác nhau để
phản ánh tổng quát đối tượng nghiên cứu.
Central tendency
Variation
Skewness
Inferential Statistics: Involves using sample data to draw conclusions about

a population.
Thống kê suy luận (inferential statistics): bao gồm các phương pháp ước lượng các đặc
trưng của tổng thể, phân tích mối liên hệ giữa các hiện tượng nghiên cứu, dự đoán hoặc ra
quyết định trên cơ sở thu thập thông tin từ kết quả quan sát mẫu.
Confidence interval
Hypothesis Testing
Regression
Types of data
Categorical (qualitative) variables take categories as their values such as
“yes”, “no”, or “blue”, “brown”, “green”.
Numerical (quantitative) variables have values that represent a counted or
measured quantity.
Discrete variables arise from a counting process.
Continuous variables arise from a measuring process.
Population: A population contains all the items or individuals of interest
that you seek to study.
population: all FPT students
Sample: A sample contains only a portion of a population of interest.
sample: 100 FPT students
Data: consist of information coming from observations, counts,

measurements, or responses.
Parameter: a numerical measurement describing some characteristics of a

population.
parameter: average parking time of all FPT students, population variance, population
standard deviation, population proportion,...
Statistic: a numerical measurement describing some characteristics of a
sample.
statistic: average parking time of 100 FPT students, sample variance,…
Sources of data
When you perform the activity that collects the data, you are using a
primary data source.
When the data collection part of these activities is done by someone else,
you are using a secondary data source.
Primary Sources: The data collector is the one using the data for analysis:
Data from a political survey.
Data collected from an experiment. (A treatment is applied to part of
a population and responses are observed.)
Observed data (A researcher observes and measures characteristics
of interest of part of a population.)
Secondary Sources: The person performing data analysis is not the data
collector:
Analyzing census data.
Examining data from print journals or data published on the internet.
Chapter 2: Organizing and Visualizing Variables

Big picture of Statistics
Chapter 3: Numerical Descriptive Measures

1. Measures of Central Tendency
Sample Mean = sum of values divided by the number of values.
n: sample size
580VN: Menu - 6 - 1 - Data - AC - OPTN - 2

570VN: Mode - 3 - 1 - Data - AC - Shift - 1 - 4 - 2
n: sample size
Sample Median: is the “middle” number (50% above, 50% below).
1. Rank the data set in increasing order.

2. Median position = n+12 position in the ordered data.
 If the number of values is odd, the median is the middle
number.
 If the number of values is even, the median is the average of
the two middle numbers.
Sample Mode:
Value that occurs most often.
There may be no mode OR there may be several modes.
2. Measures of Variation and Shape

Sample Range
Sample Variance: Average (approximately) of squared deviations of values

from the mean.

570VN: Mode - 3 - 1 - Data - AC - Shift - 1 - 4 - 4 - ^2
23 27 26 25 40 → 45.7
Sample Standard Deviation: Is the square root of the variance.
Sample Coefficient of Variation: Measures relative variation. Always in
percentage (%).
Z Score
Z score >3 or <-3 → X: extreme value/outliers

Shape of a Distribution (Skewness)
Mean = Median → symmetric
Mean < Median → left-skewed
Mean > Median → right-skewed
3. Exploring Numerical Variables
Box Plots
Quartiles split the ranked data into 4 segments with an equal number of
values per segment.
The first quartile, Q1, is the value for which 25% of the values are
smaller and 75% are larger.
Q2 is the same as the median (50% of the values are smaller and 50%
are larger).
Only 25% of the values are greater than the third quartile.
Find a quartile by determining the value in the appropriate position in the
ranked data, where:
First quartile position: Q1=n+14 ranked value.
Second quartile position: Q2=n+12 ranked value. = Median
Third quartile position: Q3=3*(n+1)4 ranked value.
where n is the number of observed
values.
The Interquartile Range (IQR) is Q3 – Q1 and measures the spread in the

middle 50% of the data.
An outlier is an observation that is numerically distant from the rest of the

data.
(< Q1 - 1.5 * IQR OR > Q3 + 1.5 * IQR).
Five-number summary
4. Numerical Descriptive Measures for a Population
Population Mean(muy)

Population Variance & Population Standard Deviation
Population Variance
570VN: Mode - 3 - 1 - Data - AC - Shift - 1 - 4 - 3 - ^2
Population Standard Deviation
570VN: Mode - 3 - 1 - Data - AC - Shift - 1 - 4 – 3
Population
Probability distribution table
The Empirical Rule
Approximately 68% of the values are within +-1 standard deviation
from the mean.
Approximately 95% of the values are within +-2 standard deviations
from the mean.
Approximately 99.7% of the values are within +-3 standard deviations
from the mean.
5. The Covariance and the Coefficient of Correlation
The Covariance: The covariance measures the strength of the linear
relationship between two numerical variables (X and Y).
data: population → population covariance → cov(X,Y) = R * Sigma x * Sigma y

data: sample → sample covariance → cov(X, Y) = R * Sx * Sy
The Coefficient of Correlation: The coefficient of correlation measures the
relative strength of a linear relationship between two numerical variables(X and
Y). -1 <= R <= 1
negative/positive → R
-1 <= R <= 1
R < 0: negative correlation, X increases Y decreases
R = 0: no correlation/no relationship
R > 0: positive correlation, X increases Y increases
strong/weak → |R|, R^2

0 <= |R| <=1
|R| ~0: → weak correlation
|R| ~1 → strong correlation
Chapter 4: Basic Probability

I. Basic probability concepts
1. Random experiment:
is a mechanism that produces a definite outcome that cannot be predicted
with certainty.
Ex: Rolling a dice. There can be 6 possible outcomes {1, 2, 3, 4, 5, 6}.
However, none of the outcomes can be exactly predicted. 🡪 Rolling a
dice: a random experiment
2. Random variable X: số chấm khi tung xúc xắc
Ch5: discrete random var
Ch6: continuous random var
3. Types of variables
Categorical variable: định tính
Numeric variable: Discrete & Continuous
Discrete variables arise from a counting process.
(e.g., number of classes you are taking). *The number of 0 1 2 3
Continuous variables arise from a measuring process.
(e.g., your annual salary, or your weight). *The amount of, … 50.1kg
55.5kg
4. Sample space
is the collection of all possible outcomes.
Ex: Roll a dice and record number dots. 🡪 S = {1, 2, 3, 4, 5, 6}
=7 → p=0
<10 → p=1
5. Event
is a subset of the sample space.
Ex: E1 = {6}, E2 = {even} = {2, 4, 6}
6. Probability
the numerical value representing the chance, likelihood, or possibility that a
certain event will occur (always between 0 and 1).
52 lá
P(lá bích) = số lá bích / tổng số lá = 13 / 52 = 1/4
đọc câu hỏi → keyword → khoanh vùng
probability → ch4, ch5, ch6, ch7
II. Types of events
1. Impossible event and Certain event
Impossible event has no chance of occurring (probability = 0).
Certain event is sure to occur (probability = 1).
2. Simple event & Joint event
Simple event described by a single characteristic.
Ex: A day in January from all days in 2018.
Joint event described by two or more characteristics.
Ex: A day in January that is also a Wednesday from all days in 2018.
3. Complementary event
Complement of an event A (denoted A’): All events that are not part of
event A.
Ex: All days from 2018 that are not in January.
4. Mutually exclusive events (Disjoint events)
Events A and B are said to be mutually exclusive if it is not possible that
both occur at the same time.
Ex: Toss of a coin.
Let A be the event that the coin lands on head.
Let B be the event that the coin lands on the tail.
🡪 In a single fair coin toss, events A and B are mutually
exclusive.
PA∩B=0
PA∪B=PA+PB-PA∩B=PA+PB-0=PA+P(B)
5. Independent events
Events A and B are said to be independent if the probability of B occurring
is unaffected by the occurrence of the event A happening.
Ex: Tossing a coin twice.
Let A be the event that the first coin toss lands on heads.
Let B be the event that the second coin toss lands on heads.
🡪 Clearly the result of the first coin toss does not affect the result of
the second coin toss.
🡪 Events A and B are independent.
PA∩B=PA*P(B)
PA∪B=PA+PB-PA∩B=PA+PB-PA*P(B)
6. Collectively exhaustive events
One of the events must occur. The set of events covers the entire sample
space.
PA∪B=1
7. Events associated with OR A∪B
is the event that consists of all outcomes that are contained in either of
the two events.
8. Events associated with AND A∩B
is the event that consists of all outcomes that are contained in two
events.
9. Event E1 but not E2
III. Graph
1. Venn diagram
2. Contingency table
P(female|right-handed) = P(female and right-handed) / P(right-handed) =

44/100 : 87/100 = 44/87
= |female and right-handed| / |right-handed| =
44/87
3. Decision tree
IV. General addition rule
V. Conditional probability keyword *if, *given that, *given

Chapter 5: Discrete Probability Distributions & Chapter 6: Continuous
Probability Distributions
Discrete Random Variable Continuous Random Variable
Definition Discrete variables produce Continuous variables produce
outcomes that come from a outcomes that come from a
counting process. measurement.
Example Number of girls in a classroom. Height of boys in class.
Number of blue marbles in a Weight of students in a class.
bag. Amount of lemonade in a jug.
Number of heads when Time it takes to run a race.
flipping 5 coins. Lifetime of a battery.
Number of typos on a page.
Number of classes you are
taking.
Distribution Binomial distribution Uniform distribution
Poisson distribution Normal distribution
Binomial Poisson Uniform Normal

distribution distribution distribution distribution
Definiti X: the X: the number of
on number of events in a given
successes in unit of
n trials time/distance/are
a/volume
: mean number of
events in a given
unit of
time/distance/are
a/volume
: average …
Keywo Poisson Uniform Normal
rd
Notati X~B(n,) X~P() X~CU(a,b) X~N(μ,2)
on
Mean EX==n*π EX== EX==a+b2 EX=
& expected VX=2= VX=2=(a-b)212 VX=2
Varian value =
ce mean
VX=2=n*π*(
1-π)
Probab PX=x=Cnx* PX=x=e-*xx! PX=x=0 PX=x=0
ility x*(1-)n-x CASIO Pc<X<d=d-cb-a Pc<X<d 🡪
CASIO CASIO
Given N/A N/A N/A PX<x=p
probab Find x 🡪 CASIO
ility.
Find
value
Chapter 5: Discrete Probability Distributions

1. Binomial distribution X: the number of successes in n trials
2. Poisson distribution X: the number of events in a given unit of
time/distance/area/volume
Chapter 6: Continuous Probability Distributions

3. Uniform distribution
4. Normal distribution
Bell Shaped
Symmetrical
Mean, Median and Mode are Equal
Location is determined by the mean, μ.
Spread is determined by the standard deviation, σ.
The random variable has an infinite theoretical range: -∞ to +∞.
Chapter 6: The Normal Distribution and Other Continuous Distributions

Contents:
1. Continuous Random Variable
2. Normal Distribution
The Standardized Normal Distribution

Xisch ma cho biết độ phân tán
Miu lệch qua bên nào thì đồ thị lệch qua bên đó
Find Normal Probabilities
Given a Normal Probability, find the X Value
The Empirical Rule
3. The Uniform Distribution
Properties of the Uniform Distribution: mean, variance, standard
deviation
Find uniform probabilities
1. Continuous Random Variable
Continuous variables produce outcomes that come from a measurement.
e.g. annual salary
weight, in kg.
thickness of an item.
time required to complete a task.
temperature of a solution.
2. Normal Distribution (mean, variance/standard deviation) variance =

(standard deviation)^2
X ~ N(mean, variance)
Bell Shaped
Symmetrical
Mean, Median and Mode are Equal
Location is determined by the mean, μ.
Spread is determined by the standard deviation, σ.
Range: The random variable has an infinite theoretical
range: -∞ to +∞.
The Standardized Normal Distribution (Also known as the “Z” distribution)
Z-score
Mean is 0.
Standard Deviation is 1. → Variance = 1^2 = 1
The Empirical Rule

μ ± 1σ covers about 68.26% of X’s.
Calculating Normal Probabilities
Probability is measured by the area under the curve.

The total area under the curve is 1.0, and the curve is symmetric, so half is
above the mean, half is below.
CASIO 570VN / CASIO 580VN
Continuous: P(X>5) = P(X>=5)
Discrete: P(X>5) = P(X>=6)
Given a Normal Probability. Find the X Value
CASIO 570VN / CASIO 580VN
3. Uniform Distribution (a, b)

a: min
b: max
X ~ U (a,b)
Symmetrical
Also called a rectangular distribution
Range: Any value between the smallest and largest is equally likely.
Properties of the Uniform Distribution: mean, variance, standard
deviation
normal dist
1 probability density function (PDF)
ư

Chapter 1

Uploaded by

Copyright:

Available Formats

Chapter 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 1

Uploaded by

Copyright:

Available Formats

Chapter 1: Defining and Collecting Data

Statistics is the science of collecting, organizing, analyzing, and interpreting

Inferential Statistics: Involves using sample data to draw conclusions about

Data: consist of information coming from observations, counts,

Parameter: a numerical measurement describing some characteristics of a

Chapter 2: Organizing and Visualizing Variables

Chapter 3: Numerical Descriptive Measures

580VN: Menu - 6 - 1 - Data - AC - OPTN - 2

1. Rank the data set in increasing order.

2. Measures of Variation and Shape

Sample Variance: Average (approximately) of squared deviations of values

580VN: Menu - 6 - 1 - Data - AC - OPTN - 2

Z score >3 or <-3 → X: extreme value/outliers

The Interquartile Range (IQR) is Q3 – Q1 and measures the spread in the

An outlier is an observation that is numerically distant from the rest of the

580VN: Menu - 6 - 1 - Data - AC - OPTN - 2

data: population → population covariance → cov(X,Y) = R * Sigma x * Sigma y

strong/weak → |R|, R^2

Chapter 4: Basic Probability

9. Event E1 but not E2

P(female|right-handed) = P(female and right-handed) / P(right-handed) =

IV. General addition rule

V. Conditional probability keyword *if, *given that, *given

Binomial Poisson Uniform Normal

Chapter 5: Discrete Probability Distributions

Chapter 6: Continuous Probability Distributions

Chapter 6: The Normal Distribution and Other Continuous Distributions

The Standardized Normal Distribution

2. Normal Distribution (mean, variance/standard deviation) variance =

The Empirical Rule

Probability is measured by the area under the curve.

3. Uniform Distribution (a, b)

Properties of the Uniform Distribution: mean, variance, standard

You might also like

V. Conditional probability keyword if, given that, *given