Chapter 1

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 33

Chapter 1: Defining and Collecting Data

Statistics is the science of collecting, organizing, analyzing, and interpreting


DATA to make decisions.
Descriptive Statistics: Involves organizing, summarizing, and displaying
data.
Thống kê mô tả (descriptive statistics): được hiểu là các phương pháp liên quan đến việc thu
thập số liệu, tóm tắt, trình bày, tính toán và mô tả các đặc trưng riêng biệt khác nhau để
phản ánh tổng quát đối tượng nghiên cứu.
Central tendency
Variation
Skewness

Inferential Statistics: Involves using sample data to draw conclusions about


a population.
Thống kê suy luận (inferential statistics): bao gồm các phương pháp ước lượng các đặc
trưng của tổng thể, phân tích mối liên hệ giữa các hiện tượng nghiên cứu, dự đoán hoặc ra
quyết định trên cơ sở thu thập thông tin từ kết quả quan sát mẫu.
Confidence interval
Hypothesis Testing
Regression

Types of data
Categorical (qualitative) variables take categories as their values such as
“yes”, “no”, or “blue”, “brown”, “green”.
Numerical (quantitative) variables have values that represent a counted or
measured quantity.
Discrete variables arise from a counting process.
Continuous variables arise from a measuring process.
Population: A population contains all the items or individuals of interest
that you seek to study.
population: all FPT students
Sample: A sample contains only a portion of a population of interest.
sample: 100 FPT students

Data: consist of information coming from observations, counts,


measurements, or responses.

Parameter: a numerical measurement describing some characteristics of a


population.
parameter: average parking time of all FPT students, population variance, population
standard deviation, population proportion,...
Statistic: a numerical measurement describing some characteristics of a
sample.
statistic: average parking time of 100 FPT students, sample variance,…

Sources of data
When you perform the activity that collects the data, you are using a
primary data source.
When the data collection part of these activities is done by someone else,
you are using a secondary data source.
Primary Sources: The data collector is the one using the data for analysis:
Data from a political survey.
Data collected from an experiment. (A treatment is applied to part of
a population and responses are observed.)
Observed data (A researcher observes and measures characteristics
of interest of part of a population.)
Secondary Sources: The person performing data analysis is not the data
collector:
Analyzing census data.
Examining data from print journals or data published on the internet.

Chapter 2: Organizing and Visualizing Variables


Big picture of Statistics

Chapter 3: Numerical Descriptive Measures


1. Measures of Central Tendency
Sample Mean = sum of values divided by the number of values.
n: sample size

580VN: Menu - 6 - 1 - Data - AC - OPTN - 2


570VN: Mode - 3 - 1 - Data - AC - Shift - 1 - 4 - 2
n: sample size
Sample Median: is the “middle” number (50% above, 50% below).

1. Rank the data set in increasing order.


2. Median position = n+12 position in the ordered data.
 If the number of values is odd, the median is the middle
number.
 If the number of values is even, the median is the average of
the two middle numbers.
580VN: Menu - 6 - 1 - Data - AC - OPTN - 2
Sample Mode:
Value that occurs most often.
There may be no mode OR there may be several modes.

2. Measures of Variation and Shape


Sample Range

Sample Variance: Average (approximately) of squared deviations of values


from the mean.

580VN: Menu - 6 - 1 - Data - AC - OPTN - 2


570VN: Mode - 3 - 1 - Data - AC - Shift - 1 - 4 - 4 - ^2
23 27 26 25 40 → 45.7
Sample Standard Deviation: Is the square root of the variance.
580VN: Menu - 6 - 1 - Data - AC - OPTN - 2
570VN: Mode - 3 - 1 - Data - AC - Shift - 1 - 4 - 4
Sample Coefficient of Variation: Measures relative variation. Always in
percentage (%).

Z Score

Z score >3 or <-3 → X: extreme value/outliers


Shape of a Distribution (Skewness)
Mean = Median → symmetric
Mean < Median → left-skewed
Mean > Median → right-skewed
3. Exploring Numerical Variables
Box Plots

Quartiles split the ranked data into 4 segments with an equal number of
values per segment.
The first quartile, Q1, is the value for which 25% of the values are
smaller and 75% are larger.
Q2 is the same as the median (50% of the values are smaller and 50%
are larger).
Only 25% of the values are greater than the third quartile.
Find a quartile by determining the value in the appropriate position in the
ranked data, where:
First quartile position: Q1=n+14 ranked value.
Second quartile position: Q2=n+12 ranked value. = Median
Third quartile position: Q3=3*(n+1)4 ranked value.
where n is the number of observed
values.

The Interquartile Range (IQR) is Q3 – Q1 and measures the spread in the


middle 50% of the data.

An outlier is an observation that is numerically distant from the rest of the


data.
(< Q1 - 1.5 * IQR OR > Q3 + 1.5 * IQR).
Five-number summary
4. Numerical Descriptive Measures for a Population
Population Mean(muy)

580VN: Menu - 6 - 1 - Data - AC - OPTN - 2


570VN: Mode - 3 - 1 - Data - AC - Shift - 1 - 4 - 2
Population Variance & Population Standard Deviation
Population Variance
580VN: Menu - 6 - 1 - Data - AC - OPTN - 2
570VN: Mode - 3 - 1 - Data - AC - Shift - 1 - 4 - 3 - ^2
Population Standard Deviation
580VN: Menu - 6 - 1 - Data - AC - OPTN - 2
570VN: Mode - 3 - 1 - Data - AC - Shift - 1 - 4 – 3
Population
Probability distribution table
The Empirical Rule
Approximately 68% of the values are within +-1 standard deviation
from the mean.
Approximately 95% of the values are within +-2 standard deviations
from the mean.
Approximately 99.7% of the values are within +-3 standard deviations
from the mean.
5. The Covariance and the Coefficient of Correlation
The Covariance: The covariance measures the strength of the linear
relationship between two numerical variables (X and Y).

data: population → population covariance → cov(X,Y) = R * Sigma x * Sigma y


data: sample → sample covariance → cov(X, Y) = R * Sx * Sy
The Coefficient of Correlation: The coefficient of correlation measures the
relative strength of a linear relationship between two numerical variables(X and
Y). -1 <= R <= 1
negative/positive → R
-1 <= R <= 1
R < 0: negative correlation, X increases Y decreases
R = 0: no correlation/no relationship
R > 0: positive correlation, X increases Y increases

strong/weak → |R|, R^2


0 <= |R| <=1
|R| ~0: → weak correlation
|R| ~1 → strong correlation

Chapter 4: Basic Probability


I. Basic probability concepts
1. Random experiment:
is a mechanism that produces a definite outcome that cannot be predicted
with certainty.
Ex: Rolling a dice. There can be 6 possible outcomes {1, 2, 3, 4, 5, 6}.
However, none of the outcomes can be exactly predicted. 🡪 Rolling a
dice: a random experiment
2. Random variable X: số chấm khi tung xúc xắc
Ch5: discrete random var
Ch6: continuous random var
3. Types of variables
Categorical variable: định tính
Numeric variable: Discrete & Continuous
Discrete variables arise from a counting process.
(e.g., number of classes you are taking). *The number of 0 1 2 3
Continuous variables arise from a measuring process.
(e.g., your annual salary, or your weight). *The amount of, … 50.1kg
55.5kg
4. Sample space
is the collection of all possible outcomes.
Ex: Roll a dice and record number dots. 🡪 S = {1, 2, 3, 4, 5, 6}
=7 → p=0
<10 → p=1
5. Event
is a subset of the sample space.
Ex: E1 = {6}, E2 = {even} = {2, 4, 6}
6. Probability
the numerical value representing the chance, likelihood, or possibility that a
certain event will occur (always between 0 and 1).
52 lá
P(lá bích) = số lá bích / tổng số lá = 13 / 52 = 1/4
đọc câu hỏi → keyword → khoanh vùng
probability → ch4, ch5, ch6, ch7
II. Types of events
1. Impossible event and Certain event
Impossible event has no chance of occurring (probability = 0).
Certain event is sure to occur (probability = 1).
2. Simple event & Joint event
Simple event described by a single characteristic.
Ex: A day in January from all days in 2018.
Joint event described by two or more characteristics.
Ex: A day in January that is also a Wednesday from all days in 2018.
3. Complementary event
Complement of an event A (denoted A’): All events that are not part of
event A.
Ex: All days from 2018 that are not in January.
4. Mutually exclusive events (Disjoint events)
Events A and B are said to be mutually exclusive if it is not possible that
both occur at the same time.
Ex: Toss of a coin.
Let A be the event that the coin lands on head.
Let B be the event that the coin lands on the tail.
🡪 In a single fair coin toss, events A and B are mutually

exclusive.
PA∩B=0
PA∪B=PA+PB-PA∩B=PA+PB-0=PA+P(B)
5. Independent events
Events A and B are said to be independent if the probability of B occurring
is unaffected by the occurrence of the event A happening.
Ex: Tossing a coin twice.
Let A be the event that the first coin toss lands on heads.
Let B be the event that the second coin toss lands on heads.
🡪 Clearly the result of the first coin toss does not affect the result of
the second coin toss.
🡪 Events A and B are independent.
PA∩B=PA*P(B)
PA∪B=PA+PB-PA∩B=PA+PB-PA*P(B)
6. Collectively exhaustive events
One of the events must occur. The set of events covers the entire sample
space.
PA∪B=1
7. Events associated with OR A∪B
is the event that consists of all outcomes that are contained in either of
the two events.
8. Events associated with AND A∩B
is the event that consists of all outcomes that are contained in two
events.

9. Event E1 but not E2

III. Graph
1. Venn diagram
2. Contingency table

P(female|right-handed) = P(female and right-handed) / P(right-handed) =


44/100 : 87/100 = 44/87
= |female and right-handed| / |right-handed| =
44/87

3. Decision tree

IV. General addition rule

V. Conditional probability keyword *if, *given that, *given


Chapter 5: Discrete Probability Distributions & Chapter 6: Continuous
Probability Distributions
Discrete Random Variable Continuous Random Variable
Definition Discrete variables produce Continuous variables produce
outcomes that come from a outcomes that come from a
counting process. measurement.
Example Number of girls in a classroom. Height of boys in class.
Number of blue marbles in a Weight of students in a class.
bag. Amount of lemonade in a jug.
Number of heads when Time it takes to run a race.
flipping 5 coins. Lifetime of a battery.
Number of typos on a page.
Number of classes you are
taking.
Distribution Binomial distribution Uniform distribution
Poisson distribution Normal distribution

Binomial Poisson Uniform Normal


distribution distribution distribution distribution
Definiti X: the X: the number of
on number of events in a given
successes in unit of
n trials time/distance/are
a/volume
: mean number of
events in a given
unit of
time/distance/are
a/volume
: average …
Keywo Poisson Uniform Normal
rd
Notati X~B(n,) X~P() X~CU(a,b) X~N(μ,2)
on
Mean EX==n*π EX== EX==a+b2 EX=
& expected VX=2= VX=2=(a-b)212 VX=2
Varian value =
ce mean
VX=2=n*π*(
1-π)
Probab PX=x=Cnx* PX=x=e-*xx! PX=x=0 PX=x=0
ility x*(1-)n-x CASIO Pc<X<d=d-cb-a Pc<X<d 🡪
CASIO CASIO
Given N/A N/A N/A PX<x=p
probab Find x 🡪 CASIO
ility.
Find
value

Chapter 5: Discrete Probability Distributions


1. Binomial distribution X: the number of successes in n trials
2. Poisson distribution X: the number of events in a given unit of
time/distance/area/volume

Chapter 6: Continuous Probability Distributions


3. Uniform distribution

4. Normal distribution
Bell Shaped
Symmetrical
Mean, Median and Mode are Equal
Location is determined by the mean, μ.
Spread is determined by the standard deviation, σ.
The random variable has an infinite theoretical range: -∞ to +∞.

Chapter 6: The Normal Distribution and Other Continuous Distributions


Contents:
1. Continuous Random Variable
2. Normal Distribution

The Standardized Normal Distribution


Xisch ma cho biết độ phân tán
Miu lệch qua bên nào thì đồ thị lệch qua bên đó
Find Normal Probabilities
Given a Normal Probability, find the X Value
The Empirical Rule
3. The Uniform Distribution
Properties of the Uniform Distribution: mean, variance, standard
deviation
Find uniform probabilities
1. Continuous Random Variable
Continuous variables produce outcomes that come from a measurement.
e.g. annual salary
weight, in kg.
thickness of an item.
time required to complete a task.
temperature of a solution.

2. Normal Distribution (mean, variance/standard deviation) variance =


(standard deviation)^2
X ~ N(mean, variance)

Bell Shaped
Symmetrical
Mean, Median and Mode are Equal
Location is determined by the mean, μ.
Spread is determined by the standard deviation, σ.
Range: The random variable has an infinite theoretical
range: -∞ to +∞.
The Standardized Normal Distribution (Also known as the “Z” distribution)

Z-score
Mean is 0.
Standard Deviation is 1. → Variance = 1^2 = 1

The Empirical Rule


μ ± 1σ covers about 68.26% of X’s.
μ ± 2σ covers about 95.44% of X’s.
μ ± 3σ covers about 99.73% of X’s.
Calculating Normal Probabilities

Probability is measured by the area under the curve.


The total area under the curve is 1.0, and the curve is symmetric, so half is
above the mean, half is below.
CASIO 570VN / CASIO 580VN
Continuous: P(X>5) = P(X>=5)
Discrete: P(X>5) = P(X>=6)
Given a Normal Probability. Find the X Value
CASIO 570VN / CASIO 580VN

3. Uniform Distribution (a, b)


a: min
b: max
X ~ U (a,b)

Symmetrical
Also called a rectangular distribution
Range: Any value between the smallest and largest is equally likely.

Properties of the Uniform Distribution: mean, variance, standard

deviation
normal dist
1 probability density function (PDF)
ư

You might also like