Statistics For Data Science
Statistics For Data Science
Statistics For Data Science
Definition
Statistics is the science, or a branch of mathematics, that involves collecting,
classifying, analyzing, interpreting, and presenting numerical facts and data. It is
especially handy when dealing with populations too numerous and extensive for
specific, detailed measurements. Statistics are crucial for drawing general
conclusions relating to a dataset from a data sample.
Types of Statistics
There are two types of Statistics:
1. Descriptive Statistics
2. Inferential Statistics
Types of Data
From the above example, we can see all four types of data. ‘Age’ is Discrete,
‘Height’ is Continuous, ‘Sex’ is Nominal and ‘Academic Performance’ is Ordinal data
Sample Data & Population Data
A population is the entire group that you want to draw conclusions about.
A sample is the specific group that you will collect data from. The size of the
sample is always less than the total size of the population.
Sampling Techniques:
• Simple Random Sampling
• Systematic Sampling
• Stratified Sampling
• Cluster Sampling
Most Popular Sampling Techniques
Descriptive Statistics
• Descriptive statistics describe, show, and summarize the basic features of a
dataset found in a given study, presented in a summary that describes
the data sample and its measurements. It helps analysts to understand the data
better.
• Descriptive statistics represent the available data sample and do not include
theories, inferences, probabilities, or conclusions. That’s a job for inferential
statistics.
Topics under descriptive statistics:
1. Measures of central tendency
2. Measures of variability
3. Distribution (Also Called Frequency Distribution)
Let’s start with one topic at a time
Measures of Central Tendency
There are three fundamental concepts under this topic:
1. Mean
2. Median
3. Mode
Red vertical line represents the mean before adding the outlier, green vertical line
represents the mean after adding the outlier.
Median
The median of a set of data is the middlemost number or centre value in the set.
The median is also the number that is halfway into the set.
To find the median, the data should be arranged first in order of least to greatest or
greatest to the least value
• Population variance
• When you have collected data from every member of the population that you’re
interested in, you can get an exact value for population variance.
σ𝑁
𝑖=1 𝑋𝑖−μ
2
• The population variance formula looks like this: σ2=
𝑁
N: The number of values in population, μ: Population mean
Variance Contd.
σ𝑛
𝑖=1 𝑋𝑖 −𝑥
2
Sample variance: s2 =
𝑛−1
n: The number of values in sample, 𝑥: Sample mean
• With samples, we use n – 1 in the formula because using n would give us a biased
estimate that consistently underestimates variability. The sample variance would
tend to be lower than the real variance of the population.
• Reducing the sample n to n – 1 makes the variance artificially large, giving you an
unbiased estimate of variability: it is better to overestimate rather than
underestimate variability in samples.
Variance Example
Standard Deviation
A standard deviation (or σ) is a measure of how dispersed the data is in relation to the
mean. Low standard deviation means data are clustered around the mean, and high
standard deviation indicates data are more spread out.
σ𝑛
𝑖=1 𝑋𝑖−𝑥
2
s = √( ) (Sample SD)
𝑛−1
σ𝑁
𝑖=1 𝑋𝑖−μ
2
σ = √( ) (Population SD)
𝑁
• Importance of SD: Standard deviation is a useful measure of spread for normal
distributions. In normal distributions, data is symmetrically distributed with
no skew. Most values cluster around a central region, with values tapering off as
they go further away from the center. The standard deviation tells you how
spread out from the center of the distribution your data is on average.
• Many scientific variables follow normal distributions, including height,
standardized test scores, or job satisfaction ratings
SD Contd.
The standard deviation reflects the
dispersion of the distribution. The curve with
the lowest standard deviation has a high
peak and a small spread, while the curve
with the highest standard deviation is more
flat and widespread.
PDF:
Distribution Contd.
Log Normal Distribution: In probability theory, a log-normal (or lognormal)
distribution is a continuous probability distribution of a random variable whose
logarithm is normally distributed. Thus, if the random variable X is log-normally
distributed, then Y = ln(X) has a normal distribution.
PDF:
Distribution Contd.
Right Skewed Distribution (Positive skewness): Right skewed distributions occur
when the long tail is on the right side of the distribution. Analysts also refer to
them as positively skewed.
Left Skewed Distribution (Negative skewness): Left skewed distributions occur
when the long tail is on the left side of the distribution. Statisticians also refer to
them as negatively skewed
We can describe the sampling distribution of the mean using this notation:
Where:
• X̄ is the sampling distribution of the sample means
• ~ means “follows the distribution”
• N is the normal distribution
• µ is the mean of the population
• σ is the standard deviation of the population
• n is the sample size
Z-test
• A z test is conducted on a population that follows a normal distribution with
independent data points and has a sample size that is greater than or equal to
30. It is used to check whether the means of two populations are equal to each
other when the population variance is known. The null hypothesis of a z test can
be rejected if the z test statistic is statistically significant when compared with the
critical value.
Left Tailed Test:
• Null Hypothesis: H0 : μ=μ0
• Alternate Hypothesis: H1 : μ<μ0
• Decision Criteria: If the z statistic < z critical value then reject the null hypothesis.
Let’s take an example of left tailed Z test.
Example 1: An online medicine shop claims that the mean delivery time for medicines is
less than 120 minutes with a standard deviation of 30 minutes. Is there enough evidence
to support this claim at a 0.05 significance level if 49 orders were examined with a mean of
100 minutes?
Solution: As the sample size is 49 and population standard deviation is known, this is an
example of a left-tailed one-sample z test.
H0 : μ=120
H1 : μ<120
From the z table, the critical value = -1.64 (at α=0.05). A negative sign is used as this is a
left tailed test.
ҧ
𝑥−μ
z= σ
√𝑛
𝑥ҧ = 100, μ= 120, n = 49, σ= 30
z = -4.66
As -4.66 < -1.645 thus, the null hypothesis is rejected and it is concluded that there is
enough evidence to support the medicine shop's claim.
Answer: Reject the null hypothesis
Right Tailed Test:
• Null Hypothesis: H0: μ=μ0
• Alternate Hypothesis: H1 : μ>μ0
• Decision Criteria: If the z statistic > z critical value then reject the null hypothesis.
Example 2: A teacher claims that the mean score of students in his class is greater
than 82 with a standard deviation of 20. If a sample of 81 students was selected
with a mean score of 90 then check if there is enough evidence to support this
claim at a 0.05 significance level.
Solution: As the sample size is 81 and population standard deviation is known, this
is an example of a right-tailed one-sample z test.
H0: μ=82
H1: μ>82
From the z table, the critical value = 1.64 (at α=0.05)
ҧ
𝑥−μ
z= σ
√𝑛
Where:
• Χ2 is the chi-square test statistic
• Σ is the summation operator (it means “take the sum of”)
• O is the observed frequency
• E is the expected frequency
The larger the difference between the observations and the expectations (O − E in
the equation), the bigger the chi-square will be. To decide whether the difference
is big enough to be statistically significant, you compare the chi-square value to a
critical value.
Example:
After weeks of hard work, your dog food experiment is complete and you compile
your data in a table:
Observed and expected frequencies of dogs’ flavor choices
Flavor Observed Expected
Garlic Blast 22 25
Blueberry Delight 30 25
Minty Munch 23 25
Would you conclude that the frequencies of dog’s flavor choices are in different
proportions? (Significance level = 0.05)
Solution:
Null hypothesis (H0): The dog population chooses the three flavors in equal
proportions (p1 = p2 = p3).
Alternative hypothesis (Ha): The dog population does not choose the three flavors
in equal proportions.
Degree of freedom (df) = number of groups – 1 = 3 – 1 = 2
For a test of significance at α = .05 and df = 2, the Χ2 critical value is 5.99.
Let’s calculate the chi-square critical value
2 2 2
22−25 30−25 23−25
X2 = + + = 1.52
25 25 25
Also, p-value = 0.4677 (we can find this with any p-value calculator available in the
internet)
The Χ2 value is less than the critical value (1.52<5.99). Therefore, we should not
reject the null hypothesis that the dog population chooses the three flavors in
equal proportions. There is no significant difference between the observed and
expected flavor choice distribution (p >0 .05). This suggests that the dog food
flavors are equally popular in the dog population.
ANOVA (Analysis of Variance)
An ANOVA test is a statistical test used to determine if there is a statistically
significant difference between two or more categorical groups by testing for
differences of means using a variance.
Assumptions Of ANOVA
• The assumptions of the ANOVA test are the same as the general assumptions for
any parametric test:
• An ANOVA can only be conducted if there is no relationship between the
subjects in each sample. This means that subjects in the first group cannot also
be in the second group (e.g., independent samples/between groups).
• The different groups/levels must have equal sample sizes.
• An ANOVA can only be conducted if the dependent variable is normally
distributed so that the middle scores are the most frequent and the extreme
scores are the least frequent.
• Population variances must be equal (i.e., homoscedastic). Homogeneity of
variance means that the deviation of scores (measured by the range or standard
deviation, for example) is similar between populations.
Types Of ANOVA Tests:
• One way ANOVA
• Two way ANOVA
One way ANOVA:
A one-way ANOVA (analysis of variance) has one categorical independent variable
(also known as a factor) and a normally distributed continuous (i.e., interval or ratio
level) dependent variable.
The independent variable divides cases into two or more mutually exclusive levels,
categories, or groups.
The one-way ANOVA test for differences in the means of the dependent variable is
broken down by the levels of the independent variable.
An example of a one-way ANOVA includes testing a therapeutic intervention (CBT,
medication, placebo) on the incidence of depression in a clinical sample.
Note: Both the One-Way ANOVA and the Independent Samples t-Test can compare
the means for two groups. However, only the One-Way ANOVA can compare the
means across three or more groups.
Error degree of freedom (dfe) = (n-k), where n is the total number of data points
and k is the number of groups
dfe = (30 -3) = 27
Degree of freedom between the goups (dfb) = (k-1), where k is the number of
groups
dfb = (3-1) = 2
Total degree of freedom (dft) = (n-1) = dfe + dfb = 29
The steps to find the f test critical value at a specific alpha level (or significance level), α,
are as follows:
• Find the degrees of freedom of the first sample. This is done by subtracting 1 from the
first sample size. Thus, x = n1−1.
• Determine the degrees of freedom of the second sample by subtracting 1 from the
sample size. This given y = n2−1
• If it is a right-tailed test then α is the significance level. For a left-tailed test 1 - α is the
alpha level. However, if it is a two-tailed test then the significance level is given by α / 2.
• The F table is used to find the critical value at the required alpha level.
• The intersection of the x column and the y row in the f table will give the f test critical
value.
Let’s take an example to understand the F-test
Example: The bank has a head office in Delhi and a branch in Mumbai. There are long
customer queues at one office, while customer queues are short at the other. The
Operations Manager of the bank wonders if the customers at one branch are more
variable than the number of customers at another. He carries out a research study of
customers.
The variance of Delhi head office customers is 31, and that for the Mumbai branch is 20.
The sample size for the Delhi head office is 11, and that for the Mumbai branch is 21. Carry
out a two-tailed F-test with a level of significance of 10%.
Solution:
• Step 1: Null Hypothesis H0: σ12 = σ22
• Alternate Hypothesis Ha: σ12 ≠ σ22
• Step 2: F statistic = F Value = σ12 / σ22 = 31/20 = 1.55
• Step 3: df1 = n1 – 1 = 11-1 = 10
• df2 = n2 – 1 = 21-1 = 20
• Step 4: Since it is a two-tailed test, alpha level = 0.10/2 = 0.05. The F value from the F
Table with degrees of freedom as 10 and 20 is 2.348.
• Step 5: Since the F statistic (1.55) is lesser than the table value obtained (2.348), we
cannot reject the null hypothesis.
Comparison between Z, T, Chi-Square, F test
Regression Analysis
Regression analysis is a set of statistical methods used for the estimation of
relationships between a dependent variable and one or more independent
variables. It can be utilized to assess the strength of the relationship between
variables and for modeling the future relationship between them.
Bayes Theorem:
Let’s understand this with an example
A) What is the probability of getting a red ball given that A is chosen?
Answer: P(R/A) = 2/5
B) What is the probability of getting a red ball from bag A
Answer: P(A∩ 𝑅) = P(A) * P(R/A)
C) What is the probability of getting red ball?
Answer: P(R) = P(A∩ 𝑅) + P(B∩ 𝑅) + P(C∩ 𝑅)
D) What is the conditional probability that bag A is chosen given that red ball is
drawn? (This is a Bayes theory problem)
P(A∩R)
Answer: P(A/R) =
P(A∩R) + P(B∩R) + P(C∩R)
P(A) ∗ P(R/A)
=
P(A) ∗ P(R/A)+P(B) ∗ P(R/B)+P(C) ∗ P(R/C)
P(A) ∗ P(R/A)
P(A/R) = (Bayes theorem)
P(R)
1/3 ∗ 2/5
P(A/R) = = 1/3
(1/3 ∗ 2/5) +(1/3 ∗ 1/5) +(1/3 ∗ 3/5)
So the conditional probability that bag A is chosen given that red ball is drawn, is
1/3.