Data Analytics TB

Lecture 4: Central Tendency and Dispersion
Dr. A. Ramesh
Department of Management Studies
1
Lecture objectives
• Central tendency
• Measures of Dispersion
2
Measures of Central Tendency
• Measures of central tendency yield information about “particular places or

locations in a group of numbers.”
• A single number to describe the characteristics of a set of data
3
Summary statistics
• Central tendency or measures of • Dispersion

location – Skewness
– Arithmetic mean – Kurtosis
– Weighted mean – Range
– Median – Interquartile range
– Percentile – Variance
– Standard score
– Coefficient of variation
4
Arithmetic Mean
• Commonly called ‘the mean’
• It is the average of a group of numbers
• Applicable for interval and ratio data
• Not applicable for nominal or ordinal data
• Affected by each value in the data set, including extreme values
• Computed by summing all values in the data set and dividing the sum by
the number of values in the data set
5
Population Mean
 X  X 1
 X 2
 X 3
 ...  X N
N N
24  13  19  26  11

5
93

5
 18.6
6
Sample Mean
X 
X  X 1
 X 2
 X 3
 ...  X n
n n
57  86  42  38  90  66

6
379

6
 63.167
7
Mean of Grouped Data
• Weighted average of class midpoints
• Class frequencies are the weights
  fM
f

 fM
N
f 1M 1  f 2 M 2  f 3M 3    fiMi

f 1  f 2  f 3    fi
8
Calculation of Grouped Mean
Class Interval Frequency(f) Class Midpoint(M) fM
20-under 30 6 25 150
30-under 40 18 35 630
40-under 50 11 45 495
50-under 60 11 55 605
60-under 70 3 65 195
70-under 80 1 75 75
50 2150

fM 2150
  43.0
f 50
9
Weighted Average
• Sometimes we wish to average numbers, but we want to assign more

importance, or weight, to some of the numbers.
• The average you need is the weighted average.

Formula for Weighted Average
 xw
Weighted Average 
w
where x is a data value and w is
the weight assigned to that data
value. The sum is taken over all
data values.
Example
Suppose your midterm test score is 83 and your final exam score is 95.
Using weights of 40% for the midterm and 60% for the final exam, compute
the weighted average of your scores. If the minimum average for an A is
90, will you earn an A?
Weighted Average 
830.40 950.60
0.40  0.60
32  57
  90.2
1 You will earn an A!
Median
• Middle value in an ordered array of numbers
• Applicable for ordinal, interval, and ratio data
• Not applicable for nominal data
• Unaffected by extremely large and extremely small values
13
Median: Computational Procedure
• First Procedure
– Arrange the observations in an ordered array
– If there is an odd number of terms, the median is the middle term of the
ordered array
– If there is an even number of terms, the median is the average of the
middle two terms
• Second Procedure
– The median’s position in an ordered array is given by (n+1)/2.
14
Median: Example with an Odd Number of Terms
Ordered Array
3 4 5 7 8 9 11 14 15 16 16 17 19 19 20 21 22
• There are 17 terms in the ordered array.
• Position of median = (n+1)/2 = (17+1)/2 = 9
• The median is the 9th term, 15.
• If the 22 is replaced by 100, the median is 15.
• If the 3 is replaced by -103, the median is 15.
15
Median: Example with an Even Number of Terms
Ordered Array
3 4 5 7 8 9 11 14 15 16 16 17 19 19 20 21
• There are 16 terms in the ordered array
• Position of median = (n+1)/2 = (16+1)/2 = 8.5
• The median is between the 8th and 9th terms, 14.5
• If the 21 is replaced by 100, the median is 14.5
• If the 3 is replaced by -88, the median is 14.5
16
Median of Grouped Data
N
 cfp
Median  L  2 W 
fmed
Where :
L  the lower limit of the median class
cfp = cumulative frequency of class preceding the median class
fmed = frequency of the median class
W = width of the median class
N = total of frequencies
17
Median of Grouped Data -- Example
Cumulative
N
Class Interval Frequency Frequency  cfp
20-under 30 6 6 Md  L  2 W 
30-under 40 18 24 fmed
40-under 50 11 35 50
 24
50-under 60 11 46
60-under 70 3 49
 40  2 10 
11
70-under 80 1 50  40.909
N = 50
18
Mode
• The most frequently occurring value in a data set
• Applicable to all levels of data measurement (nominal, ordinal, interval,

and ratio)
• Bimodal -- Data sets that have two modes
• Multimodal -- Data sets that contain more than two modes
19
Mode -- Example
• The mode is 44
• There are more 44s 35 41 44 45
than any other value 37 41 44 46
37 43 44 46
39 43 44 46
40 43 44 46
40 43 45 48
20
Mode of Grouped Data
• Midpoint of the modal class
• Modal class has the greatest frequency
Class Interval Frequency

20-under 30 6  d1 
Mode  LMo   w 
30-under 40 18  d1  d 2 
40-under 50 11
 12 
50-under 60 11 30   10  36.31
60-under 70 3  12  7 
70-under 80 1
21
22
Percentiles
• Measures of central tendency that divide a group of data into 100 parts
• Example: 90th percentile indicates that at most 90% of the data lie
below it, and at least 10% of the data lie above it
• The median and the 50th percentile have the same value
• Applicable for ordinal, interval, and ratio data
• Not applicable for nominal data
23
Percentiles: Computational Procedure
• Organize the data into an ascending ordered array
• Calculate the p th percentile location:
P
i ( n)
100
• Determine the percentile’s location and its value.
• If i is a whole number, the percentile is the average of the values at the

i and (i+1) positions
• If i is not a whole number, the percentile is at the (i+1) position in the

ordered array
24
Percentiles: Example
• Raw Data: 14, 12, 19, 23, 5, 13, 28, 17
• Ordered Array: 5, 12, 13, 14, 17, 19, 23, 28
• Location of 30th percentile:
30
i (8)  2.4
100
• The location index, i, is not a whole number; i+1 = 2.4+1=3.4;

the whole number portion is 3; the 30th percentile is at the 3rd
location of the array; the 30th percentile is 13.
25
Dispersion
• Measures of variability describe the spread or the dispersion of a set of

data
• Reliability of measure of central tendency
• To compare dispersion of various samples
26
Variability
No Variability in Cash Flow Mean
Variability in Cash Flow Mean
27
Measures of Variability or dispersion
Common Measures of Variability
• Range
• Inter-quartile range
• Mean Absolute Deviation
• Variance
• Standard Deviation
• Z scores
• Coefficient of Variation
28
Range – ungrouped data
• The difference between the largest and the smallest values in 35 41 44 45

a set of data
37 41 44 46
• Simple to compute
37 43 44 46
• Ignores all data points except the two extremes
• Example: 39 43 44 46
Range = Largest – Smallest = 48 - 35 = 13 40 43 44 46
40 43 45 48
29
Quartiles
• Measures of central tendency that divide a group of data into four subgroups
• Q1: 25% of the data set is below the first quartile
• Q2: 50% of the data set is below the second quartile
• Q3: 75% of the data set is below the third quartile
• Q1 is equal to the 25th percentile
• Q2 is located at 50th percentile and equals the median
• Q3 is equal to the 75th percentile
• Quartile values are not necessarily members of the data set
30
Quartiles
Q1 Q2 Q3
25% 25% 25% 25%
31
Quartiles: Example
• Ordered array: 106, 109, 114, 116, 121, 122, 125, 129
• Q1 i
25
(8)  2 Q 
109  114
1  111.5
100 2
• Q2:
50 116  121
i (8)  4 Q2   118.5
100 2
• Q3:
75 122  125
i (8)  6 Q3   123.5
100 2
32
Interquartile Range
• Range of values between the first and third quartiles

• Range of the “middle half”
• Less influenced by extremes
Interquartile Range  Q3  Q1
33
Deviation from the Mean
• Data set: 5, 9, 16, 17, 18

• Mean:

 X  65  13
N 5
• Deviations from the mean: -8, -4, 3, 4, 5
-4 +5
-8 +4
+3
0 5 10 15 20

34
Mean Absolute Deviation
• Average of the absolute deviations from the mean
X X   X  
M . A.D. 
 X 
5 -8 +8 N
9 -4 +4 24

16 +3 +3 5
17 +4 +4  4.8
18 +5 +5
0 24
35
Population Variance
• Average of the squared deviations from the arithmetic mean
X X   X  
2
 X  
2
 
2
5 -8 64 N
130
9 -4 16 
5
16 +3 9  26.0
17 +4 16
18 +5 25
0 130
36
Population Standard Deviation
• Square root of the variance
X X   X  
2
 X  
2
 
2
N
5 -8 64 130
9 -4 16 
5
16 +3 9  26.0
17 +4 16   
2
18 +5 25  26.0
0 130  5.1
37
Sample Variance
• Average of the squared deviations from the arithmetic mean
X X  X X  X 
2
 X  X 
2
2,398 625 390,625


2
1,844 71 5,041 S n 1
1,539 -234 54,756 663,866
1,311 -462 213,444 
3
7,092 0 663,866
 221, 288.67
38
Sample Standard Deviation
• Square root of the sample variance
X X  X X  X 
2
 X  X 
2

2
2,398 625 390,625 S n 1

663,866
1,844 71 5,041 
3
1,539 -234 54,756  221, 288.67
1,311 -462 213,444 S  S
2
7,092 0 663,866  221, 288.67

 470.41
39
Uses of Standard Deviation
• Indicator of financial risk
• Quality Control
– construction of quality control charts
– process capability studies
• Comparing populations
– household incomes in two cities
– employee absenteeism at two plants
40
Standard Deviation as an Indicator of Financial Risk
Annualized Rate of Return

Financial  
Security
A 15% 3%
B 15% 7%
41
Lecture 5: Central Tendency and Dispersion- II
Dr. A. Ramesh
1
The Empirical Rule… If the histogram is bell shaped
• Approximately 68% of all observations fall

within one standard deviation of the mean.
• Approximately 95% of all observations fall

within two standard deviations of the mean.
• Approximately 99.7% of all observations fall

within three standard deviations of the mean.
2
Empirical Rule
• Data are normally distributed (or approximately normal)
Distance from Percentage of Values

the Mean Falling Within Distance
  1 68
  2 95
  3 99.7
3
Chebysheff’s Theorem…Not often used because interval is very wide.
• A more general interpretation of the standard deviation is derived

from Chebysheff’s Theorem, which applies to all shapes of histograms
(not just bell shaped).
• The proportion of observations in any sample that lie within k
standard deviations of the mean is at least:
For k=2 (say), the theorem states that at least

3/4 of all observations lie within 2 standard
deviations of the mean. This is a “lower bound”
compared to Empirical Rule’s approximation
(95%).
41
Coefficient of Variation
• Ratio of the standard deviation to the mean, expressed as a percentage

• Measurement of relative dispersion

. .  100 
CV

5
Coefficient of Variation
  29
1
  84
2
 1
 4.6  2
 10
 100  100
C.V .  
1
1
C.V .  
2
2
1 2
4.6 10
 100  100
29 84
 1586
.  11.90
6
Variance and Standard Deviation
of Grouped Data
Population Sample
 f  M   S  M  X 
2 2
f
 
2
2 
n1
N
2
S 
   S
2
7
Population Variance and Standard Deviation of
Grouped Data(mu=43)
M M  M 

2 2
Class Interval f M fM f
20-under 30 6 25 150 -18 324 1944

30-under 40 18 35 630 -8 64 1152
40-under 50 11 45 495 2 4 44
50-under 60 11 55 605 12 144 1584
60-under 70 3 65 195 22 484 1452
70-under 80 1 75 75 32 1024 1024
50 2150 7200
M    2
2
 f 7200
 144  12

2
   144
N 50
8
Measures of Shape
• Skewness
– Absence of symmetry
– Extreme values in one side of a distribution
• Kurtosis
Peakedness of a distribution
– Leptokurtic: high and thin
– Mesokurtic: normal shape
– Platykurtic: flat and spread out
• Box and Whisker Plots
– Graphic display of a distribution
– Reveals skewness
9
Skewness
Negatively Symmetric Positively

Skewed (Not Skewed) Skewed
10
Skewness..
The skewness of a distribution is measured by comparing the relative positions
of the mean, median and mode.
• Distribution is symmetrical
• Mean = Median = Mode
• Distribution skewed right
• Median lies between mode and mean, and mode is less than mean
• Distribution skewed left
• Median lies between mode and mean, and mode is greater than
mean
11
Skewness
Mean Mode Mean Mean

Mode
Median
Median Mode Median

12
Coefficient of Skewness
• Summary measure for skewness
3   Md 
S

• If S < 0, the distribution is negatively skewed (skewed to the left)
• If S = 0, the distribution is symmetric (not skewed)
• If S > 0, the distribution is positively skewed (skewed to the right)
13
Coefficient of Skewness
 1
 23  2
 26  3
 29
M
d1  26 M
d2  26 M
d3  26
 1
 12.3  2
 12.3  3
 12.3


3 1  M 
d1


3 2  M d2  

3 3  M 
d3
S 1
 S 2
 S 3

1 2 3
3 23  26 3 26  26 3 29  26

  
12.3 12.3 12.3
 0.73 0  0.73
14
Kurtosis
• Peakedness of a distribution
– Leptokurtic: high and thin
– Mesokurtic: normal in shape
– Platykurtic: flat and spread out
Leptokurtic
Mesokurtic
Platykurtic
15
Box and Whisker Plot
• Five specific values are used:
– Median, Q2
– First quartile, Q1
– Third quartile, Q3
– Minimum value in the data set
– Maximum value in the data set
16
Box and Whisker Plot
Minimum Q1 Q2 Q3 Maximum
17
Skewness: Box and Whisker Plots, and Coefficient of
Skewness
S=0 S>0
S<0

18
THANK YOU
19
Data Analytics with Python
Lecture 1: Introduction to data analytics
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE
1
Objective of the course
• The principle focus of this course is to introduce conceptual understanding
using simple and practical examples rather than repetitive and point click
mentality
• This course should make you comfortable using analytics in your career
and your life
• You will know how to work with real data, and might have learned many
different methodologies but choosing the right methodology is important
2
Objective of the course Contd…
• The danger in using quantitative method does not generally

lie in the inability to perform the calculation
• The real threat is lack of fundamental understanding of:
– Why to use a particular technique of procedure
– How to use it correctly and,
– How to correctly interpret the result
3
Learning objectives
1. Define data and its importance
2. Define data analytics and its types
3. Explain why analytics is important in today’s business environment
4. Explain how statistics, analytics and data science are interrelated
5. Why python?
6. Explain the four different levels of Data:
– Nominal
– Ordinal
– Interval and
– Ratio
4
1. Define Data and its importance
• Variable, Measurement and Data
• What is generating so much data?
• How data add value to the business?
• Why data is important?
5
1.1 Variable, Measurement and Data
• Variables – is a characteristic of any entity being studied that is capable of

taking on different values
• Measurements – is when a standard process is used to assign numbers to

particular attributes or characteristic of a variable
• Data – data are recorded measurements
6
1.2 What is generating so much data?
• Data can be generated by

– Humans,
– Machines or
– Humans-machines combines
• It can be generated anywhere where any information is
generated and stored in structured or unstructured formats
7
1.3 How data add value to business?
Data warehouse
Development of Data Product Discovery of Data Insight

Algorithm solutions in production, marketing and sales Quantitative data analysis to help steer
etc.(e.g. Recommendation Engines) strategic business decision
Business value
Source:https://datajobs.com/
8
Data Products
9
1.4 Why Data is important?
• Data helps in make better decisions

• Data helps in solve problems by finding the reason for
underperformance
• Data helps one to evaluate the performance.
• Data helps one improve processes
• Data helps one understand consumers and the market
10
2. Define data analytic and its types
• Define data analytics
• Why analytics is important?
• Data analysis
• Data analytics vs. Data analysis
• Types of Data analytics
11
2.1. Define data analytics
• Analytics is defined as “the scientific process of transforming data into

insights for making better decisions”
• Analytics, is the use of data, information technology, statistical analysis,
quantitative methods, and mathematical or computer-based models to
help managers gain improved insight about their business operations and
make better, fact-based decisions – James Evans
• Analysis = Analytics ?
12
2.2 Why analytics is important?
• Opportunity abounds for the use of analytics and big data

such as:
1. Determining credit risk
2. Developing new medicines
3. Finding more efficient ways to deliver products and services
4. Preventing fraud
5. Uncovering cyber threats
6. Retaining the most valuable customers
13
2.3 Data analysis
• Data analysis is the process of examining, transforming, and
arranging raw data in a specific way to generate useful
information from it
• Data analysis allows for the evaluation of data through
analytical and logical reasoning to lead to some sort of
outcome or conclusion in some context
• Data analysis is a multi-faceted process that involves a
number of steps, approaches, and diverse techniques
14
Analysis 2.4 Data analytics vs. Data analysis
Past
Explain
How?
Why?
15
2.4 Data analytics vs. Data analysis Analytics
Future
Explore potential future events
16
2.4 Data analytics vs. Data analysis
Analytics
Qualitative Quantitative
ll
ll
Intuition + analysis Formulas + algorithms
17
Analysis
Quantitative
ll
Qualitative Data + how the sale decreased last summer
ll
Explains How And Why Story ends the way it did ?
18
Analysis =/ Analytics
Data Analysis =/ Data analytics
Business Analysis =/ Business analytics
19
2.5 Classification of Data analytics
Based on the phase of workflow and the kind of analysis required, there are
four major types of data analytics.
• Descriptive analytics
• Diagnostic analytics
• Predictive analytics
• Prescriptive analytics
20
Classification of Data analytics
https://www.governanceanalytics.org/knowledge-
base/Main_Tools/Data_classification_and_analysis
21
Descriptive Analytics
• Descriptive Analytics, is the conventional form of Business Intelligence and
data analysis
• It seeks to provide a depiction or “summary view” of facts and figures in
an understandable format
• This either inform or prepare data for further analysis
• Descriptive analysis or statistics can summarize raw data and convert it
into a form that can be easily understood by humans
• They can describe in detail about an event that has occurred in the past
22
Example
A common example of Descriptive Analytics are company reports that simply
provide a historic review like:
• Data Queries
• Reports
• Descriptive Statistics
• Data Visualization
• Data dashboard
Source: https://www.linkedin.com/learning/478e9692-d13d-338f-907e-d76f0724d773
23
Diagnostic analytics
• Diagnostic Analytics is a form of advanced analytics which examines data

or content to answer the question “Why did it happen?”
• Diagnostic analytical tools aid an analyst to dig deeper into an issue so

that they can arrive at the source of a problem
• In a structured business environment, tools for both descriptive and

diagnostic analytics go parallel
24
Example
• It uses techniques such as:
1. Data Discovery
2. Data Mining
3. Correlations
25
Predictive analytics
• Predictive analytics helps to forecast trends based on the current events
• Predicting the probability of an event happening in future or estimating

the accurate time it will happen can all be determined with the help of
predictive analytical models
• Many different but co-dependent variables are analysed to predict a trend

in this type of analysis
26
Source: https://www.logianalytics.com/wp-content/uploads/2017/11/predictive-1.png
27
Example
• Set of techniques that use model constructed from past data to predict
the future or ascertain impact of one variable on another:
1. Linear regression
2. Time series analysis and forecasting
3. Data mining
Source: https://bigdata-madesimple.com/5-examples-predictive-analytics-travel-industry/
28
Prescriptive analytics
• Set of techniques to indicate the best course of action

• It tells what decision to make to optimize the outcome
• The goal of prescriptive analytics is to enable:
1. Quality improvements
2. Service enhancements
3. Cost reductions and
4. Increasing productivity
29
Prescriptive analytics: Example
• Optimization Model
• Simulation
• Decision Analysis
30
3. Explain why analytics is important
• Demand for Data Analytics

• Element of data Analytics
31
3. Explain why analytics is important
Data Scientist
Search Trends
Statistician, Operations Researcher
32
https://timesofindia.indiatimes.com/india/Data-scientists-earning-more-than-
CAs-engineers/articleshow/52171064.cms
33
3.1 Demand for Data Analytics
http://timesofindia.indiatimes.com/articleshow/52171064.cms?utm_source=
contentofinterest&utm_medium=text&utm_campaign=cppst
34
3.2 Element of data Analytics
35
4. Data analyst and Data scientist
• The requisite skill set
• Difference between Data analyst and Data Scientist
36
4.1 The requisite skill set
Technology;
Mathematic
Hacking Skill
Expertise
Business and
strategy Data Science
acumen
37
Mathematic Technology;
Expertise Hacking Skill
Business and
strategy
Data Science
acumen
38
Mathematic Technology;
Expertise Hacking Skill
Business and
strategy
Data Science
acumen
39
4.2 Difference between Data analyst and Data Scientist
Business Administration
Analyst
Domain specific responsibility : For Example marketing analyst, Financial analyst etc.
Data exploration analysis and insight
Data Scientist
Advance algorithms and machine learning
Data product engineering
Source:https://datajobs.com/
40
5. Why python?
Features
• Simple and easy to learn
• Freeware and Open source
• Interpreted
• Dynamically Typed
• Extensible
• Embedded
• Extensive library
41
5. Why python?
Usability
• Desktop and web applications
• Database applications
• Networking applications
• Data analysis (Data Science)
• Machine learning
• IoT and AI applications
• Games
42
Companies using Python
43
Why Jupyter NoteBook?
Why?
• Client – Server Application
• Edit code on web browser
• Easy in documentation
• Easy in demonstration
• User- friendly Interface
44
6. Explain the four different levels of Data
• Types of Variables
• Levels of Data Measurement
• Compare the four different levels of Data:
Nominal
Ordinal
Interval and
Ratio
• Usage Potential of Various Levels of Data
• Data Level, Operations, and Statistical Methods
45
6.1 Types of Variables
Data
Categorical Numerical
Examples:
 Marital Status
 Political Party Discrete Continuous
 Eye Color
Examples: Examples:
(Defined categories)
 Number of Children  Weight
 Defects per hour  Voltage
(Counted items) (Measured characteristics)
6.2 Levels of Data Measurement
• Nominal — Lowest level of measurement

• Ordinal
• Interval
• Ratio — Highest level of measurement
47
6.3.1 Nominal
• A nominal scale classifies data into distinct categories in which no ranking

is implied
• Example : Gender, Marital Status
48
6.3.2 Ordinal scale
• An ordinal scale classifies data into distinct categories in which ranking is

implied
• Example:
– Product satisfaction  Satisfied, Neutral, Unsatisfied
– Faculty rank  Professor, Associate Professor, Assistant Professor
– Student Grades  A, B, C, D, F
49
6.3.3. Interval scale
• An interval scale is an ordered scale in which the difference between

measurements is a meaningful quantity but the measurements do not have a
true zero point.
• Example
– Temperature in Fahrenheit and Celsius
– Year
50
6.3.4 Ratio scale
• A ratio scale is an ordered scale in which the difference between the

measurements is a meaningful quantity and the measurements have a true
zero point.
• Example
– Weight
– Age
– Salary
51
6.4 Usage Potential of Various
Levels of Data
Ratio
Interval
Ordinal
Nominal
52
6.5 Impact of choice of measurement scale
Statistical
Data Level Meaningful Operations
Methods
Nominal Classifying and Counting Nonparametric
Ordinal All of the above plus Ranking Nonparametric
Interval All of the above plus Parametric

Addition, Subtraction
Ratio All of the above plus

multiplication and division Parametric
53
Thank You
54
Welcome to
TA Live Session 1
NPTEL | DATA ANALYTICS

WITH PYTHON
29-01-2022
Ritwiz Kamal
PhD (Prime Minister’s Research Fellow), CSE, IIT Madras
Ritwiz Kamal | IIT Madras 1
Let’s Get Started …

Question 1

Question 1
http://makemeanalyst.com/explore-your-data-range-interquartile-range-and-box-plot/

Question 2

Question 2
[1, 2, 3, 4, 5, [6, 7, 8, 9]]
[1, 2, 3, 4, 5, 6, 7, 8, 9]

Question 3

Question 3
X = 10 value of x = 10
X = X + 10 value of x = 20
X=X–5 value of x = 15

Question 4

Question 4
Are we really modifying

x and y ?
NO !

Question 5

Question 5
https://www.slideshare.net/amritswaroop1/mbm-106

Question 6

Question 6
σ 𝑥𝑖ሶ
𝑥ҧ =
𝑛

Question 7

Question 7
https://www.kindpng.com/imgv/Txxxxb_mu-greek-alphabet-
letter-greek-mu-hd-png/

Question 8

Question 8
http://makemeanalyst.com/explore-your-data-range-interquartile-
range-and-box-plot/

Question 9

Question 9
https://www.managedfuturesinvesting.com/what-is-skewness/

Question 10

Question 10
If 𝜎 = 0, 𝜎 2 = 0
If 𝜎 = 2, 𝜎 2 = 4
If 𝜎 = 0.5, 𝜎 2 = 0.25

Acknowledgments
 Prof. A Ramesh | IIT Roorkee
Data Analytics with Python | NPTEL
 NPTEL Team
 PMRF Team
 Department of CSE, IIT Madras
THANK YOU!

Lecture 2: Python – Fundamentals
Dr. A. Ramesh
IIT ROORKEE
1
Learning objectives
1. Installing Python
2. Fundamentals of Python
3. Data Visualisation
2
Python Installation Process
Installation Process –
Step 1: Type https://www.anaconda.com at the address bar of web

browser.
Step 2: Click on download button
Step 3: Download python 3.7 version for windows OS
Step 4: Double click on file to run the application
Step 5: Follow the instructions until completion of installation process
3
Installation Process –
Step 1: Type https://www.anaconda.com at the address bar of web browser.
4
Step 2: Click on download button
5
Step 3: Download python 3.7 version for windows OS
6
Step 4: Double click on file to run the application
7
8
9
10
11
12
13
14
15
16
Why Jupyter NoteBook?
Why?
• Edit code on web browser
• Easy in documentation
• Easy in demonstration
• User- friendly Interface
17
Python and Jupyter
Python Programming Language Jupyter Application
Software Package contains both

python and jupyter application
18
19
About Jupyter NoteBook
Cell -> Access using Enter Key
20
Input Field -> Green color indicates edit mode

Blue color indicates command mode
21
-> It contains documentation

-> Text not executed as code
22
About Jupyter Notebook
• Command mode allow to edit notebook as whole

• To close edit mode (Press Escape key)
• Execution (Three ways)
o Ctrl +Enter (Output field can not be modified)
o Shift +Enter (Output field is modified)
o Run button on Jupyter interface
• Comment line is written preceding with # symbol.
23
About Jupyter Notebook
• Important shortcut keys
o A -> To create cell above

o B -> To create cell below
o D + D -> For deleting cell
o M -> For markdown cell
o Y -> For code cell
24
Fundamentals of Python
• Loading a simple delimited data file

• Counting how many rows and columns were loaded
• Determining which type of data was loaded
• Looking at different parts of the data by subsetting rows
and columns
25
26
Loading a simple delimited data file
Data Source: www.github.com/jennybc/gapminder.
27
28
• head method shows us only the first 5 rows
29
Get the number of rows and columns
30
get column names
31
get the dtype of each column
32
Pandas Types Versus Python Types
33
get more information about data
34
Looking at Columns, Rows, and Cells
• # get the country column and save it to its own variable
35
# show the first 5 observations
36
# show the last 5 observations
37
# Looking at country, continent, and year
38
39
Lecture 3: Python – Fundamentals - II
Dr. A. Ramesh
IIT ROORKEE
1
Looking at Columns, Rows, and Cells
• Subset Rows by Index Label: loc
2
get the first row
• Python counts from 0
3
• # get the 100th row
# Python counts from 0
4
• get the last row
5
Subsetting Multiple Rows
• # select the first, 100th, and 1000th rows
6
Subset Rows by Row Number: iloc
• # get the 2nd row
7
• get the 100th row
8
• # using -1 to get the last row
9
With iloc, we can pass in the -1 to get the last row—something we couldn’t do with loc.
10
• # get the first, 100th, and 1000th rows
11
Subsetting Columns
• The Python slicing syntax uses a colon, :

• If we have just a colon, the attribute refers to everything.
• So, if we just want to get the first column using the loc or iloc syntax,
we can write something like df.loc[:, [columns]] to subset the column(s).
12
• # subset columns with loc
# note the position of the colon
# it is used to select all rows
13
14
• # subset columns with iloc
• # iloc will alow us to use integers
• # -1 will select the last column
15
Subsetting Columns by Range
• # create a range of integers from 0 to 4 inclusive
16
• # subset the dataframe with the range
17
Subsetting Rows and Columns
• # using loc
18
• # using iloc
19
Subsetting Multiple Rows and Columns
• #get the 1st, 100th, and 1000th rows

# from the 1st, 4th, and 6th columns
20
• if we use the column names directly,
# it makes the code a bit easier to read
# note now we have to use loc, instead of iloc
21
22
23
Grouped Means
• # For each year in our data, what was the average life
expectancy?
# To answer this question,
# we need to split our data into parts by year;
# then we get the 'lifeExp' column and calculate the mean
24
25
26
• If you need to “flatten” the dataframe, you can use the
reset_index method.
27
Grouped Frequency Counts
• use the nunique to get counts of unique values on a Pandas Series.
28
Basic Plot
29
30
Visual Representation of the Data
• Histogram -- vertical bar chart of frequencies
• Frequency Polygon -- line graph of frequencies
• Ogive -- line graph of cumulative frequencies
• Pie Chart -- proportional representation for categories of a whole
• Stem and Leaf Plot
• Pareto Chart
• Scatter Plot
31
Methods of visual presentation of data
• Table
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

East 20.4 27.4 90 20.4
West 30.6 38.6 34.6 31.6
North 45.9 46.9 45 43.9
32
• Graphs
90
80
70
60
50 East
40 West
30 North
20
10
0
33
• Pie chart
1st Qtr
2nd Qtr
3rd Qtr
4th Qtr
34
• Multiple bar chart
4th Qtr
3rd Qtr North

West
2nd Qtr East
1st Qtr
0 20 40 60 80 100
35
• Simple pictogram
100
80
60
40
North
20
East
0 West
36
Frequency distributions
• Frequency tables
Observation Table
Class Interval Frequency Cumulative Frequency
< 20 13 13
<40 18 31
<60 25 56
<80 15 71
<100 9 80
37
Frequency diagrams
Frequency
30 Cumulative Frequency
25 Frequency
20
90
80
15
70
10 60
5 50
Cumulative Frequency
0 40
< 20 <40 <60 <80 <100 30
20
Frequency 10
0
30 < 20 <40 <60 <80 <100
25
20
15 Frequency
10
5
0
< 20 <40 <60 <80 <100
38
Histogram
20
Frequency
20-under 30 6
10
30-under 40 18
40-under 50 11
50-under 60 11
0
60-under 70 3 0 10 20 30 40 50 60 70 80
Years
70-under 80 1
39
Histogram Construction
20
20-under 30 6
Frequency
30-under 40 18
10
40-under 50 11
50-under 60 11
60-under 70 3
0
70-under 80 1
0 10 20 30 40 50 60 70 80
Years
40
Frequency Polygon
20
Class IntervalFrequency
20-under 30 6
Frequency
30-under 40 18
10
40-under 50 11
50-under 60 11
60-under 70 3
0
70-under 80 1 0 10 20 30 40 50 60 70 80
Years
41
Ogive
Cumulative
60
40
Frequency
20-under 30 6
30-under 40 24
20
40-under 50 35
50-under 60 46
0
60-under 70 49 0 10 20 30 40 50 60 70 80
70-under 80 50 Years
42
Relative Frequency Ogive
Cumulative
Cumulative Relative Frequency

1.00
Relative 0.90
0.80
Class Interval Frequency 0.70
0.60
20-under 30 .12 0.50
30-under 40 .48 0.40
0.30
40-under 50 .70 0.20
0.10
50-under 60 .92 0.00
60-under 70 .98 0 10 20 30 40 50 60 70 80
70-under 80 1.00 Years
43
Pareto Chart
100 100%
90 90%
80 80%
70 70%
Frequency 60 60%
50 50%
40 40%
30 30%
20 20%
10 10%
0 0%
Poor Short in Defective Other
Wiring Coil Plug
44
Scatter Plot
Registered Gasoline Sales

Vehicles (1000's of 200
(1000's) Gallons)
Gasoline Sales
5 60 100
15 120
9 90
0
15 140 0 5 10 15 20
Registered Vehicles
7 60
45
Principles of Excellent Graphs
• The graph should not distort the data
• The graph should not contain unnecessary adornments (sometimes
referred to as chart junk)
• The scale on the vertical axis should begin at zero
• All axes should be properly labeled
• The graph should contain a title
• The simplest possible graph should be used for a given set of data
Graphical Errors: Chart Junk
Bad Presentation  Good Presentation
Minimum Wage Minimum Wage

1960: $1.00
$
4
1970: $1.60
2
1980: $3.10
0
1990: $3.80 1960 1970 1980 1990
Graphical Errors:
Compressing the Vertical Axis
Bad Presentation  Good Presentation

Quarterly Sales Quarterly Sales
$ $
200 50
100 25
0 0
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
Graphical Errors: No Zero Point on the Vertical Axis
Bad Presentation
 Good Presentations
Monthly Sales $ Monthly Sales
$ 45
45
42
42 39
39 36
36 0
J F M A M J J F M A M J
Graphing the first six months of sales

Welcome to
TA Live Session 2

WITH PYTHON
05-02-2022
Ritwiz Kamal

Question 1
1)

Question 1
1)
P(‘H’) = P(‘T’) = 0.5
H, T
{H,T}
https://towardsdatascience.com/what-is-expected-value-
4815bdbd84de

Question 2
2)

Question 2
2)
HH
HT
TH
T
1st Toss
TT
2nd Toss
Question 3
3)

Question 3
3)
https://www.scribbr.com/statistics/standard-normal-distribution/

Question 4
4)

Question 4
4)
𝜆𝑥 −𝜆
𝑓 𝑥 = ⅇ
𝑥! https://en.wikipedia.org/wiki/Poisson_distribution

Question 5
5)

Question 5
5)
https://en.wikipedia.org/wiki/Hypergeometric_distribution

Question 6
6)A bag contains one ball which could either be GREEN or RED. You
take another red ball and put it in this pouch. You now close your eyes
and pull out a ball from the pouch. It turns out to be red. What is the
probability that the original ball in the pouch was red?

Question 6
14
Ritwiz Kamal | IIT Madras
Question 7
7) Suppose you have a biased coin i.e. P(H) ≠ P(T).
How will you use it to make unbiased decision.
Charles Barsotti :https://www.cartoonstock.com/directory/t/toss_a_coin.asp

Question 7
16
Ritwiz Kamal | IIT Madras
Question 8
8) Suppose you have data about the progression of the number of cases
of Covid-19 in some country during the second wave. What kind of
visual representation would you prefer to use for this data ?
Violin Plot
Pie Chart
Ogive
Box Plot

Question 8
8) Suppose you have data about the progression of the number of cases
of Covid-19 in some country during the second wave. What kind of
visual representation would you prefer to use for this data ?
Violin Plot
Pie Chart
Ogive
Box Plot

https://en.wikipedia.org/wiki/Ogive_(statistics)
Question 9
9) Suppose you have data about the preference of TV shows among
viewers on an OTT platform. What kind of visual representation would
you prefer to use for this data ?
Violin Plot
Pie Chart
Ogive
Box Plot

Question 9
9) Suppose you have data about the preference of TV shows among
viewers on an OTT platform. What kind of visual representation would
you prefer to use for this data ?
Violin Plot TV Show A
Pie Chart
TV Show D
Ogive
Box Plot
TV Show B
*fictional data
TV Show C
Question 10
10) Suppose there are 10 rabbits in a race.
Let R1 and R2 be two of the rabbits.
Let A be the event that R1 wins the race.
Let B be the event that R2 wins the race.
Are A and B independent events ? (Assume all rabbits are equally likely to win)
Independent
Not Independent

Question 10
10) Suppose there are 10 rabbits in a race.
Let R1 and R2 be two of the rabbits.
Let A be the event that R1 wins the race.
Let B be the event that R2 wins the race.
Are A and B independent events ? (Assume all rabbits are equally likely to win)
1
Independent 𝑃 𝐴 = 𝑃 𝐴ȁ𝐵 = 0
10
Not Independent

Introduction To Probability, 2nd edition, by
Dimitri P. Bertsekas and John N. Tsitsiklis

Acknowledgments
 NPTEL Team
 PMRF Team
THANK YOU!

Lecture 6: Introduction to Probability
Dr. A. Ramesh
1
Lecture objectives
• Comprehend the different ways of assigning probability

• Understand and apply marginal, union, joint, and conditional probabilities
• Solve problems using the laws of probability including the laws of addition,
multiplication and conditional probability
• Revise probabilities using Bayes’ rule
2
Probability
• Probability is the numerical measure of the likelihood that an event will occur.
• The probability of any event must be between 0 and 1, inclusively

– 0 ≤ P(A) ≤ 1 for any event A.
• The sum of the probabilities of all mutually exclusive and collectively

exhaustive events is 1.
– P(A) + P(B) + P(C) = 1
– A, B, and C are mutually exclusive and collectively exhaustive
3
Range of Probability
1 Certain
.5
0 Impossible
4
Methods of Assigning Probabilities
• Classical method of assigning probability (rules and laws)
• Relative frequency of occurrence (cumulated historical data)
• Subjective Probability (personal intuition or reasoning)
5
Classical Probability
• Number of outcomes leading to the event divided by the total number of

outcomes possible
• Each outcome is equally likely
• Determined a priori -- before performing the experiment
• Applicable to games of chance
• Objective -- everyone correctly using the method assigns an identical
probability
6
Classical Probability
P( E ) 
n e
N
Where:
N  total number of outcomes
ne
 number of outcomes in E
7
Relative Frequency Probability
• Based on historical data
• Computed after performing the experiment
• Number of times an event occurred divided by the number of trials
• Objective -- everyone correctly using the method assigns an identical

probability
8
Relative Frequency Probability
P( E )  ne
N
Where:
N  total number of trials
n e
 number of outcomes
producing E
9
Subjective Probability
• Comes from a person’s intuition or reasoning

• Subjective -- different individuals may (correctly) assign different numeric
probabilities to the same event
• Degree of belief
• Useful for unique (single-trial) experiments
– New product introduction
– Initial public offering of common stock
– Site selection decisions
– Sporting events
10
Probability - Terminology
• Experiment
• Event
• Elementary Events
• Sample Space
• Unions and Intersections
• Mutually Exclusive Events
• Independent Events
• Collectively Exhaustive Events
• Complementary Events
11
Experiment, Trial, Elementary Event, Event
• Experiment: a process that produces outcomes
– More than one possible outcome
– Only one outcome per trial
• Trial: one repetition of the process
• Elementary Event: cannot be decomposed or broken down into other
events
• Event: an outcome of an experiment
– may be an elementary event, or
– may be an aggregate of elementary events
– usually represented by an uppercase letter, e.g., A, E1
12
An Example Experiment
• Experiment: randomly select,
without replacement, two families Tiny Town Population
from the residents of Tiny Town
• Elementary Event: the sample Children in Number of
Family Household
includes families A and C Automobiles
• Event: each family in the sample
has children in the household A Yes 3
• Event: the sample families own a B Yes 2
total of four automobiles C No 1
D Yes 2
13
Sample Space
• The set of all elementary events for an experiment

• Methods for describing a sample space
– roster or listing
– tree diagram
– set builder notation
– Venn diagram
14
Sample Space: Roster Example
• Experiment: randomly select, without replacement, two families from the

residents of Tiny Town
• Each ordered pair in the sample space is an elementary event, for example
-- (D,C)
Children in Number of Listing of Sample Space
Family
Household Automobiles
(A,B), (A,C), (A,D),
A Yes 3
(B,A), (B,C), (B,D),
B Yes 2
(C,A), (C,B), (C,D),
C No 1
(D,A), (D,B), (D,C)
D Yes 2
15
Sample Space: Tree Diagram for Random Sample of Two
Families
16
Sample Space: Set Notation for Random Sample of Two
Families
• S = {(x,y) | x is the family selected on the first draw, and y is the family
selected on the second draw}
• Concise description of large sample spaces
17
Sample Space
• Useful for discussion of general principles and concepts
Listing of Sample Space

Venn Diagram
(A,B), (A,C), (A,D),
(B,A), (B,C), (B,D),
(C,A), (C,B), (C,D),
(D,A), (D,B), (D,C)
18
Union of Sets
• The union of two sets contains an instance of each element of the two
sets.
X  1,4,7,9
Y  2,3,4,5,6 X Y
X  Y  1,2,3,4,5,6,7,9
C   IBM , DEC , Apple

F   Apple, Grape, Lime
C  F   IBM , DEC , Apple, Grape, Lime
19
Intersection of Sets
• The intersection of two sets contains only those element common to the
X  1,4,7,9
two sets.
Y  2,3,4,5,6 X Y
X  Y   4

F   Apple, Grape, Lime
C  F   Apple
20
Mutually Exclusive Events
• Events with no common outcomes
• Occurrence of one event precludes the occurrence of the other event

F  Grape, Lime
C F  
X Y
X  1,7,9
Y  2 ,3,4 ,5,6
X Y    P( X  Y )  0
21
Independent Events
• Occurrence of one event does not affect the occurrence or nonoccurrence

of the other event
• The conditional probability of X given Y is equal to the marginal probability
of X.
• The conditional probability of Y given X is equal to the marginal probability
of Y.
P( X | Y )  P( X ) and P(Y | X )  P(Y )
22
Collectively Exhaustive Events
• Contains all elementary events for an experiment
E1 E2 E3
Sample Space with three

collectively exhaustive events
23
Complementary Events
• All elementary events not in the event ‘A’ are in its complementary event.
P( Sample Space )  1
A
Sample
Space A
P( A)  1  P( A)
24
Counting the Possibilities
• mn Rule
• Sampling from a Population with Replacement
• Combinations: Sampling from a Population without Replacement
25
mn Rule
• If an operation can be done m ways and a second operation can be done n

ways, then there are mn ways for the two operations to occur in order.
• This rule is easily extend to k stages, with a number of ways equal to
n1.n2.n3..nk
• Example: Toss two coins . The total umber of simple events is 2 x 2 =4
26
Sampling from a Population with Replacement
• A tray contains 1,000 individual tax returns. If 3 returns are randomly

selected with replacement from the tray, how many possible samples are
there?
• (N)n = (1,000)3 = 1,000,000,000
27
Combinations
• A tray contains 1,000 individual tax returns. If 3 returns are randomly

selected without replacement from the tray, how many possible samples
are there?
N N! 1000!
    166,167,00 0
 n  n!( N  n)! 3!(1000  3)!
28
Four Types of Probability
Marginal Union Joint Conditional
P( X ) P( X  Y ) P( X  Y ) P( X | Y )
The probability The probability The probability The probability
of X occurring of X or Y of X and Y of X occurring
occurring occurring given that Y
has occurred
X X Y X Y
29
General Law of Addition
P ( X  Y )  P( X )  P( Y )  P( X  Y )
X Y
30
Design for improving productivity?
31
Problem
• A company conducted a survey for the American Society of Interior
Designers in which workers were asked which changes in office design
would increase productivity.
• Respondents were allowed to answer more than one type of design
change.
Reducing noise would increase 70 %

productivity
More storage space would 67 %
increase productivity
32
Problem
• If one of the survey respondents was randomly selected and asked what
office design changes would increase worker productivity,
– what is the probability that this person would select reducing noise or
more storage space?
33
Solution
• Let N represent the event “reducing noise.”

• Let S represent the event “more storage/ filing space.”
• The probability of a person responding with N or S can be symbolized
statistically as a union probability by using the law of addition.
34
General Law of Addition -- Example
P( N  S )  P( N )  P( S )  P( N  S )
N S P ( N ) .70
P ( S ) .67
P ( N  S ) .56
.56
.70 .67 P ( N  S ) .70.67 .56
 0.81
35
Office Design Problem
Probability Matrix
Increase
Storage Space
Yes No Total
Noise Yes .56 .14 .70
Reduction
No .11 .19 .30
Total .67 .33 1.00
36
Joint Probability Using a Contingency Table
Event
Event B1 B2 Total
A1 P(A1 and B1) P(A1 and B2) P(A1)
A2 P(A2 and B1) P(A2 and B2) P(A2)
Total P(B1) P(B2) 1
Joint Probabilities Marginal (Simple) Probabilities

37
Office Design Problem - Probability Matrix
Increase
Storage Space
Yes No Total
Noise Yes .56 .14 .70
Reduction
No .11 .19 .30
Total .67 .33 1.00
P( N  S )  P( N )  P( S )  P( N  S )
.70.67 .56
.81
38
Law of Conditional Probability
39
Office Design Problem
40
Problem
• A company data reveal that 155 employees worked one of four types of
positions.
• Shown here again is the raw values matrix (also called a contingency table)
with the frequency counts for each category and for subtotals and totals
containing a breakdown of these employees by type of position and by
sex.
41
Contingency Table
42
Solution
• If an employee of the company is selected randomly, what is the

probability that the employee is female or a professional worker?
43
Problem
• Shown here are the raw values matrix and corresponding probability
matrix for the results of a national survey of 200 executives who were
asked to identify the geographic locale of their company and their
company’s industry type.
• The executives were only allowed to select one locale and one industry
type.
44
Lecture 9: Probability Distributions-II
Dr. A. Ramesh
IIT ROORKEE
1
Some Special Distributions
• Discrete
– Binomial
– Poisson
– Hyper geometric
• Continuous
– Uniform
– Exponential
– Normal
2
Binomial Distribution
• Let us consider the purchase decisions of the next three customers who
enter a store.
• On the basis of past experience, the store manager estimates the

probability that any one customer will make a purchase is .30.
• What is the probability that two of the next three customers will make a
purchase?
3
Tree diagram for the Martin clothing store problem
4
Trial Outcomes
5
Graphical representation of the probability distribution
for the number of customers making a purchase
x P(x)
0 0.7 x 0.7 x 0.7=0.343
1 0.3x0.7x07+
0.7x0.3x0.7+
0.7x0.7x0.3 = 0.441
2 0.189
3 0.027
6
Binomial Distribution- Assumtions
• Experiment involves n identical trials
• Each trial has exactly two possible outcomes: success and failure
• Each trial is independent of the previous trials
• p is the probability of a success on any one trial
q = (1-p) is the probability of a failure on any one trial
• p and q are constant throughout the experiment
• X is the number of successes in the n trials
7
Binomial Distribution
• Probability n! X n X
P( X )  p q for 0  X  n
function X ! n  X !
• Mean
value   n p
• Variance and
standard  2
 n pq
deviation    2
 n pq
8
Binomial Table
SELECTED VALUES FROM THE BINOMIAL PROBABILITY TABLE
EXAMPLE: n = 10, x = 3, p = .40; f (3) = .2150
9
Mean and Variance
• Suppose that for the next month the Clothing Store forecasts 1000
customers will enter the store.
• What is the expected number of customers who will make a purchase?
• The answer is μ = np = (1000)(.3) = 300.
• For the next 1000 customers entering the store, the variance and
standard deviation for the number of customers who will make a
purchase are
10
Poisson Distribution
• Describes discrete occurrences over a continuum or interval

• A discrete distribution
• Describes rare events
• Each occurrence is independent any other occurrences.
• The number of occurrences in each interval can vary from zero to infinity.
• The expected number of occurrences must hold constant throughout the
experiment.
11
Poisson Distribution: Applications
• Arrivals at queuing systems
– airports -- people, airplanes, automobiles, baggage
– banks -- people, automobiles, loan applications
– computer file servers -- read and write operations
• Defects in manufactured goods

– number of defects per 1,000 feet of extruded copper wire
– number of blemishes per square foot of painted surface
– number of errors per typed page
12
Poisson Distribution
• Probability function
e
X 
P( X )  for X  0,1, 2, 3,...

X!
where:
  long  run average
e  2. 718282 ... (the base of natural logarithms )
Mean value Variance Standard deviation
  
13
Poisson Distribution: Example
  3.2 customers/4 minutes   3.2 customers/4 minutes

X = 10 customers/8 minutes X = 6 customers/8 minutes
Adjusted  Adjusted 
 =6.4 customers/8 minutes  =6.4 customers/8 minutes
P(X)= 

P(X)= 
X  X
e e
X! X!
10 6.4 6 6.4
P(X =10)= 6.4 e  0.0528 P(X =6)= 6.4 e  0.1586

10! 6!
14
Poisson Probability Table
Example: μ = 10, x = 5; f (5) = .0378
15
The Hypergeometric Distribution
• The binomial distribution is applicable when selecting from a

finite population with replacement or from an infinite population
without replacement.
• The hypergeometric distribution is applicable when selecting

from a finite population without replacement.
Hyper Geometric Distribution
• Sampling without replacement from a finite population
• The number of objects in the population is denoted N.
• Each trial has exactly two possible outcomes, success and failure.
• Trials are not independent
• X is the number of successes in the n trials
• The binomial is an acceptable approximation, if N/10 > n Otherwise it is not.
17
Hypergeometric Distribution
• Probability function
– N is population size
P( x) 
 ACx  N  ACn  x 
– n is sample size
N Cn
– A is number of successes in population
– x is number of successes in sample An
 
N
• Mean Value
A( N  A) n( N  n)

2
 2
N ( N  1)
• Variance and standard deviation
 
2

18
The Hypergeometric Distribution Example
• Different computers are checked from 10 in the department. 4 of the 10
computers have illegal software loaded.
• What is the probability that 2 of the 3 selected computers have illegal
software loaded?
• So, N = 10, n = 3, A = 4, X = 2
 A  N  A   4  6 
     
 X  n  X   2 1  (6)(6)
P(X  2)           0.3
N 10  120
   
n  3 
   
• The probability that 2 of the 3 selected computers have illegal
software loaded is .30, or 30%.
Continuous Probability Distributions
• A continuous random variable is a variable that can assume any value on

a continuum (can assume an uncountable number of values)
– thickness of an item
– time required to complete a task
– temperature of a solution
– height
• These can potentially take on any value, depending only on the ability to
measure precisely and accurately.
Continuous Distributions
• Uniform
• Normal
• Exponential
The Uniform Distribution
• The uniform distribution is a probability distribution that has equal

probabilities for all possible outcomes of the random variable
• Because of its shape it is also called a rectangular distribution

Uniform Distribution
 1
b  a for a xb
 1
f ( x)  
 0 ba
for all other values f (x)


Area = 1
a x b
Uniform Distribution: Mean and Standard Deviation
Mean
a +b
 =
2
Standard Deviation
ba

12
The Uniform Distribution
Example: Uniform probability distribution over the range 2 ≤ X ≤ 6:
1
f(X) = 6 - 2 = .25 for 2 ≤ X ≤ 6
f(X)
ab 26
μ   4
.25 2 2
(b - a) 2 (6 - 2 ) 2
σ   1 .1 5 4 7
2 6 X 12 12
Uniform Distribution Example
 1
 47  41 for 41  x  47
 1 1
f ( x)   
 0 47  41 6

for all other values f ( x)

Area = 1
41 47 x
Uniform Distribution: Mean and Standard Deviation
Mean Mean
a +b 41+47 88
 = =   44
2 2 2
Standard Deviation Standard Deviation

ba 47  41 6
    1. 732
12 12 3. 464
Uniform Distribution Probability
x x1
P ( x1  X  x2)  2
ba 45  42 1

47  41 2
f (x)
45  42 1
P( 42  X  45)  
47  41 2 Area
= 0.5
41 42 45 47 x
Example : Uniform Distribution
• Consider the random variable x representing the flight time of an airplane

traveling from Delhi to Mumbai.
• Suppose the flight time can be any value in the interval from 120 minutes
to 140 minutes.
• Because the random variable x can assume any value in that interval, x is a
continuous rather than a discrete random variable
29
Example : Uniform Distribution contd….
• Let us assume that sufficient actual flight data are available to conclude
that the probability of a flight time within any 1-minute interval is the
same as the probability of a flight time within any other 1-minute interval
contained in the larger interval from 120 to 140 minutes.
• With every 1-minute interval being equally likely, the random variable x is
said to have a uniform probability distribution.
30
Uniform Probability Distribution for Flight time
31
Probability of a flight time between 120 and 130
minutes
32
Exponential Probability Distribution
• The exponential probability distribution is useful in describing the time it
takes to complete a task.
• The exponential random variables can be used to describe:
Time required Distance between

Time between
to complete major defects
vehicle arrivals
a questionnaire in a highway
at a toll booth
• Density Function
for x > 0,  > 0
1  x /
f ( x)  e

where:  = mean
e = 2.71828
• Suppose that x represents the loading time for a truck at loading dock and
follows such a distribution.
• If the mean, or average, loading time is 15 minutes ( μ = 15), the
appropriate probability density function for x is
Exponential Distribution for the loading Dock Example
• Cumulative Probabilities
Cumulative Probabilities
 xo / 
P( x  x0 )  1  e
where:
x0 = some specific value of x x
Example: Exponential Probability Distribution
• The time between arrivals of cars at a Petrol pump follows an exponential
probability distribution with a mean time between arrivals of 3 minutes.
• The Petrol pump owner would like to know the probability that the time
between two successive arrivals will be 2 minutes or less.

Example: Petrol Pump Problem
f(x)
.4 P(x < 2) = 1 - 2.71828-2/3 = 1 - .5134 = .4866

.3
.2
.1
x
1 2 3 4 5 6 7 8 9 10
Time Between Successive Arrivals (mins.)
Relationship between the Poisson and Exponential
Distributions
The Poisson distribution
provides an appropriate description
of the number of occurrences
per interval
The exponential distribution

provides an appropriate description
of the length of the interval
between occurrences
Mean of Poisson and Mean of Exponential Distributions
• Because the average number of arrivals is 10 cars per hour, the average
time between cars arriving is
42
The Normal Distribution: Properties
• ‘Bell Shaped’
• Symmetrical f(X)
• Mean, Median and Mode are equal
• Location is characterized by the mean, μ σ
• Spread is characterized by the standard μ
deviation, σ
Mean = Median = Mode
• The random variable has an infinite
theoretical range: - to +
The Normal Distribution: Density Function
The formula for the normal probability density function is
2
1  (X μ) 
1   
2  
f(X)  e
2π
Where e = the mathematical constant approximated by 2.71828
π = the mathematical constant approximated by 3.14159
μ = the population mean
σ = the population standard deviation
X = any value of the continuous variable
Chap 6-44
The Normal Distribution: Shape
By varying the parameters μ and σ, we obtain different normal

distributions
Lecture 10: Probability Distributions-III
Dr. A. Ramesh
IIT ROORKEE
1
The Normal Distribution: Properties
• ‘Bell Shaped’
• Symmetrical f(X)
• Mean, Median and Mode are equal
• Location is characterized by the mean, μ σ
• Spread is characterized by the standard μ
deviation, σ
Mean = Median = Mode
• The random variable has an infinite
theoretical range: - to +
The Normal Distribution: Density Function
The formula for the normal probability density function is
2
1  (X μ) 
1   
2  
f(X)  e
2π
μ = the population mean
σ = the population standard deviation
X = any value of the continuous variable
Chap 6-3
By varying the parameters μ and σ, we obtain different normal

distributions
f(X) Changing μ shifts the distribution

left or right.
Changing σ increases or
decreases the spread.
σ
μ X
The Standardized Normal Distribution
• Any normal distribution (with any mean and standard deviation

combination) can be transformed into the standardized normal
distribution (Z).
• Need to transform X units into Z units.
• The standardized normal distribution has a mean of 0 and a standard

deviation of 1.
The Standardized Normal Distribution
• Translate from X to the standardized normal (the “Z” distribution) by

subtracting the mean of X and dividing by its standard deviation:
X μ
Z
σ
The Standardized Normal Distribution: Density
Function
• The formula for the standardized normal probability density

function is
Z2
1 2
f(Z)  e
2π
Z = any value of the standardized normal distribution
The Standardized Normal Distribution: Shape
• Also known as the “Z” distribution
• Mean is 0
• Standard Deviation is 1
f(Z)
Z
0
Values above the mean have positive Z-values, values below the mean have
negative Z-values
The Standardized Normal Distribution: Example
• If X is distributed normally with mean of 100 and standard deviation of

50, the Z value for X = 200 is
X  μ 200  100
Z   2 .0
σ 50
• This says that X = 200 is two standard deviations (2 increments of 50
units) above the mean of 100.
The Standardized Normal Distribution: Example
100 200 X (μ = 100, σ = 50)

0 2.0 Z (μ = 0, σ = 1)
Note that the distribution is the same, only the scale has changed. We
can express the problem in original units (X) or in standardized units (Z)
Normal Probabilities
Probability is measured by the area under the curve
f(X)
P(a ≤ X ≤ b)
(Note that the

probability of any
individual value is zero)
a b
Normal Probabilities
The total area under the curve is 1.0, and the curve is symmetric,
so half is above the mean, half is below.
f(X) P (    X  μ )  0 .5
P (μ  X   )  0 .5
0.5 0.5
P (    X   )  1 .0
Normal Probability Tables
Example:
P(Z < 2.00) = .9772
.9772
0 2.00 Z
Normal Probability Tables
The column gives the value of

Z to the second decimal point
Z 0.00 0.01 0.02 …
The row shows 0.0

the value of Z to 0.1
.
the first decimal . The value within the
. table gives the probability
point 2.0 .9772
from Z =   up to the
desired Z value.
2.0
P(Z < 2.00) = .9772
Finding Normal Probability
Procedure
To find P(a < X < b) when X is distributed normally:
• Draw the normal curve for the problem in terms of X.
• Translate X-values to Z-values.
• Use the Standardized Normal Table.

Finding Normal Probability: Example
• Let X represent the time it takes (in seconds) to download an image file
from the internet.
• Suppose X is normal with mean 8.0 and standard deviation 5.0
• Find P(X < 8.6)
X
8.0
8.6
• Suppose X is normal with mean 8.0 and standard deviation 5.0. Find
P(X < 8.6).
X  μ 8 .6  8 .0
Z   0 .1 2
σ 5 .0
μ=8 μ=0
σ = 10 σ=1
8 8.6 X 0 0.12 Z
P(X < 8.6) P(Z < 0.12)

Standardized Normal Probability P(X < 8.6)
Table (Portion)
= P(Z < 0.12)
.5478
Z .00 .01 .02
0.0 .5000 .5040 .5080

μ=0
0.1 .5398 .5438 .5478 σ=1
0.2 .5793 .5832 .5871
0.3 .6179 .6217 .6255 0 0.12 Z

• Find P(X > 8.6)…
P(X > 8.6) = P(Z > 0.12) = 1.0 - P(Z ≤ 0.12)

= 1.0 - .5478 = .4522
.5478
1.0 - .5478 = .4522
Z
0
0.12
Finding Normal Probability: Between Two Values
• Suppose X is normal with mean 8.0 and standard deviation 5.0.

Find P(8 < X < 8.6)
Calculate Z-values:
X μ 88
Z  0
σ 5
8 8.6 X
X  μ 8.6  8 0 0.12 Z
Z   0.12
σ 5 P(8 < X < 8.6)
= P(0 < Z < 0.12)
Finding Normal Probability
Between Two Values
P(8 < X < 8.6)

• Standardized Normal Probability = P(0 < Z < 0.12)
• Table (Portion) = P(Z < 0.12) – P(Z ≤ 0)
= .5478 - .5000 = .0478
Z .00 .01 .02
.0478
0.0 .5000 .5040 .5080 .5000
0.1 .5398 .5438 .5478
0.2 .5793 .5832 .5871
0.3 .6179 .6217 .6255 Z

0.00 0.12
Given Normal Probability: Find the X Value
• Let X represent the time it takes (in seconds) to download an image file
from the internet.
• Suppose X is normal with mean 8.0 and standard deviation 5.0
• Find X such that 20% of download times are less than X.
.2000
? 8.0 X
? 0 Z
Given Normal Probability, Find the X Value
• First, find the Z value corresponds to the known probability

using the table.
Z …. .03 .04 .05
-0.9 …. .1762 .1736 .1711

.2000
-0.8 …. .2033 .2005 .1977
-0.7 …. .2327 .2296 .2266

? 8.0 X
-0.84 0 Z
Given Normal Probability,
Find the X Value
• Second, convert the Z value to X units using

the following formula.
X  μ  Zσ
 8.0  (0.84)5.0
 3.80
So 20% of the download times from the distribution with mean 8.0
and standard deviation 5.0 are less than 3.80 seconds.
Assessing Normality
• It is important to evaluate how well the data set is approximated by a normal
distribution.
• Normally distributed data should approximate the theoretical normal
distribution:
– The normal distribution is bell shaped (symmetrical) where the mean is
equal to the median.
– The empirical rule applies to the normal distribution.
– The interquartile range of a normal distribution is 1.33 standard deviations.
Assessing Normality
• Construct charts or graphs
– For small- or moderate-sized data sets, do stem-and-leaf display
and box-and-whisker plot look symmetric?
– For large data sets, does the histogram or polygon appear bell-
shaped?
• Compute descriptive summary measures
– Do the mean, median and mode have similar values?
– Is the interquartile range approximately 1.33 σ?
– Is the range approximately 6 σ?
Assessing Normality
• Observe the distribution of the data set

– Do approximately 2/3 of the observations lie within mean ± 1 standard
deviation?
– Do approximately 80% of the observations lie within mean ± 1.28
standard deviations?
– Do approximately 95% of the observations lie within mean ± 2 standard
deviations?
Z Table
Second Decimal Place in Z
Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.00 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.10 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.20 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.30 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.90 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.00 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.10 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.20 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
2.00 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
3.00 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990
3.40 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4998
3.50 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998
Table Lookup of a
Standard Normal Probability
P( 0  Z  1)  0. 3413
Z 0.00 0.01 0.02
0.00 0.0000 0.0040 0.0080

0.10 0.0398 0.0438 0.0478
0.20 0.0793 0.0832 0.0871
1.00 0.3413 0.3438 0.3461
1.10 0.3643 0.3665 0.3686

1.20 0.3849 0.3869 0.3888
-3 -2 -1 0 1 2 3
Applying the Z Formula
X is normally distributed with  = 485, and  = 105

P(485  X  600)  P(0  Z  1.10)  .3643
For X = 485, Z 0.00 0.01 0.02
X -  485  485
Z=  0 0.00 0.0000 0.0040 0.0080
 105 0.10 0.0398 0.0438 0.0478
1.00 0.3413 0.3438 0.3461

For X = 600,
X -  600  485 1.10 0.3643 0.3665 0.3686
Z=   1.10
 105 1.20 0.3849 0.3869 0.3888
32
33
Thank You
34
Distribution of Sample Mean, proportion,
and variance
Dr. A. Ramesh
IIT ROORKEE
1
2
Acceptance Intervals
Goal: determine a range within which sample means are likely to occur, given a
population mean and variance
• By the Central Limit Theorem, we know that the distribution of X is
approximately normal if n is large enough, with mean μ and standard
deviation
• Let zα/2 be the z-value that leaves area α/2 in the upper tail of the normal
distribution (i.e., the interval - zα/2 to zα/2 encloses probability 1– α)
• Then
μ  z/2σ X
is the interval that includes X with probability 1 – α
3
Sampling Distributions of Sample Proportions
Sampling
Distributions
Sampling Sampling Sampling

Distribution of Distribution of Distribution of
Sample Sample Sample
Mean Proportion Variance
4
Sampling Distributions of Sample Proportions
P = the proportion of the population having some characteristic
• Sample proportion (p̂) provides an estimate of P:
X number of items in the sample having the characteristic of interest

pˆ  
n sample size
• 0 ≤ p̂ ≤ 1
• p̂ has a binomial distribution, but can be approximated by a normal
distribution when nP(1 – P) > 5
5
^
Sampling Distribution of p
• Normal approximation:
Sampling Distribution
P(Pˆ )
.3
.2
Properties: E(pˆ )  P
.1
0
0 .2 .4 .6 8 1
(where P = population proportion)
 X  P(1 P)
And σ p2ˆ  Var   
n n
6
7
Z-Value for Proportions
Standardize p̂ to a Z value with the formula:
pˆ  P pˆ  P
Z 
σ pˆ P(1 P)
n
8
Example
• If the true proportion of voters who support Proposition A is

P = .4, what is the probability that a sample of size 200 yields
a sample proportion between .40 and .45?
• i.e.:
if P = .4 and n = 200, what is
P(.40 ≤ p̂ ≤ .45) ?
9
Example (continued)
• if P = .4 and n = 200, what is

P(.40 ≤ p̂ ≤ .45) ?
Find: σ pˆ P(1  P) .4(1  .4)

σ p̂    .03464
n 200
Convert to  .40  .40 .45  .40 

P(.40  pˆ  .45)  P Z 
standard  .03464 .03464 
normal:  P(0  Z  1.44)
10
Example
(continued)
if P = .4 and n = 200, what is P(.40 ≤ p̂ ≤ .45) ?

Use standard normal table: P(0 ≤ Z ≤ 1.44) = .4251
Standardized
Sampling Distribution Normal Distribution
.4251
Standardize
.40 .45 p̂ 0 1.44

Z
11
Sampling Distributions of Sample Variance
Sampling
Distributions

12
Sample Variance
• Let x1, x2, . . . , xn be a random sample from a population. The
sample variance is
1 n
s 
2
 i
n  1 i1
(x  x) 2
• the square root of the sample variance is called the sample

standard deviation
• the sample variance is different for different random samples from
the same population
13
Sampling Distribution of Sample Variances
• The sampling distribution of s2 has mean σ2
E(s2 )  σ 2
• If the population distribution is normal then

(n - 1)s2
σ2
has a 2 distribution with n – 1 degrees of freedom
14
15
The Chi-square Distribution
• The chi-square distribution is a family of distributions, depending on

degrees of freedom: d.f. = n – 1
0 4 8 12 16 20 24 28 2 0 4 8 12 16 20 24 28 2 0 4 8 12 16 20 24 28 2
16
Degrees of Freedom (df)
Idea: Number of observations that are free to vary after sample
mean has been calculated
Example: Suppose the mean of 3 numbers is 8.0
If the mean of these three values is 8.0,
Let X1 = 7 then X3 must be 9
Let X2 = 8 (i.e., X3 is not free to vary)
What is X3?
Here, n = 3, so degrees of freedom = n – 1 = 3 – 1 = 2

(2 values can be any numbers, but the third is not free to vary for a
given mean)
17
Chi-square Example
• A commercial freezer must hold a selected temperature with little

variation. Specifications call for a standard deviation of no more than 4
degrees (a variance of 16 degrees2).
• A sample of 14 freezers is to be tested
• What is the upper limit (K) for the sample variance such that the
probability of exceeding this limit, given that the population standard
deviation is 4, is less than 0.05?
18
Finding the Chi-square Value
(n  1)s2
χ 
2
Is chi-square distributed with (n – 1) = 13
σ 2
degrees of freedom
• Use the the chi-square distribution with area 0.05 in the
upper tail:
213 = 22.36 (α = .05 and 14 – 1 = 13 d.f.)
probability
α = .05
2
213 = 22.36
19
Chi-square Example
(continued)
213 = 22.36 (α = .05 and 14 – 1 = 13 d.f.)

 (n  1)s 2 2 
P(s  K)  P
2
 χ13   0.05
So:  16 
(n  1)K
or  22.36 (where n = 14)
16
(22.36)(16)
so K  27.52
(14  1)
If s2 from the sample of size n = 14 is greater than 27.52, there is

strong evidence to suggest the population variance exceeds 16.
20
Summary
• Introduced sampling distributions
• Described the sampling distribution of sample means
– For normal populations
– Using the Central Limit Theorem
• Described the sampling distribution of sample proportions
• Introduced the chi-square distribution
• Examined sampling distributions for sample variances
• Calculated probabilities using sampling distributions
21
Thank You
22
Confidence Interval Estimation: Single
Population
Dr. A. Ramesh
IIT ROORKEE
1
Goals
After completing this lecture, you should be able to:
• Distinguish between a point estimate and a confidence interval estimate
• Construct and interpret a confidence interval estimate for a single
population mean using both the Z and t distributions
• Form and interpret a confidence interval estimate for a single population
proportion
• Create confidence interval estimates for the variance of a normal
population
2
Confidence Intervals
• Confidence Intervals for the Population Mean, μ
– when Population Variance σ2 is Known
– when Population Variance σ2 is Unknown
• Confidence Intervals for the Population Proportion, p̂ (large samples)
• Confidence interval estimates for the variance of a normal population
3
Definitions
• An estimator of a population parameter is

– a random variable that depends on sample information . . .
– whose value provides an approximation to this unknown parameter
• A specific value of that random variable is called an estimate
4
Point and Interval Estimates
• A point estimate is a single number,

• a confidence interval provides additional information about
variability
Lower Upper
Confidence Confidence
Point Estimate Limit
Limit
Width of
confidence interval
5
Point Estimates
We can estimate a with a Sample

Population Parameter … Statistic
(a Point Estimate)
Mean μ x
Proportion P p̂
6
Unbiasedness
• A point estimator θ̂ is said to be an unbiased estimator of the

parameter  if the expected value, or mean, of the sampling
distribution of θ̂ is ,
E(θˆ )  θ
• Examples:
– The sample mean x is an unbiased estimator of μ
– The sample variance s2 is an unbiased estimator of σ2
– The sample proportion p̂ is an unbiased estimator of P
7
Unbiasedness
(continued)
• θ̂1 is an unbiased estimator, θ̂2 is biased:
θ̂1 θ̂2
θ θ̂
8
Bias
• Let θ̂ be an estimator of 
• The bias in θ̂ is defined as the difference between its mean and 
Bias(θˆ )  E(θˆ )  θ
• The bias of an unbiased estimator is 0
9
Most Efficient Estimator
• Suppose there are several unbiased estimators of 
• The most efficient estimator or the minimum variance unbiased estimator
of  is the unbiased estimator with the smallest variance
• Let θ̂1 and θ̂2 be two unbiased estimators of , based on the same number
of sample observations. Then,
– θ̂1 is said to be more efficient than θ̂2 if Var(θˆ 1 )  Var(θˆ 2 )
– The relative efficiency of θ̂1 with respect to θ̂2 is the ratio of

their variances:
Var( θˆ 2 )
Relative Efficiency 
Var( θˆ )
1
10
• How much uncertainty is associated with a point estimate of a population

parameter?
• An interval estimate provides more information about a population

characteristic than does a point estimate
• Such interval estimates are called confidence intervals
11
Confidence Interval Estimate
• An interval gives a range of values:

– Takes into consideration variation in sample statistics from sample to
sample
– Based on observation from 1 sample
– Gives information about closeness to unknown population
parameters
– Stated in terms of level of confidence
• Can never be 100% confident
12
Confidence Interval and Confidence Level
• If P(a <  < b) = 1 -  then the interval from a to b is called a 100(1 -

)% confidence interval of .
• The quantity (1 - ) is called the confidence level of the interval (
between 0 and 1)
– In repeated samples of the population, the true value of the

parameter  would be contained in 100(1 - )% of intervals
calculated this way.
– The confidence interval calculated in this manner is written as a <  <
b with 100(1 - )% confidence
13
Estimation Process
Random Sample I am 95% confident

that μ is between 40
Population & 60.
Mean
(mean, μ, is X = 50
unknown)
Sample
14
Confidence Level, (1-)
(continued)
• Suppose confidence level = 95%
• Also written (1 - ) = 0.95
• A relative frequency interpretation:
– From repeated samples, 95% of all the confidence intervals that can
be constructed will contain the unknown true parameter
• A specific interval either will contain or will not contain the true
parameter
15
General Formula
• The general formula for all confidence intervals is:
Point Estimate  (Reliability Factor)(Standard Error)
• The value of the reliability factor depends on the desired level of confidence
16
Confidence
Intervals
Population Population Population

σ2 Known σ2 Unknown
17
Confidence Interval for μ (σ2 Known)
• Assumptions
– Population variance σ2 is known
– Population is normally distributed
– If population is not normal, use large sample
• Confidence interval estimate:
σ σ
x  z α/2  μ  x  z α/2
n n
(where z/2 is the normal distribution value for a probability of /2 in each tail)
18
Margin of Error
• The confidence interval,
σ σ
x  z α/2  μ  x  z α/2
n n
• Can also be written as x  ME

where ME is called the margin of error
σ
ME  z α/2
n
19
Reducing the Margin of Error
σ
ME  z α/2
n
The margin of error can be reduced if
• the population standard deviation can be reduced (σ↓)
• The sample size is increased (n↑)
• The confidence level is decreased, (1 – ) ↓
20
Finding the Reliability Factor, z/2
• Consider a 95% confidence interval:
1    .95
α α
 .025  .025
2 2
Z units: z = -1.96 0 z = 1.96

Lower Upper
X units: Confidence Point Estimate Confidence
Limit Limit
 Find z.025 = 1.96 from the standard normal distribution table

21
Common Levels of Confidence
• Commonly used confidence levels are 90%, 95%, and 99%
Confidence
Confidence
Coefficient, Z/2 value
Level
1 
80% .80 1.28
90% .90 1.645
95% .95 1.96
98% .98 2.33
99% .99 2.58
99.8% .998 3.08
99.9% .999 3.27
22
Intervals and Level of Confidence
Sampling Distribution of the Mean
/2 1  /2
Intervals
x
μx  μ
extend from 100(1-)%
x1
of intervals
σ
LCL  x  z x2 constructed
n contain μ;
to
σ 100()% do
UCL  x  z not.
n
23
Example
• A sample of 11 circuits from a large normal population has a mean

resistance of 2.20 ohms. We know from past testing that the population
standard deviation is 0.35 ohms.
• Determine a 95% confidence interval for the true mean resistance of the
population.
24
Example
(continued)
• A sample of 11 circuits from a large normal population has a mean resistance

of 2.20 ohms. We know from past testing that the population standard
deviation is .35 ohms.
σ
x z
• Solution: n
 2.20  1.96 (.35/ 11)
 2.20  .2068
1.9932  μ  2.4068
25
Interpretation
• We are 95% confident that the true mean resistance is

between 1.9932 and 2.4068 ohms
• Although the true mean may or may not be in this interval,
95% of intervals formed in this manner will contain the true
mean
26
Confidence
Intervals

27
Confidence Interval Estimation: Single
Population-II
Dr. A. Ramesh
IIT ROORKEE
1
Student’s t Distribution
• Consider a random sample of n observations
– with mean x and standard deviation s
– from a normally distributed population with mean μ
• Then the variable x μ

t
s/ n
follows the Student’s t distribution with (n - 1) degrees of freedom
2
Confidence Interval for μ (σ2 Unknown)
• If the population standard deviation σ is unknown, we can substitute

the sample standard deviation, s
• This introduces extra uncertainty, since s is variable from sample to
sample
• So we use the t distribution instead of the normal distribution
3
Confidence Interval for μ (σ Unknown)
(continued)
• Assumptions
– Population standard deviation is unknown
– Population is normally distributed
– If population is not normal, use large sample
• Use Student’s t Distribution
• Confidence Interval Estimate:
S S
x  t n-1,α/2  μ  x  t n-1,α/2
n n
where tn-1,α/2 is the critical value of the t distribution with n-1 d.f. and an area of α/2 in each tail
4
Margin of Error
• The confidence interval,
S S
x  t n-1,α/2  μ  x  t n-1,α/2
n n
• Can also be written as

x  ME
where ME is called the margin of error:
σ
ME  t n-1,α/2
n
5
• The t is a family of distributions

• The t value depends on degrees of freedom (d.f.)
– Number of observations that are free to vary after sample mean has
been calculated
d.f. = n - 1
6
Note: t Z as n increases
Standard
Normal
(t with df = ∞)
t (df = 13)
t-distributions are bell-
shaped and symmetric, but
have ‘fatter’ tails than the t (df = 5)
normal
0 t
7
Student’s t Table
Upper Tail Area

Let: n = 3
df .10 .05 .025 df = n - 1 = 2
 = .10
1 3.078 6.314 12.706 /2 =.05
2 1.886 2.920 4.303
3 1.638 2.353 3.182 /2 = .05
The body of the table

contains t values, not
probabilities
0 2.920 t
8
t distribution values
With comparison to the Z value
Confidence t t t Z
Level (10 d.f.) (20 d.f.) (30 d.f.) ____
.80 1.372 1.325 1.310 1.282

.90 1.812 1.725 1.697 1.645
.95 2.228 2.086 2.042 1.960
.99 3.169 2.845 2.750 2.576
Note: t Z as n increases
9
Example
A random sample of n = 25 has x = 50 and s = 8. Form a 95%

confidence interval for μ
t n1,α/2  t 24,.025  2.0639
– d.f. = n – 1 = 24, so
The confidence interval is

S S
x  t n-1,α/2  μ  x  t n-1,α/2
n n
8 8
50  (2.0639)  μ  50  (2.0639)
25 25
46.698  μ  53.302
10
Confidence
Intervals

11
Confidence Intervals for the
Population Proportion
• An interval estimate for the population proportion ( P ) can

be calculated by adding an allowance for uncertainty to the
sample proportion ( p̂ )
12
Confidence Intervals for the Population
Proportion, p
(continued)
• Recall that the distribution of the sample proportion is approximately

normal if the sample size is large, with standard deviation
P(1 P)
σP 
n
• We will estimate this with sample data:
pˆ (1 pˆ )
n
13
Confidence Interval Endpoints
• Upper and lower confidence limits for the population proportion are
calculated with the formula
pˆ (1 pˆ ) ˆ (1 pˆ )
p
pˆ  z α/2  P  pˆ  z α/2
n n
• where
– z/2 is the standard normal value for the level of confidence desired
– p̂ is the sample proportion
– n is the sample size
– nP(1−P) > 5
14
Example
• A random sample of 100 people shows that 25 are left-

handed.
• Form a 95% confidence interval for the true proportion of
left-handers
15
Example (continued)
• A random sample of 100 people shows that 25 are left-handed. Form a

95% confidence interval for the true proportion of left-handers.
ˆ ˆ ˆ ˆ
ˆp  z α/2 p(1 p)  P  pˆ  z α/2 p(1 p)
n n
25 .25(.75) 25 .25(.75)
 1.96  P   1.96
100 100 100 100
0.1651  P  0.3349
16
Interpretation
• We are 95% confident that the true percentage of left-handers in the

population is between
16.51% and 33.49%.
• Although the interval from 0.1651 to 0.3349 may or may not contain the true
proportion, 95% of intervals formed from samples of size 100 in this manner
will contain the true proportion.
17
Confidence
Intervals

18
Confidence Intervals for the Population
Variance
 Goal: Form a confidence interval for the population variance, σ2
• The confidence interval is based on the sample variance, s2
• Assumed: the population is normally distributed
19
Confidence Intervals for the Population Variance
(continued)
The random variable

(n  1)s2
 n21 
σ2
follows a chi-square distribution with (n – 1)
degrees of freedom
20
Confidence Intervals for the Population Variance
The (1 - )% confidence interval for the population variance is
(n  1)s2 (n  1)s 2
 σ 2
 2
χn1, α/2
2
χn1, 1 - α/2
21
Example
You are testing the speed of a batch of computer processors. You

collect the following data (in Mhz):
Sample size 17
Sample mean 3004
Sample std dev 74
Assume the population is normal. Determine the 95%

confidence interval for σx2
22
Finding the Chi-square Values
• n = 17 so the chi-square distribution has (n – 1) = 16 degrees of

freedom
•  = 0.05, so use the the chi-square values with area 0.025 in each tail:
χn21, α/2  χ16
2
, 0.025  28.85
probability probability
χ 2
n 1, 1 - α/2 χ 2
16 , 0.975  6.91 α/2 = .025 α/2 = .025
216
216 = 6.91 216 = 28.85
23
Calculating the Confidence Limits
• The 95% confidence interval is

(n  1)s2 (n  1)s 2
 σ 2
 2
χn1, α/2
2
χn1, 1 - α/2
(17  1)(74)2 (17  1)(74)2

σ 
2
28.85 6.91
3037  σ 2  12683
Converting to standard deviation, we are 95% confident that the population standard
deviation of CPU speed is between 55.1 and 112.6 Mhz
24
Finite Populations
• If the sample size is more than 5% of the population size (and

sampling is without replacement) then a finite population correction
factor must be used when calculating the standard error
25
Finite Population Correction Factor
• Suppose sampling is without replacement and the sample size is large

relative to the population size
• Assume the population size is large enough to apply the central limit
theorem
• Apply the finite population correction factor when estimating the
population variance
Nn
finite population correction factor 
N 1
26
Estimating the Population Mean
• Let a simple random sample of size n be taken from a population

of N members with mean μ
• The sample mean is an unbiased estimator of the population mean
μ
• 1 n
x   xi
The point estimate is:
n i1
27
Finite Populations: Mean
• If the sample size is more than 5% of the population size, an unbiased

estimator for the variance of the sample mean is
Nn
2
ˆ  s
σ  
2
 N 1 
x
n
• So the 100(1-α)% confidence interval for the population mean is
ˆ x  μ  x  t n-1,α/2σ
x - t n-1,α/2σ ˆx
28
Estimating the Population Proportion
• Let the true population proportion be P

• Let p̂ be the sample proportion from n observations from a simple
random sample
• The sample proportion, p̂ , is an unbiased estimator of the population
proportion, P
29
Finite Populations: Proportion
• If the sample size is more than 5% of the population size, an unbiased

estimator for the variance of the population proportion is
ˆ (1- pˆ )  N  n 
p
ˆ 
σ 2
pˆ  
n  N 1 
• So the 100(1-α)% confidence interval for the population proportion is
pˆ - zα/2σ
ˆ pˆ  P  pˆ  zα/2σ
ˆ pˆ
30
Lecture Summary
• Introduced the concept of confidence intervals
• Discussed point estimates
• Developed confidence interval estimates
• Created confidence interval estimates for the mean (σ2
known)
• Introduced the Student’s t distribution
• Determined confidence interval estimates for the mean (σ2
unknown)
31
Lecture Summary
(continued)
• Created confidence interval estimates for the proportion
• Created confidence interval estimates for the variance of a normal
population
• Applied the finite population correction factor to form confidence
intervals when the sample size is not small relative to the population size
32
Summary
• Introduced sampling distributions
• Described the sampling distribution of sample means
– For normal populations
– Using the Central Limit Theorem
• Described the sampling distribution of sample proportions
• Introduced the chi-square distribution
• Examined sampling distributions for sample variances
• Calculated probabilities using sampling distributions
33
Thank You
34
Welcome to
TA Live Session 3

WITH PYTHON
12-02-2022
Ritwiz Kamal

To discuss before we start:
 Correction from Last Session (Biased vs Unbiased Coin)
 Recommendation for Books on Data Analytics

Question 1

Question 1

Question 2
2) What kind of sampling are we doing below ?

Question 2

Question 3

Question 3

Question 4

Question 4

Question 5

Question 5
GOOD OR BAD ?

Question 6
6)

Question 6
6)
12 8
20 1 1
12 2 2

Question 7

Question 7
1
𝜆 −𝜆 2.51 −2.5
ⅇ = ⅇ
1! 1!

Question 8
Find the probability that they will sell at most 2 cars.

Question 8
Find the probability that they will sell at most 2 cars.
0 1 2
𝜆 −𝜆 𝜆 −𝜆 𝜆 −𝜆
ⅇ + ⅇ + ⅇ
0! 1! 2!
*Shift to Colab for demo

Question 9
9) A basket contains 10 rotten tomatoes. One fresh tomato is thrown into the
basket by mistake. You pick a tomato at random and if it is rotten you throw it out
and pick again.
What is the probability that you will get the fresh tomato in the first trial ?
A) 1/11
B) 1/10
C) 1
D) 9/10

Question 9
and pick again.
What is the probability that you will get the fresh tomato in the first trial ?
A) 1/11
B) 1/10
C) 1
D) 9/10

Question 10
and pick again.
What is the probability that you will get the fresh tomato in the third trial ?
A) 1/11
B) 1/10
C) 1
D) 9/10

Question 10
and pick again.
What is the probability that you will get the fresh tomato in the third trial ?
A) 1/11
B) 1/10 10 9 1
C) 1 ∗ ∗
D) 9/10 11 10 9

Acknowledgments
 NPTEL Team
 PMRF Team
THANK YOU!

Lecture 11: Sampling and Sampling Distribution
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
IIT ROORKEE
1
Lecture Objectives
• Describe a simple random sample and why sampling is important
• Explain the difference between descriptive and inferential statistics
• Define the concept of a sampling distribution
• Determine the mean and standard deviation for the sampling distribution
of the sample mean,
2
Lecture Objectives
• Describe the Central Limit Theorem and its importance

• Determine the mean and standard deviation for the sampling distribution
of the sample proportion,
• Describe sampling distributions of sample variances
3
Descriptive vs Inferential Statistics
• Descriptive statistics
– Collecting, presenting, and describing data
• Inferential statistics
– Drawing conclusions and/or making decisions concerning a population
based only on sample data
4
Populations and Samples
• A Population is the set of all items or individuals of interest

– Examples: All likely voters in the next election
All parts produced today
All sales receipts for November
• A Sample is a subset of the population

– Examples: 1000 voters selected at random for interview
A few parts selected for destructive testing
Random receipts selected for audit
5
Population vs. Sample
• Population • Sample
a b cd b c
ef ghi jkl m n gi n
o pq rs t uv w o r u
x y z y
6
Why Sample?
• Less time consuming than a census
• Less costly to administer than a census
• It is possible to obtain statistical results of a sufficiently high precision
based on samples.
• Because the research process is sometimes destructive, the sample can
save product
• If accessing the population is impossible; sampling is the only option
7
Reasons for Taking a Census
• Eliminate the possibility that a random sample is not representative of the

population
• The person authorizing the study is uncomfortable with sample

information
Random Versus Nonrandom Sampling
• Random sampling
• Every unit of the population has the same probability of being included in the
sample.
• A chance mechanism is used in the selection process.
• Eliminates bias in the selection process
• Also known as probability sampling
• Nonrandom Sampling
• Every unit of the population does not have the same probability of being
included in the sample.
• Open the selection bias
• Not appropriate data collection methods for most statistical methods
• Also known as non-probability sampling
Random Sampling Techniques
• Simple Random Sample
• Stratified Random Sample
– Proportionate
– Disproportionate
• Systematic Random Sample
• Cluster (or Area) Sampling

Simple Random Samples
• Every object in the population has an equal chance of being selected

• Objects are selected independently
• Samples can be obtained from a table of random numbers or computer
random number generators
• A simple random sample is the ideal against which other sample methods
are compared
11
Simple Random Sample:
Numbered Population Frame
01 Andhra Pradesh 11 Madhya Pradesh

02 Himachal Pradesh 12 Uttar Pradesh
03 Gujrath 13 Bihar
04 Maharashtra 14 Rajasthan
05 Nagaland 15 J & K
06 Goa 16 Tamil Nadu
07 West bengal 17 Karantaka
08 Haryana 18 Kerala
09 Punjab 19 Orissa
10 Delhi 20 Manipur
Simple Random Sampling:
Random Number Table
9 9 4 3 7 8 7 9 6 1 4 5 7 3 7 3 7 5 5 2 9 7 9 6 9 3 9 0 9 4 3 4 4 7 5 3 1 6 1 8
5 0 6 5 6 0 0 1 2 7 6 8 3 6 7 6 6 8 8 2 0 8 1 5 6 8 0 0 1 6 7 8 2 2 4 5 8 3 2 6
8 0 8 8 0 6 3 1 7 1 4 2 8 7 7 6 6 8 3 5 6 0 5 1 5 7 0 2 9 6 5 0 0 2 6 4 5 5 8 7
8 6 4 2 0 4 0 8 5 3 5 3 7 9 8 8 9 4 5 4 6 8 1 3 0 9 1 2 5 3 8 8 1 0 4 7 4 3 1 9
6 0 0 9 7 8 6 4 3 6 0 1 8 6 9 4 7 7 5 8 8 9 5 3 5 9 9 4 0 0 4 8 2 6 8 3 0 6 0 6
5 2 5 8 7 7 1 9 6 5 8 5 4 5 3 4 6 8 3 4 0 0 9 9 1 9 9 7 2 9 7 6 9 4 8 1 5 9 4 1
8 9 1 5 5 9 0 5 5 3 9 0 6 8 9 4 8 6 3 7 0 7 9 5 5 4 7 0 6 2 7 1 1 8 2 6 4 4 9 3
Simple Random Sample:
Sample Members
01 Andhra Pradesh 11 Madhya Pradesh

02 Himachal Pradesh 12 Uttar Pradesh
03 Gujrath 13 Bihar
04 Maharashtra 14 Rajasthan
05 Nagaland 15 J & K
06 Goa 16 Tamil Nadu
07 West bengal 17 Karantaka
08 Haryana 18 Kerala
09 Punjab 19 Orissa
10 Delhi 20 Manipur
• N = 20
• n=4
Stratified Random Sample
• Population is divided into non-overlapping subpopulations called strata

• A random sample is selected from each stratum
• Potential for reducing sampling error
• Proportionate -- the percentage of these sample taken from each stratum
is proportionate to the percentage that each stratum is within the
population
• Disproportionate -- proportions of the strata within the sample are
different than the proportions of the strata within the population
Stratified Random Sample:
Population of FM Radio Listeners
Stratified by Age
20 - 30 years old
(homogeneous within)
(alike) Heterogeneous
(different)
30 - 40 years old between
(alike) Heterogeneous
(different)
40 - 50 years old between
(alike)
Systematic Sampling
• Convenient and relatively easy to
N
administer k = ,
n
• Population elements are an ordered
where:
sequence (at least, conceptually).
n = sample size
• The first sample element is selected
N = population size
randomly from the first k population
elements. k = size of selection interval
• Thereafter, sample elements are selected

at a constant interval, k, from the
ordered sequence frame.
Systematic Sampling: Example
• Purchase orders for the previous fiscal year are serialized 1 to 10,000 (N =
10,000).
• A sample of fifty (n = 50) purchases orders is needed for an audit.
• k = 10,000/50 = 200
• First sample element randomly selected from the first 200 purchase
orders. Assume the 45th purchase order was selected.
• Subsequent sample elements: 245, 445, 645, . . .
Cluster Sampling
• Population is divided into non-overlapping clusters or areas
• Each cluster is a miniature of the population.
• A subset of the clusters is selected randomly for the sample.
• If the number of elements in the subset of clusters is larger than the

desired value of n, these clusters may be subdivided to form a new
set of clusters and subjected to a random selection process.
Cluster Sampling
 Advantages
• More convenient for geographically dispersed populations
• Reduced travel costs to contact sample elements
• Simplified administration of the survey
• Unavailability of sampling frame prohibits using other random
sampling methods
 Disadvantages
• Statistically less efficient when the cluster elements are similar
• Costs and problems of statistical analysis are greater than for simple
random sampling
Nonrandom Sampling
• Convenience Sampling: Sample elements are selected for the convenience
of the researcher
• Judgment Sampling: Sample elements are selected by the judgment of the

researcher
• Quota Sampling: Sample elements are selected until the quota controls are
satisfied
• Snowball Sampling: Survey subjects are selected based on referral from

other survey respondents
Errors
• Data from nonrandom samples are not appropriate for analysis by inferential
statistical methods.
• Sampling Error occurs when the sample is not representative of the
population
• Non-sampling Errors
• Missing Data, Recording, Data Entry, and Analysis Errors
• Poorly conceived concepts , unclear definitions, and defective questionnaires
• Response errors occur when people so not know, will not say, or overstate in their
answers
Sampling Distribution of x
Proper analysis and interpretation of a sample statistic
requires knowledge of its distribution.
Calculate x
to estimate 
Population Sample
 Process of x
Inferential Statistics
(parameter) (statistic)
Select a
random sample
• Making statements about a population by examining sample results

Sample statistics Population parameters
(known) Inference (unknown, but can be estimated
from sample evidence)
Sample
Population
24
Drawing conclusions and/or making decisions concerning a
population based on sample results.
• Estimation
– e.g., Estimate the population mean weight
using the sample mean weight
• Hypothesis Testing
– e.g., Use sample evidence to test the claim
that the population mean weight is 120
pounds
25
Sampling Distributions
• A sampling distribution is a distribution of all of the possible values of a

statistic for a given size sample selected from a population
26
Types of sampling distributions
Sampling
Distributions

27
Sampling Distributions of Sample Means
Sampling
Distributions

Sample Sample Proportion Sample Variance
Mean
28
Developing a Sampling Distribution
• Assume there is a population … A B C D
• Population size N=4

• Random variable, X,
is age of individuals
• Values of X:
18, 20, 22, 24 (years)
29
(continued)
Summary Measures for the Population Distribution:
μ
 X i P(x)
N
.25
18  20  22  24
  21
4
0
18 20 22 24 x
σ
 (X i  μ) 2
 2.236
A B C D
N Uniform Distribution
30
(continued)
Now consider all possible samples of size n = 2
1st 2nd Observation
Obs 18 20 22 24 16 Sample
18 18,18 18,20 18,22 18,24 Means
20 20,18 20,20 20,22 20,24
22 22,18 22,20 22,22 22,24 1st 2nd Observation
Obs 18 20 22 24
24 24,18 24,20 24,22 24,24
18 18 19 20 21
16 possible samples 20 19 20 21 22
(sampling with 22 20 21 22 23
replacement) 24 21 22 23 24
31
(continued)
• Sampling Distribution of All Sample Means

16 Sample Means Sample Means Distribution
_
1st 2nd Observation P(X)
Obs 18 20 22 24 .3
18 18 19 20 21 .2
20 19 20 21 22 .1
22 20 21 22 23 _
0
24 21 22 23 24 18 19 20 21 22 23 24 X
(no longer uniform)
32
(continued)
• Summary Measures of this Sampling Distribution:
E(X) 
 X i

18  19  21   24
 21  μ
N 16
σX 
 ( X  μ)
i
2
N
(18 - 21)2  (19 - 21)2    (24 - 21)2
  1.58
16
33
Comparing the Population with its Sampling
Distribution
Population Sample Means Distribution
N=4 n=2
μ  21 σ  2.236 μX  21 σ X  1.58
_
P(X) P(X)
.3 .3
.2 .2
.1 .1
0 0 _
18 20 22 24 X 18 19 20 21 22 23 24 X
A B C D
34
1,800 Randomly Selected Values
from an Exponential Distribution
450
F
400
r
e 350
q 300
u 250
e 200
n 150
c 100
y
50
0
0 .5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10
X
Means of 60 Samples (n = 2)
F 9
r 8
e
77
q
u 66
e 55
n
44
c
y 33
22
11
00
0.00 0.25
0.00 0.25 0.50
0.50 0.75
0.75 1.00
1.00 1.25
1.25 1.50
1.50 1.75
1.75 2.00
2.00 2.25
2.25 2.50
2.50 2.75
2.75 3.00
3.00 3.25
3.25 3.50
3.50 3.75
3.75 4.00
4.00
xx
10
F
r 9
e 8
q 7
u
6
e
n 5
c 4
y 3
2
1
0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00
x
16
F
14
r
e 12
q
10
u
e 8
n
c 6
y 4
0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00
x
1,800 Randomly Selected Values
from a Uniform Distribution
F 250
250
r
e 200
200
q
u 150
150
e
n 100
100
c
y 50
50
00
0.0
0.0 0.5
0.5 1.0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
X-bar
F 10
r 9
e 8
q 7
u
6
e
n 5
c 4
y 3
2
1
0
1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 4.25
x
12
10
F
r 8
e
q 6
u
e 4
n
c 2
y
0
1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 4.25
x
25
20
F
r
15
e
q
u 10
e
n 5
c
y
0
1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 4.25
x
Expected Value of Sample Mean
• Let X1, X2, . . . Xn represent a random sample from a population
• The sample mean value of these observations is defined as
1 n
X   Xi
n i1
43
Standard Error of the Mean
• Different samples of the same size from the same population will yield
different sample means
• A measure of the variability in the mean from sample to sample is given by
the Standard Error of the Mean:
σ
σX 
n
• Note that the standard error of the mean decreases as the sample size
increases
44
If sample values are not independent
(continued)
• If the sample size n is not a small fraction of the population size N,

then individual sample members are not distributed independently
of one another
• Thus, observations are not selected independently
• A correction is made to account for this:
σ2 N  n σ Nn
Var(X)  or σX 
n N 1 n N 1
45
If the Population is Normal
• If a population is normal with mean μ and standard deviation σ, the

sampling distribution of X is also normally distributed with
σ
μX  μ σX 
and n
• If the sample size n is not large relative to the population size N, then
μX  μ and
σX 
σ Nn
n N 1
46
Z-value for Sampling Distribution of the Mean
• Z-value for the sampling distribution of :

( X  μ)
Z
σX
where: X = sample mean
μ = population mean
σ x = standard error of the mean
47
Sampling Distribution Properties
Normal Population
Distribution
μx  μ
μ x
(i.e. x is unbiased ) Normal Sampling
Distribution
(has the same mean)
μx
x
48
Sampling Distribution Properties
• For sampling with replacement:
As n increases,
σ x decreases Larger sample
Smaller sample size
size
x
μ
49
If the Population is not Normal- Central Limit Theorem
We can apply the Central Limit Theorem:
– Even if the population is not normal,

– sample means from the population will be approximately normal as
long as the sample size is large enough.
Properties of the sampling distribution:
σ
μx  μ And σx 
n
50
Central Limit Theorem
n
the sampling
As the sample distribution becomes
size gets large almost normal
enough… regardless of shape of
population
51
If the Population is not Normal
(continued)
Population Distribution
Sampling distribution
properties:
Central Tendency
μx  μ
μ x
Variation Sampling Distribution (becomes normal as n increases)
σ
σx  Larger
n Smaller sample
sample
size
size
μx x
52
How Large is Large Enough?
• For most distributions, n > 25 will give a sampling distribution that is

nearly normal
• For normal population distributions, the sampling distribution of the mean
is always normally distributed
53
Example
• Suppose a large population has mean μ = 8 and standard deviation σ = 3.

Suppose a random sample of size n = 36 is selected.
• What is the probability that the sample mean is between 7.8 and 8.2?
54
Example
Solution:
• Even if the population is not normally distributed, the central limit
theorem can be used (n > 25)
• … so the sampling distribution of x is approximately normal
• … with mean μx = 8
• …and standard deviation
σ 3
σx    0.5
n 36
55
Example (continued)
Solution (continued)
 
 7.8 - 8 μX -μ 8.2 - 8 
P(7.8  μ X  8.2)  P   
 3 σ 3 
 36 n 36 
 P(-0.5  Z  0.5)  0.3830
Sampling Standard Normal
Distribution Distribution .1915
??? +.1915
? ??
? ? Sample Standardize
?? ?
?
-0.5 0.5
μ8 X 7.8
μX  8
8.2
x μz  0 Z
56
Errors in Hypothesis Testing
Dr. A. Ramesh
Indian Institute of Technology Roorkee
1
Example
• We are interested in burning rate of a solid propellant used to power aircrew escape systems
• Burning rate is a random variable that can be described by a probability distribution
• Suppose our interest focus on mean burning rate
• Ho: µ = 50 centimeters per second
• H1: µ ≠ 50 centimeters per second
Reference: Applied statistics and probability for engineers, Douglas C. Montgomery, George C. Runger, John Wiley &
Sons, 2007
2
Value of the null hypothesis
• The value of the null hypothesis can be obtained by
– Past experience or knowledge of the process, or even from the previous tests or experiments
– From some theory or model regarding the process under study
– From external consideration, such as design or engineering specifications, or from contractual
obligations
3
Note: for this example n=10
Note: for this example we will

assume  = 2.5
4
Type I Error
• The true mean burning rate of the propellant could be equal to 50 centimeters per second
• However randomly selected propellant specimens that are tested, we could observe a value of test
statistics x that falls into the critical region(rejection region).
• We would then reject the null hypothesis Ho in favor of the alternate H1, in fact, Ho is really true
• This type of wrong conclusion is called a type I error
5
Type I Error
• Rejecting the null hypothesis Ho when

it is true is defined as a type I error
6
Type II Error
• Now suppose the true mean burning rate is different from 50 centimeters per second, yet the sample
mean x falls in the acceptance region
• In this case we would fail to reject Ho when it is false
• This type of wrong conclusion is called a type II error
7
Type II Error
• Failing to reject the null

hypothesis when it is false is
defined as a type II error
8
Type 1 and Type II Errors
H0 is correct H0 is incorrect
H0 is accepted correct decision Type II error ()

Incorrect
acceptance
H0 is rejected Type I error () correct decision

Incorrect rejection
9
Type I error
• In the propellant burning rate example, a type I error will occur when either x  51.5 _ or _ x  48.5
when the true mean burning rate is µ = 50 centimeters per second
• Suppose the standard deviation of burning rate is σ = 2.5 centimeters per second and n = 10
• Probability distribution µ = 50,standard error = 0.79.
• Type I error is
  P( x  48.5 _ when _   50)  P( x  51.5 _ when _   50)
10
Where
does this We will reject the null
number hypothesis ( = 50) if our
come sample mean is either of
from? these two regions
11
12
Type I error
• Type I error = 0.057434
• This implies that 5.7 % of all random samples would lead to rejection of the hypothesis Ho: µ=50
centimeters per second.
• We can reduce the type I error by widening the acceptance region. If we make critical value 48 and
52, the value of alpha is 0.0114 ( adding 0.0057 and 0.0057).
• Change sample size to 16 then alpha is 0.0164.
13
TYPE II ERROR
14
The pink area is
the probability
of a Type II error
if the actual mean
is 52.
15
Type II Error
• Type II error will be committed if the sample mean x-bar falls between 48.5 and 51.5 (critical region
boundaries) when µ = 52.   P(48.5  x  51.5 _ when _   52)
• 0.2643
• When µ = 50.5
• 0.8923
16
17
18
Computing the
probability of a type II
error may be the most
difficult concept
19
For constant n, increasing the acceptance region (hence
decreasing ) increases .
Increasing n, can decrease both types of errors.
20
Type I & II Errors Have an Inverse Relationship
If you reduce the probability of one error, the other

one increases so that everything else is unchanged.
21
Factors Affecting Type II Error
• True value of population parameter

–  Increases when the difference between hypothesized parameter and its true value
decrease
• Significance level
– Increases when  decreases
• Population standard deviation  
– 
Increases when increases

• Sample size
–  Increases when n decreases  

n
22
How to Choose between Type I and Type II Errors
• Choice depends on the cost of the errors
• Choose smaller Type I Error when the cost of rejecting the maintained hypothesis is high
– A criminal trial: convicting an innocent person
• Choose larger Type I Error when you have an interest in changing the status quo
23
Calculating the probability of Type II Error
Ho: µ = 8.3
H1: µ < 8.3
Determine the probability of Type II error if µ = 7.4 at 5% significance level. σ = 3.1 and n = 60.
24
Solution:
An error will be made when Z ≥ -1.645, for that will fail to reject Ho.
ᵦ = 0.2729
25
Solving for Type II Errors:
Example
Ho:   12    Zc

X
Ha:   12
c
n
010
.
 12  ( 1645
. )
60
Rejection
Region
 11979
.
=.05
If X  11979
. , reject Ho.
Non Rejection Region
=0 If X  11979
. , do not reject Ho.
Zc  1.645
26
Type II Error for Example with  =11.99 Kg
Reject Ho Do Not Reject

Type I Ho Correct
Error Decision
95%
=.05
Ho is True   Z0
Ho is False
Correct Type II
19.77% =.8023
Decision Error

Z1

X
  
27
28
Type II Error for Demonstration with =11.96 Kg
Reject Ho Do Not Reject Ho

Type Correct
I 95% Decision
Error
=.05
Ho is True  
Z0
Ho is False
Correct =.0708 Type II
Decision 92.92% Error
Z1

  
X
29
30
Hypothesis Testing and Decision Making
• We have illustrated hypothesis testing applications referred to as significance tests
• In the tests, we compared the p-value to a controlled probability of a Type I error, a, which is
called the level of significance for the test
• With a significance test, we control the probability of making the Type I error, but
not the Type II error
• We recommended the conclusion “do not reject H0” rather than “accept H0”
because the latter puts us at risk of making a Type II error
31
Hypothesis Testing and Decision Making
• With the conclusion “do not reject H0”, the statistical evidence is considered inconclusive
• Usually this is an indication to postpone a decision until further research and testing is
undertaken
• In many decision-making situations the decision maker may want, and in some cases may be
forced, to take action with both the conclusion “do not reject H0 “and the conclusion “reject
H0.”
• In such situations, it is recommended that the hypothesis-testing procedure be extended to

include consideration of making a Type II error
32
Power of a test
• The mean response time for a random sample of 40 food-

order is 13.25 minutes
• The population standard deviation is believed to be 3.2
minutes.
• The restaurant owner wants to perform a hypothesis test,
with  =0.05 level of significance, to determine whether the
service goal of 12 minutes or less is being achieved.
33
Calculating the Probability of a Type II Error
Hypotheses are: H0:    and Ha:   
Rejection rule is: Reject H0 if z > 1.645
Value of the sample mean that identifies the rejection region:
We will accept H0 when x < 12.8323
34
Calculating the Probability of a Type II Error
Probabilities that the sample mean will be in the acceptance region:
Values of   1-
14.0 -2.31 .0104 .9896
13.6 -1.52 .0643 .9357
13.2 -0.73 .2327 .7673
12.8323 0.00 .5000 .5000
12.8 0.06 .5239 .4761
12.4 0.85 .8023 .1977
12.0001 1.645 .9500 .0500
35
36
Power of the Test
• The probability of correctly rejecting H0 when it is false is called the power of the test.
• For any particular value of m, the power is 1 – b.
• We can show graphically the power associated with each value of

power curve.
 ; such a graph is called a
37
Power Curve
1.00
Rejecting Null Hypothesis

0.90
Probability of Correctly
0.80
H0 False
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00 
11.5 12.0 12.5 13.0 13.5 14.0 14.5
38
Thank You
39
Hypothesis Testing
Class Objectives
• Developing Null and Alternative Hypotheses
• Type I and Type II Errors- Explanation
• Population Mean: Sigma Known
• Population Mean: Sigma Unknown
• Population Proportion
Hypothesis Testing
• Hypothesis testing can be used to determine whether a statement about

the value of a population parameter should or should not be rejected.
• The null hypothesis, denoted by H0 , is a tentative assumption about a

population parameter
• The alternative hypothesis, denoted by Ha, is the opposite of what is stated

in the null hypothesis
• The hypothesis testing procedure uses data from a sample to test the two
competing statements indicated by H0 and Ha.
Developing Null and Alternative Hypotheses
• It is not always obvious how the null and alternative hypotheses should be
formulated
• Care must be taken to structure the hypotheses appropriately so that the test
conclusion provides the information the researcher wants
• The context of the situation is very important in determining how the hypotheses
should be stated
• In some cases it is easier to identify the alternative hypothesis first. In other

cases the null is easier
• Correct hypothesis formulation will take practice

Alternative Hypothesis as a Research Hypothesis

•Many applications of hypothesis testing involve an attempt to gather evidence in support of a research
hypothesis
• In such cases, it is often best to begin with the alternative hypothesis and make it the conclusion that
the researcher hopes to support
• The conclusion that the research hypothesis is true is made if the sample data provide sufficient
evidence to show that the null hypothesis can be rejected
Alternative Hypothesis as a Research Hypothesis
• Example: A new manufacturing method is believed to be better than the current method.
• Alternative Hypothesis:
– The new manufacturing method is better.
• Null Hypothesis:
– The new method is no better than the old method.

• Alternative Hypothesis as a Research Hypothesis
• Example: A new bonus plan, that is developed in an attempt to increase sales
– The new bonus plan increase sales
– The new bonus plan does not increase sales

• Alternative Hypothesis as a Research Hypothesis
• Example:
– A new drug is developed with the goal of lowering Cholesterol-level more

than the existing drug
– The new drug lowers Cholesterol-level more than the existing drug
– The new drug does not lower Cholesterol-level more than the existing
drug
• Null Hypothesis as an assumption to be challenged
• We might begin with a belief or assumption that a statement about the value of a population
parameter is true
• We then using a hypothesis test to challenge the assumption and determine if there is statistical
evidence to conclude that the assumption is incorrect
• In these situations, it is helpful to develop the null hypothesis first

• Null Hypothesis as an Assumption to be Challenged
• Example:
– The label on a milk bottle states that it contains 1000 ml
– The label is correct. µ > 1000 ml
– The label is incorrect. µ < 1000 ml

Null and Alternative Hypotheses about a Population Mean 
• The equality part of the hypotheses always appears in the null hypothesis
• In general, a hypothesis test about the value of a population mean  must take one of the following
three forms (where 0 is the hypothesized value of the population mean)
One-tailed One-tailed Two-tailed

(lower-tail) (upper-tail)
Null and Alternative Hypotheses
• A major hospital in Chennai provides
one of the most comprehensive
emergency medical services in the
world
• Operating in a multiple hospital
system with approximately 10 mobile
medical units, the service goal is to
respond to medical emergencies with
a mean time of 8 minutes or less
• The director of medical services
wants to formulate a hypothesis test
that could use a sample of
emergency response times to
determine whether or not the
service goal of 8 minutes or less is
being achieved.
Null and Alternative Hypotheses
The emergency service is meeting the response

H0:   8 goal; no follow-up action is necessary.
The emergency service is not meeting the

Ha:   8 response goal; appropriate follow-up action is
necessary.
where:  = mean response time for the population

of medical emergency requests
Type I Error
• Because hypothesis tests are based on sample data, we must allow for the
possibility of errors
• A Type I error is rejecting H0 when it is true
• The probability of making a Type I error when the null hypothesis is called
the level of significance
• Applications of hypothesis testing that only control the Type I error are
often called significance tests
Type II Error
• A Type II error is accepting H0 when it is false.
• It is difficult to control for the probability of making a Type II error.
• Statisticians avoid the risk of making a Type II error by using “do not reject H0” and not “accept H0”.
Type I and Type II Errors
Population Condition
H0 True H0 False
Conclusion ( < 8) (  8)
Accept H0 Correct
Type II Error
(Conclude  < 8) Decision
Reject H0 Correct
Type I Error
(Conclude  > 8) Decision
Three Approaches for Hypothesis Testing
• P- Value
• Critical Value
• Confidence Interval Value

p-Value Approach to One-Tailed Hypothesis Testing
• The p-value is the probability, computed using the test statistic, that measures the support (or lack of
support) provided by the sample for the null hypothesis
• If the p-value is less than or equal to the level of significance  , the value of the test statistic is in the
rejection region
• Reject H0 if the p-value < 

Lower-Tailed Test About a Population Mean: s Known
p-Value Approach p-Value <  ,

so reject H0.
 = .10 Sampling
distribution
of
p-value
 72
z
z = -za = 0
-1.46 -1.28
p-Value Approach
Upper-Tailed Test About a Population Mean :s Known
p-Value Approach p-Value <  ,

so reject H0.
Sampling
distribution
of  = .04
p-Value
 11
z
0 z = z=
1.75 2.29
p-Value Approach
Critical Value Approach to One-Tailed Hypothesis Testing
• The test statistic z has a standard
normal probability distribution.
• We can use the standard normal
probability distribution table to
find the z-value with an area of 
in the lower (or upper) tail of the
distribution.
• The value of the test statistic that
established the boundary of the
rejection region is called the
critical value for the test.
• The rejection rule is:
Lower tail: Reject H0 if z < -z
Upper tail: Reject H0 if z > z
Lower-Tailed Test About a Population Mean: s Known
Critical Value Approach
Sampling
distribution
of
Reject H0
  1
Do Not Reject H0
z
-z = -1.28 0
Upper-Tailed Test About a Population Mean: s Known
Sampling
distribution
of
Reject H0
  
Do Not Reject H0
z
0 z = 1.645
Steps of Hypothesis Testing – P value approach
• Step 1. Develop the null and alternative hypotheses.
• Step 2. Specify the level of significance .
• Step 3. Collect the sample data and compute the test statistic.
• p-Value Approach
• Step 4. Use the value of the test statistic to compute the p-value.
• Step 5. Reject H0 if p-value < .

Steps of Hypothesis Testing
•Step 4. Use the level of significance  to determine the critical value and
the rejection rule.
•Step 5. Use the value of the test statistic and the rejection rule to determine
whether to reject H0.

Hypothesis Testing
1
Class Objectives
• Population Mean: Sigma Known –Example
2
One-Tailed Tests About a Population Mean: s Known
• Example: The mean response times for a random sample

of 30 Pizza Deliveries is 32 minutes
• The population standard deviation is believed to be 10
minutes.
• The pizza delivery services director wants to perform a
hypothesis test, with a =0.05 level of significance, to
determine whether the service goal of 30 minutes or less
is being achieved.
3
Given Values
• Sample • Population
• Sample mean = 32 Min • a =0.05
• Sample size = 30 • Population mean = 30 Min
4
p -Value Approach
5
One-Tailed Tests About a Population Mean:
s Known
1. Develop the hypotheses.
2. Specify the level of significance. H0: 30
3. Compute the value of the test statistic. Ha:30
a = .05
x 32  30
z   1.09
s / n 10 / 30
6
7
p –Value Approach
4. Compute the p –value.
For z = 1.09, p–value = = 0.137
5. Determine whether to reject H0.
• Because p–value = 0.137 > a = .05 , we do not reject H0.
• There are not sufficient statistical evidence to infer that Pizza delivery services is not meeting the response
goal of 30 minutes.
8
p –Value Approach
Sampling
distribution a = .05
of
p-value
0.137

z
z = za =
0 1.09 1.645
9
10

4. Determine the critical value and rejection rule.
– For a = .05, z.05 = 1.645
– Reject H0 if z > 1.645
– Because 1.645 > 1.05, we do not reject H0.
11
p-Value Approach to Two-Tailed Hypothesis Testing
12
Compute the p-value using the following three steps:
1. Compute the value of the test statistic z.
2. If z is in the upper tail (z > 0), find the area under the standard normal curve to the right of z.
3. If z is in the lower tail (z < 0), find the area under the standard normal curve to the left of z.
4. Double the tail area obtained in step 2 to obtain the p –value.
The rejection rule:
Reject H0 if the p-value < a .
13
Critical Value Approach to Two-Tailed Hypothesis Testing
• The critical values will occur in both the lower and upper tails of the standard normal curve.
• Use the standard normal probability distribution table to find za/2 (the z-value with an area of a/2
in the upper tail of the distribution).
• The rejection rule is:
Reject H0 if z < -za/2 or z > za/2.
14
Two-Tailed Tests About a Population Mean:
s Known
• Example: Milk Carton

• Assume that a sample of 30 milk carton provides a sample mean of 505 ml.
• The population standard deviation is believed to be 10 ml.
• Perform a hypothesis test, at the 0.03 level of significance, population
mean 500 ml and to help determine whether the filling process should
continue operating or be stopped and corrected.
15
Given Values
• Sample • Population
• Sample size = 30 • Population mean = 500 ml
• Sample mean = 505 ml • Standard deviation = 10 ml
• Significance level 0.03
16
p –Value approach
17
s Known
1. Determine the hypotheses.
2. Specify the level of significance.
3. Compute the value of the test statistic.
a = .03
x   505  500
z   2.74
s / n 10 / 30
18
19
s Known
p –Value Approach
4. Compute the p –value.
– For z = 2.74, p–value = 2(1 - .9969) = .0061

– Because p–value = .0062 < a = .03, we reject H0.
There is no sufficient statistical evidence to infer that the null hypothesis is true (i.e. the mean filling
quantity is not 500 ml)
20
Two-Tailed Tests About a Population Mean: s Known
p-Value Approach
1/2 1/2
p -value p -value
= .0031 = .0031
a/2 = a/2 =
.015 .015
z
z = -2.74 0 z = 2.74
-za/2 = -2.17 za/2 = 2.17
21
22
Two-Tailed Tests About a Population Mean :s Known
• Critical Value Approach

4. Determine the critical value and rejection rule, for a/2 = .03/2 = .015, z.015 = 2.17
Reject H0 if z < -2.17 or z > 2.17
Because 2.74 > 2.17, we reject H0.
There is sufficient statistical evidence to infer that the null hypothesis is not true
23
24
Two-Tailed Tests About a Population Mean :s Known

Sampling
distribution
x   505  500
z   2.74 of
s / n 10 / 30
Reject H0 Do Not Reject H0 Reject H0

a/2 = .015 a/2 = .015
z
-2.17 0 2.17
25
Confidence Interval Approach
26
Confidence Interval Approach to
Two-Tailed Tests About a Population Mean
• Select a simple random sample from the population and use the value of the sample mean to
develop the confidence interval for the population mean .
• If the confidence interval contains the hypothesized value 500, do not reject H0.
• Otherwise, reject H0.
• Actually, H0 should be rejected if 0 happens to be equal to one of the end points of the confidence
interval.
27
Confidence Interval Approach to Two-Tailed Tests About a Population Mean
The 97% confidence interval for 500 is
5 5 3.9619
501.03814 ,508.96186
Because the hypothesized value for the population mean, 0 = 500ml, is not in this interval, the
hypothesis-testing conclusion is that the null hypothesis, H0:  = 500, is rejected.
28
Thanks
29
Hypothesis Testing: Two sample test
Dr. A. Ramesh
IIT ROORKEE
1
Hypothesis Testing about the Difference in Two
Sample Means
Population 1
X 1
X X
x 1 2
X  n1
1
X X1 2

x
X 2
n2
X 2
Population 2
2
Two Sample Tests
Two Sample Tests
Population Population
Means, Means, Population Population
Independent Dependent Proportions Variances
Samples Samples
Examples:
Group 1 vs. Same group before Proportion 1 vs. Variance 1 vs.
independent vs. after treatment Proportion 2 Variance 2
Group 2
3
Difference Between Two Means
Population means,
independent samples
σ12 and σ22 known Test statistic is a z value
σ12 and σ22 unknown
σ12 and σ22 assumed equal

Test statistic is a value from the
σ12 and σ12 assumed Student’s t distribution
unequal
4
σ12 and σ12 Known
Population means, Assumptions:

independent samples
 Samples are randomly and
independently drawn
σ12 and σ22 known  both population distributions

are normal
 Population variances are
known
5
σ12 and σ22 Known
When σx2 and σy2 are known and both

Population means, populations are normal, the variance of 1 – 2
independent is 2
σ1 σ 2
2
samples σ 2X1 X2  
n1 n 2
σ12 and σ22 known …and the random variable

(x1  x 2 )  (μ1  μ 2 )
Z
σ12 σ 22
σ12 and σ22 unknown 
n1 n 2
has a standard normal distribution
6
Test Statistic, σ12 and σ22 Known
Population means, H0 :μ1  μ 2  D0

independent
samples The test statistic for
μ1 – μ2 is:
σ12 and σ22 known
z
 x 1 
 x2  D0
σ12 and σ22 unknown σ12 σ 2 2


n1 n2
7
Hypothesis Tests for Two Population Means
Two Population Means, Independent Samples
Lower-tail test: Upper-tail test: Two-tail test:

H0: μ1  μ2 H0: μ1 ≤ μ2 H0: μ1 = μ2
H1: μ1 < μ2 H1: μ1 > μ2 H1: μ1 ≠ μ2
i.e., i.e., i.e.,
H0: μ1 – μ2  0 H0: μ1 – μ2 ≤ 0 H0: μ1 – μ2 = 0
H1: μ1 – μ2 < 0 H1: μ1 – μ2 > 0 H1: μ1 – μ2 ≠ 0
8
Decision Rules
a a
a/2 a/2
-za za -za/2 za/2

Reject H0 if z < -za Reject H0 if z > za Reject H0 if z < -za/2 or z > za/2
9
Hypothesis Testing about the Difference in Two
Sample Means
X 1
 X2
  
1 2
X 
  2
1
  2
2
1 X2 n 1 n 2
X 1
 X2
X 1
 X 2
10
Sampling Distribution of x1  x2
• Expected Value
• Standard Deviation (Standard Error)
where: 1 = standard deviation of population 1

2 = standard deviation of population 2
n1 = sample size from population 1
n2 = sample size from population 2
11
Interval Estimation of 1 - 2:  1 and  2 Known
• Interval Estimate
where: 1 - a is the confidence coefficient
12
Problem ( 1 and  2 Known)
• A product developer is interested in reducing the drying time of a primer paint.
• Two formulations of the paint are tested; formulation 1 is the standard chemistry, and
formulation 2 has a new drying ingredient that should reduce the drying time.
• From experience, it is known that the standard deviation of drying time is 8 minutes, and this
inherent variability should be unaffected by the addition of the new ingredient.
• Ten specimens are painted with formulation 1, and another 10 specimens are painted with
formulation 2; the 20 specimens are painted in random order.
• The two-sample average drying times are 𝑥1 = 121 minutes and 𝑥2 = 112 minutes,
respectively.
• What conclusions can the product developer draw about the effectiveness of the new
ingredient, using alpha = 0.05?
Source: Applied Probability and statistics for Engineers by Douglas C. Montgomery and George C. Runger John Wiley, 3rd Ed. 2003
13
14
15
Reject H0
t
121  112   0  2.52
.05
0 1.645 t
 1 1 2.52
8   
2
 10 10  Decision:
Reject H0 at a = 0.05
Conclusion:
There is evidence of a difference in
means.
16
17
18
σ12 and σ22 Unknown, Assumed Equal

independent samples • Samples are randomly and
independently drawn
σ12 and σ12 known • Populations are normally
distributed
• Population variances are unknown
*
σ12 and σ12 assumed unequal
but assumed equal
19
σ12 and σ22 Unknown, Assumed Equal
• The population variances are assumed equal, so use the two sample
standard deviations and pool them to estimate σ
• use a t value with (n1 + n2 – 2) degrees of freedom
20
Test Statistic, σ12 and σ22 Unknown, Equal
The test statistic for
μ1 – μ2 is:
t
 x 1 
 x2   μ1  μ 2 
s 2p s 2p

n1 n2
Where t has (n1 + n2 – 2) d.f.,

and (n1  1)s12  (n 2  1)s 22
s 
2
n1  n 2  2
p
21
Decision Rules
1 2 1 2 1 2
1 2 1 2 1 2
22
Decision Rules
23
σ12 and σ22 Unknown, Assumed equal
• Two catalysts are being analyzed to
determine how they affect the mean Observation Catalyst 1 Catalyst 2
yield of a chemical process. Number
• Specifically, catalyst 1 is currently in use, 1 91.50 89.19
but catalyst 2 is acceptable. 2 94.18 90.95
• Since catalyst 2 is cheaper, it should be 3 92.18 90.46
adopted, providing it does not change 4 95.39 93.21
the process yield. 5 91.79 97.19
• A test is run in the pilot plant and results 6 89.07 97.04
in the data shown in table. 7 94.72 91.07
• Is there any difference between the 8 89.21 92.75
mean yields?
𝑥 1= 92.255 𝑥 1 = 92.733
• Use 0.05, and assume equal variances.
s1 =2.39 s2 =2.98
24
25
26
27
28
Thank You
29
Hypothesis Testing-III
1
Tests About a Population Mean:s Unknown
• Test Statistic
This test statistic has a t distribution with n - 1 degrees of freedom.
2
Tests About a Population Mean:s Unknown
Rejection Rule: p -Value Approach

Reject H0 if p –value < 
Rejection Rule: Critical Value Approach
H0:  Reject H0 if t < -t
H0:  Reject H0 if t > t
H0:  Reject H0 if t < - t or t > t
3
4
One-Tailed Test About a Population Mean: s Unknown
Example: Ice Cream Demand
Day No. of Ice- Day No. of Ice-
• In a ice cream parlor at IIT Roorkee, the following data cream cream
Sold Sold
represent the number of ice-creams sold in 20 days
1 13 11 12
2 8 12 11
• Test hypothesis H0:  < 10 3 10 13 11
4 10 14 12
• Use = .05 to test the hypothesis. 5 8 15 10
6 9 16 12
7 10 17 7
8 11 18 10
9 6 19 11
10 8 20 8
5
Given Data
6
7
One-Tailed Test About a Population Mean:
s Unknown
Reject H0
Do Not Reject H0

t
0
8
Hypothesis Testing – proportion
9
Null and Alternative Hypotheses: Population Proportion
• The equality part of the hypotheses always appears in the null hypothesis.
• In general, a hypothesis test about the value of a population proportion p must take one of the
following three forms (where p0 is the hypothesized value of the population proportion).
H0: p > p0 H0: p < p0 H0: p = p0

H a : p < p0 H a : p > p0 H a : p ≠ p0
One-tailed One-tailed
(lower tail) (upper tail) Two-tailed
10
Tests About a Population Proportion
Test Statistic
where:
assuming np > 5 and n(1 – p) > 5
11
Tests About a Population Proportion
Rejection Rule: p –Value Approach
Reject H0 if p –value < 
Rejection Rule: Critical Value Approach
H0: pp Reject H0 if z > z
H0: pp Reject H0 if z < -z
H0: pp Reject H0 if z < -z or z > z
12
Two-Tailed Test About a Population Proportion
Example: City Traffic Police
For a New Year’s week, the City

Traffic Police claimed that 50% of the
accidents would be caused by drunk
driving.
A sample of 120 accidents showed

that 67 were caused by drunk driving.
Use these data to test the Traffic
Police’s claim with  = .05.
13
p –Value Approach
14
H 0 : p  .5
1. Determine the hypotheses.
H a : p  .5
2. Specify the level of significance.  = .05
p0 (1  p0 ) .5(1  .5)
sp    .045644
n 120
p  p0 (67 /120)  .5
z   1.28
sp .045644
15
4. Compute the p -value.
For z = 1.28, cumulative probability = .8997 p–value = 2(1 - .8997) = .2006
Because p–value = .2006 >  = .05, we cannot reject H0.
16
17
18
4. Determine the critical value and rejection rule.
For /2 = .05/2 = .025, z.025 = 1.96
Reject H0 if z < -1.96 or z > 1.96
Because 1.278 > -1.96 and < 1.96, we cannot reject H0.
19
ANOVA
Dr. A. Ramesh
1
Agenda
• Sample Size Calculation

• One Way ANOVA – Introduction
2
Determining Sample Size when Estimating 
X 
• Z formula Z

n
• Error of Estimation (tolerable error) E  X 

• Estimated Sample Size Z    Z  
2 2 2
n 2
 
2
 E 
2
E
1
• Estimated  
4
range
3
Example: Sample Size when Estimating 
E  1,   4
90% confidence  Z  1.645
Z 
2 2

n 2
2
E
2 2
(1645
. ) (4)
 2
1
 43.30 or 44
4
Example
E  2, range  25
95% confidence  Z  196
.
1  1
estimated  : range     25  6.25
4  4
Z
2 2
n 2
E
2 2
(196
. ) (6.25)
 2
2
 37.52 or 38
5
Determining Sample Size when Estimating P
• Z formula pP
Z
PQ
n
• Error of Estimation (tolerable error) E  pP
2
n Z PQ
• Estimated Sample Size E
2
6
Example
E  0.03
98% Confidence  Z  2.33
estimated P  0.40
Q  1  P  0.60
n
Z PQ 2
E
(2.33)  0.40 0.60
2

.003 2
 1,447.7 or 1,448
7
Determining Sample Size when Estimating P
with No Prior Information
P PQ
400 Z = 1.96
0.5 0.25 350 E = 0.05
300
0.4 0.24
250
0.3 0.21 n 200
150
0.2 0.16
100
0.1 0.09 50
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
2 1 P
Z 4
n 2
E
8
Example
E  0.05
90% Confidence  Z  1645
.
with no prior estimate of P, use P  0.50
Q  1  P  0.50
2
n
Z PQ
2
E
. )  0.50 0.50
2
(1645

.05 2
 270.6 or 271
9
Why ANOVA?
• We could compare the means, one by one using t-tests for difference of
means.
• Problem: each test contains type I error
• The total type I error is 1  1   k where k is the number of means.
• For example, if there are 5 means and you use a=.05, you must make 10
two by two comparisons.
• Thus, the type I error is 1-(.95)10, which is .4012.
• That is, 40% of the time you will reject the null hypothesis of equal
means in favor of the alternative!
10
Hypothesis Testing With Categorical Data
• Chi Square tests can be viewed as a generalization of Z tests of

proportions
• Analysis of Variance (ANOVA) can be viewed as a generalization of t-
tests: a comparison of differences of means across more than 2
groups.
• Like Chi Square, if there are only two groups, the two analyses will
produce identical results – thus a t-test or ANOVA can be used with 2
groups
11
Production Process inputs and outputs
12
Application of quality-engineering techniques and
the systematic reduction of process variability
13
Effect of Teaching Methodology
Group 1 Group 2 Group 3
Black Board Case Presentation PPT
4 2 2
3 4 1
2 6 3
 43 2
x1  3
3
 246
x2  4
3
 2 1 3
x3  2
3


4  3  2  2  4  6  2 1 3
x 3
9
SST  (4  3) 2  (3  3) 2  (2  3) 2  (2  3) 2  (4  3) 2  (6  3) 2  (2  3) 2  (1  3) 2  (3  3) 2
=1 + 0 +1 +1 +1 +9 +1 +4 + 0 =18
SSB  3(3  3) 2  3(4  3) 2  3(2  3) 2
=0 +3 +3 =6
SSE  (4  3) 2  (3  3) 2  (2  3) 2  (2  4) 2  (4  4) 2  (6  4) 2  (2  2) 2  (1  2) 2  (3  2) 2
= 1 +0 +1 +4 + 0 +4 + 0 +1 + 1 = 12
15
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 6 2 3 1.5 0.296296 5.143253
Within Groups 12 6 2
Total 18 8
16
Thank You
17
ANOVA
Dr. A. Ramesh
1
Effect of Teaching Methodology
Group 1 Group 2 Group 3
Black Board Case Presentation PPT
4 2 2
3 4 1
2 6 3
ANOVA with Python
3
Pandas.melt command
• Pd.melt allows you to ‘unpivot’ data from a ‘wide format’ into a ‘long
format’, data with each row representing a data point.
4
Jupyter code
5
6
Transforming table
7
8
Analysis of Variance: A Conceptual Overview
• Analysis of Variance (ANOVA) can be used to test for the equality of three
or more population means
• Data obtained from observational or experimental studies can be used for

the analysis
• We want to use the sample results to test the following hypotheses:

H0: 1=2=3=. . . = k
Ha: Not all population means are equal
9
H0: 1=2=3=. . .= k
• If H0 is rejected, we cannot conclude that all population means are equal

• Rejecting H0 means that at least two population means have different
values
10
Assumptions for Analysis of Variance
• For each population, the response (dependent) variable is normally

distributed
• The variance of the response variable, denoted  2, is the same for all of
the populations
• The observations must be independent
11
• Sampling Distribution of 𝑥 Given H0 is True
Sample means are close

together because there is only
one sampling distribution
when H0 is true.
2
 
2
x
n
x2  x1 x3
• Sampling Distribution of 𝑥 Given H0 is False
Sample means
come from
different
sampling
distributions
and are not as
close together
when H0 is
false. x3 3 x 1 1 2 x2
Analysis of Variance (ANOVA)
One-Way Two-Way
ANOVA ANOVA
F-test Interaction
Effects
Tukey-
Kramer
test
General ANOVA Setting
• Investigator controls one or more factors of interest

– Each factor contains two or more levels
– Levels can be numerical or categorical
– Different levels produce different groups
– Think of the groups as populations
• Observe effects on the dependent variable
– Are the groups the same?
• Experimental design: the plan used to collect the data
Completely Randomized Design
• Experimental units (subjects) are assigned randomly to the

different levels (groups)
– Subjects are assumed homogeneous
• Only one factor or independent variable
– With two or more levels (groups)
• Analyzed by one-factor analysis of variance (one-way ANOVA)
Analysis of Variance and the Completely
Randomized Design
• Between-Treatments Estimate of Population Variance
• Within-Treatments Estimate of Population Variance
• Comparing the Variance Estimates: The F Test
• ANOVA Table
17
Randomized Design
H0: 1=2=3=. . .= k
where
𝑗 = mean of the 𝑗𝑡ℎ population
18
Randomized Design
H0: 1=2=3=. . .= k
• Assume that a simple random sample of size 𝑛𝑗 has been selected from
each of the k populations or treatments. For the resulting sample data, let
𝑥𝑖𝑗 = value of observation ifor treatment j
𝑛𝑗 = number of observations for treatment j
𝑥𝑗 = sample mean for treatment j
𝑠𝑗2 = sample variance for treatment j
𝑠𝑗 = sample standard deviation for treatment j
19
Between-Treatments Estimate of
Population Variance  2
• The estimate of  2 based on the variation of the sample means is called
the mean square due to treatments and is denoted by MSTR
k
 n (x
j 1
j j  x )2
MSTR 
k1
Numerator is called
Denominator is the the sum of squares due
degrees of freedom to treatments (SSTR)
associated with SSTR
20
Between-Treatments Estimate of
Population Variance  2
• Mean Square due toTreatments (MSTR)
k
 j j
n (
j 1
x  x ) 2
MSTR 
k1
Where:
k = number of groups
nj = sample size from group j
𝑥𝑗 = sample mean from group j
𝑥 = grand mean (mean of all data values)
21
Within-Treatments Estimate of
Population Variance 2
• The estimate of  2 based on the variation of the sample observations
within each sample is called the mean square error and is denoted by MSE
k
 j j
( n
j 1
 1) s 2
MSE 
nT  k
Numerator is called
Denominator is the the sum of squares
degrees of freedom due to error (SSE)
associated with SSE
22
Within-Treatments Estimate of
Population Variance 2
• Mean Square Error (MSE)
k
 j j
( n
j 1
 1) s 2
MSE 
Where: nT  k
k = number of groups
𝑛𝑗 = number of observations for treatment j 𝑠𝑗2 =
sample variance for treatment j
23
Comparing the Variance Estimates: The F Test
• If the null hypothesis is true and the ANOVA assumptions are valid, the
sampling distribution of MSTR/MSE is an F distribution with MSTR d.f
equal to k - 1 and MSE d.f. equal to nT - k.
• If the means of the k populations are not equal, the value of MSTR/MSE
will be inflated because MSTR overestimates 2
• Hence, we will reject H0 if the resulting value of MSTR/MSE appears to be

too large to have been selected at random from the appropriate F
distribution
24
Comparing the Variance Estimates: The F Test
Sampling Distribution
of MSTR/MSE
Reject H0
Do Not Reject H0 
MSTR/MSE
F
Critical Value
ANOVA Table for a Completely Randomized Design
Source of Sum of Degrees of Mean p-
Variation Squares Freedom Square F Value
SSTR MSTR
Treatments SSTR k-1 MSTR 
k -1 MSE
SSE
Error SSE nT - k MSE 
nT -k
Total SST nT - 1
SST’s degrees of freedom
SST is partitioned (d.f.) are partitioned into
into SSTR and SSE.
SSTR’s d.f. and SSE’s d.f.
• SST divided by its degrees of freedom nT – 1 is the overall sample variance

that would be obtained if we treated the entire set of observations as one
data set.
• With the entire data set as one sample, the formula for computing the
total sum of squares, SST, is:
k nj
SST   ( xij  x )2  SSTR  SSE

j 1 i 1
27
• ANOVA can be viewed as the process of partitioning the total sum of

squares and the degrees of freedom into their corresponding sources:
treatments and error
• Dividing the sum of squares by the appropriate degrees of freedom

provides the variance estimates and the F value used to test the
hypothesis of equal population means.
28
Test for the Equality of k Population Means
• Hypotheses
H0: 123...k
• Test Statistic
𝑀𝑆𝑇𝑅
F=
𝑀𝑆𝐸
29
Test for the Equality of k Population Means
p- Value Approach Critical Value Approach
Reject H0 if p-value <  Reject H0 if F > F
Where the value of F is based on an F distribution with k - 1

numerator d.f. and nT - k denominator d.f.
30
Thank You
31
Dr. A. Ramesh
IIT ROORKEE
1
σ12 and σ22 Unknown, Assumed Unequal

independent samples
 Samples are randomly and
independently drawn
σ12 and σ22 known
 Populations are normally distributed
 Population variances are unknown
and assumed unequal
*
σ12 and σ22 assumed equal .
2
σ12 and σ22 Unknown: Assumed Unequal
Population means, Forming interval estimates:

independent samples
• The population variances are
assumed unequal, so a pooled
σ12 and σ22 known variance is not appropriate
• Use a t value with  degrees of
freedom, where
2
 s12 s 22 
( n )  ( n ) 
*

v
 s12 

 n1 
2
 1 2 
 s 22 
 /(n1  1)  
 n2 
2
 /(n 2  1)
3
Test Statistic: σ12 and σ22 Unknown, Unequal
The test statistic for
μ1 – μ2 is:
(x1  x 2 )  D 0
σ12 and σ22 t 
assumed equal s12 s 22

σ12 and σ22 n1 n2
assumed unequal
2
 s12 s 22 
 n 
n 2 
( ) ( )
Where t has  degrees of freedom: v 2
 1
2
 s12   s 22 
  /(n1  1)    /(n 2  1)
 n1   n2 
4
Problem:Test Statistic: σ12 and σ22 Unknown, Unequal
Metro Phoenix Rural Arizona
• Arsenic concentration in public Phoenix, 3 Rimrock, 48
drinking water supplies is a Chandler, 7 Goodyear, 44
potential health risk. Gilbert, 25 New River, 40
• An article in the Arizona Republic Glendale, 10 Apachie Junction, 38
(Sunday, May 27, 2001) reported Mesa, 15 Buckeye, 33
drinking water arsenic Paradise Valley, 6 Nogales, 21
concentrations in parts per billion Peoria, 12 Black Canyon City, 20
(ppb) for 10 metropolitan Phoenix Scottsdale, 25 Sedona, 12
communities and 10 communities Tempe, 15 Payson, 1
in rural Arizona. Sun City, 7 Casa Grande, 18
• The data as shown:
𝑥 1 = 12.5 𝑥 1 = 27.5
s1 =7.63 s2 =15.3
5
• We wish to determine it there is any difference in mean arsenic

concentrations between metropolitan Phoenix communities and
communities in rural Arizona.
6
7
2 2
 s12 s 22   7.632 15.32 
( n )  ( n )   ( 10 )  ( 10 ) 
v  1 2      13.2  13
2 2 2 2
 s1 
2
 s2 2
 7.632   15.3 
2
  /(n1  1)    /(n 2  1)   /(10  1)    /(10  1)

 1
n  2
n  10   10 
8
Reject H0 Reject H0
.025 .025
-2.160 0 2.160 t
-2.77
Decision:
12.5  27.5  0 Reject H0 at a = 0.05
t=  2.77
 7.63 15.3 
2 2 Conclusion:
   There is evidence of a difference in
 10 10 
means.
9
• Reject the null hypothesis.

• There is evidence to conclude that mean arsenic concentration in the
drinking water in rural Arizona is different from the mean arsenic
concentration in metropolitan Phoenix drinking water.
10
11
Dependent Samples
Tests Means of 2 Related Populations
– Paired or matched samples
– Repeated measures (before/after)
– Use difference between paired values:
di = xi - yi
• Assumptions:
– Both Populations Are Normally Distributed
12
Test Statistic: Dependent Samples
The test statistic for the mean difference is a t value, with

n – 1 degrees of freedom:
t
d  D0
sd d
 d i
 xy
n
n
D0 = hypothesized mean difference

sd = sample standard dev. of differences
n = the sample size (number of pairs)
13
Decision Rules: Dependent Samples
Lower-tail test: Upper-tail test: Two-tail test:

H0: μ1 – μ2  0 H0: μ1 – μ2 ≤ 0 H0: μ1 – μ2 = 0
H1: μ1 – μ2 < 0 H1: μ1 – μ2 > 0 H1: μ1 – μ2 ≠ 0
14
Decision Rules: Dependent Samples
a a a/2 a/2
-ta ta -ta/2 ta/2

Reject H0 if t < -tn-1, a Reject H0 if t > tn-1, a Reject H0 if t < -tn-1 ,a/2
or t > tn-1 ,a/2
d  D0
t
Where sd
has n - 1 d.f.
n
15
Dependent Samples: Example
• An article in the Journal of Strain Analysis (1983, Vol. 18, No. 2) compares
several methods for predicting the shear strength for steel plate girders.
• Data for two of these methods, the Karlsruhe and Lehigh procedures,
when applied to nine specific girders, are shown in Table .
• We wish to determine whether there is any difference (on the average)
between the two methods.
16
Table : Strength Predictions for Nine Steel Plate Girders
(Predicted Load/Observed Load)
Girder Karlsruhe Method Lehigh Method Difference dj
S11 1.186 1.061 0.119
S21 1.151 0.992 0.159
S31 1.322 1.063 0.259
S41 1.339 1.062 0.277
S51 1.200 1.065 0.138
S21 1.402 1.178 0.224
S22 1.365 1.037 0.328
S23 1.537 1.086 0.451
S24 1.559 1.052 0.507
17
Inferences About the Difference Between Two
Population Means: Matched Samples
18
Inferences About the Difference Between Two Population Means:
Matched Samples
19
we conclude that the strength prediction methods yield different results.
20
21
Inferences About the Difference Between
Two Population Proportions
• Inferences About the Difference Between Two Population Proportion

Inferences About the Difference Between
Two Population Proportions
• Interval Estimation of p1 - p2
• Hypothesis Tests About p1 - p2

Sampling Distribution of p1- p2
• Expected Value
E ( p1  p2 )  p1  p2
• Standard Deviation (Standard Error)

p1 (1  p1 ) p2 (1  p2 )
 p1  p2  
n1 n2
where: n1 = size of sample taken from population 1

n2 = size of sample taken from population 2
• If the sample sizes are large, the sampling distribution of

p1- p2 can be approximated by a normal probability distribution.
• The sample sizes are sufficiently large if all of these conditions

are met:
p1 (1  p1 ) p2 (1  p2 )
 p1  p2  
n1 n2
p1  p2
Interval Estimation of p1 - p2
• Interval Estimation
p1 (1  p1 ) p2 (1  p2 )
p1  p2  za / 2 
n1 n2
Point Estimator of the Difference Between Two Population
Proportions
• p1 = proportion of the population of households “aware”
of the product after the new campaign
• p2 = proportion of the population of households “aware”
of the product before the new campaign
• p1 = sample proportion of households “aware” of the
product after the new campaign
• p2 = sample proportion of households “aware” of the
product before the new campaign
120 60
p1  p2    .48  .40  .08
250 150
Hypothesis Tests about p1 - p2
• Hypothesis
We focus on tests involving no difference between the two population

proportions (i.e. p1 = p2)
H 0 : p1  p2  0 H 0 : p1  p2  0 H 0 : p1  p2  0
H a : p1  p2  0 H a : p1  p2  0 H a : p1  p2  0
Left-tailed Right-tailed Two-tailed
• Standard Error of p1- p2 when p1 = p2 = p
1 1
 p1  p2  p(1  p)   
 n1 n2 
• Pooled Estimator of p when p1 = p2 = p
n1 p1  n2 p2
p
n1  n2
• Test Statistic
( p1  p2 )
z
 1 1 
p(1  p )   
 n1 n2 
Problem: Hypothesis Tests about p1 - p2
• Extracts of St. John’s Wort are widely used to treat depression.
• An article in the April 18, 2001 issue of the Journal of the American Medical
Association (“Effectiveness of St. John’s Worton Major Depression: A
Randomized Controlled Trial”) compared the efficacy of a standard extract
of St. John’s Wort with a placebo in 200 outpatients diagnosed with major
depression.
• Patients were randomly assigned to two groups; one group received the St.
John’s Wort, and the other received the placebo.
• After eight weeks, 19 of the placebo-treated patients showed
improvement, whereas 27 of those treated with St. John’s Wort improved.
• Is there any reason to believe that St. John’s Wort is effective in treating
major depression? Use 0.05.
8. Conclusions: Since z0 1.35 does not exceed z 0.025, we cannot reject the null hypothesis.
The P-value is P ≅ 0.177. There is insufficient evidence to support the claim that St.
John’s Wort is effective in treating major depression.
34
35
Thank You
36
Dr. A. Ramesh
IIT ROORKEE
1
Agenda
• Comparing two population variances

• Choosing z or t test
• Sample size
2
Hypothesis Tests for Two Variances
Goal: Test hypotheses about two population variances
Tests for Two
Population H0: σ12  σ22
Variances H1: σ12 < σ22
Lower-tail
test
H0: σ12 ≤ σ22 Upper-tail
F test statistic H1: σ12 > σ22 test
H0: σ12 = σ22
H1: σ12 ≠ σ22 Two-tail test
The two populations are assumed to be
independent and normally distributed
3
Hypothesis Tests for Two Variances
Tests for Two The random variable

Population
s12 /σ12
Variances F 2 2
s2 /σ 2
F test statistic Has an F distribution with (n1 – 1) numerator

degrees of freedom and (n1 – 1) denominator
degrees of freedom
Denote an F value with 1 numerator and 2 denominator degrees
of freedom by
4
Test Statistic
The critical value for a hypothesis test about two

Tests for Two population variances is
Population
Variances s12
F 2
s2
F test statistic where F has (nx – 1) numerator degrees of freedom and

(ny – 1) denominator degrees of freedom
5
Decision Rules: Two Variances
H0: σ12 = σ22
H0: σ12 ≤ σ22 H1: σ12 ≠ σ22
H1: σ12 > σ22
6
Problem
• A company manufactures impellers for use in jet-turbine engines.
• One of the operations involves grinding a particular surface finish on a
titanium alloy component.
• Two different grinding processes can be used, and both processes can produce
parts at identical mean surface roughness.
• The manufacturing engineer would like to select the process having the least
variability in surface roughness.
• A random sample of n1 =11 parts from the first process results in a sample
standard deviation s1 = 5.1 micro inches, and a random sample of n2 = 16
parts from the second process results in a sample standard deviation of s2 =
4.7 micro inches.
• We will find a 90% confidence interval on the ratio of the two standard
deviations.
7
Problem
• Form the hypothesis test:
H0: σ12 = σ22 (there is no difference between variances)
H1: σ12 ≠ σ22 (there is a difference between variances)
● Find the F critical values for  = .10/2:
Degrees of Freedom:
• Numerator
• n1 – 1 = 11 – 1 = 10 d.f.
• Denominator:
• n2 – 1 = 16 – 1 = 15 d.f.
8
Problem
• Assuming that the two processes are independent and that surface
roughness is normally distributed
9
10
Problem
• f0.95,15,10 = 1 /f0.05,10,15 = 1/2.54 = 0.39

• Since this confidence interval includes unity, we cannot claim that the
standard deviations of surface roughness for the two processes are
different at the 90% level of confidence.
11
12
F Test example:
13
Z Vs t
σ –known σ –unknown
n ≤ 30 Z-test t-test
n > 30 Z-test Z-test

Use Sample standard deviation
Determining the Sample Size for a Hypothesis Test About a Population
Mean
Sampling
c
distribution H0: mm0
of x when Reject H0 Ha:mm0
H0 is true
and m = m0

x
m0 Sampling
distribution
of x when
Note: H0 is false
b and ma > m0
x
c ma
Determining the Sample Size for a Hypothesis Test About a Population
Mean
where
z = z value providing an area of  in the tail
zb = z value providing an area of b in the tail
= population standard deviation
m0 = value of the population mean in H0
ma = value of the population mean used for the
Type II error
Note: In a two-tailed hypothesis test, use z /2 not z

Determining the Sample Size for a Hypothesis Test About a Population Mean
• Let’s assume that the manufacturing company makes the following statements about the
allowable probabilities for the Type I and Type II errors:
• If the mean diameter is m = 12 mm, I am willing to risk an  = .05 probability of rejecting H0.
• If the mean diameter is 0.75 mm over the specification (m = 12.75), I am willing to risk a b = .10
probability of not rejecting H0.
Determining the Sample Size for a Hypothesis Test About a Population Mean
 = .05, b = .10
z = 1.645, zb = 1.28
m0 = 12, ma = 12.75
= 3.2
19
Thank You
20
Post Hoc Analysis(Tukey’s test)
Dr. A. Ramesh
IIT ROORKEE
1
Lecture Objectives
• Use Tukey’s test and LSD Test to identify specific differences between
means
2
Designing engineering experiments
• Experimental design methods are also useful in engineering design

activities, where new products are developed and existing ones are
improved
• By using designed experiments, engineers can determine which subset of
the process variables has the greatest influence on process performance
3
• The results of an experiment can lead to

1. Improved process yield
2. Reduced variability in the process and closer conformance to nominal
or target requirements
3. Reduced design and development time
4. Reduced cost of operation
4
• Every experiment involves a sequence of activities:

1. Conjecture—the original hypothesis that motivates the experiment
2. Experiment—the test performed to investigate the conjecture
3. Analysis—the statistical analysis of the data from the experiment
4. Conclusion—what has been learned about the original conjecture
from the experiment. Often the experiment will lead to a revised
conjecture, and a new experiment, and so forth
5
The completely randomized single-factor experiment
example
• A manufacturer of paper that is used for making
grocery bags is interested in improving the tensile
strength of the product
• Product engineer thinks that tensile strength is a
function of the hardwood concentration in the
pulp and that the range of hardwood
concentrations of practical interest is between 5
and 20%.
6
example
• A team of engineers responsible for the study decides to investigate four
levels of hardwood concentration: 5%, 10%, 15%, and 20%.
• They decide to make up six test specimens at each concentration level,
using a pilot plant.
• All 24 specimens are tested on a laboratory tensile tester, in random order.
The data from this experiment are shown in Table
7
example
• Tensile Strength of Paper (psi)
Hardwood Observations Total Avg
Concentration (%) 1 2 3 4 5 6
5 7 8 15 11 9 10 60 10.00
10 12 17 13 18 19 15 94 15.67
15 14 18 19 17 16 18 102 17.00
20 19 25 22 23 18 20 127 21.17
383 15.96
8
example
9
Typical Data for Single Factor Experiment
Treatment Observations Totals Averages

---
1 y11 y12 ... y1n y1. y1.
---
2 y 21 y 23 ... y2n y 2. y 2.
. . . ... . . .
. . . ... . . .
. . . ... . . .
---
a y a1 ya 2 ... y an ya. ya.
---
y .. y ..
10
Sum of Squares
a n --
Total sum of squares  SST   (yij - y..)2
i 1 j 1
a --- ---
Treatment sum of squares  SSTreatments  n  ( y i.  y ..)2
i 1
a n ---
Error sum of Squares  SSE   (yij  y j. ) 2
i 1 j 1
11
ANOVA with Equal Sample Sizes
a n 2
y ..
SST   y 
2
ij
i 1 j 1 N
1 a 2 y 2 ..
SSTreatments   yi. 
n i 1 N
N = an = No. of Treatments x no. of sample size = Total no. of Sample Size
12
ANOVA with unequal Sample Sizes
a n 2
y ..
SST   y i j 2
i 1 j 1 N
a
yi.2 y 2 ..
SSTreatments   
i 1 ni N
N = an = No. of Treatments x no. of sample size = Total no. of Sample Size
13
Problem: Analysis of variance
• Consider the paper tensile strength experiment described.

• We can use the analysis of variance to test the hypothesis that different
hardwood concentrations do not affect the mean tensile strength of the
paper.
• The hypotheses are
• H0:  1   2   3   4  0
• H1:  i  0 for at least one i
14
• We will use a = 0.01.

• The sums of squares for the analysis of variance are computed are as
follows:
15
ANOVA Table
Sources of Sum of Squares Degrees of Mean Square F

Variation Freedom
Treatments SS Treatments a-1 MS Treatments MS Treatments

/ MSE
Error SSE a(n-1) MSE
Total SST an-1
16
• The ANOVA is summarized as follow

Source of Sum of Degrees Mean F0 P-value
Variation Squares of Square
freedom
Hardwood 382.79 3 127.6 19.6 3.59 E-6

concentrati
on
Error 130.17 20 6.51
Total 512.96 23
17
• Since f0.01,3,20 = 4.94, we reject H0 and conclude that hardwood

concentration in the pulp significantly affects the mean strength of the
paper
18
19
Jupyter code
20
Jupyter code
21
Jupyter code
22
Jupyter code
23
Multiple Comparisons Following the ANOVA
• When the null hypothesis is rejected in the ANOVA, we know that some of
the treatment or factor level means are different
• ANOVA doesn’t identify which means are different
• Methods for investigating this issue are called multiple comparisons
methods
24
Fisher’s least significant difference (LSD) method
• The Fisher LSD method compares all pairs of means with the null
hypotheses H0:i   j (for all i ≠ j) using the t-statistic
yi*  y j*
t0 
2 MS E
n
25
• Assuming a two-sided alternative hypothesis, the pair of means i and j

would be declared significantly different if
yi*  y j*  LSD
where LSD, the least significant difference, is
2MS E
LSD  ta /2,a ( n 1)
n
26
• If the sample sizes are different in each treatment, the LSD is defined as
1 1
LSD  ta /2, N  a MS E (  )
ni n j
27
Problem : LSD method
• We will apply the Fisher LSD method to the hardwood concentration

experiment. There are a = 4 means, n = 6, MSE = 6.51, and t0.025,20 = 2.086.
The treatment means are
28
• The value of LSD is:

2MS E 2(6.51)
LSD  t0.025,20  2.086  3.07
n 6
• Therefore, any pair of treatment averages that differs by more than 3.07
implies that the corresponding pair of treatment means are different.
29
Jupyter code
30
• The comparisons among the observed treatment averages are as follows:
31
The Tukey-Kramer Test for Post Hoc analysis
• Tells which population means are significantly different

• Done after rejection of equal means in ANOVA
• Allows pair-wise comparisons
• Compare absolute mean differences with critical range
32
The Tukey-Kramer Test for Post Hoc analysis
• Determine is there any significant difference between the means

• is μ1 = μ2 ≠ μ3
x
μ1 = μ 2 μ3
33
Tukey-Kramer Critical Range
MSW  1 1 
Critical Range  QU 
2  n j n j' 

where:
QU = Value from Studentized Range
Distribution with c and n - c degrees of freedom for
the desired level of a
MSW = Mean Square Within
nj and nj’ = Sample sizes from groups j and j’
34
Problem: Tukey- Kramer test
• Tensile Strength of Paper (psi)

Hardwood Observations Total Avg
Concentratio 1 2 3 4 5 6
n (%)
5 7 8 15 11 9 10 60 10.00
10 12 17 13 18 19 15 94 15.67
15 14 18 19 17 16 18 102 17.00
20 19 25 22 23 18 20 127 21.17
383 15.96
35
The Tukey-Kramer Procedure
1. Compute absolute mean differences:
x1  x 2  10.00  15.67  5.67

x1  x 3  10.00  17.00  7
x 2  x 3  15.67  17.00  1.33
x1  x 4  10.00  21.17  11.17
x 2  x 4  15.67  21.17  5.5
x 3  x 4  17.00  21.17  4.17
36
2. Find the QU value from the table with c = 4 and (n – c) = (24 – 4) = 20

degrees of freedom for the desired level of a (a = .05 used here):
QU  3.96
37
• Q table: The critical values
for q corresponding to
alpha = .05 (top) and
alpha = .01 (bottom)
38
Source of Sum of Degrees Mean F0 P-value

Variation Squares of Square
freedom
Hardwood 382.79 3 127.6 19.6 3.59 E-6

concentrati
on
Error 130.17 20 6.51
Total 512.96 23
39
3. Compute Critical Range:
MSW  1 1  6.51  1 1 
Critical Range  Q U     3.96     4.124
2  n j n j'  2 6 6
4. Compare: x1  x 2  10.00  15.67  5.67

x1  x 3  10.00  17.00  7
x 2  x 3  15.67  17.00  1.33
x1  x 4  10.00  21.17  11.17
x 2  x 4  15.67  21.17  5.5
x 3  x 4  17.00  21.17  4.17
40
5. Other then x 2  x 3 , all of the absolute mean differences are greater than
critical range. Therefore there is significant difference between each pair of
means, except 10% concentration and 15% concentration at the 5% level of
significance.
41
Jupyter code
42
Problem 2
• Following table shows observed tensile

strength (lb/in square) of different clothes
having different weight percentage of cotton.
• Check whether having different weight
percentage of cotton, plays any role in tensile
strength (lb/in square) of clothes.
43
Problem 2
Weight Observed tensile strength (lb/in square) Total Average

Percentage
of cotton
1 2 3 4 5
15 7 7 15 11 9 49 9.8
20 12 17 12 18 18 77 15.4
25 14 18 18 19 19 88 17.6
30 19 25 22 19 23 108 21.6
35 7 10 11 15 11 54 10.8
Grand Grand
total=376 mean=
15.004
44
• SSA = 5 (9.8 – 15.04)2 + 5 (15.4 – 15.04)2 + 5
(17.6 – 15.04)2 +5( 21.6-15.04)2+ 5(10.8-
15.04)2 = 475.76
SST = 636.96
SSE = 636.96 - 475.76=161.20
Sources of Sum of Degrees of Mean square F-value

variation squares freedom
Cotton weight 475.76 4 118.94 14.76

percentage
Error 161.20 20 8.06
Total 639.96 24
45
Problem 2
• When alpha =.05, F(0.05,4,20) =2.87

• Reject Ho
46
• Q table: The critical values
for q corresponding to
alpha = .05 (top) and
alpha = .01 (bottom)
47
Problem 2
MS E
Ta  qa (c, n  c )
n
a  0.05
q0.05 (5, 20)  4.23

8.06
T0.05  4.23  5.37
5
48
Problem 2
Any pair of treatment averages that differ in absolute value by more

than 5.37 would imply that the corresponding pair of population means
are significantly different.
49
Problem 2
__ __
y1.  y2.  9.8  15.4  5.6*
__ __
y1.  y3.  9.8  17.6  7.8*
__ __ Starred values indicate pairs of means
y1.  y4.  9.8  21.6  11.8 *
that are significantly different.
__ __
y1.  y5.  9.8  10.8  1
__ __ __ __
y2.  y3.  15.4  17.6  2.2 y3.  y4.  17.6  21.6  4
__ __ __ __
y2.  y4.  15.4  21.6  6.2* y3.  y5.  17.6  10.8  6.8*
__ __ __ __
y2.  y5.  15.4  10.8  4.6 y4.  y5.  21.6  10.8  10.8*
50
Jupyter code
51
Jupyter Code
52
Thank you
53
Two Way ANOVA
Dr. A. Ramesh
IIT ROORKEE
1
Learning objectives
• Design and conduct engineering experiments involving several factors

using the factorial design approach
• Understand how the ANOVA is used to analyze the data from these
experiments
• Know how to use the two-level series of factorial designs
2
Factorial Experiment
• A factorial experiment is an experimental design that allows simultaneous
conclusions about two or more factors.
• The term factorial is used because the experimental conditions include all
possible combinations of the factors.
• The effect of a factor is defined as the change in response produced by a
change in the level of the factor. It is called a main effect because it refers to
the primary factors in the study
• For example, for a levels of factor A and b levels of factor B, the experiment
will involve collecting data on ab treatment combinations.
• Factorial experiments are the only way to discover interactions between
variables.
3
Factorial Experiment
Factorial Experiment, no interaction Factorial Experiment, with interaction
4
Two-factor Factorial Experiments
• The simplest type of factorial experiment involves only two factors, say, A
and B.
• There are a levels of factor A and b levels of factor B.
• This two-factor factorial is shown in next table .
• The experiment has n replicates, and each replicate contains all ab
treatment combinations.
5
Data Arrangement for a Two-Factor Factorial Design
6
• The observation in the ijth cell for the kth replicate is denoted by yijk
• In performing the experiment, the abn observations would be run in
random order.
• Thus, like the single factor experiment, the two-factor factorial is a
completely randomized design.
7
Example
• As an illustration of a two-factor factorial experiment, we will consider a

study involving the Common Admission test (CAT), a standardized test
used by graduate schools of business to evaluate an applicant’s ability to
pursue a graduate program in that field.
• Scores on the CAT range from 200 to 800, with higher scores implying
higher aptitude.
8
Three CAT preparation programs.
• In an attempt to improve students’ performance on the CAT, a major

university is considering offering the following three CAT preparation
programs.
1. A three-hour review session covering the types of questions generally
asked on the CAT.
2. A one-day program covering relevant exam material, along with the taking
and grading of a sample exam.
3. An intensive 10-week course involving the identification of each student’s
weaknesses and the setting up of individualized programs for
improvement.
9
Factor - 1 , 3 treatment
• One factor in this study is the CAT preparation program, which has three
treatments:
– Three-hour review,
– One-day program, and
– 10-week course.
• Before selecting the preparation program to adopt, further study will be
conducted to determine how the proposed programs affect CAT scores.
10
Factor 2 : 3 Treatment
• The CAT is usually taken by students from three colleges:
• the College of Business,
• the College of Engineering, and
• the College of Arts and Sciences.
• Therefore, a second factor of interest in the experiment is whether a
student’s undergraduate college affects the CAT score.
• This second factor, undergraduate college, also has three treatments:
– Business,
– Engineering, and
– Arts and sciences.
11
Nine Treatment Combinations for The Two-factor CAT
Experiment
Factor A: Factor B: College

Preparation Business Engineering Arts and sciences
Program
Three-hour review 1 2 3
One-day program 4 5 6
10-Week course 7 8 9
12
Replication
• In experimental design terminology, the sample size of two for each

treatment combination indicates that we have two replications.
13
CAT SCORES FOR THE TWO-FACTOR EXPERIMENT
Factor A: Factor B: College

Preparation Business Engineering Arts and sciences
Program
Three-hour review 500 540 480
580 460 400
One-day program 460 560 420
540 620 480
10-Week course 560 600 480
600 580 410
14
The analysis of variance computations answers
the following questions.
• Main effect (factor A): Do the preparation programs differ in terms of
effect on CAT scores?
• Main effect (factor B): Do the undergraduate colleges differ in terms of
effect on CAT scores?
• Interaction effect (factors A and B): Do students in some colleges do
better on one type of preparation program whereas others do better on a
different type of preparation program?
15
Interaction
• The term interaction refers to a new effect that we can now study because
we used a factorial experiment.
• If the interaction effect has a significant impact on the CAT scores, we can
conclude that the effect of the type of preparation program depends on
the undergraduate college.
16
ANOVA Table for the Two-factor Factorial Experiment
with r Replications
Sources of Sum of Degrees of Mean Square F P- value
Variation Squares Freedom
Factor A SSA (a -1) SSA/a-1 MSA /
MSE
Factor B SSB (b-1) SSB/b-1 MSB/
MSE
Interaction SSAB (a-1)(b-1) MSAB = MSAB /
SSAB/(a-1)(b-1) MSE
Error SSE ab(r-1) MSE=
SSE/(ab)(r-1)
Total SST nT-1
17
Abbreviation
18
ANOVA Procedure
• The ANOVA procedure for the two-factor factorial experiment requires us

to partition the sum of squares total (SST) into four groups:
– sum of squares for factor A (SSA),
– sum of squares for factor B (SSB),
– sum of squares for interaction (SSAB), and
– sum of squares due to error (SSE).
• The formula for this partitioning follows.
19
Computations and Conclusions
20
CAT Summary Data for The Two-factor Experiment
Factor A: Factor B: College Row totals
Preparation Business Engineering Arts and
Program sciences
Three-hour 500 x11=540 540 x12 = 500 480 x13= 440 2960
review 580 460 400
One-day 460 x21= 500 560 x22 = 590 420 x23 = 450 3080
program 540 620 480
10-Week course 560 x31 = 580 600 x32= 590 480 x33 = 445 3230
600 580 410
Column totals 3240 3360 2670 Overall x
total= = 515
9270
21
CAT Summary Data for The Two-factor Experiment
• Factor A means
x1.  493.33
x2 .  513.33
x3.  538.33
• Factor B means
x.1  540
x.2  560
x.3  445
22
CAT Example:
23
CAT Example:
24
CAT Example:
25
CAT Example:
26
CAT Example:
27
ANOVA Table for the CAT two-factor design

Factor A 6100 2 3050 1.38 0.299
Factor B 45300 2 22650 10.27 0.005
Interaction 11200 4 2800 1.27 0.350
Error 19850 9 2206
Total 82450 17
28
Jupyter Code
29
Jupyter code
30
Jupyter Code
31
Thank You
32
REGRESSION
Linear Regression
Dr. Ramesh Anbanandam
DEPARTMENT of Management Studies
1
Simple Linear Regression
• Simple Linear Regression Model

• Least Squares Method
• Coefficient of Determination
• Model Assumptions
• Testing for Significance
• Using the Estimated Regression Equation for Estimation and
Prediction
Empirical Models
• Many problems in engineering and science involve exploring the
relationships between two or more variables
• Regression analysis is a statistical technique that is very useful for these

types of problems
• This model can also be used for process optimization, such as finding the
level of temperature that maximizes yield, or for process control purposes
3
Empirical Models Example
• As an illustration, consider the data in Hydrocarbon level (X) Purity (Y)

the table. 0.99 90.01
• In this table y is the purity of oxygen 1.02 89.05
1.15 91.43
produced in a chemical distillation
1.29 93.74
process, and x is the percentage of
1.46 96.73
hydrocarbons that are present in the 1.36 94.45
main condenser of the distillation unit. 0.87 87.59
1.23 91.77
1.55 99.42
1.4 93.65
4
Using python for plotting the data
5
Simple Linear Regression Model
• The equation that describes how y is related to x and
an error term is called the regression model.
• The simple linear regression model is:
y = b0 + b1x +e
where:
b0 and b1 are called parameters of the model,
e is a random variable called the error term.
Simple Linear Regression Equation
The simple linear regression equation is:
E(y) = b0 + b1x
• Graph of the regression equation is a straight line.

• b0 is the y intercept of the regression line.
• b1 is the slope of the regression line.
• E(y) is the expected value of y for a given x value.
Positive Linear Relationship
E(y)
Intercept Slope b1
b0 is positive
x
Negative Linear Relationship
E(y)
Intercept
b0
Slope b1
is negative
x
No Relationship
E(y)
Regression line
Intercept
b0 Slope b1is 0
x
Estimated Simple Linear Regression Equation
 The estimated simple linear regression equation
𝑦 = 𝑏0 + 𝑏1 𝑥
• The graph is called the estimated regression line.

• b0 is the y intercept of the line.
• b1 is the slope of the line.
• y^ is the estimated value of y for a given x value.
Least Squares Method
• Least Squares Criterion
min  (yi  yi )
ˆ 2
where:
yi = observed value of the dependent variable
for the ith observation
^
yi = estimated value of the dependent variable
for the ith observation
Estimation Process
Regression Model Sample Data:
y = b0 + b1x +e x y
Regression Equation x1 y1
E(y) = b0 + b1x . .
Unknown Parameters . .
b0, b1 xn yn
Estimated
b0 and b1 Regression Equation
provide estimates of ŷ  b0  b1 x
b0 and b1 Sample Statistics
b0, b1
14
 y1 - (mx1 +b)  +  y 2 - (mx 2 +b)   ....  y n - (mx n +b) 
2 2 2
Squared Error (SE) =
= y12  2y1 (mx1 +b)+(mx1 +b) 2

 y 2 2  2y 2 (mx 2 +b)+(mx 2 +b) 2
....
 y n 2  2y n (mx n +b)+(mx n +b) 2
= y12 - 2x1 y1m - 2y1b+m 2 x12  2 mx1b+b 2

 y 2 2 - 2x 2 y 2 m - 2y 2 b+m 2 x 2 2  2mx2 b+b 2
.....
 y n 2 - 2x n y n m - 2y n b+m 2 x n 2 +2mxn b+b 2
15
=(y12  y 2 2  ...  y n 2 )
2m (x1 y1  x 2 y 2  ...  x n y n )
2b(y1 +y 2 +....+y n )
+m 2 (x12 +x 2 2 +..+x n 2 )
+2mb(x1 +x 2 +..+x n )
+(b 2  b 2  ...  b 2 )
= n y 2  2mn x y  2bn y  m2n x 2

+2mbnx+nb 2
16
SE=n y 2  2mn x y  2bn y  m 2 n x 2
+2mbnx+nb 2
 (SE)
 2n x y  2mnx 2
 2bnx  0
m
 (SE)
 2n xy  2mnx 2
 2bnx  0
m
  xy  m x 2
 bx  0
mx 2
 bx  x y
2 x2 x y
x x y one point ( , )
m +b  x x
x x
17
SE=n y 2  2mn x y  2bn y  m 2 n x 2
+2mbnx+nb 2
 (SE)
 2n y  2mnx  2nb=0
b
=- y  m x  b  0
y  mx  b
another point (x, y)
18
19
20
21
• Slope for the Estimated Regression Equation
b1   ( x  x )( y  y )
i i
 (x  x )
i
2
REGRESSION
Linear Regression-II
1
• Slope for the Estimated Regression Equation
b1   ( x  x )( y  y )
i i
 (x  x )
i
2
2
Sum of squares and sum of cross-products
n
S xx   ( xi  x) 2
i 1
n
S yy   ( yi  y ) 2
i 1
n
S xy  (x
i 1
i  x ) ( yi  y )
3
Sum of squares and sum of cross-products
S xy
Slope(m) 
S xx
Sxy
SSE= error sum of squares =Syy -
S xx
4
y-Intercept for the Estimated Regression Equation
b0  y  b1 x
where:
xi = value of independent variable for ith
observation
yi = value of dependent variable for ith
_ observation
_x = mean value for independent variable
y = mean value for dependent variable
n = total number of observations
5
Deviation from the estimated regression model
6
Example: Auto Sales

An Auto company periodically has a special week-long sale.
As part of the advertising campaign runs one or more television
commercials during the weekend preceding the sale.
Data from a sample of 5 previous sales are shown on the next slide.
7
Example: Auto Sales
Number of Number of
TV Ads Cars Sold
1 14
3 24
2 18
1 17
3 27
8
Estimated Regression Equation
Slope for the Estimated Regression Equation

b1   ( x  x )( y  y ) 
i i 20
5
 (x  x )
i
2
4
y-Intercept for the Estimated Regression Equation

b0  y  b1 x  20  5(2)  10
Estimated Regression Equation

ˆ  10  5x
y
9
Scatter Diagram and Trend Line
30
25
20
Cars Sold
y = 5x + 10
15
10
5
0
0 1 2 3 4
TV Ads
10
Jupyter Code
11
Jupyter Code
12
Jupyter code
13
Example Problem- II
• The data in the file hardness.xls provide measurements on the hardness and
tensile strength for 35 specimens of die-cast aluminum.
• It is believed that hardness (measured in Rockwell E units) can be used to
predict tensile strength (measured in thousands of pounds per square inch).
a. Construct a scatter plot.
b. Assuming a linear relationship, use the least-squares method to find the
regression coefficients b 0 and b 1.
c. Interpret the meaning of the slope, b1, in this problem.
d. Predict the mean tensile strength for die-cast aluminum that has a hardness of
30 Rockwell E units.
14
Tensile strength Hardness
53 29.31
70.2 34.86
84.3 36.82
55.3 30.12
78.5 34.02
63.5 30.82
71.4 35.4
53.4 31.26
82.5 32.18
67.3 33.42
69.5 37.69
73 34.88
55.7 24.66
85.8 34.76
95.4 38.02
51.1 25.68
74.4 25.81
54.1 26.46
77.8 28.67
52.4 24.64
69.1 25.77
53.5 23.69
64.3 28.65
82.7 32.38
55.7 23.21
70.5 34
87.5 34.47
50.7 29.25
72.3 28.71
59.5 29.83
71.3 29.25
52.7 27.99
76.5 31.85
63.7 27.65
69.2 31.7
15
16
Thank You
17
REGRESSION
Linear Regression-III
1
Learning Objectives
• Understanding Coefficient of Determination

• Test statistical hypotheses and construct confidence intervals on
regression model parameters
2
3
Coefficient of Determination
• Relationship Among SST, SSR, SSE
SST = SSR + SSE
(y i  y )2   ( yˆ i  y )2   ( y i  yˆ i )2
 SS xy 2   SS xy 2 
SS yy      SS yy  
 SS   SS xx 
 xx   
where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
 The coefficient of determination is:

r2 = SSR/SST
where:
SSR = sum of squares due to regression
SST = total sum of squares
r2 = SSR/SST = 100/114 = .8772

The regression relationship is very strong; 88% of the
variability in the number of cars sold can be explained by the
linear relationship between the number of TV ads and the
number of cars sold.
Jupyter code
7
r2 = SSR/SST = 100/114 = .8772

The regression relationship is very strong; 88% of the
variability in the number of cars sold can be explained by the
linear relationship between the number of TV ads and the
number of cars sold.
Sample Correlation Coefficient
rxy  (sign of b1 ) Coefficient of Determination

rxy  (sign of b1 ) r2
yˆ  b0  b1 x
where:
b1 = the slope of the estimated regression
equation
Sample Correlation Coefficient
rxy  (sign of b1 ) r2
The sign of b1 in the equation yˆ  10  5 x is “+”.

rxy = + .8772
rxy = +.9366
Assumptions About the Error Term e
1. The error e is a random variable with mean of zero.

2. The variance of e , denoted by e 2, is the same for
all values of the independent variable.
3. The values of e are independent.
4. The error e is a normally distributed random
variable.
Testing for Significance
• To test for a significant regression relationship, we must conduct a

hypothesis test to determine whether the value of b1 is zero.
• Two tests are commonly used:
t Test and F Test
• Both the t test and F test require an estimate of s 2, the variance of e in

the regression model.
12
Estimate of s
• An Estimate of s
The mean square error (MSE) provides the estimate
of s 2, and the notation s2 is also used.
s 2 = MSE = SSE/(n - 2)
where:
SSE   ( yi  yˆ i ) 2   ( yi  b0  b1 xi ) 2
• An Estimate of s
• To estimate s we take the square root of s 2.
• The resulting s is called the standard error of
the estimate.
SSE
s MSE 
n2
Se =Standard error of the estmiate ( )

2
2
Sxy
Syy -
SSE S xx
= =
n2 n2
15
Testing for Significance: t Test
• Hypotheses
H0 : b1  0
H a : b1  0
• Test Statistic
b1
t
sb1
Case 1
H 0: b1  0
In this case hypothesis is not rejected
17
Case 2
H a: b 1  0
In this case hypothesis is rejected
18
The Standard Deviation of the Regression Slope
• The standard error of the regression slope coefficient (b1) is
estimated by
sε sε
sb1  
 (x  x) 2
(  x)2
x  n
2
where:
sb1 = Estimate of the standard error of the least squares slope
SSE
sε  = Sample standard error of the estimate
n2
 Rejection Rule
Reject H0 if p-value < 
or t < -tor t > t
where:
t is based on a t distribution
with n - 2 degrees of freedom
1. Determine the hypotheses. H0 : b1  0
H a : b1  0

b1
3. Select the test statistic. t
sb1
4. State the rejection rule. Reject H0 if p-value < .05

or |t| > 3.182 (with
3 degrees of freedom)
b1  b 1 5
t   4.63
sb1 1.08

t = 4.541 provides an area of .01 in the upper
tail. Hence, the p-value is less than .02. (Also,
t = 4.63 > 3.182.) We can reject H0.
Hypothesis Tests for the Slope
of the Regression Model
b b
t 
1
b
1
H 0: 1
0 S b
H :b
1
1
0 where: S 
S e
b
SSXX
b
H 0: 1
0
S 
SSE
e
n2
H :b 0
1
1   X
2
SSXX  X 2

b
H 0: 0 n
1
b  the hypothesized slope
H :b
1
0
df  n  2
1
1
Confidence Interval for b1
 We can use a 95% confidence interval for b1 to test

the hypotheses just used in the t test.
 H0 is rejected if the hypothesized value of b1 is not
included in the confidence interval for b1.
• The form of a confidence interval for b1 is:
t /2 sb1
b1  t /2 sb1 is the
margin
b1 is the of error
point
estimator
Where t / 2 is the t value providing an area
of a/2 in the upper tail of a t distribution
with n - 2 degrees of freedom
• Rejection Rule
Reject H0 if 0 is not included in the confidence interval for b1.
• 95% Confidence Interval for b1
b1  t / 2 sb = 5 +/- 3.182(1.08) = 5 +/- 3.44
1
or 1.56 to 8.44
• Conclusion
0 is not included in the confidence interval.
Reject H0
Testing for Significance: F Test
• Hypotheses
H0 : b1  0
H a : b1  0
• Test Statistic
F = MSR/MSE
F-Test for Significance
• F Test statistic: MSR

F
MSE
where
SSR
MSR 
k
SSE
MSE 
n  k 1
where F follows an F distribution with k numerator degrees of freedom

and (n - k - 1) denominator degrees of freedom
(k = the number of independent variables in the regression model)
• Rejection Rule
Reject H0 if
p-value < 
or F > F
where:
F is based on an F distribution with
1 degree of freedom in the numerator and
n - 2 degrees of freedom in the denominator
1. Determine the hypotheses. H0 : b1  0
H a : b1  0

3. Select the test statistic. F = MSR/MSE
4. State the rejection rule. Reject H0 if p-value < .05

or F > 10.13 (with 1 d.f.
in numerator and
3 d.f. in denominator)
Jupyter Code
31
Jupyter code
32
F = MSR/MSE = 100/4.667 = 21.43
F = 17.44 provides an area of .025 in the upper tail. Thus, the p-
value corresponding to F = 21.43 is less than 2(.025) = .05. Hence,
we reject H0.
The statistical evidence is sufficient to conclude that we have a significant
relationship between the number of TV ads aired and the number of cars
sold.
Some Cautions about the
Interpretation of Significance Tests
• Rejecting H0: b1 = 0 and concluding that the relationship

between x and y is significant does not enable us to
conclude that a cause-and-effect relationship is present
between x and y.
• Just because we are able to reject H0: b1 = 0 and demonstrate
statistical significance does not enable us to conclude that there
is a linear relationship between x and y.
Thank You
35
RBD
Dr. A. Ramesh
IIT ROORKEE
1
Learning Objectives
• Estimate variance components in an experiment involving random factors
• Understand the blocking principle and how it is used to isolate the effect
of nuisance factors
• Design and conduct experiments involving the randomized complete block

design
2
Randomized Block Design
• A completely randomized design (CRD) is useful when the experimental

units are homogeneous
• If the experimental units are heterogeneous, blocking is often used to

form homogeneous groups
3
Why RBD?
• A problem can arise whenever differences due to extraneous factors (ones

not considered in the experiment) cause the MSE term in this ratio to
become large.
• In such cases, the F value in equation can become small, signaling no
difference among treatment means when in fact such a difference exists.
4
Randomized block design
• Experimental studies in business often involve experimental units that are

highly heterogeneous; as a result, randomized block designs are often
employed.
• Blocking in experimental design is similar to stratification in sampling.
5
Randomized block design
• Its purpose is to control some of the extraneous sources of variation by

removing such variation from the MSE term.
• This design tends to provide a better estimate of the true error variance
and leads to a more powerful hypothesis test in terms of the ability to
detect differences among treatment means.
6
Air Traffic Controller Stress Test
• A study measuring the fatigue and stress of
air traffic controllers resulted in proposals
for modification and redesign of the
controller’s work station
• After consideration of several designs for
the work station, three specific alternatives
are selected as having the best potential
for reducing controller stress
• The key question is: To what extent do the
three alternatives differ in terms of their
effect on controller stress?
7
Air Traffic Controller Stress Test
• In a completely randomized design, a random sample of controllers would be
assigned to each work station alternative.
• However, controllers are believed to differ substantially in their ability to
handle stressful situations.
• What is high stress to one controller might be only moderate or even low
stress to another.
• Hence, when considering the within-group source of variation (MSE), we must
realize that this variation includes both random error and error due to
individual controller differences.
• In fact, managers expected controller variability to be a major contributor to
the MSE term.
8
A randomized block design for the air traffic controller
stress test
Treatments
System A System B System C
Controller 1 15 15 18
Blocks
9
Solving this example using ANOVA in python
10
Solving this example using ANOVA in python
11
Summary of stress data for the air traffic controller stress test
Treatments System A System B System C Block total Block
Blocks means
Controller 1 15 15 18 48 x1. =16
Controller 2 14 14 14 42 x2. =14
Controller 3 10 11 15 36 x3. =12
Controller 4 13 12 17 42 x4. =14
Controller 5 16 13 16 45 x5. =15
Controller 6 13 13 13 39 x6. =13
Column Total 81 78 93 252 =
x 252/18
= 14
12
Summary of stress data for the air traffic controller stress test
• Treatment means
x.1 = 81/6 =13.5

x.2 = 78/6 =13
x.3 = 93/6 =15.5
13
ANOVA TABLE FOR THE RANDOMIZED BLOCK DESIGN WITH k
TREATMENTS AND b BLOCKS

Treatments SS k-1 MS Treatments = MS

Treatments SSTR/k-1 Treatmen
Blocks SS block (b-1) MSBL = SSBL/b- ts / MSE
1
Error SSE (k-1)(b-1) MSE= SSE/(k-
1)(b-1)
Total SST nT-1
14
RBD Problem
15
RBD Problem
16
RBD Problem
17
ANOVA table for the air traffic controller stress test

Treatments 21 2 10.5 10.5/1.9 0.024
=5.53
Blocks 30 5 6.0
Error 19 10 1.9
Total 70 17
Reject the null hypothesis

18
Solving RBD example using python
19
Solving RBD example using python
20
Conclusion
• Finally, note that the ANOVA table shown in Table provides an F value to
test for treatment effects but not for blocks.
• The reason is that the experiment was designed to test a single factor—
work station design.
• The blocking based on individual stress differences was conducted to
remove such variation from the MSE term.
• However, the study was not designed to test specifically for individual
differences in stress.
21
Problem 2: RBD
• An experiment was performed to determine the effect of four different

chemicals on the strength of a fabric.
• These chemicals are used as part of the permanent press finishing
process.
• Five fabric samples were selected, and a randomized complete block
design was run by testing each chemical type once in random order on
each fabric sample.
• The data are shown in Table.
• We will test for differences in means using an ANOVA with alpha = 0.01.
22
Problem 2: RBD
• Table: Fabric Strength Data—Randomized Complete Block Design
23
Anova using jupyter
24
Problem 2: RBD
• The sums of squares for the analysis of variance are computed as follows:
25
Problem 2: RBD
26
Problem 2: RBD
• Analysis of Variance for the Randomized Complete Block Experiment
Sources of Sum of Degrees of Mean F P- value
Variation Squares Freedom Square
Chemical types 18.04 3 6.01 75.13 4.79 E-8

(Treatments)
Fabric samples 6.69 4 1.67

(Blocks)
Error 0.96 12 0.08
Total 25.69 19
27
Conclusion
• The ANOVA is summarized in the previous table

• Since f0 = 75.13 > f0.01,3,12 = 5.95 (the P-value is 4.79 x E-8), we conclude
that there is a significant difference in the chemical types so far as their
effect on strength is concerned.
28
Python code for problem 2
29
30
31
Categorical Variable Regression
Dr. A. Ramesh
1
Agenda
• Purpose of this lecture is to show how categorical variables are handled in

regression analysis.
• To illustrate the use and interpretation of a categorical independent
variable, we will consider two problems
• Demo on python
2
What are dummy variables?
• Dummy variables, also called indicator variables allow us to include
categorical data (like Gender) in regression models
• A dummy variable can take only 2 values, 0 (absence of a category) and 1

(presence of a category)
3
Example 1: Problem / Background
• Johnson Filtration, Inc., provides maintenance service for
water-filtration systems.
• Customers contact Johnson with requests for
maintenance service on their water-filtration systems
• To estimate the service time and the service cost,
Johnson’s managers want to predict the repair time
necessary for each maintenance request
• Hence, repair time in hours is the dependent variable
• Repair time is believed to be related to two factors,
– the number of months since the last maintenance service
– the type of repair problem (mechanical or electrical).
Source: Statistics for Business & Economics, David R. Anderson, Dennis J. Sweeney, Thomas A. Williams, Jeffrey D. Camm, James J. Cochran, Cengage Learning,2013
4
Data for the Johnson filtration example
service call months_since_last_service type_of_repair repair_time_in_hours

1 2 electrical 2.9
2 6 mechanical 3
3 8 electrical 4.8
4 3 mechanical 1.8
5 2 electrical 2.9
6 7 electrical 4.9
7 9 mechanical 4.2
8 8 mechanical 4.8
9 4 electrical 4.4
10 6 electrical 4.5
5
6
Linear Regression
7
OLS Summary
8
Linear regression
9
Normal probability plot
10
Creating dummies
11
DATA FOR THE JOHNSON FILTRATION EXAMPLE WITH TYPE OF REPAIR
INDICATED BYADUMMYVARIABLE (x2 = 0 FOR MECHANICAL; x2 = 1
FOR ELECTRICAL)
12
Adding dummies to table
13
OLS Summary
14
Dummy regression
15
Interpreting the Parameters
Equation 1
Equation 2
16
• Comparing equations, we see that the mean repair time is a linear
function of x1 for both mechanical and electrical repairs.
• The slope of both equations is b1, but the y-intercept differs.
• The y-intercept is b0 in equation 1 for mechanical repairs and (b0 +b2) in
equation 2 for electrical repairs.
17
• The interpretation of b2 is that it indicates the difference between the

mean repair time for an electrical repair and the mean repair time for a
mechanical repair.
• If b2 is positive, the mean repair time for an electrical repair will be
greater than that for a mechanical repair; if b2 is negative, the mean
repair time for an electrical repair will be less than that for a mechanical
repair.
• Finally, if b2 = 0, there is no difference in the mean repair time between
electrical and mechanical repairs and the type of repair is not related to
the repair time.
18
• In effect, the use of a dummy variable for type of repair provides two
estimated regression equations that can be used to predict the repair
time, one corresponding to mechanical repairs and one corresponding to
electrical repairs.
• In addition, with b2= 1.26, we learn that, on average, electrical repairs
require 1.26 hours longer than mechanical repairs.
19
20
More Complex Categorical Variables
• A categorical variable with k levels must be modeled using k - 1 dummy

variables.
• Care must be taken in defining and interpreting the dummy variables.
21
Example 2: Problem / Background
• The manager of a small sales

force wants to know whether
average monthly salary is
different for males and females in
the sales force.
• He obtains data on monthly
salary and experience (in months)
for each of the 9 employees as
shown on the next slide.
22
Data
Employee Salary Gender Experience
1 7.5 Male 6
2 8.6 Male 10
3 9.1 Male 12
4 10.3 Male 18
5 13 Male 30
6 6.2 Female 5
7 8.7 Female 13
8 9.4 Female 15
9 9.8 Female 21
24
25
26
27
28
Creating a dummy variable for gender
• Categorical data is included in
regression analysis by using Employee Salary Gender
dummy variables 1 7.5 0
2 8.6 0
3 9.1 0
• For example, we can assign a
value of 0 for males and 1 for 4 10.3 0
females in our data so that a 5 13 0
MR model can be developed 6 6.2 1
7 8.7 1
8 9.4 1
9 9.8 1
30
31
More on the intercept and slope
• The value of the intercept, 9.70, is the average salary for males (as we
coded gender=1 for females and 0 for males)
• The value of the slope, -1.175, tells us that the average females salary is
lower than the average male salary by 1.175
32
33
What would have happened if we had used 0 for females and
1 for males in our data? Would our results be any different?
34
Male = 1, female = 0
• Not really – With coding as above, the intercept would change to

8.525 (the average female salary), the slope for gender would still
be 1.175, but now it would have a positive sign (reflecting that
average male salary is higher than average female salary by 1.175).
Predicted salaries from the model for males / females would not
change no matter how dummy variable is coded
35
More on dummy variables
• For gender, we had only 2 categories – female and male – thus we

used a single 0/1 variable for this
• When there are more than 2 categories, the number of dummy

variables that should be used equals the number of categories
minus 1
• No. of Dummy Variables = No. of levels -1
36
Example: Salary vs. Job Grade
Employee Job Salary

• In this example, the
Grade ($000)
categorical variable
1 1 7.5
job grade has 3 levels, 2 3 8.6
1 (lowest grade), 2, 3 2 9.1
and 3 (highest job 4 3 10.3
grade) 5 3 13
6 1 6.2
7 2 8.7
8 2 9.4
9 3 9.8
37
Representing 3-level Job Grade using dummy variables
Job_1 and Job_2
Dummy Variables
Employee's Job
Job Grade Job_1 Job_2
Grade
1 1 0
2 0 1
3 0 0
Job Grade 3 is the reference category
38
Data file with dummy variables for job grade
Job
Employee Grade Salary Job_1 Job_2
1 1 7.5 1 0
2 3 8.6 0 0
3 2 9.1 0 1
4 3 10.3 0 0
5 3 13 0 0
6 1 6.2 1 0
7 2 8.7 0 1
8 2 9.4 0 1
9 3 9.8 0 0
39
Thank You
40
Estimation, Prediction of Regression Model Residual
Analysis: Validating Model Assumptions - I
Dr. A. Ramesh
1
Agenda
• Point Estimation
• Interval Estimation
• Confidence Interval for the Mean Value of y
• Prediction Interval for an Individual Value of y
2
Problem
• Data were collected from a sample of 10 Ice cream vendors located near
college campuses.
• For the ith observation or restaurant in the sample, xi is the size of the
student population (in thousands) and yi is the quarterly sales (in
thousands of dollars).
• The values of xi and yi for the 10 restaurants in the sample are summarized
in Table
3
Data
Student Population Sales
Restaurant (1000) (1000)
1 2 58
2 6 105
3 8 88
4 8 118
5 12 117
6 16 137
7 20 157
8 20 169
9 22 149
10 26 202
4
Python code for scatter plot
5
Python code for scatter plot
6
Python code for regression Equation
7
Python code for regression Equation
8
Python code for regression
• In the Ice cream vendor example, the estimated regression equation 60 +

5x provides an estimate of the relationship between the size of the
student population x and quarterly sales y.
9
10
Point Estimate
• We can use the estimated regression equation to develop a point estimate
of the mean value of y for a particular value of x or to predict an individual
value of y corresponding to a given value of x.
• For instance, suppose a manager want a point estimate of the mean

quarterly sales for all restaurants located near college campuses with
10,000 students.
11
Point estimate
• Using the estimated regression equation 60 +5x, we see that for x 10 (or
10,000 students), 60 + 5(10) = 110.
• Thus, a point estimate of the mean quarterly sales for all restaurants
located near campuses with 10,000 students is $110,000.
12
Point estimate
• Now suppose the manager want to predict sales for an individual

restaurant located near College, with 10,000 students.
• In this case we are not interested in the mean value for all restaurants
located near campuses with 10,000 students;
• We are just interested in predicting quarterly sales for one individual
restaurant.
• As it turns out, the point estimate for an individual value of y is the same
as the point estimate for the mean value of y.
• Hence, we would predict quarterly sales of 60 + 5(10) = 110 or $110,000
for this one restaurant.
13
Plot at mean value of x and y
14
Confidence Interval Estimation
• Confidence interval, is an interval estimate of the mean value of y for a

given value of x.
• Prediction interval, is used whenever we want an interval estimate of an
individual value of y for a given value of x.
• The point estimate of the mean value of y is the same as the point
estimate of an individual value of y.
• The margin of error is larger for a prediction interval.
15
x p = the particular or given value of the independent variable x

y p = the value of the dependent variable y corresponding to the given x p
E(y p ) = the mean or expected value of the dependent variable 'y' corresponding to the given x p
^
y = b 0 +b1x p = the point estimate of E(y p ) when x = x p
60 + 5(10) = 110.
16
^
In general, we cannot expect y p to equal E(y p ) exactly.
^ ^
If we want to make an inference about how close y p is to the true mean value E( y p ), we will have to estimate the variance of y p .
^
The formula for estimating the variance of y p given x p , denoted by , is s 2 ^
yp
17
18
Confidence Intervals for the Mean sales y at given values of
student population x
19
Python Code
20
Special Case
^
The estimated standard deviation of y p
− −
is smallest when x p = x and the quantity x p - x = 0
21
Prediction Interval for an Individual Value of y
• Instead of estimating the mean value of sales for all restaurants located
near campuses with 10,000 students, we want to estimate the sales for an
individual restaurant located near a particular College with 10,000
students.
(1) The variance of individual ‘y’ values about the mean E( yp), an estimate of
which is given by s2
(2) The variance associated with using to estimate E( yp), an estimate of
which is given by
22
23
24
25
26
Confidence intervals vs prediction intervals
• Confidence intervals and prediction intervals show the precision of the

regression results.
• Narrower intervals provide a higher degree of precision
27
Python Code for Prediction Interval
28
Python Code
29
Python Code
30
Thank You
31
Estimation, Prediction of Regression Model Residual
Analysis: Validating Model Assumptions - II
Dr. A. Ramesh
1
Agenda
• Understanding different types of residual analysis

• Plotting residual plots using python
2
Residual Analysis: Validating Model Assumptions
• Residual analysis is the primary tool for determining whether the assumed
regression model is appropriate
3
Assumptions about the error term . 
4
Importance of the Assumptions
• These assumptions provide the theoretical basis for the t test and the F
test used to determine whether the relationship between x and y is
significant, and for the confidence and prediction interval estimates
• If the assumptions about the error term  appear questionable, the
hypothesis tests about the significance of the regression relationship and
the interval estimation results may not be valid.
5
Residuals for Ice cream parlours
Source: Statistics for Business & Economics, David R. Anderson, Dennis J. Sweeney, Thomas A. Williams, Jeffrey D. Camm, James J. Cochran, Cengage Learning,2013
6
Residual analysis is based on an examination of graphical plots
• A plot of the residuals against values of the independent variable x

^
• A plot of residuals against the predicted values of the dependent variable y
• A standardized residual plot
• A normal probability plot
7
Residual Plot Against x
8
Residual Plot Against x
9
Assumption: the variance is the same for all values of x
• The residual plot should give an

overall impression of a horizontal
band of points
10
Violation of Assumption:
The variance of ‘e’ is not the same for all values of x
• Assumption of a constant
variance of ‘e’ is violated
• If variability about the regression
line is greater for larger values of
x
11
Assumed regression model is not an adequate
representation
A curvilinear regression model or

multiple regression model
should be considered.
12
^
Residual Plot Against y
• The pattern of this residual plot is the

same as the pattern of the residual plot
against the independent variable x.
• It is not a pattern that would lead us to
question the model assumptions.
13
^
Residual Plot Against y
• For simple linear regression, both the

residual plot against x and the residual
plot against provide the same pattern
• For multiple regression analysis, the
^
residual plot against y is more widely
used because of the presence of more
than one independent variable.
14
Standardized Residuals
• Many of the residual plots provided by computer software packages use a

standardized version of the residuals.
• A random variable is standardized by subtracting its mean and dividing the
result by its standard deviation.
• With the least squares method, the mean of the residuals is zero.
• Thus, simply dividing each residual by its standard deviation provides the
standardized residual
15
Python Code
16
17
Python Code
18
Standardized Residuals
19
Computation of standardized residuals for Icecream parlors
20
Computation of standardized residuals for Icecream parlors
21
Plot of The Standardized Residuals Against The Independent
Variable x
22
Plot of The Standardized Residuals Against The Independent
Variable x
23
Studentized residual
• The standardized residual plot can provide insight about the assumption
that the error ‘e’ term has a normal distribution.
• If this assumption is satisfied, the distribution of the standardized
residuals should appear to come from a standard normal probability
distribution.
24
Studentized residual
• Thus, when looking at a standardized residual plot, we should expect to

see approximately 95% of the standardized residuals between -2 and 2.
• We see in Figure that for the Armand’s example all standardized residuals
are between -2 and 2.
• Therefore, on the basis of the standardized residuals, this plot gives us no
reason to question the assumption that ‘e’ has a normal distribution.
25
Normal Probability Plot
• Another approach for determining the validity of the assumption that the
error term has a normal distribution is the normal probability plot.
• To show how a normal probability plot is developed, we introduce the
concept of normal scores.
26
• Suppose 10 values are selected randomly from a normal probability

distribution with a mean of zero and a standard deviation of one, and that
the sampling process is repeated over and over with the values in each
sample of 10 ordered from smallest to largest.
• For now, let us consider only the smallest value in each sample.
• The random variable representing the smallest value obtained in repeated
sampling is called the first-order statistic.
27
28
29
Normal scores and ordered standardized residuals for
Armand’s pizza parlors
30
• If the normality assumption is satisfied, the smallest standardized residual

should be close to the smallest normal score, the next smallest
standardized residual should be close to the next smallest normal score,
and so on.
• If we were to develop a plot with the normal scores on the horizontal axis
and the corresponding standardized residuals on the vertical axis, the
plotted points should cluster closely around a 45-degree line passing
through the origin if the standardized residuals are approximately
normally distributed.
• Such a plot is referred to as a normal probability plot.
31
Normal probability plot for Ice Cream parlors
32
33
Thank You
34
MULTIPLE REGRESSION MODEL - I
Dr. A. Ramesh
1
Agenda
• Multiple regression model

• Least squares method
• Multiple coefficient of determination
• Model assumptions
• Testing for significance F-Test, t-Test
2
Multiple regression model
3
The estimation process For multiple regression
4
Simple vs multiple regression
• In simple linear regression, b0 and b1 were the sample statistics used to

estimate the parameters b 0 and b 1.
• Multiple regression parallels this statistical inference process, with b0,b1,
b2, . . . , bp denoting the sample statistics used to estimate the parameters
b 0, b 1, b 2, . . . , b p.
5
6
7
An Example: Trucking Company
• As an illustration of multiple regression analysis, we will consider a

problem faced by the Trucking Company.
• A major portion of business involves deliveries throughout its local area.
• To develop better work schedules, the managers want to estimate the
total daily travel time for their drivers.
Source: Statistics for Business and Economics, 2012, Anderson
8
PRELIMINARY DATA FOR BUTLER TRUCKING
9
Using python import data
10
Using python import data
11
Scatter Diagram Of Preliminary Data For Trucking x1
12
Scatter Diagram Of Preliminary Data For Trucking x2
13
Scatter Diagram For x1 and x2
14
Linear regression Vs. multiple regression model
• Linear regression
15
Linear regression Vs. multiple regression model
16
Linear regression Vs. Multiple regression model
• Multiple regression
17
Linear regression Vs. Multiple regression model
18
Multiple Coefficient of Determination
19
Multiple Coefficient of Determination for linear model
20
Multiple Coefficient of Determination for Multiple regression
model
21
22
• Adding independent variables causes the prediction errors to become

smaller, thus reducing the sum of squares due to error, SSE.
• Because SSR = SST- SSE, when SSE becomes smaller, SSR becomes larger,
causing R2 = SSR/SST to increase.
• Many analysts prefer adjusting R2 for the number of independent variables
to avoid overestimating the impact of adding an independent variable on
the amount of variability explained by the estimated regression equation.
23
Adjusted Multiple Coefficient of Determination
n = number of observations
p = denoting the number of independent variables
24
OLS Summary
25
Adjusted Multiple Coefficient Vs Multiple Coefficient
• If a variable is added to the model, R2 becomes larger even if the variable

added is not statistically significant.
• The adjusted multiple coefficient of determination compensates for the
number of independent variables in the model.
26
Adjusted Multiple Coefficient Vs Multiple Coefficient
• If the value of R2 is small and the model contains a large number of

independent variables, the adjusted coefficient of determination can take
a negative value
27
Model Assumptions
28
Assumption about error term
1. The error term e is a random variable with mean or expected value of

zero;
E(e) = 0.
Implication: For given values of x1,x2,…,xp the expected , or average , value
of y is given by E(y) = b0+b1x1+b2 x 2+....+bpxp
– This equation represents the average of all possible values of y , that
might occur for the given value of x1,x2,…,xp, by E(y).
29
Assumption about error term
30
Graph of the regression equation for multiple regression
analysis with two independent variables
31
Response variable and response surface
• In regression analysis, the term response variable is often used in place of

the term dependent variable.
• Furthermore, since the multiple regression equation generates a plane or
surface, its graph is called a response surface.
32
Thank You
33
MULTIPLE REGRESSION MODEL-II
Dr. A. Ramesh
1
Agenda
• Testing for significance

– F Test
– t Test
• Python Demo for multiple regression
2
• The F test is used to determine whether a significant relationship exists

between the dependent variable and the set of all the independent
variables; we will refer to the F test as the test for overall significance.
• If the F test shows an overall significance, the t test is used to determine
whether each of the individual independent variables is significant.
• A separate t test is conducted for each of the independent variables in the
model; we refer to each of these t tests as a test for individual significance.
3
F Test
4
F test significance
5
F test significance
6
F Test
7
ANOVA table
8
t Test for individual significance
9
10
11
Regression Approach
to ANOVA
Regression Approach to ANOVA
• Three different assembly methods, referred to as methods A, B, and C, have been
proposed.
• Managers at Chemitech want to determine which assembly method can produce
the greatest number of filtration systems per week
A B C
58 58 48
64 69 57
55 71 59
66 64 47
67 68 49
ANOVA
Anova: Single Factor
SUMMARY
Groups Count Sum Average Variance
A 5 310 62 27.5
B 5 330 66 26.5
C 5 260 52 31
ANOVA
Source of
Variation SS df MS F P-value F crit
Between
Groups 520 2 260 9.176471 0.003818 3.885294
Within
Groups 340 12 28.33333
Total 860 14
Dummy variables for the chemitech experiment
• If we are interested in the expected value of the number of units

assembled per week for an employee who uses method C, our procedure
for assigning numerical values to the dummy variables would result in
setting A = B= 0.
• The multiple regression equation then reduces to
• For method A the values of the dummy variables are A = 1 and B = 0, and
• For method B we set A = 0 and B = 1, and

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.777593186
R Square 0.604651163
Adjusted R Square 0.53875969
Standard Error 5.322906474
Observations 15
ANOVA
df SS MS F Significance F
Regression 2 520 260 9.176471 0.003818412
Residual 12 340 28.33333
Total 14 860
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 52 2.380476143 21.84437 4.97E-11 46.81338804 57.18661196 46.81338804 57.18661196
A 10 3.366501646 2.970443 0.011692 2.665023022 17.33497698 2.665023022 17.33497698
B 14 3.366501646 4.15862 0.001326 6.665023022 21.33497698 6.665023022 21.33497698
Estimation of E(y)
• b0 = 52
• b1= 10
• b2 = 14
Assembly Method Estimation of E(y)
A b0+b1 = 52+10=62
B b0+b2 = 52 +14 = 66
C 52
Testing the significance
Thank You
21
Linear Regression Model Vs Logistic Regression Model
Dr. A. Ramesh
1
Agenda
• Comparison of Linear Regression model and Logistic regression model
2
Estimating the relationship
Linear regression model Logistic regression model

• Y1 = X1+X2+…+Xn • Y1 = X1+X2+…+Xn
• Where , • Where ,
– Y1 = continuous data – Y1 = Binary nonmetric
– Independent variables = – Independent variables =
nonmetric and metric nonmetric and metric
3
Graphical representation
• Linear regression • Logistic regression
4
Correspondence of Primary Elements of Model Fit
Linear Regression Logistic Regression
• Total sum of squares • -2LL of base model
• Error sum of squares • -2LL of proposed model
• F test of model fit • Chi-square test of -2LL difference
• Coefficient of determination (R2) • Pseudo R2 measures
• Regression sum of squares • Difference of -2LL for base and
proposed models
5
Objective of logistic regression
• Logistic regression is identical to discriminant analysis in terms of the basic

objectives it can address
• Logistic regression is best suited to address two research objectives:
– Identifying the independent variables that impact group membership
in the dependent variable
– Establishing a classification system based on the logistic model for
determining group membership
6
The fundamental difference
• Logistic regression differs from linear regression, in being specifically

designed to predict the probability of an event occurring (ie., the
probability of an observation being in the group coded 1)
• Although probability values are metric measures, there are fundamental
differences between linear regression and logistic regression
7
Log likelihood
• Measure used in logistic regression to represent the lack of predictive

fit
• Even though this method does not use the least squares procedure in
model estimation, as is done in linear regression, the likelihood value is
similar to the sum of squared error in regression analysis
8
Logistic vs discriminant
• Logistic regression may be preferred for two reasons

• First, discriminant analysis relies on strictly meeting the assumptions of
– Multivariate normality and equal variance
– Covariance matrices across groups
– Assumptions that are not met in many situations
• Logistic regression does not face these strict assumptions and is much
more robust when these assumptions are not met, making its application
appropriate in many situations
9
Logistic vs discriminant
• Second, even if the assumptions are met, many researchers prefer logistic
regression because it is similar to multiple regression
• It has straightforward statistical tests, similar approaches to incorporating
metric and nonmetric variables and nonlinear effects, and a wide range of
diagnostics
• Logistic regression is equivalent to two-group discriminant analysis and
may be more suitable in many situations
10
Logistic vs discriminant : Sample size
• One factor that distinguishes logistic regression from the other techniques
is its use of maximum likelihood (MLE) as the estimation technique
• MLE requires larger samples such that, all things being equal, logistic
regression will require a larger sample size than multiple regression
• As for discriminant analysis, there are considerations on the minimum
group size as well
11
Logistic vs discriminant : Sample size
• The recommended sample size for each group is at least 10 observations

per estimated parameter
• This is much greater than multiple regression, which had a minimum of
five observations per parameter, and that was for the overall sample, not
the sample size for each group, as seen with logistic regression
12
Determination of coefficients
Linear regression Logistic regression

• R2
−2LLnull – (−2 LLmodel )
• r2 = SSR/SST R 2 Logit =
−2 LLnull
where:
SSR = sum of squares due to Where:
regression LL = Loglikelihood
SST = total sum of squares -2LLnull = -2LL of base model
-2LLmodel= -2LL of proposed model
13
Determination of coefficients
14
Testing for overall significance
Linear regression • Logistic Regression

• F-test of model fit • G-test of model fit
• F = MSR/MSE
 likelihood without the variable 
G = −2ln  
 likelihood with variable
15
Testing for overall significance
16
Testing for significance
• t-test • Wald-test
b 1
− 
t = 1
S b
=
S e
w h e r e: S b
S S XX
SSE
S =
e
n − 2
( X )2
S S XX =  X 2
−
n
 1
= th e h y p o th e s iz e d s lo p e
df = n − 2
17
Testing for significance
18
Model Estimation fit
• The basic measure of how well the maximum likelihood estimation

procedure fits is the likelihood value, similar to the sums of squares values
used in multiple regression
• Logistic regression measures model estimation fit with the value of -2
times the log of the likelihood value, referred to as -2LL or -2 log likelihood
• The minimum value for -2LL is 0, which corresponds to a perfect fit
(likelihood = 1 and -2LL is then 0)
19
Model Estimation fit
• The lower the -2LL value, the better the fit of the model
• The -2LL value can be used to compare equations for the change in fit
20
Between Model Comparison
• The likelihood value can be compared between equations to assess the

difference in predictive fit from one equation to another, with statistical
tests for the significance of these differences
• The basic approach follows three steps:
21
Step 1 : Estimate a null model
• The first step is to calculate a null model, which acts as the baseline for
making comparisons of improvement in model fit.
• The most common null model is one without any independent variables,
which is similar to calculating the total sum of squares using only the
mean in linear regression.
• The logic behind this form of null model is that it can act as a baseline
against which any model containing independent variables can be
compared.
22
Step 2: Estimate the proposed model
• This model contains the independent variables to be included in the

logistic regression model.
• This model fit will improve from the null model and result in a lower -2IL
value.
• Any number of proposed models can be estimated
23
Step 3: Assess -2LL difference:
• The final step is to assess the statistical significance of the -2LL value
between the two models (null model versus proposed model).
• If the statistical tests support significant differences, then we can state
that the set of independent variable(s) in the proposed model is
significant in improving model estimation fit.
24
Between model comparison
Linear regression Logistic Regression

• SSE • -2LL of proposed model
• = σ 𝑦𝑖 − 𝑦ො 𝑖 2
25
Between model comparison
Linear Regression Logistic regression

• SSR = = σ 𝑦𝑖 − 𝑦ത 𝑖 2 • Difference between log likelihood
• SST-SSE • = 2LLnull –(2LLmodel)
26
Normality of Residual (Error)
• Normally distributed • Binomially distributed
• Linear regression assumes that • Logistic regression does not need
residuals are approximately equal residuals to be equal for each
for all predicted dependent level of the predicted dependent
variable values variable values
27
Estimation Methods
• Linear regression is based • logistic regression is based on Maximum

on least square estimation Likelihood Estimation
• Regression coefficients should be • Coefficients should be chosen in such a
chosen in such a way that way that it maximizes the Probability of
it minimizes the sum of the Y given X (likelihood)
squared distances of each • With MLE, the computer uses different
observed response to its fitted "iterations" in which it tries different
value solutions until it gets the maximum
likelihood estimates
28
Interpretation
Coefficients of linear regression is In logistic regression, we interpret

interpreted as: odd ratios as:
• Keeping all other independent • The effect of a one unit of change
variables constant, how much the in X in the predicted odds ratio
dependent variable is expected to with the other variables in the
increase/decrease with an unit model held constant
increase in the independent
variable
29
THANK YOU
30
LOGISTIC REGRESSION - I
Dr A. RAMESH
1
Agenda
• Building Logistic regression Model

• Python Demo on Logistic Regression
2
Application
• In many regression applications the dependent variable may only assume
two discrete values.
• For instance, a bank might like to develop an estimated regression
equation for predicting whether a person will be approved for a credit
card or not
• The dependent variable can be coded as y =1 if the bank approves the
request for a credit card and y = 0 if the bank rejects the request for a
credit card.
• Using logistic regression we can estimate the probability that the bank
will approve the request for a credit card given a particular set of values
for the chosen independent variables.
3
Example
• Let us consider an application of logistic regression involving a direct mail

promotion being used by Simmons Stores.
• Simmons owns and operates a national chain of women’s apparel stores.
• Five thousand copies of an expensive four-color sales catalog have been
printed, and each catalog includes a coupon that provides a $50 discount
on purchases of $200 or more.
• The catalogs are expensive and Simmons would like to send them to only
those customers who have the highest probability of using the coupon.
Sources: Statistics for Business and Economics,11th Edition by David R. Anderson (Author), Dennis J.
Sweeney (Author), Thomas A. Williams (Author)
4
Variables
• Management thinks that annual spending at Simmons Stores and whether

a customer has a Simmons credit card are two variables that might be
helpful in predicting whether a customer who receives the catalog will use
the coupon.
• Simmons conducted a pilot study using a random sample of 50 Simmons
credit card customers and 50 other customers who do not have a
Simmons credit card.
• Simmons sent the catalog to each of the 100 customers selected.
• At the end of a test period, Simmons noted whether the customer used
the coupon or not?
5
Data (10 customer out of 100)
Customer Spending Card Coupon
1 2.291 1 0
2 3.215 1 0
3 2.135 1 0
4 3.924 0 0
5 2.528 1 0
6 2.473 0 1
7 2.384 0 0
8 7.076 0 0
9 1.182 1 1
10 3.345 0 0
6
Explanation of Variables
• The amount each customer spent last year at Simmons is shown in

thousands of dollars and the credit card information has been coded as 1
if the customer has a Simmons credit card and 0 if not.
• In the Coupon column, a 1 is recorded if the sampled customer used the
coupon and 0 if not.
7
Logistic Regression Equation
• If the two values of the dependent variable y are coded as 0 or 1, the

value of E( y) in equation given below provides the probability that y = 1
given a particular set of values for the independent variables x1, x2, . . . , xp.
8
Logistic Regression Equation
• Because of the interpretation of E( y) as a probability, the logistic

regression equation is often written as follows
9
10
Logistic regression equation for β0 and β1
11
Logistic regression equation for β0 and β1
• Note that the graph is S-shaped.

• The value of E( y) ranges from 0 to 1, with the value of E( y) gradually
approaching 1 as the value of x becomes larger and the value of E( y)
approaching 0 as the value of x becomes smaller.
• Note also that the values of E( y), representing probability, increase fairly
rapidly as x increases from 2 to 3.
• The fact that the values of E( y) range from 0 to 1 and that the curve is S-
shaped makes equation (slide no.11) ideally suited to model the
probability the dependent variable is equal to 1.
12
Estimating the Logistic Regression Equation
• In simple linear and multiple regression the least squares method is used to
compute b0, b1, . . . , bp as estimates of the model parameters ( 0, 1, . . . , p).
• The nonlinear form of the logistic regression equation makes the method of
computing estimates more complex
• We will use computer software to provide the estimates.
• The estimated logistic regression equation is
Here, y hat provides an estimate of the probability that y = 1, given a particular

set of values for the independent variables.
13
Python Code for Logistic Regression
14
Variables
15
Managerial Use
• P( y = 1/x1 = 2, x2 = 0) = .1880
• P( y = 1/x1= 2,/ x2 = 1) = .4099
• Probabilities indicate that for customers with annual spending of $2000 the presence of
a Simmons credit card increases the probability of using the coupon
16
Managerial Use
• It appears that the probability of using the coupon is much higher for
customers with a Simmons credit card.
17
18
G Statistics
• The test for overall significance is based upon the value of a G test
statistic.
• If the null hypothesis is true, the sampling distribution of G follows a chi-
square distribution with degrees of freedom equal to the number of
independent variables in the model.
19
20
G Statistics
G = 2(−60.487 − (−67.301) = 13.628
• The value of G is 13.628, its degrees of freedom are 2,and its p-value is
0.001.
• Thus, at any level of significance α >= .001, we would reject the null
hypothesis and conclude that the overall model is significant.
21
Thank You
22
LOGISTIC REGRESSION - II
Dr A. RAMESH
1
Agenda
• Testing the significance of Logistic regression coefficients

• Python Demo on Logistic Regression
2
Chi sq. value of G- Statistic
3
z test- Wald Test
• z test can be used to determine

whether each of the individual
independent variables is making a
significant contribution to the
overall model
4
Strategies
• Suppose Simmons wants to send the
promotional catalog only to
customers who have a 0.40 or higher
probability of using the coupon.
• Customers who have a Simmons
credit card: Send the catalog to every
customer who spent $2000 or more
last year.
• Customers who do not have a
Simmons credit card: Send the
catalog to every customer who spent
$6000 or more last year.
5
Interpreting the Logistic Regression Equation
6
Odd ratio
• The odds ratio measures the impact on the odds of a one-unit increase in
only one of the independent variables.
7
Interpretation
• For example, suppose we want to compare the odds of using the coupon
for customers who spend $2000 annually and have a Simmons credit card
(x1= 2 and x2 = 1) to the odds of using the coupon for customers who
spend $2000 annually and do not have a Simmons credit card (x1= 2 and
x2 = 0).
• We are interested in interpreting the effect of a one-unit increase in the
independent variable x2.
8
Odds ratio
9
Odds ratio – Interpretation
• The estimated odds in favor of using the coupon for customers who spent
$2000 last year and have a Simmons credit card are 3 times greater than
the estimated odds in favor of using the coupon for customers who spent
$2000 last year and do not have a Simmons credit card.
10
Odds ratio – Interpretation
• The odds ratio for each independent variable is computed while holding all the
other independent variables constant.
• But it does not matter what constant values are used for the other independent
variables.
• For instance, if we computed the odds ratio for the Simmons credit card
variable (x2) using $3000, instead of $2000, as the value for the annual
spending variable (x1), we would still obtain the same value for the estimated
odds ratio (3.00).
• Thus, we can conclude that the estimated odds of using the coupon for
customers who have a Simmons credit card are 3 times greater than the
estimated odds of using the coupon for customers who do not have a Simmons
credit card.
11
Relationship between the odds ratio and the coefficients of
the independent variables
12
Effect of a change of more than one unit in Odd Ratio
• The odds ratio for an independent variable represents the change in the
odds for a one unit change in the independent variable holding all the
other independent variables constant.
• Suppose that we want to consider the effect of a change of more than one
unit, say c units.
• For instance, suppose in the Simmons example that we want to compare
the odds of using the coupon for customers who spend $5000 annually (x1
= 5) to the odds of using the coupon for customers who spend $2000
annually (x1 = 2).
• In this case c = 5- 2 = 3 and the corresponding estimated odds ratio is
13
Effect of a change of more than one unit in Odd Ratio
• This result indicates that the estimated odds of using the coupon for
customers who spend $5000 annually is 2.79 times greater than the
estimated odds of using the coupon for customers who spend $2000
annually.
• In other words, the estimated odds ratio for an increase of $3000 in
annual spending is 2.79
14
Logit Transformation
• An interesting relationship can be observed between the odds in favor of y

= 1 and the exponent for ‘e’ in the logistic regression equation
• This equation shows that the natural logarithm of the odds in favor of y =
1 is a linear function of the independent variables.
• This linear function is called the logit →g(x1, x2, . . . , xp) to denote the
logit.
15
Estimated Logit Regression Equation
16
17
G vs Z
• Because of the unique relationship between the estimated coefficients in

the model and the corresponding odds ratios, the overall test for
significance based upon the G statistic is also a test of overall significance
for the odds ratios.
• In addition, the z test for the individual significance of a model parameter
also provides a statistical test of significance for the corresponding odds
ratio.
18
Thank You
19
Maximum Likelihood Estimation - I
Dr. A. Ramesh
1
Agenda
• This lecture will provide intuition behind the MLE using Theory and
examples.
2
Maximum Likelihood Estimation
• The method of maximum likelihood was first introduced by R. A.

Fisher, a geneticist and statistician, in the 1920s.
• Most statisticians recommend this method, at least when the
sample size is large, since the resulting estimators have certain
desirable efficiency properties
• Maximum likelihood estimation(MLE) is a method to find most likely density
function, that would have generated data.
• MLE requires one to make distribution assumption first.
3
An intuitive view on likelihood
 = −2,  2 = 1
 = 0,  2 = 1
 = 0,  2 = 4
4
Maximum Likelihood Estimation: Problem
• A sample of ten new bike helmets manufactured by a certain company is
obtained. Upon testing, it is found that the first, third, and tenth helmets
are flawed, whereas the others are not.
• Let p = P(flawed helmet), i.e., p is the proportion of all such helmets that
are flawed.
• Define (Bernoulli) random variables X1, X2, . . . , X10 by
Source: Probability and Statistics for Engineering and the Sciences, Jay L Devore, 8th Ed, Cengage
5
Maximum Likelihood Estimation: Problem
• Then for the obtained sample, X1 = X3 = X10 = 1 and the other seven Xi’s are
all zero
• The probability mass function of any particular Xi is ,
which becomes p if xi = 1 and 1 – p when xi = 0
• Now suppose that the conditions of various helmets are independent of
one another
• This implies that the Xi’s are independent, so their joint probability mass
function is the product of the individual pmf’s.
6
Maximum Likelihood Estimation: Binomial Distribution
• Joint pmf evaluated at the observed Xi’s is

f(x1, . . . , x10; p) = p(1 – p)p . . . p = p3(1 – p)7 - (1)
• Suppose that p = .25. Then the probability of observing the sample that
we actually obtained is (.25)3(.75)7 = .002086.
• If instead p = .50, then this probability is (.50)3(.50)7 = .000977.
• For what value of p is the obtained sample most likely to have occurred?
• That is, for what value of p is the joint pmf (eq 1) as large as it can be?
• What value of p maximizes (eq 1)
7
• Figure shows a graph of the likelihood (eq 1) as a function of p.
• It appears that the graph reaches its peak above p = .3 = the proportion of
flawed helmets in the sample.
Graph of the likelihood (joint pmf) (eq 1)
8
Graph of the natural logarithm of the likelihood
• Figure shows a graph of the

natural logarithm of (eq 1)
• Since ln[g(u)] is a strictly
increasing function of g(u),
finding u to maximize the
function g(u) is the same as
finding u to maximize ln[g(u)].
9
• We can verify our visual impression by using calculus to find the value of p
that maximizes (eq 1).
• Working with the natural log of the joint pmf is often easier than working
with the joint pmf itself, since the joint pmf is typically a product so its
logarithm will be a sum.
• Here ln[ f (x1, . . . , x10; p)] = ln[p3(1 – p)7]
• = 3ln(p) + 7ln(1 – p)
10
Thus
11
Interpretation
• Equating this derivative to 0 and solving for p gives
3(1 – p) = 7p, from which 3 = 10p and so p = 3/10 = .30 as conjectured
• That is, our point estimate is p = .30.

• It is called the maximum likelihood estimate because it is the parameter
value that maximizes the likelihood (joint pmf) of the observed sample
• In general, the second derivative should be examined to make sure a
maximum has been obtained, but here this is obvious from Figure
12
• Suppose that rather than being told the condition of every helmet, we had
only been informed that three of the ten were flawed.
• Then we would have the observed value of a binomial random variable X =
the number of flawed helmets.
• The pmf of X is For x = 3, this becomes
• The binomial coefficient is irrelevant to the maximization, so again p =
0.30.
13
Maximum Likelihood Function Definition
• Let 𝑋1 , 𝑋2 ,…, 𝑋𝑛 have joint pmf or pdf
𝑓(𝑥1 , 𝑥2 , … , 𝑥𝑛 ; 𝜃1 , … , 𝜃𝑚 ) (a)
• Where the parameters 𝜃1 , … , 𝜃𝑚 have unknown values. When 𝑥1 , … , 𝑥𝑛 are the observed
sample values and (a) is regarded as a function of 𝜃1 , … , 𝜃𝑚 , it is called the likelihood
function.
^ ^
• The maximum likelihood estimates (mle’s)
the likelihood function, so that
 ,...,
1 m
are those values of the i’s that maximize
^ ^
𝑓(𝑥1 , 𝑥2 , … , 𝑥𝑛 ; 1,..., m) ≥ 𝑓(𝑥1 , 𝑥2 , … , 𝑥𝑛 ; 𝜃1 , … , 𝜃𝑚 ) for all 𝜃1 , … , 𝜃𝑚
• When the 𝑋𝑖′ 𝑠 are substituted in place of the 𝑥𝑖′ 𝑠, the maximum likelihood estimators result.
14
Interpretation
• The likelihood function tells us how likely the observed sample is as a
function of the possible parameter values.
• Maximizing the likelihood gives the parameter values for which the
observed sample is most likely to have been generated—that is, the
parameter values that “agree most closely” with the observed data.
15
Estimation of Poisson Parameter
• Suppose we have data generated from a Poisson distribution. We want to
estimate the parameter of the distribution
e−  X
• The probability of observing a particular random variable is P( X ;  ) =
X!
• Joint likelihood by multiplying the individual probabilities together
e −   X1 e −   X 2 e−  X n
P( X 1 , X 2 ,, X n ;  ) =   
X 1! X 2! X n!
L (  ; X) =  e −   X i
i
L(  ; X) = e − n  nX
16
Estimation of Poisson Parameter
• Note in the likelihood function the factorials have disappeared.
• This is because they provide a constant that does not influence the
relative likelihood of different values of the parameter
• It is usual to work with the log likelihood rather than the likelihood.
• Note that maximising the log likelihood is equivalent to maximising the
likelihood. Take the natural log of the
likelihood function
L(  ; X) = e − n  nX
(  ; X) = −n + nX log  Find where the derivative of the log
likelihood is zero
d nX
= −n +
d  Note that here the MLE is the same as the
ˆ = X moment estimator
17
Estimation of exponential distribution Parameter
• Suppose X1, X2, . . . , Xn is a random sample from an exponential
distribution with parameter . Because of independence, the likelihood
function is a product of the individual pdf’s:
• The natural logarithm of the likelihood function is
ln[ f (x1, . . . , xn ; )] = n ln() – xi

18
Estimation of exponential distribution Parameter
• Equating (d/d)[ln(likelihood)] to zero results in

n/ – xi = 0, or  = n/xi =
• Thus the MLE is
19
Estimation of parameters of Normal Distribution
• Let X1, . . . , Xn be a random sample from a normal distribution.
• The likelihood function is
• so
20
Estimation of parameters of normal distribution
• To find the maximizing values of  and  2, we must take the partial derivatives
of ln(f ) with respect to  and  2, equate them to zero, and solve the resulting
two equations.
• Omitting the details, the resulting MLE’s are
• The MLE of  2 is not the unbiased estimator, so two different principles of

estimation (unbiasedness and maximum likelihood) yield two different
estimators
21
Thank you
22
Maximum Likelihood Estimation-II
Dr. A. Ramesh
1
Agenda
• This lecture will provide understanding intuition behind the MLE using
Theory and examples.
2
Example1: Estimation of parameters of normal distribution
• Let us explain basic idea of MLE using simple

Id x
problems.
1 1
• Let us make assumption that variable x follows
normal distributed 2 4
• Density function of normal distribution with 3 5

mean m and variance σ2 is given by: 4 6
35 9
3
Example 1: Estimation of parameters of normal distribution
• The data is plotted on a horizontal line

• Think which distribution, either A or B, is more likely to have generated
the data?
4
Interpretation
• Answer to this question is A, because the data are cluster around the
center of the distribution A, but not around the center of the distribution
B
• This example illustrate that, by looking at the data, it is possible to find the
distribution that is most likely to have generated the data
• Now, I will explain exactly how to find the distribution in practice
5
The illustration of the estimation procedure
• MLE starts with computing the likelihood contribution of each observation

• The likelihood contribution is the height of the density function.
• We use Li to denote the likelihood contribution of ith observation.
6
Graphical illustration of likelihood contribution
7
• Then, you multiply the likelihood contributions of all the observations. this
is called the likelihood function. We use the notation L
n
• Likelihood function L=  Li
i =1 This notation means you
multiply from i= 1 through n
• In our example, n= 5
8
• In our example, the likelihood function looks like:
• The likelihood function depends on mean m and variance σ2
9
• The value of mean m and σ that maximise the likelihood function is found.
• The values of mean m and σ which are obtained this way are called the
maximum likelihood estimators of mean m and σ
• Most of the MLE cannot be solved ‘by hand’. Thus, you need to write an
iterative procedure to solve it on computer
10
Method of Least-squares vs MLE
Model for the expectation
(fixed part of the model):
E[Yi ] =  0 + 1 xi
Residuals: ri = yi − E[Yi ]
The method of least-squares:
Find the values for the parameters (β0 and β1) that
makes the sum of the squared residuals (Σrj2) as
small as possible.
Can only be used when the error term is
normal (residuals are assumed to be drawn
from a normal distribution)
Yi = 0 + 1 xi +  i , where  i ~ N (0,  )
Method of Least-squares vs MLE
Model for the expectation
(fixed part of the model):
E[Yi ] =  0 + 1 xi
Residuals: ri = yi − E[Yi ]
The maximum likelihood method is

more general!
- Can be applied to models with any

probability distribution
Estimation of Regression Parameter
• We are interested in estimating a model like this:
• Estimating such a model can be done using MLE
13
Estimation of Regression Parameters
• Suppose that we have the following data and Id Y X

we are interested in estimating the model:
1 2 1
2 6 4
• Let us make an assumption that u follows the
3 7 5
normal distribution with mean 0 and variance
σ2 4 9 6
5 15 9
14
• We can write the model as :

u=
• This means that follows the normal distribution with mean 0

and variance σ2
• The likelihood contribution of each data point is the height of the density
function at the data points (y-0-1x)
15
• The likelihood contribution in this example, of the 2nd observation is given

by:
The likelihood contribution of the

Data point
2nd observation
16
• Then the likelihood function is given by

L(𝛽 0 , 𝛽1 , 𝜎) = ς𝑛𝑖=1 𝐿𝑖 = 𝐿1 × 𝐿2 × 𝐿3 × 𝐿4 × 𝐿5
Id Y X
1 2 1 2
= 𝑒 − (2−𝛽0 −𝛽1 ) ൗ × 𝑒 − (6−𝛽0 −4𝛽1 ) ൗ
1 2 1 2 𝜎2 2 𝜎2
2𝜋𝜎 2𝜋𝜎
2 6 4 1 2
− (7−𝛽0 −5𝛽1 ) ൗ × 1 2
− (9−𝛽0 −6𝛽1 ) ൗ
× 2
𝑒 𝜎2 2
𝑒 𝜎2
3 7 5 2𝜋𝜎 2𝜋𝜎
1 2
− (15−𝛽0 −9𝛽1 ) ൗ𝜎2
4 9 6 × 𝑒
2𝜋𝜎2
5 15 9
• The likelihood function is a function of 𝛽0 , 𝛽1 and 𝜎.
17
• You choose the values of 𝛽0 , 𝛽1 and 𝜎 that maximizes the likelihood

function.
18
Python Demo for MLE
19
20
21
Parameter estimation by MLE
22
Parameter estimation by MLE
23
Example 2
24
25
26
27
Thank you
28
Performance of Logistic Model-III
Dr A. RAMESH
1
Agenda
Python demo for accuracy prediction in logistic regression model using Receiver
operating characteristics curve
2
Sensitivity and Specificity
• For checking, what type of error we are making; we use two parameters-
1. Sensitivity = tp/(tp+fn) True Positive Rate(tpr)
2. Specificity = tn/(tn+fp) True Negative Rate (tnr)
3
Specificity and Sensitivity Relationship with Threshold
Threshold (Lower) Sensitivity ( ) Specificity ( )
Threshold (Higher) Sensitivity ( ) Specificity ( )
Which threshold value should be chosen??
4
Measuring Accuracy, Specificity and Sensitivity
5
ROC Curve for Training dataset
6
ROC Curve for Test data set
7
Threshold value selection
• The outcome of logistic regression model is a probability.

• Selecting a good threshold value is often challenging.
• Threshold values on ROC curve –
Threshold = 1 TPR = 0 FPR = 0
Threshold = 0 TPR = 1 FPR = 1
• Threshold values are often selected based on which errors are bettor.
8
Accuracy checking for different threshold values
9
10
11
12
Calculating Optimal Threshold Value
13
Optimal Threshold Value in ROC Curve
14
Classification Report using Optimal Threshold Value
15
Thank You
16
Confusion matrix and ROC - I
Dr A. RAMESH
1
Agenda
• Confusion matrix
• Receiver operating characteristics curve
2
Why Evaluate?
• Multiple methods are available to classify or predict

• For each method, multiple choices are available for settings
• To choose best model, need to assess each model’s performance
3
Accuracy Measures (Classification)
Misclassification error
• Error = classifying a record as belonging to one class when it belongs to
another class.
• Error rate = percent of misclassified records out of the total records in the
validation data
4
Confusion Matrix
Classification Confusion Matrix

Predicted Class
Actual Class 1 0
1 201 85
0 25 2689
201 1’s correctly classified as “1”

85 1’s incorrectly classified as “0”
25 0’s incorrectly classified as “1”
2689 0’s correctly classified as “0”
5
Error Rate
Predicted Class
Actual Class 1 0
1 201 85
0 25 2689
Overall error rate = (25+85)/3000 = 3.67%

Accuracy = 1 – err = (201+2689) = 96.33%
If multiple classes, error rate is:
(sum of misclassified records)/(total records)
6
Cutoff for classification
Most algorithms classify via a 2-step process:
For each record,
1. Compute probability of belonging to class “1”
2. Compare to cutoff value, and classify accordingly
• Default cutoff value is 0.50

If >= 0.50, classify as “1”
If < 0.50, classify as “0”
• Can use different cutoff values
• Typically, error rate is lowest for cutoff = 0.50
7
Cutoff Table
Actual Class Prob. of "1" Actual Class Prob. of "1"
1 0.996 1 0.506
1 0.988 0 0.471
1 0.984 0 0.337
1 0.980 1 0.218
1 0.948 0 0.199
1 0.889 0 0.149
1 0.848 0 0.048
0 0.762 0 0.038
1 0.707 0 0.025
1 0.681 0 0.022
1 0.656 0 0.016
0 0.622 0 0.004
• If cutoff is 0.50: 11 records are classified as “1”

• If cutoff is 0.80: seven records are classified as “1”
8
Confusion Matrix for Different Cutoffs
Cut off Prob.Val. for Success (Updatable) 0.25

Predicted Class
Actual Class owner non-owner
owner 11 1
non-owner 4 8
Cut off Prob.Val. for Success (Updatable) 0.75

Predicted Class
Actual Class owner non-owner
owner 7 5
non-owner 1 11
9
Compute Outcome Measures
10
When One Class is More Important
In many cases it is more important to identify members of one class
– Tax fraud
– Credit default
– Response to promotional offer
– Detecting electronic network intrusion
– Predicting delayed flights
In such cases, we are willing to tolerate greater overall error, in return

for better identifying the important class for further attention
11
ROC curves
• ROC = Receiver Operating Characteristic

• Started in electronic signal detection theory (1940s - 1950s)
• Has become very popular in biomedical applications, particularly
radiology and imaging
• Also used in machine learning applications to assess classifiers
• Can be used to compare tests/procedures
ROC curves: simplest case
• Consider diagnostic test for a disease

• Test has 2 possible outcomes:
– ‘positive’ = suggesting presence of disease
– ‘negative’
• An individual can test either positive or negative for the disease

ROC Analysis
• True Positives = Test states you have the disease when you do have the
disease
• True Negatives = Test states you do not have the disease when you do not
have the disease
• False Positives = Test states you have the disease when you do not have
the disease
• False Negatives = Test states you do not have the disease when you do
Specific Example
Patients with disease

Patients without the disease
Test Result
Threshold
Call these patients “negative” Call these patients “positive”
Test Result
Some definitions ...
True Positives
without the disease Test Result with the disease

with the disease
without the disease Test Result False

Positives
True
negatives
without the disease Test Result with the disease

False
negatives
Test Result with the disease

without the disease
Moving the Threshold: right
‘‘-’’ ‘‘+’’
Test Result
without the disease
with the disease
Moving the Threshold: left
‘‘-’’ ‘‘+’’
without the disease Test Result

with the disease
Threshold Value
• The outcome of a logistic regression model is a probability
• Often, we want to make a binary prediction
• We can do this using a threshold value t
• If P(y = 1) ≥ t, predict positive
– If P(y = 1) < t, predict negative
– What value should we pick for t?
23
Threshold Value
• Often selected based on which errors are “better”
• If t is large, predict positive rarely (when P(y=1) is large)
– More errors where we say negative , but it is actually positive
– Detects patients who are negative
• If t is small, predict negative rarely (when P(y=1) is small)
– More errors where we say positive, but it is actually negative
– Detects all patients who are positive
• With no preference between the errors, select t = 0.5
– Predicts the more likely outcome
24
Selecting a Threshold Value
• Compare actual outcomes to predicted outcomes using a confusion matrix

(classification matrix)
25
True disease state vs. Test result
not rejected
Test rejected/accepted
Disease
No disease ☺ X
(D = 0) specificity Type I error
(False +)

Disease X ☺
(D = 1) Type II error Power 1 - ;
(False -) sensitivity

Classification matrix: Meaning of each cell
27
Alternate Accuracy Measures
If “C1” is the important class,

Sensitivity = % of “C1” class correctly classified
Sensitivity = n1,1 / (n1,0+ n1,1 )
Specificity = % of “C0” class correctly classified
Specificity = n0,0 / (n0,0+ n0,1 )
False positive rate = % of predicted “C1’s” that were not “C1’s”
False negative rate = % of predicted “C0’s” that were not “C0’s”
28
Receiver Operator Characteristic (ROC) Curve
• True positive rate (sensitivity) on y-axis

– Proportion of positive
• False positive rate (1-specificity) on x-axis
– Proportion of negative labelled as positive
• Low Threshold
– Low specificity
– High sensitivity
29
Selecting a Threshold using ROC
• Captures all thresholds simultaneously

• High threshold
– High specificity
– Low sensitivity
• Low Threshold
– Low specificity
– High sensitivity
30
Thank You
31
Confusion Matrix and ROC-II
Dr A. RAMESH
1
Agenda
• Receiver operating characteristics curve

• Optimum threshold value
2
ROC analysis
• True Positive Fraction
– TPF = TP / (TP+FN)
– also called sensitivity
– true abnormals called abnormal by the
observer
• False Positive Fraction
– FPF = FP / (FP+TN)
• Specificity = TN / (TN+FP)
– True normals called normal by the observer
– FPF = 1 - specificity
Evaluating classifiers (via
their ROC curves)
Classifier A can’t
distinguish between
normal and abnormal.
B is better but makes

some mistakes.
C makes very few

mistakes.
“Perfect”
means no
false positives
and no false
negatives.
ROC analysis
• ROC = receiver operator/operating characteristic/curve
Area Under the ROC Curve (AUC)
7
• What is a good AUC?

– Maximum of 1 (perfect prediction)
8
• What is a good AUC?

• Maximum of 1 (perfect
prediction)
• Minimum of 0.5
(just guessing)
9
Selecting a Threshold using ROC
• Choose best threshold for best trade off

– cost of failing to detect positives
– costs of raising false alarms
10
ROC Plot
• A typical look of ROC plot with few points in it is shown in the following
figure.
• Note the four cornered points are the four extreme cases of classifiers
11
Interpretation of Different Points in ROC Plot
• The four points (A, B, C, and D)
• A: TPR = 1, FPR = 0, the ideal model, i.e., the perfect
classifier, no false results
• B: TPR = 0, FPR = 1, the worst classifier, not able to
predict a single instance
• C: TPR = 0, FPR = 0, the model predicts every instance
to be a Negative class, i.e., it is an ultra-conservative
classifier
• D: TPR = 1, FPR = 1, the model predicts every instance
to be a Positive class, i.e., it is an ultra-liberal classifier
12
• Let us interpret the different points in the ROC
plot.
• The points on the upper diagonal region
• All points, which reside on upper-diagonal region
are corresponding to classifiers “good” as their
TPR is as good as FPR (i.e., FPRs are lower than
TPRs)
• Here, X is better than Z as X has higher TPR and
lower FPR than Z.
• If we compare X and Y, neither classifier is superior
to the other
13
• Let us interpret the different points in the ROC

plot.
• The points on the lower diagonal region
– The Lower-diagonal triangle corresponds to
the classifiers that are worst than random
classifiers
– A classifier that is worser than random
guessing, simply by reversing its prediction,
we can get good results.
W’(0.2, 0.4) is the better version than W(0.4,
0.2), W’ is a mirror reflection of W
14
Tuning a Classifier through ROC Plot
• Using ROC plot, we can compare two or more
classifiers by their TPR and FPR values and this
plot also depicts the trade-off between TPR
and FPR of a classifier.
• Examining ROC curves can give insights into
the best way of tuning parameters of
classifier.
• For example, in the curve C2, the result is
degraded after the point P.
• Similarly for the observation C1, beyond Q the
settings are not acceptable.
15
Comparing Classifiers trough ROC Plot
• We can use the concept of “area under
curve” (AUC) as a better method to
compare two or more classifiers.
• If a model is perfect, then its AUC = 1.
• If a model simply performs random
guessing, then its AUC = 0.5
• A model that is strictly better than other,
would have a larger value of AUC than the
other.
• Here, C3 is best, and C2 is better than C1
as AUC(C3)>AUC(C2)>AUC(C1).
16
ROC curve
100%
True Positive Rate

(sensitivity)
0
% False Positive Rate (1- 100
0
% specificity) %
ROC curve comparison
A good test: A poor test:
100% 100%
True Positive Rate
True Positive Rate

0 0
% %
0 100% 100
0
% False Positive Rate False Positive Rate %
%
ROC curve extremes
Best Test: Worst test:
100% 100%
True Positive Rate

True Positive Rate
0
0 %
% 0 100
0 100 False Positive Rate %
False Positive Rate % %
%
The distributions The distributions

don’t overlap at all overlap completely
ROC curve extremes
20
Typical ROC
21
ROC curve extremes
22
Example
• Let us consider an application of logistic regression involving a direct mail

promotion being used by Simmons Stores.
• Simmons owns and operates a national chain of women’s apparel stores.
• Five thousand copies of an expensive four-color sales catalog have been
printed, and each catalog includes a coupon that provides a $50 discount
on purchases of $200 or more.
• The catalogs are expensive and Simmons would like to send them to only
those customers who have the highest probability of using the coupon.
23
Variables
• Management thinks that annual spending at Simmons Stores and whether

a customer has a Simmons credit card are two variables that might be
helpful in predicting whether a customer who receives the catalog will use
the coupon.
• Simmons conducted a pilot study using a random sample of 50 Simmons
credit card customers and 50 other customers who do not have a
Simmons credit card.
• Simmons sent the catalog to each of the 100 customers selected.
• At the end of a test period, Simmons noted whether the customer used
the coupon or not?
24
Data (10 customer out of 100)
Customer Spending Card Coupon
1 2.291 1 0
2 3.215 1 0
3 2.135 1 0
4 3.924 0 0
5 2.528 1 0
6 2.473 0 1
7 2.384 0 0
8 7.076 0 0
9 1.182 1 1
10 3.345 0 0
25
Explanation of Variables
• The amount each customer spent last year at Simmons is shown in

thousands of dollars and the credit card information has been coded as 1
if the customer has a Simmons credit card and 0 if not.
• In the Coupon column, a 1 is recorded if the sampled customer used the
coupon and 0 if not.
26
Loading data file and get some statistical detail
27
Method’s description
• Dataframe.describe(): This method is used to get basic statistical details

such as central tendency, dispersion and shape of dataset’s distribution.
• Numpy.unique(): This method gives unique values in particular column.
• Series.value_counts(): Returns object containing counts of unique values.
• ravel(): It will return one dimensional array with all the input array
elements.
28
Split dataset into training and testing sets
29
Building the model and predicting values
30
Calculate probability of predicting data values
31
Summary for logistic model
32
Accuracy Checking
• By using accuracy_score function.

• By using confusion matrix
Predicted (0) Predicted (1)

Actual (0) True Negative(tn) False Positive(fp)
Actual (1) False Negative(fn) True Positive(tp)
33
Calculating Accuracy Score using Confusion Matrix
34
Generating Classification Report
• Recall gives us an idea

about when it’s actually
yes, how often does it
predict yes.
• Precision tells us about
when it predicts yes, how
often is it correct
35
Interpreting Classification Report
• Precision = tp / (tp + fp)
• Accuracy = (tp + tn) / (tp + tn + fp + fn)

Predicted (0) Predicted (1)
Actual (0) tn fp
• Recall= tp / (tp + fn)
Actual (1) fn tp
36
Thank You
37
Regression Analysis Model Building - I
Dr. A. Ramesh
1
Introduction
• Model building is the process of developing an estimated regression

equation that describes the relationship between a dependent variable
and one or more independent variables.
• The major issues in model building are finding the proper functional form
of the relationship and selecting the independent variables to be included
in the model.
2
General Linear Regression Model
• Suppose we collected data for one dependent variable y and k
independent variables x1,x2, . . . , xk.
• Objective is to use these data to develop an estimated regression equation
that provides the best relationship between the dependent and
independent variables.
• zj (where j =1, 2, . . . , p) is a function of x1, x2, . . . , xk (the variables for

which data are collected).
• In some cases, each zj may be a function of only one x variable.
3
Simple first-order model with one predictor variable
4
Modelling Curvilinear Relationships
• To illustrate, let us consider the problem facing Reynolds, Inc., a

manufacturer of industrial scales and laboratory equipment.
• Managers at Reynolds want to investigate the relationship between length
of employment of their salespeople and the number of electronic
laboratory scales sold.
• Table in the next slide gives the number of scales sold by 15 randomly
selected salespeople for the most recent sales period and the number of
months each salesperson has been employed by the firm.
5
Data
Scales Months
Sold Employed
275 41
296 106
317 76
376 104
162 22
150 12
367 85
308 111
189 40
235 51
83 9
112 12
67 6
325 56
189 19
6
Importing libraries and table
7
SCATTER DIAGRAM FOR THE REYNOLDS EXAMPLE
8
Python code for the Reynolds example: first-order model
9
First-order regression equation
10
Standardized residual plot for the Reynolds example: first-
order model
11
Standardized residual plot for the Reynolds example: first-
order model
12
Need for curvilinear relationship
• Although the computer output shows that the relationship is significant (

p-value .000) and that a linear relationship explains a high percentage of
the variability in sales (R-sq 78.1%), the standardized residual plot
suggests that a curvilinear relationship is needed.
13
Second-order model with one predictor variable
• Set Z1= x1 and Z2 = X2
14
New Data set
• The data for the MonthsSq independent variable is obtained by squaring

the values of Months.
15
Python output for the Reynolds example:
second-order model
16
Second-order regression model
17
Standardized residual plot for the Reynolds example:
second-order model
18
Interpretation second order model
• Figure corresponding standardized residual plot shows that the previous

curvilinear pattern has been removed.
• At the .05 level of significance, the computer output shows that the
overall model is significant ( p-value for the F test is 0.000)
• Note also that the p-value corresponding to the t-ratio for MonthsSq ( p-
value .002) is less than .05
• Hence we can conclude that adding MonthsSq to the model involving
Months is significant.
• With an R-sq(adj) value of 88.6%, we should be pleased with the fit
provided by this estimated regression equation.
19
Meaning of linearity in GLM
• In multiple regression analysis the word linear in the term “general linear
model” refers only to the fact that b0, b1, . . . , bp all have exponents of b1
• It does not imply that the relationship between y and the xi’s is linear.
• Indeed, we have seen one example of how equation general linear model
can be used to model a curvilinear relationship.
20
Thank you
21
Regression Analysis Model Building (Interaction)- II
Dr. A. Ramesh
1
Agenda
• Incorporating Interaction of the independent variable to the regression

model
• Python demo
2
Interaction
• If the original data set consists of observations for y and two independent
variables x1 and x2, we can develop a second-order model with two predictor
variables by setting z1 = x1, z2= x2, z3=x12 , z4=x22 , and z5 = x1x2 in the general
linear model of equation
• The model obtained is
• In this second-order model, the variable z5 = x1x2 is added to account for the
potential effects of the two variables acting together.
• This type of effect is called interaction.
3
Example – Interaction
• A company introduces a new shampoo product.

• Two factors believed to have the most influence on sales are unit selling
price and advertising expenditure.
• To investigate the effects of these two variables on sales, prices of $2.00,
$2.50, and $3.00 were paired with advertising expenditures of $50,000
and $100,000 in 24 test markets.
Source: Statistics for Business and Economics,11th Edition by David R.

Anderson (Author), Dennis J. Sweeney (Author), Thomas A. Williams (Author)
4
Advertising
Expenditure Sales
Price ($1000s) (1000s)
2 50 478
2.5 50 373
3 50 335
2 50 473
2.5 50 358
3 50 329
2 50 456
2.5 50 360
3 50 322
2 50 437
2.5 50 365
3 50 342
2 100 810
2.5 100 653
3 100 345
2 100 832
2.5 100 641
3 100 372
2 100 800
2.5 100 620
3 100 390
2 100 790
2.5 100 670
3 100 393
5
MEAN UNIT SALES (1000s)
6
Interpretation of interaction
• Note that the sample mean sales corresponding to a price of $2.00 and an
advertising expenditure of $50,000 is 461,000, and the sample mean sales
corresponding to a price of $2.00 and an advertising expenditure of
$100,000 is 808,000.
• Hence, with price held constant at $2.00, the difference in mean sales
between advertising expenditures of $50,000 and $100,000 is 808,000 -
461,000 = 347,000 units.
7
Interpretation of interaction
• When the price of the product is $2.50, the difference in mean sales is
646,000 -364,000 = 282,000 units.
• Finally, when the price is $3.00, the difference in mean sales is 375,000 -
332,000 = 43,000 units.
• Clearly, the difference in mean sales between advertising expenditures of
$50,000 and $100,000 depends on the price of the product.
• In other words, at higher selling prices, the effect of increased advertising
expenditure diminishes.
• These observations provide evidence of interaction between the price and
advertising expenditure variables.
8
Importing Data
9
Mean unit sales (1000s) as a function of selling price
10
Mean unit sales (1000s) as a function of Advertising
Expenditure($1000s)
11
Need for study the interaction between variable
• When interaction between two variables is present, we cannot study the

effect of one variable on the response y independently of the other
variable.
• In other words, meaningful conclusions can be developed only if we
consider the joint effect that both variables have on the response.
12
Estimated regression equation, a general linear model
involving three independent variables (z1, z2, and z3)
13
Interaction variable
• The data for the PriceAdv independent variable is obtained by multiplying

each value of Price times the corresponding value of AdvExp.
14
New Model
15
New Model
16
Interpretation
• Because the model is significant ( p-value for the F test is 0.000) and the p-
value corresponding to the t test for PriceAdv is 0.000, we conclude that
interaction is significant given the linear effect of the price of the product
and the advertising expenditure.
• Thus, the regression results show that the effect of advertising xpenditure
on sales depends on the price.
17
Transformations Involving the Dependent Variable
Miles per
Gallon Weight
28.7 2289
29.2 2113
34.2 2180
27.9 2448
33.3 2026
26.4 2702
23.9 2657
30.5 2106
18.1 3226
19.5 3213
14.3 3607
20.9 2888
18
Importing data
19
Scatter diagram
20
Model 1
21
Standardized residual plot corresponding to the first-order
model.
22
Standardized residual plot corresponding to the first-order
model
23
Model 2
24
Residual plot for model 2
25
Residual plot of model 2
26
• The miles-per-gallon estimate is obtained by finding the number whose
natural logarithm is 3.2675.
• Using a calculator with an exponential function, or raising e to the power
3.2675, we obtain 26.2 miles per gallon.
27
Nonlinear Models That Are Intrinsically Linear
28
Thank You
29
2 Test of Independence - I
Dr. A. Ramesh
1
Agenda
• To understand 2 Test of Independence
2
2 Test of Independence
• It is used to analyze the frequencies of two variables with multiple

categories to determine whether the two variables are independent.
• Qualitative Variables
• Nominal Data
3
2 Test of Independence: Investment Example
• In which region of the country do you reside?
A. Northeast B. Midwest C. South D. West
• Which type of financial investment are you most likely to make today?
E. Stocks F. Bonds G. Treasury bills
Type of financial
Investment
Contingency Table
E F G
A O13 nA
Geographic B nB
Region C nC
D nD
nE nF nG N
4
2 Test of Independence: Investment Example
e AF
= N  P( A  F )
n n n n 
If A and F are independent, P( A) = A
P( F ) = F
= N A  F
N N  N N 
P( A  F) = P( A)  P( F ) n n
P( A  F ) = A F
n n
N N = A F
N
Type of Financial
Contingency Table Investment
E F G
A e12 nA
Geographic B nB
Region C nC
D nD
nE nF nG N
5
2 Test of Independence: Formulas
e =
ij
(n ) (n)j
i j
N
Expected where : i = the row
Frequencies j = the column
ni = the total of row i
nj = the total of column j
N = the total of all frequencies
6
2 Test of Independence: Formulas
( f o − f e)
2
Calculated   
2
=
(Observed ) f
where : df = (r - 1)(c - 1)
e
r = the numberr of rows

c = the numberr of columns
7
Example for Independence
8
Ho : Type of gasoline is
independent of income
Ha : Type of gasoline is not
independent of income
9
Type of
Gasoline
r=4 c=3 Extra
Income Regular Premium Premium
Less than $30,000
$30,000 to $49,999
$50,000 to $99,000
At least $100,000
10
2 Test of Independence: Gasoline Preference Versus
Income Category
 =.01
df = ( r − 1)( c − 1)
= ( 4 − 1)( 3 − 1)
=6
 2
.01, 6
= 16.812
If  2
Cal
 16.812, reject Ho.
If  2
Cal
 16.812, do not reject Ho.
11
Python code
12
Gasoline Preference Versus Income Category:
Observed Frequencies
Type of
Gasoline
Extra
Income Regular Premium Premium
Less than $30,000 85 16 6 107
$30,000 to $49,999 102 27 13 142
$50,000 to $99,000 36 22 15 73
At least $100,000 15 23 25 63
238 88 59 385
13
Gasoline Preference Versus Income Category: Expected
Frequencies
e =
ij
(n )
N
(ni
) j
Type of
Gasoline Extra
=
(107 )(238 ) Income Regular Premium Premium
e11 385 Less than $30,000 (66.15) (24.46) (16.40)
= 66.15 85 16 6 107
(107 )(88 ) $30,000 to $49,999 (87.78) (32.46) (21.76)
e12 = 385
102 27 13 142
$50,000 to $99,000 (45.13) (16.69) (11.19)
= 24 .46 36 22 15 73
(107 )(59) At least $100,000 (38.95) (14.40) (9.65)
e 13 = 385 15 23 25 63
= 16.40 238 88 59 385
14
Gasoline Preference Versus Income Category: 2
Calculation
(f o −f f e)
2
 = 
2
(85 −6666 .15 ) + (16 −2424 .46) + (6 −16.40) +

2 2 2
= .15 .46 16.40

(102 87
− 87.78) + (27 −3232 .46) + (13 − 21.76) +
2 2 2
.78 .46 21.76

(36 −454513 . )+ (22 −1616 .69 ) + (15 −1119. )+
2 2 2
.13 .69 11.19

(15 −3838 .95) + (23 −1414 .40) + (25 − 9.65)
2 2 2
.95 .40 9.65

= 7075
.
15
Gasoline Preference Versus Income Category:
Conclusion
df = 6
0.01
Non rejection
region
16.812

2
= 70.78  16.812, reject Ho.
Cal
16
Contingency Tables
Contingency Tables
• Useful in situations involving multiple population proportions
• Used to classify sample observations according to two or more
characteristics
• Also called a cross-classification table.
17
Contingency Table Example
Hand Preference vs. Gender

Dominant Hand: Left vs. Right
Gender: Male vs. Female
• 2 categories for each variable, so the table is called a 2 x 2 table
• Suppose we examine a sample of 300 college students
18
Sample results organized in a contingency table:
Gender
sample size = n = 300:
Hand
120 Females, 12 were Preference
Female Male
left handed
Left 12 24 36
180 Males, 24 were
left handed Right 108 156 264
120 180 300

19
H0: π1 = π2 (Proportion of females who are left handed is equal to the

proportion of males who are left handed)
H1: π1 ≠ π2 (The two proportions are not the same Hand preference is
not independent of gender)
• If H0 is true, then the proportion of left-handed females should be the

same as the proportion of left-handed males.
• The two proportions above should be the same as the proportion of left-
handed people overall.
20
The Chi-Square Test Statistic
The Chi-square test statistic is:

(f − f ) 2
χ2 =  o e
all cells fe
where:
fo = observed frequency in a particular cell
fe = expected frequency in a particular cell if H0 is true
2 for the 2 x 2 case has 1 degree of freedom

Assumed: each cell in the contingency table has expected frequency of at least 5
21
The 2 test statistic approximately follows a chi-square

distribution with one degree of freedom
Decision Rule:
If 2 > 2U, reject H0,
otherwise, do not reject 
H0
0 Do not Reject H0 
reject H0 2U
22
Observed vs. Expected Frequencies
Gender
Hand
Female Male
Preference
Observed = 12 Observed = 24
Left 36
Expected = 14.4 Expected = 21.6

Right 264
120 180 300

Gender
Hand
Female Male
Preference
Left 36

Right 264
120 180 300
The test statistic is:
( fo − fe )2
2 = 
all cells fe
(12 − 14.4) 2 (108 − 105.6) 2 ( 24 − 21.6) 2 (156 − 158.4) 2
= + + + = 0.7576
14.4 105.6 21.6 158.4
24
The test statistic is  2 = 0.7576 , U2 with 1 d.f. = 3.841

Decision Rule:
If 2 > 3.841, reject H0, otherwise, do not
reject H0
Here,
2 = 0..7576 < 2U = 3.841,
=.05
so you do not reject H0 and
conclude that there is
0 Do not Reject H0  
insufficient evidence that the
reject H0
2U=3.841 two proportions are different.
25
2 Test for The Differences Among More Than Two
Proportions
• Extend the 2 test to the case with more than two independent
populations:
H0: π1 = π2 = … = πc
H1: Not all of the πj are equal (j = 1, 2, …, c)
26
The Chi-square test statistic is:

( fo − fe )2
2 = 
all cells fe
where:
• fo = observed frequency in a particular cell of the 2 x c table
• fe = expected frequency in a particular cell if H0 is true
• 2 for the 2 x c case has (2-1)(c-1) = c - 1 degrees of freedom
Assumed: each cell in the contingency table has expected frequency of at least 5
27
2 Test with More Than Two Proportions: Example
The sharing of patient records is a controversial issue in health care. A survey

of 500 respondents asked whether they objected to their records being
shared by insurance companies, by pharmacies, and by medical researchers.
The results are summarized on the following table:
28
Organization
Object to Insurance Pharmacies Medical
Record Companies Researchers
Sharing
Yes 410 295 335
No 90 205 165
Organization
Object to Insurance Pharmacies Medical Row Sum
Sharing
Yes 410 295 335 1040
No 90 205 165 460
Column 500 500 500 1500

Sum
The overall proportion is:
X 1 + X 2 + ... + X c 410 + 295 + 335
p= = = 0.6933
n1 + n2 + ... + nc 500 + 500 + 500
Organization
Object to Record Insurance Pharmacies Medical
Sharing Companies Researchers
Yes fo = 410 fo = 295 fo = 335

fe = 346.667 fe = 346.667 fe = 346.667
No fo = 90 fo = 205 fo = 165
fe = 153.333 fe = 153.333 fe = 153.333
Organization
Object to Insurance Pharmacies Medical
Sharing
Yes ( fo − fe )
2
= 11.571 ( f o − f e )2 ( f o − f e )2
= 7.700 = 0.3926
fe fe fe
No ( f o − f e )2 ( fo − fe )
2
= 17.409
( fo − fe )
2
= 0.888
= 26.159
fe fe fe
( fo − fe )2
The Chi-square test statistic is:  2
=  = 64.1196
all cells fe
H0: π1 = π2 = π3
H1: Not all of the πj are equal (j = 1, 2, 3)
Decision Rule: 2U = 5.991 is from the chi-square

If 2 > 2U, reject H0, otherwise, distribution with 2 degrees of
do not reject H0 freedom.
Conclusion: Since 64.1196 > 5.991, you reject H0 and you conclude that at
least one proportion of respondents who object to their records being shared
is different across the three organizations
33
Thank You
34
2 Test of Independence - II
Dr. A. Ramesh
1
Agenda
• Using python to test the independence of variables

• Understanding goodness of fit test for Poisson
2
Example
• Record of 50 students studying in ABN School is taken at random, the first

10 entries are like this:
res_num aa pe sm ae r g c
1 99 19 1 2 0 0 1
2 46 12 0 0 0 0 0
3 57 15 1 1 0 0 0
4 94 18 2 2 1 1 1
5 82 13 2 1 1 1 1
6 59 12 0 0 2 0 0
7 61 12 1 2 0 0 0
8 29 9 0 0 1 1 0
9 36 13 1 1 0 0 0
10 91 16 2 2 1 1 0
3
Example
Here :
• res_num = registration no.
• aa= academic ability
• pe = parent education
• sm = student motivation
• r = religion
• g = gender
4
Python code
5
Hypothesis
• Test the hypothesis that “gender and student motivation” are

independent
6
Python code
7
Observed values
Gender Student motivation
0 1 2 Row Sum
(Disagree ) (Not (Agree)
decided )
0 (Male) 10 13 6 29
1(Female ) 4 9 8 21
Column 14 22 14 50
Sum
8
Expected frequency (contingency table)

0 1 2
0 29*14/50= 12.76 8.12

8.12
1 5.88 9.24 5.88
9
Frequency Table

0 1 2
0 fo = 10 fo = 13 fo = 6
fe = 8.12 fe =12.76 fe =8.12
1 fo = 4 fo = 9 fo = 8
fe =5.88 fe =9.24 fe =5.88
10
Chi sq. calculation
(f o −f f e)
2
 = 
2
= 0.435+ 0.005+0.554+0.601+0.006+0.764
= 2.365
11
Python code
12
Python code
Degrees of
freedom =
(2-1)*(3-1)
13
Python code
Contingency
table
14
2 Goodness of Fit Test
15
2 Goodness-of-Fit Test
• The 2 goodness-of-fit test compares expected (theoretical)

frequencies of categories from a population distribution to the
observed (actual) frequencies from a distribution to determine
whether there is a difference between what was expected and what
was observed
16
2 Goodness-of-Fit Test
( f o− f e )
2
 =
2
f e
df = k - 1 - p
where : f = frequency of observed values
o
f = frequency of expected values

e
k = number of categories
p = number of parameters estimated from the sample data
17
Goodness of Fit Test: Poisson Distribution
1. Set up the null and alternative hypotheses.
H0: Population has a Poisson probability distribution
Ha: Population does not have a Poisson distribution
2. Select a random sample and

• Record the observed frequency fi for each value of the Poisson
random variable.
• Compute the mean number of occurrences .
3. Compute the expected frequency of occurrences ei

for each value of the Poisson random variable.
18
4. Compute the value of the test statistic

k( f i − ei ) 2
 =
2
i =1 ei
where:
fi = observed frequency for category i
ei = expected frequency for category i
k = number of categories
19
5. Rejection rule:
p-value approach: Reject H0 if p-value < 
Critical value approach: Reject H0 if  2   2
where  is the significance level and

there are k - 2 degrees of freedom
20
• Example: Parking Garage
In studying the need for an additional entrance to a city parking

garage, a consultant has recommended an analysis, that approach is
applicable only in situations where the number of cars entering
during a specified time period follows a Poisson distribution.
21
A random sample of 100 one- minute time intervals resulted in the
customer arrivals listed below. A statistical test must be conducted to
see if the assumption of a Poisson distribution is reasonable.
# Arrivals 0 1 2 3 4 5 6 7 8 9 10 11 12
Frequency 0 1 4 10 14 20 12 12 9 8 6 3 1
22
• Hypotheses
H0: Number of cars entering the garage during
a one-minute interval is Poisson distributed
Ha: Number of cars entering the garage during a

one-minute interval is not Poisson distributed
23
Python Code
24
• Estimate of Poisson Probability Function

otal Arrivals = 0(0) + 1(1) + 2(4) + . . . + 12(1) = 600
Estimate of  = 600/100 = 6
Total Time Periods = 100
Hence,
6 x e −6
f ( x) =
x!
25
• Expected Frequencies
x f (x ) nf (x ) x f (x ) nf (x )
0 .0025 .25 7 .1377 13.77
1 .0149 1.49 8 .1033 10.33
2 .0446 4.46 9 .0688 6.88
3 .0892 8.92 10 .0413 4.13
4 .1339 13.39 11 .0225 2.25
5 .1606 16.06 12+ .0201 2.01
6 .1606 16.06 Total 1.0000 100.00
26
Python code
27
Python code
28
• Observed and Expected Frequencies
i fi ei fi - ei
0 or 1 or 2 5 6.20 -1.20
3 10 8.92 1.08
4 14 13.39 0.61
5 20 16.06 3.94
6 12 16.06 -4.06
7 12 13.77 -1.77
8 9 10.33 -1.33
9 8 6.88 1.12
10 or more 10 8.39 1.61
29
Python code
30
• Rejection Rule
With  = .05 and k - p - 1 = 9 - 1 - 1 = 7 d.f.
(where k = number of categories and p = number of
population parameters estimated),  .02 5 = 1 4 .0 6 7
Reject H0 if p-value < .05 or 2 > 14.067.
• Test Statistic
( − 1.20) 2
(1.08) 2
(1.61) 2
2 = + + ... + = 3.268
6.20 8.92 8.39
31
Python code
32
Goodness of Fit Test: Poisson
Distribution
df = 7
0.05
Non rejection
region
14.067

2
= 3.268  14.067, do not reject Ho.
Cal
33
Thank You
34
Cluster analysis: Introduction - I
Dr. A. Ramesh
1
Agenda
• Understanding cluster analysis and its purpose

• Introduction to types of data and how to handle them
2
Cluster Analysis
• Cluster analysis is the art of finding

groups in data
• In cluster analysis basically, one wants to
form groups in such a way that objects in
the same group are similar to each other,
whereas objects in different groups are
as dissimilar as possible
3
Cluster analysis
• The classification of similar objects into

groups is an important human activity, this is
part of the learning process
• i.e. A child learns to distinguish between cats
and dogs, between tables and chairs,
between men and women, by means of
continuously improving subconscious
classification schemes
• This explains why cluster analysis is often
considered as a branch of pattern recognition
and artificial intelligence
4
Example
• Lets illustrate with the help of an example:

• It is a plot of twelve objects, on which two variables were measured. For
instance, the weight of an object might be displayed on the vertical axis
and its height on the horizontal one
5
Example
• Because this example contains only two variables, we can investigate it by merely looking
at the plot
• In this small data set there are clearly two distinct groups of objects
• Such groups are called clusters, and to discover them is the aim of cluster analysis
6
Cluster and discriminant analysis
• Cluster Analysis is an unsupervised • Discriminant Analysis (DA) is a statistical

classification technique in the sense that it is technique used to build a prediction model
applied to a dataset where patterns want to that is used to classify objects from a dataset
be discovered (i.e. groups of individuals or depending on the features observed on
variables want to be found) them. In this case, the dependent variable is
• No prior knowledge is needed for this the grouping variable, which identifies to
grouping, and it is sensitive to several which group and object belongs
decisions that have to be taken • This grouping variable should be known at
(similarity/dissimilarity measures, clustering the beginning, for the function to be built up.
method,...) Sometimes DA is considered as a Supervised
tool, as there is a previous known
classification for the elements of the dataset
7
Cluster analysis and discriminant analysis
• Cluster analysis can be used not only to identify a structure already

present in the data, but also to impose a structure on a more or less
homogeneous data set that has to be split up in a “fair” way, for instance
when dividing a country into telephone areas
• Cluster analysis is quite different from discriminant analysis in that it
actually establishes the groups, whereas discriminant analysis assigns
objects to groups that were defined in advance
Telephone area code for USA
8
Types of data and how to handle them
• Let us take an example, there are n objects to be clustered, which may be

persons, flowers, words, countries, or anything
• Clustering algorithms typically operate on either of two input structures:
– The first represents the objects by means of p measurements or
attributes, such as height, weight, sex, color, and so on
– These measurements can be arranged in an n-by-p matrix, where the
rows correspond to the objects and the columns to the attributes
9
Example
Attributes
Objects
10
Types of data and how to handle them
• The second structure is a collection of proximities that must be available

for all pairs of objects
• These proximities make up an n-by-n table, which is called a one-mode
matrix because the row and column entities are the same set of objects
• one shall consider two types of proximities, namely dissimilarities (which
measure how far away two objects are from each other) and similarities
(which measure how much they resemble each other)
11
Type of data
• Interval-Scaled Variables
• In this situation the n objects are characterized by p continuous
measurements
• These values are positive or negative real numbers, such as height, weight,
temperature, age, cost, ..., which follow a linear scale
• For instance, the time interval between 1900 and 1910 was equal in length
to that between 1960 and 1970
Time scale in years
12
Type of data
• Also, it takes the same amount of energy to heat an object of -16.4°C to -

12.4°C as to increase it from 35.2°C to 39.2°C
• In general it is required that intervals keep the same importance
throughout the scale
13
Interval-Scaled Variables
• These measurements can be organized in an n-by-p matrix, where the

rows correspond to the objects (or cases) and the columns correspond to
the variables.
• When the fth measurement of the ith object is denoted by xif (where i = 1,. .
. , n and f = 1,. . . , p) this matrix looks like:
14
• For example :
Person Weight(Kg) Height(cm)
• Take eight people, the weight (in A 15 95
kilograms) and the height (in centimetres) B 49 156
• In this situation, n = 8 and p = 2. C 13 95
D 45 160
E 85 178
F 66 176
G 12 90
H 10 78
Table :1
15
Figure 1
200 F E
DB
150
Height in cm
G
A
100 H
C
50
0
0 50 100
Weight in kg
16
• The units on the vertical axis are drawn to the same size as those on the horizontal axis, even
though they represent different physical concepts
• The plot contains two obvious clusters, which can in this case be interpreted easily: the one
consists of small children and the other of adults
• However, other variables might have led to completely different clustering
• For instance, measuring the concentration of certain natural hormones might have yielded a
clear cut partition into different male and female persons
17
• Let us now consider the effect of changing measurement
Person Weight(lb) Height(in)
units.
A 33.1 37.4
• If weight and height of the subjects had been expressed in B 108 61.4
pounds and inches, the results would have looked quite C 28.7 37.4
different. D 99.2 63
E 187.4 70
• A pound equals 0.4536 kg and an inch is 2.54 cm F 145.5 69.3
• Therefore, Table 2 contains larger numbers in the column G 26.5 35.4
of weights and smaller numbers in the column of heights. H 22 30.7
Figure 2 Table :2
18
Figure 2
100
Height in inches
D B F E
C
50 G
H
A
0
0 20 40 60 80 100 120 140 160 180 200
Weight in lb
19
Interpretation
• Although plotting essentially the same data as Figure 1, Figure 2 looks
much flatter
• In this figure, the relative importance of the variable “weight” is much
larger than in Figure 1
• As a consequence, the two clusters are not as nicely separated as in Figure
1 because in this particular example the height of a person gives a better
indication of adulthood than his or her weight. If height had been
expressed in feet (1 ft = 30.48 cm), the plot would become flatter still and
the variable “weight” would be rather dominant
• In some applications, changing the measurement units may even lead one
to see a very different clustering structure
20
Standardizing the data
• To avoid this dependence on the choice of measurement units, one has

the option of standardizing the data
• This converts the original measurements to unitless variables
• First one calculates the mean value of variable f, given by:
for each f = 1,. . . , p
21
• Then one computes a measure of the dispersion or “spread” of this fth

variable
• Generally, we use the standard deviation for this purpose
22
• However, this measure is affected very much by the presence of outlying

values
• For instance, suppose that one of the xif has been wrongly recorded, so
that it is much too large
• In this case stdf will be unduly inflated, because xif - mf is squared
• Hartigan (1975, p. 299) notes that one needs a dispersion measure that is
not too sensitive to outliers
• Therefore, we will use the mean absolute deviation, where the
contribution of each measurement xif is proportional to the absolute value
lxif - mfl
23
• Let us assume that sf is nonzero (otherwise variable f is constant over all

objects and must be removed)
• Then the standardized measurements are defined by and sometimes
called z-scores
• They are unitless because both the numerator and the denominator are
expressed in the same units
• By construction, the zif have mean value zero and their mean absolute
deviation is equal to1
24
• When applying standardization, one forgets about the original data and
uses the new data matrix in all subsequent computations
25
Detecting outlier
• The advantage of using sf rather than stdf, in the denominator of z-score

formula is that sf will not be blown up so much in the case of an outlying
xIf, and hence the corresponding zif will still be noticeable so the ith object
can be recognized as an outlier by the clustering algorithm, which will
typically put it in a separate cluster
26
• The preceding description might convey the impression that
standardization would be beneficial in all situations.
• However, it is merely an option that may or may not be useful in a given
application
• Sometimes the variables have an absolute meaning, and should not be
standardized
• For instance, it may happen that several variables are expressed in the
same units, so they should not be divided by different sf
• Often standardization dampens a clustering structure by reducing the
large effects because the variables with a big contribution are divided by a
large sf
27
Thank you
28
Cluster analysis: Part - II
Dr. A. Ramesh
1
Agenda
• Explain effect of standardization(with help of an example)

• Different types of distances computation between the objects
2
Example
• Lets take four persons A, B,C, D with following age and height:
200 A B
Person Age (yr) Height (cm) Height 190
A 35 190 180
B 40 190 D
170 C
C 35 160
D 40 160
160
Age
150
TABLE: 1 10 30 50
Finding Groups in Data: An Introduction to Cluster Analysis
Author(s): Leonard Kaufman, Peter J. Rousseeuw FIGURE: 1
March 1990, John Wiley & Sons, Inc.
3
Example
• In Figure 1 we can see to distinct clusters

• Let us standardize the data of Table 1
• The mean age equals m1 = 37.5 and the mean absolute deviation of the
first variable works out to be s1 = (2.5 + 2.5 + 2.5 + 2.5)/4 = 2.5
• Therefore, standardization converts age 40 to + 1 ((40-37.5)/2.5 = 1)and
age 35 ((35 - 37.5)/2.5 = -1) to – 1
• Analogously, m2 = 175 cm and s2 = (15 + 15 + 15 + 15)/4 = 15 cm, so 190
cm is standardized to +1 and 160 cm to - 1
4
Example
• The resulting data matrix, which is unitless, is given in Table 2
• Note that the new averages are zero and that the mean deviations equal 1
• Table 2
Person Variable 1 Variable 2
A 1 1
B -1 1
C 1 -1
D -1 -1
• Even when the data are converted to very strange units standardization will always yield
the same numbers
5
Example
• Plotting the values of Table 2 in Figure 2 1.5

does not give a very exciting result
1
• Figure 2 shows no clustering structure
0.5
because the four points lie at the vertices
0
of a square
-2 -1 -0.5 0 1 2
• One could say that there are four clusters,
each consisting of a single point, or that -1
there is only one big cluster containing -1.5
four points
FIGURE: 2
• Here standardizing is no solution
6
Choice of measurement (Units)- Merits and demerits
• The choice of measurement units gives rise to relative weights of the

variables
• Expressing a variable in smaller units will lead to a larger range for that
variable, which will then have a large effect on the resulting structure
• On the other hand, by standardizing one attempts to give all variables an
equal weight, in the hope of achieving objectivity
• As such, it may be used by a practitioner who possesses no prior
knowledge
7
Choice of measurement- Merits and demerits
• However, it may well be that some variables are intrinsically more

important than others in a particular application, and then the assignment
of weights should be based on subject-matter knowledge
• On the other hand, there have been attempts to devise clustering
techniques that are independent of the scale of the variables
8
Distances computation between the objects
• The next step is to compute distances between the objects, in order to
quantify their degree of dissimilarity
• It is necessary to have a distance for each pair of objects i and j.
• The most popular choice is the Euclidean distance:
• When the data are being standardized, one has to replace all x by z in this
expression
• This Formula corresponds to the true geometrical distance between the
points with coordinates (xi1,. .., xip) and (xj1 ,..., xjp)
9
Example
• let us consider the special case with p =

2 (Figure 3)
• Figure shows two points with
coordinates ( x i 1 , x i 2 ) and (xj1, xj2)
• It is clear that the actual distance
between objects i and j is given by the
length of the hypotenuse of the
triangle, yielding expression in previous
slide by virtue of Pythagoras’ theorem
Figure 3: Illustration of the Euclidean distance formula
10
• Another well-known metric is the city block or Manhattan distance,

defined by:
11
Interpretation
• Suppose you live in a city where the streets are all north-south or east-
west, and hence perpendicular to each other
• Let Figure 3 be part of a street map of such a city, where the streets are
portrayed as vertical and horizontal lines
12
Interpretation
• Then the actual distance you would have to travel by car to get from
location i to location j would total lxi1 – xj1l + lxi2 – xj2l
• This would be the shortest length among all possible paths from i to j
• Only a bird could fly straight from point i to point j, thereby covering the
Euclidean distance between these points
13
Mathematical Requirements of a Distance Function
• Both the Euclidean metric and the Manhattan metric satisfy the following
mathematical requirements of a distance function, for all objects i, j, and h:
• (D1) d(i, j) ≥ 0
• (D2) d(i, i) = 0
• (D3) d(i, j) = d(j, i)
• (D4) d(i, j) ≤ d(i, h) + d(h, j)
• Condition (D1) merely states that distances are nonnegative numbers and (D2) says
that the distance of an object to itself is zero
• Axiom (D3) is the symmetry of the distance function
• The triangle inequality (D4) looks a little bit more complicated, but is necessary to allow
a geometrical interpretation
• It says essentially that going directly from i to j is shorter than making a detour over
object h
14
• If d(i, j) = 0 does not necessarily imply that i = j, because it can very well
happen that two different objects have the same measurements for the
variables under study
• However, the triangle inequality implies that i and j will then have the
same distance to any other object h, because d(i, h) ≤ d(i, j) + d( j, h) = d(j,
h) and at the same time d( j, h) ≤ d( j, i) + d(i, h) = d(i, h), which together
imply that d(i, h) = d(j, h)
15
Minkowski distance
• A generalization of both the Euclidean and the Manhattan metric is the

Minkowski distance given by:
Where p is any real number larger than or equal to 1

• This is also called the Lp metric, with the Euclidean (p = 2) and the
Manhattan (p = 1) as special cases
16
Example for Calculation of Euclidean and Manhattan Distance
• Let x1 = (1, 2) and x2 = (3, 5) represent two objects as in the given Figure
The Euclidean distance between the two is (22 +32)= 3.61. The
Manhattan distance between the two is 2 + 3 = 5.
Figure: 4
17
n- by- n Matrix
• For example, when computing
Euclidean distances between the Person Weight(Kg) Height(cm)
objects of the following Table can be A 15 95
obtain as next slide: B 49 156
C 13 95
D 45 160
• Euclidean distances between B and E:
E 85 178
• ((49 – 85)2 +(156-178)2)½ = 42.2 F 66 176
G 12 90
H 10 78
18
n- by- n Matrix
A B C D E F G H
A
B
C
D
E
F
G
H
19
Interpretation
• The distance between object B and object E can be located at the

intersection of the fifth row and the second column, yielding 42.2
• The same number can also be found at the intersection of the second row
and the fifth column, because the distance between B and E is equal to
the distance between E and B
• Therefore, a distance matrix is always symmetric
• Moreover, note that the entries on the main diagonal are always zero,
because the distance of an object to itself has to be zero
20
Distance matrix
• It would suffice to write down only the lower triangular half of the
distance matrix
A B C D E F G
B
C
D
E
F
G
H
21
Selection of variables
• It should be noted that a variable not containing any relevant information

(say, the telephone number of each person) is worse than useless,
because it will make the clustering less apparent.
• The Occurrence of several such “trash variables” will kill the whole
clustering because they yield a lot of random terms in the distances,
thereby hiding the useful information provided by the other variables.
• Therefore, such non informative variables must be given a zero weight in
the analysis, which amounts to deleting them
22
Selection of variables
• The selection of “good” variables is a nontrivial task and may involve quite
some trial and error (in addition to subject-matter knowledge and
common sense)
• In this respect, cluster analysis may be considered an exploratory
technique
23
Thank you
24
2 Goodness of Fit Test
Dr. A. Ramesh
1
Agenda
• Python demo for testing GOF for Poisson distribution

• Understanding goodness of fit test for:
– Uniform
– Normal
• Python demo for testing GOF for uniform and normal distribution
2
Goodness of fit for Uniform Distribution
Month Litres
• Milk Sales Data January 1,610
February 1,585
March 1,649
April 1,590
May 1,540
June 1,397
July 1,410
August 1,350
September 1,495
October 1,564
November 1,602
December 1,655
18,447
3
Hypotheses and Decision Rules
Ho: The monthly milk figures for milk sales are uniformly distributed
Ha: The monthly milk figures for milk sales are not uniformly distributed
 = .01 If  2
 24.725, reject Ho.
Cal
df = k − 1 − p
= 12 − 1 − 0
If  2
Cal
 24.725, do not reject Ho.
= 11

2
= 24.725
.01,11
4
Python code
5
Calculations
Month fo fe (fo - fe)2/fe
January 1,610 1,537.25 3.44
February 1,585 1,537.25 1.48 18447
March 1,649 1,537.25 8.12 f =
April 1,590 1,537.25 1.81
e 12
May 1,540 1,537.25 0.00 = 1537.25
June 1,397 1,537.25 12.80
July 1,410 1,537.25 10.53
August 1,350 1,537.25 22.81  2
Cal
= 74.37
September 1,495 1,537.25 1.16
October 1,564 1,537.25 0.47
November 1,602 1,537.25 2.73
December 1,655 1,537.25 9.02
18,447 18,447.00 74.38
6
Python code
7
Conclusion
df = 11
Non Rejection 0.01

region
24.725
 2
Cal
= 74.37  24.725, reject Ho.
8
Goodness of Fit Test: Normal Distribution
1. Set up the null and alternative hypotheses.
2. Select a random sample and
a. Compute the mean and standard deviation.
b. Define intervals of values so that the expected frequency is at least 5 for
each interval.
c. For each interval record the observed frequencies
3. Compute the expected frequency, ei , for each interval.
9
Goodness of Fit Test: Normal Distribution
( f i − ei ) 2
k
 = 2
i =1 ei
5. Reject H0 if    2 2
(where  is the significance level and there are k - 3 degrees

of freedom)
10
Normal Distribution Goodness of Fit Test
• Example: IQL Computers
IQL Computers manufactures and sells a general purpose

microcomputer. As part of a study to evaluate sales personnel,
management wants to determine, at  = 0.05 significance level, if the
annual sales volume (number of units sold by a salesperson) follows a
normal probability distribution.
11
A simple random sample of 30 of the salespeople was

taken and their numbers of units sold are below.
33 43 44 45 52 52 56 58 63 64
64 65 66 68 70 72 73 73 74 75
83 84 85 86 91 92 94 98 102 105
(mean = 71, standard deviation = 18.23)
12
Python code
13
• Hypotheses
H0: The population of number of units sold
has a normal distribution with mean 71
and standard deviation 18.23
Ha: The population of number of units sold

does not have a normal distribution with
mean 71 and standard deviation 18.23
14
• Interval Definition
To satisfy the requirement of an expected frequency of at

least 5 in each interval we will divide the normal distribution
into 30/5 = 6 equal probability intervals.
15
• Interval Definition
Areas
= 1.00/6
= .1667
53.367 71 88.63 = 71 + .97(18.24)

71 - .43(18.23) = 63.149 78.85
16
Python code
17
• Observed and Expected Frequencies
i fi ei f i - ei
Less than 53.02 6 5 1
53.02 to 63.03 3 5 -2
63.03 to 71.00 6 5 1
71.00 to 78.97 5 5 0
78.97 to 88.98 4 5 -1
More than 88.98 6 5 1
Total 30 30
18
Python code
19
• Rejection Rule
With  = .05 and k - p - 1 = 6 - 2 - 1 = 3 d.f.
(where k = number of categories and p = number
of population parameters estimated),  .0 5 = 7 .8 1 5
2
Reject H0 if p-value < .05 or 2 > 7.815.
• Test Statistic
(1) 2 ( − 2) 2 (1) 2 (0) 2 ( − 1) 2 (1) 2
 =2
+ + + + + = 1.600
5 5 5 5 5 5
20
Thank you
21
Cluster analysis: Part - III
Dr. A. Ramesh
1
Clustering analysis part III
2
Agenda
• Handling missing data

• Calculation of similarity and dissimilarity matrix
3
Handling missing data
• It often happens that not all measurements are actually available, so there
are some “holes” in the data matrix
• Such an absent measurement is called a missing value and it may have
several causes
• The value of the measurement may have been lost or it may not have
been recorded at all by oversight or lack of time
4
• Sometimes the information is simply not available, for example the

birthdate of a foundling, or the patient may not remember whether he or
she ever had the measles, or it may be impossible to measure the desired
quantity due to the malfunctioning of some instrument
• In certain instances the question does not apply (such as the colour of hair
of a bald person) or there may be more than one possible answer (when
two experimenters obtain very different results)
5
• How can we handle a data set with missing values?

• In a matrix we indicate the absent measurements by means of some code
• If there exists an object in the data set for which all measurements are
missing, there is really no information on this object so it has to be deleted
• Analogously, a variable consisting exclusively of missing values has to be
removed too
6
• If the data are standardized, the mean value m, of the fth variable is
calculated by making use of the present values only
• The same goes for sf,
In the denominator , we must replace ‘n’ by the number of non missing

values for that variable
• But of course only when the corresponding xi, is not missing itself
7
• In the computation of distances (based on either the xi, or the zi,) similar
precautions must be taken
• When calculating the distances d(i, j), only those variables are considered
in the sum for which the measurements for both objects are present
subsequently the sum is multiplied by p and divided by the actual number
of terms (in the case of Euclidean distances this is done before taking the
square root)
• Such a procedure only makes sense when the variables are thought of as
having the same weight (for instance, this can be done after
standardization)
8
• When computing these distances, one might come across a pair of objects
that do not have any common measured variables, so their distance
cannot be computed by means of the above mentioned approach.
• Several remedies are possible: One could remove either object or one
could fill in some average distance value based on the rest of the data
• Or by replacing all missing xif by the mean mf of that variable; then all
distances can be computed
• Applying any of these methods, one finally possesses a “full” set of
distances
9
Dissimilarities
• The entries of a n-by n matrix may be Euclidean or Manhattan distances

• However, there are many other possibilities, so we no longer speak of
distances but of dissimilarities (or dissimilarity coefficients)
• Basically, dissimilarities are non-negative numbers d( i, j) that are small
(close to zero) when i and j are “near” to each other and that become
large when i and j are very different
• We shall usually assume that dissimilarities are symmetric and that the
dissimilarity of an object to itself is zero, but in general the triangle
inequality does not hold
10
Dissimilarities
• Dissimilarities can be obtained in several ways.

• Often they can be computed from variables that are binary, nominal,
ordinal, interval, or a combination of these
• Also, dissimilarities can be simple subjective ratings of how much certain
objects differ from each other, from the point of view of one or more
observers
• This kind of data is typical in the social sciences and in marketing
11
Example
• Fourteen postgraduate economics students (coming from different parts

of the world) were asked to indicate the subjective dissimilarities between
11 scientific disciplines.
• All of them had to fill in a matrix like Table 4, where the dissimilarities had
to be given as integer numbers on a scale from 0 (identical) to 10 (very
different)
• The actual entries of the Table in next slide, are the averages of the values
given by the students
12
Example
• It appears that the smallest dissimilarity is perceived between
mathematics and computer science (1.43 ), whereas the most remote
fields were psychology and astronomy (9.36)
13
Dissimilarities
• If one wants to perform a cluster analysis on a set of variables that have

been observed in some population, there are other measures of
dissimilarity
• For instance, one can compute the (parametric) Pearson product-moment
between the variables f and g, or alternatively the (non-parametric)
Spearman correlation
14
Dissimilarities
• Both coefficients lie between - 1 and + 1 and do not depend on the choice
of measurement units
• The main difference between them is that the Pearson coefficient looks
for a linear relation between the variables f and g, whereas the Spearman
coefficient searches for a monotone relation
15
Dissimilarities
• Correlation coefficients are useful for clustering purposes because they
measure the extent to which two variables are related
• Correlation coefficients, whether parametric or nonparametric, can be
converted to dissimilarities d( f, g), for instance by setting
With this formula, variables with a high positive correlation receive a

dissimilarity coefficient close to zero, whereas variables with a strongly
negative correlation will be considered very dissimilar
16
Similarities
• The more objects i and j are alike (or close), the larger s(i, j) becomes
• Such a similarity s(i, j) typically takes on values between 0 and 1, where 0
means that i and j are not similar at all and 1 reflects maximal similarity
• Values in between 0 and 1 indicate various degrees of resemblance
• Often it is assumed that the following conditions hold:
17
Similarities
• For all objects i and j , the numbers s(i, j) can be arranged in an n-by-n
matrix ,which is then called a similarity matrix
• Both similarity and dissimilarity matrices are generally referred to as
proximity matrices, or sometimes as resemblance
• In order to define similarities between variables, we can again resort to
the Pearson or the Spearman correlation coefficient
• However, neither correlation measure can be used directly as a similarity
coefficient because they also take on negative values
18
Similarities
• Some transformation is in order to bring the coefficients into the zero-one
range
• There are essentially two ways to do this, depending on the meaning of the
data and the purpose of the application
• If variables with a strong negative correlation are considered to be very
different because they are oriented in the opposite direction (like mileage and
weight of a set of cars), then it is best to take something like the following:
which yields s(f, g) = 0 whenever R(f, g) = - 1.
19
Similarities
• There are situations in which variables with a strong negative correlation

should be grouped, because they measure essentially the same thing
• For instance, this happens if one wants to reduce the number of variables
in a regression data set by selecting one variable from each cluster
• In that case it is better to use a formula like
which yields s(f, g) = 1 when R(f, g) = -1
20
Similarities
• Suppose the data consist of a similarity matrix but one wants to apply a
clustering algorithm designed for dissimilarities
• Then it is necessary to transform the similarities into dissimilarities
• The larger the similarity s(i, j) between i and j, the smaller their
dissimilarity d(i, j) should be
• Therefore, we need a decreasing transformation, such as
21
Binary Variables
• A contingency table for binary variables.
22
Dissimilarity between two binary variables
• q → is the number of variables that equal 1 for both objects i and j,

• r → is the number of variables that equal 1 for object i but that are 0 for
object j,
• S→ is the number of variables that equal 0 for object i but equal 1 for
object j, and
• t → is the number of variables that equal 0 for both objects i and j.
• The total number of variables is p, where p = q+r+s+t.
23
Symmetric Binary Dissimilarity
24
Asymmetric binary variable
• A binary variable is asymmetric if the outcomes of the states are not
equally important, such as the positive and negative outcomes of a
disease test.
• By convention, we shall code the most important outcome, which is
usually the rarest one, by 1 (e.g., HIV positive) and the other by 0 (e.g., HIV
negative).
• Given two asymmetric binary variables, the agreement of two 1s (a
positive match) is then considered more significant than that of two 0s (a
negative match).
• Therefore, such binary variables are often considered “monary” (as if
having one state).
25
asymmetric binary dissimilarity
26
Jaccard coefficient
27
Dissimilarity between binary variables
28
Dissimilarity between Jack and Marry
Jack
Marry 1 0
1 2 1
0 0 3
29
Dissimilarity between Jack and Jim
Jim
1 0
Jack 1 1 1
0 1 3
30
Dissimilarity between Jim and Marry
Jim
1 0
Marry 1 1 2
0 1 2
31
Thank you
32
Cluster analysis: Part - IV
Dr. A. Ramesh
1
Agenda
• How to handle the following types of variables :

– Interval scale variable
– Binary variables
– Categorical Variables
– Ordinal Variables
– Ratio-Scaled Variables
– Variables of mixed type
Categorical Variables
• A categorical variable is a generalization of the binary variable in that it
can take on more than two states
• For example, map color is a categorical variable that may have, say, five
states: red, yellow, green, purple, and blue
3
• Let the number of states of a categorical variable be M

• The states can be denoted by letters, symbols, or a set of integers, such as
1, 2,..., M
• Notice that such integers are used just for data handling and do not
represent any specific ordering
4
• “How is dissimilarity computed between objects described by categorical

variables?”
• The dissimilarity between two objects i and j can be computed based on

the ratio of mismatches:
where ‘m’ is the number of matches (i.e., the number of variables for
which ‘I’ and ‘j’ are in the same state), and ‘p’ is the total number of
variables
Weights can be assigned to increase the effect of ‘m’ or to assign greater
weight to the matches in variables having a larger number of states
6
Dissimilarity between categorical variables
• Suppose that we have the sample data as shown in the table
• Let only the object-identifier and the variable (or attribute) test-1 are
available, which is a categorical data
Finding Groups in Data: An Introduction to Cluster Analysis

Author(s): Leonard Kaufman, Peter J. Rousseeuw
March 1990, John Wiley & Sons, Inc.
7
Dissimilarity matrix
1 2 3 4
1
2
3
4
8
Dissimilarity between categorical variables
• Since here we have one categorical variable, test-1, we set p = 1 in

Equation
So that d(i, j) evaluates to ‘0’ if objects i and j match, and ‘1’ if the objects
differ
• Thus, we get d(2,1) = (1-0)/1 = 1
d(4,1) = (1-1)/1 = 0
9
Ordinal Variables
• A discrete ordinal variable resembles a categorical variable, except that

the ‘M’ states of the ordinal value are ordered in a meaningful sequence
• Ordinal variables are very useful for registering subjective assessments of
qualities that cannot be measured objectively
• For example, professional ranks are often enumerated in a sequential
order, such as Assistant, Associate, and full for Professors
• A continuous ordinal variable looks like a set of continuous data of an
unknown scale; that is, the relative ordering of the values is essential but
their actual magnitude is not
10
Ordinal Variables
• For example, the relative ranking in a particular sport (e.g., gold, silver,
bronze) is often more essential than the actual values of a particular
measure
• Ordinal variables may also be obtained from the discretization of interval-
scaled quantities by splitting the value range into a finite number of
classes
• The values of an ordinal variable can be mapped to ranks
11
Dissimilarity computation
• The treatment of ordinal variables is quite similar to that of interval-scaled

variables when computing the dissimilarity between objects
• Suppose that ‘f’ is a variable from a set of ordinal variables describing ‘n ‘
objects
• The dissimilarity computation with respect to ‘f’ involves the following
steps:
• The value of ‘f’ for the ith object is xi f , and ‘f’ has Mf ordered states,
representing the ranking 1,..., Mf .
• Replace each xi f by its corresponding rank, ri f ∈ {1,..., Mf }.
12
A B C D E F G
B
C
D
E
F
G
H
13
Standardization of ordinal variable
• Since each ordinal variable can have a different number of states, it is

often necessary to map the range of each variable onto [0.0,1.0] so that
each variable has equal weight.
• This can be achieved by replacing the rank ri f of the ith object in the fth
variable by:
14
• Dissimilarity can then be computed using any of the distance measures

described earlier (like that for interval data)
15
Example
• Suppose that we have the

sample data of the following
Table ,
• Except that this time only the
object-identifier and the
continuous ordinal variable,
test-2, are available
• There are three states for test-
2, namely fair, good, and
excellent, that is Mf = 3
16
Example
• For step 1, if we replace each value for test-2 by its rank, the four objects
are assigned the ranks 3, 1, 2, and 3, respectively
• Step 2 normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5, and
rank 3 to 1.0
• For step 3, we can use, say, the Euclidean distance, which results in the
following dissimilarity matrix:
17
1 2 3 4
1
1→3→1
2
2→ 1→0
3
3 → 2 → 0.5
4
4→3→1
18
Ratio-Scaled Variables
• A ratio-scaled variable makes a positive measurement on a nonlinear

scale, such as an exponential scale, approximately following the formula
where A and B are positive constants, and t typically represents time
• Common examples include the growth of a bacteria population or the

decay of a radioactive element
19
Computing the dissimilarity between objects
• There are three methods to handle ratio-scaled variables for computing
the dissimilarity between objects:
1. Treat ratio-scaled variables like interval-scaled variables
– This, however, is not usually a good choice since it is likely that the
scale may be distorted
2. Apply logarithmic transformation to a ratio-scaled variable f having value
xif for object i by using the formula yif = log(xi f)
– The yif values can be treated as interval valued, Notice that for some
ratio-scaled variables, log-log or other transformations may be
applied, depending on the variable’s definition and the application
20
Computing the dissimilarity between objects
3. Treat xif as continuous ordinal data and treat their ranks as interval-valued
• The latter two methods are the most effective, although the choice of
method used may depend on the given application
21
Example
• This time, we have the sample data of the following Table,

• Except that only the object-identifier and the ratio-scaled variable, test-3,
are available
22
Example
• Let’s try a logarithmic transformation

• Taking the log of test-3 results in the values 2.65, 1.34, 2.21, and 3.08 for
the objects 1 to 4, respectively
• Using the Euclidean distance on the transformed values, we obtain the
following dissimilarity matrix:
23
Variables of Mixed Types
• So far we have discussed how to compute the dissimilarity between

objects described by variables of the same type, where these types may
be either interval-scaled, symmetric binary, asymmetric binary,
categorical, ordinal, or ratio-scaled
• However, in many real databases, objects are described by a mixture of
variable types
24
• In general, a database can contain all of the six variable types listed above
• “So, how can we compute the dissimilarity between objects of mixed
variable types?”
• One approach is to group each kind of variable together, performing a
separate cluster analysis for each variable type
– This is feasible if these analyses derive compatible results
– However, in real applications, it is unlikely that a separate cluster
analysis per variable type will generate compatible results
25
• A more preferable approach is to process all variable types together,

performing a single cluster analysis
• One such technique combines the different variables into a single
dissimilarity matrix, bringing all of the meaningful variables onto a
common scale of the interval [0.0,1.0]
26
• Suppose that the data set contains p variables of mixed type
• The dissimilarity d(i, j) between objects i and j is defined as
where the indicator δij(f) =0 if either

– xif or xjf is missing (i.e., there is no measurement of variable f for object
i or object j), or xif = xjf = 0 and variable f is asymmetric binary;
– otherwise, δij(f) = 1
27
• The contribution of variable f to the dissimilarity between i and j, that is,

dij(f) , is computed dependent on its type:
• If ‘f’ is interval-based:
where h runs overall non missing objects for variable f

• If ‘f’ is binary or categorical: dij(f) =0, if xif =xjf
– otherwise dij(f) =1
28
• If ‘f’ is ordinal: compute the ranks rif and zif = rif−1 /Mf−1, and treat zif as
interval scaled
• If ‘f’ is ratio-scaled: either perform logarithmic transformation and treat
the transformed data as interval-scaled; or treat ‘f’ as continuous ordinal
data, compute rif and zif, and then treat zif as interval-scaled
• The above steps are identical to what we have already seen for each of the
individual variable types
29
• The only difference is for interval-based variables, where here we

normalize so that the values map to the interval [0.0,1.0]
• Thus, the dissimilarity between objects can be computed even when the
variables describing the objects are of different types
30
Thank you
31
Cluster analysis: Part - V
Dr. A. Ramesh
1
Agenda
• Dissimilarity matrix for mixed type variables

• Python demo for computing different types of distances
• Python demo for computing distance matrix for interval scaled data
Example
Consider the data given in the

following table and compute a
dissimilarity matrix for the objects
of the table
Now we will consider all of the
variables, which are of different
types
3
Example
• The procedures we followed for test-1 (which is categorical) and test-2
(which is ordinal) are the same as outlined above for processing variables
of mixed types
• For categorical variable -
• For ordinal variable -
• For interval scale variable -
4
Normalizing the interval scale data
• First, however, we need to complete some work for test-3 (which is ratio-
scaled)
• We have already applied a logarithmic transformation to its values
• Based on the transformed values of 2.65, 1.34, 2.21, and 3.08 obtained for
the objects 1 to 4, respectively, we let maxhxh =3.08 and minhxh =1.34
• We then normalize the values in the dissimilarity matrix obtained in
Example solve for ratio data by dividing each one by (3.08−1.34) = 1.74
5
Dissimilarity matrix for test-3
• This results in the following dissimilarity matrix for test-3:
Object Ratio scaled Log (x)

Identifier Data (x)
1 445 2.65
2 22 1.34
3 164 2.21
4 1210 3.08
• For 1 and 2 = (2.65- 1.34)/(3.08−1.34) = 0.75
6
dissimilarity matrices for the three variables
• We can now use the dissimilarity matrices for the three variables in our
computation of Equation
• For example, we get d(2,1)= (1(1)+1(1)+1(0.75))/ 3 =0.92
Dissimilarity matrix Dissimilarity matrix normalize the values in the

for categorical for ordinal dissimilarity matrix for ratio data
7
Example
• The resulting dissimilarity matrix obtained for the data described by the
three variables of mixed types is:
8
Interpretation
• If we go back and look at Table of given data, we can intuitively guess that
objects 1 and 4 are the most similar, based on their values for test-1 and
test-2
• This is confirmed by the dissimilarity matrix, where d(4,1) is the lowest
value for any pair of different objects
• Similarly, the matrix indicates that objects 2 and 4 are the least similar
9
Distance Measurement using python - Euclidean Distance :
Python Demo for Euclidean Distance
Distance Measurement using python – Minkowski Distance :
• P =1 Manhattan distance
• P = 2 Euclidean distance
Python Demo for Minkowski Distance
Dissimilarity matrix
Distance matrix calculation for Interval-Scaled Variables
• For example :
Person Weight(Kg) Height(cm)
• Take eight people, the weight (in A 15 95
kilograms) and the height (in centimetres) B 49 156
• In this situation, n = 8 and p = 2. C 13 95
D 45 160
E 85 178
F 66 176
G 12 90
H 10 78
15
Distance matrix calculation using Python
Thank You
19
K- Means Clustering
Dr. A. Ramesh
1
Agenda
• Classification of clustering methods

• Partitioning method: K – means clustering
2
Classification of Clustering Methods
Clustering
Methods
Partitioning Hierarchical
K-Means k-Medoids Agglomerative
3
Which Clustering Algorithm to Choose
• The choice of a clustering algorithm depends on

- Type of data available
- Particular purpose
• It is permissible to try several algorithms on the same data, because
cluster analysis is mostly used as a descriptive or exploratory tool
4
Partitioning Method
Given -
• a data set of n objects
• k, the number of clusters
• A partitioning algorithm organizes the objects into k partitions (k ≤ n),
where each partition represents a cluster.
• The clusters are formed to optimize an objective partitioning criterion
• Objective partitioning criterion such as a dissimilarity function based on
distance
• Therefore, the objects within a cluster are “similar,” whereas the objects
of different clusters are “dissimilar” in terms of the data set attributes.
5
Partitioning Method
• Partitioning methods are applied if one wants to classify the objects into k
clusters, where k is fixed.
6
K-Means Method
• It is a centroid based technique

• The k-means algorithm takes the input parameter, k, and partitions a set
of n objects into k clusters
• So that the resulting intra-cluster similarity is high but the inter-cluster
similarity is low
• Cluster similarity is measured in regard to the mean value of the objects in
a cluster, which can be viewed as the cluster’s centroid or center of gravity
7
Working Principle of K-Means Algorithm
8
• First it randomly selects k of the objects, each of which initially represents

a cluster mean or center
• For each of the remaining objects, an object is assigned to the cluster to
which it is the most similar, based on the distance between the object and
the cluster mean
• It then computes the new mean for each cluster
• This process iterates until the criterion function converges
9
• Criterion function
𝑘
𝐸 = ෍ ෍ |𝑝 − 𝑚𝑖 |2
𝑖=1 𝑝∈𝐶𝑖
where
- 𝐸is the sum of the square error for all objects in the data set;
-𝑝is the point in space representing a given object;
-𝑚𝑖 is the mean of cluster Ci (both 𝑝and 𝑚𝑖 are multidimensional).
• For each object in each cluster, the distance from the object to its cluster
center is squared, and the distances are summed.
• This criterion tries to make the resulting k clusters as compact and as separate
as possible.
10
K=3
11
K-Means Clustering Algorithm
Algorithm: k-means. The k-means algorithm for partitioning, where each

cluster’s center is represented by the mean value of the objects in the cluster.
• Input:
k: the number of clusters,
D: a data set containing n objects.
• Output: A set of k clusters.
12
K-Means Clustering Method
• Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most
similar, based on the mean value of the objects in the cluster;
(4) update the cluster means, i.e., calculate the mean value of the objects for
each cluster;
(5) until no change;
13
K-Means clustering example
Individual Variable 1 Variable 2

1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
14
Variable 2
8.00
4
7.00
6.00
5 6
5.00
7
3
4.00
3.00
2
2.00
1
1.00
0.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00
15
• Initialization: Randomly we choose following two centroids (k=2) for two

clusters. In this case the 2 centroid are:
Individ Variabl Variabl
Cluster Var1 Var2 ual e1 e2
K1 1.0 1.0 1 1.0 1.0

2 1.5 2.0
K2 3.0 4.0 3 3.0 4.0
4 5.0 7.0
• Calculate Euclidean distance using the given equation 5 3.5 5.0
6 4.5 5.0
Distance [(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 )] = (𝑥2 − 𝑥1 )2 + (𝑦2 − 𝑦1 )2
7 3.5 4.5
16
Distance of k1 from k1 (1.0, 1.0) = (1.0 − 1.0)2 + (1.0 − 1.0)2 = 0

k1 to k2 (1.0, 1.0), (3.0, 4.0) = (3.0 − 1.0)2 + (4.0 − 1.0)2 = 3.61
Distance of k 2 from k2 (3.0, 4.0) = (3.0 − 3.0)2 + 4.0 − 4.0 2 =0
Indivi Variab Variab
dual le 1 le 2
Centroid
Cluster 1 1.0 1.0
K1 K2 Assignment 2 1.5 2.0
3 3.0 4.0
K1 0 3.61 k1
4 5.0 7.0
K2 3.61 0 k2 5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
17
At K = 2
Variable 2
8.00
4
7.00
6.00
5 6
5.00
7
3
4.00
3.00
2
2.00
1
1.00
0.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00
18
K-Means clustering example Individual Variable 1 Variable 2
1 1.0 1.0
2 1.5 2.0
• Calculate Euclidean distance for next dataset (1.5, 2.0) 3 3.0 4.0
4 5.0 7.0
Distance from cluster1 = (1.5 − 1.0)2 + (2.0 − 1.0)2 = 1.12 5 3.5 5.0
6 4.5 5.0
Euclidean Distance
Dataset
Cluster 1 Cluster 2 Assignment
(1.5, 2.0) 1.12 2.5 k1
19
Variable 2
8.00
4
7.00
6.00
5 6
5.00
7
3
4.00
3.00
2
2.00
1
1.00
0.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00
20
• Update the cluster centroid
Cluster Var1 Var2

K1 (1.0 + 1.5)/2 = 1.25 (1.0 + 2.0)/2 = 1.5
K2 3.0 4.0
21
1 1.0 1.0
2 1.5 2.0
4 5.0 7.0
6 4.5 5.0
Euclidean Distance
Dataset
(5.0, 7.0) 6.66 3.61 K-2
22
Variable 2
8.00
4
7.00
6.00
5 6
5.00
7
3
4.00
3.00
2
2.00
1
1.00
0.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00
23
Cluster Var1 Var2

K1 1.25 1.5
K2 (3.0 + 5.0)/2 = 4 (4.0 + 7.0)/2 =5.5
24
1 1.0 1.0
2 1.5 2.0
4 5.0 7.0
6 4.5 5.0
Euclidean Distance
Dataset
(3.5, 5.0) 4.16 0.71 K-2
25
Variable 2
8.00
4
7.00
6.00
5 6
5.00
7
3
4.00
3.00
2
2.00
1
1.00
0.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00
26
Cluster Var1 Var2

K1 1.25 1.5
K2 (3.0+5.0+ 3.5)/3 = (4.0+7.0 + 5.0)/3 =
3.83 5.33
27
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
Euclidean Distance
Dataset
(4.5, 5.0) 4.78 0.75 K- 2
28
8.00 Variable 2
4
7.00
6.00
5 6
5.00
7
3
4.00
3.00
2
2.00
1
1.00
0.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00
29
Cluster Var1 Var2

K1 1.25 1.5
K2 (3.0+5.0+3.5+4.5)/4= 4.00 (4.0+7.0+5.0+5.0)/4= 5.25
30
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
Euclidean Distance
Dataset
(3.5, 4.5) 3.75 0.86 K-2
31
Variable 2
8.00
4
7.00
6.00
5 6
5.00
7
3
4.00
3.00
2
2.00
1
1.00
0.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00
32
Cluster Var1 Var2

K1 1.25 1.5
K2 (3.0+5.0+3.5+4.5+3.5)/5= 3.9 (4.0+7.0+5.0+5.0+4.5)/5= 5.1
33
Individual Variable 1 Variable 2 Assignment
1 1.0 1.0 1
2 1.5 2.0 1
3 3.0 4.0 2
4 5.0 7.0 2
5 3.5 5.0 2
6 4.5 5.0 2
7 3.5 4.5 2
34
Python code for K- Means Clustering
35
Python code for K- Means Clustering
36
Python code
37
Thank you
38
Hierarchical method of clustering - I
Dr. A. Ramesh
1
Agenda
• Introduction to Hierarchical clustering

• Partitioning Vs. Hierarchical
2
Introduction
• A hierarchical method creates a hierarchical decomposition of the given

set of data objects
• A hierarchical clustering method works by grouping data objects into a
tree of clusters
• A hierarchical method can be classified as being either agglomerative or
divisive, based on how the hierarchical decomposition is formed
• The agglomerative approach, also called the bottom-up approach, starts
with each object forming a separate group
3
Introduction
• It successively merges the objects or groups that are close to one another,
until all of the groups are merged into one (the topmost level of the
hierarchy), or until a termination condition holds
• The divisive approach, also called the top-down approach, starts with all
of the objects in the same cluster
• In each successive iteration, a cluster is split up into smaller clusters, until
eventually each object is in one cluster, or until a termination condition
holds
4
Introduction
• Hierarchical methods suffer from the fact that once a step (merge or split)
is done, it can never be undone
• This rigidity is useful in that it leads to smaller computation costs by not

having to worry about a combinatorial number of different choices
• However, such techniques cannot correct erroneous decisions
5
Agglomerative and Divisive Hierarchical Clustering
Agglomerative Divisive Hierarchical

• This bottom-up strategy starts by • This top-down strategy does the
placing each object in its own reverse of agglomerative hierarchical
cluster and then merges these clustering by starting with all objects in
atomic clusters into larger and one cluster
larger clusters, until all of the • It subdivides the cluster into smaller
objects are in a single cluster or and smaller pieces, until each object
forms a cluster on its own or until it
until certain termination satisfies certain termination conditions,
conditions are satisfied such as a desired number of clusters is
• Most hierarchical clustering obtained or the diameter of each
methods belong to this category cluster is within a certain threshold
6
Agglomerative versus divisive hierarchical clustering
Figure: 1 Agglomerative and divisive hierarchical clustering on data objects{a,b,c,d,e}

7
Interpretation
• Figure: 1 shows the application of AGNES (AGglomerative NESting), an

agglomerative hierarchical clustering method, and DIANA (DIvisive
ANAlysis), a divisive hierarchical clustering method, to a data set of five
objects, {a,b,c,d,e}
• Initially, AGNES places each object into a cluster of its own
• The clusters are then merged step-by-step according to some criterion
• Let’s say for example, clusters C1 and C2 may be merged if an object in C1
and an object in C2 form the minimum Euclidean distance between any
two objects from different clusters
8
Interpretation
• This is a single-linkage approach in that each cluster is represented by all

of the objects in the cluster, and the similarity between two clusters is
measured by the similarity of the closest pair of data points belonging to
different clusters
• The cluster merging process repeats until all of the objects are eventually
merged to form one cluster
9
Interpretation
• In DIANA, all of the objects are used to form one initial cluster
• The cluster is split according to some principle, such as the maximum
Euclidean distance between the closest neighboring objects in the cluster
• The cluster splitting process repeats until, eventually, each new cluster
contains only a single object
• In either agglomerative or divisive hierarchical clustering, the user can
specify the desired number of clusters as a termination condition
10
Dendrogram
Figure 2: Dendrogram representation for hierarchical clustering of data objects{a,b,c,d,e}

11
Dendrogram
• A tree structure called a dendrogram is commonly used to represent the

process of hierarchical clustering
• It shows how objects are grouped together step by step
• Figure: 2 shows a dendrogram for the five objects presented in Figure:1 ,
where l =0 shows the five objects as singleton clusters at level 0
• At l =1, objects a and b are grouped together to form the first cluster, and
they stay together at all subsequent levels
12
Dendrogram
• We can also use a vertical axis to show the similarity scale between
clusters
• For example, when the similarity of two groups of objects, {a,b} and
{c,d,e}, is roughly 0.16, they are merged together to form a single cluster
13
Measures for distance between clusters
• Four widely used measures for distance between clusters are as follows,
where|p−p’| is the distance between two objects or points, p and p’ , mi is
the mean for cluster, Ci and ni is the number of objects in Ci
• Minimum distance: dmin(Ci,Cj)= minp∈Ci, p’∈Cj |p−p’|
• Maximum distance: dmax(Ci,Cj)= maxp∈Ci, p’∈Cj |p−p’|
• Mean distance: dmean(Ci,Cj)= |mi−mj|
• Average distance: davg(Ci,Cj) =
14
• When an algorithm uses the minimum distance, dmin(Ci,Cj), to measure the

distance between clusters, it is sometimes called a nearest-neighbor
clustering algorithm
• Moreover, if the clustering process is terminated when the distance
between nearest clusters exceeds an arbitrary threshold, it is called a
single-linkage algorithm
• If we view the data points as nodes of a graph, with edges forming a path
between the nodes in a cluster, then the merging of two clusters, Ci and Cj,
corresponds to adding an edge between the nearest pair of nodes in Ci
and Cj
15
• Because edges linking clusters always go between distinct clusters, the
resulting graph will generate a tree
• Thus, an agglomerative hierarchical clustering algorithm that uses the
minimum distance measure is also called a minimal spanning tree
algorithm
• When an algorithm uses the maximum distance, dmax(Ci,Cj), to measure
the distance between clusters, it is sometimes called a farthest-neighbor
clustering algorithm
• If the clustering process is terminated when the maximum distance
between nearest clusters exceeds an arbitrary threshold, it is called a
complete-linkage algorithm
16
• By viewing data points as nodes of a graph, with edges linking nodes, we

can think of each cluster as a complete sub graph, that is, with edges
connecting all of the nodes in the clusters
• The distance between two clusters is determined by the most distant
nodes in the two clusters
• Farthest-neighbor algorithms tend to minimize the increase in diameter of
the clusters at each iteration as little as possible
• If the true clusters are rather compact and approximately equal in size, the
method will produce high-quality clusters
• Otherwise, the clusters produced can be meaningless
17
Choice of measurement
• The above minimum and maximum measures represent two extremes in

measuring the distance between clusters
• They tend to be overly sensitive to outliers or noisy data
• The use of mean or average distance is a compromise between the
minimum and maximum distances and overcomes the outlier sensitivity
problem
• Whereas the mean distance is the simplest to compute, the average
distance is advantageous in that it can handle categorical as well as
numeric data
18
Illustration
(a)
Representation of some definitions of inter-
cluster dissimilarity: (a) Group average
(b) Nearest neighbor
(b)
(c) Furthest neighbor
(c)
19
Illustration
(a) Some types of clusters:

(a) Ball-shaped
(b) Elongated
(c) Compact but not
(b) well separated
(c)
20
Difficulties with hierarchical clustering
• The hierarchical clustering method, though simple, often encounters

difficulties regarding the selection of merge or split points
• Such a decision is critical because once a group of objects is merged or
split, the process at the next step will operate on the newly generated
clusters
• It will neither undo what was done previously nor perform object
swapping between clusters
21
Difficulties with hierarchical clustering
• Thus merge or split decisions, if not well chosen at some step, may lead to
low-quality clusters
• Moreover, the method does not scale well, because each decision to
merge or split requires the examination and evaluation of a good number
of objects or cluster
• For improving the clustering quality of hierarchical methods is to integrate
hierarchical clustering with other clustering techniques, resulting in
multiple-phase clustering
22
Partitioning Vs. Hierarchical
23
K-means versus hierarchical clustering
24
K means versus hierarchical clustering
K- means clustering Hierarchical clustering

• Non-hierarchical methods, such • Hierarchical methods can be
as k-means, using a pre-specified either agglomerative or
number of clusters, the method
assigns records to each cluster to divisive
find the mutually exclusive cluster • Agglomerative methods
of spherical shape based on begin with ‘n’ clusters and
distance sequentially merge similar
• In this case, one can use mean or clusters until a single cluster
median as a cluster centre to
represent each cluster is obtained
25
• These methods are • Divisive methods work in the
generally less opposite direction, starting
computationally intensive with one cluster that includes
and are therefore preferred all records
with very large datasets • Hierarchical methods are
especially useful when the
goal is to arrange the clusters
into a natural hierarchy
26
• A partitioning (K- means) • A hierarchical clustering is a
clustering a simply a set of nested clusters that
division of the set of data are organized as a tree
objects into non-
overlapping subsets
(clusters) such that each
data object is in exactly one
subset)
27
Un-nested cluster Nested cluster
Ashok, A.R., Prabhakar, C.R. and Dyaneshwar, P.A., Comparative Study on Hierarchical and Partitioning Data Mining Methods.
28
• Hierarchical clustering does not assume a particular value of ‘𝑘’, as needed

by 𝑘-means clustering
• The generated tree may correspond to a meaningful taxonomy
• Only a distance or “proximity” matrix is needed to compute the
hierarchical clustering
Proximity
matrix
29
K Means clustering Hierarchical clustering
• In K Means clustering, since one • Results are reproducible in
start with random choice of Hierarchical clustering
clusters, the results produced by • Hierarchical clustering don’t work
running the algorithm multiple as well as, k means when the
times might differ shape of the clusters is hyper
• K Means is found to work well spherical
when the shape of the clusters is
hyper spherical (like circle in 2D,
sphere in 3D)
30
K Means clustering Hierarchical clustering
• K Means clustering requires • In hierarchical clustering
prior knowledge of K i.e. no. one can stop at any number
of clusters one want to of clusters, one find
divide your data into appropriate by interpreting
the dendrogram
31
https://stepupanalytics.com/difference-between-k-means-clustering-and-hierarchical-clustering/
32
Hierarchical clustering
Advantages
• Ease of handling of any forms of similarity or distance
• Consequently, applicability to any attributes types
33
Limitations of Hierarchical Clustering
• Hierarchical clustering requires the computation and storage of an n×n

distance matrix. For very large datasets, this can be expensive and slow
• The hierarchical algorithm makes only one pass through the data. This
means that records that are allocated incorrectly early in the process
cannot be reallocated subsequently
• Hierarchical clustering also tends to have low stability. Reordering data or
dropping a few records can lead to a different solution
34
Limitations of Hierarchical Clustering
• With respect to the choice of distance
between clusters, single and complete
linkage are robust to changes in the
distance metric (e.g., Euclidean, statistical
distance) as long as the relative ordering is
kept.
• In contrast, average linkage is more
influenced by the choice of distance metric,
and might lead to completely different
clusters when the metric is changed
• Hierarchical clustering is sensitive to outlier
35
Average-linkage clustering
• Compromise between Single and Complete Link
• Strengths
– Less susceptible to noise and outliers
• Limitations
– Biased towards globular clusters
36
Distance between two clusters
• Ward’s distance between clusters Ci and Cj is the difference between the total
within cluster sum of squares for the two clusters separately, and the within
cluster sum of squares resulting from merging the two clusters in cluster Cij
Dw (Ci , C j ) =  (x − ri ) +  (x − rj ) −  (x − rij )
2 2 2
xCi xC j xCij
• ri: centroid of Ci
• rj: centroid of Cj
• rij: centroid of Cij
37
Ward’s distance for clusters
• Similar to group average and centroid distance
• Less susceptible to noise and outliers
• Biased towards globular clusters
• Hierarchical analogue of k-means

– Can be used to initialize k-means
38
Hierarchical Clustering: Comparison
5
1 4 1
3
2 5
5 5
2 1 2
Simple linkage
2 3 6 3 6
3
1
4 Complete 4
4
linkage
5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group 3 6
3 Average
4 1 1
4 4
3
39
K- means clustering
Advantages Disadvantages
• The center of mass can be found • K-means has problems when
efficiently by finding the mean clusters are of differing sizes,
value of each co-ordinate densities, non-globular shapes
• This leads to an efficient and when the data contains
algorithm to compute the new outliers
centroids with a single scan of the
data
40
Similarity
• Two most popular methods: hierarchical agglomerative clustering and k-

means clustering
• In both cases, we need to define two types of distances: distance between
two records and distance between two cluster
• In both cases, there is a variety of metrics that can be used
41
Thank You
42
Measures of Attribute Selection
Dr. A. Ramesh
1
Agenda
• Measures of attribute selection using

– Information Gain
– Gain ratio
– Gini Index
2
Example
• The following Table presents

a training set, D, of class-
labeled tuples randomly
selected from the
AllElectronics customer
database
Han, J., Pei, J. and Kamber, M., 2011. Data mining: concepts and
techniques. Elsevier.
3
Example
• In this example, each attribute is discrete-valued

• Continuous-valued attributes have been generalized
• The class label attribute, buys computer, has two distinct values (namely,
{yes, no}); therefore, there are two distinct classes (that is, m = 2)
• Let class C1 correspond to ‘yes’ and class C2 correspond to ‘no’.
• There are nine tuples of class ‘yes’ and five tuples of class ‘no’.
• A (root) node N is created for the tuples in D
4
Expected information needed to classify a tuple in D
• To find the splitting criterion for these tuples, we must compute the
information gain of each attribute
• Let us consider Class: buys computer as decision criteria D
• Calculate information:
• = -py log2 (py) – pn log2 (pn)
• Where py is probability of ‘yes’ and pn is probability of ‘no’
5
Calculation of entropy for ‘ Youth’
• Age can be:

– youth
– Middle_aged
– Senior
• Youth
Youth Class: buys computer

Yes 2
No 3
6
Calculation of entropy for ‘ Youth’
• Calculate Entropy for youth:

• Entropy youth =
• Middle_aged
middle Class: buys computer

Yes 4
No 0
7
Calculation of entropy for ‘ Middle Age’
• Calculate Entropy for middle_aged
• =
• =0
• For Senior
Senior Class: buys computer

Yes 3
No 2
8
Calculate Entropy for senior
Calculate Entropy for senior

=
9
The expected information needed to classify a tuple in D
according to age
The expected information needed to classify a tuple in D if the tuples are

partitioned according to age is
10
Calculation information Gain of Age
• Gain of Age:
11
Calculation information Gain of Income
• Calculation of gain for income:

• Income cane be:
– High
– Medium
– Low
12
Calculate Entropy for high
• High :
High Class: buys computer
Yes 2
No 2
• Calculate Entropy for high:

= -(2/4)log2(2/4) - (2/4)log2(2/4)
13
Calculate Entropy for ‘medium’
• Medium:
Medium Class: buys computer
Yes 4
No 2
• Calculate Entropy for Medium:

= -(4/6)log2(4/6) - (2/6)log2(2/6)
14
Calculate Entropy for ‘low’
• Low :
Low Class: buys computer
No 1
Yes 3
• Calculate Entropy for Low:

= -(1/4)log2(1/4) - (3/4)log2(3/4)
15
Gain of income
• The expected information needed to classify a tuple in D if the tuples are

partitioned according to income is:
• Info income (D) = (4/14) ( -(2/4)log2(2/4) - (2/4)log2(2/4)) +
(6/14) ( -(4/6)log2(4/6) - (2/6)log2(2/6)) +
(4/14) -(1/4)log2(1/4) - (3/4)log2(3/4)
= 0.911
Gain of income : Info(D) - Info income (D)
= 0.94 – 0.911 = 0.029
16
Calculation of gain for student
• Calculation of gain for student

• Student can be:
– Yes
– No
17
Calculate Entropy for No
• No :
No Class: buys computer
Yes 3
No 4
• Calculate Entropy for No:

= -(3/7)log2(3/7) - (4/7)log2(4/7)
18
Calculate Entropy for ‘Yes’
• Yes :
Yes Class: buys computer
Yes 6
No 1
• Calculate Entropy for Yes:

= -(6/7)log2(6/7) - (1/7)log2(1/7)
19
Gain of student

partitioned according to student is:
• Info Student (D) = (7/14) (-(3/7)log2(3/7) - (4/7)log2(4/7)) +
(7/14) (-(6/7)log2(6/7) - (1/7)log2(1/7))
=0.789
• Gain(student) :
Info(D) - Info student (D)
= 0.94 – 0.789 = 0.151
20
Calculation of gain for credit rating
• Calculation of gain for credit rating

• Credit rating can be:
– Fair
– Excellent
21
Calculate Entropy for Fair
• Fair :
Fair Class: buys computer
Yes 6
No 2
• Calculate Entropy for Fair:

= -(6/8)log2(6/8) - (2/8)log2(2/8)
22
Calculate Entropy for Excellent
• Excellent :
Yes Class: buys computer
Yes 3
No 3
• Calculate Entropy for Excellent:

= -(3/6)log2(3/6) - (3/6)log2(3/6)
23
Gain for credit rating

partitioned according to Credit rating is:
• Info Credit rating (D) = (8/14) (-(6/8)log2(6/8) - (2/8)log2(2/8)) +
(6/14) (-(3/6)log2(3/6) - (3/6)log2(3/6))
=0.892
• Gain for credit rating :
Info(D) - Info Credit rating (D)
= 0.94 – 0.892 = 0.048
24
Independent variable Information gain
Age 0.246
Income 0.029
Student 0.151
Credit_rating 0.048
25
Selection of root classifier
• Because age has the highest information gain among the attributes, it is
selected as the splitting attribute
• Node N is labelled with age, and branches are grown for each of the
attribute’s values
• The tuples are then partitioned accordingly
• Notice that the tuples falling into the partition for age = middle aged all
belong to the same class
• Because they all belong to class “yes,” a leaf should therefore be created
at the end of this branch and labelled with “yes.”
26
Decision tree
27
Decision tree
• The final decision tree returned by the algorithm is shown in Figure
28
Thank You
29
Classification and Regression Trees (CART - I)
Dr. A. Ramesh
1
Agenda
• Introduction to Classification and Regression Trees

• Attribute selection measures – Introduction
2
Introduction
• Classification is one form of data analysis that can be used to extract

models describing important data classes or to predict future data trends
• Classification predicts categorical (discrete, unordered) labels whereas
Regression analysis is a statistical methodology that is most often used for
numeric (continuous) prediction
• For example, we can build a classification model to categorize bank loan
applications as either safe or risky
• Regression model is used to predict expenditures in dollars of potential
customers on computer equipment given their income and occupation
3
Problem Description for Illustration
4
Root Node, Internal Node, Child Node
Root Node or
Internal parent node
• A decision tree uses a tree structure to
represent a number of possible decision paths Node
and an outcome for each path
• A decision tree consists of root node, internal
node and leaf node
• The topmost node in a tree is the root node
or parent node
• It represents entire sample population
• Internal node (non-leaf node) denotes a test
on an attribute, each branch represents an
outcome of the test
• Leaf node (or terminal node or child node)
holds a class label Child
• It can not be further split
Node
5
Decision Tree Introduction
• A decision tree for the concept buys_computer, indicating whether a

customer at All Electronics is likely to purchase a computer
• Each internal (non-leaf) node

represents a test on an attribute
• Each leaf node represents a class
(either buys_computer = yes or
buys computer = no).
Figure 1.1 : Decision Tree

6
CART Introduction
• CART comes under supervised learning technique
• CART adopt a greedy(i.e., non backtracking) approach in which decision

trees are constructed in a top-down recursive divide-and-conquer manner
• It is very interpretable model
7
Decision Tree Algorithm
Input:
• Data partition, D, which is a set of
training tuples and their associated
class labels;
• Attribute list, the set of candidate
attributes;
• Attribute selection method, a
procedure to determine the splitting
criterion that “best” partitions the data
tuples into individual classes. This
criterion consists of a splitting attribute
and, possibly, either a split point or
splitting subset.
Output: A decision tree
8
• The algorithm is called with three parameters: D, attribute list, and

Attribute selection method
• D is defined as a data partition. Initially, it is the complete set of training
tuples and their associated class labels
• The parameter attribute list is a list of attributes or independent variables
which are describing the tuples
• Attribute selection method specifies a heuristic procedure for selecting
the attribute that “best” discriminates the given tuples according to class
9
• This procedure employs an attribute selection measure, such as

information gain, gain ratio or the Gini index.
• Whether the tree is strictly binary is generally driven by the attribute
selection measure
• Some attribute selection measures, such as the Gini index, enforce the
resulting tree to be binary. Others, like information gain, do not, therein
allowing multiway splits (i.e., two or more branches to be grown from a
node).
10
Decision Tree Method
N-Node
C- Class
D- tuples in training data set
11
Decision Tree Method step 1 to 6
• The tree starts as a single node, N,
representing the training tuples in D (step
1).
• If the tuples in D are all of the same class,
then node N becomes a leaf and is
labelled with that class (steps 2 and 3)
• Steps 4 and 5 are terminating conditions
• Otherwise, the algorithm calls Attribute
selection method to determine the
splitting criterion
• The splitting criterion (like Gini) tells us
which attribute to test at node N by
determining the “best” way to separate
or partition the tuples in D into individual
classes (step 6)
12
Decision Tree Method - Step 7 - 11
• The splitting criterion indicates the splitting
attribute and may also indicate either a
split-point or a splitting subset
• The splitting criterion is determined so
that, ideally, the resulting partitions at each
branch are as “pure” as possible. A
partition is pure if all of the tuples in it
belong to the same class.
• The node N is labelled with the splitting
criterion, which serves as a test at the node
(step 7).
• A branch is grown from node N for each of
the outcomes of the splitting criterion.
• The tuples in D are partitioned accordingly
(steps 10 to 11)
13
Three possibilities for partitioning tuples based on the
splitting criterion
• There are three possible scenarios, as illustrated in Figure (a), (b) and (c).
• Let A be the splitting attribute. A has ‘v’ distinct values,{a1,a2,...,av}, based
on the training data
• If A is discrete-valued in figure (a), then one branch is grown for each
known value of A.
Figure (a)
14
splitting criterion
• If A is continuous-valued in figure (b), then two branches are grown,
corresponding to A ≤ split point and A > split point.
• Where split point is the split-point returned by Attribute selection method
as part of the splitting criterion.
Figure (b)
15
splitting criterion
• If A is discrete-valued and a binary tree must be produced, then the test is
of the form A ∈ 𝑆𝐴 , where 𝑆𝐴 is the splitting subset for A.
Figure (c)
16
Decision Tree Method – termination condition
• The algorithm uses the same process recursively to form a decision tree
for the tuples at each resulting partition, 𝐷𝑗 , of D (step 14).
• The recursive partitioning stops only when anyone of the following

terminating conditions is true:
1. All of the tuples in partition D (represented at node N) belong to the

same class (steps 2 and 3), or
17
Decision Tree Method – termination condition
1.
2. There are no remaining attributes on which the tuples may be further
partitioned (step4).
• In this case, majority voting is employed(step 5).
• This involves converting node N into a leaf and labelling it with the most
common class in D.
• Alternatively, the class distribution of the node tuples may be stored.
3. There are no tuples for a given branch, that is, a partition Dj is empty (step
12).
• In this case, a leaf is created with the majority class in D (step 13).
• The resulting decision tree is returned (step 15).
18
Attribute Selection Measures
• Attribute selection measures are also known as splitting rules because

they determine how the tuples at a given node are to be split
• It is a heuristic approach for selecting the splitting criterion that “best”
separates a given data partition, D, of class-labeled training tuples into
individual classes
• The attribute selection measure provides a ranking for each attribute
describing the given training tuples
• The attribute having the best score for the measure is chosen as the
splitting attribute for the given tuples
19
• If the splitting attribute is continuous-valued or if we are restricted to binary
trees then, respectively, either a ‘split point’ or a ‘splitting subset’ must also be
determined as part of the splitting criterion
• There are three popular attribute selection measures

– information gain,
– gain ratio, and
– Gini index
• CART algorithm uses information gain and Gini index measure for attribute
selection
20
21
Information Gain
• This measure studied the value or “information content” of messages

• The attribute with the highest information gain is chosen as the splitting
attribute for node
• This attribute minimizes the information needed to classify the tuples in
the resulting partitions and reflects the least randomness or “impurity” in
these partitions
• This approach minimizes the expected number of tests needed to classify
a given tuple
22
Information Gain-Entropy Measure
• The expected information needed to classify a
tuple in D is given by
• Where 𝑝𝑖 is the probability that an arbitrary

tuple in D belongs to class 𝐶𝑖 and is estimated
by |𝐶𝑖,𝐷 |/|D|.
• A log function to the base 2 is used, because
the information is encoded in bits
• Info(D) (or Entropy of D )is just the average
amount of information needed to identify the
class label of a tuple in D
23
• It is quite likely that the partitions will be impure (e.g., where a partition
may contain a collection of tuples from different classes rather than from
a single class).
• How much more information would we still need (after the partitioning) in
order to arrive at an exact classification?
• This amount is measured by
• The term |𝐷𝑗 | / |D| acts as the weight of the 𝑗𝑡ℎ partition. 𝐼𝑛𝑓𝑜𝐴 (D) is
the expected information required to classify a tuple from D based on the
partitioning by A.
24
Information Gain
• The smaller the expected information (still) required, the greater the
purity of the partitions
• Information gain is defined as the difference between the original
information requirement (i.e., based on just the proportion of classes) and
the new requirement (i.e., obtained after partitioning on A). That is,
• The attribute A with the highest information gain, (Gain(A)), is chosen as

the splitting attribute at node N.
25
Gini Index
• Gini index is used to measures the
impurity of D, a data partition or set
of training tuples, as
• Where 𝑝𝑖 is the probability that a tuple

in Dbelongs to class 𝐶𝑖 and is
estimated by |𝐶𝑖,𝐷 |/|D|.
• The sum is computed over ‘m’classes.
• The Gini index considers a binary split
for each attribute
26
Gini Index
• When considering a binary split, we compute a weighted sum of the

impurity of each resulting partition
• For example, if a binary split on Apartitions Dinto 𝐷1 and 𝐷2 , the Gini
index of Dgiven that partitioning is-
• For each attribute, each of the possible binary splits is considered

• For a discrete-valued attribute, the subset that gives the minimum Gini
index for that attribute is selected as its splitting subset
27
Gini Index
• For continuous-valued attributes, each possible split-point must be considered
• The strategy is similar where the midpoint between each pair of (sorted)
adjacent values is taken as a possible split-point.
• For a possible split-point of A, 𝐷1 is the set of tuples in D satisfying A ≤ split
point, and 𝐷2 is the set of tuples in D satisfying A > split point.
• The reduction in impurity that would be incurred by a binary split on a
discrete- or continuous-valued attribute A is
• The attribute that maximizes the reduction in impurity (or, equivalently, has
the minimum Gini index) is selected as the splitting attribute
28
Which attribute selection measure is the best?
• All measures have some bias.

• The time complexity of decision tree generally increases exponentially
with tree height
• Hence, measures that tend to produce shallower trees (e.g., with
multiway rather than binary splits, and that favour more balanced splits)
may be preferred.
• However, some studies have found that shallow trees tend to have a large
number of leaves and higher error rates
• Several comparative studies suggests no one attribute selection measure
has been found to be significantly superior to others.
29
Tree Pruning
• When a decision tree is built, many of the branches will reflect anomalies
in the training data due to noise or outliers
• Tree pruning use statistical measures to remove the least reliable
branches
• Pruned trees tend to be smaller and less complex and, thus, easier to
comprehend
• They are usually faster and better at correctly classifying independent test
data than unpruned trees
30
How does Tree Pruning Work?
• There are two common approaches to tree pruning: pre-pruning and post-
pruning.
• In the pre-pruning approach, a tree is “pruned” by halting its construction
early (e.g., by deciding not to further split or partition the subset of
training tuples at a given node).
• When constructing a tree, measures such as statistical significance,
information gain, Gini index can be used to assess the goodness of a split.
31
• The post-pruning approach removes sub_trees from a “fully grown” tree

• A subtree at a given node is pruned by removing its branches and
replacing it with a leaf
• The leaf is labelled with the most frequent class among the subtree being
replaced
• For example, the subtree at node “A3?” in the unpruned tree of Figure 1.2
• The most common class within this subtree is “class B”
• In the pruned version of the tree, the subtree in question is pruned by
replacing it with the leaf “class B
32
Figure: 1.2 An unpruned decision tree and a post-pruned decision tree
33
THANK YOU
34
Attribute selection Measures in CART : II
Dr. A. Ramesh
1
Agenda
• Attribute selection measures:

– Gain Value
– Gain ratio
– Gini Index
2
Gain Ratio
• The information gain measure is biased toward tests with many outcomes
• That is, it prefers to select attributes having a large number of values
• For example, consider an attribute that acts as a unique identifier, such as
product ID.
• A split on product ID would result in a large number of partitions (as many
as there are values), each one containing just one tuple
3
Gain Ratio
• Because each partition is pure, the information required to classify dataset

D based on this partitioning would be Info product ID(D) = 0
• Information Gain = Info D- Info product ID(D) = maximum
• Therefore, the information gained by partitioning on this attribute is
maximal
• Clearly, such a partitioning is useless for classification
• Gain ratio is an extension to information gain which attempts to
overcome this bias
4
Split information
• It applies a kind of normalization to information gain using a “split
information” value defined analogously with Info(D) as:
• Dj= single partion

• D = Data set
• This value represents the potential information generated by splitting the
training data set, D, into v partitions, corresponding to the v outcomes of a
test on attribute A
5
Gain ratio
• Gain ratio differs from information gain, which measures the information
with respect to classification that is acquired based on the same
partitioning
• The gain ratio is defined as
• The attribute with the maximum gain ratio is selected as the splitting
attribute
6
Gain Ratio example
• Consider the previous example

for computation of gain ratio
for the attribute income
• A test on income splits the
data of the following Table into
three partitions, namely low,
medium, and high, containing
four, six, and four
tuples,respectively
7
Calculate Entropy for high
• High :
Yes 2
No 2
• Calculate Entropy for high:

= -(2/4)log2(2/4) - (2/4)log2(2/4)
8
Calculate Entropy for ‘medium’
• Medium:
Yes 4
No 2
• Calculate Entropy for Medium:

= -(4/6)log2(4/6) - (2/6)log2(2/6)
9
Calculate Entropy for ‘low’
• Low :

Yes 3
No 1
• Calculate Entropy for Low:

= - (3/4)log2(3/4) -(1/4)log2(1/4)
10
Calculate Entropy for buying class D
• Calculate information:
• = -py log2 (py) – pn log2 (pn)
• Where py is probability of yes and pn is probability of no
11
Gain of income

partitioned according to income is:
• Info income (D) = (4/14) ( -(2/4)log2(2/4) - (2/4)log2(2/4)) +
(6/14) ( -(4/6)log2(4/6) - (2/6)log2(2/6)) +
(4/14) (-(1/4)log2(1/4) - (3/4)log2(3/4))
= 0.911 bits
Gain of income : Info(D) - Info income (D)
= 0.94 – 0.911 = 0.029
12
Gain-Ratio(income)
• Calculation of split ratio:
• Therefore, Gain-Ratio(income) = 0.029/0.926 = 0.031
13
Interpretation
• Further we calculate the same for the rest 3 criteria (age, student, credit
rating)
• The one with maximum Gain ratio value will results in the maximum
reduction in impurity of the tuples in D and is returned as the splitting
criterion
14
15
Decision tree using Gini index
• Let’s take the Introduction of

a decision tree using Gini
index
• Let D be the training data of
the following table
16
Example
• In this example, each attribute is discrete-valued

• Continuous-valued attributes have been generalized
• The class label attribute, buys computer, has two distinct values (namely,
{yes, no}); therefore, there are two distinct classes (that is, m = 2)
• Let class C1 correspond to ‘yes’ and class C2 correspond to ‘no’.
• There are nine tuples of class ‘yes’ and five tuples of class ‘no’.
• A (root) node N is created for the tuples in D
17
Calculation of Gini(D)
• We first use the following Equation for Gini index to compute the impurity
of D:
18
Gini index for income attribute
• Lets calculate Gini index for income attribute

• To find the splitting criterion for the tuples in D, we need to compute the
Gini index for each attribute
• Let’s start with the attribute income and consider each of the possible
splitting subsets
• Income has three possible values, namely {low, medium, high}, then the
possible subsets are {low, medium, high}, {low, medium}, {low, high},
{medium, high}, {low}, {medium}, {high}, and {}
• Power set and empty set will not be used for splitting
19
• Consider the subset{low,

medium}
• This would result in 10 tuples
in partition D1 satisfying the
condition “income ∈{low,
medium}”
• The remaining four tuples of D
(high) would be assigned to
partition D2
20
Tuples in partition D1
• Low + Medium:
+ Low
Yes 3+4 =7
No 1+ 2 = 3
21
• High :
Yes 2
No 2
22
• The Gini index value computed based on this partitioning is
= (10/14) (1- (7/10)2 – (3/10)2) +

(4/14) (1- (2/4)2 – (2/4)2)
= 0.443 = Gini income ∈{high}
23
• Consider the subset{high,

medium}
condition “income ∈{high,
medium}”
• The remaining four tuples of
D (low) would be assigned to
partition D2
24
• High + Medium:
+ high
Yes 2+4
No 2+2
25
• Low :

No 1
Yes 3
26

Gini income ∈{high, medium}
= (10/14) (1- (6/10)2 – (4/10)2) +

(4/14) (1- (1/4)2 – (3/4)2)
=0.45 = Gini income ∈{low}
27
• Consider the subset{high,

low}
condition “income ∈{high,
low}”
• The remaining six tuples of D
(medium) would be assigned
to partition D2
28
• High + low:
high + Class: buys computer
low
Yes 2+3
No 2+1
29
• Medium:

No 2
Yes 4
30

Gini income ∈{high, low}
= (8/14) (1- (5/8)2 – (3/8)2) +

(6/14) (1- (2/6)2 – (4/6)2)
=0.458 =Gini income ∈{medium}
31
Gini Index values
Gini Index values

Gini income ∈{high, low} 0.458
Gini income ∈{high, medium} 0.45
Gini income ∈{medium, low} 0.443
32
Interpretation
• The best binary split for attribute income is on {medium, low} (or {high})
because it minimizes the Gini index
• The splitting subset {medium,low} therefore give the minimum Gini index
for attribute income
• Reduction in impurity = 0.459 − 0.443 = 0.016
• Further we calculate the same for the rest 3 criteria (age, student, credit
rating)
• The one with minimum Gini index value will results in the maximum
reduction in impurity of the tuples in D and is returned as the splitting
criterion
33
34
Thank You
35
Classification and Regression Trees (CART – III)
Dr A. RAMESH
1
Agenda
Python demo for CART model -

• Visualizing Decision Tree
• Interpretation of CART model
2
Example
Problem Description-
3
Import Relevant Libraries and Loading Data File
4
Methods used in Data Encoding
• LabelEncoder (): This method is used to normalize labels. It can also be

used to transform non-numerical labels to numerical labels.
• Fit_transform (): This method is used for Fitting label encoder and return
encoded labels.
5
Data Encoding Procedure
6
Data Encoding
7
Structuring Dataframe
drop(): This is used to Remove rows or columns by specifying label names

and corresponding axis or by specifying directly index or column names.
8
Independent and Dependent Variables Selection
9
Build the Decision Tree Model without Splitting
10
Visualizing Decision Tree
11
Decision Tree Visualization
12
Interpretation of the CART Output
13
Calculation of Gini(D)
• We first use the following Equation for Gini index to compute the impurity
of D:
14
Income Attribute
• Low, Medium, High

• Option 1: {Low, Medium}, {High}
• Option 2 : {High, Medium}, {low}
• Option 3 : {High, Low}, {Medium}
15
• Low + Medium:
Low + Class: buys computer
Medium
Yes 3+4 =7
No 1+ 2 = 3
16
• High :
Yes 2
No 2
17
= (10/14) (1- (7/10)2 – (3/10)2) +

(4/14) (1- (2/4)2 – (2/4)2)
18

Gini income ∈{high, medium}
= (10/14) (1- (6/10)2 – (4/10)2) +

(4/14) (1- (3/4)2 – (1/4)2)
=0.45 = Gini income ∈{low}
19

Gini income ∈{high, low}
= (8/14) (1- (5/8)2 – (3/8)2) +
(6/14) (1- (2/6)2 – (4/6)2)
=0.458 =Gini income ∈{medium}
20
• Gini income ∈{low, medium}
• Gini income ∈{high, medium}
= 0.45 = Gini income ∈{low}
• Gini income ∈{high, low}
= 0.458 = Gini income ∈{medium}
21
Gini index for Age attribute

Gini Age ∈{Youth, middle_aged}
= 0.457 = Gini Age ∈{senior}
Gini Age ∈{Youth, Senior}
= 0.357 = Gini Age∈{middle_aged}
Gini Age ∈{senior, middle_aged}
= 0.393 = Gini Age ∈{Youth}
22
Gini index for student attribute

Gini student ∈{Yes, No}
= 7/14 (1- (6/7)2 – (1/7)2 ) +
7/14 (1- (3/7)2 – (4/7)2 )
= 0. 367
23
Gini index for credit_rating attribute

Gini credit rating ∈{fair, Excellent}
= 8/14 (1- (6/8)2 – (2/8)2 ) +
6/14 (1- (3/6)2 – (3/6)2 )
= 0. 428
24
Choosing the root node
The attribute with minimum Gini score will be taken, i.e. Age (Gini Age ∈{Youth, Senior} =
0.357 = Gini Age∈{middle_aged} )
Age Attribute Gini score

Youth, senior
Age 0.357
Income 0.443
Middle age ???
Student 0.367
Credit_rating 0.428
25
Gini index for different attributes for sample of 10
• After separating 4 samples belonging middle age, total 10 are remaining:
26
Gini index for different attributes for sample of 10
• Gini (D) = (1- (5/10)2 – (5/10)2) ) = 0.5

• GiniAge = 0.48
• GiniCredit Rating= 0.41
• Gini Student = 0.32
• Gini income = 0.375
• Take student as node as it have mini. Gini Score
27
Drawing cart
Age
Youth, senior
Middle age Student
yes
No
??? ???
28
For branch Student = No
• Omit the marked rows
(Data entry), either
belonging Age =
middle_aged or student =
Yes
• Total 5 rows are remaining
29
Gini index for different attributes For branch Student = No
• Gini (D) = (1- (4/5)2 – (1/5)2) ) = 0.32

• GiniAge = 0.2
• Take age as node as it have mini. Gini Score
30
Drawing cart
Age
Youth, senior
Middle age Student
yes
No
??? Age
??? ???
31
For branch Student = Yes
• Omit the marked rows
(Data entry), either
belonging Age =
middle_aged or student =
No
• Total 5 rows are remaining
32
Gini index for different attributes For branch Student = No
• Gini (D) = (1- (4/5)2 – (1/5)2) ) = 0.32

• GiniAge = 0.267
• Take credit rating as node as it have mini. Gini Score
33
Drawing cart
Age
Youth, senior
Middle age Student
yes No
Credit_rating Age
??? ??? ??? ???
34
Coding scheme
Age Code Student Code
Youth 2 Yes 1
Middle Age 0 No 0
senior 1 Income Code
High 0
Credit rating Code
Low 1
Fair 1
Medium 2
Excellent 0
Buys computer Class
Yes 1
No 0
35
Values for the dependent
Decision tree variable
Youth, Senior
Middle_age
Decision classifier
• Repeat the
splitting
process until No Yes
we obtain all Number of yes and Sample
the leaf nodes, No in independent size
the final out - variable Excellent Fair
Senior Youth
put:
Excellent Fair High, Low Medium
36
Splitting Dataset
• Train_test_split(): This method is used for splitting dataset into training

and testing data subsets.
37
Build the Decision Tree Model
38
Evaluating the Model
39
Visualizing Decision Tree
40
Decision Tree Visualization
41
Thank You
42
Hierarchical method of clustering- II
Dr. A. Ramesh
1
Agenda
• Agglomerative hierarchical algorithm

• Python demo
2
Example for Hierarchical Agglomerative Clustering (HAC)
• A data set consisting of seven objects for which two variables were
measured.
Object Variable 1 Variable 2
1 2.00 2.00
2 5.50 4.00
3 5.00 5.00
4 1.50 2.50
5 1.00 1.00
6 7.00 5.00
7 5.75 6.50
3
Scatter plot
4
Object Var 1 Var 2
Example for HAC 1 2.00 2.00

2 5.50 4.00
3 5.00 5.00
• Calculate Euclidean Distance and create the distance matrix. 4 1.50 2.50
5 1.00 1.00
Distance 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 = (𝑥2 − 𝑥1 )2 +(𝑦2 − 𝑦1 )2 6 7.00 5.00
Distance (1,2) 7 5.75 6.50
2.00, 2.00 5.50, 4.00 = (5.50 − 2.00)2 +(4.00 − 2.00)2 = 4.02

Distance (1,3)
2.00, 2.00 5.00, 5.00 = (5.00 − 2.00)2 +(5.00 − 2.00)2 = 4.24
Distance (1,4)
2.00, 2.00 1.50, 2.50 = (1.50 − 2.00)2 +(2.50 − 2.00)2 = 0.71
5
Object Var 1 Var 2

2 5.50 4.00
3 5.00 5.00
Distance (1,5) 4 1.50 2.50
2.00, 2.00 1.00, 1.00 = (1.00 − 2.00)2 +(1.00 − 2.00)5 2 1.00 1.00
6 7.00 5.00
= 1.41 7 5.75 6.50
Distance (1,6)
2.00, 2.00 7.00, 5.00 = (7.00 − 2.00)2 +(5.00 − 2.00)2 = 5.83
Distance (1,7)
2.00, 2.00 5.75, 6.50 = (5.75 − 2.00)2 +(6.50 − 2.00)2 = 5.86
6
Object Var 1 Var 2

2 5.50 4.00
3 5.00 5.00
Distance (2,3) 4 1.50 2.50
5.50, 4.00 5.00, 5.00 = (5.00 − 5.50)2 +(5.00 − 4.00)5 2 1.00 1.00
6 7.00 5.00
= 1.12
7 5.75 6.50
Distance (2,4)
5.50, 4.00 1.50, 2.50 = (1.50 − 5.50)2 +(2.50 − 4.00)2 = 4.27
Distance (2,5)
5.50, 4.00 1.00, 1.00 = (1.00 − 5.50)2 +(1.00 − 4.00)2 = 5.41
Distance (2,6)
5.50, 4.00 7.00, 5.00 = (7.00 − 5.50)2 +(5.00 − 4.00)2 = 1.80
7
Object Var 1 Var 2

2 5.50 4.00
3 5.00 5.00
Distance (2,7) 4 1.50 2.50
5.50, 4.00 5.75, 6.50 = (5.75 − 5.50)2 +(6.50 − 4.00)5 2 1.00 1.00
6 7.00 5.00
= 2.51
7 5.75 6.50
Distance (3,4)
5.00, 5.00 1.50, 2.50 = (1.50 − 5.00)2 +(2.50 − 5.00)2 = 4.30
Distance (3,5)
5.00, 5.00 1.00, 1.00 = (1.00 − 5.00)2 +(1.00 − 5.00)2 = 5.66
Distance (3,6)
5.00, 5.00 7.00, 5.00 = (7.00 − 5.00)2 +(5.00 − 5.00)2 = 2.00
8
Object Var 1 Var 2

2 5.50 4.00
3 5.00 5.00
Distance (3,7) 4 1.50 2.50
5.00, 5.00 5.75, 6.50 = (5.75 − 5.00)2 +(6.50 − 5.00)5 2 1.00 1.00
6 7.00 5.00
= 1.68
7 5.75 6.50
Distance (4,5)
1.50, 2.50 1.00, 1.00 = (1.00 − 1.50)2 +(1.00 − 2.50)2 = 1.58
Distance (4,6)
1.50, 2.50 7.00, 5.00 = (7.00 − 1.50)2 +(5.00 − 2.50)2 = 6.04
Distance (4,7)
1.50, 2.50 5.75, 6.50 = (5.75 − 1.50)2 +(6.50 − 2.50)2 = 5.84
9
Object Var 1 Var 2

2 5.50 4.00
3 5.00 5.00
Distance (5,6) 4 1.50 2.50
1.00, 1.00 7.00, 5.00 = (7.00 − 1.00)2 +(5.00 − 1.00)5 2 1.00 1.00
6 7.00 5.00
= 7.21 7 5.75 6.50
Distance (5,7)
1.00, 1.00 5.75, 6.50 = (5.75 − 1.00)2 +(6.50 − 1.00)2 = 7.27
Distance (6,7)
7.00, 5.00 5.75, 6.50 = (5.75 − 7.00)2 +(6.50 − 5.00)2 = 1.95
10
Distance Matrix
• The distance matrix is-
1 2 3 4 5 6 7
1 0.0
2 4.0 0.0
3 4.2 1.1 0.0
4 0.7 4.3 4.3 0.0
5 1.4 5.4 5.7 1.6 0.0
6 5.8 1.8 2.0 6.0 7.2 0.0
7 5.9 2.5 1.7 5.8 7.3 2.0 0.0
11
Example for HAC
• Select minimum element to build first cluster formation-
1 2 3 4 5 6 7
1 0.0
2 4.0 0.0
3 4.2 1.1 0.0
4 0.7 4.3 4.3 0.0
5 1.4 5.4 5.7 1.6 0.0
6 5.8 1.8 2.0 6.0 7.2 0.0
7 5.9 2.5 1.7 5.8 7.3 2.0 0.0
12
Example for HAC
13
Example for HAC
• Recalculate distance to update distance matrix 1 2 3 4 5 6 7

1 0.0
2 4.0 0.0
- MIN[ dist(1,4), 2] = MIN(dist(1,2), (4,2)) 3 4.2 1.1 0.0

4 0.7 4.3 4.3 0.0
= MIN(4.0, 4.3) = 4.0 5 1.4 5.4 5.7 1.6 0.0
- MIN[ dist(1,4), 3] = MIN(dist(1,3), (4,3)) 6 5.8 1.8 2.0 6.0 7.2 0.0
7 5.9 2.5 1.7 5.8 7.3 2.0 0.0
= MIN(4.2, 4.3) = 4.2
- MIN[ dist(1,4), 5] = MIN(dist(1,5), (4,5)) = MIN(1.4, 1.6) = 1.4
14
Example for HAC
• Updated distance matrix for the cluster (1, 4)

1,4 2 3 5 6 7
1,4 0.0
2 4.0 0.0
3 4.2 1.1 0.0
5 1.4 5.4 5.7 0.0
6 5.8 1.8 2.0 7.2 0.0
7 5.8 2.5 1.7 7.3 2.0 0.0
15
Example for HAC
• Select minimum element to build next cluster formation-

1,4 2 3 5 6 7
1,4 0.0
2 4.0 0.0
3 4.2 1.1 0.0
5 1.4 5.4 5.7 0.0
6 5.8 1.8 2.0 7.2 0.0
7 5.8 2.5 1.7 7.3 2.0 0.0
16
Example for HAC
17
Example for HAC
1,4 2 3 5 6 7
1,4 0.0
• Recalculate distance to update distance matrix 2 4.0 0.0
3 4.2 1.1 0.0
- MIN[ dist(2,3), (1,4)] = MIN(dist(2,(1,4)), (3,(1,4)) 5 1.4 5.4 5.7 0.0

6 5.8 1.8 2.0 7.2 0.0
= MIN(4.0, 4.2) = 4.0 7 5.8 2.5 1.7 7.3 2.0 0.0

18
Example for HAC
• Updated distance matrix for the cluster (2, 3)

1,4 2,3 5 6 7
1,4 0.0
2,3 4.0 0.0
5 1.4 5.4 0.0
6 5.8 1.8 7.2 0.0
7 5.8 1.7 7.3 2.0 0.0
19
Example for HAC

1,4 2,3 5 6 7
1,4 0.0
2,3 4.0 0.0
5 1.4 5.4 0.0
6 5.8 1.8 7.2 0.0
7 5.8 1.7 7.3 2.0 0.0
20
Example for HAC
21
Example for HAC
• Recalculate distance to update distance matrix 1,4 2,3 5 6 7

1,4 0.0
2,3 4.0 0.0
- MIN[ dist((1,4),5), (2,3)] = MIN(dist((1,4),(2,3)), (5,(2,3)) 5 1.4 5.4 0.0
6 5.8 1.8 7.2 0.0
= MIN(4.0, 5.4) = 4.0
7 5.8 1.7 7.3 2.0 0.0
- MIN[dist((1,4),5), 6] = MIN(dist((1,4),6), (5,6)) = MIN(5.8, 7.2) = 5.8

22
Example for HAC
• Updated distance matrix for the cluster ((1,4), 5)
1,4,5 2,3 6 7
1,4,5 0.0
2,3 4.0 0.0
6 5.8 1.8 0.0
7 5.8 1.7 2.0 0.0
23
Example for HAC
1,4,5 2,3 6 7
1,4,5 0.0
2,3 4.0 0.0
6 5.8 1.8 0.0
7 5.8 1.7 2.0 0.0
24
Example for HAC
25
Example for HAC
• Recalculate distance to update distance matrix 1,4,5 2,3 6 7

1,4,5 0.0
2,3 4.0 0.0
6 5.8 1.8 0.0
7 5.8 1.7 2.0 0.0
- MIN[ dist((2,3),7), (1,4,5)] = MIN(dist((2,3),(1,4,5)), (7,(1,4,5))

= MIN(4.0, 5.8) = 4.0
26
Example for HAC
• Updated distance matrix for the cluster ((2,3), 7)
1,4,5 2,3,7 6
1,4,5 0.0
2,3,7 4.0 0.0
6 5.8 1.8 0.0
27
Example for HAC
1,4,5 2,3,7 6
1,4,5 0.0
2,3,7 4.0 0.0
6 5.8 1.8 0.0
28
Example for HAC
29
Example for HAC
• Recalculate distance to update distance matrix 1,4,5 2,3,7 6
1,4,5 0.0
2,3,7 4.0 0.0
6 5.8 1.8 0.0
- MIN[ dist((2,3,7),6), (1,4,5)] = MIN(dist((2,3,7),(1,4,5)), (6,(1,4,5))

= MIN(4.0, 5.8)
= 4.0
30
Example for HAC
• Updated distance matrix for the cluster ((2,3,7), 6)
1,4,5 2,3,7,6
1,4,5 0.0
2,3,7,6 4.0 0.0
31
Example for HAC
32
Python demo for HAC
33
Python demo for HAC
34
Python demo for HAC
35
Python demo for HAC
36
Python demo for HAC
37
THANK YOU
38

Data Analytics TB

Uploaded by

Copyright:

Available Formats

Data Analytics TB

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analytics TB

Uploaded by

Copyright:

Available Formats

Lecture 4: Central Tendency and Dispersion

• Measures of central tendency yield information about “particular places or

• Central tendency or measures of • Dispersion

• Sometimes we wish to average numbers, but we want to assign more

• The average you need is the weighted average.

• Applicable for ordinal, interval, and ratio data

• Not applicable for nominal data

• Unaffected by extremely large and extremely small values

• There are 16 terms in the ordered array

• Position of median = (n+1)/2 = (16+1)/2 = 8.5

• The median is between the 8th and 9th terms, 14.5

• If the 21 is replaced by 100, the median is 14.5

• If the 3 is replaced by -88, the median is 14.5

• The most frequently occurring value in a data set

• Applicable to all levels of data measurement (nominal, ordinal, interval,

• Bimodal -- Data sets that have two modes

• Multimodal -- Data sets that contain more than two modes

than any other value 37 41 44 46

Class Interval Frequency

• Applicable for ordinal, interval, and ratio data

• Not applicable for nominal data

• If i is a whole number, the percentile is the average of the values at the

• If i is not a whole number, the percentile is at the (i+1) position in the

• The location index, i, is not a whole number; i+1 = 2.4+1=3.4;

• Measures of variability describe the spread or the dispersion of a set of

• Reliability of measure of central tendency

• To compare dispersion of various samples

No Variability in Cash Flow Mean

Variability in Cash Flow Mean

• The difference between the largest and the smallest values in 35 41 44 45

Range = Largest – Smallest = 48 - 35 = 13 40 43 44 46

• Q1: 25% of the data set is below the first quartile

• Q2: 50% of the data set is below the second quartile

• Q3: 75% of the data set is below the third quartile

• Q1 is equal to the 25th percentile

• Q2 is located at 50th percentile and equals the median

• Q3 is equal to the 75th percentile

• Quartile values are not necessarily members of the data set

25% 25% 25% 25%

• Range of values between the first and third quartiles

• Data set: 5, 9, 16, 17, 18

• Average of the absolute deviations from the mean

2,398 625 390,625

2,398 625 390,625 S n 1

7,092 0 663,866  221, 288.67

Annualized Rate of Return

• Approximately 68% of all observations fall

• Approximately 95% of all observations fall

• Approximately 99.7% of all observations fall

• Data are normally distributed (or approximately normal)

Distance from Percentage of Values

• A more general interpretation of the standard deviation is derived

For k=2 (say), the theorem states that at least

• Ratio of the standard deviation to the mean, expressed as a percentage

M M  M 

20-under 30 6 25 150 -18 324 1944

Negatively Symmetric Positively

Mean Mode Mean Mean

Negatively Symmetric Positively