Data Analytics TB
Data Analytics TB
Data Analytics TB
Dr. A. Ramesh
Department of Management Studies
1
Lecture objectives
• Central tendency
• Measures of Dispersion
2
Measures of Central Tendency
3
Summary statistics
4
Arithmetic Mean
• Commonly called ‘the mean’
• It is the average of a group of numbers
• Applicable for interval and ratio data
• Not applicable for nominal or ordinal data
• Affected by each value in the data set, including extreme values
• Computed by summing all values in the data set and dividing the sum by
the number of values in the data set
5
Population Mean
X X 1
X 2
X 3
... X N
N N
24 13 19 26 11
5
93
5
18.6
6
Sample Mean
X
X X 1
X 2
X 3
... X n
n n
57 86 42 38 90 66
6
379
6
63.167
7
Mean of Grouped Data
• Weighted average of class midpoints
• Class frequencies are the weights
fM
f
fM
N
f 1M 1 f 2 M 2 f 3M 3 fiMi
f 1 f 2 f 3 fi
8
Calculation of Grouped Mean
Class Interval Frequency(f) Class Midpoint(M) fM
20-under 30 6 25 150
30-under 40 18 35 630
40-under 50 11 45 495
50-under 60 11 55 605
60-under 70 3 65 195
70-under 80 1 75 75
50 2150
fM 2150
43.0
f 50
9
Weighted Average
xw
Weighted Average
w
where x is a data value and w is
the weight assigned to that data
value. The sum is taken over all
data values.
Example
Suppose your midterm test score is 83 and your final exam score is 95.
Using weights of 40% for the midterm and 60% for the final exam, compute
the weighted average of your scores. If the minimum average for an A is
90, will you earn an A?
Weighted Average
830.40 950.60
0.40 0.60
32 57
90.2
1 You will earn an A!
Median
• Middle value in an ordered array of numbers
13
Median: Computational Procedure
• First Procedure
– Arrange the observations in an ordered array
– If there is an odd number of terms, the median is the middle term of the
ordered array
– If there is an even number of terms, the median is the average of the
middle two terms
• Second Procedure
– The median’s position in an ordered array is given by (n+1)/2.
14
Median: Example with an Odd Number of Terms
Ordered Array
3 4 5 7 8 9 11 14 15 16 16 17 19 19 20 21 22
• There are 17 terms in the ordered array.
• Position of median = (n+1)/2 = (17+1)/2 = 9
• The median is the 9th term, 15.
• If the 22 is replaced by 100, the median is 15.
• If the 3 is replaced by -103, the median is 15.
15
Median: Example with an Even Number of Terms
Ordered Array
3 4 5 7 8 9 11 14 15 16 16 17 19 19 20 21
16
Median of Grouped Data
N
cfp
Median L 2 W
fmed
Where :
L the lower limit of the median class
cfp = cumulative frequency of class preceding the median class
fmed = frequency of the median class
W = width of the median class
N = total of frequencies
17
Median of Grouped Data -- Example
Cumulative
N
Class Interval Frequency Frequency cfp
20-under 30 6 6 Md L 2 W
30-under 40 18 24 fmed
40-under 50 11 35 50
24
50-under 60 11 46
60-under 70 3 49
40 2 10
11
70-under 80 1 50 40.909
N = 50
18
Mode
19
Mode -- Example
• The mode is 44
• There are more 44s 35 41 44 45
37 43 44 46
39 43 44 46
40 43 44 46
40 43 45 48
20
Mode of Grouped Data
• Midpoint of the modal class
• Modal class has the greatest frequency
21
22
Percentiles
• Measures of central tendency that divide a group of data into 100 parts
• Example: 90th percentile indicates that at most 90% of the data lie
below it, and at least 10% of the data lie above it
• The median and the 50th percentile have the same value
23
Percentiles: Computational Procedure
• Organize the data into an ascending ordered array
• Calculate the p th percentile location:
P
i ( n)
100
• Determine the percentile’s location and its value.
24
Percentiles: Example
• Raw Data: 14, 12, 19, 23, 5, 13, 28, 17
• Ordered Array: 5, 12, 13, 14, 17, 19, 23, 28
• Location of 30th percentile:
30
i (8) 2.4
100
25
Dispersion
26
Variability
27
Measures of Variability or dispersion
Common Measures of Variability
• Range
• Inter-quartile range
• Mean Absolute Deviation
• Variance
• Standard Deviation
• Z scores
• Coefficient of Variation
28
Range – ungrouped data
40 43 45 48
29
Quartiles
• Measures of central tendency that divide a group of data into four subgroups
30
Quartiles
Q1 Q2 Q3
31
Quartiles: Example
• Ordered array: 106, 109, 114, 116, 121, 122, 125, 129
• Q1 i
25
(8) 2 Q
109 114
1 111.5
100 2
• Q2:
50 116 121
i (8) 4 Q2 118.5
100 2
• Q3:
75 122 125
i (8) 6 Q3 123.5
100 2
32
Interquartile Range
Interquartile Range Q3 Q1
33
Deviation from the Mean
-4 +5
-8 +4
+3
0 5 10 15 20
34
Mean Absolute Deviation
X X X
M . A.D.
X
5 -8 +8 N
9 -4 +4 24
16 +3 +3 5
17 +4 +4 4.8
18 +5 +5
0 24
35
Population Variance
• Average of the squared deviations from the arithmetic mean
X X X
2
X
2
2
5 -8 64 N
130
9 -4 16
5
16 +3 9 26.0
17 +4 16
18 +5 25
0 130
36
Population Standard Deviation
• Square root of the variance
X X X
2
X
2
2
N
5 -8 64 130
9 -4 16
5
16 +3 9 26.0
17 +4 16
2
18 +5 25 26.0
0 130 5.1
37
Sample Variance
• Average of the squared deviations from the arithmetic mean
X X X X X
2
X X
2
1,844 71 5,041 S n 1
1,539 -234 54,756 663,866
1,311 -462 213,444
3
7,092 0 663,866
221, 288.67
38
Sample Standard Deviation
• Square root of the sample variance
X X X X X
2
X X
2
2
39
Uses of Standard Deviation
• Indicator of financial risk
• Quality Control
– construction of quality control charts
– process capability studies
• Comparing populations
– household incomes in two cities
– employee absenteeism at two plants
40
Standard Deviation as an Indicator of Financial Risk
A 15% 3%
B 15% 7%
41
Lecture 5: Central Tendency and Dispersion- II
Dr. A. Ramesh
Department of Management Studies
1
The Empirical Rule… If the histogram is bell shaped
2
Empirical Rule
1 68
2 95
3 99.7
3
Chebysheff’s Theorem…Not often used because interval is very wide.
41
Coefficient of Variation
. . 100
CV
5
Coefficient of Variation
29
1
84
2
1
4.6 2
10
100 100
C.V .
1
1
C.V .
2
2
1 2
4.6 10
100 100
29 84
1586
. 11.90
6
Variance and Standard Deviation
of Grouped Data
Population Sample
f M S M X
2 2
f
2
2
n1
N
2
S
S
2
7
Population Variance and Standard Deviation of
Grouped Data(mu=43)
M 2
2
f 7200
144 12
2
144
N 50
8
Measures of Shape
• Skewness
– Absence of symmetry
– Extreme values in one side of a distribution
• Kurtosis
Peakedness of a distribution
– Leptokurtic: high and thin
– Mesokurtic: normal shape
– Platykurtic: flat and spread out
• Box and Whisker Plots
– Graphic display of a distribution
– Reveals skewness
9
Skewness
10
Skewness..
The skewness of a distribution is measured by comparing the relative positions
of the mean, median and mode.
• Distribution is symmetrical
• Mean = Median = Mode
• Distribution skewed right
• Median lies between mode and mean, and mode is less than mean
• Distribution skewed left
• Median lies between mode and mean, and mode is greater than
mean
11
Skewness
12
Coefficient of Skewness
3 Md
S
• If S < 0, the distribution is negatively skewed (skewed to the left)
13
Coefficient of Skewness
1
23 2
26 3
29
M
d1 26 M
d2 26 M
d3 26
1
12.3 2
12.3 3
12.3
3 1 M
d1
3 2 M d2
3 3 M
d3
S 1
S 2
S 3
1 2 3
Leptokurtic
Mesokurtic
Platykurtic
15
Box and Whisker Plot
– Median, Q2
– First quartile, Q1
– Third quartile, Q3
16
Box and Whisker Plot
Minimum Q1 Q2 Q3 Maximum
17
Skewness: Box and Whisker Plots, and Coefficient of
Skewness
S=0 S>0
S<0
18
THANK YOU
19
Data Analytics with Python
Lecture 1: Introduction to data analytics
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE
1
Objective of the course
• The principle focus of this course is to introduce conceptual understanding
using simple and practical examples rather than repetitive and point click
mentality
• This course should make you comfortable using analytics in your career
and your life
• You will know how to work with real data, and might have learned many
different methodologies but choosing the right methodology is important
2
Objective of the course Contd…
3
Learning objectives
1. Define data and its importance
2. Define data analytics and its types
3. Explain why analytics is important in today’s business environment
4. Explain how statistics, analytics and data science are interrelated
5. Why python?
6. Explain the four different levels of Data:
– Nominal
– Ordinal
– Interval and
– Ratio
4
1. Define Data and its importance
5
1.1 Variable, Measurement and Data
6
1.2 What is generating so much data?
7
1.3 How data add value to business?
Data warehouse
Business value
Source:https://datajobs.com/
8
Data Products
9
1.4 Why Data is important?
10
2. Define data analytic and its types
• Define data analytics
• Data analysis
11
2.1. Define data analytics
12
2.2 Why analytics is important?
13
2.3 Data analysis
• Data analysis is the process of examining, transforming, and
arranging raw data in a specific way to generate useful
information from it
• Data analysis allows for the evaluation of data through
analytical and logical reasoning to lead to some sort of
outcome or conclusion in some context
• Data analysis is a multi-faceted process that involves a
number of steps, approaches, and diverse techniques
14
Analysis 2.4 Data analytics vs. Data analysis
Past
Explain
How?
Why?
15
2.4 Data analytics vs. Data analysis Analytics
Future
16
2.4 Data analytics vs. Data analysis
Analytics
Qualitative Quantitative
ll
ll
Intuition + analysis Formulas + algorithms
17
Analysis
Quantitative
ll
Qualitative Data + how the sale decreased last summer
ll
18
Analysis =/ Analytics
Data Analysis =/ Data analytics
19
2.5 Classification of Data analytics
Based on the phase of workflow and the kind of analysis required, there are
four major types of data analytics.
• Descriptive analytics
• Diagnostic analytics
• Predictive analytics
• Prescriptive analytics
20
Classification of Data analytics
https://www.governanceanalytics.org/knowledge-
base/Main_Tools/Data_classification_and_analysis
21
Descriptive Analytics
• Descriptive Analytics, is the conventional form of Business Intelligence and
data analysis
• It seeks to provide a depiction or “summary view” of facts and figures in
an understandable format
• This either inform or prepare data for further analysis
• Descriptive analysis or statistics can summarize raw data and convert it
into a form that can be easily understood by humans
• They can describe in detail about an event that has occurred in the past
22
Example
A common example of Descriptive Analytics are company reports that simply
provide a historic review like:
• Data Queries
• Reports
• Descriptive Statistics
• Data Visualization
• Data dashboard
Source: https://www.linkedin.com/learning/478e9692-d13d-338f-907e-d76f0724d773
23
Diagnostic analytics
24
Example
1. Data Discovery
2. Data Mining
3. Correlations
25
Predictive analytics
26
Source: https://www.logianalytics.com/wp-content/uploads/2017/11/predictive-1.png
27
Example
• Set of techniques that use model constructed from past data to predict
the future or ascertain impact of one variable on another:
1. Linear regression
2. Time series analysis and forecasting
3. Data mining
Source: https://bigdata-madesimple.com/5-examples-predictive-analytics-travel-industry/
28
Prescriptive analytics
29
Prescriptive analytics: Example
• Optimization Model
• Simulation
• Decision Analysis
30
3. Explain why analytics is important
31
3. Explain why analytics is important
Data Scientist
Search Trends
Statistician, Operations Researcher
32
https://timesofindia.indiatimes.com/india/Data-scientists-earning-more-than-
CAs-engineers/articleshow/52171064.cms
33
3.1 Demand for Data Analytics
http://timesofindia.indiatimes.com/articleshow/52171064.cms?utm_source=
contentofinterest&utm_medium=text&utm_campaign=cppst
34
3.2 Element of data Analytics
35
4. Data analyst and Data scientist
36
4.1 The requisite skill set
Technology;
Mathematic
Hacking Skill
Expertise
Business and
strategy Data Science
acumen
37
4.1 The requisite skill set
Mathematic Technology;
Expertise Hacking Skill
Business and
strategy
Data Science
acumen
38
4.1 The requisite skill set
Mathematic Technology;
Expertise Hacking Skill
Business and
strategy
Data Science
acumen
39
4.2 Difference between Data analyst and Data Scientist
Business Administration
Analyst
Domain specific responsibility : For Example marketing analyst, Financial analyst etc.
Data Scientist
Advance algorithms and machine learning
Source:https://datajobs.com/
40
5. Why python?
Features
• Simple and easy to learn
• Freeware and Open source
• Interpreted
• Dynamically Typed
• Extensible
• Embedded
• Extensive library
41
5. Why python?
Usability
• Desktop and web applications
• Database applications
• Networking applications
• Data analysis (Data Science)
• Machine learning
• IoT and AI applications
• Games
42
Companies using Python
43
Why Jupyter NoteBook?
Why?
• Client – Server Application
• Edit code on web browser
• Easy in documentation
• Easy in demonstration
• User- friendly Interface
44
6. Explain the four different levels of Data
• Types of Variables
• Levels of Data Measurement
• Compare the four different levels of Data:
Nominal
Ordinal
Interval and
Ratio
• Usage Potential of Various Levels of Data
• Data Level, Operations, and Statistical Methods
45
6.1 Types of Variables
Data
Categorical Numerical
Examples:
Marital Status
Political Party Discrete Continuous
Eye Color
Examples: Examples:
(Defined categories)
Number of Children Weight
Defects per hour Voltage
(Counted items) (Measured characteristics)
6.2 Levels of Data Measurement
47
6.3.1 Nominal
48
6.3.2 Ordinal scale
49
6.3.3. Interval scale
50
6.3.4 Ratio scale
51
6.4 Usage Potential of Various
Levels of Data
Ratio
Interval
Ordinal
Nominal
52
6.5 Impact of choice of measurement scale
Statistical
Data Level Meaningful Operations
Methods
53
Thank You
54
Welcome to
TA Live Session 1
http://makemeanalyst.com/explore-your-data-range-interquartile-range-and-box-plot/
[1, 2, 3, 4, 5, 6, 7, 8, 9]
X = 10 value of x = 10
X = X + 10 value of x = 20
X=X–5 value of x = 15
NO !
https://www.slideshare.net/amritswaroop1/mbm-106
σ 𝑥𝑖ሶ
𝑥ҧ =
𝑛
https://www.kindpng.com/imgv/Txxxxb_mu-greek-alphabet-
letter-greek-mu-hd-png/
http://makemeanalyst.com/explore-your-data-range-interquartile-
range-and-box-plot/
https://www.managedfuturesinvesting.com/what-is-skewness/
If 𝜎 = 0, 𝜎 2 = 0
If 𝜎 = 2, 𝜎 2 = 4
If 𝜎 = 0.5, 𝜎 2 = 0.25
THANK YOU!
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE
1
Learning objectives
1. Installing Python
2. Fundamentals of Python
3. Data Visualisation
2
Python Installation Process
Installation Process –
3
Python Installation Process
Installation Process –
4
Python Installation Process
5
Python Installation Process
6
Python Installation Process
7
Python Installation Process
8
Python Installation Process
9
Python Installation Process
10
Python Installation Process
11
Python Installation Process
12
Python Installation Process
13
Python Installation Process
14
Python Installation Process
15
Python Installation Process
16
Why Jupyter NoteBook?
Why?
• Edit code on web browser
• Easy in documentation
• Easy in demonstration
• User- friendly Interface
17
Python and Jupyter
18
19
About Jupyter NoteBook
20
About Jupyter NoteBook
21
About Jupyter NoteBook
22
About Jupyter Notebook
23
About Jupyter Notebook
24
Fundamentals of Python
25
26
Loading a simple delimited data file
27
28
• head method shows us only the first 5 rows
29
Get the number of rows and columns
30
get column names
31
get the dtype of each column
32
Pandas Types Versus Python Types
33
get more information about data
34
Looking at Columns, Rows, and Cells
35
# show the first 5 observations
36
# show the last 5 observations
37
# Looking at country, continent, and year
38
39
Data Analytics with Python
Lecture 3: Python – Fundamentals - II
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE
1
Looking at Columns, Rows, and Cells
2
get the first row
3
• # get the 100th row
# Python counts from 0
4
• get the last row
5
Subsetting Multiple Rows
6
Subset Rows by Row Number: iloc
7
• get the 100th row
8
• # using -1 to get the last row
9
With iloc, we can pass in the -1 to get the last row—something we couldn’t do with loc.
10
• # get the first, 100th, and 1000th rows
11
Subsetting Columns
12
• # subset columns with loc
# note the position of the colon
# it is used to select all rows
13
14
• # subset columns with iloc
• # iloc will alow us to use integers
• # -1 will select the last column
15
Subsetting Columns by Range
16
• # subset the dataframe with the range
17
Subsetting Rows and Columns
• # using loc
18
• # using iloc
19
Subsetting Multiple Rows and Columns
20
• if we use the column names directly,
# it makes the code a bit easier to read
# note now we have to use loc, instead of iloc
21
22
23
Grouped Means
• # For each year in our data, what was the average life
expectancy?
# To answer this question,
# we need to split our data into parts by year;
# then we get the 'lifeExp' column and calculate the mean
24
25
26
• If you need to “flatten” the dataframe, you can use the
reset_index method.
27
Grouped Frequency Counts
28
Basic Plot
29
30
Visual Representation of the Data
• Histogram -- vertical bar chart of frequencies
• Frequency Polygon -- line graph of frequencies
• Ogive -- line graph of cumulative frequencies
• Pie Chart -- proportional representation for categories of a whole
• Stem and Leaf Plot
• Pareto Chart
• Scatter Plot
31
Methods of visual presentation of data
• Table
32
Methods of visual presentation of data
• Graphs
90
80
70
60
50 East
40 West
30 North
20
10
0
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
33
Methods of visual presentation of data
• Pie chart
1st Qtr
2nd Qtr
3rd Qtr
4th Qtr
34
Methods of visual presentation of data
• Multiple bar chart
4th Qtr
1st Qtr
0 20 40 60 80 100
35
Methods of visual presentation of data
• Simple pictogram
100
80
60
40
North
20
East
0 West
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
36
Frequency distributions
• Frequency tables
Observation Table
Class Interval Frequency Cumulative Frequency
< 20 13 13
<40 18 31
<60 25 56
<80 15 71
<100 9 80
37
Frequency diagrams
Frequency
30 Cumulative Frequency
25 Frequency
20
90
80
15
70
10 60
5 50
Cumulative Frequency
0 40
< 20 <40 <60 <80 <100 30
20
Frequency 10
0
30 < 20 <40 <60 <80 <100
25
20
15 Frequency
10
5
0
< 20 <40 <60 <80 <100
38
Histogram
20
Class Interval Frequency
Frequency
20-under 30 6
10
30-under 40 18
40-under 50 11
50-under 60 11
0
60-under 70 3 0 10 20 30 40 50 60 70 80
Years
70-under 80 1
39
Histogram Construction
20
Class Interval Frequency
20-under 30 6
Frequency
30-under 40 18
10
40-under 50 11
50-under 60 11
60-under 70 3
0
70-under 80 1
0 10 20 30 40 50 60 70 80
Years
40
Frequency Polygon
20
Class IntervalFrequency
20-under 30 6
Frequency
30-under 40 18
10
40-under 50 11
50-under 60 11
60-under 70 3
0
70-under 80 1 0 10 20 30 40 50 60 70 80
Years
41
Ogive
Cumulative
60
Class Interval Frequency
40
Frequency
20-under 30 6
30-under 40 24
20
40-under 50 35
50-under 60 46
0
60-under 70 49 0 10 20 30 40 50 60 70 80
70-under 80 50 Years
42
Relative Frequency Ogive
Cumulative
43
Pareto Chart
100 100%
90 90%
80 80%
70 70%
Frequency 60 60%
50 50%
40 40%
30 30%
20 20%
10 10%
0 0%
Poor Short in Defective Other
Wiring Coil Plug
44
Scatter Plot
(1000's) Gallons)
Gasoline Sales
5 60 100
15 120
9 90
0
15 140 0 5 10 15 20
Registered Vehicles
7 60
45
Principles of Excellent Graphs
• The graph should not distort the data
• The graph should not contain unnecessary adornments (sometimes
referred to as chart junk)
• The scale on the vertical axis should begin at zero
• All axes should be properly labeled
• The graph should contain a title
• The simplest possible graph should be used for a given set of data
Graphical Errors: Chart Junk
100 25
0 0
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
Graphical Errors: No Zero Point on the Vertical Axis
Bad Presentation
Good Presentations
Monthly Sales $ Monthly Sales
$ 45
45
42
42 39
39 36
36 0
J F M A M J J F M A M J
H, T
{H,T}
https://towardsdatascience.com/what-is-expected-value-
4815bdbd84de
HH
HT
TH
T
1st Toss
TT
2nd Toss
Ritwiz Kamal | IIT Madras 6
Question 3
3)
https://www.scribbr.com/statistics/standard-normal-distribution/
4)
𝜆𝑥 −𝜆
𝑓 𝑥 = ⅇ
𝑥! https://en.wikipedia.org/wiki/Poisson_distribution
https://en.wikipedia.org/wiki/Hypergeometric_distribution
14
Ritwiz Kamal | IIT Madras
Question 7
7) Suppose you have a biased coin i.e. P(H) ≠ P(T).
How will you use it to make unbiased decision.
16
Ritwiz Kamal | IIT Madras
Question 8
8) Suppose you have data about the progression of the number of cases
of Covid-19 in some country during the second wave. What kind of
visual representation would you prefer to use for this data ?
Violin Plot
Pie Chart
Ogive
Box Plot
Violin Plot
Pie Chart
Ogive
Box Plot
Pie Chart
Ogive
Box Plot
Pie Chart
TV Show D
Ogive
Box Plot
TV Show B
*fictional data
TV Show C
Ritwiz Kamal | IIT Madras 20
Question 10
10) Suppose there are 10 rabbits in a race.
Let R1 and R2 be two of the rabbits.
Let A be the event that R1 wins the race.
Let B be the event that R2 wins the race.
Are A and B independent events ? (Assume all rabbits are equally likely to win)
Independent
Not Independent
Are A and B independent events ? (Assume all rabbits are equally likely to win)
1
Independent 𝑃 𝐴 = 𝑃 𝐴ȁ𝐵 = 0
10
Not Independent
THANK YOU!
Dr. A. Ramesh
Department of Management Studies
1
Lecture objectives
2
Probability
• Probability is the numerical measure of the likelihood that an event will occur.
3
Range of Probability
1 Certain
.5
0 Impossible
4
Methods of Assigning Probabilities
5
Classical Probability
6
Classical Probability
P( E )
n e
N
Where:
N total number of outcomes
ne
number of outcomes in E
7
Relative Frequency Probability
8
Relative Frequency Probability
P( E ) ne
N
Where:
N total number of trials
n e
number of outcomes
producing E
9
Subjective Probability
10
Probability - Terminology
• Experiment
• Event
• Elementary Events
• Sample Space
• Unions and Intersections
• Mutually Exclusive Events
• Independent Events
• Collectively Exhaustive Events
• Complementary Events
11
Experiment, Trial, Elementary Event, Event
• Experiment: a process that produces outcomes
– More than one possible outcome
– Only one outcome per trial
• Trial: one repetition of the process
• Elementary Event: cannot be decomposed or broken down into other
events
• Event: an outcome of an experiment
– may be an elementary event, or
– may be an aggregate of elementary events
– usually represented by an uppercase letter, e.g., A, E1
12
An Example Experiment
• Experiment: randomly select,
without replacement, two families Tiny Town Population
from the residents of Tiny Town
• Elementary Event: the sample Children in Number of
Family Household
includes families A and C Automobiles
• Event: each family in the sample
has children in the household A Yes 3
• Event: the sample families own a B Yes 2
total of four automobiles C No 1
D Yes 2
13
Sample Space
14
Sample Space: Roster Example
15
Sample Space: Tree Diagram for Random Sample of Two
Families
16
Sample Space: Set Notation for Random Sample of Two
Families
• S = {(x,y) | x is the family selected on the first draw, and y is the family
selected on the second draw}
• Concise description of large sample spaces
17
Sample Space
• Useful for discussion of general principles and concepts
18
Union of Sets
• The union of two sets contains an instance of each element of the two
sets.
X 1,4,7,9
Y 2,3,4,5,6 X Y
X Y 1,2,3,4,5,6,7,9
19
Intersection of Sets
• The intersection of two sets contains only those element common to the
X 1,4,7,9
two sets.
Y 2,3,4,5,6 X Y
X Y 4
22
Collectively Exhaustive Events
E1 E2 E3
23
Complementary Events
• All elementary events not in the event ‘A’ are in its complementary event.
P( Sample Space ) 1
A
Sample
Space A
P( A) 1 P( A)
24
Counting the Possibilities
• mn Rule
• Sampling from a Population with Replacement
• Combinations: Sampling from a Population without Replacement
25
mn Rule
26
Sampling from a Population with Replacement
27
Combinations
N N! 1000!
166,167,00 0
n n!( N n)! 3!(1000 3)!
28
Four Types of Probability
Marginal Union Joint Conditional
P( X ) P( X Y ) P( X Y ) P( X | Y )
The probability The probability The probability The probability
of X occurring of X or Y of X and Y of X occurring
occurring occurring given that Y
has occurred
X X Y X Y
29
General Law of Addition
P ( X Y ) P( X ) P( Y ) P( X Y )
X Y
30
Design for improving productivity?
31
Problem
• A company conducted a survey for the American Society of Interior
Designers in which workers were asked which changes in office design
would increase productivity.
• Respondents were allowed to answer more than one type of design
change.
32
Problem
• If one of the survey respondents was randomly selected and asked what
office design changes would increase worker productivity,
– what is the probability that this person would select reducing noise or
more storage space?
33
Solution
34
General Law of Addition -- Example
P( N S ) P( N ) P( S ) P( N S )
N S P ( N ) .70
P ( S ) .67
P ( N S ) .56
.56
.70 .67 P ( N S ) .70.67 .56
0.81
35
Office Design Problem
Probability Matrix
Increase
Storage Space
Yes No Total
Noise Yes .56 .14 .70
Reduction
No .11 .19 .30
Total .67 .33 1.00
36
Joint Probability Using a Contingency Table
Event
Event B1 B2 Total
Yes No Total
Noise Yes .56 .14 .70
Reduction
No .11 .19 .30
Total .67 .33 1.00
P( N S ) P( N ) P( S ) P( N S )
.70.67 .56
.81
38
Law of Conditional Probability
39
Office Design Problem
40
Problem
• A company data reveal that 155 employees worked one of four types of
positions.
• Shown here again is the raw values matrix (also called a contingency table)
with the frequency counts for each category and for subtotals and totals
containing a breakdown of these employees by type of position and by
sex.
41
Contingency Table
42
Solution
43
Problem
• Shown here are the raw values matrix and corresponding probability
matrix for the results of a national survey of 200 executives who were
asked to identify the geographic locale of their company and their
company’s industry type.
• The executives were only allowed to select one locale and one industry
type.
44
Data Analytics with Python
Lecture 9: Probability Distributions-II
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE
1
Some Special Distributions
• Discrete
– Binomial
– Poisson
– Hyper geometric
• Continuous
– Uniform
– Exponential
– Normal
2
Binomial Distribution
• Let us consider the purchase decisions of the next three customers who
enter a store.
• What is the probability that two of the next three customers will make a
purchase?
3
Tree diagram for the Martin clothing store problem
4
Trial Outcomes
5
Graphical representation of the probability distribution
for the number of customers making a purchase
x P(x)
0 0.7 x 0.7 x 0.7=0.343
1 0.3x0.7x07+
0.7x0.3x0.7+
0.7x0.7x0.3 = 0.441
2 0.189
3 0.027
6
Binomial Distribution- Assumtions
• Experiment involves n identical trials
• Each trial has exactly two possible outcomes: success and failure
• Each trial is independent of the previous trials
• p is the probability of a success on any one trial
q = (1-p) is the probability of a failure on any one trial
• p and q are constant throughout the experiment
• X is the number of successes in the n trials
7
Binomial Distribution
• Probability n! X n X
P( X ) p q for 0 X n
function X ! n X !
• Mean
value n p
• Variance and
standard 2
n pq
deviation 2
n pq
8
Binomial Table
SELECTED VALUES FROM THE BINOMIAL PROBABILITY TABLE
EXAMPLE: n = 10, x = 3, p = .40; f (3) = .2150
9
Mean and Variance
• Suppose that for the next month the Clothing Store forecasts 1000
customers will enter the store.
• What is the expected number of customers who will make a purchase?
• The answer is μ = np = (1000)(.3) = 300.
• For the next 1000 customers entering the store, the variance and
standard deviation for the number of customers who will make a
purchase are
10
Poisson Distribution
11
Poisson Distribution: Applications
• Arrivals at queuing systems
– airports -- people, airplanes, automobiles, baggage
– banks -- people, automobiles, loan applications
– computer file servers -- read and write operations
12
Poisson Distribution
• Probability function
e
X
13
Poisson Distribution: Example
P(X)=
P(X)=
X X
e e
X! X!
10 6.4 6 6.4
14
Poisson Probability Table
Example: μ = 10, x = 5; f (5) = .0378
15
The Hypergeometric Distribution
• Each trial has exactly two possible outcomes, success and failure.
17
Hypergeometric Distribution
• Probability function
– N is population size
P( x)
ACx N ACn x
– n is sample size
N Cn
– A is number of successes in population
– x is number of successes in sample An
N
• Mean Value
A( N A) n( N n)
2
2
N ( N 1)
• Variance and standard deviation
2
18
The Hypergeometric Distribution Example
• Different computers are checked from 10 in the department. 4 of the 10
computers have illegal software loaded.
• What is the probability that 2 of the 3 selected computers have illegal
software loaded?
• So, N = 10, n = 3, A = 4, X = 2
A N A 4 6
X n X 2 1 (6)(6)
P(X 2) 0.3
N 10 120
n 3
• The probability that 2 of the 3 selected computers have illegal
software loaded is .30, or 30%.
Continuous Probability Distributions
• Uniform
• Normal
• Exponential
The Uniform Distribution
1
b a for a xb
1
f ( x)
0 ba
for all other values f (x)
Area = 1
a x b
Uniform Distribution: Mean and Standard Deviation
Mean
a +b
=
2
Standard Deviation
ba
12
The Uniform Distribution
1
f(X) = 6 - 2 = .25 for 2 ≤ X ≤ 6
f(X)
ab 26
μ 4
.25 2 2
(b - a) 2 (6 - 2 ) 2
σ 1 .1 5 4 7
2 6 X 12 12
Uniform Distribution Example
1
47 41 for 41 x 47
1 1
f ( x)
0 47 41 6
for all other values f ( x)
Area = 1
41 47 x
Uniform Distribution: Mean and Standard Deviation
Mean Mean
a +b 41+47 88
= = 44
2 2 2
x x1
P ( x1 X x2) 2
ba 45 42 1
47 41 2
f (x)
45 42 1
P( 42 X 45)
47 41 2 Area
= 0.5
41 42 45 47 x
Example : Uniform Distribution
• Suppose the flight time can be any value in the interval from 120 minutes
to 140 minutes.
• Because the random variable x can assume any value in that interval, x is a
continuous rather than a discrete random variable
29
Example : Uniform Distribution contd….
• Let us assume that sufficient actual flight data are available to conclude
that the probability of a flight time within any 1-minute interval is the
same as the probability of a flight time within any other 1-minute interval
contained in the larger interval from 120 to 140 minutes.
• With every 1-minute interval being equally likely, the random variable x is
said to have a uniform probability distribution.
30
Uniform Probability Distribution for Flight time
31
Probability of a flight time between 120 and 130
minutes
32
Exponential Probability Distribution
• The exponential probability distribution is useful in describing the time it
takes to complete a task.
• The exponential random variables can be used to describe:
• Density Function
for x > 0, > 0
1 x /
f ( x) e
where: = mean
e = 2.71828
Exponential Probability Distribution
• Suppose that x represents the loading time for a truck at loading dock and
follows such a distribution.
• If the mean, or average, loading time is 15 minutes ( μ = 15), the
appropriate probability density function for x is
Exponential Distribution for the loading Dock Example
Exponential Probability Distribution
• Cumulative Probabilities
Cumulative Probabilities
xo /
P( x x0 ) 1 e
where:
x0 = some specific value of x x
Example: Exponential Probability Distribution
• The Petrol pump owner would like to know the probability that the time
f(x)
• Because the average number of arrivals is 10 cars per hour, the average
time between cars arriving is
42
The Normal Distribution: Properties
• ‘Bell Shaped’
• Symmetrical f(X)
• Mean, Median and Mode are equal
• Location is characterized by the mean, μ σ
• Spread is characterized by the standard μ
deviation, σ
Mean = Median = Mode
• The random variable has an infinite
theoretical range: - to +
The Normal Distribution: Density Function
The formula for the normal probability density function is
2
1 (X μ)
1
2
f(X) e
2π
Where e = the mathematical constant approximated by 2.71828
π = the mathematical constant approximated by 3.14159
μ = the population mean
σ = the population standard deviation
X = any value of the continuous variable
Chap 6-44
The Normal Distribution: Shape
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE
1
The Normal Distribution: Properties
• ‘Bell Shaped’
• Symmetrical f(X)
• Mean, Median and Mode are equal
• Location is characterized by the mean, μ σ
• Spread is characterized by the standard μ
deviation, σ
Mean = Median = Mode
• The random variable has an infinite
theoretical range: - to +
The Normal Distribution: Density Function
The formula for the normal probability density function is
2
1 (X μ)
1
2
f(X) e
2π
Where e = the mathematical constant approximated by 2.71828
π = the mathematical constant approximated by 3.14159
μ = the population mean
σ = the population standard deviation
X = any value of the continuous variable
Chap 6-3
The Normal Distribution: Shape
Changing σ increases or
decreases the spread.
σ
μ X
The Standardized Normal Distribution
X μ
Z
σ
The Standardized Normal Distribution: Density
Function
f(Z)
Z
0
Values above the mean have positive Z-values, values below the mean have
negative Z-values
The Standardized Normal Distribution: Example
X μ 200 100
Z 2 .0
σ 50
• This says that X = 200 is two standard deviations (2 increments of 50
units) above the mean of 100.
The Standardized Normal Distribution: Example
Note that the distribution is the same, only the scale has changed. We
can express the problem in original units (X) or in standardized units (Z)
Normal Probabilities
f(X)
P(a ≤ X ≤ b)
a b
Normal Probabilities
The total area under the curve is 1.0, and the curve is symmetric,
so half is above the mean, half is below.
f(X) P ( X μ ) 0 .5
P (μ X ) 0 .5
0.5 0.5
P ( X ) 1 .0
Normal Probability Tables
Example:
P(Z < 2.00) = .9772
.9772
0 2.00 Z
Normal Probability Tables
X
8.0
8.6
Finding Normal Probability: Example
• Suppose X is normal with mean 8.0 and standard deviation 5.0. Find
P(X < 8.6).
X μ 8 .6 8 .0
Z 0 .1 2
σ 5 .0
μ=8 μ=0
σ = 10 σ=1
8 8.6 X 0 0.12 Z
Z
0
0.12
Finding Normal Probability: Between Two Values
Calculate Z-values:
X μ 88
Z 0
σ 5
8 8.6 X
X μ 8.6 8 0 0.12 Z
Z 0.12
σ 5 P(8 < X < 8.6)
= P(0 < Z < 0.12)
Finding Normal Probability
Between Two Values
• Let X represent the time it takes (in seconds) to download an image file
from the internet.
• Suppose X is normal with mean 8.0 and standard deviation 5.0
• Find X such that 20% of download times are less than X.
.2000
? 8.0 X
? 0 Z
Given Normal Probability, Find the X Value
X μ Zσ
8.0 (0.84)5.0
3.80
So 20% of the download times from the distribution with mean 8.0
and standard deviation 5.0 are less than 3.80 seconds.
Assessing Normality
• It is important to evaluate how well the data set is approximated by a normal
distribution.
• Normally distributed data should approximate the theoretical normal
distribution:
– The normal distribution is bell shaped (symmetrical) where the mean is
equal to the median.
– The empirical rule applies to the normal distribution.
– The interquartile range of a normal distribution is 1.33 standard deviations.
Assessing Normality
• Construct charts or graphs
– For small- or moderate-sized data sets, do stem-and-leaf display
and box-and-whisker plot look symmetric?
– For large data sets, does the histogram or polygon appear bell-
shaped?
• Compute descriptive summary measures
– Do the mean, median and mode have similar values?
– Is the interquartile range approximately 1.33 σ?
– Is the range approximately 6 σ?
Assessing Normality
0.00 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.10 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.20 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.30 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.90 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.00 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.10 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.20 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
2.00 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
3.00 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990
3.40 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4998
3.50 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998
Table Lookup of a
Standard Normal Probability
P( 0 Z 1) 0. 3413
34
Distribution of Sample Mean, proportion,
and variance
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE
1
2
Acceptance Intervals
Goal: determine a range within which sample means are likely to occur, given a
population mean and variance
• By the Central Limit Theorem, we know that the distribution of X is
approximately normal if n is large enough, with mean μ and standard
deviation
• Let zα/2 be the z-value that leaves area α/2 in the upper tail of the normal
distribution (i.e., the interval - zα/2 to zα/2 encloses probability 1– α)
• Then
μ z/2σ X
is the interval that includes X with probability 1 – α
3
Sampling Distributions of Sample Proportions
Sampling
Distributions
4
Sampling Distributions of Sample Proportions
P = the proportion of the population having some characteristic
• Sample proportion (p̂) provides an estimate of P:
5
^
Sampling Distribution of p
• Normal approximation:
Sampling Distribution
P(Pˆ )
.3
.2
Properties: E(pˆ ) P
.1
0
0 .2 .4 .6 8 1
(where P = population proportion)
X P(1 P)
And σ p2ˆ Var
n n
6
7
Z-Value for Proportions
pˆ P pˆ P
Z
σ pˆ P(1 P)
n
8
Example
9
Example (continued)
10
Example
(continued)
.4251
Standardize
Sampling
Distributions
12
Sample Variance
• Let x1, x2, . . . , xn be a random sample from a population. The
sample variance is
1 n
s
2
i
n 1 i1
(x x) 2
13
Sampling Distribution of Sample Variances
• The sampling distribution of s2 has mean σ2
E(s2 ) σ 2
14
15
The Chi-square Distribution
0 4 8 12 16 20 24 28 2 0 4 8 12 16 20 24 28 2 0 4 8 12 16 20 24 28 2
16
Degrees of Freedom (df)
Idea: Number of observations that are free to vary after sample
mean has been calculated
Example: Suppose the mean of 3 numbers is 8.0
If the mean of these three values is 8.0,
Let X1 = 7 then X3 must be 9
Let X2 = 8 (i.e., X3 is not free to vary)
What is X3?
18
Finding the Chi-square Value
(n 1)s2
χ
2
Is chi-square distributed with (n – 1) = 13
σ 2
degrees of freedom
• Use the the chi-square distribution with area 0.05 in the
upper tail:
213 = 22.36 (α = .05 and 14 – 1 = 13 d.f.)
probability
α = .05
2
213 = 22.36
19
Chi-square Example
(continued)
(22.36)(16)
so K 27.52
(14 1)
22
Confidence Interval Estimation: Single
Population
Dr. A. Ramesh
Department of Management Studies
IIT ROORKEE
1
Goals
After completing this lecture, you should be able to:
• Distinguish between a point estimate and a confidence interval estimate
• Construct and interpret a confidence interval estimate for a single
population mean using both the Z and t distributions
• Form and interpret a confidence interval estimate for a single population
proportion
• Create confidence interval estimates for the variance of a normal
population
2
Confidence Intervals
• Confidence Intervals for the Population Mean, μ
– when Population Variance σ2 is Known
– when Population Variance σ2 is Unknown
• Confidence Intervals for the Population Proportion, p̂ (large samples)
• Confidence interval estimates for the variance of a normal population
3
Definitions
4
Point and Interval Estimates
Lower Upper
Confidence Confidence
Point Estimate Limit
Limit
Width of
confidence interval
5
Point Estimates
Mean μ x
Proportion P p̂
6
Unbiasedness
E(θˆ ) θ
• Examples:
– The sample mean x is an unbiased estimator of μ
– The sample variance s2 is an unbiased estimator of σ2
– The sample proportion p̂ is an unbiased estimator of P
7
Unbiasedness
(continued)
• θ̂1 is an unbiased estimator, θ̂2 is biased:
θ̂1 θ̂2
θ θ̂
8
Bias
• Let θ̂ be an estimator of
Bias(θˆ ) E(θˆ ) θ
• The bias of an unbiased estimator is 0
9
Most Efficient Estimator
• Suppose there are several unbiased estimators of
• The most efficient estimator or the minimum variance unbiased estimator
of is the unbiased estimator with the smallest variance
• Let θ̂1 and θ̂2 be two unbiased estimators of , based on the same number
of sample observations. Then,
– θ̂1 is said to be more efficient than θ̂2 if Var(θˆ 1 ) Var(θˆ 2 )
10
Confidence Intervals
11
Confidence Interval Estimate
12
Confidence Interval and Confidence Level
13
Estimation Process
Sample
14
Confidence Level, (1-)
(continued)
• Suppose confidence level = 95%
• Also written (1 - ) = 0.95
• A relative frequency interpretation:
– From repeated samples, 95% of all the confidence intervals that can
be constructed will contain the unknown true parameter
• A specific interval either will contain or will not contain the true
parameter
15
General Formula
• The value of the reliability factor depends on the desired level of confidence
16
Confidence Intervals
Confidence
Intervals
σ2 Known σ2 Unknown
17
Confidence Interval for μ (σ2 Known)
• Assumptions
– Population variance σ2 is known
– Population is normally distributed
– If population is not normal, use large sample
• Confidence interval estimate:
σ σ
x z α/2 μ x z α/2
n n
(where z/2 is the normal distribution value for a probability of /2 in each tail)
18
Margin of Error
• The confidence interval,
σ σ
x z α/2 μ x z α/2
n n
σ
ME z α/2
n
19
Reducing the Margin of Error
σ
ME z α/2
n
The margin of error can be reduced if
20
Finding the Reliability Factor, z/2
• Consider a 95% confidence interval:
1 .95
α α
.025 .025
2 2
Confidence
Confidence
Coefficient, Z/2 value
Level
1
80% .80 1.28
90% .90 1.645
95% .95 1.96
98% .98 2.33
99% .99 2.58
99.8% .998 3.08
99.9% .999 3.27
22
Intervals and Level of Confidence
Sampling Distribution of the Mean
/2 1 /2
Intervals
x
μx μ
extend from 100(1-)%
x1
of intervals
σ
LCL x z x2 constructed
n contain μ;
to
σ 100()% do
UCL x z not.
n
Confidence Intervals
23
Example
• Determine a 95% confidence interval for the true mean resistance of the
population.
24
Example
(continued)
2.20 .2068
1.9932 μ 2.4068
25
Interpretation
26
Confidence Intervals
Confidence
Intervals
σ2 Known σ2 Unknown
27
Confidence Interval Estimation: Single
Population-II
Dr. A. Ramesh
Department of Management Studies
IIT ROORKEE
1
Student’s t Distribution
• Consider a random sample of n observations
– with mean x and standard deviation s
– from a normally distributed population with mean μ
2
Confidence Interval for μ (σ2 Unknown)
3
Confidence Interval for μ (σ Unknown)
(continued)
• Assumptions
– Population standard deviation is unknown
– Population is normally distributed
– If population is not normal, use large sample
• Use Student’s t Distribution
• Confidence Interval Estimate:
S S
x t n-1,α/2 μ x t n-1,α/2
n n
where tn-1,α/2 is the critical value of the t distribution with n-1 d.f. and an area of α/2 in each tail
4
Margin of Error
• The confidence interval,
S S
x t n-1,α/2 μ x t n-1,α/2
n n
σ
ME t n-1,α/2
n
5
Student’s t Distribution
6
Student’s t Distribution
Note: t Z as n increases
Standard
Normal
(t with df = ∞)
t (df = 13)
t-distributions are bell-
shaped and symmetric, but
have ‘fatter’ tails than the t (df = 5)
normal
0 t
7
Student’s t Table
Confidence t t t Z
Level (10 d.f.) (20 d.f.) (30 d.f.) ____
Note: t Z as n increases
9
Example
Confidence
Intervals
σ2 Known σ2 Unknown
11
Confidence Intervals for the
Population Proportion
12
Confidence Intervals for the Population
Proportion, p
(continued)
P(1 P)
σP
n
• We will estimate this with sample data:
pˆ (1 pˆ )
n
13
Confidence Interval Endpoints
• Upper and lower confidence limits for the population proportion are
calculated with the formula
pˆ (1 pˆ ) ˆ (1 pˆ )
p
pˆ z α/2 P pˆ z α/2
n n
• where
– z/2 is the standard normal value for the level of confidence desired
– p̂ is the sample proportion
– n is the sample size
– nP(1−P) > 5
14
Example
15
Example (continued)
ˆ ˆ ˆ ˆ
ˆp z α/2 p(1 p) P pˆ z α/2 p(1 p)
n n
25 .25(.75) 25 .25(.75)
1.96 P 1.96
100 100 100 100
0.1651 P 0.3349
16
Interpretation
• Although the interval from 0.1651 to 0.3349 may or may not contain the true
proportion, 95% of intervals formed from samples of size 100 in this manner
will contain the true proportion.
17
Confidence Intervals
Confidence
Intervals
σ2 Known σ2 Unknown
18
Confidence Intervals for the Population
Variance
19
Confidence Intervals for the Population Variance
(continued)
20
Confidence Intervals for the Population Variance
(n 1)s2 (n 1)s 2
σ 2
2
χn1, α/2
2
χn1, 1 - α/2
21
Example
Sample size 17
Sample mean 3004
Sample std dev 74
22
Finding the Chi-square Values
probability probability
χ 2
n 1, 1 - α/2 χ 2
16 , 0.975 6.91 α/2 = .025 α/2 = .025
216
216 = 6.91 216 = 28.85
23
Calculating the Confidence Limits
28.85 6.91
3037 σ 2 12683
Converting to standard deviation, we are 95% confident that the population standard
deviation of CPU speed is between 55.1 and 112.6 Mhz
24
Finite Populations
25
Finite Population Correction Factor
Nn
finite population correction factor
N 1
26
Estimating the Population Mean
27
Finite Populations: Mean
Nn
2
ˆ s
σ
2
N 1
x
n
• So the 100(1-α)% confidence interval for the population mean is
ˆ x μ x t n-1,α/2σ
x - t n-1,α/2σ ˆx
28
Estimating the Population Proportion
29
Finite Populations: Proportion
pˆ - zα/2σ
ˆ pˆ P pˆ zα/2σ
ˆ pˆ
30
Lecture Summary
• Introduced the concept of confidence intervals
• Discussed point estimates
• Developed confidence interval estimates
• Created confidence interval estimates for the mean (σ2
known)
• Introduced the Student’s t distribution
• Determined confidence interval estimates for the mean (σ2
unknown)
31
Lecture Summary
(continued)
• Created confidence interval estimates for the proportion
• Created confidence interval estimates for the variance of a normal
population
• Applied the finite population correction factor to form confidence
intervals when the sample size is not small relative to the population size
32
Summary
• Introduced sampling distributions
• Described the sampling distribution of sample means
– For normal populations
– Using the Central Limit Theorem
• Described the sampling distribution of sample proportions
• Introduced the chi-square distribution
• Examined sampling distributions for sample variances
• Calculated probabilities using sampling distributions
33
Thank You
34
Welcome to
TA Live Session 3
GOOD OR BAD ?
12 8
20 1 1
12 2 2
1
𝜆 −𝜆 2.51 −2.5
ⅇ = ⅇ
1! 1!
0 1 2
𝜆 −𝜆 𝜆 −𝜆 𝜆 −𝜆
ⅇ + ⅇ + ⅇ
0! 1! 2!
What is the probability that you will get the fresh tomato in the first trial ?
A) 1/11
B) 1/10
C) 1
D) 9/10
What is the probability that you will get the fresh tomato in the first trial ?
A) 1/11
B) 1/10
C) 1
D) 9/10
What is the probability that you will get the fresh tomato in the third trial ?
A) 1/11
B) 1/10
C) 1
D) 9/10
What is the probability that you will get the fresh tomato in the third trial ?
A) 1/11
B) 1/10 10 9 1
C) 1 ∗ ∗
D) 9/10 11 10 9
THANK YOU!
1
Lecture Objectives
After completing this lecture, you should be able to:
• Describe a simple random sample and why sampling is important
• Explain the difference between descriptive and inferential statistics
• Define the concept of a sampling distribution
• Determine the mean and standard deviation for the sampling distribution
of the sample mean,
2
Lecture Objectives
3
Descriptive vs Inferential Statistics
• Descriptive statistics
– Collecting, presenting, and describing data
• Inferential statistics
– Drawing conclusions and/or making decisions concerning a population
based only on sample data
4
Populations and Samples
5
Population vs. Sample
• Population • Sample
a b cd b c
ef ghi jkl m n gi n
o pq rs t uv w o r u
x y z y
6
Why Sample?
• Less time consuming than a census
• Less costly to administer than a census
• It is possible to obtain statistical results of a sufficiently high precision
based on samples.
• Because the research process is sometimes destructive, the sample can
save product
• If accessing the population is impossible; sampling is the only option
7
Reasons for Taking a Census
– Proportionate
– Disproportionate
11
Simple Random Sample:
Numbered Population Frame
9 9 4 3 7 8 7 9 6 1 4 5 7 3 7 3 7 5 5 2 9 7 9 6 9 3 9 0 9 4 3 4 4 7 5 3 1 6 1 8
5 0 6 5 6 0 0 1 2 7 6 8 3 6 7 6 6 8 8 2 0 8 1 5 6 8 0 0 1 6 7 8 2 2 4 5 8 3 2 6
8 0 8 8 0 6 3 1 7 1 4 2 8 7 7 6 6 8 3 5 6 0 5 1 5 7 0 2 9 6 5 0 0 2 6 4 5 5 8 7
8 6 4 2 0 4 0 8 5 3 5 3 7 9 8 8 9 4 5 4 6 8 1 3 0 9 1 2 5 3 8 8 1 0 4 7 4 3 1 9
6 0 0 9 7 8 6 4 3 6 0 1 8 6 9 4 7 7 5 8 8 9 5 3 5 9 9 4 0 0 4 8 2 6 8 3 0 6 0 6
5 2 5 8 7 7 1 9 6 5 8 5 4 5 3 4 6 8 3 4 0 0 9 9 1 9 9 7 2 9 7 6 9 4 8 1 5 9 4 1
8 9 1 5 5 9 0 5 5 3 9 0 6 8 9 4 8 6 3 7 0 7 9 5 5 4 7 0 6 2 7 1 1 8 2 6 4 4 9 3
Simple Random Sample:
Sample Members
• N = 20
• n=4
Stratified Random Sample
20 - 30 years old
(homogeneous within)
(alike) Heterogeneous
(different)
30 - 40 years old between
(homogeneous within)
(alike) Heterogeneous
(different)
40 - 50 years old between
(homogeneous within)
(alike)
Systematic Sampling
• Convenient and relatively easy to
N
administer k = ,
n
• Population elements are an ordered
where:
sequence (at least, conceptually).
n = sample size
• The first sample element is selected
N = population size
randomly from the first k population
elements. k = size of selection interval
• Purchase orders for the previous fiscal year are serialized 1 to 10,000 (N =
10,000).
• A sample of fifty (n = 50) purchases orders is needed for an audit.
• k = 10,000/50 = 200
• First sample element randomly selected from the first 200 purchase
orders. Assume the 45th purchase order was selected.
• Subsequent sample elements: 245, 445, 645, . . .
Cluster Sampling
• Quota Sampling: Sample elements are selected until the quota controls are
satisfied
Calculate x
to estimate
Population Sample
Process of x
Inferential Statistics
(parameter) (statistic)
Select a
random sample
Inferential Statistics
Sample
Population
24
Inferential Statistics
Drawing conclusions and/or making decisions concerning a
population based on sample results.
• Estimation
– e.g., Estimate the population mean weight
using the sample mean weight
• Hypothesis Testing
– e.g., Use sample evidence to test the claim
that the population mean weight is 120
pounds
25
Sampling Distributions
26
Types of sampling distributions
Sampling
Distributions
27
Sampling Distributions of Sample Means
Sampling
Distributions
28
Developing a Sampling Distribution
29
Developing a Sampling Distribution
(continued)
μ
X i P(x)
N
.25
18 20 22 24
21
4
0
18 20 22 24 x
σ
(X i μ) 2
2.236
A B C D
N Uniform Distribution
30
Developing a Sampling Distribution
(continued)
Now consider all possible samples of size n = 2
1st 2nd Observation
Obs 18 20 22 24 16 Sample
18 18,18 18,20 18,22 18,24 Means
20 20,18 20,20 20,22 20,24
22 22,18 22,20 22,22 22,24 1st 2nd Observation
Obs 18 20 22 24
24 24,18 24,20 24,22 24,24
18 18 19 20 21
16 possible samples 20 19 20 21 22
(sampling with 22 20 21 22 23
replacement) 24 21 22 23 24
31
Developing a Sampling Distribution
(continued)
E(X)
X i
18 19 21 24
21 μ
N 16
σX
( X μ)
i
2
N
(18 - 21)2 (19 - 21)2 (24 - 21)2
1.58
16
33
Comparing the Population with its Sampling
Distribution
Population Sample Means Distribution
N=4 n=2
μ 21 σ 2.236 μX 21 σ X 1.58
_
P(X) P(X)
.3 .3
.2 .2
.1 .1
0 0 _
18 20 22 24 X 18 19 20 21 22 23 24 X
A B C D
34
1,800 Randomly Selected Values
from an Exponential Distribution
450
F
400
r
e 350
q 300
u 250
e 200
n 150
c 100
y
50
0
0 .5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10
X
Means of 60 Samples (n = 2)
from an Exponential Distribution
F 9
r 8
e
77
q
u 66
e 55
n
44
c
y 33
22
11
00
0.00 0.25
0.00 0.25 0.50
0.50 0.75
0.75 1.00
1.00 1.25
1.25 1.50
1.50 1.75
1.75 2.00
2.00 2.25
2.25 2.50
2.50 2.75
2.75 3.00
3.00 3.25
3.25 3.50
3.50 3.75
3.75 4.00
4.00
xx
Means of 60 Samples (n = 5)
from an Exponential Distribution
10
F
r 9
e 8
q 7
u
6
e
n 5
c 4
y 3
2
1
0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00
x
Means of 60 Samples (n = 30)
from an Exponential Distribution
16
F
14
r
e 12
q
10
u
e 8
n
c 6
y 4
0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00
x
1,800 Randomly Selected Values
from a Uniform Distribution
F 250
250
r
e 200
200
q
u 150
150
e
n 100
100
c
y 50
50
00
0.0
0.0 0.5
0.5 1.0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
X-bar
Means of 60 Samples (n = 2)
from a Uniform Distribution
F 10
r 9
e 8
q 7
u
6
e
n 5
c 4
y 3
2
1
0
1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 4.25
x
Means of 60 Samples (n = 5)
from a Uniform Distribution
12
10
F
r 8
e
q 6
u
e 4
n
c 2
y
0
1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 4.25
x
Means of 60 Samples (n = 30)
from a Uniform Distribution
25
20
F
r
15
e
q
u 10
e
n 5
c
y
0
1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 4.25
x
Expected Value of Sample Mean
1 n
X Xi
n i1
43
Standard Error of the Mean
• Different samples of the same size from the same population will yield
different sample means
• A measure of the variability in the mean from sample to sample is given by
the Standard Error of the Mean:
σ
σX
n
• Note that the standard error of the mean decreases as the sample size
increases
44
If sample values are not independent
(continued)
σ2 N n σ Nn
Var(X) or σX
n N 1 n N 1
45
If the Population is Normal
σ
μX μ σX
and n
• If the sample size n is not large relative to the population size N, then
μX μ and
σX
σ Nn
n N 1
46
Z-value for Sampling Distribution of the Mean
47
Sampling Distribution Properties
Normal Population
Distribution
μx μ
μ x
(i.e. x is unbiased ) Normal Sampling
Distribution
(has the same mean)
μx
x
48
Sampling Distribution Properties
As n increases,
σ x decreases Larger sample
Smaller sample size
size
x
μ
49
If the Population is not Normal- Central Limit Theorem
We can apply the Central Limit Theorem:
σ
μx μ And σx
n
50
Central Limit Theorem
n
the sampling
As the sample distribution becomes
size gets large almost normal
enough… regardless of shape of
population
51
If the Population is not Normal
(continued)
Population Distribution
Sampling distribution
properties:
Central Tendency
μx μ
μ x
Variation Sampling Distribution (becomes normal as n increases)
σ
σx Larger
n Smaller sample
sample
size
size
μx x
52
How Large is Large Enough?
53
Example
• What is the probability that the sample mean is between 7.8 and 8.2?
54
Example
Solution:
• Even if the population is not normally distributed, the central limit
theorem can be used (n > 25)
• … so the sampling distribution of x is approximately normal
• … with mean μx = 8
• …and standard deviation
σ 3
σx 0.5
n 36
55
Example (continued)
Solution (continued)
7.8 - 8 μX -μ 8.2 - 8
P(7.8 μ X 8.2) P
3 σ 3
36 n 36
P(-0.5 Z 0.5) 0.3830
Sampling Standard Normal
Distribution Distribution .1915
??? +.1915
? ??
? ? Sample Standardize
?? ?
?
-0.5 0.5
μ8 X 7.8
μX 8
8.2
x μz 0 Z
56
Errors in Hypothesis Testing
Dr. A. Ramesh
Department of Management Studies
Indian Institute of Technology Roorkee
1
Example
• We are interested in burning rate of a solid propellant used to power aircrew escape systems
Reference: Applied statistics and probability for engineers, Douglas C. Montgomery, George C. Runger, John Wiley &
Sons, 2007
2
Value of the null hypothesis
– Past experience or knowledge of the process, or even from the previous tests or experiments
obligations
3
Note: for this example n=10
4
Type I Error
• The true mean burning rate of the propellant could be equal to 50 centimeters per second
• However randomly selected propellant specimens that are tested, we could observe a value of test
• We would then reject the null hypothesis Ho in favor of the alternate H1, in fact, Ho is really true
5
Type I Error
6
Type II Error
• Now suppose the true mean burning rate is different from 50 centimeters per second, yet the sample
7
Type II Error
8
Type 1 and Type II Errors
H0 is correct H0 is incorrect
9
Type I error
• In the propellant burning rate example, a type I error will occur when either x 51.5 _ or _ x 48.5
• Suppose the standard deviation of burning rate is σ = 2.5 centimeters per second and n = 10
• Type I error is
P( x 48.5 _ when _ 50) P( x 51.5 _ when _ 50)
10
Where
does this We will reject the null
number hypothesis ( = 50) if our
come sample mean is either of
from? these two regions
11
12
Type I error
• This implies that 5.7 % of all random samples would lead to rejection of the hypothesis Ho: µ=50
• We can reduce the type I error by widening the acceptance region. If we make critical value 48 and
13
TYPE II ERROR
14
The pink area is
the probability
of a Type II error
if the actual mean
is 52.
15
Type II Error
• Type II error will be committed if the sample mean x-bar falls between 48.5 and 51.5 (critical region
• 0.2643
• When µ = 50.5
• 0.8923
16
17
18
Computing the
probability of a type II
error may be the most
difficult concept
19
For constant n, increasing the acceptance region (hence
decreasing ) increases .
20
Type I & II Errors Have an Inverse Relationship
21
Factors Affecting Type II Error
n
22
How to Choose between Type I and Type II Errors
• Choose smaller Type I Error when the cost of rejecting the maintained hypothesis is high
• Choose larger Type I Error when you have an interest in changing the status quo
23
Calculating the probability of Type II Error
Ho: µ = 8.3
H1: µ < 8.3
Determine the probability of Type II error if µ = 7.4 at 5% significance level. σ = 3.1 and n = 60.
24
Solution:
An error will be made when Z ≥ -1.645, for that will fail to reject Ho.
ᵦ = 0.2729
25
Solving for Type II Errors:
Example
Ho: 12 Zc
X
Ha: 12
c
n
010
.
12 ( 1645
. )
60
Rejection
Region
11979
.
=.05
If X 11979
. , reject Ho.
Non Rejection Region
=0 If X 11979
. , do not reject Ho.
Zc 1.645
26
Type II Error for Example with =11.99 Kg
Ho is False
Correct Type II
19.77% =.8023
Decision Error
Z1
X
27
28
Type II Error for Demonstration with =11.96 Kg
Ho is False
Correct =.0708 Type II
Decision 92.92% Error
Z1
X
29
30
Hypothesis Testing and Decision Making
• In the tests, we compared the p-value to a controlled probability of a Type I error, a, which is
called the level of significance for the test
• With a significance test, we control the probability of making the Type I error, but
not the Type II error
• We recommended the conclusion “do not reject H0” rather than “accept H0”
because the latter puts us at risk of making a Type II error
31
Hypothesis Testing and Decision Making
• With the conclusion “do not reject H0”, the statistical evidence is considered inconclusive
• Usually this is an indication to postpone a decision until further research and testing is
undertaken
• In many decision-making situations the decision maker may want, and in some cases may be
forced, to take action with both the conclusion “do not reject H0 “and the conclusion “reject
H0.”
32
Power of a test
33
Calculating the Probability of a Type II Error
34
Calculating the Probability of a Type II Error
Values of 1-
14.0 -2.31 .0104 .9896
13.6 -1.52 .0643 .9357
13.2 -0.73 .2327 .7673
12.8323 0.00 .5000 .5000
12.8 0.06 .5239 .4761
12.4 0.85 .8023 .1977
12.0001 1.645 .9500 .0500
35
36
Power of the Test
• The probability of correctly rejecting H0 when it is false is called the power of the test.
37
Power Curve
1.00
Probability of Correctly
0.80
H0 False
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
11.5 12.0 12.5 13.0 13.5 14.0 14.5
38
Thank You
39
Hypothesis Testing
Class Objectives
• Population Proportion
Hypothesis Testing
• The hypothesis testing procedure uses data from a sample to test the two
competing statements indicated by H0 and Ha.
Developing Null and Alternative Hypotheses
• It is not always obvious how the null and alternative hypotheses should be
formulated
• Care must be taken to structure the hypotheses appropriately so that the test
conclusion provides the information the researcher wants
• The context of the situation is very important in determining how the hypotheses
should be stated
• In such cases, it is often best to begin with the alternative hypothesis and make it the conclusion that
the researcher hopes to support
• The conclusion that the research hypothesis is true is made if the sample data provide sufficient
evidence to show that the null hypothesis can be rejected
Developing Null and Alternative Hypotheses
• Example: A new manufacturing method is believed to be better than the current method.
• Alternative Hypothesis:
• Null Hypothesis:
• Alternative Hypothesis:
• Null Hypothesis:
• Example:
• Alternative Hypothesis:
– The new drug lowers Cholesterol-level more than the existing drug
• Null Hypothesis:
– The new drug does not lower Cholesterol-level more than the existing
drug
Developing Null and Alternative Hypotheses
• We might begin with a belief or assumption that a statement about the value of a population
parameter is true
• We then using a hypothesis test to challenge the assumption and determine if there is statistical
evidence to conclude that the assumption is incorrect
• Example:
• Null Hypothesis:
• Alternative Hypothesis:
• The equality part of the hypotheses always appears in the null hypothesis
• In general, a hypothesis test about the value of a population mean must take one of the following
three forms (where 0 is the hypothesized value of the population mean)
• Because hypothesis tests are based on sample data, we must allow for the
possibility of errors
• The probability of making a Type I error when the null hypothesis is called
the level of significance
• Applications of hypothesis testing that only control the Type I error are
often called significance tests
Type II Error
• Statisticians avoid the risk of making a Type II error by using “do not reject H0” and not “accept H0”.
Type I and Type II Errors
Population Condition
H0 True H0 False
Conclusion ( < 8) ( 8)
Accept H0 Correct
Type II Error
(Conclude < 8) Decision
Reject H0 Correct
Type I Error
(Conclude > 8) Decision
Three Approaches for Hypothesis Testing
• P- Value
• Critical Value
• The p-value is the probability, computed using the test statistic, that measures the support (or lack of
• If the p-value is less than or equal to the level of significance , the value of the test statistic is in the
rejection region
= .10 Sampling
distribution
of
p-value
72
z
z = -za = 0
-1.46 -1.28
p-Value Approach
Upper-Tailed Test About a Population Mean :s Known
p-Value
11
z
0 z = z=
1.75 2.29
p-Value Approach
Critical Value Approach to One-Tailed Hypothesis Testing
• The test statistic z has a standard
normal probability distribution.
• We can use the standard normal
probability distribution table to
find the z-value with an area of
in the lower (or upper) tail of the
distribution.
• The value of the test statistic that
established the boundary of the
rejection region is called the
critical value for the test.
• The rejection rule is:
Lower tail: Reject H0 if z < -z
Upper tail: Reject H0 if z > z
Lower-Tailed Test About a Population Mean: s Known
Sampling
distribution
of
Reject H0
1
Do Not Reject H0
z
-z = -1.28 0
Upper-Tailed Test About a Population Mean: s Known
Critical Value Approach
Sampling
distribution
of
Reject H0
Do Not Reject H0
z
0 z = 1.645
Steps of Hypothesis Testing – P value approach
• Step 3. Collect the sample data and compute the test statistic.
• p-Value Approach
• Step 4. Use the value of the test statistic to compute the p-value.
•Step 4. Use the level of significance to determine the critical value and
•Step 5. Use the value of the test statistic and the rejection rule to determine
1
Class Objectives
2
One-Tailed Tests About a Population Mean: s Known
3
Given Values
• Sample • Population
• Sample mean = 32 Min • a =0.05
• Sample size = 30 • Population mean = 30 Min
4
p -Value Approach
5
One-Tailed Tests About a Population Mean:
s Known
1. Develop the hypotheses.
2. Specify the level of significance. H0: 30
3. Compute the value of the test statistic. Ha:30
a = .05
x 32 30
z 1.09
s / n 10 / 30
6
7
One-Tailed Tests About a Population Mean: s Known
p –Value Approach
4. Compute the p –value.
• There are not sufficient statistical evidence to infer that Pizza delivery services is not meeting the response
goal of 30 minutes.
8
One-Tailed Tests About a Population Mean: s Known
p –Value Approach
Sampling
distribution a = .05
of
p-value
0.137
z
z = za =
0 1.09 1.645
9
Critical Value Approach
10
One-Tailed Tests About a Population Mean: s Known
11
p-Value Approach to Two-Tailed Hypothesis Testing
12
Compute the p-value using the following three steps:
2. If z is in the upper tail (z > 0), find the area under the standard normal curve to the right of z.
3. If z is in the lower tail (z < 0), find the area under the standard normal curve to the left of z.
13
Critical Value Approach to Two-Tailed Hypothesis Testing
• The critical values will occur in both the lower and upper tails of the standard normal curve.
• Use the standard normal probability distribution table to find za/2 (the z-value with an area of a/2
in the upper tail of the distribution).
14
Two-Tailed Tests About a Population Mean:
s Known
15
Given Values
• Sample • Population
• Sample size = 30 • Population mean = 500 ml
• Sample mean = 505 ml • Standard deviation = 10 ml
• Significance level 0.03
16
p –Value approach
17
Two-Tailed Tests About a Population Mean:
s Known
1. Determine the hypotheses.
2. Specify the level of significance.
3. Compute the value of the test statistic.
a = .03
x 505 500
z 2.74
s / n 10 / 30
18
19
Two-Tailed Tests About a Population Mean:
s Known
p –Value Approach
4. Compute the p –value.
– For z = 2.74, p–value = 2(1 - .9969) = .0061
There is no sufficient statistical evidence to infer that the null hypothesis is true (i.e. the mean filling
quantity is not 500 ml)
20
Two-Tailed Tests About a Population Mean: s Known
p-Value Approach
1/2 1/2
p -value p -value
= .0031 = .0031
a/2 = a/2 =
.015 .015
z
z = -2.74 0 z = 2.74
-za/2 = -2.17 za/2 = 2.17
21
Critical Value Approach
22
Two-Tailed Tests About a Population Mean :s Known
There is sufficient statistical evidence to infer that the null hypothesis is not true
23
24
Two-Tailed Tests About a Population Mean :s Known
z
-2.17 0 2.17
25
Confidence Interval Approach
26
Confidence Interval Approach to
Two-Tailed Tests About a Population Mean
• Select a simple random sample from the population and use the value of the sample mean to
develop the confidence interval for the population mean .
• If the confidence interval contains the hypothesized value 500, do not reject H0.
• Actually, H0 should be rejected if 0 happens to be equal to one of the end points of the confidence
interval.
27
Confidence Interval Approach to Two-Tailed Tests About a Population Mean
The 97% confidence interval for 500 is
5 5 3.9619
501.03814 ,508.96186
Because the hypothesized value for the population mean, 0 = 500ml, is not in this interval, the
hypothesis-testing conclusion is that the null hypothesis, H0: = 500, is rejected.
28
Thanks
29
Hypothesis Testing: Two sample test
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE
1
Hypothesis Testing about the Difference in Two
Sample Means
Population 1
X 1
X X
x 1 2
X n1
1
X X1 2
x
X 2
n2
X 2
Population 2
2
Two Sample Tests
Two Sample Tests
Population Population
Means, Means, Population Population
Independent Dependent Proportions Variances
Samples Samples
Examples:
Group 1 vs. Same group before Proportion 1 vs. Variance 1 vs.
independent vs. after treatment Proportion 2 Variance 2
Group 2
3
Difference Between Two Means
Population means,
independent samples
5
σ12 and σ22 Known
7
Hypothesis Tests for Two Population Means
8
Decision Rules
a a
a/2 a/2
9
Hypothesis Testing about the Difference in Two
Sample Means
X 1
X2
1 2
X
2
1
2
2
1 X2 n 1 n 2
X 1
X2
X 1
X 2
10
Sampling Distribution of x1 x2
• Expected Value
11
Interval Estimation of 1 - 2: 1 and 2 Known
• Interval Estimate
12
Problem ( 1 and 2 Known)
• A product developer is interested in reducing the drying time of a primer paint.
• Two formulations of the paint are tested; formulation 1 is the standard chemistry, and
formulation 2 has a new drying ingredient that should reduce the drying time.
• From experience, it is known that the standard deviation of drying time is 8 minutes, and this
inherent variability should be unaffected by the addition of the new ingredient.
• Ten specimens are painted with formulation 1, and another 10 specimens are painted with
formulation 2; the 20 specimens are painted in random order.
• The two-sample average drying times are 𝑥1 = 121 minutes and 𝑥2 = 112 minutes,
respectively.
• What conclusions can the product developer draw about the effectiveness of the new
ingredient, using alpha = 0.05?
Source: Applied Probability and statistics for Engineers by Douglas C. Montgomery and George C. Runger John Wiley, 3rd Ed. 2003
13
Problem ( 1 and 2 Known)
14
Problem ( 1 and 2 Known)
15
Problem ( 1 and 2 Known)
Reject H0
t
121 112 0 2.52
.05
0 1.645 t
1 1 2.52
8
2
10 10 Decision:
Reject H0 at a = 0.05
Conclusion:
There is evidence of a difference in
means.
16
Problem ( 1 and 2 Known)
17
Problem ( 1 and 2 Known)
18
σ12 and σ22 Unknown, Assumed Equal
19
σ12 and σ22 Unknown, Assumed Equal
• The population variances are assumed equal, so use the two sample
standard deviations and pool them to estimate σ
20
Test Statistic, σ12 and σ22 Unknown, Equal
The test statistic for
μ1 – μ2 is:
t
x 1
x2 μ1 μ 2
s 2p s 2p
n1 n2
n1 n 2 2
p
21
Decision Rules
1 2 1 2 1 2
1 2 1 2 1 2
22
Decision Rules
23
σ12 and σ22 Unknown, Assumed equal
• Two catalysts are being analyzed to
determine how they affect the mean Observation Catalyst 1 Catalyst 2
yield of a chemical process. Number
• Specifically, catalyst 1 is currently in use, 1 91.50 89.19
but catalyst 2 is acceptable. 2 94.18 90.95
• Since catalyst 2 is cheaper, it should be 3 92.18 90.46
adopted, providing it does not change 4 95.39 93.21
the process yield. 5 91.79 97.19
• A test is run in the pilot plant and results 6 89.07 97.04
in the data shown in table. 7 94.72 91.07
• Is there any difference between the 8 89.21 92.75
mean yields?
𝑥 1= 92.255 𝑥 1 = 92.733
• Use 0.05, and assume equal variances.
s1 =2.39 s2 =2.98
24
σ12 and σ22 Unknown, Assumed equal
25
σ12 and σ22 Unknown, Assumed equal
26
σ12 and σ22 Unknown, Assumed equal
27
σ12 and σ22 Unknown, Assumed equal
28
Thank You
29
Hypothesis Testing-III
1
Tests About a Population Mean:s Unknown
• Test Statistic
2
Tests About a Population Mean:s Unknown
3
4
One-Tailed Test About a Population Mean: s Unknown
Example: Ice Cream Demand
Day No. of Ice- Day No. of Ice-
• In a ice cream parlor at IIT Roorkee, the following data cream cream
Sold Sold
represent the number of ice-creams sold in 20 days
1 13 11 12
2 8 12 11
• Test hypothesis H0: < 10 3 10 13 11
4 10 14 12
• Use = .05 to test the hypothesis. 5 8 15 10
6 9 16 12
7 10 17 7
8 11 18 10
9 6 19 11
10 8 20 8
5
Given Data
6
7
One-Tailed Test About a Population Mean:
s Unknown
Reject H0
Do Not Reject H0
t
0
8
Hypothesis Testing – proportion
9
Null and Alternative Hypotheses: Population Proportion
• The equality part of the hypotheses always appears in the null hypothesis.
• In general, a hypothesis test about the value of a population proportion p must take one of the
following three forms (where p0 is the hypothesized value of the population proportion).
One-tailed One-tailed
(lower tail) (upper tail) Two-tailed
10
Tests About a Population Proportion
Test Statistic
where:
11
Tests About a Population Proportion
Rejection Rule: p –Value Approach
Reject H0 if p –value <
Rejection Rule: Critical Value Approach
H0: pp Reject H0 if z > z
12
Two-Tailed Test About a Population Proportion
Example: City Traffic Police
13
p –Value Approach
14
Two-Tailed Test About a Population Proportion
H 0 : p .5
1. Determine the hypotheses.
H a : p .5
p0 (1 p0 ) .5(1 .5)
sp .045644
n 120
p p0 (67 /120) .5
z 1.28
sp .045644
15
Two-Tailed Test About a Population Proportion
16
17
Critical Value Approach
18
Two-Tailed Test About a Population Proportion
Because 1.278 > -1.96 and < 1.96, we cannot reject H0.
19
ANOVA
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
Determining Sample Size when Estimating
X
• Z formula Z
n
n 2
2
E
2
E
1
• Estimated
4
range
3
Example: Sample Size when Estimating
E 1, 4
90% confidence Z 1.645
Z
2 2
n 2
2
E
2 2
(1645
. ) (4)
2
1
43.30 or 44
4
Example
E 2, range 25
95% confidence Z 196
.
1 1
estimated : range 25 6.25
4 4
Z
2 2
n 2
E
2 2
(196
. ) (6.25)
2
2
37.52 or 38
5
Determining Sample Size when Estimating P
• Z formula pP
Z
PQ
n
• Error of Estimation (tolerable error) E pP
2
n Z PQ
• Estimated Sample Size E
2
6
Example
E 0.03
98% Confidence Z 2.33
estimated P 0.40
Q 1 P 0.60
n
Z PQ 2
E
(2.33) 0.40 0.60
2
.003 2
1,447.7 or 1,448
7
Determining Sample Size when Estimating P
with No Prior Information
P PQ
400 Z = 1.96
0.5 0.25 350 E = 0.05
300
0.4 0.24
250
0.3 0.21 n 200
150
0.2 0.16
100
0.1 0.09 50
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
2 1 P
Z 4
n 2
E
8
Example
E 0.05
90% Confidence Z 1645
.
with no prior estimate of P, use P 0.50
Q 1 P 0.50
2
n
Z PQ
2
E
. ) 0.50 0.50
2
(1645
.05 2
270.6 or 271
9
Why ANOVA?
• We could compare the means, one by one using t-tests for difference of
means.
• Problem: each test contains type I error
• The total type I error is 1 1 k where k is the number of means.
• For example, if there are 5 means and you use a=.05, you must make 10
two by two comparisons.
• Thus, the type I error is 1-(.95)10, which is .4012.
• That is, 40% of the time you will reject the null hypothesis of equal
means in favor of the alternative!
10
Hypothesis Testing With Categorical Data
11
Production Process inputs and outputs
12
Application of quality-engineering techniques and
the systematic reduction of process variability
13
Effect of Teaching Methodology
Group 1 Group 2 Group 3
Black Board Case Presentation PPT
4 2 2
3 4 1
2 6 3
43 2
x1 3
3
246
x2 4
3
2 1 3
x3 2
3
4 3 2 2 4 6 2 1 3
x 3
9
SST (4 3) 2 (3 3) 2 (2 3) 2 (2 3) 2 (4 3) 2 (6 3) 2 (2 3) 2 (1 3) 2 (3 3) 2
=1 + 0 +1 +1 +1 +9 +1 +4 + 0 =18
SSB 3(3 3) 2 3(4 3) 2 3(2 3) 2
=0 +3 +3 =6
SSE (4 3) 2 (3 3) 2 (2 3) 2 (2 4) 2 (4 4) 2 (6 4) 2 (2 2) 2 (1 2) 2 (3 2) 2
= 1 +0 +1 +4 + 0 +4 + 0 +1 + 1 = 12
15
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 6 2 3 1.5 0.296296 5.143253
Within Groups 12 6 2
Total 18 8
16
Thank You
17
ANOVA
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Effect of Teaching Methodology
Group 1 Group 2 Group 3
Black Board Case Presentation PPT
4 2 2
3 4 1
2 6 3
ANOVA with Python
3
Pandas.melt command
• Pd.melt allows you to ‘unpivot’ data from a ‘wide format’ into a ‘long
format’, data with each row representing a data point.
4
Jupyter code
5
6
Transforming table
7
8
Analysis of Variance: A Conceptual Overview
• Analysis of Variance (ANOVA) can be used to test for the equality of three
or more population means
9
Analysis of Variance: A Conceptual Overview
H0: 1=2=3=. . .= k
Ha: Not all population means are equal
10
Analysis of Variance: A Conceptual Overview
• The variance of the response variable, denoted 2, is the same for all of
the populations
11
Analysis of Variance: A Conceptual Overview
• Sampling Distribution of 𝑥 Given H0 is True
x2 x1 x3
Analysis of Variance: A Conceptual Overview
• Sampling Distribution of 𝑥 Given H0 is False
Sample means
come from
different
sampling
distributions
and are not as
close together
when H0 is
false. x3 3 x 1 1 2 x2
Analysis of Variance (ANOVA)
One-Way Two-Way
ANOVA ANOVA
F-test Interaction
Effects
Tukey-
Kramer
test
General ANOVA Setting
• ANOVA Table
17
Analysis of Variance and the Completely
Randomized Design
H0: 1=2=3=. . .= k
Ha: Not all population means are equal
where
𝑗 = mean of the 𝑗𝑡ℎ population
18
Analysis of Variance and the Completely
Randomized Design
H0: 1=2=3=. . .= k
Ha: Not all population means are equal
• Assume that a simple random sample of size 𝑛𝑗 has been selected from
each of the k populations or treatments. For the resulting sample data, let
𝑥𝑖𝑗 = value of observation ifor treatment j
𝑛𝑗 = number of observations for treatment j
𝑥𝑗 = sample mean for treatment j
𝑠𝑗2 = sample variance for treatment j
𝑠𝑗 = sample standard deviation for treatment j
19
Between-Treatments Estimate of
Population Variance 2
• The estimate of 2 based on the variation of the sample means is called
the mean square due to treatments and is denoted by MSTR
k
n (x
j 1
j j x )2
MSTR
k1
Numerator is called
Denominator is the the sum of squares due
degrees of freedom to treatments (SSTR)
associated with SSTR
20
Between-Treatments Estimate of
Population Variance 2
• Mean Square due toTreatments (MSTR)
k
j j
n (
j 1
x x ) 2
MSTR
k1
Where:
k = number of groups
nj = sample size from group j
𝑥𝑗 = sample mean from group j
𝑥 = grand mean (mean of all data values)
21
Within-Treatments Estimate of
Population Variance 2
• The estimate of 2 based on the variation of the sample observations
within each sample is called the mean square error and is denoted by MSE
k
j j
( n
j 1
1) s 2
MSE
nT k
Numerator is called
Denominator is the the sum of squares
degrees of freedom due to error (SSE)
associated with SSE
22
Within-Treatments Estimate of
Population Variance 2
• Mean Square Error (MSE)
k
j j
( n
j 1
1) s 2
MSE
Where: nT k
k = number of groups
𝑛𝑗 = number of observations for treatment j 𝑠𝑗2 =
sample variance for treatment j
23
Comparing the Variance Estimates: The F Test
• If the null hypothesis is true and the ANOVA assumptions are valid, the
sampling distribution of MSTR/MSE is an F distribution with MSTR d.f
equal to k - 1 and MSE d.f. equal to nT - k.
• If the means of the k populations are not equal, the value of MSTR/MSE
will be inflated because MSTR overestimates 2
24
Comparing the Variance Estimates: The F Test
Sampling Distribution
of MSTR/MSE
Reject H0
Do Not Reject H0
MSTR/MSE
F
Critical Value
ANOVA Table for a Completely Randomized Design
Source of Sum of Degrees of Mean p-
Variation Squares Freedom Square F Value
SSTR MSTR
Treatments SSTR k-1 MSTR
k -1 MSE
SSE
Error SSE nT - k MSE
nT -k
Total SST nT - 1
SST’s degrees of freedom
SST is partitioned (d.f.) are partitioned into
into SSTR and SSE.
SSTR’s d.f. and SSE’s d.f.
ANOVA Table for a Completely Randomized Design
k nj
27
ANOVA Table for a Completely Randomized Design
28
Test for the Equality of k Population Means
• Hypotheses
H0: 123...k
Ha: Not all population means are equal
• Test Statistic
𝑀𝑆𝑇𝑅
F=
𝑀𝑆𝐸
29
Test for the Equality of k Population Means
30
Thank You
31
Hypothesis Testing: Two sample test
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE
1
σ12 and σ22 Unknown, Assumed Unequal
2
σ12 and σ22 Unknown: Assumed Unequal
*
σ12 and σ22 assumed equal
/(n 2 1)
3
Test Statistic: σ12 and σ22 Unknown, Unequal
The test statistic for
μ1 – μ2 is:
σ12 and σ22 unknown
(x1 x 2 ) D 0
σ12 and σ22 t
assumed equal s12 s 22
σ12 and σ22 n1 n2
assumed unequal
2
s12 s 22
n
n 2
( ) ( )
Where t has degrees of freedom: v 2
1
2
s12 s 22
/(n1 1) /(n 2 1)
n1 n2
4
Problem:Test Statistic: σ12 and σ22 Unknown, Unequal
Metro Phoenix Rural Arizona
• Arsenic concentration in public Phoenix, 3 Rimrock, 48
drinking water supplies is a Chandler, 7 Goodyear, 44
potential health risk. Gilbert, 25 New River, 40
• An article in the Arizona Republic Glendale, 10 Apachie Junction, 38
(Sunday, May 27, 2001) reported Mesa, 15 Buckeye, 33
drinking water arsenic Paradise Valley, 6 Nogales, 21
concentrations in parts per billion Peoria, 12 Black Canyon City, 20
(ppb) for 10 metropolitan Phoenix Scottsdale, 25 Sedona, 12
communities and 10 communities Tempe, 15 Payson, 1
in rural Arizona. Sun City, 7 Casa Grande, 18
• The data as shown:
𝑥 1 = 12.5 𝑥 1 = 27.5
s1 =7.63 s2 =15.3
5
Problem:Test Statistic: σ12 and σ22 Unknown, Unequal
6
Problem:Test Statistic: σ12 and σ22 Unknown, Unequal
7
Problem:Test Statistic: σ12 and σ22 Unknown, Unequal
2 2
s12 s 22 7.632 15.32
( n ) ( n ) ( 10 ) ( 10 )
v 1 2 13.2 13
2 2 2 2
s1
2
s2 2
7.632 15.3
2
8
Problem:Test Statistic: σ12 and σ22 Unknown, Unequal
Reject H0 Reject H0
.025 .025
-2.160 0 2.160 t
-2.77
Decision:
12.5 27.5 0 Reject H0 at a = 0.05
t= 2.77
7.63 15.3
2 2 Conclusion:
There is evidence of a difference in
10 10
means.
9
Problem:Test Statistic: σ12 and σ22 Unknown, Unequal
10
Problem:Test Statistic: σ12 and σ22 Unknown, Unequal
11
Dependent Samples
Tests Means of 2 Related Populations
– Paired or matched samples
– Repeated measures (before/after)
– Use difference between paired values:
di = xi - yi
• Assumptions:
– Both Populations Are Normally Distributed
12
Test Statistic: Dependent Samples
t
d D0
sd d
d i
xy
n
n
14
Decision Rules: Dependent Samples
a a a/2 a/2
15
Dependent Samples: Example
• An article in the Journal of Strain Analysis (1983, Vol. 18, No. 2) compares
several methods for predicting the shear strength for steel plate girders.
• Data for two of these methods, the Karlsruhe and Lehigh procedures,
when applied to nine specific girders, are shown in Table .
• We wish to determine whether there is any difference (on the average)
between the two methods.
16
Table : Strength Predictions for Nine Steel Plate Girders
(Predicted Load/Observed Load)
Girder Karlsruhe Method Lehigh Method Difference dj
S11 1.186 1.061 0.119
S21 1.151 0.992 0.159
S31 1.322 1.063 0.259
S41 1.339 1.062 0.277
S51 1.200 1.065 0.138
S21 1.402 1.178 0.224
S22 1.365 1.037 0.328
S23 1.537 1.086 0.451
S24 1.559 1.052 0.507
17
Inferences About the Difference Between Two
Population Means: Matched Samples
18
Inferences About the Difference Between Two Population Means:
Matched Samples
19
we conclude that the strength prediction methods yield different results.
20
21
Inferences About the Difference Between
Two Population Proportions
• Expected Value
E ( p1 p2 ) p1 p2
p1 (1 p1 ) p2 (1 p2 )
p1 p2
n1 n2
p1 p2
Interval Estimation of p1 - p2
• Interval Estimation
p1 (1 p1 ) p2 (1 p2 )
p1 p2 za / 2
n1 n2
Point Estimator of the Difference Between Two Population
Proportions
• p1 = proportion of the population of households “aware”
of the product after the new campaign
• p2 = proportion of the population of households “aware”
of the product before the new campaign
• p1 = sample proportion of households “aware” of the
product after the new campaign
• p2 = sample proportion of households “aware” of the
product before the new campaign
120 60
p1 p2 .48 .40 .08
250 150
Hypothesis Tests about p1 - p2
• Hypothesis
H 0 : p1 p2 0 H 0 : p1 p2 0 H 0 : p1 p2 0
H a : p1 p2 0 H a : p1 p2 0 H a : p1 p2 0
Left-tailed Right-tailed Two-tailed
Hypothesis Tests about p1 - p2
1 1
p1 p2 p(1 p)
n1 n2
• Pooled Estimator of p when p1 = p2 = p
n1 p1 n2 p2
p
n1 n2
Hypothesis Tests about p1 - p2
• Test Statistic
( p1 p2 )
z
1 1
p(1 p )
n1 n2
Problem: Hypothesis Tests about p1 - p2
• Extracts of St. John’s Wort are widely used to treat depression.
• An article in the April 18, 2001 issue of the Journal of the American Medical
Association (“Effectiveness of St. John’s Worton Major Depression: A
Randomized Controlled Trial”) compared the efficacy of a standard extract
of St. John’s Wort with a placebo in 200 outpatients diagnosed with major
depression.
• Patients were randomly assigned to two groups; one group received the St.
John’s Wort, and the other received the placebo.
• After eight weeks, 19 of the placebo-treated patients showed
improvement, whereas 27 of those treated with St. John’s Wort improved.
• Is there any reason to believe that St. John’s Wort is effective in treating
major depression? Use 0.05.
Problem: Hypothesis Tests about p1 - p2
Problem: Hypothesis Tests about p1 - p2
8. Conclusions: Since z0 1.35 does not exceed z 0.025, we cannot reject the null hypothesis.
The P-value is P ≅ 0.177. There is insufficient evidence to support the claim that St.
John’s Wort is effective in treating major depression.
34
35
Thank You
36
Hypothesis Testing: Two sample test
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE
1
Agenda
2
Hypothesis Tests for Two Variances
Goal: Test hypotheses about two population variances
Tests for Two
Population H0: σ12 σ22
Variances H1: σ12 < σ22
Lower-tail
test
H0: σ12 ≤ σ22 Upper-tail
F test statistic H1: σ12 > σ22 test
H0: σ12 = σ22
H1: σ12 ≠ σ22 Two-tail test
The two populations are assumed to be
independent and normally distributed
3
Hypothesis Tests for Two Variances
4
Test Statistic
5
Decision Rules: Two Variances
H0: σ12 = σ22
H0: σ12 ≤ σ22 H1: σ12 ≠ σ22
H1: σ12 > σ22
6
Problem
• A company manufactures impellers for use in jet-turbine engines.
• One of the operations involves grinding a particular surface finish on a
titanium alloy component.
• Two different grinding processes can be used, and both processes can produce
parts at identical mean surface roughness.
• The manufacturing engineer would like to select the process having the least
variability in surface roughness.
• A random sample of n1 =11 parts from the first process results in a sample
standard deviation s1 = 5.1 micro inches, and a random sample of n2 = 16
parts from the second process results in a sample standard deviation of s2 =
4.7 micro inches.
• We will find a 90% confidence interval on the ratio of the two standard
deviations.
7
Problem
• Form the hypothesis test:
H0: σ12 = σ22 (there is no difference between variances)
H1: σ12 ≠ σ22 (there is a difference between variances)
● Find the F critical values for = .10/2:
Degrees of Freedom:
• Numerator
• n1 – 1 = 11 – 1 = 10 d.f.
• Denominator:
• n2 – 1 = 16 – 1 = 15 d.f.
8
Problem
• Assuming that the two processes are independent and that surface
roughness is normally distributed
9
10
Problem
11
12
F Test example:
13
Z Vs t
σ –known σ –unknown
n ≤ 30 Z-test t-test
x
m0 Sampling
distribution
of x when
Note: H0 is false
b and ma > m0
x
c ma
Determining the Sample Size for a Hypothesis Test About a Population
Mean
where
z = z value providing an area of in the tail
zb = z value providing an area of b in the tail
= population standard deviation
m0 = value of the population mean in H0
ma = value of the population mean used for the
Type II error
• Let’s assume that the manufacturing company makes the following statements about the
allowable probabilities for the Type I and Type II errors:
• If the mean diameter is m = 12 mm, I am willing to risk an = .05 probability of rejecting H0.
• If the mean diameter is 0.75 mm over the specification (m = 12.75), I am willing to risk a b = .10
probability of not rejecting H0.
Determining the Sample Size for a Hypothesis Test About a Population Mean
= .05, b = .10
z = 1.645, zb = 1.28
m0 = 12, ma = 12.75
= 3.2
19
Thank You
20
Post Hoc Analysis(Tukey’s test)
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
IIT ROORKEE
1
Lecture Objectives
After completing this lecture, you should be able to:
• Use Tukey’s test and LSD Test to identify specific differences between
means
2
Designing engineering experiments
3
Designing engineering experiments
4
Designing engineering experiments
5
The completely randomized single-factor experiment
example
• A manufacturer of paper that is used for making
grocery bags is interested in improving the tensile
strength of the product
• Product engineer thinks that tensile strength is a
function of the hardwood concentration in the
pulp and that the range of hardwood
concentrations of practical interest is between 5
and 20%.
6
The completely randomized single-factor experiment
example
• A team of engineers responsible for the study decides to investigate four
levels of hardwood concentration: 5%, 10%, 15%, and 20%.
• They decide to make up six test specimens at each concentration level,
using a pilot plant.
• All 24 specimens are tested on a laboratory tensile tester, in random order.
The data from this experiment are shown in Table
7
The completely randomized single-factor experiment
example
• Tensile Strength of Paper (psi)
Hardwood Observations Total Avg
Concentration (%) 1 2 3 4 5 6
5 7 8 15 11 9 10 60 10.00
10 12 17 13 18 19 15 94 15.67
15 14 18 19 17 16 18 102 17.00
20 19 25 22 23 18 20 127 21.17
383 15.96
8
The completely randomized single-factor experiment
example
9
Typical Data for Single Factor Experiment
10
Sum of Squares
a n --
Total sum of squares SST (yij - y..)2
i 1 j 1
a --- ---
Treatment sum of squares SSTreatments n ( y i. y ..)2
i 1
a n ---
Error sum of Squares SSE (yij y j. ) 2
i 1 j 1
11
ANOVA with Equal Sample Sizes
a n 2
y ..
SST y
2
ij
i 1 j 1 N
1 a 2 y 2 ..
SSTreatments yi.
n i 1 N
12
ANOVA with unequal Sample Sizes
a n 2
y ..
SST y i j 2
i 1 j 1 N
a
yi.2 y 2 ..
SSTreatments
i 1 ni N
13
Problem: Analysis of variance
14
Problem: Analysis of variance
15
ANOVA Table
16
Problem: Analysis of variance
17
Problem: Analysis of variance
18
Problem: Analysis of variance
19
Jupyter code
20
Jupyter code
21
Jupyter code
22
Jupyter code
23
Multiple Comparisons Following the ANOVA
• When the null hypothesis is rejected in the ANOVA, we know that some of
the treatment or factor level means are different
• ANOVA doesn’t identify which means are different
• Methods for investigating this issue are called multiple comparisons
methods
24
Fisher’s least significant difference (LSD) method
• The Fisher LSD method compares all pairs of means with the null
hypotheses H0:i j (for all i ≠ j) using the t-statistic
yi* y j*
t0
2 MS E
n
25
Fisher’s least significant difference (LSD) method
yi* y j* LSD
where LSD, the least significant difference, is
2MS E
LSD ta /2,a ( n 1)
n
26
Fisher’s least significant difference (LSD) method
• If the sample sizes are different in each treatment, the LSD is defined as
1 1
LSD ta /2, N a MS E ( )
ni n j
27
Problem : LSD method
28
Problem : LSD method
• Therefore, any pair of treatment averages that differs by more than 3.07
implies that the corresponding pair of treatment means are different.
29
Jupyter code
30
Problem : LSD method
31
The Tukey-Kramer Test for Post Hoc analysis
32
The Tukey-Kramer Test for Post Hoc analysis
x
μ1 = μ 2 μ3
33
Tukey-Kramer Critical Range
MSW 1 1
Critical Range QU
2 n j n j'
where:
QU = Value from Studentized Range
Distribution with c and n - c degrees of freedom for
the desired level of a
MSW = Mean Square Within
nj and nj’ = Sample sizes from groups j and j’
34
Problem: Tukey- Kramer test
35
The Tukey-Kramer Procedure
1. Compute absolute mean differences:
36
The Tukey-Kramer Procedure
QU 3.96
37
• Q table: The critical values
for q corresponding to
alpha = .05 (top) and
alpha = .01 (bottom)
38
The Tukey-Kramer Procedure
39
The Tukey-Kramer Procedure
3. Compute Critical Range:
MSW 1 1 6.51 1 1
Critical Range Q U 3.96 4.124
2 n j n j' 2 6 6
40
The Tukey-Kramer Procedure
5. Other then x 2 x 3 , all of the absolute mean differences are greater than
critical range. Therefore there is significant difference between each pair of
means, except 10% concentration and 15% concentration at the 5% level of
significance.
41
Jupyter code
42
Problem 2
43
Problem 2
1 2 3 4 5
15 7 7 15 11 9 49 9.8
20 12 17 12 18 18 77 15.4
25 14 18 18 19 19 88 17.6
30 19 25 22 19 23 108 21.6
35 7 10 11 15 11 54 10.8
Grand Grand
total=376 mean=
15.004
44
• SSA = 5 (9.8 – 15.04)2 + 5 (15.4 – 15.04)2 + 5
(17.6 – 15.04)2 +5( 21.6-15.04)2+ 5(10.8-
15.04)2 = 475.76
SST = 636.96
SSE = 636.96 - 475.76=161.20
45
Problem 2
46
• Q table: The critical values
for q corresponding to
alpha = .05 (top) and
alpha = .01 (bottom)
47
Problem 2
MS E
Ta qa (c, n c )
n
a 0.05
48
Problem 2
49
Problem 2
__ __
y1. y2. 9.8 15.4 5.6*
__ __
y1. y3. 9.8 17.6 7.8*
__ __ Starred values indicate pairs of means
y1. y4. 9.8 21.6 11.8 *
that are significantly different.
__ __
y1. y5. 9.8 10.8 1
__ __ __ __
y2. y3. 15.4 17.6 2.2 y3. y4. 17.6 21.6 4
__ __ __ __
y2. y4. 15.4 21.6 6.2* y3. y5. 17.6 10.8 6.8*
__ __ __ __
y2. y5. 15.4 10.8 4.6 y4. y5. 21.6 10.8 10.8*
50
Jupyter code
51
Jupyter Code
52
Thank you
53
Two Way ANOVA
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE
1
Learning objectives
2
Factorial Experiment
• A factorial experiment is an experimental design that allows simultaneous
conclusions about two or more factors.
• The term factorial is used because the experimental conditions include all
possible combinations of the factors.
• The effect of a factor is defined as the change in response produced by a
change in the level of the factor. It is called a main effect because it refers to
the primary factors in the study
• For example, for a levels of factor A and b levels of factor B, the experiment
will involve collecting data on ab treatment combinations.
• Factorial experiments are the only way to discover interactions between
variables.
3
Factorial Experiment
4
Two-factor Factorial Experiments
• The simplest type of factorial experiment involves only two factors, say, A
and B.
• There are a levels of factor A and b levels of factor B.
• This two-factor factorial is shown in next table .
• The experiment has n replicates, and each replicate contains all ab
treatment combinations.
5
Two-factor Factorial Experiments
6
Two-factor Factorial Experiments
• The observation in the ijth cell for the kth replicate is denoted by yijk
• In performing the experiment, the abn observations would be run in
random order.
• Thus, like the single factor experiment, the two-factor factorial is a
completely randomized design.
7
Example
8
Three CAT preparation programs.
9
Factor - 1 , 3 treatment
• One factor in this study is the CAT preparation program, which has three
treatments:
– Three-hour review,
– One-day program, and
– 10-week course.
• Before selecting the preparation program to adopt, further study will be
conducted to determine how the proposed programs affect CAT scores.
10
Factor 2 : 3 Treatment
• The CAT is usually taken by students from three colleges:
• the College of Business,
• the College of Engineering, and
• the College of Arts and Sciences.
• Therefore, a second factor of interest in the experiment is whether a
student’s undergraduate college affects the CAT score.
• This second factor, undergraduate college, also has three treatments:
– Business,
– Engineering, and
– Arts and sciences.
11
Nine Treatment Combinations for The Two-factor CAT
Experiment
12
Replication
13
CAT SCORES FOR THE TWO-FACTOR EXPERIMENT
14
The analysis of variance computations answers
the following questions.
• Main effect (factor A): Do the preparation programs differ in terms of
effect on CAT scores?
• Main effect (factor B): Do the undergraduate colleges differ in terms of
effect on CAT scores?
• Interaction effect (factors A and B): Do students in some colleges do
better on one type of preparation program whereas others do better on a
different type of preparation program?
15
Interaction
• The term interaction refers to a new effect that we can now study because
we used a factorial experiment.
• If the interaction effect has a significant impact on the CAT scores, we can
conclude that the effect of the type of preparation program depends on
the undergraduate college.
16
ANOVA Table for the Two-factor Factorial Experiment
with r Replications
Sources of Sum of Degrees of Mean Square F P- value
Variation Squares Freedom
Factor A SSA (a -1) SSA/a-1 MSA /
MSE
Factor B SSB (b-1) SSB/b-1 MSB/
MSE
Interaction SSAB (a-1)(b-1) MSAB = MSAB /
SSAB/(a-1)(b-1) MSE
Error SSE ab(r-1) MSE=
SSE/(ab)(r-1)
Total SST nT-1
17
Abbreviation
18
ANOVA Procedure
19
Computations and Conclusions
20
CAT Summary Data for The Two-factor Experiment
Factor A: Factor B: College Row totals
Preparation Business Engineering Arts and
Program sciences
Three-hour 500 x11=540 540 x12 = 500 480 x13= 440 2960
review 580 460 400
One-day 460 x21= 500 560 x22 = 590 420 x23 = 450 3080
program 540 620 480
10-Week course 560 x31 = 580 600 x32= 590 480 x33 = 445 3230
600 580 410
Column totals 3240 3360 2670 Overall x
total= = 515
9270
21
CAT Summary Data for The Two-factor Experiment
• Factor A means
x1. 493.33
x2 . 513.33
x3. 538.33
• Factor B means
x.1 540
x.2 560
x.3 445
22
CAT Example:
23
CAT Example:
24
CAT Example:
25
CAT Example:
26
CAT Example:
27
ANOVA Table for the CAT two-factor design
Total 82450 17
28
Jupyter Code
29
Jupyter code
30
Jupyter Code
31
Thank You
32
REGRESSION
Linear Regression
Dr. Ramesh Anbanandam
DEPARTMENT of Management Studies
1
Simple Linear Regression
• This model can also be used for process optimization, such as finding the
level of temperature that maximizes yield, or for process control purposes
3
Empirical Models Example
4
Using python for plotting the data
5
Simple Linear Regression Model
• The equation that describes how y is related to x and
an error term is called the regression model.
• The simple linear regression model is:
y = b0 + b1x +e
where:
b0 and b1 are called parameters of the model,
e is a random variable called the error term.
Simple Linear Regression Equation
The simple linear regression equation is:
E(y) = b0 + b1x
E(y)
Intercept Slope b1
b0 is positive
x
Simple Linear Regression Equation
Negative Linear Relationship
E(y)
Intercept
b0
Slope b1
is negative
x
Simple Linear Regression Equation
No Relationship
E(y)
Regression line
Intercept
b0 Slope b1is 0
x
Estimated Simple Linear Regression Equation
𝑦 = 𝑏0 + 𝑏1 𝑥
min (yi yi )
ˆ 2
where:
yi = observed value of the dependent variable
for the ith observation
^
yi = estimated value of the dependent variable
for the ith observation
Estimation Process
Regression Model Sample Data:
y = b0 + b1x +e x y
Regression Equation x1 y1
E(y) = b0 + b1x . .
Unknown Parameters . .
b0, b1 xn yn
Estimated
b0 and b1 Regression Equation
provide estimates of ŷ b0 b1 x
b0 and b1 Sample Statistics
b0, b1
14
y1 - (mx1 +b) + y 2 - (mx 2 +b) .... y n - (mx n +b)
2 2 2
Squared Error (SE) =
15
=(y12 y 2 2 ... y n 2 )
2m (x1 y1 x 2 y 2 ... x n y n )
2b(y1 +y 2 +....+y n )
+m 2 (x12 +x 2 2 +..+x n 2 )
+2mb(x1 +x 2 +..+x n )
+(b 2 b 2 ... b 2 )
16
SE=n y 2 2mn x y 2bn y m 2 n x 2
+2mbnx+nb 2
(SE)
2n x y 2mnx 2
2bnx 0
m
(SE)
2n xy 2mnx 2
2bnx 0
m
xy m x 2
bx 0
mx 2
bx x y
2 x2 x y
x x y one point ( , )
m +b x x
x x
17
SE=n y 2 2mn x y 2bn y m 2 n x 2
+2mbnx+nb 2
(SE)
2n y 2mnx 2nb=0
b
=- y m x b 0
y mx b
18
19
20
21
Least Squares Method
b1 ( x x )( y y )
i i
(x x )
i
2
REGRESSION
Linear Regression-II
Dr. Ramesh Anbanandam
DEPARTMENT of Management Studies
1
Least Squares Method
b1 ( x x )( y y )
i i
(x x )
i
2
2
Sum of squares and sum of cross-products
n
S xx ( xi x) 2
i 1
n
S yy ( yi y ) 2
i 1
n
S xy (x
i 1
i x ) ( yi y )
3
Sum of squares and sum of cross-products
S xy
Slope(m)
S xx
Sxy
SSE= error sum of squares =Syy -
S xx
4
Least Squares Method
y-Intercept for the Estimated Regression Equation
b0 y b1 x
where:
xi = value of independent variable for ith
observation
yi = value of dependent variable for ith
_ observation
_x = mean value for independent variable
y = mean value for dependent variable
n = total number of observations
5
Simple Linear Regression
6
Simple Linear Regression
7
Simple Linear Regression
Example: Auto Sales
Number of Number of
TV Ads Cars Sold
1 14
3 24
2 18
1 17
3 27
8
Estimated Regression Equation
9
Scatter Diagram and Trend Line
30
25
20
Cars Sold
y = 5x + 10
15
10
5
0
0 1 2 3 4
TV Ads
10
Jupyter Code
11
Jupyter Code
12
Jupyter code
13
Example Problem- II
• The data in the file hardness.xls provide measurements on the hardness and
tensile strength for 35 specimens of die-cast aluminum.
• It is believed that hardness (measured in Rockwell E units) can be used to
predict tensile strength (measured in thousands of pounds per square inch).
a. Construct a scatter plot.
b. Assuming a linear relationship, use the least-squares method to find the
regression coefficients b 0 and b 1.
c. Interpret the meaning of the slope, b1, in this problem.
d. Predict the mean tensile strength for die-cast aluminum that has a hardness of
30 Rockwell E units.
14
Tensile strength Hardness
53 29.31
70.2 34.86
84.3 36.82
55.3 30.12
78.5 34.02
63.5 30.82
71.4 35.4
53.4 31.26
82.5 32.18
67.3 33.42
69.5 37.69
73 34.88
55.7 24.66
85.8 34.76
95.4 38.02
51.1 25.68
74.4 25.81
54.1 26.46
77.8 28.67
52.4 24.64
69.1 25.77
53.5 23.69
64.3 28.65
82.7 32.38
55.7 23.21
70.5 34
87.5 34.47
50.7 29.25
72.3 28.71
59.5 29.83
71.3 29.25
52.7 27.99
76.5 31.85
63.7 27.65
69.2 31.7
15
16
Thank You
17
REGRESSION
Linear Regression-III
Dr. Ramesh Anbanandam
DEPARTMENT of Management Studies
1
Learning Objectives
2
3
Coefficient of Determination
• Relationship Among SST, SSR, SSE
SST = SSR + SSE
(y i y )2 ( yˆ i y )2 ( y i yˆ i )2
SS xy 2 SS xy 2
SS yy SS yy
SS SS xx
xx
where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
Coefficient of Determination
7
Coefficient of Determination
yˆ b0 b1 x
where:
b1 = the slope of the estimated regression
equation
Sample Correlation Coefficient
rxy (sign of b1 ) r2
rxy = +.9366
Assumptions About the Error Term e
12
Estimate of s
• An Estimate of s
The mean square error (MSE) provides the estimate
of s 2, and the notation s2 is also used.
s 2 = MSE = SSE/(n - 2)
where:
SSE ( yi yˆ i ) 2 ( yi b0 b1 xi ) 2
Testing for Significance
• An Estimate of s
• To estimate s we take the square root of s 2.
• The resulting s is called the standard error of
the estimate.
SSE
s MSE
n2
Testing for Significance
2
Sxy
Syy -
SSE S xx
= =
n2 n2
15
Testing for Significance: t Test
• Hypotheses
H0 : b1 0
H a : b1 0
• Test Statistic
b1
t
sb1
Case 1
H 0: b1 0
17
Case 2
H a: b 1 0
18
The Standard Deviation of the Regression Slope
• The standard error of the regression slope coefficient (b1) is
estimated by
sε sε
sb1
(x x) 2
( x)2
x n
2
where:
sb1 = Estimate of the standard error of the least squares slope
SSE
sε = Sample standard error of the estimate
n2
Testing for Significance: t Test
Rejection Rule
Reject H0 if p-value <
or t < -tor t > t
where:
t is based on a t distribution
with n - 2 degrees of freedom
Testing for Significance: t Test
1. Determine the hypotheses. H0 : b1 0
H a : b1 0
b
1
H 0: 1
0 S b
H :b
1
1
0 where: S
S e
b
SSXX
b
H 0: 1
0
S
SSE
e
n2
H :b 0
1
1 X
2
SSXX X 2
b
H 0: 0 n
1
b the hypothesized slope
H :b
1
0
df n 2
1
1
Confidence Interval for b1
or 1.56 to 8.44
• Conclusion
0 is not included in the confidence interval.
Reject H0
Testing for Significance: F Test
• Hypotheses
H0 : b1 0
H a : b1 0
• Test Statistic
F = MSR/MSE
F-Test for Significance
31
Jupyter code
32
Testing for Significance: F Test
5. Compute the value of the test statistic.
F = MSR/MSE = 100/4.667 = 21.43
6. Determine whether to reject H0.
F = 17.44 provides an area of .025 in the upper tail. Thus, the p-
value corresponding to F = 21.43 is less than 2(.025) = .05. Hence,
we reject H0.
The statistical evidence is sufficient to conclude that we have a significant
relationship between the number of TV ads aired and the number of cars
sold.
Some Cautions about the
Interpretation of Significance Tests
35
RBD
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE
1
Learning Objectives
• Understand the blocking principle and how it is used to isolate the effect
of nuisance factors
2
Randomized Block Design
3
Why RBD?
4
Randomized block design
5
Randomized block design
6
Air Traffic Controller Stress Test
• A study measuring the fatigue and stress of
air traffic controllers resulted in proposals
for modification and redesign of the
controller’s work station
• After consideration of several designs for
the work station, three specific alternatives
are selected as having the best potential
for reducing controller stress
• The key question is: To what extent do the
three alternatives differ in terms of their
effect on controller stress?
7
Air Traffic Controller Stress Test
• In a completely randomized design, a random sample of controllers would be
assigned to each work station alternative.
• However, controllers are believed to differ substantially in their ability to
handle stressful situations.
• What is high stress to one controller might be only moderate or even low
stress to another.
• Hence, when considering the within-group source of variation (MSE), we must
realize that this variation includes both random error and error due to
individual controller differences.
• In fact, managers expected controller variability to be a major contributor to
the MSE term.
8
A randomized block design for the air traffic controller
stress test
Treatments
System A System B System C
Controller 1 15 15 18
Controller 2 14 14 14
Controller 3 10 11 15
Blocks
Controller 4 13 12 17
Controller 5 16 13 16
Controller 6 13 13 13
9
Solving this example using ANOVA in python
10
Solving this example using ANOVA in python
11
Summary of stress data for the air traffic controller stress test
Treatments System A System B System C Block total Block
Blocks means
Controller 1 15 15 18 48 x1. =16
Controller 2 14 14 14 42 x2. =14
Controller 3 10 11 15 36 x3. =12
Controller 4 13 12 17 42 x4. =14
Controller 5 16 13 16 45 x5. =15
Controller 6 13 13 13 39 x6. =13
Column Total 81 78 93 252 =
x 252/18
= 14
12
Summary of stress data for the air traffic controller stress test
• Treatment means
13
ANOVA TABLE FOR THE RANDOMIZED BLOCK DESIGN WITH k
TREATMENTS AND b BLOCKS
14
RBD Problem
15
RBD Problem
16
RBD Problem
17
ANOVA table for the air traffic controller stress test
19
Solving RBD example using python
20
Conclusion
• Finally, note that the ANOVA table shown in Table provides an F value to
test for treatment effects but not for blocks.
• The reason is that the experiment was designed to test a single factor—
work station design.
• The blocking based on individual stress differences was conducted to
remove such variation from the MSE term.
• However, the study was not designed to test specifically for individual
differences in stress.
21
Problem 2: RBD
22
Problem 2: RBD
23
Anova using jupyter
24
Problem 2: RBD
• The sums of squares for the analysis of variance are computed as follows:
25
Problem 2: RBD
26
Problem 2: RBD
• Analysis of Variance for the Randomized Complete Block Experiment
Sources of Sum of Degrees of Mean F P- value
Variation Squares Freedom Square
27
Conclusion
28
Python code for problem 2
29
Python code for problem 2
30
Python code for problem 2
31
Categorical Variable Regression
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
What are dummy variables?
• Dummy variables, also called indicator variables allow us to include
categorical data (like Gender) in regression models
3
Example 1: Problem / Background
• Johnson Filtration, Inc., provides maintenance service for
water-filtration systems.
• Customers contact Johnson with requests for
maintenance service on their water-filtration systems
• To estimate the service time and the service cost,
Johnson’s managers want to predict the repair time
necessary for each maintenance request
• Hence, repair time in hours is the dependent variable
• Repair time is believed to be related to two factors,
– the number of months since the last maintenance service
– the type of repair problem (mechanical or electrical).
Source: Statistics for Business & Economics, David R. Anderson, Dennis J. Sweeney, Thomas A. Williams, Jeffrey D. Camm, James J. Cochran, Cengage Learning,2013
4
Data for the Johnson filtration example
5
6
Linear Regression
7
OLS Summary
8
Linear regression
9
Normal probability plot
10
Creating dummies
11
DATA FOR THE JOHNSON FILTRATION EXAMPLE WITH TYPE OF REPAIR
INDICATED BYADUMMYVARIABLE (x2 = 0 FOR MECHANICAL; x2 = 1
FOR ELECTRICAL)
12
Adding dummies to table
13
OLS Summary
14
Dummy regression
15
Interpreting the Parameters
Equation 1
Equation 2
16
Interpreting the Parameters
• Comparing equations, we see that the mean repair time is a linear
17
Interpreting the Parameters
18
Interpreting the Parameters
• In effect, the use of a dummy variable for type of repair provides two
estimated regression equations that can be used to predict the repair
time, one corresponding to mechanical repairs and one corresponding to
electrical repairs.
• In addition, with b2= 1.26, we learn that, on average, electrical repairs
require 1.26 hours longer than mechanical repairs.
19
Interpreting the Parameters
20
More Complex Categorical Variables
21
Example 2: Problem / Background
22
Data
Employee Salary Gender Experience
1 7.5 Male 6
2 8.6 Male 10
3 9.1 Male 12
4 10.3 Male 18
5 13 Male 30
6 6.2 Female 5
7 8.7 Female 13
8 9.4 Female 15
9 9.8 Female 21
24
25
26
27
28
Creating a dummy variable for gender
• Categorical data is included in
regression analysis by using Employee Salary Gender
dummy variables 1 7.5 0
2 8.6 0
3 9.1 0
• For example, we can assign a
value of 0 for males and 1 for 4 10.3 0
females in our data so that a 5 13 0
MR model can be developed 6 6.2 1
7 8.7 1
8 9.4 1
9 9.8 1
30
31
More on the intercept and slope
• The value of the intercept, 9.70, is the average salary for males (as we
coded gender=1 for females and 0 for males)
• The value of the slope, -1.175, tells us that the average females salary is
lower than the average male salary by 1.175
32
33
What would have happened if we had used 0 for females and
1 for males in our data? Would our results be any different?
34
Male = 1, female = 0
35
More on dummy variables
36
Example: Salary vs. Job Grade
37
Representing 3-level Job Grade using dummy variables
Job_1 and Job_2
Dummy Variables
Employee's Job
Job Grade Job_1 Job_2
Grade
1 1 0
2 0 1
3 0 0
38
Data file with dummy variables for job grade
Job
Employee Grade Salary Job_1 Job_2
1 1 7.5 1 0
2 3 8.6 0 0
3 2 9.1 0 1
4 3 10.3 0 0
5 3 13 0 0
6 1 6.2 1 0
7 2 8.7 0 1
8 2 9.4 0 1
9 3 9.8 0 0
39
Thank You
40
Estimation, Prediction of Regression Model Residual
Analysis: Validating Model Assumptions - I
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
• Point Estimation
• Interval Estimation
• Confidence Interval for the Mean Value of y
• Prediction Interval for an Individual Value of y
2
Problem
• Data were collected from a sample of 10 Ice cream vendors located near
college campuses.
• For the ith observation or restaurant in the sample, xi is the size of the
student population (in thousands) and yi is the quarterly sales (in
thousands of dollars).
• The values of xi and yi for the 10 restaurants in the sample are summarized
in Table
3
Data
Student Population Sales
Restaurant (1000) (1000)
1 2 58
2 6 105
3 8 88
4 8 118
5 12 117
6 16 137
7 20 157
8 20 169
9 22 149
10 26 202
4
Python code for scatter plot
5
Python code for scatter plot
6
Python code for regression Equation
7
Python code for regression Equation
8
Python code for regression
9
10
Point Estimate
• We can use the estimated regression equation to develop a point estimate
of the mean value of y for a particular value of x or to predict an individual
value of y corresponding to a given value of x.
11
Point estimate
• Using the estimated regression equation 60 +5x, we see that for x 10 (or
10,000 students), 60 + 5(10) = 110.
• Thus, a point estimate of the mean quarterly sales for all restaurants
located near campuses with 10,000 students is $110,000.
12
Point estimate
14
Confidence Interval Estimation
15
Confidence Interval Estimation
60 + 5(10) = 110.
16
Confidence Interval Estimation
^
In general, we cannot expect y p to equal E(y p ) exactly.
^ ^
If we want to make an inference about how close y p is to the true mean value E( y p ), we will have to estimate the variance of y p .
^
The formula for estimating the variance of y p given x p , denoted by , is s 2 ^
yp
17
Confidence Interval Estimation
18
Confidence Intervals for the Mean sales y at given values of
student population x
19
Python Code
20
Special Case
^
The estimated standard deviation of y p
− −
is smallest when x p = x and the quantity x p - x = 0
21
Prediction Interval for an Individual Value of y
• Instead of estimating the mean value of sales for all restaurants located
near campuses with 10,000 students, we want to estimate the sales for an
individual restaurant located near a particular College with 10,000
students.
(1) The variance of individual ‘y’ values about the mean E( yp), an estimate of
which is given by s2
(2) The variance associated with using to estimate E( yp), an estimate of
which is given by
22
Prediction Interval for an Individual Value of y
23
Prediction Interval for an Individual Value of y
24
Prediction Interval for an Individual Value of y
25
Prediction Interval for an Individual Value of y
26
Confidence intervals vs prediction intervals
27
Python Code for Prediction Interval
28
Python Code
29
Python Code
30
Thank You
31
Estimation, Prediction of Regression Model Residual
Analysis: Validating Model Assumptions - II
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
Residual Analysis: Validating Model Assumptions
• Residual analysis is the primary tool for determining whether the assumed
regression model is appropriate
3
Assumptions about the error term .
4
Importance of the Assumptions
• These assumptions provide the theoretical basis for the t test and the F
test used to determine whether the relationship between x and y is
significant, and for the confidence and prediction interval estimates
• If the assumptions about the error term appear questionable, the
hypothesis tests about the significance of the regression relationship and
the interval estimation results may not be valid.
5
Residuals for Ice cream parlours
Source: Statistics for Business & Economics, David R. Anderson, Dennis J. Sweeney, Thomas A. Williams, Jeffrey D. Camm, James J. Cochran, Cengage Learning,2013
6
Residual analysis is based on an examination of graphical plots
7
Residual Plot Against x
8
Residual Plot Against x
9
Assumption: the variance is the same for all values of x
10
Violation of Assumption:
The variance of ‘e’ is not the same for all values of x
• Assumption of a constant
variance of ‘e’ is violated
• If variability about the regression
line is greater for larger values of
x
11
Assumed regression model is not an adequate
representation
12
^
Residual Plot Against y
13
^
Residual Plot Against y
14
Standardized Residuals
15
Python Code
16
17
Python Code
18
Standardized Residuals
19
Computation of standardized residuals for Icecream parlors
20
Computation of standardized residuals for Icecream parlors
21
Plot of The Standardized Residuals Against The Independent
Variable x
22
Plot of The Standardized Residuals Against The Independent
Variable x
23
Studentized residual
• The standardized residual plot can provide insight about the assumption
that the error ‘e’ term has a normal distribution.
• If this assumption is satisfied, the distribution of the standardized
residuals should appear to come from a standard normal probability
distribution.
24
Studentized residual
25
Normal Probability Plot
• Another approach for determining the validity of the assumption that the
error term has a normal distribution is the normal probability plot.
• To show how a normal probability plot is developed, we introduce the
concept of normal scores.
26
Normal Probability Plot
27
Normal Probability Plot
28
Normal Probability Plot
29
Normal scores and ordered standardized residuals for
Armand’s pizza parlors
30
Normal Probability Plot
31
Normal probability plot for Ice Cream parlors
32
33
Thank You
34
MULTIPLE REGRESSION MODEL - I
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
Multiple regression model
3
The estimation process For multiple regression
4
Simple vs multiple regression
5
Least Squares Method
6
Least Squares Method
7
An Example: Trucking Company
8
PRELIMINARY DATA FOR BUTLER TRUCKING
9
Using python import data
10
Using python import data
11
Scatter Diagram Of Preliminary Data For Trucking x1
12
Scatter Diagram Of Preliminary Data For Trucking x2
13
Scatter Diagram For x1 and x2
14
Linear regression Vs. multiple regression model
• Linear regression
15
Linear regression Vs. multiple regression model
16
Linear regression Vs. Multiple regression model
• Multiple regression
17
Linear regression Vs. Multiple regression model
18
Multiple Coefficient of Determination
19
Multiple Coefficient of Determination for linear model
20
Multiple Coefficient of Determination for Multiple regression
model
21
Multiple Coefficient of Determination
22
Multiple Coefficient of Determination
23
Adjusted Multiple Coefficient of Determination
n = number of observations
p = denoting the number of independent variables
24
OLS Summary
25
Adjusted Multiple Coefficient Vs Multiple Coefficient
26
Adjusted Multiple Coefficient Vs Multiple Coefficient
27
Model Assumptions
28
Assumption about error term
29
Assumption about error term
30
Graph of the regression equation for multiple regression
analysis with two independent variables
31
Response variable and response surface
32
Thank You
33
MULTIPLE REGRESSION MODEL-II
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
Testing for Significance
3
F Test
4
F test significance
5
F test significance
6
F Test
7
ANOVA table
8
t Test for individual significance
9
t Test for individual significance
10
t Test for individual significance
11
Regression Approach
to ANOVA
Regression Approach to ANOVA
• Three different assembly methods, referred to as methods A, B, and C, have been
proposed.
• Managers at Chemitech want to determine which assembly method can produce
the greatest number of filtration systems per week
A B C
58 58 48
64 69 57
55 71 59
66 64 47
67 68 49
ANOVA
Anova: Single Factor
SUMMARY
Groups Count Sum Average Variance
A 5 310 62 27.5
B 5 330 66 26.5
C 5 260 52 31
ANOVA
Source of
Variation SS df MS F P-value F crit
Between
Groups 520 2 260 9.176471 0.003818 3.885294
Within
Groups 340 12 28.33333
Total 860 14
Dummy variables for the chemitech experiment
Dummy variables for the chemitech experiment
• For method A the values of the dummy variables are A = 1 and B = 0, and
Regression Statistics
Multiple R 0.777593186
R Square 0.604651163
Adjusted R Square 0.53875969
Standard Error 5.322906474
Observations 15
ANOVA
df SS MS F Significance F
Regression 2 520 260 9.176471 0.003818412
Residual 12 340 28.33333
Total 14 860
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 52 2.380476143 21.84437 4.97E-11 46.81338804 57.18661196 46.81338804 57.18661196
A 10 3.366501646 2.970443 0.011692 2.665023022 17.33497698 2.665023022 17.33497698
B 14 3.366501646 4.15862 0.001326 6.665023022 21.33497698 6.665023022 21.33497698
Estimation of E(y)
• b0 = 52
• b1= 10
• b2 = 14
Assembly Method Estimation of E(y)
A b0+b1 = 52+10=62
B b0+b2 = 52 +14 = 66
C 52
Testing the significance
Thank You
21
Linear Regression Model Vs Logistic Regression Model
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
Estimating the relationship
3
Graphical representation
• Linear regression • Logistic regression
4
Correspondence of Primary Elements of Model Fit
Linear Regression Logistic Regression
• Total sum of squares • -2LL of base model
• Error sum of squares • -2LL of proposed model
• F test of model fit • Chi-square test of -2LL difference
• Coefficient of determination (R2) • Pseudo R2 measures
• Regression sum of squares • Difference of -2LL for base and
proposed models
5
Objective of logistic regression
6
The fundamental difference
7
Log likelihood
8
Logistic vs discriminant
9
Logistic vs discriminant
• Second, even if the assumptions are met, many researchers prefer logistic
regression because it is similar to multiple regression
• It has straightforward statistical tests, similar approaches to incorporating
metric and nonmetric variables and nonlinear effects, and a wide range of
diagnostics
• Logistic regression is equivalent to two-group discriminant analysis and
may be more suitable in many situations
10
Logistic vs discriminant : Sample size
• One factor that distinguishes logistic regression from the other techniques
is its use of maximum likelihood (MLE) as the estimation technique
• MLE requires larger samples such that, all things being equal, logistic
regression will require a larger sample size than multiple regression
• As for discriminant analysis, there are considerations on the minimum
group size as well
11
Logistic vs discriminant : Sample size
12
Determination of coefficients
13
Determination of coefficients
• Linear regression • Logistic regression
14
Testing for overall significance
15
Testing for overall significance
• Linear regression • Logistic regression
16
Testing for significance
Linear regression Logistic regression
• t-test • Wald-test
b 1
−
t = 1
S b
=
S e
w h e r e: S b
S S XX
SSE
S =
e
n − 2
( X )2
S S XX = X 2
−
n
1
= th e h y p o th e s iz e d s lo p e
df = n − 2
17
Testing for significance
• Linear regression • Logistic regression
18
Model Estimation fit
19
Model Estimation fit
• The lower the -2LL value, the better the fit of the model
• The -2LL value can be used to compare equations for the change in fit
20
Between Model Comparison
21
Step 1 : Estimate a null model
• The first step is to calculate a null model, which acts as the baseline for
making comparisons of improvement in model fit.
• The most common null model is one without any independent variables,
which is similar to calculating the total sum of squares using only the
mean in linear regression.
• The logic behind this form of null model is that it can act as a baseline
against which any model containing independent variables can be
compared.
22
Step 2: Estimate the proposed model
23
Step 3: Assess -2LL difference:
• The final step is to assess the statistical significance of the -2LL value
between the two models (null model versus proposed model).
• If the statistical tests support significant differences, then we can state
that the set of independent variable(s) in the proposed model is
significant in improving model estimation fit.
24
Between model comparison
25
Between model comparison
26
Normality of Residual (Error)
Linear regression Logistic regression
• Normally distributed • Binomially distributed
• Linear regression assumes that • Logistic regression does not need
residuals are approximately equal residuals to be equal for each
for all predicted dependent level of the predicted dependent
variable values variable values
27
Estimation Methods
28
Interpretation
29
THANK YOU
30
LOGISTIC REGRESSION - I
Dr A. RAMESH
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
Application
• In many regression applications the dependent variable may only assume
two discrete values.
• For instance, a bank might like to develop an estimated regression
equation for predicting whether a person will be approved for a credit
card or not
• The dependent variable can be coded as y =1 if the bank approves the
request for a credit card and y = 0 if the bank rejects the request for a
credit card.
• Using logistic regression we can estimate the probability that the bank
will approve the request for a credit card given a particular set of values
for the chosen independent variables.
3
Example
Sources: Statistics for Business and Economics,11th Edition by David R. Anderson (Author), Dennis J.
Sweeney (Author), Thomas A. Williams (Author)
4
Variables
5
Data (10 customer out of 100)
Customer Spending Card Coupon
1 2.291 1 0
2 3.215 1 0
3 2.135 1 0
4 3.924 0 0
5 2.528 1 0
6 2.473 0 1
7 2.384 0 0
8 7.076 0 0
9 1.182 1 1
10 3.345 0 0
6
Explanation of Variables
7
Logistic Regression Equation
8
Logistic Regression Equation
9
10
Logistic regression equation for β0 and β1
11
Logistic regression equation for β0 and β1
12
Estimating the Logistic Regression Equation
• In simple linear and multiple regression the least squares method is used to
compute b0, b1, . . . , bp as estimates of the model parameters ( 0, 1, . . . , p).
• The nonlinear form of the logistic regression equation makes the method of
computing estimates more complex
• We will use computer software to provide the estimates.
• The estimated logistic regression equation is
14
Variables
15
Managerial Use
• P( y = 1/x1 = 2, x2 = 0) = .1880
• Probabilities indicate that for customers with annual spending of $2000 the presence of
a Simmons credit card increases the probability of using the coupon
16
Managerial Use
• It appears that the probability of using the coupon is much higher for
customers with a Simmons credit card.
17
Testing for Significance
18
G Statistics
• The test for overall significance is based upon the value of a G test
statistic.
• If the null hypothesis is true, the sampling distribution of G follows a chi-
square distribution with degrees of freedom equal to the number of
independent variables in the model.
19
20
G Statistics
• The value of G is 13.628, its degrees of freedom are 2,and its p-value is
0.001.
• Thus, at any level of significance α >= .001, we would reject the null
hypothesis and conclude that the overall model is significant.
21
Thank You
22
LOGISTIC REGRESSION - II
Dr A. RAMESH
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
Chi sq. value of G- Statistic
3
z test- Wald Test
4
Strategies
• Suppose Simmons wants to send the
promotional catalog only to
customers who have a 0.40 or higher
probability of using the coupon.
• Customers who have a Simmons
credit card: Send the catalog to every
customer who spent $2000 or more
last year.
• Customers who do not have a
Simmons credit card: Send the
catalog to every customer who spent
$6000 or more last year.
5
Interpreting the Logistic Regression Equation
6
Odd ratio
• The odds ratio measures the impact on the odds of a one-unit increase in
only one of the independent variables.
7
Interpretation
• For example, suppose we want to compare the odds of using the coupon
for customers who spend $2000 annually and have a Simmons credit card
(x1= 2 and x2 = 1) to the odds of using the coupon for customers who
spend $2000 annually and do not have a Simmons credit card (x1= 2 and
x2 = 0).
• We are interested in interpreting the effect of a one-unit increase in the
independent variable x2.
8
Odds ratio
9
Odds ratio – Interpretation
• The estimated odds in favor of using the coupon for customers who spent
$2000 last year and have a Simmons credit card are 3 times greater than
the estimated odds in favor of using the coupon for customers who spent
$2000 last year and do not have a Simmons credit card.
10
Odds ratio – Interpretation
• The odds ratio for each independent variable is computed while holding all the
other independent variables constant.
• But it does not matter what constant values are used for the other independent
variables.
• For instance, if we computed the odds ratio for the Simmons credit card
variable (x2) using $3000, instead of $2000, as the value for the annual
spending variable (x1), we would still obtain the same value for the estimated
odds ratio (3.00).
• Thus, we can conclude that the estimated odds of using the coupon for
customers who have a Simmons credit card are 3 times greater than the
estimated odds of using the coupon for customers who do not have a Simmons
credit card.
11
Relationship between the odds ratio and the coefficients of
the independent variables
12
Effect of a change of more than one unit in Odd Ratio
• The odds ratio for an independent variable represents the change in the
odds for a one unit change in the independent variable holding all the
other independent variables constant.
• Suppose that we want to consider the effect of a change of more than one
unit, say c units.
• For instance, suppose in the Simmons example that we want to compare
the odds of using the coupon for customers who spend $5000 annually (x1
= 5) to the odds of using the coupon for customers who spend $2000
annually (x1 = 2).
• In this case c = 5- 2 = 3 and the corresponding estimated odds ratio is
13
Effect of a change of more than one unit in Odd Ratio
• This result indicates that the estimated odds of using the coupon for
customers who spend $5000 annually is 2.79 times greater than the
estimated odds of using the coupon for customers who spend $2000
annually.
• In other words, the estimated odds ratio for an increase of $3000 in
annual spending is 2.79
14
Logit Transformation
• This equation shows that the natural logarithm of the odds in favor of y =
1 is a linear function of the independent variables.
• This linear function is called the logit →g(x1, x2, . . . , xp) to denote the
logit.
15
Estimated Logit Regression Equation
16
17
G vs Z
18
Thank You
19
Maximum Likelihood Estimation - I
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
• This lecture will provide intuition behind the MLE using Theory and
examples.
2
Maximum Likelihood Estimation
3
An intuitive view on likelihood
= −2, 2 = 1
= 0, 2 = 1
= 0, 2 = 4
4
Maximum Likelihood Estimation: Problem
• A sample of ten new bike helmets manufactured by a certain company is
obtained. Upon testing, it is found that the first, third, and tenth helmets
are flawed, whereas the others are not.
• Let p = P(flawed helmet), i.e., p is the proportion of all such helmets that
are flawed.
• Define (Bernoulli) random variables X1, X2, . . . , X10 by
Source: Probability and Statistics for Engineering and the Sciences, Jay L Devore, 8th Ed, Cengage
5
Maximum Likelihood Estimation: Problem
• Then for the obtained sample, X1 = X3 = X10 = 1 and the other seven Xi’s are
all zero
• The probability mass function of any particular Xi is ,
which becomes p if xi = 1 and 1 – p when xi = 0
• Now suppose that the conditions of various helmets are independent of
one another
• This implies that the Xi’s are independent, so their joint probability mass
function is the product of the individual pmf’s.
6
Maximum Likelihood Estimation: Binomial Distribution
• Suppose that p = .25. Then the probability of observing the sample that
we actually obtained is (.25)3(.75)7 = .002086.
• If instead p = .50, then this probability is (.50)3(.50)7 = .000977.
• For what value of p is the obtained sample most likely to have occurred?
• That is, for what value of p is the joint pmf (eq 1) as large as it can be?
• What value of p maximizes (eq 1)
7
Maximum Likelihood Estimation: Binomial Distribution
• Figure shows a graph of the likelihood (eq 1) as a function of p.
• It appears that the graph reaches its peak above p = .3 = the proportion of
flawed helmets in the sample.
8
Graph of the natural logarithm of the likelihood
9
Maximum Likelihood Estimation: Binomial Distribution
• We can verify our visual impression by using calculus to find the value of p
that maximizes (eq 1).
• Working with the natural log of the joint pmf is often easier than working
with the joint pmf itself, since the joint pmf is typically a product so its
logarithm will be a sum.
• Here ln[ f (x1, . . . , x10; p)] = ln[p3(1 – p)7]
• = 3ln(p) + 7ln(1 – p)
10
Maximum Likelihood Estimation: Binomial Distribution
Thus
11
Interpretation
• Equating this derivative to 0 and solving for p gives
3(1 – p) = 7p, from which 3 = 10p and so p = 3/10 = .30 as conjectured
12
Maximum Likelihood Estimation: Binomial Distribution
• Suppose that rather than being told the condition of every helmet, we had
only been informed that three of the ten were flawed.
• Then we would have the observed value of a binomial random variable X =
the number of flawed helmets.
• The pmf of X is For x = 3, this becomes
• The binomial coefficient is irrelevant to the maximization, so again p =
0.30.
13
Maximum Likelihood Function Definition
• Let 𝑋1 , 𝑋2 ,…, 𝑋𝑛 have joint pmf or pdf
𝑓(𝑥1 , 𝑥2 , … , 𝑥𝑛 ; 𝜃1 , … , 𝜃𝑚 ) (a)
• Where the parameters 𝜃1 , … , 𝜃𝑚 have unknown values. When 𝑥1 , … , 𝑥𝑛 are the observed
sample values and (a) is regarded as a function of 𝜃1 , … , 𝜃𝑚 , it is called the likelihood
function.
^ ^
• The maximum likelihood estimates (mle’s)
the likelihood function, so that
,...,
1 m
are those values of the i’s that maximize
^ ^
• When the 𝑋𝑖′ 𝑠 are substituted in place of the 𝑥𝑖′ 𝑠, the maximum likelihood estimators result.
14
Interpretation
• The likelihood function tells us how likely the observed sample is as a
function of the possible parameter values.
• Maximizing the likelihood gives the parameter values for which the
observed sample is most likely to have been generated—that is, the
parameter values that “agree most closely” with the observed data.
15
Estimation of Poisson Parameter
• Suppose we have data generated from a Poisson distribution. We want to
estimate the parameter of the distribution
e− X
• The probability of observing a particular random variable is P( X ; ) =
X!
• Joint likelihood by multiplying the individual probabilities together
e − X1 e − X 2 e− X n
P( X 1 , X 2 ,, X n ; ) =
X 1! X 2! X n!
L ( ; X) = e − X i
i
L( ; X) = e − n nX
16
Estimation of Poisson Parameter
• Note in the likelihood function the factorials have disappeared.
• This is because they provide a constant that does not influence the
relative likelihood of different values of the parameter
• It is usual to work with the log likelihood rather than the likelihood.
• Note that maximising the log likelihood is equivalent to maximising the
likelihood. Take the natural log of the
likelihood function
L( ; X) = e − n nX
( ; X) = −n + nX log Find where the derivative of the log
likelihood is zero
d nX
= −n +
d Note that here the MLE is the same as the
ˆ = X moment estimator
17
Estimation of exponential distribution Parameter
• Suppose X1, X2, . . . , Xn is a random sample from an exponential
distribution with parameter . Because of independence, the likelihood
function is a product of the individual pdf’s:
19
Estimation of parameters of Normal Distribution
• Let X1, . . . , Xn be a random sample from a normal distribution.
• The likelihood function is
• so
20
Estimation of parameters of normal distribution
• To find the maximizing values of and 2, we must take the partial derivatives
of ln(f ) with respect to and 2, equate them to zero, and solve the resulting
two equations.
21
Thank you
22
Maximum Likelihood Estimation-II
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
• This lecture will provide understanding intuition behind the MLE using
Theory and examples.
2
Example1: Estimation of parameters of normal distribution
3
Example 1: Estimation of parameters of normal distribution
4
Interpretation
• Answer to this question is A, because the data are cluster around the
center of the distribution A, but not around the center of the distribution
B
• This example illustrate that, by looking at the data, it is possible to find the
distribution that is most likely to have generated the data
• Now, I will explain exactly how to find the distribution in practice
5
The illustration of the estimation procedure
6
Graphical illustration of likelihood contribution
7
The illustration of the estimation procedure
• Then, you multiply the likelihood contributions of all the observations. this
is called the likelihood function. We use the notation L
n
• Likelihood function L= Li
i =1 This notation means you
multiply from i= 1 through n
• In our example, n= 5
8
The illustration of the estimation procedure
9
The illustration of the estimation procedure
• The value of mean m and σ that maximise the likelihood function is found.
• The values of mean m and σ which are obtained this way are called the
maximum likelihood estimators of mean m and σ
• Most of the MLE cannot be solved ‘by hand’. Thus, you need to write an
iterative procedure to solve it on computer
10
Method of Least-squares vs MLE
Model for the expectation
(fixed part of the model):
E[Yi ] = 0 + 1 xi
Residuals: ri = yi − E[Yi ]
The method of least-squares:
Find the values for the parameters (β0 and β1) that
makes the sum of the squared residuals (Σrj2) as
small as possible.
Can only be used when the error term is
normal (residuals are assumed to be drawn
from a normal distribution)
Yi = 0 + 1 xi + i , where i ~ N (0, )
Method of Least-squares vs MLE
Model for the expectation
(fixed part of the model):
E[Yi ] = 0 + 1 xi
Residuals: ri = yi − E[Yi ]
13
Estimation of Regression Parameters
14
Estimation of Regression Parameters
15
Estimation of Regression Parameters
16
Estimation of Regression Parameters
17
Estimation of Regression Parameters
18
Python Demo for MLE
19
20
21
Parameter estimation by MLE
22
Parameter estimation by MLE
23
Example 2
24
25
26
27
Thank you
28
Performance of Logistic Model-III
Dr A. RAMESH
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
Python demo for accuracy prediction in logistic regression model using Receiver
operating characteristics curve
2
Sensitivity and Specificity
• For checking, what type of error we are making; we use two parameters-
3
Specificity and Sensitivity Relationship with Threshold
Threshold (Lower) Sensitivity ( ) Specificity ( )
Threshold (Higher) Sensitivity ( ) Specificity ( )
4
Measuring Accuracy, Specificity and Sensitivity
5
ROC Curve for Training dataset
6
ROC Curve for Test data set
7
Threshold value selection
• Threshold values are often selected based on which errors are bettor.
8
Accuracy checking for different threshold values
9
Accuracy checking for different threshold values
10
Accuracy checking for different threshold values
11
Accuracy checking for different threshold values
12
Calculating Optimal Threshold Value
13
Optimal Threshold Value in ROC Curve
14
Classification Report using Optimal Threshold Value
15
Thank You
16
Confusion matrix and ROC - I
Dr A. RAMESH
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
• Confusion matrix
• Receiver operating characteristics curve
2
Why Evaluate?
3
Accuracy Measures (Classification)
Misclassification error
• Error = classifying a record as belonging to one class when it belongs to
another class.
• Error rate = percent of misclassified records out of the total records in the
validation data
4
Confusion Matrix
5
Error Rate
Classification Confusion Matrix
Predicted Class
Actual Class 1 0
1 201 85
0 25 2689
6
Cutoff for classification
Most algorithms classify via a 2-step process:
For each record,
1. Compute probability of belonging to class “1”
2. Compare to cutoff value, and classify accordingly
7
Cutoff Table
Actual Class Prob. of "1" Actual Class Prob. of "1"
1 0.996 1 0.506
1 0.988 0 0.471
1 0.984 0 0.337
1 0.980 1 0.218
1 0.948 0 0.199
1 0.889 0 0.149
1 0.848 0 0.048
0 0.762 0 0.038
1 0.707 0 0.025
1 0.681 0 0.022
1 0.656 0 0.016
0 0.622 0 0.004
owner 11 1
non-owner 4 8
owner 7 5
non-owner 1 11
9
Compute Outcome Measures
10
When One Class is More Important
In many cases it is more important to identify members of one class
– Tax fraud
– Credit default
– Response to promotional offer
– Detecting electronic network intrusion
– Predicting delayed flights
11
ROC curves
Test Result
Threshold
Test Result
Some definitions ...
True Positives
True
negatives
False
negatives
‘‘-’’ ‘‘+’’
Test Result
without the disease
with the disease
Moving the Threshold: left
‘‘-’’ ‘‘+’’
23
Threshold Value
• Often selected based on which errors are “better”
• If t is large, predict positive rarely (when P(y=1) is large)
– More errors where we say negative , but it is actually positive
– Detects patients who are negative
• If t is small, predict negative rarely (when P(y=1) is small)
– More errors where we say positive, but it is actually negative
– Detects all patients who are positive
• With no preference between the errors, select t = 0.5
– Predicts the more likely outcome
24
Selecting a Threshold Value
25
True disease state vs. Test result
not rejected
Test rejected/accepted
Disease
No disease ☺ X
(D = 0) specificity Type I error
(False +)
Disease X ☺
(D = 1) Type II error Power 1 - ;
(False -) sensitivity
Classification matrix: Meaning of each cell
27
Alternate Accuracy Measures
28
Receiver Operator Characteristic (ROC) Curve
29
Selecting a Threshold using ROC
30
Thank You
31
Confusion Matrix and ROC-II
Dr A. RAMESH
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
ROC analysis
• True Positive Fraction
– TPF = TP / (TP+FN)
– also called sensitivity
– true abnormals called abnormal by the
observer
• False Positive Fraction
– FPF = FP / (FP+TN)
• Specificity = TN / (TN+FP)
– True normals called normal by the observer
– FPF = 1 - specificity
Evaluating classifiers (via
their ROC curves)
Classifier A can’t
distinguish between
normal and abnormal.
7
Area Under the ROC Curve (AUC)
8
Area Under the ROC Curve (AUC)
9
Selecting a Threshold using ROC
10
ROC Plot
• A typical look of ROC plot with few points in it is shown in the following
figure.
• Note the four cornered points are the four extreme cases of classifiers
11
Interpretation of Different Points in ROC Plot
• The four points (A, B, C, and D)
• A: TPR = 1, FPR = 0, the ideal model, i.e., the perfect
classifier, no false results
• B: TPR = 0, FPR = 1, the worst classifier, not able to
predict a single instance
• C: TPR = 0, FPR = 0, the model predicts every instance
to be a Negative class, i.e., it is an ultra-conservative
classifier
• D: TPR = 1, FPR = 1, the model predicts every instance
to be a Positive class, i.e., it is an ultra-liberal classifier
12
Interpretation of Different Points in ROC Plot
• Let us interpret the different points in the ROC
plot.
• The points on the upper diagonal region
• All points, which reside on upper-diagonal region
are corresponding to classifiers “good” as their
TPR is as good as FPR (i.e., FPRs are lower than
TPRs)
• Here, X is better than Z as X has higher TPR and
lower FPR than Z.
• If we compare X and Y, neither classifier is superior
to the other
13
Interpretation of Different Points in ROC Plot
14
Tuning a Classifier through ROC Plot
• Using ROC plot, we can compare two or more
classifiers by their TPR and FPR values and this
plot also depicts the trade-off between TPR
and FPR of a classifier.
• Examining ROC curves can give insights into
the best way of tuning parameters of
classifier.
• For example, in the curve C2, the result is
degraded after the point P.
• Similarly for the observation C1, beyond Q the
settings are not acceptable.
15
Comparing Classifiers trough ROC Plot
• We can use the concept of “area under
curve” (AUC) as a better method to
compare two or more classifiers.
• If a model is perfect, then its AUC = 1.
• If a model simply performs random
guessing, then its AUC = 0.5
• A model that is strictly better than other,
would have a larger value of AUC than the
other.
• Here, C3 is best, and C2 is better than C1
as AUC(C3)>AUC(C2)>AUC(C1).
16
ROC curve
100%
0
% False Positive Rate (1- 100
0
% specificity) %
ROC curve comparison
100% 100%
True Positive Rate
100% 100%
0
0 %
% 0 100
0 100 False Positive Rate %
False Positive Rate % %
%
20
Typical ROC
21
ROC curve extremes
22
Example
Sources: Statistics for Business and Economics,11th Edition by David R. Anderson (Author), Dennis J.
Sweeney (Author), Thomas A. Williams (Author)
23
Variables
24
Data (10 customer out of 100)
Customer Spending Card Coupon
1 2.291 1 0
2 3.215 1 0
3 2.135 1 0
4 3.924 0 0
5 2.528 1 0
6 2.473 0 1
7 2.384 0 0
8 7.076 0 0
9 1.182 1 1
10 3.345 0 0
25
Explanation of Variables
26
Loading data file and get some statistical detail
27
Method’s description
• ravel(): It will return one dimensional array with all the input array
elements.
28
Split dataset into training and testing sets
29
Building the model and predicting values
30
Calculate probability of predicting data values
31
Summary for logistic model
32
Accuracy Checking
33
Calculating Accuracy Score using Confusion Matrix
34
Generating Classification Report
35
Interpreting Classification Report
36
Thank You
37
Regression Analysis Model Building - I
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Introduction
2
General Linear Regression Model
• Suppose we collected data for one dependent variable y and k
independent variables x1,x2, . . . , xk.
• Objective is to use these data to develop an estimated regression equation
that provides the best relationship between the dependent and
independent variables.
3
Simple first-order model with one predictor variable
4
Modelling Curvilinear Relationships
Sources: Statistics for Business and Economics,11th Edition by David R. Anderson (Author), Dennis J.
Sweeney (Author), Thomas A. Williams (Author)
5
Data
Scales Months
Sold Employed
275 41
296 106
317 76
376 104
162 22
150 12
367 85
308 111
189 40
235 51
83 9
112 12
67 6
325 56
189 19
6
Importing libraries and table
7
SCATTER DIAGRAM FOR THE REYNOLDS EXAMPLE
8
Python code for the Reynolds example: first-order model
9
First-order regression equation
10
Standardized residual plot for the Reynolds example: first-
order model
11
Standardized residual plot for the Reynolds example: first-
order model
12
Need for curvilinear relationship
13
Second-order model with one predictor variable
14
New Data set
15
Python output for the Reynolds example:
second-order model
16
Second-order regression model
17
Standardized residual plot for the Reynolds example:
second-order model
18
Interpretation second order model
• In multiple regression analysis the word linear in the term “general linear
model” refers only to the fact that b0, b1, . . . , bp all have exponents of b1
• It does not imply that the relationship between y and the xi’s is linear.
• Indeed, we have seen one example of how equation general linear model
can be used to model a curvilinear relationship.
20
Thank you
21
Regression Analysis Model Building (Interaction)- II
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
Interaction
• If the original data set consists of observations for y and two independent
variables x1 and x2, we can develop a second-order model with two predictor
variables by setting z1 = x1, z2= x2, z3=x12 , z4=x22 , and z5 = x1x2 in the general
linear model of equation
• The model obtained is
• In this second-order model, the variable z5 = x1x2 is added to account for the
potential effects of the two variables acting together.
• This type of effect is called interaction.
3
Example – Interaction
4
Advertising
Expenditure Sales
Price ($1000s) (1000s)
2 50 478
2.5 50 373
3 50 335
2 50 473
2.5 50 358
3 50 329
2 50 456
2.5 50 360
3 50 322
2 50 437
2.5 50 365
3 50 342
2 100 810
2.5 100 653
3 100 345
2 100 832
2.5 100 641
3 100 372
2 100 800
2.5 100 620
3 100 390
2 100 790
2.5 100 670
3 100 393
5
MEAN UNIT SALES (1000s)
6
Interpretation of interaction
• Note that the sample mean sales corresponding to a price of $2.00 and an
advertising expenditure of $50,000 is 461,000, and the sample mean sales
corresponding to a price of $2.00 and an advertising expenditure of
$100,000 is 808,000.
• Hence, with price held constant at $2.00, the difference in mean sales
between advertising expenditures of $50,000 and $100,000 is 808,000 -
461,000 = 347,000 units.
7
Interpretation of interaction
• When the price of the product is $2.50, the difference in mean sales is
646,000 -364,000 = 282,000 units.
• Finally, when the price is $3.00, the difference in mean sales is 375,000 -
332,000 = 43,000 units.
• Clearly, the difference in mean sales between advertising expenditures of
$50,000 and $100,000 depends on the price of the product.
• In other words, at higher selling prices, the effect of increased advertising
expenditure diminishes.
• These observations provide evidence of interaction between the price and
advertising expenditure variables.
8
Importing Data
9
Mean unit sales (1000s) as a function of selling price
10
Mean unit sales (1000s) as a function of Advertising
Expenditure($1000s)
11
Need for study the interaction between variable
12
Estimated regression equation, a general linear model
involving three independent variables (z1, z2, and z3)
13
Interaction variable
14
New Model
15
New Model
16
Interpretation
• Because the model is significant ( p-value for the F test is 0.000) and the p-
value corresponding to the t test for PriceAdv is 0.000, we conclude that
interaction is significant given the linear effect of the price of the product
and the advertising expenditure.
• Thus, the regression results show that the effect of advertising xpenditure
on sales depends on the price.
17
Transformations Involving the Dependent Variable
Miles per
Gallon Weight
28.7 2289
29.2 2113
34.2 2180
27.9 2448
33.3 2026
26.4 2702
23.9 2657
30.5 2106
18.1 3226
19.5 3213
14.3 3607
20.9 2888
18
Importing data
19
Scatter diagram
20
Model 1
21
Standardized residual plot corresponding to the first-order
model.
22
Standardized residual plot corresponding to the first-order
model
23
Model 2
24
Residual plot for model 2
25
Residual plot of model 2
26
• The miles-per-gallon estimate is obtained by finding the number whose
natural logarithm is 3.2675.
• Using a calculator with an exponential function, or raising e to the power
3.2675, we obtain 26.2 miles per gallon.
27
Nonlinear Models That Are Intrinsically Linear
28
Thank You
29
2 Test of Independence - I
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
2 Test of Independence
3
2 Test of Independence: Investment Example
• In which region of the country do you reside?
A. Northeast B. Midwest C. South D. West
• Which type of financial investment are you most likely to make today?
E. Stocks F. Bonds G. Treasury bills
Type of financial
Investment
Contingency Table
E F G
A O13 nA
Geographic B nB
Region C nC
D nD
nE nF nG N
4
2 Test of Independence: Investment Example
e AF
= N P( A F )
n n n n
If A and F are independent, P( A) = A
P( F ) = F
= N A F
N N N N
P( A F) = P( A) P( F ) n n
P( A F ) = A F
n n
N N = A F
N
Type of Financial
Contingency Table Investment
E F G
A e12 nA
Geographic B nB
Region C nC
D nD
nE nF nG N
5
2 Test of Independence: Formulas
e =
ij
(n ) (n)j
i j
N
Expected where : i = the row
Frequencies j = the column
ni = the total of row i
nj = the total of column j
N = the total of all frequencies
6
2 Test of Independence: Formulas
( f o − f e)
2
Calculated
2
=
(Observed ) f
where : df = (r - 1)(c - 1)
e
7
Example for Independence
8
2 Test of Independence
Ho : Type of gasoline is
independent of income
Ha : Type of gasoline is not
independent of income
9
2 Test of Independence
Type of
Gasoline
r=4 c=3 Extra
Income Regular Premium Premium
Less than $30,000
$30,000 to $49,999
$50,000 to $99,000
At least $100,000
10
2 Test of Independence: Gasoline Preference Versus
Income Category
=.01
df = ( r − 1)( c − 1)
= ( 4 − 1)( 3 − 1)
=6
2
.01, 6
= 16.812
If 2
Cal
16.812, reject Ho.
If 2
Cal
16.812, do not reject Ho.
11
Python code
12
Gasoline Preference Versus Income Category:
Observed Frequencies
Type of
Gasoline
Extra
Income Regular Premium Premium
Less than $30,000 85 16 6 107
$30,000 to $49,999 102 27 13 142
$50,000 to $99,000 36 22 15 73
At least $100,000 15 23 25 63
238 88 59 385
13
Gasoline Preference Versus Income Category: Expected
Frequencies
e =
ij
(n )
N
(ni
) j
Type of
Gasoline Extra
=
(107 )(238 ) Income Regular Premium Premium
e11 385 Less than $30,000 (66.15) (24.46) (16.40)
= 66.15 85 16 6 107
(107 )(88 ) $30,000 to $49,999 (87.78) (32.46) (21.76)
e12 = 385
102 27 13 142
$50,000 to $99,000 (45.13) (16.69) (11.19)
= 24 .46 36 22 15 73
(107 )(59) At least $100,000 (38.95) (14.40) (9.65)
e 13 = 385 15 23 25 63
= 16.40 238 88 59 385
14
Gasoline Preference Versus Income Category: 2
Calculation
(f o −f f e)
2
=
2
df = 6
0.01
Non rejection
region
16.812
2
= 70.78 16.812, reject Ho.
Cal
16
Contingency Tables
Contingency Tables
• Useful in situations involving multiple population proportions
• Used to classify sample observations according to two or more
characteristics
• Also called a cross-classification table.
17
Contingency Table Example
18
Contingency Table Example
Gender
sample size = n = 300:
Hand
120 Females, 12 were Preference
Female Male
left handed
Left 12 24 36
180 Males, 24 were
left handed Right 108 156 264
20
The Chi-Square Test Statistic
all cells fe
where:
fo = observed frequency in a particular cell
fe = expected frequency in a particular cell if H0 is true
Decision Rule:
If 2 > 2U, reject H0,
otherwise, do not reject
H0
0 Do not Reject H0
reject H0 2U
22
Observed vs. Expected Frequencies
Gender
Hand
Female Male
Preference
Observed = 12 Observed = 24
Left 36
Expected = 14.4 Expected = 21.6
Observed = 12 Observed = 24
Left 36
Expected = 14.4 Expected = 21.6
25
2 Test for The Differences Among More Than Two
Proportions
• Extend the 2 test to the case with more than two independent
populations:
H0: π1 = π2 = … = πc
H1: Not all of the πj are equal (j = 1, 2, …, c)
26
The Chi-Square Test Statistic
Assumed: each cell in the contingency table has expected frequency of at least 5
27
2 Test with More Than Two Proportions: Example
28
2 Test with More Than Two Proportions: Example
Organization
Object to Insurance Pharmacies Medical
Record Companies Researchers
Sharing
No 90 205 165
2 Test with More Than Two Proportions: Example
Organization
Object to Insurance Pharmacies Medical Row Sum
Record Companies Researchers
Sharing
Organization
Object to Record Insurance Pharmacies Medical
Sharing Companies Researchers
Yes ( fo − fe )
2
= 11.571 ( f o − f e )2 ( f o − f e )2
= 7.700 = 0.3926
fe fe fe
No ( f o − f e )2 ( fo − fe )
2
= 17.409
( fo − fe )
2
= 0.888
= 26.159
fe fe fe
( fo − fe )2
The Chi-square test statistic is: 2
= = 64.1196
all cells fe
2 Test with More Than Two Proportions: Example
H0: π1 = π2 = π3
H1: Not all of the πj are equal (j = 1, 2, 3)
Conclusion: Since 64.1196 > 5.991, you reject H0 and you conclude that at
least one proportion of respondents who object to their records being shared
is different across the three organizations
33
Thank You
34
2 Test of Independence - II
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
Example
res_num aa pe sm ae r g c
1 99 19 1 2 0 0 1
2 46 12 0 0 0 0 0
3 57 15 1 1 0 0 0
4 94 18 2 2 1 1 1
5 82 13 2 1 1 1 1
6 59 12 0 0 2 0 0
7 61 12 1 2 0 0 0
8 29 9 0 0 1 1 0
9 36 13 1 1 0 0 0
10 91 16 2 2 1 1 0
3
Example
Here :
• res_num = registration no.
• aa= academic ability
• pe = parent education
• sm = student motivation
• r = religion
• g = gender
4
Python code
5
Hypothesis
6
Python code
7
Observed values
Gender Student motivation
0 1 2 Row Sum
(Disagree ) (Not (Agree)
decided )
0 (Male) 10 13 6 29
1(Female ) 4 9 8 21
Column 14 22 14 50
Sum
8
Expected frequency (contingency table)
9
Frequency Table
0 fo = 10 fo = 13 fo = 6
fe = 8.12 fe =12.76 fe =8.12
1 fo = 4 fo = 9 fo = 8
fe =5.88 fe =9.24 fe =5.88
10
Chi sq. calculation
(f o −f f e)
2
=
2
= 0.435+ 0.005+0.554+0.601+0.006+0.764
= 2.365
11
Python code
12
Python code
Degrees of
freedom =
(2-1)*(3-1)
13
Python code
Contingency
table
14
2 Goodness of Fit Test
15
2 Goodness-of-Fit Test
16
2 Goodness-of-Fit Test
( f o− f e )
2
=
2
f e
df = k - 1 - p
where : f = frequency of observed values
o
k = number of categories
p = number of parameters estimated from the sample data
17
Goodness of Fit Test: Poisson Distribution
1. Set up the null and alternative hypotheses.
H0: Population has a Poisson probability distribution
Ha: Population does not have a Poisson distribution
18
Goodness of Fit Test: Poisson Distribution
where:
fi = observed frequency for category i
ei = expected frequency for category i
k = number of categories
19
Goodness of Fit Test: Poisson Distribution
5. Rejection rule:
p-value approach: Reject H0 if p-value <
20
Goodness of Fit Test: Poisson Distribution
• Example: Parking Garage
21
Goodness of Fit Test: Poisson Distribution
A random sample of 100 one- minute time intervals resulted in the
customer arrivals listed below. A statistical test must be conducted to
see if the assumption of a Poisson distribution is reasonable.
# Arrivals 0 1 2 3 4 5 6 7 8 9 10 11 12
Frequency 0 1 4 10 14 20 12 12 9 8 6 3 1
22
Goodness of Fit Test: Poisson Distribution
• Hypotheses
H0: Number of cars entering the garage during
a one-minute interval is Poisson distributed
23
Python Code
24
Goodness of Fit Test: Poisson Distribution
6 x e −6
f ( x) =
x!
25
Goodness of Fit Test: Poisson Distribution
• Expected Frequencies
x f (x ) nf (x ) x f (x ) nf (x )
0 .0025 .25 7 .1377 13.77
1 .0149 1.49 8 .1033 10.33
2 .0446 4.46 9 .0688 6.88
3 .0892 8.92 10 .0413 4.13
4 .1339 13.39 11 .0225 2.25
5 .1606 16.06 12+ .0201 2.01
6 .1606 16.06 Total 1.0000 100.00
26
Python code
27
Python code
28
Goodness of Fit Test: Poisson Distribution
• Observed and Expected Frequencies
i fi ei fi - ei
0 or 1 or 2 5 6.20 -1.20
3 10 8.92 1.08
4 14 13.39 0.61
5 20 16.06 3.94
6 12 16.06 -4.06
7 12 13.77 -1.77
8 9 10.33 -1.33
9 8 6.88 1.12
10 or more 10 8.39 1.61
29
Python code
30
Goodness of Fit Test: Poisson Distribution
• Rejection Rule
With = .05 and k - p - 1 = 9 - 1 - 1 = 7 d.f.
(where k = number of categories and p = number of
population parameters estimated), .02 5 = 1 4 .0 6 7
Reject H0 if p-value < .05 or 2 > 14.067.
• Test Statistic
( − 1.20) 2
(1.08) 2
(1.61) 2
2 = + + ... + = 3.268
6.20 8.92 8.39
31
Python code
32
Goodness of Fit Test: Poisson
Distribution
df = 7
0.05
Non rejection
region
14.067
2
= 3.268 14.067, do not reject Ho.
Cal
33
Thank You
34
Cluster analysis: Introduction - I
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
Cluster Analysis
3
Cluster analysis
4
Example
5
Example
• Because this example contains only two variables, we can investigate it by merely looking
at the plot
• In this small data set there are clearly two distinct groups of objects
• Such groups are called clusters, and to discover them is the aim of cluster analysis
6
Cluster and discriminant analysis
7
Cluster analysis and discriminant analysis
8
Types of data and how to handle them
9
Example
Attributes
Objects
10
Types of data and how to handle them
11
Type of data
• Interval-Scaled Variables
• In this situation the n objects are characterized by p continuous
measurements
• These values are positive or negative real numbers, such as height, weight,
temperature, age, cost, ..., which follow a linear scale
• For instance, the time interval between 1900 and 1910 was equal in length
to that between 1960 and 1970
12
Type of data
13
Interval-Scaled Variables
14
Interval-Scaled Variables
• For example :
Person Weight(Kg) Height(cm)
• Take eight people, the weight (in A 15 95
kilograms) and the height (in centimetres) B 49 156
• In this situation, n = 8 and p = 2. C 13 95
D 45 160
E 85 178
F 66 176
G 12 90
H 10 78
Table :1
15
Figure 1
200 F E
DB
150
Height in cm
G
A
100 H
C
50
0
0 50 100
Weight in kg
16
Interval-Scaled Variables
• The units on the vertical axis are drawn to the same size as those on the horizontal axis, even
• The plot contains two obvious clusters, which can in this case be interpreted easily: the one
• For instance, measuring the concentration of certain natural hormones might have yielded a
17
Interval-Scaled Variables
• Let us now consider the effect of changing measurement
Person Weight(lb) Height(in)
units.
A 33.1 37.4
• If weight and height of the subjects had been expressed in B 108 61.4
pounds and inches, the results would have looked quite C 28.7 37.4
different. D 99.2 63
E 187.4 70
• A pound equals 0.4536 kg and an inch is 2.54 cm F 145.5 69.3
• Therefore, Table 2 contains larger numbers in the column G 26.5 35.4
of weights and smaller numbers in the column of heights. H 22 30.7
Figure 2 Table :2
18
Figure 2
100
Height in inches
D B F E
C
50 G
H
A
0
0 20 40 60 80 100 120 140 160 180 200
Weight in lb
19
Interpretation
• Although plotting essentially the same data as Figure 1, Figure 2 looks
much flatter
• In this figure, the relative importance of the variable “weight” is much
larger than in Figure 1
• As a consequence, the two clusters are not as nicely separated as in Figure
1 because in this particular example the height of a person gives a better
indication of adulthood than his or her weight. If height had been
expressed in feet (1 ft = 30.48 cm), the plot would become flatter still and
the variable “weight” would be rather dominant
• In some applications, changing the measurement units may even lead one
to see a very different clustering structure
20
Standardizing the data
21
Standardizing the data
22
Standardizing the data
24
Standardizing the data
• When applying standardization, one forgets about the original data and
uses the new data matrix in all subsequent computations
25
Detecting outlier
26
Standardizing the data
• The preceding description might convey the impression that
standardization would be beneficial in all situations.
• However, it is merely an option that may or may not be useful in a given
application
• Sometimes the variables have an absolute meaning, and should not be
standardized
• For instance, it may happen that several variables are expressed in the
same units, so they should not be divided by different sf
• Often standardization dampens a clustering structure by reducing the
large effects because the variables with a big contribution are divided by a
large sf
27
Thank you
28
Cluster analysis: Part - II
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
Example
• Lets take four persons A, B,C, D with following age and height:
200 A B
Person Age (yr) Height (cm) Height 190
A 35 190 180
B 40 190 D
170 C
C 35 160
D 40 160
160
Age
150
TABLE: 1 10 30 50
Finding Groups in Data: An Introduction to Cluster Analysis
Author(s): Leonard Kaufman, Peter J. Rousseeuw FIGURE: 1
March 1990, John Wiley & Sons, Inc.
3
Example
4
Example
• The resulting data matrix, which is unitless, is given in Table 2
• Note that the new averages are zero and that the mean deviations equal 1
• Table 2
Person Variable 1 Variable 2
A 1 1
B -1 1
C 1 -1
D -1 -1
• Even when the data are converted to very strange units standardization will always yield
the same numbers
5
Example
6
Choice of measurement (Units)- Merits and demerits
7
Choice of measurement- Merits and demerits
8
Distances computation between the objects
• The next step is to compute distances between the objects, in order to
quantify their degree of dissimilarity
• It is necessary to have a distance for each pair of objects i and j.
• The most popular choice is the Euclidean distance:
• When the data are being standardized, one has to replace all x by z in this
expression
• This Formula corresponds to the true geometrical distance between the
points with coordinates (xi1,. .., xip) and (xj1 ,..., xjp)
9
Example
10
Distances computation between the objects
11
Interpretation
• Suppose you live in a city where the streets are all north-south or east-
west, and hence perpendicular to each other
• Let Figure 3 be part of a street map of such a city, where the streets are
portrayed as vertical and horizontal lines
12
Interpretation
• Then the actual distance you would have to travel by car to get from
• This would be the shortest length among all possible paths from i to j
• Only a bird could fly straight from point i to point j, thereby covering the
13
Mathematical Requirements of a Distance Function
• Both the Euclidean metric and the Manhattan metric satisfy the following
mathematical requirements of a distance function, for all objects i, j, and h:
• (D1) d(i, j) ≥ 0
• (D2) d(i, i) = 0
• (D3) d(i, j) = d(j, i)
• (D4) d(i, j) ≤ d(i, h) + d(h, j)
• Condition (D1) merely states that distances are nonnegative numbers and (D2) says
that the distance of an object to itself is zero
• Axiom (D3) is the symmetry of the distance function
• The triangle inequality (D4) looks a little bit more complicated, but is necessary to allow
a geometrical interpretation
• It says essentially that going directly from i to j is shorter than making a detour over
object h
14
Distances computation between the objects
• If d(i, j) = 0 does not necessarily imply that i = j, because it can very well
happen that two different objects have the same measurements for the
variables under study
• However, the triangle inequality implies that i and j will then have the
same distance to any other object h, because d(i, h) ≤ d(i, j) + d( j, h) = d(j,
h) and at the same time d( j, h) ≤ d( j, i) + d(i, h) = d(i, h), which together
imply that d(i, h) = d(j, h)
15
Minkowski distance
16
Example for Calculation of Euclidean and Manhattan Distance
• Let x1 = (1, 2) and x2 = (3, 5) represent two objects as in the given Figure
The Euclidean distance between the two is (22 +32)= 3.61. The
Manhattan distance between the two is 2 + 3 = 5.
Figure: 4
17
n- by- n Matrix
• For example, when computing
Euclidean distances between the Person Weight(Kg) Height(cm)
objects of the following Table can be A 15 95
obtain as next slide: B 49 156
C 13 95
D 45 160
• Euclidean distances between B and E:
E 85 178
• ((49 – 85)2 +(156-178)2)½ = 42.2 F 66 176
G 12 90
H 10 78
18
n- by- n Matrix
A B C D E F G H
A
B
C
D
E
F
G
H
19
Interpretation
20
Distance matrix
• It would suffice to write down only the lower triangular half of the
distance matrix
A B C D E F G
B
C
D
E
F
G
H
21
Selection of variables
22
Selection of variables
• The selection of “good” variables is a nontrivial task and may involve quite
some trial and error (in addition to subject-matter knowledge and
common sense)
• In this respect, cluster analysis may be considered an exploratory
technique
23
Thank you
24
2 Goodness of Fit Test
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
Goodness of fit for Uniform Distribution
Month Litres
• Milk Sales Data January 1,610
February 1,585
March 1,649
April 1,590
May 1,540
June 1,397
July 1,410
August 1,350
September 1,495
October 1,564
November 1,602
December 1,655
18,447
3
Hypotheses and Decision Rules
Ho: The monthly milk figures for milk sales are uniformly distributed
Ha: The monthly milk figures for milk sales are not uniformly distributed
= .01 If 2
24.725, reject Ho.
Cal
df = k − 1 − p
= 12 − 1 − 0
If 2
Cal
24.725, do not reject Ho.
= 11
2
= 24.725
.01,11
4
Python code
5
Calculations
Month fo fe (fo - fe)2/fe
January 1,610 1,537.25 3.44
February 1,585 1,537.25 1.48 18447
March 1,649 1,537.25 8.12 f =
April 1,590 1,537.25 1.81
e 12
May 1,540 1,537.25 0.00 = 1537.25
June 1,397 1,537.25 12.80
July 1,410 1,537.25 10.53
August 1,350 1,537.25 22.81 2
Cal
= 74.37
September 1,495 1,537.25 1.16
October 1,564 1,537.25 0.47
November 1,602 1,537.25 2.73
December 1,655 1,537.25 9.02
18,447 18,447.00 74.38
6
Python code
7
Conclusion
df = 11
2
Cal
= 74.37 24.725, reject Ho.
8
Goodness of Fit Test: Normal Distribution
1. Set up the null and alternative hypotheses.
2. Select a random sample and
a. Compute the mean and standard deviation.
b. Define intervals of values so that the expected frequency is at least 5 for
each interval.
c. For each interval record the observed frequencies
3. Compute the expected frequency, ei , for each interval.
9
Goodness of Fit Test: Normal Distribution
4. Compute the value of the test statistic.
( f i − ei ) 2
k
= 2
i =1 ei
5. Reject H0 if 2 2
10
Normal Distribution Goodness of Fit Test
• Example: IQL Computers
11
Normal Distribution Goodness of Fit Test
33 43 44 45 52 52 56 58 63 64
64 65 66 68 70 72 73 73 74 75
83 84 85 86 91 92 94 98 102 105
(mean = 71, standard deviation = 18.23)
12
Python code
13
Normal Distribution Goodness of Fit Test
• Hypotheses
H0: The population of number of units sold
has a normal distribution with mean 71
and standard deviation 18.23
14
Normal Distribution Goodness of Fit Test
• Interval Definition
15
Normal Distribution Goodness of Fit Test
• Interval Definition
Areas
= 1.00/6
= .1667
17
Normal Distribution Goodness of Fit Test
• Observed and Expected Frequencies
i fi ei f i - ei
Less than 53.02 6 5 1
53.02 to 63.03 3 5 -2
63.03 to 71.00 6 5 1
71.00 to 78.97 5 5 0
78.97 to 88.98 4 5 -1
More than 88.98 6 5 1
Total 30 30
18
Python code
19
Normal Distribution Goodness of Fit Test
• Rejection Rule
With = .05 and k - p - 1 = 6 - 2 - 1 = 3 d.f.
(where k = number of categories and p = number
of population parameters estimated), .0 5 = 7 .8 1 5
2
• Test Statistic
(1) 2 ( − 2) 2 (1) 2 (0) 2 ( − 1) 2 (1) 2
=2
+ + + + + = 1.600
5 5 5 5 5 5
20
Thank you
21
Cluster analysis: Part - III
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Clustering analysis part III
2
Agenda
3
Handling missing data
• It often happens that not all measurements are actually available, so there
are some “holes” in the data matrix
• Such an absent measurement is called a missing value and it may have
several causes
• The value of the measurement may have been lost or it may not have
been recorded at all by oversight or lack of time
4
Handling missing data
5
Handling missing data
6
Handling missing data
• If the data are standardized, the mean value m, of the fth variable is
calculated by making use of the present values only
• The same goes for sf,
7
Handling missing data
• In the computation of distances (based on either the xi, or the zi,) similar
precautions must be taken
• When calculating the distances d(i, j), only those variables are considered
in the sum for which the measurements for both objects are present
subsequently the sum is multiplied by p and divided by the actual number
of terms (in the case of Euclidean distances this is done before taking the
square root)
• Such a procedure only makes sense when the variables are thought of as
having the same weight (for instance, this can be done after
standardization)
8
Handling missing data
• When computing these distances, one might come across a pair of objects
that do not have any common measured variables, so their distance
cannot be computed by means of the above mentioned approach.
• Several remedies are possible: One could remove either object or one
could fill in some average distance value based on the rest of the data
• Or by replacing all missing xif by the mean mf of that variable; then all
distances can be computed
• Applying any of these methods, one finally possesses a “full” set of
distances
9
Dissimilarities
10
Dissimilarities
11
Example
12
Example
• It appears that the smallest dissimilarity is perceived between
mathematics and computer science (1.43 ), whereas the most remote
fields were psychology and astronomy (9.36)
13
Dissimilarities
14
Dissimilarities
• Both coefficients lie between - 1 and + 1 and do not depend on the choice
of measurement units
• The main difference between them is that the Pearson coefficient looks
for a linear relation between the variables f and g, whereas the Spearman
coefficient searches for a monotone relation
15
Dissimilarities
• Correlation coefficients are useful for clustering purposes because they
measure the extent to which two variables are related
• Correlation coefficients, whether parametric or nonparametric, can be
converted to dissimilarities d( f, g), for instance by setting
16
Similarities
• The more objects i and j are alike (or close), the larger s(i, j) becomes
• Such a similarity s(i, j) typically takes on values between 0 and 1, where 0
means that i and j are not similar at all and 1 reflects maximal similarity
• Values in between 0 and 1 indicate various degrees of resemblance
• Often it is assumed that the following conditions hold:
17
Similarities
• For all objects i and j , the numbers s(i, j) can be arranged in an n-by-n
matrix ,which is then called a similarity matrix
• Both similarity and dissimilarity matrices are generally referred to as
proximity matrices, or sometimes as resemblance
• In order to define similarities between variables, we can again resort to
the Pearson or the Spearman correlation coefficient
• However, neither correlation measure can be used directly as a similarity
coefficient because they also take on negative values
18
Similarities
• Some transformation is in order to bring the coefficients into the zero-one
range
• There are essentially two ways to do this, depending on the meaning of the
data and the purpose of the application
• If variables with a strong negative correlation are considered to be very
different because they are oriented in the opposite direction (like mileage and
weight of a set of cars), then it is best to take something like the following:
19
Similarities
20
Similarities
• Suppose the data consist of a similarity matrix but one wants to apply a
clustering algorithm designed for dissimilarities
• Then it is necessary to transform the similarities into dissimilarities
• The larger the similarity s(i, j) between i and j, the smaller their
dissimilarity d(i, j) should be
• Therefore, we need a decreasing transformation, such as
21
Binary Variables
22
Dissimilarity between two binary variables
23
Symmetric Binary Dissimilarity
24
Asymmetric binary variable
• A binary variable is asymmetric if the outcomes of the states are not
equally important, such as the positive and negative outcomes of a
disease test.
• By convention, we shall code the most important outcome, which is
usually the rarest one, by 1 (e.g., HIV positive) and the other by 0 (e.g., HIV
negative).
• Given two asymmetric binary variables, the agreement of two 1s (a
positive match) is then considered more significant than that of two 0s (a
negative match).
• Therefore, such binary variables are often considered “monary” (as if
having one state).
25
asymmetric binary dissimilarity
26
Jaccard coefficient
27
Dissimilarity between binary variables
28
Dissimilarity between Jack and Marry
Jack
Marry 1 0
1 2 1
0 0 3
29
Dissimilarity between Jack and Jim
Jim
1 0
Jack 1 1 1
0 1 3
30
Dissimilarity between Jim and Marry
Jim
1 0
Marry 1 1 2
0 1 2
31
Thank you
32
Cluster analysis: Part - IV
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
Categorical Variables
• A categorical variable is a generalization of the binary variable in that it
can take on more than two states
• For example, map color is a categorical variable that may have, say, five
states: red, yellow, green, purple, and blue
3
Categorical Variables
4
Categorical Variables
where ‘m’ is the number of matches (i.e., the number of variables for
which ‘I’ and ‘j’ are in the same state), and ‘p’ is the total number of
variables
Weights can be assigned to increase the effect of ‘m’ or to assign greater
weight to the matches in variables having a larger number of states
6
Dissimilarity between categorical variables
• Suppose that we have the sample data as shown in the table
• Let only the object-identifier and the variable (or attribute) test-1 are
available, which is a categorical data
7
Dissimilarity matrix
1 2 3 4
1
2
3
4
8
Dissimilarity between categorical variables
So that d(i, j) evaluates to ‘0’ if objects i and j match, and ‘1’ if the objects
differ
• Thus, we get d(2,1) = (1-0)/1 = 1
d(4,1) = (1-1)/1 = 0
9
Ordinal Variables
10
Ordinal Variables
• For example, the relative ranking in a particular sport (e.g., gold, silver,
bronze) is often more essential than the actual values of a particular
measure
• Ordinal variables may also be obtained from the discretization of interval-
scaled quantities by splitting the value range into a finite number of
classes
• The values of an ordinal variable can be mapped to ranks
11
Dissimilarity computation
12
Dissimilarity computation
A B C D E F G
B
C
D
E
F
G
H
13
Standardization of ordinal variable
14
Dissimilarity computation
15
Example
16
Example
• For step 1, if we replace each value for test-2 by its rank, the four objects
are assigned the ranks 3, 1, 2, and 3, respectively
• Step 2 normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5, and
rank 3 to 1.0
• For step 3, we can use, say, the Euclidean distance, which results in the
following dissimilarity matrix:
17
Dissimilarity computation
1 2 3 4
1
1→3→1
2
2→ 1→0
3
3 → 2 → 0.5
4
4→3→1
18
Ratio-Scaled Variables
19
Computing the dissimilarity between objects
• There are three methods to handle ratio-scaled variables for computing
the dissimilarity between objects:
1. Treat ratio-scaled variables like interval-scaled variables
– This, however, is not usually a good choice since it is likely that the
scale may be distorted
2. Apply logarithmic transformation to a ratio-scaled variable f having value
xif for object i by using the formula yif = log(xi f)
– The yif values can be treated as interval valued, Notice that for some
ratio-scaled variables, log-log or other transformations may be
applied, depending on the variable’s definition and the application
20
Computing the dissimilarity between objects
3. Treat xif as continuous ordinal data and treat their ranks as interval-valued
• The latter two methods are the most effective, although the choice of
method used may depend on the given application
21
Example
22
Example
23
Variables of Mixed Types
24
Variables of Mixed Types
• In general, a database can contain all of the six variable types listed above
• “So, how can we compute the dissimilarity between objects of mixed
variable types?”
• One approach is to group each kind of variable together, performing a
separate cluster analysis for each variable type
– This is feasible if these analyses derive compatible results
– However, in real applications, it is unlikely that a separate cluster
analysis per variable type will generate compatible results
25
Variables of Mixed Types
26
Variables of Mixed Types
• Suppose that the data set contains p variables of mixed type
• The dissimilarity d(i, j) between objects i and j is defined as
27
Variables of Mixed Types
28
Variables of Mixed Types
• If ‘f’ is ordinal: compute the ranks rif and zif = rif−1 /Mf−1, and treat zif as
interval scaled
• If ‘f’ is ratio-scaled: either perform logarithmic transformation and treat
the transformed data as interval-scaled; or treat ‘f’ as continuous ordinal
data, compute rif and zif, and then treat zif as interval-scaled
• The above steps are identical to what we have already seen for each of the
individual variable types
29
Variables of Mixed Types
30
Thank you
31
Cluster analysis: Part - V
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
3
Example
• The procedures we followed for test-1 (which is categorical) and test-2
(which is ordinal) are the same as outlined above for processing variables
of mixed types
• For categorical variable -
4
Normalizing the interval scale data
• First, however, we need to complete some work for test-3 (which is ratio-
scaled)
• We have already applied a logarithmic transformation to its values
• Based on the transformed values of 2.65, 1.34, 2.21, and 3.08 obtained for
the objects 1 to 4, respectively, we let maxhxh =3.08 and minhxh =1.34
• We then normalize the values in the dissimilarity matrix obtained in
Example solve for ratio data by dividing each one by (3.08−1.34) = 1.74
5
Dissimilarity matrix for test-3
6
dissimilarity matrices for the three variables
• We can now use the dissimilarity matrices for the three variables in our
computation of Equation
• For example, we get d(2,1)= (1(1)+1(1)+1(0.75))/ 3 =0.92
7
Example
• The resulting dissimilarity matrix obtained for the data described by the
three variables of mixed types is:
8
Interpretation
• If we go back and look at Table of given data, we can intuitively guess that
objects 1 and 4 are the most similar, based on their values for test-1 and
test-2
• This is confirmed by the dissimilarity matrix, where d(4,1) is the lowest
value for any pair of different objects
• Similarly, the matrix indicates that objects 2 and 4 are the least similar
9
Distance Measurement using python - Euclidean Distance :
Python Demo for Euclidean Distance
Distance Measurement using python – Minkowski Distance :
• P =1 Manhattan distance
• P = 2 Euclidean distance
Python Demo for Minkowski Distance
Dissimilarity matrix
Distance matrix calculation for Interval-Scaled Variables
• For example :
Person Weight(Kg) Height(cm)
• Take eight people, the weight (in A 15 95
kilograms) and the height (in centimetres) B 49 156
• In this situation, n = 8 and p = 2. C 13 95
D 45 160
E 85 178
F 66 176
G 12 90
H 10 78
15
Distance matrix calculation using Python
Thank You
19
K- Means Clustering
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
Classification of Clustering Methods
Clustering
Methods
Partitioning Hierarchical
3
Which Clustering Algorithm to Choose
4
Partitioning Method
Given -
• a data set of n objects
• k, the number of clusters
• A partitioning algorithm organizes the objects into k partitions (k ≤ n),
where each partition represents a cluster.
• The clusters are formed to optimize an objective partitioning criterion
• Objective partitioning criterion such as a dissimilarity function based on
distance
• Therefore, the objects within a cluster are “similar,” whereas the objects
of different clusters are “dissimilar” in terms of the data set attributes.
5
Partitioning Method
• Partitioning methods are applied if one wants to classify the objects into k
clusters, where k is fixed.
6
K-Means Method
7
Working Principle of K-Means Algorithm
8
Working Principle of K-Means Algorithm
9
Working Principle of K-Means Algorithm
• Criterion function
𝑘
𝐸 = |𝑝 − 𝑚𝑖 |2
𝑖=1 𝑝∈𝐶𝑖
where
- 𝐸is the sum of the square error for all objects in the data set;
-𝑝is the point in space representing a given object;
-𝑚𝑖 is the mean of cluster Ci (both 𝑝and 𝑚𝑖 are multidimensional).
• For each object in each cluster, the distance from the object to its cluster
center is squared, and the distances are summed.
• This criterion tries to make the resulting k clusters as compact and as separate
as possible.
10
K=3
11
K-Means Clustering Algorithm
12
K-Means Clustering Method
• Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most
similar, based on the mean value of the objects in the cluster;
(4) update the cluster means, i.e., calculate the mean value of the objects for
each cluster;
(5) until no change;
13
K-Means clustering example
14
K-Means clustering example
Variable 2
8.00
4
7.00
6.00
5 6
5.00
7
3
4.00
3.00
2
2.00
1
1.00
0.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00
15
K-Means clustering example
16
K-Means clustering example
17
At K = 2
Variable 2
8.00
4
7.00
6.00
5 6
5.00
7
3
4.00
3.00
2
2.00
1
1.00
0.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00
18
K-Means clustering example Individual Variable 1 Variable 2
1 1.0 1.0
2 1.5 2.0
• Calculate Euclidean distance for next dataset (1.5, 2.0) 3 3.0 4.0
4 5.0 7.0
Distance from cluster1 = (1.5 − 1.0)2 + (2.0 − 1.0)2 = 1.12 5 3.5 5.0
6 4.5 5.0
Distance from cluster2 = (1.5 − 3.0)2 + (2.0 − 4.0)2 = 2.5 7 3.5 4.5
Euclidean Distance
Dataset
Cluster 1 Cluster 2 Assignment
(1.5, 2.0) 1.12 2.5 k1
19
Variable 2
8.00
4
7.00
6.00
5 6
5.00
7
3
4.00
3.00
2
2.00
1
1.00
0.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00
20
K-Means clustering example
21
K-Means clustering example Individual Variable 1 Variable 2
1 1.0 1.0
2 1.5 2.0
• Calculate Euclidean distance for next dataset (5.0, 7.0) 3 3.0 4.0
4 5.0 7.0
Distance from cluster1 = (5.0 − 1.25)2 + (7.0 − 1.5)2 = 6.66 5 3.5 5.0
6 4.5 5.0
Distance from cluster2 = (5.0 − 3.0)2 + (7.0 − 4.0)2 = 3.61 7 3.5 4.5
Euclidean Distance
Dataset
Cluster 1 Cluster 2 Assignment
(5.0, 7.0) 6.66 3.61 K-2
22
Variable 2
8.00
4
7.00
6.00
5 6
5.00
7
3
4.00
3.00
2
2.00
1
1.00
0.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00
23
K-Means clustering example
24
K-Means clustering example Individual Variable 1 Variable 2
1 1.0 1.0
2 1.5 2.0
• Calculate Euclidean distance for next dataset (3.5, 5.0) 3 3.0 4.0
4 5.0 7.0
Distance from cluster1 = (3.5 − 1.25)2 + (5.0 − 1.5)2 = 4.16 5 3.5 5.0
6 4.5 5.0
Distance from cluster2 = (3.5 − 4.0)2 + (5.0 − 5.5)2 = 0.71 7 3.5 4.5
Euclidean Distance
Dataset
Cluster 1 Cluster 2 Assignment
(3.5, 5.0) 4.16 0.71 K-2
25
Variable 2
8.00
4
7.00
6.00
5 6
5.00
7
3
4.00
3.00
2
2.00
1
1.00
0.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00
26
K-Means clustering example
27
K-Means clustering example Individual Variable 1 Variable 2
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
Euclidean Distance
Dataset
Cluster 1 Cluster 2 Assignment
(4.5, 5.0) 4.78 0.75 K- 2
28
8.00 Variable 2
4
7.00
6.00
5 6
5.00
7
3
4.00
3.00
2
2.00
1
1.00
0.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00
29
K-Means clustering example
30
K-Means clustering example Individual Variable 1 Variable 2
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
Euclidean Distance
Dataset
Cluster 1 Cluster 2 Assignment
(3.5, 4.5) 3.75 0.86 K-2
31
Variable 2
8.00
4
7.00
6.00
5 6
5.00
7
3
4.00
3.00
2
2.00
1
1.00
0.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00
32
K-Means clustering example
33
K-Means clustering example
Individual Variable 1 Variable 2 Assignment
1 1.0 1.0 1
2 1.5 2.0 1
3 3.0 4.0 2
4 5.0 7.0 2
5 3.5 5.0 2
6 4.5 5.0 2
7 3.5 4.5 2
34
Python code for K- Means Clustering
35
Python code for K- Means Clustering
36
Python code
37
Thank you
38
Hierarchical method of clustering - I
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
Introduction
3
Introduction
• It successively merges the objects or groups that are close to one another,
until all of the groups are merged into one (the topmost level of the
hierarchy), or until a termination condition holds
• The divisive approach, also called the top-down approach, starts with all
of the objects in the same cluster
• In each successive iteration, a cluster is split up into smaller clusters, until
eventually each object is in one cluster, or until a termination condition
holds
4
Introduction
• Hierarchical methods suffer from the fact that once a step (merge or split)
is done, it can never be undone
5
Agglomerative and Divisive Hierarchical Clustering
6
Agglomerative versus divisive hierarchical clustering
8
Interpretation
9
Interpretation
• In DIANA, all of the objects are used to form one initial cluster
• The cluster is split according to some principle, such as the maximum
Euclidean distance between the closest neighboring objects in the cluster
• The cluster splitting process repeats until, eventually, each new cluster
contains only a single object
• In either agglomerative or divisive hierarchical clustering, the user can
specify the desired number of clusters as a termination condition
10
Dendrogram
12
Dendrogram
• We can also use a vertical axis to show the similarity scale between
clusters
• For example, when the similarity of two groups of objects, {a,b} and
{c,d,e}, is roughly 0.16, they are merged together to form a single cluster
13
Measures for distance between clusters
• Four widely used measures for distance between clusters are as follows,
where|p−p’| is the distance between two objects or points, p and p’ , mi is
the mean for cluster, Ci and ni is the number of objects in Ci
• Minimum distance: dmin(Ci,Cj)= minp∈Ci, p’∈Cj |p−p’|
• Maximum distance: dmax(Ci,Cj)= maxp∈Ci, p’∈Cj |p−p’|
• Mean distance: dmean(Ci,Cj)= |mi−mj|
• Average distance: davg(Ci,Cj) =
14
Measures for distance between clusters
15
Measures for distance between clusters
• Because edges linking clusters always go between distinct clusters, the
resulting graph will generate a tree
• Thus, an agglomerative hierarchical clustering algorithm that uses the
minimum distance measure is also called a minimal spanning tree
algorithm
• When an algorithm uses the maximum distance, dmax(Ci,Cj), to measure
the distance between clusters, it is sometimes called a farthest-neighbor
clustering algorithm
• If the clustering process is terminated when the maximum distance
between nearest clusters exceeds an arbitrary threshold, it is called a
complete-linkage algorithm
16
Measures for distance between clusters
18
Illustration
(a)
Representation of some definitions of inter-
cluster dissimilarity: (a) Group average
(b) Nearest neighbor
(b)
(c) Furthest neighbor
(c)
19
Illustration
(c)
20
Difficulties with hierarchical clustering
21
Difficulties with hierarchical clustering
• Thus merge or split decisions, if not well chosen at some step, may lead to
low-quality clusters
• Moreover, the method does not scale well, because each decision to
merge or split requires the examination and evaluation of a good number
of objects or cluster
• For improving the clustering quality of hierarchical methods is to integrate
hierarchical clustering with other clustering techniques, resulting in
multiple-phase clustering
22
Partitioning Vs. Hierarchical
23
K-means versus hierarchical clustering
24
K means versus hierarchical clustering
25
K means versus hierarchical clustering
K- means clustering Hierarchical clustering
• These methods are • Divisive methods work in the
generally less opposite direction, starting
computationally intensive with one cluster that includes
and are therefore preferred all records
with very large datasets • Hierarchical methods are
especially useful when the
goal is to arrange the clusters
into a natural hierarchy
26
K means versus hierarchical clustering
K- means clustering Hierarchical clustering
• A partitioning (K- means) • A hierarchical clustering is a
clustering a simply a set of nested clusters that
division of the set of data are organized as a tree
objects into non-
overlapping subsets
(clusters) such that each
data object is in exactly one
subset)
27
K means versus hierarchical clustering
Un-nested cluster Nested cluster
Ashok, A.R., Prabhakar, C.R. and Dyaneshwar, P.A., Comparative Study on Hierarchical and Partitioning Data Mining Methods.
28
K means versus hierarchical clustering
Proximity
matrix
29
K means versus hierarchical clustering
K Means clustering Hierarchical clustering
• In K Means clustering, since one • Results are reproducible in
start with random choice of Hierarchical clustering
clusters, the results produced by • Hierarchical clustering don’t work
running the algorithm multiple as well as, k means when the
times might differ shape of the clusters is hyper
• K Means is found to work well spherical
when the shape of the clusters is
hyper spherical (like circle in 2D,
sphere in 3D)
30
K means versus hierarchical clustering
K Means clustering Hierarchical clustering
• K Means clustering requires • In hierarchical clustering
prior knowledge of K i.e. no. one can stop at any number
of clusters one want to of clusters, one find
divide your data into appropriate by interpreting
the dendrogram
31
K means versus hierarchical clustering
https://stepupanalytics.com/difference-between-k-means-clustering-and-hierarchical-clustering/
32
Hierarchical clustering
Advantages
• Ease of handling of any forms of similarity or distance
• Consequently, applicability to any attributes types
33
Limitations of Hierarchical Clustering
34
Limitations of Hierarchical Clustering
• With respect to the choice of distance
between clusters, single and complete
linkage are robust to changes in the
distance metric (e.g., Euclidean, statistical
distance) as long as the relative ordering is
kept.
• In contrast, average linkage is more
influenced by the choice of distance metric,
and might lead to completely different
clusters when the metric is changed
• Hierarchical clustering is sensitive to outlier
35
Average-linkage clustering
• Strengths
– Less susceptible to noise and outliers
• Limitations
– Biased towards globular clusters
36
Distance between two clusters
• Ward’s distance between clusters Ci and Cj is the difference between the total
within cluster sum of squares for the two clusters separately, and the within
cluster sum of squares resulting from merging the two clusters in cluster Cij
Dw (Ci , C j ) = (x − ri ) + (x − rj ) − (x − rij )
2 2 2
• ri: centroid of Ci
• rj: centroid of Cj
• rij: centroid of Cij
37
Ward’s distance for clusters
38
Hierarchical Clustering: Comparison
5
1 4 1
3
2 5
5 5
2 1 2
Simple linkage
2 3 6 3 6
3
1
4 Complete 4
4
linkage
5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group 3 6
3 Average
4 1 1
4 4
3
39
K- means clustering
Advantages Disadvantages
• The center of mass can be found • K-means has problems when
efficiently by finding the mean clusters are of differing sizes,
value of each co-ordinate densities, non-globular shapes
• This leads to an efficient and when the data contains
algorithm to compute the new outliers
centroids with a single scan of the
data
40
Similarity
41
Thank You
42
Measures of Attribute Selection
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
Example
Han, J., Pei, J. and Kamber, M., 2011. Data mining: concepts and
techniques. Elsevier.
3
Example
4
Expected information needed to classify a tuple in D
• To find the splitting criterion for these tuples, we must compute the
information gain of each attribute
• Let us consider Class: buys computer as decision criteria D
• Calculate information:
• = -py log2 (py) – pn log2 (pn)
• Where py is probability of ‘yes’ and pn is probability of ‘no’
5
Calculation of entropy for ‘ Youth’
6
Calculation of entropy for ‘ Youth’
• Middle_aged
7
Calculation of entropy for ‘ Middle Age’
• Calculate Entropy for middle_aged
• =
• =0
• For Senior
8
Calculate Entropy for senior
9
The expected information needed to classify a tuple in D
according to age
10
Calculation information Gain of Age
• Gain of Age:
11
Calculation information Gain of Income
12
Calculate Entropy for high
• High :
High Class: buys computer
Yes 2
No 2
13
Calculate Entropy for ‘medium’
• Medium:
Medium Class: buys computer
Yes 4
No 2
14
Calculate Entropy for ‘low’
• Low :
Low Class: buys computer
No 1
Yes 3
15
Gain of income
16
Calculation of gain for student
17
Calculate Entropy for No
• No :
No Class: buys computer
Yes 3
No 4
18
Calculate Entropy for ‘Yes’
• Yes :
Yes Class: buys computer
Yes 6
No 1
19
Gain of student
20
Calculation of gain for credit rating
21
Calculate Entropy for Fair
• Fair :
Fair Class: buys computer
Yes 6
No 2
22
Calculate Entropy for Excellent
• Excellent :
Yes Class: buys computer
Yes 3
No 3
23
Gain for credit rating
24
Independent variable Information gain
Age 0.246
Income 0.029
Student 0.151
Credit_rating 0.048
25
Selection of root classifier
• Because age has the highest information gain among the attributes, it is
selected as the splitting attribute
• Node N is labelled with age, and branches are grown for each of the
attribute’s values
• The tuples are then partitioned accordingly
• Notice that the tuples falling into the partition for age = middle aged all
belong to the same class
• Because they all belong to class “yes,” a leaf should therefore be created
at the end of this branch and labelled with “yes.”
26
Decision tree
27
Decision tree
28
Thank You
29
Classification and Regression Trees (CART - I)
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
Introduction
3
Problem Description for Illustration
Han, J., Pei, J. and Kamber, M., 2011. Data mining: concepts and
techniques. Elsevier.
4
Root Node, Internal Node, Child Node
Root Node or
Internal parent node
• A decision tree uses a tree structure to
represent a number of possible decision paths Node
and an outcome for each path
• A decision tree consists of root node, internal
node and leaf node
• The topmost node in a tree is the root node
or parent node
• It represents entire sample population
• Internal node (non-leaf node) denotes a test
on an attribute, each branch represents an
outcome of the test
• Leaf node (or terminal node or child node)
holds a class label Child
• It can not be further split
Node
5
Decision Tree Introduction
7
Decision Tree Algorithm
Input:
• Data partition, D, which is a set of
training tuples and their associated
class labels;
• Attribute list, the set of candidate
attributes;
• Attribute selection method, a
procedure to determine the splitting
criterion that “best” partitions the data
tuples into individual classes. This
criterion consists of a splitting attribute
and, possibly, either a split point or
splitting subset.
Output: A decision tree
8
Decision Tree Algorithm
9
Decision Tree Algorithm
10
Decision Tree Method
N-Node
C- Class
D- tuples in training data set
11
Decision Tree Method step 1 to 6
• The tree starts as a single node, N,
representing the training tuples in D (step
1).
• If the tuples in D are all of the same class,
then node N becomes a leaf and is
labelled with that class (steps 2 and 3)
• Steps 4 and 5 are terminating conditions
• Otherwise, the algorithm calls Attribute
selection method to determine the
splitting criterion
• The splitting criterion (like Gini) tells us
which attribute to test at node N by
determining the “best” way to separate
or partition the tuples in D into individual
classes (step 6)
12
Decision Tree Method - Step 7 - 11
• The splitting criterion indicates the splitting
attribute and may also indicate either a
split-point or a splitting subset
• The splitting criterion is determined so
that, ideally, the resulting partitions at each
branch are as “pure” as possible. A
partition is pure if all of the tuples in it
belong to the same class.
• The node N is labelled with the splitting
criterion, which serves as a test at the node
(step 7).
• A branch is grown from node N for each of
the outcomes of the splitting criterion.
• The tuples in D are partitioned accordingly
(steps 10 to 11)
13
Three possibilities for partitioning tuples based on the
splitting criterion
• There are three possible scenarios, as illustrated in Figure (a), (b) and (c).
• Let A be the splitting attribute. A has ‘v’ distinct values,{a1,a2,...,av}, based
on the training data
• If A is discrete-valued in figure (a), then one branch is grown for each
known value of A.
Figure (a)
14
Three possibilities for partitioning tuples based on the
splitting criterion
• If A is continuous-valued in figure (b), then two branches are grown,
corresponding to A ≤ split point and A > split point.
• Where split point is the split-point returned by Attribute selection method
as part of the splitting criterion.
Figure (b)
15
Three possibilities for partitioning tuples based on the
splitting criterion
• If A is discrete-valued and a binary tree must be produced, then the test is
of the form A ∈ 𝑆𝐴 , where 𝑆𝐴 is the splitting subset for A.
Figure (c)
16
Decision Tree Method – termination condition
• The algorithm uses the same process recursively to form a decision tree
for the tuples at each resulting partition, 𝐷𝑗 , of D (step 14).
17
Decision Tree Method – termination condition
1.
2. There are no remaining attributes on which the tuples may be further
partitioned (step4).
• In this case, majority voting is employed(step 5).
• This involves converting node N into a leaf and labelling it with the most
common class in D.
• Alternatively, the class distribution of the node tuples may be stored.
3. There are no tuples for a given branch, that is, a partition Dj is empty (step
12).
• In this case, a leaf is created with the majority class in D (step 13).
• The resulting decision tree is returned (step 15).
18
Attribute Selection Measures
19
Attribute Selection Measures
• If the splitting attribute is continuous-valued or if we are restricted to binary
trees then, respectively, either a ‘split point’ or a ‘splitting subset’ must also be
determined as part of the splitting criterion
• CART algorithm uses information gain and Gini index measure for attribute
selection
20
Attribute Selection Measures
21
Information Gain
22
Information Gain-Entropy Measure
• The expected information needed to classify a
tuple in D is given by
23
Attribute Selection Measures
• It is quite likely that the partitions will be impure (e.g., where a partition
may contain a collection of tuples from different classes rather than from
a single class).
• How much more information would we still need (after the partitioning) in
order to arrive at an exact classification?
• This amount is measured by
• The term |𝐷𝑗 | / |D| acts as the weight of the 𝑗𝑡ℎ partition. 𝐼𝑛𝑓𝑜𝐴 (D) is
the expected information required to classify a tuple from D based on the
partitioning by A.
24
Information Gain
• The smaller the expected information (still) required, the greater the
purity of the partitions
• Information gain is defined as the difference between the original
information requirement (i.e., based on just the proportion of classes) and
the new requirement (i.e., obtained after partitioning on A). That is,
25
Gini Index
• Gini index is used to measures the
impurity of D, a data partition or set
of training tuples, as
26
Gini Index
27
Gini Index
• For continuous-valued attributes, each possible split-point must be considered
• The strategy is similar where the midpoint between each pair of (sorted)
adjacent values is taken as a possible split-point.
• For a possible split-point of A, 𝐷1 is the set of tuples in D satisfying A ≤ split
point, and 𝐷2 is the set of tuples in D satisfying A > split point.
• The reduction in impurity that would be incurred by a binary split on a
discrete- or continuous-valued attribute A is
• The attribute that maximizes the reduction in impurity (or, equivalently, has
the minimum Gini index) is selected as the splitting attribute
28
Which attribute selection measure is the best?
• When a decision tree is built, many of the branches will reflect anomalies
in the training data due to noise or outliers
• Tree pruning use statistical measures to remove the least reliable
branches
• Pruned trees tend to be smaller and less complex and, thus, easier to
comprehend
• They are usually faster and better at correctly classifying independent test
data than unpruned trees
30
How does Tree Pruning Work?
• There are two common approaches to tree pruning: pre-pruning and post-
pruning.
• In the pre-pruning approach, a tree is “pruned” by halting its construction
early (e.g., by deciding not to further split or partition the subset of
training tuples at a given node).
• When constructing a tree, measures such as statistical significance,
information gain, Gini index can be used to assess the goodness of a split.
31
How does Tree Pruning Work?
32
How does Tree Pruning Work?
33
THANK YOU
34
Attribute selection Measures in CART : II
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
Gain Ratio
• The information gain measure is biased toward tests with many outcomes
• That is, it prefers to select attributes having a large number of values
• For example, consider an attribute that acts as a unique identifier, such as
product ID.
• A split on product ID would result in a large number of partitions (as many
as there are values), each one containing just one tuple
3
Gain Ratio
4
Split information
• It applies a kind of normalization to information gain using a “split
information” value defined analogously with Info(D) as:
5
Gain ratio
• Gain ratio differs from information gain, which measures the information
with respect to classification that is acquired based on the same
partitioning
• The gain ratio is defined as
• The attribute with the maximum gain ratio is selected as the splitting
attribute
6
Gain Ratio example
7
Calculate Entropy for high
• High :
High Class: buys computer
Yes 2
No 2
8
Calculate Entropy for ‘medium’
• Medium:
Medium Class: buys computer
Yes 4
No 2
9
Calculate Entropy for ‘low’
• Low :
10
Calculate Entropy for buying class D
• Calculate information:
• = -py log2 (py) – pn log2 (pn)
• Where py is probability of yes and pn is probability of no
11
Gain of income
12
Gain-Ratio(income)
13
Interpretation
• Further we calculate the same for the rest 3 criteria (age, student, credit
rating)
• The one with maximum Gain ratio value will results in the maximum
reduction in impurity of the tuples in D and is returned as the splitting
criterion
14
15
Decision tree using Gini index
Han, J., Pei, J. and Kamber, M., 2011. Data mining: concepts and
techniques. Elsevier.
16
Example
17
Calculation of Gini(D)
• We first use the following Equation for Gini index to compute the impurity
of D:
18
Gini index for income attribute
19
Gini index for income attribute
20
Tuples in partition D1
• Low + Medium:
Medium Class: buys computer
+ Low
Yes 3+4 =7
No 1+ 2 = 3
21
Tuples in partition D2
• High :
High Class: buys computer
Yes 2
No 2
22
Gini index for income attribute
23
Gini index for income attribute
24
Tuples in partition D1
• High + Medium:
Medium Class: buys computer
+ high
Yes 2+4
No 2+2
25
Tuples in partition D2
• Low :
26
Gini index for income attribute
27
Gini index for income attribute
28
Tuples in partition D1
• High + low:
high + Class: buys computer
low
Yes 2+3
No 2+1
29
Tuples in partition D2
• Medium:
30
Gini index for income attribute
31
Gini Index values
32
Interpretation
• The best binary split for attribute income is on {medium, low} (or {high})
because it minimizes the Gini index
• The splitting subset {medium,low} therefore give the minimum Gini index
for attribute income
• Reduction in impurity = 0.459 − 0.443 = 0.016
• Further we calculate the same for the rest 3 criteria (age, student, credit
rating)
• The one with minimum Gini index value will results in the maximum
reduction in impurity of the tuples in D and is returned as the splitting
criterion
33
34
Thank You
35
Classification and Regression Trees (CART – III)
Dr A. RAMESH
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
Example
Problem Description-
Han, J., Pei, J. and Kamber, M., 2011. Data mining: concepts and
techniques. Elsevier.
3
Import Relevant Libraries and Loading Data File
4
Methods used in Data Encoding
• Fit_transform (): This method is used for Fitting label encoder and return
encoded labels.
5
Data Encoding Procedure
6
Data Encoding
7
Structuring Dataframe
8
Independent and Dependent Variables Selection
9
Build the Decision Tree Model without Splitting
10
Visualizing Decision Tree
11
Decision Tree Visualization
12
Interpretation of the CART Output
13
Calculation of Gini(D)
• We first use the following Equation for Gini index to compute the impurity
of D:
14
Income Attribute
15
Tuples in partition D1
• Low + Medium:
Low + Class: buys computer
Medium
Yes 3+4 =7
No 1+ 2 = 3
16
Tuples in partition D2
• High :
High Class: buys computer
Yes 2
No 2
17
Gini index for income attribute
18
Gini index for income attribute
19
Gini index for income attribute
20
Gini index for income attribute
• Gini income ∈{low, medium}
= 0.443 = Gini income ∈{high}
• Gini income ∈{high, medium}
= 0.45 = Gini income ∈{low}
• Gini income ∈{high, low}
= 0.458 = Gini income ∈{medium}
21
Gini index for Age attribute
22
Gini index for student attribute
23
Gini index for credit_rating attribute
24
Choosing the root node
The attribute with minimum Gini score will be taken, i.e. Age (Gini Age ∈{Youth, Senior} =
0.357 = Gini Age∈{middle_aged} )
25
Gini index for different attributes for sample of 10
• After separating 4 samples belonging middle age, total 10 are remaining:
26
Gini index for different attributes for sample of 10
27
Drawing cart
Age
Youth, senior
yes
No
??? ???
28
For branch Student = No
• Omit the marked rows
(Data entry), either
belonging Age =
middle_aged or student =
Yes
• Total 5 rows are remaining
29
Gini index for different attributes For branch Student = No
30
Drawing cart
Age
Youth, senior
yes
No
??? Age
??? ???
31
For branch Student = Yes
• Omit the marked rows
(Data entry), either
belonging Age =
middle_aged or student =
No
• Total 5 rows are remaining
32
Gini index for different attributes For branch Student = No
33
Drawing cart
Age
Youth, senior
yes No
Credit_rating Age
34
Coding scheme
Age Code Student Code
Youth 2 Yes 1
Middle Age 0 No 0
senior 1 Income Code
High 0
Credit rating Code
Low 1
Fair 1
Medium 2
Excellent 0
Buys computer Class
Yes 1
No 0
35
Values for the dependent
Decision tree variable
Youth, Senior
Middle_age
Decision classifier
• Repeat the
splitting
process until No Yes
we obtain all Number of yes and Sample
the leaf nodes, No in independent size
the final out - variable Excellent Fair
Senior Youth
put:
36
Splitting Dataset
37
Build the Decision Tree Model
38
Evaluating the Model
39
Visualizing Decision Tree
40
Decision Tree Visualization
41
Thank You
42
Hierarchical method of clustering- II
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
Example for Hierarchical Agglomerative Clustering (HAC)
• A data set consisting of seven objects for which two variables were
measured.
Object Variable 1 Variable 2
1 2.00 2.00
2 5.50 4.00
3 5.00 5.00
4 1.50 2.50
5 1.00 1.00
6 7.00 5.00
7 5.75 6.50
3
Scatter plot
4
Object Var 1 Var 2
5
Object Var 1 Var 2
2.00, 2.00 1.00, 1.00 = (1.00 − 2.00)2 +(1.00 − 2.00)5 2 1.00 1.00
6 7.00 5.00
= 1.41 7 5.75 6.50
Distance (1,6)
2.00, 2.00 7.00, 5.00 = (7.00 − 2.00)2 +(5.00 − 2.00)2 = 5.83
Distance (1,7)
2.00, 2.00 5.75, 6.50 = (5.75 − 2.00)2 +(6.50 − 2.00)2 = 5.86
6
Object Var 1 Var 2
5.50, 4.00 5.00, 5.00 = (5.00 − 5.50)2 +(5.00 − 4.00)5 2 1.00 1.00
6 7.00 5.00
= 1.12
7 5.75 6.50
Distance (2,4)
5.50, 4.00 1.50, 2.50 = (1.50 − 5.50)2 +(2.50 − 4.00)2 = 4.27
Distance (2,5)
5.50, 4.00 1.00, 1.00 = (1.00 − 5.50)2 +(1.00 − 4.00)2 = 5.41
Distance (2,6)
5.50, 4.00 7.00, 5.00 = (7.00 − 5.50)2 +(5.00 − 4.00)2 = 1.80
7
Object Var 1 Var 2
5.50, 4.00 5.75, 6.50 = (5.75 − 5.50)2 +(6.50 − 4.00)5 2 1.00 1.00
6 7.00 5.00
= 2.51
7 5.75 6.50
Distance (3,4)
5.00, 5.00 1.50, 2.50 = (1.50 − 5.00)2 +(2.50 − 5.00)2 = 4.30
Distance (3,5)
5.00, 5.00 1.00, 1.00 = (1.00 − 5.00)2 +(1.00 − 5.00)2 = 5.66
Distance (3,6)
5.00, 5.00 7.00, 5.00 = (7.00 − 5.00)2 +(5.00 − 5.00)2 = 2.00
8
Object Var 1 Var 2
5.00, 5.00 5.75, 6.50 = (5.75 − 5.00)2 +(6.50 − 5.00)5 2 1.00 1.00
6 7.00 5.00
= 1.68
7 5.75 6.50
Distance (4,5)
1.50, 2.50 1.00, 1.00 = (1.00 − 1.50)2 +(1.00 − 2.50)2 = 1.58
Distance (4,6)
1.50, 2.50 7.00, 5.00 = (7.00 − 1.50)2 +(5.00 − 2.50)2 = 6.04
Distance (4,7)
1.50, 2.50 5.75, 6.50 = (5.75 − 1.50)2 +(6.50 − 2.50)2 = 5.84
9
Object Var 1 Var 2
1.00, 1.00 7.00, 5.00 = (7.00 − 1.00)2 +(5.00 − 1.00)5 2 1.00 1.00
6 7.00 5.00
= 7.21 7 5.75 6.50
Distance (5,7)
1.00, 1.00 5.75, 6.50 = (5.75 − 1.00)2 +(6.50 − 1.00)2 = 7.27
Distance (6,7)
7.00, 5.00 5.75, 6.50 = (5.75 − 7.00)2 +(6.50 − 5.00)2 = 1.95
10
Distance Matrix
• The distance matrix is-
1 2 3 4 5 6 7
1 0.0
2 4.0 0.0
3 4.2 1.1 0.0
4 0.7 4.3 4.3 0.0
5 1.4 5.4 5.7 1.6 0.0
6 5.8 1.8 2.0 6.0 7.2 0.0
7 5.9 2.5 1.7 5.8 7.3 2.0 0.0
11
Example for HAC
• Select minimum element to build first cluster formation-
1 2 3 4 5 6 7
1 0.0
2 4.0 0.0
3 4.2 1.1 0.0
4 0.7 4.3 4.3 0.0
5 1.4 5.4 5.7 1.6 0.0
6 5.8 1.8 2.0 6.0 7.2 0.0
7 5.9 2.5 1.7 5.8 7.3 2.0 0.0
12
Example for HAC
13
Example for HAC
14
Example for HAC
15
Example for HAC
16
Example for HAC
17
Example for HAC
1,4 2 3 5 6 7
1,4 0.0
• Recalculate distance to update distance matrix 2 4.0 0.0
3 4.2 1.1 0.0
18
Example for HAC
19
Example for HAC
20
Example for HAC
21
Example for HAC
22
Example for HAC
1,4,5 2,3 6 7
1,4,5 0.0
2,3 4.0 0.0
6 5.8 1.8 0.0
7 5.8 1.7 2.0 0.0
23
Example for HAC
1,4,5 2,3 6 7
1,4,5 0.0
2,3 4.0 0.0
6 5.8 1.8 0.0
7 5.8 1.7 2.0 0.0
24
Example for HAC
25
Example for HAC
26
Example for HAC
1,4,5 2,3,7 6
1,4,5 0.0
2,3,7 4.0 0.0
6 5.8 1.8 0.0
27
Example for HAC
1,4,5 2,3,7 6
1,4,5 0.0
2,3,7 4.0 0.0
6 5.8 1.8 0.0
28
Example for HAC
29
Example for HAC
1,4,5 0.0
30
Example for HAC
1,4,5 2,3,7,6
1,4,5 0.0
31
Example for HAC
32
Python demo for HAC
33
Python demo for HAC
34
Python demo for HAC
35
Python demo for HAC
36
Python demo for HAC
37
THANK YOU
38