Data Analytics TB

Download as pdf or txt
Download as pdf or txt
You are on page 1of 1944

Lecture 4: Central Tendency and Dispersion

Dr. A. Ramesh
Department of Management Studies

1
Lecture objectives

• Central tendency
• Measures of Dispersion

2
Measures of Central Tendency

• Measures of central tendency yield information about “particular places or


locations in a group of numbers.”
• A single number to describe the characteristics of a set of data

3
Summary statistics

• Central tendency or measures of • Dispersion


location – Skewness
– Arithmetic mean – Kurtosis
– Weighted mean – Range
– Median – Interquartile range
– Percentile – Variance
– Standard score
– Coefficient of variation

4
Arithmetic Mean
• Commonly called ‘the mean’
• It is the average of a group of numbers
• Applicable for interval and ratio data
• Not applicable for nominal or ordinal data
• Affected by each value in the data set, including extreme values
• Computed by summing all values in the data set and dividing the sum by
the number of values in the data set

5
Population Mean

 X  X 1
 X 2
 X 3
 ...  X N

N N
24  13  19  26  11

5
93

5
 18.6

6
Sample Mean

X 
X  X 1
 X 2
 X 3
 ...  X n

n n
57  86  42  38  90  66

6
379

6
 63.167

7
Mean of Grouped Data
• Weighted average of class midpoints
• Class frequencies are the weights

  fM
f

 fM
N
f 1M 1  f 2 M 2  f 3M 3    fiMi

f 1  f 2  f 3    fi

8
Calculation of Grouped Mean
Class Interval Frequency(f) Class Midpoint(M) fM
20-under 30 6 25 150
30-under 40 18 35 630
40-under 50 11 45 495
50-under 60 11 55 605
60-under 70 3 65 195
70-under 80 1 75 75
50 2150


fM 2150
  43.0
f 50

9
Weighted Average

• Sometimes we wish to average numbers, but we want to assign more


importance, or weight, to some of the numbers.

• The average you need is the weighted average.


Formula for Weighted Average

 xw
Weighted Average 
w
where x is a data value and w is
the weight assigned to that data
value. The sum is taken over all
data values.
Example
Suppose your midterm test score is 83 and your final exam score is 95.
Using weights of 40% for the midterm and 60% for the final exam, compute
the weighted average of your scores. If the minimum average for an A is
90, will you earn an A?

Weighted Average 
830.40 950.60
0.40  0.60
32  57
  90.2
1 You will earn an A!
Median
• Middle value in an ordered array of numbers

• Applicable for ordinal, interval, and ratio data

• Not applicable for nominal data

• Unaffected by extremely large and extremely small values

13
Median: Computational Procedure
• First Procedure
– Arrange the observations in an ordered array
– If there is an odd number of terms, the median is the middle term of the
ordered array
– If there is an even number of terms, the median is the average of the
middle two terms
• Second Procedure
– The median’s position in an ordered array is given by (n+1)/2.

14
Median: Example with an Odd Number of Terms

Ordered Array
3 4 5 7 8 9 11 14 15 16 16 17 19 19 20 21 22
• There are 17 terms in the ordered array.
• Position of median = (n+1)/2 = (17+1)/2 = 9
• The median is the 9th term, 15.
• If the 22 is replaced by 100, the median is 15.
• If the 3 is replaced by -103, the median is 15.

15
Median: Example with an Even Number of Terms
Ordered Array
3 4 5 7 8 9 11 14 15 16 16 17 19 19 20 21

• There are 16 terms in the ordered array

• Position of median = (n+1)/2 = (16+1)/2 = 8.5

• The median is between the 8th and 9th terms, 14.5

• If the 21 is replaced by 100, the median is 14.5

• If the 3 is replaced by -88, the median is 14.5

16
Median of Grouped Data

N
 cfp
Median  L  2 W 
fmed
Where :
L  the lower limit of the median class
cfp = cumulative frequency of class preceding the median class
fmed = frequency of the median class
W = width of the median class
N = total of frequencies

17
Median of Grouped Data -- Example
Cumulative
N
Class Interval Frequency Frequency  cfp
20-under 30 6 6 Md  L  2 W 
30-under 40 18 24 fmed
40-under 50 11 35 50
 24
50-under 60 11 46
60-under 70 3 49
 40  2 10 
11
70-under 80 1 50  40.909
N = 50

18
Mode

• The most frequently occurring value in a data set

• Applicable to all levels of data measurement (nominal, ordinal, interval,


and ratio)

• Bimodal -- Data sets that have two modes

• Multimodal -- Data sets that contain more than two modes

19
Mode -- Example

• The mode is 44
• There are more 44s 35 41 44 45

than any other value 37 41 44 46

37 43 44 46

39 43 44 46

40 43 44 46

40 43 45 48

20
Mode of Grouped Data
• Midpoint of the modal class
• Modal class has the greatest frequency

Class Interval Frequency


20-under 30 6  d1 
Mode  LMo   w 
30-under 40 18  d1  d 2 
40-under 50 11
 12 
50-under 60 11 30   10  36.31
60-under 70 3  12  7 
70-under 80 1

21
22
Percentiles
• Measures of central tendency that divide a group of data into 100 parts

• Example: 90th percentile indicates that at most 90% of the data lie
below it, and at least 10% of the data lie above it

• The median and the 50th percentile have the same value

• Applicable for ordinal, interval, and ratio data

• Not applicable for nominal data

23
Percentiles: Computational Procedure
• Organize the data into an ascending ordered array
• Calculate the p th percentile location:
P
i ( n)
100
• Determine the percentile’s location and its value.

• If i is a whole number, the percentile is the average of the values at the


i and (i+1) positions

• If i is not a whole number, the percentile is at the (i+1) position in the


ordered array

24
Percentiles: Example
• Raw Data: 14, 12, 19, 23, 5, 13, 28, 17
• Ordered Array: 5, 12, 13, 14, 17, 19, 23, 28
• Location of 30th percentile:
30
i (8)  2.4
100

• The location index, i, is not a whole number; i+1 = 2.4+1=3.4;


the whole number portion is 3; the 30th percentile is at the 3rd
location of the array; the 30th percentile is 13.

25
Dispersion

• Measures of variability describe the spread or the dispersion of a set of


data

• Reliability of measure of central tendency

• To compare dispersion of various samples

26
Variability

No Variability in Cash Flow Mean

Variability in Cash Flow Mean

27
Measures of Variability or dispersion
Common Measures of Variability
• Range
• Inter-quartile range
• Mean Absolute Deviation
• Variance
• Standard Deviation
• Z scores
• Coefficient of Variation

28
Range – ungrouped data

• The difference between the largest and the smallest values in 35 41 44 45


a set of data
37 41 44 46
• Simple to compute
37 43 44 46
• Ignores all data points except the two extremes
• Example: 39 43 44 46

Range = Largest – Smallest = 48 - 35 = 13 40 43 44 46

40 43 45 48

29
Quartiles
• Measures of central tendency that divide a group of data into four subgroups

• Q1: 25% of the data set is below the first quartile

• Q2: 50% of the data set is below the second quartile

• Q3: 75% of the data set is below the third quartile

• Q1 is equal to the 25th percentile

• Q2 is located at 50th percentile and equals the median

• Q3 is equal to the 75th percentile

• Quartile values are not necessarily members of the data set

30
Quartiles

Q1 Q2 Q3

25% 25% 25% 25%

31
Quartiles: Example
• Ordered array: 106, 109, 114, 116, 121, 122, 125, 129
• Q1 i
25
(8)  2 Q 
109  114
1  111.5
100 2

• Q2:
50 116  121
i (8)  4 Q2   118.5
100 2

• Q3:
75 122  125
i (8)  6 Q3   123.5
100 2

32
Interquartile Range

• Range of values between the first and third quartiles


• Range of the “middle half”
• Less influenced by extremes

Interquartile Range  Q3  Q1

33
Deviation from the Mean

• Data set: 5, 9, 16, 17, 18


• Mean:

 X  65  13
N 5
• Deviations from the mean: -8, -4, 3, 4, 5

-4 +5
-8 +4
+3
0 5 10 15 20


34
Mean Absolute Deviation

• Average of the absolute deviations from the mean

X X   X  
M . A.D. 
 X 
5 -8 +8 N
9 -4 +4 24

16 +3 +3 5
17 +4 +4  4.8
18 +5 +5
0 24

35
Population Variance
• Average of the squared deviations from the arithmetic mean

X X   X  
2

 X  
2

 
2

5 -8 64 N
130
9 -4 16 
5
16 +3 9  26.0
17 +4 16
18 +5 25
0 130

36
Population Standard Deviation
• Square root of the variance

X X   X  
2

 X  
2

 
2

N
5 -8 64 130
9 -4 16 
5
16 +3 9  26.0
17 +4 16   
2

18 +5 25  26.0
0 130  5.1

37
Sample Variance
• Average of the squared deviations from the arithmetic mean

X X  X X  X 
2

 X  X 
2

2,398 625 390,625



2

1,844 71 5,041 S n 1
1,539 -234 54,756 663,866
1,311 -462 213,444 
3
7,092 0 663,866
 221, 288.67

38
Sample Standard Deviation
• Square root of the sample variance

X X  X X  X 
2

 X  X 
2


2

2,398 625 390,625 S n 1


663,866
1,844 71 5,041 
3
1,539 -234 54,756  221, 288.67
1,311 -462 213,444 S  S
2

7,092 0 663,866  221, 288.67


 470.41

39
Uses of Standard Deviation
• Indicator of financial risk
• Quality Control
– construction of quality control charts
– process capability studies
• Comparing populations
– household incomes in two cities
– employee absenteeism at two plants

40
Standard Deviation as an Indicator of Financial Risk

Annualized Rate of Return


Financial  
Security

A 15% 3%
B 15% 7%

41
Lecture 5: Central Tendency and Dispersion- II

Dr. A. Ramesh
Department of Management Studies

1
The Empirical Rule… If the histogram is bell shaped

• Approximately 68% of all observations fall


within one standard deviation of the mean.

• Approximately 95% of all observations fall


within two standard deviations of the mean.

• Approximately 99.7% of all observations fall


within three standard deviations of the mean.

2
Empirical Rule

• Data are normally distributed (or approximately normal)

Distance from Percentage of Values


the Mean Falling Within Distance

  1 68
  2 95
  3 99.7
3
Chebysheff’s Theorem…Not often used because interval is very wide.

• A more general interpretation of the standard deviation is derived


from Chebysheff’s Theorem, which applies to all shapes of histograms
(not just bell shaped).
• The proportion of observations in any sample that lie within k
standard deviations of the mean is at least:

For k=2 (say), the theorem states that at least


3/4 of all observations lie within 2 standard
deviations of the mean. This is a “lower bound”
compared to Empirical Rule’s approximation
(95%).

41
Coefficient of Variation

• Ratio of the standard deviation to the mean, expressed as a percentage


• Measurement of relative dispersion


. .  100 
CV

5
Coefficient of Variation

  29
1
  84
2

 1
 4.6  2
 10
 100  100
C.V .  
1
1
C.V .  
2
2

1 2

4.6 10
 100  100
29 84
 1586
.  11.90

6
Variance and Standard Deviation
of Grouped Data

Population Sample

 f  M   S  M  X 
2 2
f
 
2
2 
n1
N
2
S 
   S
2

7
Population Variance and Standard Deviation of
Grouped Data(mu=43)

M M  M 


2 2
Class Interval f M fM f

20-under 30 6 25 150 -18 324 1944


30-under 40 18 35 630 -8 64 1152
40-under 50 11 45 495 2 4 44
50-under 60 11 55 605 12 144 1584
60-under 70 3 65 195 22 484 1452
70-under 80 1 75 75 32 1024 1024
50 2150 7200

M    2
2
 f 7200
 144  12

2
   144
N 50
8
Measures of Shape
• Skewness
– Absence of symmetry
– Extreme values in one side of a distribution
• Kurtosis
Peakedness of a distribution
– Leptokurtic: high and thin
– Mesokurtic: normal shape
– Platykurtic: flat and spread out
• Box and Whisker Plots
– Graphic display of a distribution
– Reveals skewness

9
Skewness

Negatively Symmetric Positively


Skewed (Not Skewed) Skewed

10
Skewness..
The skewness of a distribution is measured by comparing the relative positions
of the mean, median and mode.
• Distribution is symmetrical
• Mean = Median = Mode
• Distribution skewed right
• Median lies between mode and mean, and mode is less than mean
• Distribution skewed left
• Median lies between mode and mean, and mode is greater than
mean

11
Skewness

Mean Mode Mean Mean


Mode
Median
Median Mode Median

Negatively Symmetric Positively


Skewed (Not Skewed) Skewed

12
Coefficient of Skewness

• Summary measure for skewness

3   Md 
S

• If S < 0, the distribution is negatively skewed (skewed to the left)

• If S = 0, the distribution is symmetric (not skewed)

• If S > 0, the distribution is positively skewed (skewed to the right)

13
Coefficient of Skewness

 1
 23  2
 26  3
 29

M
d1  26 M
d2  26 M
d3  26
 1
 12.3  2
 12.3  3
 12.3



3 1  M 
d1


3 2  M d2  

3 3  M 
d3
S 1
 S 2
 S 3

1 2 3

3 23  26 3 26  26 3 29  26


  
12.3 12.3 12.3
 0.73 0  0.73
14
Kurtosis
• Peakedness of a distribution
– Leptokurtic: high and thin
– Mesokurtic: normal in shape
– Platykurtic: flat and spread out

Leptokurtic

Mesokurtic
Platykurtic

15
Box and Whisker Plot

• Five specific values are used:

– Median, Q2

– First quartile, Q1

– Third quartile, Q3

– Minimum value in the data set

– Maximum value in the data set

16
Box and Whisker Plot

Minimum Q1 Q2 Q3 Maximum

17
Skewness: Box and Whisker Plots, and Coefficient of
Skewness
S=0 S>0
S<0

Negatively Symmetric Positively


Skewed (Not Skewed) Skewed

18
THANK YOU

19
Data Analytics with Python
Lecture 1: Introduction to data analytics

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE

1
Objective of the course
• The principle focus of this course is to introduce conceptual understanding
using simple and practical examples rather than repetitive and point click
mentality
• This course should make you comfortable using analytics in your career
and your life
• You will know how to work with real data, and might have learned many
different methodologies but choosing the right methodology is important

2
Objective of the course Contd…

• The danger in using quantitative method does not generally


lie in the inability to perform the calculation
• The real threat is lack of fundamental understanding of:
– Why to use a particular technique of procedure
– How to use it correctly and,
– How to correctly interpret the result

3
Learning objectives
1. Define data and its importance
2. Define data analytics and its types
3. Explain why analytics is important in today’s business environment
4. Explain how statistics, analytics and data science are interrelated
5. Why python?
6. Explain the four different levels of Data:
– Nominal
– Ordinal
– Interval and
– Ratio

4
1. Define Data and its importance

• Variable, Measurement and Data

• What is generating so much data?

• How data add value to the business?

• Why data is important?

5
1.1 Variable, Measurement and Data

• Variables – is a characteristic of any entity being studied that is capable of


taking on different values

• Measurements – is when a standard process is used to assign numbers to


particular attributes or characteristic of a variable

• Data – data are recorded measurements

6
1.2 What is generating so much data?

• Data can be generated by


– Humans,
– Machines or
– Humans-machines combines
• It can be generated anywhere where any information is
generated and stored in structured or unstructured formats

7
1.3 How data add value to business?
Data warehouse

Development of Data Product Discovery of Data Insight


Algorithm solutions in production, marketing and sales Quantitative data analysis to help steer
etc.(e.g. Recommendation Engines) strategic business decision

Business value

Source:https://datajobs.com/

8
Data Products

9
1.4 Why Data is important?

• Data helps in make better decisions


• Data helps in solve problems by finding the reason for
underperformance
• Data helps one to evaluate the performance.
• Data helps one improve processes
• Data helps one understand consumers and the market

10
2. Define data analytic and its types
• Define data analytics

• Why analytics is important?

• Data analysis

• Data analytics vs. Data analysis

• Types of Data analytics

11
2.1. Define data analytics

• Analytics is defined as “the scientific process of transforming data into


insights for making better decisions”
• Analytics, is the use of data, information technology, statistical analysis,
quantitative methods, and mathematical or computer-based models to
help managers gain improved insight about their business operations and
make better, fact-based decisions – James Evans
• Analysis = Analytics ?

12
2.2 Why analytics is important?

• Opportunity abounds for the use of analytics and big data


such as:
1. Determining credit risk
2. Developing new medicines
3. Finding more efficient ways to deliver products and services
4. Preventing fraud
5. Uncovering cyber threats
6. Retaining the most valuable customers

13
2.3 Data analysis
• Data analysis is the process of examining, transforming, and
arranging raw data in a specific way to generate useful
information from it
• Data analysis allows for the evaluation of data through
analytical and logical reasoning to lead to some sort of
outcome or conclusion in some context
• Data analysis is a multi-faceted process that involves a
number of steps, approaches, and diverse techniques

14
Analysis 2.4 Data analytics vs. Data analysis

Past

Explain
How?
Why?

15
2.4 Data analytics vs. Data analysis Analytics

Future

Explore potential future events

16
2.4 Data analytics vs. Data analysis
Analytics

Qualitative Quantitative

ll
ll
Intuition + analysis Formulas + algorithms

17
Analysis

Quantitative

ll
Qualitative Data + how the sale decreased last summer
ll

Explains How And Why Story ends the way it did ?

18
Analysis =/ Analytics
Data Analysis =/ Data analytics

Business Analysis =/ Business analytics

19
2.5 Classification of Data analytics

Based on the phase of workflow and the kind of analysis required, there are
four major types of data analytics.

• Descriptive analytics

• Diagnostic analytics

• Predictive analytics

• Prescriptive analytics

20
Classification of Data analytics

https://www.governanceanalytics.org/knowledge-
base/Main_Tools/Data_classification_and_analysis

21
Descriptive Analytics
• Descriptive Analytics, is the conventional form of Business Intelligence and
data analysis
• It seeks to provide a depiction or “summary view” of facts and figures in
an understandable format
• This either inform or prepare data for further analysis
• Descriptive analysis or statistics can summarize raw data and convert it
into a form that can be easily understood by humans
• They can describe in detail about an event that has occurred in the past

22
Example
A common example of Descriptive Analytics are company reports that simply
provide a historic review like:
• Data Queries
• Reports
• Descriptive Statistics
• Data Visualization
• Data dashboard

Source: https://www.linkedin.com/learning/478e9692-d13d-338f-907e-d76f0724d773

23
Diagnostic analytics

• Diagnostic Analytics is a form of advanced analytics which examines data


or content to answer the question “Why did it happen?”

• Diagnostic analytical tools aid an analyst to dig deeper into an issue so


that they can arrive at the source of a problem

• In a structured business environment, tools for both descriptive and


diagnostic analytics go parallel

24
Example

• It uses techniques such as:

1. Data Discovery

2. Data Mining

3. Correlations

25
Predictive analytics

• Predictive analytics helps to forecast trends based on the current events

• Predicting the probability of an event happening in future or estimating


the accurate time it will happen can all be determined with the help of
predictive analytical models

• Many different but co-dependent variables are analysed to predict a trend


in this type of analysis

26
Source: https://www.logianalytics.com/wp-content/uploads/2017/11/predictive-1.png

27
Example

• Set of techniques that use model constructed from past data to predict
the future or ascertain impact of one variable on another:
1. Linear regression
2. Time series analysis and forecasting
3. Data mining

Source: https://bigdata-madesimple.com/5-examples-predictive-analytics-travel-industry/

28
Prescriptive analytics

• Set of techniques to indicate the best course of action


• It tells what decision to make to optimize the outcome
• The goal of prescriptive analytics is to enable:
1. Quality improvements
2. Service enhancements
3. Cost reductions and
4. Increasing productivity

29
Prescriptive analytics: Example

• Optimization Model
• Simulation
• Decision Analysis

30
3. Explain why analytics is important

• Demand for Data Analytics


• Element of data Analytics

31
3. Explain why analytics is important

Data Scientist
Search Trends
Statistician, Operations Researcher

32
https://timesofindia.indiatimes.com/india/Data-scientists-earning-more-than-
CAs-engineers/articleshow/52171064.cms

33
3.1 Demand for Data Analytics

http://timesofindia.indiatimes.com/articleshow/52171064.cms?utm_source=
contentofinterest&utm_medium=text&utm_campaign=cppst

34
3.2 Element of data Analytics

35
4. Data analyst and Data scientist

• The requisite skill set

• Difference between Data analyst and Data Scientist

36
4.1 The requisite skill set

Technology;
Mathematic
Hacking Skill
Expertise

Business and
strategy Data Science
acumen

37
4.1 The requisite skill set

Mathematic Technology;
Expertise Hacking Skill

Business and
strategy
Data Science
acumen

38
4.1 The requisite skill set

Mathematic Technology;
Expertise Hacking Skill

Business and
strategy
Data Science
acumen

39
4.2 Difference between Data analyst and Data Scientist

Business Administration
Analyst

Domain specific responsibility : For Example marketing analyst, Financial analyst etc.

Data exploration analysis and insight

Data Scientist
Advance algorithms and machine learning

Data product engineering

Source:https://datajobs.com/

40
5. Why python?

Features
• Simple and easy to learn
• Freeware and Open source
• Interpreted
• Dynamically Typed
• Extensible
• Embedded
• Extensive library

41
5. Why python?

Usability
• Desktop and web applications
• Database applications
• Networking applications
• Data analysis (Data Science)
• Machine learning
• IoT and AI applications
• Games

42
Companies using Python

43
Why Jupyter NoteBook?

Why?
• Client – Server Application
• Edit code on web browser
• Easy in documentation
• Easy in demonstration
• User- friendly Interface

44
6. Explain the four different levels of Data
• Types of Variables
• Levels of Data Measurement
• Compare the four different levels of Data:
Nominal
Ordinal
Interval and
Ratio
• Usage Potential of Various Levels of Data
• Data Level, Operations, and Statistical Methods

45
6.1 Types of Variables

Data

Categorical Numerical

Examples:
 Marital Status
 Political Party Discrete Continuous
 Eye Color
Examples: Examples:
(Defined categories)
 Number of Children  Weight
 Defects per hour  Voltage
(Counted items) (Measured characteristics)
6.2 Levels of Data Measurement

• Nominal — Lowest level of measurement


• Ordinal
• Interval
• Ratio — Highest level of measurement

47
6.3.1 Nominal

• A nominal scale classifies data into distinct categories in which no ranking


is implied
• Example : Gender, Marital Status

48
6.3.2 Ordinal scale

• An ordinal scale classifies data into distinct categories in which ranking is


implied
• Example:
– Product satisfaction  Satisfied, Neutral, Unsatisfied
– Faculty rank  Professor, Associate Professor, Assistant Professor
– Student Grades  A, B, C, D, F

49
6.3.3. Interval scale

• An interval scale is an ordered scale in which the difference between


measurements is a meaningful quantity but the measurements do not have a
true zero point.
• Example
– Temperature in Fahrenheit and Celsius
– Year

50
6.3.4 Ratio scale

• A ratio scale is an ordered scale in which the difference between the


measurements is a meaningful quantity and the measurements have a true
zero point.
• Example
– Weight
– Age
– Salary

51
6.4 Usage Potential of Various
Levels of Data
Ratio
Interval
Ordinal

Nominal

52
6.5 Impact of choice of measurement scale

Statistical
Data Level Meaningful Operations
Methods

Nominal Classifying and Counting Nonparametric

Ordinal All of the above plus Ranking Nonparametric

Interval All of the above plus Parametric


Addition, Subtraction

Ratio All of the above plus


multiplication and division Parametric

53
Thank You

54
Welcome to
TA Live Session 1

NPTEL | DATA ANALYTICS


WITH PYTHON
29-01-2022
Ritwiz Kamal
PhD (Prime Minister’s Research Fellow), CSE, IIT Madras
Ritwiz Kamal | IIT Madras 1
Let’s Get Started …

Ritwiz Kamal | IIT Madras 2


Question 1

Ritwiz Kamal | IIT Madras 3


Question 1

http://makemeanalyst.com/explore-your-data-range-interquartile-range-and-box-plot/

Ritwiz Kamal | IIT Madras 4


Question 2

Ritwiz Kamal | IIT Madras 5


Question 2

[1, 2, 3, 4, 5, [6, 7, 8, 9]]

[1, 2, 3, 4, 5, 6, 7, 8, 9]

Ritwiz Kamal | IIT Madras 6


Question 3

Ritwiz Kamal | IIT Madras 7


Question 3

X = 10 value of x = 10
X = X + 10 value of x = 20
X=X–5 value of x = 15

Ritwiz Kamal | IIT Madras 8


Question 4

Ritwiz Kamal | IIT Madras 9


Question 4

Are we really modifying


x and y ?

NO !

Ritwiz Kamal | IIT Madras 10


Question 5

Ritwiz Kamal | IIT Madras 11


Question 5

https://www.slideshare.net/amritswaroop1/mbm-106

Ritwiz Kamal | IIT Madras 12


Question 6

Ritwiz Kamal | IIT Madras 13


Question 6

σ 𝑥𝑖ሶ
𝑥ҧ =
𝑛

Ritwiz Kamal | IIT Madras 14


Question 7

Ritwiz Kamal | IIT Madras 15


Question 7

https://www.kindpng.com/imgv/Txxxxb_mu-greek-alphabet-
letter-greek-mu-hd-png/

Ritwiz Kamal | IIT Madras 16


Question 8

Ritwiz Kamal | IIT Madras 17


Question 8

http://makemeanalyst.com/explore-your-data-range-interquartile-
range-and-box-plot/

Ritwiz Kamal | IIT Madras 18


Question 9

Ritwiz Kamal | IIT Madras 19


Question 9

https://www.managedfuturesinvesting.com/what-is-skewness/

Ritwiz Kamal | IIT Madras 20


Question 10

Ritwiz Kamal | IIT Madras 21


Question 10

If 𝜎 = 0, 𝜎 2 = 0

If 𝜎 = 2, 𝜎 2 = 4

If 𝜎 = 0.5, 𝜎 2 = 0.25

Ritwiz Kamal | IIT Madras 22


Acknowledgments
 Prof. A Ramesh | IIT Roorkee
Data Analytics with Python | NPTEL
 NPTEL Team
 PMRF Team
 Department of CSE, IIT Madras

THANK YOU!

Ritwiz Kamal | IIT Madras 23


Data Analytics with Python
Lecture 2: Python – Fundamentals

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE

1
Learning objectives
1. Installing Python
2. Fundamentals of Python
3. Data Visualisation

2
Python Installation Process

Installation Process –

Step 1: Type https://www.anaconda.com at the address bar of web


browser.
Step 2: Click on download button
Step 3: Download python 3.7 version for windows OS
Step 4: Double click on file to run the application
Step 5: Follow the instructions until completion of installation process

3
Python Installation Process

Installation Process –

Step 1: Type https://www.anaconda.com at the address bar of web browser.

4
Python Installation Process

Step 2: Click on download button

5
Python Installation Process

Step 3: Download python 3.7 version for windows OS

6
Python Installation Process

Step 4: Double click on file to run the application

7
Python Installation Process

8
Python Installation Process

9
Python Installation Process

10
Python Installation Process

11
Python Installation Process

12
Python Installation Process

13
Python Installation Process

14
Python Installation Process

15
Python Installation Process

16
Why Jupyter NoteBook?

Why?
• Edit code on web browser
• Easy in documentation
• Easy in demonstration
• User- friendly Interface

17
Python and Jupyter

Python Programming Language Jupyter Application

Software Package contains both


python and jupyter application

18
19
About Jupyter NoteBook

Cell -> Access using Enter Key

20
About Jupyter NoteBook

Input Field -> Green color indicates edit mode


Blue color indicates command mode

21
About Jupyter NoteBook

-> It contains documentation


-> Text not executed as code

22
About Jupyter Notebook

• Command mode allow to edit notebook as whole


• To close edit mode (Press Escape key)
• Execution (Three ways)
o Ctrl +Enter (Output field can not be modified)
o Shift +Enter (Output field is modified)
o Run button on Jupyter interface

• Comment line is written preceding with # symbol.

23
About Jupyter Notebook

• Important shortcut keys

o A -> To create cell above


o B -> To create cell below
o D + D -> For deleting cell
o M -> For markdown cell
o Y -> For code cell

24
Fundamentals of Python

• Loading a simple delimited data file


• Counting how many rows and columns were loaded
• Determining which type of data was loaded
• Looking at different parts of the data by subsetting rows
and columns

25
26
Loading a simple delimited data file

Data Source: www.github.com/jennybc/gapminder.

27
28
• head method shows us only the first 5 rows

29
Get the number of rows and columns

30
get column names

31
get the dtype of each column

32
Pandas Types Versus Python Types

33
get more information about data

34
Looking at Columns, Rows, and Cells

• # get the country column and save it to its own variable

35
# show the first 5 observations

36
# show the last 5 observations

37
# Looking at country, continent, and year

38
39
Data Analytics with Python
Lecture 3: Python – Fundamentals - II

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE

1
Looking at Columns, Rows, and Cells

• Subset Rows by Index Label: loc

2
get the first row

• Python counts from 0

3
• # get the 100th row
# Python counts from 0

4
• get the last row

5
Subsetting Multiple Rows

• # select the first, 100th, and 1000th rows

6
Subset Rows by Row Number: iloc

• # get the 2nd row

7
• get the 100th row

8
• # using -1 to get the last row

9
With iloc, we can pass in the -1 to get the last row—something we couldn’t do with loc.

10
• # get the first, 100th, and 1000th rows

11
Subsetting Columns

• The Python slicing syntax uses a colon, :


• If we have just a colon, the attribute refers to everything.
• So, if we just want to get the first column using the loc or iloc syntax,
we can write something like df.loc[:, [columns]] to subset the column(s).

12
• # subset columns with loc
# note the position of the colon
# it is used to select all rows

13
14
• # subset columns with iloc
• # iloc will alow us to use integers
• # -1 will select the last column

15
Subsetting Columns by Range

• # create a range of integers from 0 to 4 inclusive

16
• # subset the dataframe with the range

17
Subsetting Rows and Columns

• # using loc

18
• # using iloc

19
Subsetting Multiple Rows and Columns

• #get the 1st, 100th, and 1000th rows


# from the 1st, 4th, and 6th columns

20
• if we use the column names directly,
# it makes the code a bit easier to read
# note now we have to use loc, instead of iloc

21
22
23
Grouped Means

• # For each year in our data, what was the average life
expectancy?
# To answer this question,
# we need to split our data into parts by year;
# then we get the 'lifeExp' column and calculate the mean

24
25
26
• If you need to “flatten” the dataframe, you can use the
reset_index method.

27
Grouped Frequency Counts

• use the nunique to get counts of unique values on a Pandas Series.

28
Basic Plot

29
30
Visual Representation of the Data
• Histogram -- vertical bar chart of frequencies
• Frequency Polygon -- line graph of frequencies
• Ogive -- line graph of cumulative frequencies
• Pie Chart -- proportional representation for categories of a whole
• Stem and Leaf Plot
• Pareto Chart
• Scatter Plot

31
Methods of visual presentation of data

• Table

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr


East 20.4 27.4 90 20.4
West 30.6 38.6 34.6 31.6
North 45.9 46.9 45 43.9

32
Methods of visual presentation of data

• Graphs
90
80
70
60
50 East
40 West
30 North
20
10
0
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

33
Methods of visual presentation of data

• Pie chart

1st Qtr
2nd Qtr
3rd Qtr
4th Qtr

34
Methods of visual presentation of data
• Multiple bar chart

4th Qtr

3rd Qtr North


West
2nd Qtr East

1st Qtr

0 20 40 60 80 100

35
Methods of visual presentation of data

• Simple pictogram

100
80
60
40
North
20
East
0 West
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

36
Frequency distributions

• Frequency tables

Observation Table
Class Interval Frequency Cumulative Frequency
< 20 13 13
<40 18 31
<60 25 56
<80 15 71
<100 9 80

37
Frequency diagrams
Frequency

30 Cumulative Frequency
25 Frequency

20
90
80
15
70
10 60
5 50
Cumulative Frequency
0 40
< 20 <40 <60 <80 <100 30
20
Frequency 10
0
30 < 20 <40 <60 <80 <100
25
20
15 Frequency
10
5
0
< 20 <40 <60 <80 <100

38
Histogram

20
Class Interval Frequency

Frequency
20-under 30 6

10
30-under 40 18
40-under 50 11
50-under 60 11

0
60-under 70 3 0 10 20 30 40 50 60 70 80
Years
70-under 80 1

39
Histogram Construction

20
Class Interval Frequency
20-under 30 6

Frequency
30-under 40 18

10
40-under 50 11
50-under 60 11
60-under 70 3

0
70-under 80 1
0 10 20 30 40 50 60 70 80
Years

40
Frequency Polygon

20
Class IntervalFrequency
20-under 30 6

Frequency
30-under 40 18

10
40-under 50 11
50-under 60 11
60-under 70 3

0
70-under 80 1 0 10 20 30 40 50 60 70 80
Years

41
Ogive

Cumulative

60
Class Interval Frequency

40
Frequency
20-under 30 6
30-under 40 24

20
40-under 50 35
50-under 60 46

0
60-under 70 49 0 10 20 30 40 50 60 70 80

70-under 80 50 Years

42
Relative Frequency Ogive

Cumulative

Cumulative Relative Frequency


1.00
Relative 0.90
0.80
Class Interval Frequency 0.70
0.60
20-under 30 .12 0.50
30-under 40 .48 0.40
0.30
40-under 50 .70 0.20
0.10
50-under 60 .92 0.00
60-under 70 .98 0 10 20 30 40 50 60 70 80
70-under 80 1.00 Years

43
Pareto Chart
100 100%
90 90%
80 80%
70 70%
Frequency 60 60%
50 50%
40 40%
30 30%
20 20%
10 10%
0 0%
Poor Short in Defective Other
Wiring Coil Plug

44
Scatter Plot

Registered Gasoline Sales


Vehicles (1000's of 200

(1000's) Gallons)

Gasoline Sales
5 60 100

15 120
9 90
0
15 140 0 5 10 15 20
Registered Vehicles
7 60

45
Principles of Excellent Graphs
• The graph should not distort the data
• The graph should not contain unnecessary adornments (sometimes
referred to as chart junk)
• The scale on the vertical axis should begin at zero
• All axes should be properly labeled
• The graph should contain a title
• The simplest possible graph should be used for a given set of data
Graphical Errors: Chart Junk

Bad Presentation  Good Presentation

Minimum Wage Minimum Wage


1960: $1.00
$
4
1970: $1.60
2
1980: $3.10
0
1990: $3.80 1960 1970 1980 1990
Graphical Errors:
Compressing the Vertical Axis

Bad Presentation  Good Presentation


Quarterly Sales Quarterly Sales
$ $
200 50

100 25

0 0
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
Graphical Errors: No Zero Point on the Vertical Axis

Bad Presentation
 Good Presentations
Monthly Sales $ Monthly Sales
$ 45
45
42
42 39
39 36
36 0
J F M A M J J F M A M J

Graphing the first six months of sales


Welcome to
TA Live Session 2

NPTEL | DATA ANALYTICS


WITH PYTHON
05-02-2022
Ritwiz Kamal
PhD (Prime Minister’s Research Fellow), CSE, IIT Madras
Ritwiz Kamal | IIT Madras 1
Let’s Get Started …

Ritwiz Kamal | IIT Madras 2


Question 1
1)

Ritwiz Kamal | IIT Madras 3


Question 1
1)

P(‘H’) = P(‘T’) = 0.5

H, T

{H,T}
https://towardsdatascience.com/what-is-expected-value-
4815bdbd84de

Ritwiz Kamal | IIT Madras 4


Question 2
2)

Ritwiz Kamal | IIT Madras 5


Question 2
2)

HH

HT
TH

T
1st Toss
TT
2nd Toss
Ritwiz Kamal | IIT Madras 6
Question 3
3)

Ritwiz Kamal | IIT Madras 7


Question 3
3)

https://www.scribbr.com/statistics/standard-normal-distribution/

Ritwiz Kamal | IIT Madras 8


Question 4

4)

Ritwiz Kamal | IIT Madras 9


Question 4
4)

𝜆𝑥 −𝜆
𝑓 𝑥 = ⅇ
𝑥! https://en.wikipedia.org/wiki/Poisson_distribution

Ritwiz Kamal | IIT Madras 10


Question 5
5)

Ritwiz Kamal | IIT Madras 11


Question 5
5)

https://en.wikipedia.org/wiki/Hypergeometric_distribution

Ritwiz Kamal | IIT Madras 12


Question 6
6)A bag contains one ball which could either be GREEN or RED. You
take another red ball and put it in this pouch. You now close your eyes
and pull out a ball from the pouch. It turns out to be red. What is the
probability that the original ball in the pouch was red?

Ritwiz Kamal | IIT Madras 13


Question 6

14
Ritwiz Kamal | IIT Madras
Question 7
7) Suppose you have a biased coin i.e. P(H) ≠ P(T).
How will you use it to make unbiased decision.

Charles Barsotti :https://www.cartoonstock.com/directory/t/toss_a_coin.asp

Ritwiz Kamal | IIT Madras 15


Question 7

16
Ritwiz Kamal | IIT Madras
Question 8
8) Suppose you have data about the progression of the number of cases
of Covid-19 in some country during the second wave. What kind of
visual representation would you prefer to use for this data ?
Violin Plot

Pie Chart

Ogive
Box Plot

Ritwiz Kamal | IIT Madras 17


Question 8
8) Suppose you have data about the progression of the number of cases
of Covid-19 in some country during the second wave. What kind of
visual representation would you prefer to use for this data ?

Violin Plot

Pie Chart

Ogive
Box Plot

Ritwiz Kamal | IIT Madras 18


https://en.wikipedia.org/wiki/Ogive_(statistics)
Question 9
9) Suppose you have data about the preference of TV shows among
viewers on an OTT platform. What kind of visual representation would
you prefer to use for this data ?
Violin Plot

Pie Chart

Ogive
Box Plot

Ritwiz Kamal | IIT Madras 19


Question 9
9) Suppose you have data about the preference of TV shows among
viewers on an OTT platform. What kind of visual representation would
you prefer to use for this data ?
Violin Plot TV Show A

Pie Chart
TV Show D
Ogive
Box Plot
TV Show B

*fictional data

TV Show C
Ritwiz Kamal | IIT Madras 20
Question 10
10) Suppose there are 10 rabbits in a race.
Let R1 and R2 be two of the rabbits.
Let A be the event that R1 wins the race.
Let B be the event that R2 wins the race.

Are A and B independent events ? (Assume all rabbits are equally likely to win)

Independent

Not Independent

Ritwiz Kamal | IIT Madras 21


Question 10
10) Suppose there are 10 rabbits in a race.
Let R1 and R2 be two of the rabbits.
Let A be the event that R1 wins the race.
Let B be the event that R2 wins the race.

Are A and B independent events ? (Assume all rabbits are equally likely to win)

1
Independent 𝑃 𝐴 = 𝑃 𝐴ȁ𝐵 = 0
10
Not Independent

Ritwiz Kamal | IIT Madras 22


Introduction To Probability, 2nd edition, by
Dimitri P. Bertsekas and John N. Tsitsiklis

Ritwiz Kamal | IIT Madras 23


Acknowledgments
 Prof. A Ramesh | IIT Roorkee
Data Analytics with Python | NPTEL
 NPTEL Team
 PMRF Team
 Department of CSE, IIT Madras

THANK YOU!

Ritwiz Kamal | IIT Madras 24


Lecture 6: Introduction to Probability

Dr. A. Ramesh
Department of Management Studies

1
Lecture objectives

• Comprehend the different ways of assigning probability


• Understand and apply marginal, union, joint, and conditional probabilities
• Solve problems using the laws of probability including the laws of addition,
multiplication and conditional probability
• Revise probabilities using Bayes’ rule

2
Probability

• Probability is the numerical measure of the likelihood that an event will occur.

• The probability of any event must be between 0 and 1, inclusively


– 0 ≤ P(A) ≤ 1 for any event A.

• The sum of the probabilities of all mutually exclusive and collectively


exhaustive events is 1.
– P(A) + P(B) + P(C) = 1
– A, B, and C are mutually exclusive and collectively exhaustive

3
Range of Probability

1 Certain

.5

0 Impossible

4
Methods of Assigning Probabilities

• Classical method of assigning probability (rules and laws)

• Relative frequency of occurrence (cumulated historical data)

• Subjective Probability (personal intuition or reasoning)

5
Classical Probability

• Number of outcomes leading to the event divided by the total number of


outcomes possible
• Each outcome is equally likely
• Determined a priori -- before performing the experiment
• Applicable to games of chance
• Objective -- everyone correctly using the method assigns an identical
probability

6
Classical Probability

P( E ) 
n e
N
Where:
N  total number of outcomes
ne
 number of outcomes in E

7
Relative Frequency Probability

• Based on historical data

• Computed after performing the experiment

• Number of times an event occurred divided by the number of trials

• Objective -- everyone correctly using the method assigns an identical


probability

8
Relative Frequency Probability

P( E )  ne
N
Where:
N  total number of trials
n e
 number of outcomes
producing E

9
Subjective Probability

• Comes from a person’s intuition or reasoning


• Subjective -- different individuals may (correctly) assign different numeric
probabilities to the same event
• Degree of belief
• Useful for unique (single-trial) experiments
– New product introduction
– Initial public offering of common stock
– Site selection decisions
– Sporting events

10
Probability - Terminology

• Experiment
• Event
• Elementary Events
• Sample Space
• Unions and Intersections
• Mutually Exclusive Events
• Independent Events
• Collectively Exhaustive Events
• Complementary Events

11
Experiment, Trial, Elementary Event, Event
• Experiment: a process that produces outcomes
– More than one possible outcome
– Only one outcome per trial
• Trial: one repetition of the process
• Elementary Event: cannot be decomposed or broken down into other
events
• Event: an outcome of an experiment
– may be an elementary event, or
– may be an aggregate of elementary events
– usually represented by an uppercase letter, e.g., A, E1

12
An Example Experiment
• Experiment: randomly select,
without replacement, two families Tiny Town Population
from the residents of Tiny Town
• Elementary Event: the sample Children in Number of
Family Household
includes families A and C Automobiles
• Event: each family in the sample
has children in the household A Yes 3
• Event: the sample families own a B Yes 2
total of four automobiles C No 1
D Yes 2

13
Sample Space

• The set of all elementary events for an experiment


• Methods for describing a sample space
– roster or listing
– tree diagram
– set builder notation
– Venn diagram

14
Sample Space: Roster Example

• Experiment: randomly select, without replacement, two families from the


residents of Tiny Town
• Each ordered pair in the sample space is an elementary event, for example
-- (D,C)
Children in Number of Listing of Sample Space
Family
Household Automobiles
(A,B), (A,C), (A,D),
A Yes 3
(B,A), (B,C), (B,D),
B Yes 2
(C,A), (C,B), (C,D),
C No 1
(D,A), (D,B), (D,C)
D Yes 2

15
Sample Space: Tree Diagram for Random Sample of Two
Families

16
Sample Space: Set Notation for Random Sample of Two
Families
• S = {(x,y) | x is the family selected on the first draw, and y is the family
selected on the second draw}
• Concise description of large sample spaces

17
Sample Space
• Useful for discussion of general principles and concepts

Listing of Sample Space


Venn Diagram
(A,B), (A,C), (A,D),
(B,A), (B,C), (B,D),
(C,A), (C,B), (C,D),
(D,A), (D,B), (D,C)

18
Union of Sets

• The union of two sets contains an instance of each element of the two
sets.

X  1,4,7,9
Y  2,3,4,5,6 X Y
X  Y  1,2,3,4,5,6,7,9

C   IBM , DEC , Apple


F   Apple, Grape, Lime
C  F   IBM , DEC , Apple, Grape, Lime

19
Intersection of Sets

• The intersection of two sets contains only those element common to the
X  1,4,7,9
two sets.

Y  2,3,4,5,6 X Y

X  Y   4

C   IBM , DEC , Apple


F   Apple, Grape, Lime
C  F   Apple
20
Mutually Exclusive Events
• Events with no common outcomes
• Occurrence of one event precludes the occurrence of the other event

C   IBM , DEC , Apple


F  Grape, Lime
C F  
X Y
X  1,7,9
Y  2 ,3,4 ,5,6
X Y    P( X  Y )  0
21
Independent Events

• Occurrence of one event does not affect the occurrence or nonoccurrence


of the other event
• The conditional probability of X given Y is equal to the marginal probability
of X.
• The conditional probability of Y given X is equal to the marginal probability
of Y.

P( X | Y )  P( X ) and P(Y | X )  P(Y )

22
Collectively Exhaustive Events

• Contains all elementary events for an experiment

E1 E2 E3

Sample Space with three


collectively exhaustive events

23
Complementary Events

• All elementary events not in the event ‘A’ are in its complementary event.

P( Sample Space )  1
A
Sample
Space A
P( A)  1  P( A)

24
Counting the Possibilities

• mn Rule
• Sampling from a Population with Replacement
• Combinations: Sampling from a Population without Replacement

25
mn Rule

• If an operation can be done m ways and a second operation can be done n


ways, then there are mn ways for the two operations to occur in order.
• This rule is easily extend to k stages, with a number of ways equal to
n1.n2.n3..nk

• Example: Toss two coins . The total umber of simple events is 2 x 2 =4

26
Sampling from a Population with Replacement

• A tray contains 1,000 individual tax returns. If 3 returns are randomly


selected with replacement from the tray, how many possible samples are
there?
• (N)n = (1,000)3 = 1,000,000,000

27
Combinations

• A tray contains 1,000 individual tax returns. If 3 returns are randomly


selected without replacement from the tray, how many possible samples
are there?

N N! 1000!
    166,167,00 0
 n  n!( N  n)! 3!(1000  3)!

28
Four Types of Probability
Marginal Union Joint Conditional

P( X ) P( X  Y ) P( X  Y ) P( X | Y )
The probability The probability The probability The probability
of X occurring of X or Y of X and Y of X occurring
occurring occurring given that Y
has occurred
X X Y X Y

29
General Law of Addition

P ( X  Y )  P( X )  P( Y )  P( X  Y )
X Y

30
Design for improving productivity?

31
Problem
• A company conducted a survey for the American Society of Interior
Designers in which workers were asked which changes in office design
would increase productivity.
• Respondents were allowed to answer more than one type of design
change.

Reducing noise would increase 70 %


productivity
More storage space would 67 %
increase productivity

32
Problem

• If one of the survey respondents was randomly selected and asked what
office design changes would increase worker productivity,
– what is the probability that this person would select reducing noise or
more storage space?

33
Solution

• Let N represent the event “reducing noise.”


• Let S represent the event “more storage/ filing space.”
• The probability of a person responding with N or S can be symbolized
statistically as a union probability by using the law of addition.

34
General Law of Addition -- Example
P( N  S )  P( N )  P( S )  P( N  S )

N S P ( N ) .70
P ( S ) .67
P ( N  S ) .56
.56
.70 .67 P ( N  S ) .70.67 .56
 0.81

35
Office Design Problem
Probability Matrix

Increase
Storage Space
Yes No Total
Noise Yes .56 .14 .70
Reduction
No .11 .19 .30
Total .67 .33 1.00

36
Joint Probability Using a Contingency Table
Event
Event B1 B2 Total

A1 P(A1 and B1) P(A1 and B2) P(A1)

A2 P(A2 and B1) P(A2 and B2) P(A2)

Total P(B1) P(B2) 1

Joint Probabilities Marginal (Simple) Probabilities


37
Office Design Problem - Probability Matrix
Increase
Storage Space

Yes No Total
Noise Yes .56 .14 .70
Reduction
No .11 .19 .30
Total .67 .33 1.00

P( N  S )  P( N )  P( S )  P( N  S )
.70.67 .56
.81

38
Law of Conditional Probability

39
Office Design Problem

40
Problem

• A company data reveal that 155 employees worked one of four types of
positions.
• Shown here again is the raw values matrix (also called a contingency table)
with the frequency counts for each category and for subtotals and totals
containing a breakdown of these employees by type of position and by
sex.

41
Contingency Table

42
Solution

• If an employee of the company is selected randomly, what is the


probability that the employee is female or a professional worker?

43
Problem

• Shown here are the raw values matrix and corresponding probability
matrix for the results of a national survey of 200 executives who were
asked to identify the geographic locale of their company and their
company’s industry type.
• The executives were only allowed to select one locale and one industry
type.

44
Data Analytics with Python
Lecture 9: Probability Distributions-II

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE

1
Some Special Distributions
• Discrete
– Binomial
– Poisson
– Hyper geometric
• Continuous
– Uniform
– Exponential
– Normal

2
Binomial Distribution

• Let us consider the purchase decisions of the next three customers who
enter a store.

• On the basis of past experience, the store manager estimates the


probability that any one customer will make a purchase is .30.

• What is the probability that two of the next three customers will make a
purchase?

3
Tree diagram for the Martin clothing store problem

4
Trial Outcomes

5
Graphical representation of the probability distribution
for the number of customers making a purchase
x P(x)
0 0.7 x 0.7 x 0.7=0.343

1 0.3x0.7x07+
0.7x0.3x0.7+
0.7x0.7x0.3 = 0.441

2 0.189
3 0.027

6
Binomial Distribution- Assumtions
• Experiment involves n identical trials
• Each trial has exactly two possible outcomes: success and failure
• Each trial is independent of the previous trials
• p is the probability of a success on any one trial
q = (1-p) is the probability of a failure on any one trial
• p and q are constant throughout the experiment
• X is the number of successes in the n trials

7
Binomial Distribution

• Probability n! X n X
P( X )  p q for 0  X  n
function X ! n  X !

• Mean
value   n p
• Variance and
standard  2
 n pq
deviation    2
 n pq

8
Binomial Table
SELECTED VALUES FROM THE BINOMIAL PROBABILITY TABLE
EXAMPLE: n = 10, x = 3, p = .40; f (3) = .2150

9
Mean and Variance
• Suppose that for the next month the Clothing Store forecasts 1000
customers will enter the store.
• What is the expected number of customers who will make a purchase?
• The answer is μ = np = (1000)(.3) = 300.
• For the next 1000 customers entering the store, the variance and
standard deviation for the number of customers who will make a
purchase are

10
Poisson Distribution

• Describes discrete occurrences over a continuum or interval


• A discrete distribution
• Describes rare events
• Each occurrence is independent any other occurrences.
• The number of occurrences in each interval can vary from zero to infinity.
• The expected number of occurrences must hold constant throughout the
experiment.

11
Poisson Distribution: Applications
• Arrivals at queuing systems
– airports -- people, airplanes, automobiles, baggage
– banks -- people, automobiles, loan applications
– computer file servers -- read and write operations

• Defects in manufactured goods


– number of defects per 1,000 feet of extruded copper wire
– number of blemishes per square foot of painted surface
– number of errors per typed page

12
Poisson Distribution
• Probability function

e
X 

P( X )  for X  0,1, 2, 3,...


X!
where:
  long  run average
e  2. 718282 ... (the base of natural logarithms )

Mean value Variance Standard deviation

  
13
Poisson Distribution: Example

  3.2 customers/4 minutes   3.2 customers/4 minutes


X = 10 customers/8 minutes X = 6 customers/8 minutes
Adjusted  Adjusted 
 =6.4 customers/8 minutes  =6.4 customers/8 minutes

P(X)= 


P(X)= 
X  X
e e
X! X!
10 6.4 6 6.4

P(X =10)= 6.4 e  0.0528 P(X =6)= 6.4 e  0.1586


10! 6!

14
Poisson Probability Table
Example: μ = 10, x = 5; f (5) = .0378

15
The Hypergeometric Distribution

• The binomial distribution is applicable when selecting from a


finite population with replacement or from an infinite population
without replacement.

• The hypergeometric distribution is applicable when selecting


from a finite population without replacement.
Hyper Geometric Distribution
• Sampling without replacement from a finite population

• The number of objects in the population is denoted N.

• Each trial has exactly two possible outcomes, success and failure.

• Trials are not independent

• X is the number of successes in the n trials

• The binomial is an acceptable approximation, if N/10 > n Otherwise it is not.

17
Hypergeometric Distribution
• Probability function
– N is population size
P( x) 
 ACx  N  ACn  x 
– n is sample size
N Cn
– A is number of successes in population
– x is number of successes in sample An
 
N
• Mean Value
A( N  A) n( N  n)

2
 2
N ( N  1)
• Variance and standard deviation
 
2

18
The Hypergeometric Distribution Example
• Different computers are checked from 10 in the department. 4 of the 10
computers have illegal software loaded.
• What is the probability that 2 of the 3 selected computers have illegal
software loaded?
• So, N = 10, n = 3, A = 4, X = 2
 A  N  A   4  6 
     
 X  n  X   2 1  (6)(6)
P(X  2)           0.3
N 10  120
   
n  3 
   
• The probability that 2 of the 3 selected computers have illegal
software loaded is .30, or 30%.
Continuous Probability Distributions

• A continuous random variable is a variable that can assume any value on


a continuum (can assume an uncountable number of values)
– thickness of an item
– time required to complete a task
– temperature of a solution
– height
• These can potentially take on any value, depending only on the ability to
measure precisely and accurately.
Continuous Distributions

• Uniform
• Normal
• Exponential
The Uniform Distribution

• The uniform distribution is a probability distribution that has equal


probabilities for all possible outcomes of the random variable

• Because of its shape it is also called a rectangular distribution


Uniform Distribution

 1
b  a for a xb
 1
f ( x)  
 0 ba
for all other values f (x)

Area = 1
a x b
Uniform Distribution: Mean and Standard Deviation

Mean
a +b
 =
2

Standard Deviation
ba

12
The Uniform Distribution

Example: Uniform probability distribution over the range 2 ≤ X ≤ 6:

1
f(X) = 6 - 2 = .25 for 2 ≤ X ≤ 6

f(X)
ab 26
μ   4
.25 2 2

(b - a) 2 (6 - 2 ) 2
σ   1 .1 5 4 7
2 6 X 12 12
Uniform Distribution Example

 1
 47  41 for 41  x  47
 1 1
f ( x)   
 0 47  41 6

for all other values f ( x)

Area = 1

41 47 x
Uniform Distribution: Mean and Standard Deviation

Mean Mean
a +b 41+47 88
 = =   44
2 2 2

Standard Deviation Standard Deviation


ba 47  41 6
    1. 732
12 12 3. 464
Uniform Distribution Probability

x x1
P ( x1  X  x2)  2
ba 45  42 1

47  41 2
f (x)
45  42 1
P( 42  X  45)  
47  41 2 Area
= 0.5

41 42 45 47 x
Example : Uniform Distribution

• Consider the random variable x representing the flight time of an airplane


traveling from Delhi to Mumbai.

• Suppose the flight time can be any value in the interval from 120 minutes
to 140 minutes.

• Because the random variable x can assume any value in that interval, x is a
continuous rather than a discrete random variable

29
Example : Uniform Distribution contd….

• Let us assume that sufficient actual flight data are available to conclude
that the probability of a flight time within any 1-minute interval is the
same as the probability of a flight time within any other 1-minute interval
contained in the larger interval from 120 to 140 minutes.

• With every 1-minute interval being equally likely, the random variable x is
said to have a uniform probability distribution.

30
Uniform Probability Distribution for Flight time

31
Probability of a flight time between 120 and 130
minutes

32
Exponential Probability Distribution
• The exponential probability distribution is useful in describing the time it
takes to complete a task.
• The exponential random variables can be used to describe:

Time required Distance between


Time between
to complete major defects
vehicle arrivals
a questionnaire in a highway
at a toll booth
Exponential Probability Distribution

• Density Function
for x > 0,  > 0

1  x /
f ( x)  e

where:  = mean
e = 2.71828
Exponential Probability Distribution

• Suppose that x represents the loading time for a truck at loading dock and
follows such a distribution.
• If the mean, or average, loading time is 15 minutes ( μ = 15), the
appropriate probability density function for x is
Exponential Distribution for the loading Dock Example
Exponential Probability Distribution
• Cumulative Probabilities
Cumulative Probabilities

 xo / 
P( x  x0 )  1  e

where:
x0 = some specific value of x x
Example: Exponential Probability Distribution

• The time between arrivals of cars at a Petrol pump follows an exponential

probability distribution with a mean time between arrivals of 3 minutes.

• The Petrol pump owner would like to know the probability that the time

between two successive arrivals will be 2 minutes or less.


Example: Petrol Pump Problem

f(x)

.4 P(x < 2) = 1 - 2.71828-2/3 = 1 - .5134 = .4866


.3
.2
.1
x
1 2 3 4 5 6 7 8 9 10
Time Between Successive Arrivals (mins.)
Relationship between the Poisson and Exponential
Distributions
The Poisson distribution
provides an appropriate description
of the number of occurrences
per interval

The exponential distribution


provides an appropriate description
of the length of the interval
between occurrences
Mean of Poisson and Mean of Exponential Distributions

• Because the average number of arrivals is 10 cars per hour, the average
time between cars arriving is
42
The Normal Distribution: Properties

• ‘Bell Shaped’
• Symmetrical f(X)
• Mean, Median and Mode are equal
• Location is characterized by the mean, μ σ
• Spread is characterized by the standard μ
deviation, σ
Mean = Median = Mode
• The random variable has an infinite
theoretical range: - to +
The Normal Distribution: Density Function
The formula for the normal probability density function is

2
1  (X μ) 
1   
2  
f(X)  e
2π
Where e = the mathematical constant approximated by 2.71828
π = the mathematical constant approximated by 3.14159
μ = the population mean
σ = the population standard deviation
X = any value of the continuous variable
Chap 6-44
The Normal Distribution: Shape

By varying the parameters μ and σ, we obtain different normal


distributions
Data Analytics with Python
Lecture 10: Probability Distributions-III

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE

1
The Normal Distribution: Properties

• ‘Bell Shaped’
• Symmetrical f(X)
• Mean, Median and Mode are equal
• Location is characterized by the mean, μ σ
• Spread is characterized by the standard μ
deviation, σ
Mean = Median = Mode
• The random variable has an infinite
theoretical range: - to +
The Normal Distribution: Density Function
The formula for the normal probability density function is

2
1  (X μ) 
1   
2  
f(X)  e
2π
Where e = the mathematical constant approximated by 2.71828
π = the mathematical constant approximated by 3.14159
μ = the population mean
σ = the population standard deviation
X = any value of the continuous variable
Chap 6-3
The Normal Distribution: Shape

By varying the parameters μ and σ, we obtain different normal


distributions
The Normal Distribution: Shape

f(X) Changing μ shifts the distribution


left or right.

Changing σ increases or
decreases the spread.
σ

μ X
The Standardized Normal Distribution

• Any normal distribution (with any mean and standard deviation


combination) can be transformed into the standardized normal
distribution (Z).

• Need to transform X units into Z units.

• The standardized normal distribution has a mean of 0 and a standard


deviation of 1.
The Standardized Normal Distribution

• Translate from X to the standardized normal (the “Z” distribution) by


subtracting the mean of X and dividing by its standard deviation:

X μ
Z
σ
The Standardized Normal Distribution: Density
Function

• The formula for the standardized normal probability density


function is
Z2
1 2
f(Z)  e

Where e = the mathematical constant approximated by 2.71828
π = the mathematical constant approximated by 3.14159
Z = any value of the standardized normal distribution
The Standardized Normal Distribution: Shape
• Also known as the “Z” distribution
• Mean is 0
• Standard Deviation is 1

f(Z)

Z
0
Values above the mean have positive Z-values, values below the mean have
negative Z-values
The Standardized Normal Distribution: Example

• If X is distributed normally with mean of 100 and standard deviation of


50, the Z value for X = 200 is

X  μ 200  100
Z   2 .0
σ 50
• This says that X = 200 is two standard deviations (2 increments of 50
units) above the mean of 100.
The Standardized Normal Distribution: Example

100 200 X (μ = 100, σ = 50)


0 2.0 Z (μ = 0, σ = 1)

Note that the distribution is the same, only the scale has changed. We
can express the problem in original units (X) or in standardized units (Z)
Normal Probabilities

Probability is measured by the area under the curve

f(X)
P(a ≤ X ≤ b)

(Note that the


probability of any
individual value is zero)

a b
Normal Probabilities

The total area under the curve is 1.0, and the curve is symmetric,
so half is above the mean, half is below.

f(X) P (    X  μ )  0 .5
P (μ  X   )  0 .5

0.5 0.5

P (    X   )  1 .0
Normal Probability Tables

Example:
P(Z < 2.00) = .9772

.9772

0 2.00 Z
Normal Probability Tables

The column gives the value of


Z to the second decimal point
Z 0.00 0.01 0.02 …

The row shows 0.0


the value of Z to 0.1
.
the first decimal . The value within the
. table gives the probability
point 2.0 .9772
from Z =   up to the
desired Z value.
2.0
P(Z < 2.00) = .9772
Finding Normal Probability
Procedure

To find P(a < X < b) when X is distributed normally:

• Draw the normal curve for the problem in terms of X.

• Translate X-values to Z-values.

• Use the Standardized Normal Table.


Finding Normal Probability: Example
• Let X represent the time it takes (in seconds) to download an image file
from the internet.
• Suppose X is normal with mean 8.0 and standard deviation 5.0
• Find P(X < 8.6)

X
8.0
8.6
Finding Normal Probability: Example

• Suppose X is normal with mean 8.0 and standard deviation 5.0. Find
P(X < 8.6).
X  μ 8 .6  8 .0
Z   0 .1 2
σ 5 .0

μ=8 μ=0
σ = 10 σ=1

8 8.6 X 0 0.12 Z

P(X < 8.6) P(Z < 0.12)


Finding Normal Probability: Example
Standardized Normal Probability P(X < 8.6)
Table (Portion)
= P(Z < 0.12)
.5478
Z .00 .01 .02

0.0 .5000 .5040 .5080


μ=0
0.1 .5398 .5438 .5478 σ=1

0.2 .5793 .5832 .5871

0.3 .6179 .6217 .6255 0 0.12 Z


Finding Normal Probability: Example

• Find P(X > 8.6)…

P(X > 8.6) = P(Z > 0.12) = 1.0 - P(Z ≤ 0.12)


= 1.0 - .5478 = .4522
.5478

1.0 - .5478 = .4522

Z
0
0.12
Finding Normal Probability: Between Two Values

• Suppose X is normal with mean 8.0 and standard deviation 5.0.


Find P(8 < X < 8.6)

Calculate Z-values:

X μ 88
Z  0
σ 5
8 8.6 X
X  μ 8.6  8 0 0.12 Z
Z   0.12
σ 5 P(8 < X < 8.6)
= P(0 < Z < 0.12)
Finding Normal Probability
Between Two Values

P(8 < X < 8.6)


• Standardized Normal Probability = P(0 < Z < 0.12)
• Table (Portion) = P(Z < 0.12) – P(Z ≤ 0)
= .5478 - .5000 = .0478
Z .00 .01 .02
.0478
0.0 .5000 .5040 .5080 .5000

0.1 .5398 .5438 .5478

0.2 .5793 .5832 .5871

0.3 .6179 .6217 .6255 Z


0.00 0.12
Given Normal Probability: Find the X Value

• Let X represent the time it takes (in seconds) to download an image file
from the internet.
• Suppose X is normal with mean 8.0 and standard deviation 5.0
• Find X such that 20% of download times are less than X.

.2000

? 8.0 X
? 0 Z
Given Normal Probability, Find the X Value

• First, find the Z value corresponds to the known probability


using the table.

Z …. .03 .04 .05

-0.9 …. .1762 .1736 .1711


.2000
-0.8 …. .2033 .2005 .1977

-0.7 …. .2327 .2296 .2266


? 8.0 X
-0.84 0 Z
Given Normal Probability,
Find the X Value

• Second, convert the Z value to X units using


the following formula.

X  μ  Zσ
 8.0  (0.84)5.0
 3.80

So 20% of the download times from the distribution with mean 8.0
and standard deviation 5.0 are less than 3.80 seconds.
Assessing Normality
• It is important to evaluate how well the data set is approximated by a normal
distribution.
• Normally distributed data should approximate the theoretical normal
distribution:
– The normal distribution is bell shaped (symmetrical) where the mean is
equal to the median.
– The empirical rule applies to the normal distribution.
– The interquartile range of a normal distribution is 1.33 standard deviations.
Assessing Normality
• Construct charts or graphs
– For small- or moderate-sized data sets, do stem-and-leaf display
and box-and-whisker plot look symmetric?
– For large data sets, does the histogram or polygon appear bell-
shaped?
• Compute descriptive summary measures
– Do the mean, median and mode have similar values?
– Is the interquartile range approximately 1.33 σ?
– Is the range approximately 6 σ?
Assessing Normality

• Observe the distribution of the data set


– Do approximately 2/3 of the observations lie within mean ± 1 standard
deviation?
– Do approximately 80% of the observations lie within mean ± 1.28
standard deviations?
– Do approximately 95% of the observations lie within mean ± 2 standard
deviations?
Z Table
Second Decimal Place in Z
Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.00 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.10 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.20 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.30 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517

0.90 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.00 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.10 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.20 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015

2.00 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817

3.00 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990
3.40 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4998
3.50 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998
Table Lookup of a
Standard Normal Probability

P( 0  Z  1)  0. 3413

Z 0.00 0.01 0.02

0.00 0.0000 0.0040 0.0080


0.10 0.0398 0.0438 0.0478
0.20 0.0793 0.0832 0.0871

1.00 0.3413 0.3438 0.3461

1.10 0.3643 0.3665 0.3686


1.20 0.3849 0.3869 0.3888
-3 -2 -1 0 1 2 3
Applying the Z Formula

X is normally distributed with  = 485, and  = 105


P(485  X  600)  P(0  Z  1.10)  .3643
For X = 485, Z 0.00 0.01 0.02
X -  485  485
Z=  0 0.00 0.0000 0.0040 0.0080
 105 0.10 0.0398 0.0438 0.0478

1.00 0.3413 0.3438 0.3461


For X = 600,
X -  600  485 1.10 0.3643 0.3665 0.3686
Z=   1.10
 105 1.20 0.3849 0.3869 0.3888
32
33
Thank You

34
Distribution of Sample Mean, proportion,
and variance
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE

1
2
Acceptance Intervals
Goal: determine a range within which sample means are likely to occur, given a
population mean and variance
• By the Central Limit Theorem, we know that the distribution of X is
approximately normal if n is large enough, with mean μ and standard
deviation
• Let zα/2 be the z-value that leaves area α/2 in the upper tail of the normal
distribution (i.e., the interval - zα/2 to zα/2 encloses probability 1– α)
• Then
μ  z/2σ X
is the interval that includes X with probability 1 – α
3
Sampling Distributions of Sample Proportions

Sampling
Distributions

Sampling Sampling Sampling


Distribution of Distribution of Distribution of
Sample Sample Sample
Mean Proportion Variance

4
Sampling Distributions of Sample Proportions
P = the proportion of the population having some characteristic
• Sample proportion (p̂) provides an estimate of P:

X number of items in the sample having the characteristic of interest


pˆ  
n sample size
• 0 ≤ p̂ ≤ 1
• p̂ has a binomial distribution, but can be approximated by a normal
distribution when nP(1 – P) > 5

5
^
Sampling Distribution of p

• Normal approximation:
Sampling Distribution
P(Pˆ )
.3
.2
Properties: E(pˆ )  P
.1
0
0 .2 .4 .6 8 1
(where P = population proportion)
 X  P(1 P)
And σ p2ˆ  Var   
n n

6
7
Z-Value for Proportions

Standardize p̂ to a Z value with the formula:

pˆ  P pˆ  P
Z 
σ pˆ P(1 P)
n

8
Example

• If the true proportion of voters who support Proposition A is


P = .4, what is the probability that a sample of size 200 yields
a sample proportion between .40 and .45?
• i.e.:
if P = .4 and n = 200, what is
P(.40 ≤ p̂ ≤ .45) ?

9
Example (continued)

• if P = .4 and n = 200, what is


P(.40 ≤ p̂ ≤ .45) ?

Find: σ pˆ P(1  P) .4(1  .4)


σ p̂    .03464
n 200

Convert to  .40  .40 .45  .40 


P(.40  pˆ  .45)  P Z 
standard  .03464 .03464 
normal:  P(0  Z  1.44)

10
Example
(continued)

if P = .4 and n = 200, what is P(.40 ≤ p̂ ≤ .45) ?


Use standard normal table: P(0 ≤ Z ≤ 1.44) = .4251
Standardized
Sampling Distribution Normal Distribution

.4251

Standardize

.40 .45 p̂ 0 1.44


Z
11
Sampling Distributions of Sample Variance

Sampling
Distributions

Sampling Sampling Sampling


Distribution of Distribution of Distribution of
Sample Sample Sample
Mean Proportion Variance

12
Sample Variance
• Let x1, x2, . . . , xn be a random sample from a population. The
sample variance is
1 n
s 
2
 i
n  1 i1
(x  x) 2

• the square root of the sample variance is called the sample


standard deviation
• the sample variance is different for different random samples from
the same population

13
Sampling Distribution of Sample Variances
• The sampling distribution of s2 has mean σ2

E(s2 )  σ 2

• If the population distribution is normal then


(n - 1)s2
σ2
has a 2 distribution with n – 1 degrees of freedom

14
15
The Chi-square Distribution

• The chi-square distribution is a family of distributions, depending on


degrees of freedom: d.f. = n – 1

0 4 8 12 16 20 24 28 2 0 4 8 12 16 20 24 28 2 0 4 8 12 16 20 24 28 2

16
Degrees of Freedom (df)
Idea: Number of observations that are free to vary after sample
mean has been calculated
Example: Suppose the mean of 3 numbers is 8.0
If the mean of these three values is 8.0,
Let X1 = 7 then X3 must be 9
Let X2 = 8 (i.e., X3 is not free to vary)
What is X3?

Here, n = 3, so degrees of freedom = n – 1 = 3 – 1 = 2


(2 values can be any numbers, but the third is not free to vary for a
given mean)
17
Chi-square Example

• A commercial freezer must hold a selected temperature with little


variation. Specifications call for a standard deviation of no more than 4
degrees (a variance of 16 degrees2).
• A sample of 14 freezers is to be tested
• What is the upper limit (K) for the sample variance such that the
probability of exceeding this limit, given that the population standard
deviation is 4, is less than 0.05?

18
Finding the Chi-square Value
(n  1)s2
χ 
2
Is chi-square distributed with (n – 1) = 13
σ 2
degrees of freedom
• Use the the chi-square distribution with area 0.05 in the
upper tail:
213 = 22.36 (α = .05 and 14 – 1 = 13 d.f.)

probability
α = .05

2
213 = 22.36
19
Chi-square Example
(continued)

213 = 22.36 (α = .05 and 14 – 1 = 13 d.f.)


 (n  1)s 2 2 
P(s  K)  P
2
 χ13   0.05
So:  16 
(n  1)K
or  22.36 (where n = 14)
16

(22.36)(16)
so K  27.52
(14  1)

If s2 from the sample of size n = 14 is greater than 27.52, there is


strong evidence to suggest the population variance exceeds 16.
20
Summary
• Introduced sampling distributions
• Described the sampling distribution of sample means
– For normal populations
– Using the Central Limit Theorem
• Described the sampling distribution of sample proportions
• Introduced the chi-square distribution
• Examined sampling distributions for sample variances
• Calculated probabilities using sampling distributions
21
Thank You

22
Confidence Interval Estimation: Single
Population
Dr. A. Ramesh
Department of Management Studies
IIT ROORKEE

1
Goals
After completing this lecture, you should be able to:
• Distinguish between a point estimate and a confidence interval estimate
• Construct and interpret a confidence interval estimate for a single
population mean using both the Z and t distributions
• Form and interpret a confidence interval estimate for a single population
proportion
• Create confidence interval estimates for the variance of a normal
population

2
Confidence Intervals
• Confidence Intervals for the Population Mean, μ
– when Population Variance σ2 is Known
– when Population Variance σ2 is Unknown
• Confidence Intervals for the Population Proportion, p̂ (large samples)
• Confidence interval estimates for the variance of a normal population

3
Definitions

• An estimator of a population parameter is


– a random variable that depends on sample information . . .
– whose value provides an approximation to this unknown parameter

• A specific value of that random variable is called an estimate

4
Point and Interval Estimates

• A point estimate is a single number,


• a confidence interval provides additional information about
variability

Lower Upper
Confidence Confidence
Point Estimate Limit
Limit
Width of
confidence interval
5
Point Estimates

We can estimate a with a Sample


Population Parameter … Statistic
(a Point Estimate)

Mean μ x
Proportion P p̂

6
Unbiasedness

• A point estimator θ̂ is said to be an unbiased estimator of the


parameter  if the expected value, or mean, of the sampling
distribution of θ̂ is ,

E(θˆ )  θ
• Examples:
– The sample mean x is an unbiased estimator of μ
– The sample variance s2 is an unbiased estimator of σ2
– The sample proportion p̂ is an unbiased estimator of P

7
Unbiasedness
(continued)
• θ̂1 is an unbiased estimator, θ̂2 is biased:

θ̂1 θ̂2

θ θ̂
8
Bias

• Let θ̂ be an estimator of 

• The bias in θ̂ is defined as the difference between its mean and 

Bias(θˆ )  E(θˆ )  θ
• The bias of an unbiased estimator is 0

9
Most Efficient Estimator
• Suppose there are several unbiased estimators of 
• The most efficient estimator or the minimum variance unbiased estimator
of  is the unbiased estimator with the smallest variance
• Let θ̂1 and θ̂2 be two unbiased estimators of , based on the same number
of sample observations. Then,
– θ̂1 is said to be more efficient than θ̂2 if Var(θˆ 1 )  Var(θˆ 2 )

– The relative efficiency of θ̂1 with respect to θ̂2 is the ratio of


their variances:
Var( θˆ 2 )
Relative Efficiency 
Var( θˆ )
1

10
Confidence Intervals

• How much uncertainty is associated with a point estimate of a population


parameter?

• An interval estimate provides more information about a population


characteristic than does a point estimate

• Such interval estimates are called confidence intervals

11
Confidence Interval Estimate

• An interval gives a range of values:


– Takes into consideration variation in sample statistics from sample to
sample
– Based on observation from 1 sample
– Gives information about closeness to unknown population
parameters
– Stated in terms of level of confidence
• Can never be 100% confident

12
Confidence Interval and Confidence Level

• If P(a <  < b) = 1 -  then the interval from a to b is called a 100(1 -


)% confidence interval of .
• The quantity (1 - ) is called the confidence level of the interval (
between 0 and 1)

– In repeated samples of the population, the true value of the


parameter  would be contained in 100(1 - )% of intervals
calculated this way.
– The confidence interval calculated in this manner is written as a <  <
b with 100(1 - )% confidence

13
Estimation Process

Random Sample I am 95% confident


that μ is between 40
Population & 60.
Mean
(mean, μ, is X = 50
unknown)

Sample

14
Confidence Level, (1-)
(continued)
• Suppose confidence level = 95%
• Also written (1 - ) = 0.95
• A relative frequency interpretation:
– From repeated samples, 95% of all the confidence intervals that can
be constructed will contain the unknown true parameter
• A specific interval either will contain or will not contain the true
parameter

15
General Formula

• The general formula for all confidence intervals is:

Point Estimate  (Reliability Factor)(Standard Error)

• The value of the reliability factor depends on the desired level of confidence

16
Confidence Intervals

Confidence
Intervals

Population Population Population


Mean Proportion Variance

σ2 Known σ2 Unknown

17
Confidence Interval for μ (σ2 Known)
• Assumptions
– Population variance σ2 is known
– Population is normally distributed
– If population is not normal, use large sample
• Confidence interval estimate:
σ σ
x  z α/2  μ  x  z α/2
n n
(where z/2 is the normal distribution value for a probability of /2 in each tail)

18
Margin of Error
• The confidence interval,
σ σ
x  z α/2  μ  x  z α/2
n n

• Can also be written as x  ME


where ME is called the margin of error

σ
ME  z α/2
n

19
Reducing the Margin of Error

σ
ME  z α/2
n
The margin of error can be reduced if

• the population standard deviation can be reduced (σ↓)

• The sample size is increased (n↑)

• The confidence level is decreased, (1 – ) ↓

20
Finding the Reliability Factor, z/2
• Consider a 95% confidence interval:
1    .95

α α
 .025  .025
2 2

Z units: z = -1.96 0 z = 1.96


Lower Upper
X units: Confidence Point Estimate Confidence
Limit Limit

 Find z.025 = 1.96 from the standard normal distribution table


21
Common Levels of Confidence

• Commonly used confidence levels are 90%, 95%, and 99%

Confidence
Confidence
Coefficient, Z/2 value
Level
1 
80% .80 1.28
90% .90 1.645
95% .95 1.96
98% .98 2.33
99% .99 2.58
99.8% .998 3.08
99.9% .999 3.27
22
Intervals and Level of Confidence
Sampling Distribution of the Mean
/2 1  /2

Intervals
x
μx  μ
extend from 100(1-)%
x1
of intervals
σ
LCL  x  z x2 constructed
n contain μ;
to
σ 100()% do
UCL  x  z not.
n
Confidence Intervals
23
Example

• A sample of 11 circuits from a large normal population has a mean


resistance of 2.20 ohms. We know from past testing that the population
standard deviation is 0.35 ohms.

• Determine a 95% confidence interval for the true mean resistance of the
population.

24
Example
(continued)

• A sample of 11 circuits from a large normal population has a mean resistance


of 2.20 ohms. We know from past testing that the population standard
deviation is .35 ohms.
σ
x z
• Solution: n

 2.20  1.96 (.35/ 11)

 2.20  .2068

1.9932  μ  2.4068

25
Interpretation

• We are 95% confident that the true mean resistance is


between 1.9932 and 2.4068 ohms
• Although the true mean may or may not be in this interval,
95% of intervals formed in this manner will contain the true
mean

26
Confidence Intervals

Confidence
Intervals

Population Population Population


Mean Proportion Variance

σ2 Known σ2 Unknown

27
Confidence Interval Estimation: Single
Population-II
Dr. A. Ramesh
Department of Management Studies
IIT ROORKEE

1
Student’s t Distribution
• Consider a random sample of n observations
– with mean x and standard deviation s
– from a normally distributed population with mean μ

• Then the variable x μ


t
s/ n

follows the Student’s t distribution with (n - 1) degrees of freedom

2
Confidence Interval for μ (σ2 Unknown)

• If the population standard deviation σ is unknown, we can substitute


the sample standard deviation, s
• This introduces extra uncertainty, since s is variable from sample to
sample
• So we use the t distribution instead of the normal distribution

3
Confidence Interval for μ (σ Unknown)
(continued)
• Assumptions
– Population standard deviation is unknown
– Population is normally distributed
– If population is not normal, use large sample
• Use Student’s t Distribution
• Confidence Interval Estimate:
S S
x  t n-1,α/2  μ  x  t n-1,α/2
n n

where tn-1,α/2 is the critical value of the t distribution with n-1 d.f. and an area of α/2 in each tail

4
Margin of Error
• The confidence interval,
S S
x  t n-1,α/2  μ  x  t n-1,α/2
n n

• Can also be written as


x  ME
where ME is called the margin of error:

σ
ME  t n-1,α/2
n

5
Student’s t Distribution

• The t is a family of distributions


• The t value depends on degrees of freedom (d.f.)
– Number of observations that are free to vary after sample mean has
been calculated
d.f. = n - 1

6
Student’s t Distribution
Note: t Z as n increases

Standard
Normal
(t with df = ∞)

t (df = 13)
t-distributions are bell-
shaped and symmetric, but
have ‘fatter’ tails than the t (df = 5)
normal

0 t
7
Student’s t Table

Upper Tail Area


Let: n = 3
df .10 .05 .025 df = n - 1 = 2
 = .10
1 3.078 6.314 12.706 /2 =.05
2 1.886 2.920 4.303
3 1.638 2.353 3.182 /2 = .05

The body of the table


contains t values, not
probabilities
0 2.920 t
8
t distribution values
With comparison to the Z value

Confidence t t t Z
Level (10 d.f.) (20 d.f.) (30 d.f.) ____

.80 1.372 1.325 1.310 1.282


.90 1.812 1.725 1.697 1.645
.95 2.228 2.086 2.042 1.960
.99 3.169 2.845 2.750 2.576

Note: t Z as n increases
9
Example

A random sample of n = 25 has x = 50 and s = 8. Form a 95%


confidence interval for μ
t n1,α/2  t 24,.025  2.0639
– d.f. = n – 1 = 24, so

The confidence interval is


S S
x  t n-1,α/2  μ  x  t n-1,α/2
n n
8 8
50  (2.0639)  μ  50  (2.0639)
25 25
46.698  μ  53.302
10
Confidence Intervals

Confidence
Intervals

Population Population Population


Mean Proportion Variance

σ2 Known σ2 Unknown

11
Confidence Intervals for the
Population Proportion

• An interval estimate for the population proportion ( P ) can


be calculated by adding an allowance for uncertainty to the
sample proportion ( p̂ )

12
Confidence Intervals for the Population
Proportion, p
(continued)

• Recall that the distribution of the sample proportion is approximately


normal if the sample size is large, with standard deviation

P(1 P)
σP 
n
• We will estimate this with sample data:
pˆ (1 pˆ )
n

13
Confidence Interval Endpoints

• Upper and lower confidence limits for the population proportion are
calculated with the formula

pˆ (1 pˆ ) ˆ (1 pˆ )
p
pˆ  z α/2  P  pˆ  z α/2
n n
• where
– z/2 is the standard normal value for the level of confidence desired
– p̂ is the sample proportion
– n is the sample size
– nP(1−P) > 5

14
Example

• A random sample of 100 people shows that 25 are left-


handed.
• Form a 95% confidence interval for the true proportion of
left-handers

15
Example (continued)

• A random sample of 100 people shows that 25 are left-handed. Form a


95% confidence interval for the true proportion of left-handers.

ˆ ˆ ˆ ˆ
ˆp  z α/2 p(1 p)  P  pˆ  z α/2 p(1 p)
n n
25 .25(.75) 25 .25(.75)
 1.96  P   1.96
100 100 100 100
0.1651  P  0.3349

16
Interpretation

• We are 95% confident that the true percentage of left-handers in the


population is between
16.51% and 33.49%.

• Although the interval from 0.1651 to 0.3349 may or may not contain the true
proportion, 95% of intervals formed from samples of size 100 in this manner
will contain the true proportion.

17
Confidence Intervals

Confidence
Intervals

Population Population Population


Mean Proportion Variance

σ2 Known σ2 Unknown

18
Confidence Intervals for the Population
Variance

 Goal: Form a confidence interval for the population variance, σ2

• The confidence interval is based on the sample variance, s2

• Assumed: the population is normally distributed

19
Confidence Intervals for the Population Variance
(continued)

The random variable


(n  1)s2
 n21 
σ2
follows a chi-square distribution with (n – 1)
degrees of freedom

20
Confidence Intervals for the Population Variance

The (1 - )% confidence interval for the population variance is

(n  1)s2 (n  1)s 2
 σ 2
 2
χn1, α/2
2
χn1, 1 - α/2

21
Example

You are testing the speed of a batch of computer processors. You


collect the following data (in Mhz):

Sample size 17
Sample mean 3004
Sample std dev 74

Assume the population is normal. Determine the 95%


confidence interval for σx2

22
Finding the Chi-square Values

• n = 17 so the chi-square distribution has (n – 1) = 16 degrees of


freedom
•  = 0.05, so use the the chi-square values with area 0.025 in each tail:
χn21, α/2  χ16
2
, 0.025  28.85

probability probability
χ 2
n 1, 1 - α/2 χ 2
16 , 0.975  6.91 α/2 = .025 α/2 = .025

216
216 = 6.91 216 = 28.85

23
Calculating the Confidence Limits

• The 95% confidence interval is


(n  1)s2 (n  1)s 2
 σ 2
 2
χn1, α/2
2
χn1, 1 - α/2

(17  1)(74)2 (17  1)(74)2


σ 
2

28.85 6.91
3037  σ 2  12683
Converting to standard deviation, we are 95% confident that the population standard
deviation of CPU speed is between 55.1 and 112.6 Mhz

24
Finite Populations

• If the sample size is more than 5% of the population size (and


sampling is without replacement) then a finite population correction
factor must be used when calculating the standard error

25
Finite Population Correction Factor

• Suppose sampling is without replacement and the sample size is large


relative to the population size
• Assume the population size is large enough to apply the central limit
theorem
• Apply the finite population correction factor when estimating the
population variance

Nn
finite population correction factor 
N 1

26
Estimating the Population Mean

• Let a simple random sample of size n be taken from a population


of N members with mean μ
• The sample mean is an unbiased estimator of the population mean
μ
• 1 n
x   xi
The point estimate is:
n i1

27
Finite Populations: Mean

• If the sample size is more than 5% of the population size, an unbiased


estimator for the variance of the sample mean is

Nn
2
ˆ  s
σ  
2

 N 1 
x
n
• So the 100(1-α)% confidence interval for the population mean is

ˆ x  μ  x  t n-1,α/2σ
x - t n-1,α/2σ ˆx

28
Estimating the Population Proportion

• Let the true population proportion be P


• Let p̂ be the sample proportion from n observations from a simple
random sample
• The sample proportion, p̂ , is an unbiased estimator of the population
proportion, P

29
Finite Populations: Proportion

• If the sample size is more than 5% of the population size, an unbiased


estimator for the variance of the population proportion is
ˆ (1- pˆ )  N  n 
p
ˆ 
σ 2
pˆ  
n  N 1 
• So the 100(1-α)% confidence interval for the population proportion is

pˆ - zα/2σ
ˆ pˆ  P  pˆ  zα/2σ
ˆ pˆ

30
Lecture Summary
• Introduced the concept of confidence intervals
• Discussed point estimates
• Developed confidence interval estimates
• Created confidence interval estimates for the mean (σ2
known)
• Introduced the Student’s t distribution
• Determined confidence interval estimates for the mean (σ2
unknown)

31
Lecture Summary
(continued)
• Created confidence interval estimates for the proportion
• Created confidence interval estimates for the variance of a normal
population
• Applied the finite population correction factor to form confidence
intervals when the sample size is not small relative to the population size

32
Summary
• Introduced sampling distributions
• Described the sampling distribution of sample means
– For normal populations
– Using the Central Limit Theorem
• Described the sampling distribution of sample proportions
• Introduced the chi-square distribution
• Examined sampling distributions for sample variances
• Calculated probabilities using sampling distributions
33
Thank You

34
Welcome to
TA Live Session 3

NPTEL | DATA ANALYTICS


WITH PYTHON
12-02-2022
Ritwiz Kamal
PhD (Prime Minister’s Research Fellow), CSE, IIT Madras
Ritwiz Kamal | IIT Madras 1
Let’s Get Started …

Ritwiz Kamal | IIT Madras 2


To discuss before we start:

 Correction from Last Session (Biased vs Unbiased Coin)

 Recommendation for Books on Data Analytics

Ritwiz Kamal | IIT Madras 3


Question 1

Ritwiz Kamal | IIT Madras 4


Question 1

Ritwiz Kamal | IIT Madras 5


Question 2
2) What kind of sampling are we doing below ?

Ritwiz Kamal | IIT Madras 6


Question 2

Ritwiz Kamal | IIT Madras 7


Question 3
3) What kind of sampling are we doing below ?

Ritwiz Kamal | IIT Madras 8


Question 3

Ritwiz Kamal | IIT Madras 9


Question 4
4) What kind of sampling are we doing below ?

Ritwiz Kamal | IIT Madras 10


Question 4

Ritwiz Kamal | IIT Madras 11


Question 5
5) What kind of sampling are we doing below ?

Ritwiz Kamal | IIT Madras 12


Question 5

GOOD OR BAD ?

Ritwiz Kamal | IIT Madras 13


Question 6
6)

Ritwiz Kamal | IIT Madras 14


Question 6
6)

12 8
20 1 1
12 2 2

Ritwiz Kamal | IIT Madras 15


Question 7

Ritwiz Kamal | IIT Madras 16


Question 7

1
𝜆 −𝜆 2.51 −2.5
ⅇ = ⅇ
1! 1!

Ritwiz Kamal | IIT Madras 17


Question 8

Find the probability that they will sell at most 2 cars.

Ritwiz Kamal | IIT Madras 18


Question 8

Find the probability that they will sell at most 2 cars.

0 1 2
𝜆 −𝜆 𝜆 −𝜆 𝜆 −𝜆
ⅇ + ⅇ + ⅇ
0! 1! 2!

*Shift to Colab for demo

Ritwiz Kamal | IIT Madras 19


Question 9
9) A basket contains 10 rotten tomatoes. One fresh tomato is thrown into the
basket by mistake. You pick a tomato at random and if it is rotten you throw it out
and pick again.

What is the probability that you will get the fresh tomato in the first trial ?

A) 1/11
B) 1/10
C) 1
D) 9/10

Ritwiz Kamal | IIT Madras 20


Question 9
9) A basket contains 10 rotten tomatoes. One fresh tomato is thrown into the
basket by mistake. You pick a tomato at random and if it is rotten you throw it out
and pick again.

What is the probability that you will get the fresh tomato in the first trial ?

A) 1/11
B) 1/10
C) 1
D) 9/10

Ritwiz Kamal | IIT Madras 21


Question 10
10) A basket contains 10 rotten tomatoes. One fresh tomato is thrown into the
basket by mistake. You pick a tomato at random and if it is rotten you throw it out
and pick again.

What is the probability that you will get the fresh tomato in the third trial ?

A) 1/11
B) 1/10
C) 1
D) 9/10

Ritwiz Kamal | IIT Madras 22


Question 10
10) A basket contains 10 rotten tomatoes. One fresh tomato is thrown into the
basket by mistake. You pick a tomato at random and if it is rotten you throw it out
and pick again.

What is the probability that you will get the fresh tomato in the third trial ?

A) 1/11
B) 1/10 10 9 1
C) 1 ∗ ∗
D) 9/10 11 10 9

Ritwiz Kamal | IIT Madras 23


Acknowledgments
 Prof. A Ramesh | IIT Roorkee
Data Analytics with Python | NPTEL
 NPTEL Team
 PMRF Team
 Department of CSE, IIT Madras

THANK YOU!

Ritwiz Kamal | IIT Madras 24


Lecture 11: Sampling and Sampling Distribution
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
IIT ROORKEE

1
Lecture Objectives
After completing this lecture, you should be able to:
• Describe a simple random sample and why sampling is important
• Explain the difference between descriptive and inferential statistics
• Define the concept of a sampling distribution
• Determine the mean and standard deviation for the sampling distribution
of the sample mean,

2
Lecture Objectives

• Describe the Central Limit Theorem and its importance


• Determine the mean and standard deviation for the sampling distribution
of the sample proportion,
• Describe sampling distributions of sample variances

3
Descriptive vs Inferential Statistics

• Descriptive statistics
– Collecting, presenting, and describing data
• Inferential statistics
– Drawing conclusions and/or making decisions concerning a population
based only on sample data

4
Populations and Samples

• A Population is the set of all items or individuals of interest


– Examples: All likely voters in the next election
All parts produced today
All sales receipts for November

• A Sample is a subset of the population


– Examples: 1000 voters selected at random for interview
A few parts selected for destructive testing
Random receipts selected for audit

5
Population vs. Sample

• Population • Sample

a b cd b c
ef ghi jkl m n gi n
o pq rs t uv w o r u
x y z y

6
Why Sample?
• Less time consuming than a census
• Less costly to administer than a census
• It is possible to obtain statistical results of a sufficiently high precision
based on samples.
• Because the research process is sometimes destructive, the sample can
save product
• If accessing the population is impossible; sampling is the only option

7
Reasons for Taking a Census

• Eliminate the possibility that a random sample is not representative of the


population

• The person authorizing the study is uncomfortable with sample


information
Random Versus Nonrandom Sampling
• Random sampling
• Every unit of the population has the same probability of being included in the
sample.
• A chance mechanism is used in the selection process.
• Eliminates bias in the selection process
• Also known as probability sampling
• Nonrandom Sampling
• Every unit of the population does not have the same probability of being
included in the sample.
• Open the selection bias
• Not appropriate data collection methods for most statistical methods
• Also known as non-probability sampling
Random Sampling Techniques

• Simple Random Sample

• Stratified Random Sample

– Proportionate

– Disproportionate

• Systematic Random Sample

• Cluster (or Area) Sampling


Simple Random Samples

• Every object in the population has an equal chance of being selected


• Objects are selected independently
• Samples can be obtained from a table of random numbers or computer
random number generators
• A simple random sample is the ideal against which other sample methods
are compared

11
Simple Random Sample:
Numbered Population Frame

01 Andhra Pradesh 11 Madhya Pradesh


02 Himachal Pradesh 12 Uttar Pradesh
03 Gujrath 13 Bihar
04 Maharashtra 14 Rajasthan
05 Nagaland 15 J & K
06 Goa 16 Tamil Nadu
07 West bengal 17 Karantaka
08 Haryana 18 Kerala
09 Punjab 19 Orissa
10 Delhi 20 Manipur
Simple Random Sampling:
Random Number Table

9 9 4 3 7 8 7 9 6 1 4 5 7 3 7 3 7 5 5 2 9 7 9 6 9 3 9 0 9 4 3 4 4 7 5 3 1 6 1 8
5 0 6 5 6 0 0 1 2 7 6 8 3 6 7 6 6 8 8 2 0 8 1 5 6 8 0 0 1 6 7 8 2 2 4 5 8 3 2 6
8 0 8 8 0 6 3 1 7 1 4 2 8 7 7 6 6 8 3 5 6 0 5 1 5 7 0 2 9 6 5 0 0 2 6 4 5 5 8 7
8 6 4 2 0 4 0 8 5 3 5 3 7 9 8 8 9 4 5 4 6 8 1 3 0 9 1 2 5 3 8 8 1 0 4 7 4 3 1 9
6 0 0 9 7 8 6 4 3 6 0 1 8 6 9 4 7 7 5 8 8 9 5 3 5 9 9 4 0 0 4 8 2 6 8 3 0 6 0 6
5 2 5 8 7 7 1 9 6 5 8 5 4 5 3 4 6 8 3 4 0 0 9 9 1 9 9 7 2 9 7 6 9 4 8 1 5 9 4 1
8 9 1 5 5 9 0 5 5 3 9 0 6 8 9 4 8 6 3 7 0 7 9 5 5 4 7 0 6 2 7 1 1 8 2 6 4 4 9 3
Simple Random Sample:
Sample Members

01 Andhra Pradesh 11 Madhya Pradesh


02 Himachal Pradesh 12 Uttar Pradesh
03 Gujrath 13 Bihar
04 Maharashtra 14 Rajasthan
05 Nagaland 15 J & K
06 Goa 16 Tamil Nadu
07 West bengal 17 Karantaka
08 Haryana 18 Kerala
09 Punjab 19 Orissa
10 Delhi 20 Manipur

• N = 20
• n=4
Stratified Random Sample

• Population is divided into non-overlapping subpopulations called strata


• A random sample is selected from each stratum
• Potential for reducing sampling error
• Proportionate -- the percentage of these sample taken from each stratum
is proportionate to the percentage that each stratum is within the
population
• Disproportionate -- proportions of the strata within the sample are
different than the proportions of the strata within the population
Stratified Random Sample:
Population of FM Radio Listeners
Stratified by Age

20 - 30 years old
(homogeneous within)
(alike) Heterogeneous
(different)
30 - 40 years old between
(homogeneous within)
(alike) Heterogeneous
(different)
40 - 50 years old between
(homogeneous within)
(alike)
Systematic Sampling
• Convenient and relatively easy to
N
administer k = ,
n
• Population elements are an ordered
where:
sequence (at least, conceptually).
n = sample size
• The first sample element is selected
N = population size
randomly from the first k population
elements. k = size of selection interval

• Thereafter, sample elements are selected


at a constant interval, k, from the
ordered sequence frame.
Systematic Sampling: Example

• Purchase orders for the previous fiscal year are serialized 1 to 10,000 (N =
10,000).
• A sample of fifty (n = 50) purchases orders is needed for an audit.
• k = 10,000/50 = 200
• First sample element randomly selected from the first 200 purchase
orders. Assume the 45th purchase order was selected.
• Subsequent sample elements: 245, 445, 645, . . .
Cluster Sampling

• Population is divided into non-overlapping clusters or areas

• Each cluster is a miniature of the population.

• A subset of the clusters is selected randomly for the sample.

• If the number of elements in the subset of clusters is larger than the


desired value of n, these clusters may be subdivided to form a new
set of clusters and subjected to a random selection process.
Cluster Sampling
 Advantages
• More convenient for geographically dispersed populations
• Reduced travel costs to contact sample elements
• Simplified administration of the survey
• Unavailability of sampling frame prohibits using other random
sampling methods
 Disadvantages
• Statistically less efficient when the cluster elements are similar
• Costs and problems of statistical analysis are greater than for simple
random sampling
Nonrandom Sampling
• Convenience Sampling: Sample elements are selected for the convenience
of the researcher

• Judgment Sampling: Sample elements are selected by the judgment of the


researcher

• Quota Sampling: Sample elements are selected until the quota controls are
satisfied

• Snowball Sampling: Survey subjects are selected based on referral from


other survey respondents
Errors
• Data from nonrandom samples are not appropriate for analysis by inferential
statistical methods.
• Sampling Error occurs when the sample is not representative of the
population
• Non-sampling Errors
• Missing Data, Recording, Data Entry, and Analysis Errors
• Poorly conceived concepts , unclear definitions, and defective questionnaires
• Response errors occur when people so not know, will not say, or overstate in their
answers
Sampling Distribution of x
Proper analysis and interpretation of a sample statistic
requires knowledge of its distribution.

Calculate x
to estimate 
Population Sample
 Process of x
Inferential Statistics
(parameter) (statistic)

Select a
random sample
Inferential Statistics

• Making statements about a population by examining sample results


Sample statistics Population parameters
(known) Inference (unknown, but can be estimated
from sample evidence)

Sample
Population

24
Inferential Statistics
Drawing conclusions and/or making decisions concerning a
population based on sample results.

• Estimation
– e.g., Estimate the population mean weight
using the sample mean weight
• Hypothesis Testing
– e.g., Use sample evidence to test the claim
that the population mean weight is 120
pounds

25
Sampling Distributions

• A sampling distribution is a distribution of all of the possible values of a


statistic for a given size sample selected from a population

26
Types of sampling distributions

Sampling
Distributions

Sampling Sampling Sampling


Distribution of Distribution of Distribution of
Sample Sample Sample
Mean Proportion Variance

27
Sampling Distributions of Sample Means

Sampling
Distributions

Sampling Sampling Sampling


Distribution of Distribution of Distribution of
Sample Sample Proportion Sample Variance
Mean

28
Developing a Sampling Distribution

• Assume there is a population … A B C D

• Population size N=4


• Random variable, X,
is age of individuals
• Values of X:
18, 20, 22, 24 (years)

29
Developing a Sampling Distribution
(continued)

Summary Measures for the Population Distribution:

μ
 X i P(x)
N
.25
18  20  22  24
  21
4
0
18 20 22 24 x
σ
 (X i  μ) 2

 2.236
A B C D
N Uniform Distribution

30
Developing a Sampling Distribution
(continued)
Now consider all possible samples of size n = 2
1st 2nd Observation
Obs 18 20 22 24 16 Sample
18 18,18 18,20 18,22 18,24 Means
20 20,18 20,20 20,22 20,24
22 22,18 22,20 22,22 22,24 1st 2nd Observation
Obs 18 20 22 24
24 24,18 24,20 24,22 24,24
18 18 19 20 21
16 possible samples 20 19 20 21 22
(sampling with 22 20 21 22 23
replacement) 24 21 22 23 24

31
Developing a Sampling Distribution
(continued)

• Sampling Distribution of All Sample Means


16 Sample Means Sample Means Distribution
_
1st 2nd Observation P(X)
Obs 18 20 22 24 .3
18 18 19 20 21 .2
20 19 20 21 22 .1
22 20 21 22 23 _
0
24 21 22 23 24 18 19 20 21 22 23 24 X
(no longer uniform)
32
Developing a Sampling Distribution
(continued)

• Summary Measures of this Sampling Distribution:

E(X) 
 X i

18  19  21   24
 21  μ
N 16

σX 
 ( X  μ)
i
2

N
(18 - 21)2  (19 - 21)2    (24 - 21)2
  1.58
16

33
Comparing the Population with its Sampling
Distribution
Population Sample Means Distribution
N=4 n=2
μ  21 σ  2.236 μX  21 σ X  1.58
_
P(X) P(X)
.3 .3
.2 .2
.1 .1
0 0 _
18 20 22 24 X 18 19 20 21 22 23 24 X
A B C D

34
1,800 Randomly Selected Values
from an Exponential Distribution

450
F
400
r
e 350
q 300
u 250
e 200
n 150
c 100
y
50
0
0 .5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10
X
Means of 60 Samples (n = 2)
from an Exponential Distribution

F 9
r 8
e
77
q
u 66
e 55
n
44
c
y 33

22

11

00
0.00 0.25
0.00 0.25 0.50
0.50 0.75
0.75 1.00
1.00 1.25
1.25 1.50
1.50 1.75
1.75 2.00
2.00 2.25
2.25 2.50
2.50 2.75
2.75 3.00
3.00 3.25
3.25 3.50
3.50 3.75
3.75 4.00
4.00
xx
Means of 60 Samples (n = 5)
from an Exponential Distribution
10
F
r 9
e 8
q 7
u
6
e
n 5
c 4
y 3
2
1
0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00
x
Means of 60 Samples (n = 30)
from an Exponential Distribution
16
F
14
r
e 12
q
10
u
e 8
n
c 6
y 4

0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00
x
1,800 Randomly Selected Values
from a Uniform Distribution

F 250
250
r
e 200
200
q
u 150
150
e
n 100
100
c
y 50
50

00
0.0
0.0 0.5
0.5 1.0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

X-bar
Means of 60 Samples (n = 2)
from a Uniform Distribution

F 10
r 9
e 8
q 7
u
6
e
n 5
c 4
y 3
2
1
0
1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 4.25

x
Means of 60 Samples (n = 5)
from a Uniform Distribution

12

10

F
r 8
e
q 6
u
e 4
n
c 2
y
0
1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 4.25
x
Means of 60 Samples (n = 30)
from a Uniform Distribution
25

20
F
r
15
e
q
u 10
e
n 5
c
y
0
1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 4.25
x
Expected Value of Sample Mean

• Let X1, X2, . . . Xn represent a random sample from a population

• The sample mean value of these observations is defined as

1 n
X   Xi
n i1

43
Standard Error of the Mean
• Different samples of the same size from the same population will yield
different sample means
• A measure of the variability in the mean from sample to sample is given by
the Standard Error of the Mean:

σ
σX 
n
• Note that the standard error of the mean decreases as the sample size
increases

44
If sample values are not independent
(continued)

• If the sample size n is not a small fraction of the population size N,


then individual sample members are not distributed independently
of one another
• Thus, observations are not selected independently
• A correction is made to account for this:

σ2 N  n σ Nn
Var(X)  or σX 
n N 1 n N 1

45
If the Population is Normal

• If a population is normal with mean μ and standard deviation σ, the


sampling distribution of X is also normally distributed with

σ
μX  μ σX 
and n

• If the sample size n is not large relative to the population size N, then

μX  μ and
σX 
σ Nn
n N 1

46
Z-value for Sampling Distribution of the Mean

• Z-value for the sampling distribution of :


( X  μ)
Z
σX
where: X = sample mean
μ = population mean
σ x = standard error of the mean

47
Sampling Distribution Properties

Normal Population
Distribution
μx  μ
μ x
(i.e. x is unbiased ) Normal Sampling
Distribution
(has the same mean)

μx
x
48
Sampling Distribution Properties

• For sampling with replacement:

As n increases,
σ x decreases Larger sample
Smaller sample size
size

x
μ
49
If the Population is not Normal- Central Limit Theorem
We can apply the Central Limit Theorem:

– Even if the population is not normal,


– sample means from the population will be approximately normal as
long as the sample size is large enough.
Properties of the sampling distribution:

σ
μx  μ And σx 
n
50
Central Limit Theorem

n
the sampling
As the sample distribution becomes
size gets large almost normal
enough… regardless of shape of
population

51
If the Population is not Normal
(continued)
Population Distribution
Sampling distribution
properties:

Central Tendency
μx  μ
μ x
Variation Sampling Distribution (becomes normal as n increases)
σ
σx  Larger
n Smaller sample
sample
size
size
μx x
52
How Large is Large Enough?

• For most distributions, n > 25 will give a sampling distribution that is


nearly normal
• For normal population distributions, the sampling distribution of the mean
is always normally distributed

53
Example

• Suppose a large population has mean μ = 8 and standard deviation σ = 3.


Suppose a random sample of size n = 36 is selected.

• What is the probability that the sample mean is between 7.8 and 8.2?

54
Example

Solution:
• Even if the population is not normally distributed, the central limit
theorem can be used (n > 25)
• … so the sampling distribution of x is approximately normal
• … with mean μx = 8
• …and standard deviation
σ 3
σx    0.5
n 36
55
Example (continued)
Solution (continued)
 
 7.8 - 8 μX -μ 8.2 - 8 
P(7.8  μ X  8.2)  P   
 3 σ 3 
 36 n 36 
 P(-0.5  Z  0.5)  0.3830
Sampling Standard Normal
Distribution Distribution .1915
??? +.1915
? ??
? ? Sample Standardize
?? ?
?
-0.5 0.5
μ8 X 7.8
μX  8
8.2
x μz  0 Z
56
Errors in Hypothesis Testing

Dr. A. Ramesh
Department of Management Studies
Indian Institute of Technology Roorkee

1
Example

• We are interested in burning rate of a solid propellant used to power aircrew escape systems

• Burning rate is a random variable that can be described by a probability distribution

• Suppose our interest focus on mean burning rate

• Ho: µ = 50 centimeters per second

• H1: µ ≠ 50 centimeters per second

Reference: Applied statistics and probability for engineers, Douglas C. Montgomery, George C. Runger, John Wiley &
Sons, 2007

2
Value of the null hypothesis

• The value of the null hypothesis can be obtained by

– Past experience or knowledge of the process, or even from the previous tests or experiments

– From some theory or model regarding the process under study

– From external consideration, such as design or engineering specifications, or from contractual

obligations

3
Note: for this example n=10

Note: for this example we will


assume  = 2.5

4
Type I Error

• The true mean burning rate of the propellant could be equal to 50 centimeters per second

• However randomly selected propellant specimens that are tested, we could observe a value of test

statistics x that falls into the critical region(rejection region).

• We would then reject the null hypothesis Ho in favor of the alternate H1, in fact, Ho is really true

• This type of wrong conclusion is called a type I error

5
Type I Error

• Rejecting the null hypothesis Ho when


it is true is defined as a type I error

6
Type II Error

• Now suppose the true mean burning rate is different from 50 centimeters per second, yet the sample

mean x falls in the acceptance region

• In this case we would fail to reject Ho when it is false

• This type of wrong conclusion is called a type II error

7
Type II Error

• Failing to reject the null


hypothesis when it is false is
defined as a type II error

8
Type 1 and Type II Errors

H0 is correct H0 is incorrect

H0 is accepted correct decision Type II error ()


Incorrect
acceptance

H0 is rejected Type I error () correct decision


Incorrect rejection

9
Type I error

• In the propellant burning rate example, a type I error will occur when either x  51.5 _ or _ x  48.5

when the true mean burning rate is µ = 50 centimeters per second

• Suppose the standard deviation of burning rate is σ = 2.5 centimeters per second and n = 10

• Probability distribution µ = 50,standard error = 0.79.

• Type I error is
  P( x  48.5 _ when _   50)  P( x  51.5 _ when _   50)

10
Where
does this We will reject the null
number hypothesis ( = 50) if our
come sample mean is either of
from? these two regions

11
12
Type I error

• Type I error = 0.057434

• This implies that 5.7 % of all random samples would lead to rejection of the hypothesis Ho: µ=50

centimeters per second.

• We can reduce the type I error by widening the acceptance region. If we make critical value 48 and

52, the value of alpha is 0.0114 ( adding 0.0057 and 0.0057).

• Change sample size to 16 then alpha is 0.0164.

13
TYPE II ERROR

14
The pink area is
the probability
of a Type II error
if the actual mean
is 52.

15
Type II Error

• Type II error will be committed if the sample mean x-bar falls between 48.5 and 51.5 (critical region

boundaries) when µ = 52.   P(48.5  x  51.5 _ when _   52)

• 0.2643

• When µ = 50.5

• 0.8923

16
17
18
Computing the
probability of a type II
error may be the most
difficult concept

19
For constant n, increasing the acceptance region (hence
decreasing ) increases .

Increasing n, can decrease both types of errors.

20
Type I & II Errors Have an Inverse Relationship

If you reduce the probability of one error, the other


one increases so that everything else is unchanged.

21
Factors Affecting Type II Error

• True value of population parameter


–  Increases when the difference between hypothesized parameter and its true value
decrease
• Significance level
– Increases when  decreases
• Population standard deviation  
– 
Increases when increases

• Sample size
–  Increases when n decreases  


n
22
How to Choose between Type I and Type II Errors

• Choice depends on the cost of the errors

• Choose smaller Type I Error when the cost of rejecting the maintained hypothesis is high

– A criminal trial: convicting an innocent person

• Choose larger Type I Error when you have an interest in changing the status quo

23
Calculating the probability of Type II Error

Ho: µ = 8.3
H1: µ < 8.3

Determine the probability of Type II error if µ = 7.4 at 5% significance level. σ = 3.1 and n = 60.

24
Solution:

An error will be made when Z ≥ -1.645, for that will fail to reject Ho.
ᵦ = 0.2729
25
Solving for Type II Errors:
Example

Ho:   12    Zc

X
Ha:   12
c
n
010
.
 12  ( 1645
. )
60
Rejection
Region
 11979
.
=.05
If X  11979
. , reject Ho.
Non Rejection Region
=0 If X  11979
. , do not reject Ho.
Zc  1.645

26
Type II Error for Example with  =11.99 Kg

Reject Ho Do Not Reject


Type I Ho Correct
Error Decision
95%
=.05
Ho is True   Z0

Ho is False
Correct Type II
19.77% =.8023
Decision Error

Z1

X
  

27
28
Type II Error for Demonstration with =11.96 Kg

Reject Ho Do Not Reject Ho


Type Correct
I 95% Decision
Error
=.05
Ho is True  
Z0

Ho is False
Correct =.0708 Type II
Decision 92.92% Error

Z1

  
X
29
30
Hypothesis Testing and Decision Making

• We have illustrated hypothesis testing applications referred to as significance tests

• In the tests, we compared the p-value to a controlled probability of a Type I error, a, which is
called the level of significance for the test

• With a significance test, we control the probability of making the Type I error, but
not the Type II error
• We recommended the conclusion “do not reject H0” rather than “accept H0”
because the latter puts us at risk of making a Type II error

31
Hypothesis Testing and Decision Making

• With the conclusion “do not reject H0”, the statistical evidence is considered inconclusive

• Usually this is an indication to postpone a decision until further research and testing is
undertaken
• In many decision-making situations the decision maker may want, and in some cases may be
forced, to take action with both the conclusion “do not reject H0 “and the conclusion “reject
H0.”

• In such situations, it is recommended that the hypothesis-testing procedure be extended to


include consideration of making a Type II error

32
Power of a test

• The mean response time for a random sample of 40 food-


order is 13.25 minutes
• The population standard deviation is believed to be 3.2
minutes.
• The restaurant owner wants to perform a hypothesis test,
with  =0.05 level of significance, to determine whether the
service goal of 12 minutes or less is being achieved.

33
Calculating the Probability of a Type II Error

Hypotheses are: H0:    and Ha:   

Rejection rule is: Reject H0 if z > 1.645

Value of the sample mean that identifies the rejection region:

We will accept H0 when x < 12.8323

34
Calculating the Probability of a Type II Error

Probabilities that the sample mean will be in the acceptance region:

Values of   1-
14.0 -2.31 .0104 .9896
13.6 -1.52 .0643 .9357
13.2 -0.73 .2327 .7673
12.8323 0.00 .5000 .5000
12.8 0.06 .5239 .4761
12.4 0.85 .8023 .1977
12.0001 1.645 .9500 .0500

35
36
Power of the Test

• The probability of correctly rejecting H0 when it is false is called the power of the test.

• For any particular value of m, the power is 1 – b.

• We can show graphically the power associated with each value of


power curve.
 ; such a graph is called a

37
Power Curve

1.00

Rejecting Null Hypothesis


0.90

Probability of Correctly
0.80
H0 False
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00 
11.5 12.0 12.5 13.0 13.5 14.0 14.5

38
Thank You

39
Hypothesis Testing
Class Objectives

• Developing Null and Alternative Hypotheses

• Type I and Type II Errors- Explanation

• Population Mean: Sigma Known

• Population Mean: Sigma Unknown

• Population Proportion
Hypothesis Testing

• Hypothesis testing can be used to determine whether a statement about


the value of a population parameter should or should not be rejected.

• The null hypothesis, denoted by H0 , is a tentative assumption about a


population parameter

• The alternative hypothesis, denoted by Ha, is the opposite of what is stated


in the null hypothesis

• The hypothesis testing procedure uses data from a sample to test the two
competing statements indicated by H0 and Ha.
Developing Null and Alternative Hypotheses

• It is not always obvious how the null and alternative hypotheses should be
formulated

• Care must be taken to structure the hypotheses appropriately so that the test
conclusion provides the information the researcher wants

• The context of the situation is very important in determining how the hypotheses
should be stated

• In some cases it is easier to identify the alternative hypothesis first. In other


cases the null is easier

• Correct hypothesis formulation will take practice


Developing Null and Alternative Hypotheses

Alternative Hypothesis as a Research Hypothesis


•Many applications of hypothesis testing involve an attempt to gather evidence in support of a research
hypothesis

• In such cases, it is often best to begin with the alternative hypothesis and make it the conclusion that
the researcher hopes to support

• The conclusion that the research hypothesis is true is made if the sample data provide sufficient
evidence to show that the null hypothesis can be rejected
Developing Null and Alternative Hypotheses

Alternative Hypothesis as a Research Hypothesis

• Example: A new manufacturing method is believed to be better than the current method.

• Alternative Hypothesis:

– The new manufacturing method is better.

• Null Hypothesis:

– The new method is no better than the old method.


Developing Null and Alternative Hypotheses

• Alternative Hypothesis as a Research Hypothesis

• Example: A new bonus plan, that is developed in an attempt to increase sales

• Alternative Hypothesis:

– The new bonus plan increase sales

• Null Hypothesis:

– The new bonus plan does not increase sales


Developing Null and Alternative Hypotheses

• Alternative Hypothesis as a Research Hypothesis

• Example:

– A new drug is developed with the goal of lowering Cholesterol-level more


than the existing drug

• Alternative Hypothesis:

– The new drug lowers Cholesterol-level more than the existing drug

• Null Hypothesis:

– The new drug does not lower Cholesterol-level more than the existing
drug
Developing Null and Alternative Hypotheses

• Null Hypothesis as an assumption to be challenged

• We might begin with a belief or assumption that a statement about the value of a population
parameter is true

• We then using a hypothesis test to challenge the assumption and determine if there is statistical
evidence to conclude that the assumption is incorrect

• In these situations, it is helpful to develop the null hypothesis first


Developing Null and Alternative Hypotheses

• Null Hypothesis as an Assumption to be Challenged

• Example:

– The label on a milk bottle states that it contains 1000 ml

• Null Hypothesis:

– The label is correct. µ > 1000 ml

• Alternative Hypothesis:

– The label is incorrect. µ < 1000 ml


Null and Alternative Hypotheses about a Population Mean 

• The equality part of the hypotheses always appears in the null hypothesis

• In general, a hypothesis test about the value of a population mean  must take one of the following
three forms (where 0 is the hypothesized value of the population mean)

One-tailed One-tailed Two-tailed


(lower-tail) (upper-tail)
Null and Alternative Hypotheses
• A major hospital in Chennai provides
one of the most comprehensive
emergency medical services in the
world
• Operating in a multiple hospital
system with approximately 10 mobile
medical units, the service goal is to
respond to medical emergencies with
a mean time of 8 minutes or less
• The director of medical services
wants to formulate a hypothesis test
that could use a sample of
emergency response times to
determine whether or not the
service goal of 8 minutes or less is
being achieved.
Null and Alternative Hypotheses

The emergency service is meeting the response


H0:   8 goal; no follow-up action is necessary.

The emergency service is not meeting the


Ha:   8 response goal; appropriate follow-up action is
necessary.

where:  = mean response time for the population


of medical emergency requests
Type I Error

• Because hypothesis tests are based on sample data, we must allow for the
possibility of errors

• A Type I error is rejecting H0 when it is true

• The probability of making a Type I error when the null hypothesis is called
the level of significance

• Applications of hypothesis testing that only control the Type I error are
often called significance tests
Type II Error

• A Type II error is accepting H0 when it is false.

• It is difficult to control for the probability of making a Type II error.

• Statisticians avoid the risk of making a Type II error by using “do not reject H0” and not “accept H0”.
Type I and Type II Errors

Population Condition

H0 True H0 False
Conclusion ( < 8) (  8)

Accept H0 Correct
Type II Error
(Conclude  < 8) Decision

Reject H0 Correct
Type I Error
(Conclude  > 8) Decision
Three Approaches for Hypothesis Testing

• P- Value

• Critical Value

• Confidence Interval Value


p-Value Approach to One-Tailed Hypothesis Testing

• The p-value is the probability, computed using the test statistic, that measures the support (or lack of

support) provided by the sample for the null hypothesis

• If the p-value is less than or equal to the level of significance  , the value of the test statistic is in the

rejection region

• Reject H0 if the p-value < 


Lower-Tailed Test About a Population Mean: s Known

p-Value Approach p-Value <  ,


so reject H0.

 = .10 Sampling
distribution
of
p-value
 72

z
z = -za = 0
-1.46 -1.28
p-Value Approach
Upper-Tailed Test About a Population Mean :s Known

p-Value Approach p-Value <  ,


so reject H0.
Sampling
distribution
of  = .04

p-Value
 11

z
0 z = z=
1.75 2.29
p-Value Approach
Critical Value Approach to One-Tailed Hypothesis Testing
• The test statistic z has a standard
normal probability distribution.
• We can use the standard normal
probability distribution table to
find the z-value with an area of 
in the lower (or upper) tail of the
distribution.
• The value of the test statistic that
established the boundary of the
rejection region is called the
critical value for the test.
• The rejection rule is:
Lower tail: Reject H0 if z < -z
Upper tail: Reject H0 if z > z
Lower-Tailed Test About a Population Mean: s Known

Critical Value Approach

Sampling
distribution
of
Reject H0

  1
Do Not Reject H0

z
-z = -1.28 0
Upper-Tailed Test About a Population Mean: s Known
Critical Value Approach

Sampling
distribution
of
Reject H0

  
Do Not Reject H0

z
0 z = 1.645
Steps of Hypothesis Testing – P value approach

• Step 1. Develop the null and alternative hypotheses.

• Step 2. Specify the level of significance .

• Step 3. Collect the sample data and compute the test statistic.

• p-Value Approach

• Step 4. Use the value of the test statistic to compute the p-value.

• Step 5. Reject H0 if p-value < .


Steps of Hypothesis Testing

Critical Value Approach

•Step 4. Use the level of significance  to determine the critical value and

the rejection rule.

•Step 5. Use the value of the test statistic and the rejection rule to determine

whether to reject H0.


Hypothesis Testing

1
Class Objectives

• Population Mean: Sigma Known –Example

2
One-Tailed Tests About a Population Mean: s Known

• Example: The mean response times for a random sample


of 30 Pizza Deliveries is 32 minutes
• The population standard deviation is believed to be 10
minutes.
• The pizza delivery services director wants to perform a
hypothesis test, with a =0.05 level of significance, to
determine whether the service goal of 30 minutes or less
is being achieved.

3
Given Values

• Sample • Population
• Sample mean = 32 Min • a =0.05
• Sample size = 30 • Population mean = 30 Min

4
p -Value Approach

5
One-Tailed Tests About a Population Mean:
s Known
1. Develop the hypotheses.
2. Specify the level of significance. H0: 30
3. Compute the value of the test statistic. Ha:30
a = .05

x 32  30
z   1.09
s / n 10 / 30

6
7
One-Tailed Tests About a Population Mean: s Known

p –Value Approach
4. Compute the p –value.

For z = 1.09, p–value = = 0.137

5. Determine whether to reject H0.

• Because p–value = 0.137 > a = .05 , we do not reject H0.

• There are not sufficient statistical evidence to infer that Pizza delivery services is not meeting the response
goal of 30 minutes.

8
One-Tailed Tests About a Population Mean: s Known
p –Value Approach

Sampling
distribution a = .05
of

p-value
0.137

z
z = za =
0 1.09 1.645

9
Critical Value Approach

10
One-Tailed Tests About a Population Mean: s Known

Critical Value Approach


4. Determine the critical value and rejection rule.

– For a = .05, z.05 = 1.645

– Reject H0 if z > 1.645

5. Determine whether to reject H0.

– Because 1.645 > 1.05, we do not reject H0.

11
p-Value Approach to Two-Tailed Hypothesis Testing

12
Compute the p-value using the following three steps:

1. Compute the value of the test statistic z.

2. If z is in the upper tail (z > 0), find the area under the standard normal curve to the right of z.

3. If z is in the lower tail (z < 0), find the area under the standard normal curve to the left of z.

4. Double the tail area obtained in step 2 to obtain the p –value.

The rejection rule:

Reject H0 if the p-value < a .

13
Critical Value Approach to Two-Tailed Hypothesis Testing

• The critical values will occur in both the lower and upper tails of the standard normal curve.

• Use the standard normal probability distribution table to find za/2 (the z-value with an area of a/2
in the upper tail of the distribution).

• The rejection rule is:

Reject H0 if z < -za/2 or z > za/2.

14
Two-Tailed Tests About a Population Mean:
s Known

• Example: Milk Carton


• Assume that a sample of 30 milk carton provides a sample mean of 505 ml.
• The population standard deviation is believed to be 10 ml.
• Perform a hypothesis test, at the 0.03 level of significance, population
mean 500 ml and to help determine whether the filling process should
continue operating or be stopped and corrected.

15
Given Values

• Sample • Population
• Sample size = 30 • Population mean = 500 ml
• Sample mean = 505 ml • Standard deviation = 10 ml
• Significance level 0.03

16
p –Value approach

17
Two-Tailed Tests About a Population Mean:
s Known
1. Determine the hypotheses.
2. Specify the level of significance.
3. Compute the value of the test statistic.

a = .03

x   505  500
z   2.74
s / n 10 / 30

18
19
Two-Tailed Tests About a Population Mean:
s Known
p –Value Approach
4. Compute the p –value.
– For z = 2.74, p–value = 2(1 - .9969) = .0061

5. Determine whether to reject H0.


– Because p–value = .0062 < a = .03, we reject H0.

There is no sufficient statistical evidence to infer that the null hypothesis is true (i.e. the mean filling
quantity is not 500 ml)

20
Two-Tailed Tests About a Population Mean: s Known

p-Value Approach

1/2 1/2
p -value p -value
= .0031 = .0031

a/2 = a/2 =
.015 .015

z
z = -2.74 0 z = 2.74
-za/2 = -2.17 za/2 = 2.17

21
Critical Value Approach

22
Two-Tailed Tests About a Population Mean :s Known

• Critical Value Approach


4. Determine the critical value and rejection rule, for a/2 = .03/2 = .015, z.015 = 2.17

Reject H0 if z < -2.17 or z > 2.17

5. Determine whether to reject H0.

Because 2.74 > 2.17, we reject H0.

There is sufficient statistical evidence to infer that the null hypothesis is not true

23
24
Two-Tailed Tests About a Population Mean :s Known

Critical Value Approach


Sampling
distribution
x   505  500
z   2.74 of
s / n 10 / 30

Reject H0 Do Not Reject H0 Reject H0


a/2 = .015 a/2 = .015

z
-2.17 0 2.17

25
Confidence Interval Approach

26
Confidence Interval Approach to
Two-Tailed Tests About a Population Mean

• Select a simple random sample from the population and use the value of the sample mean to
develop the confidence interval for the population mean .

• If the confidence interval contains the hypothesized value 500, do not reject H0.

• Otherwise, reject H0.

• Actually, H0 should be rejected if 0 happens to be equal to one of the end points of the confidence
interval.

27
Confidence Interval Approach to Two-Tailed Tests About a Population Mean
The 97% confidence interval for 500 is

5 5 3.9619
501.03814 ,508.96186

Because the hypothesized value for the population mean, 0 = 500ml, is not in this interval, the
hypothesis-testing conclusion is that the null hypothesis, H0:  = 500, is rejected.

28
Thanks

29
Hypothesis Testing: Two sample test

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE

1
Hypothesis Testing about the Difference in Two
Sample Means
Population 1
X 1

X X
x 1 2
X  n1
1
X X1 2


x
X 2
n2

X 2

Population 2

2
Two Sample Tests
Two Sample Tests

Population Population
Means, Means, Population Population
Independent Dependent Proportions Variances
Samples Samples
Examples:
Group 1 vs. Same group before Proportion 1 vs. Variance 1 vs.
independent vs. after treatment Proportion 2 Variance 2
Group 2

3
Difference Between Two Means

Population means,
independent samples

σ12 and σ22 known Test statistic is a z value

σ12 and σ22 unknown

σ12 and σ22 assumed equal


Test statistic is a value from the
σ12 and σ12 assumed Student’s t distribution
unequal
4
σ12 and σ12 Known

Population means, Assumptions:


independent samples
 Samples are randomly and
independently drawn

σ12 and σ22 known  both population distributions


are normal
σ12 and σ22 unknown
 Population variances are
known

5
σ12 and σ22 Known

When σx2 and σy2 are known and both


Population means, populations are normal, the variance of 1 – 2
independent is 2
σ1 σ 2
2
samples σ 2X1 X2  
n1 n 2

σ12 and σ22 known …and the random variable


(x1  x 2 )  (μ1  μ 2 )
Z
σ12 σ 22
σ12 and σ22 unknown 
n1 n 2
has a standard normal distribution
6
Test Statistic, σ12 and σ22 Known

Population means, H0 :μ1  μ 2  D0


independent
samples The test statistic for
μ1 – μ2 is:
σ12 and σ22 known
z
 x 1 
 x2  D0

σ12 and σ22 unknown σ12 σ 2 2



n1 n2

7
Hypothesis Tests for Two Population Means

Two Population Means, Independent Samples

Lower-tail test: Upper-tail test: Two-tail test:


H0: μ1  μ2 H0: μ1 ≤ μ2 H0: μ1 = μ2
H1: μ1 < μ2 H1: μ1 > μ2 H1: μ1 ≠ μ2
i.e., i.e., i.e.,
H0: μ1 – μ2  0 H0: μ1 – μ2 ≤ 0 H0: μ1 – μ2 = 0
H1: μ1 – μ2 < 0 H1: μ1 – μ2 > 0 H1: μ1 – μ2 ≠ 0

8
Decision Rules

a a
a/2 a/2

-za za -za/2 za/2


Reject H0 if z < -za Reject H0 if z > za Reject H0 if z < -za/2 or z > za/2

9
Hypothesis Testing about the Difference in Two
Sample Means

X 1
 X2
  
1 2
X 
  2
1
  2
2
1 X2 n 1 n 2

X 1
 X2
X 1
 X 2

10
Sampling Distribution of x1  x2

• Expected Value

• Standard Deviation (Standard Error)

where: 1 = standard deviation of population 1


2 = standard deviation of population 2
n1 = sample size from population 1
n2 = sample size from population 2

11
Interval Estimation of 1 - 2:  1 and  2 Known
• Interval Estimate

where: 1 - a is the confidence coefficient

12
Problem ( 1 and  2 Known)
• A product developer is interested in reducing the drying time of a primer paint.
• Two formulations of the paint are tested; formulation 1 is the standard chemistry, and
formulation 2 has a new drying ingredient that should reduce the drying time.
• From experience, it is known that the standard deviation of drying time is 8 minutes, and this
inherent variability should be unaffected by the addition of the new ingredient.
• Ten specimens are painted with formulation 1, and another 10 specimens are painted with
formulation 2; the 20 specimens are painted in random order.
• The two-sample average drying times are 𝑥1 = 121 minutes and 𝑥2 = 112 minutes,
respectively.
• What conclusions can the product developer draw about the effectiveness of the new
ingredient, using alpha = 0.05?
Source: Applied Probability and statistics for Engineers by Douglas C. Montgomery and George C. Runger John Wiley, 3rd Ed. 2003

13
Problem ( 1 and  2 Known)

14
Problem ( 1 and  2 Known)

15
Problem ( 1 and  2 Known)
Reject H0

t
121  112   0  2.52
.05
0 1.645 t
 1 1 2.52
8   
2

 10 10  Decision:
Reject H0 at a = 0.05
Conclusion:
There is evidence of a difference in
means.

16
Problem ( 1 and  2 Known)

17
Problem ( 1 and  2 Known)

18
σ12 and σ22 Unknown, Assumed Equal

Population means, Assumptions:


independent samples • Samples are randomly and
independently drawn
σ12 and σ12 known • Populations are normally
distributed
σ12 and σ22 unknown
• Population variances are unknown
σ12 and σ12 assumed equal
*
σ12 and σ12 assumed unequal
but assumed equal

19
σ12 and σ22 Unknown, Assumed Equal

• The population variances are assumed equal, so use the two sample
standard deviations and pool them to estimate σ

• use a t value with (n1 + n2 – 2) degrees of freedom

20
Test Statistic, σ12 and σ22 Unknown, Equal
The test statistic for
μ1 – μ2 is:

t
 x 1 
 x2   μ1  μ 2 
s 2p s 2p

n1 n2

Where t has (n1 + n2 – 2) d.f.,


and (n1  1)s12  (n 2  1)s 22
s 
2

n1  n 2  2
p

21
Decision Rules

1 2 1 2 1 2
1 2 1 2 1 2

22
Decision Rules

23
σ12 and σ22 Unknown, Assumed equal
• Two catalysts are being analyzed to
determine how they affect the mean Observation Catalyst 1 Catalyst 2
yield of a chemical process. Number
• Specifically, catalyst 1 is currently in use, 1 91.50 89.19
but catalyst 2 is acceptable. 2 94.18 90.95
• Since catalyst 2 is cheaper, it should be 3 92.18 90.46
adopted, providing it does not change 4 95.39 93.21
the process yield. 5 91.79 97.19
• A test is run in the pilot plant and results 6 89.07 97.04
in the data shown in table. 7 94.72 91.07
• Is there any difference between the 8 89.21 92.75
mean yields?
𝑥 1= 92.255 𝑥 1 = 92.733
• Use 0.05, and assume equal variances.
s1 =2.39 s2 =2.98
24
σ12 and σ22 Unknown, Assumed equal

25
σ12 and σ22 Unknown, Assumed equal

26
σ12 and σ22 Unknown, Assumed equal

27
σ12 and σ22 Unknown, Assumed equal

28
Thank You

29
Hypothesis Testing-III

1
Tests About a Population Mean:s Unknown
• Test Statistic

This test statistic has a t distribution with n - 1 degrees of freedom.

2
Tests About a Population Mean:s Unknown

Rejection Rule: p -Value Approach


Reject H0 if p –value < 
Rejection Rule: Critical Value Approach
H0:  Reject H0 if t < -t

H0:  Reject H0 if t > t

H0:  Reject H0 if t < - t or t > t

3
4
One-Tailed Test About a Population Mean: s Unknown
Example: Ice Cream Demand
Day No. of Ice- Day No. of Ice-
• In a ice cream parlor at IIT Roorkee, the following data cream cream
Sold Sold
represent the number of ice-creams sold in 20 days
1 13 11 12
2 8 12 11
• Test hypothesis H0:  < 10 3 10 13 11
4 10 14 12
• Use = .05 to test the hypothesis. 5 8 15 10
6 9 16 12
7 10 17 7
8 11 18 10
9 6 19 11
10 8 20 8

5
Given Data

6
7
One-Tailed Test About a Population Mean:
s Unknown

Reject H0

Do Not Reject H0


t
0

8
Hypothesis Testing – proportion

9
Null and Alternative Hypotheses: Population Proportion

• The equality part of the hypotheses always appears in the null hypothesis.

• In general, a hypothesis test about the value of a population proportion p must take one of the
following three forms (where p0 is the hypothesized value of the population proportion).

H0: p > p0 H0: p < p0 H0: p = p0


H a : p < p0 H a : p > p0 H a : p ≠ p0

One-tailed One-tailed
(lower tail) (upper tail) Two-tailed

10
Tests About a Population Proportion
Test Statistic

where:

assuming np > 5 and n(1 – p) > 5

11
Tests About a Population Proportion
Rejection Rule: p –Value Approach
Reject H0 if p –value < 
Rejection Rule: Critical Value Approach
H0: pp Reject H0 if z > z

H0: pp Reject H0 if z < -z

H0: pp Reject H0 if z < -z or z > z

12
Two-Tailed Test About a Population Proportion
Example: City Traffic Police

For a New Year’s week, the City


Traffic Police claimed that 50% of the
accidents would be caused by drunk
driving.

A sample of 120 accidents showed


that 67 were caused by drunk driving.
Use these data to test the Traffic
Police’s claim with  = .05.

13
p –Value Approach

14
Two-Tailed Test About a Population Proportion

H 0 : p  .5
1. Determine the hypotheses.
H a : p  .5

2. Specify the level of significance.  = .05

3. Compute the value of the test statistic.

p0 (1  p0 ) .5(1  .5)
sp    .045644
n 120
p  p0 (67 /120)  .5
z   1.28
sp .045644
15
Two-Tailed Test About a Population Proportion

4. Compute the p -value.

For z = 1.28, cumulative probability = .8997 p–value = 2(1 - .8997) = .2006

5. Determine whether to reject H0.

Because p–value = .2006 >  = .05, we cannot reject H0.

16
17
Critical Value Approach

18
Two-Tailed Test About a Population Proportion

4. Determine the critical value and rejection rule.

For /2 = .05/2 = .025, z.025 = 1.96

Reject H0 if z < -1.96 or z > 1.96

5. Determine whether to reject H0.

Because 1.278 > -1.96 and < 1.96, we cannot reject H0.

19
ANOVA

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Sample Size Calculation


• One Way ANOVA – Introduction

2
Determining Sample Size when Estimating 
X 
• Z formula Z

n

• Error of Estimation (tolerable error) E  X 


• Estimated Sample Size Z    Z  
2 2 2

n 2
 
2

 E 
2
E
1
• Estimated  
4
range

3
Example: Sample Size when Estimating 

E  1,   4
90% confidence  Z  1.645

Z 
2 2

n 2
2
E
2 2
(1645
. ) (4)
 2
1
 43.30 or 44

4
Example
E  2, range  25
95% confidence  Z  196
.
1  1
estimated  : range     25  6.25
4  4

Z
2 2

n 2
E
2 2
(196
. ) (6.25)
 2
2
 37.52 or 38
5
Determining Sample Size when Estimating P

• Z formula pP
Z
PQ
n
• Error of Estimation (tolerable error) E  pP
2

n Z PQ
• Estimated Sample Size E
2

6
Example
E  0.03
98% Confidence  Z  2.33
estimated P  0.40
Q  1  P  0.60

n
Z PQ 2
E
(2.33)  0.40 0.60
2


.003 2

 1,447.7 or 1,448
7
Determining Sample Size when Estimating P
with No Prior Information
P PQ
400 Z = 1.96
0.5 0.25 350 E = 0.05
300
0.4 0.24
250
0.3 0.21 n 200
150
0.2 0.16
100
0.1 0.09 50
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

2 1 P
Z 4
n 2
E
8
Example
E  0.05
90% Confidence  Z  1645
.
with no prior estimate of P, use P  0.50
Q  1  P  0.50
2

n
Z PQ
2
E
. )  0.50 0.50
2
(1645

.05 2

 270.6 or 271
9
Why ANOVA?
• We could compare the means, one by one using t-tests for difference of
means.
• Problem: each test contains type I error
• The total type I error is 1  1   k where k is the number of means.
• For example, if there are 5 means and you use a=.05, you must make 10
two by two comparisons.
• Thus, the type I error is 1-(.95)10, which is .4012.
• That is, 40% of the time you will reject the null hypothesis of equal
means in favor of the alternative!

10
Hypothesis Testing With Categorical Data

• Chi Square tests can be viewed as a generalization of Z tests of


proportions
• Analysis of Variance (ANOVA) can be viewed as a generalization of t-
tests: a comparison of differences of means across more than 2
groups.
• Like Chi Square, if there are only two groups, the two analyses will
produce identical results – thus a t-test or ANOVA can be used with 2
groups

11
Production Process inputs and outputs

12
Application of quality-engineering techniques and
the systematic reduction of process variability

13
Effect of Teaching Methodology
Group 1 Group 2 Group 3
Black Board Case Presentation PPT

4 2 2

3 4 1

2 6 3
 43 2
x1  3
3
 246
x2  4
3
 2 1 3
x3  2
3


4  3  2  2  4  6  2 1 3
x 3
9
SST  (4  3) 2  (3  3) 2  (2  3) 2  (2  3) 2  (4  3) 2  (6  3) 2  (2  3) 2  (1  3) 2  (3  3) 2
=1 + 0 +1 +1 +1 +9 +1 +4 + 0 =18
SSB  3(3  3) 2  3(4  3) 2  3(2  3) 2
=0 +3 +3 =6
SSE  (4  3) 2  (3  3) 2  (2  3) 2  (2  4) 2  (4  4) 2  (6  4) 2  (2  2) 2  (1  2) 2  (3  2) 2
= 1 +0 +1 +4 + 0 +4 + 0 +1 + 1 = 12

15
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 6 2 3 1.5 0.296296 5.143253
Within Groups 12 6 2

Total 18 8

16
Thank You

17
ANOVA

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Effect of Teaching Methodology
Group 1 Group 2 Group 3
Black Board Case Presentation PPT

4 2 2

3 4 1

2 6 3
ANOVA with Python

3
Pandas.melt command

• Pd.melt allows you to ‘unpivot’ data from a ‘wide format’ into a ‘long
format’, data with each row representing a data point.

4
Jupyter code

5
6
Transforming table

7
8
Analysis of Variance: A Conceptual Overview

• Analysis of Variance (ANOVA) can be used to test for the equality of three
or more population means

• Data obtained from observational or experimental studies can be used for


the analysis

• We want to use the sample results to test the following hypotheses:


H0: 1=2=3=. . . = k
Ha: Not all population means are equal

9
Analysis of Variance: A Conceptual Overview

H0: 1=2=3=. . .= k
Ha: Not all population means are equal

• If H0 is rejected, we cannot conclude that all population means are equal


• Rejecting H0 means that at least two population means have different
values

10
Analysis of Variance: A Conceptual Overview

Assumptions for Analysis of Variance

• For each population, the response (dependent) variable is normally


distributed

• The variance of the response variable, denoted  2, is the same for all of
the populations

• The observations must be independent

11
Analysis of Variance: A Conceptual Overview
• Sampling Distribution of 𝑥 Given H0 is True

Sample means are close


together because there is only
one sampling distribution
when H0 is true.
2
 
2
x
n

x2  x1 x3
Analysis of Variance: A Conceptual Overview
• Sampling Distribution of 𝑥 Given H0 is False
Sample means
come from
different
sampling
distributions
and are not as
close together
when H0 is
false. x3 3 x 1 1 2 x2
Analysis of Variance (ANOVA)

One-Way Two-Way
ANOVA ANOVA

F-test Interaction
Effects
Tukey-
Kramer
test
General ANOVA Setting

• Investigator controls one or more factors of interest


– Each factor contains two or more levels
– Levels can be numerical or categorical
– Different levels produce different groups
– Think of the groups as populations
• Observe effects on the dependent variable
– Are the groups the same?
• Experimental design: the plan used to collect the data
Completely Randomized Design

• Experimental units (subjects) are assigned randomly to the


different levels (groups)
– Subjects are assumed homogeneous
• Only one factor or independent variable
– With two or more levels (groups)
• Analyzed by one-factor analysis of variance (one-way ANOVA)
Analysis of Variance and the Completely
Randomized Design
• Between-Treatments Estimate of Population Variance

• Within-Treatments Estimate of Population Variance

• Comparing the Variance Estimates: The F Test

• ANOVA Table

17
Analysis of Variance and the Completely
Randomized Design

H0: 1=2=3=. . .= k
Ha: Not all population means are equal

where
𝑗 = mean of the 𝑗𝑡ℎ population

18
Analysis of Variance and the Completely
Randomized Design

H0: 1=2=3=. . .= k
Ha: Not all population means are equal
• Assume that a simple random sample of size 𝑛𝑗 has been selected from
each of the k populations or treatments. For the resulting sample data, let
𝑥𝑖𝑗 = value of observation ifor treatment j
𝑛𝑗 = number of observations for treatment j
𝑥𝑗 = sample mean for treatment j
𝑠𝑗2 = sample variance for treatment j
𝑠𝑗 = sample standard deviation for treatment j
19
Between-Treatments Estimate of
Population Variance  2
• The estimate of  2 based on the variation of the sample means is called
the mean square due to treatments and is denoted by MSTR
k

 n (x
j 1
j j  x )2
MSTR 
k1
Numerator is called
Denominator is the the sum of squares due
degrees of freedom to treatments (SSTR)
associated with SSTR

20
Between-Treatments Estimate of
Population Variance  2
• Mean Square due toTreatments (MSTR)
k

 j j
n (
j 1
x  x ) 2

MSTR 
k1
Where:
k = number of groups
nj = sample size from group j
𝑥𝑗 = sample mean from group j
𝑥 = grand mean (mean of all data values)

21
Within-Treatments Estimate of
Population Variance 2
• The estimate of  2 based on the variation of the sample observations
within each sample is called the mean square error and is denoted by MSE
k

 j j
( n
j 1
 1) s 2

MSE 
nT  k
Numerator is called
Denominator is the the sum of squares
degrees of freedom due to error (SSE)
associated with SSE

22
Within-Treatments Estimate of
Population Variance 2
• Mean Square Error (MSE)
k

 j j
( n
j 1
 1) s 2

MSE 
Where: nT  k
k = number of groups
𝑛𝑗 = number of observations for treatment j 𝑠𝑗2 =
sample variance for treatment j

23
Comparing the Variance Estimates: The F Test

• If the null hypothesis is true and the ANOVA assumptions are valid, the
sampling distribution of MSTR/MSE is an F distribution with MSTR d.f
equal to k - 1 and MSE d.f. equal to nT - k.

• If the means of the k populations are not equal, the value of MSTR/MSE
will be inflated because MSTR overestimates 2

• Hence, we will reject H0 if the resulting value of MSTR/MSE appears to be


too large to have been selected at random from the appropriate F
distribution

24
Comparing the Variance Estimates: The F Test

Sampling Distribution
of MSTR/MSE

Reject H0
Do Not Reject H0 
MSTR/MSE
F
Critical Value
ANOVA Table for a Completely Randomized Design
Source of Sum of Degrees of Mean p-
Variation Squares Freedom Square F Value
SSTR MSTR
Treatments SSTR k-1 MSTR 
k -1 MSE
SSE
Error SSE nT - k MSE 
nT -k

Total SST nT - 1
SST’s degrees of freedom
SST is partitioned (d.f.) are partitioned into
into SSTR and SSE.
SSTR’s d.f. and SSE’s d.f.
ANOVA Table for a Completely Randomized Design

• SST divided by its degrees of freedom nT – 1 is the overall sample variance


that would be obtained if we treated the entire set of observations as one
data set.
• With the entire data set as one sample, the formula for computing the
total sum of squares, SST, is:

k nj

SST   ( xij  x )2  SSTR  SSE


j 1 i 1

27
ANOVA Table for a Completely Randomized Design

• ANOVA can be viewed as the process of partitioning the total sum of


squares and the degrees of freedom into their corresponding sources:
treatments and error

• Dividing the sum of squares by the appropriate degrees of freedom


provides the variance estimates and the F value used to test the
hypothesis of equal population means.

28
Test for the Equality of k Population Means

• Hypotheses
H0: 123...k
Ha: Not all population means are equal

• Test Statistic
𝑀𝑆𝑇𝑅
F=
𝑀𝑆𝐸

29
Test for the Equality of k Population Means

p- Value Approach Critical Value Approach

Reject H0 if p-value <  Reject H0 if F > F

Where the value of F is based on an F distribution with k - 1


numerator d.f. and nT - k denominator d.f.

30
Thank You

31
Hypothesis Testing: Two sample test

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE

1
σ12 and σ22 Unknown, Assumed Unequal

Population means, Assumptions:


independent samples
 Samples are randomly and
independently drawn
σ12 and σ22 known
 Populations are normally distributed
σ12 and σ22 unknown
 Population variances are unknown
and assumed unequal
*
σ12 and σ22 assumed equal .

σ12 and σ22 assumed unequal

2
σ12 and σ22 Unknown: Assumed Unequal

Population means, Forming interval estimates:


independent samples
• The population variances are
assumed unequal, so a pooled
σ12 and σ22 known variance is not appropriate
• Use a t value with  degrees of
freedom, where
σ12 and σ22 unknown
2
 s12 s 22 
( n )  ( n ) 

*
σ12 and σ22 assumed equal

σ12 and σ22 assumed unequal


v
 s12 

 n1 
2
 1 2 
 s 22 
 /(n1  1)  
 n2 
2

 /(n 2  1)

3
Test Statistic: σ12 and σ22 Unknown, Unequal
The test statistic for
μ1 – μ2 is:
σ12 and σ22 unknown
(x1  x 2 )  D 0
σ12 and σ22 t 
assumed equal s12 s 22

σ12 and σ22 n1 n2
assumed unequal
2
 s12 s 22 
 n 
n 2 
( ) ( )
Where t has  degrees of freedom: v 2
 1
2
 s12   s 22 
  /(n1  1)    /(n 2  1)
 n1   n2 

4
Problem:Test Statistic: σ12 and σ22 Unknown, Unequal
Metro Phoenix Rural Arizona
• Arsenic concentration in public Phoenix, 3 Rimrock, 48
drinking water supplies is a Chandler, 7 Goodyear, 44
potential health risk. Gilbert, 25 New River, 40
• An article in the Arizona Republic Glendale, 10 Apachie Junction, 38
(Sunday, May 27, 2001) reported Mesa, 15 Buckeye, 33
drinking water arsenic Paradise Valley, 6 Nogales, 21
concentrations in parts per billion Peoria, 12 Black Canyon City, 20
(ppb) for 10 metropolitan Phoenix Scottsdale, 25 Sedona, 12
communities and 10 communities Tempe, 15 Payson, 1
in rural Arizona. Sun City, 7 Casa Grande, 18
• The data as shown:
𝑥 1 = 12.5 𝑥 1 = 27.5
s1 =7.63 s2 =15.3
5
Problem:Test Statistic: σ12 and σ22 Unknown, Unequal

• We wish to determine it there is any difference in mean arsenic


concentrations between metropolitan Phoenix communities and
communities in rural Arizona.

6
Problem:Test Statistic: σ12 and σ22 Unknown, Unequal

7
Problem:Test Statistic: σ12 and σ22 Unknown, Unequal

2 2
 s12 s 22   7.632 15.32 
( n )  ( n )   ( 10 )  ( 10 ) 
v  1 2      13.2  13
2 2 2 2
 s1 
2
 s2 2
 7.632   15.3 
2

  /(n1  1)    /(n 2  1)   /(10  1)    /(10  1)


 1
n  2
n  10   10 

8
Problem:Test Statistic: σ12 and σ22 Unknown, Unequal
Reject H0 Reject H0

.025 .025
-2.160 0 2.160 t
-2.77
Decision:
12.5  27.5  0 Reject H0 at a = 0.05
t=  2.77
 7.63 15.3 
2 2 Conclusion:
   There is evidence of a difference in
 10 10 
means.

9
Problem:Test Statistic: σ12 and σ22 Unknown, Unequal

• Reject the null hypothesis.


• There is evidence to conclude that mean arsenic concentration in the
drinking water in rural Arizona is different from the mean arsenic
concentration in metropolitan Phoenix drinking water.

10
Problem:Test Statistic: σ12 and σ22 Unknown, Unequal

11
Dependent Samples
Tests Means of 2 Related Populations
– Paired or matched samples
– Repeated measures (before/after)
– Use difference between paired values:

di = xi - yi
• Assumptions:
– Both Populations Are Normally Distributed

12
Test Statistic: Dependent Samples

The test statistic for the mean difference is a t value, with


n – 1 degrees of freedom:

t
d  D0
sd d
 d i
 xy
n
n

D0 = hypothesized mean difference


sd = sample standard dev. of differences
n = the sample size (number of pairs)
13
Decision Rules: Dependent Samples

Lower-tail test: Upper-tail test: Two-tail test:


H0: μ1 – μ2  0 H0: μ1 – μ2 ≤ 0 H0: μ1 – μ2 = 0
H1: μ1 – μ2 < 0 H1: μ1 – μ2 > 0 H1: μ1 – μ2 ≠ 0

14
Decision Rules: Dependent Samples

a a a/2 a/2

-ta ta -ta/2 ta/2


Reject H0 if t < -tn-1, a Reject H0 if t > tn-1, a Reject H0 if t < -tn-1 ,a/2
or t > tn-1 ,a/2
d  D0
t
Where sd
has n - 1 d.f.
n

15
Dependent Samples: Example
• An article in the Journal of Strain Analysis (1983, Vol. 18, No. 2) compares
several methods for predicting the shear strength for steel plate girders.
• Data for two of these methods, the Karlsruhe and Lehigh procedures,
when applied to nine specific girders, are shown in Table .
• We wish to determine whether there is any difference (on the average)
between the two methods.

16
Table : Strength Predictions for Nine Steel Plate Girders
(Predicted Load/Observed Load)
Girder Karlsruhe Method Lehigh Method Difference dj
S11 1.186 1.061 0.119
S21 1.151 0.992 0.159
S31 1.322 1.063 0.259
S41 1.339 1.062 0.277
S51 1.200 1.065 0.138
S21 1.402 1.178 0.224
S22 1.365 1.037 0.328
S23 1.537 1.086 0.451
S24 1.559 1.052 0.507

17
Inferences About the Difference Between Two
Population Means: Matched Samples

18
Inferences About the Difference Between Two Population Means:
Matched Samples

19
we conclude that the strength prediction methods yield different results.

20
21
Inferences About the Difference Between
Two Population Proportions

• Inferences About the Difference Between Two Population Proportion


Inferences About the Difference Between
Two Population Proportions
• Interval Estimation of p1 - p2

• Hypothesis Tests About p1 - p2


Sampling Distribution of p1- p2

• Expected Value
E ( p1  p2 )  p1  p2

• Standard Deviation (Standard Error)


p1 (1  p1 ) p2 (1  p2 )
 p1  p2  
n1 n2

where: n1 = size of sample taken from population 1


n2 = size of sample taken from population 2
Sampling Distribution of p1- p2

• If the sample sizes are large, the sampling distribution of


p1- p2 can be approximated by a normal probability distribution.

• The sample sizes are sufficiently large if all of these conditions


are met:
Sampling Distribution of p1- p2

p1 (1  p1 ) p2 (1  p2 )
 p1  p2  
n1 n2

p1  p2
Interval Estimation of p1 - p2

• Interval Estimation
p1 (1  p1 ) p2 (1  p2 )
p1  p2  za / 2 
n1 n2
Point Estimator of the Difference Between Two Population
Proportions
• p1 = proportion of the population of households “aware”
of the product after the new campaign
• p2 = proportion of the population of households “aware”
of the product before the new campaign
• p1 = sample proportion of households “aware” of the
product after the new campaign
• p2 = sample proportion of households “aware” of the
product before the new campaign

120 60
p1  p2    .48  .40  .08
250 150
Hypothesis Tests about p1 - p2

• Hypothesis

We focus on tests involving no difference between the two population


proportions (i.e. p1 = p2)

H 0 : p1  p2  0 H 0 : p1  p2  0 H 0 : p1  p2  0
H a : p1  p2  0 H a : p1  p2  0 H a : p1  p2  0
Left-tailed Right-tailed Two-tailed
Hypothesis Tests about p1 - p2

• Standard Error of p1- p2 when p1 = p2 = p

1 1
 p1  p2  p(1  p)   
 n1 n2 
• Pooled Estimator of p when p1 = p2 = p

n1 p1  n2 p2
p
n1  n2
Hypothesis Tests about p1 - p2

• Test Statistic

( p1  p2 )
z
 1 1 
p(1  p )   
 n1 n2 
Problem: Hypothesis Tests about p1 - p2
• Extracts of St. John’s Wort are widely used to treat depression.
• An article in the April 18, 2001 issue of the Journal of the American Medical
Association (“Effectiveness of St. John’s Worton Major Depression: A
Randomized Controlled Trial”) compared the efficacy of a standard extract
of St. John’s Wort with a placebo in 200 outpatients diagnosed with major
depression.
• Patients were randomly assigned to two groups; one group received the St.
John’s Wort, and the other received the placebo.
• After eight weeks, 19 of the placebo-treated patients showed
improvement, whereas 27 of those treated with St. John’s Wort improved.
• Is there any reason to believe that St. John’s Wort is effective in treating
major depression? Use 0.05.
Problem: Hypothesis Tests about p1 - p2
Problem: Hypothesis Tests about p1 - p2

8. Conclusions: Since z0 1.35 does not exceed z 0.025, we cannot reject the null hypothesis.
The P-value is P ≅ 0.177. There is insufficient evidence to support the claim that St.
John’s Wort is effective in treating major depression.

34
35
Thank You

36
Hypothesis Testing: Two sample test

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE

1
Agenda

• Comparing two population variances


• Choosing z or t test
• Sample size

2
Hypothesis Tests for Two Variances
Goal: Test hypotheses about two population variances
Tests for Two
Population H0: σ12  σ22
Variances H1: σ12 < σ22
Lower-tail
test
H0: σ12 ≤ σ22 Upper-tail
F test statistic H1: σ12 > σ22 test
H0: σ12 = σ22
H1: σ12 ≠ σ22 Two-tail test
The two populations are assumed to be
independent and normally distributed

3
Hypothesis Tests for Two Variances

Tests for Two The random variable


Population
s12 /σ12
Variances F 2 2
s2 /σ 2

F test statistic Has an F distribution with (n1 – 1) numerator


degrees of freedom and (n1 – 1) denominator
degrees of freedom
Denote an F value with 1 numerator and 2 denominator degrees
of freedom by

4
Test Statistic

The critical value for a hypothesis test about two


Tests for Two population variances is
Population
Variances s12
F 2
s2

F test statistic where F has (nx – 1) numerator degrees of freedom and


(ny – 1) denominator degrees of freedom

5
Decision Rules: Two Variances
H0: σ12 = σ22
H0: σ12 ≤ σ22 H1: σ12 ≠ σ22
H1: σ12 > σ22

6
Problem
• A company manufactures impellers for use in jet-turbine engines.
• One of the operations involves grinding a particular surface finish on a
titanium alloy component.
• Two different grinding processes can be used, and both processes can produce
parts at identical mean surface roughness.
• The manufacturing engineer would like to select the process having the least
variability in surface roughness.
• A random sample of n1 =11 parts from the first process results in a sample
standard deviation s1 = 5.1 micro inches, and a random sample of n2 = 16
parts from the second process results in a sample standard deviation of s2 =
4.7 micro inches.
• We will find a 90% confidence interval on the ratio of the two standard
deviations.

7
Problem
• Form the hypothesis test:
H0: σ12 = σ22 (there is no difference between variances)
H1: σ12 ≠ σ22 (there is a difference between variances)
● Find the F critical values for  = .10/2:
Degrees of Freedom:
• Numerator
• n1 – 1 = 11 – 1 = 10 d.f.
• Denominator:
• n2 – 1 = 16 – 1 = 15 d.f.

8
Problem

• Assuming that the two processes are independent and that surface
roughness is normally distributed

9
10
Problem

• f0.95,15,10 = 1 /f0.05,10,15 = 1/2.54 = 0.39


• Since this confidence interval includes unity, we cannot claim that the
standard deviations of surface roughness for the two processes are
different at the 90% level of confidence.

11
12
F Test example:

13
Z Vs t

σ –known σ –unknown

n ≤ 30 Z-test t-test

n > 30 Z-test Z-test


Use Sample standard deviation
Determining the Sample Size for a Hypothesis Test About a Population
Mean
Sampling
c
distribution H0: mm0
of x when Reject H0 Ha:mm0
H0 is true
and m = m0

x
m0 Sampling
distribution
of x when
Note: H0 is false
b and ma > m0

x
c ma
Determining the Sample Size for a Hypothesis Test About a Population
Mean

where
z = z value providing an area of  in the tail
zb = z value providing an area of b in the tail
= population standard deviation
m0 = value of the population mean in H0
ma = value of the population mean used for the
Type II error

Note: In a two-tailed hypothesis test, use z /2 not z


Determining the Sample Size for a Hypothesis Test About a Population Mean

• Let’s assume that the manufacturing company makes the following statements about the
allowable probabilities for the Type I and Type II errors:

• If the mean diameter is m = 12 mm, I am willing to risk an  = .05 probability of rejecting H0.

• If the mean diameter is 0.75 mm over the specification (m = 12.75), I am willing to risk a b = .10
probability of not rejecting H0.
Determining the Sample Size for a Hypothesis Test About a Population Mean

 = .05, b = .10
z = 1.645, zb = 1.28
m0 = 12, ma = 12.75
= 3.2
19
Thank You

20
Post Hoc Analysis(Tukey’s test)
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
IIT ROORKEE

1
Lecture Objectives
After completing this lecture, you should be able to:
• Use Tukey’s test and LSD Test to identify specific differences between
means

2
Designing engineering experiments

• Experimental design methods are also useful in engineering design


activities, where new products are developed and existing ones are
improved
• By using designed experiments, engineers can determine which subset of
the process variables has the greatest influence on process performance

3
Designing engineering experiments

• The results of an experiment can lead to


1. Improved process yield
2. Reduced variability in the process and closer conformance to nominal
or target requirements
3. Reduced design and development time
4. Reduced cost of operation

4
Designing engineering experiments

• Every experiment involves a sequence of activities:


1. Conjecture—the original hypothesis that motivates the experiment
2. Experiment—the test performed to investigate the conjecture
3. Analysis—the statistical analysis of the data from the experiment
4. Conclusion—what has been learned about the original conjecture
from the experiment. Often the experiment will lead to a revised
conjecture, and a new experiment, and so forth

5
The completely randomized single-factor experiment
example
• A manufacturer of paper that is used for making
grocery bags is interested in improving the tensile
strength of the product
• Product engineer thinks that tensile strength is a
function of the hardwood concentration in the
pulp and that the range of hardwood
concentrations of practical interest is between 5
and 20%.

6
The completely randomized single-factor experiment
example
• A team of engineers responsible for the study decides to investigate four
levels of hardwood concentration: 5%, 10%, 15%, and 20%.
• They decide to make up six test specimens at each concentration level,
using a pilot plant.
• All 24 specimens are tested on a laboratory tensile tester, in random order.
The data from this experiment are shown in Table

7
The completely randomized single-factor experiment
example
• Tensile Strength of Paper (psi)
Hardwood Observations Total Avg
Concentration (%) 1 2 3 4 5 6

5 7 8 15 11 9 10 60 10.00
10 12 17 13 18 19 15 94 15.67
15 14 18 19 17 16 18 102 17.00
20 19 25 22 23 18 20 127 21.17
383 15.96

8
The completely randomized single-factor experiment
example

9
Typical Data for Single Factor Experiment

Treatment Observations Totals Averages


---
1 y11 y12 ... y1n y1. y1.
---
2 y 21 y 23 ... y2n y 2. y 2.
. . . ... . . .
. . . ... . . .
. . . ... . . .
---
a y a1 ya 2 ... y an ya. ya.
---
y .. y ..

10
Sum of Squares

a n --
Total sum of squares  SST   (yij - y..)2
i 1 j 1
a --- ---
Treatment sum of squares  SSTreatments  n  ( y i.  y ..)2
i 1
a n ---
Error sum of Squares  SSE   (yij  y j. ) 2
i 1 j 1

11
ANOVA with Equal Sample Sizes

a n 2
y ..
SST   y 
2
ij
i 1 j 1 N
1 a 2 y 2 ..
SSTreatments   yi. 
n i 1 N

N = an = No. of Treatments x no. of sample size = Total no. of Sample Size

12
ANOVA with unequal Sample Sizes

a n 2
y ..
SST   y i j 2

i 1 j 1 N
a
yi.2 y 2 ..
SSTreatments   
i 1 ni N

N = an = No. of Treatments x no. of sample size = Total no. of Sample Size

13
Problem: Analysis of variance

• Consider the paper tensile strength experiment described.


• We can use the analysis of variance to test the hypothesis that different
hardwood concentrations do not affect the mean tensile strength of the
paper.
• The hypotheses are
• H0:  1   2   3   4  0
• H1:  i  0 for at least one i

14
Problem: Analysis of variance

• We will use a = 0.01.


• The sums of squares for the analysis of variance are computed are as
follows:

15
ANOVA Table

Sources of Sum of Squares Degrees of Mean Square F


Variation Freedom

Treatments SS Treatments a-1 MS Treatments MS Treatments


/ MSE
Error SSE a(n-1) MSE

Total SST an-1

16
Problem: Analysis of variance

• The ANOVA is summarized as follow


Source of Sum of Degrees Mean F0 P-value
Variation Squares of Square
freedom

Hardwood 382.79 3 127.6 19.6 3.59 E-6


concentrati
on
Error 130.17 20 6.51
Total 512.96 23

17
Problem: Analysis of variance

• Since f0.01,3,20 = 4.94, we reject H0 and conclude that hardwood


concentration in the pulp significantly affects the mean strength of the
paper

18
Problem: Analysis of variance

19
Jupyter code

20
Jupyter code

21
Jupyter code

22
Jupyter code

23
Multiple Comparisons Following the ANOVA

• When the null hypothesis is rejected in the ANOVA, we know that some of
the treatment or factor level means are different
• ANOVA doesn’t identify which means are different
• Methods for investigating this issue are called multiple comparisons
methods

24
Fisher’s least significant difference (LSD) method

• The Fisher LSD method compares all pairs of means with the null
hypotheses H0:i   j (for all i ≠ j) using the t-statistic

yi*  y j*
t0 
2 MS E
n

25
Fisher’s least significant difference (LSD) method

• Assuming a two-sided alternative hypothesis, the pair of means i and j


would be declared significantly different if

yi*  y j*  LSD
where LSD, the least significant difference, is
2MS E
LSD  ta /2,a ( n 1)
n

26
Fisher’s least significant difference (LSD) method

• If the sample sizes are different in each treatment, the LSD is defined as

1 1
LSD  ta /2, N  a MS E (  )
ni n j

27
Problem : LSD method

• We will apply the Fisher LSD method to the hardwood concentration


experiment. There are a = 4 means, n = 6, MSE = 6.51, and t0.025,20 = 2.086.
The treatment means are

28
Problem : LSD method

• The value of LSD is:


2MS E 2(6.51)
LSD  t0.025,20  2.086  3.07
n 6

• Therefore, any pair of treatment averages that differs by more than 3.07
implies that the corresponding pair of treatment means are different.

29
Jupyter code

30
Problem : LSD method

• The comparisons among the observed treatment averages are as follows:

31
The Tukey-Kramer Test for Post Hoc analysis

• Tells which population means are significantly different


• Done after rejection of equal means in ANOVA
• Allows pair-wise comparisons
• Compare absolute mean differences with critical range

32
The Tukey-Kramer Test for Post Hoc analysis

• Determine is there any significant difference between the means


• is μ1 = μ2 ≠ μ3

x
μ1 = μ 2 μ3

33
Tukey-Kramer Critical Range

MSW  1 1 
Critical Range  QU 
2  n j n j' 

where:
QU = Value from Studentized Range
Distribution with c and n - c degrees of freedom for
the desired level of a
MSW = Mean Square Within
nj and nj’ = Sample sizes from groups j and j’
34
Problem: Tukey- Kramer test

• Tensile Strength of Paper (psi)


Hardwood Observations Total Avg
Concentratio 1 2 3 4 5 6
n (%)
5 7 8 15 11 9 10 60 10.00
10 12 17 13 18 19 15 94 15.67
15 14 18 19 17 16 18 102 17.00
20 19 25 22 23 18 20 127 21.17
383 15.96

35
The Tukey-Kramer Procedure
1. Compute absolute mean differences:

x1  x 2  10.00  15.67  5.67


x1  x 3  10.00  17.00  7
x 2  x 3  15.67  17.00  1.33
x1  x 4  10.00  21.17  11.17
x 2  x 4  15.67  21.17  5.5
x 3  x 4  17.00  21.17  4.17

36
The Tukey-Kramer Procedure

2. Find the QU value from the table with c = 4 and (n – c) = (24 – 4) = 20


degrees of freedom for the desired level of a (a = .05 used here):

QU  3.96

37
• Q table: The critical values
for q corresponding to
alpha = .05 (top) and
alpha = .01 (bottom)

38
The Tukey-Kramer Procedure

Source of Sum of Degrees Mean F0 P-value


Variation Squares of Square
freedom

Hardwood 382.79 3 127.6 19.6 3.59 E-6


concentrati
on
Error 130.17 20 6.51
Total 512.96 23

39
The Tukey-Kramer Procedure
3. Compute Critical Range:
MSW  1 1  6.51  1 1 
Critical Range  Q U     3.96     4.124
2  n j n j'  2 6 6

4. Compare: x1  x 2  10.00  15.67  5.67


x1  x 3  10.00  17.00  7
x 2  x 3  15.67  17.00  1.33
x1  x 4  10.00  21.17  11.17
x 2  x 4  15.67  21.17  5.5
x 3  x 4  17.00  21.17  4.17

40
The Tukey-Kramer Procedure
5. Other then x 2  x 3 , all of the absolute mean differences are greater than
critical range. Therefore there is significant difference between each pair of
means, except 10% concentration and 15% concentration at the 5% level of
significance.

41
Jupyter code

42
Problem 2

• Following table shows observed tensile


strength (lb/in square) of different clothes
having different weight percentage of cotton.
• Check whether having different weight
percentage of cotton, plays any role in tensile
strength (lb/in square) of clothes.

43
Problem 2

Weight Observed tensile strength (lb/in square) Total Average


Percentage
of cotton

1 2 3 4 5
15 7 7 15 11 9 49 9.8
20 12 17 12 18 18 77 15.4
25 14 18 18 19 19 88 17.6
30 19 25 22 19 23 108 21.6
35 7 10 11 15 11 54 10.8
Grand Grand
total=376 mean=
15.004

44
• SSA = 5 (9.8 – 15.04)2 + 5 (15.4 – 15.04)2 + 5
(17.6 – 15.04)2 +5( 21.6-15.04)2+ 5(10.8-
15.04)2 = 475.76
SST = 636.96
SSE = 636.96 - 475.76=161.20

Sources of Sum of Degrees of Mean square F-value


variation squares freedom

Cotton weight 475.76 4 118.94 14.76


percentage
Error 161.20 20 8.06
Total 639.96 24

45
Problem 2

• When alpha =.05, F(0.05,4,20) =2.87


• Reject Ho

46
• Q table: The critical values
for q corresponding to
alpha = .05 (top) and
alpha = .01 (bottom)

47
Problem 2

MS E
Ta  qa (c, n  c )
n
a  0.05

q0.05 (5, 20)  4.23


8.06
T0.05  4.23  5.37
5

48
Problem 2

Any pair of treatment averages that differ in absolute value by more


than 5.37 would imply that the corresponding pair of population means
are significantly different.

49
Problem 2
__ __
y1.  y2.  9.8  15.4  5.6*
__ __
y1.  y3.  9.8  17.6  7.8*
__ __ Starred values indicate pairs of means
y1.  y4.  9.8  21.6  11.8 *
that are significantly different.
__ __
y1.  y5.  9.8  10.8  1

__ __ __ __
y2.  y3.  15.4  17.6  2.2 y3.  y4.  17.6  21.6  4
__ __ __ __
y2.  y4.  15.4  21.6  6.2* y3.  y5.  17.6  10.8  6.8*
__ __ __ __
y2.  y5.  15.4  10.8  4.6 y4.  y5.  21.6  10.8  10.8*

50
Jupyter code

51
Jupyter Code

52
Thank you

53
Two Way ANOVA

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE

1
Learning objectives

• Design and conduct engineering experiments involving several factors


using the factorial design approach
• Understand how the ANOVA is used to analyze the data from these
experiments
• Know how to use the two-level series of factorial designs

2
Factorial Experiment
• A factorial experiment is an experimental design that allows simultaneous
conclusions about two or more factors.
• The term factorial is used because the experimental conditions include all
possible combinations of the factors.
• The effect of a factor is defined as the change in response produced by a
change in the level of the factor. It is called a main effect because it refers to
the primary factors in the study
• For example, for a levels of factor A and b levels of factor B, the experiment
will involve collecting data on ab treatment combinations.
• Factorial experiments are the only way to discover interactions between
variables.

3
Factorial Experiment

Factorial Experiment, no interaction Factorial Experiment, with interaction

4
Two-factor Factorial Experiments

• The simplest type of factorial experiment involves only two factors, say, A
and B.
• There are a levels of factor A and b levels of factor B.
• This two-factor factorial is shown in next table .
• The experiment has n replicates, and each replicate contains all ab
treatment combinations.

5
Two-factor Factorial Experiments

Data Arrangement for a Two-Factor Factorial Design

6
Two-factor Factorial Experiments

• The observation in the ijth cell for the kth replicate is denoted by yijk
• In performing the experiment, the abn observations would be run in
random order.
• Thus, like the single factor experiment, the two-factor factorial is a
completely randomized design.

7
Example

• As an illustration of a two-factor factorial experiment, we will consider a


study involving the Common Admission test (CAT), a standardized test
used by graduate schools of business to evaluate an applicant’s ability to
pursue a graduate program in that field.
• Scores on the CAT range from 200 to 800, with higher scores implying
higher aptitude.

8
Three CAT preparation programs.

• In an attempt to improve students’ performance on the CAT, a major


university is considering offering the following three CAT preparation
programs.
1. A three-hour review session covering the types of questions generally
asked on the CAT.
2. A one-day program covering relevant exam material, along with the taking
and grading of a sample exam.
3. An intensive 10-week course involving the identification of each student’s
weaknesses and the setting up of individualized programs for
improvement.

9
Factor - 1 , 3 treatment

• One factor in this study is the CAT preparation program, which has three
treatments:
– Three-hour review,
– One-day program, and
– 10-week course.
• Before selecting the preparation program to adopt, further study will be
conducted to determine how the proposed programs affect CAT scores.

10
Factor 2 : 3 Treatment
• The CAT is usually taken by students from three colleges:
• the College of Business,
• the College of Engineering, and
• the College of Arts and Sciences.
• Therefore, a second factor of interest in the experiment is whether a
student’s undergraduate college affects the CAT score.
• This second factor, undergraduate college, also has three treatments:
– Business,
– Engineering, and
– Arts and sciences.

11
Nine Treatment Combinations for The Two-factor CAT
Experiment

Factor A: Factor B: College


Preparation Business Engineering Arts and sciences
Program
Three-hour review 1 2 3
One-day program 4 5 6
10-Week course 7 8 9

12
Replication

• In experimental design terminology, the sample size of two for each


treatment combination indicates that we have two replications.

13
CAT SCORES FOR THE TWO-FACTOR EXPERIMENT

Factor A: Factor B: College


Preparation Business Engineering Arts and sciences
Program
Three-hour review 500 540 480
580 460 400
One-day program 460 560 420
540 620 480
10-Week course 560 600 480
600 580 410

14
The analysis of variance computations answers
the following questions.
• Main effect (factor A): Do the preparation programs differ in terms of
effect on CAT scores?
• Main effect (factor B): Do the undergraduate colleges differ in terms of
effect on CAT scores?
• Interaction effect (factors A and B): Do students in some colleges do
better on one type of preparation program whereas others do better on a
different type of preparation program?

15
Interaction

• The term interaction refers to a new effect that we can now study because
we used a factorial experiment.
• If the interaction effect has a significant impact on the CAT scores, we can
conclude that the effect of the type of preparation program depends on
the undergraduate college.

16
ANOVA Table for the Two-factor Factorial Experiment
with r Replications
Sources of Sum of Degrees of Mean Square F P- value
Variation Squares Freedom
Factor A SSA (a -1) SSA/a-1 MSA /
MSE
Factor B SSB (b-1) SSB/b-1 MSB/
MSE
Interaction SSAB (a-1)(b-1) MSAB = MSAB /
SSAB/(a-1)(b-1) MSE
Error SSE ab(r-1) MSE=
SSE/(ab)(r-1)
Total SST nT-1
17
Abbreviation

18
ANOVA Procedure

• The ANOVA procedure for the two-factor factorial experiment requires us


to partition the sum of squares total (SST) into four groups:
– sum of squares for factor A (SSA),
– sum of squares for factor B (SSB),
– sum of squares for interaction (SSAB), and
– sum of squares due to error (SSE).
• The formula for this partitioning follows.

19
Computations and Conclusions

20
CAT Summary Data for The Two-factor Experiment
Factor A: Factor B: College Row totals
Preparation Business Engineering Arts and
Program sciences
Three-hour 500 x11=540 540 x12 = 500 480 x13= 440 2960
review 580 460 400
One-day 460 x21= 500 560 x22 = 590 420 x23 = 450 3080
program 540 620 480
10-Week course 560 x31 = 580 600 x32= 590 480 x33 = 445 3230
600 580 410
Column totals 3240 3360 2670 Overall x
total= = 515
9270
21
CAT Summary Data for The Two-factor Experiment

• Factor A means
x1.  493.33
x2 .  513.33
x3.  538.33

• Factor B means
x.1  540
x.2  560
x.3  445
22
CAT Example:

23
CAT Example:

24
CAT Example:

25
CAT Example:

26
CAT Example:

27
ANOVA Table for the CAT two-factor design

Sources of Sum of Degrees of Mean Square F P- value


Variation Squares Freedom

Factor A 6100 2 3050 1.38 0.299

Factor B 45300 2 22650 10.27 0.005

Interaction 11200 4 2800 1.27 0.350

Error 19850 9 2206

Total 82450 17

28
Jupyter Code

29
Jupyter code

30
Jupyter Code

31
Thank You

32
REGRESSION
Linear Regression
Dr. Ramesh Anbanandam
DEPARTMENT of Management Studies

1
Simple Linear Regression

• Simple Linear Regression Model


• Least Squares Method
• Coefficient of Determination
• Model Assumptions
• Testing for Significance
• Using the Estimated Regression Equation for Estimation and
Prediction
Empirical Models
• Many problems in engineering and science involve exploring the
relationships between two or more variables

• Regression analysis is a statistical technique that is very useful for these


types of problems

• This model can also be used for process optimization, such as finding the
level of temperature that maximizes yield, or for process control purposes

3
Empirical Models Example

• As an illustration, consider the data in Hydrocarbon level (X) Purity (Y)


the table. 0.99 90.01
• In this table y is the purity of oxygen 1.02 89.05
1.15 91.43
produced in a chemical distillation
1.29 93.74
process, and x is the percentage of
1.46 96.73
hydrocarbons that are present in the 1.36 94.45
main condenser of the distillation unit. 0.87 87.59
1.23 91.77
1.55 99.42
1.4 93.65

4
Using python for plotting the data

5
Simple Linear Regression Model
• The equation that describes how y is related to x and
an error term is called the regression model.
• The simple linear regression model is:

y = b0 + b1x +e
where:
b0 and b1 are called parameters of the model,
e is a random variable called the error term.
Simple Linear Regression Equation
The simple linear regression equation is:
E(y) = b0 + b1x

• Graph of the regression equation is a straight line.


• b0 is the y intercept of the regression line.
• b1 is the slope of the regression line.
• E(y) is the expected value of y for a given x value.
Simple Linear Regression Equation
Positive Linear Relationship

E(y)

Intercept Slope b1
b0 is positive

x
Simple Linear Regression Equation
Negative Linear Relationship
E(y)

Intercept
b0
Slope b1
is negative
x
Simple Linear Regression Equation
No Relationship
E(y)
Regression line
Intercept
b0 Slope b1is 0

x
Estimated Simple Linear Regression Equation

 The estimated simple linear regression equation

𝑦 = 𝑏0 + 𝑏1 𝑥

• The graph is called the estimated regression line.


• b0 is the y intercept of the line.
• b1 is the slope of the line.
• y^ is the estimated value of y for a given x value.
Least Squares Method
• Least Squares Criterion

min  (yi  yi )
ˆ 2

where:
yi = observed value of the dependent variable
for the ith observation
^
yi = estimated value of the dependent variable
for the ith observation
Estimation Process
Regression Model Sample Data:
y = b0 + b1x +e x y
Regression Equation x1 y1
E(y) = b0 + b1x . .
Unknown Parameters . .
b0, b1 xn yn

Estimated
b0 and b1 Regression Equation
provide estimates of ŷ  b0  b1 x
b0 and b1 Sample Statistics
b0, b1
14
 y1 - (mx1 +b)  +  y 2 - (mx 2 +b)   ....  y n - (mx n +b) 
2 2 2
Squared Error (SE) =

= y12  2y1 (mx1 +b)+(mx1 +b) 2


 y 2 2  2y 2 (mx 2 +b)+(mx 2 +b) 2
....
 y n 2  2y n (mx n +b)+(mx n +b) 2

= y12 - 2x1 y1m - 2y1b+m 2 x12  2 mx1b+b 2


 y 2 2 - 2x 2 y 2 m - 2y 2 b+m 2 x 2 2  2mx2 b+b 2
.....
 y n 2 - 2x n y n m - 2y n b+m 2 x n 2 +2mxn b+b 2

15
=(y12  y 2 2  ...  y n 2 )
2m (x1 y1  x 2 y 2  ...  x n y n )
2b(y1 +y 2 +....+y n )
+m 2 (x12 +x 2 2 +..+x n 2 )
+2mb(x1 +x 2 +..+x n )
+(b 2  b 2  ...  b 2 )

= n y 2  2mn x y  2bn y  m2n x 2


+2mbnx+nb 2

16
SE=n y 2  2mn x y  2bn y  m 2 n x 2
+2mbnx+nb 2
 (SE)
 2n x y  2mnx 2
 2bnx  0
m

 (SE)
 2n xy  2mnx 2
 2bnx  0
m

  xy  m x 2
 bx  0

mx 2
 bx  x y
2 x2 x y
x x y one point ( , )
m +b  x x
x x
17
SE=n y 2  2mn x y  2bn y  m 2 n x 2
+2mbnx+nb 2
 (SE)
 2n y  2mnx  2nb=0
b

=- y  m x  b  0

y  mx  b

another point (x, y)

18
19
20
21
Least Squares Method

• Slope for the Estimated Regression Equation

b1   ( x  x )( y  y )
i i

 (x  x )
i
2
REGRESSION
Linear Regression-II
Dr. Ramesh Anbanandam
DEPARTMENT of Management Studies

1
Least Squares Method

• Slope for the Estimated Regression Equation

b1   ( x  x )( y  y )
i i

 (x  x )
i
2

2
Sum of squares and sum of cross-products
n
S xx   ( xi  x) 2
i 1

n
S yy   ( yi  y ) 2
i 1

n
S xy  (x
i 1
i  x ) ( yi  y )

3
Sum of squares and sum of cross-products

S xy
Slope(m) 
S xx
Sxy
SSE= error sum of squares =Syy -
S xx

4
Least Squares Method
y-Intercept for the Estimated Regression Equation

b0  y  b1 x

where:
xi = value of independent variable for ith
observation
yi = value of dependent variable for ith
_ observation
_x = mean value for independent variable
y = mean value for dependent variable
n = total number of observations

5
Simple Linear Regression

Deviation from the estimated regression model

6
Simple Linear Regression

Example: Auto Sales


An Auto company periodically has a special week-long sale.
As part of the advertising campaign runs one or more television
commercials during the weekend preceding the sale.
Data from a sample of 5 previous sales are shown on the next slide.

7
Simple Linear Regression
Example: Auto Sales

Number of Number of
TV Ads Cars Sold
1 14
3 24
2 18
1 17
3 27

8
Estimated Regression Equation

Slope for the Estimated Regression Equation


b1   ( x  x )( y  y ) 
i i 20
5
 (x  x )
i
2
4

y-Intercept for the Estimated Regression Equation


b0  y  b1 x  20  5(2)  10

Estimated Regression Equation


ˆ  10  5x
y

9
Scatter Diagram and Trend Line
30
25
20
Cars Sold

y = 5x + 10
15
10
5
0
0 1 2 3 4
TV Ads

10
Jupyter Code

11
Jupyter Code

12
Jupyter code

13
Example Problem- II

• The data in the file hardness.xls provide measurements on the hardness and
tensile strength for 35 specimens of die-cast aluminum.
• It is believed that hardness (measured in Rockwell E units) can be used to
predict tensile strength (measured in thousands of pounds per square inch).
a. Construct a scatter plot.
b. Assuming a linear relationship, use the least-squares method to find the
regression coefficients b 0 and b 1.
c. Interpret the meaning of the slope, b1, in this problem.
d. Predict the mean tensile strength for die-cast aluminum that has a hardness of
30 Rockwell E units.

14
Tensile strength Hardness
53 29.31
70.2 34.86
84.3 36.82
55.3 30.12
78.5 34.02
63.5 30.82
71.4 35.4
53.4 31.26
82.5 32.18
67.3 33.42
69.5 37.69
73 34.88
55.7 24.66
85.8 34.76
95.4 38.02
51.1 25.68
74.4 25.81
54.1 26.46
77.8 28.67
52.4 24.64
69.1 25.77
53.5 23.69
64.3 28.65
82.7 32.38
55.7 23.21
70.5 34
87.5 34.47
50.7 29.25
72.3 28.71
59.5 29.83
71.3 29.25
52.7 27.99
76.5 31.85
63.7 27.65
69.2 31.7

15
16
Thank You

17
REGRESSION
Linear Regression-III
Dr. Ramesh Anbanandam
DEPARTMENT of Management Studies

1
Learning Objectives

• Understanding Coefficient of Determination


• Test statistical hypotheses and construct confidence intervals on
regression model parameters

2
3
Coefficient of Determination
• Relationship Among SST, SSR, SSE
SST = SSR + SSE

(y i  y )2   ( yˆ i  y )2   ( y i  yˆ i )2
 SS xy 2   SS xy 2 
SS yy      SS yy  
 SS   SS xx 
 xx   
where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
Coefficient of Determination

 The coefficient of determination is:


r2 = SSR/SST
where:
SSR = sum of squares due to regression
SST = total sum of squares
Coefficient of Determination

r2 = SSR/SST = 100/114 = .8772


The regression relationship is very strong; 88% of the
variability in the number of cars sold can be explained by the
linear relationship between the number of TV ads and the
number of cars sold.
Jupyter code

7
Coefficient of Determination

r2 = SSR/SST = 100/114 = .8772


The regression relationship is very strong; 88% of the
variability in the number of cars sold can be explained by the
linear relationship between the number of TV ads and the
number of cars sold.
Sample Correlation Coefficient

rxy  (sign of b1 ) Coefficient of Determination


rxy  (sign of b1 ) r2

yˆ  b0  b1 x

where:
b1 = the slope of the estimated regression
equation
Sample Correlation Coefficient

rxy  (sign of b1 ) r2

The sign of b1 in the equation yˆ  10  5 x is “+”.


rxy = + .8772

rxy = +.9366
Assumptions About the Error Term e

1. The error e is a random variable with mean of zero.


2. The variance of e , denoted by e 2, is the same for
all values of the independent variable.
3. The values of e are independent.
4. The error e is a normally distributed random
variable.
Testing for Significance

• To test for a significant regression relationship, we must conduct a


hypothesis test to determine whether the value of b1 is zero.
• Two tests are commonly used:

t Test and F Test

• Both the t test and F test require an estimate of s 2, the variance of e in


the regression model.

12
Estimate of s
• An Estimate of s
The mean square error (MSE) provides the estimate
of s 2, and the notation s2 is also used.
s 2 = MSE = SSE/(n - 2)
where:
SSE   ( yi  yˆ i ) 2   ( yi  b0  b1 xi ) 2
Testing for Significance
• An Estimate of s
• To estimate s we take the square root of s 2.
• The resulting s is called the standard error of
the estimate.

SSE
s MSE 
n2
Testing for Significance

Se =Standard error of the estmiate ( )


2

2
Sxy
Syy -
SSE S xx
= =
n2 n2

15
Testing for Significance: t Test
• Hypotheses

H0 : b1  0
H a : b1  0

• Test Statistic

b1
t
sb1
Case 1

H 0: b1  0

In this case hypothesis is not rejected

17
Case 2

H a: b 1  0

In this case hypothesis is rejected

18
The Standard Deviation of the Regression Slope
• The standard error of the regression slope coefficient (b1) is
estimated by
sε sε
sb1  
 (x  x) 2
(  x)2
x  n
2

where:
sb1 = Estimate of the standard error of the least squares slope
SSE
sε  = Sample standard error of the estimate
n2
Testing for Significance: t Test
 Rejection Rule
Reject H0 if p-value < 
or t < -tor t > t
where:
t is based on a t distribution
with n - 2 degrees of freedom
Testing for Significance: t Test
1. Determine the hypotheses. H0 : b1  0
H a : b1  0

2. Specify the level of significance.  = .05


b1
3. Select the test statistic. t
sb1

4. State the rejection rule. Reject H0 if p-value < .05


or |t| > 3.182 (with
3 degrees of freedom)
Testing for Significance: t Test
5. Compute the value of the test statistic.
b1  b 1 5
t   4.63
sb1 1.08

6. Determine whether to reject H0.


t = 4.541 provides an area of .01 in the upper
tail. Hence, the p-value is less than .02. (Also,
t = 4.63 > 3.182.) We can reject H0.
Hypothesis Tests for the Slope
of the Regression Model
b b
t 
1

b
1

H 0: 1
0 S b

H :b
1
1
0 where: S 
S e
b
SSXX
b
H 0: 1
0
S 
SSE
e
n2
H :b 0
1
1   X
2

SSXX  X 2

b
H 0: 0 n
1
b  the hypothesized slope
H :b
1
0
df  n  2
1
1
Confidence Interval for b1

 We can use a 95% confidence interval for b1 to test


the hypotheses just used in the t test.
 H0 is rejected if the hypothesized value of b1 is not
included in the confidence interval for b1.
Confidence Interval for b1
• The form of a confidence interval for b1 is:
t /2 sb1
b1  t /2 sb1 is the
margin
b1 is the of error
point
estimator
Where t / 2 is the t value providing an area
of a/2 in the upper tail of a t distribution
with n - 2 degrees of freedom
Confidence Interval for b1
• Rejection Rule
Reject H0 if 0 is not included in the confidence interval for b1.
• 95% Confidence Interval for b1
b1  t / 2 sb = 5 +/- 3.182(1.08) = 5 +/- 3.44
1

or 1.56 to 8.44
• Conclusion
0 is not included in the confidence interval.
Reject H0
Testing for Significance: F Test
• Hypotheses
H0 : b1  0
H a : b1  0

• Test Statistic
F = MSR/MSE
F-Test for Significance

• F Test statistic: MSR


F
MSE
where
SSR
MSR 
k
SSE
MSE 
n  k 1

where F follows an F distribution with k numerator degrees of freedom


and (n - k - 1) denominator degrees of freedom
(k = the number of independent variables in the regression model)
Testing for Significance: F Test
• Rejection Rule
Reject H0 if
p-value < 
or F > F
where:
F is based on an F distribution with
1 degree of freedom in the numerator and
n - 2 degrees of freedom in the denominator
Testing for Significance: F Test
1. Determine the hypotheses. H0 : b1  0
H a : b1  0

2. Specify the level of significance.  = .05


3. Select the test statistic. F = MSR/MSE

4. State the rejection rule. Reject H0 if p-value < .05


or F > 10.13 (with 1 d.f.
in numerator and
3 d.f. in denominator)
Jupyter Code

31
Jupyter code

32
Testing for Significance: F Test
5. Compute the value of the test statistic.
F = MSR/MSE = 100/4.667 = 21.43
6. Determine whether to reject H0.
F = 17.44 provides an area of .025 in the upper tail. Thus, the p-
value corresponding to F = 21.43 is less than 2(.025) = .05. Hence,
we reject H0.
The statistical evidence is sufficient to conclude that we have a significant
relationship between the number of TV ads aired and the number of cars
sold.
Some Cautions about the
Interpretation of Significance Tests

• Rejecting H0: b1 = 0 and concluding that the relationship


between x and y is significant does not enable us to
conclude that a cause-and-effect relationship is present
between x and y.
• Just because we are able to reject H0: b1 = 0 and demonstrate
statistical significance does not enable us to conclude that there
is a linear relationship between x and y.
Thank You

35
RBD

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE

1
Learning Objectives

• Estimate variance components in an experiment involving random factors

• Understand the blocking principle and how it is used to isolate the effect
of nuisance factors

• Design and conduct experiments involving the randomized complete block


design

2
Randomized Block Design

• A completely randomized design (CRD) is useful when the experimental


units are homogeneous

• If the experimental units are heterogeneous, blocking is often used to


form homogeneous groups

3
Why RBD?

• A problem can arise whenever differences due to extraneous factors (ones


not considered in the experiment) cause the MSE term in this ratio to
become large.
• In such cases, the F value in equation can become small, signaling no
difference among treatment means when in fact such a difference exists.

4
Randomized block design

• Experimental studies in business often involve experimental units that are


highly heterogeneous; as a result, randomized block designs are often
employed.
• Blocking in experimental design is similar to stratification in sampling.

5
Randomized block design

• Its purpose is to control some of the extraneous sources of variation by


removing such variation from the MSE term.
• This design tends to provide a better estimate of the true error variance
and leads to a more powerful hypothesis test in terms of the ability to
detect differences among treatment means.

6
Air Traffic Controller Stress Test
• A study measuring the fatigue and stress of
air traffic controllers resulted in proposals
for modification and redesign of the
controller’s work station
• After consideration of several designs for
the work station, three specific alternatives
are selected as having the best potential
for reducing controller stress
• The key question is: To what extent do the
three alternatives differ in terms of their
effect on controller stress?

7
Air Traffic Controller Stress Test
• In a completely randomized design, a random sample of controllers would be
assigned to each work station alternative.
• However, controllers are believed to differ substantially in their ability to
handle stressful situations.
• What is high stress to one controller might be only moderate or even low
stress to another.
• Hence, when considering the within-group source of variation (MSE), we must
realize that this variation includes both random error and error due to
individual controller differences.
• In fact, managers expected controller variability to be a major contributor to
the MSE term.

8
A randomized block design for the air traffic controller
stress test
Treatments
System A System B System C
Controller 1 15 15 18
Controller 2 14 14 14
Controller 3 10 11 15
Blocks
Controller 4 13 12 17
Controller 5 16 13 16
Controller 6 13 13 13

9
Solving this example using ANOVA in python

10
Solving this example using ANOVA in python

11
Summary of stress data for the air traffic controller stress test
Treatments System A System B System C Block total Block
Blocks means
Controller 1 15 15 18 48 x1. =16
Controller 2 14 14 14 42 x2. =14
Controller 3 10 11 15 36 x3. =12
Controller 4 13 12 17 42 x4. =14
Controller 5 16 13 16 45 x5. =15
Controller 6 13 13 13 39 x6. =13
Column Total 81 78 93 252 =
x 252/18
= 14

12
Summary of stress data for the air traffic controller stress test

• Treatment means

x.1 = 81/6 =13.5


x.2 = 78/6 =13
x.3 = 93/6 =15.5

13
ANOVA TABLE FOR THE RANDOMIZED BLOCK DESIGN WITH k
TREATMENTS AND b BLOCKS

Sources of Sum of Degrees of Mean Square F P- value


Variation Squares Freedom

Treatments SS k-1 MS Treatments = MS


Treatments SSTR/k-1 Treatmen
Blocks SS block (b-1) MSBL = SSBL/b- ts / MSE
1
Error SSE (k-1)(b-1) MSE= SSE/(k-
1)(b-1)
Total SST nT-1

14
RBD Problem

15
RBD Problem

16
RBD Problem

17
ANOVA table for the air traffic controller stress test

Sources of Sum of Degrees of Mean Square F P- value


Variation Squares Freedom
Treatments 21 2 10.5 10.5/1.9 0.024
=5.53
Blocks 30 5 6.0
Error 19 10 1.9
Total 70 17

Reject the null hypothesis


18
Solving RBD example using python

19
Solving RBD example using python

20
Conclusion

• Finally, note that the ANOVA table shown in Table provides an F value to
test for treatment effects but not for blocks.
• The reason is that the experiment was designed to test a single factor—
work station design.
• The blocking based on individual stress differences was conducted to
remove such variation from the MSE term.
• However, the study was not designed to test specifically for individual
differences in stress.

21
Problem 2: RBD

• An experiment was performed to determine the effect of four different


chemicals on the strength of a fabric.
• These chemicals are used as part of the permanent press finishing
process.
• Five fabric samples were selected, and a randomized complete block
design was run by testing each chemical type once in random order on
each fabric sample.
• The data are shown in Table.
• We will test for differences in means using an ANOVA with alpha = 0.01.

22
Problem 2: RBD

• Table: Fabric Strength Data—Randomized Complete Block Design

23
Anova using jupyter

24
Problem 2: RBD

• The sums of squares for the analysis of variance are computed as follows:

25
Problem 2: RBD

26
Problem 2: RBD
• Analysis of Variance for the Randomized Complete Block Experiment
Sources of Sum of Degrees of Mean F P- value
Variation Squares Freedom Square

Chemical types 18.04 3 6.01 75.13 4.79 E-8


(Treatments)

Fabric samples 6.69 4 1.67


(Blocks)
Error 0.96 12 0.08
Total 25.69 19

27
Conclusion

• The ANOVA is summarized in the previous table


• Since f0 = 75.13 > f0.01,3,12 = 5.95 (the P-value is 4.79 x E-8), we conclude
that there is a significant difference in the chemical types so far as their
effect on strength is concerned.

28
Python code for problem 2

29
Python code for problem 2

30
Python code for problem 2

31
Categorical Variable Regression

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Purpose of this lecture is to show how categorical variables are handled in


regression analysis.
• To illustrate the use and interpretation of a categorical independent
variable, we will consider two problems
• Demo on python

2
What are dummy variables?
• Dummy variables, also called indicator variables allow us to include
categorical data (like Gender) in regression models

• A dummy variable can take only 2 values, 0 (absence of a category) and 1


(presence of a category)

3
Example 1: Problem / Background
• Johnson Filtration, Inc., provides maintenance service for
water-filtration systems.
• Customers contact Johnson with requests for
maintenance service on their water-filtration systems
• To estimate the service time and the service cost,
Johnson’s managers want to predict the repair time
necessary for each maintenance request
• Hence, repair time in hours is the dependent variable
• Repair time is believed to be related to two factors,
– the number of months since the last maintenance service
– the type of repair problem (mechanical or electrical).
Source: Statistics for Business & Economics, David R. Anderson, Dennis J. Sweeney, Thomas A. Williams, Jeffrey D. Camm, James J. Cochran, Cengage Learning,2013

4
Data for the Johnson filtration example

service call months_since_last_service type_of_repair repair_time_in_hours


1 2 electrical 2.9
2 6 mechanical 3
3 8 electrical 4.8
4 3 mechanical 1.8
5 2 electrical 2.9
6 7 electrical 4.9
7 9 mechanical 4.2
8 8 mechanical 4.8
9 4 electrical 4.4
10 6 electrical 4.5

5
6
Linear Regression

7
OLS Summary

8
Linear regression

9
Normal probability plot

10
Creating dummies

11
DATA FOR THE JOHNSON FILTRATION EXAMPLE WITH TYPE OF REPAIR
INDICATED BYADUMMYVARIABLE (x2 = 0 FOR MECHANICAL; x2 = 1
FOR ELECTRICAL)

12
Adding dummies to table

13
OLS Summary

14
Dummy regression

15
Interpreting the Parameters

Equation 1

Equation 2

16
Interpreting the Parameters
• Comparing equations, we see that the mean repair time is a linear

function of x1 for both mechanical and electrical repairs.

• The slope of both equations is b1, but the y-intercept differs.

• The y-intercept is b0 in equation 1 for mechanical repairs and (b0 +b2) in

equation 2 for electrical repairs.

17
Interpreting the Parameters

• The interpretation of b2 is that it indicates the difference between the


mean repair time for an electrical repair and the mean repair time for a
mechanical repair.
• If b2 is positive, the mean repair time for an electrical repair will be
greater than that for a mechanical repair; if b2 is negative, the mean
repair time for an electrical repair will be less than that for a mechanical
repair.
• Finally, if b2 = 0, there is no difference in the mean repair time between
electrical and mechanical repairs and the type of repair is not related to
the repair time.

18
Interpreting the Parameters

• In effect, the use of a dummy variable for type of repair provides two
estimated regression equations that can be used to predict the repair
time, one corresponding to mechanical repairs and one corresponding to
electrical repairs.
• In addition, with b2= 1.26, we learn that, on average, electrical repairs
require 1.26 hours longer than mechanical repairs.

19
Interpreting the Parameters

20
More Complex Categorical Variables

• A categorical variable with k levels must be modeled using k - 1 dummy


variables.
• Care must be taken in defining and interpreting the dummy variables.

21
Example 2: Problem / Background

• The manager of a small sales


force wants to know whether
average monthly salary is
different for males and females in
the sales force.
• He obtains data on monthly
salary and experience (in months)
for each of the 9 employees as
shown on the next slide.

22
Data
Employee Salary Gender Experience

1 7.5 Male 6

2 8.6 Male 10

3 9.1 Male 12

4 10.3 Male 18

5 13 Male 30

6 6.2 Female 5

7 8.7 Female 13

8 9.4 Female 15

9 9.8 Female 21
24
25
26
27
28
Creating a dummy variable for gender
• Categorical data is included in
regression analysis by using Employee Salary Gender
dummy variables 1 7.5 0
2 8.6 0
3 9.1 0
• For example, we can assign a
value of 0 for males and 1 for 4 10.3 0
females in our data so that a 5 13 0
MR model can be developed 6 6.2 1
7 8.7 1
8 9.4 1
9 9.8 1
30
31
More on the intercept and slope

• The value of the intercept, 9.70, is the average salary for males (as we
coded gender=1 for females and 0 for males)

• The value of the slope, -1.175, tells us that the average females salary is
lower than the average male salary by 1.175

32
33
What would have happened if we had used 0 for females and
1 for males in our data? Would our results be any different?

34
Male = 1, female = 0

• Not really – With coding as above, the intercept would change to


8.525 (the average female salary), the slope for gender would still
be 1.175, but now it would have a positive sign (reflecting that
average male salary is higher than average female salary by 1.175).
Predicted salaries from the model for males / females would not
change no matter how dummy variable is coded

35
More on dummy variables

• For gender, we had only 2 categories – female and male – thus we


used a single 0/1 variable for this

• When there are more than 2 categories, the number of dummy


variables that should be used equals the number of categories
minus 1

• No. of Dummy Variables = No. of levels -1

36
Example: Salary vs. Job Grade

Employee Job Salary


• In this example, the
Grade ($000)
categorical variable
1 1 7.5
job grade has 3 levels, 2 3 8.6
1 (lowest grade), 2, 3 2 9.1
and 3 (highest job 4 3 10.3
grade) 5 3 13
6 1 6.2
7 2 8.7
8 2 9.4
9 3 9.8

37
Representing 3-level Job Grade using dummy variables
Job_1 and Job_2
Dummy Variables

Employee's Job
Job Grade Job_1 Job_2

Grade
1 1 0
2 0 1
3 0 0

Job Grade 3 is the reference category

38
Data file with dummy variables for job grade
Job
Employee Grade Salary Job_1 Job_2
1 1 7.5 1 0
2 3 8.6 0 0
3 2 9.1 0 1
4 3 10.3 0 0
5 3 13 0 0
6 1 6.2 1 0
7 2 8.7 0 1
8 2 9.4 0 1
9 3 9.8 0 0

39
Thank You

40
Estimation, Prediction of Regression Model Residual
Analysis: Validating Model Assumptions - I

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Point Estimation
• Interval Estimation
• Confidence Interval for the Mean Value of y
• Prediction Interval for an Individual Value of y

2
Problem

• Data were collected from a sample of 10 Ice cream vendors located near
college campuses.

• For the ith observation or restaurant in the sample, xi is the size of the
student population (in thousands) and yi is the quarterly sales (in
thousands of dollars).

• The values of xi and yi for the 10 restaurants in the sample are summarized
in Table
3
Data
Student Population Sales
Restaurant (1000) (1000)
1 2 58
2 6 105
3 8 88
4 8 118
5 12 117
6 16 137
7 20 157
8 20 169
9 22 149
10 26 202
4
Python code for scatter plot

5
Python code for scatter plot

6
Python code for regression Equation

7
Python code for regression Equation

8
Python code for regression

• In the Ice cream vendor example, the estimated regression equation 60 +


5x provides an estimate of the relationship between the size of the
student population x and quarterly sales y.

9
10
Point Estimate
• We can use the estimated regression equation to develop a point estimate
of the mean value of y for a particular value of x or to predict an individual
value of y corresponding to a given value of x.

• For instance, suppose a manager want a point estimate of the mean


quarterly sales for all restaurants located near college campuses with
10,000 students.

11
Point estimate

• Using the estimated regression equation 60 +5x, we see that for x 10 (or
10,000 students), 60 + 5(10) = 110.
• Thus, a point estimate of the mean quarterly sales for all restaurants
located near campuses with 10,000 students is $110,000.

12
Point estimate

• Now suppose the manager want to predict sales for an individual


restaurant located near College, with 10,000 students.
• In this case we are not interested in the mean value for all restaurants
located near campuses with 10,000 students;
• We are just interested in predicting quarterly sales for one individual
restaurant.
• As it turns out, the point estimate for an individual value of y is the same
as the point estimate for the mean value of y.
• Hence, we would predict quarterly sales of 60 + 5(10) = 110 or $110,000
for this one restaurant.
13
Plot at mean value of x and y

14
Confidence Interval Estimation

• Confidence interval, is an interval estimate of the mean value of y for a


given value of x.
• Prediction interval, is used whenever we want an interval estimate of an
individual value of y for a given value of x.
• The point estimate of the mean value of y is the same as the point
estimate of an individual value of y.
• The margin of error is larger for a prediction interval.

15
Confidence Interval Estimation

x p = the particular or given value of the independent variable x


y p = the value of the dependent variable y corresponding to the given x p
E(y p ) = the mean or expected value of the dependent variable 'y' corresponding to the given x p
^
y = b 0 +b1x p = the point estimate of E(y p ) when x = x p

60 + 5(10) = 110.

16
Confidence Interval Estimation

^
In general, we cannot expect y p to equal E(y p ) exactly.
^ ^
If we want to make an inference about how close y p is to the true mean value E( y p ), we will have to estimate the variance of y p .
^
The formula for estimating the variance of y p given x p , denoted by , is s 2 ^
yp

17
Confidence Interval Estimation

18
Confidence Intervals for the Mean sales y at given values of
student population x

19
Python Code

20
Special Case
^
The estimated standard deviation of y p
− −
is smallest when x p = x and the quantity x p - x = 0

21
Prediction Interval for an Individual Value of y

• Instead of estimating the mean value of sales for all restaurants located
near campuses with 10,000 students, we want to estimate the sales for an
individual restaurant located near a particular College with 10,000
students.
(1) The variance of individual ‘y’ values about the mean E( yp), an estimate of
which is given by s2
(2) The variance associated with using to estimate E( yp), an estimate of
which is given by

22
Prediction Interval for an Individual Value of y

23
Prediction Interval for an Individual Value of y

24
Prediction Interval for an Individual Value of y

25
Prediction Interval for an Individual Value of y

26
Confidence intervals vs prediction intervals

• Confidence intervals and prediction intervals show the precision of the


regression results.
• Narrower intervals provide a higher degree of precision

27
Python Code for Prediction Interval

28
Python Code

29
Python Code

30
Thank You

31
Estimation, Prediction of Regression Model Residual
Analysis: Validating Model Assumptions - II

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Understanding different types of residual analysis


• Plotting residual plots using python

2
Residual Analysis: Validating Model Assumptions

• Residual analysis is the primary tool for determining whether the assumed
regression model is appropriate

3
Assumptions about the error term . 

4
Importance of the Assumptions

• These assumptions provide the theoretical basis for the t test and the F
test used to determine whether the relationship between x and y is
significant, and for the confidence and prediction interval estimates
• If the assumptions about the error term  appear questionable, the
hypothesis tests about the significance of the regression relationship and
the interval estimation results may not be valid.

5
Residuals for Ice cream parlours

Source: Statistics for Business & Economics, David R. Anderson, Dennis J. Sweeney, Thomas A. Williams, Jeffrey D. Camm, James J. Cochran, Cengage Learning,2013

6
Residual analysis is based on an examination of graphical plots

• A plot of the residuals against values of the independent variable x


^
• A plot of residuals against the predicted values of the dependent variable y
• A standardized residual plot
• A normal probability plot

7
Residual Plot Against x

8
Residual Plot Against x

9
Assumption: the variance is the same for all values of x

• The residual plot should give an


overall impression of a horizontal
band of points

10
Violation of Assumption:
The variance of ‘e’ is not the same for all values of x

• Assumption of a constant
variance of ‘e’ is violated
• If variability about the regression
line is greater for larger values of
x

11
Assumed regression model is not an adequate
representation

A curvilinear regression model or


multiple regression model
should be considered.

12
^
Residual Plot Against y

• The pattern of this residual plot is the


same as the pattern of the residual plot
against the independent variable x.
• It is not a pattern that would lead us to
question the model assumptions.

13
^
Residual Plot Against y

• For simple linear regression, both the


residual plot against x and the residual
plot against provide the same pattern
• For multiple regression analysis, the
^
residual plot against y is more widely
used because of the presence of more
than one independent variable.

14
Standardized Residuals

• Many of the residual plots provided by computer software packages use a


standardized version of the residuals.
• A random variable is standardized by subtracting its mean and dividing the
result by its standard deviation.
• With the least squares method, the mean of the residuals is zero.
• Thus, simply dividing each residual by its standard deviation provides the
standardized residual

15
Python Code

16
17
Python Code

18
Standardized Residuals

19
Computation of standardized residuals for Icecream parlors

20
Computation of standardized residuals for Icecream parlors

21
Plot of The Standardized Residuals Against The Independent
Variable x

22
Plot of The Standardized Residuals Against The Independent
Variable x

23
Studentized residual

• The standardized residual plot can provide insight about the assumption
that the error ‘e’ term has a normal distribution.
• If this assumption is satisfied, the distribution of the standardized
residuals should appear to come from a standard normal probability
distribution.

24
Studentized residual

• Thus, when looking at a standardized residual plot, we should expect to


see approximately 95% of the standardized residuals between -2 and 2.
• We see in Figure that for the Armand’s example all standardized residuals
are between -2 and 2.
• Therefore, on the basis of the standardized residuals, this plot gives us no
reason to question the assumption that ‘e’ has a normal distribution.

25
Normal Probability Plot

• Another approach for determining the validity of the assumption that the
error term has a normal distribution is the normal probability plot.
• To show how a normal probability plot is developed, we introduce the
concept of normal scores.

26
Normal Probability Plot

• Suppose 10 values are selected randomly from a normal probability


distribution with a mean of zero and a standard deviation of one, and that
the sampling process is repeated over and over with the values in each
sample of 10 ordered from smallest to largest.
• For now, let us consider only the smallest value in each sample.
• The random variable representing the smallest value obtained in repeated
sampling is called the first-order statistic.

27
Normal Probability Plot

28
Normal Probability Plot

29
Normal scores and ordered standardized residuals for
Armand’s pizza parlors

30
Normal Probability Plot

• If the normality assumption is satisfied, the smallest standardized residual


should be close to the smallest normal score, the next smallest
standardized residual should be close to the next smallest normal score,
and so on.
• If we were to develop a plot with the normal scores on the horizontal axis
and the corresponding standardized residuals on the vertical axis, the
plotted points should cluster closely around a 45-degree line passing
through the origin if the standardized residuals are approximately
normally distributed.
• Such a plot is referred to as a normal probability plot.

31
Normal probability plot for Ice Cream parlors

32
33
Thank You

34
MULTIPLE REGRESSION MODEL - I

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Multiple regression model


• Least squares method
• Multiple coefficient of determination
• Model assumptions
• Testing for significance F-Test, t-Test

2
Multiple regression model

3
The estimation process For multiple regression

4
Simple vs multiple regression

• In simple linear regression, b0 and b1 were the sample statistics used to


estimate the parameters b 0 and b 1.
• Multiple regression parallels this statistical inference process, with b0,b1,
b2, . . . , bp denoting the sample statistics used to estimate the parameters
b 0, b 1, b 2, . . . , b p.

5
Least Squares Method

6
Least Squares Method

7
An Example: Trucking Company

• As an illustration of multiple regression analysis, we will consider a


problem faced by the Trucking Company.
• A major portion of business involves deliveries throughout its local area.
• To develop better work schedules, the managers want to estimate the
total daily travel time for their drivers.

Source: Statistics for Business and Economics, 2012, Anderson

8
PRELIMINARY DATA FOR BUTLER TRUCKING

9
Using python import data

10
Using python import data

11
Scatter Diagram Of Preliminary Data For Trucking x1

12
Scatter Diagram Of Preliminary Data For Trucking x2

13
Scatter Diagram For x1 and x2

14
Linear regression Vs. multiple regression model

• Linear regression

15
Linear regression Vs. multiple regression model

16
Linear regression Vs. Multiple regression model

• Multiple regression

17
Linear regression Vs. Multiple regression model

18
Multiple Coefficient of Determination

19
Multiple Coefficient of Determination for linear model

20
Multiple Coefficient of Determination for Multiple regression
model

21
Multiple Coefficient of Determination

22
Multiple Coefficient of Determination

• Adding independent variables causes the prediction errors to become


smaller, thus reducing the sum of squares due to error, SSE.
• Because SSR = SST- SSE, when SSE becomes smaller, SSR becomes larger,
causing R2 = SSR/SST to increase.
• Many analysts prefer adjusting R2 for the number of independent variables
to avoid overestimating the impact of adding an independent variable on
the amount of variability explained by the estimated regression equation.

23
Adjusted Multiple Coefficient of Determination
n = number of observations
p = denoting the number of independent variables

24
OLS Summary

25
Adjusted Multiple Coefficient Vs Multiple Coefficient

• If a variable is added to the model, R2 becomes larger even if the variable


added is not statistically significant.
• The adjusted multiple coefficient of determination compensates for the
number of independent variables in the model.

26
Adjusted Multiple Coefficient Vs Multiple Coefficient

• If the value of R2 is small and the model contains a large number of


independent variables, the adjusted coefficient of determination can take
a negative value

27
Model Assumptions

28
Assumption about error term

1. The error term e is a random variable with mean or expected value of


zero;
E(e) = 0.
Implication: For given values of x1,x2,…,xp the expected , or average , value
of y is given by E(y) = b0+b1x1+b2 x 2+....+bpxp
– This equation represents the average of all possible values of y , that
might occur for the given value of x1,x2,…,xp, by E(y).

29
Assumption about error term

30
Graph of the regression equation for multiple regression
analysis with two independent variables

31
Response variable and response surface

• In regression analysis, the term response variable is often used in place of


the term dependent variable.
• Furthermore, since the multiple regression equation generates a plane or
surface, its graph is called a response surface.

32
Thank You

33
MULTIPLE REGRESSION MODEL-II

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Testing for significance


– F Test
– t Test
• Python Demo for multiple regression

2
Testing for Significance

• The F test is used to determine whether a significant relationship exists


between the dependent variable and the set of all the independent
variables; we will refer to the F test as the test for overall significance.
• If the F test shows an overall significance, the t test is used to determine
whether each of the individual independent variables is significant.
• A separate t test is conducted for each of the independent variables in the
model; we refer to each of these t tests as a test for individual significance.

3
F Test

4
F test significance

5
F test significance

6
F Test

7
ANOVA table

8
t Test for individual significance

9
t Test for individual significance

10
t Test for individual significance

11
Regression Approach
to ANOVA
Regression Approach to ANOVA
• Three different assembly methods, referred to as methods A, B, and C, have been
proposed.
• Managers at Chemitech want to determine which assembly method can produce
the greatest number of filtration systems per week

A B C
58 58 48
64 69 57
55 71 59
66 64 47
67 68 49
ANOVA
Anova: Single Factor

SUMMARY
Groups Count Sum Average Variance
A 5 310 62 27.5
B 5 330 66 26.5
C 5 260 52 31

ANOVA
Source of
Variation SS df MS F P-value F crit
Between
Groups 520 2 260 9.176471 0.003818 3.885294
Within
Groups 340 12 28.33333

Total 860 14
Dummy variables for the chemitech experiment
Dummy variables for the chemitech experiment

• If we are interested in the expected value of the number of units


assembled per week for an employee who uses method C, our procedure
for assigning numerical values to the dummy variables would result in
setting A = B= 0.
• The multiple regression equation then reduces to
Dummy variables for the chemitech experiment

• For method A the values of the dummy variables are A = 1 and B = 0, and

• For method B we set A = 0 and B = 1, and


SUMMARY OUTPUT

Regression Statistics
Multiple R 0.777593186
R Square 0.604651163
Adjusted R Square 0.53875969
Standard Error 5.322906474
Observations 15

ANOVA
df SS MS F Significance F
Regression 2 520 260 9.176471 0.003818412
Residual 12 340 28.33333
Total 14 860

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 52 2.380476143 21.84437 4.97E-11 46.81338804 57.18661196 46.81338804 57.18661196
A 10 3.366501646 2.970443 0.011692 2.665023022 17.33497698 2.665023022 17.33497698
B 14 3.366501646 4.15862 0.001326 6.665023022 21.33497698 6.665023022 21.33497698
Estimation of E(y)

• b0 = 52
• b1= 10
• b2 = 14
Assembly Method Estimation of E(y)
A b0+b1 = 52+10=62
B b0+b2 = 52 +14 = 66
C 52
Testing the significance
Thank You

21
Linear Regression Model Vs Logistic Regression Model

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Comparison of Linear Regression model and Logistic regression model

2
Estimating the relationship

Linear regression model Logistic regression model


• Y1 = X1+X2+…+Xn • Y1 = X1+X2+…+Xn
• Where , • Where ,
– Y1 = continuous data – Y1 = Binary nonmetric
– Independent variables = – Independent variables =
nonmetric and metric nonmetric and metric

3
Graphical representation
• Linear regression • Logistic regression

4
Correspondence of Primary Elements of Model Fit
Linear Regression Logistic Regression
• Total sum of squares • -2LL of base model
• Error sum of squares • -2LL of proposed model
• F test of model fit • Chi-square test of -2LL difference
• Coefficient of determination (R2) • Pseudo R2 measures
• Regression sum of squares • Difference of -2LL for base and
proposed models

5
Objective of logistic regression

• Logistic regression is identical to discriminant analysis in terms of the basic


objectives it can address
• Logistic regression is best suited to address two research objectives:
– Identifying the independent variables that impact group membership
in the dependent variable
– Establishing a classification system based on the logistic model for
determining group membership

6
The fundamental difference

• Logistic regression differs from linear regression, in being specifically


designed to predict the probability of an event occurring (ie., the
probability of an observation being in the group coded 1)
• Although probability values are metric measures, there are fundamental
differences between linear regression and logistic regression

7
Log likelihood

• Measure used in logistic regression to represent the lack of predictive


fit
• Even though this method does not use the least squares procedure in
model estimation, as is done in linear regression, the likelihood value is
similar to the sum of squared error in regression analysis

8
Logistic vs discriminant

• Logistic regression may be preferred for two reasons


• First, discriminant analysis relies on strictly meeting the assumptions of
– Multivariate normality and equal variance
– Covariance matrices across groups
– Assumptions that are not met in many situations
• Logistic regression does not face these strict assumptions and is much
more robust when these assumptions are not met, making its application
appropriate in many situations

9
Logistic vs discriminant

• Second, even if the assumptions are met, many researchers prefer logistic
regression because it is similar to multiple regression
• It has straightforward statistical tests, similar approaches to incorporating
metric and nonmetric variables and nonlinear effects, and a wide range of
diagnostics
• Logistic regression is equivalent to two-group discriminant analysis and
may be more suitable in many situations

10
Logistic vs discriminant : Sample size

• One factor that distinguishes logistic regression from the other techniques
is its use of maximum likelihood (MLE) as the estimation technique
• MLE requires larger samples such that, all things being equal, logistic
regression will require a larger sample size than multiple regression
• As for discriminant analysis, there are considerations on the minimum
group size as well

11
Logistic vs discriminant : Sample size

• The recommended sample size for each group is at least 10 observations


per estimated parameter
• This is much greater than multiple regression, which had a minimum of
five observations per parameter, and that was for the overall sample, not
the sample size for each group, as seen with logistic regression

12
Determination of coefficients

Linear regression Logistic regression


• R2
−2LLnull – (−2 LLmodel )
• r2 = SSR/SST R 2 Logit =
−2 LLnull
where:
SSR = sum of squares due to Where:
regression LL = Loglikelihood
SST = total sum of squares -2LLnull = -2LL of base model
-2LLmodel= -2LL of proposed model

13
Determination of coefficients
• Linear regression • Logistic regression

14
Testing for overall significance

Linear regression • Logistic Regression


• F-test of model fit • G-test of model fit
• F = MSR/MSE
 likelihood without the variable 
G = −2ln  
 likelihood with variable

15
Testing for overall significance
• Linear regression • Logistic regression

16
Testing for significance
Linear regression Logistic regression
• t-test • Wald-test

b 1
− 
t = 1

S b

=
S e
w h e r e: S b
S S XX
SSE
S =
e
n − 2
( X )2

S S XX =  X 2

n
 1
= th e h y p o th e s iz e d s lo p e
df = n − 2

17
Testing for significance
• Linear regression • Logistic regression

18
Model Estimation fit

• The basic measure of how well the maximum likelihood estimation


procedure fits is the likelihood value, similar to the sums of squares values
used in multiple regression
• Logistic regression measures model estimation fit with the value of -2
times the log of the likelihood value, referred to as -2LL or -2 log likelihood
• The minimum value for -2LL is 0, which corresponds to a perfect fit
(likelihood = 1 and -2LL is then 0)

19
Model Estimation fit

• The lower the -2LL value, the better the fit of the model
• The -2LL value can be used to compare equations for the change in fit

20
Between Model Comparison

• The likelihood value can be compared between equations to assess the


difference in predictive fit from one equation to another, with statistical
tests for the significance of these differences
• The basic approach follows three steps:

21
Step 1 : Estimate a null model

• The first step is to calculate a null model, which acts as the baseline for
making comparisons of improvement in model fit.
• The most common null model is one without any independent variables,
which is similar to calculating the total sum of squares using only the
mean in linear regression.
• The logic behind this form of null model is that it can act as a baseline
against which any model containing independent variables can be
compared.

22
Step 2: Estimate the proposed model

• This model contains the independent variables to be included in the


logistic regression model.
• This model fit will improve from the null model and result in a lower -2IL
value.
• Any number of proposed models can be estimated

23
Step 3: Assess -2LL difference:

• The final step is to assess the statistical significance of the -2LL value
between the two models (null model versus proposed model).
• If the statistical tests support significant differences, then we can state
that the set of independent variable(s) in the proposed model is
significant in improving model estimation fit.

24
Between model comparison

Linear regression Logistic Regression


• SSE • -2LL of proposed model
• = σ 𝑦𝑖 − 𝑦ො 𝑖 2

25
Between model comparison

Linear Regression Logistic regression


• SSR = = σ 𝑦𝑖 − 𝑦ത 𝑖 2 • Difference between log likelihood
• SST-SSE • = 2LLnull –(2LLmodel)

26
Normality of Residual (Error)
Linear regression Logistic regression
• Normally distributed • Binomially distributed
• Linear regression assumes that • Logistic regression does not need
residuals are approximately equal residuals to be equal for each
for all predicted dependent level of the predicted dependent
variable values variable values

27
Estimation Methods

• Linear regression is based • logistic regression is based on Maximum


on least square estimation Likelihood Estimation
• Regression coefficients should be • Coefficients should be chosen in such a
chosen in such a way that way that it maximizes the Probability of
it minimizes the sum of the Y given X (likelihood)
squared distances of each • With MLE, the computer uses different
observed response to its fitted "iterations" in which it tries different
value solutions until it gets the maximum
likelihood estimates

28
Interpretation

Coefficients of linear regression is In logistic regression, we interpret


interpreted as: odd ratios as:
• Keeping all other independent • The effect of a one unit of change
variables constant, how much the in X in the predicted odds ratio
dependent variable is expected to with the other variables in the
increase/decrease with an unit model held constant
increase in the independent
variable

29
THANK YOU

30
LOGISTIC REGRESSION - I

Dr A. RAMESH
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Building Logistic regression Model


• Python Demo on Logistic Regression

2
Application
• In many regression applications the dependent variable may only assume
two discrete values.
• For instance, a bank might like to develop an estimated regression
equation for predicting whether a person will be approved for a credit
card or not
• The dependent variable can be coded as y =1 if the bank approves the
request for a credit card and y = 0 if the bank rejects the request for a
credit card.
• Using logistic regression we can estimate the probability that the bank
will approve the request for a credit card given a particular set of values
for the chosen independent variables.

3
Example

• Let us consider an application of logistic regression involving a direct mail


promotion being used by Simmons Stores.
• Simmons owns and operates a national chain of women’s apparel stores.
• Five thousand copies of an expensive four-color sales catalog have been
printed, and each catalog includes a coupon that provides a $50 discount
on purchases of $200 or more.
• The catalogs are expensive and Simmons would like to send them to only
those customers who have the highest probability of using the coupon.

Sources: Statistics for Business and Economics,11th Edition by David R. Anderson (Author), Dennis J.
Sweeney (Author), Thomas A. Williams (Author)

4
Variables

• Management thinks that annual spending at Simmons Stores and whether


a customer has a Simmons credit card are two variables that might be
helpful in predicting whether a customer who receives the catalog will use
the coupon.
• Simmons conducted a pilot study using a random sample of 50 Simmons
credit card customers and 50 other customers who do not have a
Simmons credit card.
• Simmons sent the catalog to each of the 100 customers selected.
• At the end of a test period, Simmons noted whether the customer used
the coupon or not?

5
Data (10 customer out of 100)
Customer Spending Card Coupon
1 2.291 1 0
2 3.215 1 0
3 2.135 1 0
4 3.924 0 0
5 2.528 1 0
6 2.473 0 1
7 2.384 0 0
8 7.076 0 0
9 1.182 1 1
10 3.345 0 0

6
Explanation of Variables

• The amount each customer spent last year at Simmons is shown in


thousands of dollars and the credit card information has been coded as 1
if the customer has a Simmons credit card and 0 if not.
• In the Coupon column, a 1 is recorded if the sampled customer used the
coupon and 0 if not.

7
Logistic Regression Equation

• If the two values of the dependent variable y are coded as 0 or 1, the


value of E( y) in equation given below provides the probability that y = 1
given a particular set of values for the independent variables x1, x2, . . . , xp.

8
Logistic Regression Equation

• Because of the interpretation of E( y) as a probability, the logistic


regression equation is often written as follows

9
10
Logistic regression equation for β0 and β1

11
Logistic regression equation for β0 and β1

• Note that the graph is S-shaped.


• The value of E( y) ranges from 0 to 1, with the value of E( y) gradually
approaching 1 as the value of x becomes larger and the value of E( y)
approaching 0 as the value of x becomes smaller.
• Note also that the values of E( y), representing probability, increase fairly
rapidly as x increases from 2 to 3.
• The fact that the values of E( y) range from 0 to 1 and that the curve is S-
shaped makes equation (slide no.11) ideally suited to model the
probability the dependent variable is equal to 1.

12
Estimating the Logistic Regression Equation
• In simple linear and multiple regression the least squares method is used to
compute b0, b1, . . . , bp as estimates of the model parameters ( 0, 1, . . . , p).
• The nonlinear form of the logistic regression equation makes the method of
computing estimates more complex
• We will use computer software to provide the estimates.
• The estimated logistic regression equation is

Here, y hat provides an estimate of the probability that y = 1, given a particular


set of values for the independent variables.
13
Python Code for Logistic Regression

14
Variables

15
Managerial Use

• P( y = 1/x1 = 2, x2 = 0) = .1880

• P( y = 1/x1= 2,/ x2 = 1) = .4099

• Probabilities indicate that for customers with annual spending of $2000 the presence of
a Simmons credit card increases the probability of using the coupon

16
Managerial Use

• It appears that the probability of using the coupon is much higher for
customers with a Simmons credit card.

17
Testing for Significance

18
G Statistics

• The test for overall significance is based upon the value of a G test
statistic.
• If the null hypothesis is true, the sampling distribution of G follows a chi-
square distribution with degrees of freedom equal to the number of
independent variables in the model.

19
20
G Statistics

G = 2(−60.487 − (−67.301) = 13.628

• The value of G is 13.628, its degrees of freedom are 2,and its p-value is
0.001.
• Thus, at any level of significance α >= .001, we would reject the null
hypothesis and conclude that the overall model is significant.

21
Thank You

22
LOGISTIC REGRESSION - II

Dr A. RAMESH
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Testing the significance of Logistic regression coefficients


• Python Demo on Logistic Regression

2
Chi sq. value of G- Statistic

3
z test- Wald Test

• z test can be used to determine


whether each of the individual
independent variables is making a
significant contribution to the
overall model

4
Strategies
• Suppose Simmons wants to send the
promotional catalog only to
customers who have a 0.40 or higher
probability of using the coupon.
• Customers who have a Simmons
credit card: Send the catalog to every
customer who spent $2000 or more
last year.
• Customers who do not have a
Simmons credit card: Send the
catalog to every customer who spent
$6000 or more last year.

5
Interpreting the Logistic Regression Equation

6
Odd ratio

• The odds ratio measures the impact on the odds of a one-unit increase in
only one of the independent variables.

7
Interpretation

• For example, suppose we want to compare the odds of using the coupon
for customers who spend $2000 annually and have a Simmons credit card
(x1= 2 and x2 = 1) to the odds of using the coupon for customers who
spend $2000 annually and do not have a Simmons credit card (x1= 2 and
x2 = 0).
• We are interested in interpreting the effect of a one-unit increase in the
independent variable x2.

8
Odds ratio

9
Odds ratio – Interpretation

• The estimated odds in favor of using the coupon for customers who spent
$2000 last year and have a Simmons credit card are 3 times greater than
the estimated odds in favor of using the coupon for customers who spent
$2000 last year and do not have a Simmons credit card.

10
Odds ratio – Interpretation
• The odds ratio for each independent variable is computed while holding all the
other independent variables constant.
• But it does not matter what constant values are used for the other independent
variables.
• For instance, if we computed the odds ratio for the Simmons credit card
variable (x2) using $3000, instead of $2000, as the value for the annual
spending variable (x1), we would still obtain the same value for the estimated
odds ratio (3.00).
• Thus, we can conclude that the estimated odds of using the coupon for
customers who have a Simmons credit card are 3 times greater than the
estimated odds of using the coupon for customers who do not have a Simmons
credit card.

11
Relationship between the odds ratio and the coefficients of
the independent variables

12
Effect of a change of more than one unit in Odd Ratio

• The odds ratio for an independent variable represents the change in the
odds for a one unit change in the independent variable holding all the
other independent variables constant.
• Suppose that we want to consider the effect of a change of more than one
unit, say c units.
• For instance, suppose in the Simmons example that we want to compare
the odds of using the coupon for customers who spend $5000 annually (x1
= 5) to the odds of using the coupon for customers who spend $2000
annually (x1 = 2).
• In this case c = 5- 2 = 3 and the corresponding estimated odds ratio is

13
Effect of a change of more than one unit in Odd Ratio

• This result indicates that the estimated odds of using the coupon for
customers who spend $5000 annually is 2.79 times greater than the
estimated odds of using the coupon for customers who spend $2000
annually.
• In other words, the estimated odds ratio for an increase of $3000 in
annual spending is 2.79

14
Logit Transformation

• An interesting relationship can be observed between the odds in favor of y


= 1 and the exponent for ‘e’ in the logistic regression equation

• This equation shows that the natural logarithm of the odds in favor of y =
1 is a linear function of the independent variables.
• This linear function is called the logit →g(x1, x2, . . . , xp) to denote the
logit.

15
Estimated Logit Regression Equation

16
17
G vs Z

• Because of the unique relationship between the estimated coefficients in


the model and the corresponding odds ratios, the overall test for
significance based upon the G statistic is also a test of overall significance
for the odds ratios.
• In addition, the z test for the individual significance of a model parameter
also provides a statistical test of significance for the corresponding odds
ratio.

18
Thank You

19
Maximum Likelihood Estimation - I

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• This lecture will provide intuition behind the MLE using Theory and
examples.

2
Maximum Likelihood Estimation

• The method of maximum likelihood was first introduced by R. A.


Fisher, a geneticist and statistician, in the 1920s.
• Most statisticians recommend this method, at least when the
sample size is large, since the resulting estimators have certain
desirable efficiency properties
• Maximum likelihood estimation(MLE) is a method to find most likely density
function, that would have generated data.
• MLE requires one to make distribution assumption first.

3
An intuitive view on likelihood

 = −2,  2 = 1
 = 0,  2 = 1

 = 0,  2 = 4

4
Maximum Likelihood Estimation: Problem
• A sample of ten new bike helmets manufactured by a certain company is
obtained. Upon testing, it is found that the first, third, and tenth helmets
are flawed, whereas the others are not.
• Let p = P(flawed helmet), i.e., p is the proportion of all such helmets that
are flawed.
• Define (Bernoulli) random variables X1, X2, . . . , X10 by

Source: Probability and Statistics for Engineering and the Sciences, Jay L Devore, 8th Ed, Cengage

5
Maximum Likelihood Estimation: Problem

• Then for the obtained sample, X1 = X3 = X10 = 1 and the other seven Xi’s are
all zero
• The probability mass function of any particular Xi is ,
which becomes p if xi = 1 and 1 – p when xi = 0
• Now suppose that the conditions of various helmets are independent of
one another
• This implies that the Xi’s are independent, so their joint probability mass
function is the product of the individual pmf’s.

6
Maximum Likelihood Estimation: Binomial Distribution

• Joint pmf evaluated at the observed Xi’s is


f(x1, . . . , x10; p) = p(1 – p)p . . . p = p3(1 – p)7 - (1)

• Suppose that p = .25. Then the probability of observing the sample that
we actually obtained is (.25)3(.75)7 = .002086.
• If instead p = .50, then this probability is (.50)3(.50)7 = .000977.
• For what value of p is the obtained sample most likely to have occurred?
• That is, for what value of p is the joint pmf (eq 1) as large as it can be?
• What value of p maximizes (eq 1)

7
Maximum Likelihood Estimation: Binomial Distribution
• Figure shows a graph of the likelihood (eq 1) as a function of p.
• It appears that the graph reaches its peak above p = .3 = the proportion of
flawed helmets in the sample.

Graph of the likelihood (joint pmf) (eq 1)

8
Graph of the natural logarithm of the likelihood

• Figure shows a graph of the


natural logarithm of (eq 1)
• Since ln[g(u)] is a strictly
increasing function of g(u),
finding u to maximize the
function g(u) is the same as
finding u to maximize ln[g(u)].

9
Maximum Likelihood Estimation: Binomial Distribution

• We can verify our visual impression by using calculus to find the value of p
that maximizes (eq 1).
• Working with the natural log of the joint pmf is often easier than working
with the joint pmf itself, since the joint pmf is typically a product so its
logarithm will be a sum.
• Here ln[ f (x1, . . . , x10; p)] = ln[p3(1 – p)7]
• = 3ln(p) + 7ln(1 – p)

10
Maximum Likelihood Estimation: Binomial Distribution

Thus

11
Interpretation
• Equating this derivative to 0 and solving for p gives
3(1 – p) = 7p, from which 3 = 10p and so p = 3/10 = .30 as conjectured

• That is, our point estimate is p = .30.


• It is called the maximum likelihood estimate because it is the parameter
value that maximizes the likelihood (joint pmf) of the observed sample
• In general, the second derivative should be examined to make sure a
maximum has been obtained, but here this is obvious from Figure

12
Maximum Likelihood Estimation: Binomial Distribution

• Suppose that rather than being told the condition of every helmet, we had
only been informed that three of the ten were flawed.
• Then we would have the observed value of a binomial random variable X =
the number of flawed helmets.
• The pmf of X is For x = 3, this becomes
• The binomial coefficient is irrelevant to the maximization, so again p =
0.30.

13
Maximum Likelihood Function Definition
• Let 𝑋1 , 𝑋2 ,…, 𝑋𝑛 have joint pmf or pdf
𝑓(𝑥1 , 𝑥2 , … , 𝑥𝑛 ; 𝜃1 , … , 𝜃𝑚 ) (a)

• Where the parameters 𝜃1 , … , 𝜃𝑚 have unknown values. When 𝑥1 , … , 𝑥𝑛 are the observed
sample values and (a) is regarded as a function of 𝜃1 , … , 𝜃𝑚 , it is called the likelihood
function.
^ ^
• The maximum likelihood estimates (mle’s)
the likelihood function, so that
 ,...,
1 m
are those values of the i’s that maximize

^ ^

𝑓(𝑥1 , 𝑥2 , … , 𝑥𝑛 ; 1,..., m) ≥ 𝑓(𝑥1 , 𝑥2 , … , 𝑥𝑛 ; 𝜃1 , … , 𝜃𝑚 ) for all 𝜃1 , … , 𝜃𝑚

• When the 𝑋𝑖′ 𝑠 are substituted in place of the 𝑥𝑖′ 𝑠, the maximum likelihood estimators result.

14
Interpretation
• The likelihood function tells us how likely the observed sample is as a
function of the possible parameter values.
• Maximizing the likelihood gives the parameter values for which the
observed sample is most likely to have been generated—that is, the
parameter values that “agree most closely” with the observed data.

15
Estimation of Poisson Parameter
• Suppose we have data generated from a Poisson distribution. We want to
estimate the parameter of the distribution
e−  X
• The probability of observing a particular random variable is P( X ;  ) =
X!
• Joint likelihood by multiplying the individual probabilities together

e −   X1 e −   X 2 e−  X n
P( X 1 , X 2 ,, X n ;  ) =   
X 1! X 2! X n!
L (  ; X) =  e −   X i
i

L(  ; X) = e − n  nX

16
Estimation of Poisson Parameter
• Note in the likelihood function the factorials have disappeared.
• This is because they provide a constant that does not influence the
relative likelihood of different values of the parameter
• It is usual to work with the log likelihood rather than the likelihood.
• Note that maximising the log likelihood is equivalent to maximising the
likelihood. Take the natural log of the
likelihood function
L(  ; X) = e − n  nX
(  ; X) = −n + nX log  Find where the derivative of the log
likelihood is zero
d nX
= −n +
d  Note that here the MLE is the same as the
ˆ = X moment estimator

17
Estimation of exponential distribution Parameter
• Suppose X1, X2, . . . , Xn is a random sample from an exponential
distribution with parameter . Because of independence, the likelihood
function is a product of the individual pdf’s:

• The natural logarithm of the likelihood function is

ln[ f (x1, . . . , xn ; )] = n ln() – xi


18
Estimation of exponential distribution Parameter

• Equating (d/d)[ln(likelihood)] to zero results in


n/ – xi = 0, or  = n/xi =

• Thus the MLE is

19
Estimation of parameters of Normal Distribution
• Let X1, . . . , Xn be a random sample from a normal distribution.
• The likelihood function is

• so

20
Estimation of parameters of normal distribution
• To find the maximizing values of  and  2, we must take the partial derivatives
of ln(f ) with respect to  and  2, equate them to zero, and solve the resulting
two equations.

• Omitting the details, the resulting MLE’s are

• The MLE of  2 is not the unbiased estimator, so two different principles of


estimation (unbiasedness and maximum likelihood) yield two different
estimators

21
Thank you

22
Maximum Likelihood Estimation-II

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• This lecture will provide understanding intuition behind the MLE using
Theory and examples.

2
Example1: Estimation of parameters of normal distribution

• Let us explain basic idea of MLE using simple


Id x
problems.
1 1
• Let us make assumption that variable x follows
normal distributed 2 4

• Density function of normal distribution with 3 5


mean m and variance σ2 is given by: 4 6
35 9

3
Example 1: Estimation of parameters of normal distribution

• The data is plotted on a horizontal line


• Think which distribution, either A or B, is more likely to have generated
the data?

4
Interpretation

• Answer to this question is A, because the data are cluster around the
center of the distribution A, but not around the center of the distribution
B
• This example illustrate that, by looking at the data, it is possible to find the
distribution that is most likely to have generated the data
• Now, I will explain exactly how to find the distribution in practice

5
The illustration of the estimation procedure

• MLE starts with computing the likelihood contribution of each observation


• The likelihood contribution is the height of the density function.
• We use Li to denote the likelihood contribution of ith observation.

6
Graphical illustration of likelihood contribution

7
The illustration of the estimation procedure

• Then, you multiply the likelihood contributions of all the observations. this
is called the likelihood function. We use the notation L
n
• Likelihood function L=  Li
i =1 This notation means you
multiply from i= 1 through n

• In our example, n= 5

8
The illustration of the estimation procedure

• In our example, the likelihood function looks like:

• The likelihood function depends on mean m and variance σ2

9
The illustration of the estimation procedure

• The value of mean m and σ that maximise the likelihood function is found.
• The values of mean m and σ which are obtained this way are called the
maximum likelihood estimators of mean m and σ
• Most of the MLE cannot be solved ‘by hand’. Thus, you need to write an
iterative procedure to solve it on computer

10
Method of Least-squares vs MLE
Model for the expectation
(fixed part of the model):
E[Yi ] =  0 + 1 xi
Residuals: ri = yi − E[Yi ]
The method of least-squares:
Find the values for the parameters (β0 and β1) that
makes the sum of the squared residuals (Σrj2) as
small as possible.
Can only be used when the error term is
normal (residuals are assumed to be drawn
from a normal distribution)

Yi = 0 + 1 xi +  i , where  i ~ N (0,  )
Method of Least-squares vs MLE
Model for the expectation
(fixed part of the model):
E[Yi ] =  0 + 1 xi

Residuals: ri = yi − E[Yi ]

The maximum likelihood method is


more general!

- Can be applied to models with any


probability distribution
Estimation of Regression Parameter

• We are interested in estimating a model like this:

• Estimating such a model can be done using MLE

13
Estimation of Regression Parameters

• Suppose that we have the following data and Id Y X


we are interested in estimating the model:
1 2 1
2 6 4
• Let us make an assumption that u follows the
3 7 5
normal distribution with mean 0 and variance
σ2 4 9 6
5 15 9

14
Estimation of Regression Parameters

• We can write the model as :


u=

• This means that follows the normal distribution with mean 0


and variance σ2
• The likelihood contribution of each data point is the height of the density
function at the data points (y-0-1x)

15
Estimation of Regression Parameters

• The likelihood contribution in this example, of the 2nd observation is given


by:

The likelihood contribution of the


Data point
2nd observation

16
Estimation of Regression Parameters

• Then the likelihood function is given by


L(𝛽 0 , 𝛽1 , 𝜎) = ς𝑛𝑖=1 𝐿𝑖 = 𝐿1 × 𝐿2 × 𝐿3 × 𝐿4 × 𝐿5
Id Y X
1 2 1 2
= 𝑒 − (2−𝛽0 −𝛽1 ) ൗ × 𝑒 − (6−𝛽0 −4𝛽1 ) ൗ
1 2 1 2 𝜎2 2 𝜎2
2𝜋𝜎 2𝜋𝜎
2 6 4 1 2
− (7−𝛽0 −5𝛽1 ) ൗ × 1 2
− (9−𝛽0 −6𝛽1 ) ൗ
× 2
𝑒 𝜎2 2
𝑒 𝜎2
3 7 5 2𝜋𝜎 2𝜋𝜎
1 2
− (15−𝛽0 −9𝛽1 ) ൗ𝜎2
4 9 6 × 𝑒
2𝜋𝜎2
5 15 9
• The likelihood function is a function of 𝛽0 , 𝛽1 and 𝜎.

17
Estimation of Regression Parameters

• You choose the values of 𝛽0 , 𝛽1 and 𝜎 that maximizes the likelihood


function.

18
Python Demo for MLE

19
20
21
Parameter estimation by MLE

22
Parameter estimation by MLE

23
Example 2

24
25
26
27
Thank you

28
Performance of Logistic Model-III

Dr A. RAMESH
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda
Python demo for accuracy prediction in logistic regression model using Receiver
operating characteristics curve

2
Sensitivity and Specificity

• For checking, what type of error we are making; we use two parameters-

1. Sensitivity = tp/(tp+fn) True Positive Rate(tpr)

2. Specificity = tn/(tn+fp) True Negative Rate (tnr)

3
Specificity and Sensitivity Relationship with Threshold
Threshold (Lower) Sensitivity ( ) Specificity ( )
Threshold (Higher) Sensitivity ( ) Specificity ( )

Which threshold value should be chosen??

4
Measuring Accuracy, Specificity and Sensitivity

5
ROC Curve for Training dataset

6
ROC Curve for Test data set

7
Threshold value selection

• The outcome of logistic regression model is a probability.


• Selecting a good threshold value is often challenging.
• Threshold values on ROC curve –
Threshold = 1 TPR = 0 FPR = 0
Threshold = 0 TPR = 1 FPR = 1

• Threshold values are often selected based on which errors are bettor.

8
Accuracy checking for different threshold values

9
Accuracy checking for different threshold values

10
Accuracy checking for different threshold values

11
Accuracy checking for different threshold values

12
Calculating Optimal Threshold Value

13
Optimal Threshold Value in ROC Curve

14
Classification Report using Optimal Threshold Value

15
Thank You

16
Confusion matrix and ROC - I

Dr A. RAMESH
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Confusion matrix
• Receiver operating characteristics curve

2
Why Evaluate?

• Multiple methods are available to classify or predict


• For each method, multiple choices are available for settings
• To choose best model, need to assess each model’s performance

3
Accuracy Measures (Classification)

Misclassification error
• Error = classifying a record as belonging to one class when it belongs to
another class.

• Error rate = percent of misclassified records out of the total records in the
validation data

4
Confusion Matrix

Classification Confusion Matrix


Predicted Class
Actual Class 1 0
1 201 85
0 25 2689

201 1’s correctly classified as “1”


85 1’s incorrectly classified as “0”
25 0’s incorrectly classified as “1”
2689 0’s correctly classified as “0”

5
Error Rate
Classification Confusion Matrix
Predicted Class
Actual Class 1 0
1 201 85
0 25 2689

Overall error rate = (25+85)/3000 = 3.67%


Accuracy = 1 – err = (201+2689) = 96.33%
If multiple classes, error rate is:
(sum of misclassified records)/(total records)

6
Cutoff for classification
Most algorithms classify via a 2-step process:
For each record,
1. Compute probability of belonging to class “1”
2. Compare to cutoff value, and classify accordingly

• Default cutoff value is 0.50


If >= 0.50, classify as “1”
If < 0.50, classify as “0”
• Can use different cutoff values
• Typically, error rate is lowest for cutoff = 0.50

7
Cutoff Table
Actual Class Prob. of "1" Actual Class Prob. of "1"
1 0.996 1 0.506
1 0.988 0 0.471
1 0.984 0 0.337
1 0.980 1 0.218
1 0.948 0 0.199
1 0.889 0 0.149
1 0.848 0 0.048
0 0.762 0 0.038
1 0.707 0 0.025
1 0.681 0 0.022
1 0.656 0 0.016
0 0.622 0 0.004

• If cutoff is 0.50: 11 records are classified as “1”


• If cutoff is 0.80: seven records are classified as “1”
8
Confusion Matrix for Different Cutoffs
Cut off Prob.Val. for Success (Updatable) 0.25

Classification Confusion Matrix


Predicted Class

Actual Class owner non-owner

owner 11 1
non-owner 4 8

Cut off Prob.Val. for Success (Updatable) 0.75

Classification Confusion Matrix


Predicted Class

Actual Class owner non-owner

owner 7 5
non-owner 1 11

9
Compute Outcome Measures

10
When One Class is More Important
In many cases it is more important to identify members of one class

– Tax fraud
– Credit default
– Response to promotional offer
– Detecting electronic network intrusion
– Predicting delayed flights

In such cases, we are willing to tolerate greater overall error, in return


for better identifying the important class for further attention

11
ROC curves

• ROC = Receiver Operating Characteristic


• Started in electronic signal detection theory (1940s - 1950s)
• Has become very popular in biomedical applications, particularly
radiology and imaging
• Also used in machine learning applications to assess classifiers
• Can be used to compare tests/procedures
ROC curves: simplest case

• Consider diagnostic test for a disease


• Test has 2 possible outcomes:
– ‘positive’ = suggesting presence of disease
– ‘negative’

• An individual can test either positive or negative for the disease


ROC Analysis
• True Positives = Test states you have the disease when you do have the
disease
• True Negatives = Test states you do not have the disease when you do not
have the disease
• False Positives = Test states you have the disease when you do not have
the disease
• False Negatives = Test states you do not have the disease when you do
Specific Example

Patients with disease


Patients without the disease

Test Result
Threshold

Call these patients “negative” Call these patients “positive”

Test Result
Some definitions ...

Call these patients “negative” Call these patients “positive”

True Positives

without the disease Test Result with the disease


Call these patients “negative” Call these patients “positive”

with the disease

without the disease Test Result False


Positives
Call these patients “negative” Call these patients “positive”

True
negatives

without the disease Test Result with the disease


Call these patients “negative” Call these patients “positive”

False
negatives

Test Result with the disease


without the disease
Moving the Threshold: right

‘‘-’’ ‘‘+’’

Test Result
without the disease
with the disease
Moving the Threshold: left

‘‘-’’ ‘‘+’’

without the disease Test Result


with the disease
Threshold Value
• The outcome of a logistic regression model is a probability
• Often, we want to make a binary prediction
• We can do this using a threshold value t
• If P(y = 1) ≥ t, predict positive
– If P(y = 1) < t, predict negative
– What value should we pick for t?

23
Threshold Value
• Often selected based on which errors are “better”
• If t is large, predict positive rarely (when P(y=1) is large)
– More errors where we say negative , but it is actually positive
– Detects patients who are negative
• If t is small, predict negative rarely (when P(y=1) is small)
– More errors where we say positive, but it is actually negative
– Detects all patients who are positive
• With no preference between the errors, select t = 0.5
– Predicts the more likely outcome

24
Selecting a Threshold Value

• Compare actual outcomes to predicted outcomes using a confusion matrix


(classification matrix)

25
True disease state vs. Test result
not rejected
Test rejected/accepted
Disease
No disease ☺ X
(D = 0) specificity Type I error
(False +)

Disease X ☺
(D = 1) Type II error Power 1 - ;
(False -) sensitivity

Classification matrix: Meaning of each cell

27
Alternate Accuracy Measures

If “C1” is the important class,


Sensitivity = % of “C1” class correctly classified
Sensitivity = n1,1 / (n1,0+ n1,1 )
Specificity = % of “C0” class correctly classified
Specificity = n0,0 / (n0,0+ n0,1 )
False positive rate = % of predicted “C1’s” that were not “C1’s”
False negative rate = % of predicted “C0’s” that were not “C0’s”

28
Receiver Operator Characteristic (ROC) Curve

• True positive rate (sensitivity) on y-axis


– Proportion of positive
• False positive rate (1-specificity) on x-axis
– Proportion of negative labelled as positive
• Low Threshold
– Low specificity
– High sensitivity

29
Selecting a Threshold using ROC

• Captures all thresholds simultaneously


• High threshold
– High specificity
– Low sensitivity
• Low Threshold
– Low specificity
– High sensitivity

30
Thank You

31
Confusion Matrix and ROC-II

Dr A. RAMESH
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Receiver operating characteristics curve


• Optimum threshold value

2
ROC analysis
• True Positive Fraction
– TPF = TP / (TP+FN)
– also called sensitivity
– true abnormals called abnormal by the
observer
• False Positive Fraction
– FPF = FP / (FP+TN)
• Specificity = TN / (TN+FP)
– True normals called normal by the observer
– FPF = 1 - specificity
Evaluating classifiers (via
their ROC curves)

Classifier A can’t
distinguish between
normal and abnormal.

B is better but makes


some mistakes.

C makes very few


mistakes.
“Perfect”
means no
false positives
and no false
negatives.
ROC analysis
• ROC = receiver operator/operating characteristic/curve
Area Under the ROC Curve (AUC)

7
Area Under the ROC Curve (AUC)

• What is a good AUC?


– Maximum of 1 (perfect prediction)

8
Area Under the ROC Curve (AUC)

• What is a good AUC?


• Maximum of 1 (perfect
prediction)
• Minimum of 0.5
(just guessing)

9
Selecting a Threshold using ROC

• Choose best threshold for best trade off


– cost of failing to detect positives
– costs of raising false alarms

10
ROC Plot
• A typical look of ROC plot with few points in it is shown in the following
figure.

• Note the four cornered points are the four extreme cases of classifiers

11
Interpretation of Different Points in ROC Plot
• The four points (A, B, C, and D)
• A: TPR = 1, FPR = 0, the ideal model, i.e., the perfect
classifier, no false results
• B: TPR = 0, FPR = 1, the worst classifier, not able to
predict a single instance
• C: TPR = 0, FPR = 0, the model predicts every instance
to be a Negative class, i.e., it is an ultra-conservative
classifier
• D: TPR = 1, FPR = 1, the model predicts every instance
to be a Positive class, i.e., it is an ultra-liberal classifier

12
Interpretation of Different Points in ROC Plot
• Let us interpret the different points in the ROC
plot.
• The points on the upper diagonal region
• All points, which reside on upper-diagonal region
are corresponding to classifiers “good” as their
TPR is as good as FPR (i.e., FPRs are lower than
TPRs)
• Here, X is better than Z as X has higher TPR and
lower FPR than Z.
• If we compare X and Y, neither classifier is superior
to the other

13
Interpretation of Different Points in ROC Plot

• Let us interpret the different points in the ROC


plot.
• The points on the lower diagonal region
– The Lower-diagonal triangle corresponds to
the classifiers that are worst than random
classifiers
– A classifier that is worser than random
guessing, simply by reversing its prediction,
we can get good results.
W’(0.2, 0.4) is the better version than W(0.4,
0.2), W’ is a mirror reflection of W

14
Tuning a Classifier through ROC Plot
• Using ROC plot, we can compare two or more
classifiers by their TPR and FPR values and this
plot also depicts the trade-off between TPR
and FPR of a classifier.
• Examining ROC curves can give insights into
the best way of tuning parameters of
classifier.
• For example, in the curve C2, the result is
degraded after the point P.
• Similarly for the observation C1, beyond Q the
settings are not acceptable.

15
Comparing Classifiers trough ROC Plot
• We can use the concept of “area under
curve” (AUC) as a better method to
compare two or more classifiers.
• If a model is perfect, then its AUC = 1.
• If a model simply performs random
guessing, then its AUC = 0.5
• A model that is strictly better than other,
would have a larger value of AUC than the
other.
• Here, C3 is best, and C2 is better than C1
as AUC(C3)>AUC(C2)>AUC(C1).

16
ROC curve
100%

True Positive Rate


(sensitivity)

0
% False Positive Rate (1- 100
0
% specificity) %
ROC curve comparison

A good test: A poor test:

100% 100%
True Positive Rate

True Positive Rate


0 0
% %
0 100% 100
0
% False Positive Rate False Positive Rate %
%
ROC curve extremes
Best Test: Worst test:

100% 100%

True Positive Rate


True Positive Rate

0
0 %
% 0 100
0 100 False Positive Rate %
False Positive Rate % %
%

The distributions The distributions


don’t overlap at all overlap completely
ROC curve extremes

20
Typical ROC

21
ROC curve extremes

22
Example

• Let us consider an application of logistic regression involving a direct mail


promotion being used by Simmons Stores.
• Simmons owns and operates a national chain of women’s apparel stores.
• Five thousand copies of an expensive four-color sales catalog have been
printed, and each catalog includes a coupon that provides a $50 discount
on purchases of $200 or more.
• The catalogs are expensive and Simmons would like to send them to only
those customers who have the highest probability of using the coupon.

Sources: Statistics for Business and Economics,11th Edition by David R. Anderson (Author), Dennis J.
Sweeney (Author), Thomas A. Williams (Author)

23
Variables

• Management thinks that annual spending at Simmons Stores and whether


a customer has a Simmons credit card are two variables that might be
helpful in predicting whether a customer who receives the catalog will use
the coupon.
• Simmons conducted a pilot study using a random sample of 50 Simmons
credit card customers and 50 other customers who do not have a
Simmons credit card.
• Simmons sent the catalog to each of the 100 customers selected.
• At the end of a test period, Simmons noted whether the customer used
the coupon or not?

24
Data (10 customer out of 100)
Customer Spending Card Coupon
1 2.291 1 0
2 3.215 1 0
3 2.135 1 0
4 3.924 0 0
5 2.528 1 0
6 2.473 0 1
7 2.384 0 0
8 7.076 0 0
9 1.182 1 1
10 3.345 0 0

25
Explanation of Variables

• The amount each customer spent last year at Simmons is shown in


thousands of dollars and the credit card information has been coded as 1
if the customer has a Simmons credit card and 0 if not.
• In the Coupon column, a 1 is recorded if the sampled customer used the
coupon and 0 if not.

26
Loading data file and get some statistical detail

27
Method’s description

• Dataframe.describe(): This method is used to get basic statistical details


such as central tendency, dispersion and shape of dataset’s distribution.

• Numpy.unique(): This method gives unique values in particular column.

• Series.value_counts(): Returns object containing counts of unique values.

• ravel(): It will return one dimensional array with all the input array
elements.

28
Split dataset into training and testing sets

29
Building the model and predicting values

30
Calculate probability of predicting data values

31
Summary for logistic model

32
Accuracy Checking

• By using accuracy_score function.


• By using confusion matrix

Predicted (0) Predicted (1)


Actual (0) True Negative(tn) False Positive(fp)
Actual (1) False Negative(fn) True Positive(tp)

33
Calculating Accuracy Score using Confusion Matrix

34
Generating Classification Report

• Recall gives us an idea


about when it’s actually
yes, how often does it
predict yes.
• Precision tells us about
when it predicts yes, how
often is it correct

35
Interpreting Classification Report

• Precision = tp / (tp + fp)

• Accuracy = (tp + tn) / (tp + tn + fp + fn)


Predicted (0) Predicted (1)
Actual (0) tn fp
• Recall= tp / (tp + fn)
Actual (1) fn tp

36
Thank You

37
Regression Analysis Model Building - I

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Introduction

• Model building is the process of developing an estimated regression


equation that describes the relationship between a dependent variable
and one or more independent variables.
• The major issues in model building are finding the proper functional form
of the relationship and selecting the independent variables to be included
in the model.

2
General Linear Regression Model
• Suppose we collected data for one dependent variable y and k
independent variables x1,x2, . . . , xk.
• Objective is to use these data to develop an estimated regression equation
that provides the best relationship between the dependent and
independent variables.

• zj (where j =1, 2, . . . , p) is a function of x1, x2, . . . , xk (the variables for


which data are collected).
• In some cases, each zj may be a function of only one x variable.

3
Simple first-order model with one predictor variable

4
Modelling Curvilinear Relationships

• To illustrate, let us consider the problem facing Reynolds, Inc., a


manufacturer of industrial scales and laboratory equipment.
• Managers at Reynolds want to investigate the relationship between length
of employment of their salespeople and the number of electronic
laboratory scales sold.
• Table in the next slide gives the number of scales sold by 15 randomly
selected salespeople for the most recent sales period and the number of
months each salesperson has been employed by the firm.

Sources: Statistics for Business and Economics,11th Edition by David R. Anderson (Author), Dennis J.
Sweeney (Author), Thomas A. Williams (Author)

5
Data
Scales Months
Sold Employed
275 41
296 106
317 76
376 104
162 22
150 12
367 85
308 111
189 40
235 51
83 9
112 12
67 6
325 56
189 19

6
Importing libraries and table

7
SCATTER DIAGRAM FOR THE REYNOLDS EXAMPLE

8
Python code for the Reynolds example: first-order model

9
First-order regression equation

10
Standardized residual plot for the Reynolds example: first-
order model

11
Standardized residual plot for the Reynolds example: first-
order model

12
Need for curvilinear relationship

• Although the computer output shows that the relationship is significant (


p-value .000) and that a linear relationship explains a high percentage of
the variability in sales (R-sq 78.1%), the standardized residual plot
suggests that a curvilinear relationship is needed.

13
Second-order model with one predictor variable

• Set Z1= x1 and Z2 = X2

14
New Data set

• The data for the MonthsSq independent variable is obtained by squaring


the values of Months.

15
Python output for the Reynolds example:
second-order model

16
Second-order regression model

17
Standardized residual plot for the Reynolds example:
second-order model

18
Interpretation second order model

• Figure corresponding standardized residual plot shows that the previous


curvilinear pattern has been removed.
• At the .05 level of significance, the computer output shows that the
overall model is significant ( p-value for the F test is 0.000)
• Note also that the p-value corresponding to the t-ratio for MonthsSq ( p-
value .002) is less than .05
• Hence we can conclude that adding MonthsSq to the model involving
Months is significant.
• With an R-sq(adj) value of 88.6%, we should be pleased with the fit
provided by this estimated regression equation.
19
Meaning of linearity in GLM

• In multiple regression analysis the word linear in the term “general linear
model” refers only to the fact that b0, b1, . . . , bp all have exponents of b1
• It does not imply that the relationship between y and the xi’s is linear.
• Indeed, we have seen one example of how equation general linear model
can be used to model a curvilinear relationship.

20
Thank you

21
Regression Analysis Model Building (Interaction)- II

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Incorporating Interaction of the independent variable to the regression


model
• Python demo

2
Interaction
• If the original data set consists of observations for y and two independent
variables x1 and x2, we can develop a second-order model with two predictor
variables by setting z1 = x1, z2= x2, z3=x12 , z4=x22 , and z5 = x1x2 in the general
linear model of equation
• The model obtained is

• In this second-order model, the variable z5 = x1x2 is added to account for the
potential effects of the two variables acting together.
• This type of effect is called interaction.

3
Example – Interaction

• A company introduces a new shampoo product.


• Two factors believed to have the most influence on sales are unit selling
price and advertising expenditure.
• To investigate the effects of these two variables on sales, prices of $2.00,
$2.50, and $3.00 were paired with advertising expenditures of $50,000
and $100,000 in 24 test markets.

Source: Statistics for Business and Economics,11th Edition by David R.


Anderson (Author), Dennis J. Sweeney (Author), Thomas A. Williams (Author)

4
Advertising
Expenditure Sales
Price ($1000s) (1000s)
2 50 478
2.5 50 373
3 50 335
2 50 473
2.5 50 358
3 50 329
2 50 456
2.5 50 360
3 50 322
2 50 437
2.5 50 365
3 50 342
2 100 810
2.5 100 653
3 100 345
2 100 832
2.5 100 641
3 100 372
2 100 800
2.5 100 620
3 100 390
2 100 790
2.5 100 670
3 100 393

5
MEAN UNIT SALES (1000s)

6
Interpretation of interaction

• Note that the sample mean sales corresponding to a price of $2.00 and an
advertising expenditure of $50,000 is 461,000, and the sample mean sales
corresponding to a price of $2.00 and an advertising expenditure of
$100,000 is 808,000.
• Hence, with price held constant at $2.00, the difference in mean sales
between advertising expenditures of $50,000 and $100,000 is 808,000 -
461,000 = 347,000 units.

7
Interpretation of interaction

• When the price of the product is $2.50, the difference in mean sales is
646,000 -364,000 = 282,000 units.
• Finally, when the price is $3.00, the difference in mean sales is 375,000 -
332,000 = 43,000 units.
• Clearly, the difference in mean sales between advertising expenditures of
$50,000 and $100,000 depends on the price of the product.
• In other words, at higher selling prices, the effect of increased advertising
expenditure diminishes.
• These observations provide evidence of interaction between the price and
advertising expenditure variables.
8
Importing Data

9
Mean unit sales (1000s) as a function of selling price

10
Mean unit sales (1000s) as a function of Advertising
Expenditure($1000s)

11
Need for study the interaction between variable

• When interaction between two variables is present, we cannot study the


effect of one variable on the response y independently of the other
variable.
• In other words, meaningful conclusions can be developed only if we
consider the joint effect that both variables have on the response.

12
Estimated regression equation, a general linear model
involving three independent variables (z1, z2, and z3)

13
Interaction variable

• The data for the PriceAdv independent variable is obtained by multiplying


each value of Price times the corresponding value of AdvExp.

14
New Model

15
New Model

16
Interpretation

• Because the model is significant ( p-value for the F test is 0.000) and the p-
value corresponding to the t test for PriceAdv is 0.000, we conclude that
interaction is significant given the linear effect of the price of the product
and the advertising expenditure.
• Thus, the regression results show that the effect of advertising xpenditure
on sales depends on the price.

17
Transformations Involving the Dependent Variable

Miles per
Gallon Weight
28.7 2289
29.2 2113
34.2 2180
27.9 2448
33.3 2026
26.4 2702
23.9 2657
30.5 2106
18.1 3226
19.5 3213
14.3 3607
20.9 2888

18
Importing data

19
Scatter diagram

20
Model 1

21
Standardized residual plot corresponding to the first-order
model.

22
Standardized residual plot corresponding to the first-order
model

23
Model 2

24
Residual plot for model 2

25
Residual plot of model 2

26
• The miles-per-gallon estimate is obtained by finding the number whose
natural logarithm is 3.2675.
• Using a calculator with an exponential function, or raising e to the power
3.2675, we obtain 26.2 miles per gallon.

27
Nonlinear Models That Are Intrinsically Linear

28
Thank You

29
2 Test of Independence - I

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• To understand 2 Test of Independence

2
2 Test of Independence

• It is used to analyze the frequencies of two variables with multiple


categories to determine whether the two variables are independent.
• Qualitative Variables
• Nominal Data

3
2 Test of Independence: Investment Example
• In which region of the country do you reside?
A. Northeast B. Midwest C. South D. West
• Which type of financial investment are you most likely to make today?
E. Stocks F. Bonds G. Treasury bills
Type of financial
Investment
Contingency Table
E F G
A O13 nA
Geographic B nB
Region C nC
D nD
nE nF nG N
4
2 Test of Independence: Investment Example
e AF
= N  P( A  F )
n n n n 
If A and F are independent, P( A) = A
P( F ) = F
= N A  F
N N  N N 
P( A  F) = P( A)  P( F ) n n
P( A  F ) = A F
n n
N N = A F
N

Type of Financial
Contingency Table Investment
E F G
A e12 nA
Geographic B nB
Region C nC
D nD
nE nF nG N
5
2 Test of Independence: Formulas

e =
ij
(n ) (n)j
i j
N
Expected where : i = the row
Frequencies j = the column
ni = the total of row i
nj = the total of column j
N = the total of all frequencies

6
2 Test of Independence: Formulas

( f o − f e)
2

Calculated   
2
=
(Observed ) f
where : df = (r - 1)(c - 1)
e

r = the numberr of rows


c = the numberr of columns

7
Example for Independence

8
2 Test of Independence

Ho : Type of gasoline is
independent of income
Ha : Type of gasoline is not
independent of income

9
2 Test of Independence

Type of
Gasoline
r=4 c=3 Extra
Income Regular Premium Premium
Less than $30,000
$30,000 to $49,999
$50,000 to $99,000
At least $100,000

10
2 Test of Independence: Gasoline Preference Versus
Income Category
 =.01
df = ( r − 1)( c − 1)
= ( 4 − 1)( 3 − 1)
=6

 2

.01, 6
= 16.812

If  2

Cal
 16.812, reject Ho.

If  2

Cal
 16.812, do not reject Ho.

11
Python code

12
Gasoline Preference Versus Income Category:
Observed Frequencies

Type of
Gasoline
Extra
Income Regular Premium Premium
Less than $30,000 85 16 6 107
$30,000 to $49,999 102 27 13 142
$50,000 to $99,000 36 22 15 73
At least $100,000 15 23 25 63
238 88 59 385

13
Gasoline Preference Versus Income Category: Expected
Frequencies

e =
ij
(n )
N
(ni
) j
Type of
Gasoline Extra
=
(107 )(238 ) Income Regular Premium Premium
e11 385 Less than $30,000 (66.15) (24.46) (16.40)
= 66.15 85 16 6 107
(107 )(88 ) $30,000 to $49,999 (87.78) (32.46) (21.76)
e12 = 385
102 27 13 142
$50,000 to $99,000 (45.13) (16.69) (11.19)
= 24 .46 36 22 15 73
(107 )(59) At least $100,000 (38.95) (14.40) (9.65)
e 13 = 385 15 23 25 63
= 16.40 238 88 59 385
14
Gasoline Preference Versus Income Category: 2
Calculation

(f o −f f e)
2

 = 
2

(85 −6666 .15 ) + (16 −2424 .46) + (6 −16.40) +


2 2 2

= .15 .46 16.40


(102 87
− 87.78) + (27 −3232 .46) + (13 − 21.76) +
2 2 2

.78 .46 21.76


(36 −454513 . )+ (22 −1616 .69 ) + (15 −1119. )+
2 2 2

.13 .69 11.19


(15 −3838 .95) + (23 −1414 .40) + (25 − 9.65)
2 2 2

.95 .40 9.65


= 7075
.
15
Gasoline Preference Versus Income Category:
Conclusion

df = 6
0.01
Non rejection
region

16.812


2
= 70.78  16.812, reject Ho.
Cal

16
Contingency Tables

Contingency Tables
• Useful in situations involving multiple population proportions
• Used to classify sample observations according to two or more
characteristics
• Also called a cross-classification table.

17
Contingency Table Example

Hand Preference vs. Gender


Dominant Hand: Left vs. Right
Gender: Male vs. Female

• 2 categories for each variable, so the table is called a 2 x 2 table

• Suppose we examine a sample of 300 college students

18
Contingency Table Example

Sample results organized in a contingency table:

Gender
sample size = n = 300:
Hand
120 Females, 12 were Preference
Female Male
left handed
Left 12 24 36
180 Males, 24 were
left handed Right 108 156 264

120 180 300


19
Contingency Table Example

H0: π1 = π2 (Proportion of females who are left handed is equal to the


proportion of males who are left handed)
H1: π1 ≠ π2 (The two proportions are not the same Hand preference is
not independent of gender)

• If H0 is true, then the proportion of left-handed females should be the


same as the proportion of left-handed males.
• The two proportions above should be the same as the proportion of left-
handed people overall.

20
The Chi-Square Test Statistic

The Chi-square test statistic is:


(f − f ) 2
χ2 =  o e

all cells fe
where:
fo = observed frequency in a particular cell
fe = expected frequency in a particular cell if H0 is true

2 for the 2 x 2 case has 1 degree of freedom


Assumed: each cell in the contingency table has expected frequency of at least 5
21
The Chi-Square Test Statistic

The 2 test statistic approximately follows a chi-square


distribution with one degree of freedom

Decision Rule:
If 2 > 2U, reject H0,
otherwise, do not reject 
H0
0 Do not Reject H0 
reject H0 2U
22
Observed vs. Expected Frequencies
Gender
Hand
Female Male
Preference

Observed = 12 Observed = 24
Left 36
Expected = 14.4 Expected = 21.6

Observed = 108 Observed = 156


Right 264
Expected = 105.6 Expected = 158.4

120 180 300


The Chi-Square Test Statistic
Gender
Hand
Female Male
Preference

Observed = 12 Observed = 24
Left 36
Expected = 14.4 Expected = 21.6

Observed = 108 Observed = 156


Right 264
Expected = 105.6 Expected = 158.4
120 180 300
The test statistic is:
( fo − fe )2
2 = 
all cells fe
(12 − 14.4) 2 (108 − 105.6) 2 ( 24 − 21.6) 2 (156 − 158.4) 2
= + + + = 0.7576
14.4 105.6 21.6 158.4
24
The Chi-Square Test Statistic

The test statistic is  2 = 0.7576 , U2 with 1 d.f. = 3.841


Decision Rule:
If 2 > 3.841, reject H0, otherwise, do not
reject H0
Here,
2 = 0..7576 < 2U = 3.841,
=.05
so you do not reject H0 and
conclude that there is
0 Do not Reject H0  
insufficient evidence that the
reject H0
2U=3.841 two proportions are different.

25
2 Test for The Differences Among More Than Two
Proportions

• Extend the 2 test to the case with more than two independent
populations:

H0: π1 = π2 = … = πc
H1: Not all of the πj are equal (j = 1, 2, …, c)

26
The Chi-Square Test Statistic

The Chi-square test statistic is:


( fo − fe )2
2 = 
all cells fe
where:
• fo = observed frequency in a particular cell of the 2 x c table
• fe = expected frequency in a particular cell if H0 is true
• 2 for the 2 x c case has (2-1)(c-1) = c - 1 degrees of freedom

Assumed: each cell in the contingency table has expected frequency of at least 5
27
2 Test with More Than Two Proportions: Example

The sharing of patient records is a controversial issue in health care. A survey


of 500 respondents asked whether they objected to their records being
shared by insurance companies, by pharmacies, and by medical researchers.
The results are summarized on the following table:

28
2 Test with More Than Two Proportions: Example
Organization
Object to Insurance Pharmacies Medical
Record Companies Researchers
Sharing

Yes 410 295 335

No 90 205 165
2 Test with More Than Two Proportions: Example
Organization
Object to Insurance Pharmacies Medical Row Sum
Record Companies Researchers
Sharing

Yes 410 295 335 1040

No 90 205 165 460

Column 500 500 500 1500


Sum
2 Test with More Than Two Proportions: Example
The overall proportion is:
X 1 + X 2 + ... + X c 410 + 295 + 335
p= = = 0.6933
n1 + n2 + ... + nc 500 + 500 + 500

Organization
Object to Record Insurance Pharmacies Medical
Sharing Companies Researchers

Yes fo = 410 fo = 295 fo = 335


fe = 346.667 fe = 346.667 fe = 346.667
No fo = 90 fo = 205 fo = 165
fe = 153.333 fe = 153.333 fe = 153.333
2 Test with More Than Two Proportions: Example
Organization
Object to Insurance Pharmacies Medical
Record Companies Researchers
Sharing

Yes ( fo − fe )
2
= 11.571 ( f o − f e )2 ( f o − f e )2
= 7.700 = 0.3926
fe fe fe

No ( f o − f e )2 ( fo − fe )
2
= 17.409
( fo − fe )
2
= 0.888
= 26.159
fe fe fe

( fo − fe )2
The Chi-square test statistic is:  2
=  = 64.1196
all cells fe
2 Test with More Than Two Proportions: Example
H0: π1 = π2 = π3
H1: Not all of the πj are equal (j = 1, 2, 3)

Decision Rule: 2U = 5.991 is from the chi-square


If 2 > 2U, reject H0, otherwise, distribution with 2 degrees of
do not reject H0 freedom.

Conclusion: Since 64.1196 > 5.991, you reject H0 and you conclude that at
least one proportion of respondents who object to their records being shared
is different across the three organizations

33
Thank You

34
2 Test of Independence - II

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Using python to test the independence of variables


• Understanding goodness of fit test for Poisson

2
Example

• Record of 50 students studying in ABN School is taken at random, the first


10 entries are like this:

res_num aa pe sm ae r g c
1 99 19 1 2 0 0 1
2 46 12 0 0 0 0 0
3 57 15 1 1 0 0 0
4 94 18 2 2 1 1 1
5 82 13 2 1 1 1 1
6 59 12 0 0 2 0 0
7 61 12 1 2 0 0 0
8 29 9 0 0 1 1 0
9 36 13 1 1 0 0 0
10 91 16 2 2 1 1 0

3
Example

Here :
• res_num = registration no.
• aa= academic ability
• pe = parent education
• sm = student motivation
• r = religion
• g = gender

4
Python code

5
Hypothesis

• Test the hypothesis that “gender and student motivation” are


independent

6
Python code

7
Observed values
Gender Student motivation
0 1 2 Row Sum
(Disagree ) (Not (Agree)
decided )

0 (Male) 10 13 6 29

1(Female ) 4 9 8 21

Column 14 22 14 50
Sum

8
Expected frequency (contingency table)

Gender Student motivation


0 1 2

0 29*14/50= 12.76 8.12


8.12
1 5.88 9.24 5.88

9
Frequency Table

Gender Student motivation


0 1 2

0 fo = 10 fo = 13 fo = 6
fe = 8.12 fe =12.76 fe =8.12
1 fo = 4 fo = 9 fo = 8
fe =5.88 fe =9.24 fe =5.88

10
Chi sq. calculation

(f o −f f e)
2

 = 
2

= 0.435+ 0.005+0.554+0.601+0.006+0.764
= 2.365

11
Python code

12
Python code

Degrees of
freedom =
(2-1)*(3-1)

13
Python code

Contingency
table

14
2 Goodness of Fit Test

15
2 Goodness-of-Fit Test

• The 2 goodness-of-fit test compares expected (theoretical)


frequencies of categories from a population distribution to the
observed (actual) frequencies from a distribution to determine
whether there is a difference between what was expected and what
was observed

16
2 Goodness-of-Fit Test

( f o− f e )
2

 =
2

f e

df = k - 1 - p
where : f = frequency of observed values
o

f = frequency of expected values


e

k = number of categories
p = number of parameters estimated from the sample data

17
Goodness of Fit Test: Poisson Distribution
1. Set up the null and alternative hypotheses.
H0: Population has a Poisson probability distribution
Ha: Population does not have a Poisson distribution

2. Select a random sample and


• Record the observed frequency fi for each value of the Poisson
random variable.
• Compute the mean number of occurrences .

3. Compute the expected frequency of occurrences ei


for each value of the Poisson random variable.

18
Goodness of Fit Test: Poisson Distribution

4. Compute the value of the test statistic


k( f i − ei ) 2
 =
2
i =1 ei

where:
fi = observed frequency for category i
ei = expected frequency for category i
k = number of categories

19
Goodness of Fit Test: Poisson Distribution
5. Rejection rule:
p-value approach: Reject H0 if p-value < 

Critical value approach: Reject H0 if  2   2

where  is the significance level and


there are k - 2 degrees of freedom

20
Goodness of Fit Test: Poisson Distribution
• Example: Parking Garage

In studying the need for an additional entrance to a city parking


garage, a consultant has recommended an analysis, that approach is
applicable only in situations where the number of cars entering
during a specified time period follows a Poisson distribution.

21
Goodness of Fit Test: Poisson Distribution
A random sample of 100 one- minute time intervals resulted in the
customer arrivals listed below. A statistical test must be conducted to
see if the assumption of a Poisson distribution is reasonable.

# Arrivals 0 1 2 3 4 5 6 7 8 9 10 11 12
Frequency 0 1 4 10 14 20 12 12 9 8 6 3 1

22
Goodness of Fit Test: Poisson Distribution

• Hypotheses
H0: Number of cars entering the garage during
a one-minute interval is Poisson distributed

Ha: Number of cars entering the garage during a


one-minute interval is not Poisson distributed

23
Python Code

24
Goodness of Fit Test: Poisson Distribution

• Estimate of Poisson Probability Function


otal Arrivals = 0(0) + 1(1) + 2(4) + . . . + 12(1) = 600
Estimate of  = 600/100 = 6
Total Time Periods = 100
Hence,

6 x e −6
f ( x) =
x!

25
Goodness of Fit Test: Poisson Distribution
• Expected Frequencies

x f (x ) nf (x ) x f (x ) nf (x )
0 .0025 .25 7 .1377 13.77
1 .0149 1.49 8 .1033 10.33
2 .0446 4.46 9 .0688 6.88
3 .0892 8.92 10 .0413 4.13
4 .1339 13.39 11 .0225 2.25
5 .1606 16.06 12+ .0201 2.01
6 .1606 16.06 Total 1.0000 100.00

26
Python code

27
Python code

28
Goodness of Fit Test: Poisson Distribution
• Observed and Expected Frequencies
i fi ei fi - ei
0 or 1 or 2 5 6.20 -1.20
3 10 8.92 1.08
4 14 13.39 0.61
5 20 16.06 3.94
6 12 16.06 -4.06
7 12 13.77 -1.77
8 9 10.33 -1.33
9 8 6.88 1.12
10 or more 10 8.39 1.61
29
Python code

30
Goodness of Fit Test: Poisson Distribution
• Rejection Rule
With  = .05 and k - p - 1 = 9 - 1 - 1 = 7 d.f.
(where k = number of categories and p = number of
population parameters estimated),  .02 5 = 1 4 .0 6 7
Reject H0 if p-value < .05 or 2 > 14.067.
• Test Statistic

( − 1.20) 2
(1.08) 2
(1.61) 2
2 = + + ... + = 3.268
6.20 8.92 8.39

31
Python code

32
Goodness of Fit Test: Poisson
Distribution
df = 7
0.05
Non rejection
region

14.067


2
= 3.268  14.067, do not reject Ho.
Cal

33
Thank You

34
Cluster analysis: Introduction - I

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Understanding cluster analysis and its purpose


• Introduction to types of data and how to handle them

2
Cluster Analysis

• Cluster analysis is the art of finding


groups in data
• In cluster analysis basically, one wants to
form groups in such a way that objects in
the same group are similar to each other,
whereas objects in different groups are
as dissimilar as possible

3
Cluster analysis

• The classification of similar objects into


groups is an important human activity, this is
part of the learning process
• i.e. A child learns to distinguish between cats
and dogs, between tables and chairs,
between men and women, by means of
continuously improving subconscious
classification schemes
• This explains why cluster analysis is often
considered as a branch of pattern recognition
and artificial intelligence

4
Example

• Lets illustrate with the help of an example:


• It is a plot of twelve objects, on which two variables were measured. For
instance, the weight of an object might be displayed on the vertical axis
and its height on the horizontal one

5
Example

• Because this example contains only two variables, we can investigate it by merely looking

at the plot

• In this small data set there are clearly two distinct groups of objects

• Such groups are called clusters, and to discover them is the aim of cluster analysis

6
Cluster and discriminant analysis

• Cluster Analysis is an unsupervised • Discriminant Analysis (DA) is a statistical


classification technique in the sense that it is technique used to build a prediction model
applied to a dataset where patterns want to that is used to classify objects from a dataset
be discovered (i.e. groups of individuals or depending on the features observed on
variables want to be found) them. In this case, the dependent variable is
• No prior knowledge is needed for this the grouping variable, which identifies to
grouping, and it is sensitive to several which group and object belongs
decisions that have to be taken • This grouping variable should be known at
(similarity/dissimilarity measures, clustering the beginning, for the function to be built up.
method,...) Sometimes DA is considered as a Supervised
tool, as there is a previous known
classification for the elements of the dataset

7
Cluster analysis and discriminant analysis

• Cluster analysis can be used not only to identify a structure already


present in the data, but also to impose a structure on a more or less
homogeneous data set that has to be split up in a “fair” way, for instance
when dividing a country into telephone areas
• Cluster analysis is quite different from discriminant analysis in that it
actually establishes the groups, whereas discriminant analysis assigns
objects to groups that were defined in advance
Telephone area code for USA

8
Types of data and how to handle them

• Let us take an example, there are n objects to be clustered, which may be


persons, flowers, words, countries, or anything
• Clustering algorithms typically operate on either of two input structures:
– The first represents the objects by means of p measurements or
attributes, such as height, weight, sex, color, and so on
– These measurements can be arranged in an n-by-p matrix, where the
rows correspond to the objects and the columns to the attributes

9
Example
Attributes

Objects

10
Types of data and how to handle them

• The second structure is a collection of proximities that must be available


for all pairs of objects
• These proximities make up an n-by-n table, which is called a one-mode
matrix because the row and column entities are the same set of objects
• one shall consider two types of proximities, namely dissimilarities (which
measure how far away two objects are from each other) and similarities
(which measure how much they resemble each other)

11
Type of data

• Interval-Scaled Variables
• In this situation the n objects are characterized by p continuous
measurements
• These values are positive or negative real numbers, such as height, weight,
temperature, age, cost, ..., which follow a linear scale
• For instance, the time interval between 1900 and 1910 was equal in length
to that between 1960 and 1970

Time scale in years

12
Type of data

• Also, it takes the same amount of energy to heat an object of -16.4°C to -


12.4°C as to increase it from 35.2°C to 39.2°C
• In general it is required that intervals keep the same importance
throughout the scale

13
Interval-Scaled Variables

• These measurements can be organized in an n-by-p matrix, where the


rows correspond to the objects (or cases) and the columns correspond to
the variables.
• When the fth measurement of the ith object is denoted by xif (where i = 1,. .
. , n and f = 1,. . . , p) this matrix looks like:

14
Interval-Scaled Variables

• For example :
Person Weight(Kg) Height(cm)
• Take eight people, the weight (in A 15 95
kilograms) and the height (in centimetres) B 49 156
• In this situation, n = 8 and p = 2. C 13 95
D 45 160
E 85 178
F 66 176
G 12 90
H 10 78

Table :1

15
Figure 1
200 F E
DB
150

Height in cm
G
A
100 H
C
50

0
0 50 100
Weight in kg

16
Interval-Scaled Variables

• The units on the vertical axis are drawn to the same size as those on the horizontal axis, even

though they represent different physical concepts

• The plot contains two obvious clusters, which can in this case be interpreted easily: the one

consists of small children and the other of adults

• However, other variables might have led to completely different clustering

• For instance, measuring the concentration of certain natural hormones might have yielded a

clear cut partition into different male and female persons

17
Interval-Scaled Variables
• Let us now consider the effect of changing measurement
Person Weight(lb) Height(in)
units.
A 33.1 37.4
• If weight and height of the subjects had been expressed in B 108 61.4
pounds and inches, the results would have looked quite C 28.7 37.4

different. D 99.2 63
E 187.4 70
• A pound equals 0.4536 kg and an inch is 2.54 cm F 145.5 69.3
• Therefore, Table 2 contains larger numbers in the column G 26.5 35.4
of weights and smaller numbers in the column of heights. H 22 30.7

Figure 2 Table :2

18
Figure 2

100
Height in inches

D B F E
C
50 G
H
A

0
0 20 40 60 80 100 120 140 160 180 200
Weight in lb

19
Interpretation
• Although plotting essentially the same data as Figure 1, Figure 2 looks
much flatter
• In this figure, the relative importance of the variable “weight” is much
larger than in Figure 1
• As a consequence, the two clusters are not as nicely separated as in Figure
1 because in this particular example the height of a person gives a better
indication of adulthood than his or her weight. If height had been
expressed in feet (1 ft = 30.48 cm), the plot would become flatter still and
the variable “weight” would be rather dominant
• In some applications, changing the measurement units may even lead one
to see a very different clustering structure

20
Standardizing the data

• To avoid this dependence on the choice of measurement units, one has


the option of standardizing the data
• This converts the original measurements to unitless variables
• First one calculates the mean value of variable f, given by:

for each f = 1,. . . , p

21
Standardizing the data

• Then one computes a measure of the dispersion or “spread” of this fth


variable
• Generally, we use the standard deviation for this purpose

22
Standardizing the data

• However, this measure is affected very much by the presence of outlying


values
• For instance, suppose that one of the xif has been wrongly recorded, so
that it is much too large
• In this case stdf will be unduly inflated, because xif - mf is squared
• Hartigan (1975, p. 299) notes that one needs a dispersion measure that is
not too sensitive to outliers
• Therefore, we will use the mean absolute deviation, where the
contribution of each measurement xif is proportional to the absolute value
lxif - mfl
23
Standardizing the data

• Let us assume that sf is nonzero (otherwise variable f is constant over all


objects and must be removed)
• Then the standardized measurements are defined by and sometimes
called z-scores
• They are unitless because both the numerator and the denominator are
expressed in the same units
• By construction, the zif have mean value zero and their mean absolute
deviation is equal to1

24
Standardizing the data

• When applying standardization, one forgets about the original data and
uses the new data matrix in all subsequent computations

25
Detecting outlier

• The advantage of using sf rather than stdf, in the denominator of z-score


formula is that sf will not be blown up so much in the case of an outlying
xIf, and hence the corresponding zif will still be noticeable so the ith object
can be recognized as an outlier by the clustering algorithm, which will
typically put it in a separate cluster

26
Standardizing the data
• The preceding description might convey the impression that
standardization would be beneficial in all situations.
• However, it is merely an option that may or may not be useful in a given
application
• Sometimes the variables have an absolute meaning, and should not be
standardized
• For instance, it may happen that several variables are expressed in the
same units, so they should not be divided by different sf
• Often standardization dampens a clustering structure by reducing the
large effects because the variables with a big contribution are divided by a
large sf

27
Thank you

28
Cluster analysis: Part - II

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Explain effect of standardization(with help of an example)


• Different types of distances computation between the objects

2
Example

• Lets take four persons A, B,C, D with following age and height:
200 A B
Person Age (yr) Height (cm) Height 190
A 35 190 180
B 40 190 D
170 C
C 35 160
D 40 160
160
Age
150
TABLE: 1 10 30 50
Finding Groups in Data: An Introduction to Cluster Analysis
Author(s): Leonard Kaufman, Peter J. Rousseeuw FIGURE: 1
March 1990, John Wiley & Sons, Inc.

3
Example

• In Figure 1 we can see to distinct clusters


• Let us standardize the data of Table 1
• The mean age equals m1 = 37.5 and the mean absolute deviation of the
first variable works out to be s1 = (2.5 + 2.5 + 2.5 + 2.5)/4 = 2.5
• Therefore, standardization converts age 40 to + 1 ((40-37.5)/2.5 = 1)and
age 35 ((35 - 37.5)/2.5 = -1) to – 1
• Analogously, m2 = 175 cm and s2 = (15 + 15 + 15 + 15)/4 = 15 cm, so 190
cm is standardized to +1 and 160 cm to - 1

4
Example
• The resulting data matrix, which is unitless, is given in Table 2
• Note that the new averages are zero and that the mean deviations equal 1

• Table 2
Person Variable 1 Variable 2
A 1 1
B -1 1
C 1 -1
D -1 -1

• Even when the data are converted to very strange units standardization will always yield
the same numbers

5
Example

• Plotting the values of Table 2 in Figure 2 1.5


does not give a very exciting result
1
• Figure 2 shows no clustering structure
0.5
because the four points lie at the vertices
0
of a square
-2 -1 -0.5 0 1 2
• One could say that there are four clusters,
each consisting of a single point, or that -1
there is only one big cluster containing -1.5
four points
FIGURE: 2
• Here standardizing is no solution

6
Choice of measurement (Units)- Merits and demerits

• The choice of measurement units gives rise to relative weights of the


variables
• Expressing a variable in smaller units will lead to a larger range for that
variable, which will then have a large effect on the resulting structure
• On the other hand, by standardizing one attempts to give all variables an
equal weight, in the hope of achieving objectivity
• As such, it may be used by a practitioner who possesses no prior
knowledge

7
Choice of measurement- Merits and demerits

• However, it may well be that some variables are intrinsically more


important than others in a particular application, and then the assignment
of weights should be based on subject-matter knowledge
• On the other hand, there have been attempts to devise clustering
techniques that are independent of the scale of the variables

8
Distances computation between the objects
• The next step is to compute distances between the objects, in order to
quantify their degree of dissimilarity
• It is necessary to have a distance for each pair of objects i and j.
• The most popular choice is the Euclidean distance:

• When the data are being standardized, one has to replace all x by z in this
expression
• This Formula corresponds to the true geometrical distance between the
points with coordinates (xi1,. .., xip) and (xj1 ,..., xjp)

9
Example

• let us consider the special case with p =


2 (Figure 3)
• Figure shows two points with
coordinates ( x i 1 , x i 2 ) and (xj1, xj2)
• It is clear that the actual distance
between objects i and j is given by the
length of the hypotenuse of the
triangle, yielding expression in previous
slide by virtue of Pythagoras’ theorem
Figure 3: Illustration of the Euclidean distance formula

10
Distances computation between the objects

• Another well-known metric is the city block or Manhattan distance,


defined by:

11
Interpretation

• Suppose you live in a city where the streets are all north-south or east-
west, and hence perpendicular to each other
• Let Figure 3 be part of a street map of such a city, where the streets are
portrayed as vertical and horizontal lines

12
Interpretation

• Then the actual distance you would have to travel by car to get from

location i to location j would total lxi1 – xj1l + lxi2 – xj2l

• This would be the shortest length among all possible paths from i to j

• Only a bird could fly straight from point i to point j, thereby covering the

Euclidean distance between these points

13
Mathematical Requirements of a Distance Function
• Both the Euclidean metric and the Manhattan metric satisfy the following
mathematical requirements of a distance function, for all objects i, j, and h:
• (D1) d(i, j) ≥ 0
• (D2) d(i, i) = 0
• (D3) d(i, j) = d(j, i)
• (D4) d(i, j) ≤ d(i, h) + d(h, j)
• Condition (D1) merely states that distances are nonnegative numbers and (D2) says
that the distance of an object to itself is zero
• Axiom (D3) is the symmetry of the distance function
• The triangle inequality (D4) looks a little bit more complicated, but is necessary to allow
a geometrical interpretation
• It says essentially that going directly from i to j is shorter than making a detour over
object h

14
Distances computation between the objects

• If d(i, j) = 0 does not necessarily imply that i = j, because it can very well
happen that two different objects have the same measurements for the
variables under study
• However, the triangle inequality implies that i and j will then have the
same distance to any other object h, because d(i, h) ≤ d(i, j) + d( j, h) = d(j,
h) and at the same time d( j, h) ≤ d( j, i) + d(i, h) = d(i, h), which together
imply that d(i, h) = d(j, h)

15
Minkowski distance

• A generalization of both the Euclidean and the Manhattan metric is the


Minkowski distance given by:

Where p is any real number larger than or equal to 1


• This is also called the Lp metric, with the Euclidean (p = 2) and the
Manhattan (p = 1) as special cases

16
Example for Calculation of Euclidean and Manhattan Distance

• Let x1 = (1, 2) and x2 = (3, 5) represent two objects as in the given Figure
The Euclidean distance between the two is (22 +32)= 3.61. The
Manhattan distance between the two is 2 + 3 = 5.

Figure: 4
17
n- by- n Matrix
• For example, when computing
Euclidean distances between the Person Weight(Kg) Height(cm)
objects of the following Table can be A 15 95
obtain as next slide: B 49 156
C 13 95
D 45 160
• Euclidean distances between B and E:
E 85 178
• ((49 – 85)2 +(156-178)2)½ = 42.2 F 66 176
G 12 90
H 10 78

18
n- by- n Matrix
A B C D E F G H

A
B
C
D
E
F
G
H
19
Interpretation

• The distance between object B and object E can be located at the


intersection of the fifth row and the second column, yielding 42.2
• The same number can also be found at the intersection of the second row
and the fifth column, because the distance between B and E is equal to
the distance between E and B
• Therefore, a distance matrix is always symmetric
• Moreover, note that the entries on the main diagonal are always zero,
because the distance of an object to itself has to be zero

20
Distance matrix
• It would suffice to write down only the lower triangular half of the
distance matrix
A B C D E F G
B
C
D
E
F
G
H

21
Selection of variables

• It should be noted that a variable not containing any relevant information


(say, the telephone number of each person) is worse than useless,
because it will make the clustering less apparent.
• The Occurrence of several such “trash variables” will kill the whole
clustering because they yield a lot of random terms in the distances,
thereby hiding the useful information provided by the other variables.
• Therefore, such non informative variables must be given a zero weight in
the analysis, which amounts to deleting them

22
Selection of variables

• The selection of “good” variables is a nontrivial task and may involve quite
some trial and error (in addition to subject-matter knowledge and
common sense)
• In this respect, cluster analysis may be considered an exploratory
technique

23
Thank you

24
2 Goodness of Fit Test

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Python demo for testing GOF for Poisson distribution


• Understanding goodness of fit test for:
– Uniform
– Normal
• Python demo for testing GOF for uniform and normal distribution

2
Goodness of fit for Uniform Distribution
Month Litres
• Milk Sales Data January 1,610
February 1,585
March 1,649
April 1,590
May 1,540
June 1,397
July 1,410
August 1,350
September 1,495
October 1,564
November 1,602
December 1,655
18,447

3
Hypotheses and Decision Rules
Ho: The monthly milk figures for milk sales are uniformly distributed
Ha: The monthly milk figures for milk sales are not uniformly distributed

 = .01 If  2
 24.725, reject Ho.
Cal
df = k − 1 − p
= 12 − 1 − 0
If  2

Cal
 24.725, do not reject Ho.

= 11


2
= 24.725
.01,11

4
Python code

5
Calculations
Month fo fe (fo - fe)2/fe
January 1,610 1,537.25 3.44
February 1,585 1,537.25 1.48 18447
March 1,649 1,537.25 8.12 f =
April 1,590 1,537.25 1.81
e 12
May 1,540 1,537.25 0.00 = 1537.25
June 1,397 1,537.25 12.80
July 1,410 1,537.25 10.53
August 1,350 1,537.25 22.81  2

Cal
= 74.37
September 1,495 1,537.25 1.16
October 1,564 1,537.25 0.47
November 1,602 1,537.25 2.73
December 1,655 1,537.25 9.02
18,447 18,447.00 74.38
6
Python code

7
Conclusion
df = 11

Non Rejection 0.01


region
24.725

 2

Cal
= 74.37  24.725, reject Ho.

8
Goodness of Fit Test: Normal Distribution
1. Set up the null and alternative hypotheses.
2. Select a random sample and
a. Compute the mean and standard deviation.
b. Define intervals of values so that the expected frequency is at least 5 for
each interval.
c. For each interval record the observed frequencies
3. Compute the expected frequency, ei , for each interval.

9
Goodness of Fit Test: Normal Distribution
4. Compute the value of the test statistic.

( f i − ei ) 2
k
 = 2
i =1 ei

5. Reject H0 if    2 2

(where  is the significance level and there are k - 3 degrees


of freedom)

10
Normal Distribution Goodness of Fit Test
• Example: IQL Computers

IQL Computers manufactures and sells a general purpose


microcomputer. As part of a study to evaluate sales personnel,
management wants to determine, at  = 0.05 significance level, if the
annual sales volume (number of units sold by a salesperson) follows a
normal probability distribution.

11
Normal Distribution Goodness of Fit Test

A simple random sample of 30 of the salespeople was


taken and their numbers of units sold are below.

33 43 44 45 52 52 56 58 63 64
64 65 66 68 70 72 73 73 74 75
83 84 85 86 91 92 94 98 102 105
(mean = 71, standard deviation = 18.23)

12
Python code

13
Normal Distribution Goodness of Fit Test
• Hypotheses
H0: The population of number of units sold
has a normal distribution with mean 71
and standard deviation 18.23

Ha: The population of number of units sold


does not have a normal distribution with
mean 71 and standard deviation 18.23

14
Normal Distribution Goodness of Fit Test
• Interval Definition

To satisfy the requirement of an expected frequency of at


least 5 in each interval we will divide the normal distribution
into 30/5 = 6 equal probability intervals.

15
Normal Distribution Goodness of Fit Test
• Interval Definition

Areas
= 1.00/6
= .1667

53.367 71 88.63 = 71 + .97(18.24)


71 - .43(18.23) = 63.149 78.85
16
Python code

17
Normal Distribution Goodness of Fit Test
• Observed and Expected Frequencies
i fi ei f i - ei
Less than 53.02 6 5 1
53.02 to 63.03 3 5 -2
63.03 to 71.00 6 5 1
71.00 to 78.97 5 5 0
78.97 to 88.98 4 5 -1
More than 88.98 6 5 1
Total 30 30

18
Python code

19
Normal Distribution Goodness of Fit Test
• Rejection Rule
With  = .05 and k - p - 1 = 6 - 2 - 1 = 3 d.f.
(where k = number of categories and p = number
of population parameters estimated),  .0 5 = 7 .8 1 5
2

Reject H0 if p-value < .05 or 2 > 7.815.

• Test Statistic
(1) 2 ( − 2) 2 (1) 2 (0) 2 ( − 1) 2 (1) 2
 =2
+ + + + + = 1.600
5 5 5 5 5 5

20
Thank you

21
Cluster analysis: Part - III

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Clustering analysis part III

2
Agenda

• Handling missing data


• Calculation of similarity and dissimilarity matrix

3
Handling missing data

• It often happens that not all measurements are actually available, so there
are some “holes” in the data matrix
• Such an absent measurement is called a missing value and it may have
several causes
• The value of the measurement may have been lost or it may not have
been recorded at all by oversight or lack of time

4
Handling missing data

• Sometimes the information is simply not available, for example the


birthdate of a foundling, or the patient may not remember whether he or
she ever had the measles, or it may be impossible to measure the desired
quantity due to the malfunctioning of some instrument
• In certain instances the question does not apply (such as the colour of hair
of a bald person) or there may be more than one possible answer (when
two experimenters obtain very different results)

5
Handling missing data

• How can we handle a data set with missing values?


• In a matrix we indicate the absent measurements by means of some code
• If there exists an object in the data set for which all measurements are
missing, there is really no information on this object so it has to be deleted
• Analogously, a variable consisting exclusively of missing values has to be
removed too

6
Handling missing data

• If the data are standardized, the mean value m, of the fth variable is
calculated by making use of the present values only
• The same goes for sf,

In the denominator , we must replace ‘n’ by the number of non missing


values for that variable
• But of course only when the corresponding xi, is not missing itself

7
Handling missing data

• In the computation of distances (based on either the xi, or the zi,) similar
precautions must be taken
• When calculating the distances d(i, j), only those variables are considered
in the sum for which the measurements for both objects are present
subsequently the sum is multiplied by p and divided by the actual number
of terms (in the case of Euclidean distances this is done before taking the
square root)
• Such a procedure only makes sense when the variables are thought of as
having the same weight (for instance, this can be done after
standardization)

8
Handling missing data

• When computing these distances, one might come across a pair of objects
that do not have any common measured variables, so their distance
cannot be computed by means of the above mentioned approach.
• Several remedies are possible: One could remove either object or one
could fill in some average distance value based on the rest of the data
• Or by replacing all missing xif by the mean mf of that variable; then all
distances can be computed
• Applying any of these methods, one finally possesses a “full” set of
distances

9
Dissimilarities

• The entries of a n-by n matrix may be Euclidean or Manhattan distances


• However, there are many other possibilities, so we no longer speak of
distances but of dissimilarities (or dissimilarity coefficients)
• Basically, dissimilarities are non-negative numbers d( i, j) that are small
(close to zero) when i and j are “near” to each other and that become
large when i and j are very different
• We shall usually assume that dissimilarities are symmetric and that the
dissimilarity of an object to itself is zero, but in general the triangle
inequality does not hold

10
Dissimilarities

• Dissimilarities can be obtained in several ways.


• Often they can be computed from variables that are binary, nominal,
ordinal, interval, or a combination of these
• Also, dissimilarities can be simple subjective ratings of how much certain
objects differ from each other, from the point of view of one or more
observers
• This kind of data is typical in the social sciences and in marketing

11
Example

• Fourteen postgraduate economics students (coming from different parts


of the world) were asked to indicate the subjective dissimilarities between
11 scientific disciplines.
• All of them had to fill in a matrix like Table 4, where the dissimilarities had
to be given as integer numbers on a scale from 0 (identical) to 10 (very
different)
• The actual entries of the Table in next slide, are the averages of the values
given by the students

12
Example
• It appears that the smallest dissimilarity is perceived between
mathematics and computer science (1.43 ), whereas the most remote
fields were psychology and astronomy (9.36)

13
Dissimilarities

• If one wants to perform a cluster analysis on a set of variables that have


been observed in some population, there are other measures of
dissimilarity
• For instance, one can compute the (parametric) Pearson product-moment
between the variables f and g, or alternatively the (non-parametric)
Spearman correlation

14
Dissimilarities

• Both coefficients lie between - 1 and + 1 and do not depend on the choice
of measurement units
• The main difference between them is that the Pearson coefficient looks
for a linear relation between the variables f and g, whereas the Spearman
coefficient searches for a monotone relation

15
Dissimilarities
• Correlation coefficients are useful for clustering purposes because they
measure the extent to which two variables are related
• Correlation coefficients, whether parametric or nonparametric, can be
converted to dissimilarities d( f, g), for instance by setting

With this formula, variables with a high positive correlation receive a


dissimilarity coefficient close to zero, whereas variables with a strongly
negative correlation will be considered very dissimilar

16
Similarities

• The more objects i and j are alike (or close), the larger s(i, j) becomes
• Such a similarity s(i, j) typically takes on values between 0 and 1, where 0
means that i and j are not similar at all and 1 reflects maximal similarity
• Values in between 0 and 1 indicate various degrees of resemblance
• Often it is assumed that the following conditions hold:

17
Similarities

• For all objects i and j , the numbers s(i, j) can be arranged in an n-by-n
matrix ,which is then called a similarity matrix
• Both similarity and dissimilarity matrices are generally referred to as
proximity matrices, or sometimes as resemblance
• In order to define similarities between variables, we can again resort to
the Pearson or the Spearman correlation coefficient
• However, neither correlation measure can be used directly as a similarity
coefficient because they also take on negative values

18
Similarities
• Some transformation is in order to bring the coefficients into the zero-one
range
• There are essentially two ways to do this, depending on the meaning of the
data and the purpose of the application
• If variables with a strong negative correlation are considered to be very
different because they are oriented in the opposite direction (like mileage and
weight of a set of cars), then it is best to take something like the following:

which yields s(f, g) = 0 whenever R(f, g) = - 1.

19
Similarities

• There are situations in which variables with a strong negative correlation


should be grouped, because they measure essentially the same thing
• For instance, this happens if one wants to reduce the number of variables
in a regression data set by selecting one variable from each cluster
• In that case it is better to use a formula like

which yields s(f, g) = 1 when R(f, g) = -1

20
Similarities

• Suppose the data consist of a similarity matrix but one wants to apply a
clustering algorithm designed for dissimilarities
• Then it is necessary to transform the similarities into dissimilarities
• The larger the similarity s(i, j) between i and j, the smaller their
dissimilarity d(i, j) should be
• Therefore, we need a decreasing transformation, such as

21
Binary Variables

• A contingency table for binary variables.

22
Dissimilarity between two binary variables

• q → is the number of variables that equal 1 for both objects i and j,


• r → is the number of variables that equal 1 for object i but that are 0 for
object j,
• S→ is the number of variables that equal 0 for object i but equal 1 for
object j, and
• t → is the number of variables that equal 0 for both objects i and j.
• The total number of variables is p, where p = q+r+s+t.

23
Symmetric Binary Dissimilarity

24
Asymmetric binary variable
• A binary variable is asymmetric if the outcomes of the states are not
equally important, such as the positive and negative outcomes of a
disease test.
• By convention, we shall code the most important outcome, which is
usually the rarest one, by 1 (e.g., HIV positive) and the other by 0 (e.g., HIV
negative).
• Given two asymmetric binary variables, the agreement of two 1s (a
positive match) is then considered more significant than that of two 0s (a
negative match).
• Therefore, such binary variables are often considered “monary” (as if
having one state).

25
asymmetric binary dissimilarity

26
Jaccard coefficient

27
Dissimilarity between binary variables

28
Dissimilarity between Jack and Marry
Jack

Marry 1 0
1 2 1
0 0 3

29
Dissimilarity between Jack and Jim

Jim

1 0
Jack 1 1 1
0 1 3

30
Dissimilarity between Jim and Marry

Jim

1 0
Marry 1 1 2
0 1 2

31
Thank you

32
Cluster analysis: Part - IV

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• How to handle the following types of variables :


– Interval scale variable
– Binary variables
– Categorical Variables
– Ordinal Variables
– Ratio-Scaled Variables
– Variables of mixed type
Categorical Variables

Categorical Variables
• A categorical variable is a generalization of the binary variable in that it
can take on more than two states
• For example, map color is a categorical variable that may have, say, five
states: red, yellow, green, purple, and blue

3
Categorical Variables

• Let the number of states of a categorical variable be M


• The states can be denoted by letters, symbols, or a set of integers, such as
1, 2,..., M
• Notice that such integers are used just for data handling and do not
represent any specific ordering

4
Categorical Variables

• “How is dissimilarity computed between objects described by categorical


variables?”
Categorical Variables

• The dissimilarity between two objects i and j can be computed based on


the ratio of mismatches:

where ‘m’ is the number of matches (i.e., the number of variables for
which ‘I’ and ‘j’ are in the same state), and ‘p’ is the total number of
variables
Weights can be assigned to increase the effect of ‘m’ or to assign greater
weight to the matches in variables having a larger number of states

6
Dissimilarity between categorical variables
• Suppose that we have the sample data as shown in the table
• Let only the object-identifier and the variable (or attribute) test-1 are
available, which is a categorical data

Finding Groups in Data: An Introduction to Cluster Analysis


Author(s): Leonard Kaufman, Peter J. Rousseeuw
March 1990, John Wiley & Sons, Inc.

7
Dissimilarity matrix
1 2 3 4

1
2
3
4

8
Dissimilarity between categorical variables

• Since here we have one categorical variable, test-1, we set p = 1 in


Equation

So that d(i, j) evaluates to ‘0’ if objects i and j match, and ‘1’ if the objects
differ
• Thus, we get d(2,1) = (1-0)/1 = 1
d(4,1) = (1-1)/1 = 0

9
Ordinal Variables

• A discrete ordinal variable resembles a categorical variable, except that


the ‘M’ states of the ordinal value are ordered in a meaningful sequence
• Ordinal variables are very useful for registering subjective assessments of
qualities that cannot be measured objectively
• For example, professional ranks are often enumerated in a sequential
order, such as Assistant, Associate, and full for Professors
• A continuous ordinal variable looks like a set of continuous data of an
unknown scale; that is, the relative ordering of the values is essential but
their actual magnitude is not

10
Ordinal Variables

• For example, the relative ranking in a particular sport (e.g., gold, silver,
bronze) is often more essential than the actual values of a particular
measure
• Ordinal variables may also be obtained from the discretization of interval-
scaled quantities by splitting the value range into a finite number of
classes
• The values of an ordinal variable can be mapped to ranks

11
Dissimilarity computation

• The treatment of ordinal variables is quite similar to that of interval-scaled


variables when computing the dissimilarity between objects
• Suppose that ‘f’ is a variable from a set of ordinal variables describing ‘n ‘
objects
• The dissimilarity computation with respect to ‘f’ involves the following
steps:
• The value of ‘f’ for the ith object is xi f , and ‘f’ has Mf ordered states,
representing the ranking 1,..., Mf .
• Replace each xi f by its corresponding rank, ri f ∈ {1,..., Mf }.

12
Dissimilarity computation

A B C D E F G
B
C
D
E
F
G
H

13
Standardization of ordinal variable

• Since each ordinal variable can have a different number of states, it is


often necessary to map the range of each variable onto [0.0,1.0] so that
each variable has equal weight.
• This can be achieved by replacing the rank ri f of the ith object in the fth
variable by:

14
Dissimilarity computation

• Dissimilarity can then be computed using any of the distance measures


described earlier (like that for interval data)

15
Example

• Suppose that we have the


sample data of the following
Table ,
• Except that this time only the
object-identifier and the
continuous ordinal variable,
test-2, are available
• There are three states for test-
2, namely fair, good, and
excellent, that is Mf = 3

16
Example

• For step 1, if we replace each value for test-2 by its rank, the four objects
are assigned the ranks 3, 1, 2, and 3, respectively

• Step 2 normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5, and
rank 3 to 1.0
• For step 3, we can use, say, the Euclidean distance, which results in the
following dissimilarity matrix:

17
Dissimilarity computation

1 2 3 4

1
1→3→1
2
2→ 1→0
3
3 → 2 → 0.5
4
4→3→1

18
Ratio-Scaled Variables

• A ratio-scaled variable makes a positive measurement on a nonlinear


scale, such as an exponential scale, approximately following the formula

where A and B are positive constants, and t typically represents time

• Common examples include the growth of a bacteria population or the


decay of a radioactive element

19
Computing the dissimilarity between objects
• There are three methods to handle ratio-scaled variables for computing
the dissimilarity between objects:
1. Treat ratio-scaled variables like interval-scaled variables
– This, however, is not usually a good choice since it is likely that the
scale may be distorted
2. Apply logarithmic transformation to a ratio-scaled variable f having value
xif for object i by using the formula yif = log(xi f)
– The yif values can be treated as interval valued, Notice that for some
ratio-scaled variables, log-log or other transformations may be
applied, depending on the variable’s definition and the application

20
Computing the dissimilarity between objects

3. Treat xif as continuous ordinal data and treat their ranks as interval-valued

• The latter two methods are the most effective, although the choice of
method used may depend on the given application

21
Example

• This time, we have the sample data of the following Table,


• Except that only the object-identifier and the ratio-scaled variable, test-3,
are available

22
Example

• Let’s try a logarithmic transformation


• Taking the log of test-3 results in the values 2.65, 1.34, 2.21, and 3.08 for
the objects 1 to 4, respectively
• Using the Euclidean distance on the transformed values, we obtain the
following dissimilarity matrix:

23
Variables of Mixed Types

• So far we have discussed how to compute the dissimilarity between


objects described by variables of the same type, where these types may
be either interval-scaled, symmetric binary, asymmetric binary,
categorical, ordinal, or ratio-scaled
• However, in many real databases, objects are described by a mixture of
variable types

24
Variables of Mixed Types

• In general, a database can contain all of the six variable types listed above
• “So, how can we compute the dissimilarity between objects of mixed
variable types?”
• One approach is to group each kind of variable together, performing a
separate cluster analysis for each variable type
– This is feasible if these analyses derive compatible results
– However, in real applications, it is unlikely that a separate cluster
analysis per variable type will generate compatible results

25
Variables of Mixed Types

• A more preferable approach is to process all variable types together,


performing a single cluster analysis
• One such technique combines the different variables into a single
dissimilarity matrix, bringing all of the meaningful variables onto a
common scale of the interval [0.0,1.0]

26
Variables of Mixed Types
• Suppose that the data set contains p variables of mixed type
• The dissimilarity d(i, j) between objects i and j is defined as

where the indicator δij(f) =0 if either


– xif or xjf is missing (i.e., there is no measurement of variable f for object
i or object j), or xif = xjf = 0 and variable f is asymmetric binary;
– otherwise, δij(f) = 1

27
Variables of Mixed Types

• The contribution of variable f to the dissimilarity between i and j, that is,


dij(f) , is computed dependent on its type:
• If ‘f’ is interval-based:

where h runs overall non missing objects for variable f


• If ‘f’ is binary or categorical: dij(f) =0, if xif =xjf
– otherwise dij(f) =1

28
Variables of Mixed Types

• If ‘f’ is ordinal: compute the ranks rif and zif = rif−1 /Mf−1, and treat zif as
interval scaled
• If ‘f’ is ratio-scaled: either perform logarithmic transformation and treat
the transformed data as interval-scaled; or treat ‘f’ as continuous ordinal
data, compute rif and zif, and then treat zif as interval-scaled
• The above steps are identical to what we have already seen for each of the
individual variable types

29
Variables of Mixed Types

• The only difference is for interval-based variables, where here we


normalize so that the values map to the interval [0.0,1.0]
• Thus, the dissimilarity between objects can be computed even when the
variables describing the objects are of different types

30
Thank you

31
Cluster analysis: Part - V

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Dissimilarity matrix for mixed type variables


• Python demo for computing different types of distances
• Python demo for computing distance matrix for interval scaled data
Example

Consider the data given in the


following table and compute a
dissimilarity matrix for the objects
of the table
Now we will consider all of the
variables, which are of different
types

3
Example
• The procedures we followed for test-1 (which is categorical) and test-2
(which is ordinal) are the same as outlined above for processing variables
of mixed types
• For categorical variable -

• For ordinal variable -

• For interval scale variable -

4
Normalizing the interval scale data

• First, however, we need to complete some work for test-3 (which is ratio-
scaled)
• We have already applied a logarithmic transformation to its values
• Based on the transformed values of 2.65, 1.34, 2.21, and 3.08 obtained for
the objects 1 to 4, respectively, we let maxhxh =3.08 and minhxh =1.34
• We then normalize the values in the dissimilarity matrix obtained in
Example solve for ratio data by dividing each one by (3.08−1.34) = 1.74

5
Dissimilarity matrix for test-3

• This results in the following dissimilarity matrix for test-3:

Object Ratio scaled Log (x)


Identifier Data (x)
1 445 2.65
2 22 1.34
3 164 2.21
4 1210 3.08

• For 1 and 2 = (2.65- 1.34)/(3.08−1.34) = 0.75

6
dissimilarity matrices for the three variables

• We can now use the dissimilarity matrices for the three variables in our
computation of Equation
• For example, we get d(2,1)= (1(1)+1(1)+1(0.75))/ 3 =0.92

Dissimilarity matrix Dissimilarity matrix normalize the values in the


for categorical for ordinal dissimilarity matrix for ratio data

7
Example

• The resulting dissimilarity matrix obtained for the data described by the
three variables of mixed types is:

8
Interpretation

• If we go back and look at Table of given data, we can intuitively guess that
objects 1 and 4 are the most similar, based on their values for test-1 and
test-2
• This is confirmed by the dissimilarity matrix, where d(4,1) is the lowest
value for any pair of different objects
• Similarly, the matrix indicates that objects 2 and 4 are the least similar

9
Distance Measurement using python - Euclidean Distance :
Python Demo for Euclidean Distance
Distance Measurement using python – Minkowski Distance :

• P =1 Manhattan distance
• P = 2 Euclidean distance
Python Demo for Minkowski Distance
Dissimilarity matrix
Distance matrix calculation for Interval-Scaled Variables

• For example :
Person Weight(Kg) Height(cm)
• Take eight people, the weight (in A 15 95
kilograms) and the height (in centimetres) B 49 156
• In this situation, n = 8 and p = 2. C 13 95
D 45 160
E 85 178
F 66 176
G 12 90
H 10 78

15
Distance matrix calculation using Python
Thank You

19
K- Means Clustering

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Classification of clustering methods


• Partitioning method: K – means clustering

2
Classification of Clustering Methods

Clustering
Methods

Partitioning Hierarchical

K-Means k-Medoids Agglomerative

3
Which Clustering Algorithm to Choose

• The choice of a clustering algorithm depends on


- Type of data available
- Particular purpose
• It is permissible to try several algorithms on the same data, because
cluster analysis is mostly used as a descriptive or exploratory tool

4
Partitioning Method
Given -
• a data set of n objects
• k, the number of clusters
• A partitioning algorithm organizes the objects into k partitions (k ≤ n),
where each partition represents a cluster.
• The clusters are formed to optimize an objective partitioning criterion
• Objective partitioning criterion such as a dissimilarity function based on
distance
• Therefore, the objects within a cluster are “similar,” whereas the objects
of different clusters are “dissimilar” in terms of the data set attributes.

5
Partitioning Method

• Partitioning methods are applied if one wants to classify the objects into k
clusters, where k is fixed.

6
K-Means Method

• It is a centroid based technique


• The k-means algorithm takes the input parameter, k, and partitions a set
of n objects into k clusters
• So that the resulting intra-cluster similarity is high but the inter-cluster
similarity is low
• Cluster similarity is measured in regard to the mean value of the objects in
a cluster, which can be viewed as the cluster’s centroid or center of gravity

7
Working Principle of K-Means Algorithm

8
Working Principle of K-Means Algorithm

• First it randomly selects k of the objects, each of which initially represents


a cluster mean or center
• For each of the remaining objects, an object is assigned to the cluster to
which it is the most similar, based on the distance between the object and
the cluster mean
• It then computes the new mean for each cluster
• This process iterates until the criterion function converges

9
Working Principle of K-Means Algorithm
• Criterion function
𝑘

𝐸 = ෍ ෍ |𝑝 − 𝑚𝑖 |2
𝑖=1 𝑝∈𝐶𝑖
where
- 𝐸is the sum of the square error for all objects in the data set;
-𝑝is the point in space representing a given object;
-𝑚𝑖 is the mean of cluster Ci (both 𝑝and 𝑚𝑖 are multidimensional).
• For each object in each cluster, the distance from the object to its cluster
center is squared, and the distances are summed.
• This criterion tries to make the resulting k clusters as compact and as separate
as possible.

10
K=3

11
K-Means Clustering Algorithm

Algorithm: k-means. The k-means algorithm for partitioning, where each


cluster’s center is represented by the mean value of the objects in the cluster.
• Input:
k: the number of clusters,
D: a data set containing n objects.
• Output: A set of k clusters.

12
K-Means Clustering Method

• Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most
similar, based on the mean value of the objects in the cluster;
(4) update the cluster means, i.e., calculate the mean value of the objects for
each cluster;
(5) until no change;

13
K-Means clustering example

Individual Variable 1 Variable 2


1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5

14
K-Means clustering example

Variable 2
8.00
4
7.00

6.00
5 6
5.00
7
3
4.00

3.00
2
2.00
1
1.00

0.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00

15
K-Means clustering example

• Initialization: Randomly we choose following two centroids (k=2) for two


clusters. In this case the 2 centroid are:
Individ Variabl Variabl
Cluster Var1 Var2 ual e1 e2

K1 1.0 1.0 1 1.0 1.0


2 1.5 2.0
K2 3.0 4.0 3 3.0 4.0
4 5.0 7.0
• Calculate Euclidean distance using the given equation 5 3.5 5.0
6 4.5 5.0
Distance [(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 )] = (𝑥2 − 𝑥1 )2 + (𝑦2 − 𝑦1 )2
7 3.5 4.5

16
K-Means clustering example

Distance of k1 from k1 (1.0, 1.0) = (1.0 − 1.0)2 + (1.0 − 1.0)2 = 0


k1 to k2 (1.0, 1.0), (3.0, 4.0) = (3.0 − 1.0)2 + (4.0 − 1.0)2 = 3.61
Distance of k 2 from k2 (3.0, 4.0) = (3.0 − 3.0)2 + 4.0 − 4.0 2 =0
Indivi Variab Variab
dual le 1 le 2
Centroid
Cluster 1 1.0 1.0
K1 K2 Assignment 2 1.5 2.0
3 3.0 4.0
K1 0 3.61 k1
4 5.0 7.0
K2 3.61 0 k2 5 3.5 5.0
6 4.5 5.0
7 3.5 4.5

17
At K = 2

Variable 2
8.00
4
7.00

6.00
5 6
5.00
7
3
4.00

3.00
2
2.00
1
1.00

0.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00

18
K-Means clustering example Individual Variable 1 Variable 2

1 1.0 1.0

2 1.5 2.0

• Calculate Euclidean distance for next dataset (1.5, 2.0) 3 3.0 4.0

4 5.0 7.0

Distance from cluster1 = (1.5 − 1.0)2 + (2.0 − 1.0)2 = 1.12 5 3.5 5.0

6 4.5 5.0

Distance from cluster2 = (1.5 − 3.0)2 + (2.0 − 4.0)2 = 2.5 7 3.5 4.5

Euclidean Distance
Dataset
Cluster 1 Cluster 2 Assignment
(1.5, 2.0) 1.12 2.5 k1

19
Variable 2
8.00
4
7.00

6.00
5 6
5.00
7
3
4.00

3.00
2
2.00
1
1.00

0.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00

20
K-Means clustering example

• Update the cluster centroid

Cluster Var1 Var2


K1 (1.0 + 1.5)/2 = 1.25 (1.0 + 2.0)/2 = 1.5
K2 3.0 4.0

21
K-Means clustering example Individual Variable 1 Variable 2

1 1.0 1.0

2 1.5 2.0

• Calculate Euclidean distance for next dataset (5.0, 7.0) 3 3.0 4.0

4 5.0 7.0

Distance from cluster1 = (5.0 − 1.25)2 + (7.0 − 1.5)2 = 6.66 5 3.5 5.0

6 4.5 5.0

Distance from cluster2 = (5.0 − 3.0)2 + (7.0 − 4.0)2 = 3.61 7 3.5 4.5

Euclidean Distance
Dataset
Cluster 1 Cluster 2 Assignment
(5.0, 7.0) 6.66 3.61 K-2

22
Variable 2
8.00
4
7.00

6.00
5 6
5.00
7
3
4.00

3.00
2
2.00
1
1.00

0.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00

23
K-Means clustering example

• Update the cluster centroid

Cluster Var1 Var2


K1 1.25 1.5
K2 (3.0 + 5.0)/2 = 4 (4.0 + 7.0)/2 =5.5

24
K-Means clustering example Individual Variable 1 Variable 2

1 1.0 1.0

2 1.5 2.0

• Calculate Euclidean distance for next dataset (3.5, 5.0) 3 3.0 4.0

4 5.0 7.0

Distance from cluster1 = (3.5 − 1.25)2 + (5.0 − 1.5)2 = 4.16 5 3.5 5.0

6 4.5 5.0

Distance from cluster2 = (3.5 − 4.0)2 + (5.0 − 5.5)2 = 0.71 7 3.5 4.5

Euclidean Distance
Dataset
Cluster 1 Cluster 2 Assignment
(3.5, 5.0) 4.16 0.71 K-2

25
Variable 2
8.00
4
7.00

6.00
5 6
5.00
7
3
4.00

3.00
2
2.00
1
1.00

0.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00

26
K-Means clustering example

• Update the cluster centroid

Cluster Var1 Var2


K1 1.25 1.5
K2 (3.0+5.0+ 3.5)/3 = (4.0+7.0 + 5.0)/3 =
3.83 5.33

27
K-Means clustering example Individual Variable 1 Variable 2

1 1.0 1.0

2 1.5 2.0

3 3.0 4.0

4 5.0 7.0

5 3.5 5.0

6 4.5 5.0

7 3.5 4.5

Euclidean Distance
Dataset
Cluster 1 Cluster 2 Assignment
(4.5, 5.0) 4.78 0.75 K- 2

28
8.00 Variable 2
4
7.00

6.00
5 6
5.00
7
3
4.00

3.00
2
2.00
1
1.00

0.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00

29
K-Means clustering example

• Update the cluster centroid

Cluster Var1 Var2


K1 1.25 1.5
K2 (3.0+5.0+3.5+4.5)/4= 4.00 (4.0+7.0+5.0+5.0)/4= 5.25

30
K-Means clustering example Individual Variable 1 Variable 2

1 1.0 1.0

2 1.5 2.0

3 3.0 4.0

4 5.0 7.0

5 3.5 5.0

6 4.5 5.0

7 3.5 4.5

Euclidean Distance
Dataset
Cluster 1 Cluster 2 Assignment
(3.5, 4.5) 3.75 0.86 K-2

31
Variable 2
8.00
4
7.00

6.00
5 6
5.00
7
3
4.00

3.00
2
2.00
1
1.00

0.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00

32
K-Means clustering example

• Update the cluster centroid

Cluster Var1 Var2


K1 1.25 1.5
K2 (3.0+5.0+3.5+4.5+3.5)/5= 3.9 (4.0+7.0+5.0+5.0+4.5)/5= 5.1

33
K-Means clustering example
Individual Variable 1 Variable 2 Assignment
1 1.0 1.0 1
2 1.5 2.0 1
3 3.0 4.0 2
4 5.0 7.0 2
5 3.5 5.0 2
6 4.5 5.0 2
7 3.5 4.5 2

34
Python code for K- Means Clustering

35
Python code for K- Means Clustering

36
Python code

37
Thank you

38
Hierarchical method of clustering - I

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Introduction to Hierarchical clustering


• Partitioning Vs. Hierarchical

2
Introduction

• A hierarchical method creates a hierarchical decomposition of the given


set of data objects
• A hierarchical clustering method works by grouping data objects into a
tree of clusters
• A hierarchical method can be classified as being either agglomerative or
divisive, based on how the hierarchical decomposition is formed
• The agglomerative approach, also called the bottom-up approach, starts
with each object forming a separate group

3
Introduction

• It successively merges the objects or groups that are close to one another,
until all of the groups are merged into one (the topmost level of the
hierarchy), or until a termination condition holds
• The divisive approach, also called the top-down approach, starts with all
of the objects in the same cluster
• In each successive iteration, a cluster is split up into smaller clusters, until
eventually each object is in one cluster, or until a termination condition
holds

4
Introduction

• Hierarchical methods suffer from the fact that once a step (merge or split)
is done, it can never be undone

• This rigidity is useful in that it leads to smaller computation costs by not


having to worry about a combinatorial number of different choices

• However, such techniques cannot correct erroneous decisions

5
Agglomerative and Divisive Hierarchical Clustering

Agglomerative Divisive Hierarchical


• This bottom-up strategy starts by • This top-down strategy does the
placing each object in its own reverse of agglomerative hierarchical
cluster and then merges these clustering by starting with all objects in
atomic clusters into larger and one cluster
larger clusters, until all of the • It subdivides the cluster into smaller
objects are in a single cluster or and smaller pieces, until each object
forms a cluster on its own or until it
until certain termination satisfies certain termination conditions,
conditions are satisfied such as a desired number of clusters is
• Most hierarchical clustering obtained or the diameter of each
methods belong to this category cluster is within a certain threshold

6
Agglomerative versus divisive hierarchical clustering

Figure: 1 Agglomerative and divisive hierarchical clustering on data objects{a,b,c,d,e}


7
Interpretation

• Figure: 1 shows the application of AGNES (AGglomerative NESting), an


agglomerative hierarchical clustering method, and DIANA (DIvisive
ANAlysis), a divisive hierarchical clustering method, to a data set of five
objects, {a,b,c,d,e}
• Initially, AGNES places each object into a cluster of its own
• The clusters are then merged step-by-step according to some criterion
• Let’s say for example, clusters C1 and C2 may be merged if an object in C1
and an object in C2 form the minimum Euclidean distance between any
two objects from different clusters

8
Interpretation

• This is a single-linkage approach in that each cluster is represented by all


of the objects in the cluster, and the similarity between two clusters is
measured by the similarity of the closest pair of data points belonging to
different clusters
• The cluster merging process repeats until all of the objects are eventually
merged to form one cluster

9
Interpretation

• In DIANA, all of the objects are used to form one initial cluster
• The cluster is split according to some principle, such as the maximum
Euclidean distance between the closest neighboring objects in the cluster
• The cluster splitting process repeats until, eventually, each new cluster
contains only a single object
• In either agglomerative or divisive hierarchical clustering, the user can
specify the desired number of clusters as a termination condition

10
Dendrogram

Figure 2: Dendrogram representation for hierarchical clustering of data objects{a,b,c,d,e}


11
Dendrogram

• A tree structure called a dendrogram is commonly used to represent the


process of hierarchical clustering
• It shows how objects are grouped together step by step
• Figure: 2 shows a dendrogram for the five objects presented in Figure:1 ,
where l =0 shows the five objects as singleton clusters at level 0
• At l =1, objects a and b are grouped together to form the first cluster, and
they stay together at all subsequent levels

12
Dendrogram

• We can also use a vertical axis to show the similarity scale between
clusters
• For example, when the similarity of two groups of objects, {a,b} and
{c,d,e}, is roughly 0.16, they are merged together to form a single cluster

13
Measures for distance between clusters

• Four widely used measures for distance between clusters are as follows,
where|p−p’| is the distance between two objects or points, p and p’ , mi is
the mean for cluster, Ci and ni is the number of objects in Ci
• Minimum distance: dmin(Ci,Cj)= minp∈Ci, p’∈Cj |p−p’|
• Maximum distance: dmax(Ci,Cj)= maxp∈Ci, p’∈Cj |p−p’|
• Mean distance: dmean(Ci,Cj)= |mi−mj|
• Average distance: davg(Ci,Cj) =

14
Measures for distance between clusters

• When an algorithm uses the minimum distance, dmin(Ci,Cj), to measure the


distance between clusters, it is sometimes called a nearest-neighbor
clustering algorithm
• Moreover, if the clustering process is terminated when the distance
between nearest clusters exceeds an arbitrary threshold, it is called a
single-linkage algorithm
• If we view the data points as nodes of a graph, with edges forming a path
between the nodes in a cluster, then the merging of two clusters, Ci and Cj,
corresponds to adding an edge between the nearest pair of nodes in Ci
and Cj

15
Measures for distance between clusters
• Because edges linking clusters always go between distinct clusters, the
resulting graph will generate a tree
• Thus, an agglomerative hierarchical clustering algorithm that uses the
minimum distance measure is also called a minimal spanning tree
algorithm
• When an algorithm uses the maximum distance, dmax(Ci,Cj), to measure
the distance between clusters, it is sometimes called a farthest-neighbor
clustering algorithm
• If the clustering process is terminated when the maximum distance
between nearest clusters exceeds an arbitrary threshold, it is called a
complete-linkage algorithm

16
Measures for distance between clusters

• By viewing data points as nodes of a graph, with edges linking nodes, we


can think of each cluster as a complete sub graph, that is, with edges
connecting all of the nodes in the clusters
• The distance between two clusters is determined by the most distant
nodes in the two clusters
• Farthest-neighbor algorithms tend to minimize the increase in diameter of
the clusters at each iteration as little as possible
• If the true clusters are rather compact and approximately equal in size, the
method will produce high-quality clusters
• Otherwise, the clusters produced can be meaningless
17
Choice of measurement

• The above minimum and maximum measures represent two extremes in


measuring the distance between clusters
• They tend to be overly sensitive to outliers or noisy data
• The use of mean or average distance is a compromise between the
minimum and maximum distances and overcomes the outlier sensitivity
problem
• Whereas the mean distance is the simplest to compute, the average
distance is advantageous in that it can handle categorical as well as
numeric data

18
Illustration

(a)
Representation of some definitions of inter-
cluster dissimilarity: (a) Group average
(b) Nearest neighbor
(b)
(c) Furthest neighbor

(c)

19
Illustration

(a) Some types of clusters:


(a) Ball-shaped
(b) Elongated
(c) Compact but not
(b) well separated

(c)

20
Difficulties with hierarchical clustering

• The hierarchical clustering method, though simple, often encounters


difficulties regarding the selection of merge or split points
• Such a decision is critical because once a group of objects is merged or
split, the process at the next step will operate on the newly generated
clusters
• It will neither undo what was done previously nor perform object
swapping between clusters

21
Difficulties with hierarchical clustering

• Thus merge or split decisions, if not well chosen at some step, may lead to
low-quality clusters
• Moreover, the method does not scale well, because each decision to
merge or split requires the examination and evaluation of a good number
of objects or cluster
• For improving the clustering quality of hierarchical methods is to integrate
hierarchical clustering with other clustering techniques, resulting in
multiple-phase clustering

22
Partitioning Vs. Hierarchical

23
K-means versus hierarchical clustering

24
K means versus hierarchical clustering

K- means clustering Hierarchical clustering


• Non-hierarchical methods, such • Hierarchical methods can be
as k-means, using a pre-specified either agglomerative or
number of clusters, the method
assigns records to each cluster to divisive
find the mutually exclusive cluster • Agglomerative methods
of spherical shape based on begin with ‘n’ clusters and
distance sequentially merge similar
• In this case, one can use mean or clusters until a single cluster
median as a cluster centre to
represent each cluster is obtained

25
K means versus hierarchical clustering
K- means clustering Hierarchical clustering
• These methods are • Divisive methods work in the
generally less opposite direction, starting
computationally intensive with one cluster that includes
and are therefore preferred all records
with very large datasets • Hierarchical methods are
especially useful when the
goal is to arrange the clusters
into a natural hierarchy

26
K means versus hierarchical clustering
K- means clustering Hierarchical clustering
• A partitioning (K- means) • A hierarchical clustering is a
clustering a simply a set of nested clusters that
division of the set of data are organized as a tree
objects into non-
overlapping subsets
(clusters) such that each
data object is in exactly one
subset)

27
K means versus hierarchical clustering
Un-nested cluster Nested cluster

Ashok, A.R., Prabhakar, C.R. and Dyaneshwar, P.A., Comparative Study on Hierarchical and Partitioning Data Mining Methods.

28
K means versus hierarchical clustering

• Hierarchical clustering does not assume a particular value of ‘𝑘’, as needed


by 𝑘-means clustering
• The generated tree may correspond to a meaningful taxonomy
• Only a distance or “proximity” matrix is needed to compute the
hierarchical clustering

Proximity
matrix

29
K means versus hierarchical clustering
K Means clustering Hierarchical clustering
• In K Means clustering, since one • Results are reproducible in
start with random choice of Hierarchical clustering
clusters, the results produced by • Hierarchical clustering don’t work
running the algorithm multiple as well as, k means when the
times might differ shape of the clusters is hyper
• K Means is found to work well spherical
when the shape of the clusters is
hyper spherical (like circle in 2D,
sphere in 3D)

30
K means versus hierarchical clustering
K Means clustering Hierarchical clustering
• K Means clustering requires • In hierarchical clustering
prior knowledge of K i.e. no. one can stop at any number
of clusters one want to of clusters, one find
divide your data into appropriate by interpreting
the dendrogram

31
K means versus hierarchical clustering

https://stepupanalytics.com/difference-between-k-means-clustering-and-hierarchical-clustering/

32
Hierarchical clustering
Advantages
• Ease of handling of any forms of similarity or distance
• Consequently, applicability to any attributes types

33
Limitations of Hierarchical Clustering

• Hierarchical clustering requires the computation and storage of an n×n


distance matrix. For very large datasets, this can be expensive and slow
• The hierarchical algorithm makes only one pass through the data. This
means that records that are allocated incorrectly early in the process
cannot be reallocated subsequently
• Hierarchical clustering also tends to have low stability. Reordering data or
dropping a few records can lead to a different solution

34
Limitations of Hierarchical Clustering
• With respect to the choice of distance
between clusters, single and complete
linkage are robust to changes in the
distance metric (e.g., Euclidean, statistical
distance) as long as the relative ordering is
kept.
• In contrast, average linkage is more
influenced by the choice of distance metric,
and might lead to completely different
clusters when the metric is changed
• Hierarchical clustering is sensitive to outlier

35
Average-linkage clustering

• Compromise between Single and Complete Link

• Strengths
– Less susceptible to noise and outliers

• Limitations
– Biased towards globular clusters

36
Distance between two clusters

• Ward’s distance between clusters Ci and Cj is the difference between the total
within cluster sum of squares for the two clusters separately, and the within
cluster sum of squares resulting from merging the two clusters in cluster Cij

Dw (Ci , C j ) =  (x − ri ) +  (x − rj ) −  (x − rij )
2 2 2

xCi xC j xCij

• ri: centroid of Ci
• rj: centroid of Cj
• rij: centroid of Cij
37
Ward’s distance for clusters

• Similar to group average and centroid distance

• Less susceptible to noise and outliers

• Biased towards globular clusters

• Hierarchical analogue of k-means


– Can be used to initialize k-means

38
Hierarchical Clustering: Comparison
5
1 4 1
3
2 5
5 5
2 1 2
Simple linkage
2 3 6 3 6
3
1
4 Complete 4
4
linkage

5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group 3 6
3 Average
4 1 1
4 4
3

39
K- means clustering

Advantages Disadvantages
• The center of mass can be found • K-means has problems when
efficiently by finding the mean clusters are of differing sizes,
value of each co-ordinate densities, non-globular shapes
• This leads to an efficient and when the data contains
algorithm to compute the new outliers
centroids with a single scan of the
data

40
Similarity

• Two most popular methods: hierarchical agglomerative clustering and k-


means clustering
• In both cases, we need to define two types of distances: distance between
two records and distance between two cluster
• In both cases, there is a variety of metrics that can be used

41
Thank You

42
Measures of Attribute Selection

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Measures of attribute selection using


– Information Gain
– Gain ratio
– Gini Index

2
Example

• The following Table presents


a training set, D, of class-
labeled tuples randomly
selected from the
AllElectronics customer
database

Han, J., Pei, J. and Kamber, M., 2011. Data mining: concepts and
techniques. Elsevier.

3
Example

• In this example, each attribute is discrete-valued


• Continuous-valued attributes have been generalized
• The class label attribute, buys computer, has two distinct values (namely,
{yes, no}); therefore, there are two distinct classes (that is, m = 2)
• Let class C1 correspond to ‘yes’ and class C2 correspond to ‘no’.
• There are nine tuples of class ‘yes’ and five tuples of class ‘no’.
• A (root) node N is created for the tuples in D

4
Expected information needed to classify a tuple in D

• To find the splitting criterion for these tuples, we must compute the
information gain of each attribute
• Let us consider Class: buys computer as decision criteria D
• Calculate information:
• = -py log2 (py) – pn log2 (pn)
• Where py is probability of ‘yes’ and pn is probability of ‘no’

5
Calculation of entropy for ‘ Youth’

• Age can be:


– youth
– Middle_aged
– Senior
• Youth

Youth Class: buys computer


Yes 2
No 3

6
Calculation of entropy for ‘ Youth’

• Calculate Entropy for youth:


• Entropy youth =

• Middle_aged

middle Class: buys computer


Yes 4
No 0

7
Calculation of entropy for ‘ Middle Age’
• Calculate Entropy for middle_aged
• =
• =0

• For Senior

Senior Class: buys computer


Yes 3
No 2

8
Calculate Entropy for senior

Calculate Entropy for senior


=

9
The expected information needed to classify a tuple in D
according to age

The expected information needed to classify a tuple in D if the tuples are


partitioned according to age is

10
Calculation information Gain of Age

• Gain of Age:

11
Calculation information Gain of Income

• Calculation of gain for income:


• Income cane be:
– High
– Medium
– Low

12
Calculate Entropy for high

• High :
High Class: buys computer
Yes 2
No 2

• Calculate Entropy for high:


= -(2/4)log2(2/4) - (2/4)log2(2/4)

13
Calculate Entropy for ‘medium’

• Medium:
Medium Class: buys computer
Yes 4
No 2

• Calculate Entropy for Medium:


= -(4/6)log2(4/6) - (2/6)log2(2/6)

14
Calculate Entropy for ‘low’

• Low :
Low Class: buys computer
No 1
Yes 3

• Calculate Entropy for Low:


= -(1/4)log2(1/4) - (3/4)log2(3/4)

15
Gain of income

• The expected information needed to classify a tuple in D if the tuples are


partitioned according to income is:
• Info income (D) = (4/14) ( -(2/4)log2(2/4) - (2/4)log2(2/4)) +
(6/14) ( -(4/6)log2(4/6) - (2/6)log2(2/6)) +
(4/14) -(1/4)log2(1/4) - (3/4)log2(3/4)
= 0.911
Gain of income : Info(D) - Info income (D)
= 0.94 – 0.911 = 0.029

16
Calculation of gain for student

• Calculation of gain for student


• Student can be:
– Yes
– No

17
Calculate Entropy for No

• No :
No Class: buys computer
Yes 3
No 4

• Calculate Entropy for No:


= -(3/7)log2(3/7) - (4/7)log2(4/7)

18
Calculate Entropy for ‘Yes’

• Yes :
Yes Class: buys computer
Yes 6
No 1

• Calculate Entropy for Yes:


= -(6/7)log2(6/7) - (1/7)log2(1/7)

19
Gain of student

• The expected information needed to classify a tuple in D if the tuples are


partitioned according to student is:
• Info Student (D) = (7/14) (-(3/7)log2(3/7) - (4/7)log2(4/7)) +
(7/14) (-(6/7)log2(6/7) - (1/7)log2(1/7))
=0.789
• Gain(student) :
Info(D) - Info student (D)
= 0.94 – 0.789 = 0.151

20
Calculation of gain for credit rating

• Calculation of gain for credit rating


• Credit rating can be:
– Fair
– Excellent

21
Calculate Entropy for Fair
• Fair :
Fair Class: buys computer
Yes 6
No 2

• Calculate Entropy for Fair:


= -(6/8)log2(6/8) - (2/8)log2(2/8)

22
Calculate Entropy for Excellent

• Excellent :
Yes Class: buys computer
Yes 3
No 3

• Calculate Entropy for Excellent:


= -(3/6)log2(3/6) - (3/6)log2(3/6)

23
Gain for credit rating

• The expected information needed to classify a tuple in D if the tuples are


partitioned according to Credit rating is:
• Info Credit rating (D) = (8/14) (-(6/8)log2(6/8) - (2/8)log2(2/8)) +
(6/14) (-(3/6)log2(3/6) - (3/6)log2(3/6))
=0.892
• Gain for credit rating :
Info(D) - Info Credit rating (D)
= 0.94 – 0.892 = 0.048

24
Independent variable Information gain
Age 0.246
Income 0.029
Student 0.151
Credit_rating 0.048

25
Selection of root classifier

• Because age has the highest information gain among the attributes, it is
selected as the splitting attribute
• Node N is labelled with age, and branches are grown for each of the
attribute’s values
• The tuples are then partitioned accordingly
• Notice that the tuples falling into the partition for age = middle aged all
belong to the same class
• Because they all belong to class “yes,” a leaf should therefore be created
at the end of this branch and labelled with “yes.”

26
Decision tree

27
Decision tree

• The final decision tree returned by the algorithm is shown in Figure

28
Thank You

29
Classification and Regression Trees (CART - I)

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Introduction to Classification and Regression Trees


• Attribute selection measures – Introduction

2
Introduction

• Classification is one form of data analysis that can be used to extract


models describing important data classes or to predict future data trends
• Classification predicts categorical (discrete, unordered) labels whereas
Regression analysis is a statistical methodology that is most often used for
numeric (continuous) prediction
• For example, we can build a classification model to categorize bank loan
applications as either safe or risky
• Regression model is used to predict expenditures in dollars of potential
customers on computer equipment given their income and occupation

3
Problem Description for Illustration

Han, J., Pei, J. and Kamber, M., 2011. Data mining: concepts and
techniques. Elsevier.

4
Root Node, Internal Node, Child Node
Root Node or
Internal parent node
• A decision tree uses a tree structure to
represent a number of possible decision paths Node
and an outcome for each path
• A decision tree consists of root node, internal
node and leaf node
• The topmost node in a tree is the root node
or parent node
• It represents entire sample population
• Internal node (non-leaf node) denotes a test
on an attribute, each branch represents an
outcome of the test
• Leaf node (or terminal node or child node)
holds a class label Child
• It can not be further split
Node

5
Decision Tree Introduction

• A decision tree for the concept buys_computer, indicating whether a


customer at All Electronics is likely to purchase a computer

• Each internal (non-leaf) node


represents a test on an attribute
• Each leaf node represents a class
(either buys_computer = yes or
buys computer = no).

Figure 1.1 : Decision Tree


6
CART Introduction

• CART comes under supervised learning technique

• CART adopt a greedy(i.e., non backtracking) approach in which decision


trees are constructed in a top-down recursive divide-and-conquer manner

• It is very interpretable model

7
Decision Tree Algorithm
Input:
• Data partition, D, which is a set of
training tuples and their associated
class labels;
• Attribute list, the set of candidate
attributes;
• Attribute selection method, a
procedure to determine the splitting
criterion that “best” partitions the data
tuples into individual classes. This
criterion consists of a splitting attribute
and, possibly, either a split point or
splitting subset.
Output: A decision tree

8
Decision Tree Algorithm

• The algorithm is called with three parameters: D, attribute list, and


Attribute selection method
• D is defined as a data partition. Initially, it is the complete set of training
tuples and their associated class labels
• The parameter attribute list is a list of attributes or independent variables
which are describing the tuples
• Attribute selection method specifies a heuristic procedure for selecting
the attribute that “best” discriminates the given tuples according to class

9
Decision Tree Algorithm

• This procedure employs an attribute selection measure, such as


information gain, gain ratio or the Gini index.
• Whether the tree is strictly binary is generally driven by the attribute
selection measure
• Some attribute selection measures, such as the Gini index, enforce the
resulting tree to be binary. Others, like information gain, do not, therein
allowing multiway splits (i.e., two or more branches to be grown from a
node).

10
Decision Tree Method

N-Node
C- Class
D- tuples in training data set

11
Decision Tree Method step 1 to 6
• The tree starts as a single node, N,
representing the training tuples in D (step
1).
• If the tuples in D are all of the same class,
then node N becomes a leaf and is
labelled with that class (steps 2 and 3)
• Steps 4 and 5 are terminating conditions
• Otherwise, the algorithm calls Attribute
selection method to determine the
splitting criterion
• The splitting criterion (like Gini) tells us
which attribute to test at node N by
determining the “best” way to separate
or partition the tuples in D into individual
classes (step 6)

12
Decision Tree Method - Step 7 - 11
• The splitting criterion indicates the splitting
attribute and may also indicate either a
split-point or a splitting subset
• The splitting criterion is determined so
that, ideally, the resulting partitions at each
branch are as “pure” as possible. A
partition is pure if all of the tuples in it
belong to the same class.
• The node N is labelled with the splitting
criterion, which serves as a test at the node
(step 7).
• A branch is grown from node N for each of
the outcomes of the splitting criterion.
• The tuples in D are partitioned accordingly
(steps 10 to 11)

13
Three possibilities for partitioning tuples based on the
splitting criterion
• There are three possible scenarios, as illustrated in Figure (a), (b) and (c).
• Let A be the splitting attribute. A has ‘v’ distinct values,{a1,a2,...,av}, based
on the training data
• If A is discrete-valued in figure (a), then one branch is grown for each
known value of A.

Figure (a)
14
Three possibilities for partitioning tuples based on the
splitting criterion
• If A is continuous-valued in figure (b), then two branches are grown,
corresponding to A ≤ split point and A > split point.
• Where split point is the split-point returned by Attribute selection method
as part of the splitting criterion.

Figure (b)

15
Three possibilities for partitioning tuples based on the
splitting criterion
• If A is discrete-valued and a binary tree must be produced, then the test is
of the form A ∈ 𝑆𝐴 , where 𝑆𝐴 is the splitting subset for A.

Figure (c)

16
Decision Tree Method – termination condition

• The algorithm uses the same process recursively to form a decision tree
for the tuples at each resulting partition, 𝐷𝑗 , of D (step 14).

• The recursive partitioning stops only when anyone of the following


terminating conditions is true:

1. All of the tuples in partition D (represented at node N) belong to the


same class (steps 2 and 3), or

17
Decision Tree Method – termination condition
1.
2. There are no remaining attributes on which the tuples may be further
partitioned (step4).
• In this case, majority voting is employed(step 5).
• This involves converting node N into a leaf and labelling it with the most
common class in D.
• Alternatively, the class distribution of the node tuples may be stored.
3. There are no tuples for a given branch, that is, a partition Dj is empty (step
12).
• In this case, a leaf is created with the majority class in D (step 13).
• The resulting decision tree is returned (step 15).

18
Attribute Selection Measures

• Attribute selection measures are also known as splitting rules because


they determine how the tuples at a given node are to be split
• It is a heuristic approach for selecting the splitting criterion that “best”
separates a given data partition, D, of class-labeled training tuples into
individual classes
• The attribute selection measure provides a ranking for each attribute
describing the given training tuples
• The attribute having the best score for the measure is chosen as the
splitting attribute for the given tuples

19
Attribute Selection Measures
• If the splitting attribute is continuous-valued or if we are restricted to binary
trees then, respectively, either a ‘split point’ or a ‘splitting subset’ must also be
determined as part of the splitting criterion

• There are three popular attribute selection measures


– information gain,
– gain ratio, and
– Gini index

• CART algorithm uses information gain and Gini index measure for attribute
selection

20
Attribute Selection Measures

21
Information Gain

• This measure studied the value or “information content” of messages


• The attribute with the highest information gain is chosen as the splitting
attribute for node
• This attribute minimizes the information needed to classify the tuples in
the resulting partitions and reflects the least randomness or “impurity” in
these partitions
• This approach minimizes the expected number of tests needed to classify
a given tuple

22
Information Gain-Entropy Measure
• The expected information needed to classify a
tuple in D is given by

• Where 𝑝𝑖 is the probability that an arbitrary


tuple in D belongs to class 𝐶𝑖 and is estimated
by |𝐶𝑖,𝐷 |/|D|.
• A log function to the base 2 is used, because
the information is encoded in bits
• Info(D) (or Entropy of D )is just the average
amount of information needed to identify the
class label of a tuple in D

23
Attribute Selection Measures

• It is quite likely that the partitions will be impure (e.g., where a partition
may contain a collection of tuples from different classes rather than from
a single class).
• How much more information would we still need (after the partitioning) in
order to arrive at an exact classification?
• This amount is measured by

• The term |𝐷𝑗 | / |D| acts as the weight of the 𝑗𝑡ℎ partition. 𝐼𝑛𝑓𝑜𝐴 (D) is
the expected information required to classify a tuple from D based on the
partitioning by A.
24
Information Gain

• The smaller the expected information (still) required, the greater the
purity of the partitions
• Information gain is defined as the difference between the original
information requirement (i.e., based on just the proportion of classes) and
the new requirement (i.e., obtained after partitioning on A). That is,

• The attribute A with the highest information gain, (Gain(A)), is chosen as


the splitting attribute at node N.

25
Gini Index
• Gini index is used to measures the
impurity of D, a data partition or set
of training tuples, as

• Where 𝑝𝑖 is the probability that a tuple


in Dbelongs to class 𝐶𝑖 and is
estimated by |𝐶𝑖,𝐷 |/|D|.
• The sum is computed over ‘m’classes.
• The Gini index considers a binary split
for each attribute

26
Gini Index

• When considering a binary split, we compute a weighted sum of the


impurity of each resulting partition
• For example, if a binary split on Apartitions Dinto 𝐷1 and 𝐷2 , the Gini
index of Dgiven that partitioning is-

• For each attribute, each of the possible binary splits is considered


• For a discrete-valued attribute, the subset that gives the minimum Gini
index for that attribute is selected as its splitting subset

27
Gini Index
• For continuous-valued attributes, each possible split-point must be considered
• The strategy is similar where the midpoint between each pair of (sorted)
adjacent values is taken as a possible split-point.
• For a possible split-point of A, 𝐷1 is the set of tuples in D satisfying A ≤ split
point, and 𝐷2 is the set of tuples in D satisfying A > split point.
• The reduction in impurity that would be incurred by a binary split on a
discrete- or continuous-valued attribute A is

• The attribute that maximizes the reduction in impurity (or, equivalently, has
the minimum Gini index) is selected as the splitting attribute

28
Which attribute selection measure is the best?

• All measures have some bias.


• The time complexity of decision tree generally increases exponentially
with tree height
• Hence, measures that tend to produce shallower trees (e.g., with
multiway rather than binary splits, and that favour more balanced splits)
may be preferred.
• However, some studies have found that shallow trees tend to have a large
number of leaves and higher error rates
• Several comparative studies suggests no one attribute selection measure
has been found to be significantly superior to others.
29
Tree Pruning

• When a decision tree is built, many of the branches will reflect anomalies
in the training data due to noise or outliers
• Tree pruning use statistical measures to remove the least reliable
branches
• Pruned trees tend to be smaller and less complex and, thus, easier to
comprehend
• They are usually faster and better at correctly classifying independent test
data than unpruned trees

30
How does Tree Pruning Work?

• There are two common approaches to tree pruning: pre-pruning and post-
pruning.
• In the pre-pruning approach, a tree is “pruned” by halting its construction
early (e.g., by deciding not to further split or partition the subset of
training tuples at a given node).
• When constructing a tree, measures such as statistical significance,
information gain, Gini index can be used to assess the goodness of a split.

31
How does Tree Pruning Work?

• The post-pruning approach removes sub_trees from a “fully grown” tree


• A subtree at a given node is pruned by removing its branches and
replacing it with a leaf
• The leaf is labelled with the most frequent class among the subtree being
replaced
• For example, the subtree at node “A3?” in the unpruned tree of Figure 1.2
• The most common class within this subtree is “class B”
• In the pruned version of the tree, the subtree in question is pruned by
replacing it with the leaf “class B

32
How does Tree Pruning Work?

Figure: 1.2 An unpruned decision tree and a post-pruned decision tree

33
THANK YOU

34
Attribute selection Measures in CART : II

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Attribute selection measures:


– Gain Value
– Gain ratio
– Gini Index

2
Gain Ratio

• The information gain measure is biased toward tests with many outcomes
• That is, it prefers to select attributes having a large number of values
• For example, consider an attribute that acts as a unique identifier, such as
product ID.
• A split on product ID would result in a large number of partitions (as many
as there are values), each one containing just one tuple

3
Gain Ratio

• Because each partition is pure, the information required to classify dataset


D based on this partitioning would be Info product ID(D) = 0
• Information Gain = Info D- Info product ID(D) = maximum
• Therefore, the information gained by partitioning on this attribute is
maximal
• Clearly, such a partitioning is useless for classification
• Gain ratio is an extension to information gain which attempts to
overcome this bias

4
Split information
• It applies a kind of normalization to information gain using a “split
information” value defined analogously with Info(D) as:

• Dj= single partion


• D = Data set
• This value represents the potential information generated by splitting the
training data set, D, into v partitions, corresponding to the v outcomes of a
test on attribute A

5
Gain ratio

• Gain ratio differs from information gain, which measures the information
with respect to classification that is acquired based on the same
partitioning
• The gain ratio is defined as

• The attribute with the maximum gain ratio is selected as the splitting
attribute

6
Gain Ratio example

• Consider the previous example


for computation of gain ratio
for the attribute income
• A test on income splits the
data of the following Table into
three partitions, namely low,
medium, and high, containing
four, six, and four
tuples,respectively
Han, J., Pei, J. and Kamber, M., 2011. Data mining: concepts and
techniques. Elsevier.

7
Calculate Entropy for high

• High :
High Class: buys computer
Yes 2
No 2

• Calculate Entropy for high:


= -(2/4)log2(2/4) - (2/4)log2(2/4)

8
Calculate Entropy for ‘medium’

• Medium:
Medium Class: buys computer
Yes 4
No 2

• Calculate Entropy for Medium:


= -(4/6)log2(4/6) - (2/6)log2(2/6)

9
Calculate Entropy for ‘low’

• Low :

Low Class: buys computer


Yes 3
No 1

• Calculate Entropy for Low:


= - (3/4)log2(3/4) -(1/4)log2(1/4)

10
Calculate Entropy for buying class D

• Calculate information:
• = -py log2 (py) – pn log2 (pn)
• Where py is probability of yes and pn is probability of no

11
Gain of income

• The expected information needed to classify a tuple in D if the tuples are


partitioned according to income is:
• Info income (D) = (4/14) ( -(2/4)log2(2/4) - (2/4)log2(2/4)) +
(6/14) ( -(4/6)log2(4/6) - (2/6)log2(2/6)) +
(4/14) (-(1/4)log2(1/4) - (3/4)log2(3/4))
= 0.911 bits
Gain of income : Info(D) - Info income (D)
= 0.94 – 0.911 = 0.029

12
Gain-Ratio(income)

• Calculation of split ratio:

• Therefore, Gain-Ratio(income) = 0.029/0.926 = 0.031

13
Interpretation

• Further we calculate the same for the rest 3 criteria (age, student, credit
rating)
• The one with maximum Gain ratio value will results in the maximum
reduction in impurity of the tuples in D and is returned as the splitting
criterion

14
15
Decision tree using Gini index

• Let’s take the Introduction of


a decision tree using Gini
index
• Let D be the training data of
the following table

Han, J., Pei, J. and Kamber, M., 2011. Data mining: concepts and
techniques. Elsevier.

16
Example

• In this example, each attribute is discrete-valued


• Continuous-valued attributes have been generalized
• The class label attribute, buys computer, has two distinct values (namely,
{yes, no}); therefore, there are two distinct classes (that is, m = 2)
• Let class C1 correspond to ‘yes’ and class C2 correspond to ‘no’.
• There are nine tuples of class ‘yes’ and five tuples of class ‘no’.
• A (root) node N is created for the tuples in D

17
Calculation of Gini(D)

• We first use the following Equation for Gini index to compute the impurity
of D:

18
Gini index for income attribute

• Lets calculate Gini index for income attribute


• To find the splitting criterion for the tuples in D, we need to compute the
Gini index for each attribute
• Let’s start with the attribute income and consider each of the possible
splitting subsets
• Income has three possible values, namely {low, medium, high}, then the
possible subsets are {low, medium, high}, {low, medium}, {low, high},
{medium, high}, {low}, {medium}, {high}, and {}
• Power set and empty set will not be used for splitting

19
Gini index for income attribute

• Consider the subset{low,


medium}
• This would result in 10 tuples
in partition D1 satisfying the
condition “income ∈{low,
medium}”
• The remaining four tuples of D
(high) would be assigned to
partition D2

20
Tuples in partition D1

• Low + Medium:
Medium Class: buys computer
+ Low
Yes 3+4 =7
No 1+ 2 = 3

21
Tuples in partition D2

• High :
High Class: buys computer
Yes 2
No 2

22
Gini index for income attribute

• The Gini index value computed based on this partitioning is

= (10/14) (1- (7/10)2 – (3/10)2) +


(4/14) (1- (2/4)2 – (2/4)2)
= 0.443 = Gini income ∈{high}

23
Gini index for income attribute

• Consider the subset{high,


medium}
• This would result in 10 tuples
in partition D1 satisfying the
condition “income ∈{high,
medium}”
• The remaining four tuples of
D (low) would be assigned to
partition D2

24
Tuples in partition D1

• High + Medium:
Medium Class: buys computer
+ high
Yes 2+4
No 2+2

25
Tuples in partition D2

• Low :

Low Class: buys computer


No 1
Yes 3

26
Gini index for income attribute

• The Gini index value computed based on this partitioning is


Gini income ∈{high, medium}

= (10/14) (1- (6/10)2 – (4/10)2) +


(4/14) (1- (1/4)2 – (3/4)2)
=0.45 = Gini income ∈{low}

27
Gini index for income attribute

• Consider the subset{high,


low}
• This would result in 8 tuples
in partition D1 satisfying the
condition “income ∈{high,
low}”
• The remaining six tuples of D
(medium) would be assigned
to partition D2

28
Tuples in partition D1

• High + low:
high + Class: buys computer
low
Yes 2+3
No 2+1

29
Tuples in partition D2

• Medium:

Low Class: buys computer


No 2
Yes 4

30
Gini index for income attribute

• The Gini index value computed based on this partitioning is


Gini income ∈{high, low}

= (8/14) (1- (5/8)2 – (3/8)2) +


(6/14) (1- (2/6)2 – (4/6)2)
=0.458 =Gini income ∈{medium}

31
Gini Index values

Gini Index values


Gini income ∈{high, low} 0.458
Gini income ∈{high, medium} 0.45
Gini income ∈{medium, low} 0.443

32
Interpretation

• The best binary split for attribute income is on {medium, low} (or {high})
because it minimizes the Gini index
• The splitting subset {medium,low} therefore give the minimum Gini index
for attribute income
• Reduction in impurity = 0.459 − 0.443 = 0.016
• Further we calculate the same for the rest 3 criteria (age, student, credit
rating)
• The one with minimum Gini index value will results in the maximum
reduction in impurity of the tuples in D and is returned as the splitting
criterion
33
34
Thank You

35
Classification and Regression Trees (CART – III)

Dr A. RAMESH
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

Python demo for CART model -


• Visualizing Decision Tree
• Interpretation of CART model

2
Example

Problem Description-

Han, J., Pei, J. and Kamber, M., 2011. Data mining: concepts and
techniques. Elsevier.

3
Import Relevant Libraries and Loading Data File

4
Methods used in Data Encoding

• LabelEncoder (): This method is used to normalize labels. It can also be


used to transform non-numerical labels to numerical labels.

• Fit_transform (): This method is used for Fitting label encoder and return
encoded labels.

5
Data Encoding Procedure

6
Data Encoding

7
Structuring Dataframe

drop(): This is used to Remove rows or columns by specifying label names


and corresponding axis or by specifying directly index or column names.

8
Independent and Dependent Variables Selection

9
Build the Decision Tree Model without Splitting

10
Visualizing Decision Tree

11
Decision Tree Visualization

12
Interpretation of the CART Output

13
Calculation of Gini(D)

• We first use the following Equation for Gini index to compute the impurity
of D:

14
Income Attribute

• Low, Medium, High


• Option 1: {Low, Medium}, {High}
• Option 2 : {High, Medium}, {low}
• Option 3 : {High, Low}, {Medium}

15
Tuples in partition D1

• Low + Medium:
Low + Class: buys computer
Medium
Yes 3+4 =7
No 1+ 2 = 3

16
Tuples in partition D2

• High :
High Class: buys computer
Yes 2
No 2

17
Gini index for income attribute

• The Gini index value computed based on this partitioning is

= (10/14) (1- (7/10)2 – (3/10)2) +


(4/14) (1- (2/4)2 – (2/4)2)
= 0.443 = Gini income ∈{high}

18
Gini index for income attribute

• The Gini index value computed based on this partitioning is


Gini income ∈{high, medium}

= (10/14) (1- (6/10)2 – (4/10)2) +


(4/14) (1- (3/4)2 – (1/4)2)
=0.45 = Gini income ∈{low}

19
Gini index for income attribute

• The Gini index value computed based on this partitioning is


Gini income ∈{high, low}
= (8/14) (1- (5/8)2 – (3/8)2) +
(6/14) (1- (2/6)2 – (4/6)2)
=0.458 =Gini income ∈{medium}

20
Gini index for income attribute
• Gini income ∈{low, medium}
= 0.443 = Gini income ∈{high}
• Gini income ∈{high, medium}
= 0.45 = Gini income ∈{low}
• Gini income ∈{high, low}
= 0.458 = Gini income ∈{medium}

21
Gini index for Age attribute

• The Gini index value computed based on this partitioning is


Gini Age ∈{Youth, middle_aged}
= 0.457 = Gini Age ∈{senior}
Gini Age ∈{Youth, Senior}
= 0.357 = Gini Age∈{middle_aged}
Gini Age ∈{senior, middle_aged}
= 0.393 = Gini Age ∈{Youth}

22
Gini index for student attribute

• The Gini index value computed based on this partitioning is


Gini student ∈{Yes, No}
= 7/14 (1- (6/7)2 – (1/7)2 ) +
7/14 (1- (3/7)2 – (4/7)2 )
= 0. 367

23
Gini index for credit_rating attribute

• The Gini index value computed based on this partitioning is


Gini credit rating ∈{fair, Excellent}
= 8/14 (1- (6/8)2 – (2/8)2 ) +
6/14 (1- (3/6)2 – (3/6)2 )
= 0. 428

24
Choosing the root node
The attribute with minimum Gini score will be taken, i.e. Age (Gini Age ∈{Youth, Senior} =
0.357 = Gini Age∈{middle_aged} )

Age Attribute Gini score


Youth, senior
Age 0.357
Income 0.443
Middle age ???
Student 0.367
Credit_rating 0.428

25
Gini index for different attributes for sample of 10
• After separating 4 samples belonging middle age, total 10 are remaining:

26
Gini index for different attributes for sample of 10

• Gini (D) = (1- (5/10)2 – (5/10)2) ) = 0.5


• GiniAge = 0.48
• GiniCredit Rating= 0.41
• Gini Student = 0.32
• Gini income = 0.375
• Take student as node as it have mini. Gini Score

27
Drawing cart

Age
Youth, senior

Middle age Student

yes
No
??? ???

28
For branch Student = No
• Omit the marked rows
(Data entry), either
belonging Age =
middle_aged or student =
Yes
• Total 5 rows are remaining

29
Gini index for different attributes For branch Student = No

• Gini (D) = (1- (4/5)2 – (1/5)2) ) = 0.32


• GiniAge = 0.2
• GiniCredit Rating= 0.267
• Gini Student = 0.32
• Gini income = 0.267
• Take age as node as it have mini. Gini Score

30
Drawing cart

Age
Youth, senior

Middle age Student

yes
No
??? Age

??? ???
31
For branch Student = Yes
• Omit the marked rows
(Data entry), either
belonging Age =
middle_aged or student =
No
• Total 5 rows are remaining

32
Gini index for different attributes For branch Student = No

• Gini (D) = (1- (4/5)2 – (1/5)2) ) = 0.32


• GiniAge = 0.267
• GiniCredit Rating= 0.2
• Gini Student = 0.32
• Gini income = 0.267
• Take credit rating as node as it have mini. Gini Score

33
Drawing cart

Age
Youth, senior

Middle age Student

yes No
Credit_rating Age

??? ??? ??? ???

34
Coding scheme
Age Code Student Code
Youth 2 Yes 1
Middle Age 0 No 0
senior 1 Income Code
High 0
Credit rating Code
Low 1
Fair 1
Medium 2
Excellent 0
Buys computer Class
Yes 1
No 0
35
Values for the dependent
Decision tree variable
Youth, Senior
Middle_age
Decision classifier
• Repeat the
splitting
process until No Yes
we obtain all Number of yes and Sample
the leaf nodes, No in independent size
the final out - variable Excellent Fair
Senior Youth
put:

Excellent Fair High, Low Medium

36
Splitting Dataset

• Train_test_split(): This method is used for splitting dataset into training


and testing data subsets.

37
Build the Decision Tree Model

38
Evaluating the Model

39
Visualizing Decision Tree

40
Decision Tree Visualization

41
Thank You

42
Hierarchical method of clustering- II

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Agglomerative hierarchical algorithm


• Python demo

2
Example for Hierarchical Agglomerative Clustering (HAC)
• A data set consisting of seven objects for which two variables were
measured.
Object Variable 1 Variable 2
1 2.00 2.00
2 5.50 4.00
3 5.00 5.00
4 1.50 2.50
5 1.00 1.00
6 7.00 5.00
7 5.75 6.50
3
Scatter plot

4
Object Var 1 Var 2

Example for HAC 1 2.00 2.00


2 5.50 4.00
3 5.00 5.00
• Calculate Euclidean Distance and create the distance matrix. 4 1.50 2.50
5 1.00 1.00
Distance 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 = (𝑥2 − 𝑥1 )2 +(𝑦2 − 𝑦1 )2 6 7.00 5.00
Distance (1,2) 7 5.75 6.50

2.00, 2.00 5.50, 4.00 = (5.50 − 2.00)2 +(4.00 − 2.00)2 = 4.02


Distance (1,3)
2.00, 2.00 5.00, 5.00 = (5.00 − 2.00)2 +(5.00 − 2.00)2 = 4.24
Distance (1,4)
2.00, 2.00 1.50, 2.50 = (1.50 − 2.00)2 +(2.50 − 2.00)2 = 0.71

5
Object Var 1 Var 2

Example for HAC 1 2.00 2.00


2 5.50 4.00
3 5.00 5.00
Distance (1,5) 4 1.50 2.50

2.00, 2.00 1.00, 1.00 = (1.00 − 2.00)2 +(1.00 − 2.00)5 2 1.00 1.00
6 7.00 5.00
= 1.41 7 5.75 6.50
Distance (1,6)
2.00, 2.00 7.00, 5.00 = (7.00 − 2.00)2 +(5.00 − 2.00)2 = 5.83
Distance (1,7)
2.00, 2.00 5.75, 6.50 = (5.75 − 2.00)2 +(6.50 − 2.00)2 = 5.86

6
Object Var 1 Var 2

Example for HAC 1 2.00 2.00


2 5.50 4.00
3 5.00 5.00
Distance (2,3) 4 1.50 2.50

5.50, 4.00 5.00, 5.00 = (5.00 − 5.50)2 +(5.00 − 4.00)5 2 1.00 1.00
6 7.00 5.00
= 1.12
7 5.75 6.50
Distance (2,4)
5.50, 4.00 1.50, 2.50 = (1.50 − 5.50)2 +(2.50 − 4.00)2 = 4.27
Distance (2,5)
5.50, 4.00 1.00, 1.00 = (1.00 − 5.50)2 +(1.00 − 4.00)2 = 5.41
Distance (2,6)
5.50, 4.00 7.00, 5.00 = (7.00 − 5.50)2 +(5.00 − 4.00)2 = 1.80

7
Object Var 1 Var 2

Example for HAC 1 2.00 2.00


2 5.50 4.00
3 5.00 5.00
Distance (2,7) 4 1.50 2.50

5.50, 4.00 5.75, 6.50 = (5.75 − 5.50)2 +(6.50 − 4.00)5 2 1.00 1.00
6 7.00 5.00
= 2.51
7 5.75 6.50
Distance (3,4)
5.00, 5.00 1.50, 2.50 = (1.50 − 5.00)2 +(2.50 − 5.00)2 = 4.30
Distance (3,5)
5.00, 5.00 1.00, 1.00 = (1.00 − 5.00)2 +(1.00 − 5.00)2 = 5.66
Distance (3,6)
5.00, 5.00 7.00, 5.00 = (7.00 − 5.00)2 +(5.00 − 5.00)2 = 2.00

8
Object Var 1 Var 2

Example for HAC 1 2.00 2.00


2 5.50 4.00
3 5.00 5.00
Distance (3,7) 4 1.50 2.50

5.00, 5.00 5.75, 6.50 = (5.75 − 5.00)2 +(6.50 − 5.00)5 2 1.00 1.00
6 7.00 5.00
= 1.68
7 5.75 6.50
Distance (4,5)
1.50, 2.50 1.00, 1.00 = (1.00 − 1.50)2 +(1.00 − 2.50)2 = 1.58
Distance (4,6)
1.50, 2.50 7.00, 5.00 = (7.00 − 1.50)2 +(5.00 − 2.50)2 = 6.04
Distance (4,7)
1.50, 2.50 5.75, 6.50 = (5.75 − 1.50)2 +(6.50 − 2.50)2 = 5.84

9
Object Var 1 Var 2

Example for HAC 1 2.00 2.00


2 5.50 4.00
3 5.00 5.00
Distance (5,6) 4 1.50 2.50

1.00, 1.00 7.00, 5.00 = (7.00 − 1.00)2 +(5.00 − 1.00)5 2 1.00 1.00
6 7.00 5.00
= 7.21 7 5.75 6.50
Distance (5,7)
1.00, 1.00 5.75, 6.50 = (5.75 − 1.00)2 +(6.50 − 1.00)2 = 7.27
Distance (6,7)
7.00, 5.00 5.75, 6.50 = (5.75 − 7.00)2 +(6.50 − 5.00)2 = 1.95

10
Distance Matrix
• The distance matrix is-
1 2 3 4 5 6 7
1 0.0
2 4.0 0.0
3 4.2 1.1 0.0
4 0.7 4.3 4.3 0.0
5 1.4 5.4 5.7 1.6 0.0
6 5.8 1.8 2.0 6.0 7.2 0.0
7 5.9 2.5 1.7 5.8 7.3 2.0 0.0

11
Example for HAC
• Select minimum element to build first cluster formation-
1 2 3 4 5 6 7
1 0.0
2 4.0 0.0
3 4.2 1.1 0.0
4 0.7 4.3 4.3 0.0
5 1.4 5.4 5.7 1.6 0.0
6 5.8 1.8 2.0 6.0 7.2 0.0
7 5.9 2.5 1.7 5.8 7.3 2.0 0.0

12
Example for HAC

13
Example for HAC

• Recalculate distance to update distance matrix 1 2 3 4 5 6 7


1 0.0
2 4.0 0.0

- MIN[ dist(1,4), 2] = MIN(dist(1,2), (4,2)) 3 4.2 1.1 0.0


4 0.7 4.3 4.3 0.0
= MIN(4.0, 4.3) = 4.0 5 1.4 5.4 5.7 1.6 0.0
- MIN[ dist(1,4), 3] = MIN(dist(1,3), (4,3)) 6 5.8 1.8 2.0 6.0 7.2 0.0
7 5.9 2.5 1.7 5.8 7.3 2.0 0.0
= MIN(4.2, 4.3) = 4.2
- MIN[ dist(1,4), 5] = MIN(dist(1,5), (4,5)) = MIN(1.4, 1.6) = 1.4
- MIN[ dist(1,4), 6] = MIN(dist(1,6), (4,6)) = MIN(5.8, 6.0) = 5.8
- MIN[ dist(1,4), 7] = MIN(dist(1,7), (4,7)) = MIN(5.9, 5.8) = 5.8

14
Example for HAC

• Updated distance matrix for the cluster (1, 4)


1,4 2 3 5 6 7
1,4 0.0
2 4.0 0.0
3 4.2 1.1 0.0
5 1.4 5.4 5.7 0.0
6 5.8 1.8 2.0 7.2 0.0
7 5.8 2.5 1.7 7.3 2.0 0.0

15
Example for HAC

• Select minimum element to build next cluster formation-


1,4 2 3 5 6 7
1,4 0.0
2 4.0 0.0
3 4.2 1.1 0.0
5 1.4 5.4 5.7 0.0
6 5.8 1.8 2.0 7.2 0.0
7 5.8 2.5 1.7 7.3 2.0 0.0

16
Example for HAC

17
Example for HAC
1,4 2 3 5 6 7
1,4 0.0
• Recalculate distance to update distance matrix 2 4.0 0.0
3 4.2 1.1 0.0

- MIN[ dist(2,3), (1,4)] = MIN(dist(2,(1,4)), (3,(1,4)) 5 1.4 5.4 5.7 0.0


6 5.8 1.8 2.0 7.2 0.0
= MIN(4.0, 4.2) = 4.0 7 5.8 2.5 1.7 7.3 2.0 0.0

- MIN[ dist(2,3), 5] = MIN(dist(2,5), (3,5)) = MIN(5.4, 5.7) = 5.4


- MIN[ dist(2,3), 6] = MIN(dist(2,6), (3,6)) = MIN(1.8, 2.0) = 1.8
- MIN[ dist(2,3), 7] = MIN(dist(2,7), (3,7)) = MIN(2.5, 1.7) = 1.7

18
Example for HAC

• Updated distance matrix for the cluster (2, 3)


1,4 2,3 5 6 7
1,4 0.0
2,3 4.0 0.0
5 1.4 5.4 0.0
6 5.8 1.8 7.2 0.0
7 5.8 1.7 7.3 2.0 0.0

19
Example for HAC

• Select minimum element to build next cluster formation-


1,4 2,3 5 6 7
1,4 0.0
2,3 4.0 0.0
5 1.4 5.4 0.0
6 5.8 1.8 7.2 0.0
7 5.8 1.7 7.3 2.0 0.0

20
Example for HAC

21
Example for HAC

• Recalculate distance to update distance matrix 1,4 2,3 5 6 7


1,4 0.0
2,3 4.0 0.0
- MIN[ dist((1,4),5), (2,3)] = MIN(dist((1,4),(2,3)), (5,(2,3)) 5 1.4 5.4 0.0
6 5.8 1.8 7.2 0.0
= MIN(4.0, 5.4) = 4.0
7 5.8 1.7 7.3 2.0 0.0

- MIN[dist((1,4),5), 6] = MIN(dist((1,4),6), (5,6)) = MIN(5.8, 7.2) = 5.8


- MIN[dist((1,4),5), 7] = MIN(dist((1,4),7), (5,7)) = MIN(5.8, 7.3) = 5.8

22
Example for HAC

• Updated distance matrix for the cluster ((1,4), 5)

1,4,5 2,3 6 7
1,4,5 0.0
2,3 4.0 0.0
6 5.8 1.8 0.0
7 5.8 1.7 2.0 0.0

23
Example for HAC

• Select minimum element to build next cluster formation-

1,4,5 2,3 6 7
1,4,5 0.0
2,3 4.0 0.0
6 5.8 1.8 0.0
7 5.8 1.7 2.0 0.0

24
Example for HAC

25
Example for HAC

• Recalculate distance to update distance matrix 1,4,5 2,3 6 7


1,4,5 0.0
2,3 4.0 0.0
6 5.8 1.8 0.0
7 5.8 1.7 2.0 0.0

- MIN[ dist((2,3),7), (1,4,5)] = MIN(dist((2,3),(1,4,5)), (7,(1,4,5))


= MIN(4.0, 5.8) = 4.0

- MIN[dist((2,3),7), 6] = MIN(dist((2,3),6), (7,6)) = MIN(1.8, 2.0) = 1.8

26
Example for HAC

• Updated distance matrix for the cluster ((2,3), 7)

1,4,5 2,3,7 6
1,4,5 0.0
2,3,7 4.0 0.0
6 5.8 1.8 0.0

27
Example for HAC

• Select minimum element to build next cluster formation-

1,4,5 2,3,7 6
1,4,5 0.0
2,3,7 4.0 0.0
6 5.8 1.8 0.0

28
Example for HAC

29
Example for HAC

• Recalculate distance to update distance matrix 1,4,5 2,3,7 6

1,4,5 0.0

2,3,7 4.0 0.0

6 5.8 1.8 0.0

- MIN[ dist((2,3,7),6), (1,4,5)] = MIN(dist((2,3,7),(1,4,5)), (6,(1,4,5))


= MIN(4.0, 5.8)
= 4.0

30
Example for HAC

• Updated distance matrix for the cluster ((2,3,7), 6)

1,4,5 2,3,7,6

1,4,5 0.0

2,3,7,6 4.0 0.0

31
Example for HAC

32
Python demo for HAC

33
Python demo for HAC

34
Python demo for HAC

35
Python demo for HAC

36
Python demo for HAC

37
THANK YOU

38

You might also like