0% found this document useful (0 votes)
25 views67 pages

Module 1 Statistical Inference

This document outlines the course content and structure for a statistical inference course taught by Dr. Basheer Ahmad Samim. The course will cover topics such as descriptive statistics, probability distributions, sampling theory, confidence intervals, hypothesis testing, regression, and more. Students are expected to attend at least 80% of lectures and will be assessed through quizzes, assignments, exams, and class participation. SPSS software will be used for workshops and analysis. Recommended textbooks are also provided.

Uploaded by

Rahim Gangwani
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
Download as ppt, pdf, or txt
0% found this document useful (0 votes)
25 views67 pages

Module 1 Statistical Inference

This document outlines the course content and structure for a statistical inference course taught by Dr. Basheer Ahmad Samim. The course will cover topics such as descriptive statistics, probability distributions, sampling theory, confidence intervals, hypothesis testing, regression, and more. Students are expected to attend at least 80% of lectures and will be assessed through quizzes, assignments, exams, and class participation. SPSS software will be used for workshops and analysis. Recommended textbooks are also provided.

Uploaded by

Rahim Gangwani
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1/ 67

Statistical

Inference
Dr. Basheer Ahmad Samim

06:13 PM 1
Course Outline
1. Review of Descriptive Statistics and SPSS
2. Random Variable and Mathematical Expectation
3. Discrete Probability Distributions (Binomial, Poisson)
4. Continuous Probability Distribution (Normal)
5. Sampling Theory
6. Confidance Intervals
7. Hypotheses Testing
8. Goodness of Fit
9. Regression and Correlation with ANOVA
10. Multiple Regression
11. All the topics will be SPSS oriented

06:13 PM 2
Recommended Readings (Books)
Introduction to Statistics,
Walpole, R. E., 3rd Edition
(2000)
Statistical Methods for Practice
and Research by Ajai S. Gaur
and Sanjaya S. Gaur

06:13 PM 3
Attendance Policy
16-Weeks Teaching
16-Lectures (32-Attendance)
Twice Roll Call, Once before the break
and once after the break
At Least 80% (24) Attendance is
compulsory to be elligible for the Final
Examination
No Roll Call after First Ten(5) minutes
06:13 PM 4
Mode of Teaching
Lecture
SPSS Workshop
Discussion Session

06:13 PM 5
Mode of Assessment
Quizes (15%)
Assignments (15%)
Class Performance (5%)
Mid Term Test (25%)
Final Examination (40%)
06:13 PM 6
Questionnaire

06:13 PM 7
Variable
A characteristic or
property that varies
from individual to
individual.
06:13 PM 8
Constant
A characteristic or
property that does not
change from individual
to individual.
06:13 PM 9
Types of Variables
Types of
Variables

Qualitative Quantitative

Discrete Continuous

06:13 PM 10
Nominal Scale
Variable categories are mutually
exclusive and exhaustive.
Variable categories have no
logical order.
Eye Color, Hair Color, Gender.

06:13 PM 11
Ordinal Scale
Data categories are mutually
exclusive and exhaustive.
Data classifications are ranked or
ordered according to the
particular trait they possess.
Level of Knowledge about SPSS

06:13 PM 12
Interval Scale
 Data categories are mutually exclusive
and exhaustive.
 Data classifications are ranked or
ordered according to the particular trait
they possess.
 Equal differences in the characteristic are
not represented by equal differences in
the measurements.
Temperature, Shoe Size and IQ scores
06:13 PM 13
Ratio Scale
 Data categories are mutually exclusive and
exhaustive.
 Data classifications are ranked or ordered
according to the particular trait they
possess.
 Equal differences in the characteristic are
represented by equal differences in the
measurements.
 The zero point is the essence of the
characteristic.
06:13 PM
Height, Weight, Distance. 14
Measurement Scales

06:13 PM 15
Data
The information collected
for any kind of investigation.
Usually Numerical but can
be Qualitative.

06:13 PM 16
Primary Data
The initial material collected
during the research process.
The information collected
directly from the respondent.
Personal Invetigation, Through Investigator, Through Questionnaire,
Through Local Sources, Through Telephone,

06:13 PM 17
Secondary Data
The information
collected and processed
by the people other than
the researcher
Government Organizations, Semi-Government
Organizations,

06:13 PM 18
Data Collection
Any of the following methods may be
adopted:
(a) Personal interview
(b) Direct observation
(c) Mail interview (internet interview)
(d) Telephone interview
What are the cons and pros of each?

06:13 PM 19
Data management
Office Editing,
Post Coding,
Data entry and Verification.

06:13 PM 20
Data organization and
Analysis
 Preparing data for analysis,
 Extracting descriptive measures
from the data,
 Using advanced statistical
techniques to analyze the data
and draw inference there from.

06:13 PM 21
Measures of Central Tendency

Arithmetic Mean
Quantiles
(Median, Quartiles, Deciles, Percentiles)
Mode

06:13 PM 22
Arithmetic Mean
A value obtained by dividing the sum of all the observations by
their number.

Sum of all the observations


Arithmetic Mean 
Number of the observations
If X1, X2, …, Xn are n observations of a variable X then

X1  X 2    X n X i
X  i 1
n n
06:13 PM 23
Arithmetic Mean
The marks obtained by 8 students are:

67 72 68 70 65 68 75 63
67  72    63 548
X   68.5 Marks
8 8

06:13 PM 24
Quantiles
For individual observations/discrete frequency
distribution, the ith quartile, jth decile and kth
percentile are located in the array/discrete frequency
distribution by the following relations
i(n  1)
Qi  th observation in the distribution, i  1, 2, 3
4
j(n  1)
Dj  th observation in the distribution, j  1, 2,,9
10
k(n  1)
Pk  th observation in the distribution, k  1, 2, ,99
100

06:13 PM 25
Quartiles
The weekly TV Watching times (Hours):
25 41 27 32 43 66 35 31 15 5
34 26 32 38 16 30 38 30 20 21

The array of the above data is given below:

5 15 16 20 21 25 26 27 30 30
31 32 32 34 35 37 38 41 43 66

06:13 PM 26
Quartiles
1(20  1)
Q1  th observation in the distribution
4
 5.25th observation in the distribution
 5th obs.  0.25{6th obs. - 5th obs.}
 21  0.25{25 - 21}  22.0 Hours

06:13 PM 27
Quartiles
2(20  1)
Q2  th observation in the distribution
4
 10.50th observation in the distribution
 10th obs.  0.50{11th obs. - 10th obs.}
 30  0.50{31 - 30}  30.5 Hours

06:13 PM 28
Quantiles

06:13 PM 29
Mode
The mode is a value which occurs
most frequently in a set of data. Or
mode is a value that occurs
maximum number of times in a
sequence of observations.

06:13 PM 30
Mode
The total automobile sales (in millions) in
the United States for the last 14 years.

9.0 8.2 8.0 9.1 10.3 11.0 11.5


10.3 10.5 9.8 9.3 8.2 8.2 8.5

Mode = 8.2 million

06:13 PM 31
Measures of variation measure the
variation present among the values
of a data set, so measures of
variation are measures of spread of
values in the data.

06:13 PM 32
Absolute Measures of
Dispersion
 Range
 Quartile Deviation
 Mean (Average) Deviation
 Variance and Standard Deviation

06:13 PM 33
Relative Measures of
Dispersion
 Coefficient of Range
 Coefficient of Quartile Deviation
 Coefficient of Mean Deviation
 Coefficient of Variation (CV)

06:13 PM 34
Range
Difference between the largest
and the smallest observations

Range  X Largest  X Smallest

06:13 PM 35
Disadvantages of the Range
Ignores the way in which data are distributed

7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5

Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119

06:13 PM 36
Inter-quartile Range (IQR)

Inter-quartile range = 3rd quartile – 1st Quartile


Q3 - Q1

IQR is independent of outliers

06:13 PM 37
Inter-quartile Range
X Median X
minimum Q1 (Q2) Q3 maximum

25% 25% 25% 25%

12 30 45 57 70

Inter-quartile Range (IQR)


= 57 – 30 = 27

06:13 PM 38
The Mean (absolute) Deviation
Mean Deviation is the average of absolute
deviations taken form the mean value.

X (X  X ) X X

8 3 3
 (x  x ) 6
 2
5 0 0 n 3

2 -3 3
0 6

06:13 PM 39
Variance
Variance is the average X cm (X-Mean)^2 X2
of the squared 4 36 16
deviations taken from 6 16 36
the mean value. 9 1 81
12 4 144
(i ) S 2 
 (x  x ) 2


102
 17cm 2
n 6
13 9 169

X2 X
2
 702  102 2 16 36 256
(ii ) S 
2
      17 cm
2

n  n 6  6 
  60 102 702

06:13 PM 40
Comparing Standard Deviations
Data A
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 S = 3.338

Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 S = 0.926
Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 S = 4.567
•The smaller the standard deviation, the more tightly
clustered the scores around mean
•The larger the standard deviation, the more spread out
06:13 PM
the scores from mean 41
Relative Measures of Variation
X Largest  X Smallest
Coefficient of Range 
X Largest  X Smallest

Q3  Q1
Coefficient of Quartile Deviation 
Q3  Q1

MD
Coefficient of Mean Deviation 
Mean
06:13 PM 42
Coefficient of Variation (CV)
 S 
CV    100%

 X 

Can be used to compare two or more


sets of data measured in different
units or same units but different
average size.
06:13 PM 43
Use of Coefficient of Variation
Stock A:
Average price last year = $50
Both stocks
Standard deviation = $5 have the
S $5 same
 
CVA     100%   100%  10% standard
X $50 deviation

Stock B:
Average price last year = $100
Standard deviation = $5 but stock B is
less variable
relative to its
S $5 price
CVB     100%   100%  5%
X $100
06:13 PM 44
Appropriate Choice of Measure
of Variability
If data are symmetric, with no serious
outliers, use range and standard
deviation.
If data are skewed, and/or have serious
outliers, use IQR.
If comparing variation across two data
sets, use coefficient of variation (C.V)
06:13 PM 45
Five Number Summary
The five number summary of a data set consists of the
minimum value, the first quartile, the second quartile, the
third quartile and the maximum value written in that order:
Min, Q1, Q2, Q3, Max.

From the three quartiles we can obtain a measure of central


tendency (the median, Q2) and measures of variation of the
two middle quarters of the distribution, Q2-Q1 for the
second quarter and Q3-Q2 for the third quarter.

06:13 PM 46
Five Number Summary
The weekly TV viewing times (in hours).

25 41 27 32 43 66 35 31 15 5
34 26 32 38 16 30 38 30 20 21

The array of the above data is given below:

5 15 16 20 21 25 26 27 30 30
31 32 32 34 35 37 38 41 43 66

06:13 PM 47
Five Number Summary
1(20  1)
LOCATION of Q1 ; th obs. in the data  5.25th obs.
4
VALUE of Q1 ; 5th obs.  0.25{6th obs. - 5th obs.}  21  0.25{25 - 21}  22.0 Hrs

2(20  1)
LOCATION of Q 2 ; th obs. in the data  10.50th obs.
4
VALUE of Q2 ;10th obs.  0.50{11th obs. - 10th obs.}  30  0.50{31- 30}  30.5 Hrs

3(20  1)
LOCATION of Q 3 ; th obs. in the data  15.75th obs.
4
VALUE of Q 3 ; 15th obs  0.75 {16th obs - 15th obs}  35  0.75{37 - 35}  36.5 Hrs

Minimum value=5.0 Maximum value=66.0


06:13 PM 48
Box and Whisker Diagram
A box and whisker diagram or box-plot is a
graphical mean for displaying the five number
summary of a set of data. In a box-plot the first
quartile is placed at the lower hinge and the
third quartile is placed at the upper hinge. The
median is placed in between these two hinges.
The two lines emanating from the box are
called whiskers. The box and whisker diagram
was introduced by Professor Jhon W. Tukey.

06:13 PM 49
Max
Construction of Box-Plot Value

Q3
1. Start the box from Q1 and end at
Q3
Q2
2. Within the box draw a line to
represent Q2
3. Draw lower whisker to Min.
Value up to Q1 Q1

4. Draw upper Whisker from Q3 up


to Max. Value Min
Value

06:13 PM 50
70

Construction of Box-Plot 60

50

1. Q1=22.0 Q3=36.5
40
2. Q2=30.5
3. Minimum Value=5.0
30
4. Maximum Value=66.0
20

10

0
06:13 PM 51
70

Interpretation of Box-Plot 60

Box-Whisker Plot is useful to identify


50
•Maximum and Minimum Values in the data
•Median of the data
40
•IQR=Q3-Q1,
Lengthy box indicates more variability in the data 30
•Shape of the data From Position of line within box
Line At the center of the box----Symmetrical 20

Line above center of the box----Negatively skewed


Line below center of the box----Positively Skewed 10

•Detection of Outliers in the data


0
06:14 PM 52
Outliers
An outlier is the values that falls well outside the overall
pattern of the data. It might be

• the result of a measurement or recording error,


• a member from a different population,
• simply an unusual extreme value.

An extreme value needs not to be an outliers; it might,


instead, be an indication of skewness.

06:14 PM 53
Inner and Outer Fences
If Q1=22.0 Q2=30.5 Q3=36.5

Lower Inner Fence  Q1  1.5 IQR   0.25


Inner Fences : 
Upper Inner Fence  Q 3  1.5 IQR   58.25

Lower Outer Fence  Q1  3 IQR   21.5


Outer Fences : 
Upper Outer Fence  Q 3  3 IQR   80.0

06:14 PM 54
80
Identification of the Outliers
70

1. The values that lie within inner *


60
fences are normal values
2. The values that lie outside inner Only
50
fences but inside outer fences 66 is a
are possible/suspected/mild mild 40

outliers outlier
3. The values that lie outside outer 30

fences are sure outliers


20

Plot each suspected outliers with an asterisk 10


and each sure outliers with an hollow dot.
06:14 PM 55
0
Uses of Box and Whisker Diagram

Box plots are


especially suitable for
comparing two or more
data sets. In such a
situation the box plots
are constructed on the
same scale.

Female
06:14 PM Male 56
Standardized Variable
A variable that has mean “0” and Variance “1” is
called standardized variable
Values of standardized variable are called
standard scores
Values of standard variable i.e standard scores are
unit-less
Construction
Variable  Mean of Variable
Z
Standard Deviation of Variable
06:14 PM 57
Standardized Variable
X ( X  X )2 Z (Z  Z ) 2  X 32
X   8
n 4
3 25 -1.3624 1.8561
54
S x2   13.5
6 4 -0.5450 0.2970 4
X  X X 8
11 9 0.81741 0.6682 Z 
Sx 3.67
12 16 1.0899 1.1879
Z 
 Z
0
32 54 0 4.009 n
4.009
Variable Z has mean “0” and S z2  1
4
variance “1” so Z is a standard variable.
Standard Score at X=11 is Z  X  X  11  8  0.8174
Sx 3.67
06:14 PM
Performance evaluation by z-scores
The industry in which sales rep Mr. Atif works has mean
annual sales=$2,500
standard deviation=$500.
The industry in which sales rep Mr. Asad works has mean
annual sales=$4,800
standard deviation=$600.
Last year Mr. Atif’s sales were $4,000 and
Mr. Asad’s sales were $6,000.
Which of the representatives would you hire
if you have one sales position to fill?
06:14 PM 59
Performance evaluation by z-scores
Sales rep. Atif Sales rep. Asad
XB= $2,500 XP =$4,800

S= $500 SP = $600

XB= $4,000 XP= $6,000


XB  XB XP  XP
ZB  ZP 
SB SP
4,000  2,500 6,000  4,800
ZB  3 ZP  2
500 600

Mr. Atif is the best choice


06:14 PM 60
The Empirical Rule
68%
X  1S contains about 68% of values
X
X  1S

X  2S contains about 95% of values 95%

X  2S
99.7%
X  3S contains about 99.7% of values
06:14 PM X  3S 61
Measures of Skewness
A distribution in which the values equidistant
from the centre have equal frequencies is defined
to be symmetrical and any departure from
symmetry is called skewness.

1. Length of Right Tail = Length of Left


Tail
2. Mean = Median = Mode
3. Sk=0
a) Sk=(Mean-Mode)/SD
b) Sk=(Q3-2Q2+Q1)/(Q3-Q1)

06:14 PM 62
Measures of Skewness
A distribution is positively skewed, if the observations
tend to concentrate more at the lower end of the possible
values of the variable than the upper end. A positively
skewed frequency curve has a longer tail on the right
hand side

1. Length of Right Tail > Length of Left


Tail
2. Mean > Median > Mode
3. SK>0

06:14 PM 63
Measures of
A distribution Skewness
is negatively skewed, if the
observations tend to concentrate more at the upper
end of the possible values of the variable than the
lower end. A negatively skewed frequency curve has
a longer tail on the left side.

1. Length of Right Tail < Length of Left


Tail
2. Mean < Median < Mode
3. SK< 0

06:14 PM 64
Measures of Kurtosis
The Kurtosis is the degree of peakedness or flatness of a
unimodal (single humped) distribution,
• When the values of a variable are highly concentrated around
the mode, the peak of the curve becomes relatively high; the
curve is Leptokurtic.
• When the values of a variable have low concentration
around the mode, the peak of the curve becomes relatively
flat;curve is Platykurtic.
• A curve, which is neither very peaked nor very flat-toped, it
is taken as a basis for comparison, is called
Mesokurtic/Normal.

06:14 PM 65
Measures of Kurtosis

06:14 PM 66
Measures of Kurtosis
n   X-X 
4

Coefficient of Kurtosis=
2 2
  X-X  
 

1. If Coefficient of Kurtosis > 3 ----------------- Leptokurtic.


2. If Coefficient of Kurtosis = 3 ----------------- Mesokurtic.
3. If Coefficient of Kurtosis < 3 ----------------- is Platykurtic.

06:14 PM 67

You might also like