Data Analytics
Data Analytics
Data Analytics
ANINDITA MANDAL
BARNALI CHAUDHURI
JAYANTA KUMAR DAS
CHAPTER 01
INTRODUCTION TO DATA SCIENCE
The population is the set of entities under study. It is a collection of people, items, or events
about which you want to make inferences.
Sample : This subset of the population is called a sample. So, we can say, a sample is a subset
of people, items, or events from a larger population that you collect and analyze to make
inferences. To represent the population well, a sample should be randomly collected and
Mean (Arithmetic):The mean (or average) is the most popular and well known measure of central tendency. Best measure of central
Type of Variable
tendency
Advantage:
An important property of the mean is that it includes every value in your data set as part of the calculation. Nominal Mode
In addition, the mean is the only measure of central tendency where the sum of the deviations of each value from the
mean is always zero. Ordinal Median
It can be used with both discrete and continuous data, although its use is most often with continuous data.
Interval/Ratio (not skewed) Mean
Disadvantage:
The mean has one main disadvantage: it is particularly susceptible to the influence of outliers.
Interval/Ratio (skewed) Median
Measure:
The mean is equal to the sum of all the values in the data set divided by the number of values in the data set. So, if we have
n values in a data set and they have values x1, x2, ..., xn, the sample mean, usually denoted by x.
Median: The median is the middle score for a set of data that has been arranged in order of magnitude. The median Standard Deviation: The standard deviation is a measure of the spread of scores
is less affected by outliers and skewed data.
within a set of data. We can estimate the population standard deviation from a
Mode sample standard deviation. These two standard deviations - sample and population
The mode is the most frequent score in our data set. On a histogram it represents the highest bar in a bar chart or standard deviations - are calculated differently.
histogram
The sample standard deviation formula is: Variance: variance measures the variability (volatility) from an average or mean, and volatility is a measure of risk, the
where, variance statistic can help determine the risk .
s = sample standard deviation Importance of variance:
Use variance to see how individual numbers relate to each other within a data set. A drawback to variance is that it gives
X= score added weight to numbers far from the mean (outliers).
x= sample mean
The formula for the variance in a population is:
Random sample
Statistics
SEM =/ n
The common levels of confidence and their associated alpha levels and z quantiles: 99%
% z1-/
uppose a populatio ith = a d u k o ea . A a do sa ple of o se atio s f o this populatio
90% .10 1.64
and observe the following values: {21, 42, 5, 11, 30, 50, 28, 27, 24, 52}. Based on these 10 observations, x = ? , SEM = ?
a d a 9 % CI fo = ?
95% .05 1.96
99% .01 2.58
Sample Size Requirements for estimating
m represents the margin of error and population size is n.
Problem 1
Which of the following statements is true.
I. When the margin of error is small, the confidence level is high.
From above equation we get, II. When the margin of error is small, the confidence level is low.
III. A confidence interval is a type of point estimate.
IV. A population mean is an example of a point estimate.
(A) I only
Given, sta da d de iatio = a d a t to esti ate ith 9 % o fide e. (B) II only
i)What will be the samples size required to achieve a margin of error of 5? (C) III only
(D) IV only.
ii) The samples size required to achieve a margin of error of 2.5? (E) None of the above.
Solution
Example 1: 57 individuals reveals 17 smokers. Use npq rule to determine suitability of the
Estimating p with Sampling distribution of the proportion
Proportion for sample = method .Estimate the 95% CI for p .
p = u e su esses of i the sa ple/
Example 2: Out of 2673 people surveyed, 170 have risk factor X. We want to determine the
I la ge sa ples, the sa pli g dist i utio of p is app o i atel o al ith a ea of p a d sta da d e o of the
proportion SEP : population prevalence of the risk factor with 95% confidence.
Example 1: Calculate sample a population with 95% confidence for the prevalence of
Here, smoking. How large a sample is needed to achieve a margin of error of 0.05 if we assume the
prevalence of smoking is roughly 30% ?
Example 2: How large a sample is needed to shrink the margin of error to 0.03?
5: Introduction to Estimation
Contents
Acronyms and symbols ....................................................................................................... 1
Statistical inference ............................................................................................................. 2
Estimating with confidence ............................................................................................. 3
Sampling distribution of the mean .................................................................................. 3
Confidence Interval for when is known before hand ............................................... 4
Sample Size Requirements for estimating with confidence ........................................ 6
Estimating p with confidence.............................................................................................. 7
Sampling distribution of the proportion .......................................................................... 7
Confidence interval for p ................................................................................................ 7
Sample size requirement for estimating p with confidence ............................................ 9
Estimation
Null hypothesis tests of significance (NHTS)
Both estimation and NHTS are used to infer parameters. A parameter is a statistical
constant that describes a feature about a phenomena, population, pmf, or pdf.
Point estimates are single points that are used to infer parameters directly. For example,
Notice the use of different symbols to distinguish estimators and parameters. More
importantly, point estimates and parameters represent fundamentally different things.
Point estimates are calculated from the data; parameters are not.
Point estimates vary from study to study; parameters do not.
Point estimates are random variables: parameters are constants.
Without going into too much detail, the SDM reveals that:
x is an unbiased estimate of ;
the SDM tends to be normal (Gaussian) when the population is normal or when
the sample is adequately large;
the standard deviation of the SDM is equal to n . This statisticwhich is
called the standard error of the mean (SEM)predicts how closely the x s in
the SDM are likely to cluster around the value of and is a reflection of the
precision of x as an estimate of :
SEM = n
Note that this formula is based on and not on sample standard deviation s.
Recall that is NOT calculated from the data and is derived from an external
source. Also note that the SEM is inversely proportion to the square root of n.
Each time we quadruple n, the SEM is cut in half. This is called the square root law
the precision of the mean is inversely proportional to the square root of the sample size.
To gain further insight into , we surround the point estimate with a margin of error:
This forms a confidence interval (CI). The lower end of the confidence interval is the
lower confidence limit (LCL). The upper end is the upper confidence limit (UCL).
Note: The margin of error is the plus-or-minus wiggle-room drawn around the point
estimate; it is equal to half the confidence interval length.
Let (1)100% represent the confidence level of a confidence interval. The (alpha)
level represents the lack of confidence and is the chance the researcher is willing to
take in not capturing the value of the parameter.
x ( z1 / 2 )( SEM )
The z1-/2 in this formula is the z quantile association with a 1 level of confidence. The
reason we use z1-/2 instead of z1- in this formula is because the random error
(imprecision) is split between underestimates (left tail of the SDM) and overestimates
(right tail of the SDM). The confidence level 1 area lies between z1/2 and z1/2:
(1)100% z1-/2
90% .10 1.64
95% .05 1.96
99% .01 2.58
Numerical example, 90% CI for . Suppose we have a sample of n = 10 with SEM = 4.30
and x = 29.0. The z quantile for 10% confidence is z1.10/2 = z.95 = 1.64 and the 90% CI for =
29.0 (1.64)(4.30) = 29.0 7.1 = (21.9, 36.1). We use this inference to address population mean
and NOT about sample mean x . Note that the margin of error for this estimate is 7.1.
Numerical example, 95% CI for . The z quantile for 95% confidence is z1.05/2 = z.975 = 1.96.
The 95% CI for = 29.0 (1.96)(4.30) = 29.0 8.4 = (20.6, 37.4). Note that the margin of error
for this estimate is 8.4.
Numerical example, 99% CI for . Using the same data, = .01 for 99% confidence and the
99% CI for = 29.0 (2.58)(4.30) = 29.0 11.1 = (17.9, 40.1). Note that the margin of error for
this estimate is 11.1.
Here are confidence interval lengths (UCL LCL) of the three intervals just calculated:
The confidence interval length grows as the level of confidence increases from 90% to
95% to 99%.This is because there is a trade-off between the confidence and margin of
error. You can achieve a smaller margin of error if you are willing to pay the price of less
confidence. Therefore, as Dr. Evil might say, 95% is pretty standard.
Numerical example. Suppose a population has = 15 (not calculated, but known ahead of time)
and unknown mean . We take a random sample of 10 observations from this population and
observe the following values: {21, 42, 5, 11, 30, 50, 28, 27, 24, 52}. Based on these 10
observations, x = 29.0, SEM = 15/10 = 4.73 and a 95% CI for = 29.0 (1.96)(4.73) = 29.0
9.27 = (19.73, 38.27).
Interpretation notes:
The margin of error (m) is the plus or minus value surrounding the estimate. In this case m
= 9.27.
We use these confidence interval to address potential locations of the population mean ,
NOT the sample mean x .
One of the questions we often faces is How much data should be collected? Collecting
too much data is a waste of time and money. Also, by collecting fewer data points we can
devote more time and energy into making these measurements accuracy. However,
collecting too little data renders our estimate too imprecise to be useful.
To address the question of sample size requirements, let m represent the desired margin
of error of an estimate. This is equivalent to half the ultimate confidence interval length.
Note that margin of error m = z12 / 2 . Solving this equation for n derives,
n
2
n = z12 / 2 2
m
We always round results from this formula up to the next integer to ensure that we have a
margin of error no greater than m.
Note that to determine the sample size requirements for estimating with a given level of
confidence requires specification of the z quantile based on the desired level of
confidence (z1/2), population standard deviation (), and desired margin of error (m).
In samples that are large, the sampling distribution of p is approximately normal with a mean of
pq
p and standard error of the proportion SEP = where q = 1 p. The SEP quantifies the
n
precision of the sample proportion as an estimate of parameter p.
This approach should be used only in samples that are large. a Use this rule to determine if
the sample is large enough: if npq 5 proceed with this method. (Call this the npq
rule).
p ( z1 / 2 )( SEP)
p q
where the estimated SEP = .
n
a
A more precise formula that can be used in small samples is provided in a future chapter.
Step 1. Review the research question and identify the parameter. Read the research
question. Verify that we have a single sample that addresses a binomial proportion (p).
Step 2. Point estimate. Calculate the sample proportion ( p ) as the point estimate of the
parameter.
Step 4. Interpret the results. In plain language report what proportion and the variable it
address. Report the confidence interval; being clear about what population is being
addressed. Reported results should be rounds as appropriate to the reader.
Illustration
Of 2673 people surveyed, 170 have risk factor X. We want to determine the population
prevalence of the risk factor with 95% confidence.
Step 1. Prevalence is the proportion of individuals with a binary trait. Therefore we wish to
estimate parameter p.
Step 4. The prevalence in the sample was 6.4%. The prevalence in the population is between
5.4% and 7.3% with 95% confidence.
In planning a study, we want to collect enough data to estimate p with adequate precision.
Earlier in the chapter we determined the sample size requirements to estimate with
confidence. We apply a similar method to determine the sample size requirements to
estimate p.
Let m represent the margin of error. This provides the wiggle room around p for our
confidence interval and is equal to half the confidence interval length. To achieve margin
of error m,
z12 p * q *
n= 2
m2
where p* represent the an educated guess for the proportion and q* = 1 p*.
Numeric example: We want to sample a population and calculate a 95% confidence for
the prevalence of smoking. How large a sample is needed to achieve a margin of error of
0.05 if we assume the prevalence of smoking is roughly 30%
Four elements associated with DOE: experiment without changing any factor
TERMINOLOGY TERMINOLOGY
Results in averaging out the effects of the experimental results as a result of the
extraneous factors that may be present in order blocking factor will be identified and
to minimize the risk of these factors affecting the minimized.
experimental results.
After all tests are performed a series of graphs are Error 6 600 100
constructed showing how response variable is affected Total N-1 = 7 1050 F(1,6)=5.9874
by varying each factor with all other factors held
constant. There is no difference in sale value regarding Salesman
Sales Region
randomize the order of experimental runs to satisfy the
statistical requirement of independence of observations.
Analysis of Variance
ONE WAY ANALYSIS OF VARIANCE Hypothesis is an assumption about a population.
Two types of Hypothesis
Used to test hypothesis that the means of Null Hypothesis (H0) Preferred assumption about a
several populations are equal. population
Example: Production line has 7 fill needles
Alternate Hypothesis (H1) - Opposite of the NULL
hypothesis
and you wish to assess whether or not the
average fill is the same for all 7 needles.
Experiment: sample 20 fills from each of Example
H0 = There is no significance difference between companies in terms of quarterly
the 9 needles and test at 5% level of sign. averages of EPS
H1 = There is significance difference between companies for at least a pair of
Ho: 1 = 2 = 3= 4 = 5 = 6 = 7 companies in terms of quarterly averages of EPS
ifFratio<F Critical
RejectH 1 ,AcceptH 0
Company CRD
C1 C2
Q1 12 16
Completely randomized design can be defined as
Q2 8 18
studying effects of a primary factor with other factors not
Q3 16 10 taken into consideration in the design of experiments.
Q4 19 11
Completely randomized design falls within the category
of true random number generation. It is the simplest
n a aisthenumberoftreatments
form of design.
SS total= Y 2 2
ijY . . / N
nisthenumberofreplicationsundereachtreatment
In completely randomized design, subjects are assigned
i=1 j=1 Y
a Y .. isthesummationof ij valuesofi,j to various groups at random without the involvement of
all
SS treatment = Y .2j / nY .2. / N Y ij
any judgments.
j= 1 Y . j isthesummationof valuesofiforagivenj
all
Completely randomized designs are analyzed by one
SS error =SS total SS treatment Nisthetotalnumberofobservations theexperiments way ANOVA.
Alpha Engineering Ltd is facing the problem of quality in terms of
Disadvantage of CRD surface finish of components which are machined using 4 different
machines(P,Q,R and S).
Unrestricted randomization means that units that
The company has selected 4 operators from 4 different grades of the
employees who will machine the components on the machines. The
receive one treatment may be inherently different operators are A1, A2,A3 and A4 from A, B1,B2,B3 and B4 from B ,
from units that receive other treatments. C1,C2,C3 and C4 from C, and D1,D2,D3 and D4 from D for allotment to
different machines during different weeks (W1,W2,W3 and W4) of the
month of experimentation.
Any variation in units shows up in the
experimental error sum of squares. Unless the
The sixteen operators are randomly assigned to different combinations
of machine and week as shown in table next slide :
units are very similar, a CRD will have larger
experimental error than other designs.
No Operator of grade C,B A4 (34) B4 (20) C4 (45) D4 (34) Latin Square Design W2 A2(34) B2(20) C2(29) D4(34)
No Operator of grade D
improvement over CRBD W3 A3(45) B3(30) C4(45) D2(40)
W4 A4(34) B4(20) C1(35) D3(40)
Machines CRBD LSD
Grade of Operators Latin square designs allow for Machines
Week P Q R S Week A B C D two blocking factors. In other Week P Q R S
words, these designs are used
W1 A1(23) B1(30) C3(25) D1(28) W1 23 30 25 28 to simultaneously control (or W1 A1(23) B1(30) C4(45) D1(28)
W2 B2(20) C2(29) A2(34) D4(34) W2 34 20 29 34 eliminate) two sources of W2 B2(20) C2(29) D4(34) A2(34)
W3 A3(45) D3(40) B3(30) C4(45) W3 45 30 45 40 nuisance variability. W3 C3(25) D3(40) A3(45) B3(30)
W4 C1(35) B4(20) D2(40) A4(34) W4 34 20 35 40 W4 D2(40) A4(34) B4(20) C1(35)
A stock market analyst wants to study the impact of the type of SS Error = SS Total - SS Treatment = 478.4375 81.1875 = 397.25
company on quarterly averages of EPS data of four different
companies during the last financial year from 'Capital Market' which
is summarized below: C1 C2 C3 C4
12 16 25 13
C1 C2 C3 C4 8 18 15 8
12 16 25 13 16 10 22 20
8 18 15 8 19 11 9 5
16 10 22 20
55 55 71 46
19 11 9 5
bc abc
FACTORIAL (2 ) DESIGNS
k Hb b ab
Hb b ab
B
As k gets large, the sample size will B
La A Ha La A Ha
k # of runs
2 4
Average effect of A =[ [(a-1) /2] + [(ab-b)/2] ] / 2
3 8
4 16 Average effect of B =[ [(b-1) /2] + [(ab-a)/2] ] / 2
5 32
6 64 Average effect of A =[ {(a-1) /2} + {(ac-c)/2} + {(abc-bc) / 2}+ {(ab-b) / 2} ] / 2
7 128 Average effect of B =[ {(b-1) /2} + {(ab-a)/2} + {(abc-ac) / 2}+ {(bc-c) / 2} ] / 2
8 256
9 512 Average effect of C =[ {(c-1) /2} + {(ac-a)/2} + {(abc-ab) / 2}+ {(bc-b) / 2} ] / 2
10 1024
Data arranged in Yate's order A Stock market Analyst wants to study the impact of the type of company and time period on
the quarterly averages of the earnings per share (EPS). So he collected four quarterly
A B C AB AC BC ABC Y averages of EPS data of two different companies during the last two financial years from
A B AB Y Capital Market which are summarized in table below :
1 - - - + + + - Type Year (B) A B AB R1 R2 Y
1 - - + of
Comp 1(-) 2(+)
a + - - - - + + any
a + - - 12 16
(A) 1
1(-)
18 15
- - + 12 18 12+18
b - + - b - + - - + - + = 30
16 10
ab + + + 2(+)
ab + + - + - - - 19 11
a + - - 16 19 35
c - - + + - - + b - + - 16 15 31
Cont A = (35+21)-(30+31) = -5
ac + - + - + - - Cont B = (31+21)-(30+35) = -13 ab + + + 10 11 21
Cont AB = (30+21)-(31+35) = -15
bc - + + - - + -
SS-A = [Cont A] 2 / [ 2factors x replicates ] = 3.125
abc + + + + + + + SS-B = [Cont B] 2 / [ 2factors x replicates ] = 21.125
SS-AB = [Cont AB] 2 / [ 2factors x replicates ] = 28.125
A B AB R1 R2 Y
SS-Total = (122 + 182+...+102+112) (1172 / 8) ANOVA for 2^k factorial design using Yate's Alogorithm
= 75.875
1 - - + 12 18 12+18
= 30 A B AB R1 R2 Y C1 C2 Contrast SS
a 1 - - + 12 18 30 30+35 65+52
+ - - 16 19 35
=65 =117
b - + - 16 15 31 a + - - 16 19 35 31+21 5-10 A (-5)2 / 8
=52 =-5 =3.125
ab + + + 10 11 21
b - + - 16 15 31 35-30 52-65 B (-13)2 / 8
Source DF SS MSS F-ratio F-Crit =21.125
=5 =-13
Company 2-1 3.125 3.125 / 1 3.125 / 5.875 7.71
=0.531915
ab + + + 10 11 21 21-31 -10-5 AB (-15)2 / 8
Year 2-1 21.125 21.125 / 1 21.125 / 5.875 7.71 =28.125
=3.617 =-10 =-15
Company x 1*1 28.125 28.125 / 1 28.125 / 5.875 7.71
Year =4.7872
Error 7-1-1-1 = 4 = 23.5 =23.5 / 4 = 5.875
Total N-1 = 8-1 75.875
A company is keen in assessing the contribution of its employees
The surface finish of product produced in a machine shop is in a 0-10 scale in terms of value addition to its business operations.
suspected to be affected by a factor Operator and another factor In this connection the UG qualification, sex and work experience
Shift. The data of this experiment with two replications in different of the employees are considered to be the factors. The
treatment combinations are summarized below. Perform Anova corresponding ratings of the employees are shown below :
using General Linear Model and 2n Factorial Design and Yate's
Design and check the significance of the components of the
related model when is 0.05. Work UG degree (B)
Experience
(A) Engineering Others
Operator Shift (B) Sex (C) Sex (C)
(A) 1 2
Male Female Male Female
<3 8 2 4 2
1 65 20 7 6 8 4
70 40 >=3 9 4 7 5
9 9 8 6
2 30 50
35 40
Ans F- values below , F-crit = 5.32
Factor A and Factor B have an effect on the factor A is set at the High level. This is called
response variable. interaction and it basically means that the effect one
factor has on a response is dependent on the level you set
other factors at. Interactions can be major problems in a
DOE if you fail to account for the interaction when
designing your experiment.
Why control Quality ?
Controlling and improving quality has become an
important business strategy for - manufacturers,
distributors, transportation companies, financial services
CHAPTER 04 organizations; health care providers, and government
agencies.
STATISTICAL QUALITY CONTROL
Quality is a competitive advantage. A business that
can delight customers by improving and controlling
quality can dominate its competitors.
Definition of Quality
Quality is inversely proportional to variability.
Quality means fitness for use.
There are two general aspects of fitness for use: Implies :
quality of design [ Design of Experiments ] variability in the important characteristics of a product -decreases,
the quality of the product increases.
quality of conformance - how well the product / Service
conforms to the specifications required by the design. Quality improvement is the reduction of variability in processes
[ Statistical Quality Control ] and products.
Quality is inversely proportional to variability.
The largest allowable value for a quality characteristic is
called the upper specification limit (USL), and the
smallest allowable value for a quality characteristic is
called the lower specification limit (LSL).
Why Statistical Quality Control ?
Since variability can only be described in statistical terms,
statistical methods play a central role in quality improvement
efforts.
Classify data on quality characteristics as either attributes or
variables data.
Statistical quality control (SQC): the term used to describe the set of statistical
Variation exists in all processes.
tools used by quality professionals; SQC encompasses three broad Variation can be categorized as either:
categories of:
Common or Random causes of variation, or
1. Statistical process control (SPC)
Random causes that we cannot identify
2. Descriptive statistics include the mean, standard deviation,
and range
Unavoidable, e.g. slight diferences in process variables like
Involve inspecting the output from a process diameter, weight, service time, temperature
Quality characteristics are measured and charted Assignable causes of variation
Helps identify in-process variations Causes can be identiied and eliminated: poor employee
3. Acceptance sampling used to randomly inspect a batch of goods training, worn tool, machine needing repair
to determine acceptance/rejection
Does not help to catch in-process problems
This model represents manufacturing or service processes.
Statistical Methods for Quality Control and
Improvement
statistical process control [ Online Tool ]
Design of experiments, [Offline Tool ]
They are often used during development activities and the early
stages of manufacturing.
Acceptance sampling.
Done at incoming raw materials or components point , or final
production.
A process in a financial institution that processes Automobile 15-1.2: Statistical Process Control
Loan Applications.
11
John Wiley & Sons, Inc. Applied Statistics and Probability for Engineers, by Montgomery and Runger.
15-1.2: Statistical Process Control 15-2: Introduction to Control Charts
12 13
John Wiley & Sons, Inc. Applied Statistics and Probability for Engineers, by Montgomery and Runger. John Wiley & Sons, Inc. Applied Statistics and Probability for Engineers, by Montgomery and Runger.
15-2: Introduction to Control Charts A Control Chart is one of the primary techniques of
statistical process control (SPC).
Methods for looking for sequences or nonrandom patterns can be applied to control
14
charts as an aid in detecting out-of-control Conditions.
John Wiley & Sons, Inc. Applied Statistics and Probability for Engineers, by Montgomery and Runger.
Problems in Process X-bar and R or S Control Charts
An assignable cause can result in many different types of UCL = + 3 / n
shifts in the process parameters. UWL = + 2 / n
LWL = - 2 / n
LCL = -2 / n
The mean could shift instantaneously to a new value
and remain there (this is sometimes called a sustained
shift);
or it could shift abruptly; but the assignable cause
could be short-lived and the mean could then return to
its nominal or in-control value; The constants D 3 and D 4 are tabulated for various values of n
assignable cause could result in a steady drift or trend There is a well-known relationship between the range of a sample from a normal
distribution and the standard deviation of that distribution. The random variable W =
in the value of the mean. R/s is called the relative range. The parameters of the distribution of W are a function
of the sample size n. The mean of W is d2.
Process Mean is OK but process standard Control Limits for the S- Chart
deviation / Range increases / decreases. [ X-Bar
R Chart Comb] , [X-Bar S Chart Comb ]
17
John Wiley & Sons, Inc. Applied Statistics and Probability for Engineers, by Montgomery and Runger.
John Wiley & Sons, Inc. Applied Statistics and Probability for Engineers, by Montgomery and Runger.
The value of depends on the method you use to estimate it. We
will look at three methods for estimating for subgroup data:
Average of the subgroup ranges(Most Used)
Average of the subgroup standard deviations
Pooled standard deviation
when the plotted points tend to fall near or slightly outside the
a control chart can indicate an out-of-control condition even
control limits, with relatively few points near the center line.
though no single point plots outside the control limits, if the
pattern of the plotted points exhibits nonrandom or systematic
behavior.
Cyclic patterns
Cause : environmental changes such as temperature, operator
fatigue, regular rotation of operators and/or machines, or
fluctuation in voltage or pressure etc
A mixture pattern can also occur when output product from
several sources (such as parallel machines) is fed into a
common stream which is then sampled for process monitoring
purposes.
= (USL LSL)
-------------------
6
= R
------
d2
Cp is valuable in measuring process capability. However, it has one Control Chart for Attributes
shortcoming: it assumes that process variability is centered on the
specification range. Unfortunately,this is not always the case. Control charts for attributes are used to measure quality
characteristics that are counted rather than measured.
Attributes are discrete in nature and entail simple yes-or-no
decisions.
For example, this could be the number of non-functioning light
bulbs, the proportion of broken eggs in a carton, number of
complaints issued.
C-charts count the actual number of defects.
For example, we can count the number of complaints from
customers in a month
P-charts are used to measure the proportion of items in a
sample that are defective.
Examples are the proportion of broken cookies in a batch
P- Chart
The center line is computed as the average proportion defective in
the population, p . This is obtained by taking a number of samples of
observations at random and computing the average value of p
across all samples.
Problems
0.4
0.35
0.3 CL
0.25 VALUE
0.2 UCL
LCL
0.15
UWL
0.1 ULL
0.05
0
1 2 3 4 5 6 7 8 9 10
The Shewhart Control Chart for Individual
Measurements
There are many situations in which the sample size used for process
monitoring is n = 1; that is, the sample consists of an individual unit.
In process plants, such as paper making, measurements on some
parameter such as coating thickness across the roll will differ very little
and produce a standard deviation that is much too small if the objective is
to control coating thickness along the roll.
In many applications of the individuals control chart, we use the moving
range two successive observations as the basis of estimating the
process variability.
If for an error prevention cost is Rs 1 , then it will cost Rs 10 to detect In-house &
Rs 100 if detected by Customer
Setting Six Sigma targets Six Sigma in the Accounts department
In product related industry for example, the customer or buyer can define
certain specifications, which help identify and quantify parameters for Six The management notices that most of the customer complaints are
Sigma implementation. related to vouchers handled by the company.
In the service industry however, output being intangible, the company has to To reduce the number of customer complaints, the company should
set its own targets by identifying all the key characteristics of their service aim to reduce the number of errors in the vouchers.
and identifying the process measures that have a direct impact on these key
characteristics. Typically, the Six Sigma implementation strategy would The CTQs identified in the accounts department, are errors pertaining
suggest that the company takes the following steps: to Amount, Tax, Code and Date.
The team needs to find out the Defects per million opportunities
Identify the CTQ (Critical to Quality ) that is the most significant For this they have to analyze sample vouchers. These should be
Identify the root cause selected randomly.
Design a solution that would address this root cause The team inspected 1000 vouchers. And found a total of 120 defects.
Implement this solution Therefore, defects per invoice = 120/1000 = 0.12
And Defect per CTQ = 0.12/4 = 0.03
Verify the effect of the solution by conducting audits at regular intervals
This value is then expressed in terms of Defects per million opportunities. It
Improve the process if needed thus becomes 30,000 ppm.
The table assumes a 1.5 sigma shift because processes tend to exhibit instability of
that magnitude over time.
Area: Call Center
Customer Quote: I consistently wait too long to speak to a representative.
CTQ Measure: Time on hold (seconds)
CTQ Specification: Less than 60 seconds from call connection to the
automated response system
Defect: Calls with hold time equal and greater than 60 seconds
Unit: Call
Opportunity: 1 per call
Calculate Sigma
Defects: 263 calls
Units: 21,501 calls
CTQ = 263/21,501=0.012 , CPK = 1.250 < 2 (6 sigma) 3.4 ppm
Opportunities: 1 per call
Sigma: 3.75
Multiple criteria decision making (MCDM) refers to making decisions in
the presence of multiple non-commensurable and conflicting criteria,
different units of measurement among the criteria, and the presence of
quite different alternatives.
Multi-criteria Decision Making
MCDM problems are common in everyday life. Multi criterion Decision-
and Making (MCDM) analysis has some unique characteristics such as, the
Analytical Hierarchical Problem presence of multiple conflicting criteria, In personal context, a house or
a car one buys may be characterised in terms of price, size, style, safety,
comfort, etc. In business context, MCDM problems are more
CHAPTER 05 complicated and usually of large scale.
QUANTITATIVE TECHNIQUES USED IN ADVANCED
DECISION MAKING
The Analytic Hierarchy Process (AHP) decomposes a complex MCDM problem into a system 1 Develop a hierarchy of factors impacting the final decision. This is known as the AHP
of hierarchies. The final step in the AHP deals with the structure of an m*n matrix ( Where decision model. The last level of the hierarchy is the three candidates as an alternative.
m is the number of alternatives and n is the number of criteria). The matrix is constructed
by using the relative importance of the alternatives in terms of each criterion. It deals with 2 Elicit pair wise comparisons between the factors using inputs from users/managers.
complex problems which involve the consideration of multiple criteria/alternatives
simultaneously. 3 Evaluate relative importance weights at each level of the hierarchy.
AHP based on Pairwise comparison method. It is any process of comparing entities in pairs 4 Combine relative importance weights to obtain an overall ranking of the three candidates.
to judge which of each entity is preferred, or has a greater amount of some quantitative
property, or whether or not the two entities are identical. A paired comparison is usually a While comparing two criteria, the simple rule as recommended by Saaty (1980). Thus while
method to compare one entity with another of a similar status. Usually, such paired comparing two attributes X and Y we assign the values in the following manner based on the
comparisons are made on the grounds of the overall performance of an individual. relative preference of the decision maker. To fill the lower triangular matrix, we use the
reciprocal values of the upper diagonal.
Prof. Thomas L. Saaty (1980) originally developed the Analytic Hierarchy Process (AHP) to
enable decision making in situations characterized by multiple attributes and alternatives.
Table 1: Scale Used for Pair wise Comparison Step 5. Compute the average of the values found in step 4. Let be the average.
Step 6. Compute the consistency index (CI), which is defined as (max - n) / (n-1).
Compute the random index, RI, using ratio: Example: A company decided to out source some parts of their product. Three different
RI = 1.98 (n-2)/n company submit their tender for the above required parts. Three factors are important to
Accept the matrix if consistency ratio, CR, is less than 0.10, where CR is select the best fit- costs, reliability of the product and delivery time of the orders. The price
CR = CI / RI offered by them as follows:
Consistency Ratio CR = (CI/CR)
If the Consistency Ratio (CI/CR) <0.10, so the degree of consistency is satisfactory. The ABC-100/- per gross
decision aker s comparison is probably consistent enough to be useful. XYZ-80/- per gross
PQR -144/- per gross
Standard Random Index(RI) for number of alternatives:
1 gross= 12 dozens=144
No. of 3 4 5 6 7 8
alternatives
(n) Criteria : Cost Reliability Delivery Time
RI 0.58 0.9 1.12 1.24 1.32 1.41
Alternatives: ABC ABC ABC
XYZ XYZ XYZ
PQR PQR PQR
Terms of price are compared as, XYZ is moderately preferred to ABC and very strongly preferred to PQR. Where as, ABC
is strongly preferred to PQR.
Si ce, XYZ is oderately preferred to ABC, ABC s e try i the XYZ ro is 3 a d XYZ s e try i ABC ro is 1/3.
Si ce, XYZ is ery stro gly preferred to PQ, PQ s e try i the XYZ ro is a d XYZ s e try i the PQ ro is 1/ .
Si ce , ABC is stro gly to ery stro g preferred to PQ, PQ s e try i the ABC ro is a d ABS s e try i the PQ ro is
1/6.
PIVOT TABLE 3. Compare the monthly selling performance for each sales person.
4. Draw a pivot chart showing monthly regional selling status. Change the chart according to
AND product sales.
OPTIMIZATION USING SOLVER 5. Open the student.xlsx. Display the month wise sum of score for all subjects and their
grand total.
6. Display the highest score for each students.
CHAPTER 05 7. Display the pivot chart for students monthly score.
QUANTITATIVE TECHNIQUES USED IN ADVANCED
DECISION MAKING
The Data worksheet in the Groceriespt.xlsx file contains more than 900 rows of sales data. From information in a random sample of 925 people, I know the gender, the age, and the amount these
people spent on travel last year. How can I use this data to determine how gender and age influence a
Each row contains the number of units sold and revenue of a product at a store as well as persons travel expenditures? What can I conclude about the type of person to whom I should mail the
the month and year of the sale. The product group (fruit, milk, cereal, or ice cream) is also brochure?
included. You would like to see a breakdown of sales during each year of each product group
and product at each store. You would also like to be able to show this breakdown during any
subset of months in a given year (for example, what the sales were from January through To understand this data, you need to break it down as follows:
June). Average amount spent on travel by gender
Average amount spent on travel for each age group
Average amount spent on travel by gender for each age group
Determine the following using groceries worksheet:
A Bank processes checks seven days a week. The number of workers needed each day to When you click Solve, youll see the essage, ol er ould ot fi d a feasi le solutio . This essage does ot ea
that you ade a istake i your odel ut, rather, that ith li ited resour es, you a t eet de a d for all
process checks. For example, 13 workers are needed on Tuesday, 15 workers are needed on
products.
Wednesday, and so on. All bank employees work five consecutive days. Find the minimum
number of employees that the Bank can have and still meet its labour requirements based
on the following data.
Number Wednesd
starting Day worker starts Monday Tuesday ay Thursday Friday Saturday Sunday
0 Monday 1 1 1 1 1 0 0
0 Tuesday 0 1 1 1 1 1 0
0 Wednesday 0 0 1 1 1 1 1
0 Thursday 1 0 0 1 1 1 1
0 Friday 1 1 0 0 1 1 1
0 Saturday 1 1 1 0 0 1 1
0 Sunday 1 1 1 1 0 0 1
Number working 0 0 0 0 0 0 0
>= >= >= >= >= >= >=
Number needed 17 13 15 17 9 9 12
CHAPTER 06
DATA ANALYSIS USING MS EXCEL
CHAPTER 06
DATA ANALYSIS USING MS EXCEL Advanced Statistical Applications
Excel has various built-in functions that allow to perform Data analysis is the process used to get result from
all sorts of statistical calculations. raw data that can be used to make decisions.
* Numerical Summaries
* Measures of location
* Measures of variability
Hypothesis Testing
Data > Data analysis > t-Test: Two sample . . .
H0 = null hypothesis
There is no significant difference
As a ule of thu , a use e ual a ia es
H1 = alternative hypothesis
if ratio of variances < 3.
There is a significant difference
Correlation
Correlation is the extent to which variables in two
different data series tend to move together (or
apart) over time. A correlation coefficient
Describes the strength of the correlation between
two series. Values in range (-1.0, 1.0)
SUMMARY OUTPUT
Regression Statistics
The regression analysis in previous example has only Multiple R 0.97
Data analysts would like to estimate the Many companies use Monte Carlo Simulation as an important part
of their decision making process.
probabilities of uncertain events accurately.
General Motors, P&G, Pfizer use simulation to estimate both
Monte Carlo simulation enables you to model
average return and the risk factor of new products.
situations that present uncertainty and then play Lilly uses simulation to determine the optimal plant capacity for
them out on a computer thousands of times. each drug.
P&G uses simulation to model and optimally hedge foreign
exchange risk.
Note : The term Monte Carlo simulation comes from the
Monte Carlo Simulation uses by organization for forecasting net
computer simulations performed during 130s 1940s to estimate
income, predicting structural and purchasing costs, and determining
the probability that the chain reaction needed for an atom bomb
its susceptibility to different kinds of risks
to detonate would work successfully. The scientists involved in
Financial planners use Monte Carlo Simulation to determine
this work were big fans of gambling , so they gave the simulations
optimal investment strategies for their lie ts retirement.
the code name Monte Carlo.
Estimate probabilities of future events You get a number that is equally likely to assume any
value between 0 and 1. Thus, around 25 percent of the
Assign random number ranges to time, you should get a number less than or equal to
percentages (probabilities) 0.25; around 10 percent of the time, you should get a
Obtain random numbers number that is at least 0.90, and so on.
Use a do u e s to si ulate e e ts.
To fight back, the chief buyer reorganized the sections that are in
An Inventory Control Example: trouble to create a store within a store.
Foslins Housewares With these ha ges, plus the sto es eputatio fo uality a d
i ulatio a e used fo odels i hi h the uestio is Ho service, she feels that Foslins can effectively compete.
u h of this should e do?
We will now use an inventory control model to provide an A I te atio al Di i g Mo th p o otio ill e featu ed i
illustration of simulation. October to introduce the new facility.
THE OMELET PAN PROMOTION: Five specially made articles (each from a different country) will be
HOW MANY PANS TO ORDER? featured on sale. For example, a copper omelet pan from France,
In Foslins, certain sections of the housewares department have a set of 12 long-stem wine glasses from Spain, etc.
just suffered their second consecutive bad year.
All items must be ordered 6 months in advance. Any unsold items
Due to competition, the gourmet cooking, glassware, stainless after October will be sold to a discount chain at a reduced price. If
flatware, and contemporary dishes sections of Foslins are not they run out, a more expensive item from the regular line will be
generating enough revenue to justify the amount of floor space. substituted.
For example, suppose you order 1000 pans and the demand
Consider the special omelet pans: turned out to be 1100 pans.
Buying price: $22.00 In this situation, you would be 100 pans short and would have to
Selling price: $35.00 buy 100 regular pans and sell them at the sale price in order to
Discounted price: $15.00 (at the end of October) make up the deficit.
The net profit would be:
Regular pans:
$35(1100) - $32(100) - $22(1000) = $12,300
Buying price: $32.00
Normal selling price: $65.00 In general, let y = number of pans ordered and
D = demand. Then for D > y,
Selling price if substituted: $35.00
Profit = 35D 32(D y) 22y
Now, without knowing the demand for this special product, how Profit = 3D + 10y
many pans should be ordered in advance?
In another scenario, suppose you order 1000 pans and the This spreadsheet assumes an order of 11 omelet pans and a
demand turned out to be 200 pans. random demand of 8 (i.e., y = 11 and D = 8).
In this situation, you would have an excess of 800 pans and would
have to sell the addition pans at $15 each and take a loss.
Now, click on
cell B5. Next,
click on the
Define
Assumptions
icon,
choose Custom
distribution and
click OK.
Now, enter the spreadsheet cell range where the discrete After clicking OK, the distribution will be displayed:
distribution was placed and click OK.
To determine the expected profit through the use of the In order to use simulation to calculate the average profit, first
simulations, click on cell B11 and then click on the Define Forecast generate a number of trials, setting y = 11.
icon. The profit that results on any given trial depends on the value of
demand that was generated on that trial.
The average profit over all trials is the expected profit.
To do this, click on the Run Preferences icon and change the
Maximum Number of Trials to 500.
Click OK.
If not already selected, choose Large as the window size and When
Stopped (Faster). Click OK.
Next, click on the Start Simulation icon and after Crystal Ball To look at the statistics from the simulation, go to the Crystal Ball
has run the 500 iterations, the following dialog will appear: View menu and choose Statistics.
Expected Value versus Order Quantity. To calculate the true Simulated versus Expected Profits. For any particular order
expected profit using the spreadsheet and Crystal Ball, simply quantity, the average profit generated by the spreadsheet
enter each demand in cell B5 (one at a time), run the simulation simulator does not equal the true expected profit. The implication
and then record the average profit. of this fact on the process of making a decision is interesting.
These average profits will then be multiplied by their respective Based on the max. profit, your decision would be to order 10 pans
probabilities. The sum of these values will give the true expected for the true expected profit or 11 pans for the simulated average
profit. profit.
The previous example illustrates that simulation, in general, is not RECAPITULATION
guaranteed to achieve optimality. To summarize:
A simple way to increase the likelihood of achieving optimality is 1. A spreadsheet simulator takes parameters and
to increase the number of trials. decisions as inputs and yields a performance
measure(s) as output.
With simulation, your decision may be wrongly identified if care is
not taken to simulate a sufficient number of trials.
2. Each iteration of the spreadsheet simulator will
In a real problem you would not both calculate the true expected generally yield a different value for the
profit and use simulation to calculate an average profit. performance measure.
Missing data may be due to Fill in the missing value manually: tedious + infeasible?
equipment malfunction Use a glo al o sta t to fill i the issi g alue: e.g., u k o ,a e
inconsistent with other recorded data and thus deleted class?!
data not entered due to misunderstanding Use the attribute mean to fill in the missing value
certain data may not be considered important at the time of entry Use the most probable value to fill in the missing value: inference-based
not register history or changes of the data such as Bayesian formula or decision tree
Noisy Data How to Handle Noisy Data?
Noise: random error or variance in a measured variable
Incorrect attribute values may due to Binning method:
faulty data collection instruments first sort data and partition into (equi-depth) bins
data entry problems
then smooth by bin means, smooth by bin median, smooth
data transmission problems
by bin boundaries, etc.
technology limitation
inconsistency in naming convention Clustering
Other data problems which requires data cleaning detect and remove outliers
duplicate records
Combined computer and human inspection
incomplete data
inconsistent data detect suspicious values and check by human
Regression
smooth by fitting the data into regression functions
20
Data Transformation:
Normalization Data Preprocessing
min-max normalization
v minA Why preprocess the data?
v'
maxA minA Data cleaning
z-score normalization
v meanA Data integration and transformation
v'
stand _ devA
Data reduction
normalization by decimal scaling
v Discretization and concept hierarchy generation
v' Where j is the smallest integer such that Max(| v ' |)<1
10 j
Summary
Data Reduction Strategies Data Cube Aggregation
Warehouse may store terabytes of data: Complex data The lowest level of a data cube
analysis/mining may take a very long time to run on the
complete data set the aggregated data for an individual entity of interest
27
y
33
Binning
Why preprocess the data?
Data Integration
Data Cleaning I this step, the oise a d i o siste t data is
removed. Data Integration is a data preprocessing technique
Data Integration I this step, ultiple data sour es are that merges the data from multiple heterogeneous
combined. data sources into a coherent data store. Data
Data Selection I this step, data rele a t to the a alysis task integration may involve inconsistent data and
are retrieved from the database. therefore needs data cleaning.
Data Transformation I this step, data is tra sfor ed or Data Cleaning
consolidated into forms appropriate for mining by performing
summary or aggregation operations. Data cleaning is a technique that is applied to
Data Mining I this step, i tellige t ethods are applied i remove the noisy data and correct the
order to extract data patterns. inconsistencies in data. Data cleaning involves
Pattern Evaluation I this step, data patter s are e aluated. transformations to correct the wrong data. Data
Knowledge Presentation I this step, k o ledge is cleaning is performed as a data preprocessing step
represented. while preparing the data for a data warehouse.
Data Selection
Data Selection is the process where data relevant to
Knowledge Discovery Process
the analysis task are retrieved from the database. Data mining: the core of
Sometimes data transformation and consolidation are knowledge discovery Knowledge Interpretation
performed before the data selection process. process.
Clusters Data Mining
Cluster refers to a group of similar kind of objects. Task-relevant Data
Cluster analysis refers to forming group of objects that Data transformations
are very similar to each other but are highly different Selection
Preprocessed
from the objects in other clusters. Data
Data Transformation Data Cleaning
In this step, data is transformed or consolidated into Data Integration
forms appropriate for mining, by performing summary
or aggregation operations. Databases