0% found this document useful (0 votes)
43 views6 pages

Exp Assum START

This document discusses methods for checking the statistical assumption that data follows an exponential distribution. It presents an example where test data on device lifetimes could follow either a normal or exponential distribution. The confidence intervals calculated assuming the wrong distribution would be incorrect. Practical graphical and numerical methods are presented for checking distribution assumptions, which is important as assuming the wrong distribution could have costly consequences. Verifying distribution assumptions is crucial for ensuring statistical analyses like confidence intervals are valid.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
43 views6 pages

Exp Assum START

This document discusses methods for checking the statistical assumption that data follows an exponential distribution. It presents an example where test data on device lifetimes could follow either a normal or exponential distribution. The confidence intervals calculated assuming the wrong distribution would be incorrect. Practical graphical and numerical methods are presented for checking distribution assumptions, which is important as assuming the wrong distribution could have costly consequences. Verifying distribution assumptions is crucial for ensuring statistical analyses like confidence intervals are valid.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 6

START

Selected Topics in Assurance


Related Technologies
Volume 8, Number 2

Statistical Assumptions of an Exponential


Distribution
Table of Contents Alternatively, there are many practical procedures, easy to
understand and implement. They are based on intuitive and
• Introduction graphical properties of the distribution that we wish to assess
and can thus be used to check and validate these distribution
• Putting the Problem in Perspective assumptions. The implementation and interpretation of such
• Statistical Assumptions and their Implications Procedures, for the important case of the Exponential distri-
• Practical Methods to Verify Exponential Assumptions bution, so prevalent in quality and reliability theory and prac-
• Summary tice, are discussed in this paper.
• Bibliography Also addressed in this START sheet are some problems asso-
• About the Author ciated with checking the Exponential distribution assump-
• Other START Sheets Available tion. First, a numerical example is given to illustrate the seri-
ousness of this problem. Then, additional numerical and
graphical examples are developed that illustrate how to
Introduction implement such distribution checks and related problems.
This START sheet discusses some empirical and practical
methods for checking and verifying the statistical assump-
tions of an Exponential distribution and presents several
Putting the Problem in Perspective
numerical and graphical examples showing how these meth- Assume that we are given the task of estimating the mean life
ods are used. Most statistical methods (of parametric statis- of a device. We may provide a simple point estimator, the
tics) assume an underlying distribution in deriving results sample mean, which will not provide much useful informa-
(methods that do not assume an underlying distribution are tion. Or we may provide a more useful estimator: the confi-
called non-parametric, or distribution free, and will be the dence interval (CI). This latter estimator consists of two val-
topic of a separate paper). ues, the CI upper and lower limits, such that the unknown
mean life, m, will be in this range, with a prescribed coverage
Whenever we assume that the data follow a specific distribu- probability (1-a). For example, we say that the life of a
device is between 90 and 110 hours with probability 0.95 (or

START 2002-2, E-ASSUME


tion we also assume risk. If the assumption is invalid, then
the confidence levels of the confidence intervals (or the that there is a 95% chance that the interval 90 to 110, covers
hypotheses tests) will be incorrect. The consequences of the device true mean life, m).
assuming the wrong distribution may prove very costly. The
way to deal with this problem is to check distribution The accuracy of CI estimators depends on the quality and
assumptions carefully, using the practical methods discussed quantity of the available data. However, we also need a sta-
in this paper. tistical model that is consistent with and appropriate for the
data. For example, to establish a CI for the Mean Life of a
There are two approaches to checking distribution assump- device we need, in addition to sufficiently good test data, to
tions. One is to use the Goodness of Fit (GoF) tests. These know or assume a statistical distribution (e.g., Normal,
are numerically convoluted, theoretical tests such as the Chi Exponential, Weibull) that actually fits these data and prob-
Square, Anderson-Darling, Kolmogorov-Smirnov, etc. They lem.
are all based on complex statistical theory and usually require
lengthy calculations. In turn, these calculations ultimately Every parametric statistical model is based upon certain
require the use of specialized software, not always readily assumptions that must be met, for it to hold true. In our dis-
available. cussion, and for the sake of illustration, consider only two
possibilities: that the distribution of the lives (times to fail-
ure) is Normally distributed (Figure 1) and that it is

A publication of the DoD Reliability Analysis Center


Exponentially distributed (Figure 2). The figures were obtained The statistic “sample average”, x = 94.14, will follow a differ-
using 2000 data points, generated from each of these two distri- ent sampling distribution, according to whether the Normal or
butions, with the same mean = 100 (and for the Normal, with a the Exponential distributions are assumed for the population.
Standard Deviation of 20). Hence, the data will be processed twice, each time using a dif-
ferent “formula”. This, in turn, will produce two different CI
that will exhibit different confidence probabilities.
200
Normal Assumption. If the original device lives are assumed
distributed Normal (with s = 20), the 95% CI for the device
mean life m, based on the Normal distribution is:
Frequency

100 (81.7, 106.5)

Exponential Assumption. If, however, the device lives are


assumed Exponential, then the 95% CI for the mean life q, based
0 on the Exponential, is:
0 20 40 60 80 100 120 140 160 180
Times to Failure (55.11, 196.3)

Figure 1. Normal Distribution of Times to Failure We leave the details of obtaining these two specific statistics or
“formulas” for another paper.
350
Since in reality the ten data points come from the Exponential,
300 only the CI (55.11, 196.3) is correct and its coverage probability
250 (95%) is the one prescribed. Had we erroneously assumed
Frequency

200
Normality, the CI obtained under this assumption, for this small
sample, would have been incorrect. Moreover, its true coverage
150 probability (confidence) would be unknown and every policy,
100 derived under such unknown probability, is at risk.
50
This example illustrates and underlines how important it is to
0 establish the validity (or at least the strong plausibility) of the
0 100 200 300 400 500 600 700 800 underlying statistical distribution of the data.
Times to Failure
Statistical Assumptions and their Implications
Figure 2. Exponential Distribution of Times to Failure Every statistical model has its own “assumptions” that have to be
verified and met, to provide valid results. In the Exponential
There are practical consequences of data fitting one or the other case, the CI for the mean life of a device requires two “assump-
of these two different distributions. “Normal lives” are symmet- tions”: that the lives of the tested devices are (1) independent,
ric about 100 and concentrated in the range of 40 to 160 (three and (2) Exponentially distributed. These two statistical assump-
standard deviations, on each side of the mean, which comprises tions must be met (and verified) for the corresponding CI to
99% of the population). “Exponential lives”, on the other hand, cover the true mean with the prescribed probability. But if the
are right-skewed, with a relatively large proportion of device data do not follow the assumed distribution, the CI coverage
lives much smaller than 40 units and a small proportion of device probability (or its confidence) may be totally different than the
lives larger than 200 units. one prescribed.
To highlight the consequences of choosing the wrong distribu- Fortunately, the assumptions for all distribution models (e.g.,
tion, consider a sample of n = 10 data points (Table 1). We will Normal, Exponential, Weibull, Lognormal, etc.) have practical
obtain a 95% CI for the mean of these data, using two different and useful implications. Hence, having some background infor-
distribution assumptions: Exponential and Normal. mation about a device may help us assess its life distribution.
Table 1. Small Sample Data Set A case in question occurs with the assumption that the distribu-
5.950 119.077 366.074 155.848 30.534 tion of the lives of a device is Exponential. An implication of the
20.615 15.135 3.590 103.713 120.859 Exponential is that the device failure rate is constant. In prac-
tice, the presence of a constant failure rate may be confirmed

2
from observing the times between failures of a process where They are practical for the engineer because they are largely intu-
failures occur at random times. itive and easy to implement.

In general, if we observe any process composed of events that To assess the data in Table 2, using this more practical approach,
occur at random times (say lightning strikes, coal mine acci- we first obtain their descriptive statistics (Table 3). Then, we
dents, earthquakes, fires, etc.), the times between these events analyze and plot the raw data in several ways, to check (empiri-
will be Exponentially distributed. The probability of occurrence cally but efficiently) if the Exponential assumption holds.
of the next event is independent of the occurrence time of the
past event. As a result, phrases such as “old is as good as new” Table 3. Descriptive Statistics of Data in Table 2
have a valid meaning. [It is important to note that although fail- Variable n Mean Median Std. Dev.
ures may occur at random times, they do not occur for “no rea- Exp. Data 45 99.9 77.1 85.6
son.” Every failure has an underlying cause.]

In what follows, we will use statistical properties derived from Where Mean is the average of the data and the Standard
Exponential distribution implications to validate the Exponential Deviation is the square root of:
assumption.

2 (
å xi - x
2
)
Practical Methods to Verify Exponential S =
n -1
Assumptions
Several empirical and practical methods can be used to establish It is worthwhile to notice that the values of the sample mean and
the validity of the Exponential distribution. We will illustrate the standard deviation are the same, irrespective of the underlying
process of validating the Exponential assumptions using the life distribution. What will change are the properties of such values,
test data in Table 2. This larger sample (n = 45) was generated a fact that can be used to help identify the distribution in ques-
following the same process used to generate the previous small- tion.
er sample (n = 10) presented in Table 1.
There are a number of useful and easy to implement procedures,
Table 2. Large Sample Life Data Set based on well-known statistical properties of the Exponential
12.411 58.526 46.684 49.022 77.084 7.400 distribution, which help us to informally assess this assumption.
21.491 28.637 16.263 53.533 93.241 43.911 These properties are summarized in Table 4.
33.771 78.954 399.071 102.947 118.077 61.894
72.435 108.561 46.252 40.479 95.291 10.291 Table 4. Some Properties of the Exponential Distribution
27.668 116.729 149.432 59.067 199.458 45.771 1. The theoretical mean and standard deviation are equal1;
272.005 60.266 233.254 87.592 137.149 50.668 sample hence, the values of mean and standard deviation
89.601 313.879 150.011 173.580 220.413 182.737
should be close.
6.171 162.792 82.273
2. Histogram should show that the distribution is right-
skewed (Median < Mean).
In this data set, two distribution assumptions need to be verified 3. A plot of Cumulative-Failure vs. Cumulative-Time should
or assessed: (1) that the data are independent and (2) that they be close to linear.
are identically distributed as an Exponential. 4. The regression slope of Cum-Failure vs. Cum-Time is
close to the failure rate.
The assumption of independence implies that randomization 5. A plot of Cum-Rate vs. Cum-Failure should decrease/sta-
(sampling) of the population of devices (and other influencing fac- bilize at the failure rate level.
tors) must be performed before placing them on test. For exam- 6. Plots of the Exponential probability and its scores should
ple, device operators, times of operations, weather conditions, also be close to linear.
1
location of devices in warehouses, etc., should be randomly Although the exponential is a one-parameter distribution, it has a stan-
selected so they become representative of these characteristics and dard deviation. All distributions, except for the Cauchy, have a standard
of the environment in which devices will normally be operated. deviation.

First, from the descriptive statistics in Table 2, we verify that the


Having knowledge about the product and its testing procedure,
Mean (99.9) and Standard Deviation (85.6) are close, and that
can help in assessing that the observations are independent and
the Median (77.1) is smaller than the Mean. This agrees with the
representative of the population from which they come, and
Exponential distribution Property No. 1 and Property No. 2, from
establishes the first of the two distribution assumptions.
Table 4.
To assess the exponentiality of the data, we use several informal
methods, based on the properties of the Exponential distribution.

3
The theoretical Exponential standard deviation s is equal to the The regression equation is:
mean. Hence, one standard deviation above and below the mean
yields a range of population values 0 to 2s, which comprises the Cum-Fail = 5.80 + 0.00931 Cum-Time
majority (86% of the values) of the population (see Figure 3). S = 2.283 R-Sq = 97.0%
For reference, in the Normal distribution, one standard deviation
above and below the Mean comprise only 68% of the population. Notice how the regression in Table 5 is significant, as shown by
The corresponding sample points under these ranges should be the large T (= 37.59) test value for the regression coefficient and
commensurate to these percentages and provide an indication to 2
by the large Index of Fit (R = 0.97). Both results suggest that a
which distribution they come from (especially in large samples). linear regression with slope equal to failure rate is plausible.

Cumulative Failure Rate (at failure I) is obtained by dividing


15 Cumulative Test Time (at failure I) by Cumulative Failures (up
1s to I), for I = 2 , …, n. If the Cumulative Failure Rate is plotted
vs. the Cumulative Failures, it then soon stabilizes to a constant
value (the true Failure Rate = 0.01) as expected in the case of the
Frequency

10

Exponential distribution (Property 5), see Figure 5.

0.08

0.07
0
0.06

Failure Rate
0 50 100 150 200 250 300 350 400
Cumulative 0.05
Failure Times
0.04
Median Mean 0.03

Figure 3. Histogram is Skewed to the Right, as in an 0.02


Exponential Distribution (Property No. 2) 0.01

0 10 20 30 40 50
If we regress the Cumulative Failures on Cumulative Test Time
(Table 5), the result is a straight line (Figure 4), whose slope Cumulative No. of Failures
(0.00931) is close to the true (0.01) failure rate (Property 4).
Figure 5. Plot of Cum-Rate vs. Cum-Fail Stabilizes (flat) and
Table 5. Cum-Fail vs. Cum-Time Regression Analysis Converges to the Exponential Failure Rate (Close to Value
0.01) (Property No. 5)
Predictor Coeff. Std. Dev. T P
Constant 5.8048 0.5702 10.18 0.000 The Probability Plot is one where the Exponential Probability (PI)
Cum-Time 0.0093135 0.0002478 37.59 0.000 is plotted vs. I/(n + 1) (where I is the data sequence order, i.e., I
= 1, …, 45). Each PI is obtained by calculating the Exponential
probability of the corresponding failure data, XI using the sample
50
mean (see Figure 6). For example, the first sorted (smallest) data
Cumulative Number of

40 point is 6.17 and the sample average = 99.9:


Failures (n)

30 P99.9(6.17) = 1 - exp(-6.17/99.9) = 1 - 0.94 = 0.06

20 which is then plotted against the corresponding I/(n + 1) value:


1/46 = 0.0217 and so on, until all other sorted sample elements
10 I = 1, …, 45, have been considered.

0 The Exponential scores XI are the percentiles corresponding to


0 1000 2000 3000 4000 5000 the values I/(n + 1), for I = 1, …, n; calculated under the
Exponential distribution (assuming the sample mean). For the
Cumulative Test Time
same example, the first I/(n + 1) is 1/46 = 0.0217 and the sample
Figure 4. Plot of Cum-Failure vs. Cum-Time is Close to average = 99.9. Then:
Linear, as Expected in an Exponential Distribution (Property
No. 3) P99.9(Xi) = 1 - exp(-Xi/99.9) Þ Xi = -99.9ln(1 - P99.9(Xi))

4
where: All of the preceding empirical results contribute to support the
plausibility of the assumption of the Exponentiality of the given
life data. If, at this point, a stronger case for the validity of the
i Exponential distribution is required, then a number of theoreti-
P99.9 (X i ) »
n +1 cal GoF tests can be carried out with the data.

A final comment about distribution assumptions and engineering


1.0 work is due. In practice, engineers do not solve ideal problems
Exponential Probability (P I)

0.9 -but real and complicated ones, whose settings are not perfect.
0.8 Consequently, some statistical assumptions may not be met.
0.7 This does not, however, necessarily preclude the use of statisti-
0.6
cal procedures.
0.5
0.4
0.3 In such cases, some assumptions may have to be relaxed and
0.2 some of the inferences (results) may have to be interpreted with
0.1 care and used with special caution. The best criteria to establish
0.0 such relaxation and interpretation of the rules (e.g., which
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 assumptions can be relaxed and by how much) often come from
a thorough knowledge of the underlying engineering and statis-
I/(n + 1); I = 1, …, n
tical theories, from extensive professional experience and from a
Figure 6. Plot of Exponential Probability (PI) vs. I/(n+1) ; I = deep understanding of the specific processes under considera-
1, … , n is Close to Linear, as Expected When the Data Come tion.
from an Exponential Distribution (Property 6)
Summary
Substituting in the above formula I/(n + 1) for I = 1, we get the This START sheet discussed the important problem of (empiri-
first exponential score: cally) assessing the Exponential distribution assumptions.
Several numerical and graphical examples were provided,
æ 1 ö together with some related theoretical and practical issues, and
X i = - 99.9lnç1 - ÷ = - 99.9ln(0.9783) = - 99.9x(-0.022) = 2.2 some background information and references to further read-
è 46 ø
ings.

The scores are then plotted vs. their corresponding sorted real Other, very important, reliability analysis topics were mentioned
data values (in the case above, 2.2 is plotted against 6.17, the in this paper. Due to their complexity, these will be treated in
smallest data point). When the data come from an Exponential more detail in separate, forthcoming START sheets.
Distribution, this plot is close to a straight line (Property 6), see
Figure 7.
Bibliography
400 1. Practical Statistical Tools for Reliability Engineers,
Coppola, A., RAC, 1999.
2. A Practical Guide to Statistical Analysis of Material
Exponential Scores

300
Property Data, Romeu, J.L. and C. Grethlein. AMPTIAC,
2000.
200
3. Mechanical Applications in Reliability Engineering,
Sadlon, R.J., RAC, 1993.
100 4. Reliability and Life Testing Handbook (Vols. 1 & 2),
Kececioglu, D., Editor, Prentice Hall, NJ, 1993.
0

0 100 200 300 400

Sorted Data Values

Figure 7. Plot of the Exponential Scores vs. the Sorted Real


Data Values

5
About the Author Romeu is a senior technical advisor for reliability and advanced
Dr. Jorge Luis Romeu has over thirty years of statistical and information technology research with IIT Research Institute
operations research experience in consulting, research, and (IITRI). Since joining IITRI in 1998, Romeu has provided con-
teaching. He was a consultant for the petrochemical, construc- sulting for several statistical and operations research projects.
tion, and agricultural industries. Dr. Romeu has also worked in He has written a State of the Art Report on Statistical Analysis
statistical and simulation modeling and in data analysis of soft- of Materials Data, designed and taught a three-day intensive
ware and hardware reliability, software engineering and eco- statistics course for practicing engineers, and written a series of
logical problems. articles on statistics and data analysis for the AMPTIAC
Newsletter and RAC Journal.
Dr. Romeu has taught undergraduate and graduate statistics,
operations research, and computer science in several American Other START Sheets Available
and foreign universities. He teaches short, intensive profes- Many Selected Topics in Assurance Related Technologies
sional training courses. He is currently an Adjunct Professor of (START) sheets have been published on subjects of interest in
Statistics and Operations Research for Syracuse University and reliability, maintainability, quality, and supportability. START
a Practicing Faculty of that school’s Institute for Manufacturing sheets are available on-line in their entirety at <http://rac.
Enterprises. iitri.org/DATA/START>.

For his work in education and research and for his publications
and presentations, Dr. Romeu has been elected Chartered
Statistician Fellow of the Royal Statistical Society, Full
Member of the Operations Research Society of America, and For further information on RAC START Sheets contact the:
Fellow of the Institute of Statisticians.
Reliability Analysis Center
Romeu has received several international grants and awards, 201 Mill Street
including a Fulbright Senior Lectureship and a Speaker Rome, NY 13440-6916
Specialist Grant from the Department of State, in Mexico. He Toll Free: (888) RAC-USER
has extensive experience in international assignments in Spain Fax: (315) 337-9932
and Latin America and is fluent in Spanish, English, and
French. or visit our web site at:

<http://rac.iitri.org>

About the Reliability Analysis Center


The Reliability Analysis Center is a Department of Defense Information Analysis Center (IAC). RAC serves as a government
and industry focal point for efforts to improve the reliability, maintainability, supportability and quality of manufactured com-
ponents and systems. To this end, RAC collects, analyzes, archives in computerized databases, and publishes data concerning
the quality and reliability of equipments and systems, as well as the microcircuit, discrete semiconductor, and electromechani-
cal and mechanical components that comprise them. RAC also evaluates and publishes information on engineering techniques
and methods. Information is distributed through data compilations, application guides, data products and programs on comput-
er media, public and private training courses, and consulting services. Located in Rome, NY, the Reliability Analysis Center is
sponsored by the Defense Technical Information Center (DTIC). Since its inception in 1968, the RAC has been operated by IIT
Research Institute (IITRI). Technical management of the RAC is provided by the U.S. Air Force's Research Laboratory
Information Directorate (formerly Rome Laboratory).

You might also like