Calculating and Using Basic Statistics: Standard Practice For
Calculating and Using Basic Statistics: Standard Practice For
Calculating and Using Basic Statistics: Standard Practice For
1. Scope
1.1 This practice covers methods and equations for computing and presenting basic descriptive statistics using a set of
sample data containing a single variable or two variables. This
practice includes simple descriptive statistics for variable data,
tabular and graphical methods for variable data, and methods
for summarizing simple attribute data. Some interpretation and
guidance for use is also included.
2. Referenced Documents
2.1 ASTM Standards:2
E178 Practice for Dealing With Outlying Observations
E456 Terminology Relating to Quality and Statistics
E2282 Guide for Defining the Test Result of a Test Method
2.2 ISO Standards:3
ISO 3534-1 StatisticsVocabulary and Symbols, part 1:
Probability and General Statistical Terms
ISO 3534-2 StatisticsVocabulary and Symbols, part 2:
Applied Statistics
3. Terminology
3.1 Definitions:
1
This practice is under the jurisdiction of ASTM Committee E11 on Quality and
Statistics and is the direct responsibility of Subcommittee E11.10 on Sampling /
Statistics.
Current edition approved June 1, 2014. Published January 2015. Originally
approved in 2007. Last previous edition approved in 2014 as E2586 13. DOI:
10.1520/E2586-14.
2
For referenced ASTM standards, visit the ASTM website, www.astm.org, or
contact ASTM Customer Service at [email protected]. For Annual Book of ASTM
Standards volume information, refer to the standards Document Summary page on
the ASTM website.
3
Available from American National Standards Institute (ANSI), 25 W. 43rd St.,
4th Floor, New York, NY 10036, http://www.ansi.org.
Copyright ASTM International, 100 Barr Harbor Drive, PO Box C700, West Conshohocken, PA 19428-2959. United States
E2586 14
~ x 2 x !~ y 2 y !
~ n 2 1 ! s xs y
(1)
(2)
E2586 14
4.8 While the methods described in this practice may be
used to summarize any set of observations, the results obtained
by using them may be of little value from the standpoint of
interpretation unless the data quality is acceptable and satisfies
certain requirements. To be useful for inductive generalization,
any sample of observations that is treated as a single group for
presentation purposes must represent a series of measurements,
all made under essentially the same test conditions, on a
material or product, all of which have been produced under
essentially the same conditions. When these criteria are met,
we are minimizing the danger of mixing two or more distinctly
different sets of data.
4.8.1 If a given collection of data consists of two or more
samples collected under different test conditions or representing material produced under different conditions (that is,
different populations), it should be considered as two or more
separate subgroups of observations, each to be treated independently in a data analysis program. Merging of such
subgroups, representing significantly different conditions, may
lead to a presentation that will be of little practical value.
Briefly, any sample of observations to which these methods are
applied should be homogeneous or, in the case of a process,
have originated from a process in a state of statistical control.
(3)
5. Characteristics of Populations
5.1 A population is the totality of a set of items under
consideration. Populations may be finite or unlimited in size
and may be existing or continuing to emerge as, for example,
in a process. For continuous variables, X, representing an
essentially unlimited population or a process, the population is
mathematically characterized by a probability density function,
f(x). The density function visually describes the shape of the
distribution as for example in Fig. 1. Mathematically, the only
requirements of a density function are that its ordinates be all
positive and that the total area under the curve be equal to 1.
5.1.1 Area under the density function curve is equivalent to
probability for the variable X. The probability that X shall occur
between any two values, say s and t, is given by the area under
the curve bounded by the two given values of s and t. This is
expressed mathematically as a definite integral over the density
function between s and t:
P ~ s,X # t ! 5
* f ~ x ! dx
(4)
E2586 14
* xf~ x ! dx
(5)
2`
`
2 5
* ~x 2 !
f ~ x ! dx
(6)
2`
4
In the same way a straight line, y = mx + b, has parameters referred to as the
slope, m, and y-intercept, b. Once these parameters are known, the line is completely
known and may be drawn precisely.
E2586 14
1
2
3
4
Area
0.68270
0.95450
0.99730
0.99994
Interval
Area
0.50
0.90
0.95
0.99
0.674
1.645
1.960
2.576
* ~x 2 !
1 5
f ~ x ! dx
2`
(7)
Skewness
Kurtosis
Normal
Exponential
Uniform
PoissonA
Students tB
WeibullC , = 3.6
Weibull, = 0.5
Weibull, = 50.0
0
2
0
1/=
0
0
6.62
1
0
6
1.2
1/
6/(v 4)
0.28
84.72
1.9
* ~x 2 !
2 5
Distribution Form
2`
f ~ x ! dx
23
(8)
5
The boldface numbers in parentheses refer to a list of references at the end of
this standard.
E2586 14
TABLE 3 Values of the Constant, d2, for Converting the Sample
Range into an Estimate of Standard DeviationA
x 5
i51
(9)
d2
d2
1.128
1.693
2.059
2.326
2.534
7
8
9
10
11
2.704
2.847
2.970
3.078
3.173
12
13
14
15
16
3.258
3.336
3.407
3.472
3.532
Source: ASTM Manual on Presentation of Data and Control Chart Analysis (2).
(11)
R 5 x ~n! 2 x ~1!
(12)
d2
2
3
4
5
6
(X
(10)
6
Several alternatives to the mean rank equation i/(n + 1) are available (3),
including the median rank and Kaplan-Meier methods. A equation for the exact
median rank is available but is computationally intensive. The Behnard approximation equation to the median rank, (i 0.3) (n + 0.4), is widely used. The modified
Kaplan-Meier equation is (i 0.5) n.
E2586 14
TABLE 4 Maximum Z-Scores Attainable for a Selected Sample
Size, n
s2 5
i51
2 x ! 2
n21
n
5
i51
S( D
n
(x
2
i
11
15
18
3.015
3.615
4.007
~ x i 2 x !
s
(16)
(14)
? Z ? # ~ n 2 1 ! / =n
10
2.846
Zi 5
6.10.1 The IQR is sometimes used as an alternative estimator of the standard deviation by dividing by an appropriate
constant. This is particularly true when several outlying observations are present and may be inflating the ordinary calculation of the standard deviation. The dividing constant will
depend on the type of distribution being used. For example, in
a normal distribution, the IQR will span 1.35 standard deviations; then dividing the sample IQR by 1.35 will give an
estimate of the standard deviation when a normal distribution
is used.
5
1.789
st
( ~x
3
1.155
(13)
IQR 5 Q 3 2 Q 1
n
Z(n)
i51
n~n 2 1!
(17)
xi
(15)
7
These equations are algebraic equivalents, but the second form may be subject
to round off error.
8
When the denominator of the sample variance is taken as n instead of n 1, the
square root of this quantity is called the root mean squared deviation (RMS).
E2586 14
6.17.1.1 More generally, when we have to estimate k
parameters, we lose k degrees of freedom. In simple linear
regression where there are n pairs of data (xi, yi) and the
problem is to fit a linear model of the form y5mx1b through
the data, there are two parameters (m and b) that must be
estimated, and we effectively lose 2 degrees of freedom when
calculating the residual variance. The concept is further extended to multiple regression where there are k parameters that
must be estimated and to other types of statistical methods
where parameters must be estimated.
( ~ x 2 x !
g1 5
i51
ns
, g2 5
( ~ x 2 x !
i
ns
23
(18)
n
k 1 5 x , k 2 5 s , k 3 5
( ~ x 2 x !
k4 5
i51
S (~
n
( ~ x 2 x !
(19)
~ n 2 1 !~ n 2 2 !
n ~ n11 !
i51
~ n 2 1 !~ n 2 2 !~ n 2 3 !
i51
x i 2 x ! 2
~ n 2 2 !~ n 2 3 !
(20)
g 1 'k 3 /k , g 2 'k 4 /k
2
2
t5
(21)
x 2
(22)
s =n
q5
~n 2 1!s2
(23)
n2 2 1! 5
s 21
s 22
(24)
Both degrees of freedom are required to use the F distribution. It is common to specify one as associated with the
numerator and one as associated with the denominator. If the
two populations being sampled have differing standard
deviations, say 1 for population 1 and 2 for population 2,
then the F ratio above is multiplied by 22 21 . The F distribution
is used to construct confidence intervals for a ratio of two
variances, and in hypothesis testing associated with designed
9
For example, an F distribution having four degrees of freedom in the
denominator always has a theoretical skewness of 0, yet this distribution is not
symmetric. Also, see Ref. (5), Chapter 27, for further discussion.
E2586 14
TABLE 5 Commonly Required Statistics and Their Standard
ErrorsData Is of the Variable Type and Population Is Normal
NOTE 1For skewness and kurtosis,A the range for the sample size is
n = 5 through 1000. The constant c4 is a function of the sample size n and
is widely available in tables. Alternatively, this approximate equation may
be used. See Table 7 and Ref. (5).
Skewness, g1 = k3 / k21.5, let v = ln(n)
ln(se) = 0.54 0.3718v 0.01144 v2
Kurtosis, g2 = k4 / s4 , let v = ln(n)
ln(se) = 1.641 0.6752v 0.05498 v2 0.004492v3
Statistic
ox
i51
x 5
ses x d 5
o sx 2 xd
s2 5
ses s 2 d 5
i51
(25)
s5
2s 4
n21
ses s d 5 s 1 2 c 24
n21
Standard Deviation
i51
Variance
(X
p 5
Mean
o sx 2 xd
i51
<
s 8n 2 7
4n 2 3
n21
A
The standard error equations for these statistics were determined using a Monte
Carlo simulation.
(P
p 5
i51
(26)
se~ x ! 5
'
=n =n
2.8
=20
5 0.63
(27)
6.19.1.1 Here the quantity represents the unknown population standard deviation, s is the sample standard deviation
and estimates , and n is the sample size. In this example, the
estimated standard error of the mean is approximately 0.63.
6.19.2 Any standard error calculation or equation will
typically be a function of the sample size (as it is for the mean)
as well other items such as the kind of distribution being
sampled. Tables 5 and 6 contain a short list of commonly
required statistics along with associated standard errors
6.19.3 Many other equations for finding or approximating
the standard error for a given statistic are available in the
literature. When a statistic is complicated to the point at which
a closed-form solution or even an approximate equation may
be very difficult to find, computer-intensive methodology can
be used. Monte Carlo simulation methods are very useful for
such purposes. In particular, the technique known as a parametric bootstrap (6) uses the original data to generate many
new samples (the so-called bootstrap samples) each of the
same size n as the original sample. For each bootstrap sample,
the statistic of interest is again calculated and saved to a file.
6.19 Standard Error ConceptWhen a statistic is calculated from a set of sample data there is usually some population
parameter that is of interest and for which the statistic or some
simple function therefore serves as the estimate of the parameter. We know that when a second sample is taken, we will not
get the same result as the first sample provided. This is because
the sample values are different every time a sample is taken.
Different sample values will necessarily give us different
values for the statistic. A statistic is a random variable subject
to variation in repeated sampling. The standard error of the
statistic is the standard deviation of the statistic in repeated
sampling.
6.19.1 In using or reporting any statistic, it is good practice
to also report a standard error for that statistic. This gives the
user some idea of the uncertainty in the results being stated.
9
E2586 14
The quantity is a statistic, the estimator of the unknown
parameter ; se( ) is an estimate of the standard error of ; and
the multiplier z1-/2 is the 1 2 quantile selected from the
standard normal distribution (5.3) for a (1 ) two sided
confidence interval. For example, when 95 % confidence level
is used ( = 0.05), z0.975 = 1.960; when 99 % confidence level
is used, z0.995 = 2.576.
6.20.3 To construct a confidence interval for an unknown
proportion, p, using the observed sample proportion p from a
sample of size n, the general approximate Eq 28 may be used
with the standard error as specified in Table 6. For the
approximation to be adequate, np and n(1 p) should be 5 or
more. The equation for this interval is:
ox
p 5
i51
ses p d 5
p 1 2 p d
n21
ox
i51
ses d 5
p 6z 12/2 =p ~ 1 2 p ! / ~ n 2 1 !
c4
2
3
4
5
6
7
8
9
10
n
0.797885
0.886227
0.921318
0.939986
0.951533
0.959369
0.965030
0.969311
0.972659
c4
11
12
13
14
15
16
17
18
19
20
n
0.975350
0.977559
0.979406
0.980971
0.982316
0.983484
0.984506
0.985410
0.986214
0.986934
25
30
35
40
45
50
75
100
150
200
c4
0.989640
0.991418
0.992675
0.993611
0.994335
0.994911
0.996627
0.997478
0.998324
0.998745
x 6t 12/2, df s/ =n
(30)
(29)
or 28.6.
6.20.6 Procedures for calculating confidence intervals from
sample data are available in textbooks and in the literature for
parameters of a variety of distribution functions and for a
variety of scenarios (for example, single parameter, difference
between two parameters, ratio of two parameters, etc.). Widely
available published tables are used to construct confidence
intervals for cases involving the binomial, Poisson, exponential
and normal distributions. For the common cases as well as
others, tables of Students t, the chi-square and F distributions
are required for construction of the interval. Generally, the
coverage probability depends on the correctness of the assumed distribution from which the data have arisen.
(28)
10
E2586 14
6.21.4 Example 1A certain type of material tensile
strength exhibits a sample mean and standard deviation from a
sample of n = 7 observations of 17,580 and 795 lbs, respectively. This characteristic has historically been shown to be
normally distributed. A two-sided 95 % prediction interval for
the tensile strength of the next observation is calculated from
Eq 28 and Eq 29. For n = 7, use 6 degrees of freedom and a
quantile level of 1 0.05 2 = 0.975. A standard table of
Students t values shows that t0.975 = 2.447. The corresponding
prediction interval is:
6.21 Prediction-Type
Intervals
for
a
Normal
DistributionIt may sometimes be the case that we have a
sample of n observations from a normal distribution and we
want to construct an interval that would contain one or more
future observations with some stated confidence C. Such
intervals are called prediction intervals.
6.21.1 Two-Sided Prediction Intervals for a Single Future
Value From a Normal PopulationA prediction interval for a
single future observation, y, from a normal population is
constructed using a sample of n observations from a normal
distribution and provides the limits within which the future
value is expected to fall with some confidence C = 1 . We
can have both single sided and double sided limits. Let y be the
future value. The prediction limits for the two sided interval for
the future value are PL y PU. Equations for these limits are:
P L 5 x 2 t 12/2 s =111/n
P U 5 x 1t 12/2 s =111/n
(31)
(32)
(33)
(34)
6.21.3.1 The degrees of freedom remain n 1. The modification of the quantile level is an application of the Bonferroni
inequality (see Ref. (7)). Many variations on the theme of
prediction intervals are possible. Note that the interval methodology in this section should not be used unless the underlying distribution is normal and stable. For further information
on this topic, see Refs. (7, 8, or 9).
11
E2586 14
however, any probability estimate will also be a function of the
data quality (resolution) and quantity.
8.1.2 The second purpose for constructing a histogram is to
assess the general shape of the distribution from which the
sample originated. Here the analysis is mostly visual. The
histogram may suggest both questions and answers. For
example, has the data originated from a symmetrical distribution? Might there be any outliers among our data?
E2586 14
FIG. 7 Bearing Life DataIllustration of Sample Comparison Using Boxplots, Sample Size, n = 30 Each Group
average. Note that the rth order statistic is also the 100r/
(n + 1)th sample percentile. In a quantile-quantile plot, the
quantiles from one sample are plotted against the corresponding quantiles of another sample. With two samples of equal
size, the order statistics from one sample are plotted against the
order statistics of the second sample. If both samples are
exactly the same, then the resulting plot will be straight line
with slope 1 and y-intercept 0. If the mean of one sample
(plotted on the horizontal axis) is shifted to the right, say k
units, but otherwise the samples are exactly the same, the
resulting plot would be a line of slope 1 and y-intercept k. A
slope less than 1 would indicate that the sample plotted as the
horizontal coordinate has more variability than the sample
plotted as the vertical coordinate. In this manner, fundamental
differences between the two samples may be discerned graphically.
8.5.3 When the sample sizes are not equal, we use the
smaller sample size to determine the quantiles that are to be
plotted. Let two samples be denoted through the variables X
and Y; further, let the smaller sample size, n, belong to X, and
the larger sample size, m, belong to Y. The n order statistics of
the variable X determine the quantiles to be used. These are
quantiles of orders r/(n + 1) for r = 1, 2, n. To find the
associated quantiles of the same orders from sample of Y
values use the method outlined in 6.8. Using this method, two
sets of n sample quantiles are determined and may be plotted
in manner described previously.
8.5.4 Probability PlotsTo prepare and use a probability
plot, a distribution must be assumed for the variable being
studied. Important cases of distributions that are used for this
purpose include the normal, log-normal, exponential, Weibull,
and extreme value distributions. In most cases, the special
probability paper needed for each distribution is readily available or construction is available in a wide variety of software
packages. The utility of a probability plot lies in the property
that the sample data will generally plot as a straight line given
that the assumed distribution is true. From this property, use as
an informal and graphic hypothesis test that the sample arose
E2586 14
from the assumed distribution is in frequent use.10 The underlying theory will be illustrated using the normal distribution.
Illustrations appear in the section on examples.
8.5.5 Normal Distribution CaseGiven a sample of n
observations assumed to come from a normal distribution with
unknown mean and standard deviation ( and ), let the
variable be Y and the order statistics be y(1), y(2), y(n). Plot
the order statistics y(i) against the inverse standard normal
distribution function, -1(p), evaluated at p = i/(n + 1), where
i = 1, 2, 3, n. This is because i/(n + 1) is the expected fraction
of a population lying below the order statistic y(i) in any sample
of size n. The resulting relationship is:
y ~ i ! 5 21 ~ i/ ~ n11 !! 1
570
568
572
570
570
572
576
584
(35)
item
X2
1
2
3
4
5
6
7
8
9
10
sum
578
572
570
568
572
570
570
572
576
584
5732
334,084
327,184
324,900
322,624
327,184
324,900
324,900
327,184
331,776
341,056
3,285,792
(x
x 5
i51
5732
5 573.2
10
(36)
n
s2 5
i51
S( D
n
(x
2
i
i51
xi
n~n 2 1!
(37)
10 ~ 3,285,792! 2 57322
5 23.29
10 ~ 9 !
(38)
9. Examples
9.1 Example 1Calculation of descriptive statistics (Table
8).
9.1.1 Mean, variance, and standard deviation calculation.
Refer to Table 9.
(39)
10
Formal methods for testing the hypothesis that the data arise from the assumed
distribution are available. Such tests include the Anderson-Darling, the ShapiroWilks, and a chi-square test among others.
14
E2586 14
TABLE 11 Strength of 270 Bricks of a Typical Brand, psiA
z-score
1
2
3
4
5
6
7
8
9
10
578
572
570
568
572
570
570
572
576
584
0.99464
0.24866
0.66309
1.07753
0.24866
0.66309
0.66309
0.24866
0.58021
2.23794
860
920
1200
850
920
1090
830
1040
1510
740
1150
1000
1140
1030
700
920
860
950
1020
1300
890
1080
910
870
810
1010
740
1070
1020
1170
960
1180
800
1240
1020
1030
690
1070
820
1230
830
1100
830
1010
860
1400
920
800
1050
1070
1130
1000
730
1360
is 0.25. This gives for the 75th percentile the eighth order
statistic plus 25 % of the distance between the eighth and ninth
order statistic. This is:
Q 3 5 57610.25 ~ 578 2 576! 5 577.5
(40)
th
1320
1100
830
920
1070
700
880
1080
1060
1230
860
720
1080
960
860
1100
990
880
750
970
1030
970
1100
970
1070
1190
1080
830
1390
920
1020
740
860
1290
820
990
1020
820
1180
950
1220
1020
850
1230
1150
850
1110
800
710
880
1330
1090
930
910
820
1250
1100
940
1630
910
870
1040
840
1020
1100
800
990
870
660
1080
890
970
1070
800
1060
960
870
910
1100
1180
860
1380
830
1120
1090
880
1010
870
1030
1100
890
580
1350
900
1100
1380
630
780
1400
1010
780
1140
890
1240
1260
1140
900
890
1040
1480
890
1310
670
1170
1340
980
940
1060
840
1170
570
800
1180
980
940
1000
920
650
1610
1180
980
830
460
1080
1000
960
820
1170
2010
790
1130
1260
860
1080
700
820
1180
760
1090
1010
710
1000
880
1010
780
940
1010
940
890
970
1150
950
1000
1150
270
1330
1150
800
840
1240
1110
990
1060
970
790
1040
780
760
910
990
870
1180
1190
1050
730
1030
860
1100
810
1360
980
1160
890
1100
970
1050
850
1070
880
1060
950
1380
1380
1030
900
1150
730
1240
1190
980
1120
860
980
1110
900
1270
11
The uncertainty considered here is only related to the significant digits of the
reported data and does not include other sources of uncertainty such as measurement
error.
15
E2586 14
TABLE 12 Frequency Distribution of Brick Strength
Data (Table 11)
lower
upper
Freq.
Rel.
Freq.
Cume
Freq.
CumeRel.
Freq.
255
355
455
555
655
755
855
955
1055
1155
1255
1355
1455
1555
1655
1755
1855
1955
355
455
555
655
755
855
955
1055
1155
1255
1355
1455
1555
1655
1755
1855
1955
2055
1
0
1
4
16
37
56
55
50
25
11
9
2
2
0
0
0
1
0.0037
0.0000
0.0037
0.0148
0.0593
0.1370
0.2074
0.2037
0.1852
0.0926
0.0407
0.0333
0.0074
0.0074
0.0000
0.0000
0.0000
0.0037
1
1
2
6
22
59
115
170
220
245
256
265
267
269
269
269
269
270
0.0037
0.0037
0.0074
0.0222
0.0815
0.2185
0.4259
0.6296
0.8148
0.9074
0.9482
0.9815
0.9889
0.9963
0.9963
0.9963
0.9963
1.0000
10.1.2 The practitioner typically wants to see if a relationship exists between X and Y. In theory, many different types of
relationships can occur between X and Y. The most common is
a simple linear relationship of the form Y = + X + , where
and are model coefficients and is a random error term
representing variation in the observed value of Y at given X,
and is assumed to have a mean of 0 and some unknown
standard deviation . A statistical analysis that seeks to
determine a linear relationship between a dependent variable,
Y, and a single independent variable, X, is called simple linear
regression. In this type of analysis it is assumed that the error
structure is normally distributed with mean 0 and some
16
E2586 14
are the estimates of and respectively. The ith observed
values of X and Y are denoted as xi and yi. The estimate of Y at
X = xi is written y i 5a1bx i . The hat notation over the yi
variable denotes that this is the estimated mean or predicted
value of Y for a given x.
10.2.1 The least squares best fitting line is one that minimizes the sum of the squared deviations from the line to the
observed yi values. Note that these are vertical distances.
Analytically, this sum of squared deviations is of the form:
n
S ~ a, b ! 5 ~ y i 2 y i ! 2 5 ~ y i 2 a 2 bx i ! 2
i51
(41)
i51
S XX 5 ~ n 2 1 ! s 2x 5 ~ x 1 2 x ! 2
(42)
i51
n
S YY 5 ~ n 2 1 ! s 2y 5 ~ y 1 2 y ! 2
(43)
i51
S XY 5 ~ x 1 2 x !~ y 1 2 y ! 5 ~ x 1 2 x ! y 1
i51
(44)
i51
sion because ~ x 1 2 x ! y 50 .
i51
~ x i 2 x ! y i
b5
i21
n
~ x i 2 x ! 2
S XY
S XX
(45)
i21
(46)
10.2 Method of Least SquaresThe methodology considered in this standard and used to estimate the model parameters
and is called the method of least squares. The form of the
best fitting line will be denoted as Y = a + bX, where a and b
wi 5
12
The normal distribution of the error structure is not required to fit the linear
model to the data but is required for performing standard model analysis such as
residual analysis, confidence and prediction intervals and statistical inference on the
model parameters.
~ x i 2 x !
S XX
(47)
E2586 14
TABLE 13 Weld Diameter (x) and Shear Strength (y)
i
xi
1
2
3
4
5
6
7
8
9
10
yi
190
200
209
215
215
215
230
250
265
250
average
stdev (S)
S2
di=xiyi
680
800
780
885
975
1025
1100
1030
1175
1300
223.9
24.196
585.433
490.0
600.0
571.0
670.0
760.0
810.0
870.0
780.0
910.0
1050.0
975.0
191.645
36,727.778
xix
(xix)yi
33.9
23.9
14.9
8.9
8.9
8.9
6.1
26.1
41.1
26.1
23,052.0
19,120.0
11,622.0
7,876.5
8,677.5
9,122.5
6,710.0
26,883.0
48,292.5
33,930.0
170.987
29,236.544
parameter estimates
b
a
6.898
569.468
5,268.900
330,550.000
36,345.000
SXX
SYY
SXY
~ x i 2 x ! y i
~ n 2 1 ! s xs y
i21
~ n 2 1 ! s xs y
s 2x 1s 2y 2 s 2d
2s x s y
(49)
~ x i 2 x !~ y i 2 y !
r5
r5
(48)
r5
~ x 2 x !~ y 2 y !
r5
i21
36,345
5 0.871
~ 10 2 1 !~ 24.196!~ 191.645!
10.4.1 An alternative formula for r uses the standard deviation of the paired differences (di = yi xi). Note that it does not
matter in what order we calculate these differences. Either di =
yi xi or di = xi yi will give the same result:
18
E2586 14
TABLE 14 Calculate the Estimate of
i
yi
y i
y i 2y i
s y i 2 y i d 2
1
2
3
4
5
6
7
8
9
10
680
800
780
885
975
1025
1100
1030
1175
1300
741.16
810.14
872.22
913.61
913.61
913.61
1017.08
1155.04
1258.51
1155.04
61.16
10.14
92.22
28.61
61.39
111.39
82.92
125.04
83.51
144.96
3,740.18
102.76
8,504.42
818.39
3,769.03
12,408.27
6,876.07
15,634.61
6,973.86
21,013.86
SUM
79,841.31
99.9
(50)
se ~ a ! 5
se ~ b ! 5
se ~ a ! 5 99.9
n22
S YY 2 bS XY
n22
(51)
(53)
99.9
=5268.9
5 1.376
1 223.92
1
5 309.8
10 5268.9
1 x 2
1
n S XX
n
i51
(52)
and:
=S XX
Standard errors for the slope and intercept for the example
are:
~ y i 2 y i ! 2
19
E2586 14
2, n22
1 ~ x 0 2 x ! 2
1
n ~ n 2 1 ! s 2x
(55)
1
~ 215 2 223.9! 2
1
10 ~ 10 2 1 !~ 24.196! 2
5913.6678.13
62.306~ 99.9!
2, n22
1 ~ x 0 2 x ! 2
11 1
n ~ n 2 1 ! s 2x
(56)
(54)
2569.46816.898~ 215!
62.306~ 99.9!
1
~ 215 2 223.9! 2
1
10 ~ 10 2 1 !~ 24.196! 2
5913.66243.26
11
E2586 14
11. Keywords
11.1 bivariate; boxplot; correlation; dot plot; empirical percentile; frequency distribution; histogram; kurtosis; least
squares; mean; median; mid range; Ogive; order statistic;
population parameter; predication; probability plot; q-q plot;
range; regression; sample statistic; skewness; standard deviation; standard error; variance
REFERENCES
(1) Johnson, N.L., and Kotz, S., eds., Encyclopedia of Statistical
Sciences, Vol 4, s.v. Kurtosis, Wiley-Interscience, 1983.
(2) Manual on Presentation of Data and Control Chart Analysis, Seventh
Edition, ASTM International, West Conshohocken, PA, 2002.
(3) Hyndman, R.J., and Fan, Y., Sample Quantiles in Statistical
Packages, The American Statistician, Vol 50, 1996, pp. 361365.
(4) Shiffler, R.E., Maximum Z Scores and Outliers, The American
Statistician, Vol 42, No. 1, February 1988, pp. 7980.
(5) Duncan, A.J., Quality Control and Industrial Statistics, Fifth Edition,
Irwin, Homewood, IL, 1986.
(6) Efron, B.Y., The Jackknife, Bootstrap and Other Resampling Plans,
Regional Conference Series in Applied Mathematics, No. 38, SIAM,
1982.
(7) Hahn, G., and Meeker, W., Statistical Intervals, A Guide for
Practitioners, John Wiley & Sons, 1991.
(8) Whitmore, G.A., Prediction Limits for a Univariate Normal
Observation, The American Statistician, Vol 40, No. 2, 1986, pp.
141143.
(9) Hahn, G.J., Finding an Interval for next Observation from a Normal
Distribution, Journal of Quality Technology, Vol 1, No. 3, 1969, pp.
168171.
(10) Wand, M.P., Data-Based Choice of Histogram Bin Width, The
American Statistician, Vol 51, No. 1, February 1997, pp. 5964.
ASTM International takes no position respecting the validity of any patent rights asserted in connection with any item mentioned
in this standard. Users of this standard are expressly advised that determination of the validity of any such patent rights, and the risk
of infringement of such rights, are entirely their own responsibility.
This standard is subject to revision at any time by the responsible technical committee and must be reviewed every five years and
if not revised, either reapproved or withdrawn. Your comments are invited either for revision of this standard or for additional standards
and should be addressed to ASTM International Headquarters. Your comments will receive careful consideration at a meeting of the
responsible technical committee, which you may attend. If you feel that your comments have not received a fair hearing you should
make your views known to the ASTM Committee on Standards, at the address shown below.
This standard is copyrighted by ASTM International, 100 Barr Harbor Drive, PO Box C700, West Conshohocken, PA 19428-2959,
United States. Individual reprints (single or multiple copies) of this standard may be obtained by contacting ASTM at the above
address or at 610-832-9585 (phone), 610-832-9555 (fax), or [email protected] (e-mail); or through the ASTM website
(www.astm.org). Permission rights to photocopy the standard may also be secured from the Copyright Clearance Center, 222
Rosewood Drive, Danvers, MA 01923, Tel: (978) 646-2600; http://www.copyright.com/
21