2000 (Bonate) Analysis of Pretest-Posttest Designs (Full Book)
2000 (Bonate) Analysis of Pretest-Posttest Designs (Full Book)
2000 (Bonate) Analysis of Pretest-Posttest Designs (Full Book)
Pretest-Posttest
Designs
Peter L. Bonate
Bonate, Peter L.
Analysis of pretest-posttest designs / Peter L. Bonate.
p. cm.
Includes bibliographical references and index.
ISBN 1-58488-173-9 (alk. paper)
1. Experimental design. I. Title.
QA279 .B66 2000
001.4′34—dc21 00-027509
CIP
This book contains information obtained from authentic and highly regarded sources. Reprinted material
is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable
efforts have been made to publish reliable data and information, but the author and the publisher cannot
assume responsibility for the validity of all materials or for the consequences of their use.
Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic
or mechanical, including photocopying, microfilming, and recording, or by any information storage or
retrieval system, without prior permission in writing from the publisher.
The consent of CRC Press LLC does not extend to copying for general distribution, for promotion, for
creating new works, or for resale. Specific permission must be obtained in writing from CRC Press LLC
for such copying.
Direct all inquiries to CRC Press LLC, 2000 N.W. Corporate Blvd., Boca Raton, Florida 33431.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation, without intent to infringe.
INTRODUCTION
250
Oral-Cecal Transit Time (min)
200
150
100
50
0
1 2 3
Treatment
300
250
Oral-Cecal Transit Time (min)
200
150
100
50
0
0 1 2 3 4
Treatment
Figure 1.1: The top plot shows individual oral-cecal transit times in 14
subjects administered (1) intravenous placebo and oral placebo, (2)
intravenous morphine and oral placebo, or (3) intravenous morphine and oral
methylnaltrexone. The bottom plot shows what the data would look like
assuming every subject was different among the groups. Data redrawn from
Yuan, C.-S., Foss, J.F., Osinski, J., Toledano, A., Roizen, M.F., and Moss, J.,
The safety and efficacy of oral methylnaltrexone in preventing morphine-induced
delay in oral cecal-transit time, Clinical Pharmacology and Therapeutics, 61,
467, 1997. With permission.
∆ = µt - µr
0.4
0.3
Y Data
0.2 σ σ
0.1
0.0
97 98 99 100 101 102 103 104 105 106
X Data
Figure 1.2: Plot showing the difference between two independent group
means, ∆, expressed as an effect size difference of 1.7. Legend: µr, mean of
reference group; µt, mean of test group; σ, common standard deviation.
0.5
µt µr
0.4
0.3
Y Data
α/2
0.2
α/2
0.1
β β - 1 = Power
0.0
98 100 102 104 106
X Data
Figure 1.3: Plot showing the elements of power analysis in comparing the
mean difference between two groups, assuming a 2-tailed t-test. Legend: α,
the probability of rejecting the null hypothesis given that it is true; β, the
probability of not rejecting the null hypothesis given that it is true.
14
12
ES=2.0
Apparent Effect Size
10
ES=1.5
8 ES=1.0
6 ES=0.50
4 ES=0.25
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
90
80
70
60
50
40
30
20
10
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Figure 1.4: Plots showing apparent effect size as a function of the correlation
between pretest and posttest (top) and percent change in effect size as a
function of correlation between pretest and posttest (bottom). Data were
generated using Eq. (1.5). ES is the effect size in the absence of within subject
variation and was treated as fixed value. Incorporating correlation between
pretest and posttest into an analysis may dramatically improve its statistical
power.
12
10
Frequency
0
<30 35 45 55 65 75 85 95 105 115 125 135 145 155 165 175 185 195 205 215 225
10 Control
Treatment
8
Frequency
0
<30 35 45 55 65 75 85 95 105 115 125 135 145 155 165 175 185 195 205 215 225 235 245
Midpoint
150
Posttest Score
100
50
200
180
160
140
120
Score
100
80
60
40
Pretest Posttest
20
Figure 1.7: Representative tilted line-segment plot. Each circle in the pretest
and posttest subgroups represents one individual.
130
120
110
Posttest
100
90
80
70
60
1
Summary
• Pretest-posttest experimental designs are quite common in biology and
psychology.
• The defining characteristic of a pretest-posttest design (for purposes of
this book) is that a baseline or basal measurement of the variable of
interest is made prior to randomization into treatment groups and
application of the treatment of interest.
◊ There is temporal distance between collection of the pretest and
posttest measurements.
MEASUREMENT CONCEPTS
What is Validity?
We tend to take for granted that the things we are measuring are really the
things we want to measure. This may sound confusing, but consider for a
moment what an actual measurement is. A measurement attempts to quantify
through some observable response from a measurement instrument or device
an underlying, unobservable concept. Consider these examples:
• The measurement of blood pressure using a blood pressure cuff: True
blood pressure is unobservable, but we measure it through a pressure
transducer because there is a linear relationship between the value
reported by the transducer and the pressure device used to calibrate the
instrument.
• The measurement of a personality trait using the Minnesota Multiphasic
Personality Inventory (MMPI-II): Clearly, personality is an unobservable
concept, but there may exist a positive relationship between personality
constructs and certain items on the test.
• Measurement of an analyte in an analytical assay: In an analytical assay,
we can never actually measure the analyte of interest. However, there
may be a positive relationship between the absorption characteristics of
the analyte and its concentration in the matrix of interest. Thus by
monitoring a particular wavelength of the UV spectrum, the concentration
of analyte may be related to the absorbance at that wavelength and by
comparison to samples with known concentrations of analyte, the
concentration in the unknown sample may be interpolated.
In each of these examples, it is the concept or surrogate that is measured, not
the actual “thing” that we desire to measure.
In the physical and life sciences, we tend to think that the relationship
between the unobservable and the observable is a strong one. But the natural
question that arises is to what extent does a particular response represent the
particular construct or unobservable variable we are interested in. That is the
nature of validity–a measurement is valid if it measures what it is supposed to
measure. It would make little sense to measure a person’s cholesterol level
using a ruler or quantify a person’s IQ using a bathroom weight scale. These
are all invalid measuring devices. But, validity is not an all-or-none
proposition. It is a matter of degree, with some measuring instruments being
more valid than others. The scale used in a doctor’s office is probably more
accurate and valid than the bathroom scale used in our home.
Validation of a measuring instrument is not done per se. What is done is
the validation of the measuring instrument in relation to what it is being used
for. A ruler is perfectly valid for the measure of length, but invalid for the
measure of weight. By definition, a measuring device is valid if it has no
(
mathematical shorthand, T ~ µ, σs2 . )
The situation where the true scores are known is impossible to achieve
because all measurements are subject to some degree of error. Two types of
errors occur during the measuring process. If a measurement consistently
overestimates or underestimates the true value of a variable, then we say that
systematic error is occurring. Systematic error affects the validity of an
instrument. If, however, there is no systematic tendency either
underestimating or overestimating the true value, then random error is
i =1 i =1 i =1 (2.16)
n n
∑ R1i ( Si − µ ) +∑ R 2i (Si − µ ).
i =1 i =1
Assuming that the errors are random, i.e.,
n n
∑ R X1 = ∑ R X2 = 0 ,
i=1 i=1
and both uncorrelated with each other and with the true scores, then
n n
∑ ( X1i -µ )( X 2i -µ ) = ∑ ( Si -µ ) .
2
(2.17)
i =1 i =1
i =1 (2.18)
= i =1 .
2 2
nσ nσ
Recognizing that the variance of the true scores is
n
∑ (Si -µ )
2
i =1 (2.19)
σS2 = = Var(S) ,
n
substituting into Eq. (2.18) and simplifying gives
n
∑ (Si − µ )
2
(2.20)
G = i=1 ,
2
nσ
which is equivalent to
σ2 σS2
G= S = . (2.21)
σ 2 σS2 + σ2R
G is defined as the test-retest reliability or reliability coefficient between two
measurements. G will be used interchangeably with ρ, the test-retest
correlation, throughout this book. From Eq. (2.21) it can be seen that the test-
retest reliability between two measurements made on the same experimental
unit is the proportion of the “true subject” or inter-subject variance that is
contained in the observed total variance. G will always be positive and
bounded on the interval (0,1) because the observed variance is in the
denominator which will always be greater than or equal to inter-subject
140
130
120
Posttest Scores
110
100
90
80
70
60
60 80 100 120 140
Pretest Scores
Figure 2.1: One-thousand normally distributed scores with mean 100 and
variance 125. The test-retest reliability between pretest and posttest is 0.80.
The solid line is the least squares linear fit to the data.
Y’
X’ Mean of X
20
Difference Scores
10
-10
-20
-30
60 80 100 120 140
Pretest Scores
Figure 2.3: The difference between posttest and pretest scores against each
subject’s baseline score. Original data are in Figure 2.1. The solid line is the
least squares regression and the negative slope is a hallmark of regression
towards the mean. Subjects whose baseline scores are less than the mean will
tend to show positive changes from baseline, whereas subjects whose baseline
scores are greater than the mean will tend to show negative changes from
baseline. This effect is independent of any treatment effect that might occur
between measurements of the pretest and posttest.
Regression towards the mean does not occur because of some underlying
biological or physical property common to the subjects being measured; it is
solely a statistical phenomena and is due entirely to the properties of
conditional expectation. Conditional expectation is the expectation given that
some other event has already occurred. When both X and Y are normally
distributed, the conditional expectation of Y, given an observed value of x is
G ( x − µ x ) σY
E(Y|X = x) = µ Y + (2.24)
σX
where G is the reliability coefficient (note that here G is calculated from the
correlation between observed pretest and posttest scores upon repeated
measurements of the same individual), σY is the standard deviation of Y, σX is
the standard deviation of X, µX is the mean of X, and µY is the mean of Y.
When µ x = µ Y = µ and σ x = σ Y = σ then
*
The QTc interval is one of the variables which an electrocardiogram
produces and is an index of cardiac repolarization.
0.95
Coefficient of Reliability
0.90
0.85
0.80
m=1
m=3
0.75 m=5
0.70
0 2 4 6 8 10 12
Number of Visits
e −µ e −µ
2 p
p(e1 ≤ X n ≤ e2 | X m ) = Φ z − Φz 1 p
(2.29)
vp vp
where e1 and e2 are the lower and upper limits, respectively, of acceptability
into the study, Φ(.) is the cumulative distribution function of the standard
normal distribution,
m 1 + (n-1)G
µp = µn + (X m − µ m ) , (2.30)
n 1+(m-1)G
1 + (n-1)G m 1 + (n-1)G 2
vp = 1 − σ , (2.31)
n n 1+(m-1)G
µn is the population mean of n measurements, and µ m is the population mean
of m measurements (Moye et al., 1996). Eq. (2.29) can be solved iteratively to
find critical values for X m given a predefined level of probability set by the
investigator. The difficulty with using Eq. (2.29) is the reliance on having to
know a priori the population mean, the population standard deviation, and the
population correlation.
As an example, using the data and example provided by Moye et al.
(1996), what is the probability that an individual can enroll in the CARE trial
after a single trial? Assume that the population mean and standard deviation
for the first and second measurements are 137.3 + 29.4 and 135.8 + 29.3
mg/dL, respectively, and the test-retest correlation between measurements is
0.79. The probability that an individual will be eligible for the study is given
by Figure 2.5. With almost virtual certainty, if someone has an initial LDL
cholesterol of less than 90 mg/dL or greater than 210 mg/dL, they will not be
eligible for the study. If the researcher wishes to be 90% confident that a
subject will be eligible for the study, then taking the upper and lower 5%
interval of the distribution of probabilities will suffice as cut-off values. In
this instance, the upper and lower cut-off values were approximately 95 and
203 mg/dL.
0.9
0.8
Probability of Enrollment
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
80 100 120 140 160 180 200 220
First Measurement (LDL Cholesterol)
σY
Yadj = Y- ( G-1) (X-µ) (2.32)
σx
where G is the test-retest reliability coefficient, σY and σx are the standard
deviations for the posttest (sY) and pretest (sx), respectively, and µ is the
average pretest score (Chuang-Stein, 1993; Chaung-Stein and Tong, 1997).
σy
The second term in Eq. (2.32), ( G-1) (X-µ) , reflects the amount of
σx
regression towards the mean. It can be seen from Eq. (2.32) that when the
posttest score is greater than the mean, the adjusted score will be less than the
observed value. Conversely, when the observed posttest score is below the
mean, the adjusted posttest score will be above the mean. The net effect of
this is that the observed and adjusted scores will be highly correlated and the
adjusted posttest scores will have the same mean as the observed posttest
scores, but will have a variance slightly greater than the observed posttest
scores. This is illustrated in the top of Figure 2.6, which includes plots of the
adjusted posttest scores against the observed posttest scores from Figure 2.1.
The line in the figure shows the line of unity. The adjusted posttest scores
have a mean of 100 and a variance of 124 compared to a mean of 100 and a
variance of 120 for the observed scores. The bottom figure plots the amount
of regression towards the mean as a function of the observed posttest score.
There is a direct linear relationship between observed posttest scores and the
amount of regression towards the mean. The maximal amount of regression
towards the mean was 6.5; thus, at most, 6.5% of the observed posttest scores
was due to regression towards the mean, not a treatment effect. Other methods
are available for correcting for regression towards the mean, but are more
technically complex. The reader is referred to Lin and Hughes (1997) for a
more thorough exposition of these techniques.
Another useful post-hoc procedure is to use analysis of covariance
(Chapter 5). As will be seen in Chapter 5, analysis of covariance explicitly
takes into account the dependence of the posttest scores on the pretest scores,
as well as controlling regression towards the mean, whereas the other
statistical methods which will be discussed herein do not.
130
Adjusted Posttest Scores
120
110
100
90
80
70
60
60 70 80 90 100 110 120 130 140
8
Amount of Regression Towards the Mean
-2
-4
-6
-8
-10
60 70 80 90 100 110 120 130 140
Figure 2.6: Adjusted posttest scores plotted against original posttest scores
(top) and the amount of regression towards the mean against baseline scores
(bottom). Adjusted posttest scores are highly correlated (solid line is the line
of unity) and have the same mean as unadjusted posttest scores, but have a
larger variance. The bottom plot shows that, at most, 6.5 units of the observed
posttest score was due to regression towards the mean.
a
γˆ = (2.43)
ρˆ
where µ and σ are estimated from the sample estimates of the mean and
standard deviation of all n pretest scores, and
n
∑ ( Xi − µ )( Yi − µ )
i =1 (2.44)
a=
n
∑ ( Xi − µ )
2
i =1
and ρ̂ is the same sign as a. Note that Eq. (2.42)-(2.44) are calculated using
only the first n samples, but that the estimates for µ and σ are calculated using
all n+m samples.
Administer Pretest
No Yes
No Group 1 Group 2
Administer Treatment
Yes Group 3 Group 4
(Group 2), and the last group is given neither the pretest nor the treatment
intervention (Group 1). All four groups are given the posttest, which are then
analyzed using an ANOVA. A statistically significant interaction effect is
evidence of pretest sensitization. The reader is referred to Kirk (1982) or
Peterson (1985) for details on factorial designs.
As with any experimental design, certain assumptions must be made for
the ANOVA to be valid. First, subjects are chosen at random from the
population at large and then randomly assigned to one group, the assumption
being that any subject in any group is similar in characteristics to any subject
in another group. However, given the nature of the experimental design it is
impossible to test this assumption because only half the subjects are
administered the pretest. Second, the data are normally distributed with
constant variance and each subject’s scores are independent of the other
subject’s scores. This assumption is seen with many of the statistical tests
which have been used up to this point.
A couple of points relating to this topic must be made. First, in this
design the pretest cannot be treated as a continuous variable; it is a qualitative
variable with two levels (“Yes” or “No”) depending on whether or not the
subject had a pretest measurement. Second, whereas in future chapters it will
be advocated to include the pretest as a covariate in the linear model, this
cannot be done here because only some of the subjects actually take the
pretest. In fact, this is the biggest drawback of the design. Only posttest
scores alone can be analyzed with any within subject information being lost in
the process. Third, the Solomon-four can be expanded to include both
additional treatments. For example, a 23 design could have two active
treatment levels and a no-treatment level, a 24 design would have three active
treatment levels and a no-treatment level, etc. Of course, the disadvantage of
the factorial design is that as the number of treatment levels increases there is a
geometric increase in the required number of groups and subjects, thereby
possibly making the experiment prohibitive.
In factorial experimental designs, simple effects refer to the differences
between the levels of the factors and main effects refer to the average of the
simple effects. The difference in main effects for each level is referred to as
Administer
Pretest Simple Main
No Yes Effect Effect Interaction
Administer No 20 22 2
13 22
Treatment Yes 26 50 24
Simple Effect 6 28
Main Effect 17
Interaction 21
Note: The dependent variable is the posttest score. Simple effects are
the difference between factor levels. Main effects are the average of the
simple effects and interaction is the difference of simple effects. Pretest
sensitization is reflected in dissimilar interaction terms.
55 55
Pretest - Yes
50 50
Treatment - Yes
45 45
40 40
Posttest Score
Posttest Score
35 35
30 30
25 25
Pretest - No
20 20 Treatment - No
15 15
No Yes No Yes 12
Figure 2.7: Plot demonstrating pretest sensitization using the data in Table
2.5. A differential treatment effect is observed depending on whether subjects
were exposed to the pretest prior to the treatment intervention.
Summary
• No measurement has perfect reliability. Every measurement has some
degree of error associated with it, both systematic and random in nature.
• Compounded with the inherent degree of error associated with every
measurement is that regression towards the mean occurs whenever a
subject is measured on at least two occasions.
• Regression towards the mean is independent of any treatment effects that
may be applied between collection of the pretest and subsequent
measurements.
◊ When an individual's pretest score is very different from the
mean pretest score, regression towards the mean becomes an
important issue and may bias the estimation and/or detection of
treatment effects.
◊ In particular is the case where subjects are enrolled in a study
based on their pretest score. In these subjects, the effect
regression towards the mean will have on the estimation of the
treatment effect will probably be quite substantial.
• It is often assumed that the pretest and treatment effect do not interact, but
that may not necessarily be the case.
• Careful examination of the main effect plots should be used to reveal the
presence of a treatment by pretest interaction.
• If a researcher believes that pretest sensitization may occur, the best
method to control for it is by the appropriate experimental design.
◊ If, however, an experiment suggests that pretest sensitization has
occurred, it is possible to estimate the treatment effect using a
modified linear model where the treatment effect is a function of
the pretest score.
DIFFERENCE SCORES
i =1
(3.11)
σˆ ∆2 =
n −1
where ∆ is the average difference score,
n
∑ ∆i (3.12)
i =1
∆= .
n
Notice that
n n n
∑ ( Yi − Xi ) ∑ Yi ∑ Xi (3.13)
∆ = i=1 = i=1 − i=1 = Y-X .
n n n
Thus the average difference is equal to the difference of the averages. The t-
statistic is then computed as
where X and Y are the mean pretest and posttest scores, respectively. The t-
statistic in Eq. (3.23) has a Student’s t-distribution with n-1 degrees of
freedom. Rejection of the null hypothesis indicates that a difference exists
between pretest and posttest scores.
It can be shown mathematically that Eq. (3.23) is equivalent to Eq. (3.14).
To see this numerically, consider the data in Table 3.1. The authors measured
the degree of platelet aggregation in 11 individuals before and after cigarette
smoking and wished to determine if smoking increases the degree of platelet
aggregation. The mean difference was 10.27 with a standard deviation of
7.98. The variance of the before (X) and after (Y) data was 15.61 and 18.30,
respectively, and using Eq. (3.22) an estimate of the correlation coefficient
between the before and after data was 0.901. Using Eq. (3.14), the t-test
statistic is
∆ 10.27
t= = = 4.27 .
σˆ 2∆ 63.62
n 11
The critical one-sided Student’s t-value with 10 degrees of freedom is 1.812.
Because the authors phrased the null hypothesis in terms of directionality (one-
sided), it may be concluded that smoking increases the degree of platelet
aggregation (because the average difference scores was greater than 0) at α =
0.05. Using Eq. (3.23), the t statistic, denoted t(τ) to differentiate it from t, is
52.45 − 42.18
t(τ) = = 4.27 ,
1
18.302 + 15.612 − 2(0.901)(18.30)(15.61)
11
which is equivalent to Eq. (3.14). Thus the null hypothesis of no difference
H o : µ1 − µ 2 = 0
H a : µ1 − µ 2 ≠ 0
Maximum Percent
Aggregation
Before After Difference
25 27 2
25 29 4
27 37 10
44 f 12
30 46 16
67 82 15
53 57 4
53 80 27
52 61 9
60 59 -1
28 43 15
Mean 42.18 52.45 10.27
Standard Deviation 15.61 18.30 7.98
Signed
Before After Difference Rank Rank
25 27 2 10 10
25 29 4 8.5 8.5
27 37 10 6 6
44 56 12 5 5
30 46 16 2 2
67 82 15 3.5 3.5
53 57 4 8.5 8.5
53 80 27 1 1
52 61 9 7 7
60 59 -1 11 -11
28 43 15 3.5 3.5
n = 11
T+ = 10 + 8.5 + 6 + 5 + 2 + 3.5 + 8.5 + 1 + 7 + 3.5 = 55
T- = abs(-11) = 11
T0.05(2), 11 = 10
Since T+ is greater than T0.05(2), 11 do not reject Ho.
where ∆1 and ∆ 2 are the mean difference scores for Groups 1 and 2,
respectively, n1 and n2 are the number of subjects in Groups 1 and 2,
respectively, and s 2p is the pooled variance estimate,
n1 n2
SS1 + SS2
∑ ( ∆1i − ∆1 ) + ∑ ( ∆ 2i − ∆ 2 ) (3.31)
i =1 i =1
s 2p = = ,
n1 + n 2 − 2 n1 + n 2 − 2
where ∆1i and ∆2i are the ith subject’s difference score in Groups 1 and 2,
respectively. If the observed T is greater than Student’s tα(2),n-1, the null
hypothesis of no treatment effect is rejected.
As an example consider the data in Table 3.3. Subjects participated in a
study in which they received a placebo drug capsule and the next day received
a sedative capsule. Each subject’s psychomotor performance was assessed 8
hr after drug administration on each day using the digit-symbol substitution
test (DSST), a subtest of the Weschler Adult Intelligence Scale-Revised. The
Females 11 46 44 -2
12 56 52 -4
13 81 81 0
16 56 59 3
17 70 69 -1
18 70 68 -2
Mean 63.2 62.2 -1.00
S.D. 12.7 13.2 2.37
t-test -2.63 -2.90 0.305
Degrees of Freedom 11 11 11
p-value 0.0249 0.0144 0.7664
Rank-transform t-test -2.496 -3.261 0.2772
p-value 0.0297 0.0076 0.7868
DSST is a pen and paper test whereby a unique set of symbols are associated
with digits in the key at the top of the test. The score is the number of symbols
correctly drawn in 90 sec. A decrease in score is indicative of psychomotor
impairment. Such a long time interval between drug administration and
assessment of psychomotor performance may be relevant for drugs that are
used as sleeping agents. A patient’s psychomotor performance the morning
after drug administration may be an issue if the patient will be driving to work
or some other task which requires coordination. The researchers were
Education
Summary Video Literature Control
Statistics Intervention Intervention Group
and when the pretest and posttest are measured using the same measuring
device and have equal reliability, Gx, then
G − ρ xy
Gd = , G > ρxy , ρxy ≠ 0 . (3.34)
1 − ρxy
Obviously when the correlation between scores is equal to their individual
reliabilities, G = ρxy, the reliability of the difference scores is equal to 0 and
completely unreliable. What might not seem as obvious is that for any G and
ρxy, Gd is always less than G (Figure 3.1). Thus difference scores are said to
be unreliable. This argument has stood for almost a quarter of a century
before being challenged.
Recently, however, researchers have begun to question this dogma for a
number of reasons. Foremost, if the analysis of data using difference scores
results in a significant finding, does the unreliability of difference scores imply
the finding is invalid? Of course not. Williams and Zimmerman (1996) have
made a quite persuasive argument which says that although difference scores
are unreliable, the difference in reliability between the individual pretest and
posttest reliabilities can be small. They state that the dogma of unreliability of
0.9 ρ xy = 0.01
ρ xy = 0.25
Reliability of Difference Scores
0.8
ρ xy = 0.50
0.7
ρ xy = 0.75
0.6 ρ xy = 0.90
0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Reliability of Measuring Device (G x)
difference scores is always based on the worst case scenario; that those authors
who argue that difference scores are unreliable use assumptions that “show the
reliability of differences in the most unfavorable light” (Williams and
Zimmerman, 1996). Under the assumptions made in developing Eq. (3.34), the
reliability of difference scores will be at a minimum. It is not that Williams
and Zimmerman (1996) are arguing that difference scores are reliable, they are
arguing that it is wrong to simply conclude that difference scores will be
unreliable under any and all conditions. They also argue that it is too
restrictive to assume that the standard deviation of the pretest and posttest
scores are always equal, especially after intervention of a treatment
intervention. The reliability of difference scores must be viewed as a
composite function of all its components, not as a simplification. When
viewed as a whole, the usefulness of difference scores becomes more apparent.
m4
β2 = (3.36)
m 22
where
n
∑ ( Xi − X )
k
i =1
mk =
n
and n is the sample size. Compute
( n + 1)( n + 3)
Y = b1 (3.37)
6 ( n − 2)
( )
3 n 2 + 27n − 70 ( n + 1)( n + 3)
β2 ( β1 ) = ( n − 2 )( n + 5)( n + 7 )( n + 9 )
(3.38)
W 2 = −1 + 2β2 ( β1 ) − 1 (3.39)
1
δ= (3.40)
Ln ( W )
2
θ= (3.41)
2
W −1
2
Y Y
Z1 = δ ⋅ Ln + + 1 . (3.42)
θ θ
Under the null hypothesis that the sample comes from a normal distribution, Z1
is approximately normally distributed. Eq. (3.37) to (3.42) are based on the
skewness of the samples. The other half of the omnibus test is based on the
kurtosis. Compute
β1 ( β2 ) =
(
6 n 2 − 5n + 2 ) 6 ( n + 3)( n + 5 ) (3.46)
( n + 7 )( n + 9 ) n ( n − 2 )( n − 3)
8 2 4
A = 6+ + 1+ (3.47)
β1 ( β2 ) β1 ( β2 ) β1 ( β2 )
2
1−
1− 2 A
−
9A 3 2
1+ q (3.48)
A −4
Z2 = .
2
9A
Under the null hypothesis that the sample comes from a normal distribution, Z2
is approximately normally distributed. The omnibus test statistic is then
calculated as
K 2 = Z12 + Z22 . (3.49)
where σy-x, σx and σy are the standard deviations of the difference, pretest and
posttest scores, respectively, and σ2x is the variance of the pretest scores
(Ghiselli, Campbell, and Zeddeck, 1981). When measurements are made
using the same instrument and σx = σy, then Eq. (3.50) may be simplified to
σ 2 (G − 1)
ρx,y − x = x . (3.51)
σx σy−x
Only when the denominator of Eq. (3.51) is large compared to the numerator
and the test-retest correlation between pretest and posttest scores is very close
to one does the correlation between pretest scores and difference scores
become close to 0. Because the numerator in Eq. (3.50) is the difference
between the pretest/posttest covariance and the variance of the pretest scores,
it is common for ρx,y − x to be negative because the variance of the pretest
scores is often greater than the pretest/posttest covariance. As we will see in
later chapters (Analysis of Covariance), a researcher may take advantage of
the correlation between pretest and difference scores by using the pretest score
as a covariate in an analysis of covariance while using the difference score as
the dependent variable.
The problem with the correlation between difference and pretest scores is
that regression towards the mean significantly influences the value of the
corresponding difference score. Explicit in the development of the paired
samples t-test is that regression towards the mean is not occurring. For the t-
test to be valid when significant regression towards the mean is occurring,
adjusted posttest scores, as given in Eq. (2.32), should be used instead of raw
posttest scores. In the case where subjects are enrolled in a study on the basis
of their pretest scores, the influence of regression towards the mean becomes
magnified. If µ is known then an alternative solution to the t-test problem is
presented by Mee and Chua (1991). They presented a regression-based test to
be used when regression towards the mean is occurring to a significant extent.
Let Zi = Xi − µ . Assuming Zi = zi is fixed then
Y = µ + τ + ρz + e (3.52)
where e is random error with mean 0 and variance σ2 . As can be seen, Eq.
(3.52) indicates that Y and z are related in a linear manner with the slope equal
to the test-retest correlation coefficient between ∆ and z and intercept equal to
the sum of the population mean and treatment effect. Hence, the null
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Prob>F
Model 1 422.09877 422.09877 20.797 0.0038
Error 6 121.77623 20.29604
Total 7 543.87500
Parameter Estimates
Parameter Standard T for H0:
Variable DF Parameter Error Parameter=0 Prob > |T|
INTERCEPT 1 79.95905 4.580258 17.457 0.0001
Z 1 1.111152 0.243653 4.560 0.0038
Summary
• Difference scores:
1. Attempt to remove the influence of the pretest score on the
posttest score but often fail in this regard because difference
scores are usually negatively correlated with pretest scores.
2. Require that the pretest and posttest be collected with the same
measuring device and have the same units.
3. Are easy to interpret and easily analyzed. This is the main
reason they are used.
4. Are less than (usually) or equal to (rarely) in reliability as the
individual component scores.
• Nonparametric counterparts to the parametric statistical methods
presented have only slightly less power when the assumptions of the
parametric test are met and greater power when the assumptions of the
parametric test are violated.
◊ It is suggested that most analyses be done using nonparametric
methods unless one is confident in the assumptions of the
parametric test.
• When the marginal distributions of the pretest and posttest are normally
distributed, the distribution of their difference scores will be normally
distributed.
• Beware of the impact of regression towards the mean in the analysis of
difference scores and use analysis of covariance or corrected difference
scores when necessary.
Y
E (C) = E −1 . (4.3)
X
Thus the expected value of a change score simplifies to finding the expected
value of the ratio of Y to X. The expected value of Y/X can be written as a
first-order Taylor series expansion
Y µy 1 2 µy
E = + σ x − ρσ x σ y . (4.4)
X µx µx
2 µx
Y
Because E does not equal µy/µx, Eq. (4.1), is biased, i.e., the average
X
value of the estimate is not equal to the average value of the quantity being
estimated. The second term in Eq. (4.4) is the degree of bias in the
proportional change score which depends on several factors, including the
mean and standard deviation of the pretest and posttest and the reliability of
measuring device.
Using the rules of expectation, the variance of Eq. (4.1) can be written as
Y − X 1 2 µ y µy
2
Var = σ x 2 + σ 2
y − 2ρσ σ
x y
(4.5)
X µ 2x µx µx
Thus the variance of the proportional change score is dependent on the same
factors that affect the expectation of the change score. It should be stressed
that the same results hold for when change is expressed as a percentage. In
this case, Eq. (4.4) and Eq. (4.5) are expressed as
µy 1 2 µy
E ( C% ) = 100 + σ y − ρσ x σ y − 1 (4.6)
µ x µ x
2 µx
and
Y−X 100
2 µ2 µ
Var × 100% = σ2x y + σ 2y − 2ρσ x σ y y (4.7)
X µ 2x µ 2x µx
respectively.
Tornqvist, Vartia, and Vartia (1985) have shown that most every indicator
of relative change can be expressed as a function of Y/X alone. Hence, the
change function can be expressed as an alternate function dependent solely on
Y/X. Formally, there exists a function H, such that
Y Y
C ( X,Y ) = H = C , 1 (4.8)
X X
with properties:
Y Y
1. H = 0, if =1.
X X
Y Y
2. H > 0, if >1.
X X
Y Y
3. H < 0, if < 1.
X X
4. H is a continuous increasing function of its argument Y/X.
aY Y
5. H = aH X .
aX
Given that the expected value of Y/X is biased, any relative change
function that can be expressed as a function H will have some degree of bias
due to the expected value of Y/X. Table 4.1 shows a variety of other relative
Function Mapping
X 2 − X1 X2
−1
X1 X1
X 2 − X1 X1
1−
X2 X2
X 2 − X1 X2
− 1
( X 2 + X1 ) X1
2
1 X2
1 +
2 X1
X 2 − X1 X2
−1
X 2 X1 X1
X2
X1
X 2 − X1 X2 X
− 1 1 + 1
( )
1 −1 −1
−1 X1 X 2
2 X2 + X2
2
X 2 − X1 X2
−1
min(X 2 , X1 ) X1
X
min 1, 2
X1
X 2 − X1 X2
−1
max(X 2 , X1 ) X1
X
max 1, 2
X1
X 2 − X1 X2
−1
K ( X1 , X 2 ) X1
X
where K is any mean of X2 or X1 K 1, 2
X1
20
Percent Change from Baseline
10
-10
-20
-30
60 80 100 120 140
Pretest Score
Figure 4.1: Plot of percent change from baseline against baseline scores from
data in Figure 2.1. The solid line is the least-square linear regression line to
the data. The negative correlation is evidence that regression towards the mean
occurs even with percent change scores.
10
-10
-20
-30
60 80 100 120 140
Observed Pretest Scores
Figure 4.2: Plot of adjusted percent change scores against their baseline score
using Eq. 4.12. Compare to figure 4.1. There is no correlation between
baseline scores and adjusted posttest scores indicating that regression towards
the mean has been corrected for.
n
2
( )
2
C% − C%i
( )
2
geo(X) ∑ 100
ij (4.13)
R=
( )
2
∑ Dij − Di
I = 1, 2, ... k, j = 1, 2, ...n, where geo(X) is the geometric mean of the pretest
scores,
n
( )
geo X = n X1X 2 ...X n = n ∏ Xi (4.14)
i =1
C%i is the ith group’s average percent change score, and Di is the ith group’s
average difference score. If R is greater than one, use difference scores,
otherwise use percent change scores. Although this test statistic was
developed for normal populations, using computer simulation, Kaiser (1989)
indicated that it works well with populations exhibiting substantial positive
*
Berry (1990) refers to this as a symmetrized change score.
0.2
Brouwers and Mour Modified
Proportional Change Score
0.1
0.0
-0.1
-0.2
-0.3
60 80 100 120 140
Observed Pretest Score
160
140
120
100
Frequency
80
60
40
20
0
-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3
Midpoint of Modified Proportional Change Scores
then the optimal value of c is the value of c which minimizes g. The value of c
can then be found using either a numerical optimization routine or a one-
dimensional grid search.
As an example, consider the artificial data presented in Table 4.2. In this
hypothetical example, subjects are administered either a drug or placebo and
asked the question are they tired or awake on a visual analog scale. The scale
ranges from “tired” with a score of -50 to “wide awake” with a score of 50.
Thus the interval presented to each subject is {-50, 50} with a range of 100
units. In this example, both positive, negative, and 0 scores are present. The
minimum value that c can be is greater than 14 because if c is less than 14,
some scores will still be less than or equal to 0. Therefore, c is constrained to
be greater than 14. Figure 4.4 plots g as a function of c using a grid search
technique. It can be seen that the optimal value of c was 22.8. This resulted in
a skewness of -0.152 and a kurtosis of -3.00. The bottom plot in Figure 4.4 is
the resultant distribution of the modified log ratio scores. The log-ratio scores
can now be used as statistics in either a t-test or analysis of variance. The SAS
code used to obtain g is given in the Appendix.
18
16
14
Minimum Value of g = Optimal value of c
12
Function g
g = 1.46
10
c = 22.8
0
14 16 18 20 22 24 26 28 30
Constant c
10
6
Frequency
0
0 2 4 6 8 10 12
Ln(X1 + c) - Ln(X2 + c)
Figure 4.4: Here, g (Eq. 4.19) is seen as a function of constant c using the
log-ratio score with adjustment as proposed by Berry (1987). The minimum
value of g is the optimal value of the constant to add to each score. The
bottom plot shows the distribution of the log-ratio scores after log-
transformation. After adding a value of 22.8 to each value and using the log-
ratio score as the primary metric, the data are normally distributed (bottom).
60
40
20
95
% of Simulations Declaring Normality
90
85
80
75
70
65
60
0.0 0.2 0.4 0.6 0.8 1.0
Test-Retest Correlation
85
80
75
70
65
60
0.0 0.2 0.4 0.6 0.8 1.0
Test-Retest Correlation
Summary
• Relative change scores are commonly found in the literature, both as
descriptive statistics and in use in hypothesis testing.
ANALYSIS OF COVARIANCE
Recall in Chapter 1 that many researchers feel that hypothesis testing for
baseline comparability is redundant because the very act of hypothesis testing
assures that some percentage of studies will be declared baseline non-
comparable. It is generally recognized, however, that if baseline non-
comparability among groups is suspected then analysis of covariance
(ANCOVA) is the recommended method of analysis (Senn, 1994; Overall,
1993). Also, recall from Chapter 2 that when regression towards the mean
significantly influences the measurement of the posttest, then ANCOVA is the
recommended method of choice. For these and other reasons, many
researchers advocate ANCOVA as the method of choice for pretest-posttest
data analysis.
Parametric ANCOVA
For k groups with j subjects in each group, the simple linear model for
ANCOVA is
where Yij is the jth subject in the ith’s groups posttest score, µ is the population
grand mean, τi is the ith group’s treatment effect, Xij is the jth subject in the ith
group’s pretest score, βw is the common linear regression coefficient, eij is
random error, and X.. is the grand mean pretest score. Crager (1987) has
shown that in the case of pretest-posttest designs, the between-groups
regression coefficient is a function of the between subject, σS2 , and within
subject error variances, σe2 ,
σS2
βw = (5.2)
σS2 + σe2
such that βw is always constrained to the interval (0,1). It is readily apparent
that βw is the test-retest reliability coefficient and it can now be seen why
ANCOVA is the method of choice when regression towards the mean
significantly influences the posttest scores. ANCOVA explicitly models the
influence of the pretest scores on the posttest scores by inclusion of a term for
regression towards the mean in the linear model.
Compared to a completely randomized design where
Yij = µ + τi + eij (5.3)
it can be seen that ANCOVA splits the error term into two components: one
due to regression of the pretest scores on the posttest scores and another due to
unexplained or residual variation. Hence, the mean square error term is
smaller in ANCOVA than in a completely randomized design and provides a
better estimate of the unexplained variation in the model when βw is
significantly different than 0. An alternative way of viewing ANCOVA may
be that ANCOVA proceeds as a two-step process. In the first step, ANCOVA
adjusts the posttest scores by removing the influence of the pretest scores, such
that
( )
Y(adj)ij = Yij − βw Xij − X.. = µ + τi + eij (5.4)
where Yadj(ij) is the adjusted posttest score. The second step is to perform an
analysis of variance on the adjusted posttest scores. It can be shown that when
βw is at its upper limit, i.e., β w = 1 , both ANCOVA and ANOVA using
difference scores will have similar sum of squares values in the denominator of
the F-ratio, but ANOVA on difference scores will be slightly more powerful
due to a single lost degree of freedom associated with the regression
component in the ANCOVA. However, when β w ≠ 1 (as is usually the case
when the pretest and posttest are in fact correlated), the error term in the
2
σe,ancova 2
= σe,anova (
1 − ρ2 1 + )
1
dfe − 2
(5.5)
2
where dfe is the degrees of freedom associated with σe,anova . Normally the
degrees of freedom is sufficiently large that the reduction in the error term
depends primarily on the correlation coefficient between pretest and posttest
scores with a higher correlation resulting in a smaller mean square error.
(
τˆ i = X 2i. − X 2.. − βw X1i. − X1.. ) (5.6)
where X 2i. is the mean for the ith groups posttest score, i = 1, 2, ...k, X 2.. is
the grand mean posttest score, X1i. is the mean for the ith groups pretest score,
and X1.. is the grand mean pretest score. If an ANCOVA is performed on the
difference scores, the estimated treatment effect, τ% i is given by
( )
τ% i = di. − d.. − β% w X1i. − X1.. . (5.7)
( )
E(Y)= 1+β’0 ⋅ X+β1’ ⋅ X 2 . (5.9)
Thus the expected posttest score is a quadratic function of the pretest score.
This function will either be concave up or down depending on whether β1 < 0
or β1 > 0. When β1 < 0 the extrema, X*, can be found at
1 + β0
X* = . (5.10)
2 ⋅β1
Figure 5.1 demonstrates the relationship in Eq. (5.9) for both positive and
negative β1. Because the expected value function is dependent on β1, two
possible interpretations arise. One interpretation is that when β1 > 0, there is a
positive relationship between pretest and posttest for all pretest scores. The
other is that when pretest scores are less than the extrema and β1 < 0 then there
β 1’ > 0
15
Expected Posttest Score
10
β1 ’ < 0
5
0
0 5 10 15 20
Pretest Score
28
26
24
∆ = 12
Posttest Score
22
20
18 ∆=7
16
14
12
10
8
0 5 10 15 20
Pretest Score
Figure 5.2: Plot showing how group by pretest interaction appears when
posttest scores are plotted against pretest scores. No interaction would appear
as parallel lines and a constant difference between groups for a given pretest
score. Interaction appears as diverging lines and non-constant differences for
a given pretest score. Shown in the plot is that the difference between the two
groups is 7 units when the pretest score was about 10, but the difference is 12
when the pretest score was about 18.
Rank Rank
Group Posttest Pretest (Posttest) (Pretest)
1 16 26 1 7
60 10 5 3
82 42 7 11
126 49 11 13
137 55 12 14
2 44 21 4 6
67 28 6 8
87 5 8 2
100 12 9 4
142 58 13 15
3 17 1 2 1
28 19 3 5
105 41 10 10
149 48 14 12
160 35 15 9
Courtesy of Conover, W.J. and Iman, R.L., Analysis of covariance
using the rank transformation, Biometrics 38, 715-724, 1982. With
permission from the International Biometrics Society.
TABLE 5.3
Degrees
of
Test F-value freedom p-value
Heterogeneity of variance
(Treatment by group interaction) 1.25 2, 9 0.333
Test for treatment effect using
parametric ANCOVA without a
treatment by group interaction term 0.74 2,11 0.498
Quade’s test for treatment effect 1.30 2,12 0.307
Test for treatment effect using
parametric ANCOVA on ranks without
a treatment by group interaction term 1.30 2,11 0.312
Error-in-variables ANCOVA
Another assumption often violated in ANCOVA is the assumption of
error-free measurement of the pretest scores. As mentioned in the second
chapter, no measurement is error free; all measurements have some degree of
random or systemic error. Often in the case of pretest-posttest data, the degree
of random error is the same for both the pretest and the posttest. It might not
be apparent that the error term in the linear model
( )
Yij = µ + τi + β w Xij − X.. + eij , j=1,2..k (5.13)
represents the random component of Yij; it does not reflect measurement error
in Xij. In the case where significant error is made in the measurement of the
pretest the linear model must be modified by inclusion of an additional error
term, exij, to reflect this error
( )
Yij = µ + τi + βw Xij − X.. + e x ij + eij , j=1,2..k . (5.14)
Ψ Ri − 1
R z(i) =
3 (5.16)
for Tukey
n+ 1
3
where Rz(i) is the rank normal score, Ri is the rank of X, n is the number of
non-missing observations of X, and Ψ [.] is the inverse cumulative normal
distribution function. Note that SAS can provide these values as part of the
PROC RANK procedure using the NORMAL= option. Focus will primarily
center on Blom’s transformation because it has been shown to fit better to the
normal distribution than Tukey’s transformation. As an example, consider the
data set {27, 31, 45, 67, 81} with ranks {1, 2, 3, 4, 5}. The normalized rank
scores using Blom’s transformation is {-1.18, -0.50, 0, 0.50, 1.18}. For any
vector of scores, normalized rank scores center the ranks at 0 and force them
to be symmetrical around 0.
Knoke (1991) used Monte Carlo simulation to investigate the
performance of the usual parametric ANCOVA based on the raw data and the
nonparametric alternatives under the error-in-variables model. He examined
the influence of various measurement error distributions: normal, uniform,
double exponential, log-normal, Cauchy, etc. with varying degrees of
skewness and kurtosis on power and Type I error rate. The results indicated
that an increase in the error variance in the measurement of the pretest
decreased the power of both the parametric and nonparametric tests. When the
error variance was not large and the data were normally distributed, there was
little loss of power using the normal rank scores transformation compared to
parametric ANCOVA. Rank transformed scores tended to have less power
than both normal rank transformed scores and parametric ANCOVA. When
the measurement error was non-normally distributed, the normal rank scores
had much greater efficiency than the other methods, although as the sample
size increased or as the separation between groups increased, the efficiency of
using rank transformed scores approached the efficiency of using normal rank
transformed scores. Knoke (1991) concluded that normal rank scores
Other Violations
If a curvilinear relationship exists between pretest and posttest scores, the
general ANCOVA linear model may be modified to include higher order
polynomial terms
( ) ( )
2
Yij = µ + τi + β1 Xij − X.. + β2 Xij − X.. + eij (5.17)
where β1 and β2 are the first order and second order common regression
coefficients. If the relationship between pretest and posttest scores is
nonlinear, then perhaps nonlinear mixed effects models may be needed. The
reader is referred to an excellent text by Davidian and Giltinan (1995) on
practical use of these nonlinear models.
where
ˆ = µˆ + τˆ + βˆ X − X + e
Yij i w ij .. ij ( ) (5.19)
2
2
1 − u u ≤ 4.685
bisquare: w = 4.685 (5.21)
0 u > 4.685
where w is the weight and u denotes the scaled residual
ei
ui = (5.22)
1
0.6745 median { ei
}
and ei is the ith residual. The constant 1.345 in the Huber function and 4.685
in the bisquare function are called tuning constants. The denominator in Eq.
(5.22) is referred to as the median absolute deviation estimator. Seber (1977)
recommends that two iterations be used in computing the parameter estimates
under IRWLS due to convergence difficulties on successive iterations.
Using Monte Carlo simulation, Birch and Myers (1982) compared the
efficiency of ordinary least squares (OLS) to the bisquare and Huber functions
in ANCOVA. The authors concluded that the bisquare function was superior
Group 1 Group 2
Pretest Posttest Pretest Posttest
72 74 91 112
74 76 75 106
76 84 74 104
70 67 83 96
71 79 81 19
80 72 72 99
79 69 75 109
77 81 77 98
69 60 81 106
82 72
TABLE 5.5
Sum of Mean
Method Source Squares DF Square F-Ratio Prob>F
OLS* Regression 1.4 1 1.4 0.0 0.956
Treatment 1845.7 1 1845.7 4.2 0.057
Error 7057.0 16 441.1
Total 9134.1 18
* outlier still contained in data set
OLS** Regression 58.5 1 58.5 1.4 0.251
Treatment 3359.1 1 3359.1 81.9 0.001
Error 615.4 15 41.0
Total 4767.8 17
** outlier deleted from analysis
Huber Regression 1.1 1 1.1 0.1 0.796
Treatment 1286.9 1 1286.9 84.6 0.001
Error 243.3 16 15.2
Total 1541.4 18
bisquare Regression 1.4 1 1.4 0.6 0.470
Treatment 559.9 1 559.9 245.3 0.001
Error 13.7 6 2.28
Total 648.4 9
The authors used Monte Carlo simulation to examine whether this method
would lead to biased estimates of the true treatment effect and of the
significance of these effects. On the basis of their results, they concluded that
alternate ranks assignment did not lead to biased estimates of treatment effects
and that the precision of these estimates was slightly less than those obtained
from random assignment to groups. They also concluded that the power of the
F-test in ANCOVA was not changed when subjects were nonrandomly
assigned to treatment groups.
It was not the aim of the authors to show that the alternate ranks design
was superior to random assignment to treatment groups. Nevertheless, their
method appears to have numerous advantages over randomization. Because
their algorithm is fixed, assignment to treatment groups removes any
prejudices or biases the experimenter may have in assigning subjects to
treatment groups. The authors also suggest that “violation of (the)
Summary
• Analysis of covariance is a popular method used to analyze pretest-
posttest data.
• ANCOVA implicitly takes into account regression towards the mean.
• Often the assumptions of ANCOVA are violated. The most common
violations include:
1. Use of a parallel line ANCOVA model when the regression
slopes between groups are not equal.
2. Use of ANCOVA when the regression slope is not linear.
3. When the pretest score is measured with significant error.
4. When outliers are present which significantly influence the
estimation of the regression coefficient.
• Glass, Peckham, and Sanders (1972) present a thorough overview of the
consequences of violating the assumptions of ANOVA and ANCOVA.
The effect of violating these assumptions on the power and Type I error
rate of a statistical test will depend on the severity of the violation.
• One solution to most violations has been the use of some type of rank
transform, either to the posttest or to both the pretest and posttest.
◊ Rank-transform ANCOVA has similar power and Type I error
rate to parametric ANCOVA when the assumptions of parametric
ANCOVA are met and it has superior power when the
assumptions of parametric ANCOVA are violated.
◊ The use of nonparametric, rank-transformed ANCOVA in the
analysis of pretest-posttest data should be highly encouraged.
◊ One caveat to the rank transformation is that all Monte Carlo
simulations which tout its usage have been based on the results of
a singular violation of the ANCOVA assumptions. There has
been no systematic research examining what to do when multiple
irregularities occur. In this case, one may have to choose what is
the lesser of the two evils.
BLOCKING TECHNIQUES
where (ατ)ij is the ijth interaction effect for the jth block and treatment level i
with mean 0 and variance σ2ατ . In the case where the treatment by blocks
( )
interaction is significant, it is recommended that PROC MIXED in SAS be
used instead of PROC GLM because PROC MIXED computes the exact
expected mean square, whereas PROC GLM does not. It has been shown that
when the treatment by block interaction term is statistically significant, this is
equivalent to violation of the assumption of homogeneity of regression
coefficients in ANCOVA.
Feldt (1958) concluded that when the sample size was sufficiently large,
blocking should be preferred over ANCOVA in most educational and
psychological studies because rarely is the correlation between pretest and
posttest large enough to take advantage of the greater power of ANCOVA.
Also, ANCOVA has more assumptions, which might be more difficult to meet,
than blocking. Feldt (1958) states that “dependence on the accuracy of the
assumed regression model (in ANCOVA) constitutes a severe restriction on
the usefulness of covariance techniques. The absence of any regression
assumptions in the (randomized block) design…represents a considerable
argument in its favor, especially in instances (where) the number of degrees of
freedom are fairly large.” However, when the sample size is small ANCOVA
should be preferred because small sample sizes do not lend themselves readily
to blocked experimental design.
Feldt makes some good points, but overstates the case for using blocking
techniques over ANCOVA. First, contrary to his statement above, randomized
block designs do have regression assumptions, although those assumptions are
less restrictive than with ANCOVA. Second, a high correlation (greater than
0.6) between pretest and posttest is more common than Feldt thinks. In
choosing between blocking and ANCOVA, verification of the assumptions
each method makes should be done before making any decisions based on
which method has greater statistical power because if the assumptions of the
method are violated, the validity of the method's conclusions may be in
question.
where df is the error term degrees of freedom, σ2 is the mean square error
term and the subscripts 'b' and 'crd' refer to the blocked design and the
completely randomized design, respectively. The ratio of degrees of freedom
in Eq. (6.5) is a correction term that reflects the different degrees of freedom
between the models. Relative efficiency may be viewed as the number of
replications collected on the same subject that must be made if the completely
randomized design is used instead of the randomized block design.
Alternatively, relative efficiency may be viewed as the ratio of the number of
subjects required by the unblocked design to achieve the same power as the
blocked design under similar testing conditions. By varying the number of
blocks from 2 to q (q <= total number of observations), an iterative estimate of
Summary
• Blocking is less restrictive than ANCOVA because it does not depend on
a linear relationship between the pretest and posttest scores.
9
700
8
600
7
Relative Efficiency
5 400
4
300
3
200
2
100
1
0 0
0 10 20 30 40 50 60
X Data
Figure 6.1: Relative efficiency of post-hoc blocking (solid square) and mean
square error (solid circle) as a function of block size. Maximum efficiency is
reached using two blocks, whereas minimum mean square error remains
relatively constant throughout.
20
18
Cortisol (ug/mL)
16
14
12
10
2
0 Pretest Posttest Pretest Posttest 5
12
10
8
∆ Cortisol (ug/mL)
-2
-4
-6
Controls Endogenous Depressed
Figure 7.1: Cortisol levels for 13 patients with endogenous depression and 20
normal controls plotted as a tilted-line segment plot (top) and difference scores
for patients and controls plotted as a whisker plot (bottom). Data reprinted
from Asnis, G.M., Lemus, C.Z., and Halbreich, U., The desipramine cortisol
test−a selective noradrenergic challenge (relationship to other cortisol tests in
depressives and normals), Psychopharmacology Bulletin, 22, 571-578, 1986.
With permission.
Sum of Mean
Source DF Squares Square F-ratio Prob > F
Treatment 1 0.30 0.30 0.01 0.91
Subject(Treatment) 31 650.94 21.00 3.87 <0.01
Time 1 163.94 163.94 30.25 <0.01
Treatment by time 1 32.54 32.54 6.00 0.02
Error 31 168.01 5.42
Total 65 1057.43
( )
where all factors are treated as in Eq. (7.1) and γ X j − µ is the dependence
of the posttest score on the pretest score. An excellent example of this
approach is given in Wolfinger (1997) in the repeated measures analysis of
systolic blood pressure data from a clinical trial studying the effect of various
medications on hypertension. The reader is also referred to Littell et al. (1996)
for further details on the SAS code and the use of PROC MIXED to analyze
this type of data.
Baseline Data
1200
Drug Concentration (ng/mL)
1000
800
600
400
200
0
0 20 40 60 80
Time (h)
500
300
200
100
0
0 10 20 30 40 50 60 70 80
Time (h)
Figure 7.2: Drug X plasma levels over time after oral administration of drug
X in 14 volunteers before (top) and after (bottom) rifampin pretreatment.
AUCTA = ∑
n-1 0.5 ⋅ ( t i+1 − t i )( Yi +1 + Yi )
(7.13)
i =1 t n − t1
where AUCTA is normalized to the total time interval or time-averaged AUC
with baseline correction (AUCTA-BC)
AUCTA-BC = ∑
n-1 0.5 ⋅ ( t i +1 − t i )( Yi +1 + Yi ) − Y
0 (7.14)
i =1 t n − t1
1000
Drug Concentration (ng/mL)
800
600
400
Area Under
200 The Curve
0
0 10 20 30 40 50 60 70 80
Time (h)
Figure 7.3: Figure demonstrating the concept of area under the curve. Data
are plasma drug X concentrations over time for a single subject. The shaded
area represents the area under the curve.
30000
25000
20000
AUC (ng*h/mL)
15000
10000
5000
0
Control After Rifampin
1 2
Figure 7.4: Tilted line-segment plot for the area under the curve (AUC) data
in Figure 7.2. Rifampin pretreatment showed a clear negative effect on AUC.
This plot shows that what was once a multivariate problem (Figure 7.2) can be
reduced to a univariate one (Figure 7.4) using an appropriate summary
measure.
W=
n m 0 1 2 3 4 5 6
2 2 0.3333 0.1667 0.1667
2 3 0.2000 0.2000 0.1000 0.1000
2 4 0.2000 0.1333 0.1333 0.0667 0.0667
2 5 0.1429 0.1429 0.0952 0.0952 0.0476 0.0476
2 6 0.1429 0.1071 0.1071 0.0714 0.0714 0.0357 0.0357
3 3 0.3000 0.2000 0.1000 0.0500
3 4 0.2571 0.1714 0.1143 0.0571 0.0286
3 5 0.2143 0.1607 0.1071 0.0714 0.0357 0.0179
3 6 0.1905 0.1429 0.1071 0.0714 0.0476 0.0238 0.0119
4 4 0.3143 0.1857 0.1000 0.0429 0.0143
4 5 0.2698 0.1746 0.1032 0.0556 0.0238 0.0079
4 6 0.2381 0.1619 0.1048 0.0619 0.0333 0.0143 0.0048
Note: A more extensive table can be found in Donahue (1997).
Reprinted from Donahue, R.M.J., A summary statistic for measuring
change from baseline, Journal of Biopharmaceutical Statistics, 7, 287-
289, 1997 by courtesy of Marcel Dekker, Inc.
pretest scores as the covariate. It was concluded that the W-statistic is more
resistant to outliers than ANCOVA. However, the W-statistic is still quite new
and its validity has not been rigorously challenged. It does appear to be a
useful summary measure for repeated measures design and its use should not
be discouraged. In summary, many times it is possible to transform what was
originally a multivariate problem into a univariate problem by using either
AUC, Donahue’s W-statistic, maximal effect, or some other transformation.
Once transformed the data may be analyzed by methods presented in earlier
chapters.
Summary
• Use of the F-test for treatment effect in the analysis of simple pretest-
posttest data using a repeated measures results in a biased estimate of the
true treatment effect.
• The appropriate F-test in a repeated measures analysis of variance on
pretest-posttest data is the treatment by time interaction.
• Multiple posttest measurements after the treatment intervention is a true
repeated measures design and can be analyzed by either a traditional
Choosing a statistical test to analyze your data is one of the most difficult
decisions a researcher makes in the experimental process because choosing the
wrong test may lead to the wrong conclusion. If a test is chosen with
inherently low statistical power, a Type II error can result, i.e., failure to reject
the null hypothesis given that it is true. On the other hand, choosing a test that
is sensitive to the validity of its assumptions may result in inaccurate p-values
when those assumptions are incorrect. The ideal test is one that is uniformly
most powerful and is not sensitive to minor violations of its underlying
assumptions. It would be nice to be able to say, always use test X for the
analysis of pretest-posttest data. Unfortunately, no such test exists. The tests
that have been presented in previous chapters may be best under different
conditions depending on the observed data. The rest of this chapter will be
devoted to picking the right test for your analysis. It should be pointed out
that the choice of statistical test should be stated and defined prior to doing the
experiment.
Y = ρX + W 1 − ρ2 . (8.1)
This process is repeated n times for the desired sample size. If the computer
program that is being used does not generate normal random variates then the
method of Box and Muller (1958) can be used. Box and Muller’s (1958)
method depends on the ability of the computer to generate uniformly
distributed random variables, which most computer languages readily do.
Their method is as follows: if A and B are uniform random variates on the
interval (0, 1) then X and W will be normally distributed with mean 0 and
variance 1, where
X = cos(2πB) * −2 ⋅ ln(A) (8.2)
3
0.0 0.2 0.4 0.6 0.8 1.0
Test-Retest Correlation
50
40
30
20
10
0
0.0 0.2 0.4 0.6 0.8 1.0
Test-Retest Correlation
Figure 8.1a: Percent of simulations which rejected the null hypothesis (Ho: µ1
= µ2 = µ3) when pretest and posttest scores were normally distributed with
equal variance and equal sample sizes per group.
80
60
40
20
0
0.0 0.2 0.4 0.6 0.8 1.0
Test-Retest Correlation
100
% of Simulations Rejecting Ho
80
60
40
20
Figure 8.1b: Percent of simulations which rejected the null hypothesis (Ho: µ1
= µ2 = µ3) when pretest and posttest scores were normally distributed with
equal variance and equal sample sizes per group.
3
0.0 0.2 0.4 0.6 0.8 1.0
Test-Retest Correlation
30
Effect Size = 1.0
25
% of Simulations Rejecting Ho
20
15
10
0
0.0 0.2 0.4 0.6 0.8 1.0
Test-Retest Correlation
Figure 8.2a: Percent of simulations which rejected the null hypothesis when
pretest and posttest scores were normally distributed with equal variance but
unequal sample sizes per group. Five, 10, and 15 subjects were randomized to
the control, group 2, and group 3, respectively. Groups 2 and 3 received
active treatment interventions.
40
30
20
10
0
0.0 0.2 0.4 0.6 0.8 1.0
Test-Retest Correlation
80
60
50
40
30
20
10
0
0.0 0.2 0.4 0.6 0.8 1.0
Test-Retest Correlation
Figure 8.2b: Percent of simulations which rejected the null hypothesis when
pretest and posttest scores were normally distributed with equal variance but
unequal sample sizes per group. Five, 10, and 15 subjects were randomized to
the control, group 2, and group 3, respectively. Groups 2 and 3 received
active treatment interventions.
Effect Size = 0
3
0.0 0.2 0.4 0.6 0.8 1.0
Test-Retest Correlation
50
40
30
20
10
0
0.0 0.2 0.4 0.6 0.8 1.0
Test-Retest Correlation
Figure 8.3a: Percent of simulations which rejected the null hypothesis when
pretest and posttest scores were normally distributed with equal variance and
equal sample sizes per group. However, both the pretest and posttest had a
constant degree of bias of +25% in the measurement.
τ = 1.5
100
% of Simulations Rejecting Ho
80
60
40
20
0
0.0 0.2 0.4 0.6 0.8 1.0
Test-Retest Correlation
100
% of Simulations Rejecting Ho
80
60
40
20
Figure 8.3b: Percent of simulations which rejected the null hypothesis when
pretest and posttest scores were normally distributed with equal variance and
equal sample sizes per group. However, both the pretest and posttest had a
constant degree of bias of +25% in the measurement.
τ=0
% of Simulations Rejecting Ho
3
0.0 0.2 0.4 0.6 0.8 1.0
Test-Retest Correlation
60
50 τ = 1.0
% of Simulations Rejecting Ho
40
30
20
10
0
0.0 0.2 0.4 0.6 0.8 1.0
Test-Retest Correlation
Figure 8.4a: Percent of simulations which rejected the null hypothesis when
pretest and posttest samples were normally distributed but had unequal
variances. The pretest mean was 100 with a variance of 100, while the
variance of the posttest was 400. Ten subjects were randomized to the control
group, group 2, and group 3, respectively. Groups 2 and 3 received active
treatment interventions.
τ = 1.5
100
% of Simulations Rejecting Ho
80
60
40
20
0
0.0 0.2 0.4 0.6 0.8 1.0
Test-Retest Correlation
110
100 τ = 2.0
% of Simulations Rejecting Ho
90
80
70
60
50
40
30
0.0 0.2 0.4 0.6 0.8 1.0
Test-Retest Correlation
Figure 8.4b: Percent of simulations which rejected the null hypothesis when
pretest and posttest samples were normally distributed but had unequal
variances. The pretest mean was 100 with a variance of 100, while the
variance of the posttest was 400. Ten subjects were randomized to the control
group, group 2, and group 3, respectively. Groups 2 and 3 received active
treatment interventions.
3
0.0 0.2 0.4 0.6 0.8 1.0
Test-Retest Correlation
50
Effect Size = 1.0
% of Simulations Rejecting Ho
40
30
20
10
0
0.0 0.2 0.4 0.6 0.8 1.0
Test-Retest Correlation
Figure 8.5a: Percent of simulations which rejected the null hypothesis when
pretest and posttest scores were normally distributed and had equal sample
sizes per group. Subjects were stratified on the basis of their pretest scores
and randomized to blocks of different sizes (randomized block design) or
randomized using the alternating block design. Ten subjects were randomized
to the control group, group 2, and group 3, respectively. Groups 2 and 3
received active treatment interventions.
70
60
50
40
30
20
10
0.0 0.2 0.4 0.6 0.8 1.0
Test-Retest Correlation
100
% of Simulations Rejecting Ho
80
60
40
20
Figure 8.5b: Percent of simulations which rejected the null hypothesis when
pretest and posttest scores were normally distributed and had equal sample
sizes per group. Subjects were stratified on the basis of their pretest scores
and randomized to blocks of different sizes (randomized block design) or
randomized using the alternating block design. Ten subjects were randomized
to the control group, group 2, and group 3, respectively. Groups 2 and 3
received active treatment interventions.
NORMAL
0.20
TRUNCATED NORMAL
0.15
probability
0.10
0.05
0.00
98 100 102 104 106
X Data
0.25
LOG-NORMAL
0.20
0.15
probability
0.10
0.05
0.00
0 200 400 600 800 1000
X Data
Figure 8.6a: Probability density functions for the normal, truncated, and log-
normal distributions.
0.12
0.10
probability
0.08
0.06
0.04
0.02
0.00
98 100 102 104 106
X Data
their nominal value, regardless of the underlying distribution. All the non-
normal distributions studied had power curves that were similar in shape to the
normal distribution, but with power being slightly less to greatly less than the
power seen with normal distribution. Like the previous simulations, among all
the distributions studied, ANCOVA methods were slightly more powerful than
ANOVA methods.
When the distribution was decidedly bimodal (mixed normals), the power
of the ANCOVA and ANOVA methods was significantly greater than the
power of the RBD method. In general, the ANCOVA methods had greater
power than the ANOVA methods. The most powerful test was ANCOVA on
ranked normal scores and least powerful test was analysis of posttest only
scores. The Type I error rate of all the methods was near their nominal values,
but was a little low overall.
When the distribution of the pretest and posttest scores was truncated
normal, the power curve was almost identical to the power curve for the
normal distribution. The most powerful tests were the ANCOVA methods,
while the least powerful was analysis of posttest scores only. ANOVA
methods were similar in power to the RBD method. Overall, parametric
ANCOVA using pretest scores as the covariate was the most powerful test
used. Nonparametric ANCOVA methods, while having less power than
90
% of Simulations Rejecting Ho
80
70
60
50
40
30
20
10 NORMAL DISTRIBUTION
0
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Tau
100
90
% of Simulations Rejecting Ho
80
70
60
50
40
30
20
10
TRUNCATED NORMAL DISTRIBUTION
0
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Tau
Figure 8.7a: Percent of simulations which rejected the null hypothesis when
pretest and posttest scores were normally and truncated normal distributed.
Fifty subjects were randomized to the control group, group 2, and group 3,
respectively. Groups 2 and 3 received active treatment interventions. The
correlation between pretest and posttest scores was set at 0.60 + 10%.
% of Simulations Rejecting Ho
80
60
40
20
LOG-NORMAL DISTRIBUTION
0
0 5 10 15 20 25 30
Tau
100
80
70
60
50
40
30
20
10
0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Tau
Figure 8.7b: Percent of simulations which rejected the null hypothesis when
pretest and posttest scores were log-normally distributed or mixture of normal
distributions. Fifty subjects were randomized to the control group, group 2,
and group 3, respectively. Groups 2 and 3 received active treatment
interventions. The correlation between pretest and posttest scores was set at
0.60 + 10%.
Summary
• Each of the statistical tests presented in this book have slightly different
interpretations.
• In presenting the results of an analysis one may wish to do an analysis
using a single statistical method but present the data in an easy to
understand manner.
• In general, ANCOVA methods have greater power than ANOVA and
post-hoc blocking methods.
• Among the ANCOVA methods, it is best to use a nonparametric method
for most analyses. These methods significantly improve the power of
ANCOVA when the assumptions of the test are violated and they have
only slightly lower power when the assumptions of the test are met.
• If subjects are randomized to treatment groups based on pretest scores,
then the ARD should be used over the RBD.
RANDOMIZATION TESTS
With the fall in the price of personal computers along with the concurrent
increase in their processing speed, randomization tests have started to gain in
their popularity among statisticians. The value of randomization tests lies in
their wide applicability and their application to problems for which there is no
parametric solution. Specific types of randomization tests are given names
such as the bootstrap or jackknife, but some are simply called permutation
tests. Surprisingly scientists in other fields have failed to grasp and make use
of them. But just what are permutation or randomization tests? These tests
are the generic names applied to computer-intensive methods that generate
either the sampling distribution, the standard error, or p-value of a test statistic.
Good (1994) and Noreen (1989) provide two good introductory texts on
permutation tests and resampling-based methods in general. Good (1994)
gives a very simple outline for how to perform a permutation test
1. Analyze the problem.
2. Choose a test statistic.
3. Compute the test statistic for the observed data.
4. Rearrange (permute) the observations and recompute the test statistic
for the rearranged data. Repeat until you obtain all possible
permutations.
5. Accept or reject the null hypothesis using the permutation distribution
as a guide.
These steps provide the basic outline for how to do permutation tests. The
remainder of this section will be an outline for how to use randomization
methods with some of the statistical tests that have been presented in previous
chapters.
=
( n1 + n 2 + n3 + ...n k )!
n Pn k (9.2)
( n1 )!( n 2 )!( n3 )!... ( n k )!
where ()! refers to the factorial function. Figure 9.1 plots the number of
combinations in the two- and three- group case varying the number of
observations per group. As can be seen from the figure, the number of
possible permutations increases to astronomical levels as the number of
observations per group increases every so slightly. For two groups of subjects
with 15 independent subjects per group, the number of permutations is about
1.5 x 108 (Ludbrook and Dudley, 1998). For paired data, the number becomes
32,768. When the number of possible combinations increases to numbers that
would be difficult to handle on a personal computer, it is necessary to reduce
the number of computations by taking a small but representative sample of all
possible combinations. Such a Monte Carlo approach is called a
randomization test and the resulting p-value is an estimate of the true, exact p-
value. Thus randomization tests represent a generalization of permutation
tests and their p-values are an approximation to the true p-value. For purposes
of this book, we will focus on randomization tests since they are simpler to
compute and are equally valid as permutation tests when done correctly.
The utility of randomization or permutation tests should now be apparent.
Foremost, a researcher is not limited to test statistics which are converted to
known sampling distributions. For example, suppose the sampling distribution
was quite skewed, then a test statistic, such as the difference in the modes or
difference in the medians, may be needed. Further, suppose the distribution
of the groups was bimodal and for whatever reason the researcher wanted to
determine if the ratio of the intra-group modes was equal between groups.
Certainly, there is no parametric test statistic which could provide a answer.
Randomization tests could.
The disadvantage of permutation tests is devising the algorithm to test all
possible combinations. For this reason, randomization tests are often used
1e+9
# of Possible Combinations
1e+8
1e+7
1e+6
1e+5
1e+4
1e+3
1e+2
1e+1
1e+0
2 4 6 8 10
Number of Observations per Group
Figure 9.1: The number of possible combinations in the two- (solid circles)
and three-group (open circles) case varying the number of independent
observations per group.
over permutation tests. But the central question in randomization tests is how
many iterations must be done to adequately simulate the sampling distribution
of the test statistic. A useful tool is to plot the computed p-value at each
iteration of the simulation and observe how the p-value converges to its final
estimate. Often researchers use general rules of thumb and nice round numbers
for how many iterations will be done. It is common to see in the literature
cases where the researcher used 1000 or 10,000 iterations without any
rationale for doing so. Certain computer packages, such as StatXact (Cytel
Corp., Cambridge, MA), make use of permutation tests to generate p-values,
thus allowing the researcher to focus on the question, not on the programming
to get the answer.
In order to do more difficult analyses for which there may not be a
parametric solution, it is necessary to first understand how to solve problems
where a parametric solution does exist. Then by comparing the results from a
known statistical test to the results from a randomization test, the reader should
feel more comfortable applying randomization tests to statistics with unknown
sampling distributions. To accomplish this, randomization tests for analysis
of variance, analysis of covariance, and repeated measures analysis of variance
will be presented with the idea being that at then end of the chapter the reader
will be able to apply these methods to more complex problems.
Define Total
Number of
Iterations
Set Iteration
Counter to 1
Randomize or
Permutate
Data to
Generate
Pseudovalues
Counter = Recompute
Counter + 1 Test Statistic
Store Resampled
Test Statistic
in Vector
mean square’s was not known, what would the p-value be? Randomization
tests could be used to answer this. The data were shuffled 10,000 times and
the F-value from each analysis of variance recalculated. From 10,000
iterations, 126 F-values were greater than or equal to the observed F-value of
5.40, leading to a p-value of 0.0126 (= 126/10,000).
Table 9.2 presents the output from this analysis. Figure 9.3 plots the
computed p-value as a function of the iteration number. It is seen from the
figure that with less than 2000 iterations, the p-value was wildly erratic
varying from 0 to 0.028. Between 1000 and 3000 iterations, the calculated p-
value was less variable varying between 0.012 and 0.019 and that as the
number of iterations exceeded 4000, the p-value stabilized to its final value.
This example demonstrates the need to have a sufficient number of iterations
in the randomization process to ensure stability and validity of the computed
p-value.
Analysis of Covariance
The basic linear model for analysis of covariance is
( )
Yij = µ + τ j + β Xij − µ + eij (9.3)
0.0300
0.0275
0.0250
0.0225
0.0200
0.0175
p-value
0.0150
0.0125
0.0100
0.0075
0.0050
0.0025
0.0000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Iteration
Figure 9.3: p-value as a function of iteration number. The data in Table 9.2
were analyzed using a randomization test on the difference scores. The
pseudocode for the analysis is given in Figure 9.3 and the actual SAS code is
given in the Appendix.
Yij* = Y
ˆ + e*
ij ij (9.4)
where Ŷij is the ith predicted value in the jth group and e*ij is the ith permuted
residual in the jth group. The new permutated observations Yij* are then
submitted to analysis of covariance and the F-values recomputed. Let the
recomputed F values equal F*. This process is repeated many times until a
sampling distribution for F* is generated. The p-value for the observed F-
value is the number of resampled F*-values greater than or equal to the
observed F-value divided by the total number of resampling simulations. The
SAS code to perform this analysis is given in the Appendix.
Permutation of residuals and calculation of the resulting F-value will not
eliminate one very important assumption of the analysis of covariance − the
between-groups regression coefficients are equal. However, if the between-
group regression coefficients are equal for the observed data they will also be
equal for the permutated data. Thus it is unnecessary to check for equal
regression coefficients with each permutation.
As an example, consider the sexual harassment data in Table 3.4. Both
pretest and posttest were transformed to ranks prior to ANCOVA. The results
of resampling the residuals with 1000 permutations are given in Table 9.3. The
observed F-test for treatment effect was 4.19 with a corresponding p-value of
0.0181. The resampled p-value was 0.0210, an insignificant difference. It can
be seen that resampling gives a very close approximation to the p-value in the
parametric case.
Summary
• Randomization tests are not a panacea for all problems inherent in certain
tests.
EQUALITY OF VARIANCE
72 ⋅ S 9 (k + 1)n(n + 1)
W= − (10.10)
k(k + 1)n(n + 1)(2n + 1) 2 (2n + 1)
72(3n 4 + 6n 3 − 3n + 1)
B = 3A - 2 + (10.12)
7n 2 (n + 1) 2 (2n + 1) 2
(k -1)A3
v= (10.13)
B2
A (W - J + 1)
Q= +v. (10.14)
B
Reject Ho when Q > Qcrit where Qcrit is the critical value associated with a 1-α
quantile of a chi-square distribution with v degrees of freedom rounded to the
nearest integer.
As an example of the method, three groups of subjects were generated
with ten subjects per set. Two of the groups were distributed as X~N(10,1)
and one group was generated as X~N(10,5). The data are presented in Table
10.1. The following values were calculated using Eq. (10.9)-(10.14): S =
39794, W = 9.07, A = 0.8291, B = 0.5566, v = 3.67, Q = 14.218, and p(Q, 4)
= 0.0026. Thus the null hypothesis was rejected and it was concluded that at
least one of the group variances was not equal to the others. One disadvantage
to using the Q method is that no data can be missing. If any data are missing
that entire subject must be deleted from the analysis. Another disadvantage is
that it fails to identify which time period has unequal variance compared to the
others or whether two or more time periods have unequal variance compared
to the rest. Nevertheless, it is still a useful test for pointing out that there is
heteroscedasticity present in the data set on hand.
Summary
• Methods were presented for null hypothesis testing for equality of the
pretest and posttest variance.
• Until further research is done using the different methods under varying
conditions, it would probably be wise to use the Pittman-Morgan method
modified using the Spearman rank correlation coefficient.
Rank Transformation
Subject R(Z1) R(Z2) R(Z3) R(D) R(Zi) * R(Di)
1 1 2 3 6 6 12 18
2 1 2 3 1 1 2 3
3 2 1 3 10 20 10 30
4 2 3 1 2 4 6 2
5 2 1 3 8 16 8 24
6 1 3 2 4 4 12 8
7 1 2 3 3 3 6 9
8 2 1 3 5 10 5 15
9 2 1 3 7 14 7 21
10 2 1 3 9 18 9 27
sum of columns 96 77 157
sum of columns squared 9216 5929 24649
grand sum of columns squared 39794
Asnis, G.M., Lemus, C.Z., and Halbreich, U., The desipramine cortisol test - a
selective noradrenergic challenge (relationship to other cortisol tests in
depressives and normals), Psychopharmacol. Bull., 22, 571, 1986.
Austin, H.A., III, Muenz, L.R., Joyce, K.M., Antonovych, T.A., Kullick,
M.E., Klippel, J.H., Decker, J.L., and Balow, J.W., Prognostic factors in
lupus nephritis, Am. J. of Med., 75, 382, 1983.
Birch, J. and Myers, R.H., Robust analysis of covariance, Biometrics, 38, 699,
1982.
Blair, R.C. and Higgins, J.J., Comparison of the power of the paired samples t-
test to that of Wilcoxon’s signed-ranks test under various population shapes,
Psychol. Bull., 97, 119, 1985.
Blom, G., Statistical Estimates and Transformed Beta Variables, John Wiley &
Sons, New York, 1958.
Bonett, D.G., On post-hoc blocking, Educ. and Psychol. Meas., 42, 35, 1982.
Bowles, S.K., Reeves, R.A., Cardozo, L., and Edwards, D.J., Evaluation of
the pharmacokinetic and pharmacodynamic interaction between quinidine and
nifedipine, J. of Clin. Pharmacol. 33, 727, 1993.
Brouwers, P. and Mohr, E., A metric for the evaluation of change in clinical
trials, Clin. Neuropharmacol., 12, 129, 1989.
Burdick, W.P., Ben-David, M.F., Swisher, L., Becher, J., Magee, D.,
McNamara, R., and Zwanger, M., Reliability of performance-based clinical
skill assessment of emergency room medicine residents, Acad. Emerg. Med., 3,
1119, 1996.
Carmines, E.G. and Zeller, R.A., Reliability and Validity, Sage Publications,
Newbury Park, CA, 1979.
Chen, S. and Cox, C., Use of baseline data for estimation of treatment effects in
the presence of regression to the mean, Biometrics, 48, 593, 1992.
Chesher, A., Non-normal variation and regression toward the mean, Stat. Meth.
Med. Res., 6, 147, 1997.
Chuang-Stein, C., The regression fallacy, Drug Inf. J., 27, 1213, 1993.
Cochran, W.G., Analysis of covariance: its nature and uses, Biometrics, 13,
261, 1957.
Cohen, J., Statistical Power Analysis for the Behavioral Sciences, Lawrence
Erlbaum Associates, Hillsdale, NJ, 1988.
Cronbach, L.J. and Furby, L., How should we measure change-or should we?,
Psychol. Bull., 74, 68, 1970.
D’Agostino, R.B., Belanger, A., and D’Agostino, R.B., Jr., A suggestion for
using powerful and informative tests of normality, Am. Stat., 44, 316, 1990.
Davis, C.E., The effect of regression to the mean in epidemiologic and clinical
studies, Am. J. of Epidemiol., 104, 493, 1976.
Dawson, J.D., Comparing treatment groups on the basis of slopes, area under
the curves, and other summary measures, Drug Inf. J., 28, 723, 1994.
DeGracie, J.S. and Fuller, W.A., Estimation slope and analysis of covariance
when the concomitant variable is measured with error, J. of the Am. Stat. Assoc.,
67, 930, 1972.
de Mey, C. and Erb, K.A., Usefulness, usability, and quality criteria for
noninvasive methods in cardiovascular pharmaccology, J. Clin. Pharmacol., 27,
11S, 1997.
Diggle, P.J., Liang, K.-Y., and Zeger, S.L., Analysis of Longitudinal Data,
Clarendon Press, Oxford, 1995.
Enas, G.G., Enas, N.H., Spradlin, C.T., Wilson, M.G., and Wiltse, C.G.,
Baseline comparability in clinical trials: prevention of “poststudy anxiety,” Drug
Inf. J., 24, 541, 1990.
Fries, J.F., Porta, K., and Liang, M.H., Marginal benefit of renal biopsy in
systemic lupus erythematosus, Arch. of Int. Med., 138, 1386, 1978.
Frison, L.J. and Pocock, S.J., Repeated measures in clinical trials: analysis
using mean summary statistics and its implication for design, Stat. in Med., 11,
1685, 1992.
Gail, M.H., Tan, W.Y., and Piantadosi, S., Tests for no treatment effect in
randomized clinical trials, Biometrika, 75, 57, 1988.
George, V., Johnson, W.D., Shahane, A., and Nick, T.G., Testing for
treatment effect in the presence of regression towards the mean, Biometrics, 53,
49, 1997.
Ghiselli, E.E., Campbell, J.P., and Zeddeck, S., Measurement Theory for the
Behavioral Sciences, W.H. Freeman, San Francisco, 1981.
Graham, J.R., The MMPI: A Practical Guide, 2nd ed., Oxford University Press,
New York, 1987.
Huck, S.W. and McLean, R.A., Using a repeated measures ANOVA to analyze
the data from a pretest-posttest design: a potentially confusing task, Psychol.
Bull., 82, 4, 1975.
Jaccard, J., Turriso, R., and Wan, C.K., Interaction Effects in Multiple
Regression, Sage Publications, Newbury Park, CA, 1990.
Kaiser, L., Adjusting for baseline: change or percentage change?, Stat. in Med.,
8, 1183, 1989.
Labouvie, E.W., The concept of change and regression toward the mean,
Psychol. Bull., 92, 251, 1982.
Lacey, L.F., O’Keene, O.N., Pritchard, J.F., and Bye, A., Common
noncompartmental pharmacokinetic variables: are they normally or log-normally
distributed?, J. of Biopharm. Stat., 7, 171, 1997.
Lin, H.M. and Hughes, M.D., Adjusting for regression toward the mean when
variables are normally distributed, Stat. Meth. in Med. Res., 6, 129, 1997.
Littell, R.C., Milliken, G.A., Stroup, W.W., and Wolfinger, R.D., SAS System
for Mixed Models, SAS Institute, Inc., Cary, NC, 1996.
Lord, F.M. and Novick, M.R., Statistical Theories of Mental Test Scores,
Addison-Wesley, Reading, MA, 1968.
Ludbrook, J. and Dudley, H., Why permutation tests are superior to t and F
tests in biomedical literature, Am. Stat., 52, 127, 1998.
Maxwell, S.E., Delaney, H.D., and Dill, C.A., Another look at ANCOVA
versus blocking, Psychol. Bull., 95, 136, 1984.
McCullogh, C.E., Tests for equality of variances with paired data, Commun. in
Stat. Ser. − Theory and Meth., 16, 1377, 1987.
McDonald, C.J., Mazzuca, S.A., and McCabe, G.P., How much placebo
‘effect’ is really statistical regression?, Stat. in Med., 2, 417, 1983.
McNeil, D., On graphing paired data, Am. Stat., 46, 307, 1992.
Menegazzi, J.J., Davis, E.A., Sucov, A.N., Paris, and P.N., Reliability of the
Glasgow Coma Scale when used by emergency physicians and paramedics, J.
Trauma, 34, 46, 1993.
Morgan, W.A., A test for the significance of the difference between two
variances in a sample from a normal bivariate population, Biometrika, 31, 9,
1939.
Moye, L.A., Davis, B.R., Sacks, F., Cole, T., Brown, L., and Hawkins, C.M.,
Decision rules for predicting future lipid values in screening for a cholesterol
reduction clinical trial, Controlled Clin. Trials, 17, 536, 1996.
Nesselroade, J.R., Stigler, S.M., and Baltes, P.B., Regression toward the mean
and the study of change, Psychol. Bull., 88, 622, 1980.
Neter, J., Kutner, M.H., Nachtsheim, C.J., and Wasserman, W., Applied
Linear Statistical Models, Irwin, Chicago, 1996.
Olejnik, S.F. and Algina, J., Parametric ANCOVA and the rank transform
ANCOVA when the data are conditionally non-normal and heteroscedastic, J.
Educ. Stat., 9, 129, 1984.
Overall, J.E., Letter to the editor: the use of inadequate corrections for baseline
imbalance remains a serious problem, J. Biopharm. Stat., 3, 271, 1993.
Overall, J.E. and Ashby, B., Baseline corrections in experimental and quasi-
experimental clinical trials, Neuropsychopharmacol., 4, 273, 1991.
Pocock, S.J., Clinical Trials: A Practical Approach, John Wiley & Sons, New
York, 1983.
Puri, M.L. and Sen, P.K., Analysis of covariance based on general rank scores,
Ann. of Math. Stat., 40, 610, 1969.
Quade, D., Rank analysis of covariance, J. Am. Stat. Assoc., 62, 1187, 1967.
Raboud, J.M., Montaner, J.S.G., Rae, S., Conway, B., Singer, J., and
Scheter, M.T., Issues in the design and trials of therapies for HIV infected
individuals with plasma RNA level as an outcome, J. of Infect. Dis., 175, 576,
1996.
Rosenbaum, P.R., Exploratory plots for paired data, Am. Stat., 43, 108, 1989.
SAS Institute, SAS/STAT Users Guide, Version 6, SAS Institute, Cary, NC,
1990.
Seaman, S.L., Algina, J., and Olejnik, S.F., Type I error probabilities and
power of the rank and parametric ANCOVA procedures, J. of Educ. Stat., 10,
345, 1985.
Seber, G.A.F., Linear Regression Analysis, John Wiley & Sons, New York,
1977.
Senn, S., Testing for baseline differences in clinical trials, Stat. in Med., 13,
1715, 1994.
Senn, S.J. and Brown, R.A., Estimating treatment effects in clinical trials
subject to regression to the mean, Biometrics, 51, 555, 1985.
Shapiro, S.S. and Wilk, M.B., An analysis of variance test for normality
(complete samples), Biometrika, 52, 591, 1965.
Snow, W.G., Tierney, M.C., Zorzitto, M.L., Fisher, R.H., and Reid, D.W.,
WAIS-R test-retest reliability in a normal elderly sample, J. of Clinical Exp.
Neuropsychol., 11, 423, 1989.
Solomon, R.L., An extension of control group design, Psychol. Bull., 46, 137,
1949.
Stigler, S.M., Regression toward the mean, historically considered, Stat. Meth.
in Med. Res., 6, 103, 1997.
Suissa, S., Levinton, C., and Esdaile, J.M., Modeling percentage change: a
potential linear mirage, J. of Clin. Epidemiol., 42, 843, 1989.
Tornqvist, L., Vartia, P., and Vartia, Y.O., How should relative change be
measured?, Am. Stat., 39, 43, 1985.
Tukey, J.W., The future of data analysis, Ann. of Math. Stat., 33, 22, 1962.
van der Ent, C.K. and Mulder, P., Improvement in tidal breathing pattern
analysis in children with asthma by on-line automatic data processsing, Eur.
Respiratory J., 9, 1306, 1996.
Wainer, H., Adjusting for differential rates: Lord’s paradox again, Psychol.
Bull., 109, 147, 1991.
West, S.M., Herd, J.A., Ballantyne, C.M., Pownall, H.J., Simpson, S.,
Gould, L., and Gotto, A.M., The Lipoprotein and Coronary Atherosclerosis
Study (LCAS): design, methods, and baseline data of a trial of fluvastatin in
patients without severe hypercholesteremia, Controlled Clin. Trials, 17, 550,
1996.
Whiting-O’Keefe, Q., Henke, J.E., Shearn, M.A., Hopper, J., Biava, C.G.,
and Epstein, W.V., The information content from renal biopsy in systemic lupus
erythematosus, Ann. of Int. Med., 96, 718, 1982.
Wilder, J., Adrenalin and the law of initial values, Exp. and Med. Surg., 15, 47,
1957.
Williams, R.H. and Zimmerman, D.W., Are simple gain scores obsolete?,
Appl. Psychol. Meas., 20, 59, 1996.
Wolfinger, R.D., An example of using mixed models and PROC MIXED for
longitudinal data, J. of Biopharm. Stat., 7, 481, 1997.
Yuan, C.-S., Foss, J.F., Osinski, J., Toledano, A., Roizen, M.F., and Moss,
J., The safety and efficacy of oral methylnaltrexone in preventing morphine-
induced delay in oral cecal-transit time, Clin. Pharmacol. and Therap., 61, 467,
1997.
Zar, J.H., Biostatistical Analysis, 2nd ed., Prentice-Hall, Englewood Cliffs, NJ,
1984.
SAS Code
/**********************************************************
* SAS Code to Do:
* Iteratively Reweighted ANCOVA
***********************************************************/
%MACRO IRWLS;
proc sort;
by resid;
proc iml;
use resid;
read all;
absresid = abs(resid);
if nrow(resid)/2 = int(nrow(resid)/2) then;
mad = (absresid(nrow(resid)/2) +
absresid(nrow(resid)/2 + 1))/(2 * 0.6745);
else;
mad = absresid(nrow(resid)/2 + 1)/0.6745;
u = resid/mad;
huberwgt = j(nrow(resid), 1, 1);
bisqr = j(nrow(resid), 1, 0);
do j = 1 to nrow(resid);
if abs(u(j)) > 1.345 then
huberwgt(j) = 1.345/abs(u(j));
if abs(u(j)) <= 4.685 then
bisqr(j) = (1 - (u(j)/4.685)**2)**2;
end;
create data2 var {trt pretest posttest huberwgt bisqr};
append var {trt pretest posttest huberwgt bisqr};
run;
%MEND IRWLS;
%MACRO BISQRGLM;
proc glm data=data2;
class trt;
model posttest = trt pretest;
weight bisqr;
output out=resid r=resid;
title1 ’GLM with Bisquare Function Weights’;
run;
%MEND BISQRGLM;
%MACRO HUBERGLM;
proc glm data=data2;
class trt;
model posttest = trt pretest;
weight huberwgt;
output out=resid r=resid;
title1 ’GLM with HUBER Function Weights’;
run;
%MEND HUBERGLM;
data original;
input trt pretest posttest;
cards;
0 72 74
/***********************************************************
* SAS Code to do:
* ANCOVA WHEN THE WITHIN-GROUP
* REGRESSION COEFFICIENTS ARE UNEQUAL
**********************************************************/
data rawdata;
input group posttest pretest;
cards;
1 16 26
1 60 10
1 82 42
1 126 49
1 137 55
2 44 21
2 67 28
2 87 5
2 100 12
2 142 58
3 17 1
/**********************************************************
*
* SAS Code to :
* COMPUTE G FOR Ln TRANSFORMATION
*
***********************************************************/
dm ’clear log’;
dm ’clear list’;
data one;
input x1 x2;
cards;
-10 25
5 28
7 45
-8 32
/**********************************************************
*
* SAS Code to:
* PROCEDURE TO RESAMPLE WITHIN K-GROUPS (ANOVA Type)
*
***********************************************************/
options ls=75;
dm ’clear log’;
dm ’clear list’;
data data;
input trt subject pretest posttest;
diff = posttest - pretest;
cards;
1 1 75 77
1 2 68 66
1 3 82 65
1 4 76 76
1 5 73 74
1 6 78 90
1 7 72 68
1 8 75 78
1 9 80 79
1 10 76 75
2 11 76 79
2 12 75 79
2 13 80 84
2 14 79 73
2 15 77 81
2 16 68 73
2 17 76 72
2 18 76 79
2 19 82 85
2 20 69 75
2 21 73 77
3 22 74 81
3 23 76 79
3 24 69 73
3 25 73 79
3 26 77 81
3 27 75 76
3 28 71 74
3 29 72 81
3 30 69 75
;
/**********************************************************
*
* OPTIMIZED SAS Code to:
* PROCEDURE TO RESAMPLE WITHIN K-GROUPS (ANOVA Type)
* ONE-WAY ANALYSIS OF VARIANCE
* DESIGN MATRIX HARD CODED WITHIN PROC IML
***********************************************************/
data data;
input trt subject pretest posttest;
diff = posttest - pretest;
pchange = (posttest - pretest)/pretest * 100;
cards;
1 1 75 77
1 2 68 66
1 3 82 65
1 4 76 76
1 5 73 74
1 6 78 90
1 7 72 68
1 8 75 78
1 9 80 79
1 10 76 75
2 11 76 79
2 12 75 79
2 13 80 84
2 14 79 73
2 15 77 81
2 16 68 73
2 17 76 72
2 18 76 79
2 19 82 85
2 20 69 75
2 21 73 77
3 22 74 81
3 23 76 79
3 24 69 73
3 25 73 79
3 26 77 81
3 27 75 76
3 28 71 74
data pvalue;
set fvalues;
if f > obsf then output;
proc means data=pvalue n;
title3 ’Number of Simulations Greater than Observed F’;
var f;
run;
proc format;
value sequence 1=’RT’ 2=’TR’;
run;
data one;
infile rmdata;
input sequence subject day period trt cs baseline;
diff = cs - baseline;
%MACRO RESAMPLE;
proc sort data=one; by day;
proc iml;
seed1 = time();
use one var _all_;
read all;
/***********************************************
modify the next uncommented line by changing
times = day to times = VARIABLE
to reflect the variable that is being repeated
************************************************/
times = day;
/***********************************************
modify the next uncommented line by changing
newy = diff to newy = VARIABLE
to reflect the dependent variable
************************************************/
newy = diff;
timevar = unique(times)‘;
do i = 1 to nrow(timevar);
ti = choose(times ^= timevar(i) , times, .);
y = .;
do k = 1 to nrow(ti);
if ti(k) ^= . then y = y//newy(k);
end;
y = y(2:nrow(y));
seed1 = time();
u = rank(ranuni(J(nrow(y), 1, seed1)));
y = y(|u,|);
/*************************************
modify the next uncommented line by changing
if i = 1 then diff = y;
else diff = diff//y;
to
if i = 1 then diff = VARIABLE;
else diff = diff//VARIABLE;
to reflect the dependent variable
****************************************/
if i = 1 then diff = y;
else diff = diff//y;
create newdata var {sequence subject trt period
day diff};
append var {sequence subject trt period day diff};
run;
%MEND RESAMPLE;
%MACRO SIM;
data fvalues; f =.; id = .;
/*******************************************
change the counter (i) to the number
of desired simulations
********************************************/
%do i = 1 %to 1200;
%let id = &i;
%resample;
%stat;
data ftrt&i;
set outstat;
id = &i;
if _type_ = ’SS3’ and _source_ = ’TRT’
then ftrtrs = f;
if _type_ = ’SS3’ and _source_ = ’TRT’
then output;
drop _type_ _source_ df prob _name_ ss f;
data fday&i;
set outstat;
id = &i;
if _type_ = ’SS3’ and _source_ = ’DAY’
then fdayrs = f;
if _type_ = ’SS3’ and _source_ = ’DAY’
then output;
drop _type_ _source_ df prob _name_ ss f;
proc sort data=ftrt&i; by id;
proc sort data=fday&i; by id;
data f&i;
merge ftrt&i fday&i;
by id;
data fvalues;
merge fvalues f&i;
by id;
if id ^= . then output;
drop f;
proc datasets;
delete f&i ftrt&i fday&i;
%end;
%MEND SIM;
/***************************************************
Resampling Macro for ANCOVA
*****************************************************/
%MACRO RESAMPLE;
proc iml;
use resdata var _all_;
read all;
n = nrow(e);
seed1 = time();
u = ranuni(J(n, 1, seed1));
i = int(n*u + j(n,1,1));
newe = e(|i,|);
newy = yhat + newe;
create newdata var {trt newy rankpre};
append var {trt newy rankpre};
proc glm data=newdata outstat=outstat noprint;
class trt;
model newy = rankpre trt;
run;
%MEND STAT;
%MACRO SIM;
%do i = 1 %to 1000;
%if &i = 250 %then dm ’clear log’;
%if &i = 500 %then dm ’clear log’;
%if &i = 750 %then dm ’clear log’;
%let id = &i;
%resample;
data f&i;
set outstat;
id = &i;
resampf = f;
if _type_ = ’SS3’ and _source_ =
’TRT’ then output;
drop _type_ _source_ df prob _name_ ss;
data fvalues;
merge fvalues f&i;
by id;
if id ^= . then output;
drop f;
proc datasets;
delete f&i;
%end;
%MEND SIM;