Clinical Epidemiology and Biostatistics A Primer For Clinical Investigators and Decision Makers
Clinical Epidemiology and Biostatistics A Primer For Clinical Investigators and Decision Makers
Clinical Epidemiology and Biostatistics A Primer For Clinical Investigators and Decision Makers
Kramer
Clinical Epidemiology
and Biostatistics
A Primer for Clinical Investigators
and Decision-Makers
Library of Congress Cataloging-in-Publication Data. Kramer, Michael 5., 1948 Clinical epidemiology and biostatistics 1 Michael S. Kramer. p. cm. Includes index.
ISBN-13:978-3-642-64814-4 (U.S.)
1. Epidemiology - Research - Methodology. 2. Epidemiology Statistical methods. 3. Biometry. I. Title. [DNLM: 1. Biometry - methods. 2. Epidemiologic Methods. 3. Research Design. WA 950 K89c] RA652.K73 1988614.4'028 dc 19 DNLM/DLC
This work is subject to copyright. All rights are reserved, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, re-use of illustrations,
recitation, broadcasting, reproduction on microfilms or in other ways, and storage in data
banks. Duplication of this publication or parts thereof is only permitted under the provisions
of the German Copyright Law of September 9, 1965, in its version of June 24, 1985, and a
copyright fee must always be paid. Violations fall under the prosecution act of the German
Copyright Law.
The use of registered names, trademarks, etc. in this publication does not imply, even in the
absence of a specific statement, that such names are exempt from the relevant protective laws
and regulations and therefore free for general use.
Product Liability: The publisher can give no guarantee for information about drug dosage
and application thereof contained in this book. In every individual case the respective user
must check its accuracy by consulting other pharmaceutical literature.
Typesetting: Appl, Wemding
2123/3145-543210 - Printed on acid-free paper
Preface
VI
Preface
Acknowledgements
Although the scope, content, and intended audience of this text differ from those of previously available texts, little of the material presented can be considered original. In addition to the many colleagues whose direct help and encouragement are acknowledged
below, I owe a great debt to many authors of other texts of epidemiology or biostatistics and to numerous teachers and colleagues
with whom I have come in contact over the years.
I wish to thank, first and foremost, Dr. Alvan R. Feinstein, who
first "turned me on" to clinical epidemiology as a viable academic
discipline. I not only cut my epidemiologic teeth with Dr. Feinstein
but have continued to benefit from his support and encouragement
in the years since leaving his tutelage.
I have also learned a great deal from collaborative teaching with
other faculty in the McGill Department of Epidemiology and Biostatistics. Primary among these have been Drs. Tom Hutchinson and
David Lane, but the list also includes Drs. John Hoey, Robert Oseasohn, and Walter Spitzer.
Several colleagues gave me extremely helpful suggestions on previous drafts of this text. They include Drs. Jean-Fran<;:ois Boivin,
F. Sessions Cole III, Erica Eason, James Hanley, David Lane, Abby
Lippman, John McDowell, I. Barry Pless, and Stanley Shapiro. Drs.
William Fraser, Tom Hutchinson, Paul Kramer, Sammy Suissa, and
Sholom Wacholder also provided helpful advice on specific items.
I cannot adequately acknowledge the peerless secretarial work of
Mrs. Laurie Tesseris. Without her patience and thoroughness, this
book could never have been completed, even in this era of word processors. Ms. Lenora Naimark and Ms. Tiziana Bruni provided additional secretarial assistance. Many thanks are also due to Mr. Phillip
Dakin, Ms. Artemis Karabelas, and Ms. Jennifer Morrison for preparing the graphs and figures.
Lastly, and possibly most importantly, I wish to thank my wife
Claire and son Eric, whose support and encouragement are so valuable to me in all my work.
To whatever extent this text succeeds in its goal, the abovenamed persons deserve much of the credit. Any remaining inaccuracies or lack of clarity are entirely my own.
Michael S. Kramer
Table of Contents
Part I
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . .
1.1
The Compatibility of the Clinical
and Epidemiologic Approaches . . . . . . . . .
Clinical Epidemiology: Main Areas of Interest.
1.2
Historical Roots . . . . . . . . . . . . . . . .
1.3
Current and Future Relevance: Controversial
1.4
Questions and Unproven Hypotheses
Chapter 2:
2.1
2.2
2.3
2.4
2.5
2.6
Measurement. . . . . . . . . . . . . .
Types of Variables and Measurement Scales
Sources of Variation in a Measurement.
Properties of Measurement . . . . . . . . .
"Hard" vs "Soft" Data . . . . . . . . . . . .
Consequences of Erroneous Measurement .
Sources of Data .. . . . . . . . . . . . . .
Chapter 3: Rates . . . . . .
3.1
What is a Rate? .
Prevalence and Incidence Rates .
3.2
3.3
Stratification and Adjustment of Rates
3.4
Concluding Remarks . . . . . . . . .
Chapter 4: Epidemiologic Research Design: an Overview
4.1
The Research Objective: Descriptive vs Analytic
Studies .. . . . . . . . . . . . . . . . . . . . ..
4.2
Exposure and Outcome . . . . . . . . . . . . ..
4.3
The Three Axes of Epidemiologic Research Design
4.4
Concluding Remarks . . . . . . . . . . . . . . . .
Chapter 5: Analytic Bias . . . . . . . . . . . . . . . . . . . ..
5.1
Validity and Reproducibility of
Exposure-Outcome Associations
Internal and External Validity . .
5.2
3
3
5
8
9
11
11
11
13
15
15
17
25
25
27
32
36
37
37
38
39
45
47
47
48
x
5.3
5.4
5.5
5.6
5.7
Table of Contents
49
52
53
56
57
Chapter 6:
6.1
6.2
6.3
6.4
6.5
58
58
62
69
74
76
Chapter 7:
7.1
7.2
7.3
7.4
7.5
7.6
7.7
Clinical Trials. . . . . . . . . . . . .
Research Design Components. . . .
Assignment of Exposure (Treatment)
Blinding in Clinical Trials
Analysis of Results . . . .
Interpretation of Results.
Ethical Considerations. .
Advantages and Disadvantages of Clinical Trials.
78
78
80
84
85
86
88
91
Chapter 8:
8.1
8.2
8.3
8.4
8.5
Case-Control Studies . . . . .
Introduction . . . . . . . . . .
Research Design Components.
Analysis of Results . . . . . . .
Bias Assessment and Control .
Advantages and Disadvantages of Case-Control
Studies . . . . . . . . .
93
93
93
96
106
Cross-Sectional Studies
Introduction . . . . . .
Research Design Components.
Analysis of Results. . . . . . .
Bias Assessment and Control .
"Pseudo-Cohort" Cross-Sectional Studies
Advantages, Disadvantages, and Uses
of Cross-Sectional Studies . . . . . . . . .
113
113
113
114
115
115
Chapter 9:
9.1
9.2
9.3
9.4
9.5
9.6
Part II
110
116
Biostatistics
121
121
121
122
122
Table of Contents
XI
124
124
133
136
137
137
138
141
143
146
162
163
165
165
166
181
183
185
187
187
191
192
193
195
196
Part III
146
148
157
159
Special Topics
201
201
. 201
.
205
211
213
216
XII
Table of Contents
.220
.220
.222
.225
230
233
234
.
.
.
.
.
.
236
236
237
240
245
247
.
.
.
.
.
.
.
.
254
254
256
258
259
261
264
265
. . . . . . . . . . . . . . . . . . . . . . . . 283
Part I
Chapter I: Introduction
To avoid displays of male chauvinism or unwieldy prose (he/she, his/her), I have tried to vary the
use of masculine and feminine pronouns in this text.
Introduction
cannot afford to act indecisively. Consequently, she views (albeit unconsciously) the
range of choices as right ones and wrong ones.
The epidemiologist is more comfortable with shades of gray. Since he is not
obliged to make decisions for individual patients, he can live with uncertainty. He
focuses on improving the health of populations and is less interested in the outcome
of individuals within these populations; he prefers to think probabilistically. If 80% of
patients like Mr. Jones have been shown to improve with surgery vs 60% with medical therapy, the epidemiologist would have no trouble advocating surgery for Mr.
Jones. Mr. Jones' physician, however, may find it difficult to recommend a course of
therapy that may not be unequivocally best for her patient.
It is important to emphasize that some fundamental clinical facts can be
observed only in groups. No amount of pathophysiologic, mechanistic reasoning
would reveal that the sex ratio at birth is not 50: 50 but 51.5: 48.5, or that males
have a higher overall mortality rate than females. Predicting the sex of an individual
newborn or whether a given women will outlive a given man is subject to considerable error. But the sex ratio among the next 1000 newborns or the comparative mortality of a large representative group of men and women can be predicted within a
fairly narrow range.
In the past, these two different approaches, individualized and mechanistic on
the one hand, group-oriented and probabilistic on the other, have had little in common. Unlike the laboratory-based sciences, epidemiology tended to remain far from
the bedside. Epidemiologists concerned themselves almost exclusively with investigating the etiology of infectious and chronic diseases, and clinicians consequently
found epidemiology to be of little relevance to their roles as caretakers and decisionmakers. Medical, dental, or nursing school courses in epidemiology were viewed
with an attitude ranging from indifference to contempt.
Recently, however, the essential compatibility and mutual benefit of the two
approaches have become more evident, and this has given rise to the term "clinical
epidemiology." Although all epidemiology is clinical in a broad sense, since it concerns disease and other health-related phenomena, "classical epidemiology" has
usually concerned itself with disease etiology. Clinical epidemiologists also study
etiology, but are equally interested in diagnosis, prognosis, therapy, prevention,
evaluation of health care services, and analysis of risks and benefits.
Nonetheless, the distinction between clinical and classical epidemiology should
not be overemphasized. The important point is that epidemiology and biostatistics
are now recognized by clinical investigators as essential in the design and analysis of
research and by practicing clinicians as useful in patient care and in interpreting and
appraising the medical literature. These areas are receiving increased attention in
clinical curricula, and postgraduate courses and seminars are in great demand by
practitioners and researchers alike.
Happily, this marriage of the epidemiologic and individual clinical approaches is
proving fruitful not only to patients, their caretakers, and researchers, but also to the
disciplines themselves. Inferences based on statistical associations can lead to new
avenues of laboratory investigation. For example, knowledge that exposure to a certain solvent in the work place is associated with an increased risk of liver cancer can
lead to animal and in vitro experiments aiming to determine its mechanism of carcinogenesis. Conversely, knowledge of underlying mechanisms can suggest novel
diagnostic or therapeutic options whose clinical utility will ultimately depend on evidence from epidemiologic studies. A new drug shown to be a potent vasodilator in
dogs will need to be tested in well-designed clinical trials in hypertensive patients to
assess its efficacy and safety in lowering blood pressure and preventing stroke, heart
attack, blindness, or kidney failure.
1.2.1 Etiology
What are the causes of coronary artery disease (CAD)? Most of what we know
about this condition derives from long-term, population-based epidemiologic
studies. For example, in the well-known Framingham study [1], a two-thirds sample
of the 30- to 60-year-old population of that Massachusetts town was examined at
the inception of the study and periodically thereafter to identify sociodemographic
and clinical risk /actors for CAD. As a result of this and other similar studies, it is
now widely acknowledged that smoking, hypertension, high blood cholesterol levels, insufficient exercise, and a high-stress (so-called type-A) personality significantly increase the risk of heart attack.
1.2.2 Diagnosis
How is CAD diagnosed? A variety of invasive and noninvasive diagnostic tests have
been developed in an attempt to assess the anatomic state of the coronary arteries,
the derangement in blood supply to the heart muscle, and the resulting tissue damage. These include blood tests, roentgenographic studies, electrocardiograms (at rest
and during exercise), and radioisotopic tracer uptakes. Before such tests achieve
wide application, they should be subjected to appropriate epidemiologic study to
ascertain their ability to discriminate accurately between individuals with and without CAD or its sequelae.
1.2.3 Prognosis
What is the likelihood that Mr. Jones will still be alive in 5 years? Epidemiologic
inquiry has made substantial contribution to our understanding of those clinical,
demographic, and psychosocial variables in CAD patients that are significantly
Introduction
related to future morbidity and mortality. These prognostic foctors are analogous to
the risk factors discussed above in reference to etiology but include, in addition, various indicators of the extent and severity of the underlying disease in question. Some
prognostic factors are causally related to the outcome (morbidity or mortality) of
interest; others serve merely as markers of the underlying disease or other causal factors. Significant prognostic factors for Mr. Jones might include his age, the fact that
he has significant obstruction of two of his three major coronary vessels, and the
results of his postinfarction electrocardiogram exercise test. Fortunately, prognosis is
a dynamic, rather than a static, process that can be influenced by treatment and prevention. In other words, therapeutic and preventive interventions can themselves be
prognostic factors.
1.2.4 Treatment
How can CAD be prevented? Some epidemiologists distinguish here between primary prevention (preventing the disease from developing in the first place) and secondary prevention (preventing progression or complication of disease already present). Unfortunately for Mr. Jones, primary prevention is no longer an option.
Perhaps, had intervention been attempted when he was a young man, he might have
been prevailed upon to stop smoking, improve his diet, get more exercise, and seek
treatment for his hypertension (high blood pressure). Although the evidence is not
clear-cut, most epidemiologic studies suggest that such changes can be effective in
lowering the risk of developing CAD. In fact, many "experts" believe that recent
changes in smoking, eating, and exercise behavior and improved control of hypertension are responsible for the clearly perceptible decline in morbidity and mortality
from CAD in North America. As for Mr. Jones, he may benefit from the secondary
preventive efficacy of such changes, as well as (possibly) from taking aspirin or other
anticlotting drugs.
Introduction
association with a contaminated water supply [16], several decades before acceptance of the germ theory of disease and demonstration of the cholera Vibrio.
The current century has seen the extension of epidemiologic principles and techniques to the study of a variety of diseases, treatments, and preventive measures. In
1920, Joseph Goldberger carried out a community trial of diet in the treatment of
pellagra, thus demonstrating it to be a nutritional, rather than an infectious, disease
[17]. This was long before the biochemical demonstration of the vitamin involved
(nicotinic acid) and the understanding of its importance in intermediary metabolism.
In 1941, N. M. Gregg, an astute Australian ophthalmologist, recognized the association between certain congenital deformities and maternal rubella (German measles)
infection early in pregnancy [18].
In more recent decades we have had the trials of poliomyelitis vaccines [19], the
observational studies by the U. S. Public Health Service [20] and subsequent community trials by the New York State Department of Health [21] demonstrating that
fluoride in drinking water protects against dental caries, the recognition by Doll and
Hill of the strong association between cigarette smoking and lung cancer [22], and
the Framingham and other studies of risk factors for the development of cardiovascular disease [1]. Epidemiology has not abandoned its historical role in establishing
the etiology of presumed infectious disease, however, and in the past few years,
epidemiologists have been instrumental in discovering the causal agent in Legionnaire's disease, the relationship between tampon use and toxic shock syndrome, and
the importance of aspirin as a cofactor in causing Reye's syndrome in children with
influenza or chicken pox. Much of what we know now about AIDS (acquired
immunodeficiency syndrome) is based on epidemiological data obtained well before
the recent discovery of the responsible human immunodeficiency virus (HIV). The
efficacy of future vaccines and other preventive measures for AIDS will also require
evaluation by epidemiologic studies.
10
Introduction
interpret and apply published research to the best advantage of their patients. The
goal of this volume is to help the "doers" of clinical research in improving the scientific quality of their investigation, and the "users" of research in developing their
skills of appraisal and application.
References
1. Dawber TR (1980) The Framingham study: the epidemiology of atherosclerotic disease. Harvard University Press, Cambridge
2. Vineberg A, Walker J (1964) The surgical treatment of coronary artery heart disease by internal
mammary implantation: report of 140 cases followed up to thirteen years. Dis Chest 45:
190-206
3. Urschel HC, Razzuk MA, Miller ER, Nathan MJ, Ginsberg RJ, Paulson DL (1970) Direct and
indirect myocardial neovascularization: follow-up and appraisal. Surgery 68: 1087-1100
4. Sethi GK, Scott SM, Takaro T (1973) Myocardial revascularization by internal thoracic arterial
implants: long-term follow-up. Chest 97: 97-105
5. Hill JD, Holdstock G, Hampton JR (1977) Comparison of mortality of patients with heart
attacks admitted to a coronary care unit and an ordinary medical ward. Br Med J 2: 81-83
6. Fletcher RH, Fletcher SW, Wagner EH (1982) Clinical epidemiology - the essentials. Williams
and Wilkins, Baltimore
7. Sackett DL, Haynes RB, Tugwell P (1985) Clinical epidemiology: a basic science for clinical
medicine. Little, Brown, Boston
8. Hippocrates (1939) The genuine works of Hippocrates. Williams and Wilkins, Baltimore
9. Dewhurst K (1966) Dr. Thomas Sydenham (1624-1689). University of California Press, Berkeley
10. Wilcox WF (ed) (1937) Natural and political observations made upon the bills of mortality by
John Graunt. Johns Hopkins Press reprint, Baltimore
11. Lind J (1793) A treatise of the scurvy. Sands, Murray, and Cochran, Edinburgh
12. Jenner E (1910) Vaccination against smallpox. In: Eliot SW (ed) Scientific papers. Collier, New
York, pp 153-231
13. Bollet AJ (1973) Pierre Louis: the numerical method and the foundation of quantative medicine. Am J Med Sci 266: 92-101
14. Louis PC-A (1836) Researches on the effects of bloodletting in some inflammatory diseases, and
on the influence of tartarized antimony and vesication in pneumonitis. Milliard, Gray, Boston
15. FarrW (1975) Vital statistics: a memorial volume of selections from the reports and writings of
William Farr. New York Academy of Medicine, Metuchen, NJ
16. Snow J (1936) Snow on cholera. The Commonwealth Fund, New York
17. Goldberger J (1964) Goldberger on pellagra. Louisiana State University Press, Baton Rouge
18. Gregg NM (1941) Congenital cataract following German measles in the mother. Trans Ophthalmol Soc Austr 3: 35-46
19. Francis T, Korns RF, Voight RB, Boisen M, Hemphill FM, Napier JA, Tolchinsky E (1955) An
evaluation of the 1954 poliomyelitis vaccine trials: summary report. Am J Public Health 45
[part II Suppl]: 1-630
20. Dean HT, Arnold FA, Elvove E (1942) Domestic water and dental caries. V. Additional studies
of the relation of fluoride domestic waters to dental caries experience in 4425 white children,
aged 12 to 14 years, of 13 cities in 4 states. Public Health Rep 57: 1155-1179
21. Ast DB, Schlesinger ER (1956) The conclusion of a ten-year study of water fluoridation. Am J
Public Health 46: 265-271
22. Doll R, Hill AB (1952) A study ofthe aetiology of carcinoma of the lung. Br Med J 1271-1286
Chapter 2: Measurement
12
Measurement
...~...
~Blas
.... ........
Bi
..... Chlne.....
iii
!
Chance
~
... ~
..
.~
Chine,
I
Value of
Measurement
I
oi
....Chanee ....
True Value
Properties of Measurement
13
14
Measurement
may reflect either (temporal) variation in the underlying biologic attribute or random measurement error (due to method and or observers). Many other terms exist
for the property of reproducibility, and this is a major source of confusion among
persons encountering these concepts for the first time. The most commonly encountered term for this measurement property is reliability, but the term is misleading,
since it seems unwise to rely on a measurement that may be invalid, merely because
it is reproducible. Statisticians prefer the word precision, which unfortunately can be
confused in its normal English usage with measurement detail. Perhaps the best
word is consistency, but this has not achieved general acceptance. The important
thing here, however, is the concept: the extent to which the same answer is obtained
when the measurement is repeated.
As shown in Fig.2.1, a measurement may be highly reproducible but biased, and
therefore invalid (measurement A in the figure). It may be biased systematically
upward or downward, e. g., an incorrectly calibrated serum glucose autoanalyzer
that reproducibly gives values 30 mg/ dl above the true concentration. Or it may be
consistently biased toward a given value; a broken watch, for example, will reproducibly give the time but will be valid only twice a day.
As also shown in Fig. 2.1, a measurement may be poorly reproducible but unbiased (measurement B). When such a measurement is taken with several replications,
the average value of the replicates may have fairly good validity. This is common
practice in epidemiologic studies for variables such as height and blood pressure,
which are subject to considerable (random) intra- and interobserver error.
The final measurement property of interest is detail. The detail of a measurement
is equivalent to the amount of information provided. For continuous variables, this
usually means the number of "significant figures" or decimal places. For categorical
variables, detail refers to the number of categories contained in the scale.
Ideally, a measurement should be sufficiently, but not excessively, detailed.
Detail should be sufficient to distinguish individuals or groups with true differences
in the entity of interest, but it should not be excessive, in the sense that measured
differences are of no biological importance. Furthermore, the detail of a measurement should not exceed its validity and reproducibility. Serum glucose concentration, for example, is usually measured to the nearest mg/ dl. Measurement to the
nearest 100 mg/dl would be insufficient in distinguishing normal subjects from those
with hypoglycemia, on the one hand, or diabetes, on the other. Even measurement
to the nearest 10 mg/ dl might be insufficient to document improvement or deterioration in diabetic control after changes in insulin dosage. Conversely, measurement
to the nearest 0.1 mg/ dl would probably be excessive, since changes of this magnitude have no known clinical significance and since existing technology for measurement does not yield this degree of validity or reproducibility.
To illustrate the same concept using categorical variables, consider the question
"How would you describe your mood today?" The range of responses (scale of
categories) to a question of this type is often given according to what is called a
Likert format, e.g., depressed, neutral, or happy. Such a 3-point scale might be
insufficient to distinguish mildly depressed from suicidal patients, however, and
expansion to five categories (e.g., severely depressed, mildly depressed, neutral,
slightly happy, very happy) is probably preferable. A further increase in the number
of categories, on the other hand, may exceed the respondent's ability to characterize
his or her mood.
15
16
Measurement
makes little or no difference whether the error is systematic or random, and besides,
we usually have no way of finding out. If a woman participating in a "hypertension
screening clinic" in a local shopping center has her blood pressure erroneously
recorded as 160/100 instead of her usual true pressure of 130/80, it matters little
whether the reason is an insufficiently wide blood pressure cuff (bias), an inexperienced person taking the reading (random measurement variation), or the fact that
she is under some stress because the time has expired on her parking meter (biologic
variation). Regardless of the reason for the error, she may be labeled as "hypertensive" and suffer all the worries attendant upon receiving such a diagnosis, at least
until such time as she is rechecked when calm by her own physician using a proper
cuff.
When groups instead of individuals are considered, however, the situation is
quite different. Variability (poor reproducibility), in the absence of bias, should not
change the average group value, since there is just as much a chance that any individual measurement is too high or too low with respect to its true value. In the
absence of bias, therefore, the average measurement for a group (if sufficiently
large) will be valid even if many of the individual measurements from which it
derives are not. To use engineering parlance, the "signal" may be correct despite
considerable "noise." If adequate numbers of subjects are studied (to improve the
signal-to-noise ratio), a valid measure of the group average will be revealed even in
the presence of considerable random measurement error. When the individual measurements are biased, however, the group signal will also be erroneous, despite
inclusion of a large sample of study subjects.
Random measurement error can nonetheless have deleterious consequences
when one is seeking associations or correlations between two measured variables in
a group of subjects. In these situations, random errors in the individual measurements will lead to an analytic bias by diminishing the extent of association or correlation between the two. Say, for example, that we wish to study the correlation
between weight and systolic blood pressure. Poorly reproducible (i. e., randomly
erroneous) but unbiased measurements of weight and/or blood pressure might
reveal valid group averages for each of these variables, but the correlation between
the two would be reduced below its true value. Similarly, in a study of a possible
association between smoking and myocardial infarction (MI), random errors in
classifying study subjects as to their smoking status and/or diagnosis (MI or no MI)
will tend to reduce (bias) the measure of association between the two. The type of
analytic bias that occurs in the statistical relationship between variables as a result of
errors in measuring those two variables is called information bias and will be discussed in greater detail in Chapter 5.
In summary, then, poorly reproducible measurements are more tolerable in
epidemiologic research than in the assessment of individual patients for clinical purposes. The effects of random measurement errors can be overcome, in part, by
increasing the number of subjects measured, and statistical relationships between
variables that result from such random errors will generally lead to conservative
inferences. Thus, even "sloppy" measurements should not, in the absence of bias,
create false statistical associations where none exist. Depending on one's point of
view, this built-in conservatism can be considered either beneficial (preventing the
too-ready acceptance of new findings) or harmful (hindering scientific progress).
Sources of Data
17
Clinical observations include the elements of a medical history, physical examination, and laboratory data that are obtained in the clinical care of patients. They may
be either primary or secondary, the latter usually being obtained from existing medical records.
The quality of data obtained from medical histories depends on how the questions are asked, and therefore on language, understanding, alertness, and other
characteristics of both the history taker and patient. These factors can affect either
the reproducibility or validity of data obtained by history. As reviewed by Koran,
interobserver agreement is often poor when two or more observers obtain a medical
history from the same patient [8]. Furthermore, if the observer obtaining the history
has a preconceived notion or hypothesis in mind, the resulting measurement is susceptible to bias, especially when that observer is not "blind" to the characteristics of
the patient whose history is being taken.
Suppose, for example, we wish to know whether our patient, Mrs. Jones, has
experienced hemoptysis (coughing up blood) within the past year. Here are two
(admittedly extreme) ways of asking her:
1. Mrs. Jones, it is not at all uncommon for people with a bad cough or cold to
notice, on occasion, the appearance of small flecks of blood in their phlegm. Has
this happened to you at any time during the past year?
2. Mrs. Jones, you haven't been so unfortunate, so obviously ill, so utterly doomed
as to have coughed up blood in the past year, have you?
18
Measurement
Physical examination depends on the skill, training, experience, and mental state
of the examiner, and many of the measurements obtained are thus somewhat subjective (e.g., the presence or absence of liver enlargement). As with history taking,
interobserver agreement has been shown to be poor for a variety of aspects of the
physical examination [8]. Here, too, the use of nonblind examiners increases the
potential for biased measurement.
Laboratory data are usually more valid and reproducible than those obtained by
history and physical examination, but they depend on the quality control utilized by
the clinical laboratory or X-ray facility. As reviewed by Koran, interobserver agreement is not as high as one may be led to believe by the impressive technologic
advances in recent years [8]. In general, the greater the potential for subjectivity
("clinical judgment") in obtaining the laboratory measurement, the poorer the
reproducibility and the greater the opportunity for bias.
When clinical observations are planned and carried out by a study's investigators
(i. e., the data source is primary), the reproducibility and validity of the measurements can be improved by adequately training and blinding the observers (clinicians
and laboratory personnel) and by providing objective, operational criteria for performing and recording the measurements. When the data come from secondary
sources, the recording of the observations cannot (by definition) be controlled by
the investigators, but medical record abstractors should be provided with operational rules and criteria for extracting their observations and should be kept blind, as
far as possible, to the study's principal hypotheses. Medical records may have the
additional problem, of course, that data may be missing or insufficiently detailed for
use in the study.
2.6.2 Questionnaires and Interviews
Since questionnaires and interviews are often highly structured and designed by
investigators for a specific study, the resulting data are usually primary. These data
sources share some of the same characteristics as the medical history. As with history
taking, responses depend on how questions are asked. Similarly also, subjects are
generally believed to overreport minor symptoms and underreport bad habits.
Self-administered questionnaires suffer from three additional problems. First,
since many questionnaires designed for a specific study utilize a fixed format (i. e.,
the range of possible responses to each question is limited to those printed on the
questionnaire), they occasionally provide insufficient or excessive detail. The pretesting of such questionnaires prior to use in an actual research study is thus essential to improve the quality of data obtained therefrom. Second, inconsistent
responses cannot be resolved, unless provision is made for follow-up contact by
mail, telephone, or personal visit. Third, since self-administered questionnaires are
usually sent by mail, nonresponse can be a major problem. Many people simply do
not return questionnaires sent to them in the mail. Even with repeated mailings,
response rates above 80% are unusual. Of even greater concern, those who do
return the questionnaire may differ in important ways from those who do not, thus
leading to a potential for bias due to nonresponse.
The data presented in Table 2.1 are taken from a study by Burgess and Tierney
19
Sources of Data
Table 2.1. Smoking habits among 1184 Rhode Island physicians [9]
Subjects
Respondents
First mailing
Second mailing
Total
N onrespondents
a
% of Total
% Smoking
837
189
70.7
16.0
21.7
26.5
1026
86.7
22.6
158
13.3"
45.5
Number
[9]. In 1968, short questionnaires concerning (among other items) cigarette smoking
were mailed to 1184 licensed physicians in Rhode Island. The first mailing produced
a 70.7% response; 21.7% of the respondents admitted to being current cigarette
smokers. A second mailing netted an additional 189 (16.0%) respondents, 26.5% of
whom reported smoking. When a sample of the 158 (13.3%) remaining nonrespondents (or their families or friends) were approached in person, it was found that
45.5% were smokers. In other words, nonrespondents were about twice as likely to
smoke as respondents.
It should be emphasized, however, that nonresponse does not always lead to bias
[10]; data based on low response rates can indeed be valid. The problem is that the
characteristics of the nonrespondents are usually unknown, and the potential for
bias is unassessable and therefore capable of undermining the findings of a study.
Personal interviews, either by telephone or direct questioning (often in the
home), have several advantages over mailed, self-administered questionnaires. The
response rate is often higher when the study subject can meet, or at least talk to, the
person asking the question. People are far more likely to throwaway or ignore a
written questionnaire received by mail from an investigator or study group they have
never met or talked to than to refuse to answer questions asked by telephone or in
person by someone who adequately introduces him- or herself. Another advantage
of the personal interview is that inconsistencies between two or more responses can
be resolved by the interviewer. One disadvantage of personal interviews relative to
self-administered questionnaires is their potential for systematic measurement bias.
This can be minimized by thoughtful a priori structuring of the interviews and by
careful training, periodic quality control, and "blinding" (to preselected characteristics of study subjects and the study hypothesis) of interviewers.
2.6.3 Reportable Diseases and Disease Registries
20
Measurement
This general category of (secondary) data source includes school and industry
records of baseline and periodic physical examinations and of absenteeism, health
records of the armed forces and the Veterans Administration, and data from insurance programs. Although the data from such sources is often conveniently computerized and of high quality, the major limitation concerns generalizability to persons
outside of the specific group from which the data derive. Military data are highly
nonrepresentative with respect to age and sex, employers and life insurance companies are likely to exclude persons with significant illness, and prepaid health insurance plans (e.g., Kaiser Permanente in the Western United States and the Health
Insurance Plan of Greater New York) underrepresent the economically disadvantaged.
Nonetheless, the quality and size of these data bases have facilitated a number of
important epidemiologic studies. For example, data on height and weight routinely
collected by the Metropolitan Life Insurance Company have been useful in understanding the relationship between obesity and life expectancy. One of the best of
these data sources has been the Mayo Clinic and the Olmstead Country Medical
Group, which provide medical care for the vast majority of residents in the Rochester, Minnesota, area. Although underrepresentative of poor and minority groups,
the exploitation of this data source has contributed to our understanding of the natural history of several chronic diseases.
Sources of Data
21
When their completeness and validity can be assured, as is the case in most industrialized countries, national or other population-based statistics can be valuable secondary data sources for epidemiologic investigation. The census is taken every
10 years in the United States and Canada and includes data on age, sex, race, education, and socioeconomic status. Although illegal immigrants are uncounted and
certain other groups (e.g., infants, racial minorities, and vagrants) are undercounted, the census often provides the best source of denominator data for many of
the epidemiologic rates that will be discussed in the next chapter.
Vital statistics consist of population-based data bearing on births, deaths, marriages, and divorces. Birth and death certificates provide fairly valid data for counting numbers of births and deaths, except in remote areas where such events occur
without contact with hospitals or medical care personneL In addition to the fact of
the birth, birth certificates also include useful information concerning the parents'
race and education, the mother's pregnancy history and use of prenatal care, and
evidence of (obvious) congenital anomalies. Death certificates include data on the
age, sex, marital status, and occupation of the deceased.
The main problem with death certificates concerns the cause of death, because it
depends on the attribution of cause by the attending physician. One problem is that
socially undesirable causes of death, such as suicide and alcohol or drug abuse, are
systematically underreported. But even more importantly, death certificates require
the physician to specify an underlying cause. Not only does this require a judgment
by the physician as to which of several diseases the patient may have been suffering
from was the underlying fatal one, but the cited cause is not usually changed by the
results of autopsy or other data that may subsequently become available. Fortunately, although published mortality statistics are based on the single underlying
cause of death, data concerning other conditions listed on the death certificate are
also entered into the computerized data base and are thus available to investigators
having access to that data base.
In an attempt at international standardization, most countries adhere to the classification codes established by the International Classification of Disease (ICD),
which is now in its ninth revision. Secular changes in causes of deaths, however, may
be confounded by changes in nosology. For example, the disease "dropsy" (swelling
of the ankles) has, in this century, been replaced by more specific causes of death,
such as heart, liver, or kidney failure. Important changes in diagnostic technology
have also resulted in some artificial changes. The recent drop in the number of
deaths from stomach cancer, for example, is probably at least partly attributable to a
previous tendency to label any abdominal mass or tumor as stomach cancer. It is
now known that most of these masses are caused by cancer of the colon, ovary, pancreas, or other intra-abdominal organs. Conversely, the death rate for hypertension
increased about tenfold in English and Welsh men 45-54 years old between 1930
and 1950. This increase did not reflect a true increase in either the occurrence
or fatality of hypertension, but rather the increasing availability and use of the
sphygmomanometer and the recognition of the role of hypertension in causing
fatal heart disease and stroke. Finally, geographic differences in terminology may
also lead to spurious differences in mortality. The same chronic obstructive lung
22
Measurement
disease may be called emphysema in the United States and bronchitis in the United
Kingdom.
Abortion (fetal death) rates are worthy of special comment. Many early pregnancies go unrecognized, and requirements vary as to stage of gestation at which
registration is required. Because of these factors, as well as the obvious difficulties in
determining cause of death in many cases, data concerning fetal deaths (which
require a distinct certificate form) are of notoriously poor quality and completeness.
Because of the legal requirements for registration of births and deaths, population-based data concerning fertility and mortality are both more complete and of
higher quality than data concerning morbidity. For countries like the United Kingdom, where the National Health Service assigns each individual to one general
practitioner, physicians' records can serve as a base for collecting morbidity data. In
the United States, such data have been produced by the National Center for Health
Statistics (NCHS) over the past 30 years by a series of interviews and examinations
(the Health Interview Surveys and Health and Nutrition Examination Surveys) of
random samples of the U. S. population. These data are supplemented by sampling
physicians' offices (the National Ambulatory Care Survey).
The NCHS data are limited by the fact that the surveys are cross-sectional in
nature, i. e., they measure only morbidity present at the time of the interview, examination, or physician visit. Furthermore, the size of the samples studied is insufficient
to study infrequent diseases. Nonetheless, they have provided a valuable source of
national data concerning anthropometric measurements and nutritional status,
minor illnesses and disabilities, and utilization of health care services.
Perhaps the best population-based data sources are the extensive data linkage
networks in the Scandinavian countries. In Sweden, for example, each person has a
unique identification number assigned at birth. Information about birth, employment, health, and death is stored in computer data banks accessible through this
number. Individuals listed in birth defects registries and cancer registries can also be
identified through this number, and linkage to other data bases is readily achieved.
The availability of such information for virtually the entire population is an invaluable resource for epidemiologic investigation.
2.6.6 Sources of Data: Concluding Remarks
An important distinction to be made concerning the various sources of data discussed above relates to whether group data that are aggregated for presentation can
be disaggregated to obtain data on the individual members of that group. When a
single variable is considered in isolation, this distinction is of little importance, since,
as we have seen, the group average for that variable should be valid if the individual
measurements are unbiased. When the main interest is in the possible association
between two variables, however, aggregate data can lead to a spurious association
that would not be found on analysis of the same two variables presented by individual subjects.
For example, consider the relationship between death from colon cancer and
dietary fiber intake. Analyses based on aggregated vital statistics and food consumption data by country have revealed a very tight inverse association: the higher a
References
23
country's per capita fiber intake, the lower its colon cancer mortality [11 J. An obvi0us conclusion is that eating dietary fiber protects against colon cancer. This conclusion might be false, however. It is possible that on an individual basis, no association
exists between fiber intake and the development of colon cancer. In other words,
within countries having a given per capita fiber intake, individuals consuming a
high-fiber diet might be just as likely to develop colon cancer as those consuming a
low-fiber diet. The spurious association derived from the country-by-country
(aggregated) analysis might then be explained, for example, by the fact that highfiber foods are consumed (for cultural or climatic reasons) in countries lacking
heavy industry, and that it is the industrial air pollution in other countries that leads
to higher colon cancer death rates, rather than any protective effect of fiber in the
nonindustrial countries.
The false inference that can result from analysis of aggregate, rather than individual, data is called an ecological fallacy. Data on individuals is always to be preferred to aggregated data on groups when investigating statistical associations
between variables.
Finally, I should re-emphasize here another important destinction among data
sources: the distinction between primary and secondary data. Because secondary
data are collected largely for general documentation and descriptive purposes, key
data items may be missing or inadequately detailed to answer specific questions or
test specific hypotheses. Furthermore, as we saw in Section 2.4, poorly reproducible measurements tend to reduce the magnitude of statistical associations between measured variables. For these reasons, many epidemiologic studies make
use of primary data collected by the study's investigators. Secondary and primary
sources can often be profitably combined, however, as when follow-up data are
obtained from patients with a specific tumor who are identified from a cancer
registry.
Whether individual or aggregated, primary or secondary, data concerning death,
illness, recovery, and other discrete (i. e., categorical) events are usually expressed as
epidemiologic rates. The definition and interpretation of these rates is the focus of
the next chapter.
References
1. Feinstein AR (1972) The need for humanized sCience in evaluating medication. Lancet 2:
421-423
2. Yerushalmy J (1969) The statistical assessment of the variability in observer perception and
description of roentgenographic pulmonary shadows. Radiol Clin North Am 7: 381-392
3. Feinstein AR, Gelfman NA, Yesner R (1970) Observer variability in the histopathologic diagnosis of lung cancer. Am Rev Respir Dis 101: 671-684
4. Karch FE, Smith CL, Kerzner B, Mazzullo JM, Weintraub M, Lasagna L (1976) Adverse drug
reactions - a matter of opinion. Clin Pharmacol Ther 19: 489-492
5. Koch-Weser J, Sellers EM, Zacest R (1977) The ambiguity of adverse drug reactions. Eur J Clin
Pharmacol 11: 75-78
6. Kramer MS, Leventhal JM, Hutchinson TA, Feinstein AR (1979) An algorithm for the operational assessment of adverse drug reactions. 1. Background, description, and instructions for use.
JAMA 272: 623-632
24
Measurement
7. Hutchinson T A, Leventhal JM, Kramer MS, Karch FE, Lippman AG, Feinstein AR (1979) An
algorithm for the operational assessment of adverse drug reactions. II. Demonstration of reproducibility and validity. JAMA 242: 633-638
8. Koran LM (1975) The reliability of clinical methods, data and judgments. N Engl J Med 293:
642-646,695-701
9. Burgess AM, Tierney JT (1970) Bias due to nonresponse in a mail survey of Rhode Island physicians' smoking habits - 1968. N Engl J Med 282: 908
10. Siemiatycki J, Campbell S (1984) Nonresponse bias and early versus all responders in mail and
telephone surveys. Am J Epidemiol 120: 291-301
11. Armstrong B, Doll R (1975) Environmental factors and cancer incidence and mortality in different countries, with special reference to diCltary practices. Int J Cancer 15: 617-631
Chapter 3: Rates
26
Rates
The importance of denominators can be illustrated further by analogy with batting averages. A baseball player's batting average is defined as the number of hits he
obtains divided by his number of opportunities, i. e., appearances at bat, and is represented by a rate (to three decimal places) between 0 and 1. The professional baseball
leagues award separate trophies for the player getting the most hits (counts) and the
player achieving the highest average (rate). They are usually not the same player,
however, since batters at the beginning of the lineup invariably get considerably
more at bats, and thus a greater number of opportunities for hits. Their numerators
are higher because their denominators are higher, but their average may be somewhat lower than those of players further down in the lineup, who have fewer at bats
but a higher rate of success.
In constructing rates, the nature of the relationship between the numerator and
denominator is of crucial importance. There are two main requirements:
1. The individuals counted in the numerator must be members of the group represented by the denominator. If we were interested in the rate of skin-test positivity
for tuberculosis (TB) in a given community, the community census would provide
the data for the denominator. Consequently, transients or recent immigrants not
counted in the census should not appear in the numerator. Similarly, if the numerator is restricted to certain characteristics, the denominator should be similarly
restricted. Rates restricted in this manner are called specific rates. Rates may be
specified by age, sex, race/ethnic origin, or any other attribute of interest. The
rate of TB skin-test positivity among white men 20-34 years of age is an example
of a race-, sex-, and age-specific rate.
2. All "members" of the denominator group should be eligible to have the attribute
or to experience the event counted in the numerator. In constructing uterine cancer rates, for example, women with prior hysterectomies and men should be
removed from the denominator. (That this requirement is sometimes violated,
however, is illustrated by the crude birth rate, in which the numerator is the number of live births, and the denominator is the total population rather than the
number of women of child-bearing age.)
Occasionally, the sources of data for the numerator and denominator are different,
and the requirements for constructing a rate are violated. Such measures are more
properly called ratios, rather than rates, although the latter term is often loosely (and
incorrectly) applied. For example, the annual maternal mortality "rate" of a population is defined as the number of deaths due to pregnancy, labor, or delivery divided
by the number of live births occurring in that population during a given year. The
true denominator is falsely lowered by excluding spontaneous and induced abortions, as well as unrecognized pregnancies, and is falsely (although only slightly)
inflated by twin and triplet births.
I conclude this section with a semantic warning. As is unfortunately the case
with many epidemiologic terms, rate can convey different meanings. For some
epidemiologists, the notion of rate implies change over time, or slope, and they prefer to restrict the use of the word to this context [1]. Since rate has achieved such
wide acceptance, however, both within and outside the field of epidemiology, I shall
continue to use the term in the traditional and more general sense discussed above.
27
The concept of change over time is nonetheless of great relevance to the measurement of rates - so much so, in fact, that two different types of rates are used.
One, called prevalence, is a static measure of rate at a single point in time. The other,
incidence, is dynamic and measures the rate at which some attribute or event develops over a specific period of time. The distinction between prevalence and incidence
is of fundamental importance in epidemiology and will be the focus of the following
section.
per specified
time period
Incidence is thus a measure of frequency over time. It refers to change in status over
a specified period, e. g., monthly incidence or annual incidence. Despite these clear
differences between prevalence and incidence, clinicians and clinical investigators
commonly confuse the two. In particular, "incidence" is often used as a generic term
for "frequency," e.g., "The incidence [sic] of retinopathy among insulin-dependent
diabetics at our medical center is 15%," or "The incidence [sic] of hepatic adenomas
(benign liver tumors) in rats killed at 1 month was 4%." "Incidence" should be
reserved for describing the frequency of newly occurring characteristics and should
always be expressed as a function of time.
A problem arises in measuring the incidence of attributes or events that are transient and recurrent. For such characteristics, a choice must be made between the
proportion of individuals in a group who develop one or more episodes within a
28
Rates
Pa ID
i. e., prevalence is proportional to the product of incidence and average duration.
The average duration of any characteristic is dependent on two primary determinants: (a) its mortality and (b) its rate of disappearance (either spontaneously or in
response to some treatment or other intervention).
A disease with a high incidence may therefore have a low prevalence if it is of
short duration, e. g., the common cold or lung cancer. Conversely, a disease of low
incidence can attain a high prevalence if it is incurable but nonfatal. In fact, medical
treatment can (paradoxically) increase the prevalence of disease. A good example is
end-stage kidney disease, where the availability of dialysis and transplantation has
turned a previously rapidly fatal disease into a chronic illness with a prevalence that
continues to rise.
Figure 3.1 depicts the experience of 40 subjects followed up for 1 year for the
development of a (hypothetical) disease lasting 1 month. The distinction between
incidence and prevalence can be clearly seen if we examine the situation at
6 months. The incidence of the disease over the first 6 months is 10/40, or 25%. The
prevalence at 6 months, however, is 0%, i. e., none of the 40 subjects has the disease
at that time. It is also apparent that if the average disease duration were, say, 1 year
instead of 1 month, the prevalence would increase markedly. At 12 months, the
prevalence would then be 20/40, or 50%, rather than 0%.
In groups or populations in which the occurrence of attributes or events is stable,
P a ID becomes
P=ID
"Stability" here means that incidence and average duration remain constant over
time. Thus, if the incidence of a certain characteristic remains stable, and no change
occurs in its rate of disappearance, its prevalence will also remain unchanged. In
such situations, any two of these three quantities, if known, can be used to calculate
the third. This is frequently the case for nonfatal, chronic illnesses such as arthritis
29
9 10
11
12
Time (Months)
Fig.3.t. Forty subjects followed up for 1 year for the development of a hypothetical disease with
I-month duration
and asthma. It is also true for fatal diseases (e.g., certain cancers) for which no
effective treatment is available.
30
Rates
In general, all individuals in a study group who develop the attribute or event
during the follow-up period are placed in the numerator, even if they were not
members of the group at the beginning of the period. For the denominator, the average number of group members during the period is usually used. If changes in group
membership occur evenly throughout the period, the number at mid-period will
serve adequately as the denominator. Thus, an incidence rate for a given calendar
year would use the group membership (e.g., population) as of July 1 of that year for
the denominator.
When gains and losses for a dynamic group occur irregularly during the followup period, however, a different denominator is often used. The duration of followup of each individual in the group is summed to yield a total number of persondurations. Person-durations (e.g., person-months or person-years) would then substitute for persons in the denominator, and the specification of time period is then
no longer required. Incidence rates using person-durations as denominators are also
called incidence density rates [2] and have special properties, to be discussed later in
the text.
Even incidence densities, however, assume the equivalence of equal units of follow-up, i. e., that ten individuals followed for 1 year are equivalent to one individual
followed for 10 years. For attributes or events with a long latent period (the period
between exposure to a cause and the appearance of an effect), such an assumption
can lead to an erroneous measure of incidence. For example, a group's adoption of
a certain exercise or diet regimen may require many years before resulting in any
subsequent reduction in cardiovascular mortality. If most individuals are followed
up for only a year or two after beginning the diet, no beneficial effect may be seen,
even if tens of thousands of individuals participate in the study. In other words, a
large number of individuals followed up for a short period of time will lead to an
underestimate of the true incidence. Adjustment of incidence for differential durations of follow-up is accomplished by means of life-table techniques. These will be
taken up in Chapter 18.
3.2.3 Incidence Rates: Specific Examples
There are several attributes or events of particular interest to epidemiologists, and
their incidence rates carry special names. The death rate (or mortality rate) is the
number of individuals in a group who die within a given number of person-years of
follow-up. When restricted to deaths caused by a specific disease, the incidence is
referred to as the disease-specific death (or mortality) rate. This is to be distinguished
from the disease-specific case fatality rate. Both rates have the same numerator, i. e.,
the number of individuals who die of the disease within the given period. The
denominators are entirely different, however. For disease-specific mortality, the
denominator consists of the total person-years of follow-up, whereas for case fatality, it is restricted to the number of individuals in the group who are affected by the
disease.
When the time period at risk for development of a given attribute or event is
limited, the incidence may be expressed as an attack rate without specifying the
duration of time during which cases developed. The attack rate is of particular use
31
The incidence rate of an attribute or event is the frequency measure of choice when
interest focuses on the cause of that attribute or event. Because causal factors operate
prior to the development of the effects they cause, causal reasoning is enhanced by
knowing that individuals are free of a characteristic (effect) before being exposed to
the causal factor under suspicion. Furthermore, rapid recovery or death from the
characteristic of interest will prevent its detection if only prevalence is known. In
order to ensure detection of all new cases, however, calculation of incidence
requires measurement of the characteristic in all individuals within a group at the
beginning of follow-up and systematic assessment of its occurrence until the end of
follow-up.
Table 3.t. Annual incidence rates used in vital statistics
Name
Numerator
Denominator
Expressed
Mid-year population
Mid-year population
of women 15-44 years
Mid-year population
Number of live births a
per 1000
per 1000
per 1000
per 100000
per 1000
per 1000
per 1000
The definitions of "live birth" and "fetal death" are far from uniform. Some U. S. states, for example, require that a newborn or fetus weigh ;;;; 500 g to count as either a live birth or fetal death,
respectively [3].
This measure is more appropriately called a ratio, rather than a rate (see p. 26).
32
Rates
Prevalence, on the other hand, is much easier to calculate, since it requires only
one measurement of individuals in a group at a single point in time. No follow-up or
repeat measurements are required. Furthermore, prevalence is quite useful for
describing the extent or "burden" of an attribute in a given community, clinic, etc.
Prevalence is therefore of great importance from a public health perspective, since
health care services are often distributed according to need, i. e., existing health and
disease status. For example, conditions like arthritis and heart disease usually consume far greater resources than do more commonly occurring but shorter-lived diseases like the common cold or viral gastroenteritis.
Finally, although not generally appropriate for making causal inferences, prevalence rates can occasionally be useful in suggesting hypotheses when incidence rates
are unavailable. Comparison of the prevalence of cardiovascular disease among different types of societies, for example, might give rise to etiologic clues based on differences in diet, physical activity, or other characteristics of those societies.
One peculiar kind of rate, called the period prevalence rate, represents a hybrid of
incidence and prevalence. It is defined as the proportion of individuals in a fixed
group who either have a given characteristic at the beginning of a specified period or
develop it during the period. It is thus the sum of the initial prevalence (also called
the point prevalence) and the subsequent incidence. But since period prevalence has
neither the etiologic advantages of incidence nor the public health utility of prevalence, it is little used in modern epidemiology.
33
Consider the example shown in Table 3.2, which compares the annual death
rates in two (hypothetical) small U. S. communities, one a northeastern industrial
town (Millville), the other a sun-belt retirement colony (Sunnyvale). The overall
crude death rates in the two communities are shown in the last row of the table (columns 4 and 7). Contrary to what we might expect, Sunnyvale appears to be a considerably more lethal habitat than Millville, with an annual death rate twice as high
(23.8 vs 11.0 per 1000 per year). A closer examination of the individual rows corresponding to different age groups or strata, however, reveals just the opposite. In
each age stratum, the death rate in Sunnyvale is in fact lower than that in Millville.
The discrepancy is caused by an age distribution that is quite different in the two
communities, with Sunnyvale having a much older age structure (columns 2 and 5);
74% of the Millville population is under 45, compared with only 28% in Sunnyvale.
Age is thus a confounding factor here. It is unequally distributed between the two
groups and is independently related to the attribute of interest (death).
Stratification of rates is accomplished by comparing the stratum-specific, rather
than overall, rates and is one method of eliminating bias due to confounding. We
might, however, prefer some overall measure that combines the data from all strata
without reintroducing bias. This is especially important when the stratum-specific
rates reveal a mixed picture, with some rates higher in one group and other rates
higher in the second. There are two frequently used methods for this overall type of
adjustment: direct and indirect standardization.
For direct standardization, the observed stratum-specific rates in the two groups
are applied to a third ("standard") group or population with known stratum structure. In Table 3.3, the age-specific death rates in Millville and Sunnyvale are applied
to a standard population of 12000 with an age distribution as shown in column 2.
For each age stratum in column 1, the number of persons from the standard population in each stratum (column 2) is multiplied by the age-specific death rate for Millville (column 3) and Sunnyvale (column 5) to yield the number of deaths (columns 4
and 6, respectively) that would be expected if each community had the age structure
of the standard population. The total number of "expected" deaths for each community (last row, columns 4 and 6) is then divided by the total population of 12000 to
yield the standardized death rates shown in the last row, columns 3 and 5. In con-
Sunnyvale
Age stratum
(1)
Population
(2)
Deaths
(3)
Deaths/ 1000
(4)
0-14
15-29
30-44
45-59
60-74
500 }
2000 74%
2000
1000 }
500 26%
100
2
8
12
10
20
15
4
4
6
10
40
150
6100
67
~75
Total
11.0
Population
(5)
400}
300 28%
1000
2000}
2000 72%
400
6100
Deaths
(6)
Deaths/1 000
(7)
1
1
5
18
70
50
2.5
3.3
5
9
35
125
145
23.8
34
Rates
Table 3.3. Direct standardization of annual death rates shown in Table 3.2
Millville
Age
stratum
(1)
Standard
population
(2)
0-14
15-29
30-44
45-59
60-74
;;:;;75
500
2500
3000
3000
2500
500
Total
12000
Sunnyvale
Deaths/
1000
(3)
"Expected"
deaths
(4)
Deaths/
1000
(5)
4
4
6
10
40
150
2
10
18
30
100
75
2.5
3.3
5
9
35
125
235
16.8
19.6
"Expected"
deaths
(6)
1.25
8.25
15
27
87.5
62.5
201.5
formity with the stratum-specific rates, the overall standardized rates reveal a higher
annual death rate in Millville (19.6 vs 16.8 per 1000).
Note that the choice of "standard" population is arbitrary. Generally speaking,
the standard population should reflect the age distribution of the population to
which one wishes to generalize the results. In the above example, the standardized
rates tell us the deaths we could expect if the standard population of 12 000 lived in
Millville or Sunnyvale. A better standard might have been the entire U. S. population, as based on the most recent census. When no standard is available, the groups
being compared are often combined into a single "standard." If the groups are of
markedly unequal size, however, the larger group will have an undue influence on
the adjusted overall rates.
Indirect standardization is used in one or more of the following circumstances:
1. When small numbers lead to potentially unstable (i.e., poorly reproducible) stratum-specific rates.
2. When the stratum structure (number of individuals in each stratum) of the standard population is unknown.
3. When the overall death rates and stratum structures are known for the compared
groups but their stratum-specific rates are unknown.
35
Total
Sunnyvale
500
2000
2000
1000
500
100
1.5
8
10
10
19
14
2
8
12
10
20
15
400
300
1000
2000
2000
400
1.2
1.2
5
20
76
56
1
1
5
18
70
50
6100
62.5
67
6100
159.4
145
..
Observed deaths
.
d
Standardlzate mortality ratio (SMR) =
d d h
Expecte
eat s
For Millville, SMR= = 1.072
62.5
145
For Sunnyvale, SMR= - - =0.910
159.4
Indirectly standardized death rate = SMR x death rate in standard population
For Millville, standardized rate = 1.072 x 18.5 = 19.8 per 1000
For Sunnyvale, standardized rate=0.910x 18.5=16.8 per 1000
a
Note that the overall death rate in the standard population must be known. It cannot be derived
from the stratum-specific death rates without one also knowing the population in each stratum.
not required in making this calculation. Only the total is necessary.] Finally, each
community's SMR is multiplied by the overall death rate in the standard population
to obtain the indirectly standardized death rate. Once again, we see that the standardized rate is lower in Sunnyvale (16.8 vs 19.8 per 1000) despite its higher
observed overall crude (unstandardized) rate.
In a way, the two methods of standardization are mirror images of one another.
With the direct method, we calculate the number of "expected" deaths in the standard population based on its age distribution and the study group's stratum-specific
death rates. With the indirect method, we calculate the number of "expected" deaths
in the study group based on its age distribution and the standard population's stratum-specific death rates.
Since age affects many of the attributes and events of interest to epidemiologists,
it is often a confounding factor when rates are compared. Depending on the attributes or events measured and the characteristics of the groups compared, other variables may have an equal or greater potential for confounding. For example, a comparison of death rates from lung cancer in asbestos vs coal miners should standardize by cigarette smoking status unless there is good reason to believe that the two
groups of miners have similar smoking habits. The choice of variables by which to
standardize thus depends on existing biologic and clinical knowledge (e. g., of the
relationship between cigarette smoking and lung cancer), and on how well those
variables can be measured.
36
Rates
References
1. K1einbaum DG, Kupper LL, Morgenstern H (1980) Epidemiologic research: principles and
quantitative methods. Lifetime Learning Publications, Belmont, CA, pp 96-116
2. Miettinen 0 (1976) Estimability and estimation in case-referent studies. Am] Epidemiol 103:
226-235
3. Wilson AL, Fenton L], Munson DP (1986) State reporting of live births of newborns weighing
less than 500 grams: impact on neonatal mortality rates. Pediatrics 78: 850-854
38
39
agent or surgical procedure. Finally, it may be a change (for the better or worse) in
quality of life associated with a certain treatment.
Exposure and outcome each can be measured on a continuous or categorical
scale. The quantitative expression of the exposure-outcome association will depend
on whether both are continuous, one categorical and the other continuous, or both
categorical. It will further depend on whether categorical exposures or outcomes are
dichotomous or polychotomous.
Regardless of the measurement scale, the study of an association between exposure and outcome depends on the presence of variation in both factors. If all study
subjects have the same exposure, for example, measurement of their outcomes
becomes a descriptive rather than an analytic study. A comparison of outcomes in
two or more groups with different exposures would be considered analytic, since the
association between exposure (group membership) and outcome can be assessed.
Finally, multiple exposures and outcomes can be investigated within the context
of a single research study. For example, an investigator who wishes to study the
therapeutic efficacy of a new cancer chemotherapeutic agent may be interested in
studying the effect of that agent on survival, tumor size, relief of pain, and quality of
life. All of these would be important outcome measures. Similarly, in studying possible causes (sometimes called risk factors) of a community outbreak of diarrhea,
numerous food and water exposures might be investigated. Although efficient in
practice and clinically sensible, such multiple testing for exposure-outcome associations creates certain problems for statistical inference, as we shall see in Chapter 12.
40
41
them forward in time (i. e., cohort directionality) to the development of the outcome. Similarly, in a case-control study, subjects can be selected without regard to
exposure or outcome, classified by outcome, and then questioned about prior exposure to an agent or treatment of interest. These would not be statistically efficient
strategies, however, when exposure (in cohort studies) or outcome (in case-control
studies) is rare. Although cross-sectional studies often use this type of sample selection, their statistical efficiency can often be improved by selecting a sample either by
exposure or outcome, depending on which is rarer in the target population.
Regardless of which criteria are used to select study subjects, those subjects
should be representative of their counterparts in the target population. In other
words, when a sample is selected by exposure, sample subjects should be representative of those members of the target population having the studied levels of exposure.
When selected by outcome, sample cases and controls should be representative of
cases and controls in the target population. Finally, when "other" criteria are used,
sample subjects should be representative of the overall target population. If study
subjects are truly representative of their counterparts in the target population, then
the results of the study can be safely extended to that population. If the sample is not
known to be representative, the main concerns are the potential for sample distortion bias (discussed in Chapter 5) and uncertainty as to whom the study results may
be applied.
One way of ensuring representativeness is by random sampling. In simple random
sampling, each member of the target population has an equal probability of being
selected for the study, and that probability depends only on chance, i. e., a random
event, and not on either the investigator or the subject. The usual way this is
achieved is by obtaining a list of persons in the target population and then using a
table (or computer-generated list) of random numbers to assign a number to each
person.
An example of a random number table is contained in Appendix Table A.1. The
table can be entered at any point, e. g., at the beginning or by pointing while "blindfolded." The investigator then continues through the table, either down the columns
or across the rows, assigning successive numbers to the next person on the list. The
"rules" for sample selection should be established beforehand and are based on the
size of the desired sample. If a 50% simple random sample (of the target population)
is needed, odd- or even-numbered persons could be selected. For a 25% sample,
those persons whose number is evenly divisible by 4 could be chosen. A similar
procedure can be used for any fixed fraction (e.g., 1110, 1130, 11100). If a specific
number of sample subjects is desired, e.g., 137, then the subjects with the 137 highest (or lowest) numbers can be chosen. The main requirement is that the method of
selection is decided before entering the table, so that neither the subject nor the
investigator can exert any influence on the choice.
In stratified random sampling, individuals from certain clinical or sociodemographic subgroups (strata) are selected more frequently. This strategy is often used
to ensure that the study sample is representative of the target population with
respect to subgroup (stratum) membership, e.g., race, sex, marital status. It is also
essential in examining results separately in those subgroups that, owing to their small
size, may require oversampling to provide more stable (reproducible) estimates.
Finally, clustered random sampling involves the random selection of natural
42
Table 4.1. Classification of epidemiologic research designs defined by directionality and sample
selection a
Directionality
Cohort
Sample selection
Exposure
Outcome
Other
a
Case-control
C
D
Cross-sectional
E
F
G
43
The following examples constitute seven different ways of examining the same basic
research question:
Does occupational exposure to asbestos increase the risk of subsequent lung cancer?
Each example listed in Table 4.1 (as indicated by its letter A to G) illustrates one of
the seven basic designs defined by directionality and sample selection. Each of these
seven basic combinations can incorporate historical, concurrent, or mixed timing,
thus yielding a total of 21 different research designs. These are illustrated in Fig. 4.1,
which represents a 3 X 3 X 3 cube with two "tunnels" indicating the six impossible
combinations (sample selection cannot be by outcome in cohort studies or by exposure in case-control studies). The figure also indicates the seven basic designs illustrated in the examples.
44
Fig.. 1. The research design cube. The two tunnels represent the six impossible combinations of
directionality and timing. Capital letters refer to the seven basic designs listed in Table 4.1 and illustrated in the text
A. Cohort study with sample selection by exposure. In this type of study, a group of
workers who were exposed to asbestos over 30-40 years might be followed up for
development of lung cancer and compared with a group of workers who were not
exposed to asbestos.
B. Cohort study with "other" sample selection. This study would be similar to A,
except that instead of sampling exposed and nonexposed workers, we might select
all workers in a given plant, determine their cumulative exposure, and then follow
them all up for subsequent development of lung cancer.
C. Case-control study with sample selection by outcome. Workers who have developed
lung cancer are compared with a group of those who have not for a history of prior
exposure to asbestos.
D. Case-control with "other" sample selection. This is similar to C, except that instead
of choosing groups of cases and controls, all workers in a given plant are selected
for study. Lung cancer status is determined, and workers with and without disease
are compared for their history of prior asbestos exposure. This design would be
inefficient relative to C, since very few workers would be expected to have lung cancer at any given point in time.
E. Cross-sectional study with sample selection by exposure. Workers with and without
exposure to asbestos are compared for the simultaneous presence or absence of lung
cancer. This design would share the same inefficiency as D but would have the additional cart-vs-horse causality inference problem of not knowing whether the exposure occurred at a biologically relevant time in the past (d. "latent period").
Concluding Remarks
45
G. Cross-sectional study with "other" sample selection. All workers in a given plant are
classified simultaneously by asbestos exposure and lung cancer status. This design
would share the same inefficiency as D and E and the same causality inference problem as E and F.
4.3.6 "Prospective" and "Retrospective" Studies
"Prospective" and "retrospective" are two of the most familiar, and most confusing,
terms used in describing epidemiologic research designs. These terms have been
applied to all three of the methodologic aspects (axes) discussed above. For example,
"prospective" has been interpreted by various authors as indicating forward directionality, sample selection by exposure, or concurrent timing. Conversely, "retrospective" has been used to indicate backward directionality, sample selection by outcome, or historical timing. Some authors have even gone so far as to use a combined
nomenclature, leading to such semantic difficulties as "historical prospective"
studies. Furthermore, these two terms have also been used to indicate an important
aspect of the statistical analysis, namely, whether the hypotheses tested were enunciated prior to data analysis ("prospective") or were "generated" by the data analyzed, i. e., arose post hoc ("retrospective"). Thus, as shown in Table 4.2, there
appear to be at least four current usages of these two terms. To prevent what would
otherwise be inevitable confusion, I will avoid the terms entirely.
"Prospective"
"Retrospective"
Directionality
Sample selection
Timing
Hypothesis testing
forward
by exposure
concurrent
a pnon
backward
by outcome
historical
post hoc
46
Before discussing these designs in further detail, we need to consider a methodologic issue relevant to all epidemiologic research, regardless of design: analytic bias.
The types, sources, and control of analytic bias are the focus of the following chapter.
Reference
1. Kramer MS, Boivin J-F (1987) Toward an "unconfounded" classification of epidemiologic
research design. J Chronic Dis 40: 683-688
Analytic Bias
48
49
ple is unrepresentative of the target population with respect to the joint distribution of exposure
and outcome.
2. Information bias: estimate of exposure-outcome association is biased as a result of error in measurement of exposure or outcome.
3. Confounding bias: estimate of exposure-outcome association is biased by one or more variables
associated both with exposure and, independently of exposure, with outcome.
4. Reverse causality bias: estimate of exposure-outcome association is unbiased in magnitude but
biased in the inferred direction of causality, because the study outcome actually preceded and
caused the exposure.
internal validity is generally accepted. The main controversy concerns its external
validity. Are the results also applicable to nonveterans? To women? To younger or
older patients? To patients with lower or higher diastolic blood pressures? To those
who already have complications? To those who are less compliant with treatment?
These questions are difficult or impossible to answer from the study, and subsequent
studies of antihypertensive therapy have been required to provide such answers.
External validity, although useful conceptually, is often difficult to evaluate,
since the degree to which results valid in one population can be generalized to
another depends on clinical judgment and other factors beyond the realm of
research design or statistical analysis. Therefore, the remainder of this chapter will
concern internal validity and, in particular, sources of analytic bias and strategies to
reduce it. Proper statistical inference, the second requirement for internal validity,
will be the focus of Chapters 12-15.
The sources of analytic bias can be classified into four broad categories: (a) sample distortion bias, (b) information bias, (c) confounding bias, and (d) reverse causality
("cart-vs-horse") bias. They are summarized in Table 5.1 and will be discussed in
turn.
Many authors use the term selection bias to refer to what I have called sample distortion bias [2].
But "selection bias" has also been used by some epidemiologists to indicate the confounding effect
that can occur when study subjects (or their families, physicians, or other proxies) select their own
exposure. To avoid confusion, I will use the term "exposure selection bias" to refer to this type of
confounding (see Section 5.5).
50
Analytic Bias
Target Population
"""P'j'
Persons in investigator's
regiO"
Referral
Identification
1
1
c:
'2
9
.~
Contact
Participation
(Follow-up)
51
tigator may be unable to identify all patients referred to his center, he may fail to
contact some of those he can identify, and a sizeable number of those he does contact may not agree to participate.
It is important to point out, however, that nonrepresentativeness does not necessarily lead to bias. In particular, the association between exposure and outcome will
be biased only when the sample distortion is differential with respect to exposure and
outcome. If sample selection is by outcome, for example, bias will be introduced
only if, for a given outcome status, the exposures in subjects who are studied are different from those who are not. Suppose that we wish to use a case-control design to
investigate the association between cigarette smoking and lung cancer. Even if our
cases (lung cancer patients) and controls are not representative of all cases and controls in the target population, no bias will occur in the estimate of the smoking-lung
cancer association unless the cases (or controls) in the sample are either more or less
likely to have a history of cigarette smoking than those in the target population.
In cohort studies, sample distortion can occur owing to geographic maldistribution; referral patterns; selective identification or contact of potential subjects; selective participation (response); or death, withdrawal from the study, or other loss to
follow-up (see Fig. 5.2). If subjects who die, withdraw, move away, or refuse to
respond before the outcome is determined are different with respect to their exposure-outcome relationship than those remaining in the study, bias will be introduced.
For example, in a cohort study of the relationship between radiation exposure and
subsequent leukemia, if many subjects with heavy exposure move away from the
study site and are particularly likely to develop leukemia, the true magnitude of the
association will be underestimated.
Case-control and cross-sectional studies share the same sources of sample distortion bias as cohort studies. The absence of follow-up in case-control and cross-sectional designs, however, means that death, withdrawal, moving away, and other
losses are "hidden," in the sense that they have already occurred by the time the
study samples are selected. Since the outcome status is already determined at the
time the study is begun, any distortion, and, hence, bias arising therefrom, has
already occurred.
A specific type of sample distortion bias can occur in case-control or cross-sectional studies carried out in a referral setting. Consider, for example, the association
between two factors (usually two diseases, one of which can be thought of as the
"exposure" and the other, the "outcome," i. e., one disease that is hypothesized to
cause, or predispose to, the other), each of which is subject to referral. The coincidence of (positive association between) the two factors will then be falsely elevated.
Persons with both factors have a higher probability of being referred into the study
center than those with either factor alone, since they have two "chances" of referral
instead of just one. For example, if 50% of patients with disease A and 50% of patients with disease B are referred, those with A or B alone will each have a 50%
chance of referral, whereas those with both will have a 75% chance (50% referred
for disease A and 50% of the remainder for disease B). Thus, among referred patients selected into a study sample, the proportion with both diseases will be higher
than the proportion in the community. This problem was originally described by
Berkson in case-control studies carried out in a hospital setting, and the resulting
bias is often referred to as Berkson's bias [3, 4).
52
Analytic Bias
What can be done about sample distortion bias? Once the study is completed
and the data are obtained, the problem may be beyond repair. The exposure and
outcome status are unknown, of course, for members of the target population who
were not selected or were lost to follow-up. Otherwise they would have been
included. Unless at least the relative probabilities of initial inclusion and (for cohort
studies) subsequent loss are known for all combinations of exposure and outcome,
no adjustment can be made. Since specific data from previous studies concerning
inclusion and loss as a function of exposure and outcome are not generally available,
the best the investigator can do is to estimate the magnitude of the potential bias and
mitigate his inferences accordingly.
Sometimes, even the maximum possible bias would not affect a study's overall
result. In the previously cited Veterans Administration trial of antihypertensive therapy [1], the investigators ruled out possible sample distortion bias due to study dropouts by assuming a "worst case" scenario (complications in all dropouts from the
active treatment group but in none of the dropouts from the control group). Even if
this unlikely possibility had occurred, the results would still have favored the active
treatment group.
As elsewhere in medicine, however, prevention is preferable to cure. The best
way to avoid bias due to sample distortion is to strive for random (or otherwise representative) sampling when study groups are assembled and, in cohort studies, to
minimize losses due to dropouts, nonresponse, or incomplete follow-up.
Confounding Bias
53
larly likely to occur when study subjects and/or observers are aware of (i. e., are not
"blind" to) the research hypothesis and the subjects' exposure or outcome status.
Differentially biased measurement can also occur whenever outcome detection
procedures vary with exposure. This bias (which is also called detection bias) may
result from more frequent or thorough surveillance (in cohort studies) or from the
more frequent use of diagnostic tests and is particularly likely to occur when the
outcome can occur "silently," i. e., when detailed examination or special tests may be
required to detect it. In a cohort study comparing rates of subsequent breast cancer
in users and nonusers of oral contraceptives, for example, more frequent physical
examinations or roentgenography (mammograms) might occur in users, who
require regular contact with their gynecologists in order to renew their oral contraceptive prescriptions. Since many early breast cancers can be detected only by careful examination or roentgenography, this source of information bias could create a
false association between contraceptive use and breast cancer.
As with sample distortion bias, little can be done about information bias once the
study data are collected. Unless the direction and magnitude of the measurement
errors are known for different exposure-outcome combinations (e. g., differences in
surveillance and detection), the best the investigator can do is estimate the effect of
the potential bias and moderate his inferences accordingly. Partial control for detection bias can sometimes be achieved, however, by stratification or multivariate statistical control for the frequency or intensity of surveillance or diagnostic testing.
On the other hand, much can be done in the design and execution stages to minimize information bias. Procedures for surveillance (in cohort studies) and detection
of the outcome should be standardized, established before the study begins, and
maintained until completion. Measurements of exposure and outcome should be
carried out by trained observers using pretested methods, so that the reproducibility
and validity of the measurements are maximized. Study subjects and observers
should, whenever scientifically feasible and ethically defensible, be kept "blind" to
the research hypothesis (the hypothesized exposure-outcome association). Observers should also be blind to the subjects' exposure (for cohort studies) or outcome(for case-control studies) status. In summary, information bias can be minimized by
standardizing detection procedures, maintaining a high quality of individual measurements, and adequately blinding subjects and observers.
54
Analytic Bias
Confounding Bias
55
2. Study subjects or their families, physicians, or other proxies select their own exposure, and the motive or reason for selection is associated with outcome (exposure
selection bias). Exposure selection bias is, in fact, a special case of susceptibility bias.
For example, an observational cohort study comparing maternal attachment behavior in breast-feeding and bottle-feeding mothers is likely to be confounded if mothers who select breast-feeding differ in important psychological ways that influence
their behavior toward their infants.
Observational studies of treatment effects are highly prone to this type of bias,
because the clinical indications for certain treatments may be strongly related to the
outcome, independent of treatment. Miettinen refers to this as confounding by indication [5]. As an example, many new and experimental, but toxic, cancer chemotherapeutic agents are given only to patients with advanced disease who have been
resistant to all conventional treatments. An observational study would likely reveal
that patients treated with such a drug were more likely to die. The result is confounded, however, by the selection of patients who were likely to die anyway as
treatment subjects.
3. The exposure is accompanied by other agents or maneuvers that can affect ("contaminate") the outcome (accompaniment bias or contamination bias). This source of
confounding is particularly likely to occur in studies of treatment. Patients receiving
a promising new treatment may receive more attention, better nursing care, and better general supportive therapy than those receiving the "standard" treatment, and
these accompaniments may be responsible for a more favorable outcome in the former group.
5.5.3 Control for Confounding Bias
Confounding can be controlled in either the research design or data analysis phase.
In the design, the best way of eliminating susceptibility and exposure selection
biases, where feasible, is to use an experimental design (i. e., clinical trial) and to randomly assign exposure to study subjects. Although, as will be discussed further in
Chapter 7, randomization does not guarantee that these sources of confounding will
not occur, it renders the possibility far less likely.
When a randomized clinical trial is infeasible, susceptibility and (to some extent)
exposure selection bias can be reduced by restriction of the study sample according
to certain characteristics (e.g., exclusion of men or nonsmokers) or by matching
members of the compared groups according to the potentially confounding variables. Matching can be accomplished either within study groups (i. e., equalize the
average value or distribution of the confounders within each group) or with individuals (each subject in one group is matched to one or more subjects in the comparison group). Both approaches result in bias reduction. The choice between the two
will depend on the type of confounding variable, i. e., continuous vs categorical, and
if categorical, on the number of categories.
Even if no control for susceptibility or exposure selection biases is incorporated
in the study design, the investigator should attempt to measure factors that could
potentially confound the exposure-outcome association. These factors can then be
56
Analytic Bias
controlled for later in the analysis. Furthermore, to control for contamination bias in
concurrent cohort studies, study subjects and their care-givers should be blind,
where feasible, to both the study hypothesis and the subjects' exposure status.
Control for confounding in the data analysis stage can be accomplished in
several, nonmutually exclusive ways:
1. Restriction in the analysis is, of course, automatic if restriction was incorporated
in the design. If not part of the design, a restricted analysis (e. g., analyzing only the
results in women or nonsmokers, rather than in all study subjects) will "waste" data
in subjects not meeting the restriction criteria. This not only reduces the sample size
and, therefore, the reproducibility of the resulting estimate of the exposure-outcome
association, but also distorts the original study sample so that it no longer represents
the original target population.
2. Matching should be used in the analysis if it was used in the design; otherwise, the
resulting estimate of the exposure-disease association will be less reproducible.
Matching in the analysis in a study without a matched design carries the same
"waste" and'sample size penalties as mentioned for restriction.
3. Stratification and standardization accomplish the same goal as restriction but do
not waste data, because all study subjects can be included in one or another stratum.
These methods were illustrated in Chapter 3 in comparing death rates in Millville
and Sunnyvale. Community of residence is the "exposure" variable here, and death
is the outcome. The overall death rates favor Millville but are confounded by age
(the population of Sunnyvale being much older). The age stratum-specific and overall adjusted rates control for confounding bias and demonstrate that Sunnyvale, in
fact, has a lower death rate for similarly aged persons. Stratification and standardization will be illustrated futher in Chapters 6 and 8.
One of the disadvantages of these methods is that simultaneous control of
several confounders requires a separate stratum for the combination of each confounder with every other. This not only becomes computationally unwieldy, but also
results in small (and thus poorly reproducible) numbers in each stratum.
4. Multivariate statistical techniques can be used to control simultaneously for several
confounding variables. Modern computer technology and readily available statistical
software packages enable calculations of an unbiased estimate of exposure-outcome
association after adjustment for the association of each confounder with exposure, outcome, and other confounders. Although multivariate statistical techniques are largely beyond the scope of this text, they will be discussed briefly in Chapters 13-15.
Concluding Remarks
57
References
1. Veterans Administration Cooperative Study Group on Antihypertensive Agents (1967) Effects of
treatment on morbidity in hypertension: results in patients with diastolic blood pressures averaging 115 through 129 mmHg. JAMA 202: 186-192
2. Kleinbaum DG, Morgenstern H, Kupper LL (1981) Selection bias in epidemiologic studies. Am
J Epidemiol113: 452-463
3. Berkson J (1946) Limitations of the application of fourfold table analysis to hospital data. Biometr
Bull 2: 47-53
4. Walter SD (1980) Berkson's bias and its control in epidemiologic studies. J Chronic Dis 33:
721-725
5. Miettinen OS (1983) The need for randomization in the study of intended effects. Stat Med 2:
267-271
-+
outcome
Analytic cohort studies can be either experimental (exposure assigned by the investigator) or observational (exposure arises naturally, is selected by the subject, or is
prescribed by the subject's clinician). Experimental cohort studies, which are usually
called clinical trials, have achieved such widespread importance' that we will delay
our discussion of them until Chapter 7. This chapter will be limited to a consideration of observational cohort studies.
6.1.2 Sample Selection (Assembling the Cohort)
As discussed in Chapter 4, two methods are available for sample selection in cohort
studies: by exposure status or by "other" criteria. If by exposure, the study sample
should be representative of exposure groups in the target population. If by "other"
criteria, the sample should be representative of the entire target population. A
nonrepresentative sample makes it difficult to judge the population of individuals to
whom the study's results apply. Of the two choices, sample selection by exposure is
usually preferable when exposure is rare in the target population, in order to have
enough exposed subjects in the sample to provide statistically meaningful results. For
example, studying the carcinogenic effects of exposure to an unusual environmental
toxin can best be achieved by finding as many exposed subjects as feasible, rather
than by choosing a "convenience" or random sample of the entire target population.
Sample selection by exposure implies the use of a discrete number of exposure
categories, i. e., a dichotomous or ordinal measure of exposure. When the exposure
1
In fact, when the term "cohort study" is otherwise unspecified, it usually refers to an observational, rather than an experimental, design. In accordance with this practice, we shall, in the
remainder of this text, use "cohort study" for the observational design, and "clinical trial" for the
experimental one.
59
The baseline state of a cohort consists of the characteristics of its members before
exposure. It includes geographic, sociodemographic (age, sex, race, marital status,
socioeconomic status), and clinical attributes. Knowledge of the true baseline state is
possible only when the cohort is assembled before exposure begins. In practice,
however, cohorts are often assembled when exposure has already occurred, or at
least begun. The investigator must then endeavor to define the baseline state that
existed before onset of exposure to ensure that the study subjects' attributes are not
in themselves caused by exposure. This poses no problem, of course, for "permanent" attributes like race and sex, but may be quite difficult for various health and
disease states (other than the outcome under study) that might be affected by exposure.
The main importance of an adequate description of the baseline state is the control for confounding bias, particularly the bias (susceptibility bias) that can occur in
estimating the exposure-outcome association when the underlying susceptibility for
developing the outcome is associated with exposure, e. g., is different in exposed vs
nonexposed subjects. As explained in Chapter 5, this source of confounding can be
controlled in either the design or analysis. Failure to describe the baseline state adequately and to take the necessary design or analytic precautions can lead to analytic
bias and an internally invalid inference regarding the exposure-outcome association.
6.1.4 Exposure
60
ity inferences can often be strengthened by finding a graded response (a doseresponse effect) according to exposure. This can be achieved by using three or more
ordinal categories of exposure and demonstrating a monotonically increasing or
decreasing outcome response with higher categories of exposure. It can also be
achieved with continuous exposure measures by demonstrating an important positive or negative association with outcome.
The exposure may be brief and occur only once (so-called point exposure), e.g.,
the atomic bomb explosions in Hiroshima and Nagasaki. It may consist of repeated
episodes of brief exposure (recurrent exposure), such as habitual drunken driving. Or
it may be continuous (chronic exposure), such as an infant's exposure to toxic lead
paint in the home or the daily administration of estrogen to a postmenopausal
woman.
The potency of the exposure must be clinically appropriate for the exposure-outcome association under investigation [1]. Studying the effect of an exposure that is
too weak may result in a negative result (i. e., no exposure-outcome association) that
may not represent the true biological consequences of degrees of exposure that
commonly occur in the "real world." Conversely, the dramatic effect of an exposure
that is too potent may have little clinical relevance to real-world consequences of
commonly occurring exposure.
Potency includes both the quantity and quality of the exposure. Quantity refers
to how much of the agent, maneuver, or treatment is received, i. e., the dosage and
duration of exposure. Drugs that are administered in doses that are too low or too
high may yield results of little relevance for clinical practice, as are those administered for too brief or too long a period of time. The quality of the exposure refers to
how well, i. e., with what degree of skill, it is administered. An inexpert surgeon,
psychotherapist, or physiotherapist may produce bad results even if the procedure
performed is potentially efficacious.
The timing of exposure is also important. Intrauterine rubella infection leads to
severe congenital malformations when it occurs early in the first trimester of pregnancy but has few if any fetal consequences when it occurs later in gestation.
When study subjects (or their clinicians) select their exposure, the opportunity
arises for confounding due to exposure selection bias. This is particularly likely to
occur when the exposure is a clinical treatment, and the reason for selecting the
treatment is itself associated with the study outcome. Such confounding by indication, which was discussed in Chapter 5, is one of the reasons why clinical trials are
usually preferable to observational cohort studies in investigating clinical treatments.
Another source of confounding associated with exposure is contamination bias.
If other agents or maneuvers with independent effects on the study outcome are also
associated with exposure, the effects of the study exposure will be contaminated by
those of the accompanying exposures. This is particularly likely to occur in studies
of treatment. For example, patients with coronary heart disease who receive a "promising" new drug may do better than those not receiving the drug, not because the
drug is efficacious, but because they are subjected to intensive monitoring that
allows early detection and treatment of cardiac arrhythmias (rhythm disturbances).
61
6.1.5 Follow-Up
62
association will be biased (detection bias). This is particularly relevant for outcomes
that are "silent" and require physical examination or special diagnostic tests. Avoidance of detection bias can best be achieved by ensuring that the frequency and
extent of surveillance are standardized in the design protocol and are followed and
maintained irrespective of exposure status.
6.1.6 Outcome
The outcome is the effect that the investigator suspects may be caused by exposure.
As discussed in Chapter 4, it is usually the change (occurrence, disappearance,
improvement, or relief) in some health or disease state. In many cohort studies,
several outcomes are investigated simultaneously. In particular, it is often clinically
important to measure "soft" as well as "hard" outcomes, especially in studying the
effects of treatment. As indicated in Chapter 2, there has been a tendency in much
clinical research to emphasize easily quantifiable outcomes, even if they do not best
reflect the results of treatment. In a study of the effects of a new cancer chemotherapeutic agent, for example, pain and quality of life may be even more important to
patients than duration of survival or the size of the tumor.
Measurement of the outcome in cohort studies provides the greatest opportunity
for information bias. Although random measurement errors (for either exposure or
outcome) will generally reduce the extent of exposure-outcome association, systematic errors in measuring outcome that vary according to exposure can create an
exposure-outcome association in the study sample when none in fact exists in the
target population (see Chapters 2 and 5). This is particularly likely to arise when
observers of a "soft" (subjective) outcome are aware of both the association under
investigation and the subjects' exposure status. Adequate blinding of observers is
thus necessary to avoid this source of information bias. As indicated above, avoiding
the assessment of "soft" outcomes is not a satisfactory solution.
The quantitative expression of the exposure-outcome association is the subject
of the following section. Since the expression and analysis of the results of a cohort
study depend on the measurement scales in which exposure and outcome are
expressed, the discussion will be organized accordingly.
Outcome: Continuous
The main result of interest in these studies is the mean (x) of the outcome variable in
each of the exposure groups (Xl for exposure group 1, X2 for exposure group 2). The
larger the difference in mean outcomes (X2 - Xl) between exposure groups, the
greater the exposure-outcome association.
Let us take as an example a hypothetical cohort study of the effect of exposure
to asbestos on cardiopulmonary fitness as reflected by the time taken to run a 100-m
63
Analysis of Results
race. The study subjects are 100 exposed male asbestos miners and 100 nonexposed
healthy men of similar age. The exposed and nonexposed cohorts had the same
mean times for running the 100-m race before exposure occurred in the miners, and
both groups have been observed for 10 years without loss to follow-up. After
10 years, the mean time taken to run the 100 m in the exposed group is 14.4 sec,
compared with 12.2 sec in the nonexposed (control) group. The difference of 2.2 sec
is the magnitude of the effect of exposure on outcome and expresses the degree of
association of the outcome with the exposure. Another way of expressing the results
is a percent change from the control mean:
percent change =
X2-XI
-_XI
(100) =
14.4-12.2
(100) = 18.0%,
12.2
i. e., this type and extent of exposure to asbestos made the men 18.0% slower in the
100-m race.
If we had had three exposure groups (nonexposed, lightly exposed, and heavily
exposed), we would have three means (xi, X2, X3) to compare instead of just two.
Each exposure group could then be compared with the control (nonexposed) group
to see if the mean were different. A monotonically graded dose-response effect
(e.g., 12.2 sec in the nonexposed, 13.5 sec in the lightly exposed, and 15.6 sec in the
heavily exposed) would strengthen the inference of causality between exposure and
outcome (discussed further in Chapter 19).
6.2.2 Exposure: Continuous
Outcome: Continuous
To illustrate this situation, we will use the same example for outcome as in Section
6.2.1 (time to run 100 m), but the exposure of interest will now be cigarette smoking, expressed on a continuous scale as the number of cigarettes smoked per day.
Although the investigator could choose to categorize the continuous exposure measure (e.g., 0, 1-5, 6-10, ;;:; 11 cigarettes/day) and then express and analyze the
results as was shown in Section 6.2.1, a more direct approach involves the use of linear regression and correlation. This approach assesses the extent to which a unit
change (increase or decrease) in exposure (number of cigarettes smoked) is accompanied by a corresponding change (in the same or opposite direction) in outcome
(time to run 100 m). Linear regression and correlation will be discussed in detail in
Chapter 15.
6.2.3 Exposure: Continuous
Outcome: Categorical
Such a situation might occur, for example, if we wished to study the relationship
between cigarette smoking, expressed as number of cigarettes per day, and myocardial infarction (heart attack) expressed dichotomously as present or absent. This
type of study is the inverse of the study described in Section 6.2.1, in which the
64
mean outcomes were compared between two exposure groups. One approach to the
analysis of results of this type of study would be to compare the mean exposures in
the two outcome groups, those with and without myocardial infarction. Such an
analysis, however, ignores the forward directionality inherent in a cohort study. It
bears a closer resemblance to an analysis that may be used in case-control (backwardly directed) studies. Comparing exposures among subjects experiencing different outcomes is a rather indirect way of telling us what we really want to know, i. e.,
a comparison of outcomes among subjects experiencing different exposures. Thus, it
would be preferable to take advantage of the forward directionality of a cohort
study by categorizing exposure and then analyzing the results in the fashion demonstrated in the following section.
6.2.4 Exposure: Categorical
Outcome: Categorical
Many cohort studies utilize this format, the simplest example of which is a dichotomous exposure and a dichotomous outcome. The general case is illustrated in
Table 6.1, and a hypothetical example is shown in Table 6.2. The latter compares the
rate of myocardial infarction (heart attack) in 200 smoking and 200 nonsmoking
Table 6.1. Two-by-two table for analyzing results of a cohort study with dichotomous exposure
and outcome
a+b
c+d
a+c
b+d
N=a+b+c+d
0, absence of outcome
a+b
c+d
a
Relative risk (RR) = a + b
c
(Eq.6.1)
c+d
Attributable risk (AR) = _a_ _ _ c_
a+b
c+d
(Eq.6.2)
Analysis of Results
65
men followed up for 20 years (without losses). Tables such as 6.1 and 6.2, which
characterize groups simultaneously by two dichotomous variables, are known as
2 x 2 (two-by-two), or /ouifOld, tables. The totals to the right are called the row
totals, those on the bottom are the column totals, and the total on the bottom right is
the grand total (i.e., the total study sample).
The rate of MI among the smokers is 321200, or 16%, compared with 151200,
or 7.5%, among the nonsmoking controls. Note that these rates are, in essence, incidence rates, since they express the occurrence of new events over a specified period
of time. Thus, the incidence of MI among smokers is 16% per 20 years, or 0.8% per
year. If the two cohorts (exposed and nonexposed) are fixed (i.e., no members are
added during the period of follow-up), then the incidence of the outcome is equivalent to an individual member's risk, or probability, of developing the outcome during
the study period. When the cohorts are dynamic (i. e., members are added during
follow-up), the term incidence density (see Chapter 3) is probably preferable,
although "risk" is often used loosely for dynamic cohorts as well.
There are various ways in which the two risks or incidence rates can be compared. The two most common are their ratio and their difference. These are called
the relative risk (also called the risk ratio) and the attributable risk respectively,
Table 6.2. Cohort study of 200 smokers and 200 nonsmokers (controls) for occurrence of myocardial infarction (MI)
MI
No MI
Smokers
32
168
200
Nonsmokers
15
185
200
47
353
400
Risk of MI in smokers =
200
Risk of MI in nonsmokers =
16%
~ = 7.5%
200
32
Relative risk (RR) = 200
15
200
Attributable risk (AR) =
2.13
R - ~ = JZ.. =
200
200
200
8.5%
66
..
h
were
relative fISk (RR) =
risk in exposed
a+b
= -risk in nonexposed
c
(6.1)
c+d
and attributable risk (AR) = risk in exposed - risk in nonexposed
a
a+ b
(6.2)
=-----
c+ d
The relative risk can take on any value ~ o. RR= 1 indicates no exposure-outcome
association (thus, 1 is often called the null value). Values between a and 1 indicate a
negative association, i. e., exposure protects against the outcome. Values above 1
indicate a positive association, i. e., exposure increases the risk of the outcome.
For our smoking- MI example, the relative risk is
32
200
- - , or 2.13
15
200
. kIS ---=-,or8.570
32
15
17
OL
Theattnb utabl ens
200 200 200
Both the relative risk and the attributable risk provide useful information, but their
interpretations are quite different. In general, the relative risk provides the best estimate of the strength or magnitude of the exposure-outcome association and is
therefore useful for making causal inferences. The attributable risk is more useful
for public health purposes, since it indicates the frequency with which the outcome
can be attributed to exposure in the sample studied and, by extension, to the target
population of interest.
The contrast between RR and AR is illustrated in Table 6.3, which compares the
two measures obtained from two cohorts, nonsmokers and heavy (> 25 cigarettes/ day) smokers among British male physicians from 1951 to 1961 [3). When
relative risks are compared, the relationship between cigarette smoking and lung
cancer (RR= 32.43) appears stronger than that between smoking and cardiovascular
Table 6.3. Comparison of deaths from selected causes associated with heavy cigarette smoking by
British male physicians
Cause of death
Death rate" in
nonsmokers
Death rate" in
heavy smokers
Relative
risk
Attributable
risk"
Lung cancer
Cardiovascular disease
0.07
7.32
2.27
9.93
32.43
1.36
2.20
2.61
Analysis of Results
67
disease (RR= 1.36). A comparison of attributable risks, however, gives quite a different impression (AR = 2.20 per 1000 per year for lung cancer vs 2.61 per 1000 per
year for cardiovascular disease).
Given a constant relative risk, attributable risk rises with the incidence of the
outcome in the nonexposed group. The results for lung cancer and cardiovascular
disease shown in Table 6.3 are thus explained by the higher "natural" (i. e., in the
nonexposed) annual incidence of cardiovascular death (7.32 per 1000) compared
with lung cancer death (0.07 per 1000).
An additional measure is sometimes used to indicate the impact of exposure on
outcome in the target population from which the study sample derives. It is called
the etiologic/raction (EF)I and measures the proportion of all cases of outcome in the
target population that are attributable to exposure. Alternatively, the EF can be
interpreted as the proportion of cases of the outcome that would disappear if exposure were eliminated in the target population. It is defined as follows:
EF=
rT- rI::
rT
(6.3)
where rT is the risk of the outcome in the total target population and rI:: is the corresponding risk in those who are unexposed. An algebraically equivalent form that is
easier to calculate is:
EF=
P E(RR-l)
P E(RR-l)+ 1
(6.4)
where RR is the relative risk and P E is the prevalence of exposure in the target population [4]. If the relative risk is assumed to remain constant from one population to
another, the etiologic fraction is useful in comparing the proportion of outcome
attributable to exposure in settings with different prevalences of exposure. For
example, if maternal smoking doubles the risk of giving birth to an intrauterine
growth-retarded (IUGR) infant, one-third of the IUGR rate can be attributed to
maternal smoking in a population in which half the women smoke during pregnancy
(EF=
68
mation of the risk. When group gains and losses occur irregularly during the followup period, however, it is preferable to use person-durations in the denominator. The
resulting rate is called an incidence density (ID) [5]. Although many investigators
use IDs to calculate relative and attributable risks, the corresponding indexes are
probably better referred to as the incidence density ratio (IDR) and incidence density
difference (IDD) respectively.
Another approach to the problem of unequal duration of follow-up is necessary,
however, whenever equivalence of person-durations cannot be assumed. The existence of a prolonged latent period between exposure and outcome (particularly common with carcinogens and cancer) means that 100 subjects followed up for 1 year
may not yield the same number of outcome events as ten subjects followed up for
10 years, even though both cohorts contribute 100 person-years to the denominator.
In such cases, risk calculations need to be adjusted for differential duration of follow-up. The technique involved is called life-table (or survival) analysis, and this will
be taken up in Chapter 18.
Finally, to illustrate how a polychotomous ordinal measurement of exposure can
demonstrate a dose-response effect of exposure on outcome, let us re-examine the
smoking-MI question. Instead of dichotomizing study subjects as either smokers or
nonsmokers, we shall now classify them according to a three-category ordinal scale
as nonsmokers, light smokers (1-5 cigarettes/day), or heavy smokers (;;;;6 cigarettes/day). The (hypothetical) results for the 400 study subjects are shown in
Table 6.4. We have assumed that the 200 subjects classified as smokers in Table 6.2
distribute themselves equally between "light" and "heavy." The relative risk (RR)
and attributable risk (AR) in the two smoking groups are calculated using the risk in
the nonsmoking group as a "base," and the graded response is evident
(RR= 1.60 among light smokers and 2.67 among heavy smokers).
When exposure is measured on a continuous scale (e.g., number of cigarettes
smoked per day), classification into three or more ordinal categories, as demonstrated in Table 6.4, enables risks to be assessed as a function of exposure. Furthermore, such a classification still permits the demontration of a dose-response effect of
exposure. This is the procedure often used for continuous exposures and categorical
outcomes (see Section 6.2.3)
Regardless of how the sample estimate of exposure-outcome association is
expressed, we must also concern ourselves with its internal validity. Internal validity
requires adequate control for analytic bias and sufficient reproducibility of the sample estimate of the exposure effect, which depends on the extent of variability within
the exposure groups and on the size of the sample. Assessment of the role of chance
(sampling variation) in producing an observed exposure-outcome association will be
discussed in detail in Chapters 12-15. The assessment and control of analytic bias is
the focus of the following section.
69
Table 6.4. Cohort study of myocardial infarction and cigarette smoking using three-category ordinal scale of exposure
MI
No MI
Heavy smokers
20
80
100
Light smokers
12
88
100
Nonsmokers
15
185
200
47
353
400
Risk of MI in nonsmokers =
~ = 7.5%
200
~ = 12%
100
12
RR = lQQ. = 1.60
15
200
AR=~-~=~=4.5%
100
200
200
~ = 20%
100
20
RR =
lQQ. = 2.67
15
200
AR= ~ - ~ = ~ = 12.5%
100 200 200
70
As to information bias, random measurement errors will bias the exposure-outcome association (i. e., RR) toward unity (the null value), as will systematically
biased measurements that are biased in the same direction, irrespective of exposure
and outcome. Nonblind observation of the outcome, however, can bias the RR
away from 1 when observers are aware of both the association under study and the
exposure status of the study subjects. When surveillance (detection) differs byexposure status, systematic information bias can also occur. Consequently, the best protection against information bias in cohort studies is in their design. Measurements
should be of proven reproducibility and validity and should be performed by observers who are blind to the subjects' exposure status. Detection bias can be minimized
by standardizing both the frequency and content (e.g., examinations, special diagnostic tests) of all follow-up procedures, to ensure that they occur independently of
exposure.
Sample distortion bias can occur in assembling the cohort as a result of nonrepresentative sample selection from the target population. It may also occur during
follow-up if losses occur preferentially in some exposure-outcome combinations or
if the duration of follow-up varies according to exposure and is independently
related to the outcome. It can be guarded against by using a sample selection procedure that ensures representativeness of the target population, by standardizing follow-up procedures, and by minimizing losses.
Confounding bias may result from exposure selection, unequal (by exposure)
susceptibility at baseline, or exposure contamination.
Exposure selection bias can be controlled for only to the extent that the reasons
for subjects' (or their clinicians') choice of exposure can be reproducibly and validly
measured. In that (unusual) case, confounding from this source can be dealt with, in
either the design or analysis stage, as any other sort of susceptibility bias (see below).
The reasons for choosing a certain exposure are often unknown, however. Even if
appreciated in a general way, the reasons often involve subtle psychological or motivational factors that are difficult to measure. This is perhaps the major reason why
experimental studies, particularly randomized clinical trials, are often preferable to
observational cohort studies, since even unmeasurable factors are unlikely to be
associated with exposure if exposure is assigned on a random basis.
Measurable differences in susceptibility that vary according to exposure can be
controlled at either the design or the analysis stage. When sample selection is by
exposure, the resulting exposure groups can be matched, during the design, according to the suspected confounding susceptibility factors. When matching is included
in the design, the analysis should (as discussed in Chapter 5) take account of the
matching. When exposure is dichotomous, the outcome is continuous, and the
matching is by pairs, paired tests of group means can be performed, as will be shown
in Chapter 13.
When both exposure and outcome are dichotomous and the matching is by
pairs, the results can be expressed in a matched 2 X 2 table (Table 6.5). This table
superficially resembles the ordinary (unmatched) table (Table 6.1), but each of the
four cells of the table represents the results of matched pairs rather than individual
subjects. Cells a and d represent those matched pairs in which both the exposed and
the nonexposed members develop the same outcome. In a pairs, both develop the
outcome; in d pairs, neither does. Cells band c represent those matched pairs in
71
d
N=a+b+c+d
which the members experience opposite results. In b pairs, only the nonexposed
member develops the outcome; in c pairs, only the exposed member does. The
matched-pair relative risk (RRmatched) is calculated as:
a+c
RRmatched = a
ARmatched =
(6.5)
+b
c-b
N
(ARmatched)
as:
(6.6)
The method is illustrated in Table 6.6 for our smoking-MI example. The 400 study
subjects now consist of 200 matched pairs. For seven pairs, both the smoker and the
nonsmoker developed an MI, and for 150 pairs, neither did. In 14 pairs, only the
nonsmoker developed an MI, whereas in 29 pairs, only the smoker did. The
matched-pair relative risk and attributable risk are 1.71 and 7.5% respectively.
Confounding bias due to exposure contamination is best dealt with at the design
stage by blinding study subjects, their care-givers, and observers of the outcome to
both the association under study (if feasible and ethical) and the subjects' exposure
status. But contamination bias and other sources of confounding can also be controlled for in the analysis, assuming that differences in contaminating exposures and
in susceptibility are measurable and have been formally assessed. Several multivariate statistical techniques are available for dealing with multiple confounders,
72
Table 6.6. Matched-pair analysis for smoking and myocardial infarction (MI)
Smokers
MI
MI
No MI
14
21
29
150
179
36
164
200
Nonsmokers
NoMI
7+29 36
RRmatehed = - - = - = 1.71
7+14 21
AR
hd=29-14=~=7.5%
mate e
200
200
depending on the measurement scales used for expressing exposure and outcome.
These will be mentioned briefly in Chapters 13-15.
When exposure, outcome, and confounding variables are all categorical and the
number of confounding variables is small, stratification is usually the control procedure of choice. When a standard population exists (or can be created), then the rate
of outcome in each exposure group can be standardized, using either the direct or
indirect methods described in Chapter 3.
A more commonly used approach is the Mantel-Haenszel procedure [6], in which
the results from each stratum are weighted approximately according to the sample
size of the stratum to yield an overall relative risk. This procedure does not depend
on any standard population. When both exposure and outcome are dichotomous,
the Mantel-Haenszel relative risk (RRMH) is defined as follows:
RRMH= L:aj(cj + dj)/Nj
L:ci(ai+ bj)/N i
where ai = the number of subjects in the ith stratum who are positive
exposure and outcome
bi = the number of subjects in the ith stratum who are positive for
but negative for outcome
Ci = the number of subjects in the ith stratum who are negative for
but positive for outcome
di = the number of subjects in the ith stratum who are negative
exposure and outcome
and Ni = the total number of subjects in the ith stratum
(6.7)
for both
exposure
exposure
for both
The expressions are then summed (L:) over all strata to arrive at the numerator and
denominator.
73
Table 6.7. Success (S) and failure (F) for two medical treatments (Tl and T2): control for confounding (by sex) using Mantel-Haenszel procedure
40
60
100
60
40
100
100
100
200
24
27
58
30
88
82
33
115
16
57
73
10
12
18
67
85
A. Overall
(" crude") results
B. Results stratified
by sex
74
The procedures and calculations are illustrated in Table 6.7. The overall results
of an observational cohort study comparing success (S) and failure (F) rates with
two treatments (TJ and T 2) are shown in 6.7A. T J clearly appears to be the less efficacious treatment, with a 40% vs 60% success rate, or a "relative success rate" (anal40/100
ns. k) 0 f ogous to re Iauve
- - = 0.67.
60/100
The overall crude results are confounded by sex, however. Women have a much
higher success rate than men, irrespective of treatment, and women are less likely to
receive T J. Since sex does not lie on the causal path between treatment and outcome, it fulfills all three criteria for a confounding variable. The stratified analysis
shows the clear superiority of TJ in both men and women. Although the absolute
rates of success are lower in men for both treatments, the relative success rates (TJ
relative to T 2) are similar in both sexes, i.e., 1.35 and 1.32. The Mantel-Haenszel
analysis combines the stratum-specific results to yield an unconfounded overall
result, with a relative success rate (RRMH= 1.34) intermediate between the two sexspecific rates.
Fortunately, such extreme examples, in which the crude result is opposite in
direction to the adjusted (unconfounded) one, are rare. This type of situation is
often referred to as Simpson's paradox. More commonly, the crude exposure-outcome association is biased upwards or downwards to a lesser degree. As we shall see
in Chapter 14, however, a small bias can sometimes spell the difference between statistical significance and nonsignificance. The Mantel-Haenszel procedure protects
against such an eventuality and can be used for any number of strata.
75
Table 6.8. Effect modification (by age) in cohort study of smoking and myocardial infarction (MI)
MI
No MI
Smokers
32
168
200
Nonsmokers
15
185
200
47
353
400
MI
No MI
14
126
A. Overall results
B. Results stratified
by age
Smokers
140
Younger
(~50 years)
Nonsmokers
Smokers
10
130
140
24
256
280
MI
No MI
18
42
60
Older
(> 50 years)
55
60
23
97
120
the overall crude measure. Although Tables 6.7 and 6.8 illustrate "pure" confounding and "pure" effect modification respectively, the two phenomena are not mutually exclusive and may coexist.
When two or more exposure variables are positively associated with outcome,
their presence in combination will usually produce a greater effect than any will
alone. This combined effect is called synergism. In our example of Table 6.8, older
age can be considered a kind of "exposure," and the combined effects of smoking
and old age appear to be synergistic. The statistical demonstration of such a com-
76
References
77
References
1. Feinstein AR (1970) Clinical biostatistics. III. The architecture of clinical research. Clin Pharma-
Assignment of treatment defines the method of sample selection for clinical trials.
Sample selection is by exposure, and two or more groups defined by the treatments
are compared for the development of the study outcome (Fig.7.1).
7.1.2 Baseline State
Issues concerning the baseline state are similar to those in observational cohort
studies, except that the study sample must, by definition, be assembled before exposure. In other words, the timing of a clinical trial is always concurrent. This is an
advantage over historical and mixed-timing observational studies, in which the sample is assembled when exposure has already occurred and in which it may not be
known, therefore, whether some subject characteristics might be a result, rather
than a cause, of exposure. Depending on the method used for assigning treatment,
baseline susceptibility factors may assume greater or lesser importance. As we shall
see, random assignment markedly reduces the potential for confounding due to susTreatment A
Study sample
~~------+'
Outcome
Treatment B
Fig.7.1. The classical clinical trial design comparing two treatments (A and B). Asterisk, Randomization (or other mode of treatment assignment)
79
ceptibility bias, and particularly for confounding by indication. For this reason, the
randomized clinical trial, or RCT, has become the design of choice in comparing
two or more clinical treatments.
7.1.3 Exposure (Treatment)
Because it serves as the basis of sample selection in clinical trials, exposure is measured on a categorical scale, with each category representing an exposure (treatment) group. When a new treatment is compared with an existing standard treatment, the new one is often referred to as the experimental treatment and the standard
as the control treatment. An inactive control treatment that is indistinguishable
(regarding appearance, sensation, smell, and taste) from the experimental treatment
is called a placebo. As with observational cohort studies, ordinal treatment groups
(e.g., placebo, low-dose treatment, high-dose treatment) permit evaluation of a
dose-response effect.
Issues of treatment potency are similar to those discussed for observational
studies but with one important difference. Because treatments are assigned by the
study investigators, many clinical trials have attempted to standardize treatment
comparisons by using a rigid, fixed-dose treatment schedule. Good clinicians usually
adjust the dosage regimen of a given treatment in individual patients in order to
maximize benefits and minimize side effects. When fixed regimens dictated by the
trial protocol prevent this flexibility, treatment potency may be insufficient to result
in clinical benefit [1]. The result ma~ be a false inference that a treatment is not efficacious. Consider, for example, the University Group Diabetes Program (UGDP)
clinical trial comparing various treatments for diabetes [2]. Critics have suggested
that the ineffectiveness of the two oral agents (tolbutamide and phenformin) might
be explained by the fact that dosage was not optimized, i. e., not tied to attempts to
control the blood or urine glucose concentration [3].
It is often a good idea to include, in the trial protocol, the measurement of variables that reflect the potency of the treatments studied. Such variables are called
intermediate outcomes and permit the investigator to assess, for example, whether the
treatment produced the physiologic effect required to achieve the study outcome. In
the UGDP trial, for instance, measurement of the blood glucose concentration
would have revealed whether the agents administered had actually produced the
desired decrease. Failure to lower blood glucose would indicate inadequate potency.
Even if the potency of the assigned treatment is adequate, the treatment actually
received by the study subjects may be too weak to affect the outcome because of
poor compliance. A negative trial result may occur because a treatment is not taken,
rather than because it is not efficacious. A drug cannot be expected to produce its
desired clinical effect if it is not taken or is taken in inadequate dosage. It is generally advisable to measure treatment compliance in clinical trials, either directly or
indirectly, e. g., by periodic "pill counts" or urine testing for presence of the study
drug or some inert but easily detectable "marker," such as riboflavin. The importance of compliance should be stressed at the time of subject enrollment and periodically during treatment. When only "super compliers" are studied, however, the trial
results may be poorly generalizable to the "real" clinical world. In any case, mea-
80
Clinical Trials
81
The most common method of randomization makes use of published tables or computer-generated lists of random numbers. Instructions for using such tables (see
Appendix Table A.l) or lists are identical to those for random sampling and were
discussed in Chapter 4. If equal numbers are desired in both of two treatment
groups, the random numbers corresponding to each subject can be arranged in
numerical order. The first half of the group will then receive treatment A, the sec-
82
Clinical Trials
Table 7.1. Comparison of treatment groups (active vs placebo) created by randomization, VA antihypertensive trial [4]
Active
Characteristic
(n)
Placebo
(%)
(n)
(%)
Race
White
Black
31
42
42.5
57.5
35
35
50.0
50.0
23
49
1
31.5
67.1
1.4
19
48
3
27.1
68.6
4.3
Cardiac symptoms
None
Present
52
21
71.2
28.8
48
22
68.6
31.4
44
29
60.3
39.7
39
31
55.7
44.3
Diabetes
Absent
Present
65
8
89.0
11.0
65
5
92.9
7.1
Total randomized
73
100
70
100
ond treatment B. When three treatments are involved, the ordered subjects are
divided into thirds, and so on.
Once the randomization schedule has been devised, the assigned treatments
must, of course, be communicated to the clinicians who administer them. One of the
best and most frequent methods for accomplishing this involves the use of opaque
envelopes that must be opened to reveal the assigned treatment. Upon enrollment of
each subject, the next envelope in numbered sequence is opened to determine the
treatment for that subject. When the treatments involve look-alike tablets, capsules,
or liquids, another method involves sequentially numbering each bottle or package.
The treatment corresponding to each number is obtained from the random number
table, and the code remains unknown to the personnel dispensing the treatments.
Coin flips, dice, or playing cards can also be used to randomize treatment
assignment, but such methods are used far less frequently than random numbers.
Regardless of which method of randomization (and communication) is used, that
method should be indicated whenever the trial is described and reported. The mere
use of the term "random assignment" is insufficient, because some authors have used
the term "random" in a rather loose sense to indicate "without any pre-established
order." As we have seen, however, assignment must be truly random to be immune
from bias.
7.2.3 Stratified and Blocked Randomization
Unfortunately, even true randomization of treatment assignment does not guarantee
that confounding will not occur. The only guarantee is that confounding factors will
83
distribute themselves randomly. Randomly does not mean evenly. Ten coin flips do
not guarantee 5 heads and 5 tails, or even that the result will not be more extreme
than 4 and 6, or even 3 and 7. The chance occurrence of 0 heads and 10 tails
is unlikely, but not impossible. In fact, the probability can be calculated as
P = (Yl) 10 = 0.00977. Similarly, random assignment can occasionally result in the
uneven distribution of important confounding variables. Since a chance occurrence
of lout of 20 (i. e., P = 0.05) is the usual threshold for establishing its "statistical significance," a statistically significant difference in the distribution of a given confounding variable will occur once out of every 20 randomizations.
To protect against possible bias by the chance maldistribution of one or more
important potential confounding factors, some trials use a stratified randomization in
which subjects are first assigned to a stratum defined by the confounder(s). A separate randomization is then carried out for each stratum. This procedure is analogous
to stratified random sampling (see Section 4.3.2).
To maximize statistical efficiency by ensuring that approximately equal numbers
of subjects receive each study treatment, randomization is occasionally carried out
within blocks of specified size. For example, in a two-arm RCT, randomization by
blocks of ten will ensure that for every ten subjects enrolled, five will receive each of
the two treatments. Blocking is often particularly helpful in the setting of multiple
strata by preventing large within-stratum imbalances in treatment assignment.
7.2.4 Individual vs Group Randomization
Randomization of individual subjects is entirely appropriate for the classic drug-efficacy trial. In such a trial, subjects are treated individually, treatment groups remain
distinct, and an unbiased comparison of drug vs placebo, or drug A vs drug B, is
thus likely. For some types of treatment, however, random assignment by individuals
can actually be detrimental, because interaction among subjects may lead to systematic errors in classifying the treatment actually received (i. e., the treatments
actually received will be more similar than those allocated), and hence a biased comparison. Psychosocial, educational, and health care service interventions are particularly prone to this problem, since subjects are likely to interact with one another
between administration of the intervention and measurement of the outcome. For
such trials, treatment assignment by hospital room or ward, school, or geographic
region may be preferable to randomization of individuals [1].
Group randomization appears preferable whenever relatively closed, naturally
formed groups are capable of modifying the treatment allocated to individuals
within those groups. For example, of the dozen or so controlled clinical trials assessing the effect of early maternal-infant contact on subsequent maternal attachment
behavior (so-called bonding), most randomized individual women, rather than
entire postpartum wards. Thus, mothers receiving different treatments (early contact
vs usual "routine") were housed on the same ward, and often in the same room.
Communication among these mothers might well be expected to reduce the difference between the treatments actually received and thereby reduce the difference in
outcome [5]. Randomization by group (in this example, postpartum ward) can avoid
this source of (information) bias.
84
Clinical Trials
In the clinical trial designs we have considered thus far, a group of subjects receiving
a given treatment is compared with other groups receiving one or more different
treatments. Such a design is called a parallel design, because the study groups receive
their respective treatments simultaneously, i. e., in parallel. In a crossover design,
however, each study subject receives each treatment in series by "crossing over" in
sequence from one to the other. For example, patients with asthma might each
receive, in sequence, two different treatment regimens to see which of the two treatments is more efficacious among the group as a whole.
Crossover trials have a major advantage over parallel trials in statistical efficiency, in that a given treatment difference is demonstrable with fewer subjects.
There are two reasons for this: (a) each subject receives both treatments (in a twotreatment comparison) and thus "counts" twice, and (b) variability in treatment
response due to individual subject characteristics is eliminated and the "signal-tonoise ratio" thus enhanced. Proper conduct of crossover trials, however, requires
randomization of treatment sequence (A,B vs B,A) , time-dependent (rather than
outcome-dependent) crossover of treatments, and elimination of (or control for)
carry-over effects from the first treatment [8].
The statistical analysis of crossover trials is similar to that of matched pairs and
will be discussed in Chapters 13 and 14.
Analysis of Results
85
Blinding of study subjects is necessary to protect against contamination (confounding) of the true treatment effect by the so-called placebo effect. The placebo
effect is the nonspecific effect that any treatment can have on the outcome, especially when the subject believes it to be efficacious. The main reason for providing
look-alike, feel-alike, smell-alike, taste-alike placebo treatments when comparing an
experimental treatment with no treatment is to facilitate subject blinding.
Blinding of observers is also necessary to prevent the information bias that would
occur if the outcomes were determined by observers who are aware of the treatment
received. As discussed previously, such awareness can influence the outcome assessment, either consciously or unconsciously, especially if the outcome is subjective.
When a clinical trial incorporates blinding of both the subjects and the observers,
the trial is said to be double-blind. When care-givers other than the observers are
aware of treatment status, however, even double blinding is insufficient, because the
study treatment can be contaminated (confounded) by doctors, nurses, physical
therapists, or others who may alter the quantity or quality of their care according to
the study treatment received. Failure to protect against this source of bias was one of
the major defects in the maternal-infant "bonding" trials alluded to earlier [5].
Unfortunately, blinding may be infeasible for some treatments. This is obviously
true for most surgical procedures but also pertains to many behavioral (e.g., exercise
vs no exercise in a trial to prevent myocardial infarction) and health care (e. g., care
by nurse practitioner vs physician) interventions. Furthermore, unblinding can arise
owing to differences in side effects that occur with different treatments, even if those
treatments originally seem indistinguishable. This is especially likely to create a bias
when the control treatment involves a placebo. One strategy for measuring potential
bias due to unblinding involves asking subjects, after they have completed treatment,
to guess whether they received the active treatment or placebo; bias should be suspected whenever treatment effects appear only in subjects who are unblinded. A
good example of the use of this strategy was one of the RCTs of vitamin C in the
prevention and treatment of the common cold [9]. Unblinded subjects who received
vitamin C reported a shorter duration and lesser severity of colds; in subjects who
remained blind, no such differences were found.
86
Clinical Trials
One of the major problems likely to arise during a clinical trial is that subjects either
may not have received or complied with the treatment to which they were assigned
or, having received it initially, may have switched to another. Another problem is
that subjects may withdraw from participation in the trial before treatment or follow-up are complete. These realities of clinical research create major problems in
interpreting the trial's results.
Such problems have led to two different ways of analyzing the results of a clinical trial and, consequently, to two different interpretations. Which of the two
approaches is taken depends on whether one is interested in treatment efficacy or
treatment effectiveness.
Efficacy refers to the potential effect of treatment under optimal circumstances,
i. e., whether treatment can have an effect on outcome. Thus, an analysis for efficacy
would compare subjects according to the treatment actually received (rather than
the one assigned) and would exclude subjects who complied poorly, those who
switched over (and thus received both treatments), and those who withdrew during
the trial. An analysis for efficacy is also called an explanatory trial analysis [10-12].
Effectiveness refers to the actual effect of treatment in the "real world" of people
who comply poorly, change treatment (owing to unsatisfactory results or side effects
of initial treatment), or become lost to follow-up - i.e., whether treatment does have
an effect on outcome. An effectiveness analysis compares all subjects according to
their original, assigned treatment and thus includes poor compliers, switch-overs,
and withdrawals. Other terms for effectiveness analysis include pragmatic, management, and intention-to-treat trial analysis [10-12].
Both efficacy and effectiveness may be important. Efficacy is usually of primary
interest to biologists, physiologists, and (to some extent) pharmacologists, i. e., to
those interested in biologic potency and mechanisms of action. It is also a sine qua
!
It is important to emphasize that some variables are bound to distribute themselves asymmetrically
by treatment. This will occur, on average, with 1 of every 20 variables examined. But control for
all such asymmetrically distributed variables is unnecessary, since random differences between
treatment groups are already taken into account in calculating the probability of an erroneous statistical inference. Thus, in-depth searches for such variables are not indicated. Control for confounding, whether by stratified randomization or by adjustment in the analysis, is necessary only
for those few potential confounders identified a priori as important candidates.
Interpretation of Results
87
non of effectiveness, since a treatment that cannot work under optimal circumstances will not work in clinical practice. Treatments can be efficacious without
being effective, however, and it is effectiveness that is of primary concern to patients
and clinicians. Nonetheless, efficacious but ineffective treatments may still be useful
for certain patients (e. g., those without side effects and those with good compliance).
7.5.2 Selective Subject Participation
To an even greater extent than in observational studies, subjects (or their clinicians)
may be unwilling to participate in clinical trials, and especially in RCTs. Treatment
assignment by the study's investigators means that neither the subject nor his or her
clinician can choose, and many subjects therefore decline to participate. Not only
does this result in smaller numbers of participants, with consequent statisticallimitations, but those who participate may be quite different from those who do not. The
trial's external validity, or generalizability, thus may not extend beyond the narrow
confines of a highly selective target population [1].
A common example of this type of problem arises whenever low-risk or highrisk patients are preferentially enrolled in a trial. A treatment comparison in one of
these risk strata may not be generalizable to the other. In an RCT of a new cancer
chemotherapeutic agent, for example, only patients with advanced disease resistant
to conventional treatment may be enrolled. Failure of the agent among such highrisk cases does not indicate whether it is efficacious in lower-risk patients without
previous treatment. The problem is even more insidious whenever motivational differences responsible for selective participation can also affect the study outcome,
because the nature of these differences may not be known and, even if appreciated
qualitatively, may be difficult to measure.
Trial investigators should keep track of, and include in all reports, both the numbers and relevant characteristics of all participants and nonparticipants. Such characteristics include any sociodemographic or clinical factors that can affect the outcome. Unless participation rates are exceptionally high (80%-90%), investigators
should compare participants and nonparticipants and indicate characteristics of the
target population to whom the results appear to apply.
7.5.3 The Hawthorne Effect
The Hawthorne effect refers to the tendency of study participation per se to affect
outcome. The term originated in studies carried out in the 1920s at the Hawthorne
Works of the Western Electric Company in Chicago. A variety of interventions
(e. g., changing the light intensity) were used in an attempt to improve workers' productivity, but the investigators found that productivity increased regardless of what
intervention was introduced. Although a Hawthorne effect can arise in any study
concerned with behavioral outcomes or outcomes that can be influenced by behavioral changes, RCTs carry with them the sense of uncertainty and risk (randomization), which may be more potent behavior modifiers than mere observation.
88
Clinical Trials
Ethical Considerations
89
treatment preferences in clinical practice on more than a single study, it may be difficult for a subject or clinician to accept a 50-50 chance of receiving a new treatment when a previous trial, or even two or three previous trials, have shown the
treatment to be superior to the existing "standard."
Nonetheless, many thoughtful (and ethical) scientists make the opposite argument. According to them, it is more ethical to allow subjects to receive a standard
treatment before convincing evidence favors a new one than to allow unproven,
potentially harmful and costly therapies to be adopted prematurely. We need only
adduce the now-abandoned practices of purging and blood-letting to remind us that
the annals of medical history are replete with ineffective and even harmful remedies
staunchly defended over long periods by the best clinicians of their time. The difficulty comes in defining the threshold for "convincing" and "proven," which is likely
to be lower for clinicians wishing to do the most good and least harm by their
patients than for researchers wishing to establish scientific "truth."
Although these ethical dilemmas are not easily resolved, the recent insistence, in
many developed societies, on informed consent and institutional review boards has
provided important ethical safeguards. Informed consent is required by most research
funding agencies and clinical journals whenever studies involve human experimentation and usually consists of a full disclosure of trial procedures, including randomization and blinding. Informed consent means that the subject is given a chance to
ask questions, is under no pressure or obligation to participate, and is informed that
his or her care will not suffer if he or she declines participation or later decides to
withdraw. A signed statement of informed consent is often required. Despite these
requirements, however, the complexities of modern medicine and other clinical disciplines, as well as those of trial design, often prevent the consenting subject from
being fully informed.
Institutional review boards (IRBs), also called human investigation committees,
usually consist of committees composed of lay persons, clinicians, administrators,
lawyers, and ethicists and are based within a hospital, research institute, or academic
institution. Their purpose is to review trial (and other human study) protocols, to
ensure the protection and ethical treatment of study subjects, and to suggest appropriate changes in informed consent procedures or study design. The IRB may also
request that the trial's data be analyzed periodically by an outside statistician or
committee, so that a clear difference in treatment efficacy is recognized promptly,
further enrollment is halted, and subsequent patients can receive the better treatment. IRB approval is often required by the institution at which the trial will be carried out, as well as by the agency providing the funding.
Variations in the classic ReT design (Fig.7.l) have been proposed in an effort to
overcome some of its inherent ethical difficulties. Zelen has proposed a design
(Fig.7.2) in which subjects are randomized to "consent not sought" vs "consent
sought" groups. Those in the former group receive the existing standard treatment,
while those in the latter are o./fored the new experimental treatment, with the standard treatment given if they decline [15]. This design retains the scientific benefits of
randomization while allowing those subjects in the "consent sought" arm to choose
whether or not they want the new treatment. In addition to preventing blinding,
however, such a design also creates problems of statistical efficiency and interpretation, since the groups should be compared according to the randomization arms,
90
Clinical Trials
Consent not sought (treatment B)
Study sample
Accept (treatment A)
Consent sought;
offered treatment A
Outcome
Decline (treatment B)
Fig. 7.2. Alternative clinical trial design comparing new (A) and conventional (B) treatments. Asterisk, Randomization
20
Q)
(.)
c:::
Q)
10
L....
Q)
Q)
L....
a..
en
en
Q)
(.)
CD
10
............
t............ ........
Superiority of B
20
,'/
"
,"
en
/ /
/
..
,, 40
,'II
,30
No Important
Difference
.... ........ ,
....
Number of Pairwise
Preferences
Fig. 7.3. Sequential analysis of clinical trial comparing two treatments (A and B)
and the "consent sought" arm contains subjects receiving both treatments, i.e., the
effect of the new treatment is contaminated with that of the standard.
The sequential design differs from the classic design in that sample sizes are not
fixed in advance but are determined by the cumulative trial results [16). The results
are analyzed in successive pairs in which the two subjects are assigned different
treatments until either one of the two treatments is shown to be statistically superior
or it becomes clear that the difference between the two is small enough to ignore.
The method of analysis is illustrated in Fig. 7.3. The outcome in the first subject randomized to receive treatment A is compared with that of the first subject randomized to treatment B. A "step" is then taken off the line of equality (the x axis, corresponding to 0 excess preferences) toward the treatment favored in the first pairwise
comparison. Subsequent pairs are compared similarly until the path defined by the
cumulative "steps" crosses one of the lines indicating either superiority of A, superiority of B, or no important difference. Sequential designs were originally devised in
World War II to minimize the sample size necessary to demonstrate a treatment difference, but they also have ethical advantages, because the trial is stopped as soon as
the results are clear. Their advantages are limited, however, to treatments of short
duration (so that many subjects are not enrolled needlessly while the results of previous enrollees are awaited). Furthermore, smaller trials, while beneficial in some
respects, do not convey as much information as larger trials, both because the results
91
may not be as widely applicable (generalizable) and because limited numbers may
prevent analysis of smaller subgroups or detection of rare adverse outcomes of treatment.
92
Clinical Trials
References
1. Kramer MS, Shapiro SH (1984) Scientific challenges in the application of clinical trials. ]AMA
252: 2739-2745
2. University Group Diabetes Program (1970) A study of the effects of hypoglycemic agents on
vascular complications in patients with adult-onset diabetes. I. Design, methods, and baseline
results. II. Mortality results. Diabetes 19 [Suppl 2]: 747-830
3. Feinstein AR (1971) Clinical biostatistics. VIII. An analytic appraisal of the University Group
Diabetes Program (UGDP) study. Clin Pharmacol Ther 12: 167-191
4. Veterans Administration Cooperative Study Group on Antihypertensive Agents (1967) Effects of
treatment on morbidity in hypertension: results in patients with diastolic blood pressures averaging 115 through 129 mmHg. ]AMA 202: 116-122
5. Thomson ME, Kramer MS (1984) Methodological standards for controlled clinical trials of
early contact and maternal-infant behavior. Pediatrics 73: 294-300
6. Cornfield ], Mitchell S (1969) Selected risk factors in coronary disease: possible intervention
effects. Arch Environ Health 19: 387-394
7. Buck C, Donner A (1982) The design of controlled experiments in the evaluation of nontherapeutic interventions.] Chronic Dis 35: 531-538
8. Louis TA, Lavori PW, Bailar ]C, Polansky M (1984) Crossover and self-controlled designs in
clinical research. N Engl] Med 310: 24-31
9. Karlowski TR, Chalmers TC, Frenkel LD et al. (1975) Ascorbic acid for the common cold: a
prophylactic and therapeutic trial. ]AMA 231: 1038-1042
10. Schwartz D, Lellouch ] (1967) Explanatory and pragmatic attitudes in therapeutic trials. ]
Chronic Dis 20: 637-648
11. Sackett D L, Gent M (1979) Controversy in counting and attributing events in clinical trials. N
Engl] Med 301: 1410-1412
12. Louis TA, Shapiro SH (1983) Critical issues in the conduct and interpretation of clinical trials.
Annu Rev Public Health 43: 25-46
13. Schafer A (1982) The ethics of the randomized clinical trial. N Engl] Med 307: 719-724
14. Lebacqz K (1983) Ethical aspects of clinical trials. In: Shapiro SH, Louis TA (eds) Clinical trials:
issues and approaches. Marcel Dekker, New York, pp 81-98
15. Zelen M (1979) A new design for randomized clinical trials. N Engl] Med 300: 1242-1245
16. Armitage P (1971) Statistical methods in medical research. Blackwell Scientific Publications,
Oxford, pp 415-425
17. Fletcher RH, Fletcher SW (1979) Clinical research in general medical journals. N Engl] Med
301: 180-183
8.1 Introduction
In cohort studies (and clinical trials), subjects are followed in a forward direction
from exposure to outcome. Inferential reasoning is from cause to effect.
In case-control studies, on the other hand, we start with the outcome and ask or
find out about prior exposure. The directionality is backward, and the reasoning is
inductive, from effect to cause. In some ways, therefore, case-control studies can be
thought of as the chronological and logical inverse of cohort studies. Feinstein has
coined the term trohoc (cohort spelled backwards) to illustrate this relationship [1].
Another frequently encountered synonym is case-referent study. The generally
accepted term case-control study derives from the usual dichotomous categorization
of outcome as present (cases) or absent (controls).
In case-control studies, prior exposure status is ascertained after the study subjects
have been assembled. Thus, the study sample may be selected from the target population by either outcome status or other criteria. If the outcome is rare, random sampling or other mode of sample selection is far less efficient than selection by outcome, because many more subjects would be required to provide statistically
meaningful results.
Regardless of which sample selection method is used, study subjects are usually
classified (i. e., categorized) according to their outcome status. In the majority of
case-control studies, outcome is dichotomously assessed as either present or absent,
and subjects are classified as cases or controls respectively. (The terms diseased and
nondiseased have also been applied to these two outcome categories but are best
reserved for true diseases or other adverse health outcomes.)
Because the outcome has already occurred (among the cases) when study subjects are sampled, opportunities abound for sample distortion bias in case-control
studies. Differential surveillance or selective loss to follow-up, which can usually be
guarded against in the design and execution of cohort studies, have already
occurred in case-control studies when the study sample is assembled. Sample distortion bias will be discussed in some detail in Section 8.6.
94
Case-Control Studies
95
Perhaps the best control group would consist of a representative sample of subjects free of the outcome who would have been included as cases ifthey had developed the outcome [2]. Defining and locating such a sample, however, may be difficult.
8.2.4 Exposure
96
Case-Control Studies
be used in the context of a single study to investigate several types of exposure-outcome causal models, multiple hypotheses increase the possibility of chance associations being declared "statistically significant" (see Chapter 12). In any case, the various definitions or criteria of exposure should be specified a priori to permit the
testing, rather than the mere generation, of etiologic hypotheses.
Ascertainment of exposure provides the main source of information bias in casecontrol studies. This will be discussed, along with other aspects of bias assessment
and control, in Section 8.4.
The presentation and analysis of the results depends on the types of measurement
scales in which exposure and outcome are expressed. As mentioned earlier, outcome
is usually expressed categorically. A study of the relationship between the two continuous variables of infant birth weight and maternal cigarette smoking during gestation, however, would still qualify as a case-control study, if the directionality of
the study was from outcome to exposure. The linear correlation between the average number of cigarettes smoked per day by the mothers and their infants' birth
weights would nicely reflect the degree of exposure-outcome association. This type
of case-control study is unusual, however, and would be similar (except for
unknown losses to follow-up) to a cohort study in which pregnant women were followed forward to delivery.
When the outcome is continuous and the exposure is categorical, the results can
be analyzed by comparing the mean outcomes in each exposure group. Although
such an analysis appears similar to a comparison of mean outcomes in cohort studies,
the exposure groups defined in a case-control study are not representative of exposure groups in the target population, and their corresponding means are difficult to
interpret.
The use of ordinal outcomes is also rare in case-control studies but permits the
assessment of dose-response effects. One recent example is a study of the relationship between adolescent adiposity (fatness) and a history of having been breast-fed
as an infant [6 J. Subjects were classified by outcome as either obese, overweight, or
normal based on their weight-for-height and skinfold thicknesses. Their mothers
were then interviewed about the type of feeding (breast vs bottle) the subjects
received as newborns, and the major analysis was a comparison of breast feeding
rates among the three outcome groups of normal, overweight, and obese subjects.
Categorical outcomes and continuous exposures yield a comparison of mean
exposure. While the information conveyed by such a comparison does provide a test
of exposure-outcome association, the result is difficult to interpret in the usual
sequence of causal inference, because our primary interest is not the average level of
exposure preceding a given effect. Rather, the major inference concerns the effect of
a particular level of exposure in the target population. Since, for a continuous expo-
97
Analysis of Results
sure variable, the distributions of exposures in cases and controls are likely to overlap, and since the distribution of exposure in the target population will be a mix of
the case and control distributions, no inference can be made concerning the effect of
a given level of exposure merely by comparing the mean exposures in cases and controls.
The usual method of analyzing the results of a case-control study uses dichotomous outcomes and categorical (usually dichotomous) exposures. Continuous exposures can be categorized to permit the use of this strategy. As we shall see, such a
method does indeed provide a good estimate of the effect of exposure in the target
population. When both outcome and exposure are dichotomous, the results can be
displayed in a 2 x 2 table (see Table 8.1). At first glance, the table appears identical
to the cohort study 2 x 2 table illustrated in Table 6.1. The cases and controls correspond to the presence and absence of the outcome respectively. The difference is in
the directionality of the research design: backward from outcome to exposure in the
case-control study, forward from exposure to outcome in the cohort study.
The backward directionality imposes a backward method of analysis. Thus, we
cannot compare the "rate" of cases in the exposed and nonexposed subjects ~b vs
a+
_c_, because the exposure groups were formed by the exposure histories ascerc+d
tained in the study, not by sampling from the target population. Only the outcome
Table 8.1. Two-by-two table for analyzing results of a case-control study with dichotomous outcome and exposure
Cases
(0)
Controls
(0)
a+b
c+d
a+e
b+d
E, Exposed;
N=a+b+c+d
a+e
b! d
we ad
'
Exposure 0 dd s ratio (ORE) = - = -b
bid
98
Case-Control Studies
groups (cases and controls) are representative of the target population, and thus the
. comparIson
. .IS t he rate 0 f
'In cases vs contro Is: -a- vs -b--'
b
major
exposure
a+c
+d
Although such a comparison in itself provides a valid test of the exposure-outcome
association, it shares the same difficulty in interpretation as a comparison (for continuous exposures) of mean exposures. That is, it does not allow a direct inference
about the effects or risks of exposure in the target population. As we shall see, the
use of odds, instead of rates, will allow us to estimate the relative risk of exposure in
the target population.
8.3.2 Odds and Odds Ratios
As in horse racing or other forms of betting, the odds of a given event is the ratio of
the probability of its occurrence to the probability of its nonoccurrence. In horse
racing, the odds are usually given as the odds against a given horse's winning the
race. Thus,S: 1 odds indicates that the horse is five times more likely to lose than to
win; hypothetically, if the race were to be run six times, the horse would lose five
and win one. Conversely, the odds in /avor of the horse's winning is 1: 5. Since
ratios can also be expressed as fractions, these odds can be expressed as 115.
Similarly, in Table 8.1, the odds of exposure in the cases can be expressed as ale
and that in controls as bid. We can also form a ratio of these two odds, called the
.
alc
ad
exposure odds ratlO (ORE) = bid = Tc'
(8.1)
Why do we compare the odds, rather than the rates, of exposure in cases and
controls? As we have seen, the risks of outcome in exposed and nonexposed subjects, which would provide direct information about the relative risk of exposure in
the target population, cannot be directly derived from the data supplied by a casecontrol study. In fact, the "rates" of cases in the two exposure groups are uninterpretable quantities that reflect the proportion of cases and controls sampled from the
target population rather than the true risks of the outcome among the exposed and
nonexposed in that population. The reason for using odds is that the odds ratio is a
/airly good estimate 0/ the true relative risk 0/ exposure in the target population, provided the outcome is rare. The algebraic proof of this assertion forms the basis of the
following section. Readers not interested in seeing this proof may skip to Section 8.3.4 without loss of continuity.
8.3.3 The Relationship Between the Sample Odds Ratio
in a Case-Control-Study and the Target Population Relative Risk
The best way to illustrate this relationship is by demonstrating the results in the
entire target population (i. e., without sampling), and then seeing how a cohort
study and a case-control study arrive at similar expressions for the relative risk of
exposure in the target population. I will use capital letters to indicate the population
values, lowercase letters to indicate values obtained in the cohort study, and lowercase letters with primes (') for values in the case-control study.
99
Analysis of Results
Table 8.2. Relationship between the relative risk and the odds ratio: target population
A+B
C+D
A+C
B+D
N=A+B+C+D
A+B
~
C+D
~ /~
A+B C+D
~""~
A+B
and
~""~
C+D
100
Case-Control Studies
We can also calculate odds and odds ratios in the target population. The odds of
exposure among subjects with the outcome is A/C, the odds among those without
the outcome is BID, and the ORE = A/C = AD. We can also consider the odds of
BID BC
developing the outcome among the exposed (AlB) and the nonexposed (C/D), as
well as the outcome odds ratio (ORo):
ORo= AlB =AD
C/D BC
(8.3)
~ =~. That
is, the odds of outcome among the exposed is the same as in the target population.
Similarly, .. = C for the nonexposed. In the sample, the exposure odds ratios (ORE)
d D
~ = -b
~,an dh
' (ORo) as -~ = ~. But,
can be calculated as -b
t e outcome 0 dsd
ratio
M c
~
k
Table 8.3. Relationship between the relative risk and the odds ratio: cohort study
a+b
e+d
a+e
b+d
N=a+b+e+d
=~;= 1c
ale
ad
Since the sample is representative of exposed and nonexposed subjects in the target population,
Thus ad=AD
be BC
Analysis of Results
101
ab'd' = AD.) In a case-control study without sample distortion, therefore, the sam'e' BC
pie odds ratio is equivalent to the odds ratio and estimated relative risk in the target
population.
As we have seen, if the outcome is rare, all these expressions will be very close to
the true relative risk. What is meant by "rare"? The rarer the outcome, of course,
the closer the odds ratio will approximate the true relative risk. In general, if 10% or
less of the target population develop the outcome during the period of follow-up,
Table 8.4. Relationship between the relative risk and the odds ratio: case-control study
Cases
(0)
Controls
(0)
a'
b'
a'+b'
c'
d'
c'+d'
a' +c'
b' +d'
N = a'
a'Ie'
a'd'
'/ '
'e'
i
. 0 f cases an d controI
SIllce
th
e samp
e IS representative
s III t h
e targeti
popu
atlon, -a' = -A an d -b' = -B
c' C
d' D
a'd' AD
Thus-=b'c' BC
102
Case-Control Studies
the approximation is fairly good [7]. In fact, many authors use the term "relative
risk" rather loosely to indicate the estimated relative risk or odds ratio determined in
a case-control study. It is probably better, however, to restrict the term "relative risk"
to the true relative risk determined in a cohort study. (Miettinen has shown that if
incident cases are used, and controls are selected periodically over the same duration of study as the cases, the odds ratio is actually equivalent to the incidence density ratio (see Section 6.2.4) without requiring the rare disease assumption [8].)
8.3.4
An Illustrative Example
To illustrate the algebraic concepts discussed in the previous section, let us consider
a hypothetical example. (Three decimal places are retained to demonstrate the magnitude of the "error" in using the odds ratio as an estimate of the relative risk.) Let
us imagine that newly published laboratory experiments demonstrate that rats who
are fed tea with their regular diets have an increased incidence of renal (kidney)
cancer. Because tea consumption represents such a widespread exposure in humans,
we decide to mount an epidemiologic study to test the hypothesis that tea drinkers
have an increased risk for developing renal cancer.
Because the rats did not develop their renal cancers until late in life, because we
know that carcinogenesis is a process that may require years or even decades, and
because renal cancer is a rather rare disease, we decide that a case-control study
would be the most feasible approach to this question. Table 8.5 shows the hypothetical results that we would have obtained had an entire birth cohort of 300000 from a
given community been followed to age 60. For simplicity, we shall assume a fixed
population without migration or loss to follow-up and without other causes of mortality before age 60. Two out of three individuals in this population are tea drinkers.
The relative risk of renal cancer in tea drinkers is 2.000, and the attributable risk
due to tea consumption is 0.001, or 1 per 1000. The exposure odds ratio and outcome odds ratio are both 2.002, which is very close to the true relative risk (2.000).
This is exactly what we would expect, because renal cancer is a rare disease (cumulative incidence through age 60 =
103
Analysis of Results
No RC
Tea
drinkers
400
199600
200000
Nontea
drinkers
100
99900
100000
500
299500
300000
No RC
Tea
drinkers
400
333
733
Nontea
drinkers
100
167
267
500
500
1000
104
Case-Control Studies
(The slight difference, 2.006 vs 2.002, occurred because our sampling resulted in a
need to round off the proportion of tea drinkers to 333 of 500, or 0.6660 for the
proportion of exposed controls. In the target population, however, 199600 of
299500, or 0.6664 of the controls were exposed.)
As alluded to earlier, there is one extremely important trap to be aware of when
analyzing a 2 x 2 table from a case-control study. The unwary "cohort-prone"
reader may be tempted to calculate a relative risk directly from such a table. For
example, in Table 8.6 some persons might naively calculate a "risk" in the exposed as
4001733, or 546 per 1000. This is obviously a ridiculously high risk, nowhere near
the 2 per 1000 in the target population. Similarly, they would calculate the "risk" in
the nonexposed as 1001267, or 375 per 1000, which is also ridiculous. Such persons
a d'Irect "
' flS. k" as 4001733 ,or 1.47Th"
. d'ff
wou ld t hen d eflve
reIative
5.
IS IS qUite
I er1001267
ent from the population value of 2.000, but not as ridiculous as the individual
"risks."
This entire procedure, of course, is totally incorrect. In a case-control study we
do not begin with a representative sample of exposed and nonexposed individuals in
the target population. Instead, we begin with a representative sample of subjects
with and without the outcome (cases and controls). The only true rates or risks we
can calculate, therefore, are the rates of exposure in cases and controls, not the rates
of outcome in the exposed and nonexposed. (We use odds instead of rates, however,
.
to derive an estimate of the true relative risk.)
8.3.5 Analysis of Ordinal Exposure Categories
105
Analysis of Results
Table 8.7. Tea drinking and renal cancer: case-control study with ordinal exposure
Cases
(RC)
Controls
(no RC)
190
115
305
( < 3 cups/day)
210
218
428
None
100
167
267
500
500
1000
Heavy
(;;; 3 cups/day)
Tea
drinking
Light
1.61
=
2.76
The odds ratio will rarely equal exactly 1, even in the absence of true risk or
protection. In particular, small increases or decreases from 1 may occur by chance;
this is especially true if the sample size is small. In Chapter 14 we will see how the
odds ratio can be tested for statistical significance, i. e., how to assess whether its difference from 1 could have occurred by chance.
8.3.7 Calculating Etiologic Fractions from Case-Control Studies
The etiologic fraction (EF) can be derived in an entirely analogous fashion to cohort
studies (see Eq. 6.3):
EF=
P E(OR-l)
PE(OR-l)+ 1
(8.4)
where P E is the prevalence of exposure in the target population and OR is the odds
ratio determined in the case-control study. The rate of exposure in the control
group,
b! d'
can be used to estimate P E under the assumption that this rate will be
fairly close to the rate of exposure in the overall target population. For the tea
drinking and renal cancer example (Table 8.6),
EF=
333/500(2.01-1) =0.40
333/500 (2.01-1) + 1
106
Case-Control Studies
In other words, 40% of the cases of renal cancer in our (hypothetical) target population can be attributed to tea drinking.
Because case-control studies begin with the outcome, many of the sources of sample
distortion bias are "hidden," in the sense that they have already occurred when the
case and control subjects are assembled for study, rather than occurring during the
course of follow-up. Thus mortality, migration, and referral that differ according to
both exposure and outcome will lead to a biased sample, and unless the investigator
knows the pattern of these differences, she can neither assess the magnitude of the
bias nor protect against it in the design or analysis.
In cohort studies, unequal surveillance can lead to information bias through
biased detection of the outcome, but standardized examination and testing procedures, as well as appropriate blinding, can be incorporated into the design to minimize this bias. In case-control studies, however, the outcome has already been
detected, and any systematic difference in ascertainment of outcome according to
exposure will result in a biased sample for which the investigator has no opportunity
for control or reduction.
Referral is also a potential source of sample distortion bias in case-control
studies, because the outcome is often the very reason for referrals that may, independently, be associated with exposure. This is most likely to occur if the cases are
referred by a clinician who also prescribes the exposure agent. For example, if cases
are referred by a gynecologist, whereas controls are not, cases may be more likely to
be taking estrogen than controls, even if there is no true association between
estrogen exposure and the study outcome.
Finally, as discussed in Chapter 5, case-control studies are prone to Berkson's
bias, a form of sample distortion bias that occurs when both the exposure factor and
the outcome are causes for referral [9, 10]. This is particularly likely to occur when
the exposure factor is itself a disease or other adverse health state. The exposureoutcome association becomes falsely inflated because subjects with both exposure
and outcome are more likely to be referred (they can be referred for either the
exposure condition or the outcome), and therefore included in the study sample,
than subjects with either or neither. For example, a hospital clinic-based case-control
study of hypertension (high blood pressure) as a risk factor for breast cancer may
reveal a false association (or an association of inflated magnitude) merely because
patients with both conditions have a "double" chance of being referred (selected).
Sample distortion bias is best reduced by preventive planning in the study design.
Use of incident outcomes as cases will reduce bias due to differential follow-up by
including fatal cases. Choosing cases and controls from the same referral source will
remove one source of referral bias but will not affect Berkson's bias. Control for the
latter requires information about the rates of referral for both the exposure and outcome conditions, which unfortunately is rarely available to the investigator. Finally,
107
detection bias can be reduced by ensuring that cases and controls had the same
opportunity for detection of the outcome. For example, a study of the risk of gallstones in patients taking cholesterol-lowering drugs might sample cases and controls
by selecting patients with positive and negative ultrasound studies respectively.
8.4.2 Information Bias
Case-control studies are prone to information (misclassification) bias in the ascertainment of exposure, especially when exposure history is obtained directly from the
study subjects, rather than from their medical records. The reason is that, unlike
cohort studies, such case-control studies require the subjects to remember accurately
their past exposures. Nondifferential (between cases and controls) errors in recall
will result in random misclassification of exposure (i. e., "noise") and thus bias the
exposure-outcome association toward a null result (OR= 1).
Differential recall bias is a graver concern, however. Theoretically, cases might
be more likely to remember exposure than controls. It is often argued, for example,
that the mother of a baby who has just been born with a severe congenital anomaly
is more likely to search her memory for a history of past exposure to a drug or a
potential toxin than a woman who has just given birth to a perfectly healthy baby.
Empirical verification of the existence, frequency, and magnitude of this bias is
sorely lacking, however [11].
Observers are another potent source of information bias in case-control studies.
Knowledge of the case vs control status on the part of the person obtaining the
exposure history (either from medical records or by personal interview) can affect
the diligence with which positive or negative histories are obtained. If an investigator
is under a strong impression that a certain exposure is associated with a given outcome, she may press (even if unconsciously) cases much harder for their recollection
of exposure than she will controls.
Misclassification of outcome can occur when prevalent outcomes are selected as
cases and the outcome is transient (may be cured or may resolve on its own). Subjects who experienced the outcome in the past but are free of it at the time of study
will then be misclassified as controls. This is best avoided by selecting incident outcomes as cases.
Misclassification of exposure can be reduced considerably by establishing a priori criteria of exposure, by incorporating standardized methods for stimulating
memory in all subjects (cases and controls), and by blinding observers to both the
case vs control status of the subjects and (if possible) the exposure-outcome association under study. The latter can be accomplished by inquiring about prior exposure
to a number of factors, with the study factor thus "hidden" among the rest.
8.4.3 Confounding Bias
Confounding bias arises whenever factors associated with exposure are independently associated with outcome (providing that such factors do not lie on the causal
path from exposure to outcome). The principal sources of confoundi"ng in case-control studies are (a) exposure-associated differences in background variables with
Case-Control Studies
108
independent effects on outcome (susceptibility bias), and (b) exposure accompaniments with independent effects on outcome (contamination bias). Such sociodemographic and clinical factors as age, sex, and socioeconomic status, disease severity,
and comorbidity (the coexistence of other diseases or conditions) are the types of
baseline susceptibility factors that are particularly likely to confound the exposureoutcome association in case-control studies. Examples of accompaniments would
include associated toxic exposures (e. g., cigarette smoking among asbestos miners)
and medical treatments (e.g., radiation and chemotherapy).
As with cohort studies, confounding in case-control studies can be controlled at
either the design or the analysis stage. Design features include restriction and
matching (see Chapter 5). Restriction tends to limit the target population to which
the results of the study may be applied. Matching is a powedul strategy for controlling confounding, provided the number of factors and levels of those factors are
small enough to ensure "matchability" of all (or most) of the cases. As previously
mentioned, matched designs should receive matched analyses to enhance statistical
efficiency.
A matched analysis in case-control studies is similar to the analysis of matched
cohort studies with dichotomous exposure and outcome (see Tables 6.5 and 6.6).
Table 8.8 shows a matched analysis from a hypothetical case-control study of breast
feeding as a possible protective factor against subsequent gastroenteritis (intestinal
infection) in the first year of life in 100 pairs (200 total subjects) of infants matched
for age, sex, and socioeconomic status. The matched odds ratio (ORmatched) is
defined as the ratio of the number of pairs discordant for exposure history, i. e.,
-be
=.2.. = 0.35. (The OR < 1 here indicates a protective effect of breast feeding.) The
26
reader is referred to other sources for analytic strategies pertaining to matched triplets, quadruplets, or variable numbers of controls per case [12, 13].
A stratified (Mantel-Haenszel) analysis [14] is another powedul analytic tool for
controlling confounding and requires no adjustments in the design, providing that
potentially confounding variables are recognized and measured reproducibly and
Table 8.8. Breast feeding and gastroenteritis case-control study: matched-pair analysis
Cases
BF
Not BF
BF
26
NotBF
59
Controls
100
ORmatched =
bc = 269 = 0.35
109
validly. The analytic method is analogous to that shown for cohort studies (see
Table 6.7):
ORMH= J:.aid;!N i
J:.bic/N i
(8.5)
Smokers
Nonsmokers
Cases
(RC)
Controls
(no RC)
Tea
drinkers
350
80
430
Nontea
drinkers
75
20
95
425
100
525
Cases
(RC)
Controls
(no RC)
Tea
drinkers
50
253
303
Nontea
drinkers
25
147
172
75
400
475
= ad = (50)(147) = 1 16
be (253)(25)
.
= 'f.a;d/N; = (350)(20)/525 + (50)(147)/475 = 1.16
nonsmokm
OR
MH
'f.b;e/N;
(80)(75)/525 + (253)(25)/475
110
Case-Control Studies
Table 8.6 and suggested that tea drinkers have double the risk (ORcrude=2.01) of
developing renal cancer that nontea drinkers have. We suspect, however, that the
effect may be confounded by cigarette smoking, since tea drinkers are more likely to
smoke than nontea drinkers, and cigarette smoking is an independent risk factor for
renal cancer. Table 8.9 shows the Mantel-Haenszel analysis when the data are stratified by smoking status (smokers vs nonsmokers).
The tea drinkers are indeed more likely to be smokers (430 of 733) than the
nontea drinkers (95 of 267). The stratum-specific odds ratios are similar in both
smokers and nonsmokers (1.17 vs 1.16), indicating no effect modification (interaction) by smoking status and little, if any, remaining association between tea drinking
and renal cancer. The Mantel-Haenszel odds ratio (ORMH) is the same, of course,
since it is merely a weighted average of the two stratum-specific ORs.
The final approach to controlling confounding uses multivariate statistical
adjustment techniques, which permit assessment of the exposure-outcome association while simultaneously adjusting for any number of confounding or interacting
variables. The technique usually employed is multiple logistic regression. Although a
full discussion of the method is beyond the scope of this text, it will be mentioned
briefly in Chapter 14.
Table 8.10. Advantages and disadvantages of case-control studies (vs cohort studies)
A. Advantages
1. Statistically more efficient when outcomes are rare
2. Quicker when outcomes are delayed
3. Less costly
B. Disadvantages
1. Enhanced potential for sample distortion
2. Exposure ascertainment more prone to error and bias
References
111
References
I. Feinstein AR (1973) Clinical biostatistics. XX. The epidemiologic trohoc, the ablative risk ratio,
112
Case-Control Studies
9.1 Introduction
In cohort studies, subjects are followed in a forward direction from exposure to outcome, and inferential reasoning is from cause to effect. In case-control studies, subjects are investigated in a backward direction from outcome to exposure; inference
is from effect to cause. In cross-sectional studies, the exposure and outcome are
both determined at the same point, or cross section, in time [1]. (Hence, another
name for this design is prevalence study.) Cross-sectional studies share many of the
features of case-control studies. They carry an additional disadvantage, however;
since exposure is ascertained at the same point in time as the outcome, the investigator cannot be certain that exposure preceded outcome. As we shall see, this disadvantage has important implications for causal inference.
114
Cross-Sectional Studies
Exposure is measured at the same point in time as outcome. (As explained in Chapter 4, the "same point in time" is an approximation.) Since exposure and outcome
have usually been present for some time prior to the study, the investigator cannot
be certain that exposure preceded outcome. Consequently, any inference that exposure caused outcome rests on the unknown true temporal sequence of events.
When the exposure variable is a genetic, anatomic, or otherwise permanent
attribute, the issue of temporal sequence becomes less problematic for the investigator. Thus race, sex, blood type, or glucose-6-phosphate dehydrogenase (G-6-PD)
genotype, for example, can usually be assumed to precede the study outcome. For
these kinds of exposures, cross-sectional studies are equivalent to case-control
studies.
This is also true whenever exposure determined at a particular point in time is a
valid proxy for exposure occurring in the past. To the extent that dietary, smoking,
or drug-taking practices measured at one point in time accurately reflect such practices within a time range consistent with the latent period for the outcome, the
results of a cross-sectional study should be similar to the results of a case-control
study. Since such practices often change over time, however, and may even change
in response to the study outcome (i. e., as effect rather than cause), use of the crosssectional design is best suited to outcomes with short latent periods.
ratio.
When sample selection is by outcome, analysis is similar to that used for casecontrol studies. Outcome is usually dichotomized (case vs control), and odds ratios
can be calculated. The classification of outcome status as "case" vs "control" does
not render the design truly case-control, however, since the exposure ascertained in
cross-sectional studies is simultaneous with, rather than prior to, the outcome. As
noted earlier, these two designs do in fact become equivalent when the exposure
115
variables are permanent characteristics or when the latent period from exposure to
outcome is very short. As with case-control studies, the outcome must be rare for
the odds ratio to be a reasonable estimate of the relative risk (see Section 8.5.3).
When sample selection is by random, representative, or other criteria, the results
of a cross-sectional study can be analyzed using either of the above strategies
(cohort or case-control). In general, the cohort approach is preferred, because it
usually permits a direct comparison of means or rates in groups defined by exposure, as well as calculation of true relative risks when the outcome is dichotomous.
116
Cross-Sectional Studies
An example of this design is insurance company life tables; such tables list the
number of people surviving for 1 year at each year of age. These data are then used
to calculate the chances of dying or living for specific periods of time after any given
age. The problem is that this approach assumes that risk factors, medical care, and
other aspects of public health are also constant, so that current mortality trends will
remain unchanged for decades to come. The assumption is of course untrue. With
general improvements in health in developed societies over the past few decades,
persons at any age today will (on average) live longer than did those of the same age
50 years ago. Since life insurance premiums are based on the current mortality experience of earlier birth cohorts, the insurance companies benefit from the lower
future mortality of later birth cohorts.
This example illustrates the effect that a given age cohort (i. e., those persons
born at the same calendar time) can have on the cross-sectional age distribution of a
clinical attribute: the so-called cohort effect. Another example of a cohort effect is the
apparent deterioration in measured intelligence with age demonstrated in several
cross-sectional studies [2]. More recent cohort studies, however, have shown that
intelligence does not diminish with aging. Since intelligence tests reflect education
and since successive generations have received more and better education, the elderly appear less intelligent than the young at any cross section in time. Cohort
effects thus represent a confounding bias of calendar time on age. Such a bias can be
discovered (and thus removed) only by analyzing the data longitudinally by age
cohort, a technique known as cohort analysis. Consider once again our example of
age and intelligence. Repeated intelligence testing of the same individuals within
each age cohort would show no decrease in scores over time, whereas members of
earlier birth cohorts would have lower scores at any given age than members of later
cohorts tested at the same age.
Studies
The major advantages of cross-sectional studies (see Table 9.1) are their rapidity and
low cost, compared with cohort studies, and their relative freedom (vis a vis casecontrol studies) from faulty or biased memory. Since exposure and outcome are
both ascertained at a single point in time, no follow-up is required. Data can therefore be obtained quickly and at little expense to the investigators. Furthermore, the
potential for information bias in ascertaining exposure is less than in case-control
studies, since subjects do not have to rely on their memory of past exposure. If
observers are adequately blinded, contemporaneous exposure is likely to be measured reproducibly and validly.
Perhaps the main contribution of the cross-sectional design is in descriptive,
rather than analytic, epidemiologic studies. Disease or other clinical phenomena can
be classified by person (age, sex, race, ethnicity, socioeconomic status), place
(nation, region, province, city, neighborhood, dwelling), or time. Cross-sectional
studies are also useful for describing the clinical spectrum (symptoms, signs, laboratory test results, pathologic findings) of a given disease entity. For example, an
References
117
investigator might carry out a cross-sectional study of a large defined group of diabetic patients to describe the proportion with retinal, renal, cardiac, or peripheral
vascular complications. Much of what we know about the varied clinical manifestations of many diseases, especially rare diseases, is based on such descriptive crosssectional "case series." Furthermore, as discussed in Chapter 3, ascertaining the
prevalence of a variety of diseases and conditions is of great importance to public
health personnel in making their decisions about allocation of resources and targets
for preventive or other intervention strategies.
Cross-sectional designs also have a role in analytic studies. Because they can be
done quickly and inexpensively, cross-sectional studies can often provide the first
clue to an exposure-outcome association, which can serve as a stimulus for more
definitive cohort or case-control studies. In addition, in situations involving permanent exposure characteristics, short latent periods, or exposure measures that are
valid proxies for past exposures, cross-sectional and case-control studies become
equivalent. In such situations, cross-sectional studies have the advantage of being
less prone to random error and bias in measurement of exposure.
The major disadvantages of cross-sectional studies are their frequent inability to
distinguish cause from effect and their potential for sample distortion bias. The latter problem is one shared by case-control studies. The problem of distinguishing the
horse from the cart, i. e., whether exposure preceded outcome or vice versa, is
unique to cross-sectional studies and constitutes their major limitation in analytic
research. Unless the exposure variable is a permanent attribute or the latent period is
very short, causality inferences are rather tenuous. The importance of temporal
sequence in causal reasoning will be discussed further in Chapter 19.
References
1. Kramer MS, Boivin J-F (1987) Toward an "unconfounded" classification of epidemiologic
research design. J Chronic Dis 40: 683-688
2. Susser MW (1969) Aging and the field of public health. In: Riley MW, Riley JW, Johnson M
(eds) Aging and society. Russell Sage Foundation, New York, pp 137-146
Part "
Biostatistics
10.1 Variables
In Chapter 2 I defined the different types (scales) of epidemiologic variables and
discussed principles of their measurement. In particular, I classified variables as
either continuous or categorical, subdividing categorical variables into dichotomous
vs polychotomous and further subdividing polychotomous variables as either nominal or ordinal. This framework will be retained in our discussion of statistical analySIS.
122
Introduction to Statistics
selected) might, just by "the luck of the draw," have a mean or rate that differs considerably from that of the entire population. Consequently, small samples from the
same population are likely to exhibit considerable sampling variation. On the other
hand, repeated large samples would yield sample means or rates very close to the
population value and, therefore, to each other. Thus, sampling variation is inversely
related to the sample size.
Many people use the term "parameter" as a synonym for "variable" or "factor." Although this is
common in everyday parlance, I will avoid it in this text and restrict the use of "parameter" to its
accepted statistical meaning.
123
As outlined in Chapter 10, the major aim of descriptive statistics is to condense and
summarize a set of measurements on a large number of individuals. Suppose we
wished to describe the variable "age" (in completed years) in 250 patients who
underwent cholecystectomy (gallbladder removal) at City Hospital during a
6-month period. Merely listing the 250 patients with their corresponding ages would
convey very little useful information, because the number of individual measurements makes it difficult to discern any overall patterns in the data. In other words, it
is difficult to see the forest for the trees.
Making sense out of so many numbers requires that the data be summarized.
Perhaps the most informative method for summarizing and displaying a set of measurements for a continuous variable is by constructing a frequency distribution. This
is accomplished by categorizing the continuous data (i. e., breaking down the range
of observed values into a series of successive categories) and counting the number of
study subjects whose measurements fall within each category. When proportions or
percentages of the total group are given instead of counts, the resulting distribution
is called a relative frequency distribution.
Once a frequency distribution has been constructed, it can be displayed in either
tabular or graphic form. Table 11.1 summarizes both the frequency and relative frequency (percentage of total) distributions by age of the 250 postcholecystectomy
patients described above. Figure 11.1 is the corresponding histogram, or bar graph, in
which frequency (ordinate on left) and proportions (ordinate on right) are represented by the areas of the respective bars.
Several guidelines should be kept in mind in constructing frequency distributions
and histograms:
1. The number of categories should be sufficient, but not excessive, relative to the
total number of measurements. If too many categories are used, little data reduction (summary) is achieved; if too few are used, important information may be
obscured.
2. Overlapping categories must be avoided, i.e., the limits (cutoff boundaries) for
each category must be mutually exclusive. (For example, in a frequency distribution of systolic blood pressure measurements that included categories of 100-110
125
Continuous Variables
Table 11.1. Age distribution of 250 postcholecystectomy patients
Age (completed years)
Patients
(n)
16-20
21-25
26-30
31-35
36-40
41-45
46-50
51-55
56-60
61-65
2
2
5
9
17
31
83
46
35
20
Total
250
(%)
0.8
0.8
2.0
3.6
6.8
12.4
33.2
18.4
14.0
8.0
100
100
.40
80
.32
(J) _
-C o
c:
(1) ';:;
.-
:0
-D
~ .-
a...~
'"
o
.... ~u
(1) c:
.a
'"::>cO>
:;;)~
z '=-
'2
'C
:s
.~ ,g
(J)O
c;
a...
.24 _
60
_~
0
>-
o g
r:::~
40
.16
20
,08 g- .~
o u..
.... '"
a....,
Age (Years)
Fig. 1i.1. Age histogram for 250 postcholecystectomy patients
and 110-120 mmHg, one would not know in which of the two categories to
place a subject with a systolic pressure of 110 mmHg.
3. Although not an essential requirement, interpretation is aided by the use of equal
category intervals (upper minus lower limits) and by the avoidance of open-ended
intervals (e.g., ~ 140 or 140+ mmHg for systolic blood pressure).
Histograms provide a method for adjusting for unequal intervals in a frequency distribution. Because there are only nine total patients in the three youngest age categories in our example, it might seem advisable to "collapse" them into a single category, 16-30 years. In that case, however, the height of the corresponding histogram
bar should be 3 (or .012), rather than 9 (or .036), so that the total area of the bar
remains proportional to the overall frequency (or proportion) for the enlarged category. The area under a single bar spanning ages 16-30 would be (3)(15) = 45, which
is the same total area as the sum of the areas for the first three bars in Fig. 11.1, i. e.,
(2)(5) + (2)(5) + (5)(5). Similarly, for the relative frequency distribution, (.012)
(15) =.18 = (.008)(5) + (.008)(5) + (.020)(5).
126
In addition to tabular and graphic methods, continuous variables can often be summarized using simple statistics that describe the frequency distribution without actually displaying it. In the interest of parsimony (data reduction), we attempt to
describe the essential attributes of a distribution using the fewest possible descriptors. Three major attributes of the distribution are usually described: central tendency, shape, and spread.
Three measures are in common use for describing central tendency: the mean,
the median, and the mode. The sample mean or average (x) is defined as follows:
- 1: Xi
X=-
(11.1)
where
and
The median is the midmost value of the distribution, i. e., the value for which 50% of
the group have higher values and 50% have lower values. It is calculated by rank
ordering (from lowest to highest) the values and then determining the value corresponding to the middle rank, i. e., the rank order n + 1 . Thus, if the group contains
2
an odd number of subjects, the median will be the value of the subject with the middle rank. If the group contains an even number of subjects, the median will fall halfway between the values of the two midmost subjects.
The mode is the most common single value, i. e., the peak of the frequency distribution. It is the least used of the three measures of central tendency because it is not
readily manipulated mathematically.
The calculation of each of the three central tendency descriptors is illustrated
below for the following serum creatinine measurements (in mg/ dl) in a group of
15 patients, arranged here in ascending order:
0.3, 0.6, 0.6, 0.7, 0.8, 0.8, 0.8, 0.9, 1.0, 1.0, 1.1, 1.3, 1.4, 1.6, 2.1
1:X' 15.0
mean = - ' = - - = 1.0 mg/ dl
n
15
n+1h
15+1
Continuous Variables
127
such as length of hospital stay. A distribution is skewed to the left when the mean is
lower than the median and the left tail is longer than the right. An example is the
age distribution in many developing countries, where high birth rates and short life
expectancy result in a distribution skewed toward younger ages.
Three types of statistics are commonly used to describe the spread of a frequency distribution: range, percentile ranges, and standard deviation. The range is
the interval between the lowest and highest value in the distribution (0.3-2.1 mg/dl
in the above example of serum creatinine). A percentile range is an interval between
two specified percentile points. The inner 90 percentile range includes all values
between the 5th and 95th percentiles; the inner quartile range includes those between
the 25th and 75th percentiles.
To calculate a percentile point, we rank the values from lowest to highest and
number each observation 1, 2, 3.. . n. The pth percentile is the value corresponding
to the [( n + 1) P] th rank. Thus, the median is equivalent to the 50th percentile
100
.
.
(n + 1) (n + 1) 50
pomt smce - - - =
.
,
2
100
Perhaps the most useful way of describing the spread of a distribution is to indicate the average or typical "distance" between the individual measurements and the
center of the distribution. For example, one could subtract each individual value
from the group mean. How could the resulting deviations (differences) then be summarized? Obviously, the average deviation would be zero, since the positive and
negative differences around the mean would cancel. The average absolute value of
the deviation would do nicely, but absolute values are difficult to manipulate mathematically. Squaring the deviations and then taking the average of the squared deviations also accomplishes the goal of eliminating the plus and minus signs. This is the
basis of the formula for calculating a sample variance.
For a sample, the variance is denoted by S2 and is defined as follows:
L(Xi-XY
s2 = --'--'-----''-n-1
(11.2)
where Xi is the value for the ith subject in the sample, x the sample mean, L, the
Greek letter sigma indicating a summation over all x/s, and n the sample size. The
quantity n-1 is called the number of degrees 0/freedom. (For a given mean ~ n-1
x/s are considered "free" or independent. The nth value of Xi is determined by x and
the previous n- 1 x/s.) The formula uses n- 1 instead of n because the variance of
the sample calculated in this way better approximates the variance of its source population.
Because the variance is based on squared deviations, it obviously does not represent the average or expected deviation of an individual from the sample. A better
representation is achieved by taking the square root of the variance. The resulting
quantity is called the standard deviation (abbreviated SD). The standard deviation
for a sample is denoted by s and is defined as:
(11.3)
128
The standard deviation appears no easier to compute than the average absolute
value of the deviations. Its justification lies in a computationally simpler formula that
is algebraically equivalent to Eq. 11.3:
(11.4)
In other words, one sums the squares of each value, subtracts the quotient of the
squared sum of the values divided by the sample size, divides the resulting difference
by the degrees of freedom, and then takes the square root.
The sample standard deviation can also be expressed as a proportion, or percentage, of the mean value. This entity
[i or i
tion (abbreviated CV). It is useful for describing measurement variation, since its
value is independent of the measurement units used. For example, the standard deviation of a set of height measurements will differ according to whether height is measured in inches or centimeters, whereas the coefficient of variation will be the same.
The range and percentile ranges can be used for describing the spread of any
frequency distribution, regardless of its shape. The choice of which percentile range
to report depends on the shape of the distribution. For example, the inner quartile
range would poorly describe the spread of a bimodal distribution in which the two
modes were widely separated. The standard deviation is best reserved for data that
are distributed fairly symmetrically around the group mean (i. e., data with a nonskewed distribution), because it is affected by extreme (very high or very low) values. It is most appropriate when the distribution is what statisticians call nonn-al. We
shall have more to say about the normal distribution in Section 11.1.3.
There is one other sample statistic that is often erroneously used in the clinical
literature as a descriptor of spread: the standard error 0/ the mean (SEM). It is
defined as:
SEM=
s
Vn
(11.5)
Because the SEM decreases with increasing sample size, however, it is not a good
descriptor of the spread of a frequency distribution, despite its popularity. A large
sample with a high standard deviation (i. e., a wide spread) may have a small standard error. Since the SEM is always smaller than the SD, it gives the impression that
the spread of the data is less than it really is. Consequently, it may be favored by
authors who wish to minimize, rather than summarize, the variability of their data.
The use of so-called "error bars" (defined by 1 SEM) above and below mean values displayed on a graph is a common example of this practice.
The SEM is actually the standard deviation of a distribution of means obtained
in repeated sampling from a source population. As we shall see in Chapter 13, it is
important in making statistical inforences based on sample means. As a descriptor,
however, it should be avoided.
129
Continuous Variables
The normal distribution (the familiar "bell-shaped" curve) is the most important distribution in statistics. There are several reasons for this. First, although it is a theoretical distribution based on an infinitely large population (this is called a probability
distribution), it describes the empirical distribution of certain measurements, such as
height and weight, performed on subjects from actual populations. It also closely fits
the distribution of repeated measurements obtained from the same individual (random measurement variation or error). In addition, the normal distribution serves as
the basis of statistical inference for means (Chapter 13). Its theoretical basis and
mathematical properties were first investigated by de Moivre, Laplace, and Gauss.
In honor of the latter, the normal distribution is also called the Gaussian distribution.
It must be emphasized that the term "normal" to describe this distribution has absolutely nothing to do with the usual clinical connotation of the word, indicating
absence of disease or other adverse condition. We shall return to this point in Chapter 16.
The most important property of the normal distribution is that it is completely
specified by two population parameters: the mean (I!) and standard deviation a.' As
shown in Fig. 11.2, the proportions of values lying within SD intervals of the mean
are as follows:
68.3% lie within 1 SD from the mean (i.e., I!a)
95.4% lie within 2 SD from the mean (i. e., I! 2a)
99.7% lie within 3 SD from the mean (i. e., I! 3a)
For any probability distribution, the proportion (or percentage) of values lying
within the interval defined by any two values of x is equivalent to the area under the
curve subtended by those two values. The area under the entire curve is equal to 1
(or 100%), since the entire population is defined by the curve. The area under the
curve above the median is 0.5. The area under any segment of the curve is also
equivalent to the probability that any member of the group, chosen at random, will
have a value of x lying between the two values defining that segment. Thus, the
probability that any individual member of a population whose values are normally
distributed will have a value within one, two, or three standard deviations from the
group mean is 0.683, 0.954, and 0.997, respectively.
Although I have tried to avoid excessive use of algebraic symbols, a certain minimum is required
for clarity and economy of expression. In particular, formulas for statistical tests written out in
text would be quite unwieldy. This text will follow the usual convention of using small Roman letters to indicate sample statistics and small Greek letters for the corresponding population parameters. Here is a table summarizing the symbols introduced thus far:
Sample
Mean
Standard deviation (SD)
Variance
Sample size
Population
11
(J
52
(J2
(usually infinite)
130
en
CD
:c
.5
::J .4
:!:::U)
==
.c
"+.3
0
ro r:::
.c
o.Q .2
..... .....
a...
.1
0
Cl.
0
.....
a...
0
!'
'4--!'<T-+
i_!'2<T_'
!' 3<T
OIl
.4
~
.0
'"
.0
0
.3
.2
......
Cl...
.1
0
Since each normal curve is specified by its mean and standard deviation, different normal curves may differ from one another in either or both parameters
(Fig. 11.3). Curves A and B have the same mean but different standard deviations.
Curves Band C have the same standard deviation but different means. Curves A and
C differ on both parameters. These differences can be eliminated, however, by
transforming any normal distribution to a single standard normal distribution (also
called the z-distribution). The transformation is called a z-trans/ormation (or z-score)
and is achieved by taking the x value, subtracting the population mean, and dividing
by the standard deviation:
z= X-!l
cr
(11.6)
Continuous Variables
131
x-~
cr
You should also be aware that some z-tables show the area between the two tails, i. e., between
- z and + z. This is just 1 minus the area in the two tails. All z-tables should contain a description
of which areas are tabulated.
132
c>O~
1::E
OeQ
0..0
0
0
..........
=.100
0... 0...
70
82 .8
+ 1.28
z-score
Fig. 11.4. Diastolic blood pressure in adolescent boys: finding the upper 10% "cutoff"
c>;
o~
:;==
..... .0
O~
0..0
00
'-
'-
0... 0...
55
Fig. 11.5. Diastolic blood pressure in adolescent boys: probability of randomly choosing a boy with
a value between 55 and 95 mmHg
The corresponding areas in the tails beyond these two z values are:
0.067 for z,
0.006 for Z2
The area between these two values is 1- (0.067 + 0.006) = 0.927. Thus, the probability is 0.927.
Finally, to illustrate the use of the two-tailed z-tables, let us calculate the inner
95 percentile range of this distribution (Fig. 11.6). The total area in the upper and
lower tails must be 0.05, which corresponds to Z= 1.96 in the two-tailed table. Thus,
the lower and upper z-scores will be -1.96 and + 1.96 respectively. We then solve
for the corresponding x's:
XI
X2
133
Categorical Variables
c: >-
0='=
t:6
o n:l
Q...c
..........
00
c...c...
70
89.6
+ 1.96
z-score
Fig. 11.6. Diastolic blood pressure in adolescent boys: finding the inner 95 percentile range
...!i.
250
~
250
!..9...
250
(28%) with less pain, and the same 140 (56%) with no pain. The sum of the propor250
tions must equal 1, and that of the percentages, 100%.
When many categories are involved, the use of a table, histogram, or pie chart
can often aid the reader in appreciating the relative magnitudes of the proportions in
each category. The principles for constructing a table or histogram are the same as
those discussed in Section 11.1.1, since continuous variables must first be catego-
134
More pain
No change
No pain
Fig. 11.7. Pain outcome in
250 postcholecystectomy patients
rized in order to use these methods. The pie chart achieves the same effect by dividing a circle into "slices" whose size corresponds to the proportion or percent in each
category. The size of each slice can be determined by calculating the angle formed
by the two "edges" of the slice; each percent = 360/100= 3.6. For our postoperative pain study, the angles would be 6%, 10%, 28%, and 56% of 360, or 21.6,
36.0, 100.8, and 201.6, respectively, for the four pain categories (see Fig. 11.7).
n!
nt(l_n)n-t
t!(n-t)!
where n!=nfactorial [=n(n-l) (n-2) ... 1]
t! = t factorial
(n-t)!=(n-t) factorial
and O!= 1
135
Categorical Variables
.25
<=
0
t0
a.
0
.5:
.20
.15
~ .10
.c
.c
'"e
"---
.05
0
10
Fig. 11.8. The binomial distribution: expected numbers (and their probabilities) of affected children
in families of ten children with one parent having Huntington's chorea
To illustrate, Fig. 11.8 depicts the binomial probability distribution for the expected
number of affected children (t) in families of ten children (n= 10) in which one parent has Huntington's chorea, a fatal degenerative brain disorder. Because the inheritance pattern of Huntington's chorea is autosomal dominant, the probability of any
one child eventually developing the disease is 0.5 (n = 0.5). Thus, the probability of
having exactly three affected children in a family of ten children is:
136
References
1. Ingelfinger JA, Mosteller F, Thibodeau LA, Ware JH (1983) Biostatistics in clinical medicine.
MacMillan, New York, pp 150-156
2. Bailar JC, Ederer F (1964) Significance factors for the ratio of a Poisson variable to its expectation. Biometrics 20: 639-643
138
research either because she thinks an exposure-outcome association exists, or because she or others are suspicious enough that it might exist to make such a study
worthwhile. He> however, is an artificial "straw man" that provides a reference for
examining the departure of the data actually obtained from the data that would be
expected under Ho. For our smoking-MI example, the null hypothesis is that cigarette smoking is not a risk factor (i. e., does not alter the risk) for subsequent MI in
the target population.
On occasion, the null hypothesis can be similar to the research hypothesis if the
researcher believes that there is no exposure-outcome association in the target population. In general, however, the research and null hypotheses are entirely different
and need to be kept separate. Once this distinction is clear, the testing of Ho then
becomes the basis for assessing the "statistical significance" of an association
observed in the study sample.
We begin testing the null hypothesis by assuming it to be true, i. e., that no exposureoutcome association exists in the target population from which the study sample is
(hypothetically) randomly selected. We then calculate the probability under that
assumption of obtaining, by chance alone, a degree of association between exposure
and outcome at least as strong as that observed in the sample. In other words, we
calculate the probability of obtaining such an association by chance if the study sample had been randomly chosen from a target population with no such association.
This probability is called the P value. If P is less than a certain amount (by convention, 0.05), we consider Ho to be so unlikely that we reject it.
To recapitulate, we start out with the assumption that exposure and outcome are
not associated (i.e., that Ho is true). If, under that assumption, the probability of
obtaining, by chance, an exposure-outcome association at least as large as the one
observed is very small, we then reject our initial assumption (Ho). Rejecting Ho
means that we infer that the study sample is not a random sample from a target population in which Ho is true, but rather from a different target population in which
exposure and outcome are associated.
Many investigators erroneously interpret the P value as the probability that the
null hypothesis is true, which is actually the probability the investigator would like
to know. Unfortunately, the frequentist approach to hypothesis testing is conditional
on (i.e., assumes) the truth of the null hypothesis. The Pvalue thus provides a very
indirect measure of the probability that Ho is true. It represents the plausibility of the
data given H o, not the plausibility of Ho given the data. I will have more to say
about the indirectness of this approach (and discuss an alternative approach) in Section 12.4.
The P value threshold for rejection of the null hypothesis should be established a
priori. This threshold is called the a-level and, as indicated above, it is conventionally set at 0.05. We reject Ho if the probability of obtaining the observed or more
The Testing of Ho
139
extreme results by chance, under the assumption that Ho is true, is less than 0.05.
Conversely, we are unwilling to reject Ho (i. e., we do not consider it sufficiently
unlikely) if P> 0.05.
There is nothing "magic" about 0.05. It has come to be the accepted a-level for
most studies in the medical and scientific literature. The difference, however,
between a P value of 0.04 and 0.06 is very small; yet, this small difference can affect
whether a scientific manuscript is accepted or rejected for publication or whether its
results are believed or not. The sensible scientist will keep these artificial distinctions
in their proper place and will not discard results if the P value is above 0.05, or automatically accept them as proven merely because the P value is below 0.05.
12.2.2 Type I Error
Even if P< 0.05, we may be wrong by rejecting H o , but we consider that the probability of being wrong is acceptably low. A P value of 0.05 simply means that the
results obtained in the study sample could have occurred by chance 5% of the time
when the null hypothesis is true for the target population. Once out of every
20 samples, on average, rejecting a null hypothesis when P= 0.05 will result in an
error. In other words, we will be rejecting the null hypothesis when, in fact, the null
hypothesis is true. This type of error (erroneous rejection of the null hypothesis) is
called a Type I error [1,2]. Whenever we reject the null hypothesis, we run the risk
of a Type I error. The lower the Pvalue, the lower the risk. When we' reject the null
hypothesis with a Pvalue of 0.001, we have only one chance in a thousand of making a Type I error.
Because clinical investigation is usually expensive and time consuming, studies
are often used to answer several questions at once, that is, to test several hypotheses.
Interventions may be compared for multiple outcomes, or a variety of clinical,
sociodemographic, or treatment factors may be examined for their effects on one or
more outcomes. When multiple tests of significance are performed, some significant exposure-outcome associations are likely to arise merely by chance. In fact,
for every 20 independent tests of Ha> one (on average) will result in statistical
significance just by chance. If 100 tests are carried out, and ten are associated
with P values < 0.05, it is impossible to know which of the ten are mere chance findings and which represent "truly" significant associations. Similarly, in the usual situation of a single study on a single sample, there is no way to be certain whether
an observed association represents a true (for the target population) finding or a
Type I error.
To protect against a plethora of Type I errors, some statisticians advocate dividing the threshold a-level for rejecting Ho by the number of tests performed. Because
many of the outcomes are associated with one another, however, the probabilities of
their joint occurrence is usually greater then the product of their individual probabilities (i. e., they are not statistically independent). Thus, such a procedure may be
overly conservative; it tends to attribute true differences to chance. At the very least,
however, the investigator should indicate the number of tests performed in addition
to the number achieving statistical significance and should modify his inferences
accordingly.
140
Multiple testing becomes an even greater problem when the research hypothesis
arises post hoc, i. e., after the data are collected, rather than a priori. When observed
data are used to generate hypotheses for statistical testing, the calculated P values
do not accurately reflect the true probability of an exposure-outcome association
occurring by chance. After all, it is virtually certain (P= 1) that some association will
occur by chance. But betting on a horse after a race is not usually rewarded at the
ticket window. Similarly, performing a statistical test of significance on an observed
association because it "looks interesting" will result in significant P values that bear
no relationship to the chance occurrence (given Ho) of an association hypothesized
a priOri.
12.2.3 Statistical Significance vs Clinical Importance
Regardless of whether we are correct or not in rejecting the null hypothesis, a statistically significant exposure-outcome association mayor may not be clinically important. For example, suppose we wished to test the hypothesis that consumption of a
new infant formula leads to a reduction in the serum sodium concentration. We
might then compare the mean serum sodium concentration in a group of babies fed
this formula with that in a group of babies fed a standard commercial formula. If the
results were 139 and 140 mEq/1 respectively, the association between the new formula and a lower serum sodium would be clinically trivial, despite the fact that with
a large sample size such a difference might be statistically significant. As will be discussed in Section 12.3, the reverse situation can also arise; a clinically important
association may not achieve statistical significance.
Thus, the clinical importance of an observed association or difference is a clinica~ not a statistica~ decision [3]. A clinical investigator should never let a consultant
or journal editor convince him that a very low (highly significant) P value can compensate for a difference too small to be useful.
12.2.4 Directional vs Nondirectional Testing of Ho
I have already discussed the important distinction between the research hypothesis
and the null hypothesis. I have also indicated that the research hypothesis can be put
in the form of a statement or a question. What I shall now consider is the directionality or nondirectionality of the research hypothesis, and what that implies in terms
of testing for statistical significance, i. e., testing of H o .
In directional (or unidirectional) hypothesis testing, the research hypothesis
implies not only that there is an association between exposure and outcome but also
indicates the direction (positive or negative) of that association. In our smoking-MI
example, the research hypothesis is that smoking increases the risk of myocardial
infarction. In other words, we suspect that smoking might either increase the risk of
MI or have no effect, but we have no suspicion whatsoever that it protects against
MI. A directional test of Ho would thus be appropriate.
In nondirectional (or bidirectional) hypothesis testing, the investigator may have
no a priori knowledge of the direction of the association under study. Consider, for
141
142
Table 12.1. The two errors of hypothesis testing
"Truth"
Ho False
Reject Ho
Correct
Type I
error
Do not
reject Ho
Type II
error
Correct
Inference
error, the probability of which is equal to the P value. When we do not reject Hoo
Ho might still be untrue. In other words, exposure and outcome might indeed be
associated in the underlying target population. The erroneous inference that Ho is
true when it is not is called a Type II error.
Thus, the decision either to reject or not reject Ho is an inference, an inference
that may be correct or incorrect. Depending on which inference we make, we are at
risk for committing either a Type I or a Type II error. We are never at risk for both
types of erroneous inference. These relationships are illustrated in the 2 x 2 table
shown in Table 12.1.
The actual probability of a Type II error is signified by the Greek letter ~ (hence
the alternative term, "beta error"). In the design phase of a study, ~ can be calculated by constructing an alternative hypothesis, H A , which postulates a clinically
important degree of association in the target population. The alternative hypothesis
is usually directional and is often the same as the research hypothesis. ~ is the probability of failing to detect, by chance, a degree of association at least as large as the
degree specified by H A "Detect" here indicates rejection of the null hypothesis, so ~
can also be defined as the probability of not rejecting Ho when HA is in fact true.
1 - ~ is called the statistical power and is the probability of detecting the specified
association in the study sample, i. e., the probability of rejecting Ho when HA is true.
If a researcher wants to show that exposure and outcome are not associated to a
clinically important degree (i. e., the research hypothesis is similar to the null
hypothesis), the probability of failing to detect the degree of association specified by
HA should be very low. Ho can never be "proved," but the lower the plausibility of
H A , the higher the statistical power and the greater the assurance that exposure and
outcome are not associated to a clinically important degree. Minimizing the potential for Type II error is essential to avoid missing such an association.
In designing a study, the probability of a Type II error is determined by three
factors. ~ will be higher (and 1- ~ correspondingly lower):
1. The smaller the hypothesized degree of association under HA
2. The smaller the sample size n
3. The largerthe sample variance S2 (for continuous variables)
143
144
Bayesian inference takes its name from Rev. Thomas Bayes, an eighteenth-century clergyman and mathematician who discovered an important relationship
between conditional probabilities and expressed this relationship in what is now
referred to as Bayes'theorem. Bayesian inference has two features that make it an
attractive alternative to the frequentist approach. First, instead of the observed sample data being referred to either Ho or H A, the data are simultaneously examined for
their consistencies with both Ho and HA [4]. The probability (called the likelihood)
of obtaining the exact degree of observed exposure-outcome association given (i. e.,
conditional on) HA is compared with the likelihood of those data given Ho by forming what is called a likelihood ratio, or LR:
LR = P (observed association I H A)
P (observed association I Ho)
(12.1)
where the vertical lines preceding HA and Ho are read as "given" or "conditional
on."
The second attractive feature of the Bayesian approach is that the likelihood
ratio is combined with the ratio of the prior probabilities of HA and Ho (i. e., before
the study was carried out and the data were obtained) to yield the ratio of the posterior (after the data) probabilities. As discussed in Chapter 8, the ratio of a probability
to its complement is called an odds. Thus, given that either HA or Ho is true, the
ratio of their probabilities, P(HN, is the odds of H A . According to Bayes' theorem,
P(Ho)
the posterior odds of HA (which is what the investigator really wants to know) is the
product of the prior odds and the likelihood ratio:
Pposterior(HN = Pprior(HA)
Pposterior(Ho)
PpriolHo)
P (observed association IH A)
P (observed association IHo)
(12.2)
References
145
If I tossed a coin five times and got five successive "heads," few people would
conclude that the coin was unbalanced, despite the significant "p value" of
(112)5 = 1132 = 0.03. Most would attribute the occurrence to chance, that is, a run
of good (or bad) luck. Similarly, a single study that comes up with a very surprising
finding should make the investigator or clinician alter his view of the world (at least
to the extent of being less surprised if a similar result occurs in a subsequent study)
but should not turn it upside down.
Since scientific inference inevitably involves incorporating new observations with
prior beliefs, the only thing new about Bayesian inference is that the prior beliefs are
made explicit and require quantification. Many researchers are uncomfortable with
being forced to quantify their uncertainty, and this may be why the frequentist
approach to statistical inference still predominates. Recent application of Bayesian
principles to other aspects of clinical epidemiology, however, such as the interpretation of diagnostic tests (see Chapter 16), has resulted in an increased awareness and
appreciation of the Bayesian approach [5]. It would not be surprising to see this
approach applied increasingly in the future by clinical epidemiologists and statisticians.
References
1. Colton T (1974) Statistics in medicine. Little, Brown, Boston, pp 112-125
2. Feinstein AR (1975) Clinical biostatistics. XXXIV. The other side of statistical significance:
alpha, beta, delta, and the calculation of sample size. Clin Pharmacol Ther 18: 491-505
3. Feinstein AR (1973) Clinical biostatistics. XXIII. The role of randomization in sampling, testing,
allocation, and credulous idolatry (part 2). Clin Pharmacol Ther 14: 898-915
4. Miettinen OS (1985) Theoretical epidemiology: principles of occurrence research in medicine.
Wiley, New York, pp 107-128
5. Browner WS, Newman TB (1987) Are all significant P values created equal? The analogy
between diagnostic tests and clinical research. JAMA 257: 2459-2463
Suppose, hypothetically, we chose a random sample of n subjects from some infinitely large population of known mean Il and standard deviation cr, determined the
mean (x) of that sample, replaced the same subjects back into the source population,
then chose another random sample of the same size n, and repeated this process
over and over again. What distribution would the repeated sample x's have? It turns
out that if n is large enough, then the x's form a normal distribution, regardless of
the distribution of the source population. The mean of this normal sampling distribution of x's is Il, the population mean; its standard deviation (called the standard error
the investigator can determine how likely it is (i.e., the probability, or P value) that
his sample originated from a source population with the same mean as the theoretical sampling distribution. This is equivalent to the probability that the sample mean
observed (x) would occur in random sampling from a source population with a
given mean Il and standard deviation cr.
147
z=
x-Il
cr/Vn
In other words, the observed sample mean x is compared with its expected (under
Ho) value Il by dividing its deviation from Il by its standard ("expected") deviation.
The only difference in this critical ratio from the z-score introduced in Chapter 11 is
that the probability distribution here is a sampling distribution of means (x's), rather
than a distribution of individual values (x's). Thus, the SD of the distribution is ~
rather than cr. Since this sampling distribution is normal (Gaussian) according to the
Central Limit Theorem, the same z-tables can be used to interpret the resulting values of z (the critical ratio) and to calculate P values.
13.1.2 The t-Distribution
Unfortunately, the use of the standard normal z-distribution to test the statistical
significance of x, given a known Il, also depends on knowing cr, the population SD.
When cr is unknown, a different probability distribution, the t-distribution, is
required to test the significance of x. The t-distribution was discovered by William
S. Gossett, a statistician working at the Guinness Brewery in the early years of this
century who published his observations under the pseudonym of "Student."
The t-distribution differs from the z-distribution in that, although its mean is the
same (namely, Il), it uses the sample standard deviation s as an estimate of cr in the
SEM. It is based on the following critical ratio:
x-Il
t=--
slVn
Like the z-distribution, the t-distribution is bell shaped. Its two "tails," however, are
higher than the tails of the normal distribution. Thus, the calculated P values (which
correspond to the area under the curve of the tails) are higher, i. e., less significant,
for a given difference between x and Il. Like the z-distribution, the t-distribution
depends on the requirements of the Central Limit Theorem. The assumption that
the underlying population distribution (of x values) is normal is particularly important for small samples.
The value of t is interpreted using the I-distribution with n-l degrees of freedom,
and P values can be calculated accordingly. Unlike the z-distribution, there is a different I-distribution according to the number of degrees of freedom. For small
samples, the difference from the z-distribution is quite marked. For large samples
(n~ 30), the t-distribution becomes extremely close to the z-distribution, and the
latter can be used for making inferences.
Although there is a different t-distribution for each different number of degrees
of freedom (d!), it would be extremely cumbersome to have a separate !-table for
148
each df t- Tables, therefore, provide the various important P values (0.10, 0.05, 0.01,
0.001) in the columns. The minimum values of t necessary to yield those P values are
listed in the rows according to the number of degrees of freedom. Such a table is
Appendix Table A.4.
x-J.L
slyln
~)
= 14.70.6 g/dl
= 14.1 to 15.3 g/dl
149
If we wished to be 99%, instead of 95%, "confident" about J.1, we would use the
required for P=0.01 in Eq.13.1. For the above example, t=2.921, and the resulting
confidence interval is:
14.72.921 (
~)
=14.70.8 g/dl
= 13.9 to 15.5 g/ dl
When n ~ 30, we can use the standard normal (z-) distribution to calculate confidence intervals. We merely substitute z for tin Eq. 13.1. The "critical" values of z for
95% and 99% confidence are 1.96 and 2.58 respectively.
t= x-J.1o
s/Vn
(13.2)
150
s/Vn
15.3-14.7
- 1.17/ y30
= 2.809
By consulting the t-table at n-l = 29 degrees of freedom, we can see that the onesided Pvalue lies between 0.005 and 0.01. (If we had used z instead of t, the onesided P value would have been 0.002. This is perfectly consistent with the results of
the t-test, thus illustrating the equivalence of z and t with large sample sizes. I) We
therefore reject the null hypothesis that our study sample is a random sample of the
reference (sea-level) population and conclude that it derives from a different population having a higher mean hemoglobin concentration, i. e., that !l> !lo.
The most common use of the t-distribution is in the significance testing of two independent sample means. If one were randomly to choose two simultaneous samples
of sizes nl and n2 from two infinitely large (hypothetical) source populations,
replace the samples, choose two new samples of the same size, replace them, and so
on, the differences (d's) between the two sample means Xl and X2 would be normally distributed, provided the source populations and sample sizes did not grossly
violate the assumptional requirements of the Central Limit Theorem.
In the "real world" of clinical and epidemiologic research, an investigator has a
study sample selected (not necessarily randomly) to be representative of some target
population. In a cohort study, for example, the two groups being compared are
defined by their exposure status (e.g., exposed vs nonexposed, treatment A vs treatment B), and their outcomes are believed to be representative of similar exposed
members of the target population. The investigator wants to know if the observed
difference in the outcome means of the two study groups, d = Xl - X2, is "statistically significant."
I
When the standard deviation cr of the reference population is known, z can be used even if n is
small.
151
d-8
Z=--
SEed)
= (XI-Xl)-O
SE(xl- Xl)
XI-Xl
SE(xl- Xl)
The only remaining difficulty is the calculation of SE(xl - Xl). Now, the variance of
a difference between two independent variables equals the sum of the variances of
the two variables. Since the variance of the sampling distribution of means is crl / n,
the variance of the sampling distribution of a difference between two means is:
cr~
Var(xl-xl)=
-crt +nl
nl
The standard error of the difference is the standard deviation of this sampling distribution, i. e., the square root of the variance:
Since we do not know crl and O"l, the respective source population SDs, we estimate
the standard error of the difference using the SDs of the two study (exposure)
groups:
152
where the ",,, (read "hat") indicates an estimate. The use of 5, and 52 instead of a,
and a2 obliges us once again, for small samples, to use the t-, rather than the Z-, distribution. Thus
One more complication remains. Unfortunately, the conventional t-test for two
independent means is based on a null hypothesis that assumes aT = a~, as well as
~,= ~Z. Usually this is not a problem; unless the difference in means is large, the
variances often tend to be similar. To provide the best estimate of a Z (which, under
our assumption, = aT = a~), we pool the two sample variances by weighting them
according to their respective degrees of freedom. We then compute the pooled variance (5~) as follows:
~=
(n,-1)5T+(n2-1)5~
Then
(n,-I)+(nz-l)
SE(x, -
XZ) =
(n,-1)5T+(nz-l)5}
n,+nz-2
(13.3)
V5~ (~ +~)
n,
n2
Finally, we have:
(13.4)
Equation 13.4 is called Student's (after Gossett) t-test of two independent sample
means. The corresponding P value is determined by interpreting the value of t at
n, + nz-2 degrees of freedom.
To illustrate, let us once again consider the effect of altitude on the serum hemoglobin concentration. Let us assume that no applicable reference population value
exists, and that we randomly select a group of 17 healthy adult men living at sea
level and 30 similar men living at 2000-2500 m above sea level. These are the same
groups considered earlier in this section, and their respective means and standard
deviations are as follows:
x, = 14.7 g/dl
xz= 15.3 g/dl
= 1.12 g/dl
5z= 1.17 g/dl
5,
153
t-
XI - X2
14.7 -15.3
= -1.714
(The negative sign is a consequence of our labeling the sea-level group as group 1; t
would have been + 1.714 had the labeling been reversed.) Consulting the !-table at
17 + 30 - 2 = 45 degrees of freedom, the one-sided P value corresponding to
t= 1.714 is P<0.05. Consequently, we reject H o and conclude that the groups are
not random samples from source populations with the same mean hemoglobin concentration. We infer that 1!2> I!I and, therefore, that high altitude results in a rise in
serum hemoglobin. Of course, such a cross-sectional study cannot exclude the possibility that the elevated hemoglobin concentration in fact preceded the second
group's living at high altitude. In fact, the statistical significance of the difference
relates only to the role of chance in producing the study results. Any analytic bias or
other problem in study design would, naturally, invalidate our inference.
It is worth re-emphasizing that the significant difference observed above is based
on a one-sided test of significance. If the t of 1.714 had been interpreted in a twosided fashion, the resulting P value would have exceeded 0.05, and we would have
been unable to reject H o' This illustrates the importance of stating a research
hypothesis in directional terms, if appropriate. Had we carried out a two-sided test,
we would have failed to reject Ho, and the probability of a Type II error would have
been considerable. Stated in another way, the statistical power to exclude ,a clinically
important difference would have been unsatisfactorily low.
Another important point can be gleaned by comparing the results of this test
with the one-sample t-test shown in Section 13.2.2. The high-altitude group is the
same in both examples: n= 30, X= 15.3, and 5 = 1.17. In the one-sample test, however, this group is compared with a known reference population mean (14.7),
whereas in the two-sample test, the same group is compared with a group of 17 subjects with the same mean. The corresponding values of t in the two tests are 2.809
and 1.714 respectively.
Assuming the reference population has the same SD (= 1.12) as the 17 sea-level
study subjects, we could, in fact, carry out a "mock" two-sample test, assuming an
infinite sample size in the reference population (nl = 00). The "pooled variance"
then becomes, essentially, the variance of the population = 1.122 = 1.25. Since
-1 = 0, the value of t becomes:
nl
t= 14.7-15.3
V1.25 (3~)
=-2.939,
154
i. e., a result very close to that obtained in the one-sample test (with the sign
reversed). The result is quite different from that of the "real" two-sample test, however, owing to the marked increase in sample size in going from a study group of
17 subjects to an infinitely large population, and the consequent reduction in the
standard error term in the denominator. This once again illustrates the critical
importance of sample size in achieving the statistical power to detect a difference.
Although significance testing using the t-test is the most commonly encountered
approach to statistical inference for comparing two means, estimating a confidence
interval around an observed sample difference is often more helpful. No null
hypothesis is required, and the resulting inference therefore allows greater flexibility
than the all-or-none decision about whether or not to reject Ho. The confidence
interval is calculated as follows:
(13.5)
where t is the two-tailed tvalue required for the 100(1- a)% confidence interval at
n 1 + n2 - 2 degrees of freedom.
For our example, the 95% confidence interval (a=0.05) around the observed
difference in hemoglobin concentration is:
O=(14.7-15.3)2.016
V(1.33)(~+~)
17 30
= -0.62.016(0.350)
= -0.60.7 g/dl
= -1.3 to + 0.1 g/dl
13.2.4 Inferences Based on a Difference Between Two Paired Sample Means
When the two means arise from a study of matched pairs, the paired t-test is a statistically more efficient technique than the t-test for independent sample means. 2 Statistical efficiency refers to the power to detect (i. e., demonstrate the statistical significance of) a difference with a given (fixed) sample size. The more efficient the
technique, the smaller the sample required to detect a given difference, and the
smaller the difference that can be detected with a given sample size.
As we discussed in Chapter 5, a matched-pair analysis is one method of reducing
confounding. When comparing outcome in two exposure groups, pairwise matching
renders each pair as similar as possible concerning variables independently (of exposure) associated with the study outcome, so that any difference in outcome is more
likely to be attributable to exposure, rather than to potential confounders. In addition to this reduction in confounding bias, matching for variables that are independently associated with outcome, even if they are not differentially associated with
exposure (and therefore not confounding), succeeds in reducing the variability in
2
In fact, the matching creates variables that are no longer independent, thus violating an underlying assumption of the t-test for independent sample means.
155
the outcome due to such variables. Although the difference in means is unaffected,
the standard error of the difference is thereby reduced, thus raising the value of t
and lowering the corresponding P value. Statistical power is increased because variability is reduced, without increasing the sample size; i. e., the analysis is more efficient.
A matched-pair analysis of means is appropriate whenever (a) each subject from
one exposure group is matched to a subject from the other exposure group, or (b)
the same subject receives each of the two study exposures. In our example of the
effect of high altitude on the serum hemoglobin concentration, the first type would
be exemplified by 30 sea-level subjects, each matched (paired) with a high-altitude
subject by such variables as age, race, and smoking habits. The second type of pairing would be typified by a crossover clinical trial in which 30 subjects had their
serum hemoglobin measured both at sea level and after living for several weeks at
high altitude. Differential treatment of paired organs represents another example of
this second (self-pairing) type of study, e.g., the use of two different topical antiglaucoma agents in the two eyes of patients with bilateral disease.
To carry out the paired t-test, the investigator merely calculates the difference
(retaining the plus or minus sign) for each matched pair (dj = Xli- X2;, where i represents each of n successive pairs). By computing the mean difference (a) and the
the null hypothesis (0 = 0) can
estimated standard error of the difference (Sdl
be tested. Thus:
Vn),
(13.6)
The calculated value of t is then interpreted by comparing it with a reference t-distribution with n-l degrees of freedom. Note that n here is the number of pairs, not
the total number of subjects or observations. Despite the reduced number of degrees
of freedom, the reduction in variability (Sd) achieved by pairing will usually result in
a marked improvement in statistical efficiency.
This will be illustrated once again using our example of high altitude and hemoglobin concentration. We shall use the example of self-pairing by showing the results
of a hypothetical crossover trial in which six healthy men have their serum hemoglobin concentration measured both at sea level and at 2000-2500 m. To control for
possible temporal effects, we randomize the sequence of the two exposures. Thus,
some of the six will first be tested at sea level, while the others will begin at high altitude. To allow time for physiologic response, we have them live at their respective
altitudes for several weeks before testing. The results are summarized in Table 13.1
and have been chosen to resemble those in the independent sample t-test seen in
Section 13.2.3: X, = 14.7 g/dl, s, = 1.12 g/dl, X2= 15.3 g/dl, and S2= 1.17 g/dl.
The mean difference is - 0.6 gl dl, thus illustrating the mathematical equivalence
between the mean of the differences and the difference in the means
(14.7-15.3=-0.6g/dl). The SD of the differences is only 0.28, i.e., only one
fourth of the two group SDs. Using Eq.13.6:
t= Sd l
-0.6
Vn = 0.281 V6 = -
5.249
156
Table 13.t. Results of a crossover trial of the effect of high altitude on serum hemoglobin (Hb)
concentration (in gl dl)
Subject no.
(i)
Hb at sea level
Hb at high altitude
(Xli)
(X2i)
(di = Xli-
1
2
3
4
5
6
13.8
16.1
14.6
15.0
13.1
15.6
14.1
16.4
15.1
15.8
13.8
16.6
-0.3
-0.3
-0.5
-0.8
-0.7
-1.0
Difference
X2i)
I.di = -3.6
d= -3.6 = -0.6
6
d
-06
t=--=
.
Sdl
0.28 I
Vn
V6
=-5.249
157
158
From Eq.13.8, it can be seen that the higher the statistical power, or the greater the
variability (cr), or the smaller the difference the investigator wishes to detect, the
greater the required sample size.
To illustrate the use of Eq.13.8, we return to our example of serum hemoglobin
in healthy men living at sea level vs high altitude. Let us assume that a difference (0)
of 0.5 g/ dl is clinically important. We want to be sure to have a sufficient number of
study subjects to render such a difference statistically significant at P ~ 0.05 or, if the
difference is smaller, be 80% sure (~= 0.20) that the true difference in the source
population is not ~ 0.5 g/ dl; zp is thus 0.84. Since we plan a one-sided test of Ho, Za
is 1.65. We estimate cr from our previous studies as the square root of the pooled
variance (s~= 1.33), or 1.15 g/dl.
Then
Since N must be divided equally between the two exposure groups, and since we
cannot study fractions of an individual, we would need 29 subjects in each group.
(We probably should anticipate several "dropouts" and thus plan to enroll 32 or 35
subjects in each group.)
In general, a far greater (often two-fold or more) sample size is required to protect against both Type II and Type I errors than to protect against Type I error
(demonstrate "statistical significance") alone. The temptation to ignore Type II error
is thus strong, especially when patients are involved, because the calculated sample
sizes are smaller and therefore easier to achieve at a single center over a reasonable
period of time. Despite its attractions, however, such a practice is perilous for the
investigator, because she may well find herself unable to reject Ho or H A .
Consider the example of a clinical trial of arterioplasty (surgical arterial repair)
vs medical (drug) therapy in patients with hypertension caused by renal artery stenosis (narrowing). Suppose the principal investigator specifies 10 mmHg in diastolic
blood pressure reduction as a clinically important difference worth detecting. She
estimates the standard deviation and, ignoring Type II error (i. e., leaving z~ out of
Eq.13.6), calculates her required sample size.
But suppose when the study is actually carried out with the calculated sample
size, the results show a 9-mmHg difference favoring surgery over medical therapy.
Because the sample size was calculated based on a 1O-mm Hg difference, the
9-mm Hg difference will not be statistically significant. The investigator may not
consider the 9-mm Hg difference clinically important, but how sure can she be that
the true difference in the treatments is not 10 mmHg or even larger? Not very sure,
unfortunately. So she is left in a situation where she can infer neither that there is a
clinically important difference nor that there is not. The danger of this Scylla and
Charybdis can be avoided only by considering Type II error (i. e., including zp) in
the sample size calculation.
Many investigators faced with the above results (d=9 mmHg; P>0.05) would
be tempted to enroll additional patients in the study in an effort to achieve statistical
significance. There are two problems with such an approach, however. First,
repeated significance testing increases the risk of detecting a significant difference
159
arising solely by chance, i. e., of committing a Type I error. If results are repeatedly
analyzed, the P value calculated from the test will underestimate the true risk of a
Type I error (see the discussion of multiple significance tests in Chapter 12). Second,
if the null hypothesis is in fact true, subsequent results may show a difference smaller
than 9 mm Hg, and the difference may fail to achieve statistical significance despite
the larger sample size.
This example reveals one of the problems with the hypothesis-testing approach
to data analysis: it is based on "dichotomous" thinking. The investigator must
choose between Ho and H A, even if the data are not very compatible with either.
The use of confidence intervals, however, is often a more helpful approach to statistical inference. No Ho or HA need be postulated. Instead, the confidence interval
indicates the range of differences in the target population compatible with the difference observed in the study sample. In the above example, the confidence interval
around the observed difference of 9 mm Hg would include both 0 and 10 (i. e., neither Ho nor HA could be rejected with confidence). But it would be centered at 9,
with an upper bound considerably higher than 10 and a lower bound just below O.
160
significance. In the unpaired test, called the Mann- Whitney U-test, the two groups
are combined and ranks are assigned (the lowest value gets a rank of 1). Each member of both groups is then compared one by one with every member of the other
group, and a "winner" is declared for each comparison. The total number of wins in
each group (called the "V statistic") is then calculated and is interpreted by referring
to the number that would be expected under the null hypothesis that the wins were
distributed by chance. For two groups of sample sizes nl and n2, the chanceexpected value of V is nln2. When the sample size is so large as to make this one2
by-one comparison unwieldy, the two values of V can be calculated by determining
the sums of the ranks (RI and R 2) in the two groups. The two values of V are then:
(13.9)
(13.10)
To determine the P value, the smaller of the two values of V is referred to the tabulated values (see Appendix Table A.5) required for the usual thresholds of P (0.10,
0.05, 0.01, 0.001) with different sample sizes nl and n2. The smaller the observed V
compared with the chance-expected value, the lower the P value.
To illustrate, the results of a hypothetical study comparing length of hospitalization in stroke victims receiving or not receiving physical therapy (PT) are shown in
Table 13.2. One subject from each group had a hospitalization lasting 55 days. Since
these values occupy the eighth and ninth ranks, each subject receives the tied ranking of 8.5. The V statistic can be calculated for the PT group as follows. The first
PT subject wins three head-to-head comparisons with members of the non-PT
group (members 2, 4, and 5) and loses the rest; the result is thus three wins. The second PT subject wins one and loses seven. The third has three wins, one tie (non-PT
group member 8), and four losses. (Total wins = 3.5, since each tie counts as half a
win). The fourth PT subject has 0 wins; the fifth, six wins; the sixth, four wins; the
seventh, three wins; and the eighth, four wins. The total number of wins among the
eight PT subjects (= VI) is thus 3 + 1 + 3.5 + 0 + 6 + 4 + 3 + 4 = 24.5 wins. The total
number of possible wins is 8 X 8 = 64, whereas the chance-expected number is 32.
The reader may verify that the number of wins in the non-PT group is V 2 = 39.5.
The same results for V could be obtained using the sums of the ranks RI and R 2,
i.e., Eqs. 13.9 and 13.10:
VI = nln2+
161
Table 13.2. The effect of physical therapy (P1) on length of hospitalization (in days) in stroke victims: Mann-Whitney U-test
Subject
no.
PT group
(days)
1
2
3
4
5
6
7
8
41
23
55
12
91
68
47
65
Rank
6
3
8.5
1
14
11
7
10
R t =60.5
"Wins"
Non-PT group
(days)
Rank
"Wins"
3
1
3.5
0
6
4
3
4
118
15
84
38
33
79
94
55
16
2
13
5
4
12
15
8.5
8
1
7
2
2
7
8
4.5
R2 =75.5
U 2 =39.5
U t =24.5
Table 13.3. The effect of physical therapy (P1) on length of hospitalization (in days) in stroke victims: matched-pair analysis (Wilcoxon signed rank test)
Pair no.
PT group (days)
Difference
Rank
1
2
3
4
5
6
7
8
41
23
55
12
91
68
47
65
33
38
79
15
118
94
55
84
+ 8
-15
-24
- 3
-27
-26
- 8
-19
2.5
4
6
1
8
7
2.5
5
erable, considering the small sample size and the fact that four of the five highest
rankings are found in the non-PT group.
In the paired nonparametric test of two means, called the Wzlcoxon signed rank
test, the differences between each matched pair are ranked with the sign (+ or -)
of the difference ignored, assigning the rank 1 to the smallest difference. The sums
of the ranks with positive signs is then compared with the sum of the ranks with
negative signs. Under the null hypothesis, these sums should be equal, and the actual
results can be referred to the distribution of sums around a median of 0 that would
be expected by chance. These are tabulated according to the number of matched
pairs and the sum of ranks required to achieve a given P value (see Appendix
Table A.6).
To illustrate using our physical therapy example, the data of Table 13.2 have
been rearranged as matched pairs in Table 13.3. Each PT subject has been matched
with a non-PT subject for age, sex, severity of stroke, and co-morbidity (the presence or absence of cardiovascular or other serious diseases in addition to the stroke).
The matching is intended to reduce any bias due to these potential confounders, as
162
well as to reduce other sources of variation in the length of hospitalization. The second and third ranked differences (pair numbers 1 and 7) are tied at 8, and thus each
receives a ranking of 2.5. Seven of the eight differences are negative, and the sum of
the negative ranks is 4 + 6 + 1 + 8 + 7 + 2.5 + 5 = 33.5, while the sum of the positive
ranks is 2.5. As can be seen in Appendix Table A.6, for eight pairs a sum of ranks of
~ 3 or ~ 33 is required for a P value of 0.05. Since 2.5 ~ 3 and 33.5 ~ 33, we reject
the null hypothesis. This significant result once again demonstrates the enhanced
statistical efficiency of the paired approach.
Another approach to the analysis of these data involves merely examining the
signs of the differences for each matched pair. Since, under Ho, the probabilitiy of a
positive (or negative) difference for any matched pair is 0.5, the probability of seven
or more (i.e., seven or eight) negative differences out of eight pairs is
~(0.5)8
7!1 !
+ ~ (0.5)8 = 0.035,
163
164
paired t-test could be used to test for a significant difference. A second approach
would be to stratify all study patients according to age (e.g., <20,21-30,31-50,
and > 50 years) and compare the stratum-specific mean depression scores in the two
treatment groups.
Perhaps the most convenient strategy, in this day of prepackaged computer programs, is to acijust the group means according to the outcome each subject would
have if he had the mean value of the confounder. This adjustment assumes a linear
correlation (see Chapter 15) between the confounder and the outcome (age and
post-treatment depression scores, respectively, in our example). This procedure is
called analysis 0/covariance (ANCOVA) or covariate adjustment and can be used for
any number of continuous and dichotomous categorical variables [2-4]. It can be
combined with an assessment of two (or more) study effects by using two- (or more)
way ANCOVA.
Whenever extraneous variables (i.e., variables other than exposure or outcome)
are being considered, it is important to distinguish effect modifiers from confounders. As discussed in Chapter 5, effect modifiers do not bias the overall estimate of
exposure-outcome association. Instead, the magnitude of the estimate differs with
different values of the effect modifiers. In that case, reporting and testing a single
difference in means for the entire study sample is rather uninformative (i. e., it hides
important information), even if unbiased. Suppose, for example, that in our randomized clinical trial of a new antidepressant drug, the drug is not efficacious in
patients with bipolar depression (manic-depressive disease) but is extremely efficacious in those with unipolar depression. Assuming the unipolar and bipolar patients
are distributed evenly in the drug and placebo group, the results should be tested
and reported for the two subgroups separately. (If the possibility of this difference in
efficacy is appreciated in the design stage, the investigator would do better to restrict
the trial to unipolar patients.)
References
1. Ratcliffe JF (1968) The effect on the t distribution of non-normality in the sampled population.
166
'I
PI=!J...=~
nl
'2
Antibiotic
Placebo
Infection
No infection
233
240
21
239
260
28
472
SOD
167
Outcome
absent
Exposed
a+b(=rl)
Nonexposed
c+ d (= r2)
a+c
(= CI)
b+d
(= C2)
N= rl + r2=
CI
+ c2=a+ b+ c+d
r" row total for 1st row; r2, row total for 2nd row; c" column total for 1st column; C2, column
total for 2nd column.
observed in the study sample arose by chance, assuming that the sample was randomly selected from the target population. Stated in terms of the two proportions,
the null hypothesis states that the source populations of which the two exposure
groups represent random samples have equal outcome rates, i. e., Ho: 1t, = 1t2, with
the 1t'S corresponding to the p's in the study sample. The Ho of no association indicates that the columns should be statistically independent of the rows. We thus calculate the frequency with which we would expect (under Ho) subjects to fall into each
of the four cells of the 2 X 2 table. If the observed cell frequencies differ sufficiently
from the frequencies expected under H o, we reject Ho and conclude that the columns and rows are not independent, i. e., that they are associated, in the target population.
How do we calculate the expected cell frequencies? The probability that two
independent events will both occur is the product of their individual probabilities.
Under H o, the probability of a subject being in a given row is ll, the row total
N
divided by the total sample size. Similarly, the probability of a subject being in a
given colum is .!:i. Thus, under H o, the probability of being in a given cell (i. e., a
N
given row and a given column) is
(~) (~)
is then simply the probability of being in that cell times the total sample size:
Eij=
([$) (N)= ~
(14.1)
In each cell of the table we then have both an observed (Oij) and an expected (Eij)
frequency. Now we require a statistical method for comparing the OJs with the EJs
that will guide us in our inference to reject, or not reject, the null hypothesis. The
usual method for carrying out this comparison is the X2 test.
168
It can be calculated by computing the expected frequency (Eij) for each cell, subtracting it from the observed frequency (Oij) in the table, squaring the resulting difference, dividing by the expected frequency, and then summing this ratio over all
four cells in the table.
Because this method of calculating X2 can be computationally unwieldy, several
algebraically equivalent formulas may be preferable. Using our customary a, b, c, d
notation to depict the four cells of the 2 X 2 table, as shown in Table 14.2,
(0-E)2
I, I, for the first cell is:
E!J
[a- (a+ b)(a+ c)/Nf
(a+ b)(a + c)/N
If we repeat this for each of the four cells and then sum the algebraic terms, we end
up with the following formula:
2
X=
(ad-bc)2N
(a +b)(c+ t4(a+ c)(b+ t4
(14.3)
(The denominator can be seen to be the product of the two row totals and the two
column totals).
If the data are not already displayed in a 2 x 2 table, the easiest way to calculate
X2 is to compare the two proportions, PI =!.! and P2 = !..1.., directly:
nl
n2
(14.4)
Equation 14.4 is probably the easiest of the X2 formulas to compute using a handheld calculator.
To illustrate the mathematical equivalence of Eqs. 14.2-14.4, let us calculate X2
for our postlaparotomy wound infection trial (Table 14.1). To use Eq. 14.2, we must
first calculate the expected frequencies (Eijs) for each of the four cells according to
Eq. 14.1. These are shown in Table 14.3. Next, using Eq. 14.2,
X2 =I(Oi;-Ei,)2 = (7-13.44)2 + (233-226.56f + (21-14.56)2 +
Eij
13.44
226.56
14.56
(239 - 245.44)2
-'--------------'- = 3.09+0.18 + 2.85+ 0.17 =6.29
245.44
169
No infection
Antibiotic
13.44
226.56
240
Placebo
14.56
245.44
260
28
472
500
2_
(ad-bc)2N
_ [(7)(239)-(233)(21)]2(500) =6.29
X - (a+ b)(c+ d)(a+ c)(b+ d) (240)(260)(28)(472)
Using Eq. 14.4 and the two native proportions, PI =..!...... and P2 = l.l,
240
260
Thus, the three equations yield precisely the same result for the value of X2. Regardless of which method is used to calculate X2, however, we need to know how to
interpret the value calculated. In other words, how can we determine a P value from
a given value of X2? This is discussed in the following section.
14.2.3 The 'l Test: Statistical Inferences
On inspection of Eq. 14.2, it is evident from the squared term in the numerator that
~ O. The minimum value of X2 = 0, which is the value obtained when
Oij=Eij (observed = expected) for each cell of the table. The maximum value will
depend on the sample size, since larger numbers in each cell will permit greater
absolute values for Oij- Eij, and hence for (Oij- Eij)2. The empirical frequency distribution for X2 is discrete; only certain values of X2 are possible, depending on the
specific OJs and Eij's for each cell. With larger sample sizes, however, the discrete
distribution is closely approximated by the smooth curve representing the theoretical
probability distribution for the source population.
Since large values of X2 indicate a large deviation of observed from expected, the
higher the X2, the less likely it is that the study sample represents a random sample
X2 is always
170
No infection
Antibiotic
240
Placebo
260
28
472
500
Now let us see what happens when we are given the value of 21 in the left lower
cell. Because of the fixed marginals, this 21 automatically determines the values of
the three other cells, yielding the entire Table 14.1:
Antibiotic
Placebo
Infection
No infection
233
240
21
239
260
28
472
500
The general formula for determining the number of degrees of freedom is:
d/= (r-1)(c-1)
where r= the number of rows
c= the number of columns
(14.5)
171
value of X2 is given for a given df and the usual "threshold" P values (0.10, 0.05,
0.01, 0.001). In this regard, it is set up like the !-table. A representative X2 table is
provided in Appendix Table A.7.
As we have mentioned, the P value determined from a X2 test corresponds to the
area in the upper tail of the smoothed X2 probability distribution. But since
(Eij- Oij)2, and hence X2, is always ~ 0 regardless of whether PI> Pl or Pl > PI> i. e.,
regardless of which proportion is larger, the X2 test is inherently two-sided. (In other
words, there is no equivalent to the negative value of t obtained when X2> XI. See
Eq.13.3.) In comparing two proportions, therefore, the P value obtained from a X2
test represents the probability of obtaining the observed difference in the two proportions (P I - Pl) by chance under Ho, regardless of whether PI> Pl or Pl > P I
Although one-sided P values are rarely reported for X2 tests, a research hypothesis
that was stated a priori as clearly unidirectional (11:1> 11:2 or 11:2> 11:1) and was subsequently supported by the data could justify a one-sided test. The one-sided P value
is obtained by dividing the tabulated (two-sided) P value by 2.
To illustrate the use of the X2 table, let us determine the P value corresponding
to the X2 of 6.29 obtained in our wound infection trial. At 1 df, a X2 of 6.29 yields a
(two-sided) P value between P= 0.05 and P= 0.01. By convention, this is sufficiently
small to reject the null hypothesis, and, if the design and execution are adequate to
exclude analytic bias as an explanation for the findings, we conclude that the study
antibiotic is indeed efficacious in reducing postlaparotomy wound infection.
14.2.4 The Continuity Correction for Small Samples
As mentioned above, the (theoretical) X2 probability distribution is a smooth, continuous curve. Observed frequencies, however, are discrete and so, therefore, are the
possible calculated values of X2 from any study. When N is very large, many more
values are possible for the OJs and EJs (and thus for X2), and the frequency distribution of possible X2 values begins to approach the smooth, theoretical probability
distribution. For example, the wound infection rate in our antibiotic-treated patients
of 7 out of 240 might represent theoretically any number from 6Y2 to 7Y2, i. e., a similar group of 2400 antibiotic-treated patients could have a rate anywhere from 65 to
75 out of 2400.
When N is small, many statisticians advocate the use of a continuity correction to
compensate for the fact that the discrete possible values are not closely approximated by the continuous theoretical distribution. In 1934, Yates decided to subtract
Y2 from the absolute value of each Oij- Eij to provide a better approximation. The
resulting X2 with continuity correction (X~) is defined as follows.
x~ =
(14.6)
Xc-
(lad- bcl-NI2)2N
(a+ b)(c+ d)(a+ c)(b+ d)
(14.7)
172
The continuity correction is used only for comparing two proportions (i. e., for
2 x 2 tables), and X~ is interpreted at 1 dfin the same way as the uncorrected X2. The
continuity correction results in smaller values for X2, and resulting statistical inferences will thus be more conservative. In other words, Ho is less likely to be rejected.
The lower risk of Type I error must, as always, be balanced against a greater risk of
Type II error. For large samples, the continuity correction is probably unnecessary,
but for small samples the P values calculated using X~ are closer to the exact probability (see following section) obtained using a pure stochastic (chance-generated)
model.
14.2.5 The Fisher Exact Test
When the expected cell frequency in one or more cells of a 2 x 2 table is below 5,
the smoothed X2 probability distribution, even with the continuity correction, does
not provide a sufficiently accurate approximation of the true P value. In such cases,
many statisticians recommend using the Fisher exact test. The Fisher test is based on
the hypergeometric distribution, which is produced when two independent binomials
(i.e., the two sample proportions) are inserted in a 2 x 2 table with fixed marginal
totals. The test provides the probability of obtaining, by chance, an association
between the columns and rows at least as large as the one observed, under the null
hypothesis of no association and the condition of fixed marginals.
Given a hypergeometric distribution, the probability of the observed cell frequencies a, b, c, and d, given the row totals r1 and r2 and the column totals C1 and C2,
IS:
(14.8)
This formula, however, provides only the probability of the table obtained. As with
the t and X2 tests, we are usually interested in calculating the probability (Pvalue)
of getting the results obtained or results more deviant from H o , i. e., the area
in the entire "tail" of the probability distribution of tables beyond the observed
one.
To compute the P value for the Fisher exact test, we construct further 2 x 2
tables having more extreme cell frequencies than the observed one, while keeping
the marginal totals constant. We then calculate the probability associated with each
of these tables, using Eq.14.8, and add these probabilities to that of the observed
table. The resulting sum is the exact P value. Because the "more extreme" tables are
those deviating from Ho in the same direction as the observed table, the calculated P
value is, by definition, one-sided. To get the two-sided P value, the result is usually
multiplied by 2.
As an example, consider the evaluation of a new cancer chemotherapeutic regimen in the treatment of advanced acute lymphoblastic leukemia (ALL) in children.
In a randomized trial of 27 patients, 12 received the new treatment and 15 the existing standard regimen, and the results are shown in Table 14.4 in terms of "successes"
(here defined as elimination of tumor cells from the bone marrow after treatment)
173
Table 14.4. Results of a clinical trial of two chemotherapy regimens for advanced acute lymphoblastic leukemia (ALL) in children
Success
Failure
New regimen
12
Standard regimen
13
15
19
27
and failures. The success rate in children receiving the new regimen is ~, or 50%,
12
compared with
2., or
12
Standard regimen
14
15
19
27
Success Failure
New regimen
12
Standard regimen
15
15
19
27
The probabilities for these tables are calculated as follows, using the expressIOn
r1!r2 !c 1!c2 ! =4.821
N!
174
p= 2.821
(ad-bcj2N
_ [(6)(13) - (6)(2)]2(27)
(12)(15)(8)(19)
= 4.299,
which corresponds to P< 0.05, and we would have erroneously rejected the null
hypothesis.
Although the calculations required for the Fisher exact test are fairly straightforward using hand-held calculators and factorial tables, the construction of additional
2 x 2 tables, and the computation of the probability for each, can entail considerable
time and effort when the results are less extreme than those in our example. Fortunately, most standard software computer packages include the Fisher exact test in
their "menus," and thus this calculational inconvenience pertains only to data analyzed by hand.
XMcNemar =
(b-c)2
+c
-b--
(14.9)
X~McNemar= (Ib-cl- V
b+c
(14.10)
175
The value of X2 depends only on the observed frequencies in the two "discordant"
cells b (nonexposed pair member with the outcome, exposed member without the
outcome) and c (nonexposed member without the outcome, exposed member with
the outcome). It is interpreted in the same way as the usual X2 (Appendix Table A.7)
with one degree of freedom.
I shall illustrate the calculation of XttcNemar using the data of Table 6.6, which
shows the results of a cohort study of myocardial infarction (MI) in 200 smoking
and 200 nonsmoking men matched by age, blood pressure, and serum cholesterol
concentration. Of the 200 matched pairs, both members experienced an MI in seven
pairs, neither member in 150 pairs, only the nonsmoking member in 14, and only
the smoking member in 29. Thus,
2
= (14-29)2 = (-IS? = '5233
XMcNemar
14+29
43
..
,
which corresponds to a P value < 0.05. The null hypothesis is therefore rejected,
and we conclude that the smokers are indeed at greater risk for subsequent MI.
14.2.7 Testing the Statistical Significance of the Relative Risk and Odds Ratio
As discussed in Chapters 6 and 7, relative risks or odds ratios above 1 indicate that
exposure is associated with an increased risk of developing the study outcome. Conversely, relative risks or odds ratios less than 1 indicate that exposure protects
against development of the outcome. It is rare to obtain an RR or OR of exactly 1,
however, and values less than or greater than 1 may well occur by chance even if the
null hypothesis is true.
By performing a X2 test (matched or unmatched) on the same 2 x 2 table used to
generate the RR or OR, the statistical significance of the observed value of RR or
OR is automatically assessed. If RR (or OR) > 1 and X2 ~ 3.84, then exposure is
associated with a significantly increased risk of the outcome. If RR (or OR) < 1 and
X2 ~ 3.84, the exposure is associated with a significantly decreased risk of, i. e., protection against, developing the outcome.
Another approach to testing the significance of an observed RR or OR is to construct a confidence interval (usually 95% or 99%) around the observed value. If the
interval excludes 1, then the RR or OR is declared statistically significant at the chosen level (a) of significance. There are several methods for calculating the confidence interval, but the easiest computationally is that proposed by Miettinen [1]:
CI= RRI (z,,1x) or OR 1 (<\-,1x)
(14.11)
where Za is the two-sided Z value corresponding to the chosen width of the confidence interval (1.96 for 95% and 2.57 for 99%), and X is the square root of the
observed value of X2.
For example, the relative risk of MI in smokers vs nonsmokers based on the
nonmatched cohort study whose results are shown in Table 6.2 is 2.13. The calculated value of X2 is
176
which corresponds to a P value < 0.01. Thus, we conclude that the relative risk of
2.13 is significantly greater than 1.
Using the confidence interval approach of Eq.14.11,
95% CI= RR(11.96/ X)= 2.13(1 1.961 y'6.968) = 1.21 to 3.73
Since this interval excludes 1, we again conclude that the relative risk of 2.13 is statistically significant.
The confidence interval derived from Eq.14.11 is often called "test-based,"
because it is based on the calculated value of X2. This fact also leads to another marginal advantage (besides computational ease) of the test-based method, namely, that
conclusions about statistical significance are always identical to those achieved using
the X2 test. The major disadvantage of the method is that it yields narrower confidence limits than those obtained using more exact (and computationally far more
difficult) methods. The computational disadvantages of the other methods have
been largely overcome by many statistical software packages, which provide standard errors for RRs and ORs that can be used to estimate confidence intervals.
Readers interested in exploring these other methods are referred to Fleiss [2] and
Kleinbaum et al. [3].
177
(16)(10) - (57)(2)]2
85
27.974
=---
(27)(88)(82)(33) + (73)(12)(18)(67)
6.005
(84)(8W
(114)(115f
= 4.658, which corresponds to a P value < 0.05.
We thus should reject the null hypothesis and conclude that the higher success rate
of T\ observed in the sample did not arise by chance from a target population in
which T\ and T2 are of equal efficacy.
The Mantel-Haenszel procedure just described is the most appropriate and
widely used technique for controlling for a small number of categorical (or categorized) confounding factors. It can be readily appreciated, however, that as the number of confounding factors increases, the computation becomes unwieldy (when
done by hand). Furthermore, there may be some loss of control when continuous
confounding variables are arbitrarily categorized.
Two multivariate statistical techniques are commonly used to adjust for multiple
confounding variables: discriminant function analysis and multiple logistic regression.
Both techniques provide simultaneous control for any number and combination of
continuous and categorical confounders; both can be used for cohort, case-control,
or cross-sectional designs; and both are commonly available in many standard statistical software packages. Logistic regression is usually preferred over discriminant
function analysis, because the latter depends, to some extent, on the assumption of
normally distributed predictor variables (exposure and confounders) in the source
populations. Logistic regression has the further advantage that the resulting coefficient for each factor (exposure and all potential confounders) is the natural logarithm of the odds ratio for that factor's association with the study outcome. Most
computer software packages also provide standard errors for the logistic coefficients
that permit estimation of confidence intervals around the odds ratios. Both discriminant function analysis and multiple logistic regression are beyond the scope of this
text, but the interested reader will find excellent discussions in several references
[5-8).
As was discussed for differences in means, effect modification must be distinguished from confounding. If the difference between two proportions (or the relative risk or odds ratio) differs substantially in two or more subgroups of the study
sample, reporting a single difference for the overall study sample will hide relevant
information, even though no bias is introduced. For example, in comparing the dif-
178
X=
t
number of successes
.
f
p, where p IS. t h e proportIOn
0 successes = - =
.
n
number of subjects
ypq, where q= 1- P
Using these formulas for the mean and SD, z- and t-tests can be constructed just as
if the data were continuous. This is considered legitimate as long as the number of
expected (under Ho) successes and failures, i. e., np and nq, are both ;;;:; 5, because
then the binomial distribution (representing the exact probability for any given proportion) can be closely approximated by a t- or normal (z-) distribution. Formulas
for these z- and t-tests are given in several standard texts [2, 9J.
In fact, there is a direct mathematical relationship between the normal approximation of the binomial distribution and the X2 distribution: Z2 = X2. For example, for
p= 0.05, a z value of 1.96 is required. If we square 1.96, the result is 3.84, which is
the X2 value necessary for P= 0.05. Similarly, t 2 = X2 at an infinite number of degrees
of freedom.
When the data are continuous to begin with, they can be converted to categorical data by choosing a "cutoff" point to define the two categories. We can then
carry out a X2 test on the resulting proportions instead of a t-test on the native continuous data. For example, instead of doing a t-test for measuring a difference in
systolic blood pressure between two groups, we could dichotomize the variable into
"normal" 140 mmHg) vs hypertensive (;;;:; 140 mmHg). There are three problems
with this approach, however:
1. Transforming continuous to categorical data involves some "waste." We are substituting a "lower-order" scale, and this is statistically inefficient (i. e., has less statistical power).
2. There is a greater opportunity for misclassification. A pressure measurement that
is "off" by 1 mmHg will have very little impact on a mean or a t-test, but it could
179
result in a change in category, and hence a change in the group proportion and XZ
test, if a true systolic pressure of 139 is measured as 140.
3. The result of the statistical analysis depends on the choice of the cutoff point, thus
leading to a potential for bias. In other words, the X2 test could lead to a different
conclusion from the t-test, depending on where we draw the category boundaries.
It is thus essential that these boundaries be decided a priori, i. e., before calculation of the X2 test, so that the investigator is not at liberty to pick a cutoff point
that optimizes his chances for demonstrating statistical significance.
~ = Vpqln.
ypq,
VI!.J!J.J. + h!i1.
nl
nz
For our postlaparotomy wound infection trial (Table 14.1), the 95% CI
= - 0.052 0.039
= -0.091 to -0.013
(0.029)(0.971)
240
+ (0.081)(0.919)
260
180
Assuming an a-level of 0.05, the investigator must specify three additional components: 11:1 and 11:2, the proportions he estimates in the hypothetical source populations, such that 11:1 - 11:2 represents the minimum threshold for a clinically important
difference; and 1- p, the statistical power he wishes to ensure that a difference as
large as 11:1-11:2 will be detected. Assuming two study groups of equal size (nl = n2),
the total required sample size (N) is then:
[Za
(11:1-11:2?
Za
(14.14)
of a and p. As an alternative to this equation, Fleiss [2] has published sample sizes
(for each group, i. e., n 1 or n2) in the form of tables for specified values of 11:b 11:2, a,
and 1-p.
A "shortcut," approximate equation is the following:
(14.15)
As an example, suppose we were planning a case-control study of parental divorce
before a child's tenth birthday as a risk factor for adolescent suicide. Suppose the
divorce rate in the overall population is known to be 30%. Assuming an equal number of cases and controls, we wish to be 90% (1- P= 0.90) certain of detecting a
divorce rate of 40% among the parents of cases. (Note that this is based on a directional research hypothesis, and one-sided Z values are indicated.) Thus:
(one-sided) = 1.65 for a = 0.05
(one-sided) = 1.28 for p = 0.10
11:1 =0.30
11:2=0.40
1t =0.35
Za
z~
Using Eq.4.14,
[1.65 y2(0.35)(0.65)
N=2
+ 1.28
y (0.30)(0.70) + (0.40)(0.60)
(0.30 - 0.40)2
=777.5
181
~: can
be used to solve for 1t2' For our adolescent suicide example, the 2 x 2 table would
look like this:
Suicides
Controls
1t,n=0.30n
Divorce
No divorce
n
If we chose 1.5 as a clinically important OR, we could then solve for 1tz as follows:
1.5 =
1t2(0.70)
(0.30)(1 -1t2)
0.45(1-1t2) = 0.701t2
0.45 - 0.451t2 = 0.701t2
0.45= 1.151t2
1t2= 0.45 =0.39
1.15
Suppose we know the proportion 1to of some reference population. We have a study
sample measured on the same variable with proportion p. We want to know the
probability that a difference at least as large as p-1to could have arisen by chance,
under the null hypothesis that the sample was randomly selected from the reference
182
population (Ho: 1t=1to). If the probability is less then 0.05, we will reject H o, and
conclude that the observed difference p-1to is statistically significant. This procedure is exactly analogous to testing the significance of a single sample mean from a
known population mean (see Section 13.2.2).
To accomplish this test of Ho. we merely calculate, for each of the two categories of the variable, the frequencies expected under Ho and then compute X2 as
1: (Oi;- EiJ 2 over the two "cells." This X2 test for a single proportion is thus similar
Eij
to the usual X2 test, except that it uses only two observed and two expected frequencies, instead of the four seen with the 2 x 2 table for comparing two proportions.
To illustrate, suppose the prevalence of hypertension among white men in the
United States is known to be 15%. An investigator believes that industrial effluent
from a certain factory may be contaminating the water supply of a nearby town, and
that some of the components of this effluent are capable of raising the blood pressure. In a random sample of 100 white men from this town, 21 are found to be
hypertensive. Is the prevalence of hypertension truly elevated, or could the difference (21% vs 15%) have arisen by chance?
The observed frequencies of hypertensive and nonhypertensive are 21 and 79,
respectively, compared with the expected frequencies of 15 and 85. Then
which, using the conventional two-sided P value, corresponds to P> 0.05. Therefore, we should not reject the null hypothesis.
This procedure of comparing an observed proportion with an expected one can
also be used to test theoretical models. The procedure is called "testing for goodness
of fit" and can be expanded to an entire distribution (instead of just two) of
observed and expected frequencies. If the difference between the observed distribution and that expected under the model is small, as indicated by a nonsignificant
value of X2, the model is thereby "supported" (provided the sample size is adequate
to protect against an important Type II error).
For example, suppose we believed a given disease to be inherited as an autosomal recessive. This hypothesis could be tested by observing the frequency of the
disease among full siblings of patients known to have the disease. In the absence of
any bias of ascertainment, selective abortion or mortality, etc., the proportion of
affected siblings should be 0.25. If a large sample of siblings could be studied, a test
of the difference between the observed and expected rate should provide a good test
of the theory. If carriers could also be identified, then the distribution of normal,
carrier, and diseased frequencies could be tested against the expected ratio of 1 : 2 : 1
(0.25, 0.50, 0.25).
183
1t
from p
When we have a single sample from a target population of interest, we may wish to
calculate the range in which the population rate is likely to fall. This parametric estimation of 1t from p is analogous to estimating Jl from x (see Section 13.2.1) and is
the activity usually engaged in by opinion pollsters; a certain number of subjects are
(preferably randomly) sampled for their political preference or their opinion about a
prominent issue, and the result is expressed as a 95% or other confidence interval
(often around a percentage, rather than a proportion).
To estimate such a confidence interval, we rely on the normal approximation of
the binomial distribution. As discussed in Section 14.2.10, the standard error of a
proportion =
-t.
~, ~,
250 240
184
N(N~tiWi- T~niwi?
(14.18)
where nj is the number of subjects in the ith exposure category, tj is the number of
subjects within the ith category who experience the "target" outcome, Wi is the
"weight" (or "score") assigned to the ith category, N is the total number of subjects,
and T is the total number who experience the outcome. Although somewhat arbitrary, the Wi'S are usually assigned whole integers with equal intervals symmetrical
around o. Thus, for three ordinal groups, the weights would be -1, 0, and + 1; for
four groups, - 3, - 1, + 1, and + 3; and so on. The value of X2 is then interpreted
at 1 df
To illustrate, let us calculate XL for the data shown in Table 14.5, which summarizes the results of a cohort study in which children with otitis media (middle-ear
infection) were treated with oral amoxicillin in either the dosage range recommended by the manufacturer, a dosage above that recommended ("high dose"), or a
dosage below the recommended dose ("low dose"). The children were followed for
the duration of their la-day course of treatment for the occurrence of diarrhea, a
well-known side effect of oral amoxicillin. The ordinary X2 test yields a X2 value of
5.53, which at 2 d/is not statistically significant. The X2 for linear trend (xL) of 5.16
at 1 dj, however, yields a P value < 0.05, indicating a significant dose-response relationship. Failure to consider the ordinal nature of the exposure variable in the analysis would thus have led to a loss of statistical efficiency.
Another approach to analyzing these data without losing the ordinal nature of
the exposure would be to compare the ranks of exposures (dosage ranges) in the two
outcome groups (children with and without diarrhea) using the Mann-Whitney Utest (see Section 13.4). Although such a procedure would be tantamount to analyzing the data like a case-control study, the test would be an appropriate test of association between exposure and outcome.
185
Table 14.5. Diarrhea occurring in children with otitis media treated with three different dosages of
amoxicillin (see text)
w1=+1
W2 = 0
W3= -1
High dose
Recommended dose
Low dose
Diarrhea
No diarrhea
12
=t1
38
50
=n1
13
= t2
87
100
=n2
4
= t3
46
50
=n3
29
=T
171
=N-T
200
=N
Using Eq.14.17:
2=[122+l.+_29 2 ] [ 200 2 ] =5.53At2 d/,P>0.05
X
50
100 50
200
(29)(171)
,
Y..tjWj= (12)( + 1) + (13)(0) + (4)( -1) = + 8
Y..njwj= (50)( + 1) + (100)(0) + (50)( -1) = 0
Y..njw; = (50)( + 1) + (100)(0) + (50)( + 1) = + 100
Using Eq.14.18:
2=
200[(200)(8) - (29)(0)]2 = 5.16 At 1 d/, P< 0.05
XL (29)(171)[(200)(100) - (W]
,
When the axes of the table are reversed, i. e., the oUtcome is ordinal and the
exposure is dichotomous (a 2 x c table), the Mann-Whitney test is even more appropriate [12]. The X2 for linear trend can be used here too, however, provided the
analyst makes the following "substitutions":
1. tj now becomes the number of subjects in the index exposure groups in each out-
2.
3.
come category
nj now becomes the total number of subjects in each outcome category
Wj now applies to the weights assigned to the outcome categories
186
(r- 1)( c- 1) degrees of freedom. A significant value for X2, however, will indicate
only an overall tendency for deviation of observed frequencies from those expected
under Ho. It will not indicate which exposure and outcome categories are most
responsible for the overall association.
Although visual inspection of the r X c table can often provide an impression,
that impression can be confirmed by partitioning the overall table into smaller tables
such that the degrees of freedom among all these "subtables" sum to (r-l)(c-l).
The X2 value in each subtable can then be assessed for statistical significance. Armitage [11] provides a good discussion of this procedure.
When exposure and outcome are both ordinal, the most appropriate test of association involves a test of linear correlation using ranks. Rank correlation will be discussed, along with other forms of linear correlation and regression, in Chapter 15.
References
1. Miettinen os (1974) Simple interval-estimation of risk ratio. Am J Epidemiol100: 515-516
2. Fleiss JL (1981) Statistical methods for rates and proportions, 2nd edn. John Wiley & Sons, New
York, pp 23-24, 71-75, 260-280
3. Kleinbaum DG, Kupper LL, Morgenstern H (1982) Epidemiologic research: principles and
quantitative methods. Lifetime Learning Publications, Belmont, CA, pp 296-307, 332
4. Mantel N, Haenszel W (1959) Statistical aspects of the analysis of data from retrospective
studies of disease. JNCI 22: 719-748
5. Kleinbaum DG, Kupper LL (1978) Applied regression analysis and other multivariable methods.
Duxbury, North Scituate, MA, pp 414-446
6. Anderson S, Auquier A, Hauck WW, Oakes D, Vandaele W, Weisberg HI (1980) Statistical
methods for comparative studies: techniques for bias reduction. Wiley, New York, pp 161-177
7. Breslow NE, Day NE (1980) Statistical methods in cancer research, vol. 1. The analysis of casecontrol studies. International Agency for Research on Cancer, Lyon, pp 191-279
8. Hanley JA (1983) Appropriate uses of multivariate analysis. Annu Rev Public Health 4: 155-180
9. Colton T (1974) Statistics in medicine. Little, Brown, Boston, pp 153-174
10. Schlesselman 11 (1982) Case-control studies: design, conduct, analysis. Oxford University Press,
New York, pp 293-313
11. Armitage P (1971) Statistical methods in medical research. Blackwell Scientific, Oxford,
pp 353-368
12. Moses LE, Emerson JD, Hosseini H (1984) Analyzing data from ordered categories. N Engl J
Med 311: 442-448
The main objective in most epidemiologic studies is the measurement of the association between exposure and outcome. Chapter 13 focused on testing the difference in
outcome measured on a continuous scale in two (or more) exposure groups. The
difference between the group means reflects the extent of association between the
continuous outcome and the categorical exposure, and the corresponding P value
derived from a t- or z-test indicates the probability of obtaining the observed or
stronger degree of association by chance under the null hypothesis. In Chapter 14,
both outcome and exposure were categorical (often dichotomous). The usual measures of association between a dichotomous exposure and a dichotomous outcome
are the difference in proportions in those subjects with and without the outcome in
the two exposure groups and the relative risk (RR) for cohort studies and the odds
ratio (OR) for case-control studies. These measures are then usually tested for statistical significance using the X2 test or Fisher exact test.
How do we measure the association between exposure and outcome when both
are continuous variables? One of the variables (usually exposure) could be dichotomized, and the means of the other variable (outcome) in the groups defined by
that dichotomy could be compared by means of a t- or z-test. But as we saw in
Chapter 14, categorization of continuous data may be statistically inefficient. The
initial strategy for measuring the association between two continuous variables is
usually to examine the extent to which the relationship between the two can be
described by a straight line, i. e., the extent of their linear correlation.
Linear correlation measures the degree to which an increase in one of the variables is associated with a proportional increase or decrease in the second variable.
Consider a cross-sectional study of the effect of impairment in renal function on the
hemoglobin concentration. The investigator hypothesizes that progressive decrements in renal function (as measured by rises in the serum creatinine concentration)
will be associated with a proportional fall in hemoglobin. The data on ten patients
with chronic renal failure are shown in Table 15.1, and the corresponding scatter diagram is displayed in Fig. 15.1. If every point fell exactly on a straight line, the two
variables would be said to be perfectly correlated.
It should be emphasized that linear correlation is strongly influenced by a few
extreme values of the two variables whose correlation is being assessed. Suppose, for
188
Table 15.1. Serum creatinine and hemoglobin concentrations in ten subjects with chronic renal failure
Subject no.
1
2
3
4
4.1
2.8
6.5
9.0
9.5
8.4
11.7
10.8
8.2
8.8
8.9
4.4
3.2
12.0
c::
2.4
3.7
8.0
5.3
7.9
5
6
7
8
9
10
Creatinine (mg/dl)
10.1
11.5
11.0
c::
Q)
g-
0-
10.0
u~
c::,9
:0
.2
9.0
Cl
B.O
Q)
::I:
7.0
2.0
3.0
4.0
5.0
6.0
7.0
B.O
example, the scatter diagram shows that most of the data points lie in a circle (indicating no correlation) but that a few lie bunched together in an area on a "diagonal"
from, i. e., above or below and to the left or right of, the circle representing the
majority of the points. The linear correlation may then be fairly high and may be a
poor summary of the relationship between the two variables. A good rule of thumb,
therefore, is to plot the data and examine them visually before reporting the linear
correlation.
15.1.2 Dependent and Nondependent Correlation
In examining the relationship between exposure and outcome, we are testing the
research hypothesis that the outcome depends on exposure. In our creatinine-hemoglobin example, hemoglobin is being tested for its dependence on renal function (as
measured by the serum creatinine concentration). Even though the study is cross-
Linear Correlation
189
sectional, the temporal relationship between the two seems clear on biologic
grounds. Renal failure may cause anemia, but anemia does not usually (short of
massive hemolysis) cause renal failure. Serum creatinine is the exposure variable,
and hemoglobin is the outcome.
In the parlance of linear correlation and regression, exposure is usually called
the independent variable, and outcome the dependent variable. In our example, it is
as if the creatinine were "allowed" to vary independently, and the hemoglobin then
depended on the observed value of creatinine. By convention, the independent variable is usually represented by x and the dependent variable by y. Thus, in Fig.1S.l,
the serum creatinine concentration is indicated by the x-axis (abscissa), and hemoglobin by the y-axis (ordinate).
By way of contrast to the obviously dependent relationship between hemoglobin
and creatinine, consider the relationship between blood urea nitrogen (BUN) and
creatinine concentrations. The two are usually highly positively correlated, because
they represent two different tests of renal function, even though other factors (e. g.,
state of hydration for BUN and muscle mass for creatinine) prevent the correlation
from being perfect. In a cross-sectional study of these two variables, it would be difficult indeed to label one as "exposure" and the other as "outcome." Because both
depend on renal function and neither depends on the other, their relationship with
each other is nondependent. In a graphical display of the relationship, either one
could be represented by the y-axis.
The decision that a relationship is dependent or nondependent thus arises from
clinical reasoning, not from statistical inference. Because most clinical and epidemiologic studies are based on a hypothesized association between exposure and outcome, our major focus here will be on dependent relationships.
15.1.3 Measuring the Extent of Linear Correlation
The Pearson correlation coefficient, which is abbreviated by the letter 1; is a descriptive
statistic indicating the extent of linear correlation between two continuous variables.
It is defined mathematically as follows:
r=
I.(Xi-X)(Yi-Y)
YI.(Xi - x) 2I. (Yi - Ji)2
(15.1)
where x and yare the two continuous (independent and dependent, respectively)
variables being correlated on each of the i study subjects.
The correlation coefficient r can range in value from - 1 to + 1, with 0 representing no correlation, -1 a perfect inverse correlation (negatively sloping line),
and + 1 a perfect positive correlation (positively sloping line) between the two variables. For our hemoglobin-creatinine example, r= - 0.779, indicating a strong
inverse correlation.
Since the linear correlation between two variables is rarely perfect (i. e., r rarely
equals + 1 or -1), we are often interested in measuring the extent to which the
relationship between the two is explained by a straight line. To do this, we make use
of a concept known as explained variance. As will be recalled from Chapter 11, the
190
total variance of any continuous variable is the square of its standard deviation and
is a measure of the spread of a group's set of values around its mean.
We can interpret r in these terms by measuring the proportion of total variance
in one (usually the dependent) variable that is due to its linear relationship with the
other (independent variable). Using our example of hemoglobin and creatinine, we
can thus divide the varianc~ in hemoglobin into two components: (a) that component due to the linear relationship between hemoglobin and creatinine and (b) that
component due to sampling (random) variation or other sources. It can be shown
that y2 equals the proportion of variance in either variable due to its linear correlation with the other. In our example, r= - 0.779, and thus y2 = 0.607. Our interpretation of this value of r2 is that approximately 61 % of the variance in hemoglobin is
"accounted for" by the serum creatinine.
The interpretation of r requires some further discussion. The correlation
between two continuous variables x and ~ as expressed by the correlation coefficient, refers to the degree of linear relationship between x and y. Now, there might
be a very close relationship between the two variables but one that is not well
described by a straight line. In that case, linear correlation might be very poor,
despite the close mathematical relationship.
For example, consider the following equation:
y=1+(x-3?
which is graphically depicted in Fig. 15.2. In this example, y is a perfect quadatric
function of x (all points lie on the curve), but not a linear one (at least, not over the
range of x's shown in the figure). Despite the obvious closeness of the relationship,
r=O.
10
4
2
191
Linear Regression
:9=a+ bx
where :9 (y "hat") indicates the fitted estimate of y based on a, b, and x. This is the
general equation for a straight line, in which a is the intercept (the average value of y
when x= 0) and b is the slope (the average change in y per unit change in x).
Another name for b is the regression coefficient.
Mathematically, a and b can be computed as follows:
b= 'L(Xi-X)(Yi-Y}
'L(Xi- X/
a=y-bx
(15.2)
(15.3)
:9= 12.042 -
0.487 x
This means that, on average, for every increase in serum creatinine concentration of
1 mg/ dl, the decrease in hemoglobin concentration is 0.487 g/ dl, at least over the
range of measurements shown in Fig. 15.1. (It is hazardous to extrapolate the linear
relationship between x and y beyond the observed measured ranges of x and y.)
As discussed in Section 15.1.2, the relationship between hemoglobin and creatinine is a biologically dependent one, in the sense that we believe that renal function
(as measured by the serum creatinine) affects the hemoglobin concentration, rather
than the reverse. It is for this reason that we have regressed hemoglobin (y) on creatinine (x). This follows the usual convention of regressing the depe'ndent variable (y)
on the independent variable (x).
Mathematically, however, we could regress x on y:
x=a' + b'y
where the "primes" are used to indicate the intercept and slope of the "inverted
regression" and are computed as follows:
b'= 'L(Xi-X)(Yi-Y}
'L(Yi-Y/
a'=x-b'y
(15.5)
x= 16.900 -
(15.4)
1.246y
192
c:
:;::::;
11.0
CD
g _
0-
10.0
c..:>~
c: Cl
:c
-
9.0
.2
Cl
Regression
of y on x
8.0
CD
:J:
7.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
Fig. 15.3. Regression of hemoglobin concentration (y) on serum creatinine concentration (x) and
vice versa
The two different regression lines are illustrated in Fig.lS.3. As we have seen,
the regression of y on x is the biologically sensible one reflecting our hypothesis
regarding exposure (renal failure) and outcome (anemia). The other regression line,
though mathematically correct, is biologically nonsensical. The choice would have
been more difficult, however, in our example of BUN and creatinine. In that
(nondependent) case, either regression line would have been appropriate for displaying the relationship.
193
Statistical Inference
rand b
r=l, b=O.5
is represented by a perfect straight line. In other words, r= + 1 for all three. The
slope of b differs considerably among the three, however.
The invariable nature of r, irrespective of which variable is regressed on which, is
easily appreciated from the mathematical relationship between r, on the one hand,
and band b', on the other. As can be seen from inspection of Eqs. (15.1), (15.2), and
(15.4):
r=
I:.(Xj-X)(Yi-j)
..,jI:.(Xi- XYI:.(Yi-y)2
(15.6)
Equation 15.6 indicates that b is not the mere inverse of b' (that is, b+
the linear relationship between x and Y is perfect (i. e., r=
+ 1 or
~,), unless
-1).
As we have seen, rand b are descriptive statistics that describe different aspects of
the linear relationship between two continuous variables, x and y. When r= or
b= 0, there is no linear correlation or mutual interdependence between x and y.
When values are obtained that differ from 0, however, we need to ask ourselves
whether such a difference is statistically significant; i. e., what is the probability that
a difference as large or larger would arise by chance under a null hypothesis of no
correlation?
194
Since rand b are calculated on a sample, we must tum once again to hypothesis
testing to provide statistical inferences about the linear relationship between x and y
in the target population of which the study subjects are a (hypothetically) random
sample. We test the statistical significance of r or b by postulating the null hypothesis
that p (the population correlation coefficient) or p (the population regressioncoefficient) is equal to o.
We can do a t-test on the value of b or r in the study sample, and derive the
probability (P value) of obtaining this value of b or r in a random sample from a
population in which p= 0 and p= o. As it turns out, the formula for t is the same for
testing either b or r. This makes sense, since the two are mathematically related and
since no correlation means no regression (and vice versa).
The formula is as follows:
l~
t=rV
1=?
(15.7)
16-2 2 = - 4.658
1- (-0.779)
Viln(l+r)
l-r
(15.8)
which is normally distributed when the sample size is sufficiently large (n~ 20).
Then the 100(1'- a)% confidence interval for the corresponding population parameter can be computed as follows:
1h In (1 + P)
1-p
1h In (1 +
r) Zah/ n'- 3
1-r
195
(15.9)
P)
1h In (1 +
= 1h In (1- 0.779) 1.96/ y16 - 3
1-p
1+0.779
= 1h In (0.124) 0.544
= - 1.043 0.544
= - 1.587 to - 0.499
P)
Hence In (1 +
= - 3.174 to - 0.998
1-p
P) = 0.042 to 0.369
( 1+
1-p
1+ P=
P=
=
(0.042)(1- p) to (0.369)(1- p)
(0.042)(1- p) -1 to (0.369)(1- p)-1
-0.042p-0.958 to -0.369p-0.631
Therefore, at the two confidence limits, 1.042p = - 0.958 and 1.369p = - 0.631
or p= -0.919 and -0.461
and the 95% confidence interval is - 0.461 to - 0.919
The x;'s may be any continuous or dichotomous variables, and the h;'s represent the
corresponding regression coefficients. Each hi is "corrected" simultaneously for the
linear relationship between its corresponding Xi and every other Xi, as well as for the
linear relationship between the other x;'s and y. An overall r2 can be calculated for
196
the model and represents the proportion of the total variance of y accounted for by
its linear relationship with all the x/so Multiple regression is commonly included
among the available techniques contained in standard statistical software packages.
Further details are available in standard texts [1-3].
As with other measures of exposure-outcome association, including differences
in means and proportions, relative risks, and odds ratios, extraneous variables may
modify the linear correlation between exposure or outcome without confounding
(biasing) the overall correlation. Two variables may be poorly correlated in the overall study sample but highly correlated within one or more subgroups, or the correlation may be "statistically significant" in both but quantitatively much greater in one
than the other. In such cases, the magnitudes and significance of the correlation
should be reported separately for the different subgroups.
= 1- 6r.df
n 3 -n
(15.10)
197
Table 15.2. Rankings (from Table 15.1) of serum creatinine and hemoglobin concentrations in ten
subjects with chronic renal failure
Subject no.
Creatinine (rank)
Hemoglobin (rank)
d~
1
2
3
4
5
6
5
2
8
1
4
10
8
9
10
9
6
3
5
6
2
10
8
1
3
4
0
-4
+6
-9
-4
+9
+4
+5
-1
-6
d1
0
16
36
81
16
81
16
25
1
36
:r.d1 =
a
308
The magnitude of r, can vary between - 1 and + 1 and is interpreted in the same
way as the Pearson correlation coefficient, r. Its statistical significance can be
assessed by consulting tabulated (see Appendix Table A.8) minimum values of r,
required for the usual threshold P values at varying sample sizes (n' s). For sample
sizes of ten or more, the sampling distribution for the r,'s under the null hypothesis
of no correlation is approximated by the standard normal distribution, and a z-test
can be used:
z= r, Vn-1
(15.11)
1000 - 10
which is not far from the calculated value for r (- 0.779). The corresponding
P value for this value of r, can be seen from the Appendix Table A.8 to be < 0.01.
Or alternatively, using Eq.1S.l1:
z= r,
yn=l =
- 0.867
yI1C)=1 =
- 2.60,
198
References
1. Armitage P (1971) Statistical methods in medical research. Blackwell Scientific, Oxford,
pp 302-332
2. Snedecor CW, Cochran WC (1980) Statistical methods, 7th edn. Iowa State University Press,
Ames, pp 334-364
3. Kleinbaum DC, Kupper LL (1978) Applied regression analysis and other multivariable methods.
Duxbury, North Scituate, MA, pp 131-208
Part III
Special Topics
16.1 Introduction
Until about 100 years ago, the history and physical examination were the only
sources of information available to the clinician confronted with a diagnostic decision. He was thus limited to what he could see, hear, feel, smell, or taste - in other
words, to what his own senses could tell him. The development of radiology and
bacteriology around the turn of the century enabled him to amplify and extend his
sensory input. More recently, with the refinement of sophisticated radiographic, biochemical, and immunologic techniques, the diagnostic test has become an invaluable
tool in the detection and definition of disease.
On the other hand, not all tests are equally illuminating, and many are very
expensive. Modern clinicians are under increasing attack in many quarters for their
indiscriminate use of diagnostic tests. Humane patient care and limited economic
resources both demand a more thoughtful, critical approach to testing that examines
the relative merits of different tests and their respective costs and benefits.
If we accept the proposition that some tests are better than others for certain disease conditions, and that in certain situations there may be tests that are better left
undone, the question is: How do we go about making these comparisons and decisions? This chapter will provide an epidemiologic and statistical framework for evaluating the worth of diagnostic tests.
202
Diagnostic Tests
--
en
Q)
:0
::J
en
o
c::
o
Diseased
t
o
c..
o
c...
~
p.-20
/L-1CT
/L
W1CT
Test Val ue
W2CT
203
Disease-tree
c::
o
to
co
a::
Fig_16_2. Two nonoverlapping distributions
Test Value
syndrome" [2]. Ulysses, you will recall, underwent a two-year odyssey between the
end of the Trojan War and his return home, during which time he experienced a
number of needless, dangerous, but entertaining adventures. The Odyssey may make
good reading, but foisting such adventures on unsuspecting test subjects makes bad
medicine.
The statistical model easiest to deal with is that of two nonoverlapping independent distributions, which mayor may not be Gaussian (Fig. 16.2). In this model,
there are two distinct populations, a disease-free one and a diseased one, without
overlap, so that a given test result will allow a certain decision as to the disease status of the subject tested. Unfortunately, although such a model avoids arbitrary
decisions, most diseases do not afford us this luxury. Genetic diseases due to missing
or abnormal enzymes may show this pattern. For example, patients with phenylketonuria lack the enzyme tyrosine hydroxylase and are therefore unable to metabolize the amino acid phenylalanine to tyrosine. The distribution of such patients'
serum phenylalanine levels is much higher than, and does not overlap with, the distribution of levels in normal subjects.
A more attractive statistical model, and one that seems to pertain to many diseases, is that of a diseased population and a disease-free population with partially
overlapping (Gaussian or non-Gaussian) distributions. The overall distribution of
test results can take one of two forms, depending on the relative sizes of the two distributions and their degree of overlap: (a) a unimodal distribution skewed toward
the direction of abnormality; or (b) a bimodal distribution with recognizable
"peaks" corresponding to the modes of the diseased and disease-free populations.
With the unimodal skewed form (Fig. 16.3), it can be exceedingly difficult to distinguish the diseased and disease-free distributions by visual inspection, because they
are both "buried" in the overall observed distribution. When the two distributions
are both Gaussian, one approach to solving this problem is to plot the overall distribution on cumulative probability graph paper. By "squeezing" the extremely low and
high probabilities (y-axis) and "stretching out" the middle ones, a Gaussian distribution will be represented by a perfect straight line. A "buried" population with a different Gaussian distribution will appear as a linear deviation from the main line.
This procedure is illustrated in Fig. 16.4 with data from a study by Pethybridge et al.
of birth weights in southwest England in 1965 [3]. Above about 2500 g, the points
fall on a straight line, representing the "normal-birth-weight" ("disease-free") popu-
204
Diagnostic Tests
-o
c::
:eo
a.
o
Q..
Test Value
.99
c
.....0
~
0... (f)
0 .....
~u
CL.~
Q)n
>::J
-.:;(/)
('Ij-
::Jo
.95
.90
.70
.50
.30
.10
.05
.01
::J
1000
2000
3000
4000
lation. Below 2000 g, the points also fall on a fairly straight line and indicate a distinct "low-birth-weight" population. Between 2000 and 2500 g, the points appear
less linear. This range represents the area of overlap.
When the overall distribution is clearly bimodal (Fig. 16.5), separation of the two
populations is somewhat easier. There are some test values that rule out or rule in the
disease. In Fig. 16.5, values below A would lie only within the distribution of the disease-free population, thus ruling out the disease. Conversely, a value above B would
lie only within the distribution of the diseased population. The problem remains,
however, of how to classify subjects with test values between A and B. Some misclassification is inevitable regardless of what "cutoff" point is used, and the choice will
depend largely on the purposes to which the result will be put, as well as on the relative consequences and costs of the two types of misclassification. We will return to
this point in Section 16.6.
......
o
205
Observed
Distribution
c:
C-
o.....
a...
A
Fig. 16.5. Bimodal distribution
Test Value
Diagnostic Tests
206
Table 16.1. Two-by-two table for evaluating the validity of diagnostic tests
Disease
Present
Absent
a+b
c+d
a+c
b+d
Test
a, true positives (TP); b, false positives (FP); c, false negatives (FN); d, true negatives (TN).
TP
a
= -TP+FN a+c
(16.1)
d
d+ b
(16.2)
The perfectly valid diagnostic test would have a sensitivity and specificity both equal
to 1. Few if any tests attain these lofty heights, however, and most involve a tradeoff between sensitivity and specificity. Usually, the more sensitive a test is (fewer
false negatives), the less specific (more false positives) and vice versa. In fact,
depending on the setting and purpose of the test (see Section 16.6), the cutoff point
for defining positivity or negativity can be changed to increase one or the other of
these indexes. These trade-offs will be addressed in greater detail in the following
subsection.
16.3.2 The Receiving Operator Characteristics (ROC) Curve
As discussed in Section 16.2, many test results are measured on a continuous scale.
The values are then often dichotomized into a normal (negative) or abnormal (positive) result. I have already mentioned the vagaries involved in choosing a cutoff
point for defining normal and abnormal. Any single cutoff is by necessity arbitrary,
I
These statistical indexes can be expressed as either proportions or percentages. Equations 16.1 and
16.2 yield proportions; the corresponding percentages are obtained by multiplying by 100.
207
1.00
.80
>-
.:;: .60
;;
'ec::n
Q)
Cf)
.40
.20
0
.20
.40
.60
.80
1.00
1-Specificity
with the consequence that subjects whose test results lie just below or above the
cutoff may be misclassified.
In tests for which higher values reflect greater degrees of abnormality, choosing
a lower cutoff will result in greater sensitivity, i. e., in missing fewer cases of the disease. Conversely, higher values of the cutoff will result in greater specificity, i. e., less
misclassification of people who do not have the disease. This reciprocal relationship
between sensitivity and specificity is always found when a cutoff point is chosen for
the value of a test measured on a continuous scale and can be represented by the
test's receiver operating characteristics curve, an example of which is shown in
Fig. 16.6. Sensitivity is graphed on the y-axis and 1 - specificity on the x-axis. The
greater the sensitivity, the lower the specificity and vice versa.
The term "receiver operating characteristics (ROC) curve" originated in describing performance characteristics of observers using mechanical devices, especially
radar detection instruments. Different points on the ROC curve represent different
choices of cutoff points, each balanced between maximizing sensitivity and specificity. Point A in Fig. 16.6 represents a point on the curve that results in a specificity of
1 but only .30 sensitivity. At the other extreme, point D represents a point on the
curve where sensitivity is 1 but specificity is only .20. Points Band C are intermediate.
It is important to point out that the choice of which of these cutoff points is most
appropriate for a diagnostic test may be governed by the purpose for which the test
is obtained (see Section 16.6). If sensitivity is important, a point on the curve near C
or D is appropriate. When specificity is more important, point A or B would be preferable. It is thus the curve itself, rather than a specific point on the curve, that is
characteristic of the test. In other words, the choice of one specific point on this
curve does not make the test better or worse. A better test (i. e., one with better overall sensitivity and specificity) would have an ROC curve that lies above and to the
left of the curve shown in the graph.
In fact, the position of the curve with respect to the diagonal (shown by the
dashed line in Fig. 16.6) is an indicator of its informational value. The diagonal rep-
Diagnostic Tests
208
Table 16.2. Die-casting as a "diagnostic test"
Disease
Present
Absent
10
100
110
1-5
50
500
550
60
600
660
Die roll
resents the line for which sensitivity= I-specificity. Rearranging terms, sensitivity+specificity= 1. A test for which sensitivity and specificity sum to 1 contributes
no more information than pure chance. If our "diagnostic test" consisted of flipping
a coin, with heads corresponding to a "positive" (abnormal) test and tails to a "negative" (normal) test, we would expect the test to label half the subjects as diseased
and half as disease-free, regardless of whether they do or do not actually have the
disease. Thus, the sensitivity and specificity of our "test" would both be 0.5.
Alternatively, if our "test" consisted of casting a die and calling a 6 "positive"
and anything below 6 "negative," one out of six subjects would be labeled as diseased and five out of six as disease-free, regardless of their true status. The results of
testing 60 diseased and 600 disease-free subjects are shown in Table 16.2. Sensitivity
is 0.17, specificity is O.~, and the total is 1.
The closer a test's ROC curve lies to the diagonal, the less information it provides. The further above and to the left of the diagonal, the more informative it is.
(A test whose ROC curve lies below and to the right of the diagonal yields results
that are worse than those due to chance. In that case, the criteria for "positive" and
"negative" should be reversed!)
16.3.3 Spectrum and Bias
The sensitivity and specificity of a diagnostic test are often regarded as "fixed" characteristics of the test. Such is not the case, however. Many tests that appear highly
sensitive and specific when first described eventually prove considerably less so after
their introduction into the "real world" of clinical practice. Why does such disillu-
209
sionment occur? The main source of the problem is the design of the studies in
which the tests are originally evaluated.
As described in an excellent article by Ransohoff and Feinstein [5], the defects in
design of these studies concern spectrum and bias. Spectrum means: Is the range of
patients or subjects tested adequate? A broad spectrum of cases, or persons with the
disease, is required to assess adequately the sensitivity of a test, while a broad spectrum of controls, or persons without the disease, is necessary for the adequate
assessment of its specificity. Bias means: Are the diagnosis and test result determined
independently of one another? Bias can affect both sensitivity and specificity.
Spectrum includes a number of components. First, we must consider the spectrum of cases studied. An inadequate case spectrum may result in a misleading sensitivity. Here we should examine the pathologic spectrum of the disease under study,
in other words, its extent, location, and histology. A test for cancer should thus
include both patients with local and patients with metastatic disease. The clinical
spectrum of a disease, that is, its chronicity and severity, are also of obvious importance. If a test is more related to cachexia than to cancer, it may look more sensitive
than it really is if only severely cachectic cancer patients are studied. The co-morbid
spectrum of a disease, that is, the presence of coexisting ailments, can also affect a
test's sensitivity. Test results may be different in cancer patients with and without
cardiovascular disease, for example.
The spectrum of the controls studied can have profound effects on the specificity
of a diagnostic test. An adequate control spectrum should include patients with the
same disease process as the cases, but in a different location (for example, patients
with breast cancer in a test for colon cancer) and should also include patients with
different disease processes in the same location (such as patients with ulcerative colitis in a test for colon cancer). The latter is particularly important, because it is precisely in those patients with similar symptoms (e. g., diarrhea, bloody stools, abdominal pain, and weight loss in the case of colon cancer) that the clinician will want to
use the test in practice.
The importance of spectrum in both cases and controls underlines an important
principle. The purpose of a diagnostic test, at least in the clinical setting, is to detect
a disease that is not otherwise obvious in patients with compatible symptoms. No
competent clinician requires a test to distinguish a person with advanced cancer
from one who is perfectly healthy. Yet many tests are originally evaluated in a study
sample containing just such extreme "cases" and "controls." In attempting to convince their audience of the diseased and disease-free status of their cases and controls, respectively, the original authors may obtain results for both sensitivity and
specificity far higher than those to be seen when the test is applied in the real clinical
world of patients with less obvious symptoms and early disease.
The second type of design defect in testing a test is bias. Bias can lead to falsely
high sensitivity or specificity; it can manifest itself in four ways. In workup bias, the
result of a test affects the intensity of the subsequent "workup" (i. e., further diagnostic procedures) of the patient, thus increasing the chances for diagnosing the disease. Nonblind diagnosis means that the persons making the diagnosis are aware of
the test result at the time of diagnosis. This is a potent source of bias when the diagnosis involves a subjective judgment. Nonblind test interpretation means that the persons interpreting the test results are aware of the true diagnosis at the time of test
210
Diagnostic Tests
211
..
d
I
TP
a
posltlve pre lCtive va ue = TP + FP = a + b
(16.3)
Negative predictive value is the proportion of subjects with a negative test who are
disease free:
.
d
I
TN
negative pre lCtive va ue = TN + FN
d+c
(16.4)
Positive and negative predictive value can thusbe considered "horizontal" indexes,
since they are defined by row proportions, whereas the "vertical" indexes of sensitivity and specificity are defined by column proportions.
Unlike sensitivity and specificity, positive and negative predictive value are not
true indexes of validity, because they depend on the relative proportions of diseased
and disease-free persons being tested. Since they are governed by the ratio of true
and false positives (positive predictive value) or true and false negatives (negative
predictive value), a test with high specificity (few false positives among the diseasefree) can have low positive predictive value if the ratio of disease-free to diseased
subjects is high. Similarly, a test with high sensitivity (few false negatives among the
diseased) can have low negative predictive value if the ratio of disease-free to diseased subjects is low (a very unlikely testing situation).
These relationships are illustrated in Table 16.3. In A, diseased (D) and diseasefree (D) subjects are equally represented, i. e., D: D = 1 : 1. Sensitivity, specificity,
positive predictive value, and negative predictive value are all high. In B, disease-free
subjects predominate (D: D = 1 : 9). Sensitivity and specificity remain the same (since
they are characteristics of the test in diseased and disease-free persons respectively),
but positive predictive value falls, while negative predictive value rises. In fact, in the
common clinical situation in which the disease being tested for is rare, D: D may be
lower than 1 : 9, even among patients for whom the clinician is suspicious enough to
request the test, with a consequent further reduction in the positive predictive value.
212
Diagnostic Tests
Table 16.3. Sensitivity, specificity, and positive and negative predictive values with three different
ratios of diseased (D) and disease-free (0) subjects
A. 0:0=1:1
Disease
0
...
450 090
SenSItlVlty
= 500 = .
450
100
550
50
400
450
500
500
1000
Test
f
400
Specllclty=
500 = 080
.
.. pre d
P osltlve
lCtiVe vaIue = -450 = 0.8 2
550
. pred
N egatlve
lCtiVe vaIue = -400 = 0.89
450
B. D:0=1:9
Disease
0
...
90 090
SenSItlVlty=
100 = .
90
180
270
10
720
730
100
900
1000
Test
f
720
Specllclty=
900 = 080
.
Positive predictive value =
C. 0:0=9:1
1Q.. = 0.33
270
Disease
o
+
810
20
830
90
80
170
900
100
1000
Test
...
810
SenSltlVlty=
- =0.90
900
f
80
Specllclty=-=0.80
100
.. pre d
P osltlve
lCtiVe vaIue= -810 =0.98
830
. pre d
I
N egatlve
lCtiVe va ue= -80 =0.47
170
Bayes'Theorem
213
-aa+b
-a- . a+c
-a+c N
a+b
N
a
N
a
=--=-a+b
a+b
-N
-
P(DIT+)= ~b'
a+
P(D) = _a_,
a+c
Diagnostic Tests
214
(16.6)
(16.8)
since the two P(T+) terms cancel. In other words, Eq.16.8 says that the odds of disease given a positive test (also called the posterior odds) is the product of the odds of
215
Bayes' Theorem
a positive test under the competing alternatives of disease and nondisease (the socalled likelihood ratio) multiplied by the relative proportion of diseased and diseasefree subjects tested (the prior odds). The Bayesian "translation" of Eq.16.8 is:
posterior odds = likelihood ratio x prior odds
It can be seen that the likelihood ratio is nothing more than the ratio of sensitivity to
1 - specificity.
To illustrate the use of Eq. 16.8, let us work through the calculation of positive
predictive accuracy for the test results shown in Table 16.3 B. The sensitivity of the
. 0 dd s 0 f dIsease IS
. 100. Th us:
. 0.90, t he specif
test IS
IClty IS 0.80, an d t he prior
900
. 0 dd s 0 f dIsease =
posterior
0.50 = 0.33,
0.50+ 1
216
Diagnostic Tests
When combined with the prior odds, LR + and LR - provide an instant index to
the impact of a positive or negative test result. If the prior odds is low (e.g., testing
an asymptomatic, healthy "volunteer") and the test is highly discriminatory (high
LR + and low LR -), a positive test result will yield a large increase in the posterior
odds (relative to the prior odds), but a negative test result will succeed only in making a remote possibility even more remote. Conversely, if the prior odds is very high
(e.g., a patient with "classic" signs and symptoms), a positive test will result in very
little change in the posterior odds, although a negative test will substantially reduce
it.
The overall clinical utility of a test is therefore greatest when the prior odds is
near 1, i.e., when the clinician is most uncertain (a virtual "toss-up" between disease
presence or absence) prior to the test. A positive test then makes the disease likely,
and a negative test makes it unlikely. This is consistent with common sense: the less
certain we are, the more we are swayed by new information.
Another advantage of the Bayesian approach to the interpretation of diagnostic
tests is that it is not necessarily tied to the dichotomous ("positive" vs "negative,"
"abnormal vs normal") characterization of the results (see Section 16.2). For diagnostic tests whose results are expressed on a continuous scale, no cutoff point is necessarily required. Instead, the actual result can be evaluated in terms of its differential diagnostic value, i. e., its consistency with disease vs nondisease. The likelihood
merely needs to be expressed as the probability of obtaining the observed test result
T j under the competing hypotheses of disease and nondisease:
LR= P(Tj ID)
P(TiID)
with the probabilities estimated from the underlying distribution of values for the
diseased and disease-free populations.
In addition to its use in evaluating diagnostic tests, Bayes' theorem has applications to other aspects of clinical decision-making and to causality inference. These
will be considered in Chapters 17 and 19 respectively.
217
Case-finding is the testing of patients for diseases unrelated to their specific complaint. A woman who consults her physician because of pain and stiffness in her
knees may have her blood pressure taken, not because the physician suspects hypertension as the cause of her symptoms, but because hypertension that is undetected
and untreated carries a significant risk of subsequent morbidity and mortality.
The major purpose of case-finding is early (presymptomatic) detection. Obviously, the disease tested for should have a treatment that does more good than harm
to those who are afflicted by it; there is no advantage to early detection of an
untreatable disease. A treatment that improves survival, reduces morbidity, or
improves physical or social functioning (performance) should therefore exist before
case-finding is undertaken. Merely advancing the time of diagnosis without delaying
death, morbidity, or functional impairment (the so-called zero-time shift or lead-time
bias) does not constitute an improvement in outcome. In fact, early detection by
itself may do more harm than good if it results in adverse psychological effects due
to "labeling."
The zero-time shift is illustrated in Fig. 16.7. The top diagram represents the natural history of a disease without effective treatment. The time axis runs from left to
right, and the usual sequence is seen of onset of disease, followed by onset of symptoms. The symptoms lead to a visit to a clinician who then establishes the diagnosis.
Death or morbidity occurs at some later time. The lower diagram shows what happens when a test leads only to early detection. Note that onset of disease, onset of
symptoms, and death or morbidity occur at exactly the same points along the time
axis. The sole change has been an earlier time of diagnosis. If survival time (or morbidity-free time) is measured from the time of diagnosis, the patient whose disease
was detected early will appear to go longer before experiencing death or morbidity.
We must be on the lookout for this artifact of early detection; such a change does
not qualify as a beneficial change in outcome.
Screening is the testing of asymptomatic subjects from the general population for
the purpose of early detection of a particular disease. Although it is similar to casefinding in its aim to detect disease in asymptomatic subjects, it is different in several
important respects. In case-finding, the patient seeks health care, and the clinician's
Natural History
Onset of
disease
I
Onset of
symptoms
I
Death or
morbidity
Time of
diagnosis
Early Detection
Onset of
symptoms
Onset of
disease
Fig. 16.7. The zerotime shift (lead-time
bias)
Time of
diagnosis
Death or
morbidity
218
Diagnostic Tests
main responsibility relates to the symptom or other problem that prompted the patient's visit. Early detection of an unrelated disease may be useful to the patient but
is clearly secondary to the main "contract" [7].
In screening, on the other hand, early detection of asymptomatic disease is the
main goal. False-negative test results are far less acceptable with screening, since
failure to detect the disease screened for vitiates the principal objective. Consequently, the sensitivity of a screening test must be very high [8].
As with case-finding, screening requires the prior existence of a treatment with
an overall favorable effect on mortality, morbidity, or performance. Early detection
of an untreatable disease not only is of no benefit, but may even prolong the suffering caused by a patient's awareness of his diagnosis. Once again, the zero-time shift
artifact should be considered in evaluating any screening test to ensure that outcome
is truly improved.
One major difference from case-finding, however, is that even if a potentially
beneficial treatment exists, true improvement in outcome requires referral of the diseased subject to an appropriate clinician, a decision by the clinician to prescribe the
treatment, and adequate compliance by the patient. In case-finding, no referral is
required, the clinician can decide a priori to treat patients in whom she detects the
disease, and she may be selective in testing only those patients for whom her prior
experience indicates a high probability of treatment compliance. Although the distinction between case-finding and screening can sometimes be blurred, e.g., in the
case of the periodic health examination ("annual checkup"), their differences should
be kept in mind.
Diagnostic tests can also be used to measure disease incidence or prevalence as
part of an epidemiologic study. Whenever a given disease represents the study outcome, the results of a diagnostic test can be used to classify study subjects as diseased or disease-free. Descriptive surveys in representative samples of defined population groups can be used to provide incidence and prevalence rates for particular
diseases. This may be important for public health purposes, such as in allocating
resources, providing baseline data prior to some planned intervention, or supporting
(or refuting) claims of a perceived epidemic in communities exposed to a suspected
toxic agent. In analytic cohort studies and clinical trials, diagnostic tests can be used
to standardize surveillance and provide an unbiased outcome assessment in different
exposure groups. The main interest here is the rate of disease in the study groups
rather than the presence or absence of disease in any individual. Thus, reproducibility, sensitivity, and specificity are less important than in other settings. If these indexes are too low, however, the resulting misclassification may lead to erroneous rates
and inferences.
For all of the above test settings, the advantages and disadvantages of each test
should be carefully weighed before deciding whether or not to perform the test. In
addition to the probabilities of disease given the possible test results, which are
determined by the prevalence of the disease and the test's sensitivity and specificity,
the clinician or public health policy maker must also consider the consequences of
correct and incorrect disease classification, the values that individual patients and
society as a whole attach to these consequences, and the monetary costs involved.
Issues such as the acceptability (risk of serious adverse consequences, pain, convenience, embarrassment) and complexity (logistic and mechanical difficulties,
References
219
required expertise of personnel) of the test will weigh heavily in these evaluations [7,
8]. The technique for balancing the benefits and risks of available management
options is called decision analysis and forms the basis of the following chapter.
References
1. Elveback LR, Guillier CL, Keating FR (1970) Health, normality, and the ghost of Gauss. JAMA
211: 69-75
2. Rang M (1972) The Ulysses syndrome. Can Med Assoc J 106: 122-123
3. Pethybridge RJ, Ashford JR, Fryer JG (1974) Some features of the distribution of birth weight of
human infants. Br J Prev Soc Med 28: 10-18
4. Feinstein AR (1975) Clinical biostatistics. XXXI. On the sensitivity, specificity, and discrimination
of diagnostic tests. Clin Pharmacol Ther 17: 104-116
5. Ransohoff DF, Feinstein AR (1978) Problems of spectrum and bias in evaluating the efficacy of
diagnostic tests. N Engl J Med 199: 926-930
6. Sox HC (1986) Probability theory in use of diagnostic tests. Ann Intern Med 104: 60-66
7. Sackett DL, Holland WW (1975) Controversy in the detection of disease. Lancet 2: 170-172
8. Sackett DL (1975) Laboratory screening: a critique. Fed Proc 34: 2157-2161
The decision strategy most commonly used by practitioners is based on global "clinical judgment," by which the knowledge base and previous experience of a seasoned
clinician are somehow carefully weighed and considered to yield the proper course
of action. The factors considered by Mrs. Smith's cardiologist and surgeon would
probably include knowledge of her previous medical history, the extent of her current suffering, estimates of her probable survival with continued medical therapy,
the expertise and experience of the surgeon, Mrs. Smith's chances of surviving the
operation, and published studies in which the two treatments are compared. The
physicians would discuss the case between them, perhaps consulting the views of
other colleagues at a joint medical-surgical conference, and eventually reach a decision. The various considerations are mixed up together in the cauldron of the clinicians' brains, and the end product is a brew that hopefully represents the best decision for Mrs. Smith.
221
Another approach seeks to minimize the risk of the worst possible outcome. In statistical parlance, this strategy is called minimax; it chooses the decision with the
minimum probability of the maximum loss. In Mrs. Smith's case, the decision would
be to continue medical therapy. Although her condition is slowly deteriorating, her
risk of dying in the next few weeks or months is low with her current treatment.
Since she might not survive an operation, her risk of early death is much higher with
surgery. Even if a successful operation prolongs her life, avoidance of early death
mandates a decision for medical therapy.
17.1.3 "Go for Broke": The Gambling Approach
The opposite to the conservative (or minimax) decision strategy is the approach of
the gambler: "go for broke." The gambler chooses the decision that maximizes the
probability of the most favorable outcome. This approach is often frowned upon by
clinicians, who are understandably reluctant to recommend risk-taking by their patients, even in the face of substantial potential gains. Nonetheless, Mrs. Smith, along
with her cardiologist and surgeon, might decide that her expected course under
medical therapy is so inexorably downhill that the potential gain in survival with surgery is worth the risk of operative death.
17.1.4 Is P<O.05? The "Significance" Approach
A more "scientific" strategy would assemble evidence from the published literature,
preferably from randomized clinical trials comparing surgical and medical therapy
for mitral stenosis. If the evidence suggests that the outcome is different with one
therapy vs another, and sampling variation can be safely ruled out as an explanation
for the difference (i. e., P< 0.05), then this strategy would select the treatment associated with the better outcome. But what about Mrs. Smith's case? Does the published literature pertain to 75-year-old women with Mrs. Smith's poor left ventricular function arid current symptoms? If (as is likely) medical treatment is significantly
better for short-term survival, while surgery is better for long-term survival, which is
better overall for Mrs. Smith? And how are her current suffering and impaired functioning to be weighed against the chances of either short- or long-term survival?
Unfortunately, the significance strategy does not provide answers to these questions.
17.1.5 Decision Analysis: Maximize Expected Utility
Decision analysis (or risk-benefit analysis) is a systematic strategy by which the ramifications of each possible decision are compared for all relevant outcomes [1-4J.
After estimating the probability of these outcomes for each decision and assigning a
utility to each outcome, the decision is chosen that maximizes expected utility. For
Mrs. Smith, decision analysis would consider not only the short-term risks of sur-
222
Decision Analysis
223
Medical therapy - - [
Surgery - - - - Fig.17.1. The decision node: medical therapy vs surgery for Mrs. Smith
Medical
therapy------c~--
Dies at surgery
{
Surgery'
Survives surgery
Death <1 yr
Survival 1-5 yrs, morbidity
Survival 1-5 yrs, no morbidity
Survival > 5 yrs, morbidity
Survival >5 yrs, no morbidity
Death <1 yr
Death < 1 yr
Survival 1-5 yrs, morbidity
Survival 1-5 yrs, no morbidity
Survival > 5 yrs, morbidity
Survival > 5 yrs, no morbidity
Fig. 17 .2. Full decision tree: medical therapy vs surgery for Mrs. Smith
right, from the largest branches representing the decisions to the smallest terminal
branches representing the final outcomes.
In our example, we shall simplify the decision tree by considering five final outcomes: death before 1 year, survival for 1-5 years with morbidity, survival for
1-5 years without (significant) morbidity, survival for more than 5 years with morbidity, and survival for more than 5 years without morbidity. Operative death will be
included in the "death before 1 year" category. The full decision tree incorporating
these outcomes is shown in Fig. 17.2.
Two of the branches shown in Fig. 17.2 are superfluous. Although we have not
yet discussed how we estimate the probabilities associated with each branch of the
tree, it is apparent that morbidity-free survival with medical therapy is virtually
impossible (i.e., probability=O), since Mrs.Smith has severe symptoms now with
medical treatment. Consequently, the branches corresponding to morbidity-free survival for 1-5 years and> 5 years with medical therapy can be "pruned." The final
decision tree obtained after pruning these branches is shown in Fig. 17.3.
The decision tree we have constructed has been simplified for heuristic purposes.
Outcomes have been limited to five categories, and neither the morbid sequelae of
surgery nor the side effects of medication have been considered. If the probabilities
and utilities of other outcomes might affect the decision, including them would be
important. The resulting tree would then be "bushier" (i.e., have more branches)
and hence more comprehensive.
It is also important to point out that decisions may involve diagnostic, as well as
therapeutic, choices, such as the decision to order a certain diagnostic test. Four
outcomes arise from each decision mode involving a diagnostic test, corresponding
to the true and false positives and true and false negatives discussed in Chapter 16.
Decision Analysis
224
Medical
therapy----~>_--
[
Sr e
u g ry
---i
Death <1 yr
Survival 1-5 yrs, morbidity
Survival > 5 yrs, morbidity
~ Death <1 yr
Survives
Fig. 17 .3. Pruned decision tree: medical therapy VS surgery for Mrs. Smith
---c=
Seps's
Death
succeSSfUI------{""
I
Survival,
' - - - - - - - - Survival,
- [
Death
Tocolysis
- { . RDS
Survival,
Survival,
Unsuccessful
.
Death
Sepsls~ Survival,
L -_ _ _ _ _ Survival,
Successful Tocolysis
True positive
Test
morbidity
no morbidity
no morbidity
no morbidity
-E
morbidity
no morbidity
---c=
no morbidity
no morbidity
. .
no morbidity
no morbidity
Survival,
Death
Survival,
Unsuccessful Tocolysis
S .
~ Death
epsls~ Survival,
Survival,
Sepsis
Deat~
Successful Tocolysis ~
Survival,
Survival,
{
Death
False positive
g D S Survival,
Survival,
Unsuccessful Tocolysis
S . __ ~ Death
epsls~ Survival,
Survival,
. ~Death
True negative"
SepsIs ' L - Survival,
L_- - - - - - - - - - - - - Survival,
Death
RDS
Survival,
Survival,
False negativB
. ~ Death
I---------Sep.sls~ Survival,
L -_ _ _ _ _ _ _ _ _ _ _ _ _ Survival,
-E
no morbidity
no morbidity
-E
morbidity
no morbidity
no morbidity
no morbidity
no morbidity
no moribldlty
morbidity
no morbidity
no morbidity
no morbidity
Fig. 17.4. Decision tree for L:S ratio test vs immediate tocolysis in women with spontaneous rupture of membranes and preterm labor
Consider the following example, the decision tree for which is shown in
Fig. 17.4. Pregnant women who present with spontaneous rupture of fetal membranes (i. e., the amniotic sac) and premature labor represent a difficult dilemma for
the obstetrician. If she lets nature take its course, many of these women will deliver a
premature infant, with a high risk of respiratory distress syndrome (RDS), a condition caused by lung immaturity and associated with significant morbidity and mortality. If the obstetrician administers tocolytic (labor-inhibiting) drugs, she may suc-
225
ceed in delaying delivery until the lungs have matured and thus avoid RDS. But
prolonged membrane rupture increases the risk of neonatal sepsis (systemic bacterial
infection), which has an even higher mortality than RDS. (We will assume for simplicity that a neonate who survives a bout of sepsis will have no residual morbidity,
and that the coincidence of RDS and sepsis is sufficiently unlikely to ignore.)
An alternative to immediate tocolysis is a diagnostic test of fetal lung maturity.
The amniotic fluid is tested for the ratio of lecithin to sphingomyelin (called the L: S
ratio). An L: S ratio;:;; 2: 1 (a positive test) indicates immature fetal lungs and a high
risk of RDS. But, like other diagnostic tests, the L: S ratio is neither perfectly sensitive nor specific for RDS. In particular, many infants with a positive test will not
develop RDS (i. e., the positive predictive value of the test is not high). If the test is
positive (whether true positive or false positive), the obstetrician will institute tocolytic therapy, which mayor may not be successful in delaying delivery until the lungs
have matured. In the case of the false positives, successful tocolytic therapy will have
unnecessarily increased the risk of sepsis. If the test is negative (true negative or false
negative), the obstetrician will attempt to deliver the baby as soon as possible, using
oxytocin augmentation if necessary, to minimize the risk of sepsis. In the case of
false negatives, however, this will result in the birth of some infants with RDS.
The decision tree for this analysis is obviously more complicated than the one we
constructed for our first example involving Mrs. Smith. Even the tree shown in
Fig. 17.4 is an oversimplification, however, since it does not consider such issues as
amniocentesis prior to tocolytic therapy, the use of betamethasone (a corticosteroid)
to promote lung maturation, the differential morbidity and mortality of preterm and
full-term infants, or the range of possible morbidities. In fact, many clinical decisions would look almost impossibly complex if displayed in their full arboreal splendor.
What this means, though, is that the decisions themselves are complex. The decision tree merely displays the complexities; it does not create them. Decisions can be
(and usually are) made without decision trees. But all decisions weigh probabilities
and utilities, although most do so only implicitly. Even if the tree is not used to carry
out the full analysis, however, its construction can be helpful to the clinician by forcing her to consider all relevant consequences of her decision choices. Finally, many
complex trees can be made more manageable by careful pruning of nonessential
branches.
Decision Analysis
226
17.3.1 Probabilities
A variety of sources exist for estimating the probabilities of events occurring as consequences of the decisions being analyzed. Published clinical trials, observational
studies, and descriptive case series can all be used, with priority given to data representing the best combination of methodologic rigor and clinical relevance to the case
or cases being analyzed. In Mrs. Smith's case, for example, our preference would be
for a randomized trial in which surgery and medical therapy were compared for
short-term and long-term mortality, relief of symptoms, and functional performance, preferably in women around 75 years of age with similar past history, current symptomatology, and associated left ventricular dysfunction.
The chance of finding such a trial, of course, is slim. Instead, a variety of published literature must often be searched to provide the best probability estimates. In
the absence of reliable published data, it may be necessary to solicit the opinion of an
expert or panel of experts. Although such a process might appear dangerously similar
to the global introspection approach discussed earlier, there are important differences.
For one thing, the introspection is far less global. The expert is being asked for an
opinion concerning a specific probability, not a recommendation for an overall decision. By breaking down a global task into a series of small ones, each becomes more
manageable (i. e., the branch probability assessment is more reproducible and valid).
Second, the expert is consulted only in the area of his or her expertise. No one person
is being asked to know and properly weigh all the relevant facts.
Although such procedures for estimating probabilities may appear "sloppy" after
the epidemiologic and statistical principles outlined earlier in this text, there is no
readily available alternative. After all, some decision has to be made. Decision analysis provides a logical framework for weighing the available information, even if that
information is fuzzy. To be sure, the better the probability estimates, the more reliance we can place on the results of the analysis. As we shall see later on, however,
ranges of feasible probability estimates can be assessed to see if the preferred decision changes with different estimates.
Regardless of how the individual probability estimates are derived, their combination must conform to the rules of probability theory. Since the branches emanating from a given chance node represent mutually exclusive events, the probability of
either of two such events occurring is the sum of their individual probabilities. If the
two events are represented by A and B, then using the probability notation introduced in Chapter 16:
P(A or B) = P(A) + P(B)
(17.1)
Furthermore, since a chance node gives rise to a branch for each possible consequent event, and one of the events must occur, the sum of the probabilities of all the
branches emanating from a chance node must sum to 1. If there are n such branches.
P(A or B or C ... or n)= P(A) +P(B)+P(C) ... + P(n)= 1
(17.2)
Returning to our example, let us estimate the probabilities for the branches at each
of the three chance nodes shown in the Fig. 17.3 decision tree. Let us say that, based
227
Medical t h e r a p y - - - - - - - - - - [
-i
on our review of the literature and consultation with experts, the probabilities of
death before 1 year, survival for 1-5 years (with morbidity), and survival for more
than 5 years are 0.15, 0.60, and 0.25, respectively, for medical therapy. Note that
these three probabilities sum to 1. For surgery, the first chance node concerns the
probability of surviving surgery. The literature and expert opinion indicate that a
75-year-old woman in Mrs. Smith's current condition would have only a 70% (i. e.,
P= 0.70) chance of surviving a mitral valve commissurotomy. By the law of additivity (Eq. 17.2), the probability of operative death must be 0.30.
If she survives surgery, Mrs. Smith may experience any of the five possible outcomes. Let us say that our estimates of the probabilities of these outcomes, conditional on her surviving the operation, are 0.05 for death before 1 year, 0.10 for surviving 1-5 years with morbidity, 0.20 for surviving 1-5 years without morbidity,
0.25 for surviving more than 5 years with morbidity, and 0.40 for surviving more
than 5 years without morbidity. Note once again that these probabilities sum to 1.
The branch probabilities are often written into the decision tree just after the branch
point (chance nodes), as shown in Fig. 17.5.
The second rule for combining probabilites concerns events that occur subsequent to, and possibly conditional upon, prior events. This rule will tell us how to
calculate overall probabilities for the final outcomes representing the terminal
branches of the decision tree. If B is an event that can occur in a person who has
already experienced event A, then the probability that both A and B will occur is the
product of the probability of A and the conditional probability of B (given A). In the
notation of conditional probability,
P(A and B) = P(A) x P(BIA)
(17.3)
The probability of each branch of the tree is conditional on the probabilities of the
preceding larger branches, i. e., the branches lying to its left on the tree, all the way
back to the first chance node. Thus, the number of probabilities multiplied together
for each terminal branch is the same as the number of chance nodes occurring
between the decision node and the terminal branch.
In our example, the terminal branches representing the possible final outcomes
for surgery are conditional on surviving the operation. The overall probabilities
associated with these terminal branches are calculated by multiplying the probability
of surviving surgery (P= 0.70) by the conditional probability of each outcome.
Thus, for death before 1 year the overall probability is (0.70)(0.05) = 0.035. The
228
Decision Analysis
Table 17.1. Overall probabilities of terminal branches in decision tree shown in Fig. 17.5
Terminal branch
Medical therapy
Death < 1 year
Survival 1-5 years, morbidity
Survival > 5 years, morbidity
Probability
0.15
0.60
0.25
Surgery
0.30
0.035
0.070
0.140
0.175
0.280
probabilities of each of the nine terminal branches of the decision tree in Fig. 17.5
are listed in Table 17.1. Note that, consistent with the additivity rule, the sum of the
overall probabilities for the five terminal branches for operative survival is 0.70, the
probability of operative survival itself.
17.3.2 Utilities
The other major constituents required for decision analysis are utilities. When only a
single dichotomous outcome is involved, utility assessment is straightforward. One
need only decide which of the two outcome categories is preferred, and the decision
that yields the highest average probability of that outcome category is favored. In
the case of Mrs. Smith, if 5-year survival (yes or no) were the single outcome of
interest, we would need "only" find out whether medical therapy or surgery is associated with a higher 5-year survival rate in women like Mrs. Smith, and then act
accordingly.
As we have seen, however, other outcomes are important. Mrs. Smith is already
75 years old and may care far more about I-year survival than 5-year survival. Since
she is a widow who lives alone, she is also likely to prefer a treatment that reduces
her breathlessness and enables her to go out, walk up the stairs to her apartment,
and care for herself without assistance. Decision analysis requires that all outcomes
be rated on a single utility scale. How can this be achieved? In other words, how can
we consider all these outcomes simultaneously, and how do we go about weighting
their relative utilities?
The best way of answering these questions is to consult the persons who will be
directly affected by the decision under analysis. Whereas clinicians, researchers, and
other experts are required for estimating probabilities, patients and the "general
public" are often best able to assign utilities to various outcomes. In the case of
Mrs. Smith, the best decision for herwill depend on the relative weight she places on
1-year survival, 5-year survival, symptoms, and functional independence. Similarly,
229
Decision Analysis
230
Table 17.2. Assigned utilities (u;'s) of five possible final outcomes for Mrs. Smith
Outcome
1.
2.
3.
4.
5.
morbidity
no morbidity
morbidity
no morbidity
0.25
0.50
0.50
1
Table 17.2). The combination of morbidity and mortality into a single utility scale is
analogous to a concept known as quality-adjusted lifo years. Instead of using a utility
scale from 0 to 1, we might have "translated" each outcome into an equivalent number of quality-adjusted life years and used the latter itself as the scale. The advantage of the 0-to-1 u-scale, however, is the equivalence to the "indifference" lottery
probabilities discussed earlier.
231
Terminal branch
Medical therapy
Death < 1 year
Survival 1-5 years, morbidity
Survival > 5 years, morbidity
Surgery
Dies at surgery
Survives surgery
Death < 1 year
Survival 1-5 years, morbidity
Survival 1-5 years, no morbidity
Survival > 5 years, morbidity
Survival > 5 years, no morbidity
PiX Ui
0.15xO =0
0.60 X 0.25 = 0.150
0.25 X 0.50=0.125
0.30 X 0
=0
0.035 X 0 =0
0.070 X 0.25 = 0.0175
0.140 X 0.50 = 0.070
0.175 X 0.50 = 0.0875
0.280 X 1 = 0.280
UE,=I.(Pi,X ui)=0.455
the analysis, especially when the expected utility of the preferred decision is only
slightly higher than the others. The process of varying the probabilities and utilities
is called sensitivity analysis.
Sensitivity analysis assesses whether the decision choice would change with (i. e.,
is sensitive to) feasible changes in the component probabilities or utilities. It is somewhat analogous to using confidence intervals around means, proportions, or relative
risks. We may be able to provide reasonable ranges for probabilities and utilities that
incorporate our uncertainties, and we may be far more comfortable with the range
than we are with any single point estimate. If the decision associated with the maximum expected utility remains unchanged, we can then be more confident that that
decision is the correct one.
Returning to our example, if the probability of dying at surgery were 0.50
instead of 0.30, would surgery still be preferred over medical therapy? The expected
utility for the surgical decision branch now becomes five sevenths (i. e., 0.50/0.70) of
0.455 (the value obtained with a probability estimate of surviving surgery of 0.70),
or 0.325. Since 0.325 is still higher than 0.275, surgery would still be preferred over
medical therapy for Mrs. Smith. In other words, the decision favoring surgery is not
sensitive to an operative mortality of 50%. If operative mortality were as high
as 70%, however, the average utility of surgery would be only (0.30/0.70) of
0.455 = 0.195, and medical therapy would be preferred.
We can also test the sensitivity of the decision to changes in utilities. In
Mrs. Smith's case, we need only consider a reduction in the utility of morbidity-free
survival (since any increase in that utility would further favor surgery) and an
increase in the relative "disutility" of death in less than 1 year. If, for example, survival for at least 1 year were of paramount importance to Mrs. Smith, and subsequent survival with morbidity were only slightly less valuable to her than survival
without symptoms, the utilities of the five outcomes listed in Table 17.2 might be 0,
0.60, 0.70, 0.90, and 1, instead of 0, 0.25, 0.50, 0.50, and 1. If the probabilities
remained unchanged from those shown in Fig. 17.5, the expected utility for medical
232
Decision Analysis
Solving for x,
X=
0.275 = 0.423
0.65
Thus, at a surgical survival rate of 42.3%, the two decisions result in equal expected
utilities. This is useful information. It indicates that, assuming the other probabilities
Cost-Benefit Analysis
233
and utilities are valid, any surgical survival rate above 42.3% should favor surgery. If
the surgeon is sure that Mrs. Smith's chance of surviving the operation is higher than
this figure, he should operate; if it is lower, he should not.
234
Decision Analysis
future inflation and discounted over the duration of the service to compensate for
the lost future earning power of the money expended. Sensitivity analysis can be
carried out to assess the extent to which the service becomes more or less cost beneficial with changes in the underlying assumptions about expected benefits, costs, and
discount rates.
As an example, consider the example of annual mammography (breast roentgenography) screening for breast cancer. The anticipated benefits associated with early
diagnosis and treatment of breast cancer include lower mortality and less need for
subsequent hospitalization, surgery, radiation therapy, and chemotherapy. The
lower mortality must be valued in monetary terms by estimating the average increase
in earnings, plus the savings of costs that would have been incurred in paying someone else to carry out usual domestic tasks, associated with the number of womenyears gained by screening. The costs are all the direct and indirect costs incurred by
the population-wide screening program. If the monetary value of the total anticipated benefits exceeds the costs (after adjustment for inflation and the discount
rate), the screening is declared to be cost-beneficial.
Cost-benefit analyses are best accomplished by collaboration among clinicians,
economists, and medical or public health administrators. Politicians are often particularly interested in such analyses, because it is they who have to determine budgets
and thus make choices about how much to spend on medical care in general and
specific services in particular. The major problem with cost-benefit analysis, however, arises in considering the value of human life and physical and psychological
suffering. The analyst must either give them a price tag or ignore them completely.
Neither of these solutions is entirely satisfactory, either ethically or scientifically.
References
235
since infant mortality is inversely proportional to birth weight, the IMR would be
correspondingly reduced. Tetanus immunization would not affect birth weight, but
birth weight-specific IMR should be reduced by eliminating the neonatal tetanus
that can occur after deliveries in unsanitary settings (often at home in developing
countries).
To carry out the analysis, each service must be assessed in terms of the target
benefit and the associated costs. It is often convenient to base the calculations on an
arbitrary number of persons served. For a population of 10000, the benefit of each
service would be the number of infants who would die without the service but live
with it. The cost of each service would be calculated as in cost-benefit analysis and
would include personnel, equipment, and indirect costs and would take both inflation and the discount rate into account. The service associated with the lower cost
per unit benefit is then declared more cost-effective. For our example, the public
health official would choose between caloric supplementation and tetanus immunization by comparing the cost of saving one infant life with one service vs the other.
Sensitivity analyses could then be carried out by varying estimates both of the costs
and the reduced IMR achieved for each service and observing the effect on the
overall result.
As with decision analysis, morbidity, pain, suffering, and functional impairment
can be combined with mortality on a single health utility scale. The benefits can be
expressed in terms of quality-adjusted life years, for example, with the target population polled to derive the formula for quantitative adjustment. The fact that such
outcomes do not have to be valued in monetary terms makes cost-effectiveness analysis more palatable than cost-benefit analysis to many clinicians and lay persons.
Since procedures for estimating costs are identical to those used.,in cost-benefit analysis, consultation with a health economist is often essential to arrive at valid estimates.
References
1. Weinstein MC, Fineberg HV (1980) Clinical decision analysis. Saunders, Philadelphia
2.
3.
4.
5.
18.1 Introduction
In Chapter 6 we considered a variety of ways of analyzing the results of a cohort
study. When the outcome variable is continuous and the exposure is dichotomous,
the main comparison is the mean outcome in the two groups defined by exposure
status. We later devoted nearly an entire chapter (Chapter 13) to inferential statistical techniques used in testing an observed difference in means.
When the exposure is dichotomous and the outcome variable is categorical, the
main analysis is a comparison of rates in the two exposure groups. In the common
situation of a dichotomous outcome, the ratio of the two rates becomes a relative
risk, the statistical significance of which can be tested using a X2 or Fisher exact test
or by constructing an appropriate confidence interval (see Chapter 14).
In this chapter, we shall once again focus on cohort studies with dichotomous
outcomes. But the kind of cohort study we are considering here is a special, albeit
quite common, type in which the duration of follow-up varies among individuals in
the cohort. Rarity of "exposure" may require that enrollment of the study cohort be
spread over a considerable period of time, such as in studying survival among patients with a rare disease or the risks and benefits of a complex or costly treatment. If
the outcome requires many months or years to develop, study subjects will have
been followed for varying lengths of time. The first subject enrolled will have the
longest potential duration of follow-up (i. e., if he does not develop the outcome and
does not withdraw from the study), and the most recently enrolled subject will have
the shortest. Since the investigator must close the study at some date, she cannot
know if the subjects who have not yet developed the outcome at study termination
would have done so had they been followed up longer.
Lifo-table analysis (also called survival analysis) is a statistical technique that
allows the investigator to calculate a probability of developing a given outcome that
takes into account the duration of follow-up. It makes maximum use of all data on a
cohort, including those members who withdraw from the study or are lost to follow-up for other reasons. Although the technique owes its origin and name to vital
status as the study outcome (death vs survival, hence the terms lifo-table and survival
analysis), it can be used to examine the distribution of time to occurrence of any
dichotomous outcome. It can be used for either descriptive (single exposure group)
or analytic (two or more exposure groups) cohort studies and applies equally well to
observational and experimental (clinical trial) designs.
237
Before discussing the anatomy and physiology of life tables, I shall begin with a
clinical example and then examine the various ways in which the data could be analyzed. After a review of the limitations of each of the alternative strategies, the rationale for life-table analysis should be evident.
Beginning of study
End of study
I
Patient number
1
2
19791
I
1980
1981
1982
Death
1
1
1
1
1983
IWithdrawn
1984
Years followed
1985 1
1
~live
1
1.4
Death 1
4.4
Death
~live
7
8
f'\Fve
IAlive
I
9
10
11
3.5
5.8
Withdrawn
I I
r-\ive
Death
1
4.3
4.1
3.9
3.6
2.5
2.6
2.4
f'\jive
2.3
13
IAlive
1.8
14
Hive
1.2
-!Alive
0.7
12
15
1 Total = 44.5
I I
Fig.1S.1. Survival of 15 osteogenic sarcoma patients receiving new treatment regimen
238
impossible to calculate for the entire cohort unless at least half of the subjects are
known to have died at the time follow-up is terminated.
15
mistic, because one or both of the lost patients may have subsequently died without
our knowing it. But even in the absence of study losses, overall survival rate is unsatisfactory because it contains no information about the duration of follow-up. If
our patients had been followed up for a maximum of 2 years, for example, instead
of 5.8 years, all (i. e., 100%) would have been classified as survivors. Conversely, if
all had been followed for 100 years, survival (including that of the investigator!)
would have been 0%.
239
The most commonly used approach is to calculate a rate of survival for a given
length of time (less, of course, than the duration of the study). For our example, we
could calculate a I-year, 2-year, or 5-year survival rate. This approach overcomes
the major objection to the overall survival rate, because it includes duration of follow-up in its definition. It does not resolve the "denominator problem," however, of
how we deal with patients who are lost to follow-up before n years elapse or with
those patients who are still alive at study termination but who have been followed
for fewer than n years.
For example, only one of the osteosarcoma patients in our example was known
to survive 5 years or more. Four more were known to have died in less than 5 years,
while the remaining ten either were lost (two) or were still alive at the end of the
study (eight). If the latter ten are included in the denominator, the 5-year survival
rate is only
1.., or 6.7%. This is overly pessimistic, since one or more of the ten lost
15
or remaining patients might well have lived G; 5 years. Conversely, the 5-year mortality of
--,
or 27%, is overly optimistic, since it assumes that the ten incompletely
15
followed patients would all have survived at least 5 years. If the ten patients are
excluded from the denominator, the 5-year survival rate becomes
5
also too low, however, because it does not take into account the known years of survival among the exclusions. The same result would have been obtained if all ten had
been lost immediately after enrollment and had had no observed survival.
18.2.5 Person-Years Approach
Another approach (discussed in Chapter 6) uses the length of follow-up for each
subject, sums this value for each member of the cohort to obtain a total number of
person-years of follow-up, and then uses this figure as the denominator. Subjects
lost to follow-up or remaining alive at study termination thus contribute to the
denominator and appropriately reduce the mortality rate. Consequently, this
method is generally superior to the four discussed previously. In our example, the
total number of person-years is 44.5, and the mortality would thus be _4_, or
44.5
0.090 deaths per person-year.
The major limitation of the person-years approach arises in situations where the
risk of death (or other study outcome) is not constant over time. Since the same
100 person-years can accumulate from two subjects followed for 50 years or 50 subjects followed for 2 years, a mortality expressed in terms of person-years can be misleading. In our example, eight of the 15 patients contributed < 3 years to total follow-up time for the cohort. If relapses and death are expected to occur mostly after
3 years, the calculated mortality of 0.090 per person-year may be too optimistic.
This type of problem is particularly likely to occur for exposures with long latent
periods. If most radiation-induced cancers occur 20 years or longer after exposure,
240
for example, a cohort study with a sample size of 1000 but only a IS-year average,
and a 20-year maximum, follow-up might detect few or no excess cases of cancer
per 15 000 person-years. A cohort of 500 followed for an average of 30 years, however, would detect a much higher number of excess cases for the same 15 000 person-years.
18.2.6 Life-Table Analysis
Life-table analysis has many of the same attractive features as the person-years
approach. It utilizes information available on all study subjects, including those
withdrawn from the study, regardless of duration of follow-up. But it has one additional advantage: it does not require a constant risk over time. All person-years are
not treated as equivalent; those occurring soon after exposure are counted differently from those occurring later in the course of follow-up. It is thus the analysis of
choice whenever there are unequal duration of follow-up and study withdrawals
and when constancy of risk over time cannot be assumed.
There are two principal techniques for carrying out a life-table analysis: the actuarial method [1, 2] and the Kaplan-Meier (or product-limit) method [2-4]. These will
be described in turn in the following two sections.
The first requirement for any life-table analysis is a clear indication of the starting
point. This is often called the zero time and usually corresponds to the time of first
exposure. When the exposure is a treatment and the study is a randomized clinical
trial, the time of randomization is usually preferred.
When the exposure is a disease, ascertainment of zero time becomes problematic. Should it be the onset of symptoms? Time of diagnosis? First presentation for
treatment? Since patients often differ as to when symptoms are first noticed (or retrospectively recalled) in the course of their disease, onset of symptoms is usually a
poor choice for zero time. Similarly, time of diagnosis will vary with the intensity of
medical surveillance, the diagnostic acumen of the patient's clinician, and the use of
laboratory tests capable of early detection (i. e., screening or case-finding). Date of
first treatment for the disease is often used as zero time, because it is usually easy to
determine objectively and because treatment is often (although not always) begun at
a similar point in the disease's natural history.
The second requirement is a well-defined study outcome. Not only must it be
dichotomous; it must also not be subject to multiple episodes. Thus, death and
chronic diseases are outcomes ideally suited to analysis by life tables. Diseases subject to multiple remissions and relapses can be studied using this technique, providing the outcome is defined as the occurrence or nonoccurrence of a first relapse.
Examples of other outcomes for which life-table analysis is appropriate include first
241
metastasis, first hospitalization, and first physician visit. When several dichotomous
outcomes are involved, each must be analyzed separately or the outcome must be
defined in such a way as to incorporate a specific combination of interest, e. g.,
death or first relapse.
Regardless of what outcome event is chosen, a decision must be made about
whether all such events will be counted, only those from specific causes (e.g., death
from myocardial infarction), or only those "attributable" to exposure. Suppose, for
example, that one of the osteosarcoma patients in our example had died in an automobile accident. If the investigator is convinced that this death was totally unrelated
to the underlying disease, then it should probably be counted as a loss to follow-up
(withdrawal). Suppose further, however, that the patient was depressed because his
disease was progressing and that he deliberately crashed the family car into a telephone pole. Counting the death as a study withdrawal would lead to an overly optimistic estimate of survival for the cohort. In other words, the suicide was actually
caused by the osteosarcoma.
This leads us more generally to the third requirement for the life-table analysis:
losses to follow-up should be independent of the study outcome. If subjects who
drop out are those doing particularly well or particularly poorly, then their loss will
bias the results in the overall cohort. If our osteosarcoma patients had been transferred to a hospice facility as soon as their disease became unresponsive to treatment, all four deaths would have been counted as withdrawals, and survival would
have been calculated as 100%! Thus, life-table analysis assumes that lost subjects
have an identical prognosis to those remaining in the cohort at that time. Actually,
this assumption is also shared by other cohort analytic approaches (i. e., mean or
median survival, overall or n-year survival rates, and person-years) whenever losses
to follow-up occur.
The fourth requirement is that the risk of the outcome is independent of calendar time. In other words, the prognosis of subjects entering the study early should
be no different from that of those enrolled toward the end. Although life tables do
not assume that risk remains constant for any given subject over time, they do
assume no major secular changes in prognosis for the overall cohort. If advances in
supportive care or in treatment of adverse reactions to therapy resulted in a better
prognosis among our osteosarcoma patients enrolled since 1982, for example, the
overall survival of the cohort would largely reflect deaths occurring among patients
treated earlier and thus would be overly pessimistic.
The fifth and final requirement is that the risk of the study outcome remain constant within intervals used in constructing the life table (see following subsection).
Risk need not be constant from one interval to the next, but it must remain so within
each interval. This is not a major restriction, since intervals of any length can be
constructed and can vary within a given life table. Consequently, if the investigator
suspects a possible variation in risk within one or more intervals, these intervals
should be subdivided into smaller ones with constant risk, so that the constant risk
requirement is satisfied.
With these five requirements and assumptions in mind, we are now ready to
construct the life table.
242
The first step in constructing the life table is to refer the timing of all "events"
(including time to outcome, loss to follow-up, or end of study) to the zero time,
rather than to calendar time. Figure 18.2 makes this conversion for the 15 patients of
the osteosarcoma cohort. The corresponding life table is shown in Table 18.1; each
column of the table will be discussed in turn.
0
I
Death
I
I
I
I
I
I
2
3
4
5
6
7
AIL
Withdrawn
-JDeath
I Death
Alive
Alive
Alive
I
I
I
I
9
10
11
12
Withdrawn
I
Alive
Death
Alive
13
Alive
14
~Alive
I
I
15
Alive
Fig.lS.2. Survival of 15 osteogenic sarcoma patients with respect to time of initiating treatment
(zero time)
Table IS.1. Life table for 15 patients with osteogenic sarcoma (actuarial method)
(1)
(2)
(3)
(4)
(5)
(6)
(7)
Ix
Wx
r =/_ Wx
dx
qx=~
Px=l-qx Sx;=(Px,)
Survival
(Px,) . .. (Px)
Interval
(years)
Subjects
living at
start of
interval
x x 2
Subjects
withdrawn Subjects at
during
risk during
interval
interval
Deaths
during
interval
0-1
1-2
2-3
3-4
4-5
5-6
15
14
11
7
4
1
1
3
3
2
1
1
0
0
1
1
2
0
14.5
12.5
9.5
6
3.5
0.5
'x
(8)
Cumulative
survival rate to
end of interval
0
0
0.105
0.167
0.571
0
1
1
0.895
0.746
0.320
0.320
1
1
0.895
0.833
0.429
1
243
(~x).
qx= dx).
rx
Column (7): Survival Rate During Interval (Px)
Since the outcome is dichotomous, the probability of not developing the outcome
(e. g., of surviving) during the interval is simply 1 - qx. It, too, can be thought of as a
conditional probability, since it depends on a subject being free of the outcome at
the start of the interval.
244
100
80
(ij
>
oS;
.....
:::J
60
en
"*-
40
20
245
The data from Figs. 18.1 and 18.2, depicting survival times in 15 osteosarcoma patients, have been analyzed using the Kaplan-Meier approach in Table 18.2. Patients
who are known to have died during the period of follow-up are ranked in ascending
order of the time of death. The columns are then calculated as follows:
Column (1): Time of the Next Occurring Death (t)
The shortest time after beginning treatment at which death was known to occur was
2.4 years. The other three deaths observed during follow-up occurred at 3.5, 4.3,
and 4.4 years after the start of treatment.
Column (2): Number at Risk for Death at Time t (r/)
This includes all patients known to be alive just prior to time t and is thus equal to
the number known to be alive at t plus the number of deaths at t.
(2)
Time
(years)
Number at
risk
2.4
3.5
4.3
4.4
10
7
3
2
r,
(3)
d,
Deaths
(4)
d,
q,=r,
(5)
p,=I-q,
Survival rate
(6)
S'i = (p,,)(p,,) ... (p,)
Cumulative survival
rate
0.900
0.857
0.667
0.500
0.900
0.771
0.514
0.257
Death rate
0.100
0.143
0.333
0.500
246
ioo
80
cti
>
.~
60
:::::I
en
'$.
40
Actuarial Method
20
Kaplan-Meier Method
123456
Fig. 18.4. Comparison of Kaplan-Meier and actuarial survival curves for 15 osteogenic sarcoma
patients
Statistical Inference
247
actuarial and
be presented
two methods,
or exact time
(18.3)
Thus, the SE for the 3-year survival in Table 18.1 can be computed as follows:
S(S3) = 0.895
VI -
0.895
9.5
0.094
0.141
VI -
0.320
3.5
As is often the case with longer durations of follow-up and correspondingly fewer
observations, the standard error is larger for the 5-year survival.
Once the standard error has been calculated, the standard normal (z-) distribution can be used to estimate a confidence interval around the Sx or St observed in the
study sample. (This assumes a normally distributed sampling distribution of Sx's or
S/s.) The 100(1- a)% confidence interval will include the "true" (target population)
Sx or St with a probability of 1- a, where a is 0.05, 0.01, or some other chosen
value. The 95% confidence intervals for the 3- and 5-year actuarially derived survivals from our example are:
248
1*, 3, 6~, 7, 7*, 7*, 11, 14*, 15, 18, 24, 27*, 30, 32,
35*, 40, 42, 45'~
Drug plus psychotherapy: 3'~, 4, 7~, 9, 9~, 10'~, 11 *, 12~, 17, 19*, 20'~, 22, 25'~,
30':,34,38,38'\ 39*, 41*, 42, 42, 44'~
The actuarial life tables and survival curves for the two groups are shown in
Table 18.3 and Fig. 18.5. Patients relapsing or withdrawing at the common boundaries between intervals (e.g., 6 months) are assumed to have been in remission at
that time but to have relapsed or withdrawn sometime in the succeeding month.
They are thus "credited" to the next succeeding interval.
Judging from Fig. 18.5 or from the last column of Table 18.3, the combination
treatment (drug plus psychotherapy) appears superior. How likely is it that the
observed difference is due to chance? In other words, what is the probability of
obtaining a difference at least as large as the one observed, under the null hypothesis
that the treatment groups represent random samples from target populations having
the same probability of survival?
There are two main approaches to testing two Sx's or S;s for a statistically significant difference, i. e., to calculating the probability that sampling variation can
explain the observed difference under the null hypothesis of no difference. The first
assumes a normally distributed sampling distribution of S;s or S;s and involves a
z-test of the difference between two Sx:s or St:s at any given Xi or ti:
(18.4)
where Sx;, and Sx;, (or S4, and S4,) are the Sx:s (or St:s) in the two groups being compared.
249
Statistical Inference
Table 18.3. Actuarial life tables for randomized clinical trial comparing drug therapy alone with
drug plus psychotherapy in 40 chronic schizophrenics
(1)
(2)
Ix
(4)
(3)
Wx
rx
(5)
(6)
dx
(7)
qx
Px
(8)
Sx
Cumulative
continued
remission
rate
1. Drug
therapy
alone
0- 6
6-12
12-18
18-24
24-30
30-36
36-42
42-48
18
16
11
9
8
6
3
2
1
3
1
0
1
1
0
1
17.5
14.5
10.5
9
7.5
5.5
3
1.5
1
2
1
1
1
2
1
1
0.057
0.138
0.095
0.111
0.133
0.364
0.333
0.667
0.943
0.862
0.905
0.889
0.867
0.636
0.667
0.333
0.943
0.813
0.736
0.654
0.567
0.361
0.241
0.080
2. Drug plus
psychotherapy
0- 6
6-12
12-18
18-24
24-30
30-36
36-42
42-48
22
20
15
13
10
9
7
3
1
4
1
2
1
1
3
1
21.5
18
14.5
12
9.5
8.5
5.5
2.5
1
1
1
1
0
1
1
2
0.047
0.056
0.069
0.083
0
0.118
0.182
0.800
0.953
0.944
0.931
0.917
1
0.882
0.818
0.200
0.953
0.900
0.838
0.768
0.768
0.677
0.554
0.111
100
c:
.Ci)
80
.~
CIl
a: 60
c:
40
en
'$. 20
12
18
24
30
36
42
48
Fig. 18.5. Survival curves for RCT comparing drug therapy alone with drug plus psychotherapy in
40 chronic schizophrenics (actuarial method)
250
1-0.654
9
= 0.128
The corresponding two-tailed P value is 0.495, and thus the difference is not statistically significant. In other words, we cannot reject the null hypothesis. A one-tailed
test could be justified here, since there is no reason to think that drug therapy alone
would be more efficacious than drug plus psychotherapy. The one-tailed P value
would remain nonsignificant at P=0.247. A relative risk can also be calculated for
the outcome through the end of any interval x or at any time t:
RRx= l-S"",
1- S"",
(18.5)
RR t = l-S~,
l-S~,
For our example, the actuarial relative risk for relapse through 24 months is
RR = 1-0.654 = 1.49
24
1-0.768
Of course, the time at which the curves are to be tested should be established a priori, i. e., before the data are collected. Otherwise, the temptation would be great to
examine the two curves visually and choose the interval where they are farthest
apart. This would maximize the opportunity for finding a statistically significant difference, but the P value resulting from such post hoc significance testing would no
longer correspond to the probability of obtaining the observed result by chance
under the null hypothesis (see Chapter 12). If, for example, we had tested the two
schizophrenic treatment regimens at 36 months, instead of 24 months, we would
have obtained the following result:
S36,=0.361 and SE(S36,)=0.361V1-0.361 =0.123
5.5
S36, = 0.677 and SE(S36,) = 0.677
Z=
0.361-0.677
V(0.123)2 + (0.132)2
RR = 1-0.361 = 1.98
36
1-0.677
1 - 0.677 = 0.132
8.5
=-1.751
Statistical Inference
251
The corresponding two-tailed P value = 0.080, and the one-tailed P value = 0.040.
We thus might have (unfairly) rejected the null hypothesis.
The other approach to significance testing is called the log-rank test [2,4].
Despite its being a nonparametric test, the log-rank test is more efficient than the
z-test, because it compares the entire survival curve, rather than just a single point
on the curve (such as 24 months). It is also, therefore, less arbitrary. For each intervalor time in the table, the observed (0) number of relapses (or deaths, or other
outcome) in each group is compared with the number expected (E) based on the
total number of relapses observed and the number of subjects at risk in each group:
(18.6)
Thus, if there are an equal number of subjects at risk for a given interval, half of the
observed relapses would be expected to occur in each group. The observed (0) and
expected (E) relapses for each group are then summed over all intervals in the table
and an overall X2 is calculated as follows:
(18.7)
Finally, the calculated value of X2 is compared with tabulated critical values at one
degree of freedom to obtain the corresponding P value.
The log-rank test will be illustrated using our same example. The calculations
using the actuarial life table (Table 18.3) are shown in Table 18.4. Despite the
improved efficiency of the log-rank test, the calculated value of X2 is only 1.573 and
does not achieve statistical significance, even with a one-tailed test. Similar results
are obtained using the Kaplan-Meier life table.
An overall relative risk can also be calculated using the log-rank approach:
RO
_ 1:0 1/1:E 1
~"overall-
(18.8)
1:02 /1:E 2
RR,verall
1017.383
8/10.617
= 1.80
This relative risk is probably clinically important. It indicates an 80% higher risk of
relapse with drug therapy alone as compared with drug plus psychotherapy. The fact
that this difference between the two treatments is not statistically significant, however, should make us concerned about inferring that the two treatments are equivalent. Although P is not low enough to warrant rejection of the null hypothesis, the
small sample size (low statistical power) has enabled a clinically important difference
to "escape" statistical significance. In other words, the risk of a Type II error is high
(see Chapter 12).
252
Table 18.4. Calculation of log-rank test for RCT comparing drug therapy alone (group 1) with
drug plus psychotherapy (group 2) in 40 chronic schizophrenics (actuarial method)
Interval
(months)
Number at risk
Expected
relapses = [ ~ ] (0 1+ O 2)
rXI + rXl
Observed relapses
rx ,
rx,
Total
(rx, + rx,)
0- 6
6-12
12-18
18-24
24-30
30-36
36-42
42-48
17.5
14.5
10.5
9
7.5
5.5
3
1.5
21.5
18
14.5
12
9.5
8.5
5.5
2.5
39
32.5
25
21
17
14
8.5
4
01
O2
Total
El
(01+02)
E2
Total
(El +E2)
2
1
1
1
2
1
1
1
1
1
1
0
1
1
2
2
3
2
2
1
3
2
3
1.103
1.662
1.160
1.143
0.559
1.821
1.294
1.875
2
3
2
2
1
3
2
3
l:= 10
l:=8
l:= 18
l:= 18
0.897
1.338
0.840
0.857
0.441
1.179
0.706
1.125
l:E2
1017.383
= 1.80
8/10.617
7.383
10.617
Just as in the case of comparing two means or two proportions, a comparison of two
survival curves may be biased by one or more confounding factors associated with
both exposure and (independently of exposure) outcome. There are two main methods used for controlling for such confounding effects.
The first method is stratification [4]. A separate life table is constructed for each
stratum defined by the confounder or combination of confounders. (Obviously,
continuous confounders must first be categorized.) Equation 18.6 is used to calculate stratum-specific expected values for each interval or time in the life table. The
stratum-specific observed and expected totals are then added together for all strata
to get an overall total of observed and expected for each exposure group. These
totals can be used in Eq.18.7 to calculate an overall X2, which is referred to critical
values of X2 at one degree of freedom to derive the corresponding P value. This
technique is analogous to the Mantel-Haenszel procedure (see Sections 6.3
and 14.2.8). Finally, the overall relative risk can be estimated by applying the
observed and expected totals to Eq.18.8.
The second method of controlling for confounding is a multivariate statistical
technique based on the proportional hazards model. As we have seen, life-table analysis does not require that the risk (hazard) of the outcome remain constant throughout the period of follow-up. Use of the actuarial method does assume a constant
risk within intervals defining the life table, but not necessarily between intervals.
That is why the slope of the actuarial survival curve changes from one interval to
another, as seen in Figs. 18.3 and 18.5.
References
253
References
1. Cutler S], Ederer F (1958) Maximum utilization of the life-table method in analyzing survival.
] Chronic Dis 8: 699-712
2. Coldman A], Elwood]M (1979) Examining survival data. Can Med Assoc] 121: 1065-1071
3. Kaplan EL, Meier P (1958) Nonparametric estimation from incomplete observations. ]ASA 53:
457-481
4. Peto R, Pike MC, Armitage P, Breslow NE, Cox DR, Howard SV, Mantel N, McPherson K,
Peto], Smith PG (1977) Design and analysis of randomized clinical trials requiring prolonged
observation of each patient. II. Analysis and examples. Br] Cancer 35: 1-39
5. Cox DR (1972) Regression models and life tables.]R Stat Soc (Series B) 34: 187-202
6. Anderson A, Auquier A, Hauck WW, Oakes D, Vandaele W, Weisberg HI (1980) Statistical
methods for comparative studies: techniques for bias reduction. Wiley, New York, pp 214-230
- 0
It is important to realize that the terms "exposure" and "outcome" do not denote
distinct types of events or states. A variable considered an exposure in one situation
can serve as the putative outcome in another. For example, in a study of cigarette
smoking as a cause of lung cancer, the exposure variable is obviously cigarette smoking. In a study of some health education intervention intended to reduce smoking,
however, smoking is the outcome. Thus, deciding which variable is the exposure and
which is the outcome involves a choice by the clinician or investigator. The only a
priori constraint on this choice is temporality: exposure must be known to precede
outcome.
Given this understanding of the terms "exposure" and "outcome," the following
definition of cause can then be offered:
*"
*"
where E1 E2 and 0 1 O 2.
Although the definition seems unobjectionable, it is different from some conventional notions of cause in that it insists on an alternative. No exposure-outcome relationship can be thought about in isolation. Before making a causal inference about
whether a given exposure is the cause of an outcome, one must ask: "Compared with
what?" For example, smoking one pack of cigarettes per day is a cause of lung cancer
compared with not smoking, but not compared with smoking two packs a day.
What is a "Cause"?
255
At first glance, the definition may appear truly operational, because it indicates
the steps that should be taken before making a causal inference. Change the exposure (E1 to E2) and observe the outcome; if the outcome changes (0 1 to O 2), then
causality can be inferred. The definition has intuitive appeal, because it is based on
an experimental paradigm. The experimenter changes one factor and observes the
effect on another.
But things are not as straightforward as they appear. The definition is based on
what would have occurred if the same exposed individuals had instead experienced
the comparative exposure. But this is obviously not possible, at least not at the same
period of time; the same individuals cannot experience two mutually exclusive exposures simultaneously. Even a crossover experiment in which the same individuals
successively experience different exposures cannot exclude the possibility that
something else changed either in the individuals or in the environment to
explain an observed change in outcome. It was just this impossibility, in fact, that
led the philosopher Hume to argue for the empirical nonverifiability of cause and
effect [1].
It is evident that the choice of the comparative exposure requires a choice by the
clinician or investigator. How should the comparative exposure differ from the
observed one? Should the difference be qualitative (e.g., a different agent) or
quantitative (a different dose of the same agent)? If quantitative, should the level be
higher or lower? By how much? Should it be total nonexposure? There are no right
or wrong answers to these questions. The choice of comparative exposure is necessarily subjective and fraught with uncertainty. But it is also unavoidable.
The choice of comparative exposure can and should be guided by prior notions
about what changes in exposure are feasible in the "real" world. After all, why is
causal inference important in the first place?
Two principal justifications can be offered. First, an understanding of cause is
essential for change. In fact, we even defined the causal relationship between exposure and outcome in terms of the change in the latter that occurs when the former is
altered. A deliberate intervention (change in exposure) will be successful in altering
outcome only to the extent that the exposure is a true cause of that outcome.
We need to understand cause in order to act in the best interest of individual
patients and of society at large. Engineers refer to this as "control" to distinguish it
from "prediction." Exposure can be an excellent marker, or predictor, of outcome
without necessarily being a true cause. But prediction does not necessarily imply
control. (This is another way of relating the familiar epidemiologic maxim that association does not prove causation, and I shall have more to say about this issue later
in discussing how causality assessments are actually made.)
So, either as clinicians intervening to improve the health of individual patients or
as a society implementing a policy to improve the public health, causal inference is
essential. This orientation toward change dictates which comparative exposures
should be contemplated. It is pointless to compare outcome between two exposures
unless both of those exposures can be feasibly implemented. For example, in
addressing the question of whether a serum cholesterol level of 350 mg/dl is a cause
of coronary artery disease, it makes little sense to compare individuals with a serum
cholesterol of 0 mg/ dl, since there is no real possibility of reducing serum cholesterol to that level.
Causality
256
The second justification for studying cause is to learn about mechanism. For
exposures that are not manipulable by man, change occurs, but it is nature's doing,
not ours. When nature is doing the controlling, causal inference should involve a
comparison of outcomes between two exposures that occur naturally. Thus, in inferring whether the presence of a valine residue at position 6 of the beta chain of
hemoglobin is the cause of sickle cell anemia, the appropriate comparison involves
otherwise identical individuals with a glutamic acid residue in position 6, since it is
glutamic ;rcid (and not some other amino acid) that is usually found in this position.
Understanding fundamental biological processes is important not only to satisfy
human curiosity about what makes nature "tick," but also to enable us to adapt ourselves better to its requirements. Moreover, the history of science in general and of
medical science in particular has amply demonstrated that knowledge of underlying
causal mechanisms often serves as a basis for generating new hypotheses for interventions. For example, elucidation of the biochemical pathways of intermediary
metabolism has led to specific nutritional and pharmacologic interventions to correct, at least partially, a variety of inborn metabolic errors. Consequently, this second justification for understanding cause "feeds back" to the first. Epidemiologic
research and clinical practice have always benefited, and should continue to benefit,
from a knowledge of basic biologic mechanisms and the resulting improvement in
the ability of the human species to adapt to and change the world around us.
257
In biologic terms, we refer to such a relationship as synergism. Statistically, the relationship is identified by demonstrating an interaction between the two factors. (In a
multivariate statistical model, inclusion of the product of the two variables would
explain a significantly greater proportion of the total variance in outcome than a
model containing only the two variables by themselves.)
Intrauterine rubella infection at a critical time during the first trimester of pregnancy is sufficient to cause intrauterine growth retardation (IUGR) in all fetuses of
nonimmune women. No other factor is required; the infected mother may not be
short or undernourished and may not smoke or engage in other harmful practices
during pregnancy. Since these other factors, alone or in combination, can themselves
suffice to cause IUGR, however, intrauterine rubella infection cannot be considered
a necessary cause of IUGR.
For exposure to be both necessary and sufficient as a cause, its relationship with
outcome must be perfectly specific, i. e., one-to-one. In other words, an individual
can never be exposed without developing the outcome, and the outcome can never
occur in a person who has not been exposed. We can then in fact consider the exposure to be the cause of the outcome. This kind of exposure-outcome specificity is
extremely rare. The relationship between microorganisms and specific infectious diseases was originally thought to be a one-to-one correspondence and formed the
basis of the famous Koch postulates. But as we have seen, even the organism that
Koch discovered (the tubercle bacillus) does not automatically result in infection in
persons who are exposed to it. Certain (but not all) genetic mutations causing socalled inborn errors of metabolism, however, probably fit the bill; the inability to
synthesize a particular enzyme results in a metabolic derangement that is highly specific for the missing enzyme.
For many health outcomes, causality is multifactorial; causes are neither necessary nor sufficient for any given individual. Such is the case for many chronic diseases. For example, cigarette smoking, high blood pressure, a diet high in saturated
fat, insufficient exercise, high serum cholesterol, stress, and genetic predisposition
may all contribute to coronary artery disease. Coronary artery disease can occur in
the absence of anyone of these factors, and none by itself may suffice. Each factor
may independently contribute to augment the risk, however, and thus each can be
considered a true cause. As we have seen, IUGR is another outcome with a complex, multifactorial "web of causation."
The multifactorial model can be represented as follows:
E1
E2
E3
E4
I
I
I
En
..
..
As in the case of single exposure factors, the multifactorial model can be complicated by interactions (effect modification) among the various factors, as well as
between one or more of them and other variables.
Causality
258
Exposure may cause a certain outcome by first affecting an intermediate factor (also
called a mediating variable) that in turn leads to the outcome:
E
----<.~
----<..~
A series of causally linked variables is called a causal path or causal chain. Causal
paths can be contemplated either in terms of the individual or in terms of the group.
Very young maternal age, for example, appears to affect birth weight through its
impact on several mediating variables. Pregnant adolescents who have just recently
passed their menarche have not completed their physical growth and tend, therefore, to be shorter and thinner than older women. Their caloric intake may also be
less. Short stature, low weight-for-height, and low caloric intake then subsequently
lead to impaired intrauterine growth. In other words, young teenage mothers are
likely to have lighter babies because they are short and thin and consume an insufficient diet [3].
Socioeconomic status (SES) is a typical example of an exposure variable that
tends to lie somewhat removed from ("distal" to) the outcome in most causal paths.
For many health outcomes, persons of low SES fare worse than those of higher SES.
It is difficult to imagine a biologic mechanism, however, whereby low educational
attainment, income, or social standing has a direct influence on health. Rather, it
appears that low-SES persons have more crowded living conditions, consume
poorer diets, have less access to medical care, and experience more psychological
stress, and that these latter factors are the mediators (more "proximal" causes) of the
SES effects.
Many authors refer to exposures whose effect on an outcome is not known to be
mediated by other factors as direct causes [4]. Exposures lying more distal on the
causal path, i. e., those whose causal effects involve recognized mediating variables,
are called indirect causes [4]. In fact, however, if one continues to probe at a more
basic scientific level, most of the factors we consider "directly" causal are mediated
by physiologic or biochemical processes. In this narrower sense, only the last molecular event preceding an outcome can be called a direct cause, and even then such an
inference must remain tentative, pending possible discovery of further intermediate
steps in the final molecular pathway. Thus, the distinction between direct and indirect causes is somewhat artificial. It depends on both the "level" of factor (environmental agent, personal characteristic, biochemical reaction) and the state of knowledge at the time of inference. But within a given level (e.g., two environmental
factors such as SES and nutrition), such a distinction can be useful in constructing
causal paths and, hence, in understanding biologic mechanisms and contemplating
possible interventions.
259
Some exposures may operate through more than one causal path:
E
.. X
~~!
0
In this diagram, exposure causes outcome through three different paths: one mediated by X, a second mediated by Y, and a third "direct" (i. e., without any identified
mediating variable). To return to our birth weight example, maternal cigarette smoking is believed to reduce intrauterine growth (as reflected in birth weight) by several
mechanisms, including carbon monoxide-mediated fetal hypoxia, nicotine-induced
uterine vasoconstriction, and appetite suppression (the latter also in part due to nicotine). A statistical technique known as path analysis is sometimes useful for testing
postulated causal paths. It is based on multiple linear regression, and interested
readers are referred to several standard references [5, 6).
19.3.2 Causal Networks
A description of all the known causal paths and effect modifiers leading to a given
outcome (in groups of individuals) is called a causal network or causal web. The
causal network for an outcome for which a given exposure constitutes a necessary
cause will consist of a single causal path (accompanied by relevant effect modifiers).
For multifactorial outcomes, causal networks may be represented by several exposure variables, linked through numerous causal paths and interacting with a variety
of effect modifiers.
A multifactorial causal network can also include one or more exposures that are
sufficient to cause the outcome. As we have seen, intrauterine growth retardation
(IUGR) can be caused by a variety of factors. First-trimester rubella infection is a
sufficient cause (assuming the mother is nonimmune), and thus the causal path to
IUGR is a single line without effect modifiers. Since rubella infection is not a necessary cause, however, the causal network for IUGR also contains numerous other
causal paths involving other exposures and their effect modifiers, such as maternal
short stature, low prepregnancy weight, insufficient caloric intake during pregnancy,
primiparity, and cigarette smoking [3).
260
Causality
probability (P) of 1 to such a statement. It is possible (although exceedingly improbable), however, that the window would have broken on its own at that very
moment, perhaps from some inherent structural defect, or that, unbeknownst to me,
someone else simultaneously fired a bullet at the same window.
Absolute proof of causality is thus elusive, and assessment of causality inevitably
involves a statement of probability, i. e., uncertainty, rather than certainty. The fact
that causality is more continuous than dichotomous, however, need not result in
nihilism or paralysis. The probability need not be 1 to justify action. Clinicians may
institute treatment to combat a cause that seems reasonably likely, such as beginning
antibiotic treatment for suspected bacterial meningitis before the diagnosis is confirmed by bacteriologic culture results one or two days later. Similarly, an industrial
plant may reduce potentially hazardous vapor or dust levels based on a preliminary
epidemiologic study demonstrating adverse health effects. In these situations and
many others like it, the probability of causality needs to be weighed in a decision
analysis (risk-benefit analysis) along with the efficacy and side effects of available
treatments, as well as the consequences of withholding treatment.
Deciding that exposure causes outcome, therefore, usually requires a probability
assessment. Even for genetic diseases where a given mutation appears to be both
necessary and sufficient, the laboratory evidence may not be completely unequivocal. The probability assessment is most useful when it is made quantitative, in the
sense of assigning a probability P between 0 and 1. Such a quantitative assessment
expresses the degree of belief in causality. It usually involves a subjective component,
but pretending that causality is either yes (P= 1) or no (P= 0) is usually neither
helpful for understanding nor necessary for action. Dichotomous causality thinking
can in fact be harmful, because it is likely to lead to errors in clinical or public health
decisions and consequent disillusionment with the clinician or scientific community
supplying the "evidence."
The probability of causality can be assessed in terms of three different questions
relating exposure to outcome [7, 8]:
1. Can it? (potential causality assessment): What is the probability that exposure can,
at least in certain persons under certain circumstances, cause the outcome?
2. Will it? (predictive causality assessment): What is the probability that the next person exposed will develop the outcome because of the exposure? In more general
terms, is the exposure a quantitatively important cause of the outcome?
3. Did it? (retrodictive causality assessment): What is the probability that a given person who has already developed the outcome did so because of exposure?
261
Analytic bias exists in four types (see Chapter 5): information bias, sample distortion
bias, confounding bias, and reverse causality ("cart-vs-horse") bias. Given an association between exposure and outcome, the evidence that the association is causal
will be strengthened to the extent that each of these sources of bias is eliminated or
reduced. Measurements of exposure and outcome should be reproducible and valid,
and neither measurement should be influenced by the other. Sloppy (imprecise)
measurements may obscure true causal relationships; systematically biased measurements may either create or obscure associations, depending on the direction of the
bias. Sample distortion bias can arise in assembling the study sample or from differentialloss to follow-up.
Confounding bias is an ever-present danger in observational studies, and its control requires adequate design and statistical analytic techniques. Confounding results
in an exposure-outcome association because exposure and outcome are both caused
by a third factor X:
For example, anemia and iron deficiency have frequently been reported to be associated with low birth weight (birth weight < 2500 g). The evidence suggests, how-
262
Causality
ever, that iron deficiency is a result of generally poor nutritional status, and that
insufficient caloric intake is often accompanied by low iron intake. The deficient
diet causes both iron deficiency and low birth weight, but the iron deficiency itself
has no causal role [3). Iron deficiency or anemia may thus be an indicator or marker
of risk for low birth weight, but it is not a cause. The distinction is important,
because iron supplementation will correct the iron deficiency and anemia but have
no effect on birth weight.
Confounding also occurs whenever the exposure factor is tightly linked to a
third factor that, although unrecognized, is the true cause of the outcome. This is
merely a special case of the general concept of confounding that can be indicated as
follows:
x
I
"0
This type of confounding can occur, for example, when an adverse reaction attributed to the active pharmacologic ingredient of a drug is actually caused by a preservative or other component contained in the preparation administered. As another
illustration, consider the relationship between alcohol consumption and lung cancer.
Because heavy drinkers are likely to be cigarette smokers, failure to consider the
confounding effect of smoking might lead to the erroneous inference that heavy
drinking causes lung cancer. High alcohol consumption may be a marker of risk, but
it is not a true causal factor.
Randomized clinical trials (RCTs) provide the best protection against confounding bias. When combined with double blinding, standardized detection methods (to
protect against information bias), and vigorous follow-up (to reduce sample distortion occurring after randomization), RCTs provide the most convincing epidemiologic evidence of Can it? causality.
A given exposure can cause an outcome only if it precedes it. Sorting out which
is the cart and which is the horse is occasionally quite difficult, especially in crosssectional studies. Unless the exposure factor is known to have been present since
birth (e.g., sex, blood type, or racial origin), uncertainty as to whether exposure preceded outcome or vice versa will result in a lower probability estimate for Can it?
Case-control studies can protect themselves against reverse causality bias, at least to
some extent, by using newly occurring (incident) outcomes and specifically inquiring about prior exposure. Such exposure histories, however, depend on adequate
records or valid subject recall. Cohort studies can avoid this problem if the study
sample is known to be free of the outcome at the time exposure begins. Once again,
clinical trials provide the best evidence, since exposure is assigned by the investigator.
19.5.2 Strength of Association
The strength of association between exposure and outcome relates to the size of the
effect on outcome produced by a given amount of exposure. For dichotomous exposures and outcomes, this refers to the relative risk. For dichotomous exposures and
263
continuous outcomes, the mean difference in outcome is the corresponding indicator. All else (i.e., other elements for weighing the epidemiologic evidence) being
equal, the larger the effect size; the greater the likelihood that exposure can cause
the outcome. Small relative risks or mean differences always raise the question as to
whether some hidden or incompletely controlled source of bias might explain the
results. Large effects are less likely to be entirely explained away by such factors.
19.5.3 Biologic Gradient
When exposure is ordinal (ranked) or continuous, the probability of Can it? causality is often increased by demonstrating a graded effect on outcome with different
degrees of exposure, i. e., a dose-response relationship. When the outcome is dichotomous, the relative risk should increase with higher categories of exposure. When
both exposure and outcome are continuous, the slope (regression coefficient) indicates the amount of change in outcome resulting from a given increase or decrease
in exposure. It should be emphasized, however, that threshold, ceiling, optimum
(inverted "U"), and nonlinear graded effects are possible. Consequently, steady
increases in relative risk or a constant slope are not necessary to demonstrate a biologic gradient. Furthermore, even the total absence of any dose-response relationship may not weigh heavily against the Can it? probability if the underlying biologic
mechanism is independent of the dose of exposure, such as with anaphylactic or idiosyncratic adverse drug reactions.
19.5.4 Statistical Significance
No matter how unbiased, strong, graded, and statistically significant a given exposure-outcome association appears to be from a single epidemiologic study, Can it?
causality is strengthened by replication. If several investigators in different settings
and (preferably) using different methods all find a significant association, the proba-
264
Causality
bility that exposure can cause outcome is increased. In Bayesian terms, positive
results of each previous study increase the prior odds favoring the alternative
hypothesis (H,0, and new data favoring HA continue to raise its posterior odds.
Replication is particularly helpful in excluding chance as an explanation. Repeated
failure to control for sources of bias, however, can lead to consistent findings that
are invalid. Many studies from developing countries, for example, have reported an
association between maternal anemia and low birth weight. None that reported such
an association, however, controlled for the confounding effect of poor prep regnant
and gestational nutrition. Consistency alone, therefore, is an insufficient criterion of
causality.
19.5.6 Biologic Plausibility and Coherence
This is a corollary to the previous criterion and concerns the biologic similarity of
the exposure factor under assessment to another factor whose causal effect on the
outcome is well established. Suppose, for example, that an association is demonstrated between a newly marketed drug and neutropenia (a low concentration of
neutrophilic white blood cells). Knowledge that the new agent has a chemical structure very similar to that of one or more long-standing drugs with recognized neutropenic effects will increase the probability that the association with the new drug is
indeed causal.
265
It is the magnitude of the causal effect that will enter into risk-benefit calculations in making decisions about preventing an undesirable outcome or promoting a
desirable one (see Chapter 17). If a particular treatment or preventive maneuver is
dangerous, painful, or expensive, for example, the benefits will not be worth the
risks or costs if only a marginal change in outcome can be expected.
How do we go about assessing the importance, i. e., measuring the magnitude,
of a causal effect? Actually, this issue has already been discussed, to some degree, in
Chapters 6-9. The answer depends on whether the outcome under consideration is
categorical (usually dichotomous) or continuous, and these will be discussed separately.
For dichotomous outcomes, the importance of a cause can be gauged by assessing the probability that the next exposed person will develop the outcome because of
the exposure (or, equivalently, the number of persons out of the next 100 exposed
who will develop the outcome because of the exposure). In other words, Will it?
The Will it? probability is the difference in the probability of the outcome
occurring in an exposed person, i. e., the probability of outcome given exposure
[P(OIE)], and the probability of it occurring without exposure [P(OIE)]:
P(OIE) -P(OIE). It can thus be estimated by the attributable risk, i.e., the difference in incidence of the outcome in otherwise similar exposed and nonexposed persons, which depends on both the relative risk of exposure and the incidence of the
disease in the unexposed population. In Table 6.3, we examined data on lung cancer
and cardiovascular disease mortality among smoking and nonsmoking male British
physicians. Despite a very high relative risk (32.43) for lung cancer death among
smokers, the attributable risk was only 2.20 per 1000 per year. In other words, smoking can be expected to add 2.2 lung cancer deaths per year for each 1000 smoking
male British physicians. The cardiovascular death attributable risk is higher at 2.61
per 1000 per year, despite a much lower relative risk (1.36), because cardiovascular
deaths occur far more frequently than lung cancer deaths (7.32 vs 0.07 per 1000 per
year among nonsmokers).
For continuous variables, the importance of exposure as a cause of outcome can
be formulated by the question How much will it? and is determined by the difference in outcome due to exposure. Although the expected difference in outcome
actually represents an entire probability distribution of expected differences in outcome between exposed and nonexposed persons, we usually base our estimate on
the mean difference. Thus, if the mean difference in birth weight between infants of
smoking and those of nonsmoking mothers that is attributable to smoking (i. e.,
unconfounded by other factors) is 150 g, the estimate of the expected smoking
effect will be - 150 g.
Causality
266
logic studies can be very useful in carrying out a Did it? probability assessment. In
fact, Bayes' theorem enables the merging of individual case information with
epidemiologic data in assessing the Did it? causality question.
Suppose that a given individual has developed the outcome. Without knowing
whether that individual was exposed, what is the probability that the exposure was
the cause? Although this question is posed in terms of the individual, the Did it?
probability can be estimated by measuring the proportion of persons developing the
outcome who do so because of exposure. This proportion is called the etiologic fraction (EF) or population attributable risk; it is already familiar to us from Chapter 6
and can be calculated as follows, using Eq. 6.4:
EF=
Jt(RR-l)
Jt(RR-l)+ 1
where RR is the relative risk of the outcome in otherwise similar exposed vs nonexposed individuals and Jt is the probability (i. e., prevalence) of exposure in the population ofinterest [10].
For a fixed relative risk, EF increases as the probability of exposure increases,
reaching a theoretical maximum of 1. Consequently, even an exposure with a high
relative risk may not have a high EF if the exposure is rare. First-trimester intrauterine rubella infection is associated with a very high relative risk of intrauterine
growth retardation (IUGR), but because such infection is rare, the corresponding
EF is quite low. Conversely, maternal cigarette smoking may only double or triple
the risk of IUGR, but it is so common that a large proportion of IUGR may be
caused by it in populations where many women smoke during pregnancy [3].
When outcome and exposure are both known to have occurred in an individual,
an improved (over the EF) estimate of the probability that exposure caused outcome
in that individual can be derived using the etiologic fraction among the exposed (EFE)
obtained from epidemiologic data [11]. EFE represents the proportion of all exposed
persons in a population developing the outcome who do so because of exposure. It
can be calculated as follows:
RR-l
EFE = - RR
(19.1)
Thus, in the absence of any other information about a specific individual from that
population who developed the outcome other than the fact that he or she was
exposed, EFE provides an estimate of the probability that the exposure caused the
outcome. If the relative risk of lung cancer is 10 in smokers vs nonsmokers, for
example, the probability that a smoker who develops lung cancer did so because of
10-1
smoking would be - - =0.90.
10
Often, however, we know much more relevant information about a specific case
of an outcome than merely whether or not exposure occurred. We also probably
know something about the dose of exposure (if not purely dichotomous) and its timing, in addition to background factors concerning the subject's age, sex, socioeco-
267
nomic status, and past medical history. How can these be used to refine our Did it?
probability assessment?
Various informal and formal methods have been used in this regard. In fact, clinicians usually take into account some, if not all, of the above factors in making a
causality assessment. But a clinician's "global introspection" tends to become less
reliable as the problem gets more complex. It is difficult or impossible for most clinicians to simultaneously consider and properly weigh all the relevant facts, let alone
possess those facts.
The construction of algorithms (branched logic trees) can be helpful in improving the reproducibility and validity of diagnostic judgments. One particular area that
has received considerable attention in this regard is that of adverse drug reactions
(ADRs). A variety of algorithms or equivalent checklists have been developed to
help clinicians, drug manufacturers, and regulatory agencies judge whether a given
drug caused an observed adverse event in specific cases [12]. Although these
schemes appear to yield more reproducible causality assessments than global introspection (even that of clinical pharmacology experts), their assessment procedures
are somewhat arbitrary and often lead to different results from one method to
another.
A potentially more rewarding approach is to use Bayesian techniques for manipulating conditional probabilities (see Chapter 16). The posterior odds that an outcome was caused by exposure can be decomposed into a prior odds and a likelihood
ratio (see Eq.16.8). The knowledge that a specific case was exposed can be incorporated into the prior odds, i. e., the odds that exposure caused the outcome given only
the background information (B):
prior odds = P(E-+OIB)
P(E-,40IB)
where E -+0 and E frO denote the opposing propositions that exposure did and did
not, respectively, cause the outcome. As with the etiologic fraction in the exposed
(EFE), the prior odds is usually based on epidemiologic data derived from otherwise similar exposed individuals. In fact, P(E-+OIB)=EFE and P(E-,40IB)=
1- P(E-+OIB) = 1- EFE.
The specific case information (C) concerning dose, timing, background, and
other relevant factors can then be included in computing the likelihood ratio (LR):
LR= P(qE-+O)
P(qE7+0)
The posterior odds then incorporates both the background and case information:
P(E-+OIB,C) = P(E-+OIB)
P(E-f+OIB,C)
P(E-f+OIB)
P(QE-+O)
P(qE~O)
(19.2)
For ADRs, for example, the likelihood term would include information about the
age, sex, and medical history of the patient; the dosage and timing of drug adminis-
268
Causality
tration; and the results of dechallenge (stopping the drug) and rechallenge (restarting it).
For continuous outcomes, Did it? usually means How much did it? If a smoking
mother delivers an infant whose birth weight is 2800 g, what is our best estimate of
what the weight would have been had the mother not smoked during pregnancy?
Once again, we should use all the relevant factors at our disposal to make this estimate. Based on the mean difference in birth weight in infants of smoking vs nonsmoking mothers, or far better, the decrease in birth weight per cigarette smoked
per day for the precise time during pregnancy that the mother smoked, an average
expected effect of her pregnancy smoking history can be estimated. This estimate
can be further refined by knowledge about any factors known to modify the effect
of exposure (effect modifiers).
Did it? causality assessments are assuming increasing prominence for a variety of
scientific and nonscientific reasons. One major reason is that clinicians are seeking
to improve on global introspection as a diagnostic method. Another reason is related
to liability. Harmful exposures or treatments can be caused by industry, government,
or individuals. Even if it is known (P= 1) that a given exposure can cause an outcome, exposed persons who develop the outcome will naturally wish to know
whether their exposure was the cause. They may even wish to sue the party they believe to be responsible for exposing them. A patient developing a serious adverse reaction to a drug, for example, may bring suit against the treating physician, the drug's
manufacturer, and perhaps even the government agency regulating its availability on
the market. Although the courts do not easily deal with concepts such as numerical
probabilities or average expected differences, Did it? causality assessment will probably become more important in the future for legal, as well as scientific, reasons.
In this regard, one major difference between Did it? and either Can it? or Will
it? is that the Did it? assessment cannot be used for prediction and therefore cannot
be tested. Sensitivity testing or distributional assumptions can be used to estimate a
range of Did it? probabilities (analogous to a confidence interval). The figures can
even be revised in the light of new epidemiologic evidence. But the probability estimate remains a hypothesis; it can never be confirmed or refuted.
A final comment about Did it? causality assessment will bring us back full circle
to the introductory remarks made in Chapter 1. The discussion there focused on the
clinical vs the epidemiologic approaches to problem solving. Clinical reasoning is
fundamentally individualized and attempts to answer a question based on the facts
of a single case. Epidemiologic reasoning is probabilistic; it is founded on relationships in groups between exposure and outcome. In deciding whether a given exposure caused an observed outcome in a specific case, the Bayesian approach brings
epidemiologic reasoning to the clinical "bedside," the individual subject. The best
estimate of the prior odds of causality is usually based on the best epidemiologic,
probabilistic data available. But the facts of the individual case are then used to alter
the prior odds (through the likelihood ratio) to arrive at a final assessment of the
posterior odds of causation for that case.
Did it? causality assessment represents an excellent example of the essential
compatibility of the clinical and epid~miologic approaches. The marriage is quite
recent, and the two parties have much to learn from one another, but the prospects
for fertility appear excellent.
References
269
References
1. Hume D (1946) A treatise of human nature. Selby-Bigge CA (ed) Claredon Press, Oxford
2. Miettinen OS (1974) Confounding and effect modification. Am J Epidemiol 100: 350-353
3. Kramer MS (1987) Determinants of low birth weight: methodological assessment and metaanalysis. Bull WHO 65: 663-737
4. Susser M (1973) Causal thinking in the health sciences. Oxford University Press, New York
5. Turner ME, Stevens CD (1959) The regression analysis of causal paths. Biometrics 15: 236-238
6. Duncan OD (1986) Path analysis: sociological examples. Am J Sociol72: 1-16
7. Kramer MS, Hutchinson TA (1984) The Yale algorithm. Drug lnf J 18: 283-291
8. Lane DA (1984) A probabilist's view of causality assessment. Drug Inf J 18: 323-330
9. Hill AB (1977) A short textbook of medical statistics. Hodder and Stoughton, London,
pp 285-296
10. Levin ML (1953) The occurrence of lung cancer in man. Acta Unio Int Contra Cancrum 9:
531-541
11. Miettinen OS (1974) Proportion of disease caused or prevented by given exposure trait or intervention. Am J Epidemiol 99: 325-332
12. Venulet J, Berneker G-C, Ciucci AG (1982) Assessing causes of adverse drug reactions with special references to standardized methods. Academic, London
Appendix Tables
Appendix Tables
273
73152
51453
01898
12074
39941
14511
11649
61414
98551
21225
85285
86348
83525
37895
93629
36009
76431
04231
93547
19574
95892
81594
13604
24769
71565
36962
95848
75339
09404
33413
67835
36738
11730
76548
56087
63314
25014
85423
05393
40875
50162
15460
60698
96770
13351
90474
28599
25254
28785
84725
41469
64109
16210
02760
86576
16812
09497
89717
24359
86944
81542
76235
65997
99410
93296
81652
41383
82667
77319
10081
45554
31555
74624
73408
82454
27931
12639
36348
58993
76810
93994
00619
44018
61098
52975
22375
22909
64732
04393
10324
00953
29563
93589
48245
15457
41059
67434
72766
92079
29187
66456
41045
68816
46784
40350
47679
82830
37643
66125
62533
66810
47617
19959
94932
73603
15941
36932
57550
64451
34075
84602
46728
49620
29275
16451
14493
71183
98480
57669
42885
65515
36345
25640
66658
03448
19251
41404
67257
30818
37390
41642
81110
18671
58353
96328
74220
03786
75085
09161
75707
17612
02407
55558
33015
48992
65522
06098
15520
19155
64998
80607
92917
27038
11715
87080
19184
40434
25471
00551
39333
64164
60602
76107
24909
00767
66962
82175
90832
31894
45637
82310
04470
10819
37774
12538
18163
78754
56797
37953
67439
63495
90775
33751
78837
94914
21333
65626
84380
46479
59847
48660
50061
07389
32072
97197
31288
42539
87891
80083
55147
00086
14812
76255
63868
76639
79889
48895
89604
70930
76971
75532
11196
41372
89654
55928
28704
34335
10837
05359
36441
62844
60492
66992
47196
95141
92337
70650
93183
12452
42333
99695
51108
56920
38234
67483
31416
82066
01850
32315
59338
11231
83436
42782
89276
42703
27904
67914
39202
89582
55198
57383
21465
18582
87138
80380
31852
99605
46214
16165
67067
69137
83114
99228
15984
97155
96667
97885
79541
21466
34160
14315
74440
78298
63830
85019
01007
99622
75404
30475
03527
31929
87912
63648
74729
78140
58089
61705
18914
11965
85251
27632
57285
98982
94089
48111
50987
30392
60199
34803
80936
91373
23660
99275
48941
81781
07736
75841
41967
69709
93248
20436
21931
35208
16784
67877
96130
04295
30357
44642
16498
73483
00875
76772
89761
31924
85332
09114
92656
66864
51315
24384
32101
62318
62803
79921
66121
53972
14509
37700
85466
96986
96642
16594
07688
59392
84844
24199
78883
65533
72722
93873
58080
43222
72126
15473
46352
35450
23093
23611
73295
92183
03482
58645
93993
49759
51152
66953
60257
01848
56157
85878
49251
89250
03910
60477
30490
63719
63266
38552
83284
15974
57615
90858
17472
56367
Source: Daniel WW (1974) Biostatistics: a foundation for analysis in the health sciences. John Wiley
& Sons, New York, p 417.
Appendix Tables
274
TableA.2. Areas in one tail (~
+ z or
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.0
0.1
0.2
0.3
0.4
0.500
0.460
0.421
0.382
0.345
0.496
0.456
0.417
0.378
0.341
0.492
0.452
0.413
0.374
0.337
0.488
0.448
0.409
0.371
0.334
0.484
0.444
0.405
0.367
0.330
0.480
0.440
0.401
0.363
0.326
0.476
0.436
0.397
0.359
0.323
0.472
0.433
0.394
0.356
0.319
0.468
0.429
0.390
0.352
0.316
0.464
0.425
0.386
0.348
0.312
0.5
0.6
0.7
0.8
0.9
0.309
0.274
0.242
0.212
0.184
0.305
0.271
0.239
0.209
0.181
0.302
0.268
0.236
0.206
0.179
0.298
0.264
0.233
0.203
0.176
0.295
0.261
0.230
0.200
0.174
0.291
0.258
0.227
0.198
0.171
0.288
0.255
0.224
0.195
0.169
0.284
0.251
0.221
0.192
0.166
0.281
0.248
0.218
0.189
0.164
0.278
0.245
0.215
0.187
0.161
1.0
1.1
1.2
1.4
0.159
0.136
0.115
0.097
0.081
0.156
0.133
0.113
0.095
0.079
0.154
0.131
0.111
0.093
0.078
0.152
0.129
0.109
0.092
0.076
0.149
0.127
0.107
0.090
0.075
0.147
0.125
0.106
0.089
0.074
0.145
0.123
0.104
0.087
0.072
0.142
0.121
0.102
0.085
0.071
0.140
0.119
0.100
0.084
0.069
0.138
0.117
0.099
0.082
0.068
1.5
1.6
1.7
1.8
1.9
0.067
0.055
0.045
0.036
0.029
0.066
0.054
0.044
0.035
0.028
0.064
0.053
0.043
0.034
0.027
0.063
0.052
0.042
0.034
0.027
0.062
0.051
0.041
0.033
0.026
0.061
0.049
0.040
0.032
0.026
0.059
0.048
0.039
0.031
0.025
0.058
0.048
0.038
0.031
0.024
0.057
0.046
0.038
0.030
0.024
0.056
0.046
0.037
0.029
0.023
2.0
2.1
2.2
2.3
2.4
0.023
0.018
0.014
0.011
0.008
0.022
0.017
0.014
0.010
0.008
0.022
0.017
0.013
0.010
0.008
0.021
0.017
0.013
0.010
0.008
0.021
0.016
0.013
0.010
0.007
0.020
0.016
0.012
0.009
0.007
0.020
0.D15
0.012
0.009
0.007
0.019
0.015
0.012
0.009
0.007
0.019
0.D15
0.011
0.009
0.007
0.D18
0.014
0.011
0.008
0.006
2.5
2.6
2.7
2.8
2.9
0.006
0.005
0.003
0.003
0.002
0.006
0.005
0.003
0.002
0.002
0.006
0.004
0.003
0.002
0.002
0.006
0.004
0.003
0.002
0.002
0.006
0.004
0.003
0.002
0.002
0.005
0.004
0.003
0.002
0.002
0.005
0.004
0.003
0.002
0.002
0.005
0.004
0.003
0.002
0.001
0.005
0.004
0.003
0.002
0.001
0.005
0.004
0.003
0.002
0.001
3.0
0.001
1.3
275
Appendix Tables
TableA.3. Areas in two tails (;;:;
+ z plus;:;; -
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.0
0.1
0.2
0.3
0.4
1.000
0.920
0.841
0.764
0.689
0.992
0.912
0.834
0.757
0.682
0.984
0.904
0.826
0.749
0.674
0.976
0.897
0.818
0.741
0.667
0.968
0.889
0.810
0.734
0.660
0.960
0.881
0.803
0.726
0.653
0.952
0.873
0.795
0.719
0.646
0.944
0.865
0.787
0.711
0.638
0.936
0.857
0.779
0.704
0.631
0.928
0.849
0.772
0.697
0.624
0.5
0.6
0.7
0.8
0.9
0.617
0.549
0.484
0.424
0.368
0.610
0.542
0.478
0.418
0.363
0.603
0.535
0.472
0.412
0.358
0.596
0.529
0.465
0.407
0.352
0.589
0.522
0.459
0.401
0.347
0.582
0.516
0.453
0.395
0.342
0.575
0.509
0.447
0.390
0.337
0.569
0.503
0.441
0.384
0.332
0.562
0.497
0.435
0.379
0.327
0.555
0.490
0.430
0.373
0.322
1.0
1.2
1.3
1.4
0.317
0.271
0.230
0.194
0.162
0.312
0.267
0.226
0.190
0.159
0.308
0.263
0.222
0.187
0.156
0.303
0.258
0.219
0.184
0.153
0.298
0.254
0.215
0.180
0.150
0.294
0.250
0.211
0.177
0.147
0.289
0.246
0.208
0.174
0.144
0.285
0.242
0.204
0.171
0.142
0.280
0.238
0.201
0.168
0.139
0.276
0.234
0.197
0.165
0.136
1.5
1.6
1.7
1.8
1.9
0.134
0.110
0.089
0.072
0.057
0.131
0.107
0.087
0.070
0.056
0.129
0.105
0.085
0.069
0.055
0.126
0.103
0.084
0.067
0.054
0.124
0.101
0.082
0.066
0.052
0.121
0.099
0.080
0.064
0.051
0.119
0.097
0.078
0.063
0.050
0.116
0.095
0.077
0.061
0.049
0.114
0.093
0.075
0.060
0.048
0.112
0.091
0.073
0.059
0.047
2.0
2.1
2.2
2.3
2.4
0.046
0.036
0.028
0.021
0.016
0.044
0.035
0.027
0.021
0.016
0.043
0.034
0.026
0.020
0.016
0.042
0.033
0.026
0.020
0.015
0.041
0.032
0.Q25
0.019
0.015
0.040
0.032
0.024
0.019
0.014
0.039
0.031
0.024
0.018
0.014
0.038
0.030
0.023
0.018
0.014
0.038
0.029
0.023
0.017
0.013
0.037
0.029
0.022
0.017
0.013
2.5
2.6
2.7
2.8
2.9
0.012
0.009
0.007
0.005
0.004
0.012
0.009
0.007
0.005
0.004
0.012
0.009
0.007
0.005
0.004
0.011
0.009
0.006
0.005
0.003
0.011
0.008
0.006
0.005
0.003
0.011
0.008
0.006
0.004
0.003
0.010
0.008
0.006
0.004
0.003
0.010
0.008
0.006
0.004
0.003
0.010
0.007
0.005
0.004
0.003
0.010
0.007
0.005
0.004
0.003
3.0
0.003
1.1
276
Appendix Tables
TableA.4. Critical values of t required for certain P values, according to number of degrees of
freedom (df)
P value
One-tailed
0.25
0.1
0.05
0.025
0.01
0.005
0.0025
0.001
0.0005
Two-tailed
0.5
0.2
0.1
0.05
0.02
0.01
0.005
0.002
0.001
1
2
3
4
1.000
0.816
0.765
0.741
3.078
1.886
1.638
1.533
6.314
2.920
2.353
2.132
5
6
7
8
9
0.727
0.718
0.711
0.706
0.703
1.476
1.440
1.415
1.397
1.383
2.015
1.943
1.895
1.860
1.833
2.571
2.447
2.365
2.306
2.262
3.365
3.143
2.998
2.896
2.821
4.032
3.707
3.499
3.355
3.250
4.773
4.317
4.029
3.833
3.690
5.893
5.208
4.785
4.501
4.297
6.869
5.959
5.408
5.041
4.781
10
11
12
13
14
0.700
0.697
0.695
0.694
0.692
1.372
1.363
1.356
1.350
1.345
1.812
1.796
1.782
1.771
1.761
2.228
2.201
2.179
2.160
2.145
2.764
2.718
2.681
2.650
2.624
3.169
3.106
3.055
3.012
2.977
3.581
3.497
3.428
3.372
3.326
4.144
4.025
3.930
3.852
3.787
4.587
4.437
4.318
4.221
4.140
15
16
17
18
19
0.691
0.690
0.689
0.688
0.688
1.341
1.337
1.333
1.330
1.328
1.753
1.746
1.740
1.734
1.729
2.131
2.120
2.110
2.101
2.093
2.602
2.583
2.567
2.552
2.539
2.947
2.921
2.898
2.878
2.861
3.286
3.252
3.222
3.197
3.174
3.733
3.686
3.646
3.610
3.579
4.073
4.015
3.965
3.922
3.883
20
21
22
23
24
0.687
0.686
0.686
0.685
0.685
1.325
1.323
1.321
1.319
1.318
1.725
1.721
1.717
1.714
1.711
2.086
2.080
2.074
2.069
2.064
2.528
2.518
2.508
2.500
2.492
2.845
2.831
2.819
2.807
2.797
3.153
3.135
3.119
3.104
3.091
3.552
3.527
3.505
3.485
3.467
3.850
3.819
3.792
3.767
3.745
25
26
27
28
29
0.684
0.684
0.684
0.683
0.683
1.316
1.315
1.314
1.313
1.311
1.708
1.706
1.703
1.701
1.699
2.060
2.056
2.052
2.048
2.045
2.485
2.479
2.473
2.467
2.462
2.787
2.779
2.771
2.763
2.756
3.078
3.067
3.057
3.047
3.038
3.450
3.435
3.421
3.408
3.396
3.725
3.707
3.690
3.674
3.659
30
40
60
120
0.683
0.681
0.679
0.677
0.674
1.310
1.303
1.296
1.289
1.282
1.697
1.684
1.671
1.658
1.645
2.042
2.021
2.000
1.980
1.960
2.457
2.423
2.390
2.358
2.326
2.750
2.704
2.660
2.617
2.576
3.030
2.971
2.915
2.860
2.807
3.385
3.307
3.232
3.160
3.090
3.646
3.551
3.460
3.373
3.291
df
00
Source: Pearson ES, Hartley HO (eds) (1976) Biometrika tables for statisticians, voL I. Biometrika
Trust, London, p 146.
277
Appendix Tables
10
11
0
0
1
0
1
2
3
0
2
3
5
0
2
4
6
1
3
5
8
1
3
6
9
1
4
7
11
5
8
12
10
13
15
12
14
17
20
24
27
nl
2
3
4
5
6
7
8
9
10
11
15
18
21
11
12
13
14
15
16
17
18
19
20
to
14
15
16
17
18
19
20
3
7
12
18
3
8
14
19
3
9
15
20
4
9
16
22
4
10
17
13
2 2
6 7
10 11
15 16
4
11
18
25
16
19
23
27
31
17
21
26
30
34
19
24
28
33
37
21
26
31
36
41
23
28
33
39
44
25
30
36
42
48
26
33
39
45
51
28
35
41
48
55
34
38
42
42 46
47 51
51 56
61
50
55
61
66
54
60
65
71
77
83
57 61 65
64 68 72
70 75 80
77 82 87
83 88 94
89 95 101
96 102 109
109 116
123
12
2
5
9
13
72
23
30
37
44
51
58
32
39
47
54
62
69
77
84
92
100
107
115
123
130
138
nz
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
15
16
17
18
19
20
1
5
10
14
1
6
11
15
1
6
11
17
2
7
12
18
2
7
13
19
2
8
12
5
9
13
20
14
16
20
24
28
33
17
22
26
31
36
19
24
29
34
39
21
26
31
37
42
22
28
34
39
45
24
30
36
42
48
25
32
38
45
52
27
34
41
48
55
33
37
37
41
45
40
45
50
55
44
49
54
59
64
47
53
59
64
70
75
51
57
63
67
75
81
87
55 58 62
61 65 69
67 72 76
74 78 83
80 85 90
86 92 98
93 99 105
99 106 112
113 119
127
10
11
12
13
0
1
2
1
2
3
1
3
5
0
2
4
6
0
2
4
7
0
3
5
8
0
3
6
9
1
4
7
4
8
11
6
8
8
10
13
10
12
15
17
11
13
16
19
23
26
nl
2
3
4
5
14
14
17
20
23
30
18
22
26
29
13
278
Appendix Tables
10
0
2
0
1
3
0
2
4
1
3
5
4
6
6
8
10
7
9
11
14
14
15
16
17
18
19
20
0 0
2 2
5 6
9 10
0
3
7
11
0
3
7
12
0
4
8
13
0
4
9
14
1
4
9
15
5
10
16
13
15
19
24
28
33
16
21
26
31
36
18
23
28
33
38
19
24
30
36
41
20
26
32
38
44
22
28
34
40
47
34
38
43
47
37
42
47
51
56
41
44
49
55
60
66
71
77
47
53
59
65
70
76
82
88
11
12
1
3
6
4
7
2
5
8
8
11
13
16
19
9
12
15
18
22
11
14
17
21
24
12
16
20
23
27
25
28
31
31
35
39
13
nl
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
0
1
17
22
26
30
46
51
56
61
66
50 53
56 60
63 67
69 73
75 80
82 87
88 93
94 100
101 107
114
Source: SmartJV (1963) Elements of medical statistics. Charles C. Thomas, Springfield, MA,
pp 125-127.
279
Appendix Tables
Table A.6. Critical values of sums of ranks for certain P values, according to sample size (n) in each
of two pair-matched groups (for Wilcoxon signed rank test). Entry before comma represents maximum value for lower sum; entry after comma is minimum value for higher sum
n
(number
of pairs)
One-tailed
0.Q25
0.01
0.005
Two-tailed
0.05
0.02
0.01
6
7
8
9
10
0,21
2,26
3,33
5,40
8,47
0,28
1,35
3,42
5,50
0,36
1,44
3,52
11
12
13
14
15
10,56
13,65
17,74
21,84
25,95
7, 59
9,69
12,79
15,90
19,101
5,61
7, 71
9, 82
12,93
15, 105
16
17
18
19
20
29, 107
34, 119
40, 131
46,144
52, 158
23, 113
28, 125
32,139
37, 153
43, 167
19, 117
23, 130
27, 144
32,158
37, 173
21
22
23
24
25
58, 173
66, 187
73,203
81,219
89,236
49, 182
55, 198
62,214
69,231
76,249
42, 189
48,205
54,222
61,239
68,257
Pvalue
280
Appendix Tables
TableA.7. Critical values of "l required for certain P values (two-tailed only), according
ber of degrees of freedom (df)
df
Two-tailed P value
0.10
0.05
0.01
2
3
4
5
2.71
4.61
6.25
7.78
9.24
3.84
5.99
7.81
9.49
11.07
6.63
9.21
11.34
13.28
15.09
10.83
13.82
16.27
18.47
20.52
6
7
8
9
10
10.64
12.02
13.36
14.68
15.99
12.59
14.07
15.51
16.92
18.31
16.81
18.48
20.09
21.67
23.21
22.46
24.32
26.13
27.88
29.59
11
12
13
14
15
17.28
18.55
19.81
21.06
22.31
19.68
21.03
22.36
23.68
25.00
24.73
26.22
27.69
29.14
30.58
31.26
32.91
34.53
36.12
37.70
16
17
18
19
20
23.54
24.77
25.99
27.20
28.41
26.30
27.59
28.87
30.14
31.41
32.00
33.41
34.81
36.19
37.57
39.25
40.79
42.31
43.82
45.32
21
22
23
24
25
29.62
30.81
32.01
33.20
34.38
32.67
33.92
35.17
36.42
37.65
38.93
40.29
41.64
42.98
44.31
46.80
48.27
49.73
51.18
52.62
0.001
to
num-
281
Appendix Tables
Table A.S. Critical values of Spearman rank correlation coefficient (r,) for certain P values, according to sample size (n)
One-tailed
0.025
0.005
Two-tailed
0.05
0.01
6
7
8
9
10
0.886
0.786
0.738
0.683
0.648
1.000
0.929
0.881
0.833
0.794
11
12
13
14
15
0.623
0.591
0.566
0.545
0.525
0.818
0.780
0.745
0.716
0.689
16
17
18
19
20
0.507
0.490
0.476
0.462
0.450
0.666
0.645
0.625
0.608
0.591
21
22
23
24
25
0.438
0.428
0.418
0.409
0.400
0.576
0.562
0.549
0.537
0.526
26
27
28
29
30
0.392
0.385
0.377
0.370
0.364
0.515
0.505
0.496
0.487
0.478
n
(number
of pairs)
Pvalue
Subject Index
selection of cases 94
selection of controls 94-95
Causality
Can it? 260-264
causal networks 259
causal paths 54, 258-259
definition 254-256
Did it? 260, 265-268
multifactorial 257
necessary cause 256-257
sufficient cause 256-257
weighing the evidence for 261-264
Will it? 260, 264-265
Central Limit Theorem 146-147, 159
Chi-square (X2) test
continuity correction 171-172, 174
for comparing three or more proportions
183-185
for comparing two proportions 167-1.78
for linear trend 184-185
for rx ccontingency tables 185-186
for testing a single proportion 181-182
for testing goodness of fit 182
Mantel-Haenszel 176-177
McNemar 174-175
Clinical trials
advantages and disadvantages 91
assignment of treatment 80-84
crossover design 84
definition 6, 45, 58, 78
double-blinding 85
ethics 88-91
informed consent 89
parallel design 84
placebo treatment 79
randomization 55, 80-84
sequential design 90
treatment compliance 79
Coefficient of variation 128
Cohort analysis 116
Cohort effect 116
Cohort studies
advantages and disadvantages 76
analysis of results 63-69
attributable risk 65- 67
284
Cohort studies
baseline state 59
bias assessment and control 69-74
definition 40, 42-44
exposure 59-60
follow-up 61-62
matched 70-72
outcome 62
relative risk 65-69
sample selection 58-59
Conditional probability 144,213-216,
226-228, 267
Confidence interval
for correlation coefficient 194-195
for difference in means 154, 156
for difference in proportions 179
for odds ratio 175-176
for relative risk 175-176
for single mean 148-149
for single proportion 183
for survival rate 247-248
in statistical inference 122, 123
Confounding (see Bias)
Correlation coefficient 189-190, 192-194,
196-197
Cost-benefit analysis 7, 233-234
Cost-effectiveness analysis 7,234-235
Critical ratio 147
Cross-sectional studies
advantages and disadvantages 116-117
analysis of results 114-115
bias assessment and control 115
definition 40, 42-45
exposure 114
outcome 113-114
"pseudo-cohort" studies 115-116
sample selection in 113
Data sources 17-23
Decision analysis
decision trees 222-225
probabilities 225-228
sensitivity analysis 230-232
threshold analysis 232-233
utilities 228-231
vs other decision strategies 220-222
Degrees of freedom 127, 147-148, 170
Descriptive statistics 122, 124-136
Diagnostic tests
Bayes' theorem applied to 213-216
case-finding 217
criteria for "normal" and "abnormal" results
201-205
lead-time bias 217-218
predictive value 211-216
reproducibility 205
Subject Index
screening 217-218
sensitivity 205-210
specificity 205-210
spectrum and bias in interpretation of
208-210
validity 205
zero-time shift 217-218
Directionality (in research design) 39-40
Discriminant function analysis 177
Dose-response effect 60, 104-105, 184-185,
263
Ecological fallacy 23
Effect modification 74-76, 110, 164, 177-178,
196,256-257
Effectiveness 86-87
Efficacy 86-87
Epidemiology
classicial 4
clinical 4-8
definition 3
history 8-10
Etiologic fraction 67, 105-106, 266
Experimental studies, definition 38
Exposure, definition 38-39, 59, 254
F-test 162
Factorial design 163
Fisher exact test 172-174
Frequency distributions 124-125, 129-135
Gaussian distribution (see Normal distribution)
Groups
dynamic 29-30, 65
fixed 27,65
Hawthorne effect 87-88
Hazard 243, 246, 252-253
Histogram 124-125
Hypergeometric distribution 172
Hypothesis testing 137-145
Incidence
density rate 30, 65
density ratio 68, 102
rate 27-31
Informed consent 89
Interaction (see Effect modification)
Latent period 30,40,61,68,76, 80,95, 114,
115,239-240
285
Subject Index
Life-table (survival) analysis
actuarial method 240-244
confidence intervals in 247-248
Kaplan-Meier (product-limit) method
245-247
log-rank rest for comparing two groups
251-252
proportional hazards model 252-253
survival curve 244, 246, 248-253
vs other methods of analysis 237-240
z-test for comparing two groups 248-251
Likelihood ratio (see Bayes' theorem)
Linear correlation 187-190, 192-193
Linear regression 191-193, 195-196
Nonparametric tests
definition 159
Mann-Whitney U-test 160-161
rank (Spearman) correlation 196-197
sign test 162
Wilcoxon signed rank test 161-162
Normal distribution (also see z-Distribution)
129-130,201-204
Null hypothesis 137-144
286
Sample size
definition 126
for comparing two means 157-159
for comparing two proportions 179-181
Scatter diagram 187-188
Sensitivity (see Diagnostic tests)
Sign test 162
Simpson's paradox 74
Skewness 126-127, 159
Spearman's rho 196-197
Specificity (see Diagnostic tests)
Standard deviation 127-130
Standard error of the mean 128, 146
Standard normal distribution (see z-Distribution)
Standardization (see Rates)
Statistical inference 122, 137-145
Statistical power 142-143, 157, 180
Statistical significance 138-141, 157-159,263
Stratification 32-35,56,72-75,83,108-110,
164, 176-177,252
Survival analysis (see Life-table analysis)
t- Distribution 147-148
t-Test
for correlation coefficient 194
one-sample t-test 149-150
paired t-test 154-156
two-sample t-test 150-154
Subject Index
Target population 37-38,47-52, 121
Timing (in research design) 39,42,44
Treatment (see Exposure)
Two-by-two (fourfold) tables
matched 70-72, 108, 166-174
stratified 72-75,108-110, 176-178
unmatched 64-66,97-104, 174-175
Type I error 139-142, 157-159, 180-181
Type II error 141-143, 157-159, 179-181
Validity (also see Measurement)
external validity 48-49
internal validity 48-49
of diagnostic tests 205-206
of exposure-outcome associations 37, 47
of individual measurements 13-16
Variables
definition 11
dependent 189-190
independent 189-190
types of 11,121
Variance 127, 129, 151
Vital statistics 21-22,31
Wilcoxon signed rank test 161-162
z-Distribution 130-133
z-Test 150, 157,248-251
Zero time 61,217-218,240,242