Fregni, Felipe - Illigens, Ben M. W. - Critical Thinking in Clinical Research - Applied Theory and Practice Using Case Studies-Oxford University Press (2018)
Fregni, Felipe - Illigens, Ben M. W. - Critical Thinking in Clinical Research - Applied Theory and Practice Using Case Studies-Oxford University Press (2018)
Fregni, Felipe - Illigens, Ben M. W. - Critical Thinking in Clinical Research - Applied Theory and Practice Using Case Studies-Oxford University Press (2018)
Clinical Research
Critical Thinking
in Clinical Research
Applied Theory and Practice
Using Case Studies
Edited by
1
3
Oxford University Press is a department of the University of Oxford. It furthers
the University’s objective of excellence in research, scholarship, and education
by publishing worldwide. Oxford is a registered trade mark of Oxford University
Press in the UK and certain other countries.
1 3 5 7 9 8 6 4 2
Printed by WebCom, Inc., Canada
This material is not intended to be, and should not be considered, a substitute for medical or
other professional advice. Treatment for the conditions described in this material is highly
dependent on the individual circumstances. And, while this material is designed to offer
accurate information with respect to the subject matter covered and to be current as of the
time it was written, research and knowledge about medical and health issues is constantly
evolving and dose schedules for medications are being revised continually, with new side
effects recognized and accounted for regularly. Readers must therefore always check the
product information and clinical procedures with the most up-to-date published product
information and data sheets provided by the manufacturers and the most recent codes of
conduct and safety regulation. The publisher and the authors make no representations or
warranties to readers, express or implied, as to the accuracy or completeness of this material.
Without limiting the foregoing, the publisher and the authors make no representations or
warranties as to the accuracy or efficacy of the drug dosages mentioned in the material.
The authors and the publisher do not accept, and expressly disclaim, any responsibility for any
liability, loss or risk that may be claimed or incurred as a consequence of the use and/or
application of any of the contents of this material.
To Lucca Fregni, the light of our lives.
Felipe Fregni
Contributors ix
3. Study Population 45
Sandra Carvalho and Felipe Fregni
5. Randomization 87
Juliana C. Ferreira, Ben M. W. Illigens, and Felipe Fregni
6. Blinding 105
Rita Tomás and Joseph Massaro
vii
viii Contents
Index 493
C O N T R I BU TO R S
ix
x Contributors
The whole history of science has been the gradual realization that events do not happen
in an arbitrary manner, but that they reflect a certain underlying order, which may or may
not be divinely inspired.
—Stephen W. Hawking
INTRODUCTION
The search for knowledge about ourselves and the world around us is a fundamental
human endeavor. Research is a natural extension of this desire to understand and to
improve the world in which we live.
This chapter focuses on the process of clinical trials, ethical issues involved in the
history of clinical research, and other issues that may be unique to clinical trials. As
clinical trials are, perhaps, the most regulated type of research—subject to provin-
cial, national, and international regulatory bodies—reference will be made to these
regulations where appropriate.
The scope of research is vast. On the purely physical side, it ranges from seeking
to understand the origins of the universe down to the fundamental nature of matter.
At the analytic level, it covers mathematics, logic, and metaphysics. Research
involving humans ranges widely, including attempts to understand the broad sweep
of history, the workings of the human body and the body politic, the nature of human
interactions, and the impact of nature on humans—the list is as boundless as the
human imagination.
CLINICAL RESEARCH
Clinical research is a branch of medical science that determines the safety and effec-
tiveness of medications, devices, diagnostic products, nutrition or behavioral changes,
*
The first three authors contributed equally to the work.
3
4 Unit I. Basics of Clinical Research
and treatment regimens intended for human use. Clinical research is a structured pro-
cess of investigating facts and theories and exploring connections. It proceeds in a sys-
tematic way to examine clinical conditions and outcomes, to establish relationships
among clinical phenomena, to generate evidence for decision-making, and to provide
the impetus for improving methods of practice.
Clinical trials are a set of procedures in medical research and drug develop-
ment that are conducted to allow safety and efficacy data to be collected for health
interventions. Given that clinical trials are experiments conducted on humans, there
is a series of required procedures and steps for conducting a clinical trial. There are
several goals tested in clinical trials, including testing whether the drug, therapy, or
procedure is safe and effective for people to use. The overall purpose of a clinical trial
is acquisition of new knowledge, not the treatment of patients per se.
A clinical trial, also known as patient-oriented research, is any investigation involving
participants that evaluates the effects of one or more health-related interventions on health
outcomes. Interventions include, but are not restricted to, drugs, radiopharmaceuticals,
cells and other biological products, surgical procedures, radiologic procedures, devices,
genetic therapies, natural health products, process-of-care changes, preventive care,
manual therapies, and psychotherapies. Clinical trials may also include questions that are
not directly related to therapeutic goals—for example, drug metabolism—in addition to
those that directly evaluate the treatment of participants.
Clinical trials are most frequently undertaken in biomedical research, although re-
search that evaluates interventions, usually by comparing two or more approaches, is
also conducted in related disciplines, such as psychology. The researcher leading a clin-
ical trial is often (but not always) a clinician, that is, a health-care provider (e.g., physi-
cian, dentist, naturopath, physiotherapist, etc.). Although various types and forms of
clinical trials have methodological differences, the ethical principles and procedures
are the same and are applicable to all.
2737 bce
Shen Nung, legendary emperor of China, is considered the father of Chinese medi-
cine. In addition to being credited for the technique of acupuncture, he purportedly
experimented with and classified hundreds of poisonous and medical herbs, which he
tested in a series of studies on himself.
Approximately 600 bce
The first experiment that could be considered a trial can be found in the Old
Testament. The book of Daniel describes how under King Nebuchadnezzar II,
5 Chapter 1. Basics of Clinical Research
children of royal blood and certain children from the conquered Israel were recruited
to be trained as the king’s advisors over a period of three years during which they
would be granted to eat from the king’s meat and wine. Daniel, however, requested
from the officer in charge of the diet that he and three other Hebrew children would
be allowed to have only legumes and water. When the officer expressed concerns
about the “inferior” diet, Daniel suggested a 10-day trial period, after which the of-
ficer would assess both groups of children. At the end of this “pilot study,” Daniel’s
group was noticeably healthier than the group of children who were relegated to the
diet of wine and meat. Therefore Daniel and the other three children were permitted
to continue with their diet for the entire training period, after which they displayed
superior wisdom and understanding compared to all other advisors of the king (Old
Testament, Daniel 1:5–20).
1537
Ambroise Paré, a French surgeon during the Renaissance, accidentally carried out a
clinical study when he ran out of elderberry oil, which after being boiled was used as
the standard treatment for gun wounds at that time. He then used a mixture of egg
yolk, turpentine (a pine tree–derived oil), and rose oil instead, and he soon noticed
that patients treated with this mixture had less pain and better wound healing than
those patients who had received the standard treatment [1].
1747
The first reported systematic experiment of the modern era was conducted by James
Lind, a Scottish physician, when he was sailing on the Salisbury. After many of the
seamen developed signs of scurvy, Lind selected 12 similar sick sailors and split them
into six groups of two. All groups were given the same diet, but each group was treated
differently for scurvy: group 1 received cider, group 2 vitriol (sulfuric acid), group 3
vinegar, group 4 seawater, group 5 oranges and lemon, and group 6 nutmeg and barley
water (British herbal tea). The group that had the fruits recovered from scurvy within
just six days. Of the other treatments, vinegar “showed the best effect” [Dr. James Lind.
“Treatise on Scurvy.” Published in 1753, in Edinburgh]. (Lind’s experiment had little
short-term impact; he was reluctant to believe in fruits being a sole remedy for scurvy,
and citrus fruits were difficult to preserve and also expensive. It wasn’t until 1790 that
the Royal Navy made fresh lemons a standard supplement. In 1932 the link between
vitamin C and scurvy was finally proven.)
In 1747, Dr. James Lind tested several scurvy treatments on crew members of
the British naval ship Salisbury and discovered that lemons and oranges were most
effective in treating the dreaded affliction.
1863
Austin Flint (US physician, graduate of Harvard Medical School, class of
1833) is revered for having conducted the first study with a placebo. A placebo
is considered a substance or procedure with no therapeutic effect. In 1863 Flint
tested a placebo on prisoners with rheumatic fever and compared their response
6 Unit I. Basics of Clinical Research
to the response of patients who had received an active treatment (although not
in the same trial). (Austin Flint murmur is a murmur associated with aortic re-
gurgitation. This trial is somewhat problematic ethically [research conducted on
a vulnerable population] and methodically [placebo and active treatment were
not tested at the same time, and active treatment for rheumatic fever was ques-
tionable to be active]).
1906
The 1906 Pure Food and Drug Act imposed purity standards on products and drugs
and mandated accurate labeling with content and dose.
1923
The idea of randomization was introduced to clinical trials in 1923. Randomization
involves participants randomly receiving one of the treatments, one being a placebo
and one being the new drug.
1943
Blind clinical trials—in which neither group knows which treatment they are
receiving—also emerged in the twentieth century. The first double-blind controlled
trial—Patulin for the common cold—was conducted, and the first widely published
randomized clinical trial was conducted on Streptomycin as a treatment for pulmo-
nary tuberculosis [2].
1944
Multicenter clinical trials were introduced, in which multiple studies were conducted
at various sites, all using the same protocol to provide wider testing, generalization,
and better statistical data.
1947
The Nuremberg Code was developed, which outlines 10 basic statements for the pro-
tection of human participants in clinical trials.
1964
The Declaration of Helsinki was developed, which outlines ethical codes for physicians
and for the protection of participants in clinical trials worldwide.
7 Chapter 1. Basics of Clinical Research
1988
The US Food and Drug Administration (FDA) was given more authority and ac-
countability over the approval of new drugs and treatments.
1990
The International Conference on Harmonization (ICH) was assembled to help elim-
inate differences in drug-development requirements for three global pharmaceutical
markets: the European Union, Japan, and the United States. The ICH initiatives pro-
mote increased efficiency in the development of new drugs, improving their availa-
bility to patients and the public.
2000
A Common Technical Document (CTD) was developed. The CTD acts as a standard
dossier used in Europe, Japan, and the United States for proposing data gathered in
clinical trials to respective governing authorities.
American Medical Association (AMA) was the first to publicly announce the toxicity
of the new compound and to warn physicians and patients against its lethal effects.
The S. E. Massengill Company was also notified; the company then sent telegrams
to distributors, pharmacists, and physicians, asking them to return the product, but
failed to explain the reason for the request, thus undermining the urgency of the situa-
tion and the lethal effects of the product. At the request of the FDA, the company was
forced to send out a second announcement, which was clear about the toxicity of the
product and the importance of the situation.
The next step taken by the FDA was to make sure all of the products were returned
safely; to do so, they had to locate all of the stores, dispensers, pharmacists, physicians,
and buyers. This proved to be a difficult task: many of the company’s salesmen were
not willing to help by providing the information required to locate the recipients of
the Elixir; pharmacies had no clear record of buyers; and many physicians didn’t keep
documentation of the patients to whom the compound was prescribed, nor their
addresses. Some physicians decided to abstain from helping authorities and lied about
their prescription trail, afraid that they could be held liable for prescribing the med-
ication. In spite of these circumstances, the relentless efforts of the FDA and local
authorities, as well as the help of the AMA and the media, allowed for the recovery
of 234 of the 240 gallons of the drug that had been distributed. In several cases, legal
action through federal seizure was required. The FDA had to refer to the compound’s
branding name “Elixir” to file federal charges that would allow them to complete their
task. The misbranding charge was brought to the company for distributing a com-
pound as an elixir, meaning it was dissolved in alcohol, when it was actually dissolved
in ethylene glycol.
The victims were many, including young children— most sick with throat
infections—young workers, older patients, mothers, and fathers. Most of the victims
would suffer the lethal effects of the substance for 10–21 days before succumbing to
death. The symptoms were mainly associated with severe renal failure and included
oliguria, edema, nausea, vomiting, abdominal pain, and seizures.
Thalidomide
Thalidomide is considered a derivative of glutamic acid, first synthesized unsuc-
cessfully in Europe by Swiss Pharmaceuticals in 1953. A German company, Chemie
Grunenthal, then remarketed it in 1957 as an anticonvulsant. Given its sedative effects,
it was also commercialized as a sleeping aid. It became a very popular medication,
considered effective and safe, and was highly sought and prescribed due to its lack of
9 Chapter 1. Basics of Clinical Research
complete new studies without FDA approval or even subjects’ consent, and even
worse, after approval, all new data regarding the drug were considered private. All of
these situations clearly showed that pharmaceutical companies had an upper hand in
the game.
Tuskegee Study
This study, also known as the “The Tuskegee Study of Untreated Syphilis in the
Negro Male,” was conducted in Alabama by the United States Public Health Service
(USPHS) and the Tuskegee Institute between 1932 and 1972 [13]. During this pe-
riod of time, hundreds of African-American males did not receive proper and standard
care for syphilis, with the intention to document the natural course of syphilis infec-
tion if it was left untreated. During the 40 years that the study took place, many of the
enrolled subjects, who came from a poor, rural area in Alabama, died of syphilis, and
many of their descendants were born with congenital syphilis. Directors, researchers,
and collaborators of the study observed the tragic effects of the disease, completely
indifferent to the suffering of their subjects, and even decided to continue their study
after penicillin was proven to effectively treat and cure the infection [13,14].
In 1928, scientists from Oslo published a study conducted in white males
with untreated syphilis that refuted the long-lived belief that the effects of syphilis
depended on the race of the affected. It was thought that this infection had more se-
vere neurologic effects on people of Caucasian descent, while it produced more se-
vere cardiovascular effects in population of African-American descent, but the study
from Oslo showed that most of the infected white males had severe affectation of their
cardiovascular system, but very few ever developed neurosyphilis [15]. This finding
amazed physicians and researchers in the United States, which led them to plan and ex-
ecute a similar study, which would be carried out in a population with high prevalence
of the infection. American scientists chose the city of Tuskegee because 35%–45% of
11 Chapter 1. Basics of Clinical Research
the population in the area was seropositive for syphilis. The initial design proposed
observing the untreated subjects for a period of 6–8 months, after which the subjects
would be treated with the standard care: salvarsan and bismuth, both fairly effective
but toxic. The initial purpose of the study was to benefit the health of the poor popu-
lation enrolled, as well as to understand and learn more about the disease, its preven-
tion, and its cure. This led to the support of many of the local hospitals and physicians,
including African-American doctors and organizations [14,16].
Researchers initially enrolled 600 men, 201 as healthy controls and 399 seropositive
for syphilis but not yet aware of their diagnosis. The subjects proceeded from Macon
County in Alabama, where the city of Tuskegee was located; they were mostly illit-
erate men, lured into the study by the promise of free medical care, free daily meals,
and US$50 for burial expenses for their participation. Throughout their participation
in the study, the participants were not informed of their infection status, nor did they
receive treatment. In many instances, researchers used deception to assure cooperation
and avoid dropouts, including making the burial policy contingent to their previous
authorization for an autopsy. After the initial allotted time for the study was completed,
many of the participating researchers decided it was necessary to continue the study
and obtain more clinical information. As the study continued, the great economic crisis
of 1929 was gestating, which led to withdrawal of the main funding source. Researchers
thought this would mean the end of the experiment, as it would be impossible to afford
treatment for all participants, but soon they proposed continuing the study without
offering standard care to patients, leading to a complete deviation from the initial pro-
posal and to the resignation of one of the initial creators of the study, Dr. Taliaferro
Clark [13,17].
It is important to note that during the 40 years of the Tuskegee experiment, the
study was never kept secret, and many articles were published in medical journals
describing initial discoveries and important information obtained from the research
[18–21]. Despite its controversial and irregular techniques, many argued that the con-
tribution of this study to science far outweighed its detrimental effects on the health of
the studied population. One of the main contributions of the experiment was the de-
velopment of the Venereal Disease Research Laboratory (VDRL), a non-treponemal
diagnostic test now widely used. This sign of research and medical progress was in-
strumental for establishing a renowned position for the United States in the interna-
tional research scenario, which served as an impetus for the ambition of many of the
participating researchers [13,17].
By 1947, penicillin had long been established as the most effective treatment for
syphilis and it was widely used for such purpose, leading to a significant decrease in
the prevalence of the disease. Its efficacy was so clear that many even argued that syph-
ilis would be completely eradicated in the near future. Nonetheless, researchers of
the Tuskegee study continued to deny proper treatment to their subjects, and they
were specifically warned against the use of penicillin and carefully shielded from re-
ceiving any information regarding its benefits. By the end of the study period, only
74 of the original 399 men were alive; 128 of them had died of syphilis and related
complications, 40 of their wives had contracted the disease, and 19 children were
diagnosed with congenital syphilis [16,22]. The relentless ambition of the Tuskegee
researchers continued in spite of the establishment of the Nuremberg Code in 1947,
the declaration of Helsinki in 1964, and the position of the Catholic Church urging
12 Unit I. Basics of Clinical Research
physicians and scientists to always respect patients, superseding all scientific or re-
search objectives [16,23].
In 1966, Peter Buxtun was the first researcher to speak publicly about his ethical
concerns for the Tuskegee study. He warned the Division of Venereal Diseases about the
practices and techniques being used in the study, but the Centers for Disease Control
(CDC), now in charge of the experiment, argued for the importance of completing the
study and obtained the support of local and national medical associations. In 1972,
Buxtun took the story to the national press, which led to widespread outrage [16].
Congressional hearings were held, many researchers and physicians testified, and the
deplorable objectives and practices of the study were exposed. The CDC appointed a
review committee for the study, which finally determined that it was not justifiable from
an ethical and medical standpoint, leading to its termination.
and horrendous experiments that took place during this time led to the death of most
of the participants, but the few survivors were able to narrate their suffering and leave
a record for history.
This period of science should be discussed and described in the light of the his-
torical and sociopolitical events that took place at the time. Mainly, we need to point
out two determining factors in order to understand the beginning of this period of
Nazi experimentation and the nature of such experiments: (1) the political structure
of Nazi Germany, based on a totalitarian system, and (2) the racial hygiene paradigm
that arose from both political and social movements at the time. The origin of the latter
preceded by roughly two decades that of the Nazi government; nonetheless, it was the
totalitarianism of the time that allowed for such an ideology to flourish and to give
rise to the scientific questions that were later addressed by researchers and physicians
of the Nazi regime. All of them took place in a setting in which no legal or ethical
boundaries existed, leading to the ideal conditions for such experimentation to take
place [26].
Based on these ideas, many German scientists and physicians, mainly geneticists,
found in the newly formed Nazi government the opportunity to put into practice
their theories and discoveries, while at the same time, the government found in the
researchers an opportunity to legitimize its political and social beliefs of racial superi-
ority. The research scenario was further darkened by the complete violation of the civil
rights of the Jewish population, which was then rendered as freely available “guinea
pigs” for any research agenda. Resources were redistributed to any scientific quest
that would improve the health of the superior race, and the focus of most research
programs became heredity and fitness [27]. The most striking aspect of all of the ex-
perimentation that took place in the Nazi Germany is that there is no direct proof that
any of the researchers involved were forced to participate, or that any of the research
techniques and practices were merely imposed by the government [28,29].
The idea of biological inferiority led to unimaginable cruelty and disrespect for
the unconsenting subjects. It is impossible to name and include all of the examples of
such cruelty for the purpose of this review, but it is important to mention that most of
the experiments conducted in the concentration camps followed the strict guidelines
of clinical research of the time, some of them pursuing questions in accordance with
the scientific progress of the time, though some of them used obsolete or outdated
practices [26]. Some of the methods and results could even be considered innovative
and helpful, as certain experiments were later continued by the conquering armies,
aided by the same German physicians, but following the newly established laws of eth-
ical research. It is fair to say, however, that regardless of the results or objectives of each
study, their methods were always brutal, and researchers had a complete disregard for
human life and suffering [28]. The main justification for their actions was based on the
ideal of preserving the health and well-being of the population, at the same time that
new critical knowledge was gained from such endeavors.
Response: The Nuremberg Code
After the end of World War II, all of the participants and collaborators of the Nazi gov-
ernment were brought to trial. The judgment of Nazi physicians in Nuremberg, known
as the Nuremberg trial, is considered the precipitating event for the start of modern
14 Unit I. Basics of Clinical Research
research ethics. From this trial, the founding principles of ethical research were estab-
lished under the Nuremberg Code. It outlined 10 critical aspects for the conduction
of experimentation with humans [30]:
be associated with the problem. Again, the company denied the reports and quali-
fied them as a cheap intent to murder a perfectly safe drug. Due to public pressure,
the distribution of the medication was halted in Germany, but continued in other
countries. Only when the news arrived of birth deformities did each country estab-
lish restrictions on the medication. Canada was the last country to stop the sale of
thalidomide in 1962.
It is estimated that from the late 1950s to the early 1960s, over 13.000 babies were
born with several deformities, including phocomelia, secondary to use of thalidomide
during pregnancy. Many of them died soon after birth, but many lived long lives, with
many surviving until 2010.
Phase II: Its main goal is to obtain preliminary data on the efficacy of the medica-
tion on a given population. Usually, the study will be conducted in a group of
diseased patients, which can range from a dozen to 300. The study should be
controlled, meaning that the diseased population receiving the new drug being
studied has to be compared to a control diseased population receiving either
placebo or any standard medication available. This phase continues to evaluate
for drug safety and short-term side effects.
Phase III: If evidence of effectiveness is shown in phase II studies, then the process
can continue to phase III. The main goal of this phase is to assess effectiveness
and safety. For this purpose, larger study populations should be evaluated and
“real-life” conditions emulated, in order to assess the behavior of the drug when
given at different doses, in heterogeneous populations or compared against the
standard of care. The number of patients can range from several hundred to
3,000–10,000.
Phase IV: This phase is also known as post-marketing survey. It takes place after
the drug has been approved by the FDA and has been put in the market.
Post-marketing surveillance and commitment studies allow the FDA to
collect further information on safety, efficacy, and tolerability profile of any
given drug.
17 Chapter 1. Basics of Clinical Research
After this terrible incident, which shot down human research at UPenn for several
months and led to a detailed investigation, UPenn paid the parents an amount of
money in settlement. Both the university and the principal investigator (PI) had se-
rious financial stakes.”
Finishing his thoughts, Prof. Geegs concluded, “The Gelsinger case was an impor-
tant setback for phase I studies in gene therapy.”
Dr. Stevenson, thinking about his family with Li-Fraumeni syndrome, said, “The
thought of benefits at any cost has brought up terrible lessons for humankind such
19 Chapter 1. Basics of Clinical Research
as the Tuskegee Study in 1932 [a syphilis experiment conducted between 1932 and
1972 in Tuskegee, Alabama, by the US Public Health Service in which impoverished
African-Americans with syphilis were recruited in order to study the natural progres-
sion of the untreated disease] or the thalidomide case in 1959 in Germany [a drug that
was used to inhibit morning sickness during pregnancy and resulted in thousands of
babies being born with abnormalities such as phocomelia]. We should not disregard
the issue of ethics and regulatory requirements in any phase of a drug trial, especially
phase I!”
Prof. Geegs looked at his watch and realized he was late to meet a group of
researchers from Japan who had come to visit his laboratory. He then wrapped up
the discussion, “Guys, let us continue this discussion tomorrow; and I also want you
to do a bit of research on the phases of a trial, so we can continue our discussion.”
Phases of a Trial
In the investigation of a new drug, sequences of clinical trials must be carried out. Each
phase of a trial seeks to provide different types of information about the treatment in
relation to dosage, safety, and efficacy of the investigational new drug (IND).
Preclinical research: Before using an IND in humans, tests should be taken in the
laboratory usually using animal models. If an IND shows good results in this phase,
then researchers are able to request permission to start studies in humans.
Phase I trial: The aim of this phase is to show that the IND is safe. Data are col-
lected on side effects, timing, and dosage. Usually dosage is increased until a max-
imum dosage (predetermined) or development of adverse effects are found. It usually
requires a small sample size of subjects and it helps researchers to understand the
mechanism of action of a drug. Much of the pharmakocinetics and pharmacody-
namics of INDs are researched in this phase. Also during this phase, the drug is usually
tested in healthy subjects, except for some drugs such as oncologic and HIV drugs.
Phase II trial: Once an IND is found to be safe in humans, phase II trials focus on
demonstrating that it is effective. This is also done in relatively small sample sizes, in
studies often referred to as “proof-of-principle” studies. The response rate should be
at least the same as standard treatment to encourage further studies. These small trials
are usually placebo-controlled.
Phase III trial: Also referred to as pivotal studies, they represent large studies with
large samples and are usually (but not always) designed as a randomized, double-
blinded trial comparing the IND to the standard treatment and/or placebo. Successful
outcomes in two phase III trials would make a new drug likely to be approved by
the FDA.
Phase IV trial: Also referred to as post-marketing studies, in phase IV trials,
approved drugs are tested in other diseases and populations and usually in an open-
label fashion.
Our team at the University of Pennsylvania School of Medicine has recently reported
the first clinical test of a new gene therapy based on a disabled AIDS virus carrying
genetic material that inhibits HIV replication. In this first trial, we studied five subjects
with chronic HIV infection who had failed to respond to at least two antiretroviral
regimens, giving them a single infusion of their own immune cells that had been ge-
netically modified for HIV resistance. In the study, viral loads of the patients remained
stable or decreased during the study, and one subject showed a sustained decrease
in viral load. T-cell counts remained steady or increased in four patients during the
nine-month trial. Additionally, in four patients, immune function specific to HIV
improved.” Prof. Geegs, who was extremely excited about these findings (and the ap-
proval for the paper’s publication in Proceedings of the National Academy of Sciences
(PNAS)), could not resist interrupting and added, “Overall, our results are significant,
because it is the first demonstration of safety in humans for a lentiviral vector (of which
HIV is an example) for any disease.” Although Dr. Wang was still jet-lagged from her
long trip to the United States, she added, “Thank you so much, Dr. Stevenson. In fact,
we appreciate the work of Prof. Geegs in Beijing and it is a wonderful opportunity to
be here in the lab. What is the next step now?” Prof. Geegs responded, “Our results are
good, but they are preliminary—meaning that we shall replicate it in a larger popula-
tion. We have much more work to do. In the study we are planning, each patient will
now be followed for 15 years.”
Stevenson completed with the details of this new study, “The new vector is a
lab modified HIV that has been disabled to allow it to function as a ‘Trojan horse,’
carrying a gene that prevents new infectious HIV from being produced.” He con-
tinued, “Essentially, the vector puts a wrench in the HIV replication process. Instead
of chemical-or protein-based HIV replication blockers, this approach is genetic and
uses a disabled AIDS virus which carries an anti-HIV genetic payload. This approach
enables patients’ own T-cells, which are targets for HIV, to inhibit HIV replication—
via the HIV vector and its anti-viral cargo.”
Dr. Cameron, an extremely educated research fellow from Australia, then made a
comment, “I believe that it is wonderful to go in this direction instead of drugs only
as they have significant toxicity, but in the first trial, patients were still taking the
drug. Do you think patients would be able to stay off drugs with this gene therapy,
Prof. Geegs?”
Prof. Geegs liked to stimulate his fellows to think, and he asked Stevenson to
respond—which he quickly did, with a subtle smile, “That is an excellent point, which
is why, in this second trial using the new vector with HIV patients, we will select a
group of patients who are generally healthier and use six infusions rather than one—
we therefore want to evaluate the safety of multiple infusions and test the effect of
infusions on the patients’ ability to control HIV after removal of their anti-retroviral
drugs. The hope is that this treatment approach may ultimately allow patients to stay
off antiretroviral drugs for an extended period. This would be a great breakthrough for
this laboratory.”
Prof. Geegs quickly concluded, “But we should never forget the Gelsinger case as,
you know, fool me once, shame on you; fool me twice, shame on me . . .. Our group should
then reflect on the ethical implications in this case. I want you guys thinking about
this subject tonight and send an email to the group with your conclusions. Looking
forward to hearing back from you!”
21 Chapter 1. Basics of Clinical Research
weekend and next Monday we will discuss this ethics issue again. By now, I just ask you
to reflect on combining ethics, benefits, and minimizing risks.”
CASE DISCUSSION
This case illustrates how ethical dilemmas can influence the design of any given study.
Particularly, the readers should pay special attention to the study phases of a given
study and how to design a study of a novel intervention while keeping the safety of
subjects as a main concern. The readers need also to identify that a clinical goal should
not be applicable to a design of a given study; the clinician-scientist needs to use a
different “hat” when designing and conducting a clinical study.
1. What challenges does Prof. Geegs face in choosing the next steps for his HIV
study?
2. What are Prof. Geegs’s main concerns?
3. What should he consider in making this decision?
FURTHER READING
Articles
Azeka E, Fregni F, Auler Junior JO. The past, present and future of clinical research. Clinics.
2011; 66(6): 931–2. [PMCID: PMC3129946] (Outstanding article in order to grasp the
outline of clinical research)
Bhatt A. Evolution of clinical research: a history before and beyond James Lind. Perspect Clin
Res. 2010 Jan–Mar; 1(1): 6–10. [PMCID: PMC3149409]
Blixen CE, Papp KK, Hull AL, et al. Developing a mentorship program for clinical researchers. J
Contin Educ Health Prof. 2007 Spring; 27(2): 86–93. [PMID: 17576629]
Drucker CB. Ambroise Paré and the Birth of the Gentle Art of Surgery. Yale J Biol Med. 2008
December; 81(4): 199–202. [PMCID: PMC2605308]
Glickman SW, McHutchison JG, Peterson ED, et al. Ethical and scientific implications of
the globalization of clinical research. N Engl J Med. 2009 Feb 19; 360(8): 816–823.
[PMID: 19228627]
Goffee R, Jones G. Managing authenticity: the paradox of great leadership. Harv Bus Rev. 2005
Dec; 83(12): 86–94, 153. [PMID: 16334584]
Herzlinger RE. Why innovation in health care is so hard. Harv Bus Rev. 2006 May; 84(5): 58–
66, 156. [PMID: 16649698]
Kottow MH. Should research ethics triumph over clinical ethics? J Eval Clin Pract. 2007 Aug;
13(4): 695–698. [PMID: 17683317]
Murgo AJ, Kummar S, Rubinstein L, Anthony JM, et al. Designing phase 0 cancer trials. Clin
Cancer Res. 2008 Jun 15; 14(12): 3675–3682. [PMCID: PMC2435428]
Umscheid CA, Margolis DJ, Grossman CE. Key concepts of clinical trials: a narrative review.
Postgrad Med. 2011 Sep; 123(5): 194–204.
23 Chapter 1. Basics of Clinical Research
Wilson JM. Lessons learned from the gene therapy trial for ornithine transcarbamylase defi-
ciency. Mol Genet Metab. 2009 Apr; 96(4): 151–157. [PMID: 19211285].
Books
Beauchamp TL, Childress JF. Principles of biomedical ethics, 7th ed. New York: Oxford University
Press; 2012. (7th edition of the first major American bioethics textbook, written by co-
author of the Belmont Report)
Gallin JI, Ognibene FP. Principles and practice of clinical research, 3rd ed. Burlington,
VT: Academic Press; 2012.
Groopman J. How doctors think. Boston, MA: Mariner Books; 2008.
Robertson D, Gordon HW. Clinical and translational science: principles of human research.
Burlington, VT: Academic Press; 2008.
Online
FDA CFRs for regulations in details. www.fda.com
http://cme.nci.nih.gov/
http://firstclinical.com/journal/2008/0806_IIR.pdf
http://ohsr.od.nih.gov/
http://www.availclinical.com/clinical-study/clinical-trials-history/
http://w ww.cioms.ch
http://w ww.fda.gov
http://w ww.james.com/ beaumont/d r_ l ife.htm
http://www.niehs.nih.gov/research/resources/bioethics/whatis/
http://w ww.wma.net
https://web.archive.org/web/20131021022424/http://w ww.innovation.org/index.cfm/
InsideDrugDiscovery/Inside_Drug_Discovery
https://web.archive.org/web/20120526134445/http://w ww.nihtraining.com/cc/ippcr/
current/downloads/HisClinRes.pdf
https:// w eb.archive.org/ w eb/ 2 0160401233046/ h ttp:// w ww.cancer.org/ t reatment/
treatmentsandsideeffects/ c linicaltrials/ w hatyouneedtoknowaboutclinicaltrials/
clinical-trials-what-you-need-to-know-phase0
REFERENCES
1. Bull JP, A study of the history and principles of clinical therapeutic trials. MD Thesis,
University of Cambridge, 1951. Available at http://jameslindlibrary.org/w p-data/
uploads/2014/05/bull-19511.pdf
2. Streptomycin Treatment of Pulmonary Tuberculosis. BMJ. 1948; 2(4582): 769–782.
3. Burley DM, Dennison TC, Harrison W. Clinical experience with a new sedative drug.
Practitioner. 1959; 183: 57–61.
4. Lasagna L. Thalidomide: a new nonbarbiturate sleep-inducing drug. J Chronic Dis. 1960;
11: 627–631.
5. Ghoreishni K. Thalidomine. In: Encyclopedia of toxicology. Burlington, VT: Academic
Press; 2014: 523–526.
6. RiceE.Dr. FrancesKelsey: TurningthethalidomidetragedyintoFoodandDrugAdministration
Reform. 2007. Available at http://w ww.section216.com/history/Kelsey.pdf
7. Kelsey FO. Inside story of a medical tragedy. U.S. News & World Report. 1962; 13: 54–55.
24 Unit I. Basics of Clinical Research
28. Roelcke V. Nazi medicine and research on human beings. Lancet. 2004; 364(Suppl 1): s6–7.
29. Roelcke V, Maio G. Twentieth century ethics of human subjects research. Sttugart: Steiner; 2004.
30. Guraya SY, London NJM, Guraya SS. Ethics in medical research. Journal of Microscop and
Ultrestructure. 2014; 2: 121–126.
31. Lenz W, Knapp K. Thalidomide embryopathy. Arch Environ Health. 1962; 5: 100–105.
32. National Institutes of Health. RFA-RM-07-007: Institutional Clinical and Translational
Science Award (U54). Mar 2007, Available at http://grants.nih.gov/grants/guide/rfa-
files/RFA-RM-07-007.html.
2
S E L EC T I O N O F T H E R E S E A R C H Q U E ST I O N
The difficulty in most scientific work lies in framing the questions rather than in finding
the answers.
—A rthur E. Boycott (Pathologist, 1877–1938)
The grand aim of all science is to cover the greatest number of empirical facts by logical
deduction from the smallest number of hypotheses or axioms.
—A lbert Einstein (Physicist, 1879–1955)
INTRODUCTION
The previous chapter provided the reader with an overview of the history of clinical
research, followed by an introduction to fundamental concepts of clinical research and
clinical trials. It is important to be aware of and to learn lessons from the mistakes of
past and current research in order to be prepared to conduct your own research. As
you will soon learn, developing your research project is an evolutionary process, and
research itself is a continuously changing and evolving field.
Careful conceptual design and planning are crucial for conducting a reproducible,
compelling, and ethically responsible research study. In this chapter, we will discuss
what should be the first step of any research project, that is, how to develop your own
research question. The basic process is to select a topic of interest, identify a research
problem within this area of interest, formulate a research question, and finally state
the overall research objectives (i.e., the specific aims that define what you want to
accomplish).
You will learn how to define your research question, starting from broad interests
and then narrowing these down to your primary research question. We will address
the key elements you will need to define for your research question: the study pop-
ulation, the intervention (x, independent variable[s]), and the outcome(s) (y,
dependent variable[s]). Later chapters in this volume will discuss popular study
designs and elements such as covariates, confounders, and effect modifiers (inter-
action) that will help you to further delineate your research question and your data
analysis plan.
Although this chapter is not a grant-writing tutorial, most of what you will learn
here has very important implications for writing a grant proposal. In fact, the most
important part of a grant proposal is the “specific aims” page, where you state your re-
search question, hypotheses, and objectives.
26
27 Chapter 2. Selection of the Research Question
experience gives you the impression that a new intervention would be more effective
for your patients compared to standard treatment. For example, your results could
lead you to ask, “Does this drug really prolong life in patients with breast cancer?” or,
“Does this procedure really decrease pain in patients with chronic arthritis?”
Once you have identified a problem in the area you want to study, you can refine
your idea into a research question by gaining a firm grasp of “what is known” and
“what is unknown.” To better understand the research problem, you should learn as
much as you can about the background pertaining to the topic of interest and specify
the gap between current understanding and unsolved problems. As an early step, you
should consult the literature, using tools such as MEDLINE or EMBASE, to gauge
the current level of knowledge relevant to your potential research question. This is es-
sential in order to avoid spending unnecessary time and effort on questions that have
already been solved by other investigators. Meta-analyses and systematic reviews are
especially useful to understand the combined level of evidence from a large number of
studies and to obtain an overview of clinical trials associated with your questions. You
should also pay attention to unpublished results and the progress of important studies
whose results are not yet published. It is important to realize that there likely are nega-
tive results produced but never published. You can inspect funnel plots obtained from
meta-analyses or generated from your own research (see Chapter 14 in this volume
for more details) to estimate if there has been publication bias toward positive studies.
Also, be aware that clinical trials with aims similar to those of your study might still be
ongoing. To find this information, you can check the public registration of trials using
sites such as clinicaltrials.gov.
options limited, too complex or costly, or otherwise not satisfactory (e.g., limb re-
placement, face transplantation)? Does the research topic reflect a major problem
in terms of health policy, medical, social, and/or economic aspects (e.g., smoking,
hypertension, or obesity)?
• The intervention: Is it a new drug, procedure, technology, or medical device (e.g.,
stem-cell derived pacemaker or artificial heart)? Does it concern an existing drug
approved by the Food and Drug Administration (FDA) for a different indication
(e.g., is Rituximab, a drug normally indicated for malignant lymphoma, effective for
systemic lupus erythematosus or rheumatoid arthritis)? Is there new evidence for
application of an existing intervention in a different population (e.g., is Palivizumab
also effective in immunodeficiency infants, not only in premature infants to prevent
respiratory syncytial virus)? Have recent findings supported the testing of a new
intervention in a particular condition (e.g., is a β-blocker effective in preventing car-
diovascular events in patients with chronic renal failure)? Even a research question
regarding a standard of care intervention can be valuable if in the end it can improve
the effectiveness of clinical practice.
Feasibility
In short, be realistic: novel research tends to jump right away into very ambitious
projects. You should carefully prove the feasibility of your research idea to prevent
wasting precious resources such as time and money:
• Patients: Can you recruit the required number of subjects? Do you think your re-
cruitment goal is realistic? Rare diseases such as Pompe or Fabry’s disease will pose
a challenge in obtaining a sufficient sample size. Even common diseases, depending
on your inclusion criteria and regimen of intervention, may be difficult to recruit.
Does your hospital have enough patients? If not, you may have to consider a multi-
center study. What about protocol adherence and dropouts? Do you expect significant
deviations from the protocol? Do you need to adjust your sample size accordingly?
• Technical expertise: Are there any established measurements or diagnostic tools for
your study? Can the outcome be measured? Is there any established diagnostic
tool? Do we have any standard techniques for using the device (e.g., guidelines for
echocardiographic diagnosis for congenital heart disease)? Is there a defined op-
timal dose? Can you operate the device, or can the skill be learned appropriately
(e.g., training manual for transcutaneous atrial valve replacement)? A pilot study or
small preliminary study can be helpful at this stage to help answer these preliminary
questions.
• Time: Do you have the required time to recruit your patients? Is it possible to follow
up with patients for the entire time of the proposed study period (e.g., can you
follow preterm infant development at 3, 6, and 9 years of age)? When do you need
to have your results in order to apply for your next grant?
• Funding: Does you budget allow for the scope of your study? Are there any research
grants you can apply for? Do the funding groups’ interests align with those of your
study? How realistic are your chances of obtaining the required funding? If there
are available funds, how do you apply for the grant?
30 Unit I. Basics of Clinical Research
• Team: How about your research environment? Do your mentors and colleagues share
your interests? What kind of specialists do you need to invite for your research? Do
you have the staff to support your project (technicians, nurses, administrators, etc.)?
Answerability
New knowledge can only originate from questions that are answerable. A broad re-
search problem is still a theoretical idea, and even if it is important and feasible, it still
needs to be further specified. You should carefully investigate your research idea and
consider the following:
• Precisely define what is known or not known and identify what area your research
will address. The research question should demonstrate an understanding of the bi-
ology, physiology, and epidemiology relevant to your research topic. For example, you
may want to investigate the prevalence and incidence of stroke after catheterization
and its prognosis before you begin research on the efficacy of a new anticoagulant
for patients who received catheter procedures. Again, you may need to conduct
a literature review in order to clarify what is already known. Conducting surveys
(interviews or questionnaires) initially could also be useful to understand the current
status of your issues (e.g., how many patients a year are diagnosed with stroke after
catherization in your hospital? What kind of anticoagulant is already being used for
the patients? How old are the patients? How about the duration of cauterization
techniques? etc.).
• The standard treatment should be well known before testing a new treatment. Are
there any established treatments in your research field? Could your new treatment
potentially replace the standard treatment or be complementary to the current
treatment of choice? Guidelines can be helpful for discussion (e.g., American
College of Cardiology/American Heart Association guidelines for anticoagulant
therapy). Without knowing the current practice, your new treatment may never
find its clinical relevance.
• We also need information about clinical issues for diagnostic tests and interventions.
Are you familiar with the diagnoses and treatment of this disease (e.g., computerized
tomography or magnetic resonance imaging to rule out stroke after catherization)?
Do you know the current guidelines?
Ethical Aspects
Ethical issues should be discussed before conducting research. Is the subject of your
research a controversial topic? The possible ethical issues will often depend largely
on whether the study population is considered vulnerable (e.g., children, pregnant
women, etc.; see Chapter 1) [1]. You must always determine the possible risks and
benefits of your study intervention [1].
Finally, you may want to ask for expert opinions about whether your research
question is answerable and relevant (no matter how strong your personal feelings may
be about the relevance). To this end, a presentation of your idea or preliminary results
at a study meeting early on in the project development can help refine your question.
31 Chapter 2. Selection of the Research Question
I (Intervention)
The I of the acronym usually refers to “intervention.” However, a more general and
therefore preferable term would be “independent variable.” The independent variable
32 Unit I. Basics of Clinical Research
is the explanatory variable of primary interest, also declared as x in the statistical anal-
ysis. The independent variable can be an intervention (e.g., a drug or a specific drug
dose), a prognostic factor, or a diagnostic test. I can also be the exposure in an observa-
tional study. In an experimental study, I is referred to as the fixed variable (controlled
by the investigator), whereas in an observational study, I refers to an exposure that
occurs outside of the experimenter’s control.
The independent variable precedes the outcome in time or in its causal path, and
thus it “drives” the outcome in a cause-effect relationship.
C (Control)
What comparison or control is being considered? This is an important component
when comparing the efficacy of two interventions. The new treatment should be supe-
rior to the placebo, when there is no standard treatment available. Placebo is a simulated
treatment that has no pharmaceutical effects and is used to mask the recipients to poten-
tial expectation biases associated with participating in clinical trials. On the other hand,
active controls could be used when an established treatment exists and the efficacy of the
new intervention should be examined at least within the context of non-inferiority to
the standard treatment. Also the control could be baseline in a one-group study.
O (Outcomes)
O is the dependent variable, or the outcome variable of primary interest; in the sta-
tistical analysis, it is also referred to as y. The outcome of interest is a random vari-
able and can be a clinical (e.g., death) or a surrogate endpoint (e.g., hormone level,
bone density, antibody titer). Selection of the primary outcome depends on several
considerations: What can you measure in a timely and efficient manner? Which meas-
urement will be relevant to understand the effectiveness of the new intervention?
What is routinely accepted and established within the clinical community? We will
discuss the outcome variable in more detail later in the chapter.
T (Time)
Time is sometimes added as another criterion and often refers to the follow-up time
necessary to assess the outcome or the time necessary to recruit the study sample.
Rather than viewing time as a separate aspect, it is usually best to consider time in
context with the other PICOT criteria.
The primary question is the most relevant question of your research that should be
driven by the hypothesis. Usually only one primary question should be defined at the
beginning of the study, and it must be stated explicitly upfront [3]. This question is rel-
evant for your sample size calculation (and in turn, for the power of your study—see
Chapter 11).
The specific aim is a statement of what you are proposing to do in your research project.
The primary hypothesis states your anticipated results by describing how the inde-
pendent variable will affect the dependent variable. Your hypothesis cannot be just
speculation, but rather it must be grounded on the research you have performed and
must have a reasonable chance of being proven true.
We can define more than one question for a study, but aside from the primary
question, all others associated with your research are treated as secondary questions.
Secondary questions may help to clarify the primary question and may add some
information to the research study. What potential problems do we encounter with
secondary questions? Usually, they are not sufficiently powered to be answered be-
cause the sample size is determined based on the primary question. Also, type I errors
(i.e., false positives) may occur due to multiple comparisons if not adjusted for by the
proper statistical analysis. Therefore, findings from secondary questions should be
considered exploratory and hypothesis generating in nature, with new confirmatory
studies needed to further support the results.
An ancillary study is a sub-study built into the main study design. Previous evi-
dence may convince you of the need to test a hypothesis within a sub-group ancillary
to the main population of interest (e.g., females, smokers). While this kind of study
enables you to perform a detailed analysis of the subpopulation, there are limitations
on the generalizability of an ancillary study since the population is usually more re-
stricted (see Further Readings, Examples of Ancillary Studies).
Variables
It is important to understand thoroughly the study variables when formulating the
study question. Here we will discuss some of the important concepts regarding the
variables, which will be discussed in more detail in Chapter 8.
We have already learned that the dependent variable is the outcome, and the in-
dependent variable is the intervention. For study design purposes, it is important
to also discuss how the outcome variables are measured. A good measurement
requires reliability (precision), validity (“trueness”), and responsiveness to change.
Reliability refers to how consistent the measurement is if it is repeated. Validity of
a measurement refers to the degree to which it measures what it is supposed to
measure. Responsiveness of a measurement means that it can detect differences
that are proportional to the change of what is being measured with clinical mean-
ingfulness and statistical significance.
Covariates are independent variables of secondary interest that may influence the
relationship between the independent and dependent variables. Age, race, and gender
are well-known examples. Since covariates can affect the study results, it is critical to
control or adjust for them. Covariates can be controlled for by both planning (inclu-
sion and exclusion criteria, placebo and blinding, sampling and randomization, etc.)
34 Unit I. Basics of Clinical Research
and analytical methods (e.g., covariate adjustment [see Chapter 13], and propensity
scores [see Chapter 17]).
• Continuous (ratio and interval scale), discrete, ordinal, nominal (categorical, binary)
variables: Continuous data represent all numbers (including fractions of numbers,
floating point data) and are the common type of raw data. Discrete data are full
numbers (i.e., integer data type; e.g., number of hospitalizations). Ordinal data
are ordered categories (e.g., mild, moderate, severe). Nominal data can be either
categorical (e.g., race) or dichotomous/binary (e.g., gender). Compared to other
variables, continuous variables have more power, which is the ability of the study
to detect an effect (e.g., differences between study groups) when it is truly present,
but they don’t always reflect clinical meaningfulness and therefore make interpre-
tation more difficult. Ordinal and nominal data may better reflect the clinical signif-
icance (e.g., dead or alive, relapse or no relapse, stage 1 = localized carcinoma, etc.).
However, ordinal and categorical data typically have less power, and important in-
formation may be lost (e.g., if an IQ less than 70 is categorized as developmental
delay in infants, IQs of 50, 58, and 69 will all fall into the same category, while an IQ
of 70 or more is considered to be normal development, although the difference is
just 1 point). This approach is called categorization of continuous data, where a cer-
tain clinically meaningful threshold is set to make it easier to quickly assess study
results. It is important to note that some authors differentiate between continuous
and discrete variables by defining the former as having a quantitative characteristic
and the latter as having a qualitative characteristic. This is a somewhat problematic
classification, especially when it comes to ordinal data.
• Single and multiple variables: Having a single variable is simpler, as it is easier for clin-
ical interpretation. Multiple valuables are efficient because we can evaluate many
variables within a single trial, but these can be difficult to disentangle and interpret.
Composite endpoints are combined multiple variables and are also sometimes used.
Because each clinical outcome may separately require a long duration and a large
sample size, combining many possible outcomes increases overall efficiency and
enables one to reduce sample size requirements and to capture the overall impact of
therapeutic interventions. Common examples include MACE (major adverse car-
diac events) and TVF (target vessel failure: myocardial infarction in target vessel,
target vessel reconstruction, cardiac death, etc.). Interpretation of the results has to
proceed with caution, however (see section on case-specific questions) [9].
• Surrogate variables (endpoints) and clinical variables (endpoints): Clinical variables di-
rectly assess the effect of therapeutic interventions on patient function and survival,
which is the ultimate goal of a clinical trial. Clinical variables may include mortality,
events (e.g., myocardial infarction, stroke), and occurrence of disease (e.g., HIV).
A clinical endpoint is the most definitive outcome to assess the efficacy of an inter-
vention. Thus, clinical endpoints are preferably used in clinical research. However,
it is not always feasible to use clinical outcomes in trials. The evaluation of clin-
ical outcomes presents some methodological problems since they require long-
term follow-up (with problems of adherence, dropouts, competing risks, requiring
larger sample sizes) and can make a trial more costly. At the same time, the clinical
endpoint may be difficult to observe. For this reason, clinical scientists often use
alternative outcomes to substitute for the clinical outcomes. So-called surrogate
35 Chapter 2. Selection of the Research Question
endpoints are a more practical measure to reflect the benefit of a new treatment.
Surrogate endpoints (e.g., cholesterol levels, blood sugar, blood pressure, viral load)
are defined based on the understanding of the mechanism of a disease that suggests
a clear relationship between a marker and a clinical outcome [8]. Also, a biolog-
ical rationale provided by epidemiological data, other clinical trials, or animal data
should be previously demonstrated. A surrogate is frequently a continuous variable
that can be measured early and repeatedly and therefore requires shorter follow-
up time, smaller sample size, and reduced costs for conducting a trial. Surrogate
endpoints are often used to accelerate the process of new drug development and
early stages of development, such as in phase 2 [10]. As a word of caution, too
much reliance on surrogate endpoints alone can be misleading if the results are
not interpreted with regard to validation, measurability, and reproducibility (see
Further Reading) [4].
situation, a correlation analysis is used. If there is more than one independent var-
iable associated with one dependent variable (e.g., smoking and drinking alcohol
are associated with lung cancer), it is called a complex associational question, and
multiple regression is used for statistical analysis.
• Basic/complex descriptive question: The data are described and summarized using
measures of central tendency (means, median, and mode), variability, and per-
centage (prevalence, frequency). If there is only one variable, it is called a basic
descriptive question (e.g., how much MRSA isolates occur after the 15th day of hospi-
talization?); for more than one variable, a classification of basic/complex descriptive
question is used.
Introduction
Defining the research question is, perhaps, the most important part of the planning of a
research study. That is because the wrong question will eventually lead to a poor study
design and therefore all the results will be useless; on the other hand, choosing an ele-
gant, simple question will probably lead to a good study that will be meaningful to the
scientific community, even if the results are negative. In fact, the best research question
is one that, regardless of the results (negative or positive), produces interesting findings.
In addition, a study should be designed with only one main question in mind.
However, choosing the most appropriate question is not always easy, as such a
question might not be feasible to be answered. For instance, when researching acute
MI, the most important question would be whether or not a new drug decreases
mortality. However, for economic and ethical reasons, such an approach can only be
considered when previous studies have already suggested that the new drug is a po-
tential candidate. Therefore, the investigator needs to deal with the important issue
of feasibility versus clinical relevance. Dr. Heart soon realized that her task would
not be an easy one, and also that this task may take some time; she kept thinking
about one of the citations in an article she recently read: “One-t hird of a trial’s time
between the germ of your idea and its publication in the New England Journal of
Medicine should be spent fighting about the research question.”2
1
Dr. André Brunoni and Professor Felipe Fregni prepared this case. Course cases are developed
solely as the basis for class discussion. The situation in this case is fictional. Cases are not intended
to serve as endorsements or sources of primary data. All rights reserved to the authors of this case.
2
Riva JJ, Malik KM, Burnie SJ, Endicott AR, Busse JW. What is your research question? An intro-
duction to the PICOT format for clinicians. J Can Chiropr Assoc. 2012 Sep; 56(3):167–71.
39 Chapter 2. Selection of the Research Question
for example, that the main agency funding in the United States, the NIH (National
Institutes of Health), considers significance and innovation as important factors to
fund grant applications. Dr. Heart also remembers something that her mentor used
to tell her at the beginning of her career: “A house built on a weak foundation will not
stand.” She knows that even if she has the most refined design and uses the optimal
statistical tests, her research will be of very little interest or utility if it does not advance
the field. But regarding this point, she is confident that her research will have a signif-
icant impact in the field.
CASE DISCUSSION
Dr. Heart is a busy and ambitious clinical scientist and wants to establish herself
within the academic ranks of her hospital. She has some background in statistics but
seems to be quite inexperienced in conducting clinical research. She is looking for
an idea to write up a research proposal and rightly conducts a literature research in
her field of expertise, cardiovascular diseases. She finds an interesting article about
a compound that has been demonstrated to be effective in an animal model and safe
in healthy volunteers (results of a phase I trial). She now plans to conduct a phase II
trial, but struggles to come up with a study design. The most vexing problem for her is
formulating the research question.
Dr. Heart then reviews and debates aspects that have to be considered when
delineating a research question. The main points she ponders include the following: de-
termining the outcome with regard to feasibility (mainly concerning the time of
42 Unit I. Basics of Clinical Research
follow-up when using a clinical outcome) versus clinical relevance (when using a sur-
rogate outcome) and with regard to the data type to be used for the outcome (cate-
gorical vs. continuous); the importance of the research proposal (the need for a new
anticoagulant drug); whether to use a narrow versus broad study population; whether
to include only a primary or also secondary questions; and whether to use a basic
versus complex hypothesis. Important aspects that Dr. Heart has not considered in-
clude the following: whether to test versus a control (although not mandatory in a
phase II trial, it deserves consideration since she is investigating the effects of an anti-
coagulant and therefore adverse events should be expected, thus justifying the inclu-
sion of a control arm) or to test several dosages (to observe a dose-response effect);
logistics; the budget; and the overall scope of her project.
All these aspects are important and need careful consideration, but you have to
wonder how this will help Dr. Heart come up with a compelling research question.
Rather than assessing each aspect separately and making decisions based on advantages
and disadvantages, it is recommended to start from a broad research interest and then
develop and further specify the idea into a specific research question.
While Dr. Heart should be applauded for her ambition, she should also try to
balance the level of risk of her research given her level of experience.
Finally, we should also question Dr. Heart’s motives for conducting this study.
What is her agenda?
FURTHER READING
Text
Haynes B, et al., Clinical epidemiology: how to do clinical practice research forming research
questions; part 1. Performing clinical research, 3rd ed. Haynes B, Sackett DL, Guyatt GH, and
Tugwell P; 2006: 3–14
Portney LG, Watkins MP. Foundations of clinical research: applications to practice. 3rd ed. Pearson;
2008: 121–139.
Surrogate Outcomes
D’Agostino RB. Debate: The slippery slope of surrogate outcomes. Curr Contr Trials Cardiovasc
Med. 2000; 1: 76–78.
43 Chapter 2. Selection of the Research Question
Echt DS, Liebson PR, Mitchell LB, et.al., Mortality and morbidity in patients receiving
encainide, flecainide, or placebo: The Cardiac Arrhythmia Suppression Trial. N Eng J Med.
1991; 324: 781–788.
Feng M, Balter JM, Normolle D, et al. Characterization of pancreatic tumor motion using Cine-
MRI: surrogates for tumor position should be used with caution. Int J Radiat Oncol Biol Phys.
2009 July 1; 74(3): 884–891.
Katz R. Biomarkers and surrogate markers: an FDA perspective. NeuroRx. 2004 April;
1(2): 189–195.
Lonn E. The use of surrogate endpoints in clinical trials: focus on clinical trials in cardiovascular
diseases. Pharmacoepidemiol Drug Safety. 2001; 10: 497–508.
Composite Endpoint
Cordoba G, Schwartz L, Woloshin S, et al. Definition, reporting, and interpretation of com-
posite outcomes in clinical trials: systematic review. BMJ. 2010; 341: c3920.
Kip KE, Hollabaugh K, Marroquin OC, et al. The problem with composite end points in car-
diovascular studies. The story of major adverse cardiac events and percutaneous coronary
intervention. JACC. 2008; 51(7): 701–707.
Krishnan JA, Bender BG, Wamboldt FS, et al. Adherence to inhaled corticosteroids: an ancillary
study of the Childhood Asthma Management Program clinical trial. J Allergy Clin Immunol.
2012; 129 (1): 112–118.
Udelson JE, Pearte CA, Kimmelstiel CD, et al. The Occluded Artery Trial (OAT) Viability
Ancillary Study (OAT- NUC): influence of infarct zone viability on left ventricular
remodeling after percutaneous coronary intervention versus optimal medical therapy alone.
Am Heart J. 2011 Mar; 161(3): 611–621.
Controls, Sham/Placebo
Finnissa DG, Kaptchukb TJ, Millerc F et.al., Placebo effects: biological, clinical and ethical
advances. Lancet. 2010 February 20; 375(9715): 686–695.
Macklin R. The ethical problems with sham surgery in clinical research. N Engl J Med. 1999 Sep
23; 341(13): 992–996.
Pilot Studies
Lancaster GA, Dodd S, Williamson PR. Design and analysis of pilot studies: recommendations
for good practice. J Eval Clin Pract. 2002; 10(2): 307–312.
REFERENCES
1. The Belmont Report. Office of the Secretary. Ethical principles and guidelines for the protec-
tion of human subjects of research. The National Commission for the Protection of Human
Subjects of Biomedical and Behavioral Research. Washington, DC: U.S. Government
Printing Office, 1979.
44 Unit I. Basics of Clinical Research
Science is facts; just as houses are made of stones, so is science made of facts; but a pile of
stones is not a house and a collection of facts is not necessarily science.
—Henri Poincaré (1854–1912)
INTRODUCTION
The previous chapter provided an overview of the selection of the research questions.
Now that you have picked a topic and have decided what to study, you now have to
think about whom you want to study in order to test your hypothesis. This chapter
will begin with the general definition of study population, followed by an introductory
session about validity (internal and external) and sampling techniques (probability
and non-probability). Then you will be invited to think critically about a hypothetical
case study titled “Choosing the Study Population.” This chapter will conclude with
some practical exercises and review questions.
The next chapter will teach you how to answer your research question. You will learn
about basic study designs and how to choose the appropriate design for your study.
The design of a study often begins with the formulation of the research question,
followed by the decision about whom the researcher will study: the study
population.
The portion of the general population a researcher wants to draw robust conclusions
or inferences about is called the target population or reference population (Figure 3.1).
This population of ultimate interest is usually large and diverse (it may be from all
around the world), making it impractical (or even impossible) and cost-ineffective to
study it entirely.
45
46 Unit I. Basics of Clinical Research
Population of Target
ultimate interest Population
Subset of study
population that can be Sample
studied
The subset of the target population that is actually accessible to the researcher is
called the accessible population (see Figure 3.1). In order for this group to be represen-
tative of the target population, it is important to clearly characterize the target popu-
lation, and to define all elements within the target population that would have to be
equally represented in the accessible population. In other words, the accessible popu-
lation has to be (1) accessible; (2) representative of the target population; and (3) in
agreement with the criteria for patient and disease characteristics, with a good ratio
between risks and benefits (including both inclusion and exclusion criteria).
The group of individuals that are drawn from the accessible population and on
which the research is conducted is called the study sample (or simply sample). Often
only a certain number of people of the accessible population are enrolled to par-
ticipate in the study (by design, but also with respect to time, budget, or logistical
constraints) [1](Figure 3.1). However, in some cases it is possible that the sample and
the accessible population are exactly the same [see Sim and Wright, Research in Health
Care: Concepts, Designs and Methods].
FIRST CHALLENGE: DEFINING
THE TARGET POPULATION
The clinical investigator has usually clear in his or her mind the specific condition to
be studied (i.e., patients with neuropathic pain, patients with stroke, patients with di-
abetes). However, the challenging task is to define among, for instance, patients with
stroke, what are the characteristics of patients to be included in the study.
There are two extremes when making this decision: (1) not defining any spe-
cific characteristic (including any patients with stroke); and (2) using a very large
number of characteristics to define the population to be studied (for instance, gender,
age, weight, race, marital status, socioeconomic status, smoking, alcohol use, other
comorbidities, specific characteristics of stroke, family history). Which extreme
should we use? Probably neither one. Choosing a broad strategy of inclusion criteria
47 Chapter 3. Study Population
(any patients with stroke) will result in a study with likely significant heterogeneity
and no significant results. On the other hand, a very narrow strategy of inclusion
criteria may likely not be able to find enough patients with these characteristics who
are willing to be part of the study.
The strategy for choosing the study population should first take into considera-
tion what is the study phase (early on, or phase I or II study; or a later stage, or phase
III or IV) and then understanding what are the factors associated with response and
safety to treatment. With these factors well established, the investigator needs then to
maximize the “cost-effectiveness” of the chosen strategy. The cost would be the conse-
quence of being too broad and increasing heterogeneity or being too strict and hurting
the recruitment, and the effectiveness would be maximizing both internal and external
validity.
The study included men or women aged ≥18 years with ischemic stroke between 48
and 72 hours before randomization confirmed by computed tomography or MRI and
a National Institutes of Health Stroke Scale (NIHSS) score of ≥4 (total score) or of
≥2 on the arm or leg motor function scores of the NIHSS. Subjects had to be medi-
cally stable within the 24 hours before randomization. Excluded were subjects with
a transient ischemic attack or who were unable to take medication by mouth at the
time of baseline assessments. Thrombolysis during the acute phase of stroke before
study enrollment was allowed.
This study included 60 subjects. Authors restricted the population by defining the
time window of stroke, baseline stroke severity, and age. Likely authors hypothesized
a priori that these characteristics would increase response to this new intervention in
stroke. It is interesting, however, that authors concluded that
[c]utamesine was safe and well tolerated at both dosage levels. Although no signifi-
cant effects on functional end points were seen in the population as a whole, greater
improvement in National Institutes of Health Stroke Scale scores among patients
48 Unit I. Basics of Clinical Research
with greater pretreatment deficits seen in post hoc analysis warrants further inves-
tigation. Additional studies should focus on the patient population with moderate-
to-severe stroke.
Therefore the authors demonstrated that they should have in fact restricted this small
study to patients with increased stroke severity.
In a phase III study, on the other hand, the goal is to be as broad as possible so as
to increase the generalizability of the study, as the overall goal of phase III study is to
gain knowledge for its use in the clinical practice. For example, a phase III trial in acute
stroke had the following inclusion criteria that were mainly “(i) Patients presenting
with acute hemispheric ischemic stroke; (ii) Age ≥18 years; (iii) National Institutes
of Health Stroke Score (NIHSS) of ≥4–26 with clinical signs of hemispheric infarc-
tion.” In fact, these criteria are broad and therefore increase the generalizability of this
study [3].
= =
Internal Validity
External Validity
Factors to choose study population: Target
1. Study phase (Phase I, II, III, or IV) Population:
2. Response to treatment Inclusion/Exclusion
3. Safety Issues Criteria
Design Phase
Accessible
Study Population
Population
Figure 3.3. The relationship between internal and external validity and the study population.
studied. However, the sample might be too narrow to generate benefits for clinical
care because the results are not generalizable or applicable to other samples or to the
general population. It should be noted that an experiment that is not internally valid
cannot be externally valid. If the sample is not representative of the target population,
both external and construct validities of the study are at risk, so it is not possible to
generalize safely. The only way to make sure that the sample is representative of the
broader population is to obtain direct information from the population [7].
SECOND CHALLENGE: SAMPLING
To maximize the generalizability of your study, you must ensure representative-
ness. You must first clearly define the target population, carefully specify inclusion
and exclusion criteria for the accessible study population, and structure the sampling
method in a way that the enrollment yields a random sample of the accessible popula-
tion (Figure 3.3). Remember, the more representative the sample, the more generaliz-
able will the study findings be to the target population.
Here the investigator should know the difference between sampling and inclusion
criteria. A study can have a broad inclusion criteria; however, during the sampling pro-
cess, only patients with a single characteristic enter the study, therefore affecting the
generalizability of the findings (see also Chapter 7 in this volume).
• Low costs
• Simplicity
• Reduced field time
• Increased accuracy
• Possible generalizability.
51 Chapter 3. Study Population
68.2%
13.6% 13.6%
2.1% 2.1%
–3 σ –2 σ –1 σ Mean +1 σ +2 σ –3 σ
95.4%
99.6%
Figure 3.4. Normal distribution with confidence intervals based on the standard deviation.
After defining the study population, researchers need to select the method of sam-
pling (i.e., the method to select individuals). Sampling refers to the process by which
researchers select a representative subset of individuals from the accessible population
to become part of the study sample under consideration to recruit a sufficient size to
warrant statistical power (e.g., 3, 4) (see Chapter 11 for more information on sample
size calculation). Increases in the heterogeneity of the target population (greater var-
iability) require an increase in the sample size, thus guaranteeing that individuals are
homogenous on the variables under study (see Figure 3.4 later in this chapter for rep-
resentation of normal distribution).
What can we do in order to select a study population that is representative of the
accessible population and, by extension, of the target population? One way to proceed
is to define clearly the eligibility criteria for the accessible population.
Sampling Bias
As stated before, a sample has to be representative to ensure generalizability from the
sample to the population. A sample is considered representative of a given population
when it represents, at least, a similar heterogeneity for the relevant characteristics. This
means that the variations found in the sample reflect to a high degree those from the
broader target population. However, the process of collecting the sample is open to
systematic errors that can lead to non-random, biased sample—this is called the sam-
pling bias (also known as ascertainment bias or systematic bias).
A sampling method is considered biased when the subjects who were recruited
from the accessible population favor certain characteristics or attributes over others
compared to the target population. This imbalance of characteristics or attributes
influences the outcome of the study—either because they are overrepresented or un-
derrepresented relative to others in the population [8,9]. Sampling bias is a threat to
external validity (generalizability) and cannot be accounted for by simply increasing
the sample size. In order to minimize the degree of bias when collecting the sample,
special sampling techniques should be employed (see discussion later in this chapter).
52 Unit I. Basics of Clinical Research
Sampling Error(s)
Sampling error or estimation error (or precision) refers to the standard error that gives
us the precision of our statistical estimate [12]. In this case, the degree from which an
observation differs from the expected value is called an error.
So, sampling error is a statistical error that is obtained from sample data, which
differs to a certain degree from the data that would be obtained if the entire population
were used [6]. Low sampling error means less variability in the sample distribution,
based on the standard deviation. A simple rule of thumb is that standard deviation and
sampling error will increase in the same fashion.
The sampling error depends on the sample size and the sampling distribution
(Figure 3.4), based on the characteristics of interest. Unlike sampling bias, sam-
pling error can be predicted and calculated by the researcher taking into account the
53 Chapter 3. Study Population
σ p(1 − p)
or
n n
Used to calculate, 95% confidence intervals (see also Chapter 14 for these calculations):
X ± 2 × Sx
Sampling Techniques
There are two broad types of sampling designs in quantitative research: probability sam-
pling and non-probability sampling (Figure 3.5). With probability sampling techniques,
every subject has, in theory, equal chances of being selected for the sample. On the
contrary, with non-probability sampling techniques, the chance of every subject being
selected is unknown [13,14].
Probability sampling methods are based on random selection. This means that every
person in the target population has an equal and independent chance of being selected
as a subject for the study [15]. This procedure decreases the probability of bias, even
when using a small sample size, because bias depends more on the selection proce-
dure than on the size of the sample. So, the main advantage of this method is that the
selection process is randomized, therefore minimizing bias (reducing sampling bias)
and thus increasing confidence in the representativeness of the sample [15]. Other
advantages of the probability sampling methods are objectiveness, requiring little
information from the population, and increasing accuracy of the statistical methods
after the study [e.g., 16]. On the other hand, the main disadvantages of these methods
are that the process is expensive and time-consuming, and a complete list of the en-
tire population is needed [7,16]. There are different types of probability sampling
Internal validity
• Refers to the extent to which we can accurately state that the independent
variables (IVs) produced the observed effect.
Internal validity is achieved when the effect on dependent variable is only
due to variation in the IVs.
• Threats to internal validity: Confounding and bias.
Probability
Sampling
Non-Probability
Sampling
Simple Random Sampling Convenience Sampling
Systematic Sampling Consecutive Sampling
Stratified Sampling Quota Sampling
Cluster Sampling Judgmental Sampling
Disproportional Sampling Snowball Sampling
Advantages
It is an easy method of sampling.
It can be done manually.
This method is ideal for statistical proposes because confidence interval around the
sample can be defined using statistical analyses.
Potential Problems
This method requires complete lists of the population, which may be very unprac-
tical and costly when individuals are scattered in a vast geographic area.
Difficulties completing the list of the entire population may systematically exclude
important cases in the population.
Example
Suppose you want to study the entire population with Parkinson’s disease who
received deep brain stimulation (DBS) in all hospitals in the United States.
First, you will need a list (organized by letters or numbers) of all Parkinson’s
patients who received DBS for each hospital in the United States that performs
these invasive procedures. These lists are named the sampling frame. After that,
you need to choose a process to select numbers randomly in each list, in a way
that every patient has the same chance of being selected.
55 Chapter 3. Study Population
Systematic Sampling
Definition and Method
In systematic sampling, a set of study participants is systematically and randomly
selected from a complete and seriated list of people (N = Population). Then a sampling
interval is obtained by dividing the accessible population by the desired sample size.
So, suppose that each individual in the accessible population is organized by alphabet-
ical order. The researcher establishes a sampling interval (SI) to select subjects, which
is the distance between the selected elements. For instance, we can specify taking
every third name or every tenth name.
Advantages
It is an easy method of sampling,
It can be done manually, especially if lists are already organized into sections.
It is ideal for large target populations, where simple random sampling would be
difficult to perform.
It ensures that the selected individuals are from the entire range of the population.
Potential Problems
This method can be expensive and time-consuming if the sample is not conven-
iently located.
The selection method chosen could introduce bias when systematically some
members of the target population are excluded. So, it is inadequate when there
is periodicity in the population.
Example
Suppose that the population size is N = 5,000 and the desired sample size is
n = 100. The first step would be to organize the population by alphabetical
order and then divide: k: 5,000/100 = 50.
This means that every 50th person listed will be selected to integrate in the
sample.
In summary, first you need to create a sampling frame (e.g., list of people),
then choose randomly a starting point on the sampling frame, and finally
pick a participant at constant and regular intervals in order to select sets of
participants.
Stratified Sampling
Definition and Method
With stratified sampling, the accessible population is first divided into two or more
strata or homogenous subgroups to which each individual is randomly assigned,
according to specific characteristics relevant for the study purpose. The groups to
which individuals are allocated must not overlap. This method uses simple random or
systematic sampling within defined strata in order to enhance the representativeness
56 Unit I. Basics of Clinical Research
of the sample [18]. Stratification is based on attributes that often are of relevance
for the study purpose, such as age, gender, race, diagnostic, duration of disease, geo-
graphic localization, socioeconomic status, education, and so on. With this method,
each element has an equal chance to be assigned to a specific subgroup, and thus an
equal chance of being represented.
Advantages
This method increases the representativeness in the target population by ensuring
that individuals from each strata are included.
It is ideal when subgroups within the target population need to be analyzed sepa-
rately and/or compared (subgroup statistical analyses).
It is less expensive—great precision is obtained even with smaller samples.
Potential Problems
The process of selecting the sample is more complex and requires more informa-
tion about the population in order to classify and organize elements from the target
population. Sometimes, difficulties in identifying the main characteristics of the
target population may require some adjustments during the study. There is also an-
other problem, if the proportions in the target population are not reflected in the
strata. For instance, if one stratum has two times more representatives in a target
population than another stratum, then the sample size of each stratum should re-
flect this. This process is called proportional stratified sample, where the sample size
in each stratum aims to reflect the proportion of those individuals in the target
population. Please note that this should only be used when the proportions in the
target population are different, but they are not very disproportionate. If they are
disproportionate (for instance, group A, N = 100; group B, N = 2,000), dispropor-
tional sampling should be considered instead.
Example
Suppose you want to study the incidence and prevalence of disease A by
gender. After selecting the accessible population, you need to randomize each
individual within each stratum (stratum 1: female; stratum 2: male). Sample
would have 50% of individuals in each gender group.
In summary, first you need to divide the target population into characteris-
tics of interest—named stratification factors—(gender, age, level of education,
etc.), and then the sample is selected randomly within each group.
if not all, but only a fixed number of individuals will be randomly selected from
a particular cluster to be studied. The clusters that provide the sample should be
representative of (or similar to) the target population. Clusters may be geographic,
racial, and so on.
The difference between cluster sampling and stratified sampling is that in stratified
sampling, the entire study population is divided into strata based on a certain char-
acteristic (covariate) and all strata are sampled. In cluster sampling, only a selected
number of clusters are included for sampling. Also, clusters are defined by geograph-
ical aspects, and the individuals within a cluster do not necessarily have a common
biological characteristic (covariate).
Advantages
This method is ideal for large and disperse target populations.
There is reduced cost and it is less time-consuming (e.g., sampling all students in a
few schools vs. some students from all schools).
There is reduced variability.
Sampling frame is not required.
Potential Problems
The main problem of cluster sampling is a loss in precision. This can lead to biased
samples when clusters were chosen based on biased assumptions about the popula-
tion, for instance, when the clusters are very similar and therefore less likely to rep-
resent the whole population. Or, if clusters differ substantially from one another,
sampling will lose efficiency.
Example
Suppose you want to study medical students; you may select randomly five
universities in the state of Massachusetts and then select randomly 500
students in each university.
In summary, first you need to identify the study units (clusters), and then you
recruit a fixed number of participants within each if they meet the criteria for
the study.
In non-probability sampling methods, subjects are selected from the accessible pop-
ulation by non-random selection. This is the most common method used in human
trials. But often it is difficult to assume that they produce accurate and representa-
tive samples of the population, and therefore generalizability to the target population
is limited. This method is often used in preliminary and exploratory studies, and in
studies in which it is difficult to access or identify all members of the population. So,
this method is used only when it is not possible (or not necessary) to use probability
sampling methods. The disadvantage of this technique is that it is less likely to produce
representative samples, which affects generalizability to the target population [7,16].
There are different types of non-probability sampling methods: convenience, consec-
utive, quota, judgmental, and snowball sampling.
58 Unit I. Basics of Clinical Research
Convenience Sampling
Definition and Method
In convenience sampling, participants are selected because they are easily accessible,
and the researcher is not concerned about the representativeness of the sample in the
population. This sample could also be random, but usually it is biased.
Advantages
It can be used when other sampling methods are not possible. In some cases, it may
be the only choice.
It is practical.
Potential Problems
There is usually sampling bias.
The sample is not representative of the population (low external validity, or none).
It is impossible to assess the degree of sampling bias.
Example
Probably one of the most common situations where convenience sampling
occurs is when university students or volunteers are recruited through adver-
tisement, an easy form of recruitment.
Another example would be to study all patients who present to a clinic
within a specific time frame.
Consecutive Sampling
Definition and Method
Consecutive sampling is very similar to convenience sampling. However, in this case,
researchers include all accessible subjects from the accessible population, who met the
eligibility criteria over a specific time or for a specified sample size.
Advantages
Compared with convenience sampling, the sample better represents the population.
Potential Problems
There is poor representativeness of the entire population, with little potential to
generalize.
The sample is not based on random selection.
It is impossible to assess the degree of sampling bias.
Example
Suppose you want to study all patients who received treatment A during the
first 6 months of use, so all the patients are eligible and are consecutively
assigned to form the sample.
59 Chapter 3. Study Population
Advantages
It is ideal for very small population sizes.
It is ideal when participants are very difficult to locate or contact.
It is easy to implement.
In some cases it is the best sampling method to implement (e.g., when there are no
records of the population).
Potential Problems
The generalizability of the results is questionable.
It is impossible to assess the degree of sampling bias.
Example
Suppose you want to study the prevalence of a certain disease in homeless people
or in a rare ethnic group. Asking the first subjects assigned to the study could be
the best or even the only option that the researcher has to identify and access
other subjects.
Is high
precision
required?
YES NO
NO YES NO NO YES
Quota,
Simple Random, Simple Random,
Simple Random, Judgment and
Stratified Systematic, and Systematic, and
Systematic, Stratified, Convenience
Cluster Cluster
and Cluster
Scheme 3.1. This scheme represents one possible algorithm of how to select the appropriate
sampling method [adapted from 2].
61 Chapter 3. Study Population
Introduction
For logical reasons, it is impossible to conduct a trial on the entire population.
Therefore, when designing a clinical trial, it is necessary to choose who is going to be
studied. Although it may be a trivial task in the trial, choosing the study population—
or sample—correctly might be the difference between success and failure, and will
also influence how other researchers and clinicians see your trial.
There are important points that the researcher needs to consider when choosing
the right population. First, it is interesting to exclude conditions that might mimic
the disease under study but that do not respond to the therapeutic intervention or
that need to be treated differently. For instance, a clinical trial testing a new antibi-
otic for bacterial pneumonia should not include patients with viral pneumonia, as
these patients will have a lower response rate or even no response to the new anti-
biotic. Second, there is the issue of competing risk, for instance, patients with other
comorbidities who might worsen because of these conditions and, therefore, con-
found the clinical trials results.
Although it is appropriate to select very carefully and to dedicate some time to
choosing the study population, over-choosing by adding more inclusion or exclusion
criteria items might also be dangerous, as it would restrict the generalizability or ex-
ternal validity of the study—especially because the study population will usually be
different from the “real world” population or the population that will be seen in the
clinical practice, which might not represent the studied population.
1
Dr. André Brunoni and Professor Felipe Fregni prepared this case. Course cases are developed
solely as the basis for class discussion. The situation in this case is fictional. Cases are not intended
to serve as endorsements or sources of primary data. All rights reserved to the authors of this case.
62 Unit I. Basics of Clinical Research
There are other issues that need to be addressed when choosing the study
population—one of them is the feasibility of recruitment—would it be feasible to re-
cruit patients, or would the trial need to be interrupted for lack of subjects?
Another important point to consider is ensuring that the research team is able to
apply the diagnostic criteria you have developed—for instance, a study might deter-
mine that a brain biopsy is necessary to diagnose a brain tumor. This would highly
limit the ability of investigators to enroll patients and might indeed exclude patients
with a diagnosis of a brain tumor. The researcher needs to be extremely attentive to
situations in which a sample bias might occur (sample bias occurs when the study
population differs from the target population).
1. The trade-off of internal versus external validity: Prof. Anderson can choose anything
from enrolling all patients with RA to including only patients with severe, advanced
RA. However, if he chooses the first option the sample will be too heterogeneous—
for instance, there will be patients with mild RA that would get better even with a
simple analgesic (he is planning to compare RA-007 against a control group)—and
then the results would tend to go toward the null hypothesis. On the other hand,
targeting a strict population might be good to prove the efficacy of an intervention,
as patients with more severe disease are more likely to respond, but not to increase
the external validity of the study—therefore, this drug would only be approved
by the regulatory agencies or by the insurance companies for specific situations,
63 Chapter 3. Study Population
Recruiting the Subjects
It is a cold fall morning in Boston, and during the drive from his home to his office,
Prof. Anderson now realizes that he needs to address another important issue: recruit-
ment. On that morning, Prof. Anderson is starting a meeting with his recruitment
team. He sighs quietly, as he knows it will be a long day. “Recruiting is a very hard task,”
he thinks, while hearing the suggestions of his staff. The sample size estimation they
calculated is 1,800 subjects. His thoughts start wandering . . . “I could advertise in the
newspaper . . . or maybe on the Internet . . .” . . . “No, I think it’s best to talk with my
colleagues and ask their patients . . .” . . . “I could use the patients from my ambulatory
of severe RA . . .” . . . “Or maybe patients from my office . . ..” And he knows that he will
have to deal with the annual reports that he will need to submit to NIH describing the
status of his study and recruitment.
Prof. Anderson is aware that there is no perfect recruitment strategy. In addition,
the recruitment strategy will have an important influence on the study population.
The issues of generalizability and target population apply, in fact, at this point. If he
chooses to recruit patients from his ambulatory or from his colleagues, his recruit-
ment yield will probably be higher—as these patients are very severe and, therefore,
are willing to try a new treatment. Advertisement will probably reach more patients,
but it’s expensive and also will bring a large number of patients who are not eligible for
the study—therefore increasing the study costs.
Also there is the issue of probability versus non-probability samples. The methods
mentioned would select non-probability samples—for example, only patients who fre-
quent Brigham and Women’s hospital, or only patients with RA who read newspapers,
or only those who access the web will be selected—therefore not representing a random
sample of the entire RA population with the characteristics defined by the study criteria.
64 Unit I. Basics of Clinical Research
1. What are the challenges for Professor Anderson? Why are these challenges so
important?
Professor Anderson’s concerns are related to the study population for a phase III clin-
ical trial to test the efficacy of drug RA-007. He is so concerned with choosing the
study population because he is aware of the implications that a low representative-
ness of his sample may have, especially with running a phase III study. So, let’s sum-
marize the main challenges: (1) Should he exclude medical conditions that mimic
the disease in study? What are the main risks if he enrolls subjects with conditions
with similar clinical manifestations to the ones that are the objective of the study?
(2) Should he exclude competing risk factors? Why would it be so important to ex-
clude such risk factors? (3) Should he restrict the inclusion criteria in order to have a
more homogenous sample? Or should he broaden the focus, accepting, for instance,
patients in different stages of the disease? (4) What would be the strategy used to di-
agnose subjects? (5) How should the recruitment be done? (6) How can he increase
or guarantee adherence?
These questions that Prof. Anderson stated have importance because, depending
on the choices he makes, they will impact the feasibility and validity of the study, and
ultimately its generalizability. These questions include the target population and how
to define it, then how to define further what population will be studied by using in-
clusion and exclusion criteria, and finally, the trade-off between internal and external
validity.
2. Considering the trade-off between internal and external validity, do you think that
Prof. Anderson should consider enrolling a broader sample (for a more heteroge-
neous sample), or should he restrict the eligibility criteria (for a more homoge-
neous sample)?
Prof. Anderson needs to find a balance between how restrictive the eligibility criteria
are and how much he wants to compromise the generalizability of his study. Please
remember what impacts the degree to which results are generalizable from the sample
to the target population. In general, all experiments have some degree of artificiality to
them and it is not possible to create a “perfect” sample. Increasing homogeneity of the
sample may help to understand better a phenomenon in a specific group—however,
at the cost of being able to make reliable inferences about the more heterogenic target
population. In contrast, increasing heterogeneity, by having less restrictive eligibility
criteria, may approximate the sample’s characteristics with those of the target pop-
ulation, with the trade-off of increasing the variability of results, introducing more
chances of bias and ultimately risking to obtain results that are more difficult to in-
terpret. What is the best trade-off, and how can this be included in the study design?
of accessing correctly diagnosed patients and thus saving time by screening primarily
individuals that meet the inclusion criteria.
4. What type of sampling method do you think is more suitable in this case? Can you
provide reasons?
Prof. Anderson wants to ensure internal validity (controlling for confounders and
bias, like selection bias, recall bias, detection bias). And simultaneously he wants to
increase generalizability by ensuring the representativeness of the recruited sample.
So, probability-sampling techniques may be a more appropriate choice. Within
probability-sampling techniques, Prof. Anderson then may choose simple random,
systematic, stratified, cluster, or disproportional sampling. Concerning the objec-
tive, and each method’s advantages and disadvantages, what would be the more most
suitable method in this case? And, what about non-probability sampling techniques?
What are the main issues and advantages of choosing theses sampling procedures in
this specific case? Please remember that there is no universally applicable sampling
technique. Choosing the ideal sampling technique depends always on several factors,
such as the study objective, the budget available, time, and the accessibility of the pop-
ulation, among others.
FURTHER READING
Sim J, Wright C. Research in health care: concepts, designs and methods. Cheltenham: Nelson
Thornes; 2000.
REFERENCES
1. Gay LR, Mills GE, Airasian PW. Educational research: competencies for analysis and
applications. 9th ed. New York: Pearson; 2008.
2. Urfer R, et al. Phase II Trial of the Sigma-1 Receptor Agonist Cutamesine (SA4503) for
recovery enhancement after acute ischemic stroke. Stroke. 2014; 45: 3304–3310.
3. Ma H, et al. A multicentre, randomized, double-blinded, placebo-controlled Phase III
study to investigate Extending the time for Thrombolysis in Emergency Neurological
Deficits (EXTEND). Int J Stroke. 2012; 7(1): 74–80.
4. Kheirkhah A, et al. Effects of corneal nerve density on the response to treatment in dry eye
disease. Ophthalmology. 2015 Apr; 122(4): 662–668.
5. Cook TD, Campbell DT, Day A. Quasi-experimentation: design and analysis issues for field
settings. Boston: Houghton Mifflin; 1979.
67 Chapter 3. Study Population
6. Finger MS, Rand KL. Addressing validity concerns in clinical psychology research.
In: Roberts MC, Ilardi SS, eds. Handbook of research methods in clinical psychology. 2.
Malden, MA: Blackwell; 2003: 13–30.
7. Nieswiadomy RM. Foundations of nursing research. 6th ed. New York: Pearson; 2011.
8. Weisberg HI. Bias and causation: models and judgment for valid comparisons.
New York: Wiley; 2010.
9. Heckman JJ. Sample selection bias as a specification error. Econometrica. 1979; 47(1): 153–161.
10. Gaertner SL, Dovidio JF. Reducing intergroup bias: the common ingroup identity model.
Psychology Press; 2014 Apr 4.
11. Wallin P. Volunteer subjects as a source of sampling bias. Am J Sociol. 1949; 54(6): 539–44.
12. Särndal CE, Swensson B, Wretman J. Model assisted survey sampling. Berlin: Springer
Verlag; 2003.
13. Ary D, Jacobs LC, Sorensen C, Razavieh A. Introduction to research in education. 8th ed.
Belmont, CA: Wadsworth; 2009.
14. Parahoo K. Nursing research: principles, process and issues. Basingstoke, UK: Palgrave
Macmillan; 2006.
15. Polit DF, Beck CT. Nursing research: generating and assessing evidence for nursing practice. 8th
ed. Philadelphia: Lippincott, Williams, & Wilkins; 2008.
16. Bryman A, Bell E. Business research methods. 3rd ed. New York: Oxford University
Press; 2011.
17. Moore DS. The basic practice of statistics. 5th ed. New York: W. H. Freeman; 2009.
18. National Audit Office. A practical guide to sampling. London: National Audit Office; 2001.
4
B A S I C ST U DY D E S I G N S
INTRODUCTION
This chapter provides an overview of basic study designs for interventional studies and
introduces important concepts for the design of clinical trials. Later, in Unit III of this
volume, you will learn about types of observational studies and the main differences
between an observational study and a clinical trial. More complex research designs,
such as adaptive designs, will also be covered in Unit III.
STUDY DESIGN
The study design delineates the methodology of how to obtain the answer to the re-
search questions. As Sackett et al. (1997) stated, the question being asked determines
the appropriate research architecture, strategy, and tactics to be used—not tradition,
authority, experts, paradigms, or schools of thought [1,2]. Knowing the advantages
and limitations of each design will play an important role, but the decision of which
one to choose is going to be based on which design can answer the defined research
question with the most compelling evidence—but at the same time, in the most
straightforward and fundamental way.
In the most basic form, the type of study can be described either as experimental
or randomized. The most important characteristic of an experimental study is the ma-
nipulation of the treatment variable (independent variable) using randomization to
control for confounding. The experimental studies look for a cause–effect relationship
where the investigator is systematically introducing a specific change (intervention)
and controls for everything else to remain the same. Experimental studies and quasi-
experimental studies are interventional studies that differ in the concept of randomi-
zation. Even though randomized clinical trials are the source of the strongest evidence
for evidence-based medicine, it is by no means the only or even the most appropriate
approach for all of the clinical research questions [3]. On the other hand, in obser-
vational studies the independent variable (most commonly referred to as exposure)
is not controlled by the investigator; thus its relationship with the outcome (also re-
ferred to as disease) is usually confounded by other variables (see Chapter 16 for more
68
69 Chapter 4. Basic Study Designs
details). (See Figure 4.1 for a depiction of the main types of study design and their
relationship with manipulation of intervention).
The study design is the methodology used in order to
Experimental studies test the efficacy of a new intervention that can be either ther-
apeutic (e.g., drugs, devices, surgery) or preventive (e.g., vaccine, diet, behavioral
modification). In order to ensure the validity of a study, an attempt must be made to
optimize the design. Therefore, the intervention is usually tested against placebo or
a standard intervention. Another very important concept of an experimental study
is that the patients are allocated at random to each treatment group, including the
control arm. In fact, administration of the intervention (independent variable of the
study) but not the allocation needs to be manipulated by the experimenter (for in-
stance, the patients receive a given intervention not because of clinical reasons but
because of study assignment). Randomization is a sine qua non characteristic of an
experimental study because it is the best method to guarantee that all variables will be
equally distributed between groups, except, naturally, the intervention (see Chapter 5
for Randomization). Therefore, if at the end of the study there is a difference between
Experimental Variable/Intervention
Figure 4.1. Main types of study design and their relationship with manipulation of intervention.
70 Unit I. Basics of Clinical Research
groups, it shall be concluded that such difference occurred due to the intervention.
Other interventional studies not using randomization are considered quasi-experi-
mental studies (quasi: Latin for “almost”). In this type of study, allocation is made
using non-random methods such as allocation by the medical record number, date of
birth, or sequential inclusion. In some cases, however, the researcher can control study
allocation, thus introducing intrinsic bias to the study.
Note: The study design will pre-define what statistical methods you will use to analyze
the study data (see Unit II).
Experimental Designs
Parallel Group Designs
This design is the most common type in experimental studies. It compares two or
more groups that are established by random assignment. Subjects are randomized to
Interventional studies
Is there randomization?
Yes No
Experimental Quasi-Experimental
How many independent variables? How many independent variables? Historical control Concurrent control
Its disadvantages are:
• It is not so powerful to account for within-patient variability
• Confounding
• Expensive
• Longitudinal, long follow-up period
• Requires larger samples
• Dropouts
• Very controlled conditions different to medical practice
different treatment arms that can either be the experimental intervention or the con-
trol group—which can be another intervention or placebo, or a combination of both.
The groups are compared based on the measurement of the endpoint of the interven-
tion; it can be pre-test–post-test control group design, where the outcome is measured
at baseline and after the intervention in each group (e.g., visual analogue scale before
and after the intervention to assess pain), or post-test control group, where the out-
come is measured only after the intervention and is compared with the control (e.g.,
time [in days] of hospitalization after surgery). See Box 4.1 for the advantages and
disadvantages of parallel group designs.
The first type consists of one independent variable with different levels, or one
treatment in different formats (for instance, active drug and placebo drug).
Intervention against placebo This is used to detect a difference between the in-
tervention versus no intervention (i.e., placebo or sham surgery). An important dis-
advantage of this design may be the ethical concerns of using placebo, since it may be
unethical to receive the placebo intervention when an available standard treatment
already exits. Another potential limitation of this design is the delay in recruitment,
as not all the subjects will agree to participate in a trial where there is a chance of re-
ceiving placebo.
72 Unit I. Basics of Clinical Research
* When there is an extra variable that is assumed to influence the dependent variable.
Intervention A
Placebo
Placebo Intervention (I + II)/2 I–II
Intervention B A
III III
Active Intervention Intervention (III + IV)/2 III–IV
B A&B
Figure 4.3. Factorial design: Two factors with each two levels = 2 x 2 factorial design. To get the
main effect of A, we compare the mean of the active intervention A to the mean of the control. We use
the same reasoning for the main effect of B, but here we compare the means of the row’s total. On the
other hand, to get the interaction, we focus on the differences within each of the columns and within
each of the rows. If the differences are different, then there is an interaction effect.
The first two questions are to answer the main effect, as in the previous designs with only
one independent variable. The third question is something unique to this type of study;
it allows us to determine if the use of one intervention affects the other (interaction
effect); that is, the effect of intervention A varies across the levels of the intervention B.
Factorial designs can be very helpful as they can be used to gain efficiency when
studying two different treatments, or they can be used to study the interaction be-
tween two treatments. However, one very important concept here is that it cannot be
used for both goals simultaneously. If there is a positive interaction, the main effect
varies according to the value of the other variable; therefore the main effect of each
variable can only be assessed when they are being tested alone. Many factorial trials
are not powered to detect this interaction, and false negative results for the interaction
may be concluded [4]. The most common use of factorial design is to test for the main
effect and not for interaction.
In factorial design, the options for the biostatistical plan include a two-way or
three-way analysis of variance, or multivariable regression modeling [3].
Example 1: Physicians’ Health Study (PHS) I. This study began in 1982 with two objectives:
to test whether aspirin prevented myocardial infarction and other cardiovascular events and
to examine whether beta-carotene prevented cancer.
Trial design was a 2 x 2 factorial design. Arms: active aspirin and active beta-carotene,
active aspirin and beta-carotene placebo, aspirin placebo and active beta-carotene, or aspirin
placebo and beta-carotene placebo.
The efficiency is gained by testing two interventions (two main effect questions) on the
same trial, saving resources by using the same pool of participants and methodology.
Main conclusions of PHS I:
1. Aspirin reduced the risk of first myocardial infarction by 44% (P< 0.00001) [5].
2. 12 years of supplementation with beta-carotene produced neither benefit nor harm in terms
of the incidence of malignant neoplasms, cardiovascular disease, or death from all causes [6].
74 Unit I. Basics of Clinical Research
Some of the disadvantages of using this design to gain efficiency were the following:
• It needed many more participants for the study to be powered enough to detect both
main effect questions.
• The assumption of no interaction needed to be met. The main effects would not have
been concluded if the protective effect of aspirin was modified by the amount of beta-
carotene. There is a formal statistical test to assess interaction, but this test is not very
powerful.
Non-parallel Design
Repeated measures design is a specific type of design where one group of subjects is
tested at baseline and then at repeated time points during/after intervention. These
studies are often referred to as longitudinal studies. Two main types of this design can be
used: the between-subjects design, in which subjects received only one intervention but
are tested several times, or the within-subjects design, where the subjects receive all the
interventions in the same study—also called cross-over design. However, the investi-
gator should be careful, as these designs usually (and should whenever possible) involve
randomization.
Cross-over Design
It is the simplest form of this design, where a subject is assigned to one intervention,
followed by measurement of the outcome variable, and then is assigned to the second
intervention, followed by the measurement of the outcome variable. The order is sys-
tematically varied among the participants: we randomize participants to define which
intervention they receive first (this is what makes it a randomized trial). The greatest
advantages of this design are that it reduces the individual variance among participants
and it increases power, as each participant serves as its own control, decreasing the
number of subjects needed to test an intervention.
The main weaknesses that are important to consider and address in cross-over
trials are the following:
• Carry-over effect: Subjects can have residual effects of the first intervention as they
undergo the second intervention. Usually this requires a wash-out period, a time
where participants do not receive any intervention, for them to come to the same
baseline before starting the new intervention.
• Practice (learning) effect: Subjects repeat the same measurement method over and over.
• Order effect: Depending on which intervention is being tested first, subjects may
respond differently to the second.
In cross-over design, the efficacy of the intervention over the control is assessed on
the basis of the within-subject difference between the two treatments with regard to
the outcome variable [7]. It can be analyzed by a paired t-test, or a two-way analysis of
variance with two repeated measures if it uses parametric data. Non-parametric data
are analyzed by a Wilcoxon signed-rank test. The analysis should include a preliminary
75 Chapter 4. Basic Study Designs
testing to assess that the wash-out period was long enough and that there was no carry-
over effect influencing the results.
Quasi-Experimental Designs
Designs where group assignment is not randomized, or if there is no control group at
all, are considered quasi-experimental designs. The research designs are very similar to
experimental designs, but we will describe specifically two types: the one-group design
and the non-equivalent control group design.
One-Group Design
In one-group design, one set of repeated measures is taken before and after treatment
on one group of subjects. Here the outcome variable is compared within the two
points of assessment (pre-test and post-test). It resembles a repeated measure design,
but there is no randomization of order to any control as all subjects are receiving the
intervention [3].
Other Designs
N-of-1
The common definition of N-of-1 trial is a single or multiple cross-over study
performed in a single individual. According to a systematic review [9]the N-of-1 trials
serve three purposes:
76 Unit I. Basics of Clinical Research
1. They bridge the gap between the broad probabilities established in large parallel
trials and treatments that work in an individual patient.
2. Second, the individual treatment effects estimated from a series of N-of-1 trials
can be combined across patients to provide an estimate for the average treatment
effect, averaged across patients participating in these trials. Therefore, N-of-1 trials
can supplement or substitute for traditional parallel-group randomized controlled
trials (RCTs) as a way to estimate the average treatment effect.
3. They provide an estimate of heterogeneity of treatment effects across patients.
SPECIAL CONSIDERATIONS
Designs for Rare Diseases: Are They Different?
One difficult challenge in clinical research is the study design for rare diseases (diseases
with low prevalence). The problem is not trivial; according to Ravani et al. [10], there
are approximately 6,000 rare diseases identified in the United States. In 1983 the
US Congress passed the “Orphan Drug Act” [11]. This landmark act instructs the
US Food and Drug Administration to label a disease as “rare” if it has a prevalence
of <200,000 persons in the United States. In the most recent years, the revolution of
the Internet and the ability to congregate patients with rare disorders in centers and
groups have provided new opportunities for their study.
Although RCTs are the gold standard, sometimes it is not possible to get the
number of patients to run an RCT for a condition that is rare. What should you do in
this situation? In fact, Ravani et al. list some of the situations in which RCT might not
be used (for instance, large treatment effect, lack of equipoise, rare outcome) and also
gives some alternative designs. Box 4.2 summarizes the challenges and potential study
designs for rare diseases.
recombinant human α-glucosidase every two weeks over a 3-year period. No infusion-
associated reactions were observed. Pulmonary function remained stable (n = 4) or
improved slightly (n = 1). Muscle strength increased. Only one patient approached
the normal range. Patients obtained higher scores on the Quick Motor Function Test.
None of the patients deteriorated. Follow-up data of two unmatched historical cohorts
of adults and children with Pompe disease were used for comparison. They showed an
average decline in pulmonary function of 1.6% and 5% per year. Data on muscle strength
and function of untreated children were not available. Further studies are required.”
Phase I
• Open-label parallel design: Phase I, Multicenter, Open-label, Dose-escalating, Clinical
and Phamiacolcinetic Study of PM01183 in Patients With Advanced Solid Tumors.
NCT00877474 (19).
• Double- blind randomized parallels design: A Phase lb Double- blind Randomized
Placebo Controlled Age-deescalating Trial of Two Virosome Formulated Anti-malaria
Vaccine Components (PEV 301 and PEV 302) Administered in Combination to Healthy
Semi-immune Tanzanian Volunteers. NC T00513669 (20).
• Crossover: A Phase I, A Single-Centre, Double-Blind, Randomized, Placebo-Controlled,
Three-Period, Three- Way Crossover Study Of The Hemodynamic frier:actions Of
Avanafil And Akohol In Healthy Male Subjects.NCT01054859 (21).
• Factorial: A Phase I lead-in to a 2x2x2 Factorial Trial of Dose Dense Temozolomide,
Memantine, Mefloquine, and Metfomtin As Post- R adiation Adjutant Therapy of
Glioblastoma Mukiforme.NCT01430351 (22).
• Historical controls: Phase I/II Multicenter Trial of Intra-Arterial Carboplatin and
Oral Temozolomide for the Treatment of Recurrent and Symptomatic Residual Brain
Metastases. NCT00362817 (23).
Phase II
• Open-label: An Open-label, Phase II Trial of ZD1839 (IRESSA) in Patients With
Malignant Mesothelioma.NCT00787410 (24).
• Parallel: Active control -A Randomised, Double-blind, Parallel Group, Multi-centre, Phase II
Study to Assess the Efficacy and Safety of Best Support Care (BSC) Plus ZD6474(Vandetanib)
300 mg, BSC Plus ZD6474(Vandetanib) 100 mg, and BSC Plus Placebo in Patients With
Inoperable Hepatoceliular Carcinoma (HCC). NCT00508001 (25).
• Crossover. A Phase 2, Dose-finding, Cross-over Study to Evaluate the Effect of a NES/
E2 Transdemial Gel Delivery on Ovulation Suppression in Normal Ovulating Women.
NCT00796133 (26).
• Factorial: A Randomized, Double-Blind, Placebo-Controlled, 3/6 Factorial Design,
Phase II Study to Evaluate the Antihypertansive Efficacy and Safety of Combination of
Fimasartan and Amlo dipine in Patients With Essential Hypertension.NCT01518998.(27)
• N-of-1: N-of-1: Serial Controlled N-of-1 Trials of Topical Vitamin E as Prophylaxis for
Chemotherapy-InducedOral Mucositis in Pediatric Patients.NCT00311116(28)
Phase III
• Parallel, Crossover and Factorial designs are common.
• Historical controls and Open label: Open Label, Phase III Study of NABI-IGIV 10%
[Immune Globulin Intravenous(Human), 10%] In Subjects With Primary Immune
Deficiency Disorders (PIDD) NCT00538915 (29). Primary Outcome Measures: To
Assess the Efficacy of Nabi-IGIV 10% in Preventing Serious Bacterial Infections (SBIs)
Compared to Historical Control Data.
Phase IV
• Again Parallel, Crossover and factorial can be common
• Open-label: An Open Label, Multi Centre Phase IV Study of Adefovir Dipivoxil in
Korean Patients With Chronic Hepatitis B (CHB). NC T01205165.(30)
79 Chapter 4. Basic Study Designs
These examples illustrate how we can find different types of designs in all study
phases. However, you should know when it is common and scientifically acceptable
to use certain designs. Again, the option depends on what you are asking and in what
population; as such, you can still have a Phase III open-label trial if the disease is rare,
life threatening, or there are no available controls, among other reasons.
Why do we need them?
Introduction
Experimental studies are designed to test the efficacy of a new intervention against pla-
cebo or a standard intervention. The most important aspect of an experimental study
is that the patients are allocated at random to each treatment group. In fact, the inde-
pendent variable of the study (for instance, the intervention) needs to be manipulated
by the experimenter (for instance, the patients receive a given intervention not be-
cause of clinical reasons but because of study assignment) and, in addition, a control
or comparison group is necessary. Randomization is a sine qua non characteristic of an
experimental study because it is the best method to guarantee that all variables will
be fairly distributed between groups, except, naturally, the intervention—therefore,
if, at the end of the study, there is a difference between groups, it shall be concluded
that such difference occurred due to the intervention. In fact, in experimental designs,
the goal is to reduce random variation and systematic error, and to increase preci-
sion. Other interventional studies not using randomization are considered quasi-
experimental studies.
There are several variations of RCTs, though the most frequently used is the de-
sign in which patients are allocated into two parallel groups and their endpoint scores
are compared between groups. There are other designs that can bring some benefits
1
Dr. André Brunoni and Professor Felipe Fregni prepared this case. Course cases are developed
solely as the basis for class discussion. The situation in this case is fictional.
81 Chapter 4. Basic Study Designs
according to the study design (such as a cross-over design)—in fact, the researcher
needs to consider the advantages and disadvantages of each design before deciding
the final study design.
Dr. Garden can be considered a successful clinician and researcher. With almost
40 articles published in journals of relatively high impact, he has acquired a satisfac-
tory experience in running clinical trials. Although he has run studies in a variety of
dermatologic diseases, his great passion is psoriasis—that was the reason he chose
dermatology as a specialty. Dr. Garden sent an email scheduling a meeting with his
three postdoctoral fellows at his office to discuss the project. They arrived at the con-
ference room and could notice that Dr. Garden wrote on the white board the poten-
tial study designs with large capital letters. They realized that this would be a long
meeting.
Massimo Rossini, a postdoctoral fellow from Italy, begins, “Well . . . I would
go for a classic RCT to compare P-SOLVE against placebo (we use an inert skin
cream). We don’t have many patients and we might not achieve a significant effect
size with two active drugs. Besides, severe psoriasis is a disease with no satisfactory
treatment . . . so my idea is P-SOLVE versus placebo—this would be a cleaner and
better strategy!
flexible dosages to reach efficacy (e.g., lithium for bipolar disorder, which should be
adjusted accordingly to the serum levels), then blinding might be an issue (although it
is possible to have blinded strategy for dose adjustment, this adds complication to the
trial design). Another threat to blinding is that physicians would easily “guess” patients
on the standard treatment. Finally, the target population should exclude patients who
have already used the standard treatment; otherwise this design would favor the new
treatment (since the standard treatment would be used in patients in which such
treatment was already proven ineffective).
Dr. Garden is writing frenetically on the whiteboard. The postdoctoral fellows are
anxious, thinking about the pros and cons of their ideas. Suddenly, Dr. Garden stops
writing, walks to the window, and asks without turning his eyes away from the beau-
tiful garden in front of his office: “Although I liked your suggestions, let us explore all
the options. I would also like to hear your ideas on cross-over studies.”
Massimo immediately says, “Well, a cross-over might increase the efficiency of our
study as the within-subject variability is smaller than between-subject variability. But there
is the issue of carry-over effects and therefore data analysis should be planned carefully.”
CASE DISCUSSION
In order to select the best approach, it is important to put the disease into context.
Some of the main points Dr. Garden should consider are the following:
power to detect a difference. Will the number of patients limit his availability of
designs?
3. What has been done for this disease? Examples of previous trials:
a. “Efficacy and safety results from the randomized controlled comparative
study of adalimumab vs. methotrexate vs. placebo in patients with psoriasis
(CHAMPION)” [15].
b. “Phase 3: A randomized, double-blind, double-dummy, placebo controlled,
multicenter study of subcutaneous Secukinumab to demonstrate efficacy after
twelve weeks of treatment, compared to placebo and Etanercept, and to assess
the safety, tolerability and long-term efficacy up to one year in subjects with
moderate to severe chronic plaque psoriasis. (ClinicalTrials.gov Identifier:
NCT01358578)” [16].
The advantages and disadvantages are of each study design for Dr. Garden trial are
summarized in Table 4.1.
Conclusions: Based on our discussion, since it is a phase II trial, where the main
objective is to assess efficacy and safety and where there is a limitation of the number
of patients available, the simplest design would be to select a randomized, double-
dummy design of P-SOLVE versus placebo. Placebo would be an acceptable option
since the outcomes of the intervention would be tested fairly quickly and participants
would then receive additional interventions. If the phase II is positive, a phase III trial
design would be done to test P-SOLVE against other interventions available (parallel
design or factorial design). Theoretically, cross-over is an appealing option; however,
the carry-over and order effect might be important cofounders in a disease with rapid
improvement to interventions.
REFERENCES
1. Wypij D, ed. Clinical trials: basic study design. Class Lecture. Principles and practice of clin-
ical research. Boston, MA. May, 2012.
2. Sackett DL, Wennberg JE. Choosing the best research design for each question. BMJ.
1997 Dec 20– 27; 315(7123): 1636. PubMed PMID: 9448521. Pubmed Central
PMCID: 2128012.
3. Portney L, Watkins M. Foundations of clinical research: applications to practice, 3rd ed. Upper
Saddle River, NJ: Pearson Prentice Hall; 2009.
4. Green S, Liu PY, O’Sullivan J. Factorial design considerations. J Clin Oncol. 2002 Aug 15;
20(16): 3424–3430. PubMed PMID: 12177102.
5. Group* SCotPHSR. Final Report on the Aspirin Component of the Ongoing Physicians’
Health Study. N Engl J Med. 1989; 321(3): 129–135. PubMed PMID: 2664509.
6. Hennekens CH, Buring JE, Manson JE, Stampfer M, Rosner B, Cook NR, et al. Lack of
effect of long-term supplementation with beta carotene on the incidence of malignant
86 Unit I. Basics of Clinical Research
INTRODUCTION
Randomization is a key feature of randomized controlled trials (RCT), which are
considered the gold standard in evaluating the efficacy of new interventions. In this
chapter, we will discuss what randomization is, why it is important, methods of ran-
domization, and their advantages and disadvantages. We will also discuss a case that
illustrates the options faced by researchers when designing a RCT and choosing the
randomization method.
WHAT IS RANDOMIZATION?
Randomization is the process of allocating study participants to one of the study
groups, in which each participant has an equal chance of being allocated to the
treatment or control group [1]. When randomization is properly conducted, neither
the investigator nor the participants can foresee the group to which the participant
will be assigned, nor can they interfere with allocation. Randomization ensures that
treatment groups in a clinical trial are comparable in terms of known and unknown
risk factors, since participants with a given set of risk factors have equal chances of
being allocated to the control or the intervention (treatment) group [2].
In clinical practice, and in observational studies, treatment is determined by the
patient’s clinician, and/or the patient’s preferences. As a result, it is common that
patients with more severe disease are treated with more aggressive strategy than
asymptomatic patients [1]. For example, in an observational study of the use of
inhaled corticosteroids for asthma and the risk of asthma exacerbation, we could ex-
pect participants with moderate asthma to be more likely to be using daily inhaled
corticosteroids than asymptomatic participants. So, if the study showed that the use
of daily corticosteroids was associated with a greater risk of exacerbation, this finding
could be erroneously attributed to the medication use rather than the baseline dis
ease severity (as sicker patients were given the medication). For non-randomized
studies evaluating invasive procedures, for example surgery, participants with better
overall health (such as younger participants with no comorbidities) might be more
87
88 Unit I. Basics of Clinical Research
likely treated with surgery, while older, sicker participants might be treated with a
less invasive strategy. In these two examples, the two groups are not comparable at
the beginning of the trial, and treatment group allocation is influenced by baseline
characteristics.
• The researcher and the participant must be unable to predict their allocation
group—what we call allocation concealment.
• The researcher must be unable to change a participant’s allocation, once he or she
has been randomized.
You may be thinking that if allocation is randomized, there is no way to predict the
next participant’s allocation. However, in order to use randomization in practice, a ran-
domization list must be generated, which contains the sequence of allocation for all
trial participants. As we will see when we discuss methods of randomization, this list
cannot be available to the researchers who are recruiting and registering participants
for the trial. It may be known only by a pharmacist who delivers medication, or it can
be used by someone not involved in participant recruitment to prepare numbered,
opaque sealed envelopes, or it can be created by a computerized system of registra-
tion and randomization of participants. In any case, if the researcher can guess the
89 Chapter 5. Randomization
allocation group for the next participant, he or she might be able to select participants.
For example, suppose that a researcher is conducting a trial to test a new rehabilitation
program for patients with traumatic brain injury, and preliminary data on a small se-
ries of patients led her to believe that the new treatment works. Now suppose she has
access to the randomization list and knows that the next participant will be allocated
to the control arm. If the next eligible participant is a young patient with a severe disa-
bility, whom she feels would benefit from the new treatment, she might not discuss the
trial with this patient until other participants were registered, and the next participant’s
group was known to be active treatment.
Therefore, for randomization to truly prevent selection bias, there needs to be al-
location concealment: allocation group of the next participant must be unknown to in-
vestigator and participants [4]. Moreover, once the participant is randomized to one
of the study groups, allocation cannot be changed by the investigator. For example,
suppose that randomization was being done by the researcher flipping a coin every
time a new participant was to be randomized to treatment or control. If the researcher
from the rehabilitation trial flipped the coin, and it ended in tails (control), she might
flip the coin again and again until it landed in heads (treatment). That is why the ran-
domization procedure should be implemented in a way that this type of manipulation
is not possible. We will discuss methods for implementing randomization later in this
chapter.
METHODS OF RANDOMIZATION
There are many methods of randomization, each with its advantages and disadvantages.
In this chapter, we will discuss simple randomization, blocked randomization, strati-
fied randomization, and adaptive randomization.
Simple randomization is one of the most commonly used methods of randomi-
zation, because it is easily implemented and inexpensive. In simple randomization,
a random digit table, usually generated by a computer, is used to generate the ran-
domization list. Several computer programs, including Stata, can generate a random
digit table. The number of digits in the table is set as the number of participants to
be enrolled in the trial. The table contains digits from 0 to 9, and the sequence of the
digits is random. Each set of digits will be paired with a study group. For example,
suppose you determine that 0 to 4 corresponds to treatment (T) and 5 to 9 corre-
spond to control (C). Figure 5.1, panel 1A, depicts the random digit table generated
for a trial with 24 participants, and the randomization list resulting from it.
The advantages of simple randomization are that it is inexpensive, easy to be
implemented, and the fact that allocation concealment is a natural feature of simple
randomization, since every participant has an equal chance of being randomized to
treatment or control group. The major disadvantage of simple randomization is that,
for small sample sizes (<100), there is a considerable chance of imbalances in the
number of participants randomized for each group [1]. For trials of 20 participants,
for example, the chance of an imbalance of having 6 participants or less in one of the
groups is approximately 11%. Simple randomization can also lead to imbalances in
terms of important baseline covariates between groups, since it does not take any
baseline characteristic into account when randomizing participants. For example,
in a trial of a new drug to treat congestive heart failure (CHF) planning to include
1A
Random numbers 8 8 4 8 3 0 5 4 8 1 3 6 3 7 6 2 5 9 8 2 9 5 6 7
Ramdomization list C C T C T T C T C T T C T C C T C C C T C C C C
Subject ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1B
Random numbers 3 6 5 2 5 3
Ramdomization list C T T C T C C T T C T C C T C T T C T C C T T C
Subject ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1C
Random numbers 7 3 17 1 6
Ramdomization list C C C T T T C T T C T C C C T T C C T T T C C T
Subject ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1D
Stage: moderate
Random numbers 4 2 6
Ramdomization list T T C C C T C T T C C T
Subject ID 1 2 4 6 7 10 12 15 16 18 19 21
Stage: severe
Random numbers 5 3 1
Ramdomization list T C T C C T T C C C T T
Subject ID 3 5 8 9 11 13 14 17 20 22 23 24
Figure 5.1. Methods of randomization for a two-arm trial comparing treatment (T) and control (C).
A) Simple randomization: A computer generates a sequence of 24 random numbers from 0 to 9. The
investigator pre-determines that 0 to 4 will correspond to treatment (T) and 5 to 9 will correspond to
control (C). This random sequence results in 15 subjects being assigned to the control arm and 9 to
the treatment arm;
B) Blocked randomization: Blocks have a size of 4. There are 6 possible combinations of the order
of randomization for these 4 patients: 1) CCTT; 2) CTCT; 3) CTTC; 4) TTCC; 5) TCTC; and
6) TCCT. A computer generates list of 6 random numbers, from 1 to 6, which corresponding to one
of the 6 possible blocks of 4 participants. This random sequence results in 12 subjects being assigned
to the control arm and 12 to the treatment arm;
C) Blocked randomization with variable block sizes: Blocks have a size of 4 or 6. There are 26 possible
combinations for blocks of 4 or 6 participants: 1) CCTT; 2) CTCT; 3) CTTC; 4) TTCC; 5) TCTC;
6) TCCT; 7) CCCTTT; 8) CCTCTT; 9) CCTTCT; 10) CCTTTC; 11) CTCCTT; 12) CTCTCT;
13) CTCTTC; 14) CTTCCT; 15) CTTCTC; 16) CTTTCC; 17) TCCCTT; 18) TCCTCT; 19) TCCTTC;
20) TCTCCT; 21) TCTCTC; 22) TCTTCC; 23) TTCCCT; 24) TTCCTC; 25) TTCTCC; 26) TTTCCC.
A computer generates list of 6 random numbers, from 1 to 26, which correspond to one of the 26 possible
blocks of 4 or 6 participants. This random sequence results in 12 subjects being assigned to the control arm
and 12 to the treatment arm, and allocation concealment is preserved.
D) Stratified randomization with blocks: Eligible patients are first separated into two strata according
to stage of disease. Each stratum has a separate randomization list. For each stratum, the computer
generates a list of 3 random numbers, from 1 to 6, which correspond to one of the 6 possible blocks
of 4 participants. This random sequence results in 12 subjects being assigned to the control arm and
12 to the treatment arm. There is balance in sample size between groups, and balance for the stage of
disease between treatment groups.
91 Chapter 5. Randomization
40 participants, simple randomization might result in more participants with severe CFH
in the treatment group than in the control group just by chance. As a result, outcomes
might be worse for the treatment group than the control group even if the intervention
(the new drug) was effective, because participants in the treatment group were sicker
than those in the control group. Therefore, simple randomization is not a good option
for trials with small sample sizes, or when balance for key covariates is desired.
Blocked randomization is a randomization method that randomizes participants
within blocks, instead of randomizing each participant individually. The blocks have a
predetermined size, for example blocks of 4 or 6 participants, and within each block,
there is balance in terms of number of participants in each group. For example, suppose
that we use blocked randomization with block sizes of 4 for our trial of 24 participants.
Each block will have 4 participants—2 must be allocated to the treatment group and
2 to the control group. There are six possible combinations of the order of randomi-
zation for these 4 participants: CCTT, CTCT, CTTC, TTCC, TCTC, and TCCT. To
generate a randomization list, the randomization program will randomly select one of
the six combinations for each block of 4 participants. Therefore, the randomization list
is a random sequence of these 6 possible blocks. For a trial with 24 participants, the
program will randomly select of one the combinations six times. Figure 5.1, panel 1B,
depicts the random blocks generated for a trial with 24 participants, and the random-
ization list resulting from it.
The major advantage of blocked randomization is that it will result in balanced
groups in terms of number of participants per group. If the trial is terminated before
the completion of a block, an imbalance may occur, but will be small. For our 24
participants’ trial, for example, the worst case scenario if the last block was interrupted
with 2 participants would be 10 participants in one group versus 12 participants in the
other group.
Blocked randomization can have a disadvantage: if the trial is not blinded, and
researchers know the assignment of previous participants and the block size, they
might be able to predict the allocation of the next participant, compromising allo-
cation concealment and allowing for selection bias [2]. For example, suppose that
in a trial comparing surgery with medical treatment, the researcher knows that the
block size is four and first three participants were randomized to surgery, control,
control. He will then deduce that the next participant is going to be randomized to
surgery, since blocks are balanced (two treatments and two controls). To overcome
this problem, the block size should be variable within the trial (i.e., blocks can have
random sizes of 4 or 6, for example). Figure 5.1, panel 1C, depicts the random blocks
with variable block sizes generated for a trial with 24 participants, and the randomiza-
tion list resulting from it.
Blocked randomization will lead to balance in sample size between groups; how-
ever, it may still result in imbalances for important covariates between the groups,
since baseline characteristic are not taken into account in the randomization process.
Stratified randomization is an alternative method when baseline covariates have
a strong impact on the study outcome, and balance for these covariates between
the groups is important. In stratified randomization, eligible participants are first
separated into strata according to baseline characteristic, for example, gender,
or stage of disease. Each strata has a separate randomization list, and after being
categorized into one of the strata, the participant is randomized to control or
92 Unit I. Basics of Clinical Research
treatment arm. With this method, each participant in each strata has an equal chance
of being allocated to control or treatment group, and if the sample size is big enough,
there will be balance in terms of the number of subjects allocated to control or
treatment within each strata.
Stratification is commonly associated with blocked randomization, so that
participants are first categorized into one of the strata, then randomized in blocks
within that stratum [5]. In this method, we can achieve balance in sample size be-
tween groups with the block, and balance for the covariate with the stratification.
Figure 5.1, panel 1D, depicts stratified, blocked randomization generated for a trial
with 24 participants, and the randomization list resulting from it. The block has a size
of four, and stratification is for stage of disease.
Stratification and blocks will result in balanced groups as long as there are not too
many strata, so that each stratum has a minimum number of subjects. Otherwise, it
can actually lead to imbalances. For example, for our trial of 24 participants, if we de-
cided to stratify by disease severity, gender, and age (greater or less than 65), there
would be six strata. But since there are probably fewer severe than moderate patients in
our accessible population, we might have only two older females with severe disease.
If the first block randomly chosen for this stratum was TTCC, it would be imbalanced,
with the two participants allocated to the treatment group [5].
Therefore, stratification should be done with few strata, only the most relevant
ones, preferably using dichotomous variables, instead of choosing an arbitrary cut-off
for continuous variables. Balance is expected to begin to fail with stratified blocked
randomization when the number of strata approaches one-half of the sample size.
Keeping the number of strata to a minimum is also important for the sample size calcu-
lation, because covariates used for stratification must be entered into the multivariate
analysis, if it is planned, and approximately 10 outcomes are needed for each variable
entered in the multivariate analysis model [6]. Variables typically used for stratified
randomization include site center in multicentric studies, and disease severity.
Adaptive randomization is a generic name for randomization methods that
use algorithms which include baseline covariates and the allocation of previous
participants into consideration to allocate the next participants [7,8]. We will discuss
the most popular of the adaptive methods, minimization.
The minimization method uses a computerized algorithm that lists the baseline
covariates of interest for the participant, evaluates the balance (or imbalance) of these
covariates among the participants already included in each of the study groups, and
then allocates the participant to the control or the treatment arm in order to minimize
the imbalances [7].
Let’s suppose for a hypothetical trial—with two covariates, gender and disease
severity—that there are 11 participants already included in the trial and the next par-
ticipant to be randomized is a male with severe disease (see Table 5.1). The minimi-
zation algorithm will take into account the total number of participants in each group,
as well as their distribution across the covariates, to assign this next participant to the
control or treatment group. In total, there are 6 participants in the treatment group and
5 participants in the control group, so if balance in terms of number of participants in
each group was the only important factor, this participant should be randomized to
the control group. However, there are two severe patients in the control group and
only one severe patient in the treatment group, so to obtain balance in the number of
93 Chapter 5. Randomization
severe patients in each group the participant should be randomized to the treatment
group. The algorithm takes into account the imbalances for each covariate and assigns
the participant to one of the groups in order to minimize imbalances. Some algorithms
assign scores for each covariate and add up the total score for each group in order to
make the decision; others use other strategies. For all the algorithms, the rationale is to
weigh in all the covariates and try to balance the groups as the trial progresses.
The minimization method allows for more covariates to be taken into account to in-
fluence randomization than traditional stratification, and usually generates balanced
groups in terms of most of the covariates.
A disadvantage of minimization technique is that it requires the implementation
of the algorithm using a computerized system for randomization. The randomization
list cannot be generated in advance and used to prepare sealed envelopes, since the
allocation of previous participants is used to determine the allocation of the next one.
As a result, the computerized system must be available at the moment when the par-
ticipant is considered eligible for the study and needs to be randomized. If the trial is
multicentric, the system must be web-based, online, and accessible to all centers at
all times. Another potential disadvantage for non-blinded trials is that investigators
may be able to predict the allocation of the next participant, leading to selection bias.
To overcome this problem, the algorithm may be programmed to use a “biased coin”
strategy: it determines the group allocation that would minimize imbalances, and
then randomizes the participant with a 70% or 60% chance of being allocated to that
group, but with 30% or 40% chance of being allocated to the other group. This strategy
protects allocation concealment, but may lead to small imbalances for trials with small
sample sizes.
No
Simple randomization
Are there
important baseline
covariates?
No
Blocked randomization
No
Stratified randomization
Adaptive randomization
Figure 5.2. Flow chart of the options of randomization method based on sample size and number
important baseline covariates to be balanced among the treatment arms.
95 Chapter 5. Randomization
technology (IT) support. Figure 5.2 shows a flow chart of the many options of ran-
domization methods and potential limitations, especially in small sample size trials.
IMPLEMENTING THE RANDOMIZATION
METHOD: ENSURING ADEQUATE
ALLOCATION CONCEALMENT
After choosing the best randomization method for a particular study, investigators
need to plan how to implement it. For example, if sealed envelopes will be used, who
will be responsible for preparing the envelopes? If a telephone-based randomization
service is chosen, how it will be put into practice? In this section, we discuss some
practical issues related to randomization.
Envelope systems are widely used because this system can be developed locally,
with little cost. To put such a system into practice, the investigators should follow
these steps:
96 Unit I. Basics of Clinical Research
In some cases, a randomization list, without envelopes, can be available for a phar-
macist, for example. If that is the case, make sure that investigators have no access to
the list and that the person who has access to it understands the need for allocation
concealment.
Telephone systems and web-based randomization systems are usually done by
companies that offer their services for a fee. Computerized systems can also be
created by the trial staff, of course, as long as members of the team are experts in
computer programming and web design; they will have to work together with the
trial statistician.
97 Chapter 5. Randomization
CASE STUDY: RANDOMIZATION:
THE DILEMMA OF SIMPLICITY VERSUS
IMBALANCES IN SMALL STUDIES
Felipe Fregni
Professor Luigi Lombardi is sitting in his office after a full day of work. He is trying to
finish his grant proposal, but he is having trouble writing the randomization method
section. He had an unpleasant experience in the past when he submitted a paper from
one of his studies and his method of randomization was highly criticized by one of the
reviewers. He did not want to repeat the same mistake this time. He is therefore taking
extra time to write this section. He is staring at his computer screen and going over
all his options. It is almost 11 p.m. and the last train to his house is about to leave. He
packs his notes and starts walking to the train station, but his thoughts are focused on
how to resolve this issue.
Introduction
In clinical research, one of the goals is to compare two (or more) different groups of
subjects that have been exposed to different interventions or have different risk factors.
In clinical trials, these groups might receive two different interventions (e.g., drug A vs.
drug B, or a placebo vs. experimental treatment). If there are two groups, it becomes
critical to define how subjects are allocated to such groups. If the treating physician or
the researcher interacting with the subjects is allowed to decide allocation, then bias
will probably occur as sicker patients (or the patients with an increased likelihood of
response) might be randomized to the active, experimental treatment. This effect can
typically be observed in observational studies, where usually the treating physician
decides the treatment based on clinical characteristics and personal preferences, and
is called selection bias.
For instance, in a previous observational study comparing three antipsychotic
drugs (risperidone, olanzapine, and clozapine) for the management of chronic schizo-
phrenia, physicians chose the drug that participants were to receive. In fact, significant
changes across the three treatment groups were seen for illness duration and number
of hospitalizations: Participants taking clozapine had longer illness duration and
higher number of hospitalizations compared to those taking olanzapine.1
Therefore, unbiased methods of allocating participants are necessary to produce
comparable groups of intervention. In addition, most of the statistical tests are based
on the notion that groups are comparable; therefore randomization becomes a critical
issue to increase the internal validity.
However, randomization comes with a price: research subjects (and investigators)
need to accept the fact that they may not be aware of the treatment group until the trial
1
Strous RD, Kupchik M, Roitman S, Schwartz S, Gonen N, Mester R, Weizman A, Spivak
B. Comparison between risperidone, olanzapine, and clozapine in the management of chronic schizo-
phrenia: A naturalistic prospective 12‐week observational study. Human Psychopharmacology: Clinical
and Experimental. 2006 Jun 1; 21(4): 235–243.
98 Unit I. Basics of Clinical Research
is over (if it is a blinded trial), and their treatment will be determined by randomiza-
tion, not personal choice. In addition, randomization is not always easy to implement
and can also add bias to a study.
There are several methods of randomization. They can be divided into two main
categories: fixed allocation randomization and adaptive randomization. In the fixed
allocation randomization, subjects are allocated with a fixed probability to the study
groups, and this probability does not change throughout the study. Examples of fixed
allocation randomization are simple, blocked, and stratified randomization. For the
adaptive randomization, however, the allocation probability changes as the study
progresses and can be based on the group characteristics (baseline adaptive randomi-
zation) or response to the treatment (response adaptive procedure).
Additionally, Luigi has to consider ethical and power issues. For instance, he was re-
cently considering using a 1:3 strategy of randomizing participants to placebo and
active treatment. He decides then to review each strategy separately.
Simple Randomization: Simplicity
versus Imbalanced Groups
The first option Luigi considers is the simple randomization strategy, in which every
participant has a 50% chance of receiving either active or placebo treatment. This
option seems interesting to him in regard to feasibility, given his lack of staff and lim-
ited budget. This strategy is in fact commonly used in clinical trials and would not
compromise the statistical inference of the study. In addition, because the chances
that the next participant will be randomized to either placebo or treatment is not af-
fected by the allocation of previous participants, unblinding due to this randomization
strategy is not a concern.
However, despite the potential advantages, Prof. Lombardi is afraid that, due to the
small sample size, the risk of imbalanced groups is elevated.
CASE DISCUSSION
Randomization
Prof. Lombardi is writing a grant proposal to get funding for a small phase II placebo-
controlled, randomized trial to test a new drug for exercise-induced asthma attacks.
He thinks the drug is promising and is eager to start the trial. He has a large clinic
where he can recruit participants, and a small team to help him conduct this trial.
However, he wants to make sure the randomization method is the best for his study
design and characteristics.
This case is a classic example of the challenges inherent to choosing a randomiza-
tion method for a small sample size trial. Here the investigators need to weigh the pros
and cons of each possibility. Ultimately, investigators need to decide how important
covariates are for the study and whether it is necessary to have a stratified randomiza-
tion method.
103 Chapter 5. Randomization
Another important factor that needs to be considered in this study is the allocation
concealment (i.e., making sure that investigators and participants cannot anticipate to
which study arm participants will be randomized, in order to avoid selection bias). As
we discussed earlier, certain randomization methods may compromise allocation con-
cealment, such as blocked randomization with fixed block sizes, or some minimization
algorithms. Moreover, the procedure used with randomization to effectively reveal the
allocation once participants have been randomized, such as sealed envelopes, or com-
puterized or telephone-based systems, need to be implemented in a way to avoid ma-
nipulation of the allocation.
WEB RESOURCES
• http://www.randomization.com/
• http://www-users.york.ac.uk/~mb55/guide/randsery.htm
REFERENCES
1. Kang M, Ragan BG, Park JH. Issues in outcomes research: an overview of randomization
techniques for clinical trials. J Athletic Training. 2008; 43: 215–221.
2. Bridgman S, Dainty K, Kirkley A, Maffulli N. Practical aspects of randomization and
blinding in randomized clinical trials. Arthroscopy. 2003; 19: 1000–1006.
3. Schulz KF, Grimes DA. Generation of allocation sequences in randomised trials: chance,
not choice. Lancet. 2002; 359: 515–519.
4. Vickers AJ. How to randomize. J Soc Integr Oncol. 2006; 4: 194–198.
104 Unit I. Basics of Clinical Research
INTRODUCTION
One of the most important strategies in conducting a research study is to minimize
bias. A randomized controlled trial (RCT) using blinding (or masking) of patients and
study personnel to the study treatment the patient is receiving, along with randomiza-
tion, is now considered the gold standard in clinical research. Randomization is done
to greatly increase the likelihood that study groups are balanced at baseline (avoiding
selection bias, confounding). Blinding, on the other hand, encompasses the methods
and strategies used to keep study participants and key research personnel unaware of
treatment assignment status throughout the duration of the trial, as well as after com-
pletion of the trial during data analysis. Blinding can be achieved in the absence of
randomization, but they are usually used together.
The knowledge about treatment allocation can compromise the quality of the study,
hindering internal validity by increasing the risk of bias. Lack of blinding of health-care
providers, study staff, and subjects (patients) can alter expectations, change behaviors
toward the intervention (compliance, dropouts), and influence side-effect report as
well as the reporting and assessment of efficacy outcome variables.
Despite investing extensive efforts, some interventions make it very difficult, or
nearly impossible, to achieve or maintain blinding of patients and providers (e.g., clin-
ical trials of surgery, psychotherapy, rehabilitation). Nonetheless, investigators should
counter this aspect, by blinding at least some important research personnel, such as
outcome assessors (raters) or endpoint adjudicators.
Despite being extensively recommended by expert panels and advisory agencies,
blinding status is often not reported by authors. If mentioned at all, commonly not
enough information is provided about who exactly was blinded, by what means
blinding was accomplished, and how successful it was. It is then important that the
investigator really understands not only how to design studies with effective blinding,
but also the potential limitations when blinding is not possible or has some inherent
limitations.
105
106 Unit I. Basics of Clinical Research
HISTORY
Blinding and blinding assessment in the medical field have more than two centuries of
history. Blinding was initially used to expose fraud when the healing properties of mag-
netism, perkinism, and homeopathy were tested [1,2]. One of the first studies using
a sham intervention in a blinded manner was published by Austin Flint in 1863 [3].
At the beginning of the twentieth century, concerns were raised that bias could
be introduced through “patient’s expectations” and “physician’s personality.” This
initiated the use of blinding in the fields of physiology and pharmacology. In addition,
blinded trials promoted recruitment and decreased attrition in the comparison group;
instead of being offered “no treatment,” patients received a placebo/sham intervention
in a blinded manner [2].
A double-blind randomized clinical trial is sometimes defined as a study where nei-
ther the treating physician nor the subject knows the randomized treatment the subject
is receiving (this definition will be discussed later in the chapter). In Michigan, from
1926 to 1931 [4], sanocrysin was tested against distilled water to treat tuberculosis. One
of the authors was unaware of group allocation and presumably so were the patients.
This trial is considered to be the first study incorporating a double-blind approach [2].
DEFINITION
Blinding or masking is the methodological principle of concealing group allocation
(intervention or control/placebo/sham) from subjects and study staff after randomi-
zation [5]. This concept should not be confused with allocation concealment, which
refers to hiding group allocation during the randomization process, preventing selec-
tion bias [6]. Allocation concealment and blinding maintain “unawareness” of group
status at different points in time and guarantee the methodological soundness of clin-
ical trials. Allocation concealment protects the allocation sequence before and until
assignment, and blinding protects the sequence from then on (Figure 6.1). It can be
achieved in all clinical trials, in contrast with blinding, which sometimes can be unfea-
sible to achieve or to maintain.
1. Participant/subject/ patient/
volunteer: receives the trial intervention (one
of them)
2. Health-care provider/clinician/attending physician: provides the intervention
3. Outcome assessor/data collector/rater/evaluator/observer: provides outcome
data
4. Outcome adjudicator/endpoint committee member/judicial assessor: ensures
that outcome data adhere as defined a priori
5. Data handlers/data entry clerk: enters data from patient file to trial database
6. Data analyst/statistician: conducts data analysis
7. Manuscript writer: writes a paper with trial results.
107 Chapter 6. Blinding
CONCEALMENT
BLINDING
OF ALLOCATION
Population R
A
INTERVENTION outcomes
N GROUP
D
O
M
Samples I
Z
A
T
I
CONTROL outcomes
O GROUP
N
Some groups may be unblinded due to the nature of their role in the study, such as
treatment manufacturer, pharmacist, and medical monitor. The medical monitor
deals with patient safety issues and often has access to the randomization schedule
containing the record of the randomized treatment each patient receives (this access is
needed in case an individual patient’s randomized treatment needs to be made known
in the event the patient experiences a serious adverse event), as do members of the
data safety monitoring board (DSMB).
BLINDING TERMINOLOGY
Blinding terminology can be confusing and ambiguous [1,5,10]. Many terms often
used in papers and study protocols are not universal and may not mean the same
to all members of the clinical research community. A study encountered 17 unique
interpretations of the meaning of “double-blinding.” When surveying textbooks, nine
different combinations of blinded groups were found. For the majority, “double-
blinded” meant that both participants and health-care providers were unaware of
group allocation, but double-blinded has been interpreted in many different ways, by
various groups such as data collectors, data analysts, and judicial assessors [10].
Instead of describing a trial vaguely as “blinded,” researchers are now encouraged
to describe the blinding status of all personnel involved in the trial, complying with the
CONSORT recommendations [11].
Nevertheless, some authors have proposed the adoption of standardized termi-
nology [12]. A tentative flowchart is described in Figure 6.2.
108 Unit I. Basics of Clinical Research
YES
Were data managers and DOUBLE BLIND STUDY
biostatisticians blinded to List or groups that were
treatment assignment? NO blinded, if applicable.
YES
TRIPLE BLIND STUDY
List other groups that were blinded,
if applicable.
A single-blind trial is when one group of interest in the study is unaware of group
allocation. This is usually interpreted as subjects being blinded to the treatment
assignment [10]. When only one group is blinded and the participants group is
unblinded, a more descriptive approach should be used, rather the designation of
“single-blind” [12].
A double-blind trial is usually understood as participants and investigators being
unaware of group assignment [12]. The term “investigator” is intentionally im-
precise, as it can mean either the health-care provider or the outcome assessor.
Interpretation of this term can lead to many different definitions, but the most
common definition found in textbooks includes patients and evaluators [10]. In
some studies, it is used to describe all three categories (patients, providers, and
evaluators [1].
In a triple-blind trial, three different groups are unaware of the treatment allocation,
but there is no consensus regarding which three. Usually, blinded groups are all those
involved in a double-blind trial plus the data managers and/or biostatisticians [1,12].
If health-care provider and the evaluator are not the same person, this term can com-
prise these two staff members in addition to the participant [1].
Quadruple-blind trial is a term rarely used by authors. It can be applied when four
different groups were blinded to treatment assignment: participant, health-care pro-
vider, evaluator, and data analyst [1].
When stating blinding methodology, a more qualitative approach is recommended
[11]. Researchers should clearly state which groups were blinded (rather than just
mentioning how many) and how that was accomplished.
A recent survey from the field of rehabilitation also found a significantly increased
percentage of papers reporting blinding status in the last 10 years from 56% to 85%
[18]. This may represent the call of action for researchers and journal editors for
better quality when reporting results, summarized in the CONSORT statement since
1996 [23].
CONSEQUENCES OF UNBLINDING
Lack of blinding increases the risk of bias, which can be a threat to the internal validity
of a clinical trial. A subject or an investigator who is aware of group allocation may
consciously or unconsciously change his or her attitude, beliefs, or behavior regarding
the study.
The main types of bias associated with unblinding are performance, detection/ob-
server, and attrition bias [5,8,24,25].
Other phenomena that could be also linked with unblinding include the following:
BLINDING VERSUS BIAS
The need for adequate blinding is essential to avoid bias. Importantly, it is not only
adding a blinding that will prevent bias, but how the blinding was implemented. The
goal of the researcher is to understand potential flaws in the blinding method, and that
of the reader is to assess whether the blinding method was adequate.
We discussed in the previous section the types of biases that can be found when
blinding is not implemented or is flawed. Bias is defined as a “systematic error, or devi-
ation from the truth, in results or inferences” [26]. The main problem with bias is that
it cannot be estimated or corrected. In fact, bias can go in both directions. A study that
has results biased will likely have little or no value. Figure 6.3 summarizes the types of
biases according to each component of a randomized clinical trial.
111 Chapter 6. Blinding
Accessible
Population Active Treatment Sham Treatment
If sampling is Randomization -
not random - If allocation
SAMPLING Randomization
concealment is
BIAS poor =
SELECTION BIAS
Sample
Population Study Subjects
ACTIVE TREATMENT
Tablet Active tablet Placebo capsule
ACTIVE TREATMENT
Capsule Active capsule Placebo tablet
The most recent update of the CONSORT statements does not recommend that this
assessment be done routinely [11,49].
Surveys found that only 2%–7% of all trials reported tests for the success of blinding
[40,51]. One of the reasons for this low reporting is the lack of a standard method to
accomplish this. Some questions remain without a definite answer: Which timing is
the best to survey subjects and study personnel? Should “do not know” be offered as
an category? Should success be a binary or a scale variable? Which is the cut-off value
for successful blinding? What statistical analysis should be done? [49,50].
In these surveys, three or five responses categories can be used:
Response
Active
Control
advantages are the ability to detect not only the magnitude but also the direction of
unblinding for each group [51].
BLINDING GUIDELINES
American and European advisory agencies recommend that studies should be blinded,
including the subject, investigator, sponsor staff involved in treatment, and clinical
evaluations. Ideally, all study personnel should be blinded, but if “practically or ethically
possible,” single-blind and open-label are also option. In this case, emphasis should be
placed on random allocation, and when possible, assessment should be done by a blind
evaluator [53,54]. Also, in single-blind and open-label, it is our opinion that as many
study personnel as possible should remain unblinded if at all possible.
The CONSORT expert panel, originally gathered in mid-1990, issued a statement
that recommended that researchers elaborate on blinding methodology when re-
porting their findings:
In 2001, the statement was updated to include a checklist of 22 items to mention when
reporting a clinical trial. This list included an item regarding blinding (#11):
In the most recent statement, from 2010, the panel goes into detail and recommends
that more information about blinding be given, namely how similar the intervention
were (#11a, #11b):
11a If done, who was blinded after assignment to interventions (for example,
participants, care providers, those assessing outcomes) and how; 11b If relevant, de-
scription of the similarity of interventions. [11]
The evaluation of the success of blinding, once included in the 2001 statement, was
dropped due to its controversial aspect [5,56].
The strategies for dealing with missing data (deletion, imputation, etc.), handling
outliers, what to do with non-normal distributed data (transformation, violating
test assumptions, dichotomize variables), how variables will to be analyzed (con-
tinuous or transformed in binary, which cut-off for dichotomous variable) and
which subgroup analyses will be done have to be addressed before the knowledge
of any preliminary results [9].
Blinding in Acupuncture
One field where a “true” placebo use is extremely difficult is acupuncture. There is no
placebo for a needle, because needling, pinching, even light touching someone’s skin
will always have some physiologic effect. Many approaches have been tried (e.g., blunt
needle, non-penetrating needle, retractable needles, using non-standard acupunc-
ture points, recruiting only naïve subjects, mock electro-acupuncture) [41,42,59].
Ultimately, it is impossible to blind the practitioner, so more emphasis should be put
on having blinded outcome assessors [59].
118 Unit I. Basics of Clinical Research
CONCLUSION
Blinding, along with randomization, protects patients and researchers from
contaminated findings and decreases the subjectivity of clinical trials [2,5]. One
should aim to blind all key categories of personnel in a trial. If blinding of all individuals
is not possible, an effort should be made to blind as many parties as possible. “Some”
is better than “none.” Researchers should report the blinding status of participants
and all study personnel and accurately describe how that was accomplished. Despite
not being unanimous, assessment of the blinding success can be done to quantify the
occurrence of unblinding throughout the study. “Intentionally inducing a state of ig-
norance” will minimize the risk of bias and therefore increase the internal validity of
the study [5].
119 Chapter 6. Blinding
Introduction
Current Development of RHINO-A
After promising results in pre-clinical and phase I trials, Dr. Bejout feels it is the proper
time to run a phase II trial. In fact, Isabelle has been working on this project for several
months and she is anxious to move on to the next step. The plan for this study is to
perform a randomized parallel phase II trial testing RHINO-A versus placebo (and an
option that was not ruled out yet was to include a third group of oral antihistamine)
for four weeks in perennial allergic rhinitis patients. Ethical issues regarding placebo
use in this setting were extensively discussed with the IRB (institutional review board)
and resulted in the conclusion that it was acceptable due to the nature of the disease
and short duration of the trial.
The main goal of the study is to assess initial efficacy, as measured by decrease in
total nasal sign and symptom scale (TNSSS)—the main outcome, a composite scale
that consists of grading symptoms and signs of allergic rhinitis. Symptoms include
congestion, runny nose, sneezing/itchy nose and postnasal drip and are self-rated by
1
Dr. Rui Imamura and Professor Felipe Fregni prepared this case. Course cases are developed solely
as the basis for class discussion. The situation in this case is fictional.
120 Unit I. Basics of Clinical Research
Noseworthy JH, et al. The impact of blinding on the results of a randomized, placebo-controlled
2
important issue that researchers should be aware of is the perceived bias. In other
words, even if an unblinded study finds the true effect estimate of a given treatment, it
is not possible to prove that the result is not biased due to unblinding.
Although blinding should be used whenever possible as it reduces the chance of in-
tentional or unintentional bias being introduced into the study, there are some issues
associated with blinding, such as an increase in study costs, feasibility, adherence, and
adverse effects monitoring.
As Dr. Bejout will be leaving soon for his long overdue family vacation, he knows
that he and Isabelle need to finalize the study plan in the next few days, and the last
outstanding issue is study blinding. He was able to cancel his morning administrative
meetings and calls Isabelle for a long meeting. He places a bunch of paper and a pencil
on his table and asks his secretary to redirect his clinical calls to the on-call ENT phy-
sician. He then starts the meeting with Isabelle, “Let us review all the potential options
so that we can choose the best one.”
answer an urgent call from a patient in need of a prescription. When he comes back to
the discussion, he concludes, “But Isabelle, we will not have funds to run three groups
of patients, so let us focus on the other options for this study and use this for perhaps
a follow-up study.”
Forder PM, Gebski VJ, Keech AC, Allocation concealment and blinding: when ignorance is bliss,
3
should meet the next day at 7 a.m. to make this decision and submit the final version
of the study. Isabelle leaves the office feeling energized and glad that she will finally be
able to start working on her PhD project.
CASE DISCUSSION
After a successful pre-clinical and phase I trial, Dr. Bejout is developing a project to run
a phase II trial for a new topical steroid drug for allergic rhinitis. The trial will last four
weeks, and because of its short duration and the nature of the disease being studied,
the IRB has agreed to the use of a placebo. His main outcome is going to be the Total
Nasal Symptoms Score Scale (TNSSS), a score given by reported symptoms (by pa-
tient) and signs assessed by an ENT physician. Symptoms and signs will be evaluated
at baseline and after study completion. Side effects also will be evaluated with a ques-
tionnaire, as a secondary endpoint. Like any scale of signs and symptoms, TNSSS
is subjective and more prone to bias than “hard,” objective outcomes. Knowledge
of group allocation may involuntarily influence signs and symptoms reports by the
patients and physician, leading to biased results. Dr. Bejout’s challenge is to choose
a blinding procedure that will ensure a high-quality trial but also feasibility and cost
control. On the other hand, Dr. Bejout is worried about adverse effects of the steroid,
and thinks than the attending physician might underreport side effects if unaware of
group allocation. Indeed, this trial illustrates the challenge of making the trial feasible
versus preventing bias.
As in any trial, five main groups of individuals can be blinded according to the
CONSORT statement: patients, health-care provider, data collectors, outcome adju-
dicator, and data analysts. One must remember that the variable of interest is a sub-
jective, “soft” outcome and depends on the report of symptoms by the patient and an
evaluation of signs from the assessors.
Several options were discussed: single-blind, double-blind, single blind with a
third blind rater.
In a single-blind study, the patient would be blinded, but the physician would still
know group allocation. This option has advantages, as it is relatively simple and keeps
the placebo effect, but still leaves the physician vulnerable to observer/detection bias
and differential treatment among patients from the two groups (performance bias).
A double-blind (both patient and physician) is a well-balanced solution, but other
issues such as adherence and adverse-event reporting may come to play. In other to
keep both parties blinded, a placebo has to be developed. It should be identical in
shape, color, size, and taste to the active drug. This adds cost to the study and some-
times is hard or nearly impossible to achieve. Unblinding may still occur, if the active
drug is highly effective in comparison to the placebo, and physicians and patients
become aware who is in the active drug group. Attending physicians might be less
motivated to recruit their patients to a study where they are kept “blind” to study allo-
cation and underreport side effects while monitoring for adverse events.
A compromise solution is blinding the patient and having an external, blinded,
assessor. This would keep the attending physician aware of group allocation, therefore
maintaining the possibility of physician unintentionally influencing the patient behavior
and thus performance bias.
125 Chapter 6. Blinding
FURTHER READING
In these references you will find an inventory of blinding methods in pharmacological and non-
pharmacological trials:
Boutron I, et al. Methods of blinding in reports of randomized controlled trials assessing phar-
macologic treatments: a systematic review. PLoS Med. 2006; 3(10): e425.
Boutron I, et al. Reporting methods of blinding in randomized trials assessing nonpharmacological
treatments. PLoS Med. 2007; 4(2): e61.
126 Unit I. Basics of Clinical Research
This paper summarizes the rationale for blinding and bias mechanisms:
Hrobjartsson A, Boutron I. Blinding in randomized clinical trials: imposed impartiality. Clin
Pharmacol Ther. 2011; 90(5): 732–736.
REFERENCES
1. Schulz KF, Chalmers I, and Altman DG. The landscape and lexicon of blinding in
randomized trials. Ann Intern Med. 2002 136(3): 254–259.
2. Kaptchuk TJ. Intentional ignorance: a history of blind assessment and placebo controls in
medicine. Bull Hist Med 1998; 72(3): 389–433.
3. Flint A. Contribution toward the natural history of articular rheumatism; consisting of
a report of thirteen cases treated solely with palliative measures. Amer J Med Sci. 1863;
46: 17–36.
4. Amberson JB, McMahon BT, Pinner M. A clinical trial of sanocrysin in pulmonary tubercu-
losis. Amer Rev Tuberc. 1931; 24: 401–435.
5. Hrobjartsson A, Boutron I. Blinding in randomized clinical trials: imposed impartiality.
Clin Pharmacol Ther. 2011; 90(5): 732–736.
6. Forder PM, Gebski VJ, Keech AC. Allocation concealment and blinding: when ignorance
is bliss. Med J Aust. 2005; 182(2): 87–89.
7. Viera AJ, Bangdiwala SI. Eliminating bias in randomized controlled trials: importance of
allocation concealment and masking. Fam Med. 2007; 39(2): 132–137.
8. Schulz KF, Grimes DA. Blinding in randomised trials: hiding who got what. Lancet. 2002;
359(9307): 696–700.
9. Polit DF. Blinding during the analysis of research data. Int J Nurs Stud. 2011; 48(5): 636–641.
10. Devereaux PJ, et al. Physician interpretations and textbook definitions of blinding termi-
nology in randomized controlled trials. JAMA. 2001; 285(15): 2000–2003.
11. Moher D, et al. CONSORT 2010 explanation and elaboration: Updated guidelines for re-
porting parallel group randomised trials. J Clin Epidemiol. 2010; 63(8): e1–37.
12. Miller LE, Stewart ME. The blind leading the blind: use and misuse of blinding in
randomized controlled trials. Contemp Clin Trials. 2011; 32(2): 240–243.
13. Lang, T., Masking or blinding? An unscientific survey of mostly medical journal editors on
the great debate. MedGenMed, 2000. 2(1): p. E25.
14. Taylor, W.J. and M. Weatherall, What are open-label extension studies for? J Rheumatol,
2006. 33(4): p. 642–3.
15. Psaty, B.M. and R.L. Prentice, Minimizing bias in randomized trials: the importance of
blinding. JAMA, 2010. 304(7): p. 793–4.
16. Montori VM, et al. In the dark: the reporting of blinding status in randomized controlled
trials. J Clin Epidemiol. 2002; 55(8): 787–790.
17. Haahr MT, Hrobjartsson A. Who is blinded in randomized clinical trials? A study of 200
trials and a survey of authors. Clin Trials. 2006; 3(4): 360–365.
18. Villamar MF, et al. The reporting of blinding in physical medicine and rehabilitation
randomized controlled trials: a systematic review. J Rehabil Med. 2012; 45(1): 6–13.
19. Balk EM, et al. Correlation of quality measures with estimates of treatment effect in meta-
analyses of randomized controlled trials. JAMA. 2002; 287(22): 2973–2982.
20. Califf RM, et al. Characteristics of clinical trials registered in ClinicalTrials.gov, 2007–2010.
JAMA. 2012; 307(17): 1838–1847.
127 Chapter 6. Blinding
21. Schulz KF, et al. Blinding and exclusions after allocation in randomised controlled
trials: survey of published parallel group trials in obstetrics and gynaecology. BMJ. 1996;
312(7033): 742–744.
22. Karanicolas PJ, et al. Blinding of outcomes in trials of orthopaedic trauma: an opportunity
to enhance the validity of clinical trials. J Bone Joint Surg Am. 2008; 90(5): 1026–1033.
23. Begg C, et al. Improving the quality of reporting of randomized controlled trials. The
CONSORT statement. JAMA. 1996; 276(8): 637–639.
24. Juni P, Altman DG, Egger M. Systematic reviews in health care: Assessing the quality of
controlled clinical trials. BMJ. 2001; 323(7303): 42–46.
25. Higgins JPT, Green S. Cochrane handbook for systematic review of interventions. London: John
Wiley & Sons; 2008.
26. http://bmg.cochrane.org/assessing-risk-bias-included-studies
27. Noseworthy JH, et al. The impact of blinding on the results of a randomized, placebo-
controlled multiple sclerosis clinical trial. Neurology. 1994; 44(1): 16–20.
28. Schulz KF, et al. Empirical evidence of bias: dimensions of methodological quality associ-
ated with estimates of treatment effects in controlled trials. JAMA. 1995; 273(5): 408–412.
29. Moher D, et al. Does quality of reports of randomised trials affect estimates of intervention
efficacy reported in meta-analyses? Lancet. 1998; 352(9128): 609–613.
30. Pildal J, et al. Impact of allocation concealment on conclusions drawn from meta-analyses
of randomized trials. Int J Epidemiol. 2007; 36(4): 847–857.
31. Wood L, et al. Empirical evidence of bias in treatment effect estimates in controlled trials
with different interventions and outcomes: meta-epidemiological study. BMJ. 2008;
336(7644): 601–605.
32. Hrobjartsson A, et al. Observer bias in randomised clinical trials with binary outcomes: sys-
tematic review of trials with both blinded and non-blinded outcome assessors. BMJ. 2012;
344: e1119.
33. de Craen AJ, et al. Effect of colour of drugs: systematic review of perceived effect of drugs
and of their effectiveness. BMJ. 1996; 313(7072): 1624–1626.
34. Desbiens NA. Lessons learned from attempts to establish the blind in placebo-controlled
trials of zinc for the common cold. Ann Intern Med. 2000; 133(4): 302–303.
35. Hemila H. Vitamin C, the placebo effect, and the common cold: a case study of how
preconceptions influence the analysis of results. J Clin Epidemiol. 1996; 49(10): 1079–
1084; discussion 1085, 1087.
36. Karlowski TR, et al. Ascorbic acid for the common cold. A prophylactic and therapeutic
trial. JAMA. 1975; 231(10): 1038–1042.
37. de Craen AJ, et al. Placebo effect in the acute treatment of migraine: subcutaneous placebos
are better than oral placebos. J Neurol. 2000; 247(3): 183–188.
38. Francis CW, et al. Comparison of ximelagatran with warfarin for the prevention of venous
thromboembolism after total knee replacement. N Engl J Med. 2003. 349(18): 1703–1712.
39. Morello CM, et al. Randomized double-blind study comparing the efficacy of gabapentin
with amitriptyline on diabetic peripheral neuropathy pain. Arch Intern Med. 1999.
159(16): 1931–1937.
40. Boutron I, et al. Methods of blinding in reports of randomized controlled trials assessing
pharmacologic treatments: a systematic review. PLoS Med. 2006; 3(10): e425.
41. Boutron I, et al. Reporting methods of blinding in randomized trials assessing
nonpharmacological treatments. PLoS Med. 2007; 4(2): e61.
128 Unit I. Basics of Clinical Research
42. Fregni F, et al. Challenges and recommendations for placebo controls in randomized trials
in physical and rehabilitation medicine: a report of the international placebo symposium
working group. Am J Phys Med Rehabil. 2010; 89(2): 160–172.
43. Boutron I, et al. Methodological differences in clinical trials evaluating nonpharmacological
and pharmacological treatments of hip and knee osteoarthritis. JAMA. 2003; 290(8):
1062–1070.
44. Kaptchuk TJ, et al. Sham device v inert pill: randomised controlled trial of two placebo
treatments. BMJ. 2006; 332(7538): 391–397.
45. Bang H, Park JJ. Blinding in clinical trials: a practical approach. J Altern Complement
Med. 2012; 19(4): 367–9.
46. Senn SJ. Turning a blind eye: authors have blinkered view of blinding. BMJ. 2004;
328(7448): 1135–1136; author reply 1136.
47. Park J, Bang H, Canette I. Blinding in clinical trials, time to do it better. Complement Ther
Med. 2008; 16(3): 121–123.
48. Altman DG, Schulz KF, Moher D. Turning a blind eye: testing the success of blinding and
the CONSORT statement. BMJ. 2004; 328(7448): 1135; author reply 1136.
49. Fergusson D, et al. Turning a blind eye: the success of blinding reported in a random sample
of randomised, placebo controlled trials. BMJ. 2004; 328(7437): 432.
50. Hrobjartsson A, et al. Blinded trials taken to the test: an analysis of randomized clinical
trials that report tests for the success of blinding. Int J Epidemiol. 2007; 36(3): 654–63.
51. James KE, et al. An index for assessing blindness in a multi-centre clinical trial: disulfiram
for alcohol cessation: a VA cooperative study. Stat Med. 1996; 15(13): 1421–1434.
52. Bang H, Ni L, Davis CE. Assessment of blinding in clinical trials. Control Clin Trials. 2004;
25(2): 143–156.
53. US Food and Drug Administration. Guidance for Industry E9 Statistical Principles for
Clinical Trials. 1998 [cited 2012]; Available from: http://www.fda.gov/downloads/
Drugs/GuidanceComplianceRegulatoryInformation/Guidances/ucm073137.pdf.
54. European Medicines Agency. ICH Topic E 9 Statistical Principles for Clinical Trials. 2006
[cited 2012]; Available from: http://www.ema.europa.eu/docs/en_GB/document_li-
brary/Scientific_guideline/2009/09/WC500002928.pdf.
55. Moher D, Schulz KF, Altman DG. The CONSORT statement: revised recommendations
for improving the quality of reports of parallel group randomized trials. BMC Med Res
Methodol. 2001; 1: 2.
56. Sackett DL. Commentary: measuring the success of blinding in RCTs: don’t, must, can’t or
needn’t? Int J Epidemiol. 2007; 36(3): 664–665.
57. Gotzsche PC. Blinding during data analysis and writing of manuscripts. Control Clin Trials.
1996; 17(4): 285–290; discussion 290–293.
58. Beeh KM, Beier J, Donohue JF. Clinical trial design in chronic obstructive pulmonary
disease: current perspectives and considerations with regard to blinding of tiotropium.
Respir Res. 2012; 13: 52.
59. White A, Cummings M, Filshie J. An introduction to Western medical acupuncture. 1st ed.
Edinburgh: Churchill Livingstone Elsevier; 2008.
7
R EC R U I T M E N T A N D A D H E R E N C E
INTRODUCTION
This is the final chapter of Unit I, Basics of Clinical Research. In this unit, you have al-
ready learned how to select your research question (Chapter 2) and how to choose the
study population (Chapter 3) in which you will test your hypothesis. But how do you
identify and reach these potential study subjects? And how do you ensure that they ad-
here to the protocol? In this chapter we present the role of recruitment and adherence
of study participants in clinical trials. The term recruitment refers to the identification
and enrollment of study participants, including operational aspects such as adver-
tising, overcoming recruitment barriers, and management of financial, logistic, and
time-related aspects throughout the process of enrollment. The term adherence refers
to the compliance of study participants to act in accordance with the study protocol
and to remain in the study. Retention is often used synonymously with adherence, but
rather refers to actions aimed at keeping patients in the study so that they are available
at follow-up (alive and not lost to dropout or withdrawal). Attrition is defined as loss
of subjects during the course of the study, which can be due to death, stopping of the
assigned intervention, dropout, or intentional withdrawal. In this chapter we describe
the methodological principles of achieving effective recruitment, adherence, and re-
tention in clinical research.
RECRUITMENT
It frequently happens that during the study design phase, only limited thought is given
to recruitment strategies. In reality, though, one of the most difficult parts of a study
is the recruitment process, and it is this factor that often decides whether a study will
fail or succeed [1,2].
The two main objectives of the recruitment process are the following:
2. To recruit a sample that is large enough to fulfill the requirements of the sample
size and power calculations (see Chapter 11) [3].
RECRUITMENT: STUDY SAMPLE,
DEFINITION, AND SIZE
Before enrolling individuals from your target population in your trial, you have to con-
sider the first important step in the process of participant recruitment: defining the target
population (see Chapter 3 for more details). The target population will be determined
based on the research question and the hypothesis to be tested. It is important to define
the target population by specifying clear inclusion/exclusion criteria. These criteria will
be used to screen subjects/patients for enrollment and to determine who will be entered
in the study. Once the screening criteria are defined, you will have to decide how many
subjects you want to recruit. The sample size depends not only on the desired power and
the effect size, but also on budgetary and logistic considerations. This decision is based
on several factors, and it is not always easy to find the right balance between them (see
Chapter 11 for more details). The main concern is that if the sample size is too small,
the study result might be negative due to insufficient power, resulting in a type II error;
on the other hand, if the sample size is too large, unethical issues become a concern, in
addition to the unnecessary expenditure of time, resources, and labor.
It is important to consider that sample size goals are affected before the start of
a study by the recruitment response rate (with the potential of introducing non-
response bias) and during the study by the attrition rate. The response rate is the
number of screened subjects who ultimately agree to enroll in a study. During the
screening process, the pool of potential study subjects shrinks substantially from
initially 100% to 10%–15% eligible and finally as low as 1% enrolled [4]. This phe-
nomenon is mainly due to the fact that the number of subjects who are available and
willing to enroll is overestimated; this is referred to as the funnel effect, or Lasagna’s
Law (see Figure 7.1) [5,6]. But even from that small part of enrolled subjects, not
Figure 7.1. The distribution of total, ineligible, eligible and enrolled subjects according to the funnel effect.
131 Chapter 7. Recruitment and Adherence
all can be randomized. Many reasons can be attributed to a low response rate, such
as patients’ lack of motivation, as well as practical issues such as travel expenses. The
funnel effect demonstrates that, at the end, only a small portion of all identified po-
tential study patients or subjects will be enrolled in the study.
The candidate’s decision of whether or not to participate in a research study is usu-
ally based on the participant’s perception of risks/costs and benefits in enrolling in a
clinical trial. The investigator should therefore be aware of potential risks/costs and
benefits that are associated with the investigation.
• Medical records review
• Clinic log
• External referrals: primary care, specialists (collaborations)
• Clinical research centers
• Specialized clinics and general hospitals
• Registries
• Recruitment/call center
132 Unit I. Basics of Clinical Research
• Patient support groups
• Patients’ community websites
• Clinical trial registration sites
• Government
• clinicaltrials.gov
• clinicaltrialsregister.eu
• Patient advocacy groups
• ciscrp.org (international)
• Institutional Resources
• rsvpforhealth.org (local)
• http://searchclinicaltrials.org/
• Databases: AWARE for All, RSVP for Health
• Advertising: paper flyers, radio, newspapers, web.
After the target population has been chosen and defined by inclusion/exclusion
criteria, the next step of the study recruitment plan is to determine the best medium
of advertising to reach potential study candidates. Easy and effective ways to reach a
specific group of patients are health-care-provider-based strategies (also referred to as
targeted strategies), such as clinician invitation letters (invitation letters to physicians
treating patients who are potentially eligible for study participation, or the distribution
of leaflets and promotional posters in hospitals). However, these strategies may only
be sufficient when a highly specific group of patients affected by the same (or similar)
disease constitute the target population and require the collaboration of clinicians to
refer patients to the study. But if we identify an appropriate way to address the poten-
tial participants through community-based strategies (also referred to as broad-based
strategies), such as advertising through television or the Internet, how do we know
that this specific medium doesn’t exclude a group within this population that does
not have access to this medium? For example, what if we choose to perform a study in
patients with type II diabetes and decide to advertise our study through Internet ad-
vertisement on webpages of diabetes support groups? As older people commonly have
less access to the Internet, online advertising would be more likely to reach younger
type I diabetes patients than older type II diabetes patients.
Besides the age-dependent behavior, there are other important factors that should
be considered when choosing the form of advertising. These influencing factors in-
clude the following:
• Leaflets
• Promotional posters
133 Chapter 7. Recruitment and Adherence
• Television
• Newspapers
• Internet (i.e., social or professional networks)
• Mails
• Emails.
Both leaflets and promotional posters are useful to reach local target groups with low
costs, but may fail to reach larger populations that exceed regional borders.
While television can reach broad target groups in a short time, it is costly and
therefore usually requires industrial financial support. Regional newspaper adver-
tisement can be useful to reach high numbers of individuals within the scope of the
newspaper’s circulation, but might exclude subjects who are using the Internet or
television to get access to news. Internet ads are useful to reach a broad number of
subjects or patients, but also to address specific groups. However, as Internet usage in
most countries correlates with demographic factors such as age and income, the exclu-
sive use of the Internet for advertising might bias the selection of study participants.
Advertising by regular mail is time-intensive but may help to reach a very specific pop-
ulation. Advertising by email has several advantages over regular mail, but low per-
ception might be a problem, and an email might never even reach the attention of an
individual by being filtered out (spam/advertisement filters). Additionally, both mail
and email require access to contact information (street addresses or email addresses,
respectively). Due to data protection regulations, mail and email lists can be preserved.
Therefore advertising by mailing or emailing in a community-based setting may be
restricted to subjects or patients whose addresses are included in those address lists
that are openly available. Alternatively, public awareness campaigns through special-
ized medical societies can be an effective tool to raise awareness of the study among
physicians and patients, thus facilitating health-care-provider-based recruitment.
Determining the contents of Determining the access of the target Characterizing subgroups within
the advertising population to media forms the target population
Figure 7.2. The consecutive steps in planning the study advertisement in consideration of the
individual properties of different media.
difficult to understand. On the other hand, not naming significant potential adverse
effects of the study intervention in an advertisement might raise a feeling of being
deceitfully “concealed” when learning about this effect during the informed consent
stage and might further decrease the recruitment response rate.
In conclusion, before starting the recruitment of study participants, it is crucial to
carefully design a detailed recruitment strategy plan (including an advertisement plan),
considering both the properties of the medium of advertising and the feasibility of
conducting the advertisement, the latter mostly being restricted by financial resources.
The process of planning the study advertising plan is also illustrated in Figure 7.2.
THE EFFECTIVENESS
OF RECRUITMENT STRATEGIES
Which single or combined strategy is most effective depends on several factors, such as
the study design, the target population, and the intervention. A recent paper asserted
that “physician referrals and flyers were the most effective recruitment method” in
their trial [7]. The use of combined (multi-tiered) recruitment strategies was found
to be beneficial in enrolling those subjects who had been previously indecisive or
doubtful [8].
Human Factors
Once a potential candidate shows interest in participating in the advertised research
study, the individual skills and behavior of the research staff gain upmost importance
in the process of recruiting. In order to successfully enroll a participant, it is crucial for
the study staff to show empathy and professionalism, beginning with the first contact
135 Chapter 7. Recruitment and Adherence
and the initially provided information. Gorkin et al. demonstrated that successful en-
rollment is higher for those patients who read and understood the informed consent
than in those patients who did not fully read and understood it [9]. An empathic way
of explaining the study to the potential participant is important in order to achieve a
high degree of understanding. Therefore, both empathy and sufficient clarification are
important contributors to successful recruitment.
ADHERENCE
Adherence in Clinical Practice
The WHO defined adherence in June 2001 as “the extent to which the patient follows
medical instructions” [10]. However, this definition is somewhat limited, since “med-
ical” is too narrow a description; not including other forms of care and “instructions”
implies that the patient is a passive, obedient recipient, rather than an active partic-
ipant. In 2003 the definition of adherence was therefore modified as “the extent to
which a person’s behavior—taking medication, following a diet, and/or executing
lifestyle changes, corresponds with agreed recommendation from a health care provider.”
Another definition of adherence is the “active, voluntary, and collaborative in-
volvement of the patient in a mutually acceptable course of behavior to produce a
therapeutic result” [11,12]. Although synonymous, the term adherence is preferred
over compliance, since the latter suggests that the patient is a passive, acquiescent re-
ceiver of a treatment protocol.
• HIV patients (side effects of drugs, multiple drugs, and food intolerance)
• Patients with arterial hypertension (asymptomatic condition)
• Patients with psychiatric conditions (behavioral difficulties)
• Pediatric populations (non-compliance during testing).
Problems related to regimen adherence can be classified as omission errors (i.e., taking
a medication too late, failing to take a medication, or taking under-dosed amounts of a
medication) and commission errors (i.e., taking an overdose of a medication, or taking
a medication too frequently). Factors leading to a decreased adherence can also be
classified as random adherence problems (i.e., unfavorable provider-participant interac-
tion) and non-random adherence problems (i.e., fatigue of the patient due to the dura-
tion of the study).
Consequences of adherence problems in clinical research studies not only may put
the study participant’s health at risk, but also can prevent the investigator from suc-
cessfully completing the trial, or can increase the duration of data acquisition that is
necessary to achieve sufficient statistical power (consequently increasing costs).
Data acquired through trials with small numbers of adherent subjects can have
higher statistical power than data obtained in studies with high numbers of non-
adherent subjects [19]. This effect is, in large part, due to the increased amount of
137 Chapter 7. Recruitment and Adherence
missing data through adherence problems (i.e., missing of follow-up visits; see
also Chapter 13). The main issues of failed adherence are dropouts or prema-
ture withdrawals, which lead to several substantial problems in the conduction of a
clinical trial:
Techniques of Facilitation
There are several strategies to facilitate adherence. In general, patients are acting
more compliantly when they think that the clinical trial may result in a potential im-
provement of their health, or effective treatment of their disease, respectively. Study
participants may also value the additional monitoring and care from health providers
throughout the study, or may show high adherence because of altruistic reasons (i.e.,
improvement of medical treatment for future generations). Also, compensation and
reimbursement for study-related expenses such as parking and travel expenses can
contribute to the degree of adherence.
Two specific techniques to increase adherence have been described by Robiner: pre-
randomization screening and adherence-enhancing strategies. [13]
In clinical research studies, pre-randomization screening can be used as a preventive
measure to exclude subjects from participation who are at high risk of non-adherence,
and lack of compliance.
The run-in approach includes assessment of compliance, by means of pill counts or
quantitative analysis of serum concentration of the administered agent during a study
pre-phase where study subjects receive a predefined prescription regimen (i.e., a certain
number of placebo pills or low-dose active treatment at predefined times). This analysis
is undertaken before the actual study begins and allows for the identification of patients
who are likely to display insufficient compliance during the study. Furthermore, through
another approach referred to as test-dosing, researchers can identify candidates who
may drop out of the study because of dose-dependent adverse drug effects. In this pre-
randomization screening method, potential study participants also receive a low dose
of the experimental agent prior to the actual study. In contrast to the run-in method,
test-dosing does not include assessment of compliance, but detects individual adverse
effects to identify those subjects who have a high risk of dropping out due to intolerance.
Although pre-randomization screening methods are controversial because of their
negative impact on sensitivity, specificity, and external validity, they can increase sta-
tistical power by 20%–41% and might pose a valid option to improve the design of
potentially underpowered studies. However, as data on these techniques are limited, a
consensus on their utility in clinical research has yet to be achieved [20].
Another approach to increase adherence is to perform adherence-enhancing
strategies during the trial. These strategies aim to influence the subject to be compliant
with the research protocol and instructions of the research staff. Adherence-enhancing
strategies include those listed in Table 7.1.
138 Unit I. Basics of Clinical Research
Target Population
Recruitment goals:
(1) Ensure representativeness
(2) Ensure that an appropriate
number of subjects are
enrolled
Accessible Population
Recruitment should focus on two
overall strategies:
Recruitment Yield of about 3%–6% – Increasing the reach to the
accessible population
(could be as low as 1%)
(targeted and broad-based
strategies)
– Improve the response rate
Study of eligible patients who agree
Population to participate by decreasing the
factors that make participation
difficult (e.g., by facilitating or
Adherence for chronic condition trials decreasing the number of visits)
is on average 43%–78%
Adherence goal: to have as many as possible patients completing the study and following
the protocol correctly.
Low adherence results in threats to internal validity:
Study – Threatens statistical power of the trial
Population – Can introduce bias if dropouts are not randomly distributed across treatment groups
– Threatens perception of the trial results
CHALLENGES FOR RECRUITMENT
Recruitment and Retention in Alzheimer’s Disease Trials
Currently there are no effective treatments or preventive strategies available for
Alzheimer’s disease. This might be one of the main reasons why it is extremely diffi-
cult to recruit and retain patients for research studies. The recruitment of 400 subjects
requires participation of more than 200 international centers to compensate for the low
number of recruited patients and the high number of dropouts per center [21]
Issues that may affect recruitment and retention:
The problem of a multicenter setup is that the unit “center” adds another level of var-
iance and therefore makes results even more difficult to interpret. Despite the availa-
bility of several promising drug candidates, the methodological issue of recruitment
and retention is currently the major roadblock to finding successful treatment options
in Alzheimer’s disease.
wherein the problem lies: according to some studies, approximately 10,000 youth
are diagnosed with diabetes in Brazil each year. Given the size of Florianopolis (a city
with approximately 1 million inhabitants), Dr. Nasser’s estimate was that approxi-
mately 50 children are diagnosed yearly with diabetes in her city, and therefore would
qualify for the study. Although she was the director of the largest diabetes center in
Florianopolis, it would take several decades to complete this study.
Pocock SJ, Size of cancer clinical trials and stopping rules. Br J Cancer. 1978 Dec;38(6): 757–766.
2
143 Chapter 7. Recruitment and Adherence
facilitate recruitment in different centers if you wish to do so. So, what would be our acces-
sible population?”
Dr. Nasser quickly responded that she wanted to enroll patients with newly
diagnosed type I diabetes aged 20 years or younger. She then asked him, “What would
be your ideas for broad-based strategies?”
“With a broad-based strategy, you could use advertisements in the media (TV, radio,
printed, and Internet) and deliver brochures in heavily concentrated areas, such as malls,
parks, and cultural events to reach a broader accessible population,” Dr. Correa responded,
and continued, “which would mean broader geographic diversity and larger absolute re-
cruitment yield. In addition, sometimes it is easier to recruit patients by attracting their
attention rather than trying to get referrals from our colleagues.”
“The only issue with this strategy,” Dr. Correa pondered, “is that you will get a
large population of non-eligible patients. And you know, it is not easy to deal with them!
Furthermore, the screening process will be time-consuming and very expensive. Cost-benefit
analysis should be taken into account, as it is not uncommon to have yields lower than 10%
of screened patients with this strategy.”3
Dr. Nasser thought about these issues for a couple of minutes and then
commented on generalizability, “Another advantage with this strategy is that we would
get a sample with patients from both outpatient and inpatient facilities and primary to
tertiary hospitals that we would not easily get using patient referral or examining medical
records. In addition, in Brazil, prompt access to medical care is not universal and many
patients with initial phases of the disease may not be accessible using other methods of
recruitment.”
Dr. Correa’s secretary interrupted their conversation, as he had to run to another
meeting. This meeting was very productive to Dr. Nasser’s study. While she waited
for a cab in Ipanema Beach outside Dr. Correa’s office, she made the final notes in her
precious research notebook before the next stop of this trip, Sao Paulo.
health facilities for patient lists and also sending invitation letters to physicians may
yield a reliable accessible population. In addition, because in your trial, the interven-
tion (ImmuneD) consists of only one application of the intervention and the outcomes
can be measured in other centers, there is no problem if patients are in other cities. The
advantage of this method is that it has a lower cost (emails or invitation letters) and it
provides a more reliable population (decreasing costs of screening), therefore increasing
recruitment yield. This would be the simplest and cheapest solution.”
In addition, Dr. Costa continued, “As the study involves direct tests in the context of
standard therapy, physicians who are concerned with the integrity of their patients may
find the study ethical which would also facilitate referral as it increases the buy-in for this
study—in fact, if a patient is recruited through the Internet, the physician might advise the
patient not to participate in the study.”
However, Dr. Costa pointed out, “Even though this seems like a good strategy, you
need to be prepared for a low number of referrals by colleagues, as their perception of the
importance of the trial (having a busy agenda as the background) might be low. This would
be the main problem for you.”
Dr. Nasser reviewed the main points raised in this meeting, adding everything
in her notebook. It was almost 9 p.m. and she was going to the hotel—at least the
traffic was better and she only needed to worry about getting to the airport on time
the next day.
CASE DISCUSSION
Dr. Angela Nasser intends to test a novel pharmacological add-on medication for the
treatment of patients with type I diabetes, which may help to reduce the necessary
daily dosages of insulin in these patients. As type I diabetes is not as highly prevalent
as type II diabetes, Dr. Nasser has to carefully consider her options of patient recruit-
ment in order to enroll a sufficient number of study participants. The first mandatory
step in this process is to define the target population and the accessible population in
consideration of the eligibility criteria defined.
In this case, the target population can be easily identified as newly diagnosed type
I diabetes patients of a certain age, whereas the population accessible to Dr. Nasser’s
recruitment efforts depends in large part on the advertisement strategy chosen by
her. The scope of advertising and consequently the size of the population she can
reach are restricted by both the financial resources available and the accessibility of
disease-specific medical institutions, such as clinical type I diabetes centers or type
I diabetes support groups. Accordingly, Dr. Nasser is facing the challenge of choosing a
recruitment strategy that allows for both enrolling a sufficient number of eligible study
participants and keeping a balance between internal validity and external generaliza-
bility. For example, the restriction of patients recruitment to a dedicated clinical type
I diabetes centers would be time and cost efficient and would result in high internal
validity, but could fail to produce data that can be extrapolated to a larger population.
Furthermore, the generalizability of Dr. Nasser’s study findings is influenced by the
chosen eligibility criteria. Selecting only few or very broad inclusion criteria will make
it is easier to enroll subjects and results will have a higher degree of generalizability;
however, the sample will be more heterogeneous, which will affect sample size/power
calculations and will require adjustments in the statistical analysis such as covariate
adjustments [23]. On the other hand, too many or too restrictive eligibility criteria
make it difficult to enroll patients and may lead to a narrow study population with low
generalizability.
In addition to the considerations regarding the study population, Dr. Nasser needs
to select a strategy to advertise the trial among this population. Therefore, she must
select between various possible recruitment methods, all of them presenting with
different advantages, disadvantages, and outcomes: community- based strategies
(broad-based) ensure both fast progress in recruitment and a high degree of diversity
of patients, but might attract many non-eligible subjects, whereas health-care-provider-
based strategies (targeted) provide a more reliable population but highly depend on
the degree to which the health-care provider collaborates by referring patients to the
study. Consequently, the decision of using either a broad-based or targeted recruit-
ment strategy may influence the demographic constitution of the study population.
While referrals from private practice may include more patients with higher educa-
tion and social status, some community-based strategies such as advertisement in
newspapers or on Internet platforms such as “craigslist” may address more patients
with low income and lower education. In order to increase the heterogeneity of the
study population, Dr. Nasser might also consider a combination of broad-based and
targeted strategies.
Although there is no ultimate rule for the selection of the best technique to re-
cruit participants in a clinical research study, it remains crucial for the success of
146 Unit I. Basics of Clinical Research
FURTHER READING
Recruitment
Campbell MK, Snowdon C, Francis D, Elbourne D, McDonald AM, Knight R, et al. Recruitment
to randomised trials: strategies for trial enrollment and participation study. The STEPS study.
Health Technol Assess. 2007; 11: ix–105.
Daley AJ, Crank H, Mutrie N, Saxton JM, Coleman R. Patient recruitment into a randomised
controlled trial of supervised exercise therapy in sedentary women treated for breast cancer.
Contemp Clin Trials. 2007; 28(5): 603–613.
Gillan MG, Ross S, Gilbert FJ, Grant AM, O’Dwyer PJ. Recruitment to multicentre trials: the
impact of external influences. Health Bull (Edinb) 2000; 58: 229–234.
Gorkin L, Schron EB, Handshaw K, Shea S, Kinney MR, Branyon M, et al. Clinical trial
enrollers vs. nonenrollers: the Cardiac Arrhythmia Suppression Trial (CAST) Recruitment
and Enrollment Assessment in Clinical Trials (REACT) project. Control Clin Trials. 1996;
17(1): 46–59.
Grunfeld E, Zitzelsberger L, Coristine M, Aspelund F. Barriers and facilitators to enrollment
in cancer clinical trials: qualitative study of the perspectives of clinicalresearch associates.
Cancer. 2002; 95(7): 1577–1583.
Johnson MO, Remien RH. Adherence to research protocols in a clinical context: challenges
and recommendations from behavioral intervention trials. Am J Psychother. 2003;
57: 348–360.
Ngune I, Jiwa M, Dadich A, Lotriet J, Sriram D Qual. Effective recruitment strategies in primary
care research: a systematic review. Prim Care. 2012; 20: 115–123.
Peto V, Coulter A, Bond A. Factors affecting general practitioner’s recruitment of patients into a
prospective study. Fam Pract. 1993; 10(2): 207–211.
Rengerink O, Opmeer BC, Loqtenberg SL, Hooft L, Bloemenkamp KW, Haak MC, et al.
Improving participation of patients in clinical trials—rationale and design of IMPACT.
BMC Med Res Methodol. 2010; 10: 85.
Spilker B, Cramer JA. Patient recruitment in clinical trials. New York: Raven Press; 1992.
Adherence
Connolly NB, Schneider D, Hill AM. Improving enrollment in cancer clinical trials. Oncol Nurs
Forum. 2004 May; 31(3): 610–614.
147 Chapter 7. Recruitment and Adherence
Ickovics JR, Meisler AW. Adherence in AIDS clinical trials: a framework for clinical research and
clinical care. J Clin Epidemiol. 1997; 50(4): 385–391.
Osterberg L, Blaschke T. Adherence to medication. N Engl J Med. 2005; 353(5): 487–497.
REFERENCES
1. Ashery RS, McAuliffe WE. Implementation issues and techniques in randomised trials of
outpatient psychosocial treatments for drug abusers: recruitment of subjects. Am J Drug
Alcohol Abuse. 1992; 18(3): 305–329.
2. Spilker B, Cramer JA (eds.). Patient recruitment in clinical trials. New York: Raven
Press; 1992.
3. Hulley SB, Cimmings SR, Browner WS, et al. Designing clinical research: an epidemiologic
approach, 2nd ed. London: Lippincott Williams and Wilkins; 2001.
4. Cooley ME, Sarna L, Brown JK, Williams RD, Chernecky C, Padilla G, et al. Challenges of
recruitment and retention in multisite clinical research. Cancer Nursing. 2003; 26: 376–384.
5. Fedor C, Cola P, Pierre C (eds.). Responsible research: a guide for coordinators. London:
Remedica; 2006.
6. Sinackevich N, Tassignon J-P. Speeding the critical path. Appl Clin Trials. 2004; 13(1): 42–48.
7. Feman SPC, Nguyen LT, Quilty MT, Kerr CE, Nam BH, Conboy LA, et al. Effectiveness of
recruitment in clinical trials: an analysis of methods used in a trial for irritable bowel syn-
drome patients. Contemp Clin Trials. 2008; 29(2): 241–251.
8. Patel MX, Doku V, Tennakoon L. Challenges in recruitment of research participants. Adv
Psyichiatr Treat. 2003; 9: 229–238.
9. Gorkin L, Schron EB, Handshaw K, Shea S, Kinney MR, Branyon M, et al. Clinical trial
enrollers vs. nonenrollers: the Cardiac Arrhythmia Suppression Trial (CAST) Recruitment
and Enrollment Assessment in Clinical Trials (REACT) project. Control Clin Trials. 1996;
17(1): 46–59.
10. Sabaté E. WHO adherence meeting report. Geneva: World Health Organization, 2001.
11. Delamater AM. Improving patient adherence. Clin Diabetes. 2006; 24: 71–77.
12. Meichenbaum D, Turk DC. Facilitating treatment adherence: a practitioner’s guidebook.
New York: Plenum Press;1987.
13. Robiner WN. Enhancing adherence in clinical research. Contemp Clin Trials. 2005; 26: 59–77.
14. Claxton AJ, Cramer J, Pierce C. A systematic review of the associations between dose
regimens and medication compliance. Clin Ther. 2001; 23(8): 1296–1310.
15. Cramer J, Rosenheck R, Kirk G, Krol W, Krystal J. Medication compliance feedback and
monitoring in a clinical trial: predictors and outcomes. Value Health. 2003; 6: 566–573.
16. Waeber B, Leonetti G, Kolloch R, McInnes GT. Compliance with aspirin or placebo in the
Hypertension Optimal Treatment (HOT) study. J Hypertens. 1999; 17: 1041–105.
17. Claxton AJ, Cramer J, Pierce C. A systematic review of the associations between dose
regimens and medication compliance. Clin Ther. 2001; 23: 1296–1310.
18. Osterberg L, Blaschke T. Adherence to medication. N Engl J Med. 2005; 353: 487–497.
19. Hunninghake DB. The interaction of the recruitment process with adherence.
In: Shumaker SA, Schron EB, Ockene JK, eds. The handbook of health behavior change.
New York: Springer; 1990.
20. Lang JM, Buring JE, Rosner B, Cook N, Hennekens CH. Estimating the effect of the run-in
on the power of the physicians’ health study. Stat Med. 1991; 10: 1585–1593.
148 Unit I. Basics of Clinical Research
21. B Vellas. Recruitment, retention and other methodological issues related to clinical trials
for Alzheimer’s disease. J Nutr Health Aging 2012; 16(4): 330.
22. Duncanson K, Burrows T, Collins C. Study protocol of a parent-focused child feeding and
dietary intake intervention: the feeding healthy food to kids randomized controlled trial.
BMC Public Health 2012; 12: 564.
23. Roozenbeek B, Lingsma HF, Maas AI. New considerations in the design of clinical trials for
traumatic brain injury. Clin Investig. 2012; 2(2): 153–162.
UNIT II
Basics of Statistics
8
B A S I C S O F STAT I ST I C S
INTRODUCTION
In Unit I, you learned the basics of clinical research. You learned how to select the
research question (What are you trying to study/prove?), define the study popula-
tion (Whom do you want to study to test/prove your question?), design your study
(How are you going to test/prove your question?). But what are you going to do
with the data your study will generate? How will you analyze the data, and how will
you interpret the results? Unit II will introduce you to statistics and will give you the
knowledge and tools to formulate a data analysis plan that will help you to answer
these questions.
The importance and impact of statistics in today’s world is immense. In TV and
other media, statistics is used all the time to support a certain message; it is used in
surveys, to analyze trends and to make predictions (e.g., the odds of a team winning
the next Super Bowl). It is also frequently misinterpreted (e.g., showing that coffee is
linked to a lower risk of diabetes). More important, statistics plays an essential role in
many decision-making processes. Statistics is widely used in many fields, including
business, social science, psychology, and agriculture. When the focus is on the biolog-
ical and health sciences, the term biostatistics is used.
Your journey into the world of statistics starts with a hypothesis: Is a new drug
more effective than placebo in treating neuropsychiatric symptoms in patients with
Alzheimer’s disease? Is prostate-specific antigen screening efficient for early detection
of prostate cancer? Can the combination of PET and CT scanning predict the risk of
myocardial infarction in patients with coronary artery disease? Should a new social
program be implemented to reduce poverty among the elderly? To find answers to
these questions, we need to collect and analyze data from a representative sample from
a larger population (for more details, see Chapter 3). Statistics provides methods of
describing and summarizing the data that we have collected from a sample and allows
us to extrapolate results to make inferences about the population from which the
sample was drawn.
Statistics can be classified into two categories: descriptive and inferential statis-
tics. The term descriptive statistics refers to measures that summarize and characterize
151
152 Unit II. Basics of Statistics
a set of data that allow us to better understand the attributes of a group or pop-
ulation. As you will see, these measures can be graphical or numerical. Whereas
descriptive statistics examine the sample data, inferential statistics and hypothesis
testing aim to use sample data to learn about the population from which the sample
was drawn based on probability theory. Inferential statistics will be discussed in the
next chapter.
Suppose you want to answer the first question proposed in this chapter: “Is a new
drug more effective than placebo in treating neuropsychiatric symptoms in patients
with Alzheimer’s disease?” After randomizing our patients into the placebo or new
drug group and following allocation of the placebo or the treatment, we need to collect
information about the frequency of neuropsychiatric symptoms in the two groups be-
fore and after intervention and compare the data. Therefore the first important point
when learning and applying statistics is to understand the study variables and their
characteristics. In fact, the investigator needs to know the main characteristics of the
variables in order to know what to do with them [1].
This chapter will present the different types of data and the methods that can be
used to organize and display each type of data. We will then introduce basic proba-
bility theory and probability distributions, with a particular emphasis on the normal
distribution, which appears frequently in real life and plays a central role in many
statistical tests.
TYPES OF DATA
It is important to classify the types of data that you are working with, since the data
type dictates the method of data analysis that you will use. The different types of data
and their general characteristics are described in Table 8.1.
Nominal
Nominal data, also referred to as categorical data, represent unordered categories or
classes. For instance, one of the possible ways to categorize race in humans is “White,”
“Black,” and “Other races.” Numbers may be used to represent categories. White can
be arbitrarily coded as 0, Black as 1, and other races as 2. However, these numbers
do not express order or magnitude and are not meaningful. Dichotomous or binary
variables are a special type of nominal data. These two terms are exchangeable and
are used when the variable has only two distinct categories. In the example provided
in Table 8.1, gender and race are two examples of nominal data, but only gender is
considered dichotomous or binary.
Ordinal
When a natural order among categories exists, data are referred to as ordinal. The
New York Heart Association (NYHA) classification describes four categories of heart
failure according to severity symptoms and degree of limitation to perform daily
activities [2]:
I. No limitation on physical activity. Ordinary physical activity does not cause fa-
tigue, palpitation, or dyspnea.
II. Slight limitation of physical activity. Comfortable at rest, but ordinary physical
activity results in fatigue, palpitation, or dyspnea.
III. Marked limitation of physical activity. Comfortable at rest, but less than ordinary
activity causes fatigue, palpitation, or dyspnea.
IV. Unable to carry out any physical activity without discomfort. Symptoms of car-
diac insufficiency are present at rest. If any physical activity is undertaken, discom-
fort is increased.
Note that severity of heart failure increases from class I (no symptoms) to class IV (se-
vere symptoms). However, the magnitude of the difference between adjacent classes
is not necessarily equivalent. The difference between classes III and IV is not neces-
sarily the same as the difference between classes I and II, even if both pairs are one
unit apart. As with nominal data, ordinal variables may be coded using numbers, but
these numbers are not meaningful; consequently, arithmetic operations should not be
performed on ordinal data.
Discrete
Discrete data are numerical values that represent measurable quantities. Discrete data
are restricted to whole values and are often referred to as count data. Examples of dis-
crete data include the number of deaths in the United States in 2012 and the number
of years a group of individuals has received formal education. Note that an ordering
exists among possible values, and the difference between the values one and two is the
same as the difference between the values five and six.
Arithmetic rules can be applied to discrete data; however, some arithmetic oper-
ations performed on two discrete values are not necessarily discrete. In our example,
suppose one individual has 3 years of education and the other one has 4 years; the
154 Unit II. Basics of Statistics
average number of years of education for the two individuals is 3.5, which is no longer
an integer.
Continuous Data
Continuous data also represent measurable quantities, but are not restricted to
whole values (integers) and may include fractional and decimal values. Therefore, the
difference between any two values can be arbitrarily small depending on the accuracy
of our measurement instrument. As with discrete data, the spacing between values is
meaningful. Arithmetic procedures can be applied. Examples of continuous data in-
clude temperature, weight, and cholesterol level.
CHOOSING A STATISTICAL TEST
The choice of outcome and the independent variables under consideration will influ-
ence the type of statistical test we can use to test the study hypothesis. You will learn in
the next chapter that if we use a continuous variable (e.g., BMI), a t-test is appropriate
to test for differences in mean BMI levels between two groups. If we use the binary
variable for obesity status created using the WHO’s obesity cut-off, a chi-square test of
homogeneity is appropriate. You will learn that parametric tests like t-tests have more
155 Chapter 8. Basics of Statistics
statistical power to detect possible differences in the outcome variable between two
populations than non-parametric tests like the chi-square test.
However, if BMI is treated as continuous, small differences in BMI that are detected
might have little clinical significance. To be specific, a small difference in BMI between
groups that is considered a statistically significant difference in our trial might have
limited impact on the patient’s health and quality of life. In this case, when designing
our study, we should carefully balance statistical power and a clinically meaningful
difference in BMI.
DESCRIPTIVE STATISTICS
The first step in data analysis is to describe or summarize the data that you have col-
lected through tables, graphs, and/or numerical values. This is an important step, be-
cause it will allow you to assess how the data are distributed and how the data should
be analyzed. When reporting the results of a study, including a description of the study
population is essential so that the findings can be generalized to other comparable
populations. You may also be interested in making inferences about the population
that your data were sampled from; this will be discussed in more depth in subsequent
chapters [3].
GRAPHICAL REPRESENTATION
Nominal and Ordinal Data
Nominal and ordinal data are summarized by the absolute and relative frequency of
observations. Using the previous example of NYHA classification for heart failure,
we can count the number of patients among each category. This is called the abso-
lute frequency. The relative frequency will be the proportion of the total number of
156 Unit II. Basics of Statistics
I 10 12.50 12.50
II 35 43.75 56.25
III 25 31.25 87.50
IV 10 12.50 100
Total 80 100
35 45 90
Relative frequency (%)
40 80
Absolute frequency
30
35 70
25 30 60
20 25 50
15 20 40
15 30
10
10 20
5 5 10
0 0 0
I II III IV I II III IV I II III IV
NYHA classification NYHA classification NYHA classification
(a) (b)
10 10 12.50 12.50
I I
II II
25 III III
31.25
35 IV IV
43.75
35
30 60
25
20 40
15
10 20
5
0 0
I II III IV I II III IV
NYHA classification NYHA classification
relative frequency in each category. Ordinal data can additionally be presented by rela-
tive frequency and cumulative frequency polygons, as shown in Figure 8.3.
100–119 15
120–139 48
140–159 36
160–179 13
180–199 5
Total 117
The histogram of these observation is shown in Figure 8.4. Note that the frequency
associated with each interval in a histogram is represented by the bar’s area. Therefore,
a histogram with unequal interval widths should be interpreted with caution.
A histogram is a quick way to make an initial assessment of your data by showing
you how the data are distributed. When studying a histogram, you might ask yourself
the following: What is the shape of the distribution? (The distribution is called uni-
modal if it has one major peak, bimodal if it has two major peaks, and multimodal if it
has more than two major peaks.) Is the histogram symmetric? (A bell-shaped distri-
bution is symmetric, and the tapered ends of the distribution are referred to as tails.
It is common to run into distributions that are unimodal where one tail is longer than
the other; this type of distribution is called skewed.) Is there a center? How are the
data points spread?
You will learn about ways to quantify the center and spread of a distribution in the
next section on summary statistics.
The frequency polygon already described for ordinal data can also be used to rep-
resent discrete and continuous data. In this case, the frequency polygon uses the same
60
50
Number of patients
40
30
20
10
0
100 120 140 160 180
Systolic blood pressure (mmHg)
Figure 8.4. Histogram representing absolute frequencies of systolic blood pressure for the data
shown in Table 8.3.
159 Chapter 8. Basics of Statistics
60
50
40
Number of patients
30
20
10
0
100 120 140 160 180
Systolic blood pressure (mmHg)
Figure 8.5. Frequency polygon: Absolute frequency of blood systolic pressure for the data show in
Table 8.3.
two axes as a histogram. It is constructed by placing a point at the center of each in-
terval and then connecting the points in adjacent bins by straight lines (Figure 8.5).
The cumulative frequency polygon can also be used with discrete and continuous
data, as shown in Figure 8.6.
Another way to summarize a set of discrete or continuous data graphically is the
box plot, as shown in Figure 8.7, which displays a sample of 444 measures of weight in
kilograms. The central box represents the interquartile range, which extends from the
25th percentile, Q1, to the 75th percentile, Q3; further explanations about percentiles
will be presented in the next section. The line inside the box marks the median (50th
100
90
80
Cumulative frequency (%)
70
60
50
40
30
20
10
0
100 120 140 160 180
Systolic blood pressure (mmHg)
150
Outliers
100
75th percentile
Weight
Median
50 25th percentile
Outliers
percentile, Q2). The lines extending from the interquartile range are called whiskers.
They extend to the most extreme observations in the data set that are within 1.5 times
the interquartile range from the lower or upper quartile. To find these extreme values,
find the largest data value that is smaller than Q3 + 1.5*IQR, and similarly find the
smallest data value that is larger than Q1–1.5*IQR. All points outside the whiskers are
considered outliers and are commonly represented by points, circles or stars.
When we are interested in showing the relationship between two different contin-
uous variables, a two-way scatter plot can be used. Each point on the graph represents
a pair of values. The scale for one measure is marked on the horizontal axis (x axis) and
the scale for the other on the vertical axis (y axis).
A scatter plot gives us a good idea of the level of correlation between the two
variables and also the nature of this correlation (linear, curvilinear, quadratic, etc.).
Figure 8.8 shows the relationship between weight (x axis) and height (y axis) in the
444 patients shown earlier.
1.8
Height
1.6
1.4
20 40 60 80 100 120
Weight
Figure 8.8. Two-way scatter plot: Weight (in kilograms) versus height (in meters) in 444 observations.
161 Chapter 8. Basics of Statistics
400
350
A line graph is similar to a two-way scatter plot in that it can be used to illustrate
the relationship between two discrete or continuous variables. However, in a line
graph, each x value can have only a single corresponding y value. Adjacent points are
connected by straight line segments. Line graphs are often used to depict the change of
one variable over time. Line graph is a good resource when the goal is to show changes
over time (time as represented in the x axis). Figure 8.9 shows the measures of gly-
cemia in a hypothetical patient with diabetes.
representative values often address central tendency (the location of the center around
which the observations fall) and dispersion (variability, or spread, of the data).
Mean
The most common measure of central tendency for discrete and continuous data is the
mean, also referred to as the average. The mean of a variable is calculated by summing
all of the observations and dividing by the total number of observations. The mean is
represented by x (spoken: x bar) and its mathematical notation is
1 n
∑xi
n i=1
The mode in this example is 168 mmHg, since this value occurs twice in the data set,
more than any other value. However, the mode is not really informative since it factors in
only 2 observations out of 10 possible ones. The mean is more useful and is calculated as
1 10 1
x= ∑
10 i=1
xi = (110 + 134 + 126 + 154 + 168 + 128 + 168 + 158 + 170 + 188)
10
= 150.4 mmHg.
It may be useful to note that the value of 1/10th is applied to every data value when
calculating the mean. We can view 1/10th as the “weight” of each value, and we are
essentially applying equal weight to each observation since we do not have prior
knowledge about the distribution of the data.
The mean is very sensitive to extreme values. In other words, if a data set contains
an outlier or an observation that has a value that is very different from the others, the
163 Chapter 8. Basics of Statistics
mean will be highly affected by it. Suppose the last observation were wrongly recorded
as 1880 mmHg. The mean in this case is
1
x = (110 + 134 + 126 + 154 + 168 + 128 + 168 + 158 + 170 + 1880 = 319.6 mmHg.
10
This mean systolic blood pressure of 319.6 mmHg is more than twice that of the
previously calculated mean. A systolic blood pressure value of 1880 mmHg is im-
possible in human beings so we should question this value and correct it. However,
sometimes an error might not be so obvious, or an apparent error may not be an
error at all. If we want to summarize the entire set of observations in the presence
of outliers, we might prefer to use a measure that is not so sensitive to extreme
observations; we will see that the median is one such measure (see later discussion
in this chapter).
Sometimes we do not have access to individual measures in our data set, but only
have summarized data in frequency distribution tables, as shown in Table 8.3. Data
of this form are called grouped data. Because we do not have the entire data set, we
cannot calculate the mean, but we can calculate the grouped mean, which is a different
kind of average. To calculate the grouped mean, multiply the midpoint of each interval
by the corresponding frequency, add these products, and divide the resulting sum by
the total number of observations. The grouped mean is a weighted average of the in-
terval midpoints, where each midpoint value is weighted by the relative frequency
of the observations within each interval. (The relative frequency of an interval is the
number of observations in an interval divided by the total number of observations.)
The mathematical representation of the grouped mean is
∑ mf
K
i i
x= i =1
∑ f
k
i =1 i
where k is the number of intervals in the table, mi is the midpoint of the i-th interval,
and fi is the absolute frequency of the i-th interval.
From the example shown in Table 8.3:
1
x=
117
[109.5(15) + 129.5( 48) + 149.5(36) + 169.5(13) + 189.5(5)]
16391.5
=
117
= 140.01 mmHg
Median
The median is defined as the middle number in a list of values ordered from
smallest to largest. (If there is no middle number, the median is the mean of the
two middle values.) The median is a measure of central tendency that is not as
sensitive to outliers compared to the mean. It can be used to summarize discrete or
continuous data.
In the previous example, we first rank the 10 measurements of systolic blood
pressure from smallest to largest:
Since we have 10 observations, the median is the average between the two middle
values, the 5th (154 mmHg) and 6th (158 mmHg) observations; therefore, it is
156 mmHg. Observe that the median divides the data into two halves; one half is less
than the median, the other half is greater than the median.
The most appropriate measure of central tendency to use depends on the distri-
bution of the values. If the distribution of the data is symmetric and unimodal, as
shown in Figure 8.10 a, the mean, median, and mode should be the same. In this sce-
nario, the mean is commonly preferred. When the data are not symmetric, the me-
dian is the best measure of the central tendency. The data in Figure 8.10 b are skewed
to the right since the right tail of the distribution is longer and fatter; similarly the
data in Figure 8.10 c are skewed to the left. Since the mean is sensitive to outliers, it is
pulled in the direction of the longer tail of the distribution. Therefore, in a unimodal
Figure 8.10. Possible distributions of the data: (a) Unimodal and symmetric; (b) unimodal and
right-skewed; and (c) unimodal and left-skewed. The solid and dotted lines represent the location of
the median and mean, respectively.
165 Chapter 8. Basics of Statistics
Figure 8.11. Two distributions with same mean, median, and mode, but different measures of dispersion.
distribution, when the data are skewed to the right, the mean tends to lie to the right
of the median; when they are skewed to the left, the mean tends to lie to the left of
the median.
Measures of Dispersion
Although two different distributions may have the same mean, median, and mode,
they could be very different, as shown in Figure 8.11. Measures of dispersion are nec-
essary to further describe the data and complement the information provided by
measures of central tendency.
Range
The range of a group of observations is defined as the difference between the largest
observation and the smallest. The range is easy to compute and gives us a rough idea
of the spread of the data; however, the usefulness of the range is limited. The range is
highly sensitive to outliers since it considers only the two most extreme values of a
data set, the minimum and maximum values. In our previous example of 10 measures
of systolic blood pressure, the range is 78 mmHg in the first set of observations and
1770 mmHg when the error measure of 1880 mmHg is considered!
Interquartile Range
The interquartile range (IQR) represents the middle 50% of the data. To calculate the
interquartile range, you must first find the 25th and 75th percentiles. The 25th percen-
tile, also called the first quartile and denoted Q1, is the value below which 25% of the
data fall, when the data are ordered from smallest to largest. Similarly, the 75th per-
centile, also referred to as the third quartile and denoted Q3, is the value below which
75% of the data fall. The interquartile range is found by taking the difference between
the 75th and 25th percentiles. The interquartile range is often reported with the me-
dian, as it is not affected by extreme values.
To calculate the 25th percentile of a set of measurements, first order the values
from smallest to largest. Then calculate the “position” of the 25th percentile, which is
equal to n(25)/100. In our example of 10 measures of blood pressure, the location of
the 25th percentile is 10(25)/100 = 2.5, which is not an integer. In this case, round up
166 Unit II. Basics of Statistics
to the next integer. Therefore, the 25th percentile is the 2 + 1 = 3rd smallest measure-
ment (or 3rd from the left), 128 mmHg. Similarly, the position of the 75th percentile
is 10(75)/100 = 7.5, which again is not an integer; rounding up to the nearest integer,
the 75th percentile is the 7 + 1 = 8th smallest measurement, 168 mmHg.
To find the kth percentile of a data set, we should begin by ranking the measurements
from smallest to largest. Next, to find the position of the kth percentile, calculate nk/100.
If nk/100 is an integer, the kth percentile is the average between the (nk/100)th
smallest number and the [(nk/100) + 1]th smallest number. If nk/100 is not an in-
teger, the kth percentile is the (j + 1)th smallest measurement, where j is the largest
integer that is less than nk/100.
1 n
∑ ( xi − x )
n i=1
It turns out that this expression is equal to zero since the sum of the deviations from
the mean of all observations less thanx is equal to the sum of the deviations greater
thanx , and therefore the two sums cancel each other out. To solve this problem, we
might square the absolute values of the deviations from the mean and then average
these values to get a single number. Note that this resulting number is a squared dis-
tance, but we are looking for a number that represents the average distance between
a typical observation and the mean, so it makes sense to take the square root of the
statistic. Thus a good candidate for the standard deviation is
1 n
∑ ( xi − x ) .
2
n i=1
It turns out that dividing by n–1 instead of n gives us a value that has better statistical
properties. Thus, the standard deviation, denoted s, is given by the following equation.
1 n
∑ ( xi − x )
2
s=
n − 1 i=1
xi xi–x (xi–x )2
110 –40.4 1632.16
134 –16.4 268.96
126 –24.4 595.36
154 3.6 12.96
168 17.6 309.76
128 –22.4 501.76
168 17.6 309.76
158 7.6 57.76
170 19.6 384.16
188 37.6 1413.76
Sum 5486.40
root. For the 10 measures of systolic blood pressure, the mean is 150.4 mmHg and the
variance is calculated as seen in the following:
Usually, the mean and the standard deviation are used to describe the characteristics
of the entire distribution of values.
s
SEM =
n
Since the SEM is equal to the SD divided by the square root of the sample size number,
the SEM is always smaller than the SD.
Confidence Interval
As mentioned earlier, the mean of a sample is only an estimate of the true mean, μ,
from which the data were sampled. One can conceive that there is some error involved
with estimating the population by a mean of just one sample. We can create an interval
168 Unit II. Basics of Statistics
around the sample mean with a margin of error that is 2 times the standard error of the
mean (SEM), which is called a 95% confidence interval for the true population mean,
given by the following:
s
x ± 2 ( SEM ) = x ± 2
n
We say that “we are 95% confident that the true population mean falls in this interval.”
What this really means is the following: imagine that many samples of the same size
are drawn from a population; then 95% of these samples will have confidence intervals
that capture the true population mean.
Coefficient of Variation
It is possible to compare the variability among two or more sets of data with different
units of measurement using a numerical measure known as the coefficient of variation.
It relates the standard deviation of a set of observations to its mean and is a measure of
relative variability. It is calculated using the following formula:
s
CV = × 100%
x
24.69
CV = × 100%
150.4
CV = 16.42%
On its own, it is difficult to assess whether this value is small or large. The usefulness of
this measure is to compare two or more sets of data.
PROBABILITY
Descriptive statistics are useful to summarize and evaluate a set of data, which is the
first step in statistical analysis. However, when we perform an experiment or observe
a phenomenon in a particular sample, we are interested in generalizing our findings
to the population from which the sample was drawn. This is achieved through statis-
tical inference. In the next four chapters, this concept will be highly utilized to explain
the basis of the statistical tests. The background necessary to understand statistical
inference is probability theory. The probability of an event is commonly defined as
the number of desired outcomes divided by the total number of possible outcomes.
Another common definition is the proportion of times the desired event occurs in an
infinitely large number of trials repeated under virtually identical conditions. Let us
see how the two definitions are reasonable by considering an example, the event of
getting a “head” in a fair coin toss. By the first definition, the probability of the event is
0.5, one head divided by two possible outcomes (head or tail). Now consider the latter
169 Chapter 8. Basics of Statistics
0.5
0.4
0.3
Probability
0.2
0.1
0.0
X=0 X=1 X=2
Figure 8.12. Probability distribution for a random variable, X, which is the number of heads that
appear in two coin flips.
definition and imagine repeating the coin flip repeatedly. First suppose we flip the coin
twice in a row; it is not guaranteed that we will only see one head due to the random
nature of a coin flip. However, we will see that the proportion of heads converges to 0.5
as the number of flips becomes increasingly large. In the same way, the probability of
tails will be 0.5 if the coin is tossed a large enough number of times.
A random variable is a variable that can assume different values such that any par-
ticular outcome is determined by chance. Every random variable has a corresponding
probability distribution, which describes the behavior of the random variable following
the theory of probability. It specifies all possible outcomes of the random variable
along with the probability that each will occur. The frequency distribution displays
each observed outcome and the number of times it appears in the data set. Similarly,
the probability distribution represents the relative frequency of occurrence of each
outcome in a large number of trials repeated under essentially identical conditions.
Since all possible values of the random variable are taken into account, the outcomes
are exhaustive, and the sum of their probabilities is equal to 1. For example, suppose
we flip two fair coins and let the random variable X denote the number of heads that
appear; the random variable X can take on values 0 through 2, since it is possible to ob-
serve no heads or, at the other extreme, all heads. The probability of getting no heads is
¼, the probability of getting exactly one head out of two coin flips is ½, and the prob-
ability of observing two head is ¼; the probability of all the possible outcomes of the
random variable X add up to 1. The probability distribution for this random variable
is displayed in Figure 8.12.
68%
95%
99.7%
Figure 8.13. The standard normal curve with mean, µ, and standard deviation, σ.
that x1 and x2 will be slightly different. If we continue to draw samples of size n from
the population, we will end with a sample of sample means. If we draw samples of size
n, the probability distribution of these sample means is known as the sampling distri-
bution of the sample mean. This distribution has a standard deviation that is equal to
σ/ n, where σ is the true standard deviation of the entire population, and it is referred
to as the standard error of the mean. Moreover, if n is large enough, the shape of the
distribution is approximately normal.
1. Central limit theorem (CLT): The CLT states that if the sample size is large
enough, the distribution of sample means is approximately normal. The CLT
applies even if the distribution of the underlying data is not normal. However, the
farther the distribution departs from being normally distributed, the larger the
sample size is necessary. A sample size of at least 30 observations is usually large
enough if the departure from normality is small. Therefore, if we have a sufficiently
large sample size, we may use parametric tests based on CLT, even if the under-
lying population distribution is not normal.
2. Transformation of the data: We can modify a variable so that its distribution is
more normal. Another reason for data transformation is to achieve constant var-
iance, which is required for the use of some parametric tests. Through transfor-
mation, a new variable X' is created by changing the scale of measurement for the
dependent variable X. The most commonly used transformations are the square
root transformation (X' = X ), the square transformation (X' = X2), the log trans-
formation (X' = log X), and the reciprocal transformation (X' = 1/X). The most
important drawbacks of data transformations are some loss of interpretability and
failure to smooth the data. Regarding interpretability, if we choose, for example,
to log transform our dependent variable X, all further results will be presented and
interpreted in a logarithmic scale and could be difficult to interpret for the readers.
3. Use of non-parametric tests: Another possible approach for non-normally distrib-
uted data is the use of non-parametric tests, which will be presented in Chapter 10.
These tests do not require any assumptions about the underlying distribution of
the data; however, they often have less power to detect true differences among the
compared groups, and the chance of false negative associations is higher compared
to when parametric tests are used. Therefore, usually a larger sample size is required
when we use non-parametric tests compared to parametric ones.
4. Categorization of the data: Another possibility is to transform continuous data
that are not normally distributed into nominal or ordinal variables. A continuous
variable that is recoded as categorical may be more clinically relevant, but the loss
of power is very significant as discussed before in this chapter.
173 Chapter 8. Basics of Statistics
• Interval or ratio data (continuous variable): For both types of variables, the
difference between an interval (e.g., 1°C difference in temperature) has the same
meaning if comparing 35°C versus 36°C, or 40°C versus 41°C. For the case of ratio
data, there is the concept of an absolute zero; for instance, a height of zero means no
height. Other examples include weight, temperature in Kelvin, time since random-
ization. In this case, the ratio of two values has meaning.
• Ordinal variable: also referred as expressing ranks (the concept of rank will be im-
portant for non-parametric tests). In this type of data, order is important, but the
differences between adjacent ranks have no meaning. For instance, the difference
between “much improved” to “improved” may not be equal to the difference
174 Unit II. Basics of Statistics
The investigator should know which type of data he or she is dealing with in
order to define the most appropriate and valid statistical test. In addition, the
type of data (e.g., continuous vs. categorical) will have a significant impact on the
study power.
Another important concept here is that, in some cases, data are originally nom-
inal or ordinal—for instance, a researcher might be measuring death (yes/no), and
in this case the plan has to be to use categorical data. However, if the data are con-
tinuous, then one option is categorize it—for instance, suppose an investigator is
collecting data on blood pressure (that is continuous), he or she may want to cat-
egorize it into low blood pressure and high blood pressure (after defining a cut-
off ). The main advantage of this approach is that a significant difference between
high and low blood pressure will have a significant clinical meaning; however, the
cost is a loss in efficiency and therefore a decrease in power. There are also critical
considerations and limitations when using this approach (of categorizing contin-
uous data) [7].
Another important point for continuous data is whether the data are normally dis-
tributed. This will have important implications for the choice of the statistical tests.
Statistical tests for normal distribution (e.g., ANOVA and t-test—or parametric tests)
usually have more efficiency than the respective options for data that are not normally
distributed (or non-parametric tests).
An option for agitation is the use of colinestherase inhibitors, which have shown
initial positive effects in preliminary studies, but uncertainty remains as to whether
they are effective when behavioral disturbance is severe. Prof. Rosseau is planning to
conduct a clinical trial to test (CHOLIN001) for the treatment of agitation in patients
who have not responded to psychosocial treatment.
The Trial
The study Prof. Rosseau is planning is a single-center (there was a large dementia
center in the behavioral neurology unit), double-blinded, randomized, parallel group
trial in which patients would be assigned to receive placebo or CHOLIN001 for 12
weeks, after four weeks of failed psychosocial treatment. The main challenge in this
trial is to define the main variable in order to decide the statistical plan and sample size
calculation. The main outcome for this study is Cohen–Mansfield Agitation Inventory
(CMAI) scores at 12 weeks. The CMAI evaluates 29 different agitated behaviors in
patients with cognitive impairment and is carried out by caregivers. The frequency of
each symptom is rated on a seven-point ordinal scale (1–7), ranging from “never” to
“several times an hour.” A total score is obtained by summing the 29 individual scores,
yielding a total score from 29 to 203.
It is Monday afternoon and the weather forecast shows five inches of snow
during the evening and early morning; Prof. Rosseau knows that the traffic through
old Montreal will not be easy. He goes home to prepare for the first meeting with his
research team. In this meeting, there will be four of his research fellows: Catherine
Moreau—a neuropsychologist from Quebec city who just finished her PhD; Scott
Neil—a senior neurology resident from Toronto in his last year of residency; Hugo
Frances—a PhD student in cognitive neuroscience from Montreal; and Munir
Dinesh—a postdoctoral fellow from India who has recently arrived to work with
Professor Rosseau. The agenda for the morning meeting was to decide how to
use the variable CMAI in the study. Prof. Rosseau sends the following email to
the team:
Dear Team,
The agenda for tomorrow’s meeting will be the discussion to decide how to use the
CMAI variable in our study. I would like each one of you to come prepared according
to the following instructions:
Catherine –your task is to investigate if it is appropriate to use CMAI as a vari-
able with parametric tests.
Scott –as I suspect that CMAI will not be normally distributed, please investigate
the use of data transformation (log transformation) or use of central limit theorem.
Hugo –please investigate the use of non-parametric approaches.
Munir –please investigate the categorization of data.
We will meet tomorrow at 8 a.m. in the conference room.
Looking forward
JR
176 Unit II. Basics of Statistics
clinical work only would make me crazy. I need additional intellectual stimulation.”
He sees this opportunity to work with Prof. Rosseau as the chance to get the necessary
training to become a future clinical researcher. He wants to impress the team in their
first meeting. He then starts, “I agree with Prof. Rosseau that data might not be nor-
mally distributed. The first step is to look at the sample size. Given the sample size cal-
culation in which we want to detect an average difference of a 6 point (SD 6) change
in agitation inventory score from baseline to 12 weeks between active treatment and
placebo with a power of 90% at the 5% (two sided) level of significance, we would
need a sample size of 22 in each group (total of 44 patients). These parameters are
based on the results of similar studies.” He then continues, “This number of patients
is not enough for the use of central limit theorem. The idea of central limit theorem
is that even for large data sets that are not normally distributed, the use of parametric
tests is OK. The main idea here is that parametric tests are robust to deviations from
normal distributions, as long as the samples are large. The only issue is to define what
“large” is. Large in this case depends also on the nature of the particular non-normal
distribution. Although this is controversial, most statisticians are comfortable to use
the central limit theorem with 70–100 data points.” He then pauses and goes to the
whiteboard.
“Because our distribution does not deviate excessively from normality, if we
double the sample, we would be able to rely on this method and use parametric
tests (given the issue of the scale that was discussed before by Catherine).”
Prof. Rosseau then interrupts him, “This is excellent, Scott. Great job. Although
increasing the sample size would increase our costs and duration of this trial, I want
to try to use parametric tests in order to use some advanced modeling with the data
that requires normal distribution. In addition, increasing the sample size would
increase the power of our study, especially if we underestimated the sample size;
however, we need to have a strong justification to do so. But let us hear the next
option: use of log transformation.”
Another option is the use of log transformation of the data. Let me explain better: be-
cause we expect that most of the data would be concentrated near the cut-off point
and there might be some outliers, then our data will not be normally distributed.
In fact, let us assume that we will get the following CMAI scores: 50, 51, 52, 55, 56,
57, 66, 80, 82, 90. These data are not normally distributed. But if we log transform
these data, we would have: 1.70, 1.71, 1.72, 1.74, 1.75, 1.76, 1.82, 1.90, 1.91, 1.95. In
fact, the Shapiro-Wilk test (a test to detect whether the data are normally distributed)
shows that the first original set is not normally distributed while the second (logged
transformed) is normally distributed.
Prof. Rosseau is impressed with Scott. He quickly comments, “Thank you again, Scott.
This is very helpful. This seems an interesting strategy as it would not be necessary to
178 Unit II. Basics of Statistics
increase the sample size of our study and we would be able to use parametric tests.
However, the disadvantage here would be the interpretation of the data. At the end of
the study, if we find significant results, we would have to say that CHOLIN001 induces
a significant decrease in the logged transformed CMAI scores as compared to placebo.
Well, this is certainly an option, depending again on the nature of the data, but let us
go to Hugo and the use of non-parametric tests.
I will try to convince you all that the approach of using non-parametric tests—tests
that need much less assumptions for the data—does not require that data are nor-
mally distributed and also is OK for ordinal data—an example is Wilcoxon test—
might be the best one. If we go with the non-parametric approach, we would not need
to be concerned with the issues raised by Cathy and we would not need to increase
the sample size or use logged transformed data.
Prof. Rosseau then continues, “Thank you, Hugo. I agree with you, you laid out all the
advantages; however, if we use a non-parametric approach we would not be able to
use some advanced models that require normally distributed data and also using non-
parametric data, we would lose some power and would need to increase the sample
size between 5% and 10%. But again, this is a good option. Let us hear Munir for the
last option: categorizing the data.”
One option is if we categorize CMAI. I would say that a reduction of 30% or greater
in CMAI can be considered clinically important. We can therefore classify patients as
responders (30% reduction in CMAI scores) and non-responders and then compare the
rate of responders between placebo and CHOLIN001. This approach would increase the
clinical significance and also would eliminate the issue of parametric tests as we would an-
alyze the data using tests to compare proportions such as Chi-square or Fisher’s exact test.
He looks at his notes and continues, “The disadvantage of this method is that we
would lose some power. I therefore calculated the new sample size. Given that the rate
of response to placebo would be 30% and the rate of response to CHOLIN001 would
179 Chapter 8. Basics of Statistics
be 55%, and assuming 90% power at the 5% level of significance, we would need 81
participants in each group (total of 162 participants).”
After this detailed explanation, Prof. Rosseau thanks Munir and gives the
final words, “Well, we have four different methods, each one with advantages and
disadvantages. I now want you to think carefully about all of them and come tomorrow
to the meeting with an opinion about the best approach.”
Prof. Rosseau looks through his window. The snowfall has finally stopped and it is
possible to see some sun. He knows it will be a long winter, but his excitement about
this new study makes him forget about the next six months of cold weather, short days,
and longer nights.
CASE DISCUSSION
This case illustrates the importance of understanding the study variables well. The
first important point is to identify the variable type (i.e., nominal, ordinal, or contin-
uous). For most of the cases, this determination is fairly easy; however, in some cases,
as in this case study, there is some room for debate. It has been discussed that the “total
score” variable is built on ordinal items. Potential methods of variable transformation
also have been discussed. The investigator needs to first make careful assessments and
then evaluate if some of the transformation tools are advantageous, and to determine
the advantages and disadvantages of these methods. Finally, the overall impact on the
study needs to be carefully considered.
FURTHER READING
Papers
• Grimes, DA, Schulz, KF. Descriptive studies: what they can and cannot do. Lancet. 2002;
359(9301): 145–149.
• Neely, JG, Stewart, MG, Hartman, JM, Forsen, JW Jr, Wallace, MS. Tutorials in clinical re-
search part VI: descriptive statistics. Laryngoscope. 2002; 112(7 Pt. 1): 1249–1255.
• Sonnad, S. Describing data: statistical and graphical methods. Radiology. 2002;
225(3): 622–8.
Books
• Pagano M, Gauvreau K. Principles of biostatistics, 2nd ed. Pacific Grove, CA: Cengage
Learning; 2000.
• Portney LG, Watkins MP. Foundations of clinical research: applications to practice, 3rd ed.
Upper Saddle River, NJ: Pearson Prentice Hall; 2009.
REFERENCES
1. Cummings JL, Mega M, Gray K, Rosenbergthompson S, Carusi DA, Gornbein J. The
neuropsychiatric inventory:comprehensive assessment of psychopathology in dementia.
Neurology. 1994; 44: 2308–2314.
2. New York Heart Association. Diseases of the heart and blood vessels: nomenclature and criteria
for diagnosis. Little, Brown and Company; 1964.
3. Hobart JC, Cano SJ, Zajicek JP, Thompson AJ. Rating scales as outcome measures for
clinical trials in neurology: problems, solutions, and recommendations. Lancet Neurology.
2007; 6: 1094–1105.
4. Kirkwood BR. Essentials of medical statistics. Malden, MA: Blackwell Scientific
Publications; 1988.
5. Miller RG Jr. Beyond ANOVA: basics of applied statistics. CRC Press; 1997.
6. Dalgaard P. Introductory statistics with R. New York: Springer Science & Business
Media; 2008.
7. Kirsch I, Moncrieff J. Clinical trials and the response rate illusion. Contemp Clin Trials.
2007; 28: 348–351.
9
PA R A M ET R I C STAT I ST I C A L T E STS
A man gets drunk on Monday on whisky and sodawater; he gets drunk on Tuesday on
brandy and sodawater, and on Wednesday on gin and sodawater. What causes his drunk-
enness? Obviously, the common factor, the sodawater.
—Anthony Standen, Science Is a Sacred Cow
INTRODUCTION
This chapter begins with the fundamentals of statistical testing, followed by an intro-
duction to the most common parametric tests. The next chapter will describe non-
parametric tests and will compare them to their parametric counterparts.
The previous chapter gave you an overview over different types of data, descriptive
methods, and sample distributions. You also learned the basics of probability, which
leads us to what is most commonly done in applied medical statistics: hypothesis
testing.
HYPOTHESIS TESTING
When we conduct an experiment, how do we know if the data from our sample
group are different compared to the normal population or compared to another
group? What would we use to compare both groups, how could we describe the
differences of the groups? We have already learned the methods of descriptive
statistics in Chapter 8, which helped us to characterize our sample. Therefore, we
could, for example look at a measure of central tendency: We could select a sum-
mary statistic for our sample (commonly the sample mean) and then compare it to
the reference group data [1]. But if we indeed find a difference between the groups,
how do we know that this is a true difference? We have to consider that it could be
due to the following reasons:
181
182 Unit II. Basics of Statistics
Bias is a systematic error that leads to a false measurement of the dependent variable and
therefore to a wrong conclusion.
A confounder is a variable that relates to the dependent and independent variable, doesn’t
lie on the causal pathway (which would be called an intermediary variable), and leads to
the assumption of a causal relationship between the dependent and independent variable,
which, however, is either overestimated or misleading.
Statistical tests are generally designed to determine if the observed difference between
the groups is likely due to chance (and therefore not meaningful). This is based on the
initial assumption that groups in most of the cases are not different—unless under the
“rare” circumstance that there is a real effect (due to the intervention or, more gener-
ally, due to the study design). This initial belief is called the null hypothesis (H0: group
1 = group 2, stated as no difference between groups). In fact, remember that when you
are running a study, you are doing so to add a new knowledge (this is your alternative
hypothesis). You start from current knowledge (this is your null hypothesis; for in-
stance, currently there is no evidence to support a difference between a new interven-
tion and a placebo (even though based on other studies you have hypothesized that
this is not the case, but need to run the study to confirm) [2].
Therefore, the null hypothesis is set up so that we can reject the null hypothesis if
there is a real difference between the groups. If the null hypothesis can be rejected, the
alternative hypothesis is true (HA: group 1 ≠ group 2).
Importantly, statistical tests cannot account for bias or, in most cases, for
confounding. The best strategy to exclude or limit those effects as much as possible is
through a proper study design (see Chapter 4). An example would be comparing two
groups of treatment (drug A vs. placebo), but during the trial there was no appropriate
blinding (leading to detection bias). Because of this, even if the statistical test shows a
difference between drug A versus placebo, this result is likely due to bias and is not valid.
The statistical test calculates the probability of the observed event to happen. If the
probability of the observed result is small enough, the investigator concludes that the
difference is likely not due to chance and can reject the null hypothesis.
The p value provides an estimate of the probability of the event, and if it is small
enough, the result is called statistically significant [3].
The key question is, what is considered small enough? Usually the threshold is set
between what is considered likely due to chance and actual effect. This threshold is
called alpha (α), the level of significance. It is widely accepted to consider a p-value
of equal or less than an α of 0.05 small enough (which corresponds to a chance of 1
out of 20). This value, though, is arbitrary. An interesting exercise can help you to
understand that 0.05 is extreme enough to consider that is beyond chance: suppose
someone tosses a coin and tells you “it is heads” (1st trial), and then again, “it is heads
P-value definition: The probability of the observed result or something more extreme under
the null hypothesis.
183 Chapter 9. Parametric Statistical Tests
again” (2nd trial), then again, “it is heads again” (3rd trial), then one more time “oh,
again, it is heads!” (4th trial)—at this point you become suspicious that the coin is bi-
ased and has only heads, but your colleague again tosses the coin and again it is heads
(5th trial)! At this point you believe that the chance of 5 heads in a roll is so extreme
that the coin is likely biased. The probability of having 5 heads consecutively is 0.032
(close to 0.05); thus you can understand that 0.05 is a reasonable threshold.
Statistical Errors
When conducting a statistical test, there are four possible outcomes:
1. Scenario 1 = the H0 is true (suppose it is possible to know this). You run the ex-
periment and indeed find a p-value greater than 0.05, thus failing to reject the H0.
Therefore the result of the experiment matches the truth (again, if it was possible
to know the truth).
2. Scenario 1 = the H0 is false (again, suppose it is possible to know this). You run
the experiment and indeed find a p-value smaller than 0.05, thus rejecting the H0.
Therefore the result of the experiment matches the truth (again, if it was possible
to know the truth).
In these two scenarios the experiment matches the truth. But what if it does not?
The preceding table shows that also two types of errors can occur when performing
a statistical test:
1. Type I error (false positive): Rejecting the null hypothesis even though the null
is true (in other words, claiming a significant difference when in fact there is no
difference. Using the example of the coin, it is possible to get a normal coin and
toss it 5 times and get heads in all the times).
– Level of significance = α—probability of committing a type I error.
Most studies set an alpha of 0.05. An α of 0.05 means that you are accepting a 5%
maximum chance of incorrectly rejecting the null hypothesis (H0). The lower α is, the
lower this “permitted” chance will be.
2. Type II error (false negative): Failure to reject the null hypothesis when the null is
false (in other words, claiming that there is no significant difference when in fact
there is a difference; this happens when the experiment is underpowered).= β—
probability of committing a type II error= Directly related to power (Power = 1-β,
see Chapter 11).
184 Unit II. Basics of Statistics
Most studies set a β of 0.2. This means that your power will be 0.8 (80%), and that
you are accepting a 20% chance of failing to reject the null hypothesis (H0) when it is
actually true.
The investigator needs to determine the dependent (or the outcome) and inde-
pendent variables. This has been discussed in the previous chapter (please go back
to this chapter if this concept is not clear). The next step is then to classify this vari-
able into continuous, ordinal, or categorical variable (this has been explained in detail
in the previous chapter). Finally, if the variable is continuous, then the investigator
needs to determine whether the distribution is normal (refer to Chapter 8 for a de-
tailed discussion).
For those who are beginning in statistics, it may then help to visualize all the
options to choosing a statistical test (see Table 9.1). If given these initial steps (var-
iable determination and normality), the investigator finds out that one of the para-
metric statistical tests can be used (t-test, ANOVA, or linear regression), then the
investigator needs to check whether the other assumptions for using these tests are
met (as discussed in the second step in the following).
It is important also to check if other factors are met when using parametric tests
(see the list of assumptions in the following). The other important assumption is
whether the variance of groups is roughly equal. Although t-test and ANOVA are
robust for some level of imbalance in variances across groups, it is recommended
to check.
There are still other assumptions, such as that data are randomly selected and the
observations are independent. These assumptions are not exclusive of parametric
tests. They are important also for other statistical tests, and in fact the investigator
should make all possible efforts to meet these assumptions. Nevertheless, it is known
that most of the sampling method is actually non-random (see Chapter 6 for further
discussion).
In summary, parametric tests rely on certain assumptions. These assump
tions are
signal
t= × sample size
noise
This formula gives you a value that tells you how confident you can be about your
data. The higher the signal is, the more confident you can be in the difference, while
an increase in noise reduces the impact of the signal (sample size and power will be
discussed in Chapter 11).
This concept is also demonstrated in Figure 9.1.
186 Unit II. Basics of Statistics
Mean
Figure 9.1. This figure shows that two parameters are important: differences between the means
(difference between dashed lines) and variability (how wide is the spread of the data).
The ANOVA tests the H0 that all means are equal. The H0 can be rejected when at least
two means are not equal, but the ANOVA does not reveal between which groups the
difference is significant. In order to determine where the difference lies, post hoc mul-
tiple comparison tests such as Bonferroni or Tukey’s can be used.
Linear Regression
The final example here is the linear regression, which uses the same type of outcome
(measured in a continuous scale and normally distributed) but differently from t-test
(and similar to ANOVA), accepts more than one independent variable. The difference
between ANOVA and linear regression is that the independent variable can be either
a categorical variable or a continuous variable. Linear regression is useful in RCTs
when there are covariates to adjust—in other words, variables that are related to the
188 Unit II. Basics of Statistics
It is beyond the scope of this chapter to explain how to select variables appropriately
to be added in the linear regression (also called model selection).
Entering Data
It is important to input data correctly in order to run the analysis. Each variable should
have its own column. Here, the first column represents the drugs being tested (1 and
2, which represent drugs X and Y, respectively), while the second column portrays the
patient’s blood pressure (BP) after taking the medication.
Choosing the Analysis Test
You can perform the analysis by using the software MENU. Select the following
options:
Statistics > Summaries, tables and tests > Classical tests of hypothesis > t test
(mean-comparison test)
190 Unit II. Basics of Statistics
Running the Analysis
When running the analysis, the software will display some options of t-test from
which to choose. Since at this moment you are comparing means of two separate,
independent groups (group taking drug X and group taking drug Y), you can select
the second option, two-sample using groups. Next, it is important to identify correctly
where to place the dependent variable (outcome: BP) and the independent/explana-
tory variable (intervention: Drug), which can also be defined as the group variable. At
last, you can choose your desired confidence interval (which generally will be 95%).
Interpreting the Output
The output shows the two-sample t test. The software will initially provide the descrip-
tive statistics (second row), with the mean, standard error, and standard deviation for
each group, as well as the 95% confidence interval. Next, the same information regarding
summary statistics is provided for the entire sample, with no separation between groups
(third row—combined) and for the difference between groups (fourth row—diff).
Subsequently, the Null hypothesis is provided (Ho: dif = 0): There is no difference in
systolic blood pressure for patients taking drugs X and Y. The t statistic (0.5651) and
the number of degrees of freedom (8) are also shown.
At the bottom of the table there are three options for alternative hypotheses. The
generally used alternative is located in the center (1): this is the two sided-hypothesis,
in which the investigator is testing simply if the difference between groups is not
equal to 0 (there is no directionality). In this case, you would fail to reject the null
hypothesis, therefore concluding no difference in BP between groups (p = 0.5875).
In some cases, the investigator may be interested in testing a one-sided hypothesis,
meaning that two different tests can be undertaken: you can test if the difference be-
tween group means is smaller than zero (left), or if the difference between groups
is larger than zero (right). For both of these tests the result is also non-significant
191 Chapter 9. Parametric Statistical Tests
Entering Data
In the paired comparison, data has to be organized differently in the database, since
the outcome is shown in two columns, separated by time of assessment. Here, two
new variables have to be inserted, representing the different time points in which data
will be analyzed (BPpre and BPpost, representing the systolic blood pressure before
and after taking drug X, respectively). Each horizontal line represents one patient.
192 Unit II. Basics of Statistics
Choosing the Analysis Test
You can perform the analysis by using the software MENU. Select the following options:
Statistics > Summaries, tables and tests > Classical tests of hypothesis > t test
(mean-comparison test)
Running the Analysis
When running the analysis, the software will display some options of t-test from which
to choose. Since at this moment you are comparing means within the same group
(patients who are taking drug X), you can select the last option, Paired. Differently
from what happened with the t-test for independent groups (unpaired analysis), the
software will request a “first variable” and a “second variable,” where blood pressure be-
fore taking the drug (BPpre) and blood pressure after taking the drug (BPpost) should
be inserted (same measurement at different time points). At last, you can choose your
desired confidence interval (which generally will be 95%).
193 Chapter 9. Parametric Statistical Tests
Interpreting the Output
The interpretation of the output for the paired t-test is very similar to the interpreta-
tion of the unpaired t-test, given the difference in application. The software will ini-
tially provide the descriptive statistics (second row), with the mean, standard error,
and standard deviation for the two different time moments (BPpre and BPpost) as
well as the 95% confidence interval. Next, the same information regarding summary
statistics is provided for the difference between groups (third row—diff).
Subsequently, the null hypothesis is provided (Ho: dif = 0), stating that
there is no difference in systolic blood pressure before and after taking drug X
(BPpre = BPpost). The t statistic (3.8528) and the number of degrees of freedom
(9) are also shown.
At the bottom of the table there are three options for alternative hypotheses. The
generally used alternative is located in the center (1): this is the two sided-hypothesis,
in which the investigator is testing simply if the difference between BP before and
after is equal to 0 (there is no directionality). In this case, you would reject the null
hypothesis, therefore concluding that drug X is able to significantly modify blood
pressure (p = 0.0039)
In some cases, the investigator may be interested in testing one-sided hypothesis,
meaning that two different tests can be undertaken: you can test if the difference
in BP between the two moments is smaller than zero (left) or larger than 0 (right).
In this scenario, the alternative hypothesis on the right (2) was significant, meaning
that mean blood pressure before was significantly larger than blood pressure after the
medication. We can therefore conclude that the drug is effective.
Entering Data
Here, instead of analyzing solely drugs X and Y, blood pressure measurements for
drug Z will also be imputed, since the purpose is to compare the differences between
the three groups. The first column includes drugs X, Y, Z (1, 2, and 3, respectively),
while the second column includes patients’ systolic blood pressure after taking the
medication.
Choosing the Analysis Test
You can perform the analysis by using the software MENU. Select the following
options:
Statistics > Linear models and related > ANOVA/MANOVA > One-way ANOVA
(note that if you want to do a two[or more]-way ANOVA, you need to select another
option.
195 Chapter 9. Parametric Statistical Tests
Running the Analysis
When running the analysis, the software will ask for two variables: the “response var-
iable” and the “factor variable.” The response variable can be seen as the outcome, the
dependent variable that is being analyzed, while the factor variable is the independent
variable. Therefore, blood pressure (BP) should be inserted in the first box, while
Drug is inserted in the second box. Options for performing multiple-comparison tests
are also provided.
Interpreting the Output
The output for the ANOVA test is relatively simple to interpret. In this case, the null
hypothesis states that blood pressure means are equal throughout the three groups.
The result indicates that there is a significant difference between at least two of the
blood pressure means of the three groups (drugs X, Y, Z), with a p-value of 0.0193
(<0.05). As previously mentioned, ANOVA is not able to specify which of the group
or groups were responsible for this difference, as this is a global test. Pairwise anal-
ysis in post hoc testing (e.g., Turkey’s method) is therefore necessary in order to have
this information (refer to software steps later in the chapter). Additionally, the soft-
ware provides the sum of squares (SS), degrees of freedom (df), and the mean squares
(MS), which are used to calculate the F statistic, leading ultimately to the p-value.
196 Unit II. Basics of Statistics
Linear Regression
Finally, consider that you are now interested in looking only at the effect of drug X in blood
pressure levels when compared to placebo. However, at this time, you want to adjust for other
covariates, such gender and age. For this purpose, linear regression is the most suitable test,
since it enables adjustment for covariates. By doing so, your purpose is to obtain results that
will provide an unbiased estimate of the treatment effect.
Entering Data
In order to run a linear regression, the database has to be organized with your inde-
pendent variables (continuous and categorical outcomes) and dependent variable
(continuous outcome). Since the software will not recognize letters (e.g., X, Y, and Z)
or words (e.g., female or male), variables have to be coded in order for the analysis to
be performed. In this case, drug X is coded as 1, while placebo is coded as 0. Likewise,
females are coded 0, while males are coded 1. Each column represents a different vari-
able, while each row represents an individual patient (total of 30 patients).
197 Chapter 9. Parametric Statistical Tests
Choosing the Analysis Test
You can perform the analysis by using the software MENU. Select the following options:
Statistics > Linear models and related > Linear Regression
Running the Analysis
When running the analysis, the software will provide a box for the dependent variable (out-
come) and the independent variables (predictors) you want to insert in your model. In this
case, blood pressure (continuous outcome) is inserted as the dependent variable, while the
age, sex, and the intervention drug (X or placebo) are inserted as the independent variables.
198 Unit II. Basics of Statistics
Interpreting the Output
The output for linear regression will initally provide descriptive statistics for the
model, including the sum of squares (SS), degrees of freedom (df), and mean squares
(MS). The number of observations (30) is stated on the right, as well as the R-squared.
The R-squared represents how much of the variance of the outcome can be explained
by your model (how useful the predictors are). In this case, only 10% of the variance in
blood pressure can be explained by the independent variables. The adjusted R-square
has the same meaning, but takes into account the number of variables that are inserted
in the analysis. On the bottom of the output, the key independent variable (Drug)
and all the other covariates you are interested in investigating (Age and Sex) are listed
in the left column, as well as the constant for the regression equation. For each of the
variables, the β coefficient, standard error, T statistic, p-value, and 95% confidence
interval are provided.
The β coefficient represents the increase or decrease in the predicted value of the
dependent variable (outcome) for a 1-unit increase in the explanatory variable (pre-
dictor). For instance, in case there was statistical significance for age, for every addi-
tional year, there would be a decrease of 0.15 mmHg in blood pressure, adjusted for
all the other covariates. In the case of categorical variables, this interpretation changes
slightly, since values are only 0 or 1. Therefore, a 1-unit increase represents switching
from one group to another. The group coded 0 is considered the reference group
(females), while males are coded 1. Therefore, based on the output results, if sex was
statistically significant, males would be expected to have a blood pressure 1.03 mmHg
higher than females, adjusted for all the other covariates.
What you are interested in here is knowing if there is evidence of a linear relation-
ship between your explanatory variable (Drug) and the response variable (BP), while
controlling for the other explanatory variables (Age and Sex). In this case, no signifi-
cance was found for any of the explanatory variables.
normality tests). One potential option for investigators working with large sample
sizes is to use the central limit theorem to support for normal distribution.
The central limit theorem states that, when dealing with large sample sizes, the
sample mean of a random, independent variable can be assumed as nearly normal.
Additionally, as the sample N increases, the distribution of the sample mean will in-
creasingly resemble a normal distribution. And what can be considered a large sample
size? Definitions vary, but a sample size of 60–100 (or higher) can generally be
considered as having a sample mean that is approximately normal.
Paired T-Test
Analyze > Compare means > Paired Sample T Test
One-Way ANOVA
Analyze > Compare means > One-Way ANOVA
Linear Regression
Analyze > Regression > Linear
200 Unit II. Basics of Statistics
The Trial
Dr. Manetti had decided that he wanted to spend his last years before retiring
doing clinical research. He set up an orthopedic laboratory and started doing clinical
research in the field of knee surgery, which used to be one of his preferred surgeries.
He is especially interested in developing new treatments for alleviating postoperative
pain, and particularly, a new opioid agonist (OXY004) with high affinity for opioid
receptors, which seems to have a greater analgesic potency as compared with its re-
lated compound morphine. This drug is under regulatory board review, as it has been
shown to be safe and efficacious for the treatment of acute pain, but has not been
1
Dr. André Brunoni and Professor Felipe Fregni prepared this case. Course cases are developed
solely as the basis for class discussion. The situation in this case is fictional. Cases are not intended
to serve as endorsements or sources of primary data. All rights reserved to the author of this case.
Reproduction and distribution without permission is not allowed.
201 Chapter 9. Parametric Statistical Tests
evaluated for the treatment of postoperative pain. In addition, its immediate release
form (IR) is an attractive alternative for this type of pain. Therefore, he decided to in-
vestigate the use of OXY004-IR for the treatment of postoperative pain in outpatients
undergoing knee arthroscopy.
This idea has been developed together with other professors in the United
Kingdom, the United States, and Brazil. Dr. Manetti decided to set up a workshop
to take place in Milan and invited his colleagues. After a very warm reception and
two days of intensive and productive debate, Dr. Manetti and his colleagues agreed
on the terms of the trial. The initial plan was to conduct a multicenter, double-
blind, randomized, placebo-controlled study in which patients are randomized to
receive OXY004-IR or placebo hourly as needed for up to eight hours after the
surgery to reduce pain (and in addition to standard analgesic therapy). The main
outcome is the sum of pain intensity (as assessed by VAS—v isual analog scale)
during these initial eight hours of post-surgery. In VAS the patient quantifies the
amount of pain he or she is feeling in a linear grade from 0 to 10 (a reason why
VAS is considered a continuous rating scale). Regarding the ethical issues, the in-
stitutional review boards agreed with this study as placebo was offered for a short
period only (8 hours), with no risk of the condition worsening due to lack of an-
algesic, and patients could also decide to take other medications if they wished to
(but in this case they would be eliminated from the analysis). Also, patients would
be prescribed standard analgesic medication.
The calculated sample size was 100 patients—25 patients for each center. This
sample size would be sufficient to use parametric tests, even if data are not normally
distributed due to the central limit theorem, but they are also prepared to normalize
the data using logarithmic transformations if the data are excessively skewed. However,
even after these two days of meetings, they are still unsure which statistical test to
choose for the primary outcome.
Choosing the Statistical Test
Although the choice of statistical test depends on the research question, it is impor-
tant to know the characteristics of the statistical test so as to understand the limita-
tions and advantages of each test, and also to adjust the study design and choose the
most optimal research question considering the resources and study feasibility. For
instance, if the researcher wants to use linear regression, he or she needs to know the
limitations of using this test, such as the assumptions associated with it (for instance,
requirement of normal distribution) and also the sample size requirements. Therefore,
an important step is the statistical analysis plan that needs to be done a priori (unless
the investigators plan to conduct an exploratory trial). Thus, researchers need to de-
termine the dependent and independent variables. The dependent variable is the out-
come of the study (in this case, pain), and the independent variables are the factors of
the study (in this case, one of the independent variables is the group—active drug or
placebo). Another important characteristic is whether the data are categorical (and
number of categories) or continuous. With this information, the investigator can
choose the analysis plan and therefore be able to adjust his or her study design and
research question.
202 Unit II. Basics of Statistics
After the two intensive days of work, the group of investigators decided to make
this final decision (choosing the statistical test) at a nice dinner at an outstanding
Italian restaurant near Corso Vittorio Emanuele II.
CASE DISCUSSION
This is a classical case in which a study design may give different options of analysis. The
investigator needs to be aware that the final decision of the design will impact the variables
selected for their study and how they will be treated. This will also affect the power and
sample size of a given study. In this example, to be didactic we consider the main options
of statistical tests, but if this were a confirmatory study, the investigator would need to
have its main question well defined, and that definition will indicate the direction of
which test is most appropriate. This reflects, therefore, the importance of defining the re-
search question and how that definition can affect the statistical tests to be used.
1. What challenges do Dr. Manetti and colleagues face in choosing the statistical test
to be used in this trial?
2. What are their main concerns?
FURTHER READING
Papers
• V
ickers AJ. Analysis of variance is easily misapplied in the analysis of randomized trials: a
critique and discussion of alternative statistical approaches. Psychosom. Med. 2005;
67:652–655. https://insights.ovid.com/pubmed?pmid=16046383
Online Statistical Tests
• http://statpages.org/
Books
• Feinstein AR. Medical statistics. Boca Raton, London, New York, Washington, DC: Chapman
& Hall/CRC; 2002.
• Mc Clave JT, Sincich R. Statistics, 11th ed. Upper Saddle River, NJ: Pearson/Prentice Hall; 2009.
• De Veaux RD, Velleman PF, Bock DE. Stats:data and models, 3rd ed. Boston: Addison-
Wesley/Pearson; 2009.
• Rosner. Fundamentals of biostatistics. Seventh Edition, Harvard University, Boston, MA:
Brooks/Cole, Cengage Learning; 2010.
Webpages
http://flowingdata.com/
www.phdcomics.com/comics.php
REFERENCES
1. Page EB. Ordered hypotheses for multiple treatments: a significance test for linear ranks.
JASA. 1963; 58(301): 216–230.
205 Chapter 9. Parametric Statistical Tests
2. Demšar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res.
2006; 7( Jan): 1–30.
3. Salvi S, Vaidya A, Kodgule R, Gogtay J. A randomized, double-blind study comparing
the efficacy and safety of a combination of formoterol and ciclesonide with ciclesonide
alone in asthma subjects with moderate- to-severe airflow limitation. Lung India
[Internet]. 2016; 33(3): 272–277. Available from: http://www.embase.com/search/
results?subaction=viewrecord&from=export&id=L610379852\nhttp://d x.doi.org/
10.4103/0970-2113.180803\nhttp://sfxhosted.exlibrisgroup.com/sfxtul?sid=EMBA
SE&issn=0974598X&id=doi:10.4103%2F0970-2113.180803&atitle=A+randomized
%2C+doub
4. Carver R. The case against statistical significance testing. Harvard Educ Rev. 1978;
48(3): 378–399.
5. Sackett DL. Why randomized controlled trials fail but needn’t: 2. Failure to employ phys-
iological statistics, or the only formula a clinician-trialist is ever likely to need (or under-
stand!). CMAJ. 2001 Oct 30; 165(9): 1226–1237. [PMID: 11706914]
10
N O N -PA R A M ET R I C STAT I ST I C A L T E STS
INTRODUCTION
Another important class of statistical tests is non-parametric tests. The term non-
parametric indicates a class of statistical tests that does not make any assumption re-
garding the distribution of the data. In fact, the other class of statistical tests discussed
previously, parametric tests (i.e., t-test or ANOVA), requires that data are normally
distributed and also that population variances are equal.
As non-parametric tests do not have any assumptions regarding distribution and
variance of the data, it is indeed safer to use these tests correctly. A non-parametric
test will always be valid. However, the investigator should not simply use these tests in
order to select a valid test, as non-parametric tests may have less power to detect signif-
icant differences when data are normally distributed and variances are roughly equal.
For instance, when using a non-parametric test, the loss of power compared to the op-
timal parametric test corresponds to a loss of 5% in the sample size. Thus not choosing
the most effective test may indeed increase the type II error of a study.
The investigator therefore needs to choose, first, a valid statistical test and, second,
the test that is also the most effective (i.e., has more power). However, this decision is
not simple, as there are several situations in clinical research when it is not easy to de-
termine what is the degree of non-normality that is accepted in order to consider that
data are not normally distributed (in order to choose a non-parametric test).
Besides data that are not normally distributed, there is another situation that
requires the use of non-parametrical tests: when data are classified as ordinal data
rather than continuous data. Ordinal data, as reviewed in Chapter 8, indicates data that
have order but the interval between two units has no meaning (for instance, symptom
classification as poor, satisfactory, good, and outstanding). In this situation a statistical
test that analyzes data as rank (or non-parametric tests) is required.
Similarly, in some situations it may not be easy to determine with certainty that
a given variable cannot be classified as continuous. We will discuss some examples
in the following. Here we suggest that whenever there is a situation in which there is
no clear indication, the investigator needs to consider the conservative approach and
choose the non-parametric option.
The final important issue is that the investigator should not choose the statis-
tical test that provides the best result. Although the investigator needs to choose
the most effective test (across the valid options), this decision needs to be made
206
207 Chapter 10. Non-Parametric Statistical Tests
a priori, without looking at the data. If this decision comes after the investigator
runs the statistical tests, then the analyses become exploratory and thus need to be
acknowledged.
First we will discuss and give examples of situations in which a non-parametric test
should be chosen, and then discuss each statistical test separately. The non-parametric
tests that will be discussed in this initial section are Mann-W hitney (Wilcoxon Rank
Sum), Wilcoxon Sign Rank, and Kruskal-Wallis. Although the names of these tests
make them appear to be complicated, they are, quite the opposite, relatively simple
tests to use and understand.
1. For instance, Luurila et al. evaluated the effect of erythromycin on the pharmaco-
kinetics and pharmacodynamics of diazepam. Six subjects ingested erythromycin
for one week, and on day 4 they ingested a dose of diazepam. All pharmacokinetic
and pharmacodynamic parameters were compared within the group using the
Wilcoxon matched pairs test, not a t-test [1].
2. Bopp et al. evaluated if intubated critical care unit (CCU) patients who received
twice-daily oral hygiene care with 0.12% chlorhexidine gluconate had less pneu-
monia incidence than those who received the standard oral care. Authors reported
that the small sample size prohibited the use of parametric statistical analysis or
hypothesis testing [2].
3. In a study to test the efficacy of a specific oral medication in patients with Sjögren’s
syndrome, Khurshudian et al. randomized patients to receive either 150 IU of
interferon—(8 patients) or placebo (4 patients) for 24 weeks, with 6-week re-
evaluations. Whole saliva (continuous outcome) was measured during each visit,
and symptoms were assessed by questionnaires and visual analog scales. The
208 Unit II. Basics of Statistics
Wilcoxon Sign Rank test was used to detect significant changes for each of the
evaluated variables [3].
4. In the study by Cruz-Correa et al. [4], five patients with familial adenomatous
polyposis with prior colectomy received supplements with curcumin 480 mg and
quercetin 20 mg orally 3 times a day. The number and size of polyps were assessed
at baseline and after therapy. The Wilcoxon Sign Rank test was used to determine
differences in the number and size of polyps.
One of the challenging issues is that when sample size is too small, non-parametric
tests are also ineffective. For instance, in order to use a non-parametric test such as
Kruskal-Wallis, it has been reported that no group should have fewer than five subjects
in order to have an accurate estimation [5–7]. But because of the small sample, par-
ametric tests are also not indicated. Therefore the investigator needs to consider the
utility of running a small sample size given that inference statistics will likely not be
valid. There are other reasons to run a small pilot study, which is beyond the scope of
this chapter.
For samples that are large enough, the central limit theorem could be invoked to
support the use of a normal distribution. It is also beyond the scope of this chapter
to explain the theory of central limit theorem, but samples over 70 subjects (some
authors say 100 or 200) usually are considered large enough to support the use of
parametric tests.
For samples in the middle, or not as small and not as large, then the tests to de-
termine normality of distribution are important to verify whether data are normally
distributed or not. There are several tests to determine distribution, from statistical
tests such as Wilk-Shapiro tests to graphical representation (histograms) and analysis
of parameters (mean, median, kurtosis, and skewness).
As the reader will come to understand, methods to determine whether data are
normally distributed or not may not be clear-cut. Again, as recommended earlier,
in the case where there is not a clear-cut response, we recommend the use of
non-parametric tests.
a binary outcome and use logistic regression. The investigator needs to be cognizant of
the statistical test when designing the study.
The investigator does not need to know the mathematics behind the statistical test
in order to use it appropriately; however, it is useful to know that these tests com-
pare ranks. In other words, the outcome is ranked in all the groups; therefore if one
group has a much higher (or smaller) rank than the other group, then someone would
expect a significant difference, depending on the sample size. By understanding this
basic principle, it is therefore intuitive that it easily addresses outliers, as the important
factor is not how much a data point is higher than another data point, but whether it is
higher or smaller (independent of the amount).
• Entering data: It is important to input data correctly in order to run the analysis.
Each variable should have its own column. Here, the first column represents the
drugs being tested (1 and 2, which represent drugs A and B, respectively), while the
second column portrays the level of patient satisfaction (1–5).
• Choosing the analysis test: You can perform the analysis by using the software
MENU. Select the following options:
Statistics > Non-parametric analysis > Test of hypothesis > Wilcoxon rank-sum
212 Unit II. Basics of Statistics
• Interpreting the output: The software will provide an output for the chosen test, and
it is crucial not only to choose the correct test, but also to know how to interpret the
given information. The table for the Mann-W hitney test provides three columns of
information regarding drug A and drug B: the number of observations (number of
subjects in each group), the observed rank sums values, and the expected values.
The rationale behind the Mann-W hitney/Wilcoxon Rank Sum test consists of ranking
the outcomes for the given drugs (satisfaction level) and using their rank positions in
the analysis, instead of their absolute values. All obtained values are ranked in order
(1st, 2nd, 3rd, etc.). The ranks are then summed and compared between groups. If
drug A is significantly different from drug B, then it is expected that the scores in one
group will be significantly different.
Since ties may be an issue, the software provides both the unadjusted variance and the
adjustment for ties. A tie occurs when a given value is in more than one sample. Next, the
investigator should interpret the result of the test with the null hypothesis, and the p-value.
Null hypothesis (Ho): There is no difference in patient satisfaction between drugs
A and B.
P = 0.5219
Conclusion: Do not reject the null hypothesis. The researcher may conclude that
patient satisfaction between drugs A and B are not significantly different.
213 Chapter 10. Non-Parametric Statistical Tests
• Entering data: In the paired comparison, this may be the only case in which data are
organized in a different way, since the outcome is shown in two columns separated
by time of assessment. Here, two new variables have to be imputed, representing the
different time points in which data will be analyzed (2 weeks and 4 weeks). Each
horizontal line represents one patient. The first column represents the level of satis-
faction with drug A at 2 weeks (Satisf2w) and at 4 weeks (Satisf4w) of treatment.
214 Unit II. Basics of Statistics
• Choosing the analysis test: You can perform the analysis by using the software
MENU. Select the following options:
Statistics > Non-parametric analysis > Test of hypothesis > Wilcoxon matched-pairs
Sign Rank test.
• Running the analysis: Sometimes, inputting data may not be the most intuitive pro-
cess, as the software may provide a different framework for each test. However, it is a
simple process. When running the paired analysis, the STATA software will ask for a
variable and for an expression. In this case, since you are comparing the satisfaction
level at 2 weeks with the satisfaction at 4 weeks, these two variables will be inserted.
Satisfaction level at 2 weeks (Satisf2w) should be inserted in the variable box. The ex-
pression box enables different arithmetic calculations to be made between variables;
however, in this case you should only insert Satisfaction level at 4 weeks (Satisf4w).
215 Chapter 10. Non-Parametric Statistical Tests
• Interpreting the output: The interpretation of the Wilcoxon Sign Rank test output is very sim-
ilar to the Mann-Whitney test output (refer to first item). However, in this case, instead of
comparing between different groups, you are comparing within the same group (Drug A).
The output table provides valuable information. In our example, we can see that
8 participants had a higher satisfaction level at 4 weeks when compared to 2 weeks,
0 patients had a lower satisfaction level with drug A at 4 weeks when compared to 2
weeks, and 2 patients did not report differences in satisfaction level between these two
time points. By examining the test statistics (see later discussion), we can understand
if these changes in satisfaction led to a statistically significant difference between 2 and
4 weeks post-treatment.
Null hypothesis (Ho): There is no difference in patient satisfaction between 2 and 4
weeks after taking drug A.
P = 0.0063
Conclusion: Reject the null hypothesis. The researcher may conclude that, for drug
A, there is a significant difference in patient satisfaction between 2 and 4 weeks after
taking the medication.
between the three groups. Therefore, possible values for drug being tested are 1, 2,
and 3, and for satisfaction level are 1–5.
• Choosing the analysis test: You can perform the analysis by using the software
MENU. Select the following options:
Statistics > Non-parametric analysis > Test of hypothesis > Kruskal-Wallis rank test.
217 Chapter 10. Non-Parametric Statistical Tests
• Running the analysis: Once again, the outcome variable will be the dependent var-
iable (Level of Satisfaction), while the variable defining groups is the independent
variable (Drug).
• Interpreting the output: The output table for the Kruskal-Wallis test provides infor-
mation regarding the number of observations (obs) and the total sum of ranks for
each group (Drug A, B, and C). Similarly, the software presents the unadjusted
and adjusted analysis, due to the presence of ties. It is always preferable to use the
adjusted analysis, as it yields estimates that are more precise.
Mann-Whitney/Wilcoxon Rank Sum
Analyze > Nonparametric Tests > Legacy Dialogs > 2 Independent Samples
Kruskal-Wallis
Analyze > Nonparametric Tests > Legacy Dialogs > K Independent Samples
219 Chapter 10. Non-Parametric Statistical Tests
INTRODUCTION
Dr. David Wang has been anxiously waiting for this message and now it finally has
arrived.
The sender was the American Journal of Obstetrics and the title was “Decision on
manuscript 10-982.” This was the email with the results of the peer review of his paper
on the study that he considers to be a major contribution to the field of postpartum
depression. He worked extremely hard during the past five years on this study and
now the fate of his study was one click away. He had mixed feelings of anxiety and
excitement.1
Dr. Wang is a senior psychiatrist in Beijing, China, from the Capital Medical
University, and he is also an experienced investigator. He has been working intensely
to publish a large clinical trial testing the efficacy of a new antidepressant for the
treatment of postpartum depression. According to the World Health Organization
(WHO), depression is a leading cause of disability worldwide, being the fourth in the
rank of global burden of disease. In addition, women experience two or three times
more depression than men, and maternal depression is directly associated with ad-
verse child cognitive and socio-emotional development progress. It is an important
issue in Western countries where the prevalence ranges from 10% to 15% of deliveries;
however, this prevalence increases in developing countries such as South Africa and,
in Latin America, where it nears 40%, depending on the definition evaluation criteria.
Postpartum depression has great incidence in the first six months after delivery; how-
ever, it can occur until the end of the first year. These results are related to psycholog-
ical, social, and economic conditions. The risk factors include depression in a previous
pregnancy and/or personal or family history, single marital status (not having support
from the husband or partner), low education, having experienced violence, and being
unaware of pregnancy.
For Dr. Wang, postpartum depression is an extremely important issue, as he had
seen in the past cases of depressed mothers hurting their babies. Another important
issue is regarding breastfeeding. Because of depression, mothers stop breastfeeding
their babies. The WHO recommends exclusive breastfeeding for at least the first four
to six months of life and its continuation for one to two years thereafter. Because of
Dr. Wang’s knowledge of this specific disease, most physicians and gynecologists refer
cases to him in which mothers need a drug therapy for the treatment of depression.
1
Dr. Suely R. Matsubayashi and Professor Felipe Fregni prepared this case. Course cases are de-
veloped solely as the basis for class discussion. The situation in this case is fictional. Cases are not
intended to serve as endorsements or sources of primary data. All rights reserved to the authors of
this case.
220 Unit II. Basics of Statistics
The Trial
It was the end of the afternoon and Dr. Wang was looking through his window—his
office in the Chaoyang district faced the amazing Olympic stadium that was built for
the 2008 Olympic games in Beijing. Dr. Wang was particularly proud of China and
in fact returned to China four years ago after a long period of research at Karolinska
Institute in Sweden. His first big project after his return to his native China was to run
this postpartum depression study. In fact, the view of the Olympic stadium was a con-
stant reminder of his mission to help China gain a leading place in science.
The goal of this trial that Dr. Wang is running is to evaluate the effects of a 10-week
home-based exercise program for the treatment of postpartum depression. Although the
benefits of exercise for depression treatment are well documented (not only behaviorally
but also via neurophysiological markers such as neurotransmitter levels), he and his team
believe that they have developed a better exercise program that will have greater efficacy.
Dr. Wang actually believes that his program will be a breakthrough for the treatment
of postpartum depression. This treatment, besides the low cost (which would be great
for his mainland China), has the important advantage of being safe and not interfering
with lactation. In his trial, 50 women with postpartum depressed moods were randomly
assigned to the 10-week home-based exercise program or usual care. The main outcome
was the comparison of changes in the Hamilton Rating Scale for Depression (HAM-D)
between baseline and immediate post-treatment when comparing exercise versus usual
care. Although the sample was not large, he showed that the exercise group had signifi-
cantly lower results after treatment baseline as compared with the usual care group. He
expected that these findings would guide future research clinical trials and hopefully
clinical practice in the future. Now, the decision was in that email.
Dear Dr. Wang,
Thank you for the submission of your manuscript to the American Journal of
Obstetrics (AJO). Unfortunately, it was not accepted for publication in AJO in its
current form. The manuscript was externally reviewed and was also reviewed by the
Associate Editor and the Board of Editors. The overall decision is made by the Board
of Editors and takes into account the Reviewers' comments, priority, relevance, and
space in the journal. Substantive issues and concerns were raised that precluded
assigning a high priority score to your manuscript for merit publication. However,
if you can satisfactorily and completely respond to the comments of the Reviewers
and the Editorial Board within three months (see the attached file), we would be
willing to review a revised version of your manuscript. The revised manuscript must
be submitted without exception within this timeframe. We offer no assurance that it
will be accepted after resubmission. We will do our absolute best to ensure a timely
re-review process so as not to cause you any delay after resubmission. We thank you
again for considering AJO.
Sincerely,
Editor-in-chief, AJO
221 Chapter 10. Non-Parametric Statistical Tests
Dr. Wang then read the attachment and realized that the chief complaint from reviewers
was that they used a parametric test (linear regression) to measure the main study
outcome: depression as assessed by Hamilton Depression Rating Scale (HDRS) over
the different time points, but the data seemed skewed, as most patients had mild to
moderate depression (and few patients had severe depression). After reading the re-
view, Dr. Wang’s initial reaction was panic and despair. Several questions were going
through his mind at the same time: “If I use a non-parametric approach, I am going to
lose efficiency and maybe my results might not become significant? How am I going
to show clinical significance with this approach? How am I going to control for age in
the model using a non-parametric approach? Even if I try to transform my data to a
normal distribution using mathematical transformation, this will not work as I have
seen before. Can I change the method of analysis now?”
He looks again through his window and sees the Olympic stadium, and he feels
inspired, as he knows that every big project has its challenges. He feels that he is ex-
tremely stressed and decides to call the day to an end and go home. He knows that
after a night of sleep, the situation will be clearer.
2
Please see the Chapter 8 case study to read more on this issue of using ordinal scales as continuous
outcome.
3
If you want to learn more about variable classification (also discussed in Chapter 8) including a
discussion of ordinal variables, go to: http://onlinestatbook.com/2/introduction/levels_of_meas-
urement.html (this link also provides a nice exercise at the end).
4
Cooper JP, Tomlinson M, Swartz L, et al. (1999). Post-partum depression and the mother infant
relationship in a South African peri-urban settlement. Br J Psychiatry. 1999; 175: 554–558.
222 Unit II. Basics of Statistics
information, as two numbers will be differentiated only by which one is larger and
not by the quantity of this difference; for instance 100 versus 1 would be theoretically
the same as 2 versus 1 in a rank test (given there are no other numbers in between).
This loss of information might then result in a loss of power/efficiency of the test and
therefore might change the results. However, on the other hand, large outliers might
reduce the power of a parametric approach, and this might not be a problem after all.
Dr. Wang is glad he could remember his statistical courses. The other important issue
is the clinical significance of the data when using a non-parametric approach, as this
approach will only give a p-value or statistical significance, but then how does one
determine whether the difference is clinically meaningful? When using continuous
data, it is possible to compare two means or to calculate the effect size of the inter-
vention; but when using ordinal data, it becomes more complicated, as median com-
parison might not be adequate since they might actually be quite similar or different,
according to how the data were presented. One possibility here is to categorize the
data—in other words, find a cut-off for the data and classify patients as responders or
non-responders. In fact, it is common to define a patient as a responder if he or she has
a decrease in Hamilton scores of more or equal to 50% in relation to baseline scores.
Finally, it is possible to adjust for one variable using, for instance, binary (or catego-
rical) outcomes (as discussed earlier—if we categorize the outcome in responders and
non-responders). To do that, it is necessary first to create two tables according to the
variable to adjust—for instance, in this case, age—and create two tables—one of ad-
olescent pregnancy and the other of non-adolescent pregnancy. The next step is to
classify the responders and non-responders in each group (exercise vs. usual care) for
each table (adolescents and non-adolescents), then to use Cochran-Mantel-Haenszel
(CMH) to find an adjusted odds ratio between the two groups of treatment consid-
ering these two strata.5 The null hypothesis of this test is that the response is condi-
tionally independent of the treatment in any given strata (in this case, adolescent and
non-adolescent).
5
Cochran-Mantel-Haenszel Test may be a bit complicated to understand. If you want to read more,
go to this link to get a more detailed explanation: http://udel.edu/~mcdonald/statcmh.html
223 Chapter 10. Non-Parametric Statistical Tests
After a full day in clinics, Dr. Wang returns to his office to get his briefcase and
go home, but before leaving, he looks again at the Olympic stadium, as it is eve-
ning and the lights are on. He feels motivated and knows now he will be able to
address the reviewers’ concerns and have a meaningful study. He feels inspired
by the chance of being able to offer something else for patients with postpartum
depression and also honored for his small contribution to the scientific progress
of China.
CASE DISCUSSION
This case discusses a classical example of situations in which the use of parametric tests
may be criticized. In this case, the use of scale based in ordinal items may be an issue;
even though the researcher is using the total sum. The investigator needs to plan well
and consider the potential drawbacks of using different approaches. If the investigator
is not confident, it is always best to choose a statistical test with less assumptions so
as to make sure the final results are valid. The final choice should also depend on the
main research question.
FURTHER READING
Callegari-Jacques SD. Bioestatística princípios e aplicações. Porto Alegre: Artmed; 2008: Chapter 11,
pp. 94–102.
Portney LG, Watkins MP. Foundations of clinical research applications to practice. 3rd ed. Upper
Saddle River, NJ: Pearson/Prentice Hall; 2015: Chapters 20, 22, 23, 24.
Zou KH, Tuncali K, Silverman SC. Correlation and simple linear regression. Radiology. 2003;
227: 617–628.
REFERENCES
1. Luurila H, Olkkola KT, Neuvonen PJ. Interaction between erythromycin and the
benzodiazepines diazepam and flunitrazepam. Pharmacol Toxicol. 1996; 78(2): 117–122.
224 Unit II. Basics of Statistics
2. Bopp M, Darby M, Loftin KC, Broscious S. Effects of daily oral care with 0.12%
chlorhexidine gluconate and a standard oral care protocol on the development of nosoco-
mial pneumonia in intubated patients: a pilot study. J Dent Hyg. 2006; 80(3): 9.
3. Khurshudian AV. A pilot study to test the efficacy of oral administration of interferon-alpha
lozenges to patients with Sjögren’s syndrome. Oral Surg Oral Med Oral Pathol Oral Radiol
Endod [Internet]. 2003; 95(1): 38–44. Available from: http://www.ncbi.nlm.nih.gov/
pubmed/12539025
4. Cruz-Correa M, Shoskes DA, Sanchez P, Zhao R, Hylind LM, Wexner SD, et al. Combination
treatment with curcumin and quercetin of adenomas in familial adenomatous polyposis.
Clin Gastroenterol Hepatol. 2006; 4(8): 1035–1038.
5. Ofungwu J. Statistical applications for environmental analysis and risk assessment.
New York: John Wiley & Sons; 2014.
6. Rosner B. Fundamentals of biostatistics. 7th ed. Boston: Cengage Learning; 2011.
7. Hothorn LA. Statistics in toxicology using r. Boca Raton, FL: CRC Press, Taylor and Francis
Group; 2016.
8. Friedman H. The Oxford handbook of health psychology. Oxford: Oxford University
Press; 2014.
9. Knapp TR. Treating ordinal scales as interval scales: an attempt to resolve the controversy.
Nurs Res. 1990; Mar-Apr; 39(2): 121–123.
10. Doering TR, Hubbard R. Measurements and statistics: the ordinal-interval controversy
and geography. Area1979; 11(3): 237–243.
11
S A M P L E S I Z E C A L C U L AT I O N
Access to power must be confined to those who are not in love with it.
—Plato
INTRODUCTION
In previous chapters you were introduced to the basic concepts of study design and
statistics: how to frame the right research question, how to design a study to answer
this question (design, recruitment and randomization, blinding, etc.) and how to an-
alyze the data obtained from a study (types of variables, their description, and the ap-
propriate statistical tests). However, a crucial study design issue that has not yet been
addressed is what sample size is required (how many study subjects or observations
are needed) to have enough power to answer the study hypothesis with statistical
significance.
This chapter will highlight and review the issues of type I and type II error, power,
and significance level, and how these parameters are used in calculation of the sample
size required to conduct a successful research study (some of these concepts have also
been reviewed in Chapter 9).
The purpose of a research study is to make inferences about the target popula-
tion from results obtained in a sample drawn from an accessible population. While
it is important to draw a representative sample to limit systematic error (reducing or
eliminating bias, confounding, etc.), it is also vitally important to select an appro-
priate sample size to reduce random error. Sample size calculation is an integral part
of a statistical analysis plan to estimate the required number of study participants. In
fact, most research funding agencies require a formal sample size calculation to dem-
onstrate that a funded project would yield conclusive data. Similarly, International
Conference on Harmonization (ICH) Good Clinical Practice (GCP) guidelines
mandate outlining the details of how the number needed to conduct a trial was cal-
culated, including the level of significance set and other parameters used [1]. Finally,
most applications for human subject research at an institutional review board (IRB)
require specification of a number of participants to be enrolled, with appropriate
justifications. Next we will discuss the consequences of over-and underestimating
the sample size.
225
226 Unit II. Basics of Statistics
OVERESTIMATING THE SAMPLE SIZE
There are ethical implications when the estimation for sample size is more than is
needed, as it may add unnecessary risk for subjects in a study whose participation
may not have been needed. Additionally, if the estimated sample size is very large, the
researcher has to consider whether such a study is feasible (financially, logistically)
and what issues arise with regard to recruitment (enrollment timeframe, recruitment
strategies, available population). Studies with an overestimated sample size are a waste
of resources, impose excessive strain on the research team and potentially expose an
unnecessary number of study subjects to risk and discomfort.
UNDERESTIMATING THE SAMPLE SIZE
Conversely, if a sample size is underestimated, a study will be statistically underpow-
ered to detect the pre-specified effect size, and study results might fail to reach sta-
tistical significance. Interpreting underpowered studies presents a challenge to the
researcher as differences that fail to reach statistical significance could represent ei-
ther a truly null effect or a false negative result. Conducting an underpowered study
therefore symbolizes an impractical and unethical approach in study design, as it also
wastes resources and exposes subjects to unnecessary risks, given that the results of
the research will not be able to inform the study hypothesis. The problems with over-
and underestimation of sample size apply not only to experimental studies, but also to
many observational designs.
REPORTING SAMPLE SIZE
In spite of increasing attention to the a priori specification of sample size and power
by various research stakeholders, transparent reporting of sample size calculations
remains inadequate. Even in randomized controlled trials, sample size calculations
227 Chapter 11. Sample Size Calculation
are still often inadequately reported, often erroneous, and based on inaccurate
assumptions [3].When negative studies do not include sample size calculations, it’s
impossible for the reader to know whether the results were truly negative or whether
the study was just underpowered. It is imperative that researchers understand the
consequences of sample size and power when interpreting results and applying them
to clinical practice.
PROBABILITY OF ERRORS
In hypothesis testing, different types of errors were introduced (Figure 11.1).
Alpha (α): Type I error, also known as “level of significance.” Type I error means
rejecting the null hypothesis when it is in fact true, or a false positive. It is typically set
to 0.05 or 5%. Setting alpha to 0.05 means that the investigator will accept taking 5%
risk of a significant finding being attributable to chance alone. Any investigator would
want to minimize this type of error as much as possible, as committing type I error can
pose significant risk to the study subjects. While accepting a Type I error risk of 5%
Truth about
the population
H0 true Ha true
Type I Correct
Reject H0
error decision
Decision
based on
sample
Correct Type II
Accept H0
decision error
α = .05 α = .20
relaxed to
H0 H1 H0 H1
Power
Power increased
is most common in clinical research, it is not necessarily the most appropriate value
depending on the research objective and scope. In genetic and molecular studies, a
large number of candidates are often compared; thus the cost of getting many false
positives can be prohibitively high when screen positive hits need to be subsequently
verified. For instance, if a genetic sample of 500,000 SNPs are to be scanned, then
keeping alpha at 0.05 would correspond to 25,000 false positive samples that could
consume extensive research resources to verify; however, setting the alpha to 0.001
would yield only 500 false positive samples. Reducing the alpha for a study will require
a larger sample size.
Power (1-β): The second type of error is the Type II error (β). Type II error
refers to failing to reject the null hypothesis when it is actually false (failing to de-
tect a difference when it actually exists, or a false negative). Power is defined as 1-β,
or the true positive rate. In clinical research, power is often set at 80%, meaning
that the investigator will accept a risk of 20% of not detecting a difference when it
truly exists.
The threshold of 80% power is an arbitrary value, much like the standard value
of 0.05 for alpha. However, regarding any study with less than 80% as uninformative
doesn’t recognize the potential value of information gained. It may be appropriate to
accept a lower degree of power for a pilot study, for example. Clinical research is often
constrained by pragmatic issues regarding feasibility, and studies must be designed
in accordance with sensitivity analyses that examine a meaningful array of possible
findings, and follow examples set by previous analogous studies [4].
MEASURE OF VARIATION
For continuous outcome variables, there is a need to estimate the population SD (σ).
How do you find this number? One option is to examine the published literature, which
can be used to estimate standard deviation. In situations where no previous studies ex-
actly matching the intervention–outcome relation are available, then studies on analo-
gous interventions and outcomes can be used to estimate the study standard deviation.
Expert opinion and subject matter expertise are needed to identify appropriate analogies
in the literature. In the absence of available published estimates on the standard deviation,
229 Chapter 11. Sample Size Calculation
an investigator may conduct a pilot study to estimate the standard deviation. However,
there is a risk that the pilot study may underestimate the SD for the outcome [5].
As the variability in the outcome is higher, a larger sample size will be required to
detect a significant difference between study groups. To demonstrate the relationship
between how variation in the sample affects the requisite sample size, consider the
following example where we wish to test whether the mean weights of two populations
is different. If the difference between the mean weights of the two populations is large
and the within-group variability is small, then a relatively small sample would be suffi-
cient to detect a significant difference between groups. Yet a larger sample size would
be required if the within-group variability is large.
• SE = SD/√n
Similarly, 95% confidence intervals can be used to calculate the SD with the
following formula.
• 95% CI = 1.96 ± SE
• SE = SD/√n.
EFFECT ESTIMATES
The effect estimate is another important parameter for sample size calculation. The
effect estimate is the actual difference expected under the alternate hypothesis, or
the magnitude of treatment effect anticipated between groups. The effect estimate
should reflect a clinically meaningful and feasible difference based on preliminary or
published studies, and subject matter or clinical expertise.
230 Unit II. Basics of Statistics
In research, the effect estimate is measured by statistics that depend on the nature
of the exposure, outcome, and research goals. Associations between the exposure and
outcome be measured using a correlation coefficient (r), standardized mean difference,
or relative risk, among many other options. For example, in a case-control study, the
odds ratio is the most appropriate way to estimate the association between the expo-
sure or treatment and the outcome. In summary, the nature of the exposure, outcome,
and study design and research methods will determine how the effect size is calculated.
If a smaller type I error is desired (= smaller α), a larger sample size (n) will be required.
If a smaller type II error is desired (= more power), a larger sample size (n) will be required.
1. Loss to follow-up: Sample size calculation should account for dropouts that can
occur during the course of a study. There is no accepted or standard figure to be
231 Chapter 11. Sample Size Calculation
added to the calculated sample size, as the degree of drop out is highly dependent
on the nature of the clinical population being studied.
2. Type of outcome variable: It is important to realize that sample size varies
depending on the type of outcome variable. With all else held equal, categorical
outcomes can require a larger sample size as compared to continuous outcomes.
In addition, more information is lost when an outcome is set as categorical, so
reconsidering the functional form of the outcome variable is a strategy if the
estimated sample size turns out to be too large [8].
3. The study design is an important part of sample size estimation. Paired studies
(vs. independent comparison groups) can reduce the need for larger sample sizes.
Similarly, if subgroup or stratified analyses are of interest as a primary or secondary
hypothesis, the study should be powered appropriately for these analyses.
4. Another issue that frequently comes up when conducting pilot studies is about the
sample size of the pilot study itself. Sometimes an arbitrary number of subjects (i.e.,
10 or 20 participants) are considered to be an adequate sample size for pilot studies
[9]. However, there are increasing requirements to defend the pilot study sample
size to justify the participation of study participants and research costs [10].
Calculating Study Power
In many research projects, your sample size may be limited due to a fixed budget, lo-
gistical or time constraints, or low rates of disease incidence, treatment, or outcome
events. When the sample size is fixed due to any of these issues, an investigator should
calculate the power available with the given sample size, anticipated effect size, desired
alpha and other appropriate inputs.
Sensitivity Analyses
To select appropriate values for sample size calculations, it is often helpful to conduct
a sensitivity analysis to evaluate how changing the alpha, power, and effect size will im-
pact the sample size estimate. Such analysis is of great use when there are issues related
to feasibility, time, and cost for a study (see Table 11.1).
Table 11.1 Sensitivity Analysis Using Different Power Values for Sample Size Calculation
for an Example of a Study Looking at Proportion Rates in Two Different Groups
Post-hoc calculation: Magnitude of Type II error can inform how you interpret null
results.
The formula used for calculation of sample size for comparative trials is the
following [12]:
4 σ 2 ( zcrit + zpwr )2
N=
D2
Sigma squared is the variance; zpower is the z-value for the corresponding power set, and
zcrit is the z-value for the corresponding value of alpha set
• Alpha
• Beta
• Effect size
• Proportions.
Introduction
The size of a given study is one of the main parameters when designing a clinical
trial and has a significant impact in the study planning and design. Calculating the
sample size accurately is crucial, as an adequate number of subjects is extremely
important to ensure adequate statistical inferences to the clinical results. In fact,
underestimating a sample size might result in a preliminary and inadequate rejection
1
Professor Felipe Fregni prepared this case. Course cases are developed solely as the basis for class
discussion. The situation in this case is fictional. Cases are not intended to serve as endorsements or
sources of primary data. All rights reserved to the authors of this case.
235 Chapter 11. Sample Size Calculation
of new interventions that could be beneficial and might not be assessed again. On the
other hand, overestimating a sample size might unnecessarily expose a large number
of subjects to a less effective treatment (such as patients randomized to the placebo
arm) and increase costs.
Besides the critical importance, sample size calculation is not easy, and the main
reason is that it is often challenging to determine the parameters for sample size cal-
culation. In other words, the investigator needs to hypothesize the difference between
groups in order to calculate the sample size. But the main question is how to do it before
the study is performed. There are several methods, such as using the minimal clinically
significant difference, pilot studies, or previous literature. Each one has its advantages
and disadvantages. Another important issue here is that clinical researchers should
be comfortable in performing sample size calculations so as to better understand the
methodology of their studies. In fact, sample size calculation is often not performed,
and sample size ends up being determined by the duration of the trial and the ability
to recruit subjects—which can lead to serious consequences for the trial.
The Planned Trial
Dr. Hoffman was planning a randomized clinical trial in which patients with well-
documented, previously treated Lyme disease—but with persistent musculoskeletal
pain, neurocognitive symptoms, or dysesthesia, often associated with fatigue—would
be randomized to receive either intravenous ceftriaxone for 30 days, followed by oral
doxycycline for 60 days, or matching intravenous and oral placebos. Because there is no
established treatment for chronic symptoms in patients with a history of Lyme disease,
the use of placebo in this situation is appropriate. The primary outcome measure would
be improvement on the Short-Form General Health Survey (SF-36)—a scale meas-
uring the health-related quality of life—on day 180 of the study.
Also, in order to increase the clinical significance of the study, Dr. Hoffman de-
cided to categorize the results of the study. She used the results of a previous
study, which showed that a change of up to 7 points in SF-36 could be considered
normal variation; therefore, a change higher than 7 points would be considered an
236 Unit II. Basics of Statistics
improvement (responder) and if a change were less than that, it would be considered
a non-responder.
Although recruitment for this study is not a problem as the infectious disease de-
partment at MGH and Dr. Hoffman are references for cases of Lyme disease in New
England and she sees at least two to three new cases of suspected cases of persistent
symptoms in patients with treated Lyme disease per week, it will be difficult to get
funds to run a very large study. In addition, she does not want to expose an unneces-
sarily large number of patients to this trial. So, the calculation of an appropriate sample
size is critical.
Dr. Hoffman knows that she needs to calculate the sample size very carefully.
Although she has determined that the power of the study will be 90% and the alpha
level will be 5% (chance that the results will be false positive), the main issue for her
is how to estimate the difference between treatments. This will not be an easy task.
She decides then to go to Cape Cod during the weekend where her family has a small
house in the charming city of Falmouth so that she can concentrate on the task of
sample size calculation for her study.
Dear John,
Remember the chronic Lyme disease clinical trial I want to conduct? I just ran the
sample size calculation using the results of my previous pilot study (with a power
of 90% and alpha of 5%) and got a sample size of 20 subjects. What do you think?
Although this is good, it may be too small.
Jen
237 Chapter 11. Sample Size Calculation
Dr. Hoffman was not expecting to hear from him anytime soon, but he is also
working on Saturday and quickly responds to her from his Blackberry:
Dear Jen,
I do remember this study. Although this is a valid method as the methodology of your
pilot study is exactly the same as your proposed study, this seems too small to me and
you may be overestimating the results of your treatment. Remember the chances of
overestimating are larger with small sample sizes. You might have selected a sample
of patients in your pilot study that responded very well to this treatment. In addition,
estimating a population’s standard deviation based on small studies is known to un-
derestimate the population’s true variability. I would also suggest assessing other
methods before making your final decision.
John
Dr. Hoffman decided then to go ahead and assess the other options. But before
that, she stopped to have something to eat in her preferred local eatery—a small
French restaurant.
John, thanks so much for your help. I recalculated the sample size using the minimal
clinically significant difference and got a sample size of 180 patients. This would result
in a big burden for my study budget—what should I do now?
Jen
Jen, if you use this method, your calculation will certainly be better accepted and in
addition it does not hurt to have a larger sample size—it will increase the impact of
your study and facilitate statistical analysis. Also remember, you are using categorical
variables. However, if you are confident that the results of your pilot study are reliable,
then you might be throwing away resources and exposing an unnecessary number of
patients to your trial. Tough call!
John
238 Unit II. Basics of Statistics
It was getting dark and Dr. Hoffman was getting tired. She decided to call it
a night and continue the next day. She then went to the porch to have a glass of
wine and tried to relax so she would be sufficiently rested the next day to reach a
decision.
CASE DISCUSSION
This case is about treatment of chronic Lyme disease and finding a way to calculate the
appropriate sample size. Dr. Hoffman is an expert in the field of treating patients with
Lyme disease, and although there is no standard therapy of chronic cases, she intends
to test a new regimen to alleviate the symptoms of the chronic state. Chronic Lyme
disease is not a common condition, and the cost of running the trial is an issue to con-
sider as well. The main challenge is finding the best way to obtain the parameters that
are needed for sample size calculation. Dr. Hoffman had conducted a pilot study on a
small number of patients with promising results, and through literature search she had
extracted a study in which tetracycline was used for the treatment of chronic Lyme dis
ease. In addition, a difference of 7 points on SF-36 scale is considered to be minimally
clinical significant. As this is a two-arm study with a continuous outcome, apart from
the alpha, beta, and means of the two groups, standard deviations will be required to
calculate the sample size.
Each of the three options mentioned can be used to estimate the standard devia-
tion to be used for the sample size calculation. All of the options are reasonable, and all
239 Chapter 11. Sample Size Calculation
of them are used to get the standard deviations in various clinical trials. However, each
of the strategies has its advantages and disadvantages. Using the parameters based on
a pilot study does seem logical and relevant. The calculated sample size is also reason-
able and more importantly feasible to conduct a trial for a disease that is not common.
Nevertheless, there are many issues related to this strategy. First, the population of
a pilot study is usually very small. Second, the pilot study population might be very
homogenous and therefore results might not be applicable to a larger sample with a
heterogeneous population. This frequently leads to a small sample size estimation that
will render the main study underpowered. Therefore, there is a chance of increasing
type II error, however, with small sample size, there is also chance of type I error.
When using pilot studies which results in small sample size one has to bear in mind
that the results can be overestimated.
The second option is to select the minimally clinical significant difference be-
tween the two treatments and then proceed to calculate the required sample size. It
also seems like a relevant and reasonable option, but will lead to calculation of a large
sample size. This can be a waste of resources, time-consuming, questionably feasible,
and more important, will expose many patients to placebo, raising ethical issues and
risking that the study might not achieve the grant approval. Dr. Hoffman could plan an
interim analysis to assess the data and decide to stop the study if a continuation is not
needed. Planning an interim analysis will be at the expense of the p-value set for the
primary outcome (see Chapter 18).
The third option is to use historical values from published studies to extract the
standard deviations and apply them for calculation, a strategy frequently used by
researchers. It is also a cheap solution and thus helps in saving money and time. The
researcher has to find a study with similar study design, outcomes, and treatment to
use it as a template for obtaining the required information. This is a difficult task, and
having the luxury to find a matched study is occasionally not possible, unless the study
has been replicated several times in the past.
For the sheer feasibility, if literature search shows 90% improvement in tetracy-
cline arm and taking the hypothesized 35% response to placebo, the sample size would
be 36 (18 patients per arm). By doing so, Dr. Hoffman has saved time and will be
able to recruit the patients needed with two to three referrals per week in 180 days.
This approach helps in saving time as the calculated sample size is intermediate as
compared to the ones calculated using pilot or minimal clinical difference approach.
It also limits the number of patients who get unnecessarily exposed to interventions
due to larger sample size if calculated using approach of minimal clinical difference.
Another option that Dr. Hoffman could apply would be simply “guesstimating”
the standard deviations based on her experience in the field and do the sample size
calculation. Bias can be introduced this way, and the study results could easily lose sig-
nificance. This option is not recommended unless the data available are scarce.
Study design has a significant impact on the outcome and therefore the best de-
sign to conduct the study should be chosen. In case of rare diseases, one strategy
is having unequal groups. Allocating more patients to the active group and then
applying statistical techniques to adjust for the different ratio in group numbers and
drawing conclusions is a possible strategy. This technique can also help in decreasing
the number of patients who drop out, as the duration of the study is 180 days. The
240 Unit II. Basics of Statistics
dropout rate is another issue that should be considered, especially in long duration
clinical trials and should be accounted for in sample size calculation.
Considering dropout rate, Dr. Hoffman thinks about using the approach of
unequal allocation, allowing more patients to get recruited in long-term antibi-
otic therapy arm. Using positive data from pilot study, this should not be against
principle of equipoise. But then Dr. Hoffman thinks of using the approach of min-
imal clinical difference which will help in gathering clinically meaningful data with
a larger sample size rather than collecting data on small number of patients using
pilot or literature search data. As she decides about the way to move forward, she
keeps on thinking about time, money, dropouts, etc., and says to herself that “there
is much more to it rather than just plugging in values of alpha and power to calculate
sample size.”
1. What challenges does Prof. Hoffman face in choosing the method to determine the
difference between treatments?
2. What are her main concerns?
3. What should she consider in making the decision?
4. Do you have any other concerns that she should discuss for sample size (outcome
variables, internal/external validity, dropping out, budget, feasibility, etc.)?
FURTHER READING
Article/Topic Review
This article discusses in detail the challenges and strategies for doing sample size calculations for
different epidemiological studies:
Kasiulevičius V, Šapoka V, Filipavičiūtė R. Sample size calculation in epidemiological studies.
Gerontologija. 2006; 7(4): 225–231. Available online at http://www.gerontologija.lt/files/
edit_files/File/pdf/2006/nr_4/2006_225_231.pdf [last accessed on Jan. 16, 2013].
Grunkemeier GL, Jin R. The statistician’s page: power and sample size: how many patients do
I need? Ann ThoracSurg. 2007; 83: 1934–1939.
Review
Dattalo P. A review of software for sample size determination. Eval Health Prof. 2009;
32(3): 229–248. doi: 10.1177/0163278709338556.
PASS
NQuery
EAST
Many websites provide convenient sample size calculators, easing the process of calculation for
researchers:
Books
The following reference books specifically discuss topics on sample size calculations with
formulas and tables given for researchers to perform sample size calculations:
Chow S-C, Wang H, Shao J. Sample size calculations in clinical research, 2nd ed. Chapman & Hall/
CRC Biostatistics Series; 2007.
Machin D, Campbell MJ, Tan S-B, Tan S-H. Sample size tables for clinical studies. 3rd
Edition. Oxford, UK: Wiley-Blackwell; 2008.
REFERENCES
1. ICH Expert working group. ICH harmonised tripartite guideline: guideline for good
clinical practice E6(R1). 1996. Available at: http://www.ich.org/fileadmin/Public_
Web_Site/ICH_Products/Guidelines/Efficacy/E6_R1/Step4/E6_R1__Guideline.pdf
[accessed on Jan. 15, 2013].
2. Machin D, Campbell MJ, Fayers PM, Pinol APY. Sample size tables for clinical studies, 2nd
ed. Oxford, London, Berlin: Blackwell Science; 1987: 1–315.
3. Charles P, Giraudeau B, Dechartres A, Baron G, Ravaud P. Reporting of sample size calcu-
lation in randomised controlled trials: review. BMJ. 2009; 338: b1732.
4. Bacchetti P. Current sample size conventions: flaws, harms, and alternatives. BMC Med.
2010; 8(1): 17.
5. Vickers AJ. Underpowering in randomized trials reporting a sample size calculation. J Clin
Epidemiol. 2003; 56(8): 717–720.
6. Prentice DA, Miller DT. When small effects are impressive. Psychol Bull. 1992;
112(1): 160–164.
7. Paulus J. Sample size calculation. Powerpoint presentation for Principles and practice of clin-
ical research course. 2012. Available online from course website access http://www.ppcr.
hms.harvard.edu [last accessed on Jan. 15, 2013].
8. Zhao LP, Kolonel LN. Efficiency loss from categorizing quantitative exposures into qualita-
tive exposures in case-control studies. Am J Epidemiol. 1992; 136(4): 464–474.
9. Julious SA. Sample size of 12 per group rule of thumb for a pilot study. Pharmaceut. Statist.
2005; 4: 287–291. Available online at http://research.son.wisc.edu/rdsu/sample%20
size%20pilot%20study12.pdf [last accessed on Jan. 13, 2013].
242 Unit II. Basics of Statistics
10. Johanson GA, Brooks GP. Initial scale development: sample size for pilot studies. Educ
Psychol Meas. 2010; 70(3): 394–400.
11. Chan A-W, Hróbjartsson A, Jørgensen KJ, Gøtzsche PC, Altman DG. Discrepancies in
sample size calculations and data analyses reported in randomised trials: comparison of
publications with protocols. BMJ. 2008; 337: a2299.
12. Eng J. Sample size estimation: how many individuals should be studied? Radiology. 2003;
227: 309–313.
12
S U RV I VA L A N A LY S I S
INTRODUCTION
Survival analysis denotes a specific set of standardized statistical analysis that is fo-
cused on time to event [1]—in other words, how much time elapses from exposure/
intervention to the occurrence of an event. For instance, imagine that you are testing a
new intervention to prevent hospitalization due to diabetic ketoacidosis (DKA). This
is an acute, life-threatening condition that requires immediate medical attention, in
which preventive care could have a major impact. During your trial design stage, you
choose as primary outcome to measure the effectiveness of your interventions, at least
one hospitalization due to DKA over a 12-month period, following the new interven-
tion (e.g., 12 months). This metric, although suitable, can be misleading as we will see
in the example that follows.
The following 2 x 2 table presents the rate of hospitalization due to DKA in the
standard versus the new intervention.
New Intervention 60 40
Standard Care 70 30
The odds ratio of being hospitalized would be 0.64 (0.36–1.16, 95% CI) or a risk ratio
of 0.86 (0.70–1.05, 95% CI). Note that in both situations the 95% CI overlaps with 1,
and thus there seems to be no significant difference in the number of hospitalizations
due to DKA complications between the new intervention and standard of care.
A survival analysis would view this problem differently, and would focus on the
time elapsed between the intervention and the outcome event (in this case, hospitali-
zation). So instead of focusing on how many patients were hospitalized after the inter-
vention, the main interest is how long it takes for them to develop DKA and require
hospitalization.
A table for a survival analysis would be arranged differently, in order to reflect
the time in which each patient experienced the first hospitalization episode due
to DKA.
243
244 Unit II. Basics of Statistics
Patient 1 5
Patient 67 8
Patient 198 12
Patient 4 2
Patient 152 4
Patient 178 3
. . . . . . . . .
After analyzing the time until the first hospitalization, the outcome shows that the
median survival time for patients that benefited from the new education interven-
tion had a median survival time of 8 months, while the other group had a median
survival time of 3 months. Therefore, depending on the research question, sur-
vival analysis may be also be a more effective test to analyze data dependent on
events (yes/no).
MEDIAN SURVIVAL TIME
The median survival time is the first important concept to understand in survival anal-
ysis. The median survival time is the time point in which 50% of patients developed the
event. In the DKA example, 50% of patients that benefited from the new intervention
were hospitalized 8 months after the intervention; and 50% of patients under standard
care were hospitalized after 3 months. The remaining 50% could be hospitalized at a
later point, or not at all.
Time to KDA Hospitalization
100
Percent survival
50 Median Survival
0
0 1 2 3 4 5 6 7 8 9 10 11 12
Months elapsed
The concept of median survival time then allows us to have an estimate of when
50% of the subjects in our sample develop the outcome of interest. And thus it is an
especially useful metric when assessing time until death, reoccurrence of an impor-
tant clinical outcome, time from exposure until a specific outcome, among others.
245 Chapter 12. Survival Analysis
Nonetheless, by itself, the median survival time is unable to tell us if the new in-
tervention is superior to the standard of care. Median survival time can be considered
along the Kaplan-Meier curve (see the following) as descriptive statistics, similar to
mean and standard deviation, respectively, for continuous data.
1 100 90 90 0.90
= 0.90
100
2 90 82 82 0.90*0.91 = 0.82
= 0.91
90
3 82 46 46 0.82*0.56 = 0.46
= 0.56
82
Survival analysis focuses on something that is called the survival function—which is the
cumulative probability of surviving the event. The Kaplan-Meier estimator is an easy
function that can be calculated by hand to provide an estimate of the cumulative proba-
bility of surviving an event. As aforementioned, it is an important method to describe the
data and provide an overall picture of what happened in the trial for all subjects.
be similar to conducting a t-test when comparing the means of two groups (as measured
with continuous data). This can be performed using the log rank test [3]. In this test,
survival curves are estimated against the null hypothesis that there are no differences be-
tween them. Thus, a p-value of less than 0.05 is interpreted as a null hypothesis rejection,
and thus the survival functions are assumed to be statistically significantly different.
In the trial that aims to compare the new educational intervention against
standard of care in DKA prevention, the new intervention seems to statistically sig-
nificant increase the time until DKA hospitalization, when compared to standard of
care [χ2(1) = 6.64, p = .01]. In this case, we state that the difference between survival
functions is statistically significant because the p-value (.01) is less than .05, and thus
the null hypothesis is rejected.
The log rank test is usually the standard test to compare survival functions and is
powerful if the survival function (sometimes also called hazard function) has propor-
tional hazards (the number of events is kept the same at all the time points). If this is
not true, then there is another method, the Gehan-Breslow-Wilcoxon, which is more
suitable if there is no consistent hazard ratio [4]. Nonetheless, the Gehan-Breslow-
Wilcoxon requires that one group has a consistently higher survival than the other.
Moreover, this second method gives more weight to events at early time points, which
can be misleading, especially with censoring at early stages.
CENSORING
So far we have been addressing survival analysis as if the time to event observations
are complete. In some cases, the variable of interest (i.e., event) will not be possible
to assess during the period of the study. For instance, from our research question, if
a patient has a hospitalization due to DKA complications after the 12-month period
following the intervention, that occurrence will not be available. This is not considered
to be missing data. Instead, this occurrence is considered to be out of the specified
time window (i.e., 12 months) and is a specific type of censoring—right censoring. This
is the most common type of censoring. And it happens when the occurrence of an
event is after the specified time for the survival analysis.
There are other types of censoring [5]. For instance, left censoring occurs when
an event stops from occurring before the analysis period. We don´t know exactly
when it happened, but just that the event occurred prior to study entry. For instance,
imagine that one patient is enrolled, but no hospitalizations due to DKA have
occurred in the 12 months prior to the enrollment. It seems that hospitalizations
due to DKA stopped before, but we don´t know when, and thus this can be an ex-
ample of left censoring.
The last type of censoring is called interval censoring. It usually happens when the
event occurs between survey intervals. For instance, imagine that in the DKA example,
instead of using hospital records, we were using questionnaires to assess the DKA
occurrence. A possibility is that the event of interest could happen between visits.
Then it would not be possible to accurately know when it actually happened. This
interval censoring occurs when the occurrence of the event is between assessment
periods, and there is no way to know exactly when it did happen.
247 Chapter 12. Survival Analysis
1 100 3 90 90 0.90
= 0.90
100
2 87 8 82 82 0.90*0.94 = 0.85
= 0.94
87
3 74 12 46 46 0.85*0.62 = 0.53
= 0.62
74
Please note that censoring does not impact the survival function at the first time point,
as the probability is the same with or without censoring. But after that time point,
censoring will change the survival function. For instance, at the end of the second
month the three observations that were previously censored in the month before will
not count toward the population at risk, and thus cumulative probability will change.
Please note that from the first example, a patient at the end of the second month has
an 82% probability of not being hospitalized due to DKA complications. But with
censoring, the probability increases to 85%.
There are other options to deal with censored data. These options include setting
the censored observation to missing, or replacing it with zero, minimum, maximum,
mean value, or a random assigned number. Although the use of such methods can
be sound in some cases, a small number of censored data is required. If not, these
methods can produce undesirable effects, such as bias in statistical estimates, samples
that are not representative of the general population, and even important information
could be potentially discarded from the study.
248 Unit II. Basics of Statistics
ADJUSTING FOR COVARIATES: THE
USE OF COX PROPORTIONAL HAZARDS
(COX REGRESSION MODEL)
So far we have been discussing survival analysis as if the hospitalization events are
solely due to the intervention. But commonly, there are several interrelated factors
that can contribute to the increase or decrease of survival probability, especially for
non-randomized studies. Imagine, for instance, that you want to explore the impor-
tance of risk factors, such as diabetes type, or miss of an insulin dose in hospitalization
due to DKA. These two factors can be covariates to the intervention, as they are also
associated with the risk of developing the hospitalization event. These covariates can
then influence the outcome and need to be accounted for in the analysis. This is often
done by using a specific statistical procedure, the Cox Proportional Hazards model (or
Cox regression model; indeed regression or multivariate analysis is used with different
types of outcomes [e.g., linear regression for continuous variables and logistic regres-
sion for categorical variables; and here Cox Proportional Hazards for time to event
variables]). In the chapter introducing linear regression (Chapter 9), you learned that
when the outcomes are binomial, the appropriate model to use is the logistic regres-
sion. In this case, hospitalization due to DKA is also a binomial variable, thus it will
be possible to use a logistic regression model to compare the presence or absence of
hospitalization due to DKA at a specific time point, but not to compare survival curves
based on time to event—the focus of survival analysis. Thus, in order to compare the
influence of one or multiple predictors in a time to event analysis, the most suitable
regression model is the Cox Proportional Hazards model [6].
In this model, the aim is to construct the hazard function, which can be defined as
the probability that if a subject survives t, he or she will experience the event in the
following instant. The logistic regression estimates instead the proportion of cases that
develop the event at a specific time point. Thus, if logistic regression estimates odds
ratios, Cox regression estimates hazard ratios. In simple terms, Cox regression model
is a function of the relative risk of developing the event at a specific moment (i.e., t).
Although the focus of this chapter is not to have an extensive explanation re-
garding the underlying assumptions to regression modeling, there is one important
concept: the proportionality. In this model, the hazard function for one subject is a
fixed proportional of the hazard function of another subject. Also, if all the covariates
of a subject are set to zero (λ0(t) = 0), and for another subject the (λ1(t) = exp (β1X1i+
β2X2i+ . . .), then the hazard function will not be dependent on time, but only on the
effects of predictor variables. This means that if, for instance, a predictor triples the risk
of an event on one day, it will also triple the risk of an event on any other day.
249 Chapter 12. Survival Analysis
Introduction: The Trial
Maria knocked softly and then walked through the door and took a seat at the confer-
ence table where the other three were already seated.
“For this study that we are working on and want you, Maria, to join us on, I want to
show that patients do better after mitral valve repair than replacement,” Dr. Feldman
said in his booming voice. In response to the puzzled look on Maria’s face, Dr. Sunder
1
Munir Boodhwani and Felipe Fregni prepared this case. Course cases are developed solely as the
basis for class discussion. The situation in this case is fictional. Cases are not intended to serve as
endorsements or sources of primary data. All rights reserved to the author of this case. Reproduction
and distribution without permission from authors is not allowed.
250 Unit II. Basics of Statistics
began to explain, “The classical treatment for severe mitral valve insufficiency has
been replacement of the valve either with a biologic or mechanical prosthesis. Over
the past 15–20 years, under Dr. Feldman’s leadership, surgical techniques have been
developed to repair the mitral valve preserving the native valve tissue. We prefer re-
pair over replacement because the biologic prostheses degrade over time and don’t
seem to last more than 10 years or so and then the patients require reoperation for
a re-replacement of the valve. On the other hand, mechanical valves last for a long
time but interact with the blood and increase the chance of blood clots forming on
the valve, which can then go to different parts of the body and can lead to stroke or
ischemia of the different parts of the body. So, we have to give them blood thinners
to prevent that, but then there is a risk of bleeding complications. What we are in-
terested in finding out is whether valve repair has a long-term advantage over re-
placement, particularly in terms of survival.” Dr. Sunder went on to explain in more
detail the nature of mitral valve disease and the history of mitral valve surgery at the
University of Pennsylvania.
After understanding the main clinical characteristics associated with mitral valve
repair, Maria asks, “That is a quite interesting clinical problem, but are we talking
about a prospective randomized study—my guess is that would be difficult due to
ethical issues and also long follow-up time until patients start having issues related to
the previous operation—or a retrospective study?” James Sunder quickly responds,
“That is a good point, Maria. You are right; a prospective study would not be feasible.
We want to analyze our data retrospectively. We have data from 3,728 patients who
underwent either mitral valve repair or replacement and we actually could follow most
of these patients as we are a renowned cardiac center in the area and patients rarely
move to other services. If they move to other cities, it is common that they maintain
appointments with their doctors in our hospital. In addition, we have a detailed elec-
tronic database that goes back to the 1980s.”
“That is wonderful,” Maria comments. In fact, this is what epidemiologists
need—large databases and an interesting and important clinical question. She then
comments, “Now we need to decide how we are analyzing the data. We have some
challenges ahead of us. The main question for us is how to analyze the data: using
other outcomes (such as quality of life and simple methods of analysis) or the use of
survival and more complicated models.”
Survival Analysis
Survival analysis is commonly used in medicine. In fact, its use has been increasing
in recent years. The main advantage of survival analysis is that this method allows
you to compare groups in which individuals have different lengths of observation.
For instance, if an investigator is measuring survival associated with cancer, but
length of follow-up is variable across patients, this would create a problem if this
investigator uses traditional methods of data analysis. In addition, treatments might
have survival rates that vary across time—for instance, surgical versus medical
treatments. While mortality would be higher initially for the surgical treatments,
it might be smaller after the first year as compared to medical treatments (which
would be associated with a more stable mortality over years). Therefore, a method
251 Chapter 12. Survival Analysis
such as survival analysis, which can take into account variable lengths of follow-up
(including cases of censoring and loss to follow-up) and change in survival rates
across time, is important in medicine.
04
06
80
82
84
86
88
90
92
94
96
98
00
08
20
19
19
19
19
19
19
19
19
19
19
20
20
20
20
MV Repaired MV Replaced
Mitral valve disease can be due to four causes: Barlow’s disease (occurs in young
patients), degenerative disease (older patients), endocarditis (valve infection), and
rheumatic valve disease. Endocarditis and rheumatic valve disease are significantly less
common, but these valves are much more difficult to repair and consequently have a
higher failure rate. Patients typically enjoy a good overall survival after mitral valve
surgery (~80%–90% at 10 years). There is a small rate of recurrent mitral insufficiency
requiring reoperation (~1%–2%/year). Other important determinants of outcome
after mitral valve surgery include the following:
Twenty minutes into the meeting, Dr. Feldman’s pager went off. It was an emer-
gency in the operating room. As he was leaving, he said, “Hey Maria. Can we look at
some raw results tomorrow?” Before she could reply, he was gone.
All graphs in this case study were created solely for educational purposes and the data are fictional.
2
252 Unit II. Basics of Statistics
60
50 (48)
40 (32) Replacement
30 (19)
20
10
0
0 2 4 6 8 10 12 14
Years
Dr. Feldman is extremely happy with the results presented. “I knew the patients were
doing better with the repair,” he commented. “This is going to be a great paper—
we should be able to publish in a top tier cardiovascular journal. So how should we
proceed?”
APPENDIX 1
Kaplan-Meier Estimates: “[T]he Kaplan-Meier Estimates is the most common method of de-
termining survival time, which does not depend on grouping data into specific time intervals.
This approach generates a step function, changing the survival estimate each time a patient dies
(or reaches the terminal event). Graphic displays of survival functions computed as a series of
steps of decreasing magnitude. This method can account for censored observations over time.
Confidence intervals can also be calculated.” Source: Portney L, Watkins M. Foundations of clin-
ical research: applications to practice, 3rd ed. (pp. 721–724). Pearson International Edition; 2009.
Cox Proportional Hazards Model: “Survival time is often dependent on many interrelated
factors that can contribute to increased or decreased probabilities of survival or failure. A regres-
sion model can be used to adjust survival estimates on the basis of several independent variables.
Standard multiple regression methods cannot be used because survival times are typically not
normally distributed—an important assumption in least squares regression. And of course, the
presence of censored observations presents a serious problem. The Cox proportional hazards
model, which conceptually similar to multiple regression, but without assumptions about the
shape of distributions. For this reason, this analysis is often considered a nonparametric tech-
nique.” Source: Portney L, Watkins M. Foundations of clinical research: applications to practice, 3rd
ed. (pp. 721–724). Pearson International Edition; 2009.
CASE DISCUSSION
Dr. Feldmann, the chair of the surgery department of University of Pennsylvania, wants to “show
that patients do better after mitral valve repair than replacement.” To accomplish that, he plans to
analyze retrospectively the data of 3,728 patients who underwent mitral valve replacement or repair.
This is an important and valuable research question that can be approached in different ways.
One of them is to simply use a quality of life scale and use an ANOVA or a linear regression to
control for some covariates—in this case, a linear regression would be more suitable than an
ANCOVA because the focus would be on the outcome and not on the difference among groups.
255 Chapter 12. Survival Analysis
This will be a simple and straightforward way of assessing the impact of mitral valve replacement
or repair, which will certainly be very welcomed by clinicians. However this type of analysis
focuses on one or several specific time points, but does not provide information about time to
event. An important event after mitral valve replacement or repair is death. An intervention can
be considered more effective than the other if, after the procedure, patients live longer.
The Kaplan-Meier method is easy to calculate by hand and, the survival curves can be
compared using a method such as the log rank. In this method, the null hypothesis is that the
survival curves are not statistically different one from the other, so a p-value of less than 0.05 is
used to reject the null hypothesis. By using this method, interventions that have a similar event
rate, the number of events (in this case, death) in a given moment should be proportional to the
number of elements in the population at risk.
Also it is possible to include some covariates in order to control for potential effects of such
predictors on the time to event analysis. This can be achieved by performing statistical mod-
eling, using the Cox Proportional Hazards regression. In this mode, the proportional impact of
predictor factors is analyzed, and then survival times are adjusted based on those factors. The
main difference between logistic regression and the Cox Proportional Hazards regression is that
the first accounts for the possible influence of covariates at a specific time point, while the latter
assumes the relative risk of developing the event at each time point.
FURTHER READING
Online Resources
http://vassarstats.net/survival.html
http://www.ats.ucla.edu/stat/sas/seminars/sas_survival/
http://data.princeton.edu/pop509
http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Survival/BS704_Survival_
print.html
REFERENCES
1. Bewick V, Cheek L, Ball J. Statistics review 12: Survival analysis. Critical Care. 2004;
8(5): 389–394.
2. Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. JASA.
1958; 53(282): 457–481.
3. Bland JM, Altman DG. The logrank test. BMJ. 2004; 328(7447): 1073–.
4. Machin D, Cheung YB, Parmar M. Survival analysis: a practical approach. 2nd ed. Wiltshire,
UK: Wiley; 2006.
256 Unit II. Basics of Statistics
INTRODUCTION
This chapter begins with an explanation of what is considered incomplete or lost data
and how to handle this issue using an intention-to-treat (ITT) approach, followed by
another topic (covariate adjustment). The next chapter will complete the statistics
unit, covering subgroup analysis and meta-analysis.
The previous chapters introduced you to statistics, covering hypothesis testing,
how to handle data, how to perform data analysis depending on the type of data you
collected (parametric vs. non-parametric tests), and sample size calculations. You
need to remember that without an adequate number of subjects in your study, you will
not have enough power to reach statistical significance. But what if you started your
study with a sufficient sample size and then had dropouts, problems with adherence
(cross-overs), missed appointments, data entry/acquisition problems? This leads us
to the first part of this chapter: what to do if we have incomplete or lost data.
When reporting a randomized controlled trial (RCT), you will need to clearly state
and explain the method used to handle missing data and you should use a flowchart
diagram to document how participants progressed throughout each phase of the study
(see Figure 13.1).
MISSING DATA
Missing data have been defined as “values that are not available and that would be
meaningful for analysis if they were observed” by the members of an expert panel
convened by the National Research Council (NRC) [1]. It is very common in clinical
research to have missing data. The most important thing is to account for all losses,
stating when and why they happened whenever this is possible.
• Participants drop out, or withdraw from the study before its completion. Usually
those who remain in the study are different from the ones who left (biasing
257
258 Unit II. Basics of Statistics
Enrollment
Assessed for eligibility (n = )
Excluded (n = )
– Not meeting inclusion criteria (n = )
– Declined to participate (n = )
– Other reasons (n = )
Randomized (n = )
Analyzed (n = ) Analyzed (n = )
Analysis
– Excluded from analyzed (give reasons) – Excluded from analyzed (give reasons)
(n = ) (n = )
analysis). Participants may leave the study for several reasons (e.g. death, adverse
reaction, unpleasant procedures, lack of improvement, or early recovery).
• Participants refuse the assigned treatment after allocation. Even if they signed the
informed consent and were aware of the possible treatments, they must be allowed
to withdraw at any time.
• Participants do not attend an appointment at which outcomes should have been
measured.
• Participants attend an appointment but do not provide relevant data.
• Participants fail to complete diaries or questionnaires.
• Participants cannot be located (lost to follow-up).
• The study investigators decide, usually inappropriately, to cease follow-up.
• Data or records are lost, or are unavailable for other reasons, despite being collected
successfully.
• Some enrolled participants were later found to be ineligible. This may happen if the
random assignment is prior to the point where eligibility can be determined.
1. Monotonous missing data: Each participant with a missing value has all subsequent
measurements missing as well.
2. Non-monotonous missing data: A participant may have one missing value, but has
some observed values after that.
Missing data are an important source of bias, unbalancing the groups, decreasing the
sample size, and reducing precision and study power, which could all lead to invalid or
non-significant results.
259 Chapter 13. Other Issues in Statistics I
Your challenge to deal with missing data has three parts. First of all you will need to
design a clinical trial so as to limit the occurrence of missing data (also see Chapter 7
on recruitment and adherence). Then you can employ methods during the data
collection process to prevent more losses. The expert panel convened by the NRC
also emphasized the importance of strategies in design and conduct of a study to help
minimize missing data. Finally, anticipating that you will have some missing data
nevertheless, you should plan a way to control or assess its potential impact on the
trial outcome. In the following, we present strategies for addressing this issue at three
different stages of the study.
Per-protocol analysis (or also called modified ITT [mITT], which may not be the best
term to use) analyzes only subjects who did not deviate from the protocol. For in-
stance, if a subject is included in a trial but later the investigator realizes (after random-
ization) that the subject does not have the disease being treated, in the PP analysis, this
subject is excluded, but all other subjects are included. There are other types of mITT
that include analyzing subjects who received at least a certain amount of the interven-
tion or who had one baseline assessment. Regardless of the method, mITT can lead to
bias as it may analyze different groups.
261 Chapter 13. Other Issues in Statistics I
Intention to Treat
In order to keep the benefit of randomization, we must adhere to it. This means that
all participants who were randomized need to be included in the analysis. We will later
explain several methods used to handle missing data. However, the analysis must also
include participants in the same group they were randomized to, regardless of what
happened to them afterward, even if they never started their allocated treatment or
262 Unit II. Basics of Statistics
crossed over to the other group. This approach is called intention to treat (ITT). But if
some participants do not have measurements, how can we use them in our analysis?
You will need to use imputation methods to complete the lost data [3].
ITT analysis may lead results toward the null hypothesis, against a treatment
difference, therefore minimizing type I error. It has been said to be too cautious,
increasing the possibility of type II error, but it also reflects the “real world” of clinical
practice by accounting for non-compliance and protocol deviations, which happen
with regular patients.
Note that if ITT analysis tend to bias the results toward no difference, it might not
be appropriate when doing equivalence or non-inferiority trials.
You could find it useful to do both PP and ITT analysis. If the results are both sim-
ilar, then it is possible to have more confidence in your inference. If the results are op-
posite, then you should try to consider what factors might be biasing this results [4].
Advantages of Using ITT
• The Consolidated Standards of Reporting Trials (CONSORT) guidelines to help
authors improve their reports of RCTs state that the number of participants in each
group should be analyzed by the “intention-to-treat” principle.
• It reflects the reality of clinical practice. In a clinical scenario we will always have a
group of patients who are not compliant with our indications. ITT will get results
based on the initial random allocation and therefore will give us an estimate of
treatment effect. In an RCT, non-compliance may be due to their response to
treatment, such as non-response or adverse effects.
• It keeps the sample size necessary for the analysis, maintaining the desired power.
• It helps investigators to become aware of the reasons for non-compliance and
emphasizes the importance of good accountability of all the enrolled patients.
• It minimizes type I error; it is more conservative with results, and will therefore be
easier to generalize results. However, this may also depend on the method utilized.
Disadvantages of Using ITT
• If a non-compliant patient who did not really receive the treatment is analyzed in
the group where patients were supposed to get treatment, results based on him do
not really indicate any efficacy.
• We usually have a dilution effect from the non-compliant participants, so type II
error increases; it has been said to be too cautious.
• Variance of results will be greater because compliant participants are grouped to-
gether with dropouts and non-compliant ones for the analysis [5].
3. Multiple imputation
4. Maximum likelihood techniques.
Advantages:
Disadvantages:
Disadvantage:
– It might introduce bias, depending on the reason for missing data.
264 Unit II. Basics of Statistics
Advantages:
• It is a basic method.
• It has a potential to reduce bias by using all study data to estimate the response of
missing participants.
Disadvantages:
Regression Imputation
This method replaces each missing value with one estimated using a regression model
performed on the non-missing ones. The imputation is different in each case based on
the baseline characteristics of each participant.
Advantages:
• It is a basic method.
• It has a smaller impact on variance, but still reduces it.
• It has the potential to reduce bias by using all study data to estimate the response of
missing participants.
Disadvantages:
The regression model can consider several baseline characteristics in the equation to get
the outcome most appropriate for each case considering the available data. For example:
Advantages:
• It has a potential to reduce bias by using all study data to estimate a response for
missing participants.
Disadvantages:
Advantages:
• It is a basic method.
• It is widely accepted.
• It is the most commonly used method.
• It is accepted and even recommended by the US Food and Drug Administration
(FDA) as a conservative method; however, this method is being less frequently
used.
• It mimics real-life scenarios of non-compliance.
Disadvantages:
• It may lead to biased estimates, which tend toward the null hypothesis or even inflate
results.
• Dropouts may be unbalanced across treatment groups.
266 Unit II. Basics of Statistics
• Dropouts may occur early during the intervention, so the last observation may not
really reflect what would have happened if the participant had finished.
• It assumes that patients who drop out maintain same outcome result, not improving
or getting worse.
• Time trends in the data, when combined with differential dropout rates between
groups, can introduce severe bias.
• This method also ignores the fact that, even if a participant’s disease state re-
mains constant, measurements of this state are unlikely to stay exactly the same,
introducing a spurious lack of random variability into the analysis [6].
Advantages:
Disadvantages:
Advantages:
• It is a conservative method.
Disadvantages:
Single imputation methods are easier to perform; they will only replace missing values
using the known characteristics of the rest of the available data [8]. Using various methods,
for example regression, will calculate predicted values for the missing ones [9]. However,
since we would be using the known values, the variance with those predicted values is
smaller than the real one. This smaller variance may introduce bias in other estimates [10].
Multiple Imputation
This is a more complex approach to the analysis of missing data, as is it involves mul-
tiple calculations. In multiple imputations (MI), each missing value will be replaced by
267 Chapter 13. Other Issues in Statistics I
a simulated value. This will be done several times (3 to 10 times), obtaining multiple
sets of completed data by imputation. All data will be analyzed by standard methods,
and the results will be combined to produce a unique result for inference. This result
incorporates missing data uncertainty, having a standard deviation and standard error
closer to the one obtained with a complete sample.
When we use MI, we incorporate an error into the equation for prediction. This
error is drawn randomly from a standard normal distribution. This random error
added to the prediction will increase the variance and make it closer to the real one
(when there are no missing values). Ultimately, this avoids the inclusion of biases,
which happens from having too small a variance.
To carry out MI there are three basic steps:
It is important to know that most of the software available to handle missing data using
either multiple imputation or maximum likelihood need the assumption that data are
missing at random [9].
Advantages:
Disadvantages:
• Is the model used for the imputation of each variable compatible with the model for the
final analysis? The analysis may be using a variable as categorical, but the imputation
model may be using it differently, maybe as quantitative. This is called model congeniality.
Advantages:
• It can even be used for cases with MNAR data, if you have a correct model for the
missingness mechanism. It is more efficient than multiple imputation.
• For the same data set, it always gives the same result. On the contrary, multiple impu-
tation has a different result every time you perform it, because of the use of random
numbers. There will always be a possibility of reaching different conclusions based
on the same data.
• It has fewer decisions to make than multiple imputation before performing the
technique. The results will not depend on your decisions.
• It uses a single model, so there will be no incompatibility between the imputation
and analysis model. All variables will be taken into account, as well as the linear or
nonlinear relations between them.
• ML is said to be more efficient than MI because it has smaller standard errors, and for
small samples or large amounts of missing data you would need too many data sets in MI.
Disadvantages:
Sensitivity Analysis
Other recommendations of the NCR expert panel, apart from the preventive strategies
cited earlier, were to recommend a sensitivity analysis, where you would compare
269 Chapter 13. Other Issues in Statistics I
your analysis to an extreme method and see how results change. This will give you
an idea of how robust (sensitive) your findings are. The analysis measures the impact
on the results from different methods of handling missing data, and it helps to jus-
tify the choice of the particular method applied. It should be planned and described
in the protocol. If the sensitivity analysis shows consistent results and leads to rea-
sonably similar estimates of the treatment effect, then you would say you have robust
findings. Also recommended are model-based methods of analysis, or those that use
appropriate weighting, as superior to complete-case analysis or single imputation
methods such as last observation carried forward, because they require less restrictive
assumptions about the missing data mechanism. One more important question is that
the issues raised by the panel also apply to observational studies.
COVARIATE ADJUSTMENT
Covariate adjustment is used because, even with randomization, the groups of your
study may be imbalanced for certain characteristics, also called covariates [15], or it
can also be used to improve the efficiency of data analysis.
Randomization does not guarantee perfect balance across treatment arms with re-
spect to one or more baseline covariates, especially in small trials [16].
You can avoid imbalances and plan for adjusted analysis at the design stage of your
study. First, by performing a stratified blocked randomization, you can ensure a rea-
sonable balance across treatment groups in some of the baseline factors known to be
strong predictors (see Chapter 6). These blocks should be entered into the analysis,
unless there are too many blocks. You should also pre-specify in the protocol which
baseline covariates will be adjusted for and why. This planned approach is preferred
over doing it post hoc, because any unplanned analysis has to be declared as explora-
tory. If not declared as exploratory, then it may be considered a “fishing expedition,”
271 Chapter 13. Other Issues in Statistics I
meaning you are playing with data to get the results you want. Multiple analyses may
yield a positive result simply by chance (see Chapters 9, 10, and 14), increasing the
chance of type I error. The FDA and the International Conference on Harmonization
of Technical Requirements for Registration of Pharmaceuticals for Human Use
(ICH) guidelines for clinical reports require that the selection of and adjustment for
any covariates should be an integral part of the planned analysis, and hence should be
set out in the protocol and explained in the reports.
In an unadjusted analysis, the baseline characteristics of participants (covariates)
are not taken into account to assess the outcome. In an adjusted analysis, the covariates
are taken into account, because it is possible that the estimates of treatment effect will
be influenced by these baseline differences between groups.
The final effect on the outcome will depend on
You should also note that if the baseline covariate is strongly correlated with the out-
come, there is still an advantage in adjusting for a baseline covariate, even if this is
balanced across the treatment arms [17].
Another effect of this adjustment is an increase in precision of the estimated
treatment effect. This, however, only applies to linear regression models [16,17].
You should note that by using covariate adjustment we are not interested in learning
how groups respond to treatments; the purpose is only to increase power. Thus it is
important for the reader to understand also that the goal of covariate adjustment is to
make the statistical analysis more efficient; but also this only happens if the covariates
are associated to the response to the treatment being tested.
• In randomized trials, the better therapy may depend on values of a baseline risk or
prognostic factor. You can adjust for them by subdividing the target population into
subsets. In a two-arm randomized trial setting, it is customary to use subsets: two
regions of superiority of the treatment arms (i.e., one region for each arm), and the
third region of uncertainty. The goal is to detect treatment efficacy by prognostic
factor interaction.
272 Unit II. Basics of Statistics
For ANOVA
In an analysis where you would run ANOVA, you should add another variable, there-
fore effectively using ANCOVA.
In STATA if you wanted to adjust for gender, the command would be changed
from anova pain changes treatment to anova pain changes treatment gender. ANCOVA
tests whether certain factors have an effect on the outcome variable.
For Regression
In an analysis where you would run a regression, you should add another variable into
the equation.
In STATA if you wanted to adjust for gender, the command would be changed
from regress pain changes treatment to regress pain changes treatment gender. This allows
you to control for important covariates or potential confounders.
For Categorical Data
This method adjusts for covariates using categorical data by averaging several strata. It
is the comparison of two groups of categorical response, allowing to adjust or control
for important covariates.
So far, we have evaluated the use of covariate adjustment in randomized controlled
trials, because even randomization may not adequately balance groups. But what
happens in observational studies, where there is no randomization to begin with? In
273 Chapter 13. Other Issues in Statistics I
this type of study, the compared groups are usually very different on the covariates. If
you do not adjust for this, your treatment effect will be biased.
Besides the method of addressing confounders using modeling or multivariate
analysis (which is the most common method), another method is called propensity
scores. We will briefly mention this method, as it is a way to help with covariate adjust-
ment in observational studies (for more details, see Chapter 19).
The method of propensity scores (PS) is defined as a conditional probability of
being treated given the individual’s characteristics or covariates. This means it will al-
ways be a fraction or percentage. PS is also considered a score, which summarizes all
the used covariates in only one number (it is usually estimated using logistic regres-
sion where the outcome is the treatment and the covariates are the predictors). Once
estimated, the propensity score can be used to reduce bias through matching, stratifi-
cation, regression adjustment, or some combination of all three.
However, you must remember that, unlike randomization, propensity scores will
only balance known covariates, but not unobserved ones; so only covariates that will
be measured before the treatment is given should be included in the propensity score.
The value of propensity is that it gives us one number with which we can compare
treatment and control groups, instead of having to compare for several different covariates.
274 Unit II. Basics of Statistics
Introduction: When the Conventional
Methods of Data Imputation Fail
A few days later, Prof. Strong received an email from Ms. MacGyver. He became
anxious even before reading the message, as the email subject was “final dataset”:
Dear Prof. Strong, the database is complete. We had only 60% patients finishing the
study—42 patients: 22 on Atkins diet and 20 on standard. I would like to discuss
with you the methods of missing data imputation as I did not find it in the protocol.
I think we can assume the data missing as at random2. I am in Dallas with my family
for the holidays.
My phone number is . . .
275 Chapter 13. Other Issues in Statistics I
After reading Melissa’s email, Prof. Strong did not know which statistical approach
to choose. He was now sure that he should have decided the method for missing data
approach when he designed the trial, but he was too optimistic and considered that
only 5% of patients would drop out from this study and therefore this would not be a
critical issue. In the case of this study, CCA seemed very problematic, as they had 40%
dropouts, implying a huge loss of study power. Also, it would be possible that patients
who dropped out were those who started to regain weight, became unmotivated to
continue the diet, and finally left the study. LOCF, on the other hand, would assume
that patients who dropped out would maintain the same weight, which might be too
optimistic. Therefore, he also agreed with her that perhaps more sophisticated statis-
tical methods for data imputation could be interesting. Fortunately, Ms. MacGyver
was an expert on these methods.
The Trial
The Atkins diet that has been developed in the 1970s is one of the most popular types
of diets. Several books have been published on this diet, and millions of subjects have
tried it. However, despite the popularity, very few studies have been performed on this
subject, and it is still not clear whether or not this diet is effective. Based on this im-
portant question, Prof. Strong designed and ran a one-year, multi-center, randomized
controlled trial in which 70 patients were randomly assigned to receive the Atkins diet
or the conventional diet (low-calorie, high-carbohydrate, and low-fat diet). The main
outcome was weight at one year, and the hypothesis was that the Atkins diet would
induce a larger weight reduction as compared to the conventional diet.
Prof. Strong called Melissa: “Hi, Melissa! How are you? Happy 2009!”
She replied, “Happy 2009, Professor!”
After a few minutes chatting, they realized they would be on the same flight back
to Chicago, as Prof. Strong would make a connection in Dallas. They settled that they
would meet at the airport to discuss the potential method for handling the missing
data, so they could discuss their trial while flying back to Chicago.
A Brainstorm—and a Thunderstorm
A few days later, Prof. Strong and Ms. MacGyver met while boarding the airplane at
Dallas Airport.
“Hi, Melissa! How are you? Are you prepared for the cold weather? I heard the
temperature in Chicago is around zero Fahrenheit plus the chill factor—that would
be near –15 F.”
She replied, “I heard about that too! But that’s fine for me. I prefer cold to warm.
I grew up in Minnesota! And how about you, Professor? How was it in Miami? And
where is Mrs. Strong?”
He said, “It was fine, thank you! She stayed with the kids; she has family in Fort
Lauderdale.”
They paused their conversation while the plane took off. After the announcement
that it was safe to use portable electronic devices, they did not hesitate and both got
their laptops out and started talking about their trial.
277 Chapter 13. Other Issues in Statistics I
and I would have to use a special statistics software to do it. However, the advantage
of these tools is the potential to reduce bias by using all the study data to estimate re-
sponse for missing subjects.
“In our example, it is possible that a generalized regression model for the non-
missing variables showed something like that:
In this case, the final values for each missing value would be replaced by a value cal-
culated from this formula. This approach is an advance, as it increases slightly the
standard deviation when compared to the mean substitution; however, even this
approach is optimistic as the SD would still be underestimated because the missing
values are still estimated from the non-missing values (in fact, there is no possibility to
increase SD using this technique).
“Thus, a third approach would be to increase random variability to the inputted
values. There are modern statistical methods that generate thousands of values, adding
error to the estimated value, and then one of these values is chosen at random by the
statistical software to replace the missing values. However, this approach is not com-
monly used and might be questioned by reviewers. Moreover, it requires familiarity
with statistics and involves specific training for using software that can generate sim-
ulation methods.”
After seeing all of these methods, Prof. Strong concludes, “Oh, well, it seems that
these methods decrease the standard deviation but might also give better estimates
depending on our assumptions. What else, Melissa?”
on the standard treatment. The advantage of this approach is that if the results are pos-
itive, they can be trusted, since they were obtained under the “worst-case scenario.”
However, this approach cannot be used in studies in which a high number of dropouts
was observed.
“Another technique is the baseline carried forward—it assumes that all patients
who dropped out, regardless of the treatment received, returned to their baseline
levels: in our case, regained the baseline weight. This approach is not commonly used
and might underestimate the effects of treatment, as it introduces a bias “toward the
null hypothesis” because in this technique, the same values are inputted for both
groups—therefore getting both means closer. However, it may be interesting for this
trial as it is based on the idea that patients after dropping out of the study will regain
the weight and return to their baseline levels.
“But Professor Strong, there are two more options to consider. Although they are
more complicated, they yield the best results.”
Professor Strong replies, “Even more options and more complicated? I have not
heard of anything else than what we have discussed already!”
“Yes, Professor, although these methods are better, they are not well known. We
can go over the basics now and then you can decide.”
“Ok, Melissa, tell me about these new methods.”.
“The first is multiple imputation, or MI. With this, each missing value will be
replaced by a simulated value. This will be done several times (3 to 10 times), obtaining
multiple sets of completed data by imputation.”
Prof. Strong interrupts her: “So I would have 3 to 10 sets of data? Which one would
I use for the final analysis?”.
Melissa continues, “All data sets will be individually analyzed by standard methods,
and the results will be combined to produce a unique result for inference. This result
incorporates missing data uncertainty, having a standard deviation and standard error
closer to the one obtained with a complete sample.”
“So, Melissa, you would like to work 10 times more?” asks the professor.
“Fortunately, there are now better versions of software for this, and they are not
as difficult to find as before. So, although it is more elaborate to do, it is not 10 times
more difficult. But there is still one more method to consider, the maximum like-
lihood, or ML. The principle of ML is to use the available values in order to find
parameter estimates (the measures describing a population) that would be the best
fit to the already observed data. ML does not impute missing values, so your result
will not be a complete data set. However, it uses the known characteristics of the
individuals to better estimate the unknown parameters of the incomplete variable.
To carry out ML you need to define the likelihood function, which will quantify
how well the data fit to the parameters. But, this you do not need to know, you will
only need to help me find the most appropriate variables to use. I would take care
of the rest.”
Prof. Strong once more interrupts her, “There is a lot to consider, Melissa.”
Fortunately, the airplane was soon able to land after a missed approach that scared
many of the passengers. Despite the bumpy ride they had, they considered that the
discussion was very productive. One important issue they know is that they need
to decide the method before testing it in the data so as not to increase type 1 error.
Though sensitivity analysis is a possible option, it is also difficult to make a decision
280 Unit II. Basics of Statistics
when discordant results are seen. They planned to meet one day later to finally decide
the best approach to use. There was a light at the end of the tunnel.
CASE DISCUSSION: CHOOSING
THE STATISTICAL TEST
First of all, Prof. Strong should have considered some methods to help avoid the loss
of follow-up. Also, he did not plan for any method of handling missing data, so this is
the first challenge he will face: to choose a method. It is important to know that both
FDA and ICH guidelines say you should consider this in the protocol.
The next important challenge is that he has a big (40%) loss of patients in the
follow-up; he will need to keep this in mind in order to decide on the method of
imputation.
The next step in choosing the strategy is to find the mechanism of missing
data: is it MCAR, MAR, or MNAR? Should he just see the reasons for missingness,
or should he use a formal approach? Ms. MacGyver says they can assume data as
MAR, but why?
In order to assume MAR, lost data can be related to independent variables, but
not to outcome. So Prof. Strong needs to decide if the patients who left the study
were those who did not see as much weight loss, or those who did see weight loss
(is it related to the outcome?). For this, it would be necessary to know the reasons
for dropping out, apart from knowing the baseline characteristics of patients. Also, he
could consider a formal method to decide on MCAR, MAR, or MCAR.
Only MCAR data can use a CCA analysis. If we assume MAR, then we cannot
choose CCA; we should use a method of ITT. But if our data are MNAR, then we
have only one method we can use: maximum likehood, which, as noted earlier, is
not really an imputation method, as you will not end up with a completed data set.
The principle of maximum likelihood is to use the available values in order to find
parameter estimates (the measures describing a population) that would be the most
fitting to the already observed data. ML does not impute missing values, so your re-
sult will not be a complete data set. However, it uses the known characteristics of the
individuals to better estimate the unknown parameters of the incomplete variable.
Now that we know we should use ITT, we need to decide on one. Single imputa-
tion methods are simple, but most are not adequate for use when there is a large loss of
data, or can cause a loss of power, or increase the probability of type I error.
Planning in the protocol is very important because it will give you the oppor-
tunity to collect all the information needed to make a good decision of the type of
missingness and ITT methods. For instance, if you want to use a regression model,
they would need to collect the baseline characteristics they think would influence the
type of missingness.
FURTHER READING
Papers
• Altman DG. Adjustment for covariate imbalance. In: Biostatistics in Clinical Trials. Chichester,
UK: John Wiley & Sons, 2001.
• Bernoulli D, Blower S. An attempt at a new analysis of the mortality caused by smallpox and
of the advantages of inoculation to prevent it. Rev Med Virol. 2004; 14: 275–288.
• Donders ART, et al. Missing data review: A gentle imputation of missing values. J Clin
Epidemiol. 2006; 59: 1087–1091.
• Frison L, Pocock SJ. Repeated measures in clinical trials: analysis using mean summary sta-
tistics and its implications for design. Stat Med. 1992; 11: 1685–1704.
• Gupta SK. Intention-to-treat concept: a review. Perspect Clin Res. 2011; 2: 109–112.
• Haukoos JS, Newgard CD. Advanced statistics: missing data in clinical research—part 1: an
introduction and conceptual framework. Acad Emerg Med. 2007 Jul; 14(7): 662–668.
• Hollis S, Campbell F. What is meant by intention to treat analysis? Survey of published
randomised controlled trials. BMJ. 1999; 319: 670.
• Laird NM. Missing data in longitudinal studies. Stat Med. 1988 Jan–Feb; 7(1–2): 305–315.
• Molenberghs G, Thijs H, Jansen I, et al. Analyzing incomplete longitudinal clinical trial data.
Biostatistics. 2004; 5: 445–464.
• Newgard CD, Haukoos JS. Advanced statistics: missing data in clinical research—part 2: mul-
tiple imputation. Acad Emerg Med. 2007 Jul; 14(7): 669–678.
• Pocock SJ, Assmann SE, Enos LE, et al. Subgroup analysis, covariate adjustment and base-
line comparisons in clinical trial reporting: current practice and problems. Stat Med. 2002;
21: 2917–2930.
• Senn SJ. Covariate imbalance and random allocation in clinical trials. Stat Med. 1989;
8: 467–475.
• Ware JH, Harrington D, Hunter DJ, D’Agostino RB. Missing data. N Engl J Med. 2012;
367: 1353–1354.
Online Statistical Tests
http://handbook.cochrane.org/index.htm#chapter_16/16_2_intention_to_treat_issues.htm
http://statpages.org/
Books
• Cochrane Handbook for Systematic Reviews of Interventions. Version 5.1.0 [updated
March 2011]. The Cochrane Collaboration, 2011. Available from www.handbook.
cochrane.org.
282 Unit II. Basics of Statistics
• European Agency for the Evaluation of Medical Products (EMEA). Committee for
Propietary Medicinal Products. Points to consider on missing data. Available from: http://
www.ema.europa.eu/docs/en_GB/document_library/Scientific_guideline/2009/09/
WC500003641.pdf
• FDA, Section 5.8 of the International Conference on Harmonization: Guidance on Statistical
Principles for Clinical Trials. Available from: http://www.fda.gov/cber/gdlns/ichclinical.
pdf. Accessed April 19, 2005.
• Little RJA, Rubin DB. Statistical analysis with missing data. New York: Wiley; 1987.
• Ting N. Carry-forward analysis. In: Chow SC, ed. Encyclopedia of biopharmaceutical statistics.
New York: Marcel Dekker; 2000: 103–109.
• Wang D, Bakhai A. Clinical trials: a practical guide to design, analysis and reporting.
England: Remedica Publishing; 2006.
• Weichung JS. Problems in dealing with missing data and informative censoring in clinical trials.
Curr Control Trials Cardiovasc Med. 2002; 3(1): 4. https://doi.org/10.1186/1468-6708-3-4.
REFERENCES
1. Ware JH, Harrington D, Hunter DJ, D’Agostino RB. Missing Data. N Engl J Med. 2012;
367: 1353–1354. http://www.nationalacademies.org/nrc/
2. Little RJA, Rubin DB. Statisical analysis with missing data. New York: John Wiley &
Sons; 1987.
3. Hollis S, Campbell F. What is meant by intention to treat analysis? Survey of published
randomised controlled trials BMJ. 1999; 319: 670.
4. Haukoos JS, Newgard CD. Advanced statistics: missing data in clinical research—part 1: an
introduction and conceptual framework. Acad Emerg Med. 2007 Jul; 662–668.
5. Gupta SK. Intention-to-treat concept: a review. Perspect Clin Res. 2011; 2: 109–112.
6. Molenberghs G, Thijs H, Jansen I, et al. Analyzing incomplete longitudinal clinical trial
data. Biostatistics. 2004; 5: 445–464.
7. Frison L, Pocock SJ. Repeated measures in clinical trials: analysis using mean summary sta-
tistics and its implications for design. Stat Med. 1992; 11: 1685–1704.
8. Ting N. Carry-forward analysis. In: Chow SC, ed. Encyclopedia of biopharmaceutical statis-
tics. New York: Marcel Dekker; 2000: 103–109.
9. Haukoos JS, Newgard CD. Advanced statistics: missing data in clinical research—part 1: an
introduction and conceptual framework. Acad Emerg Med. 2007 Jul; 14(7): 662–668.
10. Donders ART, et al. Missing data review: a gentle imputation of missing values. J Clin
Epidemiol. 2006; 59: 1087–1091.
11. Allison PD. Paper 312-2012. Handling missing data by maximum likelihood. SAS Global
Forum. Haverford, PA: Statistical Horizons; 2012. http://www.statisticalhorizons.com/
wp-content/uploads/MissingDataByML.pdf
12. Allison PD. Missing data. Series: A SAGE University Papers Series on Quantitative
Applications in the Social Sciences, 07-136. Thousand Oaks, CA: Sage; 2001.
13. Laird NM. Missing data in longitudinal studies. Stat Med. 1988 Jan–Feb; 7(1–2): 305–315.
14. D Bernoulli, S Blower. An attempt at a new analysis of the mortality caused by smallpox and
of the advantages of inoculation to prevent it. Rev Med Virol. 2004; 14: 275–288.
15. Altman DG. Adjustment for covariate imbalance. Biostatistics in clinical trials. Chichester,
UK: John Wiley & Sons, 2001.
283 Chapter 13. Other Issues in Statistics I
16. Senn SJ. Covariate imbalance and random allocation in clinical trials. Stat Med. 1989;
8: 467–475.
17. Pocock SJ, Assmann SE, Enos LE, et al. Subgroup analysis, covariate adjustment and base-
line comparisons in clinical trial reporting: current practice and problems. Stat Med. 2002;
21: 2917–2930.
14
OT H E R I S S U E S I N STAT I ST I C S I I
SU B G R O U P A N A LY S I S A N D M ETA-A N A LY S I S
INTRODUCTION
Clinical studies can provide invaluable information on the effects of particular
treatment on a population of research subjects. However, in the translation process
of applying data from clinical trials to the management of patients, clinicians fre-
quently face a very specific challenge: Among all the various available treatment
options for a disease, what specific therapeutic approach would be best for the in-
dividual patient? And what should be done when a clinician has to interpret con-
flicting data from different clinical studies on a particular subject? These questions
form the basis of this chapter in which subgroup analysis and meta-analysis will be
discussed.
The first section of this chapter will provide a step-by-step discussion of subgroup
analysis and meta-analysis. The second section will invite the reader to critically think
about one case study.
SUBGROUP ANALYSIS
Subgroup analysis is especially concerned about variability and how treatment effects can
differ due to specific characteristics of the population (e.g., gender, age, smoking status).
instance, although endovascular coiling was superior in the trial for patients with
ruptured intracranial aneurysms, indeed it seemed that older patients with MCA
aneurisms indeed benefited more from clipping.
It is important to distinguish subgroup analysis from covariate adjustment.
Covariate adjustment aims to decrease variability between groups by adjusting for
possible confounding effects of other variables, thus improving the precision of the
estimated overall treatment effect for the entire study population. Subgroup analysis,
on the other hand, aims to assess the effects of the intervention in specific subgroups,
in which there could be potential heterogeneous effects of a treatment related to dem-
ographics, pathophysiology, risks, response to therapy, potential clinical applications,
and even clinical practice [4].
Subgroup analysis can be the primary objective of a study, or can alternatively be
used to generate hypotheses for future studies. And usually are tested as an interaction
term, namely when the effect on one independent variable may depend on the level of
another independent variable.
An interaction term can be tested using a regression model, in order to consider
the effects separately, as well as the interaction between them, as can be seen in the
following equation.
Y = β0 + β1 ( x1) + β2 ( x 2 ) + β3 ( x1 * x 2 ) + e
If the variables in the model were gender and age, the equation could be coded as
The interaction term here is β3 (x1*x2) (e.g., gender * age). In the absence of interac-
tion effects, fit lines for each variable will be displayed as parallel lines (Figure 14.1 A).
However, if an interaction effect exists, β3 will be significantly different from 0, and fit
lines can cross (Figure 14.1 B) (see Chapter 9 for linear regression).
There is a special consideration concerning qualitative heterogeneity, in which
treatment effects on opposite directions are due to specific group characteristics
Gender
Age
Age
Figure 14.1. A: Parallel lines suggesting no interaction. B: Age and gender cross, thus suggesting a
potential interaction.
286 Unit II. Basics of Statistics
[e.g., 5], although the treatment effects are on opposite directions that cannot be di-
rectly translated to risks, or even impact on quality of life.
α
= α corrected
k
287 Chapter 14. Other Issues in Statistics II
following: Were there any differences in the sample size between the treatment and con-
trol arms within the subgroup? Could a covariate that is unequally distributed explain
part of the results? Was the statistical analysis the most appropriate? If no issues with the
protocol are exposed, or other confounding variables do not explain the unexpected
results, one should then examine the literature to determine if this finding (or trend)
has been previously observed in other studies. Finally, the finding has to be assessed
for biological plausibility. Nonetheless, there is often the real danger of ascribing a bi-
ological explanation to a spurious statistical finding.
This chapter has focused so far in dealing with the heterogeneity across subgroups.
Analysis of data from subgroups of patients with certain baseline characteristics is typ-
ically insufficient to change general clinical practice. Often the sample size of a sub-
group is small and therefore lacks the required statistical power, which will impose
serious limitations to the generalizability of the study results. Thus, these analyses will
often be exploratory in nature, and will mainly launch the basis for future studies, fo-
cusing on the particular subgroup.
META-A NALYSIS
Meta-analysis is a method of pooling data from several studies in order to quantify the
overall effect of an intervention or exposure. The advent of evidence-based medicine
and the general acknowledgment of meta-analysis as the apex level of evidence has led
to an increasing interest in this specific type of quantitative systematic reviews. Meta-
analysis can increase the precision of information on a specific topic, by addressing the
variability between multiple studies and by adjusting for the limitations of individual
studies, and ultimately they can contribute to changing clinical practices.
A meta-analysis can be seen as a two-step literature review: qualitative and quan-
titative. The first critical step is the qualitative assessments of a given topic. Such
assessment usually shapes into a compendium of relevant studies in that particular
area, exploring thoroughly the topic at hand. Nonetheless, there are several areas (and
the clinical area is one of them) in which a quantitative assessment of a phenomenon
is critical (e.g., when choosing the most appropriate treatment for a certain disease).
This quantification of the magnitude of the effect of the intervention or exposure is
what distinguishes meta-analysis from other types of literature review.
Thus, a meta-analysis is an integrative summary of the relevant studies in a particular
topic, analyzing potential differences among studies, while simultaneously increasing
the precision in the estimation of effects, evaluating effects in subsets of patients,
overcoming the limitations of small sample size studies, as well as analyzing the clin-
ical endpoints that require larger sample sizes, ultimately developing hypotheses for
future studies [8].
published than negative ones. This will cause a bias toward an overall positive effect.
Turner and colleagues analyzed this phenomenon in trials with antidepressants agents,
and showed that 97% of positive outcome studies were published against 12% of nega-
tive ones [13]. Similarly, journals are more likely to accept a manuscript that shows an
effect than a study that was “unsuccessful.” Additionally, in some industry-sponsored
trials, the sponsor retains the publication rights, and will not be very sympathetic to-
ward publishing negative results. There can also be a language bias (e.g., only articles
published in English), multiple publication bias (i.e., same study published more than
once), or citation bias (i.e., more likely to be cited).
In order to minimize potential publication bias, efforts to include the maximum
number of studies should be made, even the ones that were not published. This can be
especially tricky if there is no way of obtaining information about them. That is why
in recent years there has been an effort to have all clinical trials included in a registry,
prior to the enrollment of participants.
Study Selection
There are not that many differences from selecting a study or a study population. In
both scenarios, the inclusion and exclusion criteria need to be clearly defined. In the
case of studies, objectives, study population, sample size, study design (randomized
vs. non-randomized), choice of treatment, criteria for enrollment of patients and
controls, endpoints, length of follow-up, analysis and quality of the data and, finally, a
quality assessment of the study are used to define the inclusion and exclusion criteria.
Assessment of the quality of a study is essential when assessing the quality of the
evidence and even more so when performing a meta-analysis [14], despite the fact
that it is not hazard free [15]. Moher et al. [16] showed that incorporating the quality
assessment in a meta-analysis drastically changes the results, which can improve our
understanding of the “true results.” However, as mentioned before, this approach is
not hazard free, and thus, relevant methodological issues should be assessed individu-
ally, in order to allow a better understanding of their “real” contribution to the overall
outcome [15].
In order to overcome this issue, several studies scoring systems were developed,
allowing relevant studies to be stratified according to pre-specified criteria, enforcing
studies’ “coherency” within the strata. Then comparisons regarding randomization,
blinding, patient selection, sample size, type of analyses, and outcomes, among others,
can be performed.
The critical step when performing a literature search and study selection is to
develop a process that will provide enough information about the decision tree
process that will also be reproducible. The keywords are part of this process, as
well as the flow diagram. This flow diagram can be built upon several “yes or no”
questions that will determine if the study is to be included or not (see Figure
14.2). For instance, if a retrieved manuscript is a literature review or an observa-
tional study, and the question is “Is this study an RCT”? the answer will be no, and
then that study will be dropped from further analysis. Usually at this stage, only
abstracts are analyzed. Full manuscripts are only analyzed in the last stages, when
study quality is to be assessed.
291 Chapter 14. Other Issues in Statistics II
Y N
Y N
40 inacessible
outcomes
60 for possible inclusion 5 Reviews
15 did not met the
criteria for RCT
Y N
15 recruited
participants with
20 Included in the unstable medical
primary analysis conditions
25 low quality scores
Figure 14.2. Flow diagram: Y = Yes; N = No. From 200 potentially relevant studies, only 20 were
included in the primary analysis.
Data Synthesis
A critical issue when conducting/understanding a meta-analysis is to make realities
across studies comparable. Comparing “apples and oranges” should ideally be avoided,
and the chosen outcomes need to be comparable in order to be pooled in an overall
effect. In some situations, outcomes across the studies are the same, and thus the com-
parison between them is straightforward. But sometimes, metrics that are being used are
different and a common metric is required—this is called standardization of outcomes.
On way to do this is to transform the data into the same type of outcome (e.g.,
death), or combine scores into a Z score. The Z score is calculated based upon the in-
dividual score and how many standard deviations that individual score deviates from
the mean:
x−µ
Z=
σ
Imagine that the individual score in a self-report questionnaire for depression is 36 (in a
maximum of 100 points), with a population mean of 24 and standard deviation of 12. Using
the previous equation, Z score could be calculated as
36 − 24
Z= =1
12
Now imagine that you have a second questionnaire for depression (maximum of 50), in
which the score is 18 (population mean of 12 and standard deviation of 4.5). The new Z
score can be calculated as
18 − 12
Z= = 1.33
4.5
As the Z scores are unit free, they can then be combined into a global score for that
individual, or as a comparison score across different metrics. If in the preceding ex-
ample it will not be possible to compare directly the questionnaires (because of the
different scales), the Z value represents how many standard deviations that particular
individual deviates from the population mean, and thus allows a direct comparison.
For the first questionnaire, that subject deviated 1 SD from the mean (Z = 1) and
for the second, the same subject deviated 1.33 SD from the mean (Z = 1.33), which
now allows a direct comparison between the two scores. It is also possible to attribute
different weights to those global scores based upon the strength of association with an
endpoint (e.g., ORs <1 = 0; 1≤ ORs ≤1.4 = 1; ORs >1.4 = 2).
Combined Combined
0 .5 1 0 .5 1
d d
overall score, later in the chapter) along with the 95% confidence interval. The use
of effect sizes has increased in the recent years, especially after the harsh criticism
surrounding p-values.
There are several methods for estimating effect sizes depending on the nature of the
variable. One of the most well-known and used method for continuous outcomes is
the Cohen’s d (also called standardized mean difference). The calculation of this score
is easy to perform, and relies on obtaining the mean difference between populations
(μ1 and μ2) and dividing by the pooled standard deviation.
µ1 − µ 2
d=
σ
The magnitude of the effect is considered to be small (0.2), medium (0.5), or large
(0.8). Please beware that these cut-off points are arbitrary; thus the quality of the
study and the uncertainty of the estimate need to be carefully addressed prior to
their use.
Meta-analyses can be performed using several outcomes. For dichotomous outcomes risk
ratios, odds ratios or even risk differences can be used. Ordinal outcomes can be analyzed
using the same strategy as dichotomous if the outcomes are summarized by the means of
dichotomous categories, or by continuous methods if the intervention is summarized using
means, or standardized mean difference.
included in the analysis share similar magnitudes and directions. In the random effects
model, it is assumed that the true effect size can vary from study to study, but they
follow a given distribution. So fixed effects models should be used preferentially when
the heterogeneity is low. But if heterogeneity is high, random effects models should
be used.
In detail, a fixed effects model assumes that all the studies share a common pooled
effect size, which will have a mean (μ) and a variance of σ2. In this case there is only
one source of error in the estimates of the random error within studies. If the sample
size is large enough, this error will show a trend to zero. As the random effects model
assumes a distribution, has two sources of potential errors: as it tries to assess the
“real” effect for the specific population, it is prone to the same within-study error that
appears in the fixed effect; but it has also to weight the mean across studies, therefore
it is also prone to random error between studies.
Figure 14.3 shows a fixed and a random effects model. Please note that pooled
effect size is slightly higher in the fixed model (Figure 14.3 A), along with the wider
confidence intervals in the random effects model (Figure 14.3 B).
Another important issue when conducting a meta-analysis is the sample size. The
sample size of a study can influence the results of a trial. For instance, a smaller study
could lack power (type II error) to detect an effect. Or can overinflate it. If 1:1 com-
parison between small and large sample studies is to be performed, the sample size
is incorporated in the calculation. This means that in meta-analysis rather than using
a simple mean, we often calculate a weighted mean of the effect size, in which larger
studies carry more weight in the proportion of their sample size. In other words, if the
pooled sample size from a given number of studies is, for instance, 10,000, if one large
RCT enrolled 3,000, in the meta-analysis that study will carry 30% of the weight. This
can introduce bias into the model, because larger studies will have more weight than
the smaller ones, which is why sometimes in order to reduce bias, there is a stratifica-
tion by sample size.
It is important to note that weighting studies by sample size or choosing random
models with high heterogeneity does not solve the issue related to the source of the
heterogeneity. As heterogeneity among studies is “inevitable” in meta-analysis, its
quantification has been strongly recommended [12].
Quantifying Heterogeneity
Since there are clinical and methodological differences across studies, heteroge-
neity has been suggested as being inevitable [19]. Thus, instead of choosing models
based on heterogeneity, test for variability in the sample estimates that is not due to
change [12]:
Q − df
I2 = * 100
Q
in which Q is the chi-square statistic and df its degrees of freedom. Although there are
no rules of thumb, and each analysis should be thoroughly assessed for magnitude and
direction of the effects, the convention is that a score below 40% does not represent
significant heterogeneity. Moderate heterogeneity can occur when he I2 is between
296 Unit II. Basics of Statistics
30% and 60%. Between 50% and 90% may be a sign of substantial heterogeneity, and
if the values range from 75% to 100%, then there is considerable heterogeneity among
the studies included in the meta-analysis [12].
Sensitivity Analysis
An additional method of exploring heterogeneity, or if a study is of doubtful interest to
be included in a meta-analysis, is to conduct a sensitivity analysis. In this analysis the
pooled effect size is systematically plotted with or without a given study, along with
the confidence interval (CI). This will provide a more accurate estimate of the effects
of a single study on the overall results.
The CI is as important as the pooled effect size estimate for sensitivity analysis. The
CI is calculated based on the sample mean (μ) and standard error (SE).
When CIs from individual studies overlap, effects sizes are similar and the heter-
ogeneity is low. In some cases, there could be a line drawn at “no effect,” with scores
on the left favoring controls, and scores on the right side favoring treatment. If some
results are on opposite sides of the “no effect line,” these results are considered to be
inconsistent, and therefore the heterogeneity is high. Even in the absence of results
on opposite sides, usually random effects model has wider CIs than the fixed effects
model due to the specific assumptions underlying each model (see Figure 14.4).
In order to conduct a sensitivity analysis, the forest plot can be used to calculate the
pooled effect estimate along with the pooled confidence interval. This allows further
testing of heterogeneity, by allowing each study to be removed sequentially, assessing
the impact of each study on the pooled effect estimate.
Study ommited
Hunter et al. 2008
Figure 14.4. Sensitivity analysis: Each horizontal score represents the mean (with the 95% CI)
if that particular study is removed. The vertical bar at 0.66 represents the overall pooled effect
(i.e., all studies), with 0.42 and 0.90, representing the 95% CI. As can be seen, omitting one study
sequentially does not seem to change the effect, as the 95% CI seem to overlap.
297 Chapter 14. Other Issues in Statistics II
(a) Begg’s funnel plot with pseudo 95% (b) Begg’s funnel plot with pseudo 95%
confidence limits confidence limits
1.5 1.5
1 1
.5 .5
0 0
0 .1 .2 .3 0 .1 .2 .3
s.e. of: d s.e. of: d
Figure 14.5. Funnel Plot: A: symmetrical funnel plot; B: Asymmetrical funnel plot. The y axis
represents the effect size and the x axis the standard error. Larger studies have smaller standard
error, therefore larger studies are distributed to the left of the x axis, while smaller studies will be
progressively distributed more on the right. Also positive studies will be above the horizontal line
(this means increased effect size), and negative studies will be bellow (i.e., with smaller effects).
1
Professor Felipe Fregni prepared this case. Course cases are developed solely as the basis for class
discussion. The situation in this case is fictional. Cases are not intended to serve as endorsements or
sources of primary data. All rights reserved to the author of this case. Reproduction and distribution
without permission are not allowed.
299 Chapter 14. Other Issues in Statistics II
Piantadosi, S. (2005). Clinical trials: a methodologic perspective (2nd ed.). New York: NY, John
2
had collected preliminary information on the trials of exercises following total knee
arthroplasty and are ready to discuss it with Dr. Hamilton.
Dr. Hamilton arrived, and after leaving his bags at the hotel near the university—a
beautiful residential area in the heart of Sydney—he is ready for the challenge. He
goes to the university to meet Prof. Turner and his fellow. They decide then to take the
initial meeting to a nice restaurant in the harbor near the opera house. It is a beautiful
and warm evening in Australia.
“So, John, let us not waste too much time and go directly to the point as we have
a tight schedule ahead of us. I understand that the research question is decided: to
determine whether physiotherapy after discharge following a total knee arthroplasty
is effective. Well, now the work starts! First, we need to decide the retrieval method.
As you know we can—and should!—select a bunch of electronic databases. I would
suggest MEDLINE, EMBASE and Cochrane Controlled trials registry. However, this
is not enough; we also need to look at the reference lists of the retrieved papers and
also check the abstract lists of conferences on the topic. But this still cannot address
the issue of publication bias—in other words, negative trials (especially small trials)
are often not published, and this might severely bias the results of a given meta-
analysis. To address this issue, alternatives are contacting experts in the field, writing
to authors of previous trials and asking them for any unpublished data and, in the
case of drug trials, even contacting the drug company for unpublished internal trials.
Besides that, it is also possible to run some tests to assess publication bias—such as
the funnel plot. So my first question to you: based on your resources and timeline, do
you want to search only on the electronic databases or do you want to perform an ex-
tensive search?”
But even before Prof. Turner has a chance to talk, Dr. Hamilton continues, “The
second important issue is the eligibility criteria for the inclusion of studies. This is an
important step, John. As you also know, it is important to be as inclusive as possible
but the risk is that by including small studies with low quality you might bias your
analysis.”
“What do you think, Cara?” Prof. Turner quickly turned the conversation to her
to get her more involved. Cara is still used to the hierarchical Chinese system and
hesitates before speaking, but as she was asked, she then starts, “Thank you very
much, Professor, for the honor of participating in this study. In the quick research
I performed, I see that a good option is to include trials that investigated physio-
therapy intervention compared with usual or standard care or compared two different
types of relevant physiotherapy intervention. The usual therapy consists of isometric
or simple strengthening exercises to regain range of movement, and stretches.” It is
Prof. Turner’s turn again, “Thank you, Cara. I think the challenge here is whether we
will also include the open label trials and whether we will include studies that have
different control groups.”
“Well—” Dr. Hamilton intervenes, “this is an important issue in meta-analysis: lim-
iting the inclusion criteria to have a more homogeneous group versus broadening the
inclusion criteria to have more data and a greater generalizability. Both options are
correct and have their advantages and disadvantages. Another option is to include
studies based on assessment of study quality, but although this is usually done in meta-
analyses, I think there are some problems with it, as it is difficult to assess study quality,
and excluding studies based on methodological issues might also bias the results of
301 Chapter 14. Other Issues in Statistics II
the meta-analysis.” They stopped as the food just arrived and Prof. Turner decided to
take a break, “Let us have our dinner and we can restart our discussions tomorrow.” As
Dr. Hamilton was also tired, he quickly agreed with this suggestion.
accomplished this initial step, since he believes that these results will change the use of
physiotherapy after knee surgery.
CASE DISCUSSION
Prof. Turner plans to perform a meta-analysis because “patients who undergo knee
arthroplasty may still experience considerable functional impairment postoperatively,
the effectiveness of physiotherapy after discharge is an important and valid question.”
After the initial stage of the formulation of the research question, Prof. Turner and
his team face the first challenge: how to identify and select relevant studies in order
to minimize the possibility of bias. This can be achieved by several methods that are
discussed, including the inclusion and exclusion criteria, and by assessing the quality
of the studies. After all these points are established, it is important to define the
outcomes, and how the variables will be defined: this can have a clear impact on the
possibility of statistical analysis. Finally, the best strategies for results and conclusions
dissemination throughout the scientific and clinical community, and thus possibly
changing the practice, must be decided.
ONLINE RESOURCES
For more information about trial registry, go to http://www.clinicaltrials.gov/. For
a database of systematic literature review, consult the Cochrane Collaboration at
http://www.cochrane.org/cochrane-reviews. Also for the PRISMA guidelines and
supporting material, consult http://www.prisma-statement.org/.
REFERENCES
1. Molyneux AJ, Kerr RS, Yu LM, Clarke M, Sneade M, Yarnold JA, et al. International sub-
arachnoid aneurysm trial (ISAT) of neurosurgical clipping versus endovascular coiling in
2143 patients with ruptured intracranial aneurysms: a randomised comparison of effects
on survival, dependency, seizures, rebleeding, subgroups, and aneurysm occlusion. Lancet.
2005; 366(9488): 809–817.
2. Ryttlefors M, Enblad P, Kerr RS, Molyneux AJ. International subarachnoid aneurysm trial
of neurosurgical clipping versus endovascular coiling: subgroup analysis of 278 elderly
patients. Stroke. 2008; 39(10): 2720–2726.
303 Chapter 14. Other Issues in Statistics II
3. Wang R, Lagakos SW, Ware JH, Hunter DJ, Drazen JM. Statistics in medicine: reporting of
subgroup analyses in clinical trials. N Engl J Med. 2007; 357(21): 2189–2194.
4. Rothwell PM. Subgroup analysis in randomised controlled trials: importance, indications,
and interpretation. Lancet. 2005; 365(9454): 176–186.
5. Collins R, MacMahon S. Reliable assessment of the effects of treatment on mortality and
major morbidity, I: clinical trials. Lancet. 2001; 357(9253): 373–380.
6. Gopalan R, Berry DA. Bayesian multiple comparisons using Dirichlet process priors. J Am
Stat Assoc. 1998; 93(443): 1130–1139.
7. Jackson RD, LaCroix AZ, Gass M, Wallace RB, Robbins J, Lewis CE, et al. Calcium plus
vitamin D supplementation and the risk of fractures. N Engl J Med. 2006; 354(7): 669–683.
8. Walker E, Hernandez AV, Kattan MW. Meta-analysis: Its strengths and limitations. Cleve
Clin J Med. 2008; 75(6): 431–439.
9. LeLorier J, Grégoire G, Benhaddad A, Lapierre J, Derderian F. Discrepancies between
meta-analyses and subsequent large randomized, controlled trials. N Engl J Med. 1997;
337(8): 536–542.
10. Ioannidis J, Cappelleri J, Lau J. Meta-analyses and large randomized, controlled trials. N
Engl J Med. 1998; 338(1): 59–62.
11. Dickersin K, Scherer R, Lefebvre C. Systematic reviews: identifying relevant studies for
systematic reviews. BMJ. 1994; 309(6964): 1286–1291.
12. Higgins JPT, Green S, eds. Cochrane handbook for systematic reviews of interventions. 1st ed.
Chichester, UK: Wiley-Blackwell; 2008.
13. Turner EH, Matthews AM, Linardatos E, Tell RA, Rosenthal R. Selective publica-
tion of antidepressant trials and its influence on apparent efficacy. N Engl J Med. 2008;
358(3): 252–260.
14. Pogue J, Yusuf S. Overcoming the limitations of current meta-analysis of randomised
controlled trials. Lancet. 1998; 351(9095): 47–52.
15. Jüni P, Witschi A, Bloch R, Egger M. The hazards of scoring the quality of clinical trials for
meta-analysis. JAMA. 1999; 282(11): 1054–1060.
16. Moher D, Pham B, Jones A, Cook DJ, Jadad AR, Moher M, et al. Does quality of reports
of randomised trials affect estimates of intervention efficacy reported in meta-analyses?
Lancet. 1998; 352(9128): 609–613.
17. Moher D, Liberati A, Tetzlaff J, Altman DG. Preferred reporting items for systematic
reviews and meta-analyses: the PRISMA statement. BMJ. 2009; 339: b2535.
18. Field A. Discovering statistics using SPSS. London: Sage Publications; 2009.
19. Higgins JP, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in meta-
analyses. BMJ. 2003; 327(7414): 557–560.
UNIT III
Practical Aspects
of Clinical Research
15
N O N -I N F E R I O R I T Y D E S I G N
INTRODUCTION
Clinical trials are the gold standard of study designs to implement new therapies into
practice. The placebo-controlled trial is considered by many to be the ideal model to
demonstrate and estimate the effect of a new intervention. For many conditions, how-
ever, it is not ethically adequate to use placebo, so the standard of care is therefore
used as an active control over which the superiority of a new drug is shown. However,
superiority of effect is not the only purpose of a new treatment. More and more, lower
costs, less side effects, and lighter dosing regimens (with possible better adherence)
are also important goals to be sought, as long it is possible to preserve the effect at a
reasonable magnitude. A non-inferiority trial is the ideal design to address this issue. It
is possible to show that a new intervention (with presumable other advantages) is not
inferior in effect to the standard of care, which has been previously proven effective
over placebo [1].
This chapter will begin with the main aspects of three basic study designs: superi-
ority, equivalence and non-inferiority trials. The main guiding principles of the latter
will be reviewed.
SUPERIORITY TRIALS
The aim of this design is to demonstrate the superiority of a new intervention over
placebo or the current standard of care. As already demonstrated in previous chapters,
the classical null hypothesis of this design states that there is no difference between
treatments, while the alternative hypothesis states that the difference is statistically
different from zero, i.e. not only the point estimate but also the 95% confidence in-
terval (see Chapter 9) (see Figure 15.1, Table 15.1).
For sample size estimation, the predicted difference in effect and the estimated
variance are important factors to consider (see Chapter 11). When analyzing results,
the more conservative and usually preferable strategy is intention to treat (ITT), be-
cause considering protocol violators and withdrawals in the analysis will make the
treatments look more similar, making it more difficult to reject H0. Per-protocol anal-
ysis, on the other hand, tends to increase the estimate of effect, leading to an increased
type I error [2].
307
0 –M 0 +M –M 0
Not superior (H0 not rejected) Not equivalent (H0 not rejected) Non-inferiority not shown
(H0 not rejected)
Interior (H0 rejected)
Not non-inferior
(H0 not rejected)
Desired conclusion New treatment is Both treatments are the New treatment not
superior to old same or not unac- unacceptably worse
treatment (or ceptably different than old treatment
placebo)
H0 Difference = 0 Difference ≠ 0 Difference ≤–M
(difference <–M or (or ≥M)
difference >+M)
HA Difference ≠ 0 Difference = 0 (–M≤ Difference >–M
difference ≤+M) (or <M)
95% CI of the CI must not include The entire CI must lie Lower limit must lie
difference in 0 (or 1, in case of between –M and +M above –M (the po-
effect ratios) sition of the upper
limit is not
of interest)
Sample size + (depends on ++++ (depends on M) +++ (depends on M)
predicted
difference)
Type of analysis ITT preferable Both ITT and PP should be performed
EQUIVALENCE TRIALS
The aim of this design is to show that a new intervention is equivalent to another. It
is a useful design to demonstrate, for instance, that a generic and an original drug are
equally effective, or that a new formulation of a compound does not change the effi-
cacy of the original one. It may be also desirable that it is not superior, because it may
result in increased toxicity. In this setting, however, the null and alternative hypotheses
are the opposite of the ones usually considered for superiority trials. While the null
hypothesis states that both treatments are different in effect, the alternative hypothesis
states that there is no difference in effect of both treatments, or difference = 0. Because
a statistical difference of zero is virtually impossible due to variance of estimations, an
equivalence margin (M) is established both under and above zero, between which the
point estimate and variance of the effect of the new intervention over the standard will
be (Figure 15.1, Table 15.1). Establishing an adequate M is not a simple task. It should
ideally be smaller than any value that is considered a clinically meaningful difference.
However, a very tight M would imply in an unfeasibly large sample size. That is be-
cause in equivalence trials, the estimated margin of equivalence is the main factor for
sample size estimation. Equivalence studies are often very large trials.
When analyzing equivalence trials, ITT is not considered a conservative approach
as in superiority trials, because making treatments look more similar will, in this setting,
increase type I error [3]. Per-protocol analysis, although less conservative, may result
in wider confidence intervals, because it is based on fewer study participants. For the
previously mentioned reasons, both ITT and per-protocol analysis should be carried
and, ideally, equivalence should be demonstrated in both analyses.
NON-I NFERIORITY TRIALS
In this trial design, the main purpose is to demonstrate that a new intervention is at
least as good as or not worse than the control treatment. Similarly to equivalence trials,
other advantages of the new treatment may justify its use, even if it is necessary to
sacrifice a small (and clinically non-significant) fraction of the effect of the standard
treatment. This fraction is termed the non-inferiority margin (M), and, in other words,
means how much the active control can exceed the new treatment, with the new
treatment still being considered as non-inferior to the active control [4].
In this study design, the null hypothesis states that the difference between the
treatments is larger than the non-inferiority margin (i.e., the point estimate and 95%
CI of the difference between control and treatment is less than –M, or more than
M) (see Figure 15.1, Table 15.1). The alternative hypothesis is that the difference
between treatments is smaller than the non-inferiority margin or, in other words, the
point estimate and 95% CI of the difference between control and treatment is more
than –M (or less than M). Note that the new treatment can even be superior to the
control, but this is not the primary concern of this study design. However, this par-
ticular feature of this design may lead to doubts when interpreting trial results. The
310 Unit III. Practical Aspects of Clinical Research
1. Directly demonstrate that the new intervention is not inferior to the active control
2. Indirectly demonstrate that the new intervention is superior to placebo.
–M 0
To properly reach these goals, several assumptions and strategies must be taken, which
will be further explained in this chapter [6].
Assay Sensitivity
This refers to the ability of the trial to distinguish differences between treatments if
they actually exist. It may lead to false conclusions in two ways:
– If there are clinically meaningful differences between the new treatment and the
active control, but the trial is unable to detect it, it will result in a false claim of
non-inferiority.
– If the differences between the active control and placebo are no longer detectable
by this trial, the new intervention may be claimed non-inferior when, in fact, it is
ineffective if compared to placebo.
Constancy Assumption
An important premise of non-inferiority trials is that the active control was superior to
placebo in previous trials. However, this superiority may have changed over time due
to several reasons, like resistance development (in case of antibiotics), better medical
practice, and so on. It is important to ensure that the active control would also be su-
perior to placebo in the setting of the new trial, because even if showing that the new
treatment is not inferior to the active control, this may just mean that both are equally
uneffective.
Again, a three-arm design including placebo would address this question. If it is not
possible due to ethical concerns, a way of minimizing this issue would be conducting
the non-inferiority trial using a method as similar as possible to the one used for the
trials that established the superiority of the active control over placebo. Inclusion/
exclusion criteria, blinding and randomization strategies, the treatment scheme, and
measuring methods are important parameters to consider [8].
1. The margin cannot be larger than the entire effect of the active control over pla-
cebo, generally referred to as M1. Historical data of the effect of the active control
312 Unit III. Practical Aspects of Clinical Research
–M1 –M2 0
Here, it is essential to guarantee that the effect of the active control over placebo re-
mains the same, otherwise, using M1 as a non-inferiority margin when in the current
non-inferiority trial the effect over placebo would be smaller, the new drug could be
claimed non-inferior when possibly non-effective. Showing that the new treatment is
non-inferior than M1 will tell us that its effect is greater than zero. However, it does not
assure that the drug has a clinically meaningful effect.
2. It is usual and desirable to choose a smaller value than M1, in order to preserve
some estimate of the effect of the active control. This other estimation of the
margin (M2) represents the largest loss of effect of the active control that would
be acceptable (i.e., not clinically meaningful). If the lower bound of the 95% CI
for the effect of the new drug over the active control is above –M2, non-inferiority
is demonstrated (Figure 15.3). Some strategies for estimation of M2 are the
following:
Clinical judgment: reflects how much of the effect of the active control over pla-
cebo should be kept by the test drug in order to be considered non-inferior.
Investigators may ask a panel of experts—or even patients—how much would
they would be willing to sacrifice in effect in order to get other potential benefits.
In a regulatory setting, however, this choice may be subjected to criticism.
Fixed margin approach: Set the non-inferiority margin (M2) to a proportion of M1.
So, if investigators concluded that it would be necessary for the new drug to
preserve 80% of the effect of the active control, then M2 should be set at 20% of
M1. However, one must keep in mind that this method does not consider that
M1 is a point estimate of the effect of the active control over placebo, which is
subject to uncertainty.
313 Chapter 15. Non-Inferiority Design
95%–95% method: Set the non-inferiority margin (–M2) to the upper bound of the
95% CI of the estimation of the effect of active control over placebo. Although
this method addresses the issue of variability, it is usually very stringent.
Biocreep
Consider that for a given disease there is a standard treatment (A), previously proven
effective against placebo, which is used as an active control in a non-inferiority trial for
a drug B, with fewer side effects. After the non-inferiority is shown, drug B becomes
the standard of care for some time, until a new drug C is developed. Investigators be-
lieve drug C has a lighter dosing regimen and want to test it against the standard of
care (drug B) in a non-inferiority trial. Non-inferiority is shown and drug C becomes
the standard of treatment. Biocreep is the term given to this effect, by which multiple
generations of non-inferiority trials—that used drugs that were not tested against
placebo as active controls—can result in demonstrating non-inferiority of a therapy
that is actually not superior to placebo. It means B is not inferior to A; C is not in-
ferior to B, but maybe C is neither inferior to A, nor superior to placebo. The term
technocreep is also used to refer to the same effect in the case of medical devices.
INTERIM ANALYSIS
The reasons to perform an interim analysis in superiority trials, as well as advantages
and disadvantages of doing so, are discussed in Chapter 18. These reasons may not al-
ways apply to non-inferiority trials.
MISSING DATA
While in superiority trials missing data may lead to failure to show superiority because
of increased variance (type II error), in non-inferiority trials the effect may be making
treatments look more similar and so increase type I error (claiming non-inferiority
when it does not exist) [11]. Approaches for handling missing data in non-inferiority
trials tend to be conservative:
SWITCHING DESIGNS
From Non-Inferiority to Superiority
Showing a possible superiority of the new treatment is not a primary concern of non-
inferiority trials. However, once non-inferiority has been demonstrated, if the lower
bound of the 95% CI is also above zero (Figure 15.2), superiority is also shown.
Because the analysis is made examining the same single confidence interval, no pen-
alty for multiple testing is necessary [12]. One should remember, however, that an ITT
analysis should be preferred in the setting of a superiority trial [13].
1. The non-inferiority margin must be planned a priori, otherwise, results may influ-
ence the clinical judgment of an adequate margin.
2. The active control group fulfill the criteria of an appropriate control for non-
inferiority trials.
3. Assay sensitivity and constancy are satisfied.
4. Both ITT and per-protocol analysis are performed.
5. A high-quality trial is conducted (because poor quality may falsely claim
non-inferiority).
315 Chapter 15. Non-Inferiority Design
Figure 15.4. Strategies for determining the non-inferiority margin. –M2*: M2 set at 50% of the point
estimate of the effect of active control over placebo (M1); –M2**: M2 set at the upper bound of the
95% CI of the estimation of the effect of active control over placebo.
Prof. Perkins was excited by this news. He had dedicated his life to academia and this re-
cent study he had submitted to The Lancet had a special meaning for him. He and his team
had spent several years conducting this study and the results could have a significant im-
pact on the field. But after reading the reviewers’ comments, he realized that the challenges
to get this published were not over: it would not be easy to address reviewers’ comments.
But one thing he enjoys is the intellectual debate with reviewers, as he compares it with
chess play. In this case, he wanted to be the one to make the “checkmate” move.1
1
Dr. Imamura and Professor Fregni prepared this case. Course cases are developed solely as the
basis for class discussion. Although cases might be based on past episodes, the situation in this case
is fictional. Cases are not intended to serve as endorsements or sources of primary data. All rights
reserved to the author of this case. Reproduction and distribution without permission is not allowed.
317 Chapter 15. Non-Inferiority Design
It was dawn and the sky was just turning from orange to light blue. Old maple trees
that were particularly beautiful this time of year adorned his path. This morning he had
a lot to think about. Although the reviewers were positive regarding his trial, they also
had several comments and criticisms.
He had already sent an email to schedule a meeting with his research team—he
had taken almost his entire team to Toronto, and for this study, he had one postdoc
student and two junior faculty members working with him. He abbreviated his jog and
went quickly to his office.
The main issue I have with this trial is its design—I am not convinced that non-
inferiority trial is the best design—the problems I have with this design are (i) we
do not know in this case whether the drug is equally good or equally ineffective—as
the authors know the assay sensitivity is extremely important. Given that this trial
was performed using state-of-the-art percutaneous intervention that we know is as-
sociated with a smaller rate of thrombotic events, the lack of differences between two
treatments might be due to the fact that the standard drug (heparin) in this case is not
different from placebo, as patients did not have an excess of thrombotic events, and
(ii) because of this issue, I believe that this was a waste of resources as the trial was
extremely large. For these reasons the authors should have pursued a superiority trial.
Jan Klose, a postdoctoral fellow who has been working with Prof. Perkins for many
years, is the first to speak. He has always been proactive and he quickly begins:
Sunil Kumar, an instructor from India who is very knowledgeable about clinical trials,
adds some considerations:
Well, this reviewer has a good point, we cannot prove assay sensitivity in our study.
Therefore our new drug might be similar to placebo if heparin (the standard drug)
had no significant effect in our study. Therefore, patients in the future might use our
319 Chapter 15. Non-Inferiority Design
drug when in fact it is ineffective. We certainly do not want this. I wonder now if we
should have considered another trial design—for instance, conduct a superiority trial
in patients who have an allergic reaction to heparin and therefore cannot take it. This
is certainly an important limitation of our study.
“Excellent points, Sunil and Jan. We should add some comments in the limitations
section in which we discuss the ethical issue of using placebo in this case and also the
potential limitation of assay sensitivity.” After a small pause, Prof. Perkins continues,
“Although he might have put our ‘king’ on ‘check’, this was not a ‘checkmate.’ Let us
review now the main comment of reviewer 2.”
My main concern here is with the non-inferiority margin that was chosen for this
study. As the authors know, the non-inferiority margin is a critical piece for the non-
inferiority design, as a large margin might invalidate the results of a given study. In
the case of the current study, I think the margin is excessively large. In fact, this non-
inferiority trial provides a direct comparison of T (Thrombase), to C (heparin), but
not of Thrombase to placebo. Thus, one hopes to choose a non-inferiority margin
that will provide assurance that Thrombase is better than placebo. That concern is
especially true when the comparator drug C has small effect size over placebo. Please
address this important issue, especially because the new drug was in the inferior
rather than superior side when compared to heparin.
This is the main issue in the previous non-inferiority studies in which I have
participated: defining the NI margin. Although heparin (our standard drug) has a
large effect size as compared to placebo in historical controls, the issue of the non-
inferiority margin is still a challenging one. We have seen some non-inferiority
studies which were criticized for the use of margins that were considered inappro-
priately high (TARGET trial).2 And a very small margin means too many patients;
that would be unfeasible. But because we used the clinical judgment, which is an ac-
ceptable method but is subjective, how do we justify that our margin was adequate?
I believe that clinical judgment was adequate. We as clinicians might consider what
difference in event rates would make the two treatments no longer “therapeutically
2
Moliterno DF, Steven J Yakubov SJ, DiBattiste PM et al. Outcomes at 6 months for the direct
comparison of tirofiban and abciximab during percutaneous coronary revascularization with stent
placement: the TARGET follow-up study. Lancet. 2002; 360: 355–360.
320 Unit III. Practical Aspects of Clinical Research
equivalent.” Of course this judgment varies from setting to setting and it may not be
easy to reach a consensus. However, I believe that our number for the margin was
conservative and okay.
Sunil, the savvy clinical researcher, has been waiting to give his remarks:
I kind of predicted this issue when we designed our study; but most of you wanted
the clinical judgment to define the margin. But as Prof. Perkins likes to say: our king
has not been killed yet—we still have a chance. If we calculate the margin again using
other methods and show that our results do not change, we would be all set—this
would be a sensitivity analysis. The two methods I propose are: (i) using the effect
size (ES) of C (heparin) compared to P (placebo). So we can combine all available
evidence, a meta-analysis if possible, to obtain an estimate of the ES of C (over pla-
cebo) with a confidence interval. We may, then, choose the NI margin based on the
estimate of this effect size (ES). For example, we may set M equal to half the point
estimate of ES. The intent of this approach is to provide assurance that T (new drug)
provides at least half as much benefit as C (heparin), compared to placebo. My con-
cern with this approach is that large point estimates of ES may lead to unacceptably
large M as well. We know that estimates of ES are subject to uncertainty even if they
are obtained from meta-analysis. The actual ES may be smaller than stated, and if that
is the case, the use of a large M could lead to a false claim of non-inferiority. Another
option (ii) is to set the NI margin equal to half the lower limit of the confidence in-
terval for ES of C (also obtained from a meta-analysis of previous studies). This is
known as the 95–95 method and is usually suggested by FDA statisticians. This is
conservative in that it provides strong assurance that T (new drug) is superior to P
(placebo) if the NI trial is successful.
The problem with these methods, as I mentioned before, is that the lower the
margin, the more difficult to establish non-inferiority and the larger the sample size
needed. For example, reducing the NI margin from 25% change in event rate to 20%
increases the sample size by about (5/4)2 = 1.56. The 95–95 method, in particular,
not uncommonly requires sample sizes that are impossible to reach.
After listening to all the comments, Prof. Perkins concludes, “Well these are all im-
portant issues that need to be addressed. What we can do here is to do a sensitivity
analysis—see how the results would change if we change the method of defining the
margin and also mention that using these other methods, we would need a much larger
sample size that would be unfeasible. Also we should point out that the best method is
not established yet and point out that the method we used is often used in the literature.”
Prof. Perkins was happy with the outcome of this meeting—he thought to himself,
“Checkmate, reviewers!”
CASE DISCUSSION
Motivation of the Investigators to Run This Trial
Thrombase may have fewer side effects than heparin; it is also unethical to have a
placebo arm.
321 Chapter 15. Non-Inferiority Design
1. Choice of the trial design—is there enough assay sensitivity? Is the current
trial sensitive enough to detect a difference in effect of heparin over placebo
if a placebo would have been included? Another point: is this trial sensitive
enough to detect differences between heparin and Thrombase? If a difference
does exist but cannot be detected by this trial, a false claim of non-inferiority
will be made.
Possible ways to address this issue:
– Include a third arm with placebo? Is it ethical?
– Run this trial as similarly as possible to the trials that have shown superiority of
heparin over placebo.
– Constancy assumption: since the trials that showed superiority of heparin
over placebo were run, standards of care of these patients might have improved
and, in the current setting, heparin may not demonstrate such a benefit over
placebo as before. If this is true, showing that Thrombase is non-inferior to
aspirin might just mean that they are both ineffective. One point to consider
is that if the effect of heparin over placebo is large (impressive risk reduction
in ischaemia), it is unlikely that it would not be still superior to placebo in the
current trial.
2. Choice of the non-inferiority margin—selecting the margin: To better understand
the two options given in the case, we can suppose that, in a previous meta-analysis,
the relative risk of ischemic events for placebo over heparin was, for instance, 2.30
(95% CI 1.95–2.75). The interpretation would be: given an RR = 2.30, placebo
increases the rate of ischemic events in 130% (CI 95%–175%).
– First method (fixed margin approach): Set the NI margin at half the point
estimate of the effect size of placebo over heparin: so, to be claimed non-
inferior, the NI margin for Thrombase would have to be:Half the effect size
of placebo over heparin: 130%/2 = 65%, which means an RR of 1.65 or,
using the formula: 1+[(x-1)/2].
– Second method (95–95 method): Set the NI margin at the lower bound of the
95% CI of the effect size of placebo over heparin: 1.95.The upper bound of
the 95% confidence interval of the effect of Thrombase over heparin must be
less than the selected NI margin (Figure 15.5). Note that the 95–95 method
assumes a smaller loss of effect to claim for non-inferiority.
1. What are the issues involved in this case that should be considered in order to re-
spond to reviewers?
2. What are the concerns?
3. Was non-inferiority design the best one? Why? Why not?
4. Have you seen a similar situation? If you know similar cases (that can be disclosed),
please share with the group.
(a)
RR = 2.30 (1.95–2.75)
2.6
(b)
2.4
1.8 2.0
1.2
1.0
2.2 1.6
(c)
Favors
Heparin
Favors
Placebo
Figure 15.5. Two methods for choosing the NI margin. A: The effect size of heparin over placebo
and the non-inferiority margins: Ma: set at half the point estimate of the effect size; Mb: set at the
lower bound of the 95% CI. B: Transposing the margins to the NI trial with thrombase and heparin.
C: Fictitious result of the non-inferiority trial: non-inferiority would have been claimed if Ma would
be the chosen non-inferiority margin, not Mb.
323 Chapter 15. Non-Inferiority Design
FURTHER READING
UnitedStates Food and Drug Administration: Guidance for Industry Non-
Inferiority Clinical Trials 2010: focuses on a deeper understanding of the methods for deter-
mining a non-inferiority margin that are recommended by FDA.
Reporting of Noninferiority and Equivalence Randomized Trials. An Extension of the CONSORT
Statement: check-list for elaboration and reporting of Non-inferiority trials.
REFERENCES
1. Christensen E. Methodology of superiority vs. equivalence trials and non-inferiority trials.
J Hepatol. 2007 May; 46(5): 947–954.
2. D’Agostino RB, Sr., Massaro JM, Sullivan LM. Non-inferiority trials: design concepts
and issues—the encounters of academic consultants in statistics. Stat Med. 2003 Jan 30;
22(2): 169–186.
3. Evans SR. Clinical trial structures. J Exp Stroke Transl Med. 2010 Feb 9; 3(1): 8–18.
4. Evans S. Noninferiority clinical trials. CHANCE. 2009; 22(3).
5. US Department of Health and Human Services; Food and Drug Administration; Center
for Drug Evaluation and Research (CDER); Center for Biologics Evaluation and Research
(CBER). Guidance for industry non-inferiority clinical trials 2010.
6. Fleming TR, Odem-Davis K, Rothmann MD, Li Shen Y. Some essential considerations in
the design and conduct of non-inferiority trials. Clin Trials. 2011 Aug; 8(4): 432–439.
7. Schumi J, Wittes JT. Through the looking glass: understanding non-inferiority. Trials. 2011;
12: 106.
8. Nathan N, Borel T, Djibo A, Evans D, Djibo S, Corty JF, et al. Ceftriaxone as effective
as long-acting chloramphenicol in short-course treatment of meningococcal menin-
gitis during epidemics: a randomised non-inferiority study. Lancet. 2005 Jul 23–29;
366(9482): 308–313.
9. Yonemura M, Katsumata N, Hashimoto H, Satake S, Kaneko M, Kobayashi Y, et al.
Randomized controlled study comparing two doses of intravenous granisetron (1 and
3 mg) for acute chemotherapy-induced nausea and vomiting in cancer patients: a non-
inferiority trial. Jpn J Clin Oncol. 2009 Jul; 39(7): 443–448.
10. The Italian Group for Antiemetic Research. Dexamethasone, granisetron, or both for the
prevention of nausea and vomiting during chemotherapy for cancer. N Engl J Med. 1995 Jan
5; 332(1): 1–5.
11. Flandre P. Design of HIV non-inferiority trials: where are we going? AIDS. 2012 Oct 17.
12. Schiller P, Burchardi N, Niestroj M, Kieser M. Quality of reporting of clinical non-inferiority
and equivalence randomised trials: update and extension. Trials. 2012 Nov 16; 13(1): 214.
13. Wangge G, Klungel OH, Roes KC, de Boer A, Hoes AW, Knol MJ. Room for improvement
in conducting and reporting non-inferiority randomized controlled trials on drugs: a sys-
tematic review. PLoS One. 2010; 5(10): e13550.
16
O B S E RVAT I O N A L ST U D I E S
INTRODUCTION
In this chapter, we will introduce you to the major designs of observational studies.
With this topic, we now enter the world of epidemiology.
Epidemiology is a field of research that studies “the distribution and determinants of
health-related states or events in specified populations” and applies “this study to con-
trol [ . . . ] health problems” [1]. Its objective is to measure parameters relevant to public
health, ranging from birth and death rates to cleanliness of drinking water to disease
occurrence. Another main purpose of epidemiological research is to identify risk factors
associated with disease. Finally, epidemiologists develop models to predict future disease
burden based on current data (e.g., the impact of hypertension on global public health
within the next 20 years given unchanged lifestyle). Recommendations on lifestyle factors
may be one consequence of such predictions. Importantly, where clinical research assigns
interventions to as even study groups as possible, epidemiological research observes un-
exposed and exposed individuals under “real-life conditions” without intervening itself.
We will provide you with the main tools to assess the quality of an observational
study and identify possible threats to the validity of the obtained results. The next
chapter discusses covariates and confounders in observational studies and how to
account for them (with emphasis on propensity scores).
In observational studies, data are collected through “observations” (interviews,
surveys, database queries, etc.), instead of “actively” being generated or altered.
Observational studies can be descriptive, when no comparison group is included
in the design. They can be applied to describe the frequency of events in a popula-
tion. Observational studies can also be analytical, when different study groups are
compared. Data can then be used for statistical inference, and relationships between
exposures (i.e., risk factors) and outcomes (i.e., diseases) can be investigated.
An exposure may be an environmental factor (such as air pollution), a self-chosen
habit (such as smoking), or an intervention that a patient receives independently of
324
325 Chapter 16. Observational Studies
the study. To make a comparison with randomized controlled trial (RCT), an “ex-
posure” in an RCT would be the intervention, which is manipulated in RCT, while
it is not in observational trials. Due to the non-randomized nature of observational
studies, bias and confounding can largely influence study outcomes and must be
carefully controlled. Consequently, the study protocol should provide detailed in-
formation on the study sample, exposure and outcome variables, sources of data and
methods of assessment, sources of bias, and statistical analysis [2].
Ratio
A ratio is a division between values, which puts them into relation: ratio = a/b [3]. It
can be used to relate different subpopulations to each other, for example the males and
females who suffer from the same disease.
Proportion
A proportion relates a sub-entity to an entity. In its simplest form, it can be expressed
as proportion = a/(a+b) where the nominator (a) is a part of the denominator (a+b)
[3]. Different than ratios, proportions can only take values between 0.0 (if (a) equals
zero or is infinitely low) and 1.0 (if (b) equals zero or is infinitely low).
Chance can alternatively be expressed in terms of odds. Odds represent the number
of times that a favorable event will likely occur [4].
In epidemiology, a specially defined type of proportion is prevalence, which is the
proportion of those individuals who develop a condition at a specified time within a
population at risk. Disease prevalence can be determined as follows [4,5]:
Rate
Disease frequency can be determined over a specified time period and, in that case, a
measure of time needs to be included in the ratio. Rate is a measure of how frequently an
event occurs in a population at risk within a given time period. Demographic rates, such
as birth or death rate, usually refer to a time period of one year, and the population at risk
at the midpoint of the year is commonly used as the reference population. Irrespective of
the size of the study population, rates are often expressed per 1,000 or 100,000 population
(by using a multiplier). The basic formula to determine rates is the following [3]:
D+ D–
E+ a b
E– c d
For example, if 428,077 individuals die in 2012 in a country in which the mid-year
population comprised 61,153,780 individuals, the death rate is 700 in 100,000.
The terms ratio, proportion, and rate are the basis for the specific statistical
parameters that will be introduced in the context of study designs. Contingency tables
(see Box 16.1) can be helpful to calculate these parameters and the adequate formulas
will be provided throughout the text.
established way to report adverse drug events [6,7]. A case series is a quantitative and
qualitative extension of the case report in that it summarizes a number of cases under
a clearly defined research question. It thus adds value to the clinical findings and may
even be used to build hypotheses on associations. Large case series, which for example
analyze data from disease registries, may comprise up to several hundred patients [8].
Statistical Analysis
A very limited amount of statistical analysis can be done in case series. One main pa-
rameter is symptom prevalence, which is the proportion of cases that have a certain
symptom, relative to the total number of cases [9].
Advantages
Case reports and case series detect novelty by describing yet unknown clinical
presentations or novel treatments [8,10]. When evidence accumulates from several
case reports or case series with similar findings, hypotheses on disease mechanisms,
associations with risk factors, or treatment effectiveness can be generated. Larger
studies with the potential for meaningful statistical analysis (e.g., case-control or co-
hort studies) may then be initiated to test these hypotheses. In this way, a case report
may be the first step to the discovery of a new disease. Similarly, a series of cases with
severe adverse events due to a drug may initiate the retraction of that drug from the
market [7]. Clinical education can be another positive aspect of case studies, as rare
diseases or rare manifestations of a common disease may hardly be seen in clinical
daily routine, yet should be recognized once they present.
Disadvantages
The major disadvantage of case reports and case series is the lack of a comparison group.
Therefore, they cannot be used to demonstrate the efficacy or safety of a treatment.
Also, there is a good chance that a found association is falsely positive. Consequently,
associations that are based on data from case series should be treated very carefully.
They can only be used to generate hypotheses, which must further be tested in a robust
study design with a comparison group, for example a case-control study [11].
Another disadvantage of case reports and case series is their proneness to publica-
tion bias (see the section “Bias” later in this chapter) [6]. As mentioned earlier, case
reports and case serious favor the unusual, and this may lead to an overrepresentation
of certain rare cases in the literature.
These arguments explain why case reports and case series are considered to pro-
vide a rather low level of evidence. They alone usually do not change medical practice.
328 Unit III. Practical Aspects of Clinical Research
Cross-Sectional Studies
A cross-sectional study investigates the characteristics of a population sample at a
specific time point (Figure 16.1). Subjects are randomly sampled from a target pop-
ulation, and exposure and disease status are measured once and at a specific point
of time for each subject. The sampling process itself can take an extended period
of time (up to several years), depending on how narrow the study population and
how large the required sample size is. Data can be obtained from questionnaires or
interviews (13).
Statistical Analysis
On a descriptive level, cross-sectional studies can be used to determine odds and prev-
alence of the disease. On an analytical level, odds ratio and prevalence ratio can be
calculated to measure the association of exposure and disease.
We will show here how to calculate these parameters using a contingency table.
Let us examine the occurrence of type II diabetes mellitus (D) on a very small
Pacific island and investigate the potential association with a body-mass-index
(BMI) >35 (E).
Identify
1.
exposure status
Identify
2.
disease status
disease
exposed
no disease
disease
unexposed
no disease
D+ D–
E+ 125 350
E– 55 850
One can calculate prevalence ratio from the 2 x 2 table: Prevalence ratio = (a/
(a+b))/(b/(b+d)). The resulting value is a measure of strength of association. Values
>1.0 indicate an increased risk for the disease due to exposure, whereas values of 1.0
or lower indicate no associated risk or a protective effect of the exposure, respectively.
In our example, prevalence ratio is (125/(125+350))/(55/55+850)) = 4.33, which
means that the individuals with BMI >35 have 4.33 times the probability to have type
II diabetes mellitus than individuals with BMI of 35 or lower.
Alternatively, associations between exposure and disease can be determined from
odds by calculating prevalence odds ratio (POR), which is the odds ratio of disease. It can
be defined as the odds of disease among exposed compared to unexposed subjects:
Advantages
Cross-sectional studies are an efficient way to gather comprehensive data on common
health conditions like diabetes or cardiovascular diseases and are hence used for national
surveys [15]. They are used to determine disease and exposure prevalence and measure
associations between exposures and diseases. They can in principal be relatively fast and
cheap to perform, but this is highly dependent on the sampling process and method of
data acquisition, as stated earlier. Cross-sectional studies can easily be integrated into co-
hort studies to measure the characteristics of the study cohorts at baseline or any other
defined time point. Another major advantage is no need for follow-up [4].
Disadvantages
Cross-sectional studies in a general population are not suited to study rare diseases.
For rare diseases, a case series can be an alternative to measure exposure prevalence in
a subset of diseased patients [13].
The measurement of prevalence in cross-sectional studies can lead to an underrep-
resentation of cases, when a disease is of short duration, or to an overrepresentation of
cases, when a disease has a long duration (prevalence-incidence bias; see section “Bias”
for further explanation). In this situation, prevalence may not be a good estimate for
disease occurrence [15].
Since subjects are assessed at a single point in time, no temporality of an asso-
ciation can be determined. Therefore, it is extremely difficult to establish causality
with cross-sectional data [15]. Cross-sectional studies can, however, be used to build
hypotheses about causal relationships, which are tested in a case-control or cohort
design in a second step.
2. Determine 1. Identify
exposure status disease status
exposed
cases
unexposed
exposed
controls
unexposed
Case-Control Studies
Case-control studies (Figure 16.2) are studies in which individuals are selected from
a defined population based on outcome (i.e., disease) [17]. Individuals that have the
outcome are selected as “cases.” From the same population, individuals free of the
outcome are selected and serve as “controls.” Investigators then look back in time to
identify possible risk factors (exposure) for developing the disease. Information on
previous exposures is gathered directly from subjects (e.g., personal interviews, tele-
phone interviews, paper-based questionnaires) or from preexisting records (e.g., med-
ical charts) [14]. Usually, only one outcome (the one used for the selection of cases)
is investigated per study [18,19].
Statistical Analysis
Since the number of cases and controls is fixed by design, case-control studies do not
allow the estimation of disease occurrence. Instead, odds ratio can be calculated [4].
To further understand these parameters, let us analyze a fictive example of food-
borne disease after eating in a restaurant famous for seafood. We want to find out
whether eating seafood was a risk factor for developing food-borne disease in this
specific example. Therefore, both cases and controls are drawn from a population of
visitors to this restaurant. Exposed subjects had consumed seafood, whereas unex-
posed subjects had not.
D+ D–
= cases = controls
E+ 49 37
E– 31 43
ratio of exposure is the odds of exposure among the diseased relative to the odds of
exposure among the disease-free [15]:
Using the 2 x 2 table, we can calculate OR = (a/c)/(b/d) = (a x d)/(b x c) [4]. The odds
ratio of exposure in our example is (49/31)/(37/43) = (49 x 43)/(37 x 31) = 1.8. This
means that the likelihood of having eaten seafood as a case was 1.8-fold higher than as
a control. Seeing the high exposure prevalence of seafood, we can suspect that other
risk factors for food-borne disease exist in this restaurant.
Further parameters can be calculated when cases and controls are sampled from
a cohort, because only incident cases within the observation time are included [20].
In a case-cohort study, the probability that an initially disease-free subject develops
the outcome during the time of follow-up equals the incidence risk, or cumulative inci-
dence. Building a ratio between the exposed and unexposed group will result in relative
risk, or cumulative incidence ratio. Because controls are longitudinally sampled from the
population still at risk in nested case-controls, one can calculate incidence rates for the
exposed and unexposed cohort and build a ratio between them, the incidence rate ratio,
or relative rate (see section “Cohort Studies” for further explanation).
In order to control for confounders in the analysis, two main strategies exist: strat-
ification through cross-tabulation methods and regression modelling (see subsection
“Control of Confounding” for more details). Among the regression models, logistic re-
gression is commonly used for case-control designs because it is based on odds ratios.
Advantages
Case-control studies present a time-and cost-efficient way to examine a large number
of risk factors. Relatively smaller sample sizes are needed to investigate associations
in case-control studies than in cohort studies [4]. Case-control studies provide an
optimal design to investigate rare diseases and diseases with long latency, since cases
are enriched in the sample compared to the population the sample is drawn from.
Moreover, they are useful to investigate disease outbreaks [4]. Nested case-control
studies and case-cohort studies are well suited to conduct expensive and more elab-
orate analysis on a subset of a cohort, while benefiting from the prospective design
of the cohort (especially the lack of recall bias and the sampling of cases and controls
from the same population) [19].
Disadvantages
Selection of cases and controls is a major challenge in case-control studies [17]. Clear
diagnostic criteria must be provided to define cases. Controls should be drawn from
the same population as cases, as they should have an equal risk of becoming a case. It
is also crucial to select cases and controls independently of their exposure history in
order to prevent selection bias. Furthermore, if a disease affects survival, those cases
that already have died cannot be included in the study [15]. Exclusion of such cases
might lead to a serious imbalance in the exposure status of cases and controls.
333 Chapter 16. Observational Studies
Case-control studies are not suited to study rare exposures since, even in a very
large sample, none of the cases or controls might have experienced the exposure [17].
As exposure histories are obtained retrospectively in the case-control design (from
self-questionnaires, chart records, registries, etc.), incorrect or incomplete informa-
tion can lead to serious bias (recall bias; see section “Bias” for further explanation).
Cohort Studies
Cohort studies are to epidemiologists what the RCT is to the biostatisticians. They
sample subjects based on exposure status and follow them up to an outcome. The term
cohort originates from the military, where it stands for a group of warriors marching for-
ward in time. Typically, exposed and non-exposed subjects are followed in two parallel
groups. Depending on the time of data acquisition, cohort studies can be prospective or
retrospective (Figure 16.3) [23]. Prospective cohort studies examine past and present
exposures in a disease-free sample and follow subjects up in pre-defined intervals until
the main endpoint occurs or the subject becomes censored (due to study end, loss
to follow-up, or death). Retrospective, or historical, cohort studies use existing data
(e.g., from past medical records or cohorts that have been studied for other reasons).
While often both the exposures and outcomes lie in the past, the chronology of having
documented an exposure that is later followed by an outcome is maintained.
Statistical Analysis
Disease risk and rates are primary statistical measures in a cohort study. We will ex-
emplify the calculation of these parameters in a fictive cohort study that examines the
effect of air pollution on the development of asthma in children. Children who live in
an urban environment and are exposed to particulate matter <10 µm, which reaches
the respiratory tract, were considered exposed. Children from an urban environ-
ment without exposure to particulate matter <10 µm were considered unexposed and
served as controls. The diagnosis of asthma was a primary outcome; 4,003 subjects
334 Unit III. Practical Aspects of Clinical Research
1. Identify 2. Determine
exposure disease
status status
Cohort A – disease
exposed
no disease
Cohort B – disease
unexposed no disease
follow-up
Identify Determine
1. 2.
exposure disease
status status
Cohort A – disease
exposed
no disease
Cohort B – disease
unexposed
no disease
follow-up
participated in the study and were followed up over five years. For simplicity, we con-
sider person-time in the study as 4,003 x 5 = 20,015.
D+ D–
E+
= exposed 36 1967
cohort
E–
= unexposed 28 1972
cohort
Since all individuals in a cohort study are free of the outcome at the beginning
of the study, those individuals who develop the disease throughout the study will be
equivalent to “new cases.” Risk must be described in the context of time of follow-up
(e.g., 5-year risk). Risk can be calculated from the 2 × 2 table:
a
Risk among exposed subjects:
(a + b)
c
Risk among unexposed subjects:
(c + d)
In our example, five-year risk for asthma among children exposed to particulate matter
is 36/(36 + 1967) = 0.018 = 1.8%. Five-year risk for asthma among non-exposed chil-
dren is 28/(28 + 1972) = 0.014 = 1.4%.
Other than risk, incidence rate includes the duration, how long an individual is
“at risk” for a disease within the study period. This can largely vary among subjects,
depending on the time point of entry into the study, dropout, or occurrence of the
outcome (after which a subject is not any more “at risk”). In order to express incidence
rate in person-years, we multiply the number of participating subjects by the length of
follow-up period (= population at risk during study period). A multiplier is included
in the calculation to adjust the result to an easily comprehensible population size (e.g.,
100,000 individuals). Hence, the definition of incidence is the following [3]:
In the example of the asthma study, the incidence rate for asthma in children that are
exposed to particles <10 µm is 36/20,015 x 10,000 = 18 per 10,000 person-years.
The incidence rate for asthma in unexposed children is 28/20,015 x 10,000 = 14 per
10,000 person-years.
In order to measure strength of association, we can calculate the odds ratio of
disease from cohort studies, which is analogous to the calculation of odds ratio in
cross-sectional studies. However, be aware that outcomes are incident in cohort
studies while they are prevalent in cross-sectional studies.
The more important parameter to measure associations in cohort studies is relative
risk (RR), or risk ratio. It can only be derived from this study type because it is based
on incident outcomes. RR is the risk of an exposed individual to develop a disease, rel-
ative to the risk of an unexposed individual to acquire the same disease [15]:
Confidence intervals are a measure of how precise the estimate of a population parameter is
(sampling distribution). Precision is a measure of exactness. If a sample is repeatedly meas-
ured with high precision, the true population mean can be found within a narrow confi-
dence interval at a pre-specified significance level. A 95% confidence interval would contain
the true value with 95% probability. Conversely, the true value would lie outside of a 95%
confidence interval with a probability of 5%. Confidence levels of 90%–99% are commonly
chosen, depending on the degree of imprecision that researchers are willing to accept.
The 95% confidence interval can be calculated by
Sample estimate ±1.96 x SE (sample estimate),
in which SE represents standard error [15].
Confidence interval width is influenced by sample size, dispersion and confidence level
[57]. A large sample size increases precision and, therefore, the confidence interval becomes
narrower. High standard deviation or standard error results in a wider confidence interval.
And, by definition, a 99% confidence interval is wider than a 95% confidence interval.
Statistical significance can be inferred from confidence intervals [57,58]. If the value that
represents the null hypothesis lies within the confidence interval (e.g., “0” in case of a mean
difference, or “1” in case of odds ratio or relative risk), the null hypothesis is true and the
result is considered statistically not significant. If the entire confidence interval lies above
this value, in the example of a cohort study, it could be concluded that exposure to a certain
factor leads to a higher risk to develop a disease (or to a lower risk if the entire confidence
interval lies below this value). Thus, the direction of an effect can be determined, which may
also be meaningful in case of a statistically not significant result.
While p-value allows conclusions on the statistical significance within a sample, confi-
dence intervals provides an estimation of the true mean value within the entire population
of which the sample has been taken [58]. In conclusion, confidence intervals enable us to
assess the clinical relevance of study outcomes.
Odds ratio can be derived from both prevalent and incident data and can provide a useful
measure of association in cross-sectional, case-control, and cohort studies. In contrast, rela-
tive risk is based on incidence and can only be determined from cohort studies. Relative risk
is a more intuitive measure of association than odds ratio. If, hypothetically, the probability
to develop a disease outcome is 0.2 in a population, it is easy to grasp that, statistically, every
fifth subject randomly sampled from this population will develop the disease. At the same
time, this would mean that the odds to develop the disease, 1/5, relative to the odds not to
develop the disease, 4/5, equals 1/4, or 0.25. This number is much harder to understand.
The same scenario will occur when we build ratios between probabilities (RR) versus ratios
between odds (OR). OR is sometimes incorrectly interpreted as RR, resulting in wrong
conclusions that are drawn from study results [25].
In some cases, odds ratio will approximate relative risk. If a disease occurs with low proba-
bility, which reflects in low numbers for a and c, the denominators of OR and RR will almost
be similar: RR = (a/a+b)/(c/c+d) ≈ OR = (a/b)/(c/d) [26]. However, if a disease is more
common, the denominator of OR will become smaller than the denominator of relative risk,
with the consequence that OR will overestimate the relative risk. This problem is likely to
occur when the event probability of a disease outcome is larger than 10% [25].
Risk difference = risk for diseaseexposed group − risk for diseaseunexposed group
When we attempt to describe the results of a cohort study, we can interpret relative
risk as “times the risk” and incidence rate ratio as “times the rate.” A more intuitive and
sometimes more meaningful way of expressing the same findings is the excess risk that
is associated with the exposure. It is an absolute measure and can be calculated as risk
difference or rate difference by subtracting the risk or incidence rate in the unexposed
group from the risk or incidence rate in the exposed group, respectively [15]:
Risk difference in the asthma study is 0.018 –0.014 = 0.004 = 4/1000. This means
that children exposed to particulate matter <10 µm had 4 additional cases of asthma per
1,000 children during the five-year observation period compared to unexposed chil-
dren. Rate difference is 18 per 10,000 person-years –14 per 10,000 person-years = 4
per 10,000 person-years. Hence, children exposed to particulate matter <10 µm had 4
additional cases of asthma per 10,000 person-years compared to unexposed children.
Advantages
A major advantage of cohort studies is that the temporal sequence between an ex-
posure and the development of a disease can be directly observed [24]. It can be
338 Unit III. Practical Aspects of Clinical Research
Disadvantages
Among the observational study designs, cohort studies are the most expensive.
Depending on the research question, study duration usually lasts several years, in
exceptional cases even decades (see the following Example from the literature). In
order to handle large cohorts, an enormous effort of personnel resources and lo-
gistics, as well as monetary expenditure, can be required. Loss to follow-up is an
important issue in cohort studies, especially if it occurs unequally among cohorts
(see the section “Bias”). Furthermore, exposure status of a subject can change while
the study is running [24], for example, when a worker with an occupational expo-
sure leaves his working environment. Cohort studies are not useful for studying
rare outcomes, since the subjects enrolled in the study may never experience the
outcome [13].
be interested in identifying the extent of causal relationship. For this reason, they will
investigate to which extent chance, bias, and/or confounding have skewed the results.
Bias
Bias is a systematic error that leads to an over-or underestimation of the true associ-
ation between an exposure and an outcome. It can present a serious threat to the in-
ternal validity of a study. Importantly, it is independent of sample size [29]. Two main
types of bias are relevant in observational studies: selection bias and information bias.
Selection Bias
Selection of subjects for observational studies can be very difficult, as randomization
is not feasible. Selection bias is present when selected subjects differ in relevant char-
acteristics among study groups, and these factors have an influence on the associations
that are investigated. Case-control studies are especially prone to selection bias, be-
cause they require careful considerations regarding which population (and based
on which criteria controls) should be selected in order to resemble cases as much as
possible [15].
Several types of selection bias can be distinguished:
Attrition bias arises when subjects lost to follow-up substantially differ in certain
baseline characteristics from those subjects who remain in the study. Although initial
subject selection may have been unbiased, missing observation is the critical point
that leads to imbalances among groups in this type of bias. This can especially occur in
cohort studies that require long-term follow-up, but is not an issue in cross-sectional
studies [30].
Admission-rate bias (or Berkson bias) is present when hospital admission rates
differ between the case and control groups. Admission-rate bias becomes relevant
when the sample population is recruited from hospitalized patients. Some patients
may be more likely admitted for a disease when they had a certain exposure, for ex-
ample when they are carrying medical devices, which can be better handled by a
specialist. If these patients are compared to a control group recruited from the same
hospital, a relationship between exposure and disease may show in the measurements
where none exists [31,32].
If a study sample is recruited from volunteers, volunteer bias is likely to occur, as
volunteers tend to be healthier and less exposed to risk factors. Similarly, one should
be aware of non-respondent bias, for example in surveys from households, where non-
responders often have different demographic characteristics than responders [31,33].
If a disease leads to death or recovery within a short time, this may result in survivor
bias. Incidence and prevalence may differ significantly for such diseases. For example,
cases of an aortic aneurysm rupture may not be representative of the occurrence of the
disease, as many patients die quickly. This type of bias is also called incidence-prevalence
bias, or Neyman bias [29,34].
Publication bias is a systematic error that occurs when publication is directly re-
lated to the direction and significance of the results. In other words, publication bias
is present when some results (mostly positive ones) are more likely to get published
than others (mostly negative results). The origin of this type of bias has essentially
340 Unit III. Practical Aspects of Clinical Research
Information Bias
Data collection is another process during epidemiological research where bias can
occur. Information bias arises if data on exposures or outcomes are acquired in a sys-
tematically different manner from study groups. It may be introduced by the subject
who provides information to the interviewer, by the interviewer him-or herself, or
by the instruments used to measure or diagnose exposures and diseases, respectively.
Knowledge of exposure or disease status increases the likelihood that information bias
is present [15].
In the following, we will describe different types of information bias:
Recall bias (or reporting bias) is important to consider in case-control studies
where disease status is known and data on exposures are acquired retrospectively.
The problem here is that cases are more likely to remember exposures. For ex-
ample, women with breast cancer may be more attentive to their personal history
of contraceptives use, substance use, menstruation, or family history than women
without breast cancer. This may lead to an overestimation of certain risk factors for
breast cancer [36].
Interviewer bias can occur when an investigator who obtains data on exposures is
aware of a subject’s disease status. An interviewer may be more careful in reviewing
personal history, may pay attention to even rare exposures, and may explain questions
differently if a patient is suffering from cancer. Similarly, determination of the disease
status may be influenced by knowledge of the exposure status in a prospective cohort
study [29].
Lead-time bias becomes relevant when a disease is diagnosed in two different stages.
Suppose a formerly asymptomatic patient suddenly dies from hypertrophic cardiomy-
opathy. In a screening, another family member is diagnosed with hypertrophic cardi-
omyopathy in an asymptomatic stage and thereafter receives treatment. He dies five
years later from congestive heart failure. One may perceive this as a clue that treatment
is beneficial. An alternative explanation could be that the patient was diagnosed five
years before clinical presentation of the disease (this interval would be the lead time)
and the treatment had no real effect on the course of the disease [29].
Misdiagnosis of cases or controls can produce serious misclassification bias, if it
affects case and control groups unequally. If a control is mistaken as a case, the associa-
tion between exposure and disease may be overestimated, and vice versa [4].
Performance bias occurs when subjects receive different care or treatment apart
from the treatment under investigation. This may be the case in an observational study
that compares two surgical procedures when operating surgeons have different levels
of skills to perform these procedures [37]. Another example may be different levels of
additional intensive care treatment in a multi-center study.
Detection bias is caused by uneven diagnostic procedures between study groups.
If exposure counts as a diagnostic criterion for a disease, its presence will lead to the
initiation of certain diagnostic procedures and, subsequently, a higher likelihood to
discover the disease (diagnostic suspicion bias) [31,38]. In a case-control study, the
341 Chapter 16. Observational Studies
exposure status of cases will therefore be biased by selection. Similarly, exposure can
lead to a symptom that will direct the diagnostic process toward detection of a disease.
This type of bias is called unmasking (signal detection) bias [31,38].
hospital-based controls, if the disease they are admitted for is associated with the ex-
posure under investigation [4,15].
If cases are drawn from a registry, population-based controls should be chosen.
These ideally represent the general population, if the registry is comprehensive. If
doubt exists about how appropriate controls are, more than one control group can be
defined [4].
In prospective cohort studies, special attention needs to be paid to subjects’ ad-
herence and all possible measures should be taken to minimize loss to follow-up (see
Chapter 7).
An attempt to reduce publication bias has been made by the International Council
of Medical Journal Editors that, since September 2004, requires registration of any new
clinical trial in a publicly accessible registry in order to be considered later for publi-
cation [41]. Furthermore, publishers and journal editors have increasingly become
aware that the reporting of negative results needs to be stimulated and have started to
open negative results sections, and even have created new journals that focus on neg-
ative trial results [42,43].
Two fundamental principles should be followed in order to minimize information
bias. First, data collection instruments should be as precise and objective as possible;
and second, the administration of these instruments (e.g. by interviewers, examining
physicians, or radiologists) should be consequently blinded. Both principles serve the
goal of leaving as little room for judgment as possible [11].
Interviewer bias can be controlled by both consequent blinding and use of
standardized interviews [29]. If the interviewer is a blinded rater who is not informed
about the disease status, he or she will more likely interview cases and controls in a
uniform way, and will not be more thorough with case subjects. Interviewers should
receive extensive training before they first start to evaluate subjects, so that inter-rater
reliability can be ensured [4].
In order to control for recall bias, standardized questionnaires or protocols with spe-
cifically phrased, closed-ended, and preferably objective questions should be used.
Medical records or existing databases may be useful to gather more objective data.
Nevertheless, one should be aware that prior documentation itself might be incom-
plete or misleading. Subjects should be blinded and can be kept unaware of the exact
study hypothesis, if feasible and ethically acceptable [11]. In order to conceal risk
factors under investigation, “dummy” risk factors can be included in questionnaires
[4]. Another way to control recall bias is to include controls who suffer from a dis
ease that is different from the disease under investigation in cases. Compared to
healthy controls, such control subjects will be more likely to remember any relevant
exposure [29].
Strict criteria for diagnoses and exposures should be used in order to avoid
misclassification. Clinical diagnoses should be complemented by standard diag-
nostic tests, including laboratory tests, imaging, electrophysiological tests, and, if ap-
propriate, even invasive diagnostics. Use of multiple sources (e.g., records from both
hospitals and primary care physicians) can help to verify diagnoses or exposure to risk
factors [11].
Performance bias can be minimized by stratification of subjects by country, center,
or surgeon in a surgical study [37].
343 Chapter 16. Observational Studies
Confounding
Confounding leads to the assumption that a risk factor is correlated with an outcome,
while the actual effect is influenced by a third, independent factor. A confounder is
both related to exposure and a risk factor for the outcome [15,44]. For example, in
a study that investigates the risk of coffee consumption for myocardial infarction,
smoking is a confounder, as smoking is positively correlated with coffee drinking
and is a known risk factor for myocardial infarction [4]. If a confounder is positively
correlated with the exposure and the outcome, this may lead to an overestimation of
the studied association. If it is negatively correlated to either the exposure or the out-
come, this may lead to an underestimation. Note that a confounder is also predictive of
the outcome in unexposed subjects. However, since it is related to exposure, it is more
likely to be present in the exposure group and thereby introduces bias [44].
Despite its relationship with the exposure and outcome, a confounder does not
lie on the causal pathway between them. Instead, a variable that positively or nega-
tively regulates existing associations in subgroups of a population, and is therefore
causally linked to the outcome, is called an effect modifier [4]. This results in different
magnitudes of effects among subgroups. For example, hypercholesterinemia is a link
between diet and coronary artery disease and in part explains how a wrong diet can
lead to increased risk for coronary artery disease. Therefore, hypercholesterinemia
is in this case not a confounder but an effect modifier. The different relationships of
confounders and effect modifiers to exposure and outcome variables are visualized in
Figure 16.4.
One special type of confounding is called confounding by indication. It reflects the
fact that certain treatments (e.g., a surgical procedure or escalation treatment with a
medication that is very potent but may cause serious side effects) will more likely be
indicated in a carefully pre-selected subgroup (e.g., patients with a more severe dis
ease course). Outcomes, such as increased mortality, may be mistakenly attributed
to treatment, although they are in reality causally linked to disease severity [4,38].
One example is a prospective cohort study that assessed the risk of cardiovascular
disease in patients with rheumatoid arthritis exposed to glucocorticoids [45]. While
it seemed that glucocorticoids were associated with a higher incidence of cardiovas-
cular disease, this finding could not be confirmed after adjusting for disease activity
and severity. Rather, use of glucocorticoids was associated with higher disease ac-
tivity, which by its own seemed to be associated with higher risk for cardiovascular
disease.
Confounders need to be anticipated in any observational study in order to
allow proper data interpretation. This is especially important in extreme cases of
confounding, such as Simpson’s paradox (see Box 16.4), in which the true direc-
tion of the association becomes reversed. P-values alone are not suited to identify
Simpson’s paradox is a special situation in which the association between two variables
changes its direction due to the presence of a predictor [50]. This may occur if the size of
treatment arms is largely unbalanced with respect to this predictor. Simpson’s paradox has
been exemplified by Baker and Kramer using hypothetical data [51]: In a clinical trial that
compares treatment A and B, treatment A results in better survival than treatment B in both
men (60% vs. 50%) and women (95% vs. 85%; see Figure 16.5). However, when data for
women and men are aggregated into one dataset, treatment A seems to be associated with
worse survival than treatment B (72% vs. 80%). This happens because two conditions are
true: (1) women survive better than men in both treatment groups, and (2) a higher fraction
of women than men is subjected to treatment B.
Overall population
Survival
yes no
Treatment
A 215 85 72%
B 241 59 80%
Men Women
Survival Survival
yes no yes no
Treatment
Treatment
There is no statistic criterion that is able to guide the decision whether conclusions drawn
from the overall analysis versus conclusions from stratified data are the correct ones. Instead,
only the understanding of causal interactions between variables will help to determine clin-
ical meaning from data [52].
confounders because effect sizes may be minor, although they are statistically signifi-
cant, and not necessarily clinically meaningful [15].
Control of Confounding
Confounding can be adjusted in the phases of study design (restriction, matching) and
data analysis (stratification, statistical modelling) [29,46]. To complete the picture,
345 Chapter 16. Observational Studies
Restriction
Inclusion criteria can be restricted to exclude the confounding variable from the study
population [4,46]. For example, if tobacco use is expected to be a confounder, all
smokers will be excluded. This consequently reduces generalizability with respect to
the subpopulation in which the confounder is present. Additionally, it may become
difficult to recruit subjects, if the excluded confounder is highly prevalent in the study
population (e.g., smoking in patients with psychiatric diseases).
Matching
Matching is a sampling strategy that is most commonly used in case-control studies
[4]. It assures that potential confounding variables are equally distributed among
study groups. Cases and controls are sampled in a way that they have similar values
of potential confounders. For example, if age and sex are considered confounding
variables, one will include a 60-year-old female control subject when a 60-year-old
female case subject has been recruited. Matching is commonly done for constitu-
tional factors, such as age and gender. Especially when matching is done for too many
variables, it can be an expensive process in terms of costs and effort, and may even lead
to an exclusion of cases if no corresponding control can be found [4]. It is important
to avoid overmatching, which occurs when controls and cases become too similar with
respect to exposure [15]. It arises when (1) the variable used for matching is in reality
part of the causal pathway between exposure and disease, and (2) the variable is asso-
ciated with exposure (but not linked to the disease). Overmatching leads to a reduced
power to detect statistically relevant differences [4].
Stratification
Stratification involves the separate analysis of data according to different levels of a
variable, which represent homogenous categories called strata [15]. Strata can be
built for both dichotomous (e.g., smoking) and continuous (e.g., age) variables. In the
latter case, strata size should be selected as to what is considered clinically meaningful,
while keeping levels of the confounding variable homogenous so that data within the
stratum can be considered unconfounded. Stratified variables and strata per variable
are generally limited to a low number; otherwise, there is danger that some strata may
remain empty or contain extremely sparse data, and a much larger sample size will be
required to prevent this.
Other than in regression-based methods, statistical analysis of stratified data is not
based on any hypothetical model of the nature of association and is therefore less de-
pendent on assumptions [47]. Estimates can be calculated separately for each stratum,
and these stratum-specific estimates may be pooled to obtain an adjusted estimate
that is weighted by strata size. This is commonly achieved by using Mantel-Haenszel
odds ratios or risk ratios. Information on the strength of confounding can be derived
346 Unit III. Practical Aspects of Clinical Research
Statistical Modeling
Regression analysis uses a hypothetical model to estimate the relationship between
an outcome and several explanatory variables, one of which is the exposure. A main
advantage is that this procedure can be used to control for many variables simultane-
ously. For example, in a study that investigates the risk of hyperlipidemia for myocar-
dial infarction, a multiple logistic regression model may be employed to adjust for age,
sex, tobacco use, hypertension, diabetes, and family history of cardiovascular events.
Similarly, a Cox’s proportional hazards model could be used to adjust for the all these
factors when investigating time-to-event (“survival”) [15].
In regression analysis, one can take advantage of the full information provided by
continuous variables in the analysis, while categorizing into strata usually reduces its
content [48]. Importantly, control of confounding by regression modeling strongly
depends on how well the model reflects the reality. If the model is based on incorrect
assumptions, confounding will remain, at least as a residuum [14,45]. Additionally,
results from regression modeling cannot be intuitively understood because complex
relationships are condensed to few numbers, and therefore the choices made during
regression modeling should be well explained in any publication.
Propensity scores are used to estimate the probability of receiving treatment for
each subject based on covariates, and to balance treatment groups to obtain less biased
estimates [46,48,49]. Matching, stratification, weighting, or regression modeling can
all be implemented in propensity score models. Additionally, confounding by indi-
cation can be addressed by this method. (More detailed information on propensity
scores is provided in Chapter 17.)
for confounders usually results in considerably larger sample sizes, which depends,
among other factors, on the prevalence of the confounder and the strengths of associ-
ation between exposure or outcome with the confounder [4,53,54].
Another issue in sample size estimation is loss to follow-up of subjects during the
study. Its extent should be estimated a priori and incorporated into the sample size
calculation [55].
If required, sample size can be minimized by using continuous rather than di-
chotomous variables, using paired measurements, choosing more precise variables or
employing unequal group sizes [4,56].
Sample size calculation for observational studies can be achieved with two
main approaches: either it is based on power (when the analysis will be based on
statistical testing) or on precision (when the analysis will be based on confidence
intervals) [15].
To calculate sample size for means based on power, the following formula can be
used [56,59]:
(Z β + Z α /2 )2 σ 2
n= ,
(µ 2 − µ1 )2
where n is sample size, Zβ is the Z statistic for the desired power, Zα/2 is the Z statistic
for the desired significance level, σ is the assumed standard deviation, and μ1 and μ2
are the two estimated means.
Sample size for proportions based on power can be derived using the formula
[55,59]:
(Z β + Z α /2 )2 p1 (1 − p1 ) + p2 (1 − p2 )
n= ,
( p2 − p1 )2
where n is sample size, Zβ is the Z statistic for the desired power, Zα/2 is the Z statistic
for the desired significance level, and p1 and p2 are the two estimated proportions.
To calculate sample size for means based on precision, the following formula can be
used [56,60]:
Z 2 α /2 σ 2
n= ,
d2
where n is sample size, Zα/2 is the Z value for the corresponding significance level
(Z = 1.96 for the commonly used significance level of 0.05), σ is the assumed standard
deviation, and d is precision (i.e., total width of the confidence interval).
Sample size for proportions based on precision can be calculated with the following
formula [56,60]:
Z 2 α/2 p (1 − p)
n= ,
d2
348 Unit III. Practical Aspects of Clinical Research
where n is sample size, Zα/2 is the Z value for the corresponding significance level
(Z = 1.96 for the commonly used significance level of 0.05), p is the estimated popula-
tion proportion, and d is precision (i.e., total width of the confidence interval).
and the need for general anesthesia in case of some procedures [65]. They argue that
the principle of minimizing harm for subjects in clinical studies would be violated.
Other researchers proclaim that it may even be unethical and a “double standard”
not to perform a rigorous RCT to answer important clinical questions in surgery, as
long as equipoise principle is met and informed consent is carefully obtained [66,67].
Sham surgery would make blinding of patients possible and therefore abolish one im-
portant source of bias. However, these authors stress that the decision to choose sham
surgery as a control condition should be made individually for any study. Evaluation
of scientific value, methodological rationale for sham control, risk-benefit assessment,
informed consent, and the consequences of not performing a sham-controlled trial are
suggested criteria to guide this decision [66].
patients assigned to the new treatment group was causally linked to the procedure
itself or to an initially better prognosis of this study group.
2. Blinding is difficult to achieve in surgical studies, and loss of blinding may result in
significant bias. Apart from the non-blinded operating surgeon and assisting staff,
study investigators and patients may guess group allocation due to the location
or dimension of the scar, different postoperative care, or clues from diagnostics,
such as radiographs [70]. Certainly, it should be a general rule to blind as many
individuals as possible: definitely the statisticians, and possibly the outcome raters,
any staff providing patient care that do not need to know treatment allocation (the
anesthesiology team, ward nurses, physiotherapists, pharmacists, etc.), and the pa-
tient. For example, scars can be concealed and, more creatively, radiographs can
be digitally altered, as long as this does not preclude proper assessment of the out-
come [71]. Finally, bias can be reduced by choosing “hard” outcome measures,
such as death or recurrence of a disease [62].
3. Skills of the performing surgeons can influence outcomes and introduce perfor-
mance bias, if they are unbalanced in the study groups [37]. This can even occur
if the same surgeon performs both surgical interventions in a study that compares
two surgical techniques, as the surgeon may be more used to performing one of
the procedures. This problem can be addressed, for example, by allocating study
groups by surgeons [37,61]. If only highly specialized surgeons perform a proce-
dure within a study, resulting data may not be generalizable. One solution to this
issue can be a multi-center study in which surgeons of different skill levels perform
the intervention [64].
of the COSS study analyzed completed outcomes of 194 participants (out of an estimated
sampe size of 372 patients) for futility. The primary end point was “the combination of
(1) all stroke and death from surgery through 30 days after surgery and (2) ipsilateral is-
chemic stroke within 2 years of randomization” [72]. The two-year rates of reoccurrence
of ipsilateral stroke were 21.0% for the surgical group and 22.7% for the non-surgical
group. The confidence interval for the detected difference of 1.7% included the null hy-
pothesis (95% CI, −10.4% to 13.8%). Within 30 days after surgery, 14.4% subjects in
the surgical group and 2.0% subjects in the control group had an ipsilateral ischemic
stroke, which made a difference of 12.4% between groups (95% CI, 4.9% to 19.9%). The
trial was prematurely terminated in 2010 based on these results. The decision was fur-
ther explained by the authors in the discussion of the original publication: “The DSMB
considered redesigning the trial to detect a smaller absolute difference of 10% in favor
of surgery. This would have required increasing the overall sample size from 372 to 986
to achieve 80% power. The DSMB recommended stopping the trial, citing that (1) the
prespecified statistical boundary for declaring futility had been crossed using the de-
sign effect size and, (2) given the unexpected relatively low rate of observed primary
end points in the nonsurgical group, a clinically meaningful difference in favor of surgery
would not be detectable without a substantial increase in sample size, which was not
feasible” [73].
352 Unit III. Practical Aspects of Clinical Research
Stockholm—Sao Paulo
Professor Mauro Tufo was glad. Dr. Fernando Martins was excited. Dr. Frida Abba was
apprehensive.1
The three of them would meet in a couple of days. Everything started 15 years
ago when Professor Tufo did his postdoctoral fellowship in the Department of Public
Health at the University of Stockholm in Sweden just after finishing his PhD in Clinical
Epidemiology at the University of Sao Paulo in Brazil. At that time, he was already a
brilliant psychiatrist with a solid foundation in social psychiatry, interested in contin-
uing his studies in the Swedish Cohort of Mental Health. When he moved back to
Brazil, he carried with him the dream of making a similar study in his home country.
Many years passed, and finally after settling down with all the necessary support, he
was able to establish the Sao Paulo Mental Health Cohort (SP-MH) five years ago.
Since then, Prof. Tufo has published several excellent papers and has earned the re-
spect of the local and international community in the field of psychiatry.
Six months back, he felt that a cycle was complete when he received an email from
his former mentor, Bjørn Andersson: “Hi, Mauro. Great to meet you in California last
week. I have been thinking and believe that I have the perfect postdoctoral fellow to
help us in Brazil. My best student is Frida, whom you met in the conference. She was
delighted with the perspective of doing her postdoc with you in Brazil. She is really
brilliant. Initially I was reluctant to let her go but I finally agreed when I realized she
would certainly learn a lot with you and this would strengthen our relationship. Let me
know your thoughts. Best, Bjørn.”
After extensive email exchanges, Mauro Tufo, Frida Abba, and Bjørn Andersson
decided that Dr. Abba would spend two years with Prof. Tufo in the SP-MH cohort.
The summer was nearly ending. Dr. Abba would arrive in two days. Prof. Tufo asked
his doctorate student, Dr. Fernando Martins, to meet her at the airport and help her
out in the first few weeks in Sao Paulo. Fernando delightfully agreed to receive Frida
at the airport.
Frida arrived tired after a 24-hour journey, flying from Stockholm to Sao Paulo
with a flight connection in Barcelona. The time zone difference was six hours. She
was frightened, too—“Two years abroad! What can I expect? Should I have stayed in
Stockholm? Should I have done my postdoc in Europe, or the United States? Well, let’s
try not to think about this now. Prof. Tufo said a student of his would meet me here
in the airport.”
1
Dr. Brunoni and Professor Fregni prepared this case. Course cases are developed solely as the basis
for class discussion. Although cases might be based on past episodes, the situation in this case is
fictional. Cases are not intended to serve as endorsements or sources of primary data. All rights re-
served to the author of this case. Reproduction and distribution without permission is not allowed.
353 Chapter 16. Observational Studies
It was not difficult for Fernando to find Frida. After a brief introduction, they
connected well and both realized that the project could gain momentum extraordi-
narily quickly with a sense of excitement in the air. The challenge of working in a real
observational study is the dream of an epidemiologist, and Frida was not an exception
to this rule.
Observational Studies
Observational studies offer a worthwhile alternative for clinical researchers when eth-
ical or feasibility issues preclude the performance of a randomized placebo-controlled
clinical trial. It is a robust design that can provide reliable results if carefully planned
and executed. Although there are key issues with observational studies, such as lack of
blinding and poor control for unmeasured confounding, results from well-designed
observational studies might be similar to placebo-controlled clinical trials. A recent
review published in the New England Journal of Medicine, in which authors searched
for meta-analyses of randomized controlled trials and meta-analyses of either co-
hort or case-control studies published in five leading medical journals, showed that
well-designed observational studies do not overestimate the magnitude of the effects
of treatment when compared to results of randomized controlled trials on the same
topic.2
Finally, well-designed observational studies can provide data on long-term drug
effectiveness and safety. The main types of observational studies are (1) prevalence
survey or cross-sectional study; (2) case-control study; and (3) cohort study, which
can be either prospective or retrospective.
Cohort Studies
It was 2:30 p.m. when Frida arrived at the University Hospital. Prof. Tufo was tremen-
dously relieved to see her. She said, “I had to get a taxi! I tried to get the subway but
I could not find the yellow line that goes under Avenida Reboucas.” Fernando smiled
and said, “That’s because the yellow line does not exist yet. Did you not see in your
map that the line is dashed? That means it is under construction.” Everyone laughed
and started working on the Sao Paulo cohort study.
Frida Abba was genuinely impressed with the Sao Paulo Mental Cohort Study.
They were doing state-of-the-art research in psychiatry. For example, in one ancil-
lary study—the PSYCHOSP (Psychosis in Sao Paulo)—they had been following all
patients at risk of developing psychosis since 2005. The patients had neuroimaging
studies and blood tests repeated every year, as well as complete neuropsychological
batteries and follow-up with mental health professionals. As previously decided, Frida
would work on the PSYCHOSP project.
“As I told you before, Dr. Abba,” Prof. Tufo said, “ PSYCHOSP is our main project.
We have already 2,150 patients being followed up. Our cohort has completed five years
2
Concato J, Shah N, Horwitz RI. Randomized, controlled trials, observational studies, and the hier-
archy of research designs. N Engl J Med. 2000 Jun 22; 342(25): 1887–1892.
354 Unit III. Practical Aspects of Clinical Research
now and we are starting to see the initial results.” After a short pause, he concluded,
“Well, as you know this is the main caveat of cohort studies—they are very costly and
they demand a long time for follow-up. In this case, we are interested in observing the
natural history of psychosis. What is the prevalence in our population? What is the
one-year incidence in our city? And naturally, we have the fundamental question in
psychiatry, a question unresolved since the first studies of Kurt Schneider: how many
patients go on to develop schizophrenia? Therefore, the main goal of PSYCHOSP is to
increase our understanding of psychotic disorders.”
“Yes, Prof. Tufo, that is the main advantage of a cohort study,” Frida said, “We start
observing patients free of the disease—in this case, we are observing patients with a
high risk of developing psychosis. I imagine you use the common risk factors, such as
familial history of schizophrenia, cannabis use, social withdrawal—is that right?” As
he agreed, she continued, “So, after that, we start the follow-up of these patients . . . and
we wait. It can take 5, 10, or even 15 years until the data can tell us anything.” “That
is right, Dr. Abba. And the data are already talking—we have almost one thousand
cases now, and we are comparing these cases with those patients who have not devel-
oped the disease. So, that is the main strength: we can infer causality—i.e., that the risk
factors developed before the disease—thereby we can distinguish cause from effects.”
“Indeed, Prof. Tufo, the level of clinical evidence of cohort studies is particularly
strong and even comparable to, if not better than, randomized clinical trials. Some
studies have shown that, for the same disease, cohort studies and randomized trials
provide the same results—but overall the confidence interval observed in cohort
studies is narrower, meaning that the results are more precise. But then, tell me—
how do you handle the dropouts? Because the studies are time-consuming, follow-
up is a serious issue in almost all cohort studies. Another question is about subject
selection—how is it done? That is a potential source of bias in this type of study. For
instance, you are following subjects at high risk of developing psychosis. How can you
be sure that these subjects do not have the disease at time zero? This would be a se-
rious bias. But, do not get me wrong, I think cohort studies are one of the best types
of observational studies; however, they are expensive and have their limitations. This
is the reason why I think we should consider other options.” Frida and Prof. Tufo kept
talking and sharing ideas until evening. Prof. Tufo then invited Frida and Fernando
for a dinner in a charming Thai restaurant nearby. As a competent psychiatrist—and
also an experienced observer—Prof. Tufo could observe some sparks between his
two students. Both were young, smart, and charismatic—but they also had some nar-
cissistic traits, outbursts of irritation, and some passive-aggressive behaviors. At least
they could enjoy a pleasant evening without talking about epidemiology.
risk factor. Therefore, the risk of developing the disease due to the exposure is deter-
mined. Odds ratio is different. First, odds are the opposite of probability. So, if the
probability of developing a disease is 10%, then the odds of developing the disease are
1 to 9 and the odds of not developing the disease are 9 to 1. The odds ratio is, there-
fore, the ratio of the odds between two groups; for instance, let’s take the classical ex-
ample of lung cancer. Suppose that 100 patients with lung cancer are compared to 100
patients without lung cancer. Ninety patients who had lung cancer smoked, while only
10 patients without lung cancer smoked. So the odds of smoking in cancer patients are
90 to 10; while the odds smoking in non-cancer patients are 10 to 90. Dividing one
by another, we have the odds ratio of 81—that is, the odds of being a smoker in lung
cancer is 81 times as compared to subjects with no lung cancer—an extraordinarily
strong association. But this can only be transposed to a risk ratio of developing cancer
if the disease is rare (<1%). If it is not, then the association of the disease with the risk
factor is overestimated—as probably other risk factors are also contributing to cause
the disease—and then we need a cohort study.”
Prof. Tufo realized that he had accomplished enough for that meeting and decided to
end it. He returned to his office to check his emails. Bjørn, Frida’s mentor in Stockholm,
had just sent one, “Hello, Mauro! How are you? Any news from my student? I am con-
cerned. She did not answer my last emails! Is there something wrong there? Best, Bjørn.”
Prof. Tufo replied, “Hello, Bjørn! Everything is fine here. I think Frida is so involved
in our projects that she does not have enough time to check her personal emails!”
Prof. Tufo looked through the window to the room of postdoctoral students and
saw the students interacting. He continued his email, “By the way, do you remember
Dr. Fernando Martins? He is a brilliant student and I was thinking—perhaps he could
continue his postdoc with you, Bjørn? Let me know your thoughts. Best, Mauro
CASE DISCUSSION
This case deals with a very interesting situation: we are introduced to an already
running, long-term observational study and are asked to plan a new observational
study on the same disease background in parallel. Let us take a closer look: On the
one hand, we have Prof. Tufo’s PSYCHOSP study, which has been running over the
last five years. It has been designed to analyze the natural history of psychosis, deter-
mine risk factors, measure prevalence, and observe the development of schizophrenia
as a consequence of psychosis. More than 2,000 patients have so far been followed up,
while the study is still continuing.
On the other hand, Dr. Frida Abba, the new postdoc from Sweden, plans her own
study. She has two ideas, which are (1) to determine prevalence in the PSYCHOSP or
an extended, population-based sample, and (2) to compare neuroimaging in healthy
and psychotic as well as diagnosed schizophrenic subjects. Her postdoc is, according
to the current plan, limited to two years. Dr. Abba’s task is now to choose the best de-
sign for her study.
First, she could do a cross-sectional study. The advantage is that prevalence can
be well assessed in this study design. The time frame and expenses will depend on
whether she uses the PSYCHOSP sample or a population-based sample. However, she
will not be able to address causality between risk factors or neuroradiologic findings
357 Chapter 16. Observational Studies
and psychosis. Misclassification bias is a major threat in this study design because psy-
chotic symptoms are, among others, one hallmark of schizophrenia, and therefore a
thorough diagnosis of schizophrenic patients is crucial.
Second, Dr. Abba could develop a case-control study. A major advantage is that the
number of cases is enriched due to the study design and rare diseases can be studied
well. As almost 1,000 cases have already been identified in the PSYCHOSP cohort,
Dr. Abba could perform a case-control study within this cohort. Depending on her
study question, she would have to decide between the designs of a nested case-control
study or a case-cohort study. This would also require a thorough reflection on the sam-
pling of controls. Misclassification of cases can lead to serious bias in this design. While
she cannot study disease prevalence in a case-control study, she would deal with inci-
dent data from the cohort, which will allow her to address other interesting questions
(e.g., how high the incident risk associated with an exposure is). Recall bias would be
an important issue if Dr. Abba recruited a sample independently of the PSYCHOSP
study. However, within the PSYCHOSP cohort, exposures have been identified at the
beginning of the cohort study so that recall bias will not be a concern.
Third, Dr. Abba could directly use data from the PSYCHOSP cohort study or de-
sign her own cohort study. A cohort study is best suited to assess temporal association
and thus causality. This might be of importance if Dr. Abba wanted to show a causal
association between distinct neuroradiological findings and the development of psy-
chotic disorders. However, cohort studies are generally very expensive. Moreover,
psychotic disorders may take years to develop after exposure to a risk factor. Anyway, if
Dr. Abba wanted to start her own cohort apart from the PSYCHOSP study, she would
need convincing reasons to do so.
All in all, Dr. Abba’s decision is largely dependent on her exact research question,
a realistic time frame and cost estimation, and the potential limitations of the study
design that she is willing to accept.
FURTHER READING
Books
dos Santos Silva I. Cancer epidemiology: principles and methods. Lyon: International Agency for
Research on Cancer; 1999.
Hulley SB, et al.,Designing clinical research, 4th ed. Philadelphia: Lippincott Williams &
Wilkins; 2013.
Journal Articles
Beral V, et al. Ovarian cancer and hormone replacement therapy in the Million Women Study.
Lancet (London, England). 2007; 369(9574): 1703–1710.
358 Unit III. Practical Aspects of Clinical Research
Danforth KN, et al. A prospective study of postmenopausal hormone use and ovarian cancer
risk. Br J Cancer. 2007; 96(1): 151–156. Available at: http://www.ncbi.nlm.nih.gov/
pubmed/17179984\nhttp://www.nature.com/bjc/journal/v96/n1/pdf/6603527a.pdf.
Freemantle N, et al. 2013. Making inferences on treatment effects from real world data: pro-
pensity scores, confounding by indication, and other perils for the unwary in observational
research. BMJ. 2013; 347: f6409. doi:10.1136/bmj.f6409
Higgins JPT, et al. The Cochrane Collaboration’s tool for assessing risk of bias in randomised
trials. BMJ (Clinical research ed.). 2011; 343: d5928. doi:10.1136/bmj.d5928
van der Woude FJ, et al. Analgesics use and ESRD in younger age: a case-control study. BMC
Nephrology. 2007; 8: 15.
Bias
Delgado-Rodriguez M, Llorca J. Bias. J Epidemiol Comm Health. 2004; 58(8): 635–41.
Confounding
McNamee R. Confounding and confounders. Occup Environ Med. 2003; 60(3): 227–234.
Simpson’s Paradox
Bickel PJ, Hammel EA, O’Connell JW. Sex bias in graduate admissions: data from Berkeley.
Science (New York, N.Y.). 1975; 187(4175): 398–404.
Surgical Research
McCulloch P, et al. No surgical innovation without evaluation: the IDEAL recommendations.
Lancet. 2009; 374(9695): 1105–12.
REFERENCES
1. Last J. A dictionary of epidemiology, 4th ed. New York: Oxford University Press; 2000.
2. von Elm E, et al. The strengthening the reporting of observational studies in epidemiology
(STROBE) statement: guidelines for reporting observational studies. J Clin Epidemiol.
2008; 61(4): 344–349.
3. Friis RH, Sellers T. Epidemiology for public health practice, 5th ed. Burlington, MA: Jones &
Bartlett; 2013.
4. Hulley SB, et al. Designing clinical research, 4th ed. Philadelphia: Lippincott Williams &
Wilkins; 2013.
5. Szklo M, Nieto J. Epidemiology: beyond the basics, 3rd ed. Burlington, MA: Jones &
Bartlett; 2012.
6. Rao A, Ramam M. The case for case reports. Indian Dermatol Online J. 2014; 5(4):
413–415.
7. Vandenbroucke JP. In defense of case reports. Ann Intern Med. 2001; 134(4): 330–4.
359 Chapter 16. Observational Studies
8. Carey TS, Boden SD. A critical guide to case series reports. Spine. 2003; 28(15): 1631–1634.
9. Maida V, et al. Symptoms associated with malignant wounds: a prospective case series.
J Pain Symptom Manage. 2009; 37(2), pp.206–11.
10. Rao A, Ramam M. The case for case reports. Indian Dermatol Online J [serial online] 2014;
5: 413–415.
11. Buring JE. Epidemiology in medicine, Vol. 515, 1st ed. Philadephia: Lippincott Williams &
Wilkins; 1987.
12. Hymes K, et al. Kaposi’s sarcoma in homosexual men: a report of eight cases. Lancet, 1981;
2(8247): 598–600.
13. Mann C. Observational research methods. Research design II: cohort, cross sectional, and
case-control studies. Emerg Med J. 2003; 20(1): 54–60.
14. Zocchetti C, Consonni D, Bertazzi PA. Relationship between prevalence rate ratios and
odds ratios in cross-sectional studies. Int J Epidemiol. 2997; 26(1): 220–223.
15. dos Santos Silva I. Cancer epidemiology: principles and methods. Lyon: International Agency
for Research on Cancer; 1999.
16. Weintraub D, et al. Impulse control disorders in Parkinson disease: a cross-sectional study
of 3090 patients. Arch Neurol. 2010; 67(5): 589–595.
17. Schulz KF, Grimes DA. Case- control studies: research in reverse. Lancet. 2002;
359(9304): 431–434.
18. Vandenbroucke JP, Pearce N. Case-control studies: Basic concepts. Int J Epidemiol. 2012;
41(5): 1480–1489.
19. Szklo M, Nieto J. Epidemiology: beyond the basics, 3rd ed. Burlington, MA: Jones &
Bartlett; 2012.
20. Rodrigues L, Kirkwood BR. Case-control designs in the study of common diseases: updates
on the demise of the rare disease assumption and the choice of sampling scheme for
controls. Int J Epidemiol. 1990; 19(1): 205–213.
21. Doll R, Hill AB. Smoking and carcinoma of the lung: preliminary report. Bull WHO. 1999;
77(1): 84–93.
22. Doll R, et al. Mortality in relation to smoking: 50 years’ observations on male British
doctors. BMJ (Clinical Research Ed.). 2004; 328(7455): 1519.
23. Grimes DA, Schulz KF. Cohort studies: marching towards outcomes. Lancet, 2002;
359(9303): 341–345.
24. Grimes DA, Schulz KF. Bias and causal associations in observational research. Lancet.
2002; 359(9302): 248–252.
25. Katz K. The (relative) risks of using odds ratios. Archives Dermatol. 2006; 142(6): 761–764.
26. Schmidt CO, Kohlmann T. When to use the odds ratio or the relative risk? Int J Public
Health, 2008; 53(3): 165–167.
27. Mahmood SS, et al. The Framingham Heart Study and the epidemiology of cardiovascular
disease: a historical perspective. Lancet, 2014; 383(9921): 999–1008. Available at: http://
dx.doi.org/10.1016/S0140-6736(13)61752-3.
28. Wolf PA, et al. Epidemiologic assessment of chronic atrial fibrillation and risk of stroke: the
Framingham study. Neurology. 1978; 28(10): 973–977.
29. Tripepi G, et al. Bias in clinical research. Kidney Int. 2008; 73(2): 148–153.
30. Krishnan E, et al. Attrition bias in rheumatoid arthritis databanks: a case study of
6346 patients in 11 databanks and 65,649 administrations of the Health Assessment
Questionnaire. J Rheumatol. 2004; 31(7): 1320–1326.
360 Unit III. Practical Aspects of Clinical Research
31. Sackett DL. Bias in analytic research. J Chronic Dis. 1979; 32(1–2): 51–63.
32. Berkson J. Limitations of the application of fourfold table analysis to hospital data. Int J
Epidemiol. 2014; 43(2): 511–515.
33. Criqui MH, Barrett-Connor E, Austin M. Differences between respondents and non-
respondents in a population-based cardiovascular disease study. Am J Epidemiol. 1978;
108(5): 367–372.
34. Neyman J. Statistics: servant of all sciences. Science (New York, N.Y.). 1955; 122(3166):
401–406.
35. Dickersin K. The existence of publication bias and risk factors for its occurrence. JAMA.
1990; 263(10): 1385–1389.
36. Skegg DCG. Potential for bias in case-control studies of oral contraceptives and breast
cancer. Am J Epidemiol. 1988; 127(2): 205–212.
37. Paradis C. Bias in surgical research. Annals Surgery. 2008; 248(2): 180–188.
38. Delgado-Rodriguez M, Llorca J. Bias. J Epidemiol Comm Health. 2004; 58(8); 635–641.
39. Higgins J, et al. Chapter 8: Assessing risk of bias in included studies. In: Higgins J, Green
S, eds. Cochrane handbook for systematic reviews of interventions. 2011. The Cochrane
Collaboration. Available at: www.handbook.cochrane.org
40. Reeves B, et al. Chapter 13: Including non-randomized studies. In: Higgins J, Green S, eds.
Cochrane handbook for systematic reviews of interventions. 2011. The Cochrane Collaboration.
Available at: www.handbook.cochrane.org.
41. Abaid LN, Grimes DA,Schulz KF. Reducing publication bias through trialregistration.
Obstet Gynecol. 2007; 109(6): 1434–1437.
42. Dirnagl U, Lauritzen M. Fighting publication bias: introducing the Negative Results
section. J Cereb Blood Flow Metab. 2010; 30(7): 1263–1264.
43. Goodchild van Hilten L. Why it’s time to publish research “failures.” 2015. Available at:
https://www.elsevier.com/connect/scientists-we-want-your-negative-results-too [Accessed
September 10, 2016].
44. McNamee R. Regression modelling and other methods to control confounding. Occup
Environ Med. 2005; 62(7): 500–506
45. Sijl AM Van, et al. Confounding by indication probably distorts the relationship between
steroid use and cardiovascular disease in rheumatoid arthritis: results from a prospective
cohort study. PLoS One. 2014; 9(1): e87965. doi:10.1371/journal.pone.0087965.
46. Jepsen P, et al. Interpretation of observational studies. Heart. 2004; 90(8): 956–960.
47. McNamee R. Confounding and confounders. Occup Environ Med. 2003; 60(3): 227–234.
48. Freeman TB, et al. Use of placebo surgery in controlled trials of a cellular-based therapy for
Parkinson’s disease. N Engl J Med. 1999; 341(13): 988–991.
49. Okoli GN, Sanders RD, Myles P. Demystifying propensity scores. Br J Anaesth. 2014;
112(1): 13–15.
50. Simpson EH. The interpretation of interaction in contigency tables. J Roy Stat Soc. Series B
(Methodological). 1951; 13(2): 238–241.
51. Baker SG, Kramer BS. Good for women, good for men, bad for people: Simpson’s paradox
and the importance of sex-specific analysis in observational studies. J Womens Health Gend
Based Med. 2001; 10(9): 867–872.
52. Pearl J. Simpson’s paradox, confounding and collapsibility. In Causality. Cambridge:
Cambridge University Press; pp. 269–274, 2009.
53. Drescher K, Timm J, Jöckel KH. The design of case-control studies: the effect of
confounding on sample size requirements. Stat Med. 1990; 9(7): 765–766.
361 Chapter 16. Observational Studies
54. Lui, K-J. Sample size determination for case-control studies: the influence of the joint dis-
tribution of exposure and confounder. Stat Med. 1990; 9(12): 1485–1493.
55. Whitley E, Ball J. Statistics review 4: sample size calculations. Critical Care (London). 2002;
6(4): 335–341.
56. Eng J. Sample size estimation: how many individuals should be studied? Radiology. 2003;
227(2): 309–313.
57. du Prel J-B, et al. Confidence interval or p-value?: part 4 of a series on evaluation of scien-
tific publications. Deutsches Ärzteblatt Int. 2009; 106(19): 335–339.
58. Akobeng AK. Confidence intervals and p- values in clinical decision making. Acta
Paediatrica. 2008; 97(8): 1004–1007.
59. Chow S-C. Sample size calculations for clinical trials. Wiley Interdisc Rev: Comp Stat. 2011;
3(5): 414–427.
60. Hajian-Tilaki K. Sample size estimation in epidemiologic studies. Caspian J Int Med. 2011;
2(4): 289–298.
61. Lilford R, et al. Trials in surgery. Br J Surgery. 2004; 91(1): 6–16.
62. McLeod RS. Issues in surgical randomized controlled trials. World J Surgery. 1999;
23(12): 1210–1214.
63. Cook JA. The challenges faced in the design, conduct and analysis of surgical randomised
controlled trials. Trials. 2009; 10: 9. doi:10.1186/1745–6215–10–9.
64. Ergina PL, et al. Challenges in evaluating surgical innovation. Lancet. 2009;
374(9695): 1097–1104.
65. Macklin R. The ethical problems with sham surgery in clinical research. N Engl J Med. 1998;
341(13): 992–996.
66. Miller FG. Sham surgery. Surgery. 2003; 3(4): 41–48.
67. Freeman TB, et al. Use of placebo surgery in controlled trials of a cellular-based therapy for
Parkinson’s disease. N Engl J Med. 1999 Sep 23; 341(13): 988–992.
68. Barkun JS, et al. Evaluation and stages of surgical innovations. Lancet. 2009; 374(9695):
1089–1096.
69. McCulloch P, et al. No surgical innovation without evaluation: the IDEAL recommendations.
Lancet. 2009; 374(9695): 1105–1112.
70. Demange MK, Fregni F. Limits to clinical trials in surgical areas. Clinics (Sao Paulo). 2011;
66(1): 159–161.
71. Karanicolas PJ, Farrokhyar F, Bhandari M. Blinding: who, what, when, why, how? Can J
Surgery. 2010; 53(5): 345–348.
72. Wilson CB. Adoption of new surgical technology. BMJ (Clinical Research Ed.). 2006;
332(7533): 112–114.
73. Powers WJ, et al. Extracranial-intracranial bypass surgery for stroke prevention in hemo-
dynamic cerebral ischemia: the Carotid Occlusion Surgery Study randomized trial. JAMA.
2011; 306(18): 1983–1992.
17
C O N F O U N D E R S A N D U S I N G T H E M ET H O D
OF PROPENSIT Y SCORES
Author: Chin Lin
Case study authors: Rui Imamura and Felipe Fregni
It is the mark of an educated mind to rest satisfied with the degree of precision which the na-
ture of the subject admits and not to seek exactness where only an approximation is possible.
—Aristotle
INTRODUCTION
In Unit III, you have been presented with several aspects of observational studies and
their basic designs. Important concepts concerning bias and confounders, as well as
methods to address them, were explored in Chapter 16.
One of the key aspects of an observational study is the fact that researchers have
no control over treatment assignment [1, 2]. In practice, large differences on observed
covariates may exist between treated and non- treated (control) groups. These
differences can lead to biased estimates of treatment effects: a relationship effect could
be established when actually there is none, or a true effect will remain hidden instead
of being observed [2, 3]. In contrast, randomizing patients to treatment allocation, as
is done in experimental studies, is a very efficient method to reduce bias and potential
confounding by balancing groups with regard to known and unknown variables, and
thus reduce their influence on the interpretation of results.
In order to decrease the influence of confounding variables, when planning an
observational study it is highly recommended to list the potential characteristics of
patients that may impact outcome, attempt to record them, and propose a method to
control the bias resulting from them [3].
In this chapter, we will discuss one of the most robust methods to reduce the im-
pact of bias generated by group imbalance, which therefore increases greatly the va-
lidity of observational studies: the propensity score.
DEFINITIONS
Propensity Score
In the historical article by Rosenbaum and Rubin [4], the authors provided this
definition: “The propensity score is the conditional probability of assignment to a par-
ticular treatment given a vector of observed covariates” (p. 41). Intuitively, it works as
362
363 Chapter 17. Confounders and Using the Method of Propensity Scores
a balancing score and measures the tendency of a subject being in the “treated” group
(or more generally, in the group with exposure of interest) considering his or her
observed background (pre-treatment) covariates. This score is frequently estimated
by logistic regression where the treatment variable is the outcome and the covariates
are the predictor variables in the model [5]. The propensity score tries to mimic some
aspects of randomized trials by balancing patients’ characteristics, and the distribution
of baseline covariates between groups will be dependent on conditional probability.
Bias
Bias is the systematic deviation (or error) of measurements or inferences/conclusions
that are different from the truth. In a clinical study, bias may be introduced during: (1)
conception and design of the study; (2) data handling, collection, analysis, interpreta-
tion, reporting, or review processes [10].
there are two principal ways to achieve this goal: prevention in the planning phase by
restriction or matching; and statistical analysis adjustment in data handling by stratifica-
tion or multivariate regression modeling [4].
Restriction
In this method, also known as specification, the study population is restricted to those
subjects with a specific value of the confounding variable. We can perform the restriction
by determining specific exclusion criteria for the study; thus the potential confounders
are eliminated. A disadvantage of this method is that findings cannot be generalized to
those subjects left out by the restriction.
Matching
The matching process constrains subjects in different exposure groups to have the same value
of potential confounders. The samples are conditionally drawn from the populations
ensuring that characteristics are similarly distributed across samples, based on the
propensity scores. It is commonly used in case-control studies, but can also be used
in cohort studies. With increasing number of matching variables, the identification of
matched subjects becomes progressively demanding, and matching does not reduce
confounding by factors other than the covariates used for the matching.
Stratification
Stratification is also known as sub-classification. The basic idea is to divide study
subjects—treated and non-treated—into a number of subgroups (or strata) within
the covariate, so that subjects within a stratum will share the same characteristics.
Stratification is important because it provides a simple means to display data, to
measure an unconfounded estimate of the effect of interest, and to examine the
presence of effect modification.
Within each stratum, a simple comparative statistic is calculated, and the results
for both groups are compared. If there are many potential covariates at one point, this
method will not be practical, due to the overwhelming number of required strata,
which may also impact the number of subjects within each stratum, as this method
requires that the resultant strata must be large enough to yield conclusive results [7].
For both matching and stratification, there is an additional disadvantage when
dealing with continuous variables, as these variables have to be recoded into categories,
which may lead to the use of arbitrary criteria during the process.
prognostic factors and potential covariates. The effect of the exposure on the outcome
is estimated, based on the similarity of the covariates between the exposed and refer-
ence patients. Frequently used methods are the Cox proportional hazard model (sur-
vival analysis), and the logistic and linear regression models [12].
An important disadvantage of these methods is the risk of extrapolation when too
many covariates are included into the analysis. It may result in errors in the estimation
of the effects of the treatment of interest. In the literature, a ratio of 10–15 subjects or
event per independent variable in the model is desired [12].
METHOD OF PROPENSITY SCORES
Theoretical Background
There are two basic steps to perform a propensity score (PS) analysis. First, a
model to predict the exposure is built (treatment model); then a model including pro-
pensity score information is constructed (outcome model) to evaluate the associa-
tion between exposure and outcome [13]. From this model, a summary of each study
subject’s pre-treatment covariates is replaced by a single index. This “new covariate,” or
the expected probability, is the person’s propensity score. In theory, it is expected that
with increasing sample size the pre-treatment covariates are balanced between study
subjects from the two exposure groups who have nearly identical PS [14].
Consider the formula
e (X) = prob (Z = 1/ X)
Matching
Matching is a technique used to select control subjects who are similar to the treated
subjects. This similarity across groups is achieved by controlling several baseline char-
acteristics that are thought to have an potential impact on the outcome. It is useful in
situations when there is a limited number of patients in the treated group and a larger
(often much larger) number of control patients [5].
It is often difficult to find subjects who are perfectly similar (i.e., that can be
matched) on all important covariates, even if there are only a few background
covariates of interest. Propensity score matching will then be a method that allows an
investigator to control simultaneously for many background covariates by matching
on a single scalar variable.
There are several matching techniques that can be performed. Mahalanobis
metric matching [15] is a common one. It is performed by first randomly ordering
the subjects, and then the distance between the first treated subject and all controls is
calculated. The distance between a treated subject (i) and a control subject (j) is de-
fined by the Mahalanobis distance: d (i, j) = (u –v)T C-1 (u –v) where (u) and (v) are
matching variables values for treated (i) and control subjects (j), respectively, and C
is the sample covariance matrix of the matching variables from the full set of control
subjects. The control subject (j) with the minimum distance d (i, j) is chosen as the
match for treated subject (i), and both of them are removed from the pool. This pro-
cess is repeated until matches are found for all treated subjects.
The major disadvantage of this technique is the difficulty of finding close matches
when there are many covariates included in the model. As the number of covariates
increases, the average distance between observations increases as well.
There are three techniques proposed by Rosenbaum and Rubin for constructing a
matched sample using the propensity scores:
This method consists of first randomly ordering the subjects in the treated and non-
treated groups. Then the first treated subject is matched to a subject with the closest
propensity score from the non-treated group. After this, both subjects are removed
from the pool, and the next patient from the treated group is selected.
c) Nearest available Mahalanobis metric matching within calipers defined by propensity score
367 Chapter 17. Confounders and Using the Method of Propensity Scores
This method is a hybrid of the previous two techniques; first, subjects in the treated
group are randomly ordered, and then a subset of potential non-treated subjects
whose propensity scores are near to the ones on the treated group (“within calipers”)
is determined. The subject from the non-treated group is selected from this subset by
using nearest available Mahalanobis metric matching. The caliper size is determined
by the investigator, and the recommendation is to keep the size of the caliper to 1/4 of
the standard deviation of the logit of the propensity score.
Rosenbaum and Rubin suggested that the nearest variable matching on the estimated
propensity score is the easiest technique, and the nearest available Mahalanobis
metric matching within calipers defined by propensity score is the best technique for
producing the best balance between the covariates in the treated and control group [7].
Stratification
Stratification or subclassification consists of ordering subjects into subgroups (strata) de-
fined by certain background covariates. After the definition of the strata, treated and control
subjects who are in the same stratum can be compared directly.
According to Cochran [16], 90% of the bias can be removed by creating five strata.
However, there is a natural problem related to subclassification [17], because the
number of strata grows exponentially with increases in the number of covariates [18].
The propensity score is a scalar summary of all the observed background covariates;
therefore, the stratification method can balance the distributions of the covariates in
the treated and control groups without the undesirable increase in number of strata.
Ideally, the perfect stratification based on the propensity score will produce strata
where the average treatment assignment is an unbiased estimate of the truth treatment
effect. Usually in order to perform this stratification, the propensity score is estimated
by logistic regression for binary outcomes or discriminant analysis. The investigator
then must determine the cut-off point for the boundaries for different strata, and also
whether this will be based on the values of the propensity score for the combination
between groups or in the treated group alone. A suggestion is to use the quintiles or
deciles of the propensity score of treatment and control groups combined.
A Life-Threatening Experience
Professor Minoru has recently returned to work after a three-month medical leave to
treat his prostate cancer. He underwent prostatectomy surgery that did not go well be-
cause he developed a local infection that progressed to septicemia. He was in the ICU
(intensive care unit) for almost a month. Although he had a good recovery and now
is able to go back to work, he was uncomfortable with the ICU management as he felt
that procedures there did not follow evidence-based medicine. Professor Minoru has
dedicated his entire life to offer treatments with the best evidence to his patients, and
he was not comfortable with the treatment offered to him in the ICU.
Prof. Minoru is a calm and methodical physician—he is frequently described
as a cold person. He makes decisions and takes action wholly on the basis of logic.
Evidence-based medicine was therefore for him the foundation of medicine. Prof.
Minoru is a well-known professor of internal medicine in Osaka, Japan. He leads a
large team of physicians in the largest and busiest hospital in Osaka.
He decided to propose to the Japanese Ministry of Health the idea of a Task
Force aiming to determine the value of medical procedures in ICU based on
evidence-based medicine. His goal is to produce a document that could serve as
recommendations for medical care in ICUs throughout Japan. In fact, the government
was interested and helped gather first-class specialists from the entire country to join
this Task Force.
of pulmonary artery. So, the physician must balance the pros and cons of its use in each
patient in order to decide whether or not to use it.
Prof. Minoru recognizes this opportunity as one of the most important in his
career. A positive evaluation by his peers will definitely put him among the medical
leaders of his country. On the contrary, a negative one may leave him in the shadows
for a long time.
He schedules a meeting with Dr. Tanabe and his research team: one associate pro-
fessor of cardiology (Professor Shiro Yasuda, a young but experienced clinician) and
three of his postdoc students (Drs. Dan Yoshida, Hideki Ueno, and Liang Chen). He
has an idea of how to conduct a study to assess the evidence of RHC, but he wants
to discuss the problem with his research team and Dr. Tanabe. In addition, Professor
Minoru likes to challenge his postdoc fellows.
Dr. Ueno, feeling the pressure of being a new fellow, is afraid to disagree, but
he decides to defend his position, “I understand that there are ethical concerns in-
volved, but as you know, observational studies lack scientific rigor that might lead to
biased results. How do you guarantee comparability of the groups at baseline without
randomization? We might have strong selection bias. It’s like comparing apples to
oranges. I believe we are in the crossroads between ethics and science. Which road
should we take?”
Prof. Yasuda, a more experienced clinical researcher, then proceeds with a more
detailed explanation:
You got to the point. If not interpreted carefully, observational studies may lead to
biased results and history has shown that faulty conclusions and recommendations
for medical and public-health policy can follow. A typical example in the literature
regards the use of estrogen in hormone replacement therapy in post-menopausal
women. In 1985, the observational Nurse’s Health Study reported that women
taking estrogen had only a third as many heart attacks as women who did not. For
the next several years, HRT became one of the most popular drug treatments in
America. By the end of the last century and beginning of this one, two clinical trials
(HERS and WHI) concluded that, on the contrary, HRT constituted a potential risk
for postmenopausal women, with increased risks of heart disease and stroke. The
question of how many women may have died prematurely or suffered heart attacks
or strokes because they were taking HRT, which is supposed to protect them against
heart disease, is unknown. Maybe tens of thousands would be a reasonable estimate.
Why did conclusions in these studies differ so much? We have to consider the influ-
ence of confounders biasing the results in order to understand it. In continuation, our
main task, if we decide to keep with the retrospective cohort design, will be how to
control for confounders.
In the Nurses’ Health Study, the main issue was that nurses who spontaneously
adopted HRT were those with conscious healthy habits, thus less prone to cardiac
events. This is known as the healthy-user bias. Although the possibility of confounding
was raised by the authors, as the magnitude of the effect between groups was large, the
outcome (less cardiac events) was considered related to exposure (HRT). Results of
this study may have motivated disseminated use of HRT aiming at the protection to
cardiac diseases. This is in fact an important issue, especially for physicians who do not
know what confounding is and how it may affect results of a given trial.
Thank you Shiro! Those were really helpful considerations. I agree with you: we may
keep the retrospective cohort design, but we will have to control for confounders.
Although other methods are available, outcome models and propensity score analysis
are the most commonly used methods to achieve this goal. Briefly, outcome mod-
eling is the way most statisticians address the issue. It allows one to calculate the co-
efficient to each identified risk factor, which represent the effect of that factor on the
outcome, adjusting for other factors in the model. Propensity score (PS) analysis, on
the other hand, creates a model that reflects risk factors’ effects on the EXPOSURE
(in our case RHC would be the exposure). Propensity score becomes a single sum-
mary variable that predicts the probability of receiving the intervention as a function
of the confounders. By the way, we have advanced quite a lot this morning. Can we
take a break and return in the afternoon to continue this discussion?
Prof. Minoru was not happy with this interruption; but he decided to agree, “OK,
I would like that you, postdoc students, remind us of the methodology of propensity
scores, and the pros and cons of outcome regression versus propensity scores. Is that
OK for everybody? Let us meet after lunch.”
During the extra time, Drs. Ueno, Yoshida, and Chen went to their offices and
started looking on the Internet to get the information they needed. In the afternoon,
they were ready and eager to show their progress on the topic. Dr. Yoshida starts after
waiting 10 minutes for Dr. Tanabe to arrive:
becoming comparable. So, after defining propensity scores for each study subject
we may:
1. Match on propensity scores: with some algorithm (greedy or optimal match);
2. Stratify on propensity scores;
3. Control for propensity scores in an outcome model; and
4. Weight by propensity scores.
Prof. Minoru quickly gave positive feedback to Dr. Yoshida. “Great summary,
Dr. Yoshida. What about advantages and disadvantages of each method? Could you
tell us something about it, Dr. Chen?”
Dr. Chen, always concerned about speaking in public, goes ahead:
Dr. Ueno was itching to reply and he finally had the chance to do so: “On the other
hand, propensity scores obscure identification of interactions between treatment and
confounders. Furthermore, if matching by propensity scores is the method chosen, we
do not use all patients in analysis, only those who could be matched in both groups.
That means we will lose power in the analysis. Advantages of outcome regression
models include allowing to estimate the effect size of each confounder and also to
identify interaction effects between treatment and confounders.”
Dr. Yoshida decides to go with Dr. Chen’s position and completes,
That is true. However, there are some drawbacks of outcome regression models, as
well. Diagnostics for regression (residual plots, measures of influence, etc.) are not so
straightforward as for propensity scores (i.e., just checking for balance in baseline char-
acteristics between comparison groups). Also, outcome regression models do not allow
separation of modeling and outcome analysis as propensity scores do. Also, modeling
may influence the choice of covariates in the model and how they are used (squares,
interaction, etc.). Manipulating covariates, in turn, may change the strength or even the
direction of the intervention on the outcome. Furthermore, it is not so straightforward
to explain to a nontechnical audience how regression controls for confounders.
Dr. Chen, happy with the support, makes a final brief comment, “Finally, I would
like to add that there are limitations for both methods since neither is able to adjust for
unmeasured confounders (hidden bias).”
374 Unit III. Practical Aspects of Clinical Research
Prof. Minoru, who does not usually show enthusiasm in public, makes an ex-
ception, “Great job, folks! I believe we are much more prepared to decide which
method we will use to analyze the role of RHC.” At the end of the day, Prof. Minoru
was feeling that his painful experience in the ICU could result in a great contribu-
tion to medicine. He felt some comfort with that thought as he had dedicated all
of his life to medicine.
CASE DISCUSSION
Professor Minoru plans to determine the efficacy of right heart catheterization (RHC)
to improve overall mortality rates in emergency care units. This is an example of a con-
dition in which the use of a RCT can be problematic.
Alternatively, observational studies are considered to be not as controlled as RCTs
and therefore if not carefully interpreted, conclusions derived from them can be mis-
leading. One famous example is the Nurses’ Health Study, a large cohort study, in
which only a third of the women taking estrogen had as many heart attacks as women
who did not (for more details about this study, see Stampfer et al., 1991) [21]. This
had a tremendous impact in health policy. But between the exposure to estrogen and
the outcome there was also one factor that was not on the causal pathway. In this study,
nurses who were taking estrogen were also the ones having conscious healthy habits—
this is known as the healthy user bias.
This healthy user bias is a clear example of a confounder, a covariate associated
with the exposure and the outcome, but not part of the causal relationship between
them. Confounders can lead to unrealistic estimations of treatment effects, and they
need to be addressed in order for accurate conclusions are to be drawn. This can be
performed by using statistical modeling. Outcome modeling calculates a coefficient
for each identified risk factor influence on the outcome, adjusting for other factors in
the model. But it is based on the outcome, and for instance the choice of covariates or
the way they will be used may change the strength and/or direction of the association
between intervention and outcome. Propensity scores follow a different assumption,
and attempt to summarize in a single variable the probability of receiving an interven-
tion based on a set of confounders. One major advantage of this method is that it is
not necessary to take the outcome into consideration, which allows for a separation
between the modeling and the outcome analysis, ultimately preventing a deliberate
choice of covariates that can bias the results. But at the same time, they can obscure
the relationship between exposure and outcome (by not looking at the outcome),
and will reduce the sample size if only matched patients are used.
Considering the strengths and limitations of both methods, now Prof. Minoru
and his research team have to decide which method they will use to analyze the role
of RHC.
FURTHER READING
Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for
causal effects. Biometrika. 1983; 70; 41–55.
Rosenbaum PR, Rubin DB. Reducing bias in observational studies using subclassification on
the propensity score. JASA. 1984; 79; 516–524.
Rosenbaum PR, Rubin DB. Constructing a control group using multivariate matched sampling
methods that incorporate the propensity score. Am Statistician. 1985; 39; 33–38.
These are historical articles related to the conceiving process of the propensity score; the first
contains the authors’ introduction of the theoretical and mathematical basis for the propen-
sity score. The others are the method application by using the stratification and the matching
techniques.
D’Agostino RB Jr. Propensity score methods for bias reduction in the comparison of a treatment
to a non-randomized control group. Stat Med. 1998; 17: 2265–2281.
It is a comprehensive and illustrative review of the propensity score.
Cook EF, Goldman L. Performance of tests of significance based on stratification by a multivar-
iate confounder score or by a propensity score. J Clin Epidemiol. 1989; 42; 317–324.
Here the authors compare the performance and the efficiency of methods for confounder con-
trol based on stratification, multivariate confounder score, and propensity score.
Winkelmayer WC, Kurth T. Propensity scores: help or hype? Nephrol Dial Transplant. 2004; 19;
1671–1673.
This editorial comment brings us a critical review of the propensity score and also discusses
briefly the issue of confounding.
Miettinen O, Cook F. Confounding: essence and detection. Am J Epidemiol. 1981; 114; 593–603.
In this interesting article, the authors discuss the confounding issue in different studies—follow-
up and case-control, illustrated by several examples.
REFERENCES
1. Mann CJ. Observational research methods. Research design II: cohort, cross sectional, and
case-control studies. EMJ. 2003; 20(1): 54–60.
2. Joffe MM, Rosenbaum PR. Invited commentary: propensity scores. Am J Epidemiol. 1999;
150(4): 327–333.
3. Cochran WG, and Rubin DB. Controlling bias in observational studies: a review.
Sankhyā: The Indian Journal of Statistics, Series A (1961–2002). 1973; 3(4): 417–446. www.
jstor.org/stable/25049893.
4. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational
studies for causal effects. Biometrika. 1983; 70(1): 41–55.
5. D’Agostino RB Jr. Propensity scores in cardiovascular research. Circulation. 2007; 115(17):
2340–2343.
6. Rothman KJ. A pictorial representation of confounding in epidemiologic studies. J Chron
Dis. 1975; 28(2): 101–108.
7. Rosenbaum PR, Rubin DB. Constructing a control group using multivariate matched sam-
pling methods that incorporate the propensity score. Am Statistic. 1985; 39(1): 33–38.
8. McNamee R. Confounding and confounders. Occup Environ Med. 2003; 60(3): 227–234.
376 Unit III. Practical Aspects of Clinical Research
INTRODUCTION
Other chapters in this Unit have discussed special cases of randomized clinical trials
(non-inferiority trials, for instance). This chapter discusses the reasons and methods
to perform interim analysis, adaptive design (used during clinical trials to modify trial
design or statistical procedures based on preliminary results from interim analysis),
and the particularities of clinical trials with medical devices.
depending on the study, outcome, and disease characteristics. The analysis can be
planned to look for the following:
The reasons to stop a trial earlier based on interim analysis results can result in
advantages, as earlier publication diminish costs and resources utilization and expose
fewer patients to unnecessary risk. However, the balance between clinical and statis-
tical significance should be observed.
Statistical significance may be reached, but not the clinical significance, leading to
criticism from the scientific community that the results are not robust enough. In fact,
one of the main issues is perception of results from a small trial that may not validate
the results clinically. Even demonstrating important clinical significance, an early ter-
mination may impact the statistical significance of the trial and may limit the power to
look at secondary outcomes.
To preserve the overall significance level, there are specific methods to statistical
stopping rule, which must be pre-established in the analysis plan.
The Haybittle-Peto Rule
In this approach, the trial is stopped when there is overwhelming evidence to stop the
trial. This threshold has been considered at p <0.001 [4,5].
379 Chapter 18. Adaptive Trials and Interim Analysis
Example: From Schulz KF, Grimes DA, Multiplicity in randomised trials II: subgroup and
interim analyses. Lancet.
In a given study, a data monitoring committee does an interim analysis every 6 months for
5 years. At 18 months, the analysis slips under p <0.05, but never again attains significance at
that level. An early decision by the committee to stop the trial based on this result might have
led to an incorrect conclusion about the effectiveness of the intervention.
1.0
0.8
0.6
p value
0.4
0.2
p = 0.05
0
0 6 12 18 24 30 36 42 48 54 60
Months of observation
On the basis of the number of interim analyses planned, the methods define p-values for con-
sidering trial stoppage at an interim look while preserving the overall type I error (α; Table 18.1).
O’Brien-Fleming and Peto methods were selected. Both adopt stringent criteria (low
nominal p values) during the interim analyses (Table 18.1). If the trial continues until the
planned sample size, then all analyses proceed as if basically no interim analyses had taken
place. The procedures preserve not only the intended α level but also the power. As a general
rule, investigators gain little by doing more than four or five interim analyses during a trial.
Table 18.1 Continued
Number of planned Interim Pocock Peto O’Brien-Fleming
interim analyses analysis
4 1 0.018 0.001 0.0001
2 0.018 0.001 0.004
3 0.018 0.001 0019
4 (final) 0.018 0.05 0.043
5 1 0.016 0.001 0.00001
2 0.016 0.001 0.0013
3 0.016 0.001 0008
4 0.016 0.001 0.023
5 (final) 0.016 0.05 0.041
Overall α = 0.05.
The Peto (or Haybittle-Peto) approach is simpler to understand, implement, and describe. It
uses constant but stringent stopping levels until the final analysis (Table 18.1). For some trials,
however, investigators believe that early termination of a trial is too difficult with Peto [7].
Additional Reading
Fossâ and Skovlund reported in a very elegant way the penalties for the investigators in not
following the planned interim analysis [8]:
[ . . . ]The article by Negrier et al (2000 –Journal of Clinical Oncology) gives an ex-
ample of misconduct of a clinical trial in this respect as the investigators disregarded their
own study design and the role of the predefined interim analysis. Based on favorable
though preliminary phase II trial results with selected chemoimmunotherapy in meta-
static renal cell carcinoma, the combination of subcutaneous interleukin-2 (IL-2), inter-
feron alfa (IFN), and fluorouracil was compared with the combination of IL-2 and IFN.
This comparative study was planned before the final phase II results were known, which
proved the chosen chemoimmunotherapy to be ineffective in this malignancy. When
the phase III trial was designed, it was evident from the medical literature that only few
patients with metastatic renal cell carcinoma would benefit from immunotherapy: those
with a good performance status and minimal metastatic disease, preferably in lung and
lymph nodes. Oncologists treating these patients knew that the majority of the patients
would experience significant constitutional toxicity from IL-2/IFN–based immuno-
therapy without tumor response or prolongation of life. Because of this uncertainty, the
principal investigators wisely planned a randomized phase II design with 21 patients in
each arm and a subsequent interim analysis. The protocol contained clear stopping rules
to be based on the results of the interim analysis: if a ≤10% response rate was obtained
in either trial arm and a difference in response rates of greater than 15% between the
two alternatives were observed, then the trial would be discontinued. [ . . . ] Despite
their clearly defined stopping rules, Negrier et al. did not follow their own design: patient
381 Chapter 18. Adaptive Trials and Interim Analysis
inclusion was continued during the period of interim analysis until the trialists them-
selves required the premature end of the trial because of an unexpectedly low response
rate. At that time, 131 of the 182 planned patients had been entered, whereas the results
of the interim analysis would have led to closure of the trial after 42 patients. The continu-
ation of the trial is even more unexplainable, as the disappointing results of the preceding
phase II trial should have been suspected at the time when the interim analysis was due.
Proper interim analysis after 42 patients and the results of their own previous phase II
study would also have led to the consideration of another problem with the study by
Negrier et al: because of the very rapid inclusion rate in this phase III study, despite its
being a multicenter effort, the principal investigators should have suspected an inade-
quate selection of patients. In their final report, the authors correctly discuss this fact as a
reason for the low response rate. This problem could, however, have been largely avoided
by a proper interim analysis and discussion of inclusion rate with the trialists during
an investigator meeting. [ . . . ]
• Phase III clinical trials: require a DSMB, which can be formed by the funding
agency or by the local IRB, according to the level of risk entailed by the trial.
• Phase II clinical trials: a DSMB is not a requirement, but may be convened by the
funded institution according to the characteristics of the trial.
382 Unit III. Practical Aspects of Clinical Research
• Phase I clinical trials: a DMSB is not required, unless the trial entails the study of a
new and potentially risky intervention. In most cases, thorough monitoring by the
principal investigator and local IRB are sufficient.
• Observational studies: the need for a DSMB is determined on a case-by-case basis,
according to the size and complexity of the study.
ADAPTIVE (FLEXIBLE) DESIGN
Adaptive design methods have become very popular in clinical research, mainly in
industry studies due to their flexibility and efficiency. Adaptive designs are used to
modify the trial design or statistical procedures based on preliminary results from
interim analysis without minimizing the validity and integrity of the trial. However,
advantages of adaptive designs do not come without a cost. Adaptive designs can
induce some methodological shortcomings in the trial that invalidate the results.
A common selection rule is to pick the most promising treatment, for example, the
treatment with the numerically highest mean response, at the interim stage. However,
there is a concern regarding the overall type I error after the adaptations, which result
from possible deviation from the original target population.
384 Unit III. Practical Aspects of Clinical Research
The main goal is to increase the success of clinical development, making the studies
more efficient and more likely to demonstrate the effect of a treatment.
The range of possible study design modifications must be planned in the written
protocol. It has been used to change the following:
(Chow 2008)
Therefore, although adaptive designs are attractive as they seem to increase the effi-
ciency of a given trial, they should be carefully considered, as there is an important
chance of adding bias when using these designs and therefore invaliding the results.
Adaptive designs in fact should be used only in special situations and when there is a
good rationale for its use [8]. It is not the goal of this chapter to discuss in depth each
type of adaptive design, but rather to give the reader a general overview of different
types of adaptive designs. There are some references given at the end of this chapter
discussing each type of design and the uses and potential biases associated with them.
What are the differences between drugs and medical devices (for instance,
pacemakers, stents, deep brain stimulators) for clinical trial design?
Medical device differs from drug in several aspects, including indication—drugs
are used to treat patients in specific clinically indicated populations, whereas devices
are used to treat wide indications and populations, user effects that are expressive in
the case of medical devices, influencing outcomes, whereas drugs are not or minimally
affected by user effects.
Available evidence and evidence generation differs between medical devices and
drugs. The design and analysis of clinical studies of devices can be more challenging
than comparable studies of drugs, owing to ongoing device modifications, user
“learning curves,” and difficulties associated with blinding, randomization, and sample
size definition.
On the other hand, the current demand for clinical evidence in medical devices
has been increasing, leading to specific solutions to answer this need. Actually, the in-
terest in effectiveness grew over efficacy, the same for health-care value, real-time data
analysis, longitudinal follow-up, comparative effectiveness research (CER), patient-
centered outcomes research (PCOR), and clinical registries.
Alternatives to RCTs are usually observational studies—case reports, case series,
cross-sectional, case-control, and cohort studies—that avoid sham procedure, which
may raise ethical concerns. Non-randomized clinical studies play an important role
in this scenario. In addition, non-randomized clinical studies offer the possibility of
comparing two groups, using data collected. However, it is the need to avoid bias
(temporal and selection) and confounding, which is very easy to be present in surgical
studies. It is needed to balance the distribution of patient characteristics and the risk
factors and assess the quality of historical data. The interpretation of the results needs
to take into account bias, confounding, chance, and causality.
Despite these factors that are discussed in many papers, evidence regarding the
safety and effectiveness of medical devices still is considered as drug evidence—the
gold standard remains in randomized controlled trials, with adequate blinding and a
control arm, usually placebo. The issues start appearing exactly at blinding and placebo
control (or sham) point, since it is very difficult and many times impossible to perform
this type of design for medical devices.
In the following we describe some exclusive features and issues of clinical trials
with medical devices and why they have to be differentiated of drug clinical trials [11].
Rationale for RCT
The best evidence based-medicine level originates from RCTs, once observational
studies may be confounded. However, clinical trials with medical devices are especially
vulnerable due to operative covariates (user effect, learning curve) and the placebo effect.
Placebo Effect
Placebo effect is a consequence of cognitive dissonance, while medical devices are
more invasive. In addition, it is very difficult and usually impossible to blind the study
and to develop a perfect “placebo device.” Yet, although it is easy to develop a sugar pill
387 Chapter 18. Adaptive Trials and Interim Analysis
that serves as placebo for drug studies, adverse effects of drugs may serve as potential
unblinding factors.
An International Call
It was a warm night in Paris. The summer had just begun. Dr. Jean-Luc Richelieu
was in a pleasant dream but was obligated to wake up due to the annoying ring of
his mobile (he wishes he had not chosen the Beethoven’s Ninth Symphony to alert
incoming calls).1
“Bon soir . . . What time is it?” Jean-Luc said with a very sluggish voice.
“Good afternoon, Dr. Richelieu. This is John Williams, research assistant of
Professor Gregor Briggs. Prof. Briggs wants to talk with you right now about your
email to which he just replied. Is it possible?”
“Oh—hummm—of course, Mr. Williams! I was not doing anything important.”
Jean-Luc quickly ran to his computer and opened his mailbox. It was 3 a.m. in Paris.
“Good afternoon, Jean-Luc!” Prof. Briggs said, “I mean, good afternoon here. What
time is it there?”
“Oh, not to worry. I was looking forward to talking with you again! I read your
email and was responding,” said Jean-Luc, trying to open the file. (“Saved in docx, it
does not open!”)
“My apologies for disturbing you—I do not want to rush you. In fact, I am
calling about something I had forgotten to address in the email—a crucial aspect in
the project we forgot to discuss last month when I was in Paris—we should plan an
interim analysis!”
Jean-Luc panicked. He planned the trial so carefully—but he had completely for-
gotten this topic!
“Interim analysis. Yes, how did we forget? But do you think it is necessary? I mean,
we are studying insomnia . . . ”
“I know what you are going to say, Jean-Luc. We are studying insomnia, so it is not
necessary to do an interim analysis due to ethical issues. But as the PI, I need to wear
the physician’s hat, too. And although insomnia is not a life-threatening condition, I do
think it is an important condition. We are testing against placebo and the trial has four-
week duration. I do not think it is ethical to let people not sleep for four weeks. But
I know this is a delicate matter. I would like to set up a meeting. But this time in Los
Angeles. What do you think?”
As Jean-Luc agreed, Prof. Briggs continued, “But—in order to move quickly with
this study—if you can board tomorrow, it would be great. How about tomorrow at
6 p.m. Pacific Time—is that OK?”
1
Dr. Brunoni and Professor Fregni prepared this case. Course cases are developed solely as the basis
for class discussion. Although cases might be based on past episodes, the situation in this case is
fictional. Cases are not intended to serve as endorsements or sources of primary data. All rights re-
served to the author of this case. Reproduction and distribution without permission are not allowed.
389 Chapter 18. Adaptive Trials and Interim Analysis
It was a tight schedule, but Jean-Luc knew that this project would be his pathway
to his greatest success—or his worst failure. At least, Prof. Briggs was right—having
insomnia is awful. Jean-Luc was now too excited to fall asleep again.
would allow us—in fact, obligate us—to stop the trial at this point as the principle
of equipoise would be violated and therefore patients should be offered the active
treatment. Finally, in trials that compare an experimental treatment against the
standard treatment (a drug-drug trial), the interim analysis can also reveal that new
drug is worse than the standard drug (at a statistically significant level) and thus that
would be another reason to terminate the trial.”
Helen said, “Dr. Richelieu, I am confused. Interim analysis sounds like an excel-
lent idea! I mean, stopping the trial earlier is good, isn’t it? It is almost impossible that
our drug performs worse than placebo, so that is not something to worry about. Our
pilot trials showed that Serenium is not associated with important adverse events. We
planned a sample size calculation not only for statistical significance but also for clin-
ical significance! So it is very possible that if we stop the trial at half the sample size,
then we will have a statistically significant result. So that means we can publish sooner,
we can expend fewer resources in this current economical crisis—and most impor-
tant, we can expose fewer patients to unnecessary risks. What is the catch here?”
“Yes, Helen, interim analysis is certainly a very good idea with important
advantages, but as you know in clinical research, almost everything you do has a cost.
One point is the variability of results during a clinical trial. For instance, results might
favor treatment A after 50 patients, then change to treatment B after 100 patients, and
so on. In addition, you may reach the statistical significance level during the interim
analysis but not the clinical significance level. This is a result of less stability of data
with smaller samples. As a result, academics, and editors of medical journals can argue
that the results are not robust enough. Let me give you an example: suppose a trial
that tests a new anticoagulant agent versus warfarin for stroke prophylaxis in patients
with chronic atrial fibrillation. Such a trial would need to enroll a large number of
patients, for instance, 2,000 patients. Let me give you two scenarios then: with 1,000
patients, the new drug is significantly better than warfarin (p = 0.01) to avoid ischemic
stroke: 8% versus 6% (incidence of stroke). What is the number needed to treat
(NNT) in this case?”
Helen quickly calculated the NNT, a measure that she knows is very important
to address the clinical utility of a given treatment. “The absolute risk reduction is
8–6 = 2%. NNT is 1/2%. Fifty?”
“Yes, NNT is 50. And, also if the drug is very expensive and is associated with
an increased risk of an adverse event (e.g., hemorrhagic stroke), then its utility will
certainly be jeopardized since the NNT is very high—that means a physician would
have to treat 50 patients with the new drug (and not warfarin) to avoid one ischemic
stroke. Now, let me give another scenario: suppose that the trial ends when it was
planned and it still shows statistical significance and now the risks are: 16% of stroke
in the warfarin group and 4% in the new drug group—w ith more patients and time,
the differences of the drugs are clearer. In this scenario, the NNT is 7. What can you
conclude?”
“I understand. With this new NNT, the new drug is obviously better than warfarin
and should be chosen, even if it is more expensive. So you are saying that interim anal-
ysis might also hurt a study?”
Jean-Luc opens a smile before responding, “Yes—exactly—that is the catch!
Another issue here is that even if you show an important clinical difference between
the two treatments with the interim analysis, the impact of your trial might be less
391 Chapter 18. Adaptive Trials and Interim Analysis
significant as you are presenting a trial with a sample size of n/2 (or n/3, n/4, . . .).”
After a brief pause, he continues, “There are other issues: first, studies are planned
to address one primary hypothesis. In our study, we planned to enroll 300 patients.
We calculated our alpha level and beta level for this sample size, not less. We will not
have full statistical power with fewer subjects. And there is more: we have some sec-
ondary hypotheses—that are very prone to fail, as secondary hypotheses are naturally
not as powerful as the primary hypothesis. And then the argument of exposing fewer
subjects to unnecessary risks turns the side: I can argue that it is better to resolve all
the issues in one trial than in two. Suppose that a study with an interim analysis has
a small impact, and then a subsequent study will be necessary. And that is also true
when looking at the economic aspect: we are already prepared to do this trial now,
but we might not be in one year. So, I agree that an interim analysis can save costs but
only in the short term. In the long run, doing two trials instead of one will certainly
be more expensive and challenging. Helen, in fact, stopping a trial due to early effi-
cacy is usually not well accepted by aca xxx it is difficult to define that a given risk is
unacceptable—this might be subjective and depends on a given patient, as you know
medicine is a risk-benefit analysis.”
Jean-Luc stopped for a moment and then concluded, “So, Helen, I personally think
that interim analysis is a Trojan’s horse. It is a beautiful idea and, so you bring it to your
trial, but it might end up destroying the trial!”
Helen replied, “I understand, Dr. Richelieu. But I still think that the advantages of
the interim analysis cannot be underestimated.”
“I know, Helen, I know.” He sighed, “I did not make up my mind yet. Let’s hear the
opinion of Prof. Briggs. He is a brilliant researcher and I want to discuss this matter
very carefully with him. This project is very important and it is critical to analyze all
the options carefully.”
avoid increasing type I error. Let me explain: assume that the probability of type I error
was set at 5%, as most trials do. This 5% has to be shared—not necessarily equally—
among all analyses. Otherwise, if you assume 5% for the interim analysis and 5% for
the final analysis, then at the end you will have a 10% chance of type I error. Therefore,
the sum has to be 5%. So suppose that a trial is planned with one interim analysis. If we
calculate the results with a p of 2.5%, then the final p also has to be 2.5%. If we calculate
the interim analysis with a p of 1%, then the final p would be 4%. That is just an example
for your understanding. The calculation is not simple arithmetic, using some statistical
calculations that demand a statistician, but it is the logic of ‘alpha spending.’ ”
“Prof. Briggs,” Helen said, “should we distribute the p-value equally or unequally,
as you gave the example of 1% / 4%?”
“Well, Helen, it depends. If you believe that the study results will be accepted with a
smaller sample, then you can divide the p-value equally—that increases your chances of
getting a significant p-value at earlier stages. So, if we are going to do an interim analysis,
we should also decide which type of analysis is more recommended, or in statistical
terms, the most appropriated ‘alpha spending function.’ ” He continued, “The O’Brien
Fleming approach is more conservative at earlier stages, because the ‘boundaries’ of
the test are very large—it uses a very small p-value at the earlier stages. Thus, the results
should be extreme to cross the boundaries, leading to trial termination. At the more
advanced stages of a trial, then the boundaries become quite close to the first statis-
tical value and the penalty for conducting interim analysis is not too high. The Pocock
approach sets the same value for each interim analysis performed (divides the p-value
equally). Also, there is the option of conducting the interim analysis for safety only.”
“So, I think we have several options here,” Jean-Luc said, “(1) do nothing; (2) do
an interim analysis for safety only (not looking at the efficacy and therefore not paying
the p-value penalty; (3) do an interim analysis for safety and efficacy. Right? Also, if
we go for safety and efficacy, we will also have to decide which method to use for alpha
spending—Pocock or O’Brien Fleming.”
Helen Curie looked at the wall clock. It was still 6:30 p.m. She was tired, thinking
about how odd it was to have a day with 33 hours! But then she thought that when
she goes back to Paris her day would only have 15 hours! “There is no free lunch—
an interim analysis implies a type I penalty—gaining time today implies losing time
tomorrow. We now need to decide whether interim analysis here will be our Trojan’s
horse.”
CASE DISCUSSION
For the case discussion, the reader should consider the advantages and disadvantages
of having interim analysis.
Advantages:
– Reduce the duration of the trial, reducing costs and patients exposure to potential risks.
– Treatment safety is an important concern—assess unexpected life-threatening ad-
verse events related to the treatment.
– Anticipate efficacy issues, providing important information to continue or to stop
the trial.
393 Chapter 18. Adaptive Trials and Interim Analysis
Disadvantages:
– Statistical significance level could be reached, but not clinical significance (easily
criticized by publishers).
– Sample size planned at the beginning could be reduced, then losing power.
– NNT issue: clinical results in favor of the treatment can differ if the analysis is made
in the middle or in the end of the trial.
– Statistical issues: increase type I and II errors due to underpowered analysis, and
further studies may be needed, which increase unnecessary costs.
– Secondary hypothesis are prone to fail.
FURTHER READING
Papers
Observational Studies• Avorn J. In defense of pharmacoepidemiology. Embracing the yin and
yang of drug research. N Engl J Med. 2007 Nov 29; 357(22): 2219–2221. PMID: 18046025.
Interim Analysis• Snapinn S et al. Assessment of futility in clinical trials. Pharm Stat. 2006 Oct–
Dec; 5(4): 273–281. PMID: 17128426.
Adaptive Designs• Chow SC, Chang M. Adaptive design methods in clinical trials: a review.
Orphanet J Rare Dis. 2008 May 2; 3: 11. PMID: 18454853.
• Gallo P, Chuang-Stein C, Dragalin V, Gaydos B, Krams M, Pinheiro J, PhRMA Working Group.
Adaptive designs in clinical drug development: an executive summary of the PhRMA Working
Group. J Biopharm Stat. 2006 May; 16(3): 275–283; discussion 285–291, 293–298, 311–312.
Medical Devices• Bonangelino P, et al. Bayesian approaches in medical device clinical trials: a
discussion with examples in the regulatory setting. J Biopharm Stat. 2011 Sep; 21(5):938–
953. PMID: 21830924.
394 Unit III. Practical Aspects of Clinical Research
• Li H, Yue LQ. Statistical and regulatory issues in nonrandomized medical device clinical
studies. J Biopharm Stat. 2008; 18(1): 20–30. PMID: 18161539.
Online: Interim Analysis
• http://www.consort-statement.org/consort-statement/3-12---methods/item7b_interim-
analyses-and-stopping-guidelines/ç
Data Safety Monitoring Boards (DSMBs)• He P., Leung Lai, T., Su Z. Design of clinical trials
with failure-time endpoints and interim analyses: An update after fifteen years. Contemporary
Clinical Trials. 2015.
• Chalmers I, Altman DG, McHaffie H, Owens N, Cooke RW. Data sharing among data
monitoring committees and responsibilities to patients and science. Trials. 2013; 14: 102.
• Sartor O, Halabi S. Independent data monitoring committees: an update and overview.
Urologic Oncol. 2015; 33: 145–148.
Books
• Kirkwood BR, Sterne JC. Essential medical statistics. Malden, MA: Blackwell Science; 2003.
• Portney LG, Watkins MP. Foundations of clinical research: applications to practice. 3rd ed.
Upper Saddle River, NJ: Pearson Prentice Hall; 2015.
• Rothman KJ. Epidemiology: an introduction. 2nd ed. Oxford: Oxford University Press;
• Williams OD. Data Safety and Monitoring Boards (DSMBs). In: Glasser SP, ed. Essentials of
clinical research. 2nd ed. Heidelberg: Springer; 2014.
REFERENCES
1. Fleming TR, DeMets DL. Monitoring of clinical trials: issues and recommendations.
Controlled Clin Trials. 1993; 14(3): 183–197.
2. Su HC, Sammel MD. Interim analysis in clinical trials. Fertil Steril. 2012; 97(3): e9.
PMID: 22285749.
3. O’Brien PC, Fleming TR. A multiple testing procedure for clinical trials. Biometrics. 1979;
35: 549–556.
4. Haybittle JL. Repeated assessments of results in clinical trials of cancer treatment. Brit J
Radiol. 1971; 44(526): 793–797.
5. Peto R, Pike MC, Armitage P, et al. Design and analysis of randomized clinical trials
requiring prolonged observation of each patient. I. Introduction and design. Brit J Cancer.
1976; 34(6): 585–612.
6. Pocock SJ. Group sequential methods in the design and analysis of clinical trials. Biometrika.
1977; 64(2): 191–199.
7. Schulz KF, Grimes DA. Multiplicity in randomised trials II: subgroup and interim analyses.
Lancet. 2005; 365(9471): 1657–1661. PMID: 15885299.
8. Fossâ SD, Skovlund E. Interim analyses in clinical trials: why do we plan them? J Clin Oncol.
2000; 18(24): 4007–4008. PMID: 11118460.
9. Chow SC, Chang M. Adaptive design methods in clinical trials: a review. Orphanet J Rare
Dis. 2008; 3: 11. PMID: 18454853.
10. Chuang-Stein C, et al. Sample size reestimation: a review and recommendations. Drug
Inform J. 2006; 40: 475–484.
11. Li H, Yue LQ. Statistical and regulatory issues in nonrandomized medical device clinical
studies. J Biopharm Stat. 2008; 18(1): 20–30. PMID: 18161539.
UNIT IV
Study Designs
19
I N T EG R I T Y I N R E S E A R C H
AU T H O R S H I P A N D ET H I CS
INTRODUCTION
So far in this book we have had chapters focusing on research methodology, statis-
tical analysis, and trial design, among others. In this chapter we will focus on another
important concept: integrity in research. Integrity in research refers to the active
commitment to ethical principles, norms, regulations, and guidelines governing the
responsible conduct of research. Research integrity requires that the research process
is governed by honesty, objectivity, and verifiable methods, rather than preconceived
ideas and expectations.
Although practices of responsible conduct of research may vary from country to
country or even from one institution to another, there are some shared values, which
include, but are not restricted to, the following: honesty, accuracy, efficiency, objec-
tivity, confidentiality, and responsible publication of research findings [1]. These
shared values ensure the accuracy and replicability of study findings, reinforcing the
commitment to good practices in research among professionals. Integrity in research
governs all the stages of a research process—planning, implementation/execution, in-
terpretation of results, and report writing and publication. Therefore, before starting
any research process, all research members involved must be aware of professional
codes, government legislations, and institutional policies governing research with
human subjects and animals, research misconduct, and conflicts of interest.
In this chapter we will focus on several aspects of integrity in research: authorship,
conflict of interest, and ethics.
AUTHORSHIP
Publication of the research findings in scholarly journals is one of the most impor-
tant stages of the research process and a career in academia. Research findings must
be disseminated to readers and peers in a standard form, language, and style [2].
Publication must be done in the most accurate and honest way possible, so research
397
398 Unit IV. Study Designs
methodologies and research findings can be replicated, and can support future sci-
entific advances. It constitutes an ethical obligation for an investigator to make re-
search findings accessible, in a timely manner, and with sufficiently detail so that other
investigators could replicate the study [3]. The ultimate objective of any research
is to make research findings available to the community, and any publication must
give the appropriate credit and accountability to all authors who contributed to the
scientific work.
Authorship credit is attributed to persons who have substantially and intellectually
contributed to the study and to the scientific report. According to the International
Committee of Medical Journal Editors (ICMJE), authorship provides credit for an
individual’s contributions to a research study, has important academic, social, and
financial implications, and carries accountability and responsibility [4]. Since there
is no universally accepted standards governing authorship assignment, researchers
should be aware of specific practices, guidelines, or recommendations within their
own institution. Because authorship order may be governed by different guidelines,
some research journals require authors to state for each author the specific contribu-
tion to the scientific report. This practice has the advantage of removing some ambi-
guity surrounding contributions, acknowledging each specific contribution. However,
this does not resolve the problem of quantity and quality when assigning authorship.
The ICMJE developed guidelines with specific criteria for authorship. Authors
should be accountable for their contribution, and also should be able to identify the
contribution and responsibility of co-authors listed in the scientific report. This defi-
nition of authorship acknowledges an author’s accountability for his or her own work,
as well as co-authors’ contributions. Therefore, according to ICMJE, in order to be
considered an author, the individual must meet the following criteria:
Thus, every investigator who meets these four criteria should be listed as an author,
and those who do not meet the four criteria should be acknowledged (in the acknowl-
edgment section of the manuscript) for their contribution to the study. It is important
to stress that if the first criterion is met, individuals should be given the opportunity
to work on the report, including drafting, revising, and approving the final version
of the scientific report. This is the collective responsibility of the authors listed in
the manuscript, and it is not a responsibility of the journal where the work is going
to be published. Some journals require details about authorship (i.e., a list of each
specific author’s contribution) and some even require authors to sign a statement
on authorship responsibility, conflict of interests (COI) and funding, and copyright
transfer/publishing agreement. If an agreement among authors on authorship cannot
be reached, the institutions where the work was conducted should be requested to
399 Chapter 19. Integrity in Research
investigate and find an appropriate solution. After the manuscript submission or pub-
lication, any change in authorship—order, additions, deletions, contributions being
attributed differently—should be justified, and journal editors should require a signed
statement of agreement from the requested change from all listed authors, including
those being added or removed.
According to the American Psychological Association (APA) [5], people who
provided mentorship, or funding or any resources to the project, but did not par-
ticipate in the final report, should not necessarily qualify for authorship. Despite
these efforts from ICMJE and APA, among others, in providing guidelines gov-
erning scientific publication, several institutions do not follow them. For instance,
in some institutions it is a common practice that heads of the department are listed
as authors even though they have never been directly involved in the research pro-
cess or contributed to the final publication. These are the so-called guest authors or
gift authors (including authors who did not contribute significantly to the report)
(Table 19.1), and this practice is not endorsed or allowed by many of the most im-
portant peer-reviewed journals.
First authorship The main researcher and the main writer of the article. This
person is usually responsible for writing the first draft of the
paper and is also the corresponding author.
Last authorship An author who contributed with expertise and guidance. This
person is usually a senior researcher, who critically revises the
manuscript, assuring that the good quality standards of re-
search and publication have been met. Typically, this person
represents the institution in which most of the actual research
was performed (if the study is not a multicenter trial). The last
author can also be listed as corresponding author.
Corresponding author The corresponding author is the person who submits the
paper to a journal and has the responsibility to review and
answer reviewers’ questions, as well as all the correspondence
related to the published paper (i.e., reprint requests or any
contact with the research group). This person is listed in the
manuscript with detailed contact information. This is not only
an administrative role, but also a sign of seniority.
Gift or honorary Listed authors who did not contribute substantially to the
authorship manuscript and the research project. Example: award author-
ship credit to someone who has power and prestige rather
than for intellectual and substantial contribution to the work.
This is not endorsed by most medical journals or ICMJE.
Ghost authorship Failure to list as author someone who meets the criteria for
authorship. Example: a company (or a busy researcher) hires
someone to write the paper.
(continued)
400 Unit IV. Study Designs
Table 19.1 Continued
Different Forms of Authorship
Coercion authorship Authorship is demanded or imposed rather than voluntarily
awarded.
Examples: a chair of the department who demands authorship
in all manuscripts; a senior researcher forcing a junior re-
searcher to include a gift or guest author in the manuscript; or
a researcher forcing a specific authorship order (for instance,
first or last position) when his or her work does not justify that
position.
Group authorship For publications with very large number of authors, a name for
(corporate, organization the group may be created and every author who contributed to
or collective) the published work is listed in the article text. The group name
in this case represents a specific consortium, or committee or
a study group. If necessary, more than one group name can be
created for the citation, or both group name and author names
can appear in the citation.
Example: An organization that take full responsibility for the
creation of the scientific work; can be an alternative to long
author lists in multi-authored manuscripts.
Mutual support/admi- Authors agree to list each other’s names on their own
ration authorship manuscripts despite minimal or no participation in the re-
search project and manuscript.
Examples: friends or colleagues who want to rapidly increase
their number of publications agree to list each other’s names
in their own publications. Authors agree to share the main
authorship positions in the paper (first and senior positions),
though their work does not fulfill the criteria for that position.
(based on (6, 7)).
Authorship Order
According to the ICMJE guidelines, authorship order should always be the co-authors’
joint decision. Authors should be informed of authorship order, as well as the reasons
401 Chapter 19. Integrity in Research
for that particular order. In some cases, authors are listed alphabetically, with the justi-
fication that all authors made equal contributions to the study and to the publication.
Whenever this happens, it is important to make that clear in the manuscript, by adding
a note in the manuscript.
A general recommendation to a young researcher who has made substantially in-
tellectual contributions and may have drafted the manuscript is to be the first author
in a paper and/or the corresponding author. For someone in career progress, being
the last author or the corresponding author usually means that this person is a senior
author in the field or was the main person responsible for the contents of the manu-
script/and the study [8].
The following are some descriptive guideposts on authorship order to help in de-
ciding the sequence of authorship (based on [10]):
to write the first draft of the manuscript. If the student does not deliver or if he
or she completely fails to complete the first draft, the supervisor may then take
full responsibility for writing the manuscript (and therefore, will put his or her
name first).
3. The first author should be the person who contributed most to the work, including
manuscript writing. This person may be associated with the development of the
basic concept of the study, the main hypothesis, study design, data collection,
(and/or) data analysis. This person was certainly one of the major contributors
to the main data interpretation and discussion in the manuscript. It is also worth
noting here the possibility of co-first authorship, where two or more individuals
who equally contributed to the manuscript have the opportunity to share the pri-
mary credit. In this case, it is recommended that co-first authors be listed in al-
phabetical order. In cases of co-first authorship, and if this is made clear in the
manuscript, being listed as second or third or even fourth should not be seen with
prejudice. First authors, along with senior/last/corresponding authors, also typi-
cally assume primary responsibility and accountability of the reported results and
conclusions.
4. The last author is typically the one who plays a mentoring/stewardship role for the
overall conduction of the study, supervising and providing overall guidance for
the research project. He or she is typically the head of the laboratory that hosted
most of the research. The last author is usually an established and senior researcher
in the field for that particular work. Similarly to all authors, the last author should
meet all criteria for authorship in order to be listed as author in the manuscript.
5. For the middle authors there is less clarity around the significance of authors’
contributions. Order may quantify contribution, meaning that authors are listed
according to their overall contribution to the manuscript. In some research
fields, the second author is listed as the second person, following the first author,
who contributed more to the research project and manuscript writing. And, the
second-to-last author is also a senior author in the research field who has made
substantial contributions to the manuscript.
One curious and interesting aspect is that in other fields of science, authorship order
has other criteria compared to those just discussed that are used in health sciences.
In fact, promotion committees should pay more attention to the real authors’ con-
tribution as cited at the end of the articles, as requested by some journals; this may
decrease, at least at some extent, some of the authorship disputes.
Authorship Disputes
In theory, assigning the appropriate credit for intellectual contributions in a scientific
work is a straightforward process. However, authorships disputes regarding author-
ship position are somewhat frequent. In most of the cases, these disputes happen be-
cause it may not be easy to define whether someone’s contribution was substantial or
not [6].
In order to minimize the likelihood of authorship disputes, it is generally
recommended that all potential authors in a research project discuss authorship
403 Chapter 19. Integrity in Research
with the principal investigator (PI) when the study in still being planned [4].
It is the responsibility of both co-investigators and the PI to prioritize this con-
versation. If necessary, researchers can use a signed agreement, in the format of
a contract regarding publication intent. In this case, researchers agree about
their responsibilities/ roles in the project and also about authorship order.
The agreements can also specify that authorship order can be renegotiated if
researcher’s responsibilities change substantially, or if a researcher fails to perform
his or her role as previously agreed. Winston (1985) [11] suggested a procedure
for determining authorship order in any research publication. The basic concept
of this authorship instrument is that potential authors should complete it in a col-
laborative way, with discussion that includes all contributors. This checklist helps
facilitate the organization and delegation of responsibilities in the research project,
and provides the opportunity to discuss and negotiate authorship and authorship
order in a collaborative way.
Even though authorship should be discussed or negotiated in advance, this good
practice does not always prevent authorship conflicts. At the time of manuscript
writing, authorship has to be reassessed. Therefore, it is important to ask the following
type of questions: Have all investigators fulfilled their contributions according to what
they agreed upon initially? Has the scope of the project changed during its course and
therefore the contribution of the participating research study members [12]?
In order to prevent authorship disputes, it is important to follow four basic
principles:
1. Create and reinforce a culture of ethical authorship (be informed about the insti-
tution policies on authorship, or propose one if there is not, and discuss that with
your PI and research team);
2. Start discussing authorship when planning the research study (when it is possible,
discuss that in a face-to-face meeting, so all authors will be aware of authorship
decisions);
3. Reassess authorship during the course of the study (if there are any substantial
changes in the roles of any author, authorship may be discussed as the project
evolves); and
4. Decide authorship before manuscript writing (discuss expectations and
responsibilities on manuscript writing, revision and submission to a
journal) [6].
Summary of Authorship
Everyone who makes substantial intellectual contribution to a research project is a
potential candidate for authorship. In addition to the contributions to the researcher
project, the researcher needs to make substantial intellectual contributions to the
manuscript in order to be listed as author. All persons who substantially contributed
to the research project, or who are listed as authors in the manuscript, should have
the opportunity to participate in the manuscript writing and to approve the final
version to be published. It is ultimate responsibility of the lead investigator(s) to
404 Unit IV. Study Designs
manage authorship credits and authorship order with integrity and honesty, and to
promote and facilitate discussions within the research team whenever authorship
disputes occur.
Finally, authorship has a great impact on the scientific career, because it means
credit and recognition for the work performed. However, it also involves responsi-
bility and accountability for the published work.
ETHICS IN RESEARCH
Scientific research is built on a foundation of trust and credibility. Both society and the
scientific community expect that every scientist is devoted to describing the world in
the most accurate and unbiased way. This trust in research conduct and research findings
has led to unparalleled scientific investments and productivity in the last centuries.
Nevertheless, it is important to stress also that the history of science includes examples
of research misconduct or unethical procedures. Despite the negative consequences
of these research trials, it is important to stress that they also impacted the quality of
research, by promoting the need to create guidelines and rules governing research con-
duct. One famous example of this is the case of the Tuskegee syphilis study.
researchers used specific promotional campaigns with suggestive titles such as “Last
Chance for Special Free Treatment.”
The experiment continued in spite of the Henderson Act in 1943 (a public health
law requiring testing and treatment for venereal disease) and also in spite of the World
Health Organization’s Helsinki Declaration in 1964 (see details about the Helsinki
Declaration later in this chapter and in [13–15]). In fact, even when penicillin was
introduced as a possible cure for syphilis in 1947, none of the subjects participating in
this study had access to this treatment or was informed about this possible available
treatment. In 1972, when this study was exposed, a total of 28 men had died of syphilis
and 100 men were dead due to complications related to the disease. In addition, about
40 wives have been infected, and 19 children contracted the disease at birth.
The study was ended on July 25, 1972, when Jean Heller of the Associated Press
broke the story, both in New York and Washington. (For more details about this study,
see Brandt et al. [16].)
Figure 19.2. According to the Belmont Report (1979), ethics in human research should be based on
three interrelated basic principles: respect for persons, beneficence, and justice.
406 Unit IV. Study Designs
Respect for persons in research refers to the basic ethical principle that participants
involved in the research study are volunteers and have the right to be informed about
the research goals (such as the objective of the study, benefits, risks, etc.). This basic
principle involves two important ethical considerations. The first one is that the par-
ticipant should be treated as an autonomous being (i.e., a person who has the right
to make decisions or deliberations about her or his personal goals and desires). The
second one is that persons who are not able to make decisions for themselves (any vul-
nerable populations, such as children, prisoners, people with some mental disorders
or impairments) should be protected from any type of coercion from others or any
activity that can cause any harm to them.
Beneficence refers to the obligation of maximizing possible benefits and
minimizing possible harm to the participants involved in the study. In this sense,
investigators and institutions have to plan to maximize benefits and minimize
risks to the participants, following the best judgment possible, with the available
knowledge. By following the principle of beneficence, investigators will use the
available knowledge to decide whether there is another way to obtain the data/
knowledge but with lower risks to participants, and therefore, benefits should out-
weigh the risks.
The principle of justice refers to the distribution of benefits and burdens related
to the experimentation, so that there is fairness and study participants are treated eq-
uitably. The principle of justice will guide, for instance, the selection of participants,
preventing that some populations that are easily available or vulnerable or easy to ma-
nipulate are systematically recruited, rather than participants being chosen for reasons
directly related to the research problem being studied.
4. Participants are given sufficient time to read, understand, and decide whether they
want to participant in the study;
5. Any type of coercions or influence must be avoided when performing the in-
formed consent process;
6. Participants must not be made to give up legal rights or any treatment in order to
be involved in the study.
Selection of Subjects
According to the principle of justice, there must be fair procedures and outcomes
in the selection of research participants. Therefore, researchers must avoid exploita-
tion of vulnerable populations and avoid providing benefits only to populations that
they favor.
When confronting an ethical dilemma in research, the first option is always to
carefully examine the situation and keep these three ethical principles in mind. This
may help in clarifying some issues and making appropriate decisions. The example
presented in the preceding—the Tuskegge syphilis study—is an example of a clinical
trial in which researchers violated all three of these principles, as participants were lied
to about their condition, about the “treatment” they were receiving during their par-
ticipation in the trial, and about the objectives of the study. Additionally, participants
were selected based on race, gender, and economic class.
Research Misconduct
Research misconduct occurs when standard codes, regulations, and ethical behavior
that governs scholarly conduct research are violated [17]. The main purposes of re-
search misconduct policies and guidelines are to provide clear definitions of research
misconduct, to provide protection for those accused of research misconduct, and to
outline standard procedures of reporting and investigation of any research misconduct
[1]. According to the Singapore Statement on Research Integrity, drafted in 2010 at
the Second World Conference on Research Integrity in Singapore, and with 51 coun-
tries represented, there are four common principles of research integrity:
1. The behavior represents a clear and significant deviation from accepted practices,
policies, and guidelines governing research; and
2. These evidences can be proven; and
3. There are clear evidences that the behavior was committed intentionally, and with
negligence. The primary author and other authors whose results are found cul-
pable are accountable.
When there is any suspicion of research misconduct, all researchers who are involved
in the specific data and publication are investigated. In order for a formal investiga-
tion to occur, an investigative committee is appointed by the associate provost. This
committee has the responsibility to determine whether research misconduct has
occurred or not, and to determine possible disciplinary sanctions to those involved in
research misconduct. When federal funding is involved, that particular funding agency
must be informed that a formal investigation of possible research misconduct has been
initiated. The formal investigation can take several days and usually requires the exam-
ination of several research documents related to the subject being investigated, cor-
respondence, and interviews. When the formal investigation is completed, the chair
of the investigation prepares a detailed report to be sent and discussed with the as-
sociate provost. The disciplinary procedure may vary dependent on the status of the
researcher (i.e., if he or she is a faculty member, a research assistant, a student, etc.).
Examples of misconduct in research include data fabrication or falsification, pla-
giarism (both plagiarism-fabrication and self-plagiarism), ghost writing, data ma-
nipulation, and breaches of confidentiality (Box 19.1). Honest mistakes or divergent
opinions are not considered research misconduct, and therefore should be approached
in a different manner. Research misconduct needs to be proven by sufficient evidence,
and the behavior must be committed intentionally.
In 2008, the Office of Research Integrity from the US Department of Health and
Human Services carried out a study to examine scientists’ reports on suspected mis-
conduct in biomedical research. In their final report, a total of 192 scientists have re-
ported 265 incidents of research misconduct, which were coded and evaluated based
on the federal definition of research misconduct. Overall, 64 descriptions (24% of
the total) did not meet the criteria of the federal research misconduct. The remaining
201 reports were related to fabrication or falsification (60%) and plagiarism only
(36%) [20]. However, in general it seems that researchers fail to report 37%–42% of
suspected research misconduct findings. The reasons for that may be due to lack of
protection for the whistleblowers, lack of knowledge on the research misconduct, and
the need of a system with clear policies and guidelines for reporting these allegations
in an anonymous way. And the researchers who are more likely to observe and report
research misconduct are the ones who are more familiar with the institutional mis-
conduct policy.
410 Unit IV. Study Designs
1
Bornstein NM, Norris JW. The unstable carotid plaque. Stroke. 1989 Aug 1; 20(8): 1104–1106.
2
Professor Felipe Fregni and Dr. Brunoni prepared this case. Course cases are developed solely as
the basis for class discussion. The situation in this case is fictional. Cases are not intended to serve as
endorsements or sources of primary data. All rights reserved to the author of this case. Reproduction
and distribution without permission are not allowed.
411 Chapter 19. Integrity in Research
and it was implicit that I would be the first author of this study; otherwise, I would
have gone to another lab.”
“Juan, things change and science is a dynamic process. Also we are a team of 12
researchers! We have been leading this research! I have also put a lot of work into this
project. And also thanks to Prof. Ferrucci, who earned a huge grant of the Ministerio
de Salud, we were able to pay the scholarships of four postdoctoral fellows—including
you, Juan—and to import equipment from overseas, to avoid the premature termina-
tion of the project. I am sure you remember this, Juan. In a nutshell, you are the arms
of our study, but Prof. Ferrucci is the soul of our great, collective work. In addition,
he had the idea for our work and he helped significantly with the design of this study;
that’s why he will be the first author. Besides, Juan, you should see the big picture
here: you are at the beginning of your career and you will have lots of opportunities to
get your first author paper. We really need to think here on what is best for our group
and the institution.”
Juan Guevara was furious. “But I did more than him—this project is my life. I ded-
icated so many years to it—this is really unfair! And I suppose you are the last author,
aren’t you?”, Juan replied, becoming increasingly more aggressive.
“Naturally, Juan, I am the mentor of this work,” she said. “That’s it, you are the
arms, I am the head, and Prof. Ferrucci is the soul. You can be the second author—but
I have good news, we also talked about writing an invited review on the topic for an
Argentinian science journal. You can write it, we will help you, and you will be the first
author for this one. You are young and one day you will understand. If you work with
us, we will help your career.”
Juan was perplexed. He could not recognize the person in front of him. He had
always admired Isabel del Carpio as for him she was the role model of a scientist—
someone who stayed in Latin America and had been able to, despite all the challenges,
change knowledge in her field of research. But he would not accept that. “Sorry, Isabel.
I don’t agree with you. I deserve to be the first author. You know it. This question is
still unsettled.”
“Juan Guevara, you are such an idealist. You should focus more on science instead
of politics. Please do not become a rebel here; we are trying to accommodate everyone
in this situation; besides, a second authorship in a paper such as Nature is a great break-
through for you.
“Again, you are very smart, and also very young. I am sure you are going to pub-
lish many papers in the future,” she continued, but her expression then became harsh.
“Besides, you don’t want to go against Prof. Ferrucci and me. You do not want to ruin
your career for a paper, do you?”
significant amount of time and energy of the involved authors and result in significant
damage to one’s career. One important issue is that either small or large collaborations
can lead to authorship disputes if not well planned.
The importance of authorship is not only to acknowledge someone’s work; it is
the critical piece for appointments in medical centers, promotion, grant support, and
participation in society committees. Therefore, it is crucial for academic life. In the
traditional authorship model, authorship order is of vital importance, and usually the
first and last author get most of the credit for the work performed.
There are other forces playing into authorship disputes, such as scientists’ self-
esteem. Currently when a paper is cited, it is cited usually as the first author’s name
followed by “et al.”—for instance, “Guevara et al. presented remarkable findings . . ..”
Though this is never noted as an official reason in debates, this certainly plays an im-
portant role. Scientists are in a sense very similar to artists, and some are known for
having an over-inflated ego.
One important issue of authorship is that only researchers who have contributed
intellectually to the work should be included in the list of authors. The manuscript
should be seen as the intellectual product of the research. So, for instance, the clini-
cian who refers patients to the study or the technician who only does the laboratory
experiments do not qualify for authorship according to medical journals and the
ICMJE. The problem here is how then to acknowledge a clinician who has dedicated
time referring and finding patients to the study, as this person might have been critical
for a clinical study to happen. That is one of the problems of clinical research—lack of
extrinsic motivators.
Radical Solutions
Juan was really stressed by this situation. He stopped all of his work and could not
function well—he spent hours on the Internet looking for similar cases. This project
was his life—something he had worked very hard for, and he could not let go of this
issue, even knowing that it could have detrimental consequences for his career.
While thinking about how to act on this issue, he was considering a more dra-
matic approach, such as filing a lawsuit. In fact, he knew about a recent dispute be-
tween two microbiologists at the University of Gottingen that ended up in court.
Juan knew that it was a similar case to his. In this story reported in Nature,3 the team
leader removed the name of the postdoctoral fellow at the last minute and the post-
doctoral fellow decided to take legal action. In the end, the court ruled in favor of the
postdoctoral fellow, based on the original verbal agreement that both researchers had
agreed to in the 14 months before the submission. According to the court, “this un-
derstanding constituted an implicit contract.” He was prepared to go in that direction
if needed; however, he knew that this would have devastating consequences, as the
Dispute over first authorship lands researchers in dock, Nature. 2002; 419 (p. 4).
3
413 Chapter 19. Integrity in Research
academic world is a very small one and he could be labeled as a “difficult researcher”
and no one might agree to work with him in the future. But on the other hand, he kept
remembering a famous Argentinian saying, “Hay que endurecerse” (in English: “have
to become stronger”).
Another radical solution would be to send an email to the editor of the journal
expressing his disagreement with the authorship order. Usually, editors do not want
to publish papers in which there is a dispute on authorship. This could persuade his
mentor to go back and agree with him about being the first author. However, he knew
that this option would also bring much grievance and impact his career negatively.
Also he knew he needed a letter from his mentor to get a permanent academic posi-
tion. He was feeling like a hostage to this situation. He decided then to try to cool off
and wait some weeks before doing anything radical.
Diplomatic Solutions
Summer had long finished in Buenos Aires and things were not going well for Juan
Guevara. He had tried to schedule a meeting several times with Prof. Feruccci but
he always refused, rescheduled, or simply missed the appointment. Finally, when
Juan sent him a firm email, Prof. Feruccci agreed to meet, but they failed to settle the
question. In fact, he was very rude, threatening Juan with losing his position and career
if he continued to stand with his “rebel point of view.”
Juan, then, decided to act. He sent an email to Dean Catarina Mendez, Dean
of Research Integrity for Malcondo Institute. A few days later, during a long, tense
meeting, Juan explained to Prof. Mendez what was going on in the Neurobiology
Department. She listened carefully. One point that was not clear to her was the im-
plicit agreement they had made. In Guevara’s own words, “Dean Mendez, this is what
every PhD student expects: that he or she would be first author in his or her main PhD
project—if this was not the case this should have been communicated to me before.”
After a pause, Dean Mendez commented, “Well, Juan, I understand, but again this is a
gray area that may be interpreted in different manners. But let me see what I can do.”
Later, after she had listened to Prof. Feruccci and Prof. del Carpio, she realized she had
a time bomb in her hands and she would need to address the situation very carefully.
In fact, she realized the problem was too important for her to judge alone. She did not
want Juan or the others to get into a personal war. One option here would be to set up
a committee. There are two committee options:
have the power to institute disciplinary actions if authorship abuse were seen. The
advantage would be to provide a clear, final solution on the matter. The disadvan-
tage is that it would be an authoritarian solution that goes against the principles of
academia. Also, such a committee would not obviously have “force of law”—that
is, someone could get very angry with the solution and tell the media about what
is going on—possibly ruining the institution’s reputation, or could leave the insti-
tution with some of the data, burying the paper’s publication.
CASE DISCUSSION
This case discusses an authorship dispute between a postdoctoral researcher—Juan
Guevara—and his PhD mentor—Prof. Isabel del Carpio. Juan has to decide whether
to fight for what he believes is fair (i.e., being the first author), or to accept a compro-
mise solution with his mentor and, therefore, be the second author. This decision is
rather difficult, and both have pros and cons. First authorship position is very impor-
tant to both Juan and del Carpio—it has an important career impact and represents
recognition for their hard work and contribution to the final work.
Cases like Juan Guevara’s remind us that ethical ideals and integrity in research
often bend to the reality of ego, power, and self-interest in the real world. Regardless
of how one thinks this case should ethically be resolved, we must acknowledge that
many times in practice, we fail to live up to the normative expectations we set for
ourselves. There are important open issues from this case to allow for disagreement
about assignment of authorship. For example, if it were true that Carlos was the main
idea generator for the research and was the key study designer, and that Juan executed
Carlos’s ideas while intellectually contributing less as results came to be known, it
might make sense to assign co-first authorship to both. The point is, the only way we
can ethically “solve” this dispute is for each party to honestly detail precisely what and
how he or she contributed to the study. Of course, each party will infuse his or her
415 Chapter 19. Integrity in Research
own contribution with as much substantive importance as possible and this is where
leadership from an impartial judge proves vital to maintaining procedural integrity.
Whether Dean Mendez or her faculty colleagues or others can fill this role is context
specific, but it should be clear that Isabel is no longer a “neutral” party. This case also
reveals the importance of “anticipatory authorship ethics.” When so much is at stake,
it behooves all junior investigators who have made a commitment to a career in scien-
tific investigation to proactively engage their mentors/senior project advisors on the
issue of how authorship is assigned on work coming out of the lab. Unfortunately, we
can no longer rely on “understanding” and “expectation” from customary practices.
Ideally, junior investigators should choose labs and mentors only after they have a clear
understanding of how their “boss” approaches authorship assignment. At a minimum,
junior investigators should have a clear understanding of how their “boss” will manage
specific potential authorship disputes.
This case is important therefore to make the reader consider what he or she would
do it in this case, also what would be the scenario if Juan were right or vice versa? This
exercise can help to resolve and perhaps prevent future authorship disputes.
FURTHER READING
Gopnik A. Facing history. New Yorker. April 9, 2012.
REFERENCES
1. Steneck NH. ORI: Introduction to the responsible conduct of research. Washington,
DC: Government Printing Office; 2007.
2. Derntl M. Basics of research paper writing and publishing. Int J Tech Enhanc Learn. 2014;
6(2): 105–123.
3. Graf C, Wager E, Bowman A, Fiack S, Scott-Lichter D, Robinson A. Best practice guidelines
on publication ethics: a publisher’s perspective. Int J Clin Practic. 2007; Supplement
(152): 1–26.
4. Editors ICoMJ. International Committee of Medical Journal Editors (ICMJE): uniform
requirements for manuscripts submitted to biomedical journals: writing and editing for bi-
omedical publication. Haematologica. 2004; 89(3): 264.
5. Association AP. Publication practices & responsible authorship. Retrieved from http://
www.apa.org/research/responsible/publication/
6. Albert T, Wager E. How to handle authorship disputes: a guide for new researchers. The
COPE Report 2003; 32–34.
7. Babor TF, McGovern T. Coin of the realm: practical procedures for determining author-
ship. In: Babor TF, Stenius K, Savva S, eds. Publishing addiction science: a guide for the per-
plexed. 2nd ed. London: Multi-Science Publishing Company; 2004: 110–123.
416 Unit IV. Study Designs
Money won’t buy happiness, but it will pay the salaries of a large research staff to study the
problem.
—Bill Vaughan (1915–1977)
INTRODUCTION
In previous chapters you have learned the main aspects of designing, planning, and
conducting a clinical study. By now it should be clear that clinical research is an activity
that requires careful planning, right methodology, and good execution. In order to ac-
complish that, several requirements need to be met, including a budget that will allow
the proper execution of the study. Thus, even if clinical research is mainly based on ac-
ademia, at its core it is still a business and should be approached and managed as such.
In this business model, potential sources of funding (i.e., sponsors) need to be
identified and research funds secured prior to the execution of the research plan. The
budget therefore is a pivotal object for financing a research project and, depending
on the sponsor, may require not only extensive justification, but also some complex
negotiations.
Most researchers are not aware of the extent to which the source of funding can
impact research. For instance, in the United States a researcher seeking funding from
the government for research activities can be awarded a grant or a contract. If the
researcher is awarded a grant, this means that research will be developed for public
good; if instead research is funded as a contract, this will be a means of procuring a
service that will benefit the contractor (in this case the government).
In the previous example, the distinction was between grants and contracts pro-
vided by the government. This distinction would even be more significant for clinical
research funded by corporate interests.
417
418 Unit IV. Study Designs
But in 1930, the National Institutes of Health (NIH) was created in the United
States, and in less than 20 years the NIH became the leading source of funding for bio-
medical research in academia. However, with the advent of large federal funding, there
was one caveat: the ownership of discoveries and inventions made with taxpayers’
money. At that point, everything that was a product of research that was funded by
federal funds was “owned” by the government. This was a major limitation to the in-
volvement of the industry in clinical research. But the plummet of research funds from
the NIH in the 1970s, mainly due to the oil crisis, stock market crash, and inflation,
led academic researchers to look for industry sponsorship in order to conduct their
research.
But it was not until 1980, with the Bayh-Dole Act, that the relationship between
industry and academia changed. This act was intended to be a competiveness and ec-
onomic revitalization initiative; it followed three controversial cases in which the gov-
ernment asserted ownership of products from research that it had funded (Gatorade,
5-fluorouracil, and the phenylketonuria test). Probably the most interesting one was
the flourouracil, an anti-neoplastic” or “cytotoxic” chemotherapy drug. The US gov-
ernment claimed the title of the patent because US$120 on reagents was erroneously
charged from a federal grant, instead of the US$500,000 industry-sponsored grant
from Roche.
This government effort of stimulating the relationship between academia and
industry followed a simple premise, that in order to improve health care, the shared
knowledge derived from academia research could also be applicable to industry. Thus
the university becomes a unit of entrepreneurship, capitalizing on the knowledge
generated by its members [2]. This change was so successful that the estimates are that
around 68% of US and Canadian universities have a partnership with industry.
This has also meant that sponsors from industry are more willing to invest
in research. For instance, in the United States, the NIH is probably the most
well-known source of funding, but industry, as in the case of pharmaceutical and
biotech companies or venture funds, are the largest investors in clinical research.
In fact, data from 2007 suggests that industry was sponsoring 58% of ongoing
biomedical research, comparing to 33% from NIH and other federal agencies
[3]. Academia and industry can develop various forms of partnership, such as
material transfer, clinical trial agreements, consortia, joint ventures, consulting,
equipment loans/rentals, and spinoff companies, or by procuring a service by
means of a contract.
One of the major motivations for industry-sponsored trials is that Food and
Drug Administration (FDA) marketing approval requires clinical phase III trials
demonstrating efficacy of the agent/device combined with reasonable safety in
humans. In order to achieve this, sponsors from industry reach out to clinical
researchers at academic research organizations (AROs) in order for clinical trials to be
conducted. The process of approval of a new drug or device is very long, and requires
multiple clinical trials until the new intervention translates “from the bench to the
bedside.” An Investigational New Drug/Device (IND) needs to be requested 30 days
prior to the start of the first clinical trial. If in that period, or at any time point during
the clinical trial execution, the FDA finds a problem with the IND, it can put it on
“clinical hold” or actually interrupt the trial if it is an ongoing one. Only after several
419 Chapter 20. The Business of Clinical Research
trials in which there is enough evidence of safety and efficacy on the new drug/device
that matches FDA requirements for marketing approval, then a New Drug/Device
Application (NDA) can be submitted. Taken into consideration all the required
steps for an NDA, it is not surprising that sponsors from industry strive to protect
their investments with patents or other sort of intellectual property (IP) agreements.
Therefore, depending on the agreement specificity and on the institution, the owner-
ship of data can be considered as a part of the sponsoring company’s IP. Very often the
sponsor claims the responsibility for data analysis and eventual publication rights. This
sometimes leads to some complex negotiations between the sponsor and the ARO, as
some academic institutions require having data ownership and no role of sponsor in
study design, data analysis, or publication.
This has led to a blooming of specialized private businesses—contract research or-
ganizations (CROs)—in order to manage clinical trials. CROs usually promise lower
costs and speedy completion of the study by breaking the study into several steps,
while emphasizing the speedy completion of each step [3]. But these lower costs are
achieved with some workforce qualification problems that ARO usually are not af-
fected. Moreover, some sponsors still prefer the use of academic centers, due to their
reputation, as well as the lead scientist’s prestige in the scientific community.
There are mutual benefits from this academia-industry relationship. For the
academy, such partnership allows the translation to clinical applications of discoveries
made on the basic science field. The patents obtained during academic research can
also provide a valuable source of funding for other research activities. And conducting
quality clinical research is also an important training aspect for educating medical
students and future researchers. For the industry, the clinical research benefits from
increased credibility; reduced costs since the workforce and the laboratories are al-
ready implemented; stimulation and strengthening of the activities of R&D (research
and development); competitive advantage due to access to cutting-edge technology
and research; and tax credits for sponsoring academic centers (Table 20.1).
Academia Industry
Advantages Disadvantages Advantages Disadvantages
Despite the challenges, this synergy might be beneficial for both sides, but careful
and thoughtful consideration must be given when planning the agreement.
In 2000 the University of California suffered a $7 million legal action from a biophar-
maceutical company. The reason? Researchers refused to include in the manuscript
the company’s statistical analysis. This was an attempt to prevent negative results
from being published.
Box 20.1 Structure of a Clinical Trial Agreement Between Industry and Academia
Introduction
– Scope of Work
– Performance Period
Term
Cost and Payment
Responsibilities
Confidential Information
Proprietary Rights
Publications
Indemnification
Study Drug/Device and Materials
General
Amendments
Counterparts
Assignment
Compliance with Law
Arbitration
Insurance
Limitation of Liability
Parties’ Relationship
Term and Termination of Agreement
Notice
Disputes
422 Unit IV. Study Designs
Type of Expense
Table 20.2 Continued
Type of Expense
Laboratory Costs A clinical trial can use several surrogate markers in order to validate
the effects of the intervention. The clinical trial can require blood,
urine, or other type of biological/genetic samples, and thus it is im-
portant to include theses costs in the budget.
Pharmacy Costs An investigational drug can have several pharmacy costs, such as
preparation, storage, dispensation, and accounting. Pharmacy quotes
detailing all the costs and staff training (if required) should be in-
cluded in the budget.
Equipment/Supply The required equipment to conduct research that is not already avail-
Costs able at the institution. If the equipment is already available, then its
depreciation should be included. The supplies include reagents and
any other type of consumables that are required to perform the clin-
ical trial.
Travel/Missions Include the required travel between sites, to present in conferences
or field work if required.
Publication In the budget any fee associated with language editing or open access
Expenses publication should be included.
Patient Follow-up Assessing outcomes and serious adverse events has it costs and
should also be in the budget.
This process of budgeting ends with the execution of the study contract. This
study contract should weight the number of patients to be enrolled, and the level of
effort of the research team, as well as any fiscal obligations that may occur during the
trial. Also, there should be a payment schedule. The payment could be made based
upon the achievement of agreed-upon milestones, or at regular intervals. Also, will
there be start-up money to initiate the study before the first patient enrollment? For
instance, if training is required for the research team, or if advertisement is required
before the first patient is enrolled, are there funds to start the protocol? Also, there
should be a provision for screen fails, as even if they are not enrolled in the study,
there are costs associated with their eligibility evaluation. Finally, if it is a multi-year
protocol (the most common one in clinical research), the costs should be corrected
for inflation.
GOVERNMENT FUNDING
MECHANISMS FOR INDUSTRY
So far in this chapter we have been focusing on the contract agreement between
industry and academia, in which industry supports monetarily the development of
a trial by an ARO or a CRO. But in some situations it may be possible that centers
from industry (especially small businesses) apply for government-provided R&D
funds. For instance, in the United States, small businesses can apply to the Small
Business Innovation Research (SBIR) program, in order to develop a product that
has a potential for commercialization. Similarly, in the Small Business Technology
424 Unit IV. Study Designs
Transfer (STTR) program, academia and small businesses develop joint applications
in order to help translate science from “the bench to the bedside.”
CONTRACT PROVISIONS
Although budget is essential in a clinical trial agreement, there will also be a series
of provisions that can be the source of conflicts between academic institutions
and industry sponsors. Typically, these contract provisions include intellec-
tual property, publication rights, medical care in case of adverse events, and
indemnification.
Intellectual Property
One of the main missions of academia revolves around the dissemination of new
knowledge from R&D and training. In the majority of agreements, the industry
sponsor retains the IP right for what is specified in the protocol or investigator bro-
chure (Box 20.2). So, new IP that may be developed during a study is not necessarily
property of the industry sponsor, if not otherwise specified in the contract. In the
cases where new IP is not specified clearly in the contract, general patent law applies,
and thus the patent title holder will be the one that developed the new IP. As a gesture
of good faith, the industry sponsor of a trial in which new IP was developed may be
given the first option to enter into negotiations for the ownership of the new IP, even
in the absence of a provision clearly specifying that. Other options could be to give
the sponsor a limited period to license the new IP, otherwise it will have forfeited any
rights to it; or even to reach an agreement to have a fair license agreement in which the
relative contribution of each partner is assured.
The industry has a different position on this matter, as its main goal is to com-
mercialize inventions, and thus if an invention during a trial is related to the study
drug/device, then the company should own all the rights related to it. Also industry
sponsors generally believe that if they fail to license the product during the option pe-
riod, the company should nonetheless retain the right to match a third party’s license
offer. This assumption is in line with the company’s “vision” that without access to the
IND, the research team will not be able to make that discovery, and thus it should re-
tain the rights to at least match a license offer from a third party.
A contract should state
Scope of the definition of inventions
Disclosure of inventions/discoveries/improvements
Ownership of inventions/discoveries/improvements in the scope of the project
Ownership of inventions/ discoveries/
improvements not in the scope of the project
(e.g., outside the protocol, serendipitous in the course of following the protocol)
Allowed time to exercise option for licensure/match offer from a third party
Type of licensure (e.g., exclusive)
Who will be responsible for the patent costs
Statement about what is not covered by the contract
425 Chapter 20. The Business of Clinical Research
Publications
Publishing the results of a trial is how in science the dissemination of research is
performed. So the general agreement is that if the industry sponsor has more than
a certain amount of control over the content and decision to publish, then the ar-
ticle will not be accepted by high-impact peer-reviewed journals. So in academia the
dominant vision is that the sponsor from industry should not restrict publication in
any way. If accepted, these restrictions would prevent the academic institution from
reaching its goal: public dissemination of knowledge.
Thus the principal investigator (PI) in the academic institution should have full
access to the data, and will be held responsible for the integrity of the data, any analysis
performed, as well as the conclusions from the trial. But academic institutions cannot
willingly or knowingly jeopardize any IP property from the sponsor if that is stated
in the confidentiality agreement. Thus, the sponsor’s objections to the contents of a
manuscript should be related to what has been marked as “confidential information”
or that may affect the sponsor’s IP property or ability to protect any patents.
In the advent of a sponsor objecting to data publication, the academic institution
will undertake serious efforts to revert that objection. The ARO can try to find a medi-
ator for the dispute, with a pre-specified time for resolution (usually brief); can decide
to go ahead with the publication; or can try to mobilize a publication committee, in
which one of the members will be an industry representative, but the majority of the
committee will be constituted by independent representatives.
Industry generally is interested in the timely communication of important results.
It also has responsibilities in the study design, as well as the integrity of the data. Also,
industry owns the database from large multi-center trials that it has sponsored. So
usually the industry shares this interest with academia, with the possible exception of
basic science trials or even exploratory trials with the primary purpose of generating
trials for future research, in which results are not immediately available except for
those that can potentially have significant medical importance.
A contract should state
Indemnification
Indemnification is the term for designating that one party will be responsible for the
costs incurred for losses by a second one. In research this second party can be the re-
search subjects in case something harmful happens to them while they are being tested
426 Unit IV. Study Designs
for the IND. The general agreement is that when testing an IND, AROs indemnify
and hold the sponsor harmless for any misconduct, negligence, or any intentional acts
from their own employees/agents. On the other hand, the industry sponsor will be
willingly to indemnify the ARO for protocol-related injuries to patients.
A contract should state
List of indemnitees: whom the sponsor indemnifies and holds harmless
Conditions for indemnification (such as claims)
Exceptions to indemnification per non-compliance
Scope of indemnification, insurance requirements, survival of obligation to indemnify
Who will control the defense in the advent of a lawsuit, who pays and in which conditions
A contract should state
As per FDA [8] requirements, serious adverse events should be reported as soon as possible
and not exceeding 15 days after PI’s awareness. IRB and sponsor should also be notified.
FDA Investigational New Drug Application (IND), 2017
Scope of medical expenses that will or will not be covered
The extent to which subject’s insurance coverage may or may not be used to pay for study-
related health-care expenses
Sponsor’s agreement about what to do in the advent of a study-related adverse event
Circumstances under which sponsor will decline any payment/costs.
Sponsor’s obligation to provide ongoing care for efficacious drugs in chronic disorders.
427 Chapter 20. The Business of Clinical Research
Goethe’s Faust
Jean-Luc would be a brilliant scientist if he had decided to stay in academia, but he
was more interested in having a “rich future”—that was the opinion of Dr. Jean-Luc
Richelieu’s colleagues just after he finished his postdoctoral fellowship in Munich.
Indeed, his CV was impressive: medical school at the Faculté de Médecine Paris
Descartes, residence in neuropsychiatry at the King’s College (London), doctorate
in neuroscience from MIT (Boston), besides the postdoctoral fellowship in Munich.
However, to accept the job proposal from a medium-size pharmaceutical company
Psychotics™ to become medical director was beyond doubt for Dr. Richelieu: living
in Paris, his beloved city, in a big house, with a big salary and a lot of glamour. But
Dr. Richelieu soon realized there is no free lunch—three weeks after being hired, the
CEO of Psychotics™ invited him for a business talk, “Jean-Luc, I have big plans for you.
You know we hired you because you are brilliant, studied in top-notch universities,
have a good influence in academic circles and can speak fluently French, English, and
German. I will make you the golden boy of Psychotics™.”
“As you know, the aim of our pharmaceutical company is to increase our market
share of psychiatric drugs in France. We are developing a new antipsychotic drug
called Serenium to be used as a treatment for insomnia. Our pharmacists have been
working with the first antipsychotic drug—Chlorpromazine —which as you know
was first tested in the Parisian hospitals. They changed its molecular structure in order
to enhance its sedative effects while diminishing its extrapiramidal side effects—we
plan to re-launch the drug in the market after the confirmatory clinical trials.”
“Wait—” Jean-Luc interrupted, “Chlorpromazine was synthesized by the Rhone-
Poulenc laboratories which is now Sanofi-Aventis (a huge pharmaceutical industry).
Can we use their drug? Are we not violating intellectual properties?”
The CEO answered gently, managing his anger, “Yes, it was synthesized by Rhone-
Poulenc—60 years ago! As you know, drug patents are valid for only a few years. In
fact, drug patents grant 20 years of protection on average, and given that they are
applied for before clinical trials start, the effective life of a drug patent tends to be
shorter than that: between 7 and 12 years. So we are all set and it is OK to use—in
addition, by changing the molecular structure of this drug we will gain a new patent.
Besides, as the new compound is similar to the old compounds, we also can use some
of the safety and efficacy data from the old drug that were confirmed by our recent
phase I and II trials. Now we need to go straight ahead to a big, multi-center, phase III
trial—which you, Jean-Luc, are going to lead! Congratulations! This is a big opportu-
nity to show us what you are able to do.”
Those words remained in Jean-Luc’s head for a while: “show us what you are able to
do.” He knew this was the major test of his reputation as the company was depending
on the success of this trial in order to remain alive and healthy. He thought aloud, “four
to five years from now, if this does not work, I may be doing the second mortgage of
my house and selling my car to pay the bills.”
428 Unit IV. Study Designs
Intellectual property (IP) is a legal form of establishing that creations of the human mind in
its industrial, artistic, and scientific activities are kept as physical objects. In each country, a
number of laws allow that inventors limit the use of their ideas.
The need for protection laws that ensure the IP is based on two main goals: to give statu-
tory expression to these creations and promote their economic and social applicability [8].
Intellectual property is usually divided into two major areas: copyright and industrial
property. While copyright refers mainly to artistic and scientific productions, industrial
property refers to inventions, industrial designs, trademarks, and trade names.
In the area of clinical research, the most common objects of intellectual property protec-
tion are categorized as patents, copyrights, and trademarks.
Although each country has its laws and departments for the policies of intellectual
property protection, the application process begins through a request form that will be
examined by the patent office and once under the previous provisions will be registered by
the organ responsible.
429 Chapter 20. The Business of Clinical Research
more minute and contacted Prof. Briggs immediately, who was favorable to the idea
initially but wanted to meet to talk about details.
This agreement could be favorable to Prof. Briggs as he had a recent meeting with
the dean of medicine—James Tarsy—in which the main topic was how to increase
collaboration between academia and industry. In this meeting, the dean mentioned,
“As you know, Greg, our university has a long history of collaboration with industry
that generated many products to the open market that have improved the quality of
life of our citizens. And in fact, this is how I see the role of university: to give back as
much as possible to society. But in order to do that, we need to increase our collabo-
ration with industry. That will be one of the flags of my tenure at UCLA. In addition
we are losing good faculty to industry, and this can be avoided. On the other hand,
we need to defend our interests and guarantee that the agreements we make with
industry are good for us as well.” Thus, Prof. Briggs saw the recent conversation with
Jean-Luc as a great opportunity, but he knew he would need to be careful in this
interaction.
The Negotiation Table
One month later, Psychotics™ sponsored a one- week symposium (Serenity,
Serendipity, Serenium—because your patient can sleep tight) in a five-star hotel in
Paris. Jean-Luc hosted the guests, who were among the most influential names in psy-
chiatry. Before their arrival in Paris, Dr. Richelieu spoke to his staff, “This should be
done perfectly as we are hoping to invite our guests to participate in a multi-center
randomized trial to test Serenium versus standard therapy. But before doing so, we
need to have a detailed conversation with Prof. Briggs, as the main agreement will be
decided with him. Let us reserve a nice conference room in the hotel to decide on the
main aspects of this agreement. Let us do this the day before the conference, after Prof.
Briggs’s arrival.”
As planned, Prof. Briggs arrived from a long flight from the West Coast to Europe.
Unfortunately there were no direct flights, and the connections spoiled the chance of
Prof. Briggs to rest during the flights; adding the jet lag, he was not at his best, but he
was looking forward to the first round of negotiations.
The next morning, everyone was waiting anxiously for Prof. Briggs. When he
arrived, Jean-Luc gave him a warm greeting, “My dear colleague, it is a pleasure to
have you here. I hope everything is going well.” And not waiting much longer, he con-
tinued, “As you know we have a tight schedule, let us go directly to business.” After the
initial words of Prof. Briggs, Jean-Luc started with a summary of the project, “We aim
to perform a large phase-III multi-center trial comparing Serenium versus standard
therapy. We are prepared to pay all the study-related costs, including personnel, pa-
tient fees, quality assurance, and training. We are going to train all researchers of
participating centers here in Paris to guarantee internal validity of our study. We also
expect to sponsor the researchers to present the results in international symposiums
and local seminaries. The role of the centers will involve recruitment and selection of
the sample, and to set a working research center. The idea is to start data collection in
9 months and conclude in 18 months. As you know, we need to launch this drug to the
market as soon as possible.”
430 Unit IV. Study Designs
After this initial introduction, Prof. Briggs then spoke, “Thank you Jean-Luc, we
have a very good opportunity here. This might be clearly a win-win situation, and we
are looking forward to working with your company. But we need to discuss in detail
four important topics: (1) intellectual property; (2) rights to the data and publication;
(3) payments; (4) performance period, time to complete the study. I know these are
delicate topics but we need to discuss them very carefully.”
He then continued, “Let us start with intellectual property. As you know, in the
United States the Bayh-Dole Act grants universities permission to retain intellectual
property rights to inventions resulting from federally supported research—and even
to license these inventions to private industry for commercialization. As you know,
the idea of the Bayh-Dole Act is to motivate researchers to continue investigating new
compounds. Although your company may cover the expenses of this trial, most of the
researchers involved will also have federal grants, and some of our equipment (for in-
stance, computers and software) was bought with taxpayers’ money. Finally, I would like
to help with the design, indication, and dosage of this drug, as treatment of insomnia
with antipsychotics is the main line of my research; therefore, I think it would be fair if
we share the intellectual property of this trial.” This comment created a level of discom-
fort in the room. Jean-Luc was afraid that he was starting to lose control of the situation
and quickly replied, “I think this is a good point. The main issue here is that our company
created this compound and made the initial testing; so it would not be adequate to share
the intellectual property given that all the creation was done in our company. But I would
suggest that we should move to the next point: rights to the data and publication.”
Jean-Luc then decided to use a technique of negotiation: when a topic is not
progressing well, quickly move to the next topic so as to avoid an increase in tension.
He then proceeded, “Regarding the data, because we are sponsoring the trial, all
data should be immediately disclosed to us and, as you know, we want the data to be
published, but we will write the manuscript and give the academic centers the op-
portunity to review the manuscripts in a period of 60 days. We will not allow inde-
pendent publication in order to avoid the disclosure of any confidential information.”
It seemed that the mood in the room had not improved, as Prof. Briggs also quickly
replied, “I understand your position, Jean. But this is not what we are used to, nor is it
what we like to do. We usually do the opposite: we have the right to write and publish
the results of the study, and prior to the submission of the publication we will send the
manuscript to you for your review, and we also would like to have the right to publish
a small subset of the data if we want to do so.” It seemed that the negotiation was not
going well as they had reached another roadblock. Jean-Luc was then betting that the
remaining topics would improve the situation.
Jean-Luc then, feeling a bit frustrated, started, “Well, Greg, let us then see if we
agree on payments and deadlines! I think this will be an easy one as we are willing
to cover all the study-related costs. The protocol used by our company is that we pay
one-quarter of the budget after IRB approval and then pay the second quarter after
50% enrollment, the 3rd quarter after 80% enrollment, and the last allotment after the
final enrollment and transfer of the data. We would need to have the enrollment done
in 9 months and we would withhold part of the budget if there are delays.” Judging
by Prof. Biggs’s expression, this seemed to have not gone well either. He then replied,
“We may have a problem here, too. I think 9 months is not enough for us and even
considering that we will have other centers. You know that it is becoming increasingly
431 Chapter 20. The Business of Clinical Research
difficult to have patients participating in trials, and in addition our ethics committee
might delay the start of the project. If we make the budget dependent on enrollment,
then we will have a problem with the fixed costs of the trials—such as salary for the
personnel involved in the trial, like the co-investigators and research coordinators.
Indeed, if the budget decreases with a delay of the trial, then we will have a big problem
with salaries in our institution. We also need to review this.”
After this initial round of conversation, both of them were emotionally drained.
The situation had not gone as planned by Jean-Luc. He felt discouraged but de-
cided to end this meeting and take Professor Briggs on a nice tour in Paris—that was
his last ace in the hole—perhaps Paris, the city of lights, would improve the chances of
reaching an agreement for both sides.
CASE DISCUSSION
Jean-Luc had the potential to become a gifted scientist, but chose instead to become
an executive in a pharmaceutical company. His first challenge, in his new role, is to
lead a trial to test a new drug named Serenium. This new drug is a modification of
Chlorpromazine, the first antipsychotic drug that was developed more than 60 years
ago. Previous phase I and II trials have already shown its efficacy and safety for in-
somnia, and thus the company thinks that now is the time to sponsor a large phase III,
multi-center, randomized clinical trial for regulatory approval.
To conduct this study, Jean-Luc invites Prof. Briggs, a world-renowned psychia-
trist, with a vast experience in clinical trials. This could be a win-win situation for both
parties: academia and industry. Industry benefits from the expertise of Prof. Briggs,
while academia will benefit from the resources of the industry to conduct a clinical
trial. Despite the advantageous situation, there are four key elements that distinguish
industry from academia: (1) intellectual property (IP); (2) rights to the data and pub-
lication; (3) payments; (4) performance period.
As already mentioned in this chapter, IP is an important matter for both parties.
Industry thinks that it needs to protect its IP and that new IP developed during the
course of an ongoing research should also belong to its IP portfolio. For academia, new
IP developed during the clinical trial does not necessarily belong to the sponsor, even
if the sponsor is the federal government (cf., Bayh-Dole act). So the question here is
what to do with new IP that can arise from this trial.
The second point in which industry and academia see themselves on different
trenches is in the publication rights. Who has the right to publish? Usually the sponsor
owns the database of the trial that it sponsored. But that does not necessarily mean
that it owns the data that arise from research. As we discussed previously, some in-
dustry sponsors want to keep control of publications that arise from sponsored trials,
as these will be determinant for the future commercialization of their product. But the
general vision in academia is that, apart from “confidential information” that could
jeopardize the sponsor’s IP, all the decisions about the manuscript should belong to
the ARO. The sponsor could be given a period to review the manuscript, but should
limit its comments to topics that could limit its ability to commercialize the product.
The budget and the payment plan can be another source of potential conflict be-
tween industry and academia, simply because of different goals and timings. Industry
is concerned with commercialization of the drug/device, so wants to disseminate the
432 Unit IV. Study Designs
results as soon as possible. In order to achieve that, sponsors from industry attempt to
impose payment milestones for academia, in order to achieve their goal. So payments
by objectives, or payment withholds, are common tactics that industry employs over
the sponsored academic research centers. But, again, the focus of academia is not
to commercialize products, but instead to train people and create and disseminate
knowledge. That means that getting paid by objectives will limit academia’s ability to
keep staff, especially if enrollment goes below the one anticipated. This is intercon-
nected with the last point: the performance period. The sponsor from industry plans
a duration of the trial, which very often is not realistic for the ARO. And there are
many possible reasons for that: the academic research center is focused on different
trials; the IRB/ethics committee may delay the start of the trial, even if amendments
to the original protocol are not required. For instance, if there is no agreement in place
between IRBs from different institutions, it may be possible that each performance
site involved in a multi-center trial may be required to secure an independent IRB
approval. If that is the case, any modification required by one of the IRBs needs to be
accepted by all of them. This can be time-consuming, and thus can actively endanger
a study performance period.
FURTHER READING
Papers
To deepen the movement of translational research obtaining historical data, financing models, and
career building:
• Translational research: getting the message across. Nature. 2008; 453(7197): 839.
doi:10.1038/453839a
• Nathan DG. Careers in translational clinical research— historical perspectives, future
challenges. JAMA. 2002 May 8; 287(18): 2424–2427.
Porter P, Longmire B, Abrol A. Negotiating clinical trial agreements: bridging the gap between
institutions and companies.” J Health Life Sci Law. April 2009; 121.
Mello M, Claridge B, Studdert D. Academic medical centers’ standards for clinical trial
agreements with industry. N Engl J Med. 2005 May 26; 352: 2202–2210.
Online Information
• http://hms.harvard.edu/content/hmshsdm-fcoi-policy-sponsored-research—E xample
from Harvard Medical School about policy on the relationship between industry and
academia.
433 Chapter 20. The Business of Clinical Research
Books
• In: Gallin JI, Ognibene F. Principles and practice of clinical research, 2nd ed. New York: Elsevier;
2007: 341–350.
REFERENCES
1. Swann JP. Academic scientists and the pharmaceutical industry cooperative research in twentieth-
century America. Baltimore, MD: Johns Hopkins University Press; 1988.
2. Dorsey E, de Roulet J, Thompson JP, et al. Funding of us biomedical research, 2003–2008.
JAMA. 2010; 303(2): 137–143.
3. Shuchman M. Commercializing clinical trials—risks and benefits of the CRO boom. N
Engl J Med. 2007; 357(14): 1365–1368.
4. Mello MM, Clarridge BR, Studdert DM. Academic medical centers’ standards for clinical-
trial agreements with industry. N Engl J Med. 2005; 352(21): 2202–2210.
5. Steinbrook R. Gag clauses in clinical- trial agreements. N Engl J Med. 2005;
352(21): 2160–2162.
6. Wright JR, Roche K, Smuck B, Cormier J, Cecchetto S, Akow M, et al. Estimating per pa-
tient funding for cancer clinical trials: an Ontario based survey. Contemp Clinical Trials.
2005; 26(4): 421–429.
7. WIPO Intellectual Property Handbook. Word Intellectual Property Organization. Second
Edition. ISBN 978-92-805-1291-5
8. FDA Investigational New Drug Application (IND), 21 CFR § 312.32 (2017).
21
D E S I G N A N D A N A LY S I S O F S U RV E Y S
Quand on ne sait pas ce que l’on cherche, on ne voit pas ce que l’on trouve.
[If you do not know what you are looking for, you do not see what you have found.]
—Claude Bernard (French physiologist, 1813–1878)
INTRODUCTION
Surveys are often used in clinical research. A survey could be defined as a method
where information is obtained from a sample of individuals through a series of
questions. This definition already contains the key elements of a survey: the goal of
a survey is to gain knowledge on a certain topic; a sample is defined and selected as
a representative part of a target population; data are collected through a number of
questions using interviews or questionnaires.
In medical research, surveys are used in descriptive, exploratory, and experimental
studies to assess parameters such as quality of life, pain levels, and mental health.
While measurements in experimental and observational studies yield objective data
with explanatory weight, information collected through surveys is subjective and
mostly descriptive. This might be one reason why surveys are not given the credit
and attention they deserve. Another reason might be that researchers tend to assume
that design and analysis of a survey is rather trivial. A further problem is the lack of
reporting guidelines of survey research [1], which makes it difficult to assess the
quality of a survey and the true implications of its findings
In fact, survey research involves many methodological challenges, which may
strongly influence the quality of the survey results. Nevertheless, survey data can
generate many interesting new questions and provide new insights, consequently
leading to new hypotheses and further research studies.
In this chapter, we discuss the most important aspects of designing, administrating,
and analyzing surveys in clinical research and highlight important points to consider.
We provide a general overview of each of the following main stages of survey research:
434
435 Chapter 21. Design and Analysis of Surveys
• Sample design
2. Instrument design
• Method for survey administration, data collection, and data capture
3. Data analysis
We also discuss problems and pitfalls, as well as legal and ethical issues when
conducting survey research. Since this chapter cannot replace an entire book about
survey research, we refer to external sources to complement this chapter. We hope
that at the end of the chapter you will be able to better interpret published surveys,
will have higher appreciation of the information provided in surveys, and will be better
prepared for conducting your own survey research. Although very few investigators
will design a survey throughout their research career, most of the clinical researchers
will use a survey in their research; thus learning the methodology of surveys will help
the investigators to use this instrument adequately.
DESIGN
Defining the Aim(s) of the Survey Study
As discussed in Chapter 2, you should start your research project by defining your research
question. What are your aims? What are the specific objectives? What is the purpose of
your study? Do you have a hypothesis? Do you want to explore a relationship, or do you
just want to describe a condition or trend? Defining the aims is necessary in order to select
the appropriate primary outcome, the target population, and a suitable survey design.
A common mistake is that researchers instead start with the instrument design based on
the topic of interest, and then try to make the other parts of the survey design fit to it.
An example for an experimental use of a survey is a study by Schron et al. where
a questionnaire was used to compare the quality of life (QoL) of patients on anti-
arrhythmic drug (ADD) therapy versus patients with implantable cardioverter
defibrillator (ICD). While the survival benefit of ICDs is unquestionable, this study
tried to answer what impact each treatment has on patients’ QoL [2].
A recent study in the New England Journal of Medicine surveyed residency program
directors regarding the effect of the new Accreditation Council for Graduate Medical
Education (ACGME) rules one year after implementation [3]. This is an example of an
exploratory study. The purpose of this study was to evaluate if there is a relationship between
implementation of the new ACGME rules and changes (good or bad) perceived by the
residency program directors in regard to patient care, resident education, and quality of life.
As discussed in the previous chapter, the investigator designing a survey must
have a clear idea of the objectives of that survey: What is it measuring? How is it going
to be used? In what population is the survey going to be used? Based on the goals, the
investigator can design an appropriate survey.
Instrument Design
Stage 1: Planning and Development
Through surveys, data are collected in a systematic way, generally based on a
standardized assessment instrument [4]. This instrument is either an interview or
436 Unit IV. Study Designs
Questions
A well-designed survey instrument consists of questions that are
• Brief, direct, and clear: Avoid complex questions that may be misunderstood and
questions that allow for more than one specific answer. Use neutral language and
avoid unclear definitions or use of uncommon terms.
• Unambiguous: Avoid double-barreled questions, which may lead to misunder
standing and incorrect answers (e.g., “Do you have problems with climbing stairs
or do you have chest pain?” A person might give the desired answer when she has
congestive heart failure, but if the person has sprained her leg, you will receive an
answer that will lead you to a wrong conclusion).
• Directed to address the main research question: Non-specific questions lead to
lack of interest and influence the instrument validity.
• Valid and reliable: Measure what is intended to measure (internal validity).
• Attention and interest catching: Questions should be following a sequence from
neutral and general items to more specific and sensitive questions, respecting a
logical and congruent sequence.
Questions should be written so that responses given will help answer the research
question. If the aim of the study is to test or confirm a specific hypothesis, attention
has to be paid not to bias answers by providing an answer choice that is more likely to
be chosen since it suggests what the study’s hypothesis is. Similarly, leading questions
that skew answer choices should be rephrased as unbiased questions by removing
leading phrases (doctors believe that acupuncture . . .) or judgmental wording (should,
ought to, bad, wonderful, etc.).
Reliable but Not Valid Valid but Not Reliable Valid and Reliable
Answer Types Nominal answers can be either categorical (e.g., race) or multiple
choice (e.g., past medical diseases).
Ordinal answers reflect a rank order (e.g., rank the following items from 1 to 7
as how important they are for your happiness, with 1 being the most important and
7 the least important item to you: money, child(ren), reputation, education, health,
spouse, food).
Interval answers reflect an order and are evenly spaced (e.g., age: 16–25, 26–35,
36–45).
Numerical answers are continuous variables that have a meaningful zero and are
usually open-ended (e.g., what is your height in centimeters?).
Response scales (e.g., Likert scales—scales with several rating options of agreement
or disagreement) are usually used to record attitudes or values in a series of questions
that ask for favorable and unfavorable characteristics [6].
When constructing the survey, consider who your target population is. Do you
think language might be a problem for them? Should you translate the survey or use
pictograms instead or in addition to words? It is also important to match the layout/
graphical presentation to your target population. The graphical design should motivate
participants to complete the survey and improve clarity and easiness to answer (e.g.,
how many questions are presented on one page, do you arrange answer choices in a
grid, maybe a slider is helpful, maybe a progress bar, etc.).
Structure
A brief introduction should summarize the purpose of the survey and include a
confidentiality statement. An estimate of the time required to complete the survey
should be provided (in electronic questionnaires a progress bar would be useful).
Following is the main body of the survey with the set of questions, best grouped
into subsets.
The order of questions is important. Demographic questions should be asked at
the end, taking into consideration that the participant may be exhausted after a long
survey and therefore less challenging questions can be more easily answered toward
the end. Asking demographics might be felt personal or intrusive and may make
participants defensive and therefore alter the answer pattern of the survey. On the
other hand, if demographic information is important for the analysis (e.g., subgroup
analysis, adjusting for gender), the questions might be put at the beginning, to make
sure that those questions are answered, in case there is a chance of not finishing a
survey.
The end of the survey can conclude with a short summary of how the answers will
be used and a thank you statement. If deemed useful, permission to re-contact can be
asked for.
questionnaire or interview is applied to a small pre-test sample drawn from the target
population (5–10 subjects), following the same procedures, which are defined for the
main survey. In a pre-test you will be able to identify existing flaws, thus identifying
in advance potential pitfalls of the main survey. These can be problems with your
instrument (e.g., wording, answer choices, length) but also with your study design
(e.g.. mode of administration, response rate) [3].
A common strategy used for the development of an adequate instrument includes
using open-ended questions in the pilot phase to identify the most important answer
choices for inclusion and then design closed-ended questions in the finalized survey
instrument. In summary, the pilot study’s utmost purpose is to allow refining the
quality and validity of the data collection instrument and improve the overall study
design. Thus, despite the fact that a pilot study is time-consuming and increases the
cost of the research project, it deserves serious consideration.
There are some methods to validate a survey. As this is beyond the scope of this
chapter we will only briefly cite the methods used for validation. They can be divided
into two categories: (1) based on judgment, in which other methods are used to
validate the survey; (2) based on checks against the data, in which the investigator
compares the data against data that are considered valid.
The methods to validate a survey using judgment are the following: (1) face
validity (or logical validity) where the investigator assesses whether the measurement
is logically consistent—for instance, assessing age by the birth certificate seems logical
and accurate; (2) content validity, which indicates that all aspects that are aimed to
be investigated are being assessed in the survey (e.g., if a survey aims to assess quality
of life, the investigator needs to ensure that all aspects of quality of life are being
measured); and (3) consensual validity, which occurs when experts in the field agree
that the instrument is valid.
The methods to validate surveys using data include the following: (1) Criterion
validity, in which the survey is checked against another survey or similar instrument.
For instance, blood pressure measured with sphingomanometer can be measured and
checked with direct intra-arterial measurement of blood pressure. (2) Convergent and
discriminant validity, in which the new survey is checked against other surveys. The
goal is to find other alternative methods that are correlated with other instruments but
correlation is not perfect. These are not the best methods due to be need to have both,
convergent and discriminant validity in order to achieve (3) construct validity. That
is used when novel instruments are checked against a related variable; for instance, an
investigator developing a new instrument to measure angina correlates this instrument
against imaging exam of coronaries. The last two methods are (4) predictive
validity (measured against a future event, for instance mortality in the future) and
(5) responsiveness (when the new instrument is assessed in different conditions to
measure if it can change).
Sample Design
Sample design is actually one of the greatest challenges of survey research. You have to
define the target population, determine the accessible population, and finally, obtain
a representative sample from the accessible population. Which subjects should be
included? To what degree are they accessible and how can they be accessed? To what extent
do we want our results to be generalizable?
The population of interest is predetermined by the research question, but time
is well spent to clearly define it. While in other forms of research, inclusion and
exclusion criteria are critically considered and published, survey research is not
that transparent. Nevertheless, it is advisable to characterize the target population
as exactly as possible, so that you can define the criteria by which you select your
sample. Similarly to the sampling process in experimental research, a non-biased
sample must be selected from an accessible portion of the target population (see
Chapter 3 for more about study sample). In survey research especially, much
thought has to be spent on the degree of accessibility of the target population given
the mode of survey administration (e.g., if you choose to do a telephone survey,
will you be able to equally reach senior people, who often still have landlines, and
younger people, who are mostly cellphone users and therefore not registered in
a phone book). Your ability to select a random and representative sample will be
essential to determine the generalizability of your findings.
Face-to-face Interaction between the interviewer and responder Costly Higher comparing
interview Helps to interpret concepts and clarify doubts Time-consuming with other methods
Can clarify misunderstandings Requires training of the interviewers
Reduces non-responses May produce interviewer-induced bias and social
Interviewers may reinforce confidentiality desirability bias
Telephone Limited interaction between the interviewer and responder Less expensive than face-to-face interviews Higher than postal
interview Can help interpret concepts and clarify doubts Less time-consuming compared to face-to-face mail method
Can clarify misunderstandings interviews
Reduces non-responses High rate of non-responses
Effective in terms of time Administration more difficult due to cell phone use
Postal Reduced cost compared to face-to-face interviews (but still There is no contact between interviewer and Usually low
questionnaire higher than telephone and email) responder
Bias can be minimized (e.g., social desirability bias) May not be time effective; it may take months to
High level of confidentiality receive the surveys
Requires a larger sample to address the non-response
rate issue
Email Low costQuick and easy administration to a large number All the responders need to have Internet access Usually low
questionnaire of individuals Low response rate (spam filters, “survey fatigue”)
More effective in terms of time Increased chance of randomly/wrongly answered
Convenient and straight-forward for Internet users questions
Easy data capture
443 Chapter 21. Design and Analysis of Surveys
DATA ANALYSIS
After design and administration of the survey, the next step is the analysis of the
collected data. This is, indeed, one of the most critical and time-consuming aspects
of the whole survey process. As previously stated, it is recommended to have a
data analysis plan written up at the beginning of the survey design process. This is
recommended because it will prepare you to design the survey in a way that data
obtained can be analyzed, and it prevents a data-driven analysis.
Before you can analyze your data you will have to code them (unless you have used
a data capture technique that already provides you with coded data). Coding means to
convert answers into data that can be handled by a statistics program.
In survey instruments with closed- ended questions, this is a relatively
straightforward process, as it is possible to code answers as strings or numerical
variables (binary, integers, floating points) for quantitative analysis.
The next step is to edit/clean the data set. This step could be considered quality
control, where data entry problems (answers put in the wrong field, wrong units
used, etc.) can be detected, as well as outliers and missing data. Missing data can be
distinguished as data-missing and case-missing. Data-missingness means that some
responses are missing, while case-missingness occurs when an individual selected
for the sample either did not respond or dropped out [7]. Missing data have to
be planned and addressed for. (See Chapter 13 for methods of how to address
missing data.)
The actual data analysis step depends on the type of analysis we aim to conduct.
For a descriptive approach, summary statistics can be easily compiled, for instance
central tendency (mean, median, mode), dispersion (ranges), and frequencies.
For hypothesis testing, the appropriate statistical test has to be selected based on
data type and study design (parametric vs. non-parametric, paired test, chi-square,
correlation, etc.).
If the survey was designed with open-ended questions generating qualitative data,
the most common approach is to report answer frequencies for each item, generally
converted to percentages with other established methods available, such as content
analysis [8].
The most often used type of analysis for surveys is the non-parametric approach,
given that most of the surveys are based on ordinal scales. However, as discussed in the
statistical section of this book, some survey results may be considered parametric and
are analyzed using parametric tests such as ANOVA or regression modeling.
444 Unit IV. Study Designs
BIAS IN SURVEYS
As with any method in clinical research, surveys are also subjected to many types of
biases. One important type of biases is the non-response bias. Non-response bias
occurs when responders differ from non-responders; therefore results will be biased,
as they will reflect the characteristics of responders only. The impact of this bias to
surveys results will depend on how different the non-responders are from responders,
and how that can affect the main results/main hypothesis.
Sampling bias is also a potential important limitation of surveys. Did sample bias
occur? How representative was your sample, and therefore how strong is the external
validity?
Recall bias can distort the reported information if the person does not remember
correctly or remembers certain (usually negative) experiences better than others. This
phenomenon is related to the issue of under-and over-reporting, which can easily
become very complex (e.g., a potential BMW buyer might find it more important that
he can sync his iPhone with the onboard computer, while a potential Toyota driver
might pay more importance to the fuel efficiency of a car).
REPORT OF RESULTS
Similar to the data analysis plan, you should already have drafted a report outline at
the design stage of your survey research project. This helps to write a concept-driven
report rather than a data-driven report. Reports should be aimed at a specific target
audience that you should have already had in mind when formulating your research
question. This will increase the chance that your study will have the impact and
recognition it deserves.
The final report of your survey should include the aim, instrument used,
administration process, data analysis, and results. If you used an established survey
instrument, justify the reason for its use in the context of your study. If you have
developed a new survey instrument, you have to submit proof of its validity. In both
cases you have to validate that the sample size you chose was appropriate. When
reporting your results, make sure to include confidence intervals and margin of
error. (For additional information regarding manuscript writing and submission, see
Chapter 23 of this book.)
and privacy. Informed consent for face-to-face interview can be obtained before the
interview in writing. For telephone interviews, complete survey information can be
disclosed before the interview and consent obtained orally. For mail questionnaires, a
cover letter explaining the survey or an informed consent form can be included. Return
of the questionnaire would imply consent [5]. Return of emailed questionnaires
would equally imply consent. For web-based questionnaires, a start page with the
survey disclosure can be used with the requirement to click “agree” to the terms of the
survey before proceeding.
Preserving confidentiality is not just essential due to legal reasons, but it also
increases the chance to obtain complete and honest answers.
446 Unit IV. Study Designs
An Emergency Meeting
Two hours later, Prof. Marley set up an emergency meeting with the faculty and
postdocs of the social sciences department. After briefing them on the talk with the
dean, Prof. Marley stated strongly that he did not want to “resize” the department;
nevertheless, they needed to come up with an alternative to increase the funds of
2
Professor Fregni and Dr. Brunoni prepared this case. Course cases are developed solely as the
basis for class discussion. The situation in this case is fictional. Cases are not intended to serve as
endorsements or sources of primary data. All rights reserved to the authors of this case. Reproduction
and distribution without permission are not allowed.
447 Chapter 21. Design and Analysis of Surveys
the department, and his idea was to submit a large research proposal to the National
Institutes of Health (NIH) to try to get some funds and also to show the dean that the
department is still producing good scientific papers—he concluded his remarks with
the phrase, “not only to get papers in high-impact journals, but we also need to get
onto the cover of Times or Newsweek.”
As social scientists, they quickly agreed that they would propose a survey study—
but on which topic? Professor Ford had an idea: “We are in California. We know a
sensitive and difficult topic here and throughout the US is the use of cannabis. In fact
a recent survey found that the US has the highest level of cocaine and cannabis use.3
The population of some areas seems to have a sympathetic view toward its use, but we
do not know if such populations also represent regular users of the drug, and therefore
if the observation is biased. Although there is a debate regarding the use of cannabis
for medical conditions, the use of cannabis is associated with significant mortality and
morbidity. It seems that the epidemiological profile of cannabis users is different here
in California: besides college students, the drug seems also to be utilized by older men
and women who were previously married. However, we know nothing for sure, and
that is an issue of public interest. We could conduct a very carefully designed survey
on cannabis.”
Prof. Marley and the others agreed—indeed, it is a very urgent, sensitive, and
difficult topic, with many medical, social, and legal implications. This will also have
a broad impact, especially with this new federal administration that is proposing a
radical health reform. “OK! How are we going to do it? How are we going to ask people
if they are cannabis users?”
3
Degenhardt L, Chiu W-T, Sampson N, Kessler RC, Anthony JC, Angermeyer M, et al. Toward a
global view of alcohol, tobacco, cannabis, and cocaine use: findings from the WHO World Mental
Health Surveys. PLoS Med. 2008; 5: e141.
448 Unit IV. Study Designs
boyfriend who was at one time my college friend.” There are other problems in doing
surveys: simplicity is a key factor, especially if we are going to use mail questionnaires.
For instance, if our questions are too wordy or academic or long, then people will get
confused or bored and will start to answer anything. We are in a state with many non-
native English speakers—should we also build our questionnaire in Spanish? Another
issue is using positively versus negatively worded or neutral versus non-neutral
questions—as one of the issues is the response set bias in which respondents tend
to simply agree with every question—for instance, we can ask: “Have you stopped
smoking cannabis in the last year?” or “Have you continued to smoke cannabis in the
last year?” There is also the problem of “double-barreled questions”—meaning that
“one question to one idea”—for instance, if we ask: “Do you do drugs when you are
sad or happy”? It is possible that some people use only when they are sad, and others
only when they are happy—so it is better to ask two questions. Finally, there is an
important issue: should we include “no opinion” or “do not know” alternatives? The
issue here is that by including these options, we might give an easy way out for people
to respond instead of forcing them to think about the alternative. On the other hand,
not including them might create inaccurate responses, and respondents might mark
an alternative that is the least inaccurate one—but not the one reflecting his or her
opinion. This should be applied to the question: “Under which circumstances should
the use of the drug cannabis be allowed?” Mary Jane finished her discussion.
“Good, Mary Jane,” said Prof. Marley, “So, your observation leads us to a second
question—should we do a pilot study first”?
options, like mail, telephone, face-to-face interview, and Internet.” As she had the floor
to herself, she continued, “I think that use of drugs is a delicate topic. People tend to lie
regarding substance use, especially if they do not feel comfortable with the interviewer.
In addition, although a face-to-face interview usually yields the best response rate with
a good representative sample, it is an expensive method. One less expensive method
I like is mailing as a method of survey. We can mail the surveys to the subjects, with a
brief cover letter explaining the purpose and importance of our research, and then using
a small questionnaire (less than 10 minutes) with simple questions. We can assure the
respondents that we will guarantee anonymity and no personal information will be
revealed. However, the response rate might be moderate to poor—it might be difficult
to get a good response rate with this method. Finally, an intermediate solution would
be telephone interview—less expensive than face-to-face, less problematic regarding a
delicate topic, and might yield a higher response rate as compared with mailing. Also,
we should not forget that because we live in California, a high-technology state, we
could use other methods of interview: for instance, electronic mail, Internet websites,
text messages to cell phones, and so on. Therefore, we will be able to quickly reach a
large number of subjects at a relatively low cost.”
Pedro Mendonza, the other assistant professor, said, “Good ideas, Ursula. But
I think that because the topic is delicate, we cannot rely on some of the methods you
mentioned: people who feel comfortable with the use of cannabis will be precisely the
ones who will not waste their time answering a mail survey. Also, the subjects who do
not use cannabis will not waste their time either; therefore we might collect inaccurate
data using mail— an overestimated and biased sample.” Professor Mendonza
continued, “I know it is more expensive and more difficult, but my proposal is that we
go to the community and ask the questionnaires ourselves. We should train a dozen
interviewers—maybe some of our graduate students—to give them skills to show
empathy and reassurance and to establish trust in the subjects when they are being
asked these tough questions. For instance, it is easier to gain rapport when we validate
the behavior—it is useful to start our questions with a statement such as: “In College,
students suffer from a lot of pressure from professors, parents, bosses. Sometimes it
is difficult to deal with all the pressure, and a common form of relaxation and way to
unwind is to use cannabis.” Finally, they should be trained to assure a good inter-rater
reliability.”
about using cannabis and if we want to survey them—and we do—then the approach
should be carefully elaborated. In addition, the method to find these different groups
will vary.”
Prof. Ford continued, “OK—we discussed the what and the how of our study.
Now the question is where are our subjects? Are we going to use a random sample—
selecting among all people in California? Are we going to focus only on some cities of
the state—therefore performing a cluster sampling to cut costs? Or use a convenience
sample if we do not get NIH funding?”
Mary Jane tried to answer, “Of course, the best method is to use a random sample.
But it is also the most expensive one. Another method would be to stratify our
population in subgroups and then use random techniques in this stratified sample.
Using non-random samples is surely the easiest but most biased method, and we
might overestimate the number of users.”
CASE DISCUSSION
Using Surveys in Clinical Research: Signs of Smoke
In the case study Prof. Marley and his team are faced with a big problem in their
department, due to the current economic crisis. In order to face this situation, Prof.
Marley sets up an emergency meeting presenting his idea of submitting a large research
proposal to the NIH with the goal of trying to get some funds for his department. In
fact, every member of the study team agrees that they should propose a survey study
about the use of cannabis in California, mainly because this is a very urgent, sensitive,
and difficult topic, but also because it presents a variety of implications to several
fields, such as medical, social, and legal.
451 Chapter 21. Design and Analysis of Surveys
evaluated it in the context of the case study. Options that are theoretically ideal cannot
always be considered in practice. In Prof. Marley’s case, a pilot study will provide the
opportunity to explore a variety of options to refine their own instrument for the
evaluation of cannabis use. Consequently, this will allow for a more complete data
collection, and will yield an instrument that will collect more reliable and valid data
from the sample population. Despite all the advantages of performing a pilot study,
Prof. Marley and his team must consider the required time for designing and planning
the survey, and the necessary budget. In fact, another valid option would be the use
of questionnaires that are already validated and published in the literature. With the
latter choice, it would be possible to submit the survey proposal more quickly and the
survey design and planning process would be shortened. In summary, it is important
to consider a pilot study in light of the existing literature, and the amount of time and
resources that the study team has.
The next point that is added to the discussion is about how to administer the
survey. According to McColl et al. (2001) the mode of administration is one of the
first decisions to be made in designing and conducting a survey. Basically the main
decision is “between interviewer administration (either face-to-face or by telephone)
or self-completion by the respondent (with delivery of the questionnaire either by
post or to a ‘captive audience’)” [12]. These different methods have distinguishing
features: if we opt for an interviewer administration, we will have a high response rate
and the participant will more likely provide a “truthful” answer. On the other hand,
self-completion methods will have lower response rates, which may be due to a lack of
interest about the survey topic, perceived lack of time, misunderstanding of questions,
or overly long questionnaires. This is why, in Prof. Marley’s case, it is important to
understand the characteristics of the population the instrument will be applied to.
The most expensive method is the face-to-face interview; it is indeed the approach
that has the best response rate, due to interpersonal interaction. In contrast, the least
expensive approach is mail (with email being even cheaper). It is, however, difficult
to reach a good response rate through this technique. The telephone interview is the
better balanced method in regard to cost and response rate, because it is less expensive
than face-to-face interview, less problematic regarding the mode of application, and
may yield a relatively high response rate. The topic “use of cannabis in California”
chosen by Prof. Marley’s team is a sensitive topic, which may make the participants
feel judged and, thus, may influence the survey results. Therefore, the mode of survey
administration has important implications. Additionally, consider the fact that the
target population is very broad. Therefore, a mixed-mode survey administration
according to the variety of subgroups in this population might be useful, for instance,
email questionnaires to young people, and interviews of parents or business people.
Still, we need to be aware of the potential drawbacks of combining administration
modes, such as complicated data analysis.
Accordingly to Berten et al. (2012) there are two types of cannabis users: the
common profile is the young high school or college student, and the second profile
is someone from the general population, such as parents, retired men, war veterans,
multiple substance users, or important business people [15]. In order to reach both
types of cannabis users, Prof. Marley’s team has to use a sampling method that allows
reaching a representative sample of this disperse target population. Before choosing the
sampling technique that fits this constellation best, it is fundamental to decide whether
453 Chapter 21. Design and Analysis of Surveys
1. What are the main issues involved in designing the methodology of the survey?
2. Which are the implications of using interview or questionnaire as the data
collection instrument?
3. Which are the main drawbacks of using open-ended questions versus closed
questions?
4. Which are the main advantages of performing a pilot study?
454 Unit IV. Study Designs
Books
• Aday L, Llewellyn JC. Designing and conducting health surveys: a comprehensive guide. 3rd ed.
San Francisco: Josse-Bass; 2006.
• Andres L. Designing and doing survey research. Los Angeles, CA: Sage Publications; 2012.
• Czaja R, Blair J. Designing surveys: A guide to decisions and procedures. Thousand Oaks,
CA: Pine Forge Press; 2005: Chapters 2, 6, 7, 9.
• Dillman D, Smyth J, Christian LM. Internet, mail, and mixed-mode surveys: the tailored design
method, 3rd ed. New York, NY: John Wiley & Sons, Inc.; 2008.
• Everitt BS. Medical statistics from A to Z: a guide for clinicians and medical students, 2nd ed.
Cambridge: Cambridge University Press; 2006.
455 Chapter 21. Design and Analysis of Surveys
REFERENCES
1. Bennett C, Khangura S, Brehaut JC, Graham ID, Moher D, Potter BK, Grimshaw
JM. Reporting guidelines for survey research: an analysis of published guidance and
reporting practices. PLoS Med. 2010 Aug; 8(8): e1001069. doi: 10.1371/journal.
pmed.1001069.
2. Schron EB, Exner DV, Yao Q, Jenkins LS, Steinberg JS, Cook JR, Kutalek SP, Friedman PL,
Bubien RS, Page RL, Powell ., the AVID investigators. Quality of life in the antiarrhythmics
versus implantable defibrillators trial. Circulation. 2002; 105: 589–94.
3. Drolet BC, Khokhar MT, Fischer SA. The 2011 duty-hour requirements--a survey of
residency program directors. N Engl J Med. 2013 Feb 21; 368(8): 694–697. doi: 10.1056/
NEJMp1214483
4. Czaja R, Blair J. Designing surveys: a guide to decisions and procedures. Thousand Oaks,
CA: Pine Forge Press; 1996.
5. Kelley K, Clark B, Brown V, Sitzia J. Good practice in the conduct and reporting of survey
research. Int J Qual Health Care. 2003; 15(3): 261–266.
456 Unit IV. Study Designs
A man who has committed a mistake and doesn’t correct it, is committing another mistake.
—Confucius
INTRODUCTION
This chapter provides an overview of safety assessments in clinical trials, including
challenges for designing and reporting safety studies. The previous chapters gave
you the methodology to design a clinical trial (Unit I), the best statistical approach
to analyze your data for the primary and secondary outcomes (Unit II), how to do
an interim analysis, and how to power this analysis in order to stop a trial before its
completion (Unit III).
457
458 Unit IV. Study Designs
• Phase III trials: In these studies a larger and broader sample of subjects is recruited,
allowing to test for safety outcomes on a larger scale; thus rarer adverse effects can
be assessed.
• Phase IV postmarketing surveillance: In this phase, post-approval reports of adverse
effects are monitored and compiled in order to inform about adverse effects that
were not identified during the testing phase.
• Sample size: Most clinical trials do not have the resources and infrastructure to
include large samples sizes; therefore, safety is assessed in small samples, resulting
in lack of power to detect rare adverse events.
• Trial duration: Clinical trials usually have short duration due to costs of conducting
large clinical trials and also adherence issues; thus adverse events that require a
minimum time to develop may not be observed.
• Design: The design is also critical; for instance, cross-over trials may not be adequate
to assess safety when adverse events are long lasting or have a long latency.
• Biases: Similarly to efficacy, assessment of safety in clinical trials is vulnerable to
biases. For instance, measurement biases (due to lack of proper blinding) can result
in adverse events being overestimated in the treatment group and underestimated
in the placebo group.
• Sample characteristics: It is important to understand the biological basis of adverse
events so as to predict whether specific baseline or clinical characteristics (such as
use of other drugs) would accentuate or suppress adverse events.
• External validity: Results of clinical trials in a narrow or homogeneous population
may not be applicable to a larger population in terms of adverse events. Therefore
when safety is one of the main aims in a given study, the investigator needs to be
aware of this limitation.
Randomized trials have relevant strengths in safety determination. They are highly
convenient for the evaluation of safety outcomes that can be measured early during the
execution of the trial, especially if they have a high baseline incidence. These outcomes
can be represented using statistical measurements, such as absolute or relative risk
increase. In accordance, this type of study design is not helpful in assessing rare or
unexpected adverse events [5].
Given that in RCTs the intervention is well defined and it is randomly distributed
among study participants, it is possible to draw an unconfounded conclusion
by comparing groups. If allocation concealment is adequately guarded, then
randomization also offers a significant protection against selectively reporting or
diagnosing adverse events. Finally, certain reactions can be prospectively specified so
that they can be monitored specifically, in order to avoid ascertainment bias. This can
be done by using results from previous studies on adverse events, or by determining
possible reactions on the basis of pharmacological mechanisms.
DESIGNING A SAFETY STUDY
Investigators interested in designing trials to respond to the question of whether
the intervention is safe need to follow the same steps as described in Unit I (i.e.,
formulating a study question, choosing the most appropriate design, selecting the
population, and determining other methods such as randomization and blinding).
One important point, as explained in Chapter 2 on choosing the research question, is
that investigators need to be specific: for instance, a study cannot answer the question
of whether a drug is safe, but it can answer the question of whether a certain drug is
not associated with an increase in seizures as compared to placebo.
Another important point when designing a study is to determine whether safety is
the primary aim of the study and thus the study is powered and designed to answer a
safety question, or whether safety is a secondary outcome and thus study will not be
confirmatory in terms of the safety questions.
REGULATORY ISSUES: REPORTING
ADVERSE EVENTS
Reporting of adverse events is a critical aspect of the drug development process,
specifically, safety determination. The surveillance of this process is highly regulated
and its legislation is always evolving, leading to more strict reporting parameters, in
order to ensure drug safety. Worldwide, up to 5% of hospital admissions are due to
drug-related adverse events; however, only 10% or less of such events are reported to
the regulatory authorities or manufactures. This fact highlights the need to understand
and clarify the regulations commanding reports of adverse events [8].
Reporting of adverse events is based on its categorization in accordance to three
main parameters:
• Seriousness: This refers to events that lead to negative outcomes such as death,
prolonged hospitalization, persistent or significant disability or incapacity,
or congenital anomalies. An event is also considered serious if it leads to the
requirement of medical or surgical intervention to prevent one of the previous
outcomes, or if it is categorized as life threatening (i.e., the patient was at risk of
death at the time the event occurred).
• Expectedness: This concept is based on whether the event was or not previously
observed or reported in the local product labeling. An event is considered
“unexpected” if it was previously unobserved, and its nature and/or severity are
inconsistent with documented information. “Expected” events are not typically
reported to regulatory authorities on an expedited basis.
• Relatedness: This category refers to the likelihood that an event is related or not
to the exposure. In order to make such determination, factors like biological
plausibility and temporal relationship should be considered. This concept is usually
graded according to the possible degree of causality, such as certainly, probably,
possibly, or likely related. Nonetheless, there is no standard nomenclature scale. In
general, all voluntary reports are considered to carry a casual relationship
1
Professor Fregni and Dr. Imamura prepared this case. Course cases are developed solely as the
basis for class discussion. Although cases might be based on past episodes, the situation in this case
is fictional. Cases are not intended to serve as endorsements or sources of primary data. All rights
reserved to the authors of this case. Reproduction and distribution without permission are not allowed.
465 Chapter 22. Assessing Risk and Adverse Events
by almost 60% in 6 months. John at that point got a raise of $300,000—bringing his
salary to $1.3 million—and a bonus of $1.9 million. Everyone seemed to be enjoying
the success of MECAR.
THE AFTERMATH
John knows that this news could be devastating for the company. In fact, the idea of
MECAR increasing the risk of vascular accidents is plausible, as prostacyclin acts as
a vasodilator and platelet aggregation inhibitor; therefore an increase in the risk of
both myocardial infarction and strokes could be theoretically expected. That turned
out to be a real dilemma for Dr. Sullivan. After 10 years and hundreds of millions of
dollars already invested, the company was recovering its investments with MECAR; in
addition, there was no new drug on the pipeline, and a post-market failure of MECAR
could even indicate bankruptcy.
The pressure on Dr. Sullivan was quite high at that moment. He knew the issue
raised could turn into a major safety concern, and he was pressed by two antagonistic
forces. One came from his mind, saying, well, from my experience in research I know
that this finding can surely be a false alarm, a false positive. On the other hand, his
heart was telling him that there was something wrong and safety issues should always
be the main concern of any investigator or physician. In fact, if this drug proves to be
unsafe in futures trials, then the company will also likely go bankrupt due to lawsuits.
466 Unit IV. Study Designs
The situation was not favorable, and he decided then to call an urgent meeting with the
company senior executives.
the traditional NSAID was not significant among the patients without indications for
aspirin to prevent MI.” She then became excited about her finding.
“Because the traditional NSAID used in this study inhibits the production
of thromboxane by 95 percent and inhibits platelet aggregation also by almost
90 percent, the use of this drug may be similar to that of aspirin. Therefore, what we
are seeing here is a protective effect of the traditional NSAID in high-risk patients.”
She realized that she got some of the morale back in the room, as this could be good
news for everyone in that room. She then continues, “Nevertheless, we would need
to run another long-term follow-up trial to assess whether our drug increases the risk
of myocardial infarction and stroke. We actually do have a trial that might serve this
purpose: our longitudinal trial in which patients are receiving MECAR or placebo for
several months for the prevention of adenomatous polyp—in this study we can see
then if MECAR increases the risk of MI or stroke as compared to placebo. Therefore,
there would be definitive evidence.”
The last person to speak is actually Terry Morgan: “This is a good suggestion,
Alice. But I would propose a compromise solution between the proposal from Steve
to withdraw the drug and the proposal from Alice to run another study. I propose to
suggest to the FDA adding a warning label to our product, saying that for patients with
high risk of stroke and MI (as defined by a set of criteria), they are at increased risk of
another event if they use MECAR. Because this difference in the current study was
only on this population, then we would be protecting our patients while not limiting
our patients with no risk factors to use MECAR, which I know in some cases is the
only drug that is effective and tolerable.
It is almost 7 p.m. on Friday. John decides then to call the meeting off: “Let us
then work with these three scenarios and because time is of the essence here as
patients will continue to take the drug during the weekend, let us then reflect on these
proposals and decide on Monday morning the best option with our complete senior
management team—everyone is now aware and some of our colleagues who are out
of country were requested to fly back this weekend. Have a good weekend everyone.”
Monday would be a decisive day for the fate of Pharmatec.
CASE DISCUSSION
Risk Assessment and Adverse Events: When to Pull the Trigger
This case discusses a potential problem that may occur in several studies: how to
determine whether adverse events are causally related to the intervention when the
trial is not designed for such assessment. In this case, it is important to think how to
analyze the data and what are the necessary questions that the investigator needs to
make in order to analyze whether the use of the drug needs to be interrupted—that is,
if the investigator believes that this is a case of an adverse effect related to the drug. In
this case it is important to consider also how to design a trial.
Another interesting topic of discussion for this case is whether this adverse
effect of MECAR, if really linked to the drug, is a case of failure to design the trial
well, or if it is something that would only be detected in phase IV trials. In fact, the
twentieth century is full of cases in which drugs have been put on the market without
sufficient safety mechanisms to prevent adverse events or deleterious consequences.
468 Unit IV. Study Designs
You are encouraged to discuss further historical system failures, other famous cases of
pharmacovigilance issues, or recent concerns raised in your specialty/region. You can
also discuss if you agree or not (and why) with the last cases of market withdrawal.
Besides, are new surgical techniques also monitored? For example, look for papers
on LASIK correction at the beginning and compare them to the contemporary
limitations of this technique, also the improvement and safety monitory.
FURTHER READING
Macrae DJ. The Council for International Organizations and Medical Sciences (CIOMS)
guidelines on ethics of clinical trials. Proc Am Thorac Soc 2007; 4: 176–179.
This article gives an overview of the history of ethics (Nuremberg Code, Helsinki Declaration),
international ethical guidelines (Belmont Report, International Conference of
Harmonization, Council for International Organizations and Medical Sciences), informed
consent, and the study of vulnerable groups.
WHO. Pharmacovigilance: ensuring the safe use of medicines. WHO Policy Perspectives on
Medicines, 2004. At: http://who-umc.org/graphics/24753.pdf, accessed December 2012.
This article gives us summary of pharmacovigilance, with definition, aims, monitoring and
partners, the international program for drug monitoring, and the increasing of reporting and
membership.
Yadav S. Status of adverse drug reaction monitoring and pharmacovigilance in selected
countries. Indian J Pharmacol. 2008; 40(Suppl 1): S4–9. At: https://www.ncbi.nlm.nih.gov/
pmc/articles/PMC3038524/, accessed on November 2016.
This article gives an overview on how the pharmacovigilance of different developing countries
(Australia, Brazil, India, Malaysia, Singapore, and South Africa) identify and report adverse
drug reactions.
These links are interesting to explore and access the guidelines: safety report, risk management,
and so on:
• Council for International Organization of Medical Sciences (CIOMS): http://www.
cioms.ch/
• European Medicine Agency (EMA): http://www.emea.europa.eu/ema/
• US Food and Drug Administration (FDA): http://www.emea.europa.eu/ema/
REFERENCES
1. Friedman LM, et al. Fundamentals of clinical trials. Vol. 3. New York: Springer; 1998.
2. Leape LL, Berwick DM, Bates DW. What practices will most improve safety?: evidence-
based medicine meets patient safety. JAMA. 2002; 288(4): 501–507.
469 Chapter 22. Assessing Risk and Adverse Events
3. Terwee CB, et al. Quality criteria were proposed for measurement properties of health
status questionnaires. J Clin Epidemiol. 2007; 60(1): 34–42.
4. Creswell JW. Qualitative inquiry and research design: choosing among five approaches.
Thousand Oaks, CA: Sage; 2013.
5. Moher D, Schulz KF, Altman DG. The CONSORT statement: revised recommendations
for improving the quality of reports of parallel group randomized trials. BMC Med Res
Methodol. 2001; 1(1): 1.
6. Li H, Yue LQ. Statistical and regulatory issues in nonrandomized medical device clinical
studies. J Biopharm Stat. 2008; 18(1); 20–30. PMID: 18161539
7. DerSimonian R, Laird N. Meta-analysis in clinical trials. Controlled Clin Trials. 1986;
7(3): 177–188.
8. Nebeker JR, Barach P, Samore MH. Clarifying adverse drug events: a clinician’s guide to
terminology, documentation, and reporting. Ann Intern Med. 2004; 140(10): 795–801.
23
MANUSCRIPT SUBMISSION
Writers may be classified as meteors, planets, and fixed stars. They belong not to one
system, one nation only, but to the universe. And just because they are so very far away, it
is usually many years before their light is visible to the inhabitants of this earth.
—Arthur Schopenhauer, Essays and Aphorisms (1970)
INTRODUCTION
Manuscript writing and submission can be considered the final steps in a re-
search project. Investigators need to publish their results, not just to inform the
scientific world about the work, but to expose their data to scrutiny and have
their findings applied to new projects and studies. A track record of published
papers is also necessary for career development and is an important criterion for
promotions.
However, less than 50% of scientific meeting abstracts actually result in publi-
cation, and the proportion of unpublished original work is likely to be even smaller
[1]. Challenges to manuscript publication are due to both the writing and submis-
sion process. Both require careful consideration and preparation, but the key to
success is frequent practice and experience. At the same time, your study does not
have to be a randomized clinical trial (RCT) to be considered interesting or publish-
able. Nor should you give up on publishing your study if the results are negative; it
is more important that you have significant statistical power. Furthermore, negative
studies play an important role in answering relevant research questions, even if the
results are unexpected or controversial.
The basic science argument architecture contains Introduction; (Material and)
Methods; Results and Discussion (cum conclusions)—also referred to as IMRaD;
this is a standard format for presentation of the data adopted by most medical journals,
which is also helpful when comparing information between studies [2].
When drafting your manuscript, be aware that reviewers are asked to check
for originality, scientific accuracy, good composition, and interest to the readers.
Therefore, important questions to consider as you write include the following: For
which audiences are your research question and findings most relevant? How do your
findings add information to what we already know? Could your findings change med-
ical practice and, if so, how? What are other likely impacts of your study? The answers
470
471 Chapter 23. Manuscript Submission
to these questions will affect which journals are the best fit for your manuscript, and
whether journal editors will ultimately publish or reject it.
Another point to consider is the current publication landscape. On the one
hand, the number of publications is increasing at a faster rate than 10 years ago.
MEDLINE, the US National Library of Medicine’s and primary component of
PubMed, started back in 1946 and is the premier bibliographic database, containing
over 19 million scientific references. Every day 2,000–4,000 completed citations
are added; in 2010, nearly 700,000 citations were added [3]. On the other hand,
the number of journals has not increased proportionally, which has made the pub-
lication process increasingly competitive and burdensome for all parties involved.
The key to successfully publishing a manuscript in a high-impact factor is having
high-quality data that demonstrate an important message, clearly presenting this
message and its evidence in the manuscript, and choosing the right journal with
matching characteristics and requirements. In 2010 the “Authors’ Submission
Toolkit: A Practical Guide to Getting Your Research Published” was created to
increase efficiency in the submission process to accommodate the rising manu-
script volume and reduce the resource demands on journals, peer reviewers, and
authors [4].
The main focus of this chapter is to prepare you to find the right journal, to un-
derstand the quality requirements for reporting your investigation as well as the sub-
mission process, to discuss submission strategies and pitfalls and what do to when a
paper is rejected. However, we will begin with a short review of the structure of an
original research manuscript, providing key aspects that have major consequences
on the success or failure of a manuscript submission. Covering the entire manuscript
writing process would go beyond the scope of this chapter, but at the end we provide
resources that we think will be helpful. We also hope to provide you with some insight
into the current state of medical publishing and some of the current issues you need
to be aware of.
Title
The title is the “business card” of your manuscript. This is how you catch the curi-
osity and interest of your audience. Your title should be an accurate description of
your study, expressed in as few words as possible. It is important that the title is in
sync with the rest of your manuscript, so that it sends the right signals to editors,
reviewers, and readers. While it is important to begin writing your manuscript
with a title, this is a working title only—it will not be the final one. It is inevitable
that in writing up your research, you will come to a deeper understanding of your
work and its significance—and this deepened understanding needs to be reflected
in your title. We strongly recommend that at the end of the writing process you re-
examine your working title to see whether it is still the best fit for your manuscript.
We assume that it won’t be. So you will need to change the title, sometimes slightly
but other times entirely rewrite it, before you submit your work to the journal.
If the title sends the wrong message to an editor, he or she may reject the manu-
script out of hand. So craft it to fit the journal’s needs and those of its readers. An
accurate title will also make it easier for other researchers to find your work in
their searches.
Authors
The general consensus is to have the lead scientist who conducted most of the work
and drafted the chapter as the first author and the mentor as the senior/last author.
Importantly, everyone listed in the manuscript should have contributed intellectually
to the manuscript. Data collection only according to the International Committee of
Medical Journal Editors (ICMJE) does not qualify for authorship. (See Chapter 19,
Integrity in Research: Authorship and Ethics, for more information regarding
authorship.)
Keyword List
In addition to the words in the title, your keywords will also be indexed in scientific
search engines and databases. This will affect exposure of your paper to your peers and
can impact how frequently your paper is cited.
Abstract
The abstract is a short summary of your manuscript with introduction, methods,
results and discussion. The abstract is a stand-alone summary of your manuscript
that, with the title, is freely accessible, even if the journal itself requires a subscrip-
tion. Everyone reads the abstract first, and it is your best chance to attract a reader’s
attention. Even editors and reviewers base their first impression of the manuscript on
the abstract. It needs to be very succinct, within the word limit set by the journal, and
to highlight the most important aspects of your study.
473 Chapter 23. Manuscript Submission
Introduction
This is your opportunity to win over the reader of your paper (be it a peer scientist,
the editor, or the reviewer) and also to explain the motivation for doing that re-
search. The introduction should present important references as a support in your
argumentative chain. It should summarize the current state of research and show the
knowledge gap that will lead into the importance of your study. The introduction
should logically set up your research objective or hypothesis (see Table 23.1 later
in this chapter). Remember that the introduction should not be a literature review.
Also important journals such as the New England Journal of Medicine prefer to have a
short introduction with two or three short paragraphs.
Results
Report your findings succinctly using tables and figures. Use words more mini-
mally, mainly to point out main findings, whose details are given in the tables and
figures. Establish a logical chronological sequence for reporting your findings, often
following the order used in the Methods section. Table 23.1 summarizes the baseline
Audience/Readers
General Public You must know who you plan to present the informa-
tion to.
Physicians This is also important during the writing process.
Health Authorities Just stating “There is a new thing” isn’t enough.
Investigators
(continued)
474 Unit IV. Study Designs
Table 23.1 Continued
Type of journal/Selectivity
Consider timing
characteristics of your study subjects. Remember what your primary outcome is, be-
cause this may be the key figure of your paper and should be given exposure accord-
ingly. Remember that you want to present your findings without any interpretation or
subjective assessment; save this for the Discussion.
Discussion
The Discussion is where you interpret you results. What is the meaning of your
principal findings in context of what is already known? Only elaborate on what is
supported by your results and don’t overinterpret their meaning. The common struc-
ture of a discussion is a brief summary of the main findings, followed by a compar-
ison of your results with others in the literature, and accounting for differences. This
is followed by a description of your study’s limitations and a brief reiteration of its
strengths. Your Discussion section should end with a conclusion stating the signif-
icance of your results and the implications for the field. One of the main critiques
of reviewers is “wordiness,” and the introduction and discussion sections are most
vulnerable to this.
References
It can’t be overstated how important a thorough literature search is when writing a
manuscript. The aim is not to write up a review paper of the topic, but to include “land-
mark” papers and relevant contemporary references (about 20 that are not older than
10 years). Keep in mind that the reviewers of your manuscript will most likely have
published in your field and will know the literature well. In addition, the formatting of
references is unique to each journal, and it is recommended that you use bibliograph-
ical software such as EndNote or Reference Manager to accommodate a journal’s
preferences.
Acknowledgments
Acknowledge funding sources and people who helped with the research work or the
manuscript but were not included in the author’s list (e.g., media services).
Disclosures
Be sure to disclose any personal or financial conflict of interest (COI) relevant to the
work presented in the manuscript. Although the individual journal’s definitions and
policies differ, most medical journals require disclosure of COI [8].
connection of sentences, flow, and elegance. If possible, have a native English speaker
read your work. Scientific copyeditors can also be useful, if you have the budget
for one.
as an important metric for promotion criteria. Although it could be argued that pub-
lishing in a high-impact journal may increase the chance of being cited, if the manu-
script is a strong study with novel results, it can lead to a large number of citations.
Thus one strategy researchers are pursuing is to publish in respected open source
journals so as to increase the reading of their research.
Where to Submit
Once you have a draft of your manuscript in an advanced stage, you should start
thinking about to which journal you plan to submit. This is not an easy decision,
and several aspects need to be considered. Please refer to Table 23.1, which depicts
the steps that determine a manuscript’s fate. Ultimately you have to prioritize what
is most important for you, and how you can match this with the submission pro-
cess. The case discussion that you will find after this section will review the thought
and decision process you will go through when preparing your manuscript for
submission.
In choosing a journal for your manuscript, go to its website and download its
“Instructions to Authors.” Follow these guidelines strictly when finalizing your man-
uscript. Not doing so sends a message to the editors that you cannot be bothered to
present the information in the ways that they prefer, which will of course result in
your manuscript being rejected. At this stage, you should also review the cost for pub-
lication: While charges per page are rarely substantial, consider that color figures can
easily increase the cost by more than $1,000 per image.
When submitting your manuscript, you also need to include a cover letter. Briefly
state the main findings of your manuscript and why you think it should be published in
this journal. If an expert in the field has reviewed your paper before submission, state
it here and include the expert’s name as a reference. It can also be extremely helpful
if you suggest an independent expert as reviewer and, when appropriate, ask that
someone be excluded from reviewing it, if you think that person’s assessment would
likely not be objective.
Reporting Guidelines
Most reporting guidelines are being endorsed by a growing number of biomedical
journals that want to promote transparent reporting, and assess the strengths and
weaknesses of studies reported in the medical literature, with the final purpose of
improving the quality of the publications and also to help readers to understand what
was done or missing during the investigation.
There are several guidelines, CONSORT for RCT, STROBE for observational
studies, QUORUM for meta-analyses and systematic reviews of RCT, MOOSE for
479 Chapter 23. Manuscript Submission
CONSORT
CONSORT stands for Consolidated Standards of Reporting Trials and encompasses
various initiatives developed by the CONSORT Group to alleviate the problems
arising from inadequate reporting of randomized controlled trials. You can find more
information at http://www.consort-statement.org/.
Included in the 25-items checklist are the following:
The site contains a flow chart from assessing eligibility to analysis; it depicts the prog-
ress through the phases of a parallel randomized trial of two groups: enrollment, inter-
vention allocation, follow-up, and data analysis [17].
There is also an abstract CONSORT guide.
STROBE
STROBE stands for an international, collaborative initiative of epidemiologists,
methodologists, statisticians, researchers, and journal editors involved in the conduct
and dissemination of observational studies, with the common aim of STrengthening
the Reporting of OBservational studies in Epidemiology. (32)
The STROBE 22-item checklist for observational studies includes the following:
message clearer. The statistical analysis can be changed to accommodate the data
and avoid having to publish a negative study. The principal investigator can bend the
system and “pull strings” to get a paper into a journal [22].
The system is not perfect, but until it is replaced by a better one, you are advised
to understand its flaws to best deal with it and hopefully improve it by not just being a
good author but also a good reviewer and eventually a wise editor.
Publication Bias
In this book you have learned the basic principles of how to conduct a research study
with methodological rigor and procedural standardization to obtain unbiased and re-
producible results. But when it comes to publishing your results and comparing your
study to your peer’s work, you will soon find out that there are other factors that will
determine whether and how research will enter the scientific stage. The current publi-
cation landscape reflects a well-documented trend, a bias toward publication of more
positive studies than negative ones [23,24]. In case of clinical trials, this phenomenon
is accentuated by the fact that positive trials are published faster than results from neg-
ative trials [25,26].
In Chapter 4, you’ve already been introduced to a tool to identify this bias. In
Chapter 14 we discussed the funnel plot. Use this tool not just when you write a meta-
analysis; also do this for the topic of your research study to get an idea of what the
current trend is.
Many reasons can be attributed to the bias of publishing positive studies. A major
reason is a lack of incentives for authors, journals, and sponsors to publish negative
results. We have also alluded in previous chapters to the role of pharmaceutical and
biotech companies and their impact on medical research. Medical ghost writing is
one way of how publications from industry-sponsored studies can be skewed, often
resulting in an overemphasis of positive results at the expense of reporting possible ad-
verse events [27]. But there are other problems. A recent comment in Nature reported
that researchers at Amgen were unable to confirm the results of 47 of 53 “landmark
studies” in pre-clinical cancer research. The main problem identified was lack of repro-
ducibility, which suggests that most of the findings were false positives [28].
More possible methodological reasons were given in an investigation by
Ioannidis: “. . . a research finding is less likely to be true when the studies conducted in
a field are smaller; when effect sizes are smaller; when there is a greater number and
lesser preselection of tested relationships; where there is greater flexibility in designs,
definitions, outcomes, and analytical modes; when there is greater financial and other
interest and prejudice; and when more teams are involved in a scientific field in chase
of statistical significance” [29].
Attempts have been made to stir against this trend, for example the Journal of
Negative Results in Biomedicine (http://www.jnrbm.com/) and The All Results Journal
(http://www.arjournals.com). Ultimately, however, it is in your hands to contribute
to a change in medical research. We hope that with this book, we have given you the
ideas and tools to conduct research in a responsible, representative, and reproducible
way that will be an inspiration for your peers. Following is one more case and some
exercises to complete this chapter. And after that, our best wishes are with you in your
endeavor as a clinical scientist.
482 Unit IV. Study Designs
Search for the truth is the noblest occupation of man; its publication is a duty.
—Madame de Stael (1766–1817)
ACKNOWLEDGMENTS
The authors are grateful to Harvard T. H. Chan School of Public Health writing
instructors Donald Halstead and Joyce LaTulippe for their critical review and
suggestions in this chapter.
483 Chapter 23. Manuscript Submission
Prof. Gunpta is respected worldwide for his contributions to the field of stent re-
search and for his innovative ideas; in fact, he was one of the first to show the long-
term potential complications of stent. He had recently been named a Howard Hughes
Professor—a prestigious award that is given to the 20 most influential researchers by
the Howard Hughes Medical Institute (HHMI) in the United States. These leading
researchers are awarded $1 million per year. He was also named one of America’s 25
best leaders by US News & World Report in 2008. He runs a very successful lab that
has two full-time faculty, 15 postdoctoral fellows, six PhD students, and five research
assistants (undergraduate level). Dr. Vengas knew that it was a great privilege to be
accepted in this laboratory.
such as “and,” “however,” and names. But these changes were not the final story; this
was the beginning of a long iterative process between Dr. Vengas and Prof. Gunpta.
uneasy at not having the manuscript submitted yet. It was constantly coming back to
his head as flashbacks of the manuscript writing process. After the arrival in Ecuador
he went to have dinner with his friends and had the famous typical dish there—guinea
pig (or cuy as it is called in Ecuador)—and also the potato soup that he missed very
much while in New York. But neither the company of his friends nor the guinea pig
could take him away from his thoughts and flashbacks of the discussion with Prof.
Gunpta.
during the night. He could not stop thinking about the entire process, and then he
remembered the issue on the discussion section.
(1) Clear statement of what the principal findings were (first paragraph)
(2) Strengths and weaknesses of the study
(3) Comparison of the findings of our study with those of previous studies
(4) Clarification (possible explanations) regarding the similarities and differences
(from item 3)
(5) Clear and concise conclusion of the meaning of the study as it relates to clinical
practice or future research
(6) Proposal for future research.
thrombosis: a sham-controlled, double blind trial.” But he also thought about more
provocative titles such as “Enhancing post-stent healing to improve outcomes in cor-
onary obstructions.”
CASE DISCUSSION
Dr. Andres Vengas, from Cuenca, Ecuador, is part of a team at Columbia University.
He has developed a phase II trial to investigate the role of a new drug-eluting stent
for the treatment of ischemic heart disease and has proved its effectiveness to reduce
long-term complications associated with drug-eluting stents in patients with coronary
obstruction.
The study enrolled 140 patients with coronary lesions. The outcome was rate of
thrombosis after a one-year period. After three and a half years the study was completed
and the results showed a significant reduction in the rate of thrombosis with the new
drug-eluting stent compared to control. Dr. Vengas is preparing the manuscript and
will be the first author. His mentor, Dr. Gunpta, one of the worldwide leaders in cardi-
ovascular stent research, will be the senior author.
Dr. Vengas has difficulties in writing the manuscript. Dr. Gunpta revised the first
draft and found several issues concerning the title, long introduction, long discussion
489 Chapter 23. Manuscript Submission
section, and whether discussing the limitations should be included or not. After
reading this chapter and the case, think about what you would consider if you were
part of this team. Also take this knowledge and apply it to the manuscripts you are
working on.
FURTHER READING
Books
Booth WC, Colomb G, Williams J. The craft of research, 3rd ed. Chicago: University of Chicago
Press; 2008.
Hall GM. Structure of a scientific paper. In: Hall GM, ed. How to write a paper, 3rd ed.
London: BMJ Books; 2003.
Papers
Bredan A, van Roy F. Writing readable prose: when planning a scientific manuscript, following a
few simple rules has a large impact. EMBO Reports. 2006; 7(9): 846–849.
Fanelli D. Negative results are disappearing from most disciplines and countries. Scientometrics.
2012; 90(3): 891–904.
Jefferson T, Rudin M, Brodney Folse S, Davidoff F. Editorial peer review for improving the
quality of reports of biomedical studies: a Cochrane evaluation of the effectiveness of the
peer-review system. Cochrane Database Syst Rev. 2007 Apr 18; (2): MR000016.
Kallestinova E. How to write your first research paper. Yale J Biology Med. 2011; 84: 181–190.
Kern MJ, Bonneau HN. Approach to manuscript preparation and submission: how to get your
paper accepted. Catheter Cardiovasc Interv 2003; 58: 391–396.
Kliewer MA. Writing it up: a step-by-step guide to publication for beginning investigators. AJR.
2005; 185:591–596
Pierson DJ. The top 10 reasons why manuscripts are not accepted for publication. Respir Care.
2004; 4 9(10): 1246–1252.
Provenzale JM. Ten principles to improve the likelihood. . . AJR. 2007; 188:1179–1182.
Schulz KF, Altman DG, Moher D, for the CONSORT Group. CONSORT 2010 Statement:
updated guidelines for reporting parallel group randomised trials. BMJ. 2010; 340: c332.
http://www.bmj.com/content/340/bmj.c332
Veness M. Point of view: Strategies to successfully publish your first manuscript. J Med Imagine
Radiat Oncol. 2010 Aug; 54(4): 395–400.
Von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP; STROBE
Initiative. The Strengthening the Reporting of Observational Studies in Epidemiology
490 Unit IV. Study Designs
(STROBE) statement: guidelines for reporting observational studies. Ann Intern Med.
2007 Oct 16; 147(8): 573–577. Erratum in: Ann Intern Med. 2008 Jan 15; 148(2): 168.
PMID: 17938396.
Wager E. Publishing clinical trial results: the future beckons. PLoS Clin Trials. 2006 October;
1(6): e31. PMCID: PMC1626095.
Online Resources
http://www.consort-statement.org
http://www.icmje.org. Look for authorship and manuscript submission and clinical trial regis-
tration ICMJE policy.
http://www.strobe-statement.org
[ www.ploscollections.org/ghostwriting]
Manuscript Writing
http://www.scidev.net/en/practical-guides/how-do-i-write-a-scientific-paper-.html
Manuscript Submission
http://www.scidev.net/en/practical-guides/how-do-i-submit-a-paper-to-a-scientific-journal-.html
X Factor
http://occamstypewriter.org/scurry/2012/08/13/sick-of-impact-factors/
http://occamstypewriter.org/scurry/2012/08/19/sick-of-impact-factors-coda/
The Publication Landscape
http://articles.mercola.com/sites/articles/archive/2013/02/13/publication-bias.aspx
h ttp:// b logs.biomedcentral.com/ b mcblog/ 2 012/ 1 0/ 1 0/ n o- r esult- i s- w orthless-
the-value-of-negative-results-in-science/
http://w ww.theatlantic.com/magazine/archive/2010/11/lies-damned-lies-and-medical-
science/308269/?single_page=true
http://www.scilogs.com/the_gene_gym/on-publishing-negative-results/
http://www.ama-assn.org/amednews/2008/02/18/hlsb0218.htm
REFERENCES
1. Scherer RW, Langenberg P, Von Elm E. Full publication of results initially presented in
abstracts. Cochrane Database Syst Rev. 2007 Apr 18; (2): MR000005.
2. Jenicek M. How to read, understand, and write ‘Discussion’ sections. Med Sci Monit. 2006;
12(6): SR28–SR36.
3. Fact Sheet MEDLINE®, US National Library of Medicine, http://www.nlm.nih.gov/pubs/
factsheets/medline.html
4. Authors’ submission toolkit: A practical guide to getting your research published.
Curr Med Res Opin. 2010 Aug; 26(8). Informahealthcare.com/ doi/full/
10.1185/
03007995.2010.499344
5. Booth WC, Colomb G, Williams J. The craft of research, 3rd ed. Chicago: University of
Chicago Press; 2008.
6. Barron JP. The uniform requirements for manuscripts submitted to biomedical journals
recommended by the International Committee of Medical Editors. Chest. 2006;
129: 1098–1099.
7. Bourne PE. Ten simple rules for getting published. PLoS Comput Biol. 2005; 1(5): e57.
doi:10.1371/journal.pcbi.0010057.
491 Chapter 23. Manuscript Submission
8. Blum JA, Freeman K, Dart RC, Cooper RJ. Requirements and definitions in conflict
of interest policies of medical journals. JAMA. 2009 Nov 25; 302(20): 2230–2234.
doi: 10.1001/jama.2009.1669.
9. Lawrence PA. The politics of publication. Nature. 2003 Mar 20; 422: 259–261. doi:10.1038/
422259a
10. Halstead D. A strategic approach to publishing research. Boston: Writing Program, Harvard
School of Public Health; 2011.
11. International Committee of Medical Journal Editors. Uniform requirements for
manuscripts submitted to biomedical journals: writing and editing for biomedical pub-
lication. Publication ethics: sponsorship, authorship, and accountability. J Pharmacol
Pharmacother. 2010 Jan–Jun; 1(1): 42–58. https://www.ncbi.nlm.nih.gov/pmc/articles/
PMC3142758/Updated April 2010.
12. Vintzileos AM, Ananth CV. How to write and publish an original research article. Am J
Obstet Gynecol. 2009; 201: 344.e1–344.e6.
13. Lawrence PA. The politics of publication. Nature. 2003 Mar 20; 422: 259–261. doi:10.1038/
422259a
14. Byrne DW. Common reasons for rejection manuscripts at medical journals: a survey of
editors and peer reviewers. Science Editor. 2000 March–April; 23(2).
15. Pierson DJ. The top 10 reasons why manuscripts are not accepted for publication. Respir
Care. 2004; 49(10): 1246–1252.
16. Brand RA. Editorial: Standards of Reporting: The CONSORT, QUORUM, and
STROBE Guidelines. Clin Orthop Relat Res. 2009 Jun; 467(6): 1393–1394. doi:10.1007/
s11999-009-0786-x
17. Moher D, Schulz KF, Altman DG. The CONSORT statement: revised recommendations
for improving the quality of reports of parallel-group randomized trials. Ann Intern Med.
2001; 134: 657–662.
18. Björk BC, Solomon D. Open access versus subscription journals: a comparison of scientific
impact. BMC Med. 2012 Jul 17; 10: 73. doi:10.1186/1741-7015-10-73
19. Kronick DA. Peer- review in 18th- century scientific journalism. JAMA 1990; 263:
1321–1322.
20. Jefferson TO, Alderson P, Wager E, Davidoff F. Effects of editorial peer review: a systematic
review. JAMA. 2002; 287: 2784–2786.
21. Bourne PE, Korngreen A. Ten simple rules for reviewers. PLoS Comput Biol. 2006;
2(9): e110. doi:10.1371/journal.pcbi.0020110
22. Lawrence P. The politics of publication. Nature. 2003 Mar 20; 422 (6929): 259–261.
23. Chalmers I. Underreporting research is scientific misconduct. JAMA. 1990; 263:
1405–1408.
24. Wager E. Publishing clinical trial results: the future beckons. PLoS Clin Trials. 2006 Oct;
1(6): e31. PMCID: PMC1626095.
25. Hopewell S, Loudon K, Clarke MJ, Oxman AD, Dickersin K. Publication bias in clinical
trials due to statistical significance or direction of trial results. Cochrane Database Syst Rev.
2009 Jan 21; (1): MR000006. doi:10.1002/14651858.MR000006.pub3.
26. Fanelli D. “Positive” results increase down the hierarchy of the sciences. PloS One. 2010
April 7; 5(4): e10068.
27. The PLoS Medicine Editors. Ghostwriting: the dirty little secret of medical publishing that
just got bigger. PLoS Med. 2009; 6(9): e1000156.
492 Unit IV. Study Designs
28. Begley CG, Ellis LM. Drug development: raise standards for preclinical cancer research.
Nature. 2012 Mar 29; 483: 531–533. doi:10.1038/483531a
29. Ioannidis JP. Why most published research findings are false. PLoS Med. 2005; 2(8): e124.
PMID: 16060722.
INDEX
Page numbers followed by b, f and t indicate box, figures and tables, respectively. Numbers
followed by n indicate notes.
493
494 Index
basic principles of, 405, 405f parallel group designs with one, 71–72, 72t
common lapses in research integrity parallel group designs with two or more,
in, 408b 72–73, 73–74
costs of participation in, 422, 422t INDs. See investigational new drugs
critical aspects for, 14 industry-academia partnerships, 417–421
IRB requirements, 225 case study, 427–431
Nazi, 12–13 contract provisions, 424
recruitment of subjects for. See recruitment further reading, 431–432
selection of subjects for, 407 government funding mechanisms for
hypothesis, 35, 41 industry, 423–424
primary, 33 mutual benefits, 419, 420t
hypothesis testing, 181–199 reasons for, 420–421
errors in, 227–228, 227f sources of conflict, 419–420
procedure, 184–185 inferential statistics, 151–152
hypothesis-adaptive design, 384 information
availability of, 291
ICH. See International Conference on confidential, 425
Harmonization of Technical information bias, 340–341
Requirements for Registration of ways to minimize, 342
Pharmaceuticals for Human Use informed consent, 12, 135, 406–407, 445
ICMJE. See International Committee of innovation research, 423–424
Medical Journal Editors institutional review boards (IRBs), 12, 135,
IDEAL network, 349 232, 259, 381, 444–445
impact factors, 476–477, 490 costs, 422, 422t
imputation requirements for human subject
cold deck, 265 research, 225
hot deck, 265 integrity, 407–408, 408b
mean, 264 intellectual property (IP), 419, 424, 428b
median, 264 case study, 427–431, 431–432
methods of, 277–278 contract provisions, 424
multiple, 266–268 further reading, 431–432
regression, 264 intention-to-treat (ITT) analysis, 260,
single, 264–266 261–262, 275, 307, 309
stochastic regression, 264–265 further reading, 281–282
IMRaD (Introduction; [Material and] modified (mITT), 260
Methods; Results and Discussion interaction effects, 73
[cum conclusions]) format, 470, 471 interim analysis, 313–314, 377–379
incidence, 334–335 advantages of, 388–390, 391
cumulative, 332 case study, 387–391, 391–392
incidence rate, 335 disadvantages of, 388–390, 391–392
incidence rate ratio, 332 example, 379–380
incidence risk, 332 further reading, 393
incidence risk ratio (IRR), 336–337 planning, 387–391
incidence-prevalence bias, 339 statistical aspects of, 390–391
increased random variability method, 265 statistical power and, 392
indemnification, 425–426 internal validity, 49, 49f, 53f, 436
independent variables, 31–32 vs external validity, 62–63
case study, 202–203 low adherence threats to, 138f
504 Index