Diane L. Fairclough - Design and Analysis of Quali
Diane L. Fairclough - Design and Analysis of Quali
Diane L. Fairclough - Design and Analysis of Quali
ANALYSIS of
QUALITY of LIFE
STUDIES in
CLINICAL TRIALS
Second Edition
Published titles
INTRODUCTION TO M. Waterman
COMPUTATIONAL BIOLOGY:
MAPS, SEQUENCES, AND GENOMES
DESIGN and
ANALYSIS of
QUALITY of LIFE
STUDIES in
CLINICAL TRIALS
Second Edition
Diane L. Fairclough
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
R853.C55.F355 2010
615.5072’4--dc22 2009042418
Preface xvii
vii
© 2010 by Taylor and Francis Group, LLC
viii Design and Analysis of Quality of Life Studies in Clinical Trials
References 377
Index 395
What’s New?
The second edition of Design and Analysis of QOL Studies in Clinical Trials
incorporates answers to queries by readers, suggestions by reviewers, new
methodological advances, and more emphasis on the most practical methods.
The most frequent request that I had after the publication of my first edition
was for the datasets that were used in the examples. This was something I
had hoped to provide with the first edition, but time constraints prevented
that from happening. I have been able to obtain the necessary permissions
(data use agreements)∗ to be able to do so in this edition. These datasets
are available solely for educational purposes to allow the readers to replicate
the analyses presented in this edition. In the last few years, more attention
has been focused on the protection of patient’s health information (PHI). To
protect both study participants and sponsors, a number of steps have been
taken in the creation of these limited use datasets. First, all dates and other
potentially identifying information has been removed from the datasets. Small
amounts of random variation have been added to other potential identifiers
such as age and time to death; categories with only a few individuals have
been combined. Finally, the datasets were constructed using bootstrap tech-
niques (randomly sampling with replacement) and have a different number
of participants than the original datasets. Anyone wishing to use the data
for any purposes other than learning the techniques described in this book,
especially if intended for presentations or publication, MUST obtain separate
data use agreements from the sponsors. The datasets and additional docu-
mentation can be obtained from http://home.earthlink.net/∼ dianefairclough/
Welcome.html.
xvii
© 2010 by Taylor and Francis Group, LLC
xviii Design and Analysis of Quality of Life Studies in Clinical Trials
these studies, many of the same themes continued to occur. My intent was to
summarize that experience in this book.
There are numerous books that discuss the wide range of topics concerning
the evaluation of health-related quality of life. There still seemed to be a
need for a book that addresses design and analysis in enough detail to enable
readers to apply the methods to their own studies. To achieve that goal, I
have limited the focus of the book to the design and analysis of longitudinal
studies of health-related quality of life in clinical trials.
Intended Readers
My primary audience for this book is the researcher who is directly involved
in the design and analysis of HRQoL studies. However, the book will also
be useful to those who are expected to evaluate the design and interpret the
results of HRQoL research. More than any other field that I have been in-
volved with, HRQoL research draws investigators from all fields of study with
a wide range of training. This has included epidemiologists, psychologists,
sociologists, behavioral and health services researchers, clinicians and nurses
from all specialties, as well as statisticians. I expect that most readers will
have had some graduate-level training in statistical methods including multi-
variate regression and analysis of variance. However, that training may have
been some time ago and prior to some of the more recent advances in statisti-
cal methods. With that in mind, I have organized most chapters so that the
concepts are discussed in the beginning of the chapters and sections. When
possible, the technical details appear later in the chapter. Examples of SAS,
R, and SPSS programs are included to give the readers concrete examples of
implementation. Finally, each chapter ends with a summary of the important
points.
I expect that most readers will use this book for self-study or in discussion
groups. Ideally you will have data from an existing trial or will be in the
process of designing a new trial. As you read though the book you will be
able to contrast your trial with the studies used throughout the book and
decide the best approach(es) for your trial. The intent is that readers, by
following the examples in the book, will be able to follow the steps outlined
with their own studies.
Use in Teaching
This book was not designed as a course textbook and thus does not include
features such as problem sets for each chapter. But I have found it to be
extremely useful when teaching courses on the analysis of longitudinal data.
I can focus my lectures on the concepts and allow the students to learn the
details of how to implement methods from the examples.
The Future
One of my future goals is to identify (and obtain permission to use) data from
other clinical trials that illustrate designs and analytic challenges not covered
by the studies presented in this book. Perhaps the strategies of deidentifi-
cation and random sampling of subject utilized for the datasets obtained for
this second edition can be used to generate more publicly accessible datasets
that can be used for educational purposes. If you think that you have data
from a trial that could be so used, I would love to hear from you.
Diane L. Fairclough
Acknowledgments
First, I would like to thank the sponsors of the trials who gave permission to
use data from their trial to generate the limited use datasets as well as the
participants and investigators whose participation was critical to those studies.
I would like to thank all my friends and colleagues for their support and help.
I would specifically like to thank Patrick Blatchford, Joseph Cappelleri, Luella
Engelhart, Shona Fielding, Dennis Gagnon, Sheila Gardner, Cindy Gerhardt,
Stephanie Green, Keith Goldfeld, Paul Healey, Mark Jaros, Carol Moinpour,
Eva Szigethy and Naitee Ting for their helpful comments on selected chapters.
In this initial chapter, I first present a brief introduction to the concept and
measurement of Health-Related Quality of Life (HRQoL). I will then introduce
the five clinical trials that will be used to illustrate the concepts presented
throughout the remainder of this book. The data from each of these trials
and all results presented in this book arise from derived datasets from actual
trials as described in the Preface. Access is provided to the reader solely for
the purpose of learning and understanding the methods of analysis. Please
note that the results obtained from these derived datasets will not match
published results.
1
© 2010 by Taylor and Francis Group, LLC
2 Design and Analysis of Quality of Life Studies in Clinical Trials
The term health-related quality of life has been used in many ways. Al-
though the exact definition varies among authors, there is general agreement
that it is a multidimensional concept that focuses on the impact of disease and
its treatment on the well-being of an individual. In the broadest definition,
the quality of our lives is influenced by our physical and social environment as
well as our emotional and existential reactions to that environment. Kaplan
and Bush [1982] proposed the use of the term to distinguish health effects from
other factors influencing the subject’s perceptions including job satisfaction
and environmental factors. Cella and Bonomi [1995] state
Health-related quality of life refers to the extent to which one’s
usual or expected physical, emotional and social well-being are
affected by a medical condition or its treatment.
In some settings, we may also include other aspects like economic and ex-
istential well-being. Patrick and Erickson [1993] propose a more inclusive
definition which combines quality and quantity.
time within groups of patients. As a result, health status has most often been
used in clinical trials to facilitate the comparisons of therapeutic regimens.
Being (QWB) [Patrick et al., 1973], Health Utility Index (HUI) [Feeny et
al., 1992], the EuroQOL EQ-5D [Brooks, 1996] and the SF-6D [Brazier et al.,
2002]. The multi-attribute measures are much less burdensome and have been
successfully utilized in clinical trials. However, they may not capture disease
specific issues and often have problems with ceiling effects.
∗ In general, scales constructed from multiple items have better reliability than single items
How would you rate your overall health during the past week?
1 2 3 4 5 6 7
Very poor Excellent
How would you rate your overall quality of life during the past week?
1 2 3 4 5 6 7
Very poor Excellent
Likert Scale
How bothered were you by fatigue?
Not at all Slightly Moderately Quite a bit Greatly
0 1 2 3 4
diseases or treatments where there can be rapid changes will have a shorter
recall duration. HRQoL instruments designed for assessment of general pop-
ulations will often have a longer recall duration.
† Data presented here are derived (see Preface) from a trial conducted by the Eastern Co-
operative Oncology Group funded by the National Cancer Institute Grant CA-23318.
‡ CAF=cyclophosphamide, doxirubicin and 5-flurouracil.
The patients eligible for the treatment trial had hormone receptor negative,
node-positive breast cancer. Enrollment in the HRQoL substudy started after
the initiation of the treatment trial. Patients registered on the treatment trial
who had not yet started therapy on the parent trial were eligible for the quality
of life study. Patients were also required to be able to read and understand
English to be eligible. Consent was obtained separately for the treatment trial
and the HRQoL substudy.
Patients were randomized to receive one of the two regimen [Fetting et al.,
1998]. Briefly, the standard therapy consisted of 28-day cycles with 14 days of
oral therapy and 2 days of intravenous therapy (days 1 and 8). Thus, patients
on this regimen have a 2-week break every 4 weeks. In contrast, the briefer
but more intensive experimental regimen consisted of weekly therapy. During
odd-numbered weeks patients received 7 days of oral therapy plus 2 days of
intravenous therapy. During even-numbered weeks, patients received 2 days
of intravenous therapy.
1. How often during the past 2 weeks have you felt worried or upset
as a result of thinning or loss of your hair?
(1) All of the time
(2) Most of the time
(3) A good bit of the time
(4) Some of the time
(5) A little of the time
(6) Hardly any of the time
(7) None of the time
2. How often during the past 2 weeks have you felt optimistic or
positive regarding the future?
(1) None of the time
(2) A little of the time
..
.
(7) All of the time
3. How often during the past 2 weeks have you felt your fingers were
numb or falling asleep?
(1) All of the time
(2) Most of the time
..
.
(7) None of the time
4. How much trouble or inconvenience have you had during the last
2 weeks as a result of having to come to or stay at the clinic or
hospital for medical care?
(1) A great deal of trouble or inconvenience
(2) A lot of trouble or inconvenience
(3) A fair bit of trouble or inconvenience
..
.
(7) No trouble or inconvenience
5. How often during the past 2 weeks have you felt low in energy?
(1) All of the time
(2) Most of the time
..
.
(7) None of the time
%%%%%%% ''
''''''''' ' $$ $$$$$ $$ $$ $
% $$ $$ $$ $ $
%%%%%%%% '''''''' '
'' $$ $$$ $$ $ $
%%%%%% ' '
% %%%%%%% '''''''''''' ' $ $$$ $$ $ $$$ $$ $ $ $
' '' $ $
%%
%%%%%%%%% ' ''''''''''' $$ $ $$ $$$ $$$ $
% ''''' '' ' '' $$ $$ $$$ $ $
% %%%%% %% $ $$ $ $ $
%% '''' '' ' ' $ $ $$ $
:HHNV3RVW5DQGRPL]DWLRQ
FIGURE 1.2 Study 1: Timing of observations in a study with an event-
driven design with assessments before (B), during (D) and 4 months after
(A) therapy. Data are from the breast cancer trial. Each row corresponds
to a randomized subject. Subjects randomized to the experimental regimen
appear in the upper half of the figure and subjects randomized to the standard
regimen are in the lower half of the figure.
The BCQ assessments were limited to one assessment before, during and after
treatment. The assessment prior to therapy was scheduled to be within 14
days of start of chemotherapy. The assessment during treatment was sched-
uled on day 85 of treatment. This was halfway through the CAF therapy
(day 1 of cycle 4) and three quarters of the way through the 16-week regimen
(day 1 of week 13). By day 85 it was expected that patients would be expe-
riencing the cumulative effects of both regimens without yet experiencing the
psychological lift that occurs at the end of treatment. The third assessment
was scheduled 4 months after the completion of therapy. Since the duration
of therapy differed between the two treatment regimens (16 vs. 24 weeks), the
third assessment occurred at different points in time for patients on the differ-
ent regimens, but at comparable periods relative to the completion of therapy.
Additional variability in the timing of the third assessment was introduced
for women who discontinued therapy earlier than planned or for those whose
time to complete treatment may have been extended or delayed because of
toxicity. The exact timing of the assessments is illustrated in Figure 1.2.
to evaluate the efficacy and safety of the experimental therapy versus placebo
in migraine prophylaxis in individuals with an established history consistent
with migraine. The primary outcomes of the trial§ include the frequency and
severity of migraines as reported on patient diaries. Secondary objectives
include the assessment of the impact of treatment on HRQoL.
§ Datapresented here are completely simulated (see Preface). The impact of treatment on
the measures and the correlations between measures does not correspond to any actual trial
data. The correlations of assessments over time within each measure do mimic actual trial
data.
The 14-item MSQ is divided into the role restriction (RR), role prevention
(RF), and emotional function (EF) domains. Patients were asked to answer
to each question in the MSQ using a standard six-point, Likert-type scale
with the following choices: none of the time, a little of the time, some of the
time, a good bit of the time, most of the time, and all of the time (Table 1.6).
Responses to questions are reversed coded, averaged (or summed) and finally
rescaled so that scores can range from 0 to 100 with higher scores indicating
better functioning.
0RQWKV3RVW5DQGRPL]DWLRQ
FIGURE 1.3 Study 2: Timing of observations in the migraine prevention
trial. Each row represents a subject.
the MSQ subscales. Side effects (typically tingling and fatigue) are less likely
to affect the MSQ subscales (as the questions focus on migraine symptoms).
1.5.1 Treatment
All three treatment arms included cisplatin at the same dose. The traditional
treatment arm contained VP-16 with cisplatin (Control). The other two arms
contained low-dose Paclitaxel (Experimental 1) or high-dose Paclitaxel with
G-CSF (Experimental 2). The planned length of each cycle was 3 weeks.
Treatment was continued until disease progression or excessive toxicity. Of
the 525 patients randomized, 308 patients (59%) started four or more cycles
of therapy and 198 (38%) started six or more cycles.
¶ Datapresented in this book are derived (see Preface) from a trial conducted Eastern
Cooperative Oncology Group funded by the National Cancer Institute Grant CA-23318.
Each domain is scored as the sum of the items. If any of the items are
skipped and at least half of the items in the subscale are answered, scores are
calculated using the mean of the available items times the number of items in
the subscale. Examples and details of scoring are presented in Chapter 2. For
easier interpretation of the presentations in this book, all scales were rescaled
to have a possible range of 0 to 100.
& &
&%&&&%&%$$$
%
$ %& &
& &%$&%$%&$&&%
$ &
$ %%$&$ &$% & %$ &$ $%&% &$%%$ && $
%$&%&$& &%$$&%&$
%$ &&&&&%%%$$%$$ & & %&&$%%%$&%%&$ $ %%&&$&%$%%$&%% % $%$ % $ &$&$&&%%$& %$&%%& $ %%
%%&$$%&&&%$&$%% $%&&$%&$& $ $$$&&%& $ $ % & &$%% $% $
$ $&%%%$&%$&%% && % &$%&$$&&&%%%$$&%% $ % &$%%%%$&%% $$$$$%$% $$&%&$$%& %
$%& &$%& & %&%%& %&%$& %$ & %$&% && &&% &%
&
%$&$$&$$$$&&&%%%& % &$$&&%%%&$%$&&$%%$% $ $&& % $$% & $ % %
& &&$%%$$&&& $ % %&&&%$%$&& & &&&&%%%$$&%$%&% & $ & %&&$ & &
& %&$&&%%% $%$ $ %& $&%%% &$& & $$ $ &
$&%$&$$&&%%%%$$$&$$& & & &%$&%% &
&
$%%&%&&$$$$&%%%%$ % &$%$&&%%&%%&%$%$$%$&&$$ % % %%$%$%&&%& $&& % $ $ &%$&%%&&% $&& $$& % $
%%
&$% $& $ &% % %$ &&%% $ % %
:HHNV3RVW5DQGRPL]DWLRQ
FIGURE 1.4 Study 3: Timing of observations in a time driven design with
four planned assessments. Data are from the lung cancer trial (Study 3). Each
row corresponds to a randomized patient in this study. Symbols represent the
actual timing of the HRQoL assessments relative to the date the patient was
randomized. A=Control Group, B=Experimental 1, C=Experimental 2.
months was selected as the best long-term follow up in this population because
it was several months before the expected median survival, which ensured that
a sufficient number of patients could be studied. Thus, the four assessments
were scheduled to be prior to the start of treatment, before the start of the
third and fifth courses of chemotherapy and at 6 months (week 26). If patients
discontinued the protocol therapy, then the second and third assessments were
to be scheduled 6 and 12 weeks after the initiation of therapy.
The actual timing of the HRQoL assessments showed much more variability
than the plan would suggest (Figure 1.4). Some of the variation was to be
expected. For example, when courses of therapy were delayed as a result of
toxicity, then the second and third assessments were delayed. Some variation
in the 6-month assessment was also expected as the HRQoL assessment was
linked to clinic visits that might be more loosely scheduled at the convenience
of the patient and medical staff. There was also an allowed window of 2 weeks
prior to the start of therapy for the baseline assessment.
Data presented in this book are derived (see Preface) from a trial conducted by Memorial
Sloan-Kettering Cancer Center and Eastern Cooperative Oncology Group funded by the
National Cancer Institute grant CA-05826.
data will follow in Chapter 4. As in the lung cancer trial, this suggests that
dropout in this study was likely to be non-random. Consequently, methods of
analysis described in Chapters 7 through 9 must be considered for this study.
∗∗ Datapresented in this book are derived (see Preface) from a trial conducted by University
of Texas M.D. Anderson Cancer Center funded in part by the National Institutes of Health
Grants No. R01 CA026582 and R21 CA109286.
%%' '''' ' ''' '''' '' '' ' ' ' '' ' '' $$$$ $$ $ $
%% '' ''' ''''' ' ' ' '' '' '' ' '$ $$ $ $$$$ $ $$$ $ $ $$$ $
%%' '' ' ' '''' ''''' '' ' ' ''' '''$ ' '$ $ $ $ $ $ $$$ $$ $$
6XEMHFW
%% ' '' ' '' ' '' ' '''' ' ' ''' ''' $$ $ $ $$$ $ $ $$$ $ $$ $
%% '' '' ' ' ''' ' ' ''' '' ' ' $''$ $ $ $$ $$$ $ $$ $$ $$$$ $ $$$ $ $$
%% ''' ''' ' '''' '' ''' ''' ' '''' $'$ '$$$ $ $ $ $$ $$ $ $$ $$ $$
%%' '''' ''''' '''' '''' ' ''' ' '$'' '$ $' $ $$ $$ $$ $$$$ $$$ $$ $$
%% '''''' '' ' ''''''' ''''''' ''''' ' '$' $ ' $'$$' $' $$ $$ $ $$$ $$
%% '' ' ' ' ''''''' ' '''''''' ''$'' '' '$$' $$$ $ $ $
%%'' '''''''' ''''' '' ''''''''''''''''''' '''' '''' ''$' $$ '$$ $$ $$ $ $ $$ $ $
% ' ' ' ' ' ' ' ' ''' ' ' $ $ $ $ $
'D\VIURP&;576WDUW
FIGURE 1.6 Study 5: Timing of observations in the chemoradiation trial.
Each row represents a subject. Assessments during therapy are indicated by
a D and those after therapy by an A.
Symptoms and their impact on activities of daily living were assessed using
the M. D. Anderson Symptom Inventory (MDASI)[Cleeland et al., 2000]. The
MDASI is a patient-reported outcome tool to assess cancer related symptoms
and the impact of those symptoms on activities of daily living. The MDASI
includes 13 core symptoms: fatigue, sleep disturbance, pain, drowsiness, poor
appetite, nausea, vomiting, shortness of breath, numbness, difficulty remem-
bering, dry mouth, distress and sadness. Two additional symptoms (cough
and sore throat) were added in this study. The severity of these symptoms
during the previous 24 hours is assessed on a 1 to 10 point numerical scale,
with 0 being “not present” and 10 being “as bad as you can imagine.” Inter-
ference with daily activities was assessed using six questions that describe how
much symptoms have interfered with general activity, mood, walking ability,
normal work, relations with other people and enjoyment of life. Interference
is also assessed on a 0 to 10 point scale, with 0 being “does not interfere” and
10 being “completely interferes.” A total score is computed as the mean of
the six item scores.
†† Data presented here are completely simulated. The impact of the hypothetical treatment
on the measure does not correspond to any actual trial data. Only the correlations of
assessments over time and the trajectories associated with the reason for dropout mimic
actual trial data.
relief with the need for rescue medication more common in the placebo arm
(Table 1.8.2). The missing data pattern was monotone with one exception of
a patient missing the baseline assessment.
6XEMHFW
'D\V3RVW5DQGRPL]DWLRQ
FIGURE 1.7 Study 6: Timing of observations in the osteoarthritis trial.
Each row represents a subject.
1.9 Summary
• Health-Related Quality-of-Life (HRQoL)
• Measurement
2.1 Introduction
Implicit in the use of measures of HRQoL, in clinical trials and
in effectiveness research, is the concept that clinical interventions
such as pharmacologic therapies, can affect parameters such as
physical function, social function, or mental health.(Wilson and
Cleary [1995])
29
© 2010 by Taylor and Francis Group, LLC
30 Design and Analysis of Quality of Life Studies in Clinical Trials
√
Rationale for studying HRQoL and for the specific aspect of HRQoL to
be measured
√
Explicit research objectives and endpoints (also see Chapters 13 and 14)
√
Strategies for minimizing the exclusion of subjects from the trial
√
Rationale for timing of assessments and off study rules
√
Rationale for instrument selection
√
Details for administration of HRQoL assessments to minimize bias and
missing data
√
Analytic plan (see Chapter 16)
The protocol should include the critical elements of the analysis plan. This
plan needs to consider the standards that journals may require for publication.
The guidelines of the International Committee of Medical Journal Editors
(ICMJE) state that in publication the methods section should include only
information that was available at the time the plan or protocol for the study
was being written; all information obtained during the study belongs in the
results section. New standards require that the protocols for clinical trials be
placed in the public domain registering with entities such as ClinicalTrials.gov .
There are multiple motivations for these registries including awareness of other
research, but one relevant application is the use by reviewers to assess the
reported methods and results against those initially intended as stated in the
protocol. In trials that are intended for the approval of new drugs or expanded
claims for existing drugs that will be reviewed by regulatory bodies such as the
FDA or EMEA, the protocol is often supplemented with a statistical analysis
plan (SAP) that provides additional details of the planned analysis.
√
What are the specific constructs that will be used to evaluate the inter-
vention?
- If HRQoL, is it the overall HRQoL or a specific dimension?
- If a symptom, is it severity, time to relief, impact on daily activi-
ties, etc.?
√
What is the goal of inclusion of HRQoL?
- Claim for a HRQoL benefit?
- Supportive evidence of superiority of intervention?
- Exploratory analysis of potential negative impacts?
- Pharmacoecomonic evaluation?
- Other?
√
What is the population of interest (inference)?
√
What is the time frame of interest?
√
Is the intent confirmatory or exploratory?
emotional functioning). If the condition is permanent and the drugs are in-
tended for extended use, the objective could be restated as To compare the
impact of pain on daily activities while on a maintenance dose of Pain-Free
vs. Relief. Alternatively, if the pain associated with the condition was brief,
the drug which provided earlier relief would be preferable and the objective
stated as To compare the time to 20% improvement in the severity of pain
with Pain-Free vs. Relief. These objectives now provide more guidance for
the rest of the protocol development. The construct of interest is explicitly de-
fined and will guide the selection of the measure and the timing of assessment.
In the first situation, assessments during maintenance are important and in
the second, the earlier assessments are the basis of defining the outcome and
should be frequent enough to detect differences between the two regimens.
intent is to make inferences about all subjects for the entire period of assess-
ment. Thus, assessment of HRQoL should be continued regardless of whether
the patient continues to receive the intervention as specified in the protocol.
HRQoL assessments should not stop when the patient goes off study.
In contrast, with an explanatory aim, we compare the HRQoL impact of
treatments given in a manner that is carefully specified in the protocol. This
is sometimes described as the analysis of the per-protocol subgroup. In this
setting, HRQoL assessments may be discontinued when subjects are no longer
compliant with the treatment plan. The analyses of these studies appear
simple on the surface, but there is a real chance of selection bias. The analyst
can no longer rely on the principle of randomization to avoid selection bias
since unmeasured patient characteristics that confound the results may be
unbalanced across the treatment arms.
analysis. It is not uncommon to read a protocol where the analyses are limited
to subjects with at least two HRQoL assessments (baseline and one follow-
up). This criterion may change the population to which the results can be
generalized by excluding all patients who drop out of the trial early. If this
is a very small proportion of patients it will not matter. But if a substantial
number of subjects on one or more arms of the trial drop out before the second
HRQoL assessment, this rule could have a substantial impact on the results.
notes that the side effects of chemotherapy in cancer patients may have a less
adverse effect on a patient’s HRQoL than similar side effects attributable to
the disease. This observation may be true in other disease conditions as well.
Testing immediately after toxicity occurs will emphasize that experience and
deemphasize the benefits of treatment and disease symptoms. It is important
not to pick a particular timing that will automatically bias the results against
one treatment arm. In studies where the timing and length of treatment differ
across arms this may be challenging, if not impossible.
than toward the end. However, in retrospect, even more assessments during
the early phase of therapy would have been informative as HRQoL changed
very rapidly during the early weeks of the treatment. Tang and McCorkle
[2002] recommend weekly assessments in terminal cancer patients because of
the short duration of survival and the dramatic changes in symptoms that
occur in some patients.
Assessments should not be more frequent than the period of recall defined
for the instrument. The quality of the patient’s life does not generally change
on an hourly or daily basis as one would expect for symptoms. HRQoL scales
often request the subjects to base their evaluation on the last 7 days or 4
weeks. Thus, if the HRQoL instrument is based on recall over the previous
month, assessments should not be weekly or daily. Scales assessing symptoms
(Study 5) where there can be more rapid changes generally have a shorter
recall duration.
selection bias and over optimistic estimates. A treatment arm with a high rate
of dropout may appear artificially beneficial because only the healthiest of the
patients remain on the treatment. On the other hand, discontinuation of as-
sessment may make scientific sense in other disease settings. The conservative
approach is to continue HRQoL assessment; the off-therapy assessments can
always be excluded if later deemed uninformative with respect to the research
question. The opposite is not true; one can never retrospectively obtain the
off-therapy assessments at a later date if they are determined to be of interest.
√
Identification of the construct to be measured.
√
Does the instrument measure what it proposes to measure?
√
Is the information relevant to the research question?
– How well does the instrument cover the important aspects of what
is to be measured?
– Is a generic or disease-specific instrument more appropriate?
– Heath status (rating scale) vs. patient preference (utility)?
√
Will the instrument discriminate among subjects in the study and will
it detect change?
√
How well does the instrument predict related outcomes?
√
Are the questions appropriate for the subjects across time?
√
Are the format and mode of administration appropriate to the subjects
and the trial?
√
Has the instrument been previously validated in the target or similar
population? If not, what are the plans to do so for current study?
√
If using new instrument or items, the rationale and reasons they are
indispensable
clustered at either end of the scale, it may not be possible to detect change
due to interventions because scores can not get much higher or lower.
2.6.3 Appropriateness
Will the instrument discriminate among subjects in the study and will it
detect change in the target population? Ware et al. [1981] suggest two general
principles.
Are the questions appropriate and unambiguous for subjects? One can not
always assume that a questionnaire that works well in one setting will work
well in all settings. For example, questions about the ability to perform the
tasks of daily living, which make sense to individuals who are living in their
own homes, may be confusing when administered to a patient who has been
in the hospital for the past week. Questions about work may be problematic
for students, homemakers and retired individuals. Similarly, questions about
the amount of time spent in bed provide excellent discrimination for non-
hospitalized subjects, but not for a hospitalized patient.
Are the questions appropriate for the subjects across time? In cases where
the population is experiencing very different HRQoL over the length of the
study, very careful attention must be paid to the selection of the instrument
or instruments. Some studies will require difficult choices between the abil-
ity to discriminate among subjects during different phases of their disease
and treatment. In the adjuvant breast cancer trial (Study 1), the subjects
were free of any symptoms or detectable disease and at the time of the pre-
and post-treatment assessments, they were much like the general population.
During therapy, they were likely to be feeling ill from the side effects of the
treatment. In the example, at the time that the study was planned there
were very few choices of HRQoL instruments. The compromise was the se-
lection of the Breast Chemotherapy Questionnaire (BCQ), which was very
sensitive to chemotherapy side effects but may have been less sensitive to any
post-treatment differences.
If an international trial, has the instrument been validated in other lan-
guages and cultures? Simple translation is unlikely to be adequate. There
are numerous examples where investigators have found problems with certain
There are a number of other factors that can similarly influence the re-
sponses. The place and timing, such as asking patients to complete ques-
tionnaire after they have gone through testing or have received news about
their disease condition or bringing individuals back to settings (e.g. the hos-
pital) where they have experienced painful procedures or have other negative
memories [Noll and Fairclough, 2004], may influence responses. Answering
the questionnaire at home may result in different responses than in a hospi-
tal/clinic environment [Wiklund et al., 1990]. Smith et al. [2006] demonstrated
the influence of the content of an introduction during telephone surveys of
Parkinson’s disease patients on the responses.
the usual responsibilities associated with the clinical trial, ensuring that there
is someone who knows when the patient will arrive, will make sure the pa-
tient receives the questionnaire prior to undergoing diagnostic or therapeutic
procedures and has a quiet place to complete the assessment; and is responsi-
ble for implementing follow-up procedures when the patient is not available as
expected. At the time of the first assessment this key person should communi-
cate the importance to the investigators of obtaining the patient’s perspective,
review the instructions with the subject, emphasize that there are no correct
or incorrect responses, encourage the subjects to provide the best answer they
can to every question and remind the patient that they will be asked to re-
peat the assessment at later dates (if applicable). This person may have the
responsibility of reviewing the forms for missing responses, but care needs to
be taken to balance confidentiality with the need to minimize missing data. If
the assessment consists of an interview, it requires sufficient trained personnel
to schedule and conduct the interview.
Second, there needs to be a system that identifies when patients are due
for assessments. This may include preprinted orders in the patient’s chart
that identify which HRQoL assessments should be administered at each clinic
visit. This process may be assisted by support from a central data manage-
ment office where calendars and expectation notices are generated. Stickers
on the patient’s chart identifying them as part of a study may also be help-
ful. Other options include flow sheets, study calendars and patient tracking
cards [Moinpour et al., 1989].
Education
Education can be an important part of minimizing missing data. It must
start at the investigator level and include research assistants (often nurses)
as well the patient. Vehicles for education include the protocol (with strong
justifications for the HRQoL assessments), symposia, video and written ma-
terials. Videos may be valuable both as training vehicles for research staff
and for patients. Although there are often face-to-face training sessions at
the initiation of a study, research personnel can change over time. Training
tapes directed toward research personnel can deal with procedures in more
detail than is possible in the protocol. Examples would include how to handle
a patient who is not able to fill in the questionnaire and not letting family or
friends assist with the completion of the questionnaire. Training tapes are es-
pecially useful for providing positive ways of approaching the patient. Instead
of referring to participation as burdensome (e.g. “We have a lot of forms that
you’ll need to fill out”), the HRQoL assessment can be placed in a positive
light [Cella et al., 1993a]:
Hopwood et al. [1997] noted that, in three trials for lung and head and
neck cancer, staff considering the patient to be too ill to complete the HRQoL
assessments was the most commonly cited problem affecting the distribution
of questionnaires. However, patient refusal was the least cited problem. It is
understandable that study personnel are reluctant to approach patients when
they appear to be feeling particularly ill, but to minimize the bias from se-
lecting out these patients, all should be asked to complete the questionnaire.
There may be ways of encouraging ill patients, specifically by providing condi-
tions that make it as easy as possible for them to complete the questionnaire.
When a patient refuses, of course, that refusal must be respected.
Patient information sheets, which explain to the patient the rationale be-
hind the HRQoL assessments, will minimize missing data. These sheets can
contain messages about the importance of the patient’s perspective, that there
are no “correct” answers to the questions and the reasons it is important to
respond to every question and to complete the follow-up questionnaires. In
addition to the persuasive information, the fact that patients can refuse with-
out affecting their treatment or their relationship with their doctor should be
included.
Because different scales (and even subscales within the same instrument)
have a different number of items and a different range of responses, the range of
these summated scores will have different ranges, making interpretation more
difficult. For example, the emotional well-being score of the FACT (version
2) has a range of 0 to 20; the functional well-being, physical well-being and
family social well-being scores have a range of 0 to 28 and the total score
has a range of 0-140. It is common practice to standardize the summated
scores to range from 0 to 100. This is done by subtracting the lowest possible
summated score (Smin ) from each score (Si ) and then multiplying the result
by 100/(Smax − Smin ).
A much smaller number of HRQoL instruments use factor-analytic weights
to construct scales. The source of these weights should be very carefully ex-
amined before blindly accepting them. The weights need to be established
in very large representative samples of subjects that include all levels of dis-
ease (or lack of disease) and expected types of treatment. Fayers and Hand
[1997] point out how sensitive the factor structure and resulting weights are to
the population from which they were derived. Weights derived from another
clinical trial or a selected population are likely to reflect the association of
side effects of the specific treatments in that trial rather than the underlying
construct.
Below is a list of statements that other people with your illness have said are
important. By circling one number per line, please indicate how true
each statement has been for you during the past 7 days.
∗ When questions have a strict hierarchy the scale is referred to as a Guttman scale.
TABLE 2.4 Selected physical function questions from the SF-36 Health
Survey that have a roughly hierarchical structure.
The following items are about activities you might do during a typical day.
Does your health now limit you in these activities? If so, how much?
2.9 Summary
• It is critical to establish explicit research objectives during protocol de-
velopment. Details should include 1) constructs (domains) (e.g. Fatigue,
Physical Well-being, Global HRQoL) underlying the critical hypotheses,
2) population, and 3) time frame relevant to the research questions.
3.1 Introduction
In most clinical trials our ultimate goal is to compare the affects of interven-
tions between treatment groups. As the first step we will need to develop
a good model for the changes over time within groups defined by treatment
or other patient characteristics. In the previous chapter, event- or condition-
driven and time-driven designs are briefly described. These designs have cor-
responding analytic models: a repeated measures model and a growth curve
model. In this introduction, I will discuss the choice between the two mod-
els. In a repeated measures model, time is conceptualized as a categorical
variable. Each assessment must be assigned to one category. The remainder
of this chapter addresses the analysis of studies with event-driven designs.
In a growth curve model, time is conceptualized as a continuous variable.
The following chapter will describe the analysis of mixed-effects growth curve
models.
53
© 2010 by Taylor and Francis Group, LLC
54 Design and Analysis of Quality of Life Studies in Clinical Trials
appropriate for that question should be the primary consideration in the se-
lection between the models. Both models can be used to understand how
individuals change over time, how interventions affect that change and how
patient characteristics modify the treatment effects.
The adjuvant breast cancer trial is a clear example of a design that requires the
use of a repeated measures model. Each of the three assessments was planned
to measure HRQoL at a different phase of the study. There is variation in the
timing of these landmark events, especially the final assessments, scheduled 4
months after the completion of therapy, which did not occur at the same time
for the two treatment arms (see Figure 1.2). Some post-therapy observations
also occurred earlier than scheduled when a patient discontinued therapy early.
However, the intent of the design was to compare HRQoL when subjects had
been off therapy long enough to recover from the effects of acute toxicity.
Thus, the exact timing of the HRQoL assessment relative to the time they
started therapy is less relevant than the time since completing therapy.
The lung cancer trial is a good example of when either a repeated measures
or a growth curve model can be justified. With only four assessments that
are widely spaced in time, it is possible to classify each assessment with a
landmark time. There are no cases where an individual had more than four
assessments and only a few cases where there is some question about whether
an assessment is closer to the 12- or to the 26-week target (Figure 1.3). The
growth curve model is also reasonable as treatment is continous and continues
as long as it is tolerable and effective.
A growth curve model is the practical choice in the renal cell carcinoma trial,
as the timing of assessments relative to randomization is frequent initially and
becomes more varied as time progresses (see Figure 1.4). Thus, it becomes
increasingly difficult to assign each observation to a landmark time. Forcing
this study into a repeated measures design will also produce unstable estimates
of covariance parameters during the later follow-up periods. It is also likely
in this study, with a maximum of 6 assessments per subject, that a growth
curve model will require less than 6 parameters to describe the trajectory of
the outcome measures over time, in contrast to the 6 parameters that would
be used in a repeated measures model.
Two models, one nested within the other, can be compared with a maximum
likelihood (ML) ratio test [Jennrich and Schluchter, 1986] or restricted maxi-
mum likelihood (REML) ratio test [Littel et al., 1996, pg 278]. The statistics
are constructed by subtracting the values of -2 times the log-likelihood and
comparing the statistic with a χ2 distribution with degrees of freedom equal
to the difference in the number of parameters in the two covariance struc-
tures. Tests based on the REML are valid as long as the fixed effects in
both models are the same. Thus either ML or REML ratio tests are valid
when comparing nested covariance structures. However, likelihood ratio tests
of the fixed-effects (β’s) in nested models must be limited to the use of ML
because the restricted likelihood adjustment depends upon the fixed-effects
design matrix [Littel et al., 1996, pg 298,502].
To identify nesting, consider whether a set of restrictions on the parameters
in one model can be used to define the other model. Typical restrictions are
constraining a parameter to be zero (βa = 0) or constraining two or more
parameters to be equal to each other (βa = βb = βc ).
Other Statistics
Criteria such as Akaike’s Information Criterion (AIC) and the Bayesian Infor-
mation Criterion (BIC) are useful when comparing non-nested models. These
statistics are equal to the -2 log likelihood values with an added penalty that is
a function of the number of parameters estimated. The BIC imposes a greater
penalty for additional parameters than the AIC, favoring more parsimonious
models.
The simplest model for this study is a cell mean model with the mean HRQoL
score estimated for each of the treatments by time combination (Table 3.1).
The equation for the model is:
where μhj is the average HRQoL score for the j th measurement of the hth
group.
∗ Indicator variables have a value of 1 when the condition is true and 0 when it is false.
3.3.4 Covariates
Other explanatory variables (covariates) can be added to the model when
needed. An expanded presentation of models that test for moderation (effect
modification) is presented in Chapter 5. Here we briefly address covariates
that explain a significant proportion of the variation in the outcome, but
are independent of the factors of interest in the trial. For example, if the
effect of age is independent of treatment and time, but is expected to explain
variation in the outcome (Y ), we can add age at the time of diagnosis (without
interactions) to the model. The equation for the cell mean model is:
Treatment Standard
Effect Evaluation Arm Estimate Error
FUNO*Trtment Pre-Tx Control 7.6821 0.1260
FUNO*Trtment Pre-Tx Experimental 7.5700 0.1266
FUNO*Trtment During Tx Control 6.7200 0.1536
FUNO*Trtment During Tx Experimental 6.1834 0.1558
FUNO*Trtment Post-Tx Control 8.0139 0.1181
FUNO*Trtment Post-Tx Experimental 7.9674 0.1212
With addition of the age covariate AGE EVAL, the MIXED statements be-
come:
PROC MIXED DATA=BREAST3 Method=ML;
* Cell Means Model *;
CLASS PatID FUNO Trtment;
MODEL BCQ=FUNO*Trtment AGE_EVAL/NOINT SOLUTION ddfm=KR;
The output appears as follows. Note that the first three estimates are iden-
tical to those estimated for the control group using the cell mean model and
the second three estimates represent the change from the control to the ex-
perimental group.
Solution for Fixed Effects
Standard
Effect Evaluation Estimate Error
FUNO Pre-Tx 7.6821 0.1260
FUNO During Tx 6.7200 0.1536
FUNO Post-Tx 8.0139 0.1181
Effects Models option from the Analyze menu. For all the models, Subjects will
be identified by PatID, Repeated by FUNO and Repeated Covariance Type will
be identified as Unstructured in the first screen of options. In the second
screen, the Dependent variable is identified as BCQ. We will select the ML
option from the Estimation menu and Parameter Estimates and Covariance
of Residuals from the Statistics menu in the second screen of options.§
The resulting parameters are the same as previously presented in Section 3.3.5.
The results are the same as displayed in the previous section for SAS.
§ SPSS Hint: After using the menu option to specify as much of the syntax as possible, it
is recommenced that the user paste the resulting commands into a Syntax Window (e.g. a
*.sps file), add the additional statements and run the commands from the Syntax Window.
first checking the Include Intercept and adding the terms for T2, T3, Exp C*T2,
and Exp C*T3 into the model for the fixed effects.
The cell mean model is then formed by crossing the two factors (TrtGrp:FU)
and suppressing the default intercept (0 or -1) on the model argument of the
gls function.
R> CMean.Var1 = gls(model=BCQ ~0+ TrtGrp:FU, data=Breast3,
+ correlation=corSymm(form= ~1|PatID),
+ weights=varIdent(form=~1|FUNO),
+ na.action=na.exclude,method="ML")
R> list(CMean.Var1)
Coefficients:
TrtGrp1:FU1 TrtGrp2:FU1 TrtGrp1:FU2 TrtGrp2:FU2 TrtGrp1:FU3 TrtGrp2:FU3
7.682032 7.581269 6.718685 6.189271 8.015088 7.947676
The center point model is then formed by specifying each term of the model
argument of the gls function.
Coefficients:
(Intercept) Time2 Time3 Time2:Exp_C Time3:Exp_C
7.63197007 -1.17801300 0.34954553 -0.44908859 -0.01259171
The parameter associated with Intercept is the Pre-Tx mean, Time2 is the
average change from Pre-Tx to During Tx, Time3 is the average change from
Pre-Tx to Post-Tx, and Time2:EXP C and Time3:Exp C are the treatment
differences During Tx and Post-Tx.
Heterogeneous 7 σ12 σ1 σ2 ρ1 σ1 σ3 ρ2 σ1 σ4 ρ3
Toplitz σ22 σ2 σ3 ρ1 σ2 σ4 ρ2
σ32 σ3 σ4 ρ 1
σ42
Heterogeneous 5 σ12 σ1 σ2 ρ σ1 σ3 ρ σ1 σ4 ρ
Compound σ22 σ2 σ3 ρ σ2 σ4 ρ
Symmetry σ32 σ3 σ4 ρ
σ42
Toplitz 4 σ2 σ 2 ρ1 σ 2 ρ2 σ 2 ρ3
σ2 σ 2 ρ1 σ 2 ρ2
σ2 σ 2 ρ1
σ2
First-order 3 σ2 σ2 λ σ 2 λρ σ 2 λρ2
Autoregressive Moving σ2 σ2 λ σ 2 λρ
Average [ARM A(1, 1)] σ2 σ2 λ
σ2
Compound 2 σ2 σ2 ρ σ2 ρ
σ2 ρ
Symmetry σ2 σ2 ρ
σ2 ρ
σ2
σ2 ρ
σ2
Note that structures progress from the least structured at the top
of the table to the most structured at the bottom of the table.
HRQoL measure at each time point (σ12 , σ22 , σ32 ) is allowed to be different.
The need for this type of flexible structure is illustrated in the adjuvant breast
cancer study (Table 3.4) where there is more variation in the HRQoL measure
while the subjects are on therapy (σ22 = 2.28) than before or after therapy
(σ12 = 1.43, σ32 = 1.32). The covariance of each pair of HRQoL measures
(σ12 , σ13 , σ23 ) is also different. This may be appropriate in many trials, as it
is not uncommon for assessments occurring during therapy to be more strongly
correlated with each other than with the off therapy assessments.
When the number of repeated measures increases, the number of parameters
increases dramatically. For example, as the number increases from 3 to 7
repeated measures the number of parameters increases from 6 to 28 in an
unstructured covariance (Table 3.3). In large datasets with nearly complete
follow-up, estimation of the covariance parameters is not a problem. But as
the dropout increases, especially in smaller studies, it becomes more difficult
to obtain stable estimates of all the covariance parameters. In these settings
it may be advisable to place restrictions on the covariance structure.
Heterogeneous structures allow the variance (σ12 , σ22 , σ32 , σ42 )vary across
assessments whereas the homogenous structures assume that the variance is
equal across assessments (σ12 = σ22 = σ32 = σ42 ). The unstructured covariance
is a heterogeneous structure.
When HRQoL observations are taken closely in time, the correlation of the
residual errors is likely to be strongest for observations that are close in time
and weakest for observations that are the furthest apart. In the Toplitz struc-
tures, the correlations for all adjacent assessments are equal (ρ12 = ρ23 = ρ34 ),
for all paired assessments that are separated by one assessment (ρ13 = ρ24 ),
etc. Compound Symmetry assumes that the correlation among all visits is
equal (ρ12 = ρ23 = ρ34 = ρ13 = ρ24 = ρ14 ) regardless of how far apart in time
the observations are taken. An autoregressive (AR(1)) structure (not shown)
is a very restrictive structure. The covariance decays exponentially as a func-
tion of the time separation between the observations (ρ2 = ρ21 ,ρ3 = ρ31 ,etc.)
and implies that as the time increases between assessments the correlation
will eventually disappear. It is rare that the correlation of HRQoL measures
declines this rapidly or ever become completely uncorrelated, even when the
observations are far apart in time. The (ARMA(1,1)) provides a slightly more
flexible structure with a less rapid decay.¶
More than one covariance structure is likely to provide a reasonable fit to the
data. Most structured covariance matrices are nested within the unstructured
matrix, as indicated by the restrictions described above and thus can be tested
using likelihood ratio tests. The information criteria can be used to provide
guidance. Unless the sample size for the study is very large, it is difficult
to choose definitively among the various covariance structures. In the breast
cancer trial, the estimates of this and the general unstructured covariance
are very similar and the differences are unlikely to affect the results of the
primary analyses (Table 3.4). The AR(1) structure is included in Table 3.4
to illustrate the rapid decrease in the correlation that generally will not be
observed in studies measuring HRQoL.
There are numerous other possible covariance structures that are variations
on these described here. Additional options for covariance structures are
described in detail in other sources including Jennrich and Schluchter [1986],
Jones [1993, chapter 3 and 6], Verbeke and Molenberghs [1997, chapter 3]
and in the SAS Proc Mixed documentation [1996].
The R and RCORR options request the output of the covariance and correlation
matrices for the first subject. (R=2 and RCORR=2 would request the matrices
for the second subject; useful if the first subject has incomplete data.)
The correlations estimated for the unstructured covariance range from 0.58
to 0.65, with the smallest correlation between the two observations furthest
apart in time. This suggests that the correlation may be slightly decreasing
as visits are spaced further apart. The increase in variation during the sec-
ond assessment (during chemotherapy) suggests heterogeneity of the variance
across the assessments.
As a final note, the most frequent error that I encounter when fitting any of
these covariance structure occurs when there are more than a single observa-
tion associated with one of the landmarks (order categories). In this example,
this would occur when a subject has more than one record associated with a
value of FUNO. The error message is
As previously discussed, each landmark can have only one assessment associ-
ated with it.
In summary, unless the data is sparse for some of the repeated measures,
choosing an unstructured covariance will be the preferred strategy. This struc-
ture will always fit the data, though it may not be the most parsimonious.
The analysis plan can be easily specified in advance and additional steps are
eliminated.
Σi = σ 2 Vi Ci Vi (3.9)
V ar(eij ) = σ 2 [Vi ]2jj (3.10)
cor(eij , eik ) = [Ci ]jk (3.11)
While this output looks quite different from the SAS and SPSS output, the
results are essentially the same (Table 3.5). The heterogeneous and homoge-
neous compound symmetry models can be fit by changing the correlation
and weights options:
Note that because these three structures are nested, we can look at the
likelihood ratio tests. Again, the results suggest that the general and the
heterogeneous compound symmetric structures provide a similar fit, but the
homogeneous compound symmetric structure is not appropriate.
μ11
H0 : μ21 = μ11 vs. HA : μ21 =
H0 : μ22 = μ12 vs. HA : μ22 = μ12
H0 : μ23 = μ13 vs. HA : μ23 = μ13
These equations can be rewritten, placing all parameters on one side of the
equality:
Thus putting the parameters in exactly the same order as they appear in the
SAS, SPSS or R output:
Alternative models such as the center point with common baseline gener-
ate parameters that reflect the hypotheses of interest, but do not generate
estimates for most of the treatment by time combinations. Thus these values
may also need to be estimated.
Standard
Label Estimate Error DF t Value Pr > |t|
Pre-Therapy Diff 0.1121 0.1786 200 0.63 0.5311
During Therapy Diff 0.5367 0.2188 173 2.45 0.0152
Post-Therapy Diff 0.04651 0.1692 168 0.27 0.7838
Notice that the CONTRAST statement contains the same syntax for identifying
the terms in the hypothesis as the two ESTIMATE statements, but they are
separated by a comma. The estimates of change in the outcome measure
while on therapy in the control and experimental arms represent declines of
approximately 0.6 and 1.1 standard deviations; for the BCQ measure, these
are moderate and strong effects respectively [Cohen, 1988].
Standard
Label Estimate Error DF t Value Pr > |t|
Change in Cntl pts -0.9621 0.1143 175 -8.42 <.0001
Change in Exp pts -1.3867 0.1180 179 -11.75 <.0001
Contrasts
Num Den
Label DF DF F Value Pr > F
During-Pre Change 2 177 104.47 <.0001
-0.743/1.29 and -1.438/1.29 where 1.29 = σ̂12 and σ̂12 is the average baseline variance.
Similarly, we can estimate and test the change from baseline to during
treatment.
* Estimates of Change from Pre-therapy to Post-therapy *;
ESTIMATE ‘Change in Cntl pts’ T2 1 T2*EXP_C -.5;
ESTIMATE ‘Change in Exp pts’ T2 1 T2*EXP_C .5;
* Test Change from Pre-therapy to Post-therapy *;
CONTRAST ‘During-Pre Change’ T2 1 T2*EXP_C -.5,
T2 1 T2*EXP_C .5;
Notice that when a test involves multiple degrees of freedom the individual
contrasts are separated by a semicolon as in the last test presented above.
Then we compute θ̂, it variance and standard error, z-statistics and p-values
for a 2-sided test.
R> est.theta=C %*% est.beta # Compute theta
R> var.theta=C %*% var.beta %*% t(C) # Variance
R> se.theta=sqrt(diag(var.theta)) # Standard Error
R> zval.theta=est.theta/se.theta # Z-statistics
R> pval.theta=(1-pnorm(abs(zval.theta)))*2 # P-value
If the sample size was much smaller, a simple z-statistic might not be appro-
priate, but it is more than satisfactory for these examples.
Then we use the anova function to test a treatment effect during therapy,
after therapy, or simultaneously.
R> anova(CPnt.Var1,CPnt.Mn23)
Model df AIC BIC logLik Test L.Ratio p-value
CPnt.Var1 1 11 1667.688 1715.256 -822.8440
CPnt.Mn23 2 9 1687.613 1726.533 -834.8067 1 vs 2 23.92531 <.0001
R> anova(CPnt.Var1,CPnt.Mn23) # Overall Test
Model df AIC BIC logLik Test L.Ratio p-value
CPnt.Var1 1 11 1659.306 1706.715 -818.6531
CPnt.Mn23 2 9 1663.369 1702.158 -822.6845 1 vs 2 8.062832 0.0177
R> anova(CPnt.Var1,CPnt.Mn2) # T2 only
Model df AIC BIC logLik Test L.Ratio p-value
CPnt.Var1 1 11 1659.306 1706.715 -818.6531
CPnt.Mn2 2 10 1664.656 1707.755 -822.3278 1 vs 2 7.349348 0.0067
R> anova(CPnt.Var1,CPnt.Mn3) # T3 only
Model df AIC BIC logLik Test L.Ratio p-value
CPnt.Var1 1 11 1659.306 1706.715 -818.6531
CPnt.Mn3 2 10 1657.314 1700.413 -818.6571 1 vs 2 0.008048903 0.9285
3.6 Summary
• Event-driven designs are generally associated with repeated measures
models.
4.1 Introduction
This chapter focuses on analysis of longitudinal studies using mixed-effect
growth-curve models for trials where time or other explantory variables are
conceptualized as a continuous variable (e.g. time-driven design). Strategies
for choosing between repeated measures and growth-curve models for lon-
gitudinal studies were discussed in the previous chapter. The most typical
approach for modeling growth-curve models uses a mixed-effects model.∗ The
term mixed refers to the mixture of fixed and random effects:
Yi = Xi β + Z i di + ei (4.1)
Fixed effects Random effects Residual error
The fixed-effects (Xi β) model the average trajectory. The fixed-effects are
illustrated in Figure 4.1 by the bold line. The fixed effects are also referred to
as the mean response, the marginal expectation of the response [Diggle et al.,
1994] and the average evolution [Verbeke and Molenberghs, 1997, 2000]. The
random effects model the variation among individuals relative to the average
trajectory. The difference between the bold and dashed lines in Figure 4.1
represent the random effects (Zi di ). In the figure there is both variation in the
initial values (intercepts) and the rates of change (slopes) among the subjects.
The variance of the random effects is also referred to as the between-subjects
variation. The final component is the residual error and is represented by
the difference between the symbols and the dashed lines. The variance of the
residual errors is also referred to as the within-subject variation.
In the remainder of this chapter, two examples will be presented. The
first will be the renal cell carcinoma trial (Study 3). I will use this trial in
Sections 4.2-4.5 to present the development of a mixed-effects growth-curve
model. The second example will be the migraine prevention trial (Study 2)
and is presented in Section 4.6. I will use this trial to illustrate an alternative
model that will be useful for trials where patients have varied responses to a
treatment.
∗ Mixed-effect models are also referred to as linear mixed models, random effects models,
83
© 2010 by Taylor and Francis Group, LLC
84 Design and Analysis of Quality of Life Studies in Clinical Trials
+54R/5HVSRQVH
;
%
;%=G
;%=GH
7LPH
FIGURE 4.1 A simple mixed effects model. The solid line represents the
average response across all subjects, Xi β. The dashed lines represent the
trajectories of individual subjects, Xi β + Zi di . The stars and circles represent
the actual observed responses, Xi β + Zi di + eij .
carcinoma example. The higher order terms, such as quadratic (t2 ) and cubic
(t3 ) terms, allow the curves to depart from linearity.
3RO\QRPLDO 3LHFHZLVH/LQHDU
+54R/5HVSRQVH
+54R/5HVSRQVH
:HHNV :HHNV
FIGURE 4.2 Study 4: Growth curve models for the FACT-TOI score among
patients treated on the renal cell carcinoma trial (Study 4). (Control arm is
the solid line and the Experimental arm is the dashed line.)
The piecewise linear regression model avoids some of the previously mentioned
concerns of the polynomial models. The change in HRQoL is modeled as a
linear function over short intervals of time. Although we do not expect changes
in HRQoL to be strictly linear, it is reasonable to assume that changes are
approximately linear over short intervals of time. Figure 4.2 (right) illustrates
the use of a piecewise linear model for the renal cell carcinoma study. The
In this model, the higher order terms (t[c] ) allow the curves to depart from
linearity. But in contrast to the polynomial model, they model changes in the
slope at T [c]. Again, the maximum number of terms (including the intercept)
is equal to the number of assessments. Figure 4.3 illustrates a model with a
single change in the slope.† In this model, β0 is the intercept and β1 is the
initial slope, thus β0 + β1 t describes the initial trajectory. The slope is allowed
to change at 8 Weeks by adding a new variable (t[2] , T [2] = 8) indicating the
time beyond 8 Weeks. β2 is the change in the slope relative to β1 and the
sum of the two parameters, β1 + β2 , describes the rate of change after Week
8.
%R %
W
%R
2XWFRPH
%
R%
W
!
%
W
%
W
7LPH
FIGURE 4.3 Study 4: A piecewise linear regression model with a change in
[2] [2]
the slope at 8 Weeks. Yij (t) = β0 + β1 tij + βh2 tij + εij , tij = max(tij − 8, 0).
The intercept is defined by β0 , the initial slope by β1 and the change in slope
after 8 Weeks by β2 , thus the slope after Week 8 is β1 + β2 .
The selected points of change should correspond to the times when changes
in HRQoL might occur as the result of treatment or some other clinically rele-
vant process. In most trials, this will be the points in time where observations
are planned. In the renal cell carcinoma trial, we could consider terms that
allow the slope to change at 2, 8, 17 and 34 Weeks.
The first step is to specify a fully parameterized model for the mean. To
illustrate, consider the renal cell carcinoma trial (Study 4). We will start
with a piecewise linear model that allows changes in the slope at 2, 8, 17
and 34 Weeks and has a common baseline. To construct the piecewise linear
[2] [8] [17] [34] [c]
regression model, we create four new variables (tij , tij , tij and tij , tij =
max(tij − c, 0)) to model possible changes in the slope at 2, 8, 17 and 34
Weeks. For the two treatment groups:
The datafile RENAL3 has a unique record for each HRQoL assessment and
is created by merging RENAL1 and RENAL2 by the patient identifier, PatID.
TOI2 is the score for the FACT-TOI rescaled so that the possible range is 0
to 100, Weeks identifies the time of the assessment relative to randomization
and Trtment identifies the treatment arm.
where # is 2, 8, 17 or 34.
Then we fit the mixed effects model using the MIXED procedure. The first
part of the code for this model might appear as:
PROC MIXED DATA=WORK.RENAL3 COVTEST IC METHOD=ML;
The CLASS statement identifies two levels of treatment. The terms in the
MODEL statement create a design matrix corresponding to the model displayed
in equation 4.3. If we wished to relax the assumption of a common baseline,
Trtment is added to the MODEL statement along with the NOINT option.
COMPUTE Week2=MAX(Weeks-2,0).
COMPUTE Week8=MAX(Weeks-8,0).
COMPUTE Week17=MAX(Weeks-17,0).
COMPUTE Week34=MAX(Weeks-34,0).
EXECUTE.
Then we fit the mixed effects model using the MIXED procedure. The first part
of the program for this model might appear as:
‡ An alternative function for mixed effects models is lmer from the lme4 library.
TABLE 4.1 Useful covariance structures for the random effects in a mixed-
effects model with two random effects.
# Parameters Structure (G)
Uncorrelated 2 ς12 0
TYPE=VC ς22
COVTYPE(VC)
A more typical model for longitudinal studies has two random effects. The
second random effect (di2 ) allows variation in the rate of change over time
among individuals. This model has a random intercept and random slope
for each individual (Yi = Xi β + di1 + di2 ti + ei ). In most examples, the two
random effects are allowed to be correlated
2
d ς ς
Zi = [1 ti ] , G = V ar i1 = 1 12 . (4.8)
di2 ς21 ς22
In some trials, we also might expect that there may be variation among
individuals responses during the early, later or post- treatment phases. A
third random effect (di3 ) allows us to incorporate this additional variation.
§ Thisterm is often referred to as the random intercept, even though in this case the inter-
pretation differs.
Simple (σ2 I) 1 σ2 0 0 0
TYPE=SIMPLE σ2 0 0
COVTYPE(ID) σ2 0
σ2
Autoregressive 2 σ2 σ2 ρ σ2 ρ2 σ 2 ρ3
Equal Spacing σ2 σ2 ρ σ2 ρ2
TYPE=AR(1) σ2 σ2 ρ
COVTYPE(AR1) σ2
Toplitz 4 σ 2 σ 2 ρ1 σ 2 ρ2 σ 2 ρ3
Equal Spacing σ2 σ 2 ρ1 σ 2 ρ2
TYPE=TOEP σ2 σ 2 ρ1
COVTYPE(TP) σ2
structure that does not have the desirable properties of the Cholesky decomposition.
I will drop that random effect. After fitting the random effects portion, I may
check for auto-correlation among the residual errors. In practice, once a good
random effects model is fit, there is rarely residual autocorrelation unless the
assessments are very frequent (e.g. daily or Weekly).
The second random effect is added to the RANDOM statement where Weeks
defines Zhi2 = thi . TYPE=UN specifies an unstructured covariance of the ran-
dom effects with 3 covariance parameters, UN(1,1), UN(1,2), UN(2,2), which
correspond to ς12 , ς12 and ς22 . Adding a third random effect follows the same
pattern.
The residual error contribution to the covariance structure is defined by the
REPEATED statement. When this statement is omitted, a simple homogeneous
structure is assumed (R = σ 2 I) and displayed as Residual. The output for a
model with two random effects and uncorrelated homoscedastic residual errors
would have the following form:
Standard Z
Cov Parm Subject Estimate Error Value Pr Z
Standard
Cov Parm Subject Estimate Error Value Pr Z
In this study, the almost perfect negative correlation of the random effects
tells an interesting story, suggesting that those patients who had the most
rapid decline in the first 8 weeks (second random effect) had the most rapid
improvement after 8 weeks (third random effect) and the individual curves
after 8 Weeks would be roughly parallel.
To test a model with autoregressive error structure for the residual vari-
ation, we would use the following procedure. Because the observations are
unequally spaced we specified this as TYPE=SP(POW)(Weeks) where Weeks is
the number of weeks since randomization. Note that Weeks has a unique value
for each HRQoL assessment within a particular patient.
* Autoregressive structure for unequal spacing of Observations *;
REPEATED /SUBJECT=PatID TYPE=SP(POW)(Weeks);
The labeling of the output is similar to SAS, using UN(1,1) UN(1,2) and
UN(2,2) to label the three parameters.
When the residual errors are assumed to be uncorrelated, the REPEATED=
subcommand is omitted. With equally spaced observations, the REPEATED=
subcommand can be used to test autocorrelation of the residual errors:
/REPEATED=FUNO | SUBJECT(PatID) COVTYPE(AR1)
The anova() function compares nested models. Results are similar to those
previously reported where the model with three random effects has the best
fit to the data.
Auto correlation of the residuals, where the power is proportional to the
time between observations, is added by specifying:
> PW.Var2AR=undate(PW.Var2,correlation=corCAR1(form=~Weeks))
> anova(PW.Var2,PW.Var2AR)
As before, it was difficult to test autocorrelation for the model with three
random effects (results not shown), but clearly in the model with two random
effects there is no autocorrelation.
H0 : β1 = β2 , β3 = β4 , β5 = β6 , (4.10)
model. By contrasting the estimate of the AUC for each treatment group we
would be able to say which curve was generally higher or lower.
The equation to generate the estimated AUC can be obtained by integration
b
using a single simple rule: a cp ∂t = pc tp+1 |ba . So
T
AU C(t) = X β̂∂t (4.11)
t=0
T
= (β̂0 + β̂1 t + β̂3 t[2] + β̂5 t[8]
t=0
T T T −2 T −8
t2 t[2]2 t[8]2
= β̂0 t + β̂1 + β̂3 + β̂5
t=0 2 t=0 2 t[2] =0 2 t[8] =0
β̂1 2 β̂3 β̂5
= β̂0 T − 0 + T + (T − 2)2 + (T − 8)2
2 2 2
for the first treatment group. If we evaluate this over 6 months (26 Weeks),
then
262 242 182
AU C(26) = β̂0 ∗ 26 + β̂1 ∗ + β̂3 ∗ + β̂5 ∗ , h=1
2 2 2
2 2 2
26 24 18
= β̂0 ∗ 26 + β̂2 ∗ + β̂4 ∗ + β̂6 ∗ , h=2
2 2 2
Note that this is a linear combination of the β̂s and can be easily estimated.
In our example, we obtain estimates of 1648 and 1536 for the standard and
experimental treatment groups, suggesting that the scores are higher on aver-
age over 26 Weeks in the standard treatment group (t173 = −2.18, p = 0.031).
These AUC estimates are not easily interpreted, but if we divide by T (or 26),
the scores of 63.4 and 59.1 can be interpreted as the average score over the
26 Week period and 4.3 points as the average difference between the curves
over the entire 26 Weeks.
Ŷ = β̂0 + β̂1
t[2] +β̂5
t +β̂3
t[8] +β̂9
t[32] t = 2 (4.12)
2 0 0 0
= β̂0 + β̂1
t[2] +β̂5
t +β̂3
t[8] +β̂9
t[32] t = 8 (4.13)
8 6 0 0
[2] [8] [32]
= β̂0 + β̂1
t +β̂3
t +β̂5
t +β̂9
t t = 26 (4.14)
26 24 18 0
for group 1 and similarly for group 2. The estimates and their differences are
summarized in Table 4.3
Rates of Change
For the piecewise linear models, we can generate estimates of the change in
the outcome during different periods of time when a piecewise linear model
is used. Recall that the parameters of this model associated with t[2] etc.
represent the change in the slopes, thus the estimates of the slope for any
specific interval is the sum of the parameters.
To compute the AUC/26 the statements, using the DIVSOR option would
be:
/TEST=‘Std AUC/26’ Intercept 26 Trtment*Weeks 338 0
Trtment*Week2 228 0 Trtment*Week8 162 0 DIVISOR=26
/TEST=‘Exp AUC/26’ Intercept 26 Trtment*Weeks 0 338
Trtment*Week2 0 228 Trtment*Week8 0 162 DIVISOR=26
/TEST=‘Diff AUC/26’ Trtment*Weeks -338 338 Trtment*Week2 -228 228
Trtment*Week8 -162 162 DIVISOR=26.
> PW17.Var3=update(PW.Var3,fixed=TOI2~Weeks:TrtGrp+Week2:TrtGrp+
+ Week8:TrtGrp+Week34:TrtGrp)
> PW34.Var3=update(PW.Var3,fixed=TOI2~Weeks:TrtGrp+Week2:TrtGrp+
+ Week8:TrtGrp+Week17:TrtGrp)
> PW_b.Var3=update(PW.Var3,fixed=TOI2~Weeks:TrtGrp+Week2:TrtGrp+
+ Week8:TrtGrp)
> anova(PW.Var3,PW17.Var3,PW_b.Var3)
Model df AIC BIC logLik Test L.Ratio p-value
PW.Var3 1 18 5177.568 5258.125 -2570.784
PW17.Var3 2 16 5173.820 5245.427 -2570.910 1 vs 2 0.2524674 0.8814
PW_b.Var3 3 14 5171.300 5233.956 -2571.650 2 vs 3 1.4802727 0.4770
Then we define C. The following estimates the means at 8 weeks and the
AUC over 26 Weeks for the two treatment groups and the difference:
> C1=c(1, 8, 0, 6, 0, 0, 0) # Cntl 8 weeks
> C2=c(1, 0, 8, 0, 6, 0, 0) # Exp 8 weeks
> C3=c(0, -8, 8, -6, 6, 0, 0) # Diff 8 weeks
> C=rbind(C1,C2,C3,C4,C5,C6,C7)
> rownames(C)=c("Cntl 8wk","Exp 8wk","Diff 8wk",
+ "Cntl AUC","Exp AUC","Diff AUC","Diff Avg")
> Theta=cbind(est.theta,t(se.theta),zval.theta,pval.theta)
> dimnames(Theta)[[2]] =c("Theta","SE","z","p-val")
> Theta
Theta SE z p-val
Cntl 8wk 61.890487 1.941978 31.869823 0.00000000
Exp 8wk 55.829349 2.019748 27.641733 0.00000000
Diff 8wk -6.061139 2.657614 -2.280669 0.02256802
Cntl AUC 1649.391580 38.988330 42.304751 0.00000000
Exp AUC 1538.206442 40.711263 37.783314 0.00000000
Diff AUC -111.185138 51.003568 -2.179948 0.02926130
Diff Avg -4.276351 1.961676 -2.179948 0.02926130
I have written a function to facilitate this (see Appendix R). The syntax is:
> estCbeta(C,PW_b.Var3$coef$fix,PW_b.Var3$varFix)
[2]
εij (t) = di1 + di2 tij + di3 tij + ij , (4.19)
The thought behind adding the third random effect was that there would
be variation in the initial change during the titration period in the outcome
among subjects, but that change would attenuate during the maintenence
period. Examination of the correlation of the random effects confirmed that
expectation; the second and third random effects were almost perfectly neg-
atively correlated (ρ < −0.9)∗∗ suggesting that there was variation in the
change between baseline and the first follow-up, but not across the three
follow-up assessments.
Structure 2
This led to the consideration of an alternative parameterization that allowed
between subject variation of the initial assessments and of the follow-up as-
sessments (equation 4.20).
∗∗ Estimation procedures for two of the scales (MSQ-RR and MSQ-EF) obtain non-positive
definite covariance structures.
0645ROH5HVWULFWLRQ
0RQWKVSRVW5DQGRPL]DWLRQ 0RQWKVSRVW5DQGRPL]DWLRQ
064(PRWLRQDO)XQFWLRQLQJ
0645ROH3UHYHQWLRQ
0RQWKVSRVW5DQGRPL]DWLRQ 0RQWKVSRVW5DQGRPL]DWLRQ
where xBase
ij is an indicator variable for the baseline assessments and xF U
ij for
FU Base
the follow-up assessments (xij =1-xij ). BIC statistics were smaller across
all four measures for this second model when compared to both the model
with three random effects and a unstructured repeated measures covariance
structure. The correlation between these two random effects ranged from 0.5
to 0.6 indicating those individuals with higher initial scores tended to have
higher follow-up scores.
4.7 Summary
• Time-driven designs are associated with mixed effects growth curve
models.
- Both polynomial or piecewise linear models can be used to model
the average trajectory over time.
- One to three random effects typically explain the between-subject
variation.
- The residual errors (within-subject variation) are typically uncor-
related unless observations are very closely spaced.
• The recommended growth-curve model building process starts by defin-
ing a fully parameterized model for the means (Xi β), then identifying
the structure of Σi and finally simplifying the mean structure.
• - Piecewise linear models can also be used for continous covariates where
the relationship with the outcome may not be linear over the entire range
of covariate values.
5.1 Introduction
In clinical trials, the primary research question is generally the effect of treat-
ment on outcomes. Many researchers [Baron and Kenny, 1986, Holmbeck,
1997, Kraemer, 2002, Holmbeck, 2002, Donaldson et al., 2009] argue there
is much more that can be learned and understood about for whom and how
treatments are effective. This information can improve the next generation
of studies. We often pose questions of mediation or moderation in clinical
research, though do not formalize objectives and outcomes in terms of these
concepts. These same researchers have argued for clear frameworks for these
concepts [Baron and Kenny, 1986, Holmbeck, 1997, Kraemer, 2002]. In this
chapter, I will present these frameworks and illustrate with several examples.
A moderator or effect modifer (B) is a variable that specifies conditions un-
der which a given predictor (X) is related to an outcome (Y) [Holmbeck, 2002].
This is illustrated in Figure 5.1. Moderators are typically patient characteris-
tics (age, education, stage of disease). We might expect that age would affect
the relationship between disability and QOL. Time and treatment can also be
considered moderators. For example, treatment may affect the relationship
between time and QOL. Tests of moderation usually appear as interactions in
models. A moderator should either be an unchangeable characteristic or be a
condition that is measured prior to the intervention and thus not correlated
with the predictor[Kraemer, 2002]. The first half of this chapter will illustrate
tests of moderation in studies with both simple pre-post designs and more
extended longitudinal studies.
A mediation model seeks to identify or confirm the mechanism that un-
derlies an observed relationship between a predictor (e.g., treatment) and an
outcome (e.g., sleep disturbance) via the inclusion of a third explanatory vari-
able (e.g., pain), known as a mediator. Rather than hypothesizing a direct
causal relationship between the predictor (X) and the outcome (Y), a medi-
ation model hypothesizes that predictor affects the mediator, which in turn
affects the outcome. The mediator variable, therefore, serves to clarify the
nature of the relationship between the predictor and the outcome. This is
illustrated in Figure 5.1. In contrast to a moderator, a mediator will be cor-
related with the predictor [Kraemer, 2002]. For example, an intervention to
105
© 2010 by Taylor and Francis Group, LLC
106 Design and Analysis of Quality of Life Studies in Clinical Trials
X ? - Y
FIGURE 5.1 Example of simple moderation; the moderator (B) affects the
relationship between X and Y.
improve sleep quality may do so directly but also indirectly by its reduction
of pain [Russell et al., 2009], or anemia may mediate the relationship between
treatment and fatigue. Mediation can be demonstrated using regression tech-
niques including structural equation models (SEM). The medical literature
abounds with examples in which investigators speculate about the impact of
various interventions on quality of life. Some even demonstrate the interven-
tion has a positive impact on both clinical outcomes and measures of HRQoL,
but the formal demonstration of mediation is quite rare. The second half of
this chapter will illustrate methods demonstrating mediation using regression
techniques.
X - M - Y
5.2 Moderation
Before we jump straight into moderation, let’s consider the general issue of
adding explanatory variables (covariates) to models. My experience is that
this is often done without careful thought about either the interpretation
of the added parameters or the impact on other parameters. For example,
consider the breast cancer trial (Study 1) and the explanatory variable age.
There are numerous clinically interesting questions that can be asked:
2. Does the association of the covariate with the outcome vary only with
time? or Does the covariate modify the impact of time on the outcome
(i.e., a covariate by time interaction)?
These are obviously unique questions and require different treatment of co-
variates in the model.
yhij = β0 + β1 T2 + β2 T2 ∗ T x + β3 T3 + β4 T3 ∗ T x (5.1)
+ hij ,
Continuous Covariates
If we add age (xi ) without interactions to the model, we are addressing the first
question, “Does age have an association with the outcome, BCQ, irrespective
of time or treatment?” In the model parameterized as specified in equation 5.2,
α0 measures the association of age (xi ) with BCQ (yhij ) and is interpreted as
the change in the BCQ score for every unit change in age.
yhij = β0 + β1 T2 + β2 T2 ∗ T x + β3 T3 + β4 T3 ∗ T x (5.2)
+ α0 xi
+ hij .
yhij = β0 + β1 T2 + β2 T2 ∗ T x + β3 T3 + β4 T3 ∗ T x (5.3)
+ α0 xi + α1 xi ∗ T2 + α2 xi ∗ T3
+ hij .
yhij = β0 + β1 T2 + β2 T2 ∗ T x + β3 T3 + β4 T3 ∗ T x (5.5)
+ α0 xi + α1 xi ∗ T2 + α2 xi ∗ T3 + α3 xi ∗ T2 ∗ T x + α4 xi ∗ T3 ∗ T x
i + α1 xi ∗ T2 + α2 xi ∗ T3
+ α0 x50 50 50
i ∗ T2 ∗ T x + α4 xi ∗ T3 ∗ T x
+ α3 x50 50
+ hij .
Categorical Covariates
The strategy to test the impact of categorical covariates parallels that of
continuous covariates. When the categorical covariates are dichotomous, the
concept of “centering” is still appropriate. For example, the extent of disease
in breast cancer is often indicated by the number of positive nodes, with 4 or
more positive nodes indicating a higher risk. In this study 42.5% of the women
had four or more positive nodes. We could either center the covariate so that
the estimates of the other parameters reflected this proportion of subjects with
4+ positive nodes or a 50/50 split in the risk factor. Categorical variables that
are not dichotomous are more difficult to handle. Race is a good example.
In many clinical trials, the majority of subjects identify themselves as white
non-Hispanic. In this case, identifying this group as a reference group and
testing for interactions with indicators of specific ethnic/racial groups may be
an appropriate strategy. However, centering indicator variables for specific
subgroups results in estimates of treatment effects that can be interpreted as
the average effect for the entire sample.
If we have chosen an analysis of the change from baseline, we can not answer
the first question proposed in the beginning of Section 5.2. We can address
the second question with the following model. Note that the term involving
α0 has been dropped. α1 and α2 have the same interpretation as before and
the estimates are similar.
We can also address the third question with the following model. α3 and α4
have the same interpretation as before and the estimates are similar.
+ α1 xi ∗ T2 + α2 xi ∗ T3 + α3 xi ∗ T2 ∗ T x + α4 xi ∗ T3 ∗ T x
+ εhij .
While this looks like an ANOVA model with main effects and interactions, it
does NOT have the same interpretation!
In contrast, if we use centered variables: Tx is a centered indicator of treat-
ment and AGE50 is a centered age variable, our model is:
yhij = β0 + β1 T2 + β2 T2 ∗ T x (5.10)
+ α0 AGE50 + α1 AGE50 ∗ T2 + α3 AGE50 ∗ T2 ∗ T x
+ hij .
This model has the traditional interpretation of main effects and interac-
tions. If misinterpreted, results from the two models appear very different
(Table 5.3). In this example, if we take the traditional interpretation, we
would conclude that there is a significant treatment effect during therapy
in the model with centered covariates (H0 : T2 ∗ T x = 0, p < 0.001), but
would make the opposite conclusion with the model with uncentered covari-
ates (H0 : T2 ∗ Exp = 0,p = 0.17). Similarly, we might conclude that age
modifies the effect of treatment (irrespective of treatment arm) in the centered
model (H0 : T2 ∗ Age50 = 0,p = 0.005) but make the opposite conclusion in
the uncentered model (H0 : T2 ∗ Age = 0,p = 0.11).
To further illustrate the problem, let us first consider a modifier such as
gender that is a categorical (dichotomous) variable. If there is an interaction
between the modifier (M) and another explanatory variable (X) such as treat-
ment, the relationship might appear as in Figure 5.4 (left side). The two levels
of the modifier (M) are labeled A and B; the two levels of the other explana-
tory variable (X) are labeled EXP and INTL. If we parameterize the model such
that the difference between the two treatment groups is estimated considering
either A or B to be a reference group we obtain very different estimates, 0.2
and 1.0 respectively. If we center the covariate, the estimates provide a more
realistic estimate of the difference. To center a dichotomous variable, we first
take a numeric variable that is coded such that the difference between the
two groups is 1. Typically the coding is 0 and 1 or 1 and 2. If we then sub-
tract the mean of those values, we would convert them to -.5 and 0.5 if 50%
were in each group or -0.25 and 0.75 if 75% were in group B. The estimates
of the treatment difference in Figure 5.4 are 0.6 and 0.8 for the two levels
of the modifier; these estimates accurately reflecting the treatment difference
averaged over all the subjects in the study.
Note that there may be settings where the analyst chooses to establish a
reference group and not center a particular covariate. This is appropriate if
the meaning/interpretation parameters and associated tests are explicit and
clearly communicated.
5.3 Mediation
We may speculate how changes in disease status or symptoms affects HRQoL,
however, it is much more satisfying to test hypotheses about the mechanism.
We start with a conceptual model that has a clinical/theoretical basis. We can
not prove that the model is true, but we can test whether the observed data
are consistent with the model. A generic single-mediator model is displayed
in Figure 5.5 with two paths. The first is the indirect or mediated path (ab)
and the second is the direct or unmediated path (c).
M
a @b
@
@
R
@
X - Y
c
FIGURE 5.5 Simple mediation with a indirect (ab) and direct (c) effect
of X on Y. The symbol a represents the relationship between X and M, b
the relationship between M and Y and c the relationship between X and Y
adjusting for M.
Baron and Kenny [1986] proposed a two-staged approach with four condi-
tions to demonstrate mediation of the effect of X on Y by M:
One can assess these conditions using coefficients of the following three re-
gression equations:
Y = i 1 + τ X + e1 (5.11)
M = i2 + aX + e2 (5.12)
Y = i3 + cX + bM + e3 (5.13)
For small studies (n < 50) or those with multiple mediators, this approxi-
mation becomes less accurate and a bootstrap analysis is recommended when
precision is important.
There is some controversy [MacKinnon et al., 2007] about the necessity of
the third condition in settings where there is no direct effect of the predictor
on the outcome (c = 0). The requirement for the third condition affects the
power to detect mediation using this approach.
product of the square root of the frequency times the average severity. If the
frequency is 0 (and the severity is missing), a value of 0 is assigned. To illus-
trate the methods, we will examine the change from baseline to the average
of the available post-baseline assessments.
Closely related to the regression approach, we can examine simple and par-
tial correlations for evidence of mediation as illustrated in Table 5.4. We note
that all four conditions are satisfied: the mediator (MigScore) is correlated
with the predictor (Trtment) (Condition 1), MigScore is correlated with the
outcome (MSQ EF) (Condition 2), Trtment is correlated with MSQ EF (Condi-
tion 3), and the correlation of Trtment with MSQ EF is reduced from 0.127 to
0.067 when controlling for the mediator (Condition 4). Note that examina-
tion of the correlations supports the mediation model, but does not directly
allow us to quantitate the proportion of the change in the outcome that can
be attributed to the direct and mediated effects.
The regression approach for a single mediator begins with estimating the
parameters a, b, c, and τ (equations 5.11-5.13). The results are summarized
in Table 5.5. Again, we note that the four conditions are satisfied: we reject
the hypotheses a = 0, b = 0 and τ = 0 (Conditions 1-3) and note that c < τ
(Condition 4). The later condition is more formally assessed by testing ab = 0
since τ − c = ab. Finally, we note that 50% of the effect of treatment on the
change in MSQ EF is mediated by the change in the frequency and severity of
migraines as measured by MigScore.
This regression approach can be extended to multiple mediators. For the
purposes of illustration, let’s add measures of migraine frequency (MigAtt)
and severity (MigSev) to be the mediators. The regression models and pa-
rameter estimates are summarized in Table 5.6. We note that the four condi-
tions are satisfied for the measure of frequency (MigAtt) and its interaction
with severity (MigScore), but not for severity alone (MigSev): we reject the
hypotheses a1 = 0, a3 = 0, b1 = 0, b3 = 0 and τ = 0 (Conditions 1-3) and note
that c < τ (Condition 4).
At this point, several comments are appropriate. First is the issue of causal
inference. So far, all we have demonstrated is that the data are consistent
with the mediation model. Making a causal inference requires additional
conditions including a logical temporal sequence. Second, is the interpretation
of the indirect versus direct effects. While we label one of the paths as the
direct effect, it probably contains two components: the direct effect and the
unexplained effect. The later may be either effects of the predictor on the
outcome that are mediated by other factors or residual variation because the
measure of the mediator(s) is imperfect.
where tij is the number of weeks since the beginning of therapy and uij is
the number of weeks since the end of therapy (0 if still on-therapy). Thus
t̂ij is the estimated change over time while on-therapy, and t̂ij + ûij is the
estimated change over time after the end of therapy. The between and within
subject variation are V ar(di ) and V ar(ij ) respectively. di can be thought of
as the average distance between the predicted trajectory for all the subjects
(β0 + β1 tij + β2 uij ) and the ith subject’s interference score. ij is the week-
to-week variation and measurement error in the interference scores across the
J measurements. The question of interest is how much of the change over
time and the variation is explained by the symptom severity. To answer this
question, we add the symptom severity scores (sijk ) to the model where k
indicates the kth symptom.
K
Yij = β0 + β1 tij + β2 uij + β(2+k) sijk + di + ij (5.16)
k=1
The results are summarized in Table 5.9. Virtually all (> 90%) of the esti-
mated change over time is explained by all fifteen symptoms. Similarly, most
of the between subject variation is also explained. Thus, subjects who re-
port more interference than average, are reporting more symptom severity.
However, only about half of the week-to-week variation is accounted for by
symptom severity. This is not totally unexpected because the residual er-
ror contains not only the week-to-week variation in interference but also the
measurement error.
points on the symptom scale, noting that the choice of these points is arbi-
trary other than dividing the symptom scale into 4 roughly equal parts. The
regression equation for a model with a single symptom would appear as:
[2] [5] [7]
Yij = β0 + β1 tij + β2 uij + β3 sij + β4 sij + β5 sij + β6 sij + di + ij . (5.17)
The estimates of the change in interference for each point change in the symp-
[2] [2] [5]
tom score would be β̂3 over the range of 0-2, β̂3 + β̂4 over 3-5, β̂3 + β̂4 + β̂5
[2] [5] [7]
over 5-7, and β̂3 + β̂4 + β̂5 + β̂6 over 7-10. While we would not expect
the function to have abrupt changes at these points, we can approximate
the shape of the change. Table 5.10 summarizes the results for a few of the
symptoms. For these selected symptoms, two patterns emerge. Changes in
interference are greater in the upper ranges of fatigue, pain and shortness of
breath. One interpretation is that patients may be better able to accommo-
date increases in these symptoms in the lower range of symptoms severity.
In contrast, for distress and sadness the change is greater even in the very
low range of severity, though a plateau occurs around scores of 7. This might
suggest the importance of scores in the lower range for these symptoms than
traditionally indicated as meriting intervention.
5.5 Summary
• Moderators (effect modifiers) are variables that specify conditions under
which a given predictor (typically treatment) is related to an outcome.
They are generally incorporated into regression models in the form of
interactions. The predictor affects the outcome differently depending on
the level of the moderator.
• Mediation occurs when predictors (treatment, disease status, time) in-
fluence the mediator which, in turn, influences the outcome.
6.1 Introduction
I will use three examples in this chapter. The lung cancer trial (Study 3)
illustrates a trial in which the missing data can be characterized for each of the
four assessments and is used throughout the chapter. The renal cell carcinoma
trials (Study 4) illustates how the metods may be adapted when the timing of
assessments becomes more varied. The final example, the migraine prevention
trial (Study 2) illustrates different dropout patterns across the two treatment
arms.
125
© 2010 by Taylor and Francis Group, LLC
126 Design and Analysis of Quality of Life Studies in Clinical Trials
6.1.1 Terminology
There is some controversy as to whether an assessment is missing if the assess-
ment was not expected as defined in the protocol either because the patient
had died or had been removed from the trial. In this book, I will use the term
missing to include both attrition and noncompliance where attrition refers
to death or termination of follow-up as specified in the protocol and non-
compliance refers to assessments expected according to the guideline of the
protocol. Non-compliance reflects deviations from the planned data collection.
The combination of non-compliance and attrition impacts the analysis and in-
terpretation of the results. The term dropout will refer to the discontinuation
of the assessments either as a result of attrition or noncompliance.
The seriousness of the problem depends on the reasons for the missing data,
the objectives of the study and the intended use of the results. Results from
trials influencing regulatory issues of drug approval or health policy decisions
will require more stringent criteria than those used to design future studies.
The challenge for the analyst is to provide either a convincing argument that
conclusions drawn from the analysis are insensitive to the missing data or
clear limits for interpretation of the results. The concepts and tools to do
that are the focus of this and the following chapters.
A common question is whether missing data can be ignored if there are similar
proportions of missing data across all study arms for the same reason. The
answer is often “No.” If the goals of the trial are to estimate rates of change
in the outcome within groups, the resulting estimates are often dependent on
the missing data. The sensitivity of within group estimates to dropout will
be illustrated in later chapters. Comparisons between treatment arms are less
sensitive to missing data. However, there is always the possibility that this is
not a safe assumption and sensitivity analyses, as described in later chapters,
are always advisable.
6.1.4 Prevention
Although analytic strategies exist for trials with missing data, they depend
on untestable assumptions so their use is much less satisfactory than initial
prevention. Some missing data, such as that due to death, is not preventable.
However, both primary and secondary prevention are desirable. In terms of
primary prevention, missing data should be minimized at both the design and
implementation stages of a clinical trial [Fairclough and Cella, 1996a, Young
and Maher, 1999]. Some of these strategies are discussed in Chapter 2.
Secondary prevention consists of gathering information that is useful in the
analysis and interpretation of the results. This includes collection of auxiliary
data that are strongly correlated with the value of the missing HRQoL mea-
sures and information about factors that may contribute to the occurrence of
missing assessments. Strategies for secondary prevention may include gather-
ing concurrent auxiliary data on toxicity, evaluations of health status by the
clinical staff or HRQoL assessments from a caretaker. Use of this type of data
is discussed in later chapters.
Prospective documentation of the reasons for missing data is useful. The
classifications that are used should be specified in a manner that helps the
analyst decide whether the reason is related to the individual’s HRQoL. For
example, “Patient refusal” does not clarify this, but reasons such as “Patient
refusal due to health” and “Patient refusal unrelated to health” will help
differentiate missing data that is ignorable from non-ignorable.
However, while these patterns are important, the reasons for the missing
data are also very relevant. As previously described (see Table 1.7), death
and patient refusal as well as staff oversight contributed to the missing data.
The design of this trial specified that assessments were to continue until 6
months, regardless of whether the patient was still receiving the treatment.
Death accounted for roughly half of the missing assessments, however, the
remaining missing cases are a mixture of those unrelated to the outcome
(e.g. administrative), related to the outcome (e.g. health of patient) and for
unknown reasons.
Assessment #
Pattern 1 2 3 4 N %
There are three major classes of missing data, differing by whether and how
missingness are related to the outcome (e.g. subject’s quality of life)[Little
and Rubin, 1987]. For example, if HRQoL data are missing because the pa-
tient moved out of town or the staff forgot to administer the assessment, the
missingness is unrelated to the HRQoL outcome and those assessments are
Missing Completely at Random (MCAR). At the other extreme, if data are
missing because of a positive or negative response to treatment or progres-
sion of disease (e.g. increased/decreased symptoms, progressive disease and
death), the missingness is likely to be related to the HRQoL outcome and the
data are Missing Not at Random (MNAR). Intermediate cases are referred
to as Missing at Random (MAR) where missingness depends on the observed
data, generally the most recently observed HRQoL. Table 6.4 presents an
overview. The term ignorable missing data is often equated with MAR and
MCAR and the term nonignorable missing data with MNAR. While they are
related, the terms are not strictly interchangeable. The distinction as well
as formal statistical definitions are presented in Rubin [1987, Chapter 2] and
Verbeke and Molenberghs [2000, pages 217-218].
The remainder of this chapter presents the general concepts of these three
mechanisms in more detail and suggests methods for distinguishing among
them. The subsequent chapters describe methods of analysis that can be
used under the various assumptions.
6.3.2 Notation
Consider a longitudinal study where ni assessments of HRQoL are planned for
each subject over the course of the study. As previously defined, Yij indicates
the j th observation of HRQoL on the ith individual.
Yi is the complete data vector of ni planned observations of the out-
come for the ith individual which includes both the observed(Yiobs )
and missing (Yimis ) observations of HRQoL.
Ri is a vector of indicators of the missing data pattern for the ith
individual where Rij = 1 if Yij is observed and Rij = 0 if Yij is
missing.
Note that the term complete data is defined as the set of responses that one
would have observed if all subjects completed all possible assessments. This is
in contrast to the term complete cases which is defined as the set of responses
on only those subjects who completed all possible assessments (Rij = 1 for all
possible Yij from the ith subject). In some of the following chapters, we will
differentiate data from these complete cases (YiC ) and data from incomplete
cases (YiI ). Table 6.5 is a summary of terms.
Example
If a study were planned to have four HRQoL assessments at 0, 4, 13 and 26
weeks and the HRQoL of the ith subject were missing at 4 and 26 weeks, then
that subject’s data might look like:
⎡ ⎤ ⎡ ⎤ 78
78 1 Yiobs =
⎢ NA⎥ ⎢ ⎥ 58
Yi = ⎢ ⎥ and Ri = ⎢ 0 ⎥ or
⎣ 58 ⎦ ⎣1⎦
NA
NA 0 Yimis =
NA
Even though the dependent variable (Yi ) is missing at some time points, Xi is
fully observed if one assumes that the time of the observation is the planned
time of the second and fourth HRQoL assessments.
Graphical Approaches
Two examples of a graphical approach are displayed in Figures 6.1- 6.2 for
the lung cancer trial. Figure 6.1 displays the average observed FACT-Lung
TOI in groups of patients defined by their pattern of missing data. Because
the total number of patterns is large and the number of subjects with inter-
mittent or mixed patterns is small (Table 6.3), this plot was simplified by
grouping subjects by the time of their last HRQoL assessment. The result-
ing figure suggests two things: subjects who dropped out earlier had poorer
scores at baseline and scores were lower at the time of the assessment just prior
to dropout (the last observation). Since missingness depends on previously
observed FACT-Lung TOI scores, the data are not MCAR.
Figure 6.2 is a modification of one suggested by Hogan et al. [2004a]. It
contrasts the scores just prior to dropout of subjects who drop out (CI without
bars at end) with those from subjects with continued follow-up (CI with bars
at end); the overall mean is also displayed (). Subjects clearly have lower
scores just prior to dropout.
Formal Tests
Little [1988] proposed a single test statistic for testing the assumption of
&RQWURO ([SHULPHQWDO
+54R/ 5HVSRQVH
+54R/ 5HVSRQVH
$VVHVVPHQW $VVHVVPHQW
FIGURE 6.1 Study 3: Average FACT-Lung TOI scores for control and
experimental arms stratified by time of last assessment.
MCAR vs. MAR. The basic idea is that if the data are MCAR, the means
of the observed data should be the same for each pattern of missing data. If
the data are not MCAR then the means will vary across the patterns (as is
observed in Figure 6.1). This test statistic is particularly useful when there are
a large number of comparisons either as a result of a large number of patterns
or differences in missing data patterns across multiple outcome variables.
Consider a study designed to obtain J measurements. Let P be the number
of distinct missing data patterns (Ri ) where J {p} is the number of observed
assessments
in the pth pattern. n{p} is the number of cases with the pth
{p}
pattern, n = N . Let M {p} be a J {p} × J matrix of indicators of the
observed variables in pattern P . The matrix has one row for each measure
present consisting of 0s and 1s identifying the assessments with non-missing
data. For example, in the NSCLC example, if the first and third observation
were obtained in the 6th pattern then
{6} 1000
M =
0010
Ȳ{p} is the J {p} × 1 vector of pattern p observed means . μ̂ and Σ̂ are the
ML estimates of the mean and covariance of the pooled data assuming that
the missing data mechanism is ignorable. We then multiply these by M {p} to
select the appropriate rows and columns for the pth pattern. Thus, μ̂{p} =
M {p} μ̂ is the J {p} ×1 vector of ML estimates corresponding to the pth pattern
&RQWURO ([SHULPHQWDO
+54R/ 5HVSRQVH
+54R/ 5HVSRQVH
$VVHVVPHQW $VVHVVPHQW
FIGURE 6.2 Study 3: Average FACT-Lung TOI scores for control and
experimental arms. Overall mean is indicated by * symbol. Mean and 95%
confidence interval (CI) indicated for those who drop out (no bar at end of
CI) and do not drop out (bar at end of CI) at the next assessment.
and Σ̃{p} = NN−1 M {p} Σ̂M {p} is the corresponding J {p} × J {p} covariance
matrix with a correction for degrees of freedom. Little’s [1988] proposed test
statistic,∗ when Σ is unknown, takes the form:
P
χ2 = n{p} (Ȳ{p} − μ̂{p} ) Σ̃{p}−1 (Ȳ{p} − μ̂{p} ) (6.1)
p=1
2
Little
{p}shows that this test statistic is asymptotically χ distributed with
J − J degrees of freedom. In the lung cancer trial, there is consid-
erable evidence for rejecting the hypothesis of MCAR; χ223 = 61.7, p < 0.001
in the control group and χ223 = 76.8, p < 0.001 in the experimental group.
Listing and Schlitten proposed a parametric [1998] and nonparametric [2003]
test for monotone patterns of missing data. The test statistic is derived from
a comparison of the scores from the last assessment of dropouts to the scores
at the same timepoint for those who complete all assessments. If the data are
not MCAR then the scores will differ (as is observed in Figure 6.2).
∗ SPSS include this test in the MVA command (see Appendex P). SAS generates all the com-
Analytic Explorations
Baseline assessment
Physical Well-being NA -0.19∗∗∗ -0.22∗∗∗ -0.20∗∗∗
Functional Well-being -0.12∗∗∗ -0.17∗∗∗ -0.18∗∗∗
Lung Cancer Subscale -0.14∗∗∗ -0.14∗∗∗ -0.14∗∗∗
Emotional Well-being -0.04 -0.11∗∗ -0.03
Social/Family Well-being 0.02 -0.04 -0.05
Previous assessment
Physical Well-being -0.19∗∗∗ -0.18∗∗∗ -0.17∗∗∗
Functional Well-being -0.12∗∗∗ -0.17∗∗∗ -0.21∗∗∗
Lung Cancer Subscale -0.14∗∗∗ -0.16∗∗∗ -0.24∗∗∗
Emotional Well-being -0.04 -0.16∗∗∗ -0.10 ∗
Social/Family Well-being 0.12∗∗∗ -0.07 -0.05
∗ p < 0.05, ∗∗ p < 0.01, ∗∗∗ p < 0.001
We may also wish to confirm that the missingness depends on the observed
data after adjusting for the dependence on covariates. This can be tested by
first forcing the baseline covariates into a logistic model and then testing the
baseline or previous measures of HRQoL. A variation is described by Ridout
[1991] where the approach is to start with the observed data forced into the
logistic model, then to test if added covariates can explain the association
between dropout and the observed data.
Continuing with the lung cancer trial, we see the same results when using
logistic regression models (Table 6.8). For all follow-up assessments, poorer
baseline HRQoL, as measured by the FACT-Lung TOI score, is highly predic-
tive of missing data. Similarly, the previous assessment is also highly predic-
tive of missing data. The odds ratios for missing an assessment ranged from
0.63 to 0.74 for each 10 point increase in the FACT-Lung TOI score. These
analyses reinforce the evidence that the missingness is dependent on observed
HRQoL scores (Yiobs ) and that in the lung cancer trial we can not assume
that the observations are MCAR.
TABLE 6.8 Study 3: Odd ratios for missing assessments in lung cancer
trial with either baseline or previous FACT-L TOI score. Baseline
characteristics are forced into the models (O.R.s for baseline characteristics
not shown). O.R. estimated for 10 point difference in FACT-Lung TOI.
Odds Ratios (95% C.I.)
Time of Assessment
6 weeks 12 weeks 26 weeks
on the HRQoL at the time of the planned assessment. For example, this
might occur because assessments that are more likely to be missing when an
individual is experiencing side effects of therapy, such as nausea/vomiting or
mental confusion. Alternatively, subjects might be less willing to return for
follow-up (and more likely to have missing assessments) if their HRQoL has
improved as a result of the disappearance of their symptoms.
indicators of the dropout process were not measured in the trial or the analyst
failed to identify them.
Evidence for the potential for nonrandom missing data will come from
sources outside of the data. Clinicians and caregivers may provide the useful
anecdotal information that suggests the presence of nonrandom missing data.
greater severity
# Progressive disease
In the lung cancer trial, we have already shown that the missingness depends
on both the initial and the most recent HRQoL scores. The question that
remains is whether, given the observed HRQoL scores and possibly the base-
line covariates, there is additional evidence that missing assessments are more
frequent in individuals who are experiencing events likely to impact HRQoL.
Obviously, death is a perfect predictor of missingness. We also might expect
toxicity, disease progression and nearness to death to impact both HRQoL
and missingness.
We can continue with the exploratory approach that utilizes the logistic
regression model including measures of toxicity, disease progression and near-
ness to death. The final step is to determine if these events or outcomes are
associated with missingness after adjusting for the covariates and observed
HRQoL identified as associated with missingness. After forcing age, perfor-
mance status, treatment assignment and prior FACT-Lung TOI score into a
logistic regression model for missing assessments, progressive disease during
the first 6 cycles of therapy and death within 2 months of the planned as-
sessment are strong predictors of missing assessments. This suggests that the
data are MNAR. Unexpectedly, missing assessments were less likely among
individuals with more toxicity. One possible explanation is that these patients
are more likely to have follow-up visits and thus more likely to be available
for HRQoL assessments.
+54R/5HVSRQVH
:HHNV3ULRUWR'HDWK
FIGURE 6.3 Study 3: Change in FACT-Lung TOI prior to death.
Other options to explore the relationship of missing data are only limited
by the knowledge and thoughtfulness of the analyst. In the lung cancer trial,
we can explore the association of observed scores with the proximity to death
by plotting scores during the months prior to death backwards from the time
of death (Figure 6.3). In this trial we observe a clear relationship with the
outcome measure. Fayers and Machin [2000] describe an alternative graphical
approach where the origin of the horizontal axis is the date of the last assess-
ment. The HRQoL is then plotted backward in time. The decreasing values
of the outcome just prior to dropout suggest that a downward trajectory is
likely to continue after dropout.
+54R/5HVSRQVH
+54R/5HVSRQVH
:HHNVRQ7UHDWPHQW :HHNVRQ7UHDWPHQW
FIGURE 6.4 Study 4: Average FACT-BRM TOI scores for control (left)
and experimental (right) arms stratified by time of last assessment. Patients
with 25+ weeks of follow-up are represented by the solid line, with between
5 and 25 weeks with the short dashed lines, and with less than 5 weeks with
the long dashed lines.
the observed values differ across the groups defined by time of dropout; thus
missing assessments are not MCAR.
Using the same classifications, dropout during the early (0-5 weeks) and
middle (5-25 weeks) periods can be examined using logistic regression. In
both periods, the predictors of dropout are the baseline FACT-BRM TOI
scores and the duration of the patient’s survival (Table 6.10). The odds of
dropout are reduced by about a third in both periods with a 10 point increase
in the baseline FACT-BRM TOI scores. The odds are reduced by about half
in the early period for every doubling of the survival time.† As in the lung
cancer trial, the results suggest that dropout depends on observed data and is
likely to depend on missing data if we can assume that the measure decreases
as death approaches.
† Solelyfor the purposes of this exploratory analysis, censoring of the survival times was
ignored. Note that the majority had died (75%) and the minimum follow-up was 2 years,
thus the missing information would have only a moderate effect when the length of survival
was expressed on a log scale.
3ODFHER ([SHULPHQWDO
+54R/5HVSRQVH
+54R/5HVSRQVH
$VVHVVPHQW $VVHVVPHQW
FIGURE 6.5 Study 2: Average MSQ RR scores for placebo (left) and ex-
perimental (right) arms stratified by time of last assessment. Patients with 4
assessments are represented by the solid line.
of this auxiliary data will be discussed in later chapters. Finally, Figures 6.5
and 6.6 suggest that the models used in the sensitivity analyses should consider
a different relationship between dropout and the response in the two treatment
arms.
6.9 Summary
• It is important to understand the missing data patterns and mechanisms
in HRQoL studies. This understanding comes from a knowledge of the
disease or condition under study and its treatment as well as statistical
information.
• MCAR vs. MAR can be tested when the data have a repeated measures
structure using the test described by Little [1988] (see Section 6.5.2).
• Graphical techniques are useful when examining the assumption of MCAR
for studies with mistimed assessments (Figures 6.1 and 6.4).
• It is not possible to test MAR vs. MNAR without assuming a specific
model (see Chapters 11 and 12), however, it may be useful to examine
3ODFHER ([SHULPHQWDO
+54R/5HVSRQVH
+54R/5HVSRQVH
$VVHVVPHQW $VVHVVPHQW
FIGURE 6.6 Study 2: Average MSQ RR scores for placebo (left) and ex-
perimental (right) arms stratified by reason for dropout. Patients who com-
pleted the study are represented by the solid line and star symbol. Patients
who dropped out are represented by dashed lines with E indicating lack of
efficacy, S side-effects, and O all other reasons.
7.1 Introduction
In this chapter I will present an overview of methods that are used for the
analysis of clinical trials with missing data. The first section covers methods
that rely on the assumption that the missing data are missing completely at
random (MCAR); I will try to convince you that these methods should always
be avoided. The second section reviews methods that assume that the data
are ignorable or missing at random (MAR) given covariates and the observed
data. In the final section, I will present an overview of models that can be
considered when data are suspected to be non-ignorable.
149
© 2010 by Taylor and Francis Group, LLC
150 Design and Analysis of Quality of Life Studies in Clinical Trials
ally make stronger assumptions about the missing data. Use of GEEs should
be considered when the HRQoL measure has a very strong deviation from
normality, transformations are unfeasible and the sample size is so small that
the parameter estimates and statistics no longer have a normal distribution.
In clinical trials, it is extremely rare that the sample sizes are so small that the
estimates and test statistics (not the distribution of scores) are not asymp-
totically normal.
The safest approach for analysis of studies with missing data is to avoid
these methods. Alternative methods include multivariate likelihood methods
using all the available data, such as mixed-effects models or repeated measures
for incomplete data. These are described in the following sections of this
chapter and previously in Chapters 3 and 4.
Hypothetical Example
To illustrate the limitations of methods that assume that the data are MCAR,
I will generate data from a hypothetical example where the data are actually
MAR. This example illustrates how the various methods perform when we
know the underlying missing data mechanism. Because all values are known
and the example is simple, it is possible to show how the analytic results are
affected by the missing data assumptions.
Assume there are 100 subjects with two assessments. The scores are gen-
erated from a standard normal distribution (μ = 0, σ = 1). Three sets of
data are generated such that the correlations between the assessments (ρ12 )
are 0.0, 0.5 and 0.9. Correlations of HRQoL assessments over time in clinical
trials are generally in the range of 0.4-0.7. The correlations of 0.0 to 0.9 are
outside this range but will serve to illustrate the concepts. All subjects have
the first assessment (T1), but 50% are missing the second assessment (T2).
This is a larger proportion of missing data than one would desire in a clinical
trial, but magnifies the effect we are illustrating. Finally, missing values are
generated in two ways. In the first, observations at the second assessment
are deleted in a completely random manner that is unrelated to the values at
either assessment∗ . Thus the data are MCAR. In the second, the probability
of a missing assessment at the second time point depends on the observed
values at the first time point† . Thus the data are MAR conditional on the ob-
served baseline data. The means for these hypothetical data are summarized
in Table 7.1. Data from the complete cases is noted as Y C and data from
the cases with any missing data is noted as Y I . For subjects with both the
T1 and T2 assessments, ȳ1C and ȳ2C are the averages of the T1 and T2 scores.
For subjects with only the T1 assessment, ȳ1I is the average of the observed
T1 scores. ȳ2I is the average of the deleted T2 scores, which is known only
because this is a simulated example.
only the HRQoL assessments at one point in time will be biased. This is illus-
trated in the hypothetical example for repeated univariate analysis (Table 7.2,
second row), where the estimates of μ̂2 utilize only the data available at T2
from the complete cases (ȳ2C ).
C
μ̂1 π ȳ1 + (1 − π)ȳ1I ȳ
= = C1 . (7.1)
μ̂2 ȳ2C ȳ2
where π is the proportion of subjects with complete data. Thus when dropout
depends on the measure of HRQoL at T1 and responses are correlated (ρ = 0)
there is significant bias in the estimated T2 mean.
There are other disadvantages to the repeated univariate approach. First,
the pool of subjects is changing over time. Thus, the inferences associated
with each comparison are relevant to a different set of patients. Second, the
analyses produce a large number of comparisons that often fail to answer the
clinical question but rather present a confusing picture. Further, the Type I
error increases as the number of comparisons increase.
estimation. The problem is that the baseline values of those individuals who
do not have a follow-up assessment are ignored.
Consider the hypothetical example where the model is
If the mean of the T2 scores is estimated using only the baseline data from
C
the subjects with follow-up assessments (yi1 ), then the mean HRQoL scores
C
at T2 will be overestimated because ȳ1 overestimates the baseline mean, μ1 .
This is illustrated in Table 7.2 in the row identified as Baseline (Naive).
In this example the estimates of μ2 for the hypothetical example are identical
to those obtained for the complete case and the univariate analyses without
covariates. An alternative model is
If ȳi1 is calculated using all baseline assessments, the estimates will be unbi-
ased if dropout depends only on the baseline assessment and not intermediate
assessments. In the hypothetical example this is designated in Table 7.2 in
the row identified as Baseline (Correct).
Procedures that use a least squares means (LS means) technique to obtain
estimates averaged over nuisance variables (e.g. baseline covariates) generally
do not fix the problem. In the context of longitudinal datasets with missing
values, the software procedures do not default to the mean of all the baseline
assessments, but to the mean of the assessments included in the analysis. The
analyst is well advised to check how the LS means are computed in specific
software programs.
Hypothetical Example
In the hypothetical example, consider a repeated measures model for the two
possible assessments,
yi1 1 0 μ1 e
= + i1 ,
yi2 01 μ2 ei2
Yi Xi β ei
It is difficult to see how the estimates are calculated when written in matrix
notation. Consider a special case where all subjects have their first assess-
ment but only a proportion, π, have their second assessment. The resulting
The estimate of the mean at T1 is the simple average of all T1 scores (ȳ1 ),
because there are no missing data for the first assessment. With missing data
at the second time point (T2), μ̂2 is not the simple mean of all observed
T2 assessments (ȳ2C ) but is a function of the T1 and T2 scores as well as the
correlation between them. The second term in equation 7.6 ((1−π)ρ̂(ȳ1I − ȳ1C ))
is the adjustment of the simple mean (ȳ2C ). If we use the information in
Table 7.1 for ρ = 0.5,we obtain an unbiased estimate:
the expected values of the missing data, conditional on the observed data is
While this formula seems complicated, the results are not. Consider a subject
whose early observations are lower than the average, Yiobs < Xiobs β. The term
[Yiobs − Xiobs β] will be negative and if the observations over time are positively
correlated (Σm,o (Σo,o )−1 > 0), the expected value of the missing data condi-
tional on the observed data (E[Yimis |Yiobs ]) will be less than the unconditional
expected value (Ximis β). This is relevant because the algorithms (e.g. EM
algorithm, method of scoring, and MCMC) that produce the MLEs in some
way incorporate this conditional expectation.
Ŷi = Xi β̂ + Zi dˆi
= Xi β̂ + Zi DZi Σ−1
i (Yi
obs
− Xi β̂)
= (I − Zi DZi Σ−1 −1 obs
i )Xi β̂ + Zi DZi Σi Yi
= (Σi Σ−1 −1 −1 obs
i − Zi DZi Σi )Xi β̂ + Zi DZi Σi Yi
2
= σw I Σ−1 Xi β̂ + Zi DZi Σ−1 Y obs (7.14)
i
i i
W ithin Between
+54R/5HVSRQVH
$VVHVVPHQW $VVHVVPHQW
FIGURE 7.1 Study 3: Estimated FACT-Lung TOI scores for control (left)
and experimental (right) arms estimated using complete cases (C), repeated
univariate analyses (U) and MLE of all available data (M). The standard
deviation of the FACT-Lung TOI scores is roughly 16 points.
excludes individuals who do not have both a baseline assessment and at least
one follow-up assessment. As a result it makes more restrictive assumptions
about missing data and should be used cautiously if there are missing data in
either the baseline or early follow-up assessments. In general, this is not a good
strategy for randomized clinical trials, but may be useful for observational
studies.
or 2) there is no difference between the groups and a Type I error has oc-
curred. In the first case, it is appropriate to adjust the estimates by inclusion
of a covariate in the analysis, but in the second case the differences are ig-
nored. Unfortunately, it is not possible to distinguish the two cases. Again, a
sensitivity analysis is advisable.
Exclusion of Observations
Exclusion of observations from the analysis should be performed very cau-
tiously. In some settings, there is a valid conceptual reason to do so. For
example, in the adjuvant breast cancer study (Study 1), four assessments
were excluded from the pre-therapy assessments because they occurred after
therapy started and two assessments were excluded from the on-therapy as-
sessments because they occurred after therapy had been stopped. In other
settings, the artificial attempt to force observations into a repeated measures
In these models, the analyst must specify how dropout depends on the
outcome, Yi , or the random effects, di . Because the complete data are not
fully observed, we must make untestable assumptions about the form of
f (Mi |Y or di ).
In addition to adjusting the estimates for the missing data, these models
allow the investigator to explore possible relationships between HRQoL and
explanatory factors causing missing observations. This is particularly inter-
esting, for example, if death or disease progression was the cause of dropout.
The criticism of selection models is that the validity of the models for the
missing data mechanism is untestable because the model includes unobserved
data (Yimis or di ) as the explanatory variable.
Pattern mixture models are a special case of the mixture models. When
Mi = Ri , the missingness can be classified into patterns. Chapter 10 presents
examples of these models in detail. Where missing data are due to dropout,
the distributions, f (Yi |TiD ), may be a function of the time to dropout or an as-
sociated event. The random-effects mixture, f (Yi |di ) is a mixed-effects model
where the random-effects model includes the dropout time, TiD . Examples
include the conditional linear model proposed by Wu and Bailey [1989] and
the joint model proposed by Schluchter [1992] and DeGruttola and Tu [1994]
(see Chapter 11).
7.5 Summary
• Methods that exclude subjects or observations from the analysis should
be avoided unless one is convinced that the data are missing completely
at random (MCAR). This includes repeated univariate analyses which
ignore observations that occur at different times, MANOVA, which
deletes cases with any missing assessments and inclusion criteria for
the analyses such as requiring both a baseline and at least one follow-up
assessment.
• When the proportion of missing data if very small or thought to be ig-
norable, is advisable to employ likelihood based methods (MLE) that
use all of the observed data such as repeated measures models for in-
complete data (Chapter 3) or mixed-effects models (Chapter 4).
• When missing data is suspected to be non-ignorable, it is advisable to
perform sensitivity analyses using one or more of the methods described
in Section 7.4.
163
© 2010 by Taylor and Francis Group, LLC
164 Design and Analysis of Quality of Life Studies in Clinical Trials
The choice among methods should be made only after careful considera-
tion of why the observations are missing, the general patterns of change in
HRQoL over time, and the research questions being addressed. Without this
careful consideration, imputation may increase the bias of the estimates. For
example, if missing HRQoL observations are more likely in individuals who
are experiencing toxicity as a result of the treatment, the average value from
individuals with observations (who are less likely to be experiencing toxicity)
will likely overestimate the HRQoL of individuals with missing data.
a score, the mean of the observed responses are substituted for the missing
values. This is exactly the same as taking the mean of the observed responses
and rescaling that value to have the same potential range as the scale would
have if all the items were answered. For example, the social/family well-being
scale of the FACT-G has 7 questions with responses that range from 0-4. If
all items are answered, the total score is the sum of the responses, but can
also be calculated as the mean of the seven responses times 7. Following the
half rule, the summated score is also the mean of the responses times 7. To
illustrate, consider the three subjects presented in Table 8.2. The first subject
completed all items, so her score (the sum of the items) is 20. This is exactly
the same as the average multiplied by 7. Subject 2 completed 6 of 7 items.
The average of his scores is 3.0, so his total score is 21. Finally, Subject 3
completed only 4 items (more than half). Her average of the 4 items is 2.75,
thus her total summed score is 2.75x7=19.25. This particular scale includes
one question about sexual activity, making the half-rule particularly useful.
One of the advantages of this method of computing scores in the presence of
missing items is that it is simple (can even be done by hand) and only depends
on the responses contained in that individual’s questionnaire. Thus the score
does not depend on the collection of data from other subjects and would not
differ from trial to trial. However, one wonders whether this method is the
most accurate. Would other regression based methods improve the scores?
This answer is surprising and investigators have found it is hard to improve
on this simple method with regression based methods [Fairclough and Cella,
1996b]. As more research emerges using item response theory (IRT), this may
change. But my suspicions are that the small incremental improvements will
only be justifiable when the data are gathered electronically and can be scored
electronically.
A final comment: Investigators should be concerned about any individual
question that has more than 10% missing responses. For questions about
some topics, the lower response rate is expected and unavoidable, such as
with questions about sexual activity. It may also indicate that the question
is poorly worded or not appropriate for some populations or in some settings.
For example, the Breast Chemotherapy Questionnaire (BCQ) used in the
breast cancer trial (Study 1), asks questions about how much certain symp-
toms bother the respondent. It will be difficult to respond to that question if
you are not experiencing the symptom. This was particularly a problem when
the assessments were prior to or after the completion of therapy. Because of
the content of the questions of the BCQ, the instrument is very sensitive to
differences that occur during therapy, but is unlikely to be useful for long term
follow-up. Another example is an instrument that asks questions specifically
about work. If the respondent interprets the question as only referring to paid
work, they may not respond to this question, particularly if they are retired,
a student or work at home. In the FACT-G, the phrase “including work at
home” is added to expand the application of this question although it still
may not be applicable for all respondents.
Some of the problems with simple imputation cited above are illustrated by
examining the results of simple imputation in the lung cancer trial (Study 3)
presented in the remainder of this chapter.
the same as the analytic model (using only time and treatment in the analytic
model).
The distributions of observed scores and imputed scores at weeks 12 and
26 for the lung cancer trial are displayed in Figure 8.1. The imputed values
are centered relative to the observed scores emphasizing the assumption that
the missing values are MCAR. This may be valid if the missing data are
due to administrative problems, but is hard to justify in most clinical trials.
Figure 8.1 emphasizes the distortion of the distributions of observations. In
the final section of this chapter, I will illustrate how this impacts estimation
of standard errors and severely inflates the Type I error rate (see Section 8.6).
3HUFHQWRI6FRUHV
3HUFHQWRI6FRUHV
6FRUHDWZHHNV 6FRUHDWZHHNV
3. Remove variables from the above lists if they are frequently missing. If
Xi∗mis includes covariates that are missing, then we will be unable to
impute values for Yimis .
4. Identify the relationships that will be tested in the analytic model and
include the corresponding information in the imputation model regard-
less of the strength of the relationship in the observed data. These will
typically be indicators of the treatment arms in clinical trials. If the
sample size is large, develop separate imputation models for each treat-
ment group. Otherwise, force variables identifying the treatment groups
into the model. If models are not being developed separately for each
treatment group, evaluate interactions between treatment and potential
covariates. Failure to do this will bias the treatment comparisons toward
the null hypothesis.
When the intent of the analysis extends beyond treatment group comparisons,
all important explanatory variables on which inference is planned should be
included as explanatory variables in the imputation model to avoid biasing
the evidence toward the null hypothesis.
3HUFHQWRI6FRUHV
3HUFHQWRI6FRUHV
6FRUHDWZHHNV 6FRUHDWZHHNV
Yij∗mis = Xij
∗mis ∗
B̂j . (8.4)
This approach will result in unbiased estimates if the regression model satisfies
the MCAR assumption. In practice this approach will require the luck or
foresight to measure the patient characteristics and outcomes (auxiliary) that
explain the missing data mechanism.
The distributions of observed scores and imputed scores are displayed in
Figures 8.2 and 8.3. The assumptions about the missing data have been
relaxed slightly. Adding baseline covariates (Figure 8.2) results in a small
amount of variation in the imputed scores and a very small shift in the dis-
tribution. The addition of the auxilary outcome information (best response
and death within 2 weeks) results in more variation and a perceptable shift
downwards of the imputed values (Figure 8.3). We are still assuming that
relationships between HRQoL and clinical outcomes such as response and
survival are the same for subjects with missing data as for those with ob-
served HRQoL. Note that the range of imputed values is still smaller than the
3HUFHQWRI6FRUHV
3HUFHQWRI6FRUHV
6FRUHDWZHHNV 6FRUHDWZHHNV
range of observed values. The implications of this are discussed later in this
chapter (see Section 8.6).
The SPSS MVA command with the REGRESSION option can be used to output
a dataset with the missing values replaced by an imputed value. Random
errors are added to Yij∗mis which somewhat mitigates the problems associated
with simple imputation, but does not solve it completely [vonHippel, 2004].
For a mixed-effects model (equation 3.9), the estimates are a special case
Note that the observed data (Yiobs ) are now included in the equation used to
predict the missing values for each individual. When the observed HRQoL
scores for the ith subject are higher (or lower) than the average scores for sub-
jects with similar predicted trajectories, then the difference (Yiobs −Xi∗obs B̂ ∗ ) is
positive (or negative). As a result, the imputed conditional predicted value is
larger (or smaller) than the imputed unconditional value (Xi∗mis β̂). Further,
when HRQoL scores within the same individual are more strongly correlated,
the second term, Σ̂mo Σ̂−1
oo (Yi
obs
−Xi∗obs B̂ ∗ ) in equation 8.5 or Zi∗mis dˆ∗i in equa-
tion 8.7, is larger in magnitude and the difference between the conditional and
unconditional imputed values will increase.
The entire distribution of observed scores and imputed scores, predicted
by multivariate linear regression, are displayed in Figure 8.4. The same set
of explanatory variables used to generate the data in Figure 8.3 was used to
generate these values. However, the imputed values now include information
about the patient’s previous HRQoL. As a result the distribution of the im-
puted values appears to be shifted down and the range of values has greatly
expanded.
The imputed values can be obtained using existing software (e.g. SAS Proc
Mixed, SPSS Mixed/ Save=PRED(BLUP)). However, the standard errors of es-
timates using these values will be underestimated(see Section 8.6). The SPSS
MVA command with the EM option adds random errors to the expected values
which somewhat mitigates the problems associated with simple imputation,
but does not solve it completely [vonHippel, 2004].
3HUFHQWRI6FRUHV
3HUFHQWRI6FRUHV
6FRUHDWZHHNV 6FRUHDWZHHNV
approach has limited utility [Gould, 1980, Heyting et al., 1992, Little and
Yau, 1996, Revicki et al., 2001] and should be employed with great caution.
3HUFHQWRI6FRUHV
3HUFHQWRI6FRUHV
6FRUHDWZHHNV 6FRUHDWZHHNV
8.4.2 δ-Adjustments
Diehr et al. [1995] describe a variation on LVCF to address the problem of
differences in HRQoL between individuals who were able to complete HRQoL
assessments and those who were not. In the proposed procedure, a value (δ), is
subtracted (or added) to the last observed value. If this value can be justified,
then this approach is a useful option in a sensitivity analysis. In Diehr’s ex-
ample, a value of 15 points on the SF-36 physical function scale was proposed,
where 15 points is justified as the difference in scores between individuals re-
porting their health as unchanged versus those reporting worsening [Ware et
al., 1993].
The strategy for imputing missing data has the advantage that one only has
to assume a relative ordering of the HRQoL. For example we are assuming that
the HRQoL of subjects who died is poorer than those that remain alive. But
even with this seemingly straightforward procedure, it is important to consider
the assumptions very carefully. In practice it will not be easy to classify
all dropouts into one of the three groups (2, 4 and 5) defined in Table 8.3.
Heyting et al. [1992] and Pledger and Hall [1982] describe situations where this
strategy may not be appropriate. Also note that when a large proportion of
the subjects expire, this approach becomes an approximation to the analysis
of survival rather than an analysis of HRQoL.
The exclusion of subjects with missing covariates from analyses may result in
selection bias. In most cases, missing covariates (particularly patient charac-
teristics) individually represent a very small proportion (1-2%) of the potential
observations. It is only when there are a large number of covariates, each with
a small amount of missing data that a problem arises and a more substantial
proportion (> 10%) of subjects will be removed from the analysis. When the
proportion of missing covariate data is small, the choice of the imputation
method will have virtually no impact on the results derived from the analytic
model. As the proportion increases, much greater care must be taken. When
the proportion is substantial, the first step is to determine if the covariate is
absolutely necessary. (Hopefully, the trial design and data management pro-
cedures were in place to ensure the capture of vitally important covariates.)
If it is determined that the covariate is absolutely necessary, one possible al-
ternative is a missing value indicator. Thus, if the covariate is categorical, one
additional category (unknown) would be added. While there has been some
criticism of this approach, it will still be appropriate in many settings. For
example, the absence of a lab value may indicate that the clinician decided
not to obtain the test because the result (normal vs. abnormal) was highly
predictable.
values.∗ Finally, we assume that we have observed the HRQoL of all subjects
rather than just a subset of the subjects. As a result, test statistics and con-
fidence intervals based on a naive analysis of the observed and imputed data
will not be valid.
This is illustrated in the lung cancer trial. Consider the scores at 26 weeks;
the estimated standard deviation of the observed data is roughly 15-16 points,
where it is calculated as:
n
σ̂ = (Yhij − Ȳhj )2 /(n − 1). (8.8)
i=1
When we impute values for the missing data using the mean value of the
observed data and use a naive estimate of the variance, we add nothing to
mis
the squared terms because Yhij − Ȳhj = 0, but we do increase the apparent
number of observations. The effect of this is illustrated in Table 8.4, where
the naive estimate of the standard deviation decreases by almost 1/3 as we
increase the apparent number of observations by about 2-fold. In the simple
univariate case, the underestimation of the variance is roughly proportional
to the amount of missing data. The naive estimate of the variance of yi is
nobs
2 (yiobs − ȳiobs )2 + nmis (ȳiobs − ȳiobs )2
σ̂ =
nobs + nmis − 1
(n − 1)σ̂obs + 0
obs 2
nobs − 1 2
= = σ̂ .
(n − 1) n − 1 obs
obs
2
When σ̂obs ≈ σ 2 , then E[σ̂ 2 ] ≈ n n σ 2 . The standard deviation is underesti-
mated by a factor proportional to the square root of the proportion of missing
data. While it is straightforward to adjust the estimate of the variance using
mean imputation, it becomes a much more difficult task for other imputation
procedures.
The problem with the underestimation of the variance of the observations
is compounded when we attempt to estimate the standard errors of means
or regression parameters. This in turn affects test statistics and confidence
intervals. For
example, the naive estimate of the standard error of the mean
(S.E.(μ̂) = σ̂2 /n) assumes that we have information on all n individuals,
rather than the nobs individuals who completed the HRQoL assessments. In
the 6-month estimates for survivors on the NSCLC study, we analyze a dataset
with 389 observations when only 191 subjects were observed. The effect is
illustrated in Table 8.4, where the naive standard errors are roughly half the
true standard errors. For the other simple imputation methods displayed in
∗ The simple imputation performed by the SPSS MVA command addresses the first, but not
Table 8.4, the estimates of the standard deviations are underestimated (for all
approaches but LVCF) and thus the standard errors are also underestimated.
The underestimation of standard errors can make a substantial difference in
the test statistics inflating the Type I error rate. In the example, t-tests with
the standard error in the denominator would be inflated by 30-50%. Consider
a small difference of 3 points (1/5 S.D.) in the means of the two groups. With
no imputation, the t-statistic for a test of differences utilizing the stardard
errors in Table 8.4 is 1.24 (p=0.22). With mean imputation, the t-statistic is
now 2.65 (p=0.008), a highly significant difference.
8.8 Summary
• Simple imputation such as the half rule is useful when a small number
of item responses are missing in a questionnaire.
• Simple imputation may be useful when a small proportion of multiple
covariates have missing values.
• Simple imputation has very limited usefulness for missing assessments
of HRQoL that are intended as outcome measures. The primary limita-
tion of simple imputation is the underestimation of the variance of any
estimate and the corresponding inflation of Type I errors.
• Last Value Carried Forward (LVCF), if used, should be well justified
and any underlying assumptions verified. This approach will not be
conservative in all cases and may in some settings bias the results in
favor of a treatment with more dropout associated with morbidity.
9.1 Introduction
The major criticism of simple imputation methods is the underestimation of
the variance (see previous chapter). Multiple imputation [Rubin and Schenker,
1986, Rubin, 1987] rectifies this problem by incorporating both the variability
of the HRQoL measure and the uncertainty about the missing observations.
Multiple imputation of missing values will be worth the effort only if there
is a substantial benefit that cannot be obtained using methods that assume
that missing data is ignorable such as maximum likelihood for the analysis of
incomplete data (Chapter 3 and 4). As mentioned in the previous chapter,
this requires auxiliary information, such as assessments by other observers
(caregivers) or clinical outcomes that are strongly correlated with the HRQoL
measure. The previous comments about the complexities of actually imple-
menting an imputation scheme when the trial involves both longitudinal data
and multiple measures is even more relevant. The following quote summarizes
the concern:
181
© 2010 by Taylor and Francis Group, LLC
182 Design and Analysis of Quality of Life Studies in Clinical Trials
This multiple imputation procedure differs from the simple regression tech-
niques described in the previous chapter in two ways. First, because the true
parameters of the imputation model are unknown, random error is added to
the estimated parameters (B̂ ∗ ). These new values of the parameters (β (m) )
are then used to predict the score for a subject with specific characteristics
defined by the covariates (X ∗(mis) ). Then, additional random error is added
to these values to reflect the natural variability of the individual outcome
measures (V ar[Yi ]).
where
noise added in equation 9.6. Thus, adding covariates that result in small
increases in the R2 is unlikely to improve the imputation procedure even
when the statistical significance of a particular parameter is large. Finally,
any covariate that will be incorporated into subsequent hypothesis testing
(e.g. treatment group) must be retained in order to avoid biasing the results
toward the null hypothesis (as previously discussed in Chapter 8).
Yi1∗obs = Xi1
∗obs ∗
β1 + ε∗i1 (9.8)
∗obs ∗(m) (m) ∗(m)
Yi2∗obs = Xi2 β2 + Yi1 β2|1 + ε∗i2 (9.9)
∗(m) (m) ∗(m) (m) ∗(m)
Yi3∗obs = Xi3
∗obs
β3 + Yi2 β3|2 + Yi1 β3|1 + ε∗i3 (9.10)
∗(m) (m) ∗(m) (m) ∗(m) (m) ∗(m)
Yi4∗obs = Xi4
∗obs
β4 + Yi3 β4|3 + Yi2 β4|2 + Yi1 β4|1 + ε∗i4 (9.11)
cases this will not exist. Despite this, the analyst will have to make choices and
defend them. I do not have any recommendations that would be generically
applicable across various settings and instruments except to think through
what makes conceptual sense and to keep it as simple as possible.
9.3.6 Assumptions
When we impute the missing observations using these models, we are making
at least two assumptions that are not testable. First, we are assuming that the
relationship between the explanatory variables and HRQoL is the same when
individuals complete the HRQoL assessments as when they do not complete
the assessments. Basically, we are assuming that Yi∗obs = Xi∗obs B ∗ + εi and
Yi∗mis = Xi∗mis B ∗ + εi are both true when B ∗ is estimated on the basis of
the observed data. Second, we are assuming that we have identified all the
important relevant information such that the missingness no longer depends
on the missing HRQoL value (Yimis ) after conditioning on Yi∗obs and Xi∗
(MAR).
Less critical assumptions in this procedure are that the residual errors (ε∗i )
and the parameter estimates (B̂ ∗ ) of the imputation model are normally dis-
tributed. The first assumption can be assessed by examining the residual
errors (ε∗i ) and the second will be true for studies with moderate sample sizes
for which parameter estimates become asymptotically normal.
more than 5% missing data were not considered. The second step was to
eliminate variables where the correlation with the TOI scores was weak and
explained less than 5% of the variation∗ . Note that eliminating covariates
that explain less than 5% (ρ < 0.22, ρ2 < 0.05) of the covariance between
the TOI scores in the covariate only reduced the R2 for a model with all
of the covariates by 1-3% (Table 9.1). As a final note, when the prior scores
where eliminated, the proportion of the total variation that was explained was
roughly a cut in half illustrating the importance of addressing the longitudinal
nature of the study.
TABLE 9.1 Study 3: Covariate selection for lung cancer trial. The criteria for
selection was based on estimates of bivariate R2 (% of variance explained by the
potential covariate). - indicates that covariate was not included. + indicates
inclusion.
Corr with Obs Score Corr with Miss Ind
Potential Assessment # Assessment #
Covariate 1 2 3 4 1 2 3 4
Treatment Arm X X X - +
Age - - - + - + - +
Gender - - - - - - - -
Performance status ++ + - - - + + -
Prior radiotherapy - + - - - + - -
Weight loss ++ + - - - - - -
Chronic disease - - - - - - - -
Primary disease symptoms + + - - - - - -
Metastatic symptoms + + + + - - - -
Systemic symptoms ++ - - + - - - -
Baseline TOI score +++ +++ +++ ++ ++ ++
6-week TOI score +++ +++ ++ ++
12-week TOI score +++ ++
Cycles of therapy +++ +++ + +++ +++ +++
Hematologic toxicity + + - - + +
Neurologic toxicity - - - ++ ++ +
Early progressive disease ++ ++ ++ + ++ ++
Survival (log) +++ +++ +++ +++ +++ +++
∗ The R2 for a simple linear regression is equivalent to the square of the Pearson correlation.
Thus correlations less than 0.1 (even if statistically significant) would correspond to R2
values of less than 1%.
• When there are conceptual models underlying the measures and the
trial, these should provide the primary guidance.
Figure 9.1 displays the observed and imputed scores at the third and fourth
assessment. The first feature is the greater variability of the scores when
contrasted with those for simple imputation methods (see Chapter 8). As
expected in this study, the distribution of the imputed scores is lower than the
observed scores. A very small proportion of the scores lie outside the possible
range of 0 to 100. Low values (< 0) represent 0.01, 0.03, 0.16 and 1.84% of
the scores at each of the four assessments. High values (> 100) represent 0.14,
0.22, 0.19 and 0.26% of the scores respectively. The recommendation is not
to replace these out-of-range scores with 0 and 100 (although the software
procedures allow this as an option).
9.3.8 Implementation
Implementation in SAS
The SAS MI Procedure implements a variety of MI models. All require that
the data for each subject be in a single record. The first step is to create a data
set (WORK.ONEREC) with one record per subject containing the four possible
FACT-Lung TOI scores (TOI1, TOI2, TOI3, TOI4) and the covariates to be
used in the MI procedure.
3HUFHQWRI6FRUHV
3HUFHQWRI6FRUHV
6FRUHDWZHHNV 6FRUHDWZHHNV
If the covariates where all observed and the assessment of the TOI scores
followed a strictly monotone missing data pattern, the following simple pro-
cedure could be used:
%let baseline= Wt_LOSS ECOGPS SX_Sys;
proc mi data=work.onerec out=work.regression nimpute=10 /*noprint*/;
by Trtment;
var &baseline cycles crpr pd_lt6 ctc_neu ln_surv toi1 toi2 toi3 toi4;
monotone method=regression;
run;
The MI procedure with a MONOTONE METHOD=REGRESSION; statement sequen-
tially imputes missing data using
Yj = β0 + β1 Y1 + · · · + βj−1 Yj−1
for the Y1 , . . . , Yj identified in the VAR statement with the restriction that the
missing data pattern is monotone among the variables for the order speci-
fied. The sequential regression of toi1, toi2, toi3, and toi4 can also be
tailored:
proc mi data=work.onerec out=work.regression nimpute=10 /*noprint*/;
by Trtment;
var &baseline cycles crpr pd_lt6 ctc_neu ln_surv toi1 toi2 toi3 toi4;
monotone reg(toi1=Wt_LOSS ECOGPS SX_Sys);
monotone reg(toi2=Cycles CRPR PD_lt6 ctc_neu Ln_surv toi1);
monotone reg(toi3=Cycles crpr PD_lt6 ctc_neu Ln_surv toi1 toi2);
monotone reg(Toi4=Cycles crpr PD_lt6 ctc_neu Ln_surv toi1 toi2 toi3);
run;
However, in the lung cancer trial (and in most settings) the missing data
pattern will not be strictly monotone and the procedure will require an ad-
ditional step to create a dataset with a monotone pattern. One strategy is
to use a technique that does not require a monotone pattern (Section 9.6).
When there are a small number of non-monotone missing values, we can fill
in the outcome variables (TOI1, TOI2, TOI3, TOI4) as follows:
This creates a new dataset comprised of 10 sets of data that have a monotone
pattern. We can now use the regression procedure to impute the remaining
missing values. To do this we are going to rename the variable that identi-
fies these 10 sets from IMPUTATION to M and sort the data by imputation
number and treatment group.
From this point on, we are going to impute only one set of values for each of
the 10 sets generated in the first step. There are two changes to the procedure;
the imputation number ( M ) is added to the BY statement and the number of
imputations is changed to 1.
Implementation in SPSS
3. Predicted values are generated for subjects with both observed and miss-
ing data:
obs(m)
E[Yi ] = Xiobs β (m) (9.12)
mis(m)
E[Yi ] = Ximis
β (m) (9.13)
4. For each subject (i ) with a missing observation, the subject with the
obs(m)
closest predicted value, E[Yi ], is selected and that subject’s actual
observed value, Yi , is the imputed value for subject i .
obs
2. The parameters of the imputation model (β̂ (m) ) are then estimated
within each of the M samples.
3. Predicted values are generated for both the cases with observed and the
cases with missing data (equations 9.12 and 9.13).
3HUFHQWRI6FRUHV
3HUFHQWRI6FRUHV
6FRUHDWZHHNV 6FRUHDWZHHNV
4. The five nearest matches among the subjects with observed data are
identified for each subject with missing data and one of the five is se-
lected at random. The observed value of the HRQoL measure from that
subject is the imputed value.
Figure 9.2 displays the distribution of scores for this procedure. All of the
scores are within the range of the observed data. The imputed values are
again shifting the distribution toward the lower scores as we would expect
given the characteristics of these patients. Note that the distribution of the
imputed scores has a ragged shape. This suggests that a limited number
of observations are available for sampling for certain ranges of the predicted
means and these values are being repeatedly sampled; this is referred to as
the multiple donor problem.
Implementation in SAS
This procedure can be implemented in SAS, by replacing REG with REGPMM in
the MONOTONE statements described in Section 9.3.8.
Implementation in SPSS
The SPSS MULTIPLE IMPUTATION command implements this procedure us-
ing only the closest match when the data have a strictly monotone struc-
ture. Details of the procedure for non-monotone data are presented in Sec-
tion 9.6. When data is strictly monotone, the procedure is identical except
that METHOD=FCS is replaced by METHOD=MONOTONE SCALEMODEL=PMM.
2. Fit a logistic regression model for the probability that a subject has
a missing value for the j th measure as a function of a set of observed
patient characteristics. Calculate the predicted probability (propensity
score) that each subject would have missing values, given the set of
covariates.
3. Based on the propensity score, divide the subjects into K groups (gen-
erally 5 groups).
3HUFHQWRI6FRUHV
3HUFHQWRI6FRUHV
6FRUHDWZHHNV 6FRUHDWZHHNV
Y ∗ ∼ N (μ∗ , Σ∗ )
Because the MCMC procedure relaxes the requirement for a monotone missing
data pattern, the imputation can be performed in one step.
proc mi data=work.onerec out=work.mcmc nimpute=20 /*noprint*/;
by Trtment;
var WT_loss ECOGPS SX_Sys cycles crpr pd_lt6 ctc_neu
ln_surv toi1 toi2 toi3 toi4;
mcmc IMPUTE=FULL;
run;
The distribution of the resulting data displayed in Figure 9.4 are very similar
to those obtained using the regression method (Figure 9.1). This should not
be surprising as both methods assume normality and the data from the lung
cancer trial have mostly a monotone missing data pattern.
3HUFHQWRI6FRUHV
3HUFHQWRI6FRUHV
6FRUHDWZHHNV 6FRUHDWZHHNV
COMPUTE Ln_Surv=LN(MAX(.1,SURV_DUR)).
EXECUTE.
Then the imputation procedure is implemented separately in each treatment
group, generating a dataset imputedData with the original data (IMPUTATION =0)
and the imputed datasets (IMPUTATION =1 ...).
/* Imputation is performed within each treatment group */
SORT CASES BY ExpTx.
SPLIT FILE BY ExpTX.
/* Descriptive Stats */
DATASET ACTIVE Lung3.
MULTIPLE IMPUTATION WT_loss ECOGPS SX_Sys cycles crpr pd_lt6 ctc_neu
ln_surv FACT_T2.1 FACT_T2.2 FACT_T2.3 FACT_T2.4
/IMPUTE METHOD=NONE
/MISSINGSUMMARIES VARIABLES.
/* MCMC Imputation */
DATASET DECLARE imputedData.
DATASET ACTIVE Lung3.
MULTIPLE IMPUTATION WT_loss ECOGPS SX_Sys cycles crpr pd_lt6 ctc_neu
ln_surv FACT_T2.1 FACT_T2.2 FACT_T2.3 FACT_T2.4
/IMPUTE METHOD=FCS NIMPUTATIONS=10
/IMPUTATIONSUMMARIES MODELS DESCRIPTIVES
/OUTFILE IMPUTATIONS = imputedData.
9.6.3 Implementation in R
The R procedure uses a Gibbs sampler algorithm [Schafer, 1997]. The impu-
tation model is the standard mixed effects model:
The data has two pieces: the N by r matrix of the multivariate longitudinal
data, Y ∗ and the N by p matrix of the predictors which includes the design
matrices for both the fixed and random effects, Xi∗ and Zi∗ . N is the total
number of observations, thus each observation is a row in both matrices. This
is an advantage when there is wide variation in the timing of observations.
The algorithm requires a prior distribution for the variance parameters. The
prior is specified as a list of four components: a, Binv, c, and Dinv. For an
uninformative prior a=r, c=r ∗ q, Binv is a r x r identity matrix and Dinv is
a rq x rq identity matrix where q is the number of random effects.
The following code generates the first two imputed datasets for a single
longitudinal variable (FACT T2):
R> # Create Matrix of Predictors (X and Z)
R> Intcpt = matrix(1,ncol=1,nrow=length(Lung$PatID)) # Intercept
R> TxMONTHS=Lung$ExpTx*Lung$MONTHS # Interaction
R> pred=cbind(Intcpt,Lung$MONTHS,TxMONTHS,Lung$cycles,
+ Lung$PD_LT6,Lung$ln_surv)
R> xcol=c(1,2,3,4,5,6) # X columns
R> zcol=c(1,2) # Z columns
R> results1=pan(y,Lung$PatID,pred,xcol,zcol,prior,seed=13579,iter=1000)
R> Lung$Y1=results1$y
R> results2=pan(y,Lung$PatID,pred,xcol,zcol,prior,seed=77777,iter=1000)
R> Lung$Y2=results2$y
This is repeated for each of the M imputations.
The pan function also allows multiple longitudinal variables. The following
example simultaneously imputes values of the Funtional Well-being, Physical
Well-being, and Additional Concerns subscales of the FACT-L:
R> # Multiple Longitudinal Variables
R> y=cbind(Lung$FUNC_WB2,Lung$PHYS_WB2,Lung$ADD_CRN2)
R> I3=cbind(c(1,0,0),c(0,1,0),c(0,0,1)) # 3x3 identity matrix
R> I6=I2 %x% I3 # 6x6 identity matrix
R> prior <- list(a=3,Binv=I3,c=6,Dinv=I6)
R> results1=pan(y,Lung$PatID,pred,xcol,zcol,prior,seed=13579,iter=1000)
R> Y1=results1$y
R> results2=pan(y,Lung$PatID,pred,xcol,zcol,prior,seed=77777,iter=1000)
R> Y2=results2$y
This is repeated for each of the M imputations.
With both of these examples, the pan function for some seeds sometime
took much longer to execute. Checking the results using the str function
indicated that there was a problem with the convergance though no error
message was generated. A new seed was used when this occurred.
1 (m)
M
β̄ = β̂ (9.14)
M m=1
1 (m)
M
θ̄ = θ̂ (9.15)
M m=1
Variance of Estimates
The total variance of the parameter estimates incorporates both the average
within imputation variance of the estimates (Ūβ and Ūθ ) and the between
imputation variability of the M estimates (Bβ and Bθ ). The total variance
(V ) is computed by the sum of the within-imputation component (Ū ) and
the between-imputation component (B) weighted by a correction for a finite
number of imputations (1 + M −1 ).
M
Ūβ = V ar(β̂ (m) )/M (9.16)
m=1
1 M
Bβ = (β̂ (m) − β̄)2 (9.17)
M − 1 m=1
1
Vβ = Ūβ + (1 +
)Bβ (9.18)
M
The procedure is identical for both β and θ. Finally, the standard errors are
simply the square root of the variances.
Tests of β̂ = β0 or θ̂ = θ0
For tests with a single degree of freedom (θ is a scalar), confidence interval
estimates and significance levels can be obtained using a t distribution with
ν degrees of freedom [Rubin and Schenker, 1986, Rubin, 1987].
1/2
tβ = (β̄ − β0 )/Vβ ∼ tνβ (9.22)
This approximation assumes the dataset is large enough that if there were no
missing values, the degrees of freedom for standard errors and denominators
of F statistics are effectively infinity. Barnard and Rubin [1999] suggest an
adjustment for the degrees of freedom for small sample cases. However, in
most cases, if the dataset is large enough to use multiple imputation techniques
then this adjustment will not be necessary.
−1
∗ 1 1
νmβ = + (9.23)
νmβ νobs
νobs = (1 − νmβ )νO (νO + 1)/(νO + 3) (9.24)
The estimates, standard errors and associated test statistics of the fixed
effects and covariance parameters are reported for the pooled analysis. In
SPSS version 17.0.2 the estimates associated with the TEST option are not
pooled and the analyst will need to perform these calculations by hand using
the formulas presented at the beginning of this section.
R> est.beta=rbind(M1$coef$fix,M2$coef$fix,M3$coef$fix,M4$coef$fix,
+ M5$coef$fix)
R> est.beta
R> se.beta=sqrt(rbind(diag(M1$varFix),diag(M2$varFix),diag(M3$varFix),
+ diag(M4$varFix),diag(M5$varFix)))
R> se.beta
The function mi.inference from the cat package generates the pooled
estimates. Linear contrasts of the primary parameters can also be generated
(see Chapter 4) and pooled in the same manner.
library(cat)
pooled1=mi.inference(est.beta[,1],se.beta[,1]) # Beta1
pooled2=mi.inference(est.beta[,2],se.beta[,2]) # Beta2
pooled3=mi.inference(est.beta[,3],se.beta[,3]) # Beta3
pooled=cbind(pooled1,pooled2,pooled3) # Join results
dimnames(pooled)[[2]]=dimnames(se.beta)[[2]] # Add column headings
pooled # Print
:LWKRXW$X[LODU\9DULDEOHV :LWK$X[LODU\9DULDEOHV
$YHUDJH6FRUHV
$YHUDJH6FRUHV
)ROORZXS )ROORZXS
$YHUDJH6FRUHV
$YHUDJH6FRUHV
)ROORZXS )ROORZXS
FIGURE 9.5 Study 3: Observed and imputed values using the MCMC pro-
cedure with and without auxiliary variables (complete or partial response,
early progressive disease, number of cycles of therapy and log(time to death))
for the Control (upper) and Experimental (lower) arms by time of last as-
sessment. Observed data is indicated by solid line and . Imputed data is
indicated by the dashed line.
9.9 Summary
• MI provides a flexible way of handling missing data from multiple un-
related causes or when the mechanism changes over time. For example,
separate models can be used for early and late dropout.
• MI will only provide a benefit when the analyst has additional informa-
tion (data) that is related to HRQoL both when the response is observed
and missing.
• MI will be difficult to implement in studies with mistimed observations
and where the sample size is small.
• Markov chain monte carlo (MCMC) methods will be useful when the
missing data pattern is not strictly monotone and the same set of vari-
ables can be used to explain dropout over time.
• Approximate Bayesian bootstrap (ABB) is not recommended for longi-
tudinal studies as it ignores the correlation of observations over time.
10.1 Introduction
Mixture models that have been proposed for studies with missing data. These
were briefly introduced in Chapter 7 (see Section 7.4.2). The most well known
are pattern mixture models. The strength of these models is that the portion
of the model specifying the missing data mechanism (f [M]) does not depend
on the missing values (Y mis ). Thus, for these mixture models, we only need to
know the proportion of subjects in each strata and we do not need to specify
how missingness depends on Yimis . This is balanced by other assumptions
that are described throughout this chapter. In pattern mixture models, the
strata are defined by the pattern of missing assessments. Other strategies are
used for the other mixture models. In concept the procedure is simple, but
as I will illustrate in this chapter there are two major challenges. The first
is to estimate all of the model parameters within each strata. The second
is to justify the assumption that estimates within each strata are unbiased.
There are special cases where mixture models are useful, but there are also
numerous situations where justifying the assumptions will be difficult.
For example, let us assume that the change in HRQoL among subjects in
{p}
each strata could be described using a stratum specific intercept (β0 ) and
{p}
slope (β1 ). This would allow patients in one strata (e.g. those who drop out
earlier) to have lower HRQoL scores initially and to decline more rapidly over
time. The same patients may also have more or less variability in their scores
(different variance) than patients in other strata, thus the variance may also
209
© 2010 by Taylor and Francis Group, LLC
210 Design and Analysis of Quality of Life Studies in Clinical Trials
differ across strata. The quantities of interest are the marginal values of the
parameters averaged over the strata
P
β̂ = π̂{p} β̂ {p} (10.2)
where π{p} is the proportion of subjects observed with the pth stratum.
10.1.2 Illustration
Pauler et al. [2003] illustrate the use of a mixture model in a trial of pa-
tients with advanced stage colorectal cancer. The patients were to complete
the SF-36 questionnaire at baseline, 6 weeks, 11 weeks and 21 weeks post
randomization. While the authors describe the model as a pattern-mixture
model, they did not form the strata based on the patterns of observed data but
on a combination of survival and completion of the last assessment. Specifi-
cally, they proposed two strategies for their sensitivity analysis. In the first,
they defined two strata for each treatment group based on whether the pa-
tient survived to the end of the study (21 weeks). In the second, they split
the patients who survived 21 weeks based on whether they completed the last
assessment. They assumed that the trajectory within each strata was linear
and that the covariance structure is the same across all strata. Thus two
untestable assumptions were made: missing data are ignorable within each
strata and that linear extrapolation is reasonable for strata with no assess-
ments after 11 weeks. The first assumption implies that within the group of
patients who did not survive 21 weeks, there are no systematic differences be-
tween those who die early versus later, and within those who survived, there
are no differences between those who drop out early versus later within a
stratum.
of applying a model that assumes linearity within strata, how influential will
the small number of later assessments be on the estimate of the slope and if
extrapolation of the slope is reasonable. Unfortunately there are no formal
tests to answer these questions and the analyst will have to rely on clinical
opinions and intuition.
For the purposes of illustration, let us assume that we accept the model
and assumptions as reasonable. Figure 10.1 displays in the lower plots the
estimated linear trajectories. The estimates from strata B1, formed from
individuals who died in the first 26 weeks, are noticeably different from those
of the other two strata. If we combine the estimates using the proportions
displayed in Table 10.1 with estimates obtained within each of the strata, we
note that the decline over 26 weeks within treatment groups is dramatically
(2 to 3 times) greater when estimated using either definition A or B for the
proposed mixture models (Table 10.2) than when estimated using a simple
mixed effect model without stratification (first row). The differences between
groups is much smaller across the different analyses; all tests of treatment
differences would be non-significant. The greater sensitivity to dropout of
the within group estimates of change than the between group comparisons is
very typical of studies where all treatment arms have similar reasons for and
proportion of missing data.
∗ This type of design is not a good policy as it can result in additional selection bias.
Attempts to assess the patient response should continue even when the previous assessment
is missing unless the patient has requested to have no further assessments.
combining the subjects in the smaller stratum with another stratum. When
we do this, we are making one of two possible assumptions. Either the data
that are missing within the combined strata are ignorable or the proportion
is so small that it will have a minimal effect on the estimates of parameters
within that strata.
One of the most typical methods of combining strata is to pool groups by
the timing of the last assessment. Hogan et al. [2004a] go one step further in
a study with dropout at each of the 12 planned assessments. Their strategy
was to first fit a mixed model with interactions between all of the covariate
parameters and indicators of the dropout times. They then plotted the es-
timates of the covariate effects versus the dropout times. From these plots,
they attempted to identify by visual inspection, natural groupings where the
parameters were roughly constant. They acknowledge that the procedure is
subjective, and the sensitivity to different groupings should be examined.
The fixed effects and the covariance parameters may differ across the patterns.
The pooled parameter estimates are
P
β̂ = π̂ {p} β̂ {p} . (10.4)
But even with this simple model, we can immediately see the challenges. The
two fixed-effect parameters (intercept and slope) can be estimated only in the
patterns with at least two observations per subject. Thus we need to impose
an additional restriction to estimate the parameters for patterns with less than
two observations. Consider the patterns observed in the lung cancer trial after
collapsing the 15 observed patterns by time of dropout (Table 10.1, Definition
C). None of the parameters are estimable in the patients with no data and
the slope is not estimable in the pattern with only the baseline assessment
without additional assumptions.
There are a number of approaches that one can take which may or may
not be reasonable in different settings. One strategy is to collapse patterns
until there is sufficient follow-up to estimate both the intercept and slope
within each strata. In this example, we would combine those with only the
baseline assessment (Stratum C1) with those who dropped out after the second
assessment (Stratum C2).
{1} {2} {1} {2}
β̂0 = β̂0 and β̂1 = β̂1 . (10.6)
This solves the problem only if the resulting parameters are an unbiased rep-
resentation of the subjects in the pooled patterns. There are no formal tests to
assure this, but examining the estimates prior to pooling and understanding
the reasons for dropout will inform the decision. Another possible restriction
{1} {2}
is to allow different intercepts (β̂0 = β̂0 ) and to assume that the slope for
subjects with only the baseline assessment (Stratum C1) is the same as the
slope for subjects with two assessments (Stratum C2):
{1} {2}
β̂1 = β̂1 . (10.7)
The plots in the upper half of Figure 10.2 illustrates this for the lung cancer
trial.
An alternative approach is to place parametric assumptions on the param-
eters using the time of dropout as a covariate [Curren, 2000, Michiels et al.,
1999]. For example, one might assume that there was an interaction between
the time of dropout (d{p} ) and the intercept and slope in each pattern. With
both linear and quadratic terms for the time of dropout the model might
appear as:
&RQWURO ([SHULPHQWDO
3UHGLFWHG0HDQ
3UHGLFWHG0HDQ
:HHNVSRVWUDQGRPL]DWLRQ :HHNVSRVWUDQGRPL]DWLRQ
&RQWURO ([SHULPHQWDO
3UHGLFWHG0HDQ
3UHGLFWHG0HDQ
:HHNVSRVWUDQGRPL]DWLRQ :HHNVSRVWUDQGRPL]DWLRQ
where Wip is a known function of the P dropout times or strata. In this model,
we wish to estimate the parameters, β. The terms αWip model the pattern
specific deviations from βX. If we restrict the expected value of Wip to be
zero, the expected value of Yij is βXij and estimates of β̂ will come directly
from the model. Typically Wip will be constructed from centered indicator
variables for the time of dropout or the strata.
To illustrate the equivalence of the two models, consider a very simple
example with a single group and two strata with one and two-thirds of the
subjects respectively. Both models will assume that change is linear over
time. If Di1 is an indicator of belonging to the first strata, then the average
of Di1 across all subjects, D̄1 = 1/3. The centered indicator variable Wi1 =
Di1 − D̄1 = 1 − 1/3 for subjects in the first strata and 0 − 1/3 for subjects not
in the first strata. Table 10.3 illustrates the equivalence of the two strategies.
To illustrate the procedure for multiple dropout times, consider the lung
cancer trial where dropout is defined by the time of the last assessment (Def-
inition C, Table 10.1). Di would be a vector of indicator variables where
Dik = 1 if the last assessment occurred at the k th assessment and Dik = 0
otherwise. So for an individual who dropped out after the third assessment,
Di = (0, 0, 1, 0). Because the elements of Di always sum to one, we must drop
one of the indicator variables. It does not matter which, so for illustration we
drop the last indicator. We then center the indicator variables: Wi = Di − D̄i
where D̄i is the average of Di or the proportions in each dropout group. Thus
for an individual in the experimental arm who dropped out after the third
where Wik , k = 2, 3, 4 are centered indicators of dropout after the 2nd, 3rd
and 4th assessments. Note that we did not include the term α12 tij Wi2 , which
would allow the slopes to be different in the 1st and 2nd strata. Omitting the
term imposes the restriction that the slopes are equal in those two patterns
(equation 10.7).
We can also implement the second restriction described by equation 10.8
as follows:
where W1i = TiD − T¯iD and W2i = Ti2D − Ti¯2D are the centered time of dropout
(TiD ) and time of dropout squared.
&RQWURO'HILQLWLRQ& ([SHULPHQWDO'HILQLWLRQ&
3UHGLFWHG0HDQ
3UHGLFWHG0HDQ
:HHNVSRVWUDQGRPL]DWLRQ :HHNVSRVWUDQGRPL]DWLRQ
after 17 weeks from strata C5/6 to extrapolate the curve for those in strata
C3/4. It is immediately obvious that there is considerable variation in the
extrapolated estimates, with some trajectories dropping below the lower limit
of the scale. Perhaps some scenarios could be eliminated as clinically unre-
alistic, such as the assumption that scores would be maintained over time in
patients who drop out primarily due to disease progression. In summary, this
example illustrates the difficulty of identifying a set (or sets) of restriction(s)
that are clinically reasonable, especially if this is required prior to examining
the data.
&RQWURO0HDQ&DUULHG)RUZDUG ([SHULPHQWDO0HDQ&DUULHG)RUZDUG
:HHNVSRVWUDQGRPL]DWLRQ :HHNVSRVWUDQGRPL]DWLRQ
&RQWURO6ORSH([WUDSRODWHG ([SHULPHQWDO6ORSH([WUDSRODWHG
:HHNVSRVWUDQGRPL]DWLRQ :HHNVSRVWUDQGRPL]DWLRQ
&RQWURO/DVW&KDQJHIURP1HLJKERU ([SHULPHQWDO/DVW&KDQJHIURP1HLJKERU
:HHNVSRVWUDQGRPL]DWLRQ :HHNVSRVWUDQGRPL]DWLRQ
We have assumed that the proportions are known (fixed) quantities when
they are in fact unknown quantities that we have estimated. Thus, the result-
ing standard errors of the pooled estimates will underestimate the actual stan-
dard errors (see Section 10.5). The use of the macro variables will facilitate
implementation of any bootstrapping procedures to obtain acurate estimates
of the standard errors (or using the same code as the study matures).
Note that because the variables that define the strata are centered, the
parameters associated with the terms Exp and Exp*Weeks are the same as
those obtained in the previous section.
10.3.5 Implementation in R
I will again use the mixture using definition B (Table 10.1) for the lung cancer
trial to illustrate the weighted sum and centered indicator approaches using
R. The first steps are required in both approaches.
We will then need the proportion of subjects in each strata to construct the
pooled estimates.
# Calculate proportions (Note equal # obs per subject)
Freqs=table(Lung$Exp,Lung$Strata)
Props=prop.table(Freqs,1)
print(Props,digits=3)
Prop1=Props[1,]
Prop2=Props[2,]
C7=cbind(null,null,Prop1*26*-1,Prop2*26); dim(C7)=c(1,12)
CStrat=rbind(C1,C2,C3,C4,C5,C6,C7)
rownames(CStrat)=c("T0 Cntl","T0 Exp","Wk26 Cntl","Wk26 Exp",
"Chg Cntl","Chg Exp","Diff")
CentModel=lme(fixed=FACT_T2~0 + ExpF+ExpF:WEEKS+
ExpF:M1+ExpF:WEEKS:M1 + ExpF:M2+ExpF:WEEKS:M2,
data=Lung,random=~1+WEEKS|PatID,na.action = na.exclude)
summary(CentModel) # Details listing of results
The linear combinations and associated test statistics of the parameter es-
timates are then obtained as follows:
# Construct C matrix to generate Theta=C*Beta
null=c(0,0,0,0)
C1=c(1, 0, 0, 0); C1=cbind(C1,null,null); dim(C1)=c(1,12)
C2=c(0, 1, 0, 0); C2=cbind(C2,null,null); dim(C2)=c(1,12)
C3=c(1, 0, 26, 0); C3=cbind(C3,null,null); dim(C3)=c(1,12)
The simplest case for longitudinal studies consists of two repeated measures,
typically a pre and post-intervention measure. There are four possible pat-
terns of missing data (Table 10.5). The first pattern, in which all responses
are observed, contains the complete cases. The second and third patterns
have one observation each. In the fourth pattern, none of the responses is
observed. In trials where the pre-intervention measure is required, only the
first two patterns will exist.
In each of the four patterns, there are five possible parameters to be es-
{p} {p}
timated: two means (μ̂1 , μ̂2 ) and three parameters for the covariance
{p} {p} {p}
(σ̂11 , σ̂12 , σ̂22 ). We can estimate 9 of the 20 total parameters from the
data: all five parameters from pattern 1 and two each from patterns 2 and 3.
Thus the model is underidentified and some type of restriction (assumption)
is required to estimate the remaining parameters.
Since we cannot estimate the parameters in the second equation due to missing
data, we assume the same relationship holds in pattern 2 as in pattern 1.
{2} {1} {2} {1}
β0[2·1] = β0[2·1] , β1[2·1] = β1[2·1] (10.15)
† Thisterm was used in the orginal article to indicate protecting against nonrandomly
missing data.
1 2
σ̂11 = (Yi1 − μ̂1 )
N
{1} {1}
σ̂22 = σ̂22 + b(λ)2 σ̂11 − σ̂11
(σ̂ {1} σ̂{1} − σ̂{1}2 )(λ2 σ̂ {1} + 2λσ̂{1} + σ̂ {1} )2
Vb = var b(λ) = 11 22 12
{1}
22
{1} 4
12 11
n1 (λσ̂12 + σ̂11 )
To illustrate, consider the initial and 12-week assessments in the lung cancer
trial using only subjects who had a baseline assessment. Application of the
sensitivity analysis proposed by Little is illustrated in Table 10.6. The two
special cases, the CCMV (λ = 0) and Brown’s protective (λ = ∞) restrictions,
are included. The estimates of the 12-week means (μ̂h2 ) and of the change over
time (μ̂h2 − μ̂h1 ) are very sensitive to the value of λ, with the estimated change
increasing in magnitude as the missingness is assumed to depend more heavily
on the missing 12 week values. The differences between the two treatments
is less sensitive to the value and all differences are non-significant. Thus one
would feel much more confident making inferences about treatment differences
at 12 weeks than one would making inferences about the presence or absence
of change over time within each group. However, with approximately 50%
of the 12 week data missing, all conclusions about the differences should be
made cautiously. It is also not surprising that the estimated variance of the
parameters increases as the dependence shifts from the observed data (Yi1 ) to
the missing data (Yi2 ).
Little and Wang [1996] extend the bivariate case for normal measures to a
more general case for multivariate measures with covariates. However, their
extension is still limited to cases where there are only two patterns of missing
data, one of which consists of complete cases. For the non-ignorable miss-
{2} {1}
ing data, the multivariate analog of the restriction is used (Θ[1.2] = Θ[1.2] ).
The model is just identified when the number of missing observations exactly
equals the number of non-missing observations in the second pattern. Ex-
plicit expressions can be derived for the maximum likelihood estimates. The
model is overidentified when the number of non-missing observations exceeds
the number of missing observations. Additional restrictions are required if
the number of missing observations is greater than the number of non-missing
observations.
In the lung cancer trial, patients with one of the four patterns of data conform-
ing exactly to a monotone dropout pattern account for 85% of the patients.
In practice one is reluctant to omit 15% of the subjects from the analysis.
However, for the purpose of illustration, we will use only the patients with a
monotone dropout pattern. Note that each pattern has approximately one-
fourth of the patients, so that no pattern contain fewer than 25 subjects
(Table 10.7).
In the CCMV restriction, the data from the subjects in pattern 1 are used
to impute the means for the missing observations in the remaining patterns:
{2} {1}
θ[4·123] = θ[4·123] (10.30)
{3} {1}
θ[34·12] = θ[34·12] (10.31)
{4} {1}
θ[234·1] = θ[234·1] (10.32)
In the ACMV restriction, available data from subjects in all the patterns
are used to impute the means for the missing observations in the remaining
patterns. The restrictions for the patterns in Table 10.7 are:
{2} {1}
θ[4·123] = θ[4·123] (10.33)
{3} {1} {3} {1,2}
θ[4·123] = θ[4·123] , θ[3·12] = θ[3·12] (10.34)
{4} {1} {4} {1,2} {4} {1,2,3}
θ[4·123] = θ[4·123] , θ[3·12] = θ[3·12] , θ[2·1] = θ[2·1] (10.35)
This restriction is a bit more feasible than the CCMV restriction as more
observations are used to estimate some of these parameters. It is important
to note that the results using this restriction will be the same as MLE of
all available data [Curren, 2000]. While this restriction is important when
trying to understand methods, MLE of all available data is much easier to
implement.
FIGURE 10.5 Study 3: Observed and imputed means for control (left)
and experimental (right) arms under CCMV (upper), ACMV (middle), and
NCMV (lower) restrictions displayed by pattern of dropout. Observed means
indicated by solid line. Imputed means indicated by dashed lines.
:HHNV :HHNV
:HHNV :HHNV
:HHNV :HHNV
Of the three restrictions, this is the one that might be the most useful. How-
ever, as will be demonstrated in the following example, the assumption that
the missing values are random conditional on the nearest neighbor may be
hard to justify for the later assessments as the neighboring cases are the sub-
jects with complete data. Specifically, there is an assumption that relationship
between assessments is similar in those who have complete data and those who
drop out early in the trial.
Implementation
With four assessments, these restrictions result in six equations that must
be solved for the unknown means and variance parameters. Although this is
burdensome, solving the equations for the unknown parameters is straight-
forward. Deriving the appropriate variance for the pooled estimates is very
complex. Curren [2000] suggest an analytic technique using multiple imputa-
tion (see Section 9.6) to avoid this problem.
The procedure for the NCMV restriction is as follows:
initially after the last observed FACT-Lung TOI score, especially for patients
who had only the initial assessments. This illustrates the consequences of the
implicit assumption with the CCMV restriction. Obviously, in the setting of
the lung cancer trial, we do not believe that the HRQoL of the individuals who
drop out early is likely to be similar to that of individuals who have all four
assessments. The imputed values (middle plot) under the ACMV restriction
no longer tend to increase, but rather tend to remain at the same level as the
last observed FACT-Lung TOI measure. This is a slight improvement over the
CCMV, but still seems to overestimate the HRQoL of patients who drop out.
The imputed values (lower plots) under the NCMV restriction tend to fall
initially over time, especially for subjects who dropped out early, but tend to
increase by the last assessment. This would seem to be the most appropriate
of the three restrictions, but it may still overestimate the HRQoL of subjects
who drop out of the study especially at the last assessment.
:HHNV :HHNV
FIGURE 10.6 Study 3: Estimated means for the control (left) and experi-
mental arms (right) under CCMV, ACMV and NCMV restrictions for patients
with monotone dropout patterns. Curves from highest to lowest correspond
to CCMV, ACMV and NCMV.
will be smaller than the true variance, potentially inflating the Type I errors
associated with any tests of hypotheses.
π̂ P and β̂ P are the stacked vectors of the r estimates of π {p} and the c esti-
mates of β {p} from all P patterns. Ak is a r × c matrix of known constants
(usually 0s and 1s) derived from the partial derivatives of equation 10.4 with
respect to π P and β P that causes the appropriate estimates to be multiplied:
" #
P {p} {p}
∂2 π̂ β̂
Ak = (10.42)
∂π {p} ∂β {p}
When the proportions are known (and not estimated), the variance of β̂k is
solely a function of the variance of the parameter estimates:
V ar[β̂k ] = (π̂P ) Ak V ar[β̂ P ]Ak π̂ P . (10.43)
Using the delta method (Taylor series approximation) to approximate the
variance of the pooled estimates:
V ar[β̂k ] = (π̂ P ) Ak V ar[β̂ P ]Ak π̂ P + β̂ P Ak V ar[π̂ P ]Ak β̂ P (10.44)
The columns that correspond to the intercept parameters are all zero as
these parameters are not used in the estimate of the slope. Then if π {p} =
{1} {2} {3} {4}
[πh , πh , πh , πh ], the second derivative with respect to π is:
⎡ ⎤
{1} {2} {3} {4} {1}
0, πh , 0, πh , 0, πh + πh , 0, 0 /∂πh
⎢ ⎥
⎢ {1} {2} {3} {4} {2} ⎥
2
∂ βh1 ⎢ 0, π , 0, π , 0, π + π , 0, 0 /∂πh ⎥
=⎢ ⎥
h h h h
{p} {p} ⎢ {1} {2} {3} {4} {3} ⎥
∂π ∂β ⎢ 0, πh , 0, πh , 0, πh + πh , 0, 0 /∂πh ⎥
⎣ ⎦
{1} {2} {3} {4} {4}
0, πh , 0, πh , 0, πh + πh , 0, 0 /∂πh
⎡ ⎤
01000000
⎢0 0 0 1 0 0 0 0⎥
=⎢⎣0 0 0 0 0 1 0 0⎦.
⎥
00000100
This is more work than one needs to do to derive the pooled estimates for a
single estimate, but it provides a general framework with wide applications.
This approximation is appropriate for large and moderate sized samples.
{p}
In very large samples, V ar(π̂h ) is very small and for all practical purposes
ignorable. Implementation is facilitated by software that allows matrix ma-
nipulation (e.g. SAS Proc IML or R).
replacement the subjects. In most cases, we will do this within each treat-
ment group. As a word of caution, many statistical analysis packages contain
procedures/functions that perform a bootstrap procedure, however, most are
designed for setting where all observations are independent. In most clinical
trials we are working with correlated longitudinal observations and need to
sample patients rather than observations.
The basic procedure assumes that there are two sets of data, one with only a
single observation containing subject level data and the second with multiple
observations containing the longitudinal data. The procedure is as follows:
1. Sample with replacement N subjects from the first set of data, where N
is the number of subjects. Combine data from the strata and generate
a unique identifier for each of the N subjects in the bootstrap sample.
2. Merge the bootstrap sample with the longitudinal data.
3. Analyze the bootstrap sample as appropriate and save the estimates of
interest.
10.6 Summary
• Mixture models have the advantage that a model for the dropout mech-
anism does not need to be specified.
• The validity of the restrictions can not be tested and the results may be
sensitive to the choice of restrictions.
11.1 Introduction
In this chapter I present three models that all assume that there is random
variation among subjects that is related to the time of dropout. The models
incorporate the actual time of dropout or another outcome that is related to
dropout. In some trials, it is reasonable to believe that the rate of change over
time (slope) in HRQoL is associated with the length of time a subject remains
on the study. This is typical of patients with rapidly progressing disease where
more rapid decline in HRQoL is associated with earlier termination of the
outcome measurement.
The first model is the conditional linear model (CLM) proposed by Wu and
Bailey [1989] (see Section 11.2). Each individual’s rate of change in HRQoL
is assumed to depend on covariates and the time to dropout where the time
to dropout is known. While this model is rarely used in practice, it forms the
basis for this class of models. The second model is a varying coefficient model
(VCM) proposed by Hogan et al. [2004b] that expands this idea to include
variation in the intercept and uses a semi-parametric method to model the re-
lationship of the outcome to the time of dropout. The third model is the joint
model with shared parameters proposed by Schluchter [1992] and DeGruttola
and Tu [1994]. This model relaxes the assumption that the time to dropout is
observed in all subjects (allowing censoring) but assumes a parametric distri-
bution for the time to dropout. All of these models are appropriate to settings
where simple growth curve models describe the changes in the outcome and
there is variation in the rates of change among the individual subjects (Ta-
ble 11.1). Further distinctions between these models are discussed in more
detail later in the chapter.
Random Effects
In all three models, a practical requirement is non-zero variation in the random
effects, di . If the variance of the random effects is close to zero (V ar(di ) ≈ 0),
it is difficult, if not impossible, to estimate the association between the random
effects and the time to dropout. Prior to embarking on analyses using the
joint model, it is wise to check the estimates of variance in the simpler mixed-
effects model that you intend to use in the joint model. In SAS, the COVTEST
239
© 2010 by Taylor and Francis Group, LLC
240 Design and Analysis of Quality of Life Studies in Clinical Trials
TABLE 11.1 General requirements of the conditional linear model
(CLM), varying-coefficient model (VCM) and joint shared parameter
model (Joint).
Model Characteristic CLM VCM Joint
Repeated measures No No No
Growth curve model Linear1 Yes Yes
Random effects Slope only2 Intercept Flexible
+ Slope
Baseline missing Not allowed Allowed Allowed
Mistimed observations Allowed Allowed Allowed
Monotone dropout Yes Yes Yes
Intermittent pattern Yes if MAR Yes if MAR Yes if MAR
Censoring of T D No No Yes
1
Higher order polynomials possible but challenging.
2
Random intercept is unrelated to dropout.
Alternatives to Dropout
In some setting, dropout may occur for various reasons only some of which
would be related to the subject’s trajectory. In general we are looking for
a characteristic such that conditional on the observed HRQOL outcome and
that characteristic, the missing data are ignorable. Alternatives for time to
dropout might include time to death or disease progression in the cancer trials
or changes in the frequency of migraines in the migraine prevention trial.
with random variation of the intercept, βi1 , and rate of change (slope), βi2 .∗
Each individual’s slope may depend on M covariates (Vmi ), the initial value
of the outcome (Yi1 ), as well as a polynomial function of the time of dropout
(TiD ). The form of the relationship is allowed to vary across the h treatment
groups. The expected slope for the ith individual is
L
M
E[βi2 |TiD , Vmi , Yi1 ] = γhl (TiD )l + γh(L+m) Vmi +γh(L+M+1) Yi1 (11.4)
l=0 m=1
where γh0 , . . . , γh(L+M+1) are the coefficients for the hth group and L is the
degree of the polynomial. The intercept for the ith individual, βi1 , is not
dependent on covariates or the time of dropout, thus, the expected intercept
is
E[βi1 ] = βh1 . (11.5)
The mean slope in the hth group is the expected value of the individual slopes
of the subjects in the hth group (i ∈ h):
βh2 = Ei∈h βi2 |TiD , Vmi , Yi1 . (11.7)
L
M
β̂h2 = γ̂hl (T̄hD )l + γ̂h(L+m) V̄hm + γ̂h(L+M+1) Ȳh1 (11.8)
l=0 m=1
where T̄hD is the mean dropout time in the hth group, V̄hm is the mean of the
mth covariate in the hth group and Ȳh1 is the mean of the baseline measure
in the hth group. In a randomized trial, pre-randomization characteristics are
theoretically the same in all treatment groups. To avoid introducing differ-
ences that are the result of random differences in these baseline characteristics,
V̄hm and Ȳh1 are estimated using all randomized subjects (V̄m and Ȳ1 ). Cen-
tering the values of TiD , Vmi , and Yi1 so that T̄hD = 0, V̄m = 0, and Ȳ1 = 0
facilitates the analysis as the estimates of the slopes are now the conditional
estimates.
11.2.1 Assumptions
There are several practical consequences and assumptions of this model:
3. All subjects must have a baseline measurement and complete data for
the selected covariates (V) or they will be excluded.
4. The time of dropout is known for all subjects or subjects with an assess-
ment at the last follow-up behave as if they dropped out immediately
after that point in time.
variance of the slopes is significant, but only moderately different from zero
(Table 11.3).
Step 2: Identify baseline covariates that might predict change and test the
significance of their interaction with time in a mixed-effects model
M
Model 2: Yij = βhi1 + γh0 tij + γhm Vmi tij + ij (11.10)
m=1
Addition to Model 1
Symptoms of metastatic disease (SX MET) was the strongest predictor of the
slope among the available demographic† and disease measures prior to treat-
ment.‡ Note that the three way interaction of treatment, time and SX MET C
is added to the model, where SX MET C is centered (SX MET - the average
value of SX MET). Addition of symptoms of metastatic disease prior to treat-
ment explained a significant proportion of the variability of the outcome
(χ22 = 8.14, p = 0.02), but did not affect the estimates of the slopes in the two
treatment groups (Table 11.3).
Step 3: Add the baseline measure of the outcome variable to the model
and test for an interaction with time.
M
Model 3: Yij = βhi1 + γh0 tij γhm Vmi tij +
m=1
There are several possibilities at this point for defining TiD , including the
planned time of the last assessment and the observed time of the last assess-
ment. In the following illustration, TiD was defined as the observed time in
months from randomization to the last HRQoL assessment centered around
the average time in the trial for each treatment group (LAST MO). Other possi-
bilities are the time to disease progression, termination of therapy, or death. In
the Lung Cancer example, a linear (Model 4: L = 1) and quadratic (Model 5:
L = 2) model for the time of dropout were tested. The quadratic model pro-
vided the best fit. Addition of the time of dropout also explained a significant
proportion of the variability of the slope (Model 3 vs. 4: χ22 = 16.2, p < 0.001;
Model 4 vs. 5: χ22 = 7.8, p = 0.02). In contrast with the previous models,
there was a dramatic effect on the estimates of the slopes with a doubling of
the estimated rate of decline in both treatment groups when the linear inter-
action was added and a tripling of the rates when the quadratic interaction
was added (Table 11.3). The differences in the estimates of the slope between
Models 4 and 5 suggest that theestimates of change within each group can
L
be very sensitive to the form of l=1 γhl (TiD )l tij . (In the next section (Sec-
tion 11.3), an alternative model that creates a semi-parametric model for this
relationship addresses this concern.)
Of practical note, variation in the slopes may exist in Model 1 but disappear
as more variation of the slopes is explained. As the variance approaches
zero, problems with convergence of the algorithm can develop and the second
random effect may need to be dropped from the model. In the Lung Cancer
example, the addition of baseline as an interaction explained approximately
70% of the variability of the slopes (Table 11.3).
Table 11.5 contrasts the results of the conditional linear model with the
two pattern mixture models described in the previous chapter. The estimates
of change are between those estimated under the MAR assumptions and the
pattern mixture models.
Implementation in SAS
The SAS statements for Model 1 are:
PROC MIXED DATA=work.analysis METHOD=ML COVTEST;
CLASS Trtment PatID;
MODEL FACT_T2=Trtment Trtment*MONTHS/NOINT SOLUTION;
RANDOM INTERCEPT MONTHS/SUBJECT=PatID TYPE=UN;
ESTIMATE ‘DIFF - SLOPE’ Trtment*MONTHS -1 1;
RUN;
In Model 2, all the statements remain the same except for the addition of
the term Trtment*MONTHS*SX MET C to the MODEL statement:
MODEL FACT_T2=Trtment Trtment*MONTHS Trtment*MONTHS*SX_MET_C
/NOINT SOLUTION;
In Model 3 we add Trtment*BASELINE*MONTHS to the MODEL statement:
MODEL FACT_T2=Trtment Trtment*MONTHS Trtment*MONTHS*SX_MET_C
Trtment*MONTHS*BASELINE /NOINT SOLUTION;
In Model 4 we add terms for the interaction of a linear function of dropout
Trtment*MONTHS*LstFu C. Finally, a quadratic term, Trtment*MONTHS*LstFU C2,
is added to Model 5:
MODEL FACT_T2=Trtment ExpTx*months*Sx_Met_C ExpTx*months*Baseline
ExpTx*months*LstFU_C ExpTx*months*LstFU_C2
/NOINT SOLUTION;
Note that in Models 2-5, the covariates (SX MET C, BASELINE, etc.) are cen-
tered so that their expected value is 0. Thus, the parameters associated with
Trtment*MONTHS are the parameters of interest, E[βi2 |Vi = V̄ , etc.].
Implementation in SPSS
The first steps are to center the covariates (See Appendix P).
The statements to fit the models 1 and 5 are:
Model 1:
/* Model 1 */
MIXED FACT_T2 BY ExpTx WITH Months
/FIXED=ExpTx ExpTx*Months|NOINT SSTYPE(3)
/PRINT G R SOLUTION
/RANDOM=Intercept Months|SUBJECT(PatID) COVTYPE(UN)
/TEST ‘Difference in Slopes’ ExpTx*Months -1 1 .
Model 5:
MIXED FACT_T2 BY ExpTx WITH Months Sx_Met_C Baseline LstFU_C LstFU_C2
/FIXED=ExpTx ExpTx*Months ExpTx*Months*Sx_Met_C ExpTx*Months*Baseline
ExpTx*Months*LstFU_C ExpTx*Months*LstFU_C2|NOINT SSTYPE(3)
/PRINT G R SOLUTION
/RANDOM=Intercept Months|SUBJECT(PatID) COVTYPE(UN)
/TEST ‘Difference in Slopes’ ExpTx*Months -1 1 .
Implementation in R
The first steps are to create a factor variable for the treatment groups, TrtGrp,
and to center the other explanatory variables (See Appendix R).
The statements to fit the models 1 and 5 are:
Model 1:
Model1=lme(fixed=FACT_T2~0+TrtGrp+TrtGrp:MONTHS,data=Lung,
random=~1+MONTHS|PatID, na.action = na.exclude,method="ML")
Model 5:
Model5=update(Model4,fixed=FACT_T2~0 + TrtGrp+ TrtGrp:MONTHS +
TrtGrp:MONTHS:Sx_Met_C + TrtGrp:MONTHS:Baseline+
TrtGrp:MONTHS:Last_FU_c + TrtGrp:MONTHS:Last_FU_c2)
The anova function can be used to compare the models:
anova(Model1,Model2,Model3,Model4,Model5)
βj = U γj + Baj , (11.14)
Implementation in SAS
The algorithm for calculating Xj B is in Appendix C. The SAS macro can be
found on the website. Because of computational requirements, the model is
generally fit separately for each treatment group. Implementation of this in
SAS appears as follows for a group with 80 distinct dropout times; zint1-zint78
and zslp1-zslp78 are the two components of Xj B:
proc mixed data=Zdata1 covtest;
class patid;
where trtment eq 0; * Fit model for control group *;
model Fact_t2=Weeks Last_weeks_C Weeks*Last_weeks_C/s;
random intercept weeks /subject=patid type=un;
random zint1-zint78/type=toep(1) s;
random zslp1-zslp78/type=toep(1) s;
run;
11.3.1 Assumptions
Two of the assumptions imposed by the conditional linear model have been
relaxed;
1. Trajectories over time are not required to be linear.
2. Subject specific intercepts may be related dropout times.
The following remain:
2. There is enough variation in the intercept and slope among subjects to
allow modeling of the variation.
3. The time of dropout is known for all subjects or if the time of the last
assessment is used, subjects behave similarly regardless of whether they
would have continued to have assessments or would have dropped out
before the next assessment if the follow-up had been extended.
FIGURE 11.1 Study 3: Association of intercept and slope with the time
to last assessment estimated with a varying coefficient model (VCM): Cu-
bic smoothing spline estimators of βj = U γj + Baj for the intercept (left)
and slope (right). U γj (TiD ) is indicated by the solid lines without symbols.
U γj (TiD ) + Baj (TiD ) is indicated by the dashed lines overlayed with symbols.
The control group is indicated by a circle (◦) and the experimental group by
a diamond (
).
11.3.2 Application
Continuing the example of the lung cancer trial (Study 3), let us further
examine the assumption that the slope (and the intercept) are linear functions
of the dropout time. Again we will use the time of the last assessment as a
surrogate for the time of dropout. In Figure 11.1, two components of the
varying coefficient model are plotted for each treatment group. The first
is the relationship of the fixed effect estimates of the intercept and slope
as a function of the dropout time ( j (Xj U )γj ). These are represented by
the solid lines in Figure 11.1. Both the intercept and slope increase with
increasing time to the last assessment.
The second is the estimated semi-
parametric function j (Xj U )γj + j (Xj B)aj of the time to dropout. These
are represented by dashed lines overlayed by the symbols. Several patterns
arise. First, the assumption that the intercept is a linear function of the time
of the last assessment appears not to hold and there is a plateau that occurs
around 18 weeks. This is not surprising as the fourth assessment (planned at
26 weeks) is a mixture of subjects who would drop out after that assessment
and subjects who would continue if further assessments had been planned.
The slope for the control group follows a similar pattern but the slopes for
the experimental group seem to follow the linear trajectory.
Variables other than time to dropout can also be used in this model. Fig-
ure 11.2 illustrates the relationship of the intercept and slope with the log
time to death. The lines defined by U γj (TiD ) and U γj (TiD ) + Baj (TiD ) al-
most overlay indicating that the log transformation is appropriate. It also
FIGURE 11.2 Study 3: Association of intercept and slope with the log time
to death estimated with a varying coefficient model (VCM): Cubic smoothing
spline estimators of βj = U γj + Baj for the intercept (left) and slope (right).
U γj (TiD ) is indicated by the solid lines without symbols. U γj (TiD )+Baj (TiD )
is indicated by the dashed lines overlayed with symbols. The control group is
indicated by a circle (◦) and the experimental group by a diamond (
).
makes sense that deaths that are farther out in time, especially those past
the duration of the trial would have increasingly less impact on the outcome.
Note that this study had almost complete follow-up to death, so that the issue
of censoring was not relevant.
The predicted baseline scores are 59.0 and 65.7 for subjects surviving 3 months
(ln(0.25 years) = −1.386) and 12 months (ln(1.00 years) = 0.00) respectively.
The predicted change over time is
Thus, the predicted decline is greater (3.4 points per month) for a patient
who survives 3 months than for a patient surviving 12 months (1.8 points per
month).
The second implication of the joint model is that we can express the ex-
pected time of dropout as a function of the initial HRQoL scores as well as the
rate of change over time. This is of interest when the focus of the investigation
is on the time to the event and the HRQOL measures improve the prediction
of the time to that event. More formally, the conditional distribution of the
dropout times is a function of the random effects.
E[f (TiD )|βi ] = μt + σbt G −1 (βi − β) (11.18)
1. The joint model allows both the intercept and slope to be related to the
time of dropout.
2. The algorithms [Schluchter, 1992, DeGruttola and Tu, 1994] allow cen-
soring of TiD . This allows us to differentiate a subject who completes the
6-month evaluation, but dies shortly thereafter (e.g. TiD = 8 months)
from a subject who remains alive (e.g. TiD > 8 months). This frees
the analyst from assigning an arbitrary dropout time to subjects who
complete the study.
Yi = Xi β + Zi di + ei . (11.20)
They differ in the manner in which time is related to the random effects for
the HRQoL outcome. In the first alternative, the model for time to the event
is
f (Ti ) = μT + ri . (11.21)
The two models are joined by allowing the random effects (di ) to covary with
the residual errors of the time model (ri ) and thus with time to the event
itself.
di 0
∼N , Gσbt τ .
2
(11.22)
ri 0
Yi Xi β Zi GZi + σ 2 I Zi σbt
∼N , . (11.23)
f (Ti ) μT σbt Zi τa2
In the second alternative, the random effects (di ) are included in the time
to event model:
f (Ti ) = μT + λdi + ti . (11.24)
Yi Xi β Zi GZi + σ 2 I Zi Gλ
∼N , . (11.25)
f (Ti ) μT λGZi λGλ + τb2
The two models are equivalent as the parameters of one can be written as
a function of the other: σbt = λG and τa2 = λGλ + τb2 . Both alternatives
have specific uses. The first alternative may be more intuitive when the focus
is on the HRQoL outcome and corresponds to displays such as Figure 11.3.
The second alternative allows us to use one of the SAS procedures to obtain
maximum likelihood (ML) estimates of the parameters.
11.4.4 Implementation
Choice of f (TiD )
In the lung cancer trial, the protocol specified that HRQoL was to be collected
until the final follow-up, regardless of disease progression or discontinuation
of treatment. Thus, theoretically death is the event that censored the mea-
surement of HRQoL. In practice, it is difficult to follow patients after disease
progression for various reasons. So in addition to time to death, one might
consider time to the last HRQoL measurement as a candidate for the joint
model. Finally there is the possibility that the rate of change in HRQoL de-
pends on other clinical events, such as time to disease progression. The choice
in other trials will depend on the disease, treatment, study design and likely
relationship of the individual trajectories to the event.
Two considerations will influence the choice of the transformation f (TiD ):
the first is the distribution of Ti and the second is the relationship of TiD
and di . Examples in the literature have used both untransformed and log
transformed values of time. We can assess the fit by comparing the empirical
distribution (Kaplan-Meier estimates) with distributions estimated assuming
normal and lognormal distributions as displayed in Figures 11.4 and 11.5. Vi-
sual examination of the times of the last HRQoL measurement (Figure 11.4)
suggested that the distribution is roughly normal while time to death (Fig-
ure 11.5) more closely fits a log normal distribution.
&RQWURO$UP ([SHULPHQWDO$UP
:HHNVWR/DVW42/ :HHNVWR/DVW42/
The algorithms for maximization of the likelihood function of the joint model
require initial estimates of the parameters. Good starting estimates speed up
the convergence of the program, and for some algorithms avoid non-positive
definite covariance matrices. Using multiple starting values avoids finding
local maximums.
One possible starting point assumes that there is no correlation between the
random effects and time: λ1 = λ2 = 0. The following procedure is suggested:
&RQWURO$UP ([SHULPHQWDO$UP
:HHNVWR'HDWK :HHNVWR'HDWK
1. Fit a mixed-effects model for the HRQoL data alone (SAS Proc MIXED
or R lme). In SAS, obtain the Cholesky decomposition of G (G = LL ) by
adding GC to the options of the RANDOM statement. (This is the default
in R). We will use this to ensure that the estimates of G remain positive
definite. For a 2x2 symmetric matrix the Cholesky decomposition is:
G11 G12 L11 0 L11 0
=
G21 G22 L21 L22 L21 L22
L211 L11 ∗ L21
=
L11 ∗ L21 L221 + L222
1. The first step is to fit a mixed-effects model for the HRQoL data alone
and obtain the empirical best linear unbiased predictors (EBLUPs) of
the random effects, dˆi :
The EBLUPs are merged with the time to event data. In SAS, the data
set is then transposed to create one record per subject and merged with
the dataset containing the time variable its censoring indicator.
2. The parameters of the time to event model are then estimated with the
estimates of di as covariates:
Maximization of Likelihood
There are numerous ways to maximize the likelihood function for the joint
model. Schluchter [1992], Schluchter et al. [2001] describes an EM algorithm
and provides a link to a program that implements it in SAS. Guo and Carlin
[2004] present a Bayesian version implemented in WinBUGS. Vonesh et al. [2006]
uses the SAS NLMIXED procedure that uses numerical integration to maximize
an approximation to the likelihood integrated over the random effects assum-
ing Weibull and piecewise exponential distributions for the time to event.
We then specify the log of the likelihood for Yit |di recalling that assuming
a normal distribution:
1 − 12 (Yit −(Xit β+Zit di ))2 /σ2
log(f [Yit |di ]) = log √ e (11.27)
2πσ 2
1 (Yit − (Xit β + Zit di ))2
=− log(2π) + log(σ 2 ) +
2 σ2
Next we specify the log likelihood for g(TiD )|di when some of the times
are censored. If we define δi = 0 when the time is observed, δi = 1 when
right censored and δi = −1 when left censored, and let Ti be the observed or
censored value of g(TiD ), then the general form of the log likelihood assuming
a normal distribution is:
log(f [g(TiD )|di ]) = − 12 log(2π) + log(τb2 ) + zi2 (δi = 0) (11.28)
= log [1 − Φ(zi )] (δi = 1) (11.29)
= log [Φ(zi )] (δi = −1) (11.30)
Ti −(μT +λdi )
zi = τb
The joint log likelihood is specified, adding the two components and speci-
fying the distribution of the random effects as multivariate normal.
*** Joint Log Likelihood ***;
model FACT_T2 ~ general(ll_Y+ll_T);
random d1 d2 ~ normal([0,0],[D11,D12,D22]) sub=patid;
Finally, we can request both linear and non-linear function of the parame-
ters. In this example, we first contrast the estimated slopes for the two groups,
H0 : β2 − β1 = 0. We can also estimate the correlation of the random effects
with the time to death or dropout:
*** Estimates and Contrasts ***;
estimate ‘Diff in Slopes’ b2-b1 ;
estimate ‘Rho1T’ (lambda1*D11+Lambda2*D12)/(sqrt(D11)*
sqrt(lambda1**2*D11+2*lambda1*lambda2*D12+lambda2**2*D22+tau2b));
estimate ‘Rho2T’ (lambda1*D12+Lambda2*D22)/(sqrt(D22)*
sqrt(lambda1**2*D11+2*lambda1*lambda2*D12+lambda2**2*D22+Tau2b));
run;
Results
Tables 11.7 and 11.8 summarizes the results from a number of joint models for
the lung cancer study and compares the estimates to the corresponding mixed-
effects model. The results support the hypothesis that the random effects
of the longitudinal model are associated with the time to the events. The
associations were stronger for the time to death than for the last assessments.
The events were moderately correlated with the intercept (ρ is in the range
of 0.40-0.44). The correlation with the slopes ranged from (ρ = .46) for
last HRQoL assessment to (ρ = 0.75) for log time to death. The strong
correlation (ρ > 0.7) of change with time to death fit with the observation that
deterioration in physical and functional well-being accelerate in the months
prior to death.
Estimates of the intercept are insensitive to the choice of model. This is not
unexpected as there is minimal missing data at baseline. In contrast, the rate
of decline within each group roughly doubles when the outcome is modeled
jointly with either time to death or last assessment. Given the extensive miss-
ing data and the patterns observed in Figure 6.1, this is also expected. The
between group differences exhibit less variation among the models; the span
of estimates being roughly half the standard error. This is typical of studies
where treatment arms have roughly similar rates and reasons for dropout.
TABLE 11.8 Study 3: Joint Model for FACT-Lung TOI and various
measures of the time to dropout (T D ). Parameter estimates of intercept
(β0 ), slopes for control group (β1 ) and experimental group (β2 ) and the
difference in slopes (β2 − β1 ).
Estimates (s.e.)
Dropout Event β̂0 β̂1 β̂2 β̂2 − β̂1
None (MLE) 65.9 (0.66) -1.18 (0.29) -0.58 (0.19) 0.60 (0.31)
ln(Survival) 65.7 (0.66) -1.85 (0.30) -1.43 (0.24) 0.47 (0.31)
Last assessment 66.1 (0.66) -2.15 (0.39) -1.53 (0.31) 0.62 (0.32)
11.4.6 Implementation in R
The jointModel function in R uses a slightly different approach for the sur-
vival portion of the joint model. Instead of incorporating the random effects
into the survival model, they incorporate Wi (t) the value of the longitudinal
outcome at time point t for the ith subject; evaluating Xi β + Zi di at t. The
default survival function is a Weibull accelerated failure time model where ri
follows an extreme value distribution:
The function also allows the time to event portion to be modeled as a time-
dependent proportional hazards model [Wulfsohn and Tsiatis, 1997] or an
additive log cummulative hazard model [Rizopoulos et al., 2009].
Fit the longitudinal part of the model creating the object LME (see Appendix
R for required libraries).
Create a dataset that has one record for each subject with at least one
measurement for the survival part of the model. Then run Weibull model
creating the object Surv.
R> Joint=jointModel(lmeObject=LME,survObject=Surv2,timeVar="MONTHS")
R> summary(Joint)
Coefficients:
Longitudinal Process
Value Std.Err z-value p-value
(Intercept) 67.0653 0.5854 114.5730 <0.0001
MONTHS:TrtGrp0 -1.6903 0.1014 -16.6633 <0.0001
MONTHS:TrtGrp1 -1.6772 0.0947 -17.7072 <0.0001
Event Process
Value Std.Err z-value p-value
(Intercept) 3.5126 0.0555 63.3451 <0.0001
Assoct -0.0140 0.0008 -16.5566 <0.0001
log(scale) -0.9734 0.0485 -20.0543 <0.0001
The parameter for the association of the two models, α, is indicated as Assoct
and is highly significant. Note that while the estimates of the slope param-
eters, β, of the longitudinal model are similar to those obtained previously,
their standard errors are much smaller. As an interesting side note, when only
a single between-subject random effect was included (random ∼ 1) the results
were similar (with no problems with the Hessian).
'D\RI$VVHVVPHQW 'D\RI$VVHVVPHQW 'D\RI$VVHVVPHQW
Again the first step is to develop working models for the longitudinal and
time to event processes. In the longitudinal model for pain intensity, a piece-
wise linear model for the fixed effects includes an intital slope (Time) and a
change in slope at 7 days (Time7=max(Time-7,0)). The random effects have
a similar structure as the migraine prevention trial that was described in Sec-
tion 4.6.2. Unlike the lung cancer trial where patients continue to decline
throughout the period of observation, a model for the random effects with
indicators for baseline (i.e. initial pain intensity) and change from baseline
to follow-up (i.e. response) fits the observed data. The correlation between
the baseline and follow-up assessments is weak and the correlation among the
follow-up assessments is strikingly strong (Table 11.9).
For most patients, dropout occurs after the first assessment if it occurs at
all and there is no suggestion of an association of time with trajectories of the
pain scores. This suggests considering an alternative model for dropout. For
example, one might consider TiD to be an indicator variable for dropout and
g(TiD ) to be the logistic function. The conditional likelihood is:
log(f [g(TiD )|di ]) = TiD (μt + λdi ) − log(1 + e(μt +λdi ) ) (11.32)
The following code could be used with DO LackEff and DO SideEff as in-
Covariance Correlation
1.07 0.11 0.11 0.11 1.00 0.04 0.04 0.04
0.11 6.40 5.47 5.47 0.04 1.00 0.85 0.85
0.11 5.47 6.40 5.47 0.04 0.85 1.00 0.85
0.11 5.47 5.47 6.40 0.04 0.85 0.85 1.00
dicator variables for dropout due the two reasons and ExpTx an indicator of
the experimental arm:
The results from the longitudinal portion of the model are very similar
to those obtained when modeled separately. This is not unexpected given
the strong correlation among the follow-up assessments. The first follow-up
assessment has much of the information about the subsequent assessments
thus dropout after the first follow-up is likely to be MAR. The results are
sensitive to variations in the models, such as dropping the mu1a*ExpTx or
mu1b*ExpTx terms. There is still much to be learned about the performance
of these models.
11.5 Summary
• There is a rich class of models that relate the individual trajectories
(generally through random effects) to the time to dropout or other clin-
ically relevant events.
• All models are based on strong assumptions.
• These assumptions cannot be formally tested. Defense of the assump-
tions must be made on a clinical basis rather than statistically.
• Lack of evidence of non-ignorable missing data for any particular model
does not prove the missing data are ignorable.
• Estimates are not robust to model misspecification, thus more than one
model should be considered (a sensitivity analysis).
• The random effect dependent dropout models can be very useful as part
of a sensitivity analysis when non-ignorable dropout is suspected.
12.1 Introduction
In this chapter we examine a final model for non-ignorable missing data. As
with the models presented in the previous two chapters,
1. All models for non-ignorable data require the analyst to make strong
assumptions.
The term selection model was originally used to classify models with a
univariate response, yi , where the probability of being selected into a sample
depended on the response. The same idea is extended to longitudinal studies.
As previously described, the joint distribution of the outcome, Yi and the
missing data mechanism, Mi is factored into two parts.
The model for the outcome, f (Yi |Θ), does not depend on the missing data
mechanism. The model for the missing data mechanism, f (Mi |Yiobs , Yimis , Ψ),
may depend on the observed and missing outcome data. Θ and Ψ are the
parameters of the two models. We can expand this definition to differentiate
among selection models. Specifically, in outcome-dependent [Hogan and Laird,
1997b] selection models the mechanism depends directly on the elements of Y
and in random-effects [Hogan and Laird, 1997b] or random-coefficient [Little,
1995] selection models missingness depends on Y through the subject-specific
random effects, βi or di .
267
© 2010 by Taylor and Francis Group, LLC
268 Design and Analysis of Quality of Life Studies in Clinical Trials
The logistic linear model for the dropout process takes the form
Diggle and Kenward defined the dropout process in terms that correspond
to MCAR, MAR and MNAR. Completely random dropout (CRD) corresponds
to MCAR where dropout is completely independent of measurement,γ1 =
γ2 = 0. Random dropout (RD) corresponds to MAR where dropout depends
on only observed measurements (prior to dropout), γ1 = 0 and γ2 = 0. Infor-
mative dropout (ID) corresponds to MNAR where dropout depends on unob-
served measures. If γ1 = 0 then dropout depends on the previously observed
3HUFHQWRI6FRUHV
3HUFHQWRI6FRUHV
+54R/6FRUH +54R/6FRUH
FIGURE 12.1 Contrasting two distributions with 20% missing observations.
Missing scores are indicated by the hatched pattern and observed scores are
indicated by the empty portion. In figure A (left) the complete (observed and
missing) data have a normal distribution. In figure B (right) the observed
scores have a normal distribution but the distribution of the complete data is
skewed.
This implies that 31% of the remaining subjects drop out at each of the
follow-up visits. The probability of completing the HRQoL assessment at
the final visit is the product of the probability of remaining in the study at
each followup. Thus, the model predicts that 69% of the subjects will have an
assessment at 6 weeks [(1− P̂2 ) = .69], 48% at 12 weeks [(1− P̂2 )(1− P̂3 ) = .48]
and 33% at 26 weeks [(1 − P̂2 )(1 − P̂3 ) ∗ (1 − P̂4 ) = .33]. The estimated
probability of dropout and of remaining in the study are compared to the
observed probabilities in Tables 12.1 and 12.2.
In model 2 (CRD2), we allow the probability to differ across the two treat-
ments by adding an indicator for the Experimental group (Xi2 = 0 if Control,
Xi2 = 1 if Experimental).
The estimated probability of dropout is higher among the subjects not as-
signed to the Experimental therapy (P̂k = .35) than among those assigned to
the Experimental therapy (P̂k = .29).
In model 3 (CRD3), the probability is the same across the two treatments
but differs by time (Xi1 = 1 if t2 , Xi2 = 1 if t3 , Xi3 = 1 if t4 , 0 otherwise).
The predicted dropout rate at 6 weeks (P̂k = .28) is very similar to 12 weeks
(P̂k = .28) and increases at 26 weeks (P̂k = .40). This model fit the ob-
served data better than the model where we assume a constant dropout rate
(Table 12.3).
Finally, in model 4 (CRD4), we allow the probability to differ across the
two treatments by adding indicators for the Experimental group at each time.
(Xi4 = Xi5 = Xi6 = 0 if Control, Xi2 = Xi5 = Xi6 = 1 if Experimental).
This model fit the observed data better than the previous models (Table 12.3).
Note that none of the parameter estimates in the model for the HRQoL out-
come measure (β1 · · · β4 )have changed. This is to be expected as the longi-
tudinal and the models are distinct (estimated separately) under the CRD
assumptions.
Note that this model assumes that this relationship is the same for each follow-
up assessment and the previous measurement. Thus, the relationship between
dropout at 6 weeks and the baseline value of the FACT-Lung TOI is assumed
to be the same as between dropout at 26 weeks and the 12-week value of the
FACT-Lung TOI.
This RD model fits the data much better than the CRD model, with strong
evidence to reject the CRD assumption (Table 12.4). As expected, there is still
no change in the mean and variance parameters for the longitudinal HRQoL
data. The RD model predicts that the probability of dropout will decrease
with increasing FACT-Lung TOI scores during the previous assessment. Thus,
in the patient assigned to the treatment that did not include Experimental,
the predicted probability of dropout at 6 weeks is 44%, 35% and 27% for
subjects with baseline scores of 55, 65 and 75 respectively.
eγ̂01 +γ̂2 Yij−1 e1.872−0.0385∗55
P̂2 = = = .44, Xi4 = 0, Yi1 = 55
1 + eγ̂01 +γ̂2 Yij−1 1 + e1.872−0.0385∗55
eγ̂01 +γ̂2 Yij−1 e1.872−0.0385∗65
P̂2 = = = .35, Xi4 = 0, Yi1 = 65
1 + eγ̂01 +γ̂2 Yij−1 1 + e1.872−0.0385∗65
eγ̂01 +γ̂2 Yij−1 e1.872−0.0385∗75
P̂2 = γ̂ +γ̂ Y
= = .27, Xi4 = 0, Yi1 = 75
1 + e 01 2 ij−1 1 + e1.872−0.0385∗75
It is also possible to let dropout depend on the observed scores lagging back
two observations. However, to do this in the current study we would have
had to restrict the data to subjects with the first two observations; less than
half of the subjects would remain. So, in this study, we are limited to a single
lag.∗
The ID4 model was not significantly different from the RD4 model. This is at
first rather surprising as there was evidence of non-ignorable missing data for
∗ The Oswald routine will run without error messages other than the note that the algorithm
failed to converge. The resulting parameters describing dropout at 6 weeks make no sense.
all other models considered in the previous chapters. This illustrates a very
important point. The failure to reject a hypothesis of informative dropout (ID
versus RD) is not conclusive proof that the dropout is ignorable (RD). The
assumption of random dropout (RD) is acceptable only if the ID model was
the correct alternative and the parametric form of the ID process is correct.
A number of factors could explain the lack of evidence. In this example,
the most likely explanation is that the assumption of a normal distribution
for both the observed and unobserved data may be the problem. Another
contributor may be the omission of the second random effect† corresponding
to variation in the rates of change among individuals.
12.3 Summary
• All these models require the analyst to make strong assumptions.
• These assumptions cannot be formally tested. Defense of the assump-
tions must be made on a clinical basis rather than statistically.
13.1 Introduction
It is well known that performing multiple hypothesis tests and basing inference
on unadjusted p-values increase the overall probability of false positive results
(Type I errors). Multiple hypothesis tests in trials assessing HRQoL arises
from three sources: 1) multiple HRQoL measures (scales or subscales), 2)
repeated post-randomization assessments and 3) multiple treatment arms.
As a result, multiple testing is one of the major analytic challenges in these
trials [Korn and O’Fallon, 1990]. For example, in the lung cancer trial (Study
3), there are five primary subscales in the FACT-Lung instrument (physical,
functional, emotional and social/family well-being plus the disease specific
concerns). There are three follow-up assessments at 6, 12 and 26 weeks and
three treatment arms. If we consider the three possible pairwise comparisons
of the treatment arms at each of the three follow-ups for the five primary
subscales, we have 45 tests. Not only does this create concerns about type I
error, but reports containing large numbers of statistical tests generally result
in a confusing picture of HRQoL that is hard to interpret.
Although widely used in the analysis of HRQoL in clinical trials [Schu-
macher et al., 1991], univariate tests of each HRQoL domain or scale and
time point can seriously inflate the type I (false positive) error rate for the
overall trial such that the researcher is unable to distinguish between the true
and false positive differences. Post hoc adjustment is often infeasible because,
at the end of analysis, it is impossible to determine the number of tests per-
formed.
In this and the following chapter, I will illustrate various strategies for
addressing the multiple comparisons problem. I will use two examples. The
first is the breast cancer trial of adjuvant therapy (Study 1), in which the seven
domains of the Breast Chemotherapy Questionnaire (BCQ) were measured at
three time points. For the purposes of illustration, I will assume that the aims
are to identify the relative impact of the two therapeutic regimens on each of
these seven components both on- and post-therapy. This example will be used
to illustrate single step (Section 13.4) and sequentially rejective (Section 13.5)
methods. The second example is the renal cell carcinoma trial (Study 4). For
this trial, I will illustrate designs that integrate both the traditional disease
275
© 2010 by Taylor and Francis Group, LLC
276 Design and Analysis of Quality of Life Studies in Clinical Trials
response and survival endpoints with HRQoL measures using closed testing
procedures that utilize gatekeeping strategies (Section 13.6).
Communication of Results
The requirements imposed for the reporting of results may also influence the
decisions. As will be discussed in more detail later in the chapter, global tests
are easy to perform but all that can be reported is a “Yes/No” decision based
on rejection/acceptance of hypothesis tests. At the other extreme, the Bon-
ferroni test allows one and two-sided tests, and can produce adjusted p-values
and confidence intervals, but at the cost of being the most conservative of
the multiple comparisons procedures. In a trial that mandates the additional
detail such as confidence intervals, the consequences of selecting a procedure
which does not provide those details will be problematic. Thus, it is impor-
tant to decide during the planning of the analysis whether the benefits of
additional power outweigh the ability to produce more detailed reporting.
mon among HRQoL studies, such as dose finding studies with three or more
ordered treatment arms, will not be addressed. For more complete discussion
of multiple comparisons procedures, the reader is advised to consult books
and review articles devoted solely to this topic.
on all subjects (although all issues of non-random missing data are still ap-
plicable). This might occur if a subset of the K endpoints were not measured
because a translation was not available for some of the HRQoL measures. 4)
Non-parametric methods as well as parametric methods can be used.
As will be illustrated later in this chapter, a global test can be based on either
a set of univariate tests (e.g. the Bonferroni global test) or be constructed as
a multivariate test statistic. It is important to note that a global test allows
one to reject or accept the family of hypotheses, but does not allow inferences
to be made about individual hypotheses.
In the renal cancer study, the global test might compare a selected HRQoL
measure at 2, 8, 17 and 23 weeks. Based on the global test, we can conclude
that the outcome differs between the two treatment groups for at least one of
the time points, but we can not say when they differ. Thus, when the global
test of H0 has been rejected, the question remains Which of the individual
hypotheses can be rejected?
Multivariate
Global Test F7,169 = 4.83, p < 0.0001 F7,159 = 1.03, p = 0.41
k = min(1, pk ∗ K)
p̃B (13.3)
The Bonferroni adjusted p-values for the on-therapy comparisons are illus-
trated in Table 13.3. Computation of confidence intervals is also straight
forward, with the usual α replaced by α/K. For example, if the unadjusted
95% confidence interval is:
θ̂ ± t(1−α/2) V ar[θ̂] (13.4)
The Bonferroni procedure results in strong control of the FWE, but is well
known to be quite conservative. If the K test statistics are uncorrelated
(the tests are independent) and the null hypotheses are all true, then the
probability of rejecting at least one of the K hypotheses is approximately∗
αK when α is small. However, when the tests are correlated, the procedure
over corrects. Another limitation is that the Bonferroni procedure focuses on
the detection of large differences in one or more endpoints and is insensitive
to a pattern of smaller differences that are all in the same direction. Options
that address this problem will be presented later in this chapter.
∗ P r[min(p-value) ≤ α] = 1 − (1 − α)K ≈ αK
K
p̃Bw
k = min(1, pk ∗ ( wk )/wk ). (13.6)
k=1
When more than one of the null hypotheses are false, this procedure sub-
stantially increases the power. For example, it is much easier to reject the
hypothesis associated with the second smallest p-value with this procedure
than with the single step procedure. The cost is that we can not construct
confidence intervals that directly correspond to the procedure.
Step-Up Procedure
Hochberg proposed an alternative to Holm’s step-down procedure that is
slightly more powerful, though it relies on an assumption of independence
of the test statistics. The adjusted p-values for the step-up procedure are
defined in the reverse order.
p̃H
[K] = p[K] (13.8)
p̃H H
[i] = min(p̃[i+1] , p[i] ∗ (K − i + 1)), i = K − 1, · · · , 1
The procedure is illustrated for the seven domains of the BCQ (Table 13.3).
The largest p-value, p[7] = 0.80, remains unchanged, setting p̃H
[7] to 0.80. The
second largest is multiplied by 2 and compared to the largest; the minimum
of the two values is 0.80. The procedure continues and in the last step p[1] is
multiplied by 7 and compared to p̃H [2] .
p̃F DR
[K] = p[K] (13.9)
K
p̃F [i+1] , p̃[i] ∗ ), i = K − 1, · · · , 1
DR
[i] = min(p̃F DR
i
The procedure is illustrated for the seven domains of the BCQ (Table 13.3).
The largest p-value, p[7] = 0.80, remains unchanged, setting p̃F DR
[7] to 0.80.
The second largest is multiplied in this procedure by 7/6 and compared to
the largest; the minimum of the two values is 0.80. The procedure continues
and in the last step p[1] is multiplied by 7/1 and compared to p̃F DR
[2] .
13.5.4 Implementation in R
The R function p.adjust will calculate the adjusted p-values. If we have
created a vector of p-values called pval, the following statements will generate
the same four adjustments:
> pvals=c(0.0001,0.0002,0.081,0.27,0.60,0.75,0.80)
> p.adjust(pvals,method="bonferroni")
> p.adjust(pvals,method="holm")
> p.adjust(pvals,method="hochberg")
> p.adjust(pvals,method="fdr")
K
p̃Hw
[1] = min(1, p [1] ∗ ( w[k] )/w[1] ) (13.10)
k=1
K
applications simplify and the wide range of applications justifies the initial
effort to understand the procedure. To illustrate the notation, assume that
we have three endpoints. The corresponding marginal hypotheses are HA ,
HB , and HC . The intersection of the hypotheses associated with the first two
endpoints is designated as HAB and the intersection of all three hypotheses
as HABC . There are 2K − 1 possible combinations. In the closed-testing
procedure, the adjusted p-value for each hypothesis is the maximum p-value
of the set (family) of hypotheses implied by the marginal hypothesis. For
example, HA would imply all combinations that contained A: HA , HAB ,
HAC , and HABC . The adjusted p-value would be the maximum of the p-
values associated with these four combinations. The simplification occurs
when it is sufficient to report that the test of a specific hypothesis has been
accepted (not significant at α); if any of the set of hypotheses is accepted,
testing the remaining hypotheses is unnecessary.
sections. The second column designates the Bonferroni adjusted p-value for
the respective intersection hypothesis. The next three columns indicate which
of the intersection hypotheses belong to the set of implied hypotheses. The
adjusted p-values for HA , HB , and HC are the maximum of the values in each
column. If the unadjusted p-values were pA = 0.08, pB = 0.02, and pC = 0.03,
then pABC = 3∗min(0.08, 0.02, 0.03) = 0.06, pAB = 2∗min(0.08, 0.02) = 0.04,
pAC = 2 ∗ min(0.08, 0.03) = 0.06, pBC = 2 ∗ min(0.02, 0.03) = 0.04. The
adjusted p-values would be the maximum value in the columns that indi-
cate the implied hypotheses: p̃A = max(pABC , pAB , pAC , pA ) = 0.08, p̃B =
max(pABC , pAB , pBC , pB ) = 0.06, and p̃C = max(pABC , pAC , pBC , pC ) =
0.06.
example, disease response is the single primary endpoint. If we apply this de-
sign using the unadjusted p-values from the previous example, pABC = pAB =
pAC = pA = 0.08, pBC = 2 ∗ min(.02, .03) = 0.04. The adjusted p-values are
p̃A = max(pABC , pAB , pAC , pA ) = 0.08, p̃B = max(pABC , pAB , pBC , pB ) =
0.08, and p̃C = max(pABC , pAC , pBC , pC ) = 0.08. Thus, none of the three
hypotheses are rejected and the (adjusted) results are negative for all three
endpoints. If our trial design had designated survival as the primary endpoint,
both the hypotheses for survival and HRQoL would have been rejected.
occurs in the calculation of pAC and pBC . Because the gatekeeping procedure
requires that HC can not be rejected unless both of the hypotheses in the
first family are rejected, the adjusted p-value can not be smaller than the
largest in the first family thus p̃C = 0.08. Thus in design 3, only the survival
hypotheses is rejected.
The composite score (e.g. FACT-BRM TOI, see Section 1.6.3) has four do-
mains (D1 , D2 , D3 , D4 ) that are secondary endpoints and will be tested only
if the hypothesis involving the composite score is rejected. One potential set
of rules for the weights of the intersection hypotheses proposed by Hommel
et al. is as follows:
Thus the three co-primary endpoints initially have equal importance. The
testing of Q has a greater gatekeeping role for the domain scores and if either
E1 or E2 is rejected, Q would be allocated more weight facilitating the testing
of the domain scores.
To illustrate the procedure assume that the unadjusted p-values are: pE1 =
0.082, pE2 = 0.007, pQ = 0.012, pD1 = 0.070, pD2 = 0.012, pD3 = 0.0038, and
pD4 = 0.007. In the first step, we test the intersection hypothesis for the three
primary endpoints (Table 13.8) and note that E1 is the minimum weighted
p-value and thus the identified endpoint. Following the rules, the weight
associated with this endpoint is transferred to Q. The composite QOL score,
Q, is the identified endpoint in the second step; its weight is transferred to the
four domains. The testing continues with D3 , D4 , D2 and D1 and finally E1
emerging as the identified endpoints. Note that the local p-values for D3 and
E1 are less than the proceeding adjusted p-values; to preserve the ordering of
the adjusted p-value it is set to the proceeding value.
The closed testing procedure can also be based on multivariate tests. This
strategy is limited to settings where a set of hypotheses can be jointly tested.
This is virtually impossible when different analysis methods (Cox regression,
logistic regression, linear regression) are used for the respective endpoints
(survival, disease response, QOL). It is possible in settings where hypotheses
for all the outcomes can be tested in the same model, although program-
ming these analyses is a bit of a burden. Consider the test of differences in
the 7 subscales between the two treatment arms of the breast cancer trial.
One approach is to test all possible 27 − 1 or 127 combinations and use a
decision matrix strategy to compute the adjusted p-values as previously de-
scribed (Table 13.9). As the number of endpoints increases this becomes a
bit cumbersome. The procedure can be slightly simplified if one does not
need the adjusted p-value for the null hypotheses that are not rejected. In
this example, because HH , HW , HT , HF , and HN are not rejected, it is not
necessary to test the combinations that only include those hypotheses. But
if we omit those tests, we can only state that the adjusted p-value for that
hypothesis is above a certain value (Table 13.9).
13.7 Summary
• Multiple endpoints and testing creates a major analytic problem in clin-
ical trials with HRQoL assessments.
• Three strategies are generally necessary:
- Limiting the number of endpoints
- Summary measures and statistics (see Chapter 14)
- Multiple comparison procedures
• Multiple comparison procedures are most useful for measures of the
multiple domains of HRQoL, especially when there is a concern about
obscuring effects of components that move in opposite directions with
the use of summary measures.
14.1 Introduction
In most clinical trials, investigators assess HRQoL longitudinally over the pe-
riod of treatment and, in some trials, subsequent to treatment. Each assess-
ment involves multiple scales that measure the general and disease-specific
domains of HRQoL. For example in the lung cancer trial, there are three
treatment arms, four assessments and five subscales of the FACT-Lung. As
a result, addressing the problem of multiple comparisons is one of the ana-
lytic challenges in these trials [Korn and O’Fallon, 1990]. Not only are there
concerns about Type I errors, but large numbers of statistical tests generally
result in a confusing picture of HRQoL that is difficult to interpret [DeK-
lerk, 1986]. As mentioned in the previous chapter, composite endpoints and
summary measures are one of three strategies that in combination will reduce
Type I errors, attempt to conserve power and improve interpretation. In this
chapter, the computations of composite endpoints and summary measures are
presented with details concerning how their derivation is affected by missing
data.
295
© 2010 by Taylor and Francis Group, LLC
296 Design and Analysis of Quality of Life Studies in Clinical Trials
1987a, Matthews et al., 1990], area under the curve [Matthews et al., 1990,
Cox et al., 1992] and time to reach a peak or a pre-specified value [Pocock et
al., 1987a, Matthews et al., 1990].
Missing data are dealt with differently in the construction of these measures
and statistics [Fairclough, 1997]. For composite endpoints (Sections 14.3 and
14.5), we must develop a procedure to handle missing data at the subject
level possibly by using interpolation and extrapolation or by imputing missing
observations. For summary measures (Section 14.4), missing data handled by
the selection of the analytic model as described in Chapters 9-12.
J
Method Summary measure (Sh = j=1 wj g(βhj ))
Strategy 1. Fit multivariate model, estimating means (or param-
eters) for repeated measures or mixed-effects model.
2. Compute summary statistic (generally linear combi-
nation of parameters)
3. Test hypothesis (H0 : Sh = Sh or Sh − Sh = 0)
Advantage Strategies for handling missing data are model based
Disadvantage Harder to describe procedure
Increased Power
Composite endpoints and summary measures have greater power to detect
small but consistent differences that may occur over extended periods of time
or multiple domains of HRQoL, in contrast to a multivariate test (Hotelling’s
T). To illustrate, consider the two hypothetical examples displayed in Fig-
ure 14.1. In the first example (Figure 14.1 left), the measure of HRQoL is
consistently better in one treatment during all four post-baseline assessments.
The multivariate test of differences at the four followups is non-significant
(F4,100 = 1.12, p = 0.35) however the hypothesis based on the mean of the
four follow-ups is rejected (t100 = −2.11, p = 0.037). In the second example
(Figure 14.1 right), the second treatment has a negative impact (toxicity) 1
month post diagnosis, but this difference almost disappears by the third month
and begins to reverse by the ninth month. The results are reversed; multivari-
ate test of differences at the four followups rejected (F4,100 = 4.3, p = 0.003)
however the hypothesis based on the mean of the four follow-ups is not re-
jected (t100 = −0.76, p = 0.45). Although the differences between the groups
in both examples are of clinical interest, in most clinical trials one would wish
to have test procedures that are more sensitive to (or have greater power to
detect) the consistent differences displayed in Figure 14.1 (left).
)WHVW )WHVW
4R/5HVSRQVH
4R/5HVSRQVH
S S
7WHVWV 7WHVWV
S S
6XP 6XP
$VVHVVPHQW $VVHVVPHQW
The selection also depends on the expected pattern of change across time
and patterns of missing data. Consider several possible patterns of change
in HRQoL across time (Figure 14.2). One profile is a steady rate of change
over time reflecting either a constant decline in HRQoL (Figure 14.2-A) or a
constant improvement (Figure 14.2-B). The first pattern is typical of patients
with progressive disease where standard therapy is palliative rather than cu-
rative. This is the pattern observed in the two lung cancer trials (Studies 3
and 5). This pattern of change suggests that the rate of change or slope is
a possible choice of a composite endpoint. A measure defined as the change
from baseline to the last measure might initially seem relevant, but may not
be desirable if patients who fail earlier and thus drop out from the study
earlier have smaller changes than those patients with longer follow-up.
An alternative profile is an initial rapid change with a subsequent plateau
after the maximum therapeutic benefit is realized (Figure 14.2-D). This might
occur for therapies where the dose needs to be increased slowly over time or
where there is a lag between the time therapy is initiated and the time maximal
benefit is achieved (Studies 2 and 6). This profile illustrates the importance
of identifying the clinically relevant question a priori. If the objective is to
identify the therapy that produces the most rapid improvement in HRQoL,
the time to reach a peak or pre-specified value is good choice. If, in contrast,
the ultimate level of benefit is more important than the time to achieve the
benefit, then a measure such as the post-treatment mean or mean change
relative to baseline is desirable.
A third pattern of change could occur with a therapy that has transient
benefits or toxicity (Figures 14.2-E and F). For example, individuals may
experience transient benefits and then return to their baseline levels after
the effect of the therapy has ceased. Alternatively, a therapy for cancer may
significantly reduce HRQoL during therapy but ultimately result in a better
HRQoL following therapy than the patient was experiencing at the time of
diagnosis [Levine et al., 1988, Fetting et al., 1998] (Studies 1 and 4). For these
more complex patterns of change over time, a measure such as the area under
the curve might be considered as a summary of both early and continued
effects of the therapy.
$ %
42/6FRUHV
42/6FRUHV
6WHDG\'HFOLQH 6WHDG\,PSURYHPHQW
7LPH 7LPH
& '
42/6FRUHV
42/6FRUHV
,PSURYHPHQW
'HFOLQH:LWK3ODWHDX
:LWK3ODWHDX
7LPH 7LPH
( )
42/6FRUHV
42/6FRUHV
7HPSRUDU\'HFOLQH 7HPSRUDU\,PSURYHPHQW
7LPH 7LPH
the orthogonal scoring algorithms, physical function, role physical and bod-
ily pain subscales make a modest negative contribution to the MCS and the
mental health and role-emotional subscales make a modest negative contribu-
tion to the PCS score (Table 14.3). These negative contributions can produce
surprising results.Simon et al. [1998] describe this in a study of antidepressant
treatment where there were modest positive effects over time on the physical
function, role-physical, bodily pain and general health, but a negative score
(non-significant) was observed for the PCS because of the very strong positive
effects in the remaining scales. The negative contribution of these remaining
subscales overwhelmed the smaller contributions of the first four subscales.
The point of this illustration is that investigators should be aware of the
weights that are used and examine the individual components descriptively
to fully understand the composite measures.
Looking at the above 7 questions, how much would you say your
PHYSICAL WELL-BEING affects your quality of life?
0 1 2 3 4 5 6 7 8 9 10
Not at all Very much so
Similar questions appear for the other subscales. Responses to these questions
could be used to weight the responses to each of the subscales.
Interestingly, these experimental questions have become optional in Version
4 and are not recommended for use in clinical trials. The developer cites two
basic reasons [Cella, 2001]. First, using weighted scores and unweighted scores
produce essentially the same results in all analyses examined. Second was the
concern that the respondents were not answering the question as intended.
Although some appeared to answer as a true weight, others seemed to answer
the question as a summary of their response to the items and as many as
15-25% did not seem to understand at all and left the questions blank or
responded in rather unusual ways. Other considerations may be a) a more
complicated scoring system that would preclude hand scoring of the scale, 2)
requirements for additional validation studies and 3) non-equivalence of scales
from study to study.
very similar aspects of HRQoL will contribute less to the overall score than
two subscales that measure very different aspects of HRQoL. For example,
consider results from the lung cancer trial (Study 3). Table 14.4 displays the
correlations among the subscales. The physical well-being scores have the
strongest correlation with the other subscales (ρ̂ = 0.45 − 0.77) and thus that
scale has the smallest weight (Table 14.5). In contrast, the social well-being
scores have the weakest correlation with other scales and the largest weight.
The procedure is first to compute standardized scores (zik ) for each of the K
Composite endpoints with weights based on the inverse correlation are the
most powerful [O’Brien, 1984, Pocock et al., 1987a] and do not require specifi-
cation prior to analysis because the weights are determined by the data. The
disadvantages are that they vary from study to study [Cox et al., 1992] and
they may not reflect the importance that patients place on different domains.
Summary Measures
The weighted average of the individual values proposed by O’Brien [1984] also
extends to the construction of a summary measure with the use of a weighted
average of asymptotically normal test statistics such as the two-sample t-
statistic.
θ̂hk
thk = , th = (th1 , . . . , thK ) , (14.1)
σ̂(θ̂)hk
Ŝh = J R−1 th , J = (1, . . . , 1) (14.2)
R is the estimated common correlation matrix of the raw data (Σ̂) or the
pooled correlation matrix of the estimated means (μ̂jk ). An alternate way to
express the summary statistic is
K
Ŝh = wk g(β̂kj ) (14.3)
k=1
where g(β̂kj ) = μ̂kj /σ̂(μ̂kj ) and (w1 , . . . , wK ) = J R−1 . Because Ŝh is a linear
combination of asymptotically normal parameter estimates, the asymptotic
variance of the composite endpoint is
V ar(Ŝh ) = W Cov[g(β̂hk )]W. (14.4)
For two treatment groups, we can test the hypotheses S1 = S2 or θ = S1 −S2 =
0 using a t-test with N − 4 degrees of freedom
!−1/2 !−1/2
tN −4 = (Ŝ1 − Ŝ2 ) V ar(Ŝ1 − Ŝ2 ) = θ̂ V ar(θ̂) (14.5)
for small samples [Pocock et al., 1987a]. More generally, for large samples we
can test the hypothesis S1 = S2 · · · = Sk using a Wald χ2 statistic:
!−1
χ2k−1 = φ̂ V ar(φ̂) φ̂, φ̂ = (Ŝ2 − Ŝ1 , . . . , Ŝn − Ŝ1 ) . (14.6)
14.4.1 Notation
The general procedure for the construction of summary measure is to obtain
parameter estimates for the hth group (β̂hj ) and then reduce the set of J
estimates to a single summary statistic:
J
Ŝh = wj g(β̂hj ). (14.7)
j=1
Control: Sˆ0 = ((β̂0 + β̂1 ) + (β̂0 + β̂2 ) + (β̂0 + β̂3 ))/3 − β̂0
T ime2 T ime3 T ime4 T ime1
These examples are simple and it is possible to figure out the summary
measures in one’s head. But to insure that it is done correctly when mod-
els get more complicated, I recommend that you create tables of the form
displayed in Table 14.7 and double check the order in which the parameters
are listed to insure that the summary measure is correctly computed. When
additional covariates are added, they should also be included. Centering the
covariates (See Section 5.23) simplifies these computations as they drop out
of the contrasts.
= (6 + 12 + 26)β̂1 /3 + (6 + 20)β̂2 /3
Expermntl: Sˆ1 = ((β̂0 + 6β̂3 ) + (β̂0 + 12β̂3 + 6β̂4 )
W eek6 W eek12
= (6 + 12 + 26)β̂3 /3 + (6 + 20)β̂4 /3
ˆ ˆ
Difference: S1 − S0 = 44(β̂3 − β̂1 )/3 + 26(β̂4 − β̂2 )/3
Repeated Measures
When the model has been structured as a repeated measures analysis, the
AUC can be estimated for each of the H groups by using a trapezoidal ap-
proximation (Figure 14.3). The area of each trapezoid is equal to the product
of the height at the midpoint ((Yhj + Yh(j−1) )/2) and the width of the base
(tj − tj−1 ). The total area is calculated by adding areas of a series of trape-
zoids:
J
μ̂hj + μ̂h(j−1)
AU Ch (tJ ) = Ŝh = (tj − tj−1 ) (14.8)
j=2
2
The equation can be rewritten as a weighted function of the means :
t2 − t1 tj+1 − tj−1
J−1
tJ − tJ−1
AU Ch (tJ ) = Ŝh = μ̂h1 + μ̂hj + μ̂hJ (14.9)
2 j=2
2 2
\
\
42/UHVSRQVH
\
\ \
\
W W W W
7LPH
FIGURE 14.3 Calculation of the AUC using a trapezoidal approximation.
6−0 12 − 0 26 − 6 26 − 12
= μ̂h1 + μ̂h2 + μ̂h3 + μ̂h4
2 2 2 2
Dividing this quantity by tJ or scaling time so that tJ = 1 allows us to
interpret the summary measure as the average score over the period of interest.
where β̂h0 is the estimate of the intercept for the hth group, β̂h1 is the linear
coefficient for the hth group, etc.
For example, in the renal cell carcinoma study if we assumed a cubic model,
the AUC is:
J=3
T j+1
AU Ch (T ) = β̂hj
j=0
j+1
T1 T2 T3
= β̂h0 + β̂h1 + β̂h2
1 2 3
If we were interested in the AUC during the initial period of therapy, defined
as the first 8 weeks, the AUC is:
81 82 83
AU Ch = β̂h0 + β̂h1 + β̂h2
1
2
3
8 32 170.67
Estimates for longer periods of time, such as 26 and 52 weeks, are generated
using the same procedure. In this study, the results would appear as follows
where T is rescaled to be equal to 1. The estimates are now on the same scale
as the original measure and have the alternative interpretation of the average
score over the period from 0 to T .
For a piecewise regression model, one could estimate the means at each knot
and then use a trapezoidal approximation. Integration provides a more direct
method:
T J
AU Ch (T ) = Ŝh = β̂hj t[j] ∂t
t=0 j=0
T T
t2
J
= β̂0 t + β̂hj
t=0
j=1
2 t=T [j]
J
max(T − T [j] , 0)2
= β̂h0 T + β̂hj (14.11)
j=1
2
where β̂h0 is the estimate of the intercept for the hth group, β̂h1 is the initial
slope and β̂hj , j > 2 is the change in slope at each of the knots, T [l] . t[0] = 1,
t[1] = t and t[j] = max(t − T [j] ).
In the renal cell carcinoma study, the piecewise regression model has two
knots, one at 2 weeks and one at 8 weeks. Thus, the estimated AUC (rescaled
to T=1) is
T2 max(T − 2, 0)2 max(T − 8, 0)2
AU Ch (T ) = (β̂h0 T + β̂h1 + β̂h2 + β̂h3 )/T
2 2 2
If we were interested in the early period defined as the first 8 weeks, then
82 max(8 − 2, 0)2 max(8 − 8, 0)2
AU Ch (8) = (β̂h0 8 + β̂h1 +β̂h2 +β̂h3 )/8.
2
2
2
=32 =18 =0
because simple univariate tests (e.g. t-tests) can be used for the analysis. For
example, one might compute the slope of the scores observed for each sub-
ject and test the hypothesis that the slopes differed among the experimental
groups. When the period of observation varies widely among subjects or data
collection stops for some reason related to the outcome, this construction of
composite endpoints is challenging. Easy fixes without careful thought only
hide the problem. Many of the procedures for constructing composite end-
points assume data are missing completely at random (MCAR) [Omar et al.,
1999] and are inappropriate in studies of HRQoL.
14.5.1 Notation
The construction of a composite endpoint that reduces the set of J measure-
ments (Yij ) on the ith individual to a single value (Si ), can be described as
a weighted sum of the measurements (Yij ) or a function of the measurements
(f (Yij )). The general form is
J
Si = wj f (Yij ) (14.12)
j=1
Assigning zero is a valid approach for HRQoL scores that are explicitly
anchored at zero for the health state of death. These are generally scores
measured using multi-attribute, time trade off (TTO) or standard gamble
(SG) techniques to produce utility measures (see Chapter 15). However, the
majority of HRQoL instruments are developed to maximize discrimination
among patients. In these instruments a value of zero would correspond to
the worst possible outcome on every question. Even as the patients approach
death, this is unlikely for most scales. Assigning zero also has some statistical
implications. If the proportion of deaths is substantial, the observations may
mimic a binomial distribution and the results roughly approximate a Kaplan-
Meier analysis of survival.
When the changes over time are approximately linear, the average rate of
change (or slope) may provide an excellent summary of the effect of an inter-
vention. When there is virtually no dropout during the study or the dropout
occurs only during the later part of the study, it is feasible to fit a simple re-
gression model to the available data on each individual. This is often referred
to as the ordinary least squares slope (OLS slope). β̂i2 is the estimated slope
from a simple linear regression,
J
(Xij − X̄i )(Yij − Ȳi )
β̂iOLS = = (Xi Xi )−1 Xi Yi (14.13)
(Xij − X̄i )2
Obviously, slopes for each individual can be estimated if all subjects have
two or more observations. However, the estimates of the slope will have a
large associated error when the available observations span a short period of
time relative to the entire length of the study. The wide variation in slopes
estimated with OLS is displayed in Figure 14.4. If there is a substantial
proportion of subjects with only one or two observations, it may be necessary
to use either imputed values for later missing values or the empirical Bayes
(EB) estimates from a mixed-effects model. Because this second estimate
also uses information from other individuals, it is more stable especially with
highly unbalanced longitudinal studies where some individuals have only a
few observations. The EB slopes display shrinkage toward the average slope
as the estimator is a weighted average of the observed data and the overall
average slope [Laird and Ware, 1982] (See Section 5.6). For individuals with
only a few observations, the EB slope is very close to the overall average. This
is illustrated by the narrow distribution of estimates displayed in Figure 14.4.
Note that both approaches assume the data are missing at random (MAR).
The distribution of the slopes derived using LVCF illustrates the tendency of
these values to center around zero. This is by definition for those who drop
out after the first observation. Finally, the distribution of slopes is displayed
from the multiply imputed data. Values are more widely distributed than for
either the EB or LVCF strategies. Clearly, the distribution of the composite
endpoints can be sensitive to the method used to handle the missing data.
J
Yij + Yi(j−1)
Si = AU Ci = (tj − tj−1 ) (14.14)
j=2
2
t2 − t1 tj+1 − tj−1
J−1
tJ − tJ−1
AU Ci = Yi1 + Yij + YiJ (14.15)
2 j=2
2 2
0HDQ
6WG'HYLDWLRQ
0RGH
0LQLPXP
0D[LPXP
3HUFHQW
2EV([FOXGHG
2/6
0HDQ
6WG'HYLDWLRQ
0RGH
0LQLPXP
0D[LPXP
3HUFHQW
2EV([FOXGHG
(%
0HDQ
6WG'HYLDWLRQ
0RGH
0LQLPXP
0D[LPXP
3HUFHQW
/9&)
2EV([FOXGHG
0HDQ
6WG'HYLDWLRQ
0RGH
0LQLPXP
0D[LPXP
3HUFHQW
2EV([FOXGHG
0,
(VWLPDWH
4 2 / U H V S R Q V H
7LPH 7LPH
The issue of selecting a strategy for computing the AUC also occurs when
patients die during the study. One strategy is to extrapolate the curve to zero
at the time of death. Other proposed strategies include assigning values of 1)
the minimum HRQoL score for that individual or 2) the minimum HRQoL
score for all individuals [Hollen et al., 1997]. One strategy will not work for
all studies. Whichever strategy is chosen, it needs to be justified and the
sensitivity of the results to the assumptions examined.
One might be inclined to present the AUC values calculated to the time of
censoring, as one would present survival data. This approach would appear
to have the advantages of displaying more information about the distribution
of the AUC values and accommodating administrative censoring. Unfortu-
nately, administrative censoring is informative on the AUC scale [Gelber et
al., 1998, Glasziou et al., 1990, Korn, 1993] and the usual Kaplan-Meier esti-
mates are biased. Specifically, if the missing data are due to staggered entry
and incomplete follow-up is identical for two groups, the group with poorer
HRQoL will have lower values of the AUC and are censored earlier on the
AUC scale. Knowing when a subject is censored on the AUC scale gives us
some information about the AUC score and thus the censoring is informa-
tive. Korn [Korn, 1993] suggests an improved procedure to reduce the bias of
the estimator by assuming that the probability of censoring in short intervals
is independent of the HRQoL measures prior to that time. Although this
assumption is probably not true, if the HRQoL is measured frequently and
the relationship between HRQoL and censoring is weak, the violation may be
small enough that the bias in the estimator will also be small.
A practical problem with the use of the AUC or the mean of post-baseline
measures as a composite endpoint occurs when the baseline HRQoL scores
differ among groups. If one group contains more individuals who score their
HRQoL consistently higher than other individuals, small (possibly statistically
non-significant) differences are magnified over time in the composite endpoint.
One possible solution is to calculate the AUC relative to the baseline score.
AU Ci∗ = AU Ci − Yi1 (tJ − t1 ) (14.16)
tj+1 − tj−1
J−1
t2 − t1 tJ − tJ−1
= − (tJ − t1 ) Yi1 + Yij + YiJ
2 2 2
j=2
patients who remained on therapy. They would then be assigned the lowest
possible rank for measurements scheduled after death or during the time of
excessive toxicity. Other strategies that might be considered were discussed
in Chapter 6 (see Table 6.7).
∗n
gh is the number of subjects in the gth stratum and the hth treatment arm. jg is the
number of observations used to compute the composite endpoint in the gth group. xgj
denotes the times when subjects in the gth group contribute observations.
14.6 Summary
• Composite endpoints and summary measures
15.1 Introduction
In this chapter, I will change the focus to outcomes that can be interpreted
on a time scale. Most studies of HRQoL consider measurement from one of
two perspectives, outcomes expressed in the metric of the QOL scale and out-
comes expressed in the metric of time. The latter group includes outcomes
such as quality adjusted survival (QAS), quality adjusted life years (QALYs),
and Q-TWiST that incorporate both quantity and quality of life. In some
trials, these outcomes are measured as one component of an economic analy-
sis. This is beyond the scope of this book and I will not attempt to address
the analytic issues that arise in economic analyses; the reader is advised to
consult books that have this as their sole focus [Drummond, 2001]. In other
trials, the interest is in balancing improvements in survival with the impact
of treatment on HRQoL. Questions of this nature are particularly relevant in
diseases that have relatively short expected survival and the intent of treat-
ment is palliative such as advanced-stage cancer. Scientific investigation of
the balance between treatment options in diseases that have extended survival
are generally outside the context of clinical trials and typically utilize larger
population based observational studies.
This chapter will briefly present two approaches that might be encountered
in a clinical trial. In the first, measures of patient preferences are measured
repeatedly over time (Section 15.2). In the second approach, the average time
in various health states is measured and weighted using preference measures
that are specific to each of the health states (Section 15.3).
15.2 QALYs
This section describes calculation of quality-adjusted-life-years in trials where
patient preferences are measured repeatedly over time. Trials in which pa-
323
© 2010 by Taylor and Francis Group, LLC
324 Design and Analysis of Quality of Life Studies in Clinical Trials
tient preferences are measured directly using standard gamble (SG) or time-
trade-off (TTO) measures are difficult to implement. This is balanced by the
value of obtaining the patient’s own preferences for their current health state.
It is more common in clinical trials to obtain multi-attribute measures (e.g.
HUI, EQ-5D, QWB) or transform health status scales (e.g. SF-36) to utility
measures [Franks et al., 2003, Feeny, 2005]. Both of these methods rely on
transformations of measures of current health states using formulas derived
from the relationship between health states and utilities based on preferences
from the general population. There are valid arguments for both approaches.
The focus of the remainder of this section is on the analysis of longitudinal
data obtained from clinical trials.
The basis for all of the methods is estimation of the area under a curve
(AUC) generated by plotting the utility measure versus time. There are two
approaches. The first strategy estimates the average trajectory in each treat-
ment group and then calcuates the AUC using the parameter estimates (see
Chapter 14). The second strategy, which will be the focus of this section,
starts with the calculation of a value for each individual, QALYi , that is a
function of the utility scores and time. These values will then be subsequently
analyzed as univariate measures.
Ji
uJi
QALYiR = uj ∗ t + ∗ (tD − tJi ) (15.1)
j=2
2
Ji
(uj + uj−1 ) uJ
QALYiT = ∗ t + i ∗ (tD − tJi ) (15.2)
2 2
j=2
In both approaches, the contribution of the final interval that includes death
uses a trapezoidal function. The calculations are illustrated in Table 15.1.
These conditions are obviously infeasible in most trials. The schedule is
more infrequent, there are missing and mistimed assessments, and all subjects
are not followed to death. So how do we calculate QALYs under more realistic
conditions?
Limited Follow-Up
Most trials are not designed to follow all subjects to death. In these trials,
we need to estimate QALYs over a fixed period of time, tC . The choice of
what period of time is influenced by the length of follow-up of those sub-
jects who have not experienced death. Computationally, the estimation of
QALYs will be easiest if we use the minimum follow-up time. It requires no
assumptions about surviving subjects after their last follow-up but ignores
data on individuals who surive longer. Equations 15.1 and 15.2 are modified
slightly for subjects who did not die during the follow-up period, where Ji
is now the number assessment before the minimum follow-up time (tC ), and
uC = uJi + (uJi +1 − uJi )/(tJi +1 − tJi ) ∗ (tC − tJi ) is the estimated utility at
tC (using a trapezoidal estimate for the partial follow-up time).
Ji
QALYiR = uj ∗ t + uC (tC − tJi ) (15.3)
j=2
Ji
(uj + uj−1 ) (uJi + uC )
QALYiT = ∗t+ ∗ (tC − tJi ) (15.4)
j=2
2 2
At the other extreme is the maximum follow-up time of all survivors. This
procedure is unjustified both in terms of the precision of the estimate and
tractability of estimation procedures. Choosing the median follow-up time (or
some value close to it) is a reasonable compromise. This requires identifying
a strategy to deal with individuals who are censored prior to the median
follow-up time. A potential, though not necessarily recommended, method of
extrapolation would be a form of last value carried forward.
Timing of Assessments
The timing of the assessments are likely to vary from subject to subject and
will be less frequent than the period of recall (Figure 15.2). For the method
relying solely on trapezoidal approximation, the equations are are only slightly
modified. In equation 15.6, t is replaced by the observed difference between
the assessments tj − tj−1 as illustrated in Table 15.2. Modifying equation 15.3
to obtain equation 15.5 is slightly more complicated. When the time between
assessments is less than the period of recall (t), the length of the interval is
tj − tj−1 . When the time between assessments is greater than the period of
recall, the value of the utility is extended back for the recall period and a
trapezoidal approximation is used for the period of time that precedes the
period of recall. This is illustrated in Figure 15.2 (left).
Ji
QALYiR = j=2 [uj ∗ min(t, tj − tj−1 ) (15.5)
(uj + uj−1 )
+ ∗ max(0, (tj − t) − tj−1 ]
2
+uC ∗ (tC − (tJi − t))
(uj + uj−1 )
+ ∗ max(0, min(tC , (tJi − t)) − tj−1 )
2
Ji (uj + uj−1 )
QALYiT = j=2 ∗ (tj − tj−1 ) (15.6)
2
(uJi + uC )
+ ∗ (tC − tJi )
2
If death occurs prior to tC , then the last two terms of equation 15.5 and
u
the last term of equation 15.6 are replaced by 2Ji ∗ (tD − tJi ) as was done in
equations 15.1 and 15.2.
Missing Assessments
When the time between assessments is extended due to missing assessments,
the above approach extrapolates between the observed data. This is appro-
priate when the changes over time occur at a constant rate. If missing assess-
ments occur because of abrupt changes in health states, this approach may
result in bias estimates.
is one of the issues that the developers of the Q-TWiST methodology cite as
a motivation to develop that method [Glasziou et al., 1990].
15.3 Q-TWiST
A second method to integrate quality and quantity of life is Q-TWiST. A fun-
damental requirement for this approach is that we can define distinct health-
states. In the original application of this method [Glasziou et al., 1990] in
breast cancer patients, four health states were defined:
TOX the period during which the patients were receiving ther-
apy and presumably experiencing toxicity;
TWiST the period after therapy during which the patients were
without symptoms of the disease or treatment;
REL the period between relapse (recurrence of disease) and
death;
DEATH the period following death.
The second assumption is that each health state is associated with a weight
or value of the health state relative to perfect health that is representative of
the entire time the subject is in that health state (e.g. utility or preference
measures). The assumption that the utility for each health state does not
vary with time has been termed utility independence [Glasziou et al., 1990].
A third assumption is that there is a natural progression from one health
state to another. In the above example, it was assumed that patients would
progress from TOX to TWiST to REL to DEATH. The possibility of skipping
health states, but not going backwards, is allowed. Thus, a patient might
progression from TOX directly to REL or DEATH, but not from TWiST
back to TOX or REL back to TWiST. Obviously, exceptions could occur, and
if very rare they might be ignored. The quantity Q-TWiST is a weighted score
of the average time spent in each of these health sates, where the weights are
based on perference scores. Additional assumptions are made in particular
analyses and will be discussed in the remainder of this section.
Δk = Tk − Tk−1 , k = 1, · · · , K (15.7)
When there is no censoring (or the last observation is an event), most pro-
grams will provide estimates of the mean and standard error of Tk − T0 , but
not for Tk −Tk−1 . The major difficulty is obtaining the variance Tk −Tk−1 . As
a consequence, a bootstrap procedure is used to estimate the mean and stan-
dard errors associated with T1 , · · · TK and their differences. (See Appendices
for additonal details about bootstrap procedures.)
The procedure is as follows:
FIGURE 15.3 Study 1: Partitioned survival plots for the control and ex-
perimental groups of the breast cancer trial. Each plot shows the estimated
curves for TOX, DFS and Surv. Areas between the curves correspond to time
spent in the TOX, TWiST and REL health states.
Typically the weight for the period of time without symptoms UT W iST is
fixed at a value of 1 implying no loss of QALYs and for the period after death
UDEAT H is fixed at a value of 0. Thus, for this example, there are only two
potentially unknown weights, UT OX and UREL .
each health state, adding this to the bootstrap procedure requires very little
effort. In this illustration, we assume that our estimates are UT OX =0.8 and 0.7
(s.e=0.05) for the control and experimental groups respectively and UREL =0.5
(s.e=0.10). In addition to obtaining a random sample of the subjects, we
would also sample values of the utilities. For example, the values of UT OX
would be sampled from a normal distribution with mean 0.8 and standard
deviation of 0.05. Estimates of Q-TWiST obtained in this manner over a four
year time span are displayed in Table 15.3. An alternative presentation of the
results examines the differences in the Q-TWiST scores as a function of time
(Figure 15.4).
FIGURE 15.5 Threshold plots for two hypothetical trials. The region in
which the differences between the two treatments are not statistically signifi-
cant are indicated by A >= B and A <= B. The region where treatment A
is superior is indicated by A >> B.
value of Q-TWiST is equal for the two treatment arms. In our example it
would be the line where
Each of the regions defined by equation 15.9 is then divided unto the region
where the differences are non-significant or significant at some prespecified
level, generally α = 0.05. The region of non-significance would correspond to
the regions
In the breast cancer trial, there are no possible values of UT OX and UREL
that satisfy equation 15.9 when the follow-up time is 48 weeks. So to illustrate
the principal of threshold plots, Figure 15.5 is based on made up data. The
basic idea is that an individual would express their preferences with respect
to the toxicity associated with treatment (UT OX ) and relapse (UREL ). In the
plot on the left, only three regions are present, two in which the differences
between the two treatments are not statistically significant (α = 0.05) as
indicated by A >= B and A <= B and the lower portion (A >> B) where
treatment A is superior. Note that in this plot, the results are virtually
insensitive to the choice of UREL because the time in this health state is very
similar in both treatment groups. In the plot on the right, all four regions
are present and the choice between the two treatments is more sensitive to
patient’s preferences.
As the number of unknown utilities increases, presenting the results be-
comes increasingly more complicated. The results in the breast cancer trial
based on the Breast Chemotherapy Questionnaire (BCQ) measures suggest
that the value of UT OX differ between the two treatments (see Chapter 3)
and there are at least three unknown utilities. If the trial is designed with
primary endpoints such as Q-TWiST, it would be advisable to develop strate-
gies to obtain estimated values of the utilities using multi-attribute measures
or health status scales (e.g. SF-36) with validated conversions to utility mea-
sures [Franks et al., 2003, Feeny, 2005].
15.4 Summary
• QALY and Q-TWiST measures integrate quality and quantity of life;
these measures may be useful when there are tradeoffs associated with
the interventions being assessed in the clinical trial.
• When measures of patient preferences are measured repeatedly over
time, QALYs can be calculated in a number of ways. The major chal-
lenges occur when assessments are missing or follow-up is limited.
• The Q-TWiST approach is useful when distinct progressive health states
can be defined in which patients preferences can be assumed to be con-
stant.
16.1 Introduction
The analysis plan is an essential part of any randomized clinical trial. Most
statisticians have considerable experience with writing adequate analysis plans
for studies that have one or two univariate endpoints. For example, clini-
cal trials of treatments for cancer have a primary and one to two secondary
endpoints, which are either the time-to-an-event (e.g. disease progression or
death) or a binary outcome such as a complete or partial response (measured
as the change in tumor size). The analysis of these univariate outcomes is
straightforward.
When multiple measures are assessed repeatedly over time, the choices for
analysis are much more complex. This is true whether the outcome is HRQoL
or another set of longitudinal measures. For example, an analgesic trial might
include longitudinal patient ratings of the average and worst pain as well as
use of additional medication for uncontrolled episodes of pain. Unfortunately,
HRQoL is often designated as a secondary endpoint and the development of
an analysis is postponed until after the trial begins. Because the analyses are
more complex and may require auxiliary information to be gathered if missing
data is expected, more attention is required during protocol development to
insure an analysis can be performed that will answer the research objectives.
At the point that a detailed analysis plan is written it will become very
clear whether the investigators have a clear view of their research objectives.
Choices for strategies of handling multiple comparisons will require clarity
about the exact role of the HRQoL data. Will it be used in any decision about
the intervention or will it play only an explanatory or supportive role? The
choice to use a composite endpoint or summary measure or to perform tests
at each follow-up will require clarity about whether the intent is to explore the
patterns of differences over time or to identify a general benefit. The choice of
analysis methods and strategies for handling missing data will require clarity
about the population of inference. For example, are the conclusions intended
for all those started on the intervention (Intent-to-Treat) or conditional on
some criteria (e.g. survival, only while on therapy, etc.).
This chapter focuses on selected issues that are critical to the development
of a detailed analysis plan for the HRQoL component of a clinical trial. Even
337
© 2010 by Taylor and Francis Group, LLC
338 Design and Analysis of Quality of Life Studies in Clinical Trials
trol the experimentwise error rate simultaneously for all endpoints, but may
include other options such as the use of summary measures (see Chapter 14)
or controlling error rates within clusters of related endpoints [Proschan and
Waclawiw, 2000].
The second step is to clarify the roles of the multiple dimensions of HRQoL.
Is the intent to claim a generic HRQoL benefit or only benefits in certain
dimensions? If generic, what are the criteria that will establish that infer-
ence? When the HRQoL assessments are considered supportive of primary
endpoints, then HRQoL is taking the role of a secondary endpoint. Less strin-
gent criteria are generally required for secondary endpoints, however, HRQoL
and other patient reported outcomes are often held to a higher standard. It
is wise to consider both control of type I and II error rates as well options to
reduce the multiplicity of endpoints as a means to improve interpretability of
the results.
shows a survival benefit and then examine HRQoL for other reasons, then a
gatekeeper strategy may be warranted.
The design of the longitudinal assessment of any outcome measure will de-
termine the inferences that can be made from a trial. One aspect of this is
who is assessed and for how long. Ideally, all subjects are assessed at all of
the planned assessments; but this ideal can rarely be achieved in practice. In
addition to attrition due to the patient’s decision to drop out, subjects can be
excluded from the analysis as a result of the study design or implementation.
Some exclusions present no threat to the scope and validity of inference that
is possible at the end of the trial. For example, a very small proportion of
subjects may be excluded because appropriate translations are not available.
Because the exclusion is completely unrelated to both treatment assignment
and to future outcomes it can be safely ignored.
Trial designs and analysis plans often create exclusions that limit the ques-
tions that can be answered. At this point, it may be helpful to differentiate
between two research questions. The first type of question has the intent to
compare the outcomes of subjects on each treatment arm with the goal of
determining the superiority (or equivalence) of a particular treatment. The
inference associated with this question relies on randomization and intent-to-
treat (ITT) principles to avoid selection bias. However, designs that mandate
follow-up be discontinued when treatment is discontinued may induce a se-
lection bias in a randomized trial. Similarly, criteria that limit analysis to a
subset of the subjects, such as those based on the number of assessments or a
minimum dose of the intervention, may also induce a selection bias. Some ex-
clusions, such as failure to start therapy, are easily justified in a blinded study
if the exclusion is equally likely across treatment arms. Other exclusions may
depend on treatment and should be very carefully justified with plans for sen-
sitivity analysis and documentation of the impact of this exclusion on results.
In a very small number of trials, the research questions explicitly limit the
analysis to a subset of subjects who are responders, treatment compliant or
survivors. While these questions may be clinically relevant, causal inferences
based on comparisons of the groups are no longer possible. Differences (or lack
of differences) between the subjects selected from two treatment groups may
be attributable to selection rather than the effects of treatment. For example,
the HRQoL trajectories of survivors may be very similar among treatment
groups, but the proportion of subject surviving are quite different. Analysis
plans for this type of question should address the issues of selection bias and
the possible presence of confounders.
The basic strategies for the analysis of event-driven designs are presented in
Chapter 3. When missing data is a concern, sensitivity analyses for event
driven designs include the following options. When the analyst has good
auxiliary information about the subject’s cause of dropout and status after
dropout that is related to the HRQoL measure, multiple imputation using
MCMC or a sequential procedure is a possible approach (Chapter 9). Mixture
models can also be used for repeated measures models (Chapter 10, Section
4). A sensitivity analysis that includes the CCMV restriction and Brown’s
protective estimate is a feasible strategy (see Section 10.4.1) for the simplest
design with only two assessments (pre-/post-) when missing data are limited
to follow-up.
The basic strategies for the analysis of trials with time-driven designs involve
using growth curve models. There are several options for sensitivity analyses.
The joint or shared parameter models rely on the presence of variation in
the slopes that is associated with the time of dropout or another event (see
Chapter 11). A mixture model with parametric restrictions is possible when
subgroups (strata) can be defined for which missing data are ignorable (see
Chapter 10).
Endpoints
√
Definitive research objectives and specific a priori hypotheses
√
Superiority versus equivalence
√
Define primary and secondary endpoints; specify summary measures
Scoring of HRQoL Instruments
√
Specify method (or reference if it contains specific instructions)
√
State how missing responses to individual items are handled
Primary Analysis
√
Power or Sample Size Requirements
√
Procedures for handling multiplicity of HRQoL measurements:
√
What summary measures are proposed?
√
Adjustment procedures for multiple comparisons
√
What statistical procedure is used to model the repeated mea-
sures/longitudinal data?
√
What assumptions are made about the missing data? What are the expected
rates of missing data and how does that affect the analysis?
Sensitivity Analysis
√
What is the plan for sensitivity analyses?
√
What models were considered and why were the particular models selected?
√
Is the description specific enough for another analyst to implement the anal-
ysis?
Secondary Analysis
√
If some scales are excluded from the primary analysis, what exactly will be
reported for these scales?
√
What is the justification for including these assessments if not part of the
primary analysis?
√
What exploratory analyses are planned? (Psychometric characteristics of
HRQoL instrument, Relationship between clinical outcomes and HRQoL mea-
sures, Treatment effects in specific subgroups)
While this approach will provide a more accurate estimate of the required
sample size, the real usefulness is that it provides a method for estimating the
required sample size for any hypothesis that is based on a linear functions of
the parameter estimates.
H0 : θ = Cβ = 0 versus HA : θ = δθ ,
ςθ2
N = (zα/2 + zβ )2 . (16.1)
δθ2
For any hypothesis of the form θ = Cβ, we can estimate the required sample
size if we can determine ςθ2 and δθ2 .
ςθ2 4σY2
N = (zα/2 + zβ )2 = (zα/2 + z β ) 2
(16.2)
δθ2 (μA − μB )2
or the more familiar formula for the number required in each group
The second special case is the test of equality of means in a paired sample:
N = n, δθ = μA − μB , ςθ2 = N × V ar(μA − μB ) = N × 2σY2 (1 − ρ)/N =
ςθ2 2 2σY (1 − ρ)
2
N = (zα/2 + zβ )2 2 = (zα/2 + zβ ) (16.3)
δθ (μA − μB )2
.
Incomplete Designs
Clinical trials rarely have complete data and calculation of ςθ for endpoints
that summarize data across multiple assessments requires programs that can
For incomplete designs, the sample size approximation can be written as:
" #−1
K
N = (zα/2 + zβ ) C 2
pk Xk Σ̂−1
k Xk C /δθ2 . (16.10)
k=1
ςθ2
Let us assume that dropout is equal in the two groups and that all missing
data are due to dropout. The rate of dropout is 1%, 5%, 10% and 20% at each
of the four assessments (T1-T4). Thus 64% will complete all assessments. If
subjects are equally allocated to both groups, the pattern of observations will
appear as displayed in Table 16.1.
In our example, the design matrix (cell means model) for the subjects in
group A with all four observations is:
⎡ ⎤
10000000
⎢0 1 0 0 0 0 0 0⎥
Xk = ⎢ ⎥
⎣ 0 0 1 0 0 0 0 0 ⎦ , pk = .32.
00010000
The design matrix for the subjects in group A with only the first two obser-
vations is:
10000000
Xk = , pk = .05.
01000000
The first step is to generate a data set with one subject representing each
of the patterns. Each subject will be associated with a weight pk . While
the values of Yi are not involved in the calculation of ςθ , they can be used
to calculate δθ using the expected values under the alternative hypothesis.
Assuming that the standard deviation or Y is 1, we will use Yi = (0, 0, 0, 0)
for group A and Yi = (0, .4, .5, .6) for group B. The dataset would contain a
record for each observation observed in a particular pattern:
ID Group p_k Time Y ID Group p_k Time Y
2 A 0.025 1 0.0 7 B 0.025 1 0.0
3 A 0.050 1 0.0 8 B 0.050 1 0.0
3 A 0.050 2 0.0 8 B 0.050 2 0.4
4 A 0.100 1 0.0 9 B 0.100 1 0.0
4 A 0.100 2 0.0 9 B 0.100 2 0.4
4 A 0.100 3 0.0 9 B 0.100 3 0.5
5 A 0.320 1 0.0 10 B 0.320 1 0.0
5 A 0.320 2 0.0 10 B 0.320 2 0.4
5 A 0.320 3 0.0 10 B 0.320 3 0.5
5 A 0.320 4 0.0 10 B 0.320 4 0.6
Note that in the generated dataset, all four assessments only appear for
patterns 5 and 10. The fourth assessment is omitted in patterns 4 and 9, the
third and fourth assessments are missing in patterns 3 and 8, all but the initial
assessment is missing in patterns 2 and 7 and patterns 1 and 6 are omitted
completely.
If we have good estimates of the variance of the HRQoL measure over time
we can use them. But in their absence we will have to assume some covariance
structure for the repeated measures. Suppose the variance is constant over
time and ρ = 0.5. Then
⎡ ⎤
1 .5 .5 .5
⎢ .5 1 .5 .5 ⎥
Σi = σY2 ⎢ ⎥
⎣ .5 .5 1 .5 ⎦ .
.5 .5 .5 1
Note that we have fixed the variance parameters (noiter in the parms state-
ment). The values of interest are δθ (the Estimate column) and ςθ (the
Standard Error column):
Standard
Label Estimate Error
Theta 1 0.6000 2.3387
Theta 2 0.5000 1.7264
For a two-sided test with α = 0.05 and 90% power, zα/2 = −1.96 and
zβ = −1.282, the total sample size (equation 16.10) for the two endpoints are
(2.3387)2
θ1 : N = (1.96 + 1.282)2 = 159.7
(.6)2
(1.7264)2
θ2 : N = (1.96 + 1.282)2 = 125.3
(.5)2
Note that in the above example, Y and Σ are scaled so the variance of Yijk
is standard normal (N (0, 1)) and PARMS specifies the correlation of Y . It is
not necessary to convert everything to a standard normal distribution; it is
critical that one consistently use either the unstandardized or standardized
values of Y (δ) and Σ.
Let us consider the sample size required for two additional endpoints: The
first hypothesis (H03 ) is the area under the HRQoL versus time curve (equa-
tion 16.11). The second hypothesis is the same as in the previous example:
the averages of the estimates at 6, 12 and 26 weeks minus the baseline as-
sessments (H04 ). The final hypothesis is the change from baseline to week 26
(H05 ).
26
H03 : β0 + β1A t + β2A t[6] + β3A t[12] ∂t = (16.11)
t=0
26
β0 + β1B t + β2B t[6] + β3B t[12] ∂t
t=0
μB6 + μB12 + μB26 μA6 + μA12 + μA26
H04 : − μB0 = − μA0 (16.12)
3 3
H05 : (μB26 − μB0 ) = (μA26 − μA0 ) (16.13)
Again we would set up a dataset with the essence of our expected patterns
of Xk and pk :
ID Group p_k Weeks weeks6 weeks12
2 A 0.025 0 0 0
3 A 0.050 0 0 0
3 A 0.050 6 0 0
...
10 B 0.320 0 0 0
10 B 0.320 6 0 0
10 B 0.320 12 6 0
10 B 0.320 26 20 14
The following SAS program is very similar to that used in the previous
example; the major difference is the setup of the variables used to model
change over time. We use the same method of defining the covariance struc-
ture, noting that the compound symmetry is identical to assuming a random
intercept in a mixed-effects model. A more complex variance structure, with
a random slope, could be used if there is sufficient information to estimate
the parameters.
proc mixed data=work.work2 maxiter=1 method=ml;
class ID Group Time;
weight p_k;
model Y=Group*Weeks Group*Weeks6 Group*Weeks12/Solution;
repeated Time/Subject=ID type=UN;
parms (1)
(.5) (1)
(.5) (.5) (1)
(.5) (.5) (.5) (1)/noiter;
estimate ‘Theta 3’ Group*Weeks 338 -338 Group*Weeks6 200 -200
Group*Weeks12 98 -98/divisor=26 E;
estimate ‘Theta 4’ Group*Weeks 44 -44 Group*weeks6 26 -26
Group*Weeks12 14 -14/divisor=3 E;
estimate ‘Theta 5’ Group*Weeks 26 -26 Group*weeks6 20 -20
Group*Weeks12 14 -14/divisor=3 E;
5 A .32 0 0 0 0
6 B .005 . . . .
7 B .025 0 . . .
8 B .05 0 .4 . .
9 B .1 0 .4 .5 .
10 B .32 0 .4 .5 .6
data work.work2;
set work.work1;
if ranuni(333333) lt .1 then delete; * Extra 10% *;
run;
Multivariate Tests
Sample size approximations can also be created for multivariate tests. For
example, we might wish to test simultaneously for differences in the change
from baseline in our example.
H0 : (μB2 − μB1 ) − (μA2 − μA1 ) = 0,
(μB3 − μB1 ) − (μA3 − μA1 ) = 0,
(μB4 − μB1 ) − (μA4 − μA1 ) = 0.
For large samples, one can generalize the univariate z-statistic to a multi-
variate χ2 -statistic. The null hypothesis is H0 : θ = Cβ = G, where C(ν×p)
and G(ν×1) are known. The χ2 test statistic is θ̂ (V ar(θ̂))−1 θ̂ and the power
of the test is:
P r[χ2ν (λ) > χ2ν,α ] = 1 − β (16.14)
where χ2ν,α is the critical value for a χ2 distribution with ν degrees of freedom
(α). λ is the non-centrality parameter such that
λ = θ (V ar(θ̂))−1 θ
⎛ " #−1 ⎞−1
K
= N θ ⎝C pk Xk Σ̂−1
k Xk C⎠ θ (16.15)
k=1
∗ SAS: c alpha=cinv(1-0.05,3);
For previous sample size calculations, the approximations assume the sample
size is sufficiently large that the asymptotic approximation of the covariance
of the parameters is appropriate and the loss of degrees of freedom when using
t- and F-statistics will not affect the results. When the sample sizes are small
the above procedures can be used to obtain the first estimate of the sample
size. The estimation procedure is repeated, using the t- or F-distributions
with updated estimates of the degrees of freedom (ν), until the procedure
converges.
ςθ2
N = (tν,α/2 + tν,β )2
δθ2
"K #−1
= (tα/2 + tβ )2 C pk Xk Σ̂−1
k Xk C /δθ2 . (16.16)
k=1
Kenward and Roger [1997] provide the details for modifying the estimated
covariance of β̂ for Restricted Maximum Likelihood Estimation (REML) and
propose an F-distribution approximation for small sample inference.
Introduction
√
Rationale for assessing HRQoL in the particular disease and treatment.
√
State specific a priori (pretrial) hypotheses
Methods
√
Justification for selection of instrument(s) assessing HRQoL. (see Chapter
2) References to instrument and validation studies. Details of psychomet-
ric properties if a new instrument. Include copy of instrument in appendix
if previously unpublished. Details of any modifications of the questions or
formats.
√
Details of cross-cultural validation if relevant and previously unpublished.
√
Method of administration (self-report, face-to-face, etc.)
√
Planned timing of the study assessments.
√
Method of scoring, preferably by reference to a published article or scoring
manual, with details of any deviations.
√
Interpretation of scores. Do higher values indicate better outcomes?
√
Methods of analysis. What analyses were specified a priori and which were
exploratory?
√
Which dimension(s) or item(s) of the HRQoL instruments were selected as
endpoints prior to subject accrual?
√
What summary measures and multiple comparison procedures were used?
Were they specified a priori?
Results
√
Timing of assessments and length of follow-up by treatment group
√
Missing data:
√
Proportions with missing data and relevant patterns.
√
How were patients who dropped out of the study handled?
√
How were patients who died handled?
√
Results for all scales specified in protocol/analysis plan (negative as well as
positive results).
√
If general HRQoL benefit reported, summary of all dimensions.
√
If no changes were observed, describe evidence of responsiveness to measures
in related settings and the lack of floor and ceiling effects in current study.
16.5 Summary
• The analysis plan is driven by explicit research objectives.
• Analyses of HRQoL data are often more complex, because of their lon-
gitudinal and multidimensional nature, than the analyses of traditional
univariate outcomes.
The varying coefficent model described in Section 11.3 utilizes a cubic smooth-
ing spline. The following procedure to create B, a r × r − 2 design matrix
for the smoothing function was adapted from code provided on Xihong Lin’s
website. We start with, T 0 , a vector of the r distinct values of TiD . Next
create h0 a r − 1 vector of the differences between the times (equation 17).
Then calculate Q0 an r ×r −2 matrix with values on the diagnonal and two off
diagnonal positions (equations 18-20) and R0 an r − 2 × r − 2 matrix with val-
ues on the diagnonal and two off diagnonal positions (equations 21-23). Then
G is lower triangular component of the square root (Cholesky decomposition)
of R (equation 24). Finally L and B are calculated (equations 25 and 26).
h0i = Ti+1
0
− Ti0 i = 1 · · · r − 1 (17)
Q0j,j = 1/h0j j = 1 · · · r − 2 (18)
Q0j+1,j = −1/h0j+1 − 1/h0j (19)
Q0j+2,j = 1/h0j+1 (20)
0
Rj,j = (h0j + h0j+1 )/3 (21)
0
Rj,j−1 = (h0j )/6 (22)
0
Rj,j+1 = (h0j+1 )/6 (23)
R = G G (24)
L = Q (G−1 ) (25)
357
© 2010 by Taylor and Francis Group, LLC
Appendix P: PAWS/SPSS Notes
General Comments
The PAWS/SPSS software focuses on menu-driven analyses. However, many of
the procedures described in this book go beyond the options that are available
through the menus. When I am unfamiliar with a command, I typically start
with the menu, cut the code out of the output window, paste it into the syntax
window, and finally tailor that code to my needs.
Licenses for PAWS/SPSS vary with respect to the options available at each
institution. For example, my university only subscribes to the basic version,
which does not include the procedures tailored to missing data.‡ You may
find that you also do not have access to all the commands. Version 17 has
come out just as I am finishing my proofs; as I hunt through the Help files I
notice that there are additional functions that I was not aware of previously
that may facilitate analyses not presented in this book. If you develop good
examples applied to datasets in this book, I will post them on the web site.
Loading Datasets
Datasets that were used to generate the illustrations in the book are available
on http://home.earthlink.net/∼ dianefairclough/Welcome.html. They are in
SPSS *.sav format. Datasets can be loaded by clicking on the dataset, using
menu options within SPSS or embedding the following code into a program. To
load the data for Study 1 use the following with .... replaced by the directory
in which the data is saved.
GET FILE=‘C:\....\Breast3.sav’.
‡I was able to obtain an expanded license for a limited time as an author to develop some
of the code presented in this book.
359
© 2010 by Taylor and Francis Group, LLC
360 Design and Analysis of Quality of Life Studies in Clinical Trials
DEFINE !fvars()
Item1, Item2r, Item3, Item4, Item5, Item6, Item7
!ENDDEFINE.
IF (NMISS(!fvars) LE 3) SumScore2=MEAN(!fvars)*7.
EXECUTE.
Both check that the number of missing responses is less than half (≤ 3), take
the mean of the observed responses, and multiply by the number of items.
Centering Variables
The following code is an example of a procedure that can be used when there
are an equal number of records per subject:
AGGREGATE
/OUTFILE=* MODE=ADDVARIABLES
/BREAK=
/Var1_mean=MEAN(Var1).
COMPUTE Var1_c=Var1-Var1_mean.
EXECUTE.
/* Centering */
/* Calculate overall mean */
AGGREGATE
/OUTFILE=* MODE ADDVARIABLES
/BREAK=
/Mean_Age=mean(Age_Tx).
/* Subtract overall mean */
COMPUTE Age_C=Age_Tx-Mean_age.
EXECUTE.
General Comments
R is a powerful free open-source-code language that consists mainly of user-
defined functions. As a result the capacity of the R language grows rapidly
and is often at the forefront of methods development. The learning curve for
a new user is steep, but once mastered provides the greatest flexibility of any
language.
Loading Datasets
Datasets that were used to generate the illustrations in the book are available
on http://home.earthlink.net/∼ dianefairclough/Welcome.html. They are in
CSV format. The following code will load the dataset for Study 1 with ...
replaced by the directory where the data is saved.
# Breast Cancer data set
Breast = read.table("C:/.... /breast3.csv",header=TRUE,sep=",")
str(Breast) # lists contents
363
© 2010 by Taylor and Francis Group, LLC
364 Design and Analysis of Quality of Life Studies in Clinical Trials
Helpful Computations
To convert a continuous indicator variable of the treatment groups (Trtment)
to a variable of the factor class (TrtGrp):
> Renal$TrtGrp=factor(Renal$Trtment) # Creates a CLASS variable
Parameters for the changes in slope for the piecewise linear models (Chapter
4):
> WEEK2=Renal3$Weeks-2 # Week2=Weeks-2
> WEEK2[Week2<0]=0 # Negative values set to zero
> Renal3$WEEK2=WEEK2 # Stored in renal3
Centering Variables
The following code creates a centered variable for AGE TX assuming that all
subjects have the same number of assessments. The variable All indicates
the entire sample.
Lung$All=1
Lung$Age_C=Lung$AGE_TX-tapply(Lung$AGE_TX,Lung$All,mean)
simple situation where we are fitting a simple model with a intercept and
slope for two treatment groups. For the two treatment groups, we can either
use a continuous indicator variable (Trtment) where 0 indicates the control
group and 1 indicates the experimental group or a factor variable that has
two levels(TrtGrp). The following statements generate identical models:
fixed=TOI2~ Trtment*Weeks
fixed=TOI2~ Trtment + Weeks + Trtment:Weeks
fixed=TOI2~ TrtGrp*Weeks
fixed=TOI2~ TrtGrp + Weeks + TrtGrp:Weeks
The output for the first two versions appears as:
Fixed: TOI2 ~ Trtment + Weeks + Trtment * Weeks
(Intercept) Trtment Weeks Trtment:Weeks
69.75192368 -5.14639184 -0.07676819 0.00623725
and for the last two versions as:
Fixed: TOI2 ~ TrtGrp + Weeks + TrtGrp:Weeks
(Intercept) TrtGroup1 Weeks TrtGroup1:Weeks
69.75192368 -5.14639184 -0.07676819 0.00623725
The resulting models are equivalent, though the labeling differs. The resulting
parameters have the interpretation of 1) the intercept in the control group, 2)
the difference in the intercept between the experimental and control group, 3)
the slope of the control group, and 4) the difference in the slope between the
experimental and control group. Note that when the “*” symbol is used lower
order terms are automatically generated. If we wished to generate a model
in which the intercept and slopes were estimated for each treatment group,
we would use the following, adding “0” or “-1” to the model suppresses the
default intercept:
fixed=TOI2~0+ TrtGrp+TrtGrp:Weeks
The results are as follows:
Fixed: TOI2 ~ TrtGrp + TrtGrp:Weeks - 1
TrtGrp0 TrtGrp1 TrtGrp0:Weeks TrtGrp1:Weeks
69.75192368 64.60553184 -0.07676819 -0.07053094
Functions
I have also created several functions. They can be found on the website
http://home.earthlink.net/∼ dianefairclough/Welcome.html. To load the com-
mands stored in an external file (RFunctions.r) into the working space of R,
use the following after modifying the path:
source("C:/..../Rprgms/RFunctions.r")
Estimation of Θ = Cβ
This function estimates linear functions of β and tests the hypothesis that
Θ = Cβ = 0 assuming that the sample size is large enough that a z-statistic
is adequate.
The first step is to define C and add rownames as labels (later is optional),
then to call the function:
When the model has been fit using gls, then est is the object named
model$coef and var is the object named model$var. Thus the code might
appear as:
Similarly, when the model has been fit using lme, then est is the object named
model$coef$fix and var is the object named model$varFix. Thus the code
might appear as:
estCbeta = function(C,est,var) {
est.beta = as.matrix(est); # Beta
dim(est.beta)=c(length(est.beta),1)
var.beta = var # var(Beta)
est.theta = C %*% est.beta # Theta=C*Beta
var.theta = C %*% var.beta %*% t(C) # Var(T)=C*Var(B)*C’
se.theta = sqrt(diag(var.theta))
dim(se.theta)=c(length(se.theta),1)
zval.theta= est.theta/se.theta
pval.theta=(1-pnorm(abs(zval.theta)))*2
results =cbind(est.theta,se.theta,zval.theta,pval.theta)
colnames(results)=c("Theta","SE","z","p-val")
rownames(results)=rownames(C)
results
}
369
© 2010 by Taylor and Francis Group, LLC
370 Design and Analysis of Quality of Life Studies in Clinical Trials
DO I=1 to DIM{REV};
REV[I]=4-REV[I];
END;
Centering Variables
To create a new dataset (work.centered) with center variables within each
treatment group (Trtment) using a dataset (work.patient) that has one
record per subject:
PROC SQL;
create table work.centered as
select PatID,
(Var1 - mean(Var1)) as Var1_C,
(Var2 - mean(Var2)) as Var2_C,
(Var3 - mean(Var3)) as Var3_C,
(Var4 - mean(Var4)) as Var4_C
from work.patient
group by Trtment
order by PatID;
quit;
PatID is included for merging with other datasets; order by PatID is optional
but avoids later sorting. To center variables across all subjects, drop group
by Trtment.
%macro boot(B);
%do i=1 %to &B;
*** SAS statements for the analysis of the longitudinal data ***;
%analyze(indata,results);
%end;
%mend;
1. Sample with replacement N subjects from the first set of data, where N
is the number of subjects. Generate a unique identifier for each of the N
subjects in the bootstrap sample using the %BootSelect macro. If you
wish to keep the number of subjects in any subgroups (e.g. treatment
arms) constant, this step can be performed within each subgroup, using
a different &StartID so that subjects in the different subgroups have
different BootIDs. The datasets Frame1 and Frame2 contain the study
identifiers for control and experimental groups respectively.
2. Merge the bootstrap sample with the longitudinal data (Lung3) using
the %BootMerge macro.
4. Repeat these three steps 1000 times. (This is adequate to estimate the
standard error; but more repetitions may be necessary if the tails of the
distribution need to be estimated more precisely.)
Macro %Analyze
The following macro is specific to the analysis presented in Section 10.3.4.
*** Macro to Analyze the Bootstrap Sample ***;
%macro Analyze;
*** Center the Dropout Indicators within Treatment Group ***;
proc sql;
create table work.centered1 as select BootID,
(StrataB1 - mean(StrataB1)) as D1B,
(StrataB2 - mean(StrataB2)) as D2B
from work.select1 order by BootID;
create table work.centered2 as select BootID,
(StrataB1 - mean(StrataB1)) as D1B,
(StrataB2 - mean(StrataB2)) as D2B
from work.select2 order by BootID;
quit;
*** Merged Centered Indicators with Longitudinal Data ***;
data work.merged;
merge work.centered1 work.centered2 work.merged1 work.merged2;
by BootID;
run;
Putting it together
The backbone of a typical SAS program is as follows:
Test the procedure with 3 iterations, then runs the full bootstrap procedure.
Macro %BootSelect
* Randomly Selects Subjects from the dataset &Indata *;
* Creates a dataset &Outdata *;
* The dataset &Indata must have only one rec/ subject *;
* Seed is usually a function of the iteration # such as &i*7 *;
%macro BootSelect(indata,outdata,seed,startID);
data &outdata;
* Creates Random Variable from 1 to n *;
choice=Ceil(ranuni(&seed)*n);
set &indata point=choice nobs=n;
BootID=&startID+_N_; * ID for future analysis *;
if _N_>n then stop;
run;
%mend;
Macro %BootMerge
* Merges data from &Bootdata with &Longdata *;
* &ID is the subject ID on the original datasets *;
* The merged dataset will have a new ID: BootID *;
%macro BootMerge(bootdata,longdata,outdata,id);
proc sql;
create table &outdata
as select *
from &bootdata as l left join &longdata as r on l.&id=r.&id
order by BootID;
quit;
%mend;
Macro %BootSave
*** Saves the results from &Indata into &Outdata ***;
*** &Indata will usually have one record ***;
%macro BootSave(indata,outdata);
%if &I=1 %then %do; * Saves results in &outdata *;
data &outdata;
set &indata;
run;
%end;
%else %do; * Appends results to &outdata *;
proc append base=&outdata new=&indata force;
run;
%end;
%mend;
377
© 2010 by Taylor and Francis Group, LLC
378 Design and Analysis of Quality of Life Studies in Clinical Trials
Gould AL. (1980) A new approach to the analysis of clinical drug trails with
withdrawals. Biometrics, 36: 721-727.
Guo X, Carlin BP. (2004) Separate and joint modeling of longitudinal and
event time data using standard computer packages. The American Statis-
tician, 16-24.
Guyatt GH, Townsend M, Merman LB, Keller JL. (1987) A comparison of
Likert and visual analogue scales for measuring change in function. Journal
of Chronic Disease, 40: 1129-1133.
Guyatt G et al. (1991) Glossary. Controlled Clinical Trials, 12: 274S-280S.
Heitjan DA, Landis JR. (1994) Assessing secular trends in blood pressure: A
multiple imputation approach. Journal of the American Statistical Associ-
ation, 89: 750-759.
Heyting A, Tolbomm TBM, Essers JGA. (1992) Statistical handling of
dropouts in longitudinal clinical trials. Statistics in Medicine, 11: 2043-
2061.
Hicks JE, Lampert MH, Gerber LH, Glastein E, Danoff J. (1985) Functional
outcome update in patients with soft tissue sarcoma undergoing wide lo-
cal excision and radiation (Abstract). Archives of Physical Medicine and
Rehabilitation, 66: 542-543.
Hochberg Y. (1988) A sharper Bonferroni procedure for multiple significance
testing. Biometrika, 75: 800-803.
Hogan JW, Laird NM. (1997) Mixture models for the joint distribution of
repeated measuress and event times. Statistics in Medicine, 16: 239-257.
Hogan JW, Laird NM. (1997) Model-based approaches to analysing incom-
plete longitudinal and failure time data. Statistics in Medicine, 16: 259-272.
Hogan JW, Roy J, Korkontzelou C. (2004) Tutorial in biostatistics: Handling
drop-out in longitudinal studies. Statistics in Medicine, 23: 1455-1497.
Hogan JW, Lin X, Herman G. (2004) Mixtures of varying coefficient models
for longitudinal data with discrete or continuous non-ignorable dropout.
Biometrics, 60: 854-864.
Hollen PJ, Gralla RJ, Cox C, Eberly SW, Kris M. (1997) A dilemma in
analysis: Issues in serial measurement of quality of life in patients with
advanced lung cancer. Lung Cancer, 18: 119-136.
Holm S. (1979) A simple sequentially rejective multiple test procedure. Scan-
dinavian Journal of Statistics, 6: 65-70.
Holmbeck GN. (1997) Toward terminological, conceptual and statistical clar-
ity in the study of mediators and moderators: Examples from the child-
O’Brien PC. (1984) Procedures for comparing samples with multiple end-
points. Biometrics, 40: 1079-1087.
Omar PZ, Wright EM, Turner RM, Thompon SG. (1999) Analyzing re-
peated measurements data: A practical comparison of methods. Statistics
in Medicine, 18: 1587-1608.
Paik, SSS (1997) The generalized estimating approach when data are not miss-
ing completely at random. Journal of the American Statistical Association,
92: 1320-1329.
Pater J, Osoba D, Zee B, Lofters W, Gore M, Dempsey E, Palmer M, Chin C.
(1998) Effects of altering the time of administration and the time frame of
quality of life assessments in clinical trials: An example using the EORTC
QLQ-C30 in a large anti-emetic trial. Quality of Life Research, 7: 273-778.
Patrick DL, Bush JW, Chen MM. (1973) Methods for measuring levels of
well-being for a health status index. Health Services Research, 8: 228-245.
Patrick D, Erickson P. (1993) Health Status and Health Policy: Allocating
Resources to Health Care. Oxford University Press, New York.
Pauler DK, McCoy S, Moinpour C. (2003) Pattern mixture models for longitu-
dinal quality of life studies in advanced stage disease. Statistics in Medicine,
22: 795–809.
Piantadosi S. (1997) Clinical Trials: A Methodological Perspective. John Wiley
and Sons, New York.
Pinheiro JC, Bates DM. (2000) Mixed-Effects Models in S and S-PLUS.
Springer Verlag, New York, NY.
Pledger G, Hall D. (1982) Withdrawals from drug trials (letter to editor).
Biometrics, 38: 276-278.
Pocock SJ, Geller NL, Tsiatis AA. (1987a) The analysis of multiple endpoints
in clinical trials. Biometrics, 43: 487-498.
Pocock SJ, Hughes MD, Lee RJ.(1987b) Statistical problems in the reporting
of clinical trials. New England Journal of Medicine 317: 426-432.
Proschan MA, Waclawiw MA. (2000) Practical guidelines for multiplicity ad-
justment in clinical trials. Controlled Clinical Trials, 21: 527-539.
Raboud JM, Singer J, Thorne A, Schechter MT, Shafran SD. (1998) Estimat-
ing the effect of treatment on quality of life in the presence of missing data
due to dropout and death. Quality of Life Research, 7: 487-494.
Reitmeir J, Wassmer G. (1999) Resampling-based methods for the analysis of
multiple endpoints in clinical trials. Statistics in Medicine, 18: 3455-3462.
Revicki DA, Gold K, Buckman D, Chan K, Kallich JD, Woodley M. (2001)
Vonesh EF, Greene T, Schluchter MD. (2006) Shared parameter models for the
joint analysis of longitudinal data and event times. Statistics in Medicine,
25: 143-163.
von Hippel PT. (2004) Biases in SPSS 12.0 missing value analysis. The Ameri-
cian Statistician, 58: 160-164.
Wang XS, Fairclough DL, Liao Z, Komaki R, Chang JY, Mobley GM, Cleeland
CS. (2006) Longitudinal study of the relationship between chemoradiation
therapy for non-small-cell lung cancer and patient symptoms. Journal of
Clinical Oncology, 24: 4485-4491.
Ware JE, Borrk RH, Davies AR, Lohr KN. (1981) Choosing measures of health
status for individuals in general populations. American Journal of Public
Health, 71: 620-625.
Ware JE, Snow KK, Kosinski M, Gandek B. (1993) SF-36 Health Survey:
Manual and Interpretation Guide. The Health Institute, New England Med-
ical Center, Boston, MA.
Ware J, Kosinski M, Keller SD. (1994) SF-36 Physical and Mental Component
Summary Scales: A User’s Manual. The Health Institute, New England
Medical Center, Boston, MA.
Weeks J. (1992) Quality-of-life assessment: Performance status upstaged?
Journal of Clinical Oncology, 10: 1827-1829.
Wei LJ, Johnson WE. (1985) Combining dependent tests with incomplete
repeated measurements. Biometrika 72: 359-364.
Westfall PH, Young SS. (1989) p-value adjustment for multiple testing in mul-
tivariate bionomial models. Journal of the American Statistical Association,
84: 780-786.
Wiklund I, Dimenas E, Wahl M. (1990) Factor of importance when evaluating
quality of life in clinical trials. Controlled Clinical Trials, 11: 169-179.
Wilson IB, Cleary PD. (1995) Linking clinical variables with health-related
quality of life - a conceptual model of patient outcomes. Journal of the
Americal Medical Association, 273: 59-65.
World Health Organization (1948) Constitution of the World Health Organi-
zation. Basic Documents, WHO, Geneva.
World Health Organization (1958) The First Ten Years of the World Health
Organization, WHO, Geneva.
Wu MC, Bailey KR. (1988) Analyzing changes in the presence of informative
right censoring caused by death and withdrawal. Statistics in Medicine, 7:
337-346.
Wu MC, Bailey KR. (1989) Estimation and comparison of changes in the pres-
ence of informative right censoring: Conditional linear model. Biometrics,
45: 939-955.
Wu MC, Carroll RJ. (1988) Estimation and comparison of changes in the
presence of informative right censoring by modeling the censoring process.
Biometrics, 44: 175-188.
Wulfsohn M, Tsiatis A. (1997) A joint model for survival and longitudinal
data measured with error. Biometrics, 53: 330-339.
Yabroff KR, Linas BP, Schulman K. (1996) Evaluation of quality of life for
diverse patient populations. Breast Cancer Research and Treatment, 40:
87-104.
Yao Q, Wei LJ, Hogan JW. (1998) Analysis of incomplete repeated measure-
ments with dependent censoring times. Biometrika, 85: 139-149.
Young T, Maher J. (1999) Collecting quality of life data in EORTC clinical
trials - what happens in practice? Psycho-oncology, 8: 260-263.
Yu M, Law NJ, Taylor JMG, Sandler HM. (2004) Joint longitudinal-survival-
cure models and their applications to prostate cancer. Statistica Sinica, 14:
835-862.
Zeger SL, Liang K-Y. (1992) An overview of methods for the analysis of
longitudinal data. Statistics in Medicine, 11: 1825-1839.
Zhang J., Quan H, Ng J, Stepanavage ME. (1997) Some statistical methods for
multiple endpoints in clinical trials. Controlled Clinical Trials, 18: 204-221.
Zwinderman AH. (1990) The measurement of change of quality of life in clin-
ical trials. Statistics in Medicine, 9: 931-42.
A duration, 38
ABB (approximate Bayesian boot- frequency, 37
strap), see Multiple impu- location, 44, 45
ation mode, 44
Age, timing, 36, 37, 44, 47
missing data, 11, 133, 188 Auxiliary data, 46, 146, 163, 181,
and moderation, 105 337, 341
Akaike's Information Criterion (AIC), Available Case Missing Value (ACMV)
58 Restriction, 232
Analysis of variance (ANOVA), 110 also see Mixture models,
Analysis plans, 337, 340, 341, 356 Average of ranks, 312
models for longitudinal data, Average rate of change (slope), 312
341
multiple comparisons, 339 B
multiple endpoints, 275 Baseline,
primary vs. secondary endpoints, change from, 61, 112, 153
338 common, 61
role of HRQoL, 338 Bayesian Information Criterion (BIC),
sample size calculations, 342 58
summary measures, 339 BCQ (Breast Chemotherapy Ques-
Analysis, complete case, 134, 151 tionnaire), 9, 42, 107,167,
Analysis, generalized estimating equa- 276, 282, 334
tions, 149 Best linear unbiased estimate (BLUE),
Analysis, MANOVA, 134, 151 156
Analysis, per-protocol, 34 Best linear unbiased predictor (BLUP),
Analysis, repeated univariate, 134, 156, 171, 207
151 Bias, 126, 149, 186
Area under the curve (AUC), 96, selection, 34, 39, 44, 54, 159,
297, 299, 309, 315, 324, 163, 176, 340
350 Bivariate data, mixture models, 226
also see Summary measures Bonferroni adjustments, 277
integration, 310 see Multiple comparisons ad-
trapezoidal approximation, 309, justments
315, 324 Bootstrap procedure, 116, 222, 230,
Assessment, 33-38, 43-47 237, 248, 329, 330-331, 371,
assistance, 44 Breast cancer, trial of adjuvant ther-
discontinuation, 38 apy, 8-12, 42, 276
395
396 Design and Analysis of Quality of Life Studies in Clinical Trials
G
E General linear model, 56-57
Effect modifier see Moderation Generic instrument, 5, 40
EM algorithm, 207, 259 Global index, 6
Empirical Bayes estimates, 156, 315 Goals, trial 31
Endpoints, 1, 33, 275-294,337-339, also see Trial design and pro-
343 tocol development
primary and secondary, 337- Growth curve models see Mixed ef-
339, 343 fects models
survival, 1 G u t t m a n scale, 50
efficacy, 33
multiple, 275-294, 337, 339, 343, H
356 Half rule, see Scoring
safety, 33 Health states, 313, 324, 328
398 Design and Analysis of Quality of Life Studies in Clinical Trials
explanatory aims, 34 Z
follow-up procedures, 47 Zero, assigning after death, 165, 313,
frequency of assessments, 37 324, 328
goals and objectives, 31
longitudinal designs, 35
primary and secondary endpoints,
338
role of HRQoL, 33, 337-338
selection of measures, 39
selection of subjects, 34
timing of assessments, 36
Trial, goals, 338
regulatory approval, 39
Trial objectives, 40
exploratory, 297
TTO, time tradeoff, see Measures,
time tradeoff
u
Utility independence, 328
Utility measure, see Measures, pref-
erence
V
Validation, 39
Validity and reliability, 41-42
construct, 41
convergent, 39, 41
criterion, 41
discriminant, 41
divergent, 41
face, 41
responsiveness, 41
van der Waerden scores, 319
Varying coefncent model, 249-252,
357
assumptions, 250
Visual analog scale, 7
w
Weibull distribution, 259
Wilcoxon Rank Sum test, 174
Wilson and Cleary, conceptual model,
2
World Health Organization, defini-
tion of health, 2