This dissertation is focused on the development of the optimal design and analysis for cluster randomized trials. Specifically, we tackle three common questions: whether or not to pair-match clusters, which causal parameter best captures the intervention effect, and how to select the adjustment set for the analysis. We begin by introducing a formal framework for causal inference in Chapter 1. Throughout, the Sustainable East Africa Research in Community Health (SEARCH) trial serves as the motivating example (NCT01864603). SEARCH is an ongoing community randomized trial to evaluate the impact of immediate and streamlined antiretroviral therapy on HIV incidence in rural East Africa.
In Chapter 2, we consider pair-matching, an intuitive design strategy to protect study validity and to potentially increase power in randomized trials. In a common design, candidate units are identified, and their baseline characteristics are used to create the best n/2 matched pairs. Within the resulting pairs, the intervention is randomized, and the outcomes are measured at the end of follow-up. We consider this design to be adaptive, because the construction of the matched pairs depends on the baseline covariates of all candidate units. As a consequence, the observed data cannot be considered as n/2 independent, identically distributed (i.i.d.) pairs of units, as common practice assumes. Instead, the observed data consist of n dependent units. Chapter 2 explores the consequences of adaptive pair-matching in randomized trials for estimation of the conditional average treatment effect (CATE): the intervention effect, given the measured covariates of the n study units. We contrast the unadjusted estimator with TMLE and show substantial efficiency gains from matching and further gains with adjustment.
In Chapter 3, we compare three causal parameters: the population, conditional and sample average treatment effects. Using a structural causal model, we explicitly define each parameter, discuss interpretation, and formally examine identifiability. To the best of our knowledge, Chapter 3 is the first to propose using TMLE for estimation and inference of the sample effect. In most settings, the sample parameter will be estimated more efficiently than the conditional parameter, which will, in turn, be estimated more efficiently than the population parameter. Finite sample simulations illustrate the potential gains in precision and power from selecting the sample effect as the target of inference.
Finally in Chapter 4, we discuss adjustment for measured covariates during the analysis to reduce variance and increase power in randomized trials. To avoid misleading inference, the analysis plan must be pre-specified. However, it is often unclear a priori which baseline covariates (if any) should be included in the analysis. In the SEARCH trial, for example, there are 16 matched pairs of communities and many potential adjustment variables, including region, HIV prevalence, male circumcision coverage and measures of community-level viral load. In Chapter 4, we propose a rigorous procedure to data-adaptively select the adjustment set, which maximizes the efficiency of the analysis. Specifically, we use cross-validation to select from a pre-specified library the candidate TMLE that minimizes the estimated variance. For further gains in precision, we also propose a collaborative procedure for estimating the known exposure mechanism. Our small sample simulations demonstrate the promise of the methodology to maximize study power, while maintaining nominal confidence interval coverage. Our procedure is tailored to the scientific question (sample vs. population treatment effect) and study design (pair-matched or not) and alleviates many of the common concerns.