[figure]style=plain, subcapbesideposition=top [subfigure]position=top,justification=raggedright \floatsetup[table]style=plaintop [subfigure]position=top,justification=raggedright, singlelinecheck=off
Modeling the Ratio of Correlated Biomarkers Using Copula Regression
Abstract
Modeling the ratio of two dependent components as a function of covariates is a frequently pursued objective in observational research. Despite the high relevance of this topic in medical studies, where biomarker ratios are often used as surrogate endpoints for specific diseases, existing models are based on oversimplified assumptions, assuming e.g. independence or strictly positive associations between the components. In this paper, we close this gap in the literature and propose a regression model where the marginal distributions of the two components are linked by Frank copula. A key feature of our model is that it allows for both positive and negative correlations between the components, with one of the model parameters being directly interpretable in terms of Kendall’s rank correlation coefficient. We study our method theoretically, evaluate finite sample properties in a simulation study and demonstrate its efficacy in an application to diagnosis of Alzheimer’s disease via ratios of amyloid-beta and total tau protein biomarkers.
Keywords: Distributional regression, Frank copula, Gamma distribution, Negative dependence, Ratio outcome
1 Introduction
A common objective in observational research is to analyze the ratio of two (possibly dependent) components (Long et al., 2016). Typical examples are, among others, (i) the LDL/HDL cholesterol ratio in cardiovascular research (Natarajan et al., 2003), defined as the ratio of the low-density lipoprotein and the high-density lipoprotein concentrations in plasma or serum, (ii) the CD4/CD8 ratio in HIV research (Caby et al., 2016), which measures the ratio of T helper cells to cytotoxic T cells in the human immune system, (iii) the testosterone over epitestosterone (T/E) ratio in antidoping research (Sottas et al., 2007), and (iv) the GEFC/REFC ratio in ophthalmic research, corresponding to the green and red emission components in fundus autofluorescence imaging (Wintergerst et al., 2022). In many of such studies, biomarker ratios are used as early indicators or even as surrogate endpoints for a specific disease. In these cases, the focus is not only on the characterization of the marginal ratio distribution, but also on modeling this distribution as a function of a set of covariates . Usually, this amounts to specifying a regression model that includes the ratio as outcome variable (Berger et al., 2019).
When setting up a model relating the ratio outcome to the covariates , a common assumption is that both components follow either log-normal or gamma distributions, thereby accounting for the nonnegativity of the component values and the skewness of their distributions (Mitchell et al., 2015; Van Domelen et al., 2021). In the former case it is easily derived that the ratio is again log-normally distributed. The latter case, which will be dealt with in this paper, is considerably less straightforward but is often preferred in practice due to its increased efficiency (Firth, 1988; Wiens, 1999; Berger et al., 2019).
In the special case where and are independently gamma distributed, the ratio follows a generalized beta distribution of the second kind, in the following abbreviated by GB2 (Kleiber and Kotz, 2003). A regression approach for the GB2 distribution has been proposed by Tulupyev et al. (2013), who studied the determinants of alcohol abuse in HIV-positive persons using the framework of vector generalized additive models (VGAMs; Yee, 2015). More recent examples include Safari-Katesari and Zaroudi (2020), Bourguignon et al. (2021), Medeiros et al. (2023) and dos Santos et al. (2023). The case of correlated gamma distributed components has earlier been studied by Lee et al. (1979) and Tubbs (1986). Based on Kibble’s bivariate gamma distribution for (Kibble, 1941), Berger et al. (2019) developed the extended GB2 (eGB2) model for the ratio of two positively correlated gamma distributed components. Their model is characterized by three parameters, of which one is directly interpretable in terms of the Pearson correlation coefficient between the two components. Conceptually, the extended GB2 model can be seen as a distributional regression model embedded in the framework of generalized additive models for location, scale and shape (GAMLSS; Rigby and Stasinopoulos, 2005; Berger and Schmid, 2020).
Despite its major importance in biostatistics, no regression modeling strategy exists (to the best of our knowledge) for ratio outcomes with two negatively correlated gamma distributed components. Negatively correlated measurements are encountered in numerous applications, for example in dementia research, where ratios of cerebrospinal fluid (CSF) biomarkers are used for the early diagnosis of Alzheimer’s disease (Koyama et al., 2012). Importantly, measurements of the widely employed amyloid- 42 protein and total tau protein biomarkers are known to exhibit a negative correlation (Tapiola et al., 2009). In recent publications, the Gaussian regression model has been used for modeling ratios of CSF biomarkers (e.g., Xu et al., 2020). Clearly, this model neither accounts for the characteristics of the bivariate distribution of nor for the skewness in the distribution of the ratio outcome .
Motivated by these problems, and to address the current shortcomings in modeling ratio outcomes with negatively correlated gamma distributed components, we propose a regression model where the joint bivariate distribution of the two gamma distributed components is defined by Frank copula (Genest, 1987). By this choice, the model flexibly accounts for either negative or positive associations between the two components (measured by Spearman’s or Kendall’s rank correlation coefficient). It also allows for modeling different characteristics of the two marginal distributions, including possibly unequal rate and shape parameters. By relating the covariates to the parameters of the marginals, as well as to the association parameter defined by the copula, our model further allows to derive the conditional probability density function of as a function of covariates. This, in turn, allows for the analysis of conditional distributional parameters (like the expected value, median or quantiles), including valid inferential conclusions for these quantities.
We apply the new approach to data from a multi-center observational cohort study conducted by the German Dementia Network (DCN; Kornhuber et al., 2009). Study participants were diagnosed with either mild cognitive impairment (MCI), Alzheimer’s Disease (AD), or other dementia. The study aims at determining the diagnostic and prognostic power of clinical, laboratory and imaging methods. This task is considered to be a major challenge, as the period from the first clinical symptoms of AD to disease onset might take years to decades (Sperling et al., 2013). Consequently, as biomarker ratios like the amyloid- 42/total tau ratio are considered to be strong predictors of AD progression, it is of high interest to relate these measurements to patient characteristics like age, sex and educational level (Jack et al., 2015). As will be demonstrated in Section 4, the proposed copula regression model can be suitably applied to address this problem, resulting in meaningful descriptive and inferential findings regarding the associations between the biomarker ratio and individual patient characteristics.
The rest of the paper is organized as follows: Section 2 derives the distributional copula regression model, states novel theoretical results with implications for the interpretation of covariate effects, and presents estimation, prediction and inference. The efficacy of our approach is demonstrated empirically in a simulation study in Section 3 and in our main application to AD progression in Section 4. The main findings of the paper are discussed in Section 5.
2 Methods
Section 2.1 starts by deriving the distribution of the ratio of two gamma distributed components with dependence induced by a Frank copula. Details on model specification and fitting are given in Section 2.2. Section 2.3 covers the prediction of distributional parameters and inference.
2.1 Distributional concept
Let and be two gamma distributed random variables with probability density functions (PDF)
(1) |
where , denote the rate parameters and , denote the shape parameters of and , respectively. We allow for dependencies between and , and assume that their joint distribution can be described by Frank copula with copula function . By Sklar’s theorem, the joint distribution of is thus given by
(2) |
where , and denote the joint bivariate and marginal cumulative distribution functions (CDFs) of and , respectively. The parameter determines the association between and . It can be shown that Kendall’s rank correlation coefficient is a monotone increasing function of , given by
(3) |
(Harry, 2014). As a consequence, the CDF in (2.1) allows for (possibly highly) positive or negative correlations between the two components and . The joint PDF of is given by
(4) |
where is the PDF of Frank copula.
We derive the resulting PDF of the ratio , an interpretable representation thereof and the CDF in the following three propositions.
Proposition 1.
Let the PDF of be defined by (2.1). Then the PDF of the ratio is given by
(5) |
where denotes the absolute value function.
Proof.
Remark 1.
Figure 1 visualizes the PDF of for different values of the rate, shape and association parameters. The figure illustrates that the form of the PDF is strongly related to the ratio of marginal means , which is highest in the lower left panel () where the dispersion is very large, and lowest in the upper right panel () where the PDFs are heavily right-skewed. Figure 1 also describes the association between the PDF and the Kendall’s rank correlation coefficient. In each of the nine cases the mode of the PDF increases as increases. Of note, the median of does not vary with (being equal for the three PDFs in each panel).
Proposition 2.
Proof. The proof of Proposition 2 is given in Appendix A.
Remark 2.
By Proposition 2 the PDF of can be written as a function of the ratio of the rate parameters . This facilitates the interpretation of the proposed regression model, as it results in a sparser representation of the covariate effects. In particular, covariate effects can be investigated using one-dimensional hypothesis tests and p-values, see our application in Section 4. Figure 2 illustrates how the median of is related to . It is seen that the median decreases with independent of the values of the shape parameters and the association parameter.
Proposition 3.
2.2 Regression specification and estimation
To model the entire distribution of as a function of covariates , we propose to relate both the logarithmic rate parameters and and the association parameter to predictor functions of the form
(8) | ||||
(9) | ||||
(10) |
where , and are sets of real-valued coefficients. Analogous to classical gamma regression (Chapter 5.3 of Fahrmeir et al., 2022), the use of the logarithmic transformation in (8) and (9) ensures positivity of the rate parameters. Since , no transformation is needed for the association parameter. As a result of (8) and (9) it holds that .
Remark 3.
In principle, our approach allows to make use of the full flexibility of GAMLSS by relating all distributional parameters (including the shape parameters ) to the covariates and by including nonlinear effects in the predictor functions. However, in our application we found that the specification in (8) – (10) provides a sufficient fit, thereby meeting a compromise between model fit and model complexity. Furthermore, it greatly simplifies the interpretability of the results (as we will further elaborate in Section 4). Based on these considerations, our model assumes that the shape parameters and do not depend on , but can be treated as nuisance parameters.
Definition 1.
The regression model for the ratio with the distribution from Proposition 3 and with covariate-dependent parameters as specified in (8)–(10) will be termed Frank copula with Gamma Distributed Marginals (FCGAM) in the following. The FCGAM model imposes the constraint , to ensure that the two marginals both exhibit a unimodal, right-skewed distribution, which is the common form of biomarker distributions in medical applications.
Corollary 1.
For a set of i.i.d. observations with ratios and model coefficients , the log-likelihood function of the FCGAM model is given by
(11) |
Corollary 2.
Under the usual regularity assumptions, the estimator
(12) |
is consistent and asymptotically normal for .
Implementational details. Maximization of the log-likelihood function in (1) can be carried out using the R function FCGAMoptim(), which is part of the supplementary material to this paper. The optimization algorithm is based on the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm implemented in the R function optim(), setting the additional constraint , .
2.3 Prediction of distributional parameters and inference
Prediction. For a new observation with covariate values , predictions of the conditional PDF can be obtained by computing the maximum likelihood estimate (MLE) and by plugging the estimated parameters , and , in Equation (2). The predicted PDF can then be used to predict any distributional parameter of interest (like the expected value, median or quantiles). For example, denoting the predicted PDF by , the predicted median can be calculated by
(13) |
Inference. Despite the asymptotic results from Corollary 2, more reliable finite-sample confidence intervals have been established in additive models (Wood, 2017). This is particularly the case for the quantities of interest here (such as the median of above). The reason is that these are nonlinear transformations of the original model coefficients such that confidence intervals would show an additional finite-sample bias due to the application of the -rule. Following Wood (2017), we thus propose to construct confidence intervals of using a Bayesian approach, which we accordingly refer to as credible intervals. Assuming flat priors on , the posterior distribution of is given by
(14) |
where is the Hessian of the negative log-likelihood evaluated at (Equation (6.26) of Wood, 2017). Consequently, approximate credible intervals for the coefficients can be obtained by drawing a large sample from the posterior distribution (14) and by calculating the and percentiles from this sample (Wood, 2017, p. 293). In our simulations (Section 3) and in the analysis of the DCN study data (Section 4) we used samples of size 10,000 throughout.
3 Simulations
We conducted three simulation studies to investigate the performance of the FCGAM model. Our main aims were (a) to analyze the model fit and the coverage of the credible intervals, (b) to evaluate how the performance of the FCGAM approach is affected by the sample size and the choice of the association parameter , and (c) to benchmark our method against alternative ones, in particular against the extended GB2 model by Berger et al. (2019) which assumes the correlation between and to be positive.
3.1 Experimental design
In all simulations the ratio outcome was generated according to the PDF of the FCGAM model derived in Proposition 1. Similar to the application data in Section 4, we considered two standard normally distributed covariates and two binary covariates , which were equi-correlated with Pearson correlation coefficient 0.4. For each we simulated independent data sets.
In Simulation Study 1, we considered scenarios with fixed negative correlation (the case which motivated our development of the FCGAM model), setting and . This resulted in the respective rank correlation coefficients . The rate parameters were related to the four covariates through the coefficients and . The shape parameters were set to and in all scenarios.
In Simulation Study 2, we considered scenarios with fixed positive correlation (the case which has already been covered by the eGB2 model but also applies to the FCGAM model), setting and . This resulted in the respective rank correlation coefficients . To ensure that the outcome values were in a meaningful range (comparable to Simulation Study 1) we set the regression coefficients to and , and the shape parameters to and .
In Simulation Study 3, we evaluated how the model fit of the FCGAM model was affected when falsely assuming a dependence of on , or when ignoring a present dependence of on . For this we reconsidered the data sets from Simulation Study 1 with , and additionally considered scenarios where the association parameter was related to the four covariates through the coefficient vector (resulting in covariate-dependent rank correlation coefficients , with the remaining parameters as in Simulation Study 1). In both cases we fitted the FCGAM model with covariate-dependent (according to (10)) and with constant .
Benchmark methods. We evaluated the fits of the 1000 FCGAM models by computing the predictive log-likelihood values on 1000 independent test data sets. The test data sets (of size each) were also used to compare the FCGAM model to alternative models. To this purpose, we evaluated the predictive log-likelihood values of the following benchmark methods, where (ii), (iii) and (vi) are univariate regression models for , (iv) is a univariate regression model for , and (v) and (vii) are distributional regression models:
-
(i)
The copula-based FCGAM model introduced in Section 2.2.
-
(ii)
The extended GB2 model (Berger et al., 2019) (eGB2) assuming a positive correlation between and .
-
(iii)
The simple GB2 model (GB2) assuming zero Pearson correlation between and .
-
(iv)
A Gaussian regression model with log-transformed outcome values (LN).
-
(v)
A Gaussian GAMLSS with log-transformed outcome values, where both the mean and the standard deviation were related to the covariates (LN.LSS). The standard deviation was modeled using the log link.
-
(vi)
A Gamma regression model with the original outcome values (GA). The mean parameter was related to the covariates and was modeled using the log link.
-
(vii)
A Gamma GAMLSS with the original outcome values, where both the mean and the scale parameters were related to the covariates (GA.LSS) using the log link.
3.2 Results
Point estimates of the FCGAM coefficients. Figure 3 presents the coefficient estimates in Simulation Study 1 with negative (but covariate-independent) correlation between and . The boxplots show that on average the estimated coefficients are very close to the true ones, regardless of the association parameter . Accordingly, the finite-sample bias of the MLEs is small in all scenarios (with varying and ). From Figure 3 it can also be seen that, as expected, the variance of the estimates decreases with increasing sample size, in particular for the two binary covariates and . In contrast, the correlation (determined by the value of ) has only a small impact on the variance of the estimates. The coefficient estimates (presented in Supplementary Figure S1) exhibit even smaller variances in the scenarios with and .
The coefficient estimates and from Simulation Study 2 with positive correlation between and are shown in Supplementary Figures S2 and S3, respectively. In both cases the bias is small throughout all scenarios. Regarding the variance of the estimates, the results are largely the same as in Figure 3.
Coverage proportions of the credible intervals. The coverage proportions of the 95% credible intervals obtained from the FCGAM fits are presented in Table 1. They range between (Simulation Study 1) and between (Simulation Study 2), which is close to the nominal coverage of 95%. There were only minor differences with regard to sample size and the correlation coefficient. This result demonstrates that not only point estimation but also inference works well for highly positive or negative correlations and fairly small samples.
Simulation Study 1 | |||||||||
---|---|---|---|---|---|---|---|---|---|
=200 | 0.938 | 0.932 | 0.936 | 0.938 | 0.949 | 0.949 | 0.954 | 0.950 | |
0.958 | 0.942 | 0.955 | 0.941 | 0.943 | 0.935 | 0.946 | 0.952 | ||
0.937 | 0.938 | 0.932 | 0.947 | 0.952 | 0.946 | 0.940 | 0.949 | ||
=500 | 0.949 | 0.937 | 0.940 | 0.938 | 0.952 | 0.953 | 0.946 | 0.930 | |
0.954 | 0.938 | 0.928 | 0.935 | 0.952 | 0.932 | 0.935 | 0.949 | ||
0.934 | 0.948 | 0.939 | 0.951 | 0.955 | 0.947 | 0.958 | 0.943 | ||
=1000 | 0.949 | 0.937 | 0.955 | 0.947 | 0.944 | 0.946 | 0.948 | 0.937 | |
0.947 | 0.942 | 0.936 | 0.936 | 0.957 | 0.950 | 0.950 | 0.950 | ||
0.933 | 0.945 | 0.934 | 0.953 | 0.943 | 0.948 | 0.949 | 0.943 | ||
Simulation Study 2 | |||||||||
=200 | 0.935 | 0.929 | 0.934 | 0.935 | 0.947 | 0.933 | 0.959 | 0.954 | |
0.952 | 0.940 | 0.954 | 0.941 | 0.949 | 0.951 | 0.954 | 0.947 | ||
0.938 | 0.939 | 0.939 | 0.947 | 0.951 | 0.944 | 0.948 | 0.953 | ||
=500 | 0.941 | 0.948 | 0.936 | 0.928 | 0.940 | 0.942 | 0.946 | 0.940 | |
0.948 | 0.941 | 0.929 | 0.929 | 0.952 | 0.944 | 0.941 | 0.962 | ||
0.931 | 0.947 | 0.937 | 0.951 | 0.948 | 0.944 | 0.949 | 0.937 | ||
=1000 | 0.950 | 0.944 | 0.954 | 0.944 | 0.952 | 0.939 | 0.953 | 0.935 | |
0.955 | 0.942 | 0.934 | 0.934 | 0.956 | 0.952 | 0.949 | 0.947 | ||
0.928 | 0.946 | 0.934 | 0.955 | 0.935 | 0.956 | 0.948 | 0.945 |
Distributional prediction. The root mean squared error (RMSE) of the estimated conditional median values computed from (13) are given in Table 2. In Simulation Study 1 the performance is quite similar for all three values of , whereas in Simulation Study 2 the RMSE considerably decreases with increasing value of . This indicates that estimating the median value works best for highly positive correlations where the PDF of is rather diffuse with a large mode value (compare Figure 1). It is also seen from Table 2 that the means and the standard deviations of the RMSE decrease with increasing sample size.
Simulation Study 1 | |||
---|---|---|---|
= 200 | 0.076 (0.035) | 0.084 (0.039) | 0.081 (0.037) |
= 500 | 0.049 (0.021) | 0.053 (0.023) | 0.051 (0.021) |
= 1000 | 0.034 (0.015) | 0.038 (0.016) | 0.036 (0.016) |
Simulation Study 2 | |||
= 200 | 0.154 (0.056) | 0.101 (0.035) | 0.063 (0.021) |
= 500 | 0.097 (0.033) | 0.064 (0.022) | 0.040 (0.014) |
= 1000 | 0.069 (0.023) | 0.046 (0.015) | 0.029 (0.010) |
Comparison to alternative models. Figure 4 and Supplementary Figure S4 show the prediction accuracy (i.e. the predicted log-likelihood values on the test sets) of the FCGAM model and the benchmark methods (ii) to (vi). In Simulation Study 1 with negative correlation, it can be observed that the FCGAM model achieves the highest accuracy in all scenarios (Figure 4). The superiority is even more evident when the sample size and the value of the correlation coefficient are increased. The extended GB2 and simple GB2 methods yield similar performances as the Gaussian models with log-transformed outcome (LN and LN.LSS), whereas the Gamma regression models (GA and GA.LSS) result in the lowest accuracy. For both LN and GA the GAMLSS models are not superior to their simple counterparts.
In Simulation Study 2 with positive correlation, the results change considerably (Supplementary Figure S4). As expected, the performance of the FCGAM and eGB2 models is largely the same, as the eGB2 model also assumes gamma distributed components with positive correlation. The simple GB2 model (assuming uncorrelated components) and the Gaussian models with log-transformed outcomes (LN and LN.LSS) perform comparably well in the scenarios with , but deteriorated with increasing correlation ( and ). Again, the Gamma regression models (GA and GA.LSS) exhibit the worst performance by far.
RMSE | Covariate-dependent correlation | Fixed correlation | ||
---|---|---|---|---|
modeled | constant | modeled | constant | |
= 200 | 0.069 (0.031) | 0.082 (0.042) | 0.079 (0.032) | 0.077 (0.034) |
= 500 | 0.042 (0.018) | 0.058 (0.026) | 0.050 (0.020) | 0.047 (0.018) |
= 1000 | 0.031 (0.013) | 0.049 (0.021) | 0.036 (0.015) | 0.034 (0.014) |
Predictive | Covariate-dependent correlation | Fixed correlation | ||
log-likelihood | modeled | constant | modeled | constant |
= 200 | -6.769 (19.176) | -7.116 (19.133) | -20.165 (19.163) | -19.356 (19.056) |
= 500 | -10.736 (29.031) | -12.853 (29.087) | -44.585 (29.092) | -43.979 (29.050) |
= 1000 | -18.237 (41.177) | -23.330 (40.965) | -85.656 (40.989) | -85.071 (40.967) |
Misspecified models for the association parameter in Simulation Study 3. The RMSE of the estimated conditional median values and the predictive log-likelihood values of the FCGAM fits are summarized in Table 3. It is seen that ignoring the dependence of on the covariates (left part of Table 3) decreases both the predictive ability and the model fit.
In the scenario with (large sample size), the difference in predictive log-likelihood values of 5.093 suggests “considerably less” empirical support for the model with constant (according to the rules of thumb provided in Burnham and Anderson, 2002). On the other hand, when unnecessarily modeling the dependence of on (right part of Table 3) the predictive ability and the model fit are mostly unaffected (showing only negligible differences in the RMSE and the predictive log-likelihood values).
Overall summary. Taken together, we make the following key empirical observations:
-
1.
Point estimates from the FCGAM model are reliable and nearly unbiased even for small sample sizes.
-
2.
The FCGAM model outperforms the eGB2 model in case of negative correlation and is en par with the eGB2 model when the correlation is positive.
-
3.
In all scenarios, the Gamma regression models perform worst. In particular, they perform worse than the Gaussian models with log-transformed outcome.
-
4.
Falsely modeling the association parameter does not deteriorate predictive performance to a large degree, whereas the FCGAM model with covariate-dependent improves the fit when the true association depends on the covariates.
4 Cohort Study of the German Dementia Competence Network
Background. The multi-center cohort study conducted by the German Dementia Competence Network (DCN; Kornhuber et al., 2009) enrolled patients aged older than 50 years that were diagnosed with either mild cognitive impairment (MCI), Alzheimer’s disease (AD) or other dementia. Recruitment took place between 2003 and 2007. The main objective of the original study was to establish biomarkers for the diagnosis and prognosis of AD using clinical, laboratory and imaging measurements. Here, we investigate covariates that are potentially associated with amyloid- 42, amyloid- 40 and total tau protein concentrations measured in cerebrospinal fluid samples. These analyses are of high relevance for clinical routine in the neurosciences, since biomarkers enable the detection of AD pathology long before the occurrence of the first clinically obvious symptoms (Sperling et al., 2013). Thus, relating covariates to biomarker values provides insight into disease pathology and prevention at the individual patient level. In the neurosciences, amyloid- 42, amyloid- 40 and total tau protein concentrations are usually not analyzed separately but in terms of their ratios. More specifically, the amyloid- 42/40 ratio and amyloid- 42/total tau ratio are considered to be strong predictors of AD progression (Koyama et al., 2012). Therefore we focus on the group of MCI patients and relate their ratios to patient-related risk factors for dementia.
Description of the data. In the DCN study, amyloid- and total tau baseline concentrations were measured in 374 patients diagnosed with MCI. In all other MCI patients, CSF biosamples were not collected due to either logistic reasons or lack of consent to the invasive procedure of lumbar puncture. Exclusion of patients that did not meet the inclusion criterion (age 50 years; patients) and of patients with missing values in at least one of the considered risk factors ( patients) resulted in an analysis data set of patients. For details on the handling of missing values we refer to Berger et al. (2019). The unconditional distributions of the amyloid- 42/40 ratio and the amyloid- 42/total tau ratio are visualized in Figure 5. While the values of the amyloid- 42/40 ratios are all smaller than 0.3, the amyloid- 42/total tau ratios range between 0.2 and 13, exhibiting a heavily right-skewed distribution. Kendall’s rank correlation coefficient between the two components is measured to be (amyloid- 42/40) and (amyloid- 42/total tau). Thus, our analysis had to deal with both positive and negative correlations between the ratio components. As mentioned before, this problem was our main motivation for the development of the FCGAM model.
The risk factors included in the analysis are summarized in Table 4. These were: (i) sex, (ii) age in years, (iii) educational level (measured by the number of years of education), and (iv) a binary variable indicating whether a patient was a carrier of the apolipoprotein E4 (ApoE 4) allele, which is a strong genetic predictor of AD.
Variable | Summary statistics | ||||||
min | Q1 | median | Q3 | max | mean | sd | |
Amyloid- 42/40 | 0.03 | 0.08 | 0.10 | 0.14 | 0.26 | 0.11 | 0.04 |
Amyloid- 42/total tau | 0.19 | 0.91 | 2.13 | 3.72 | 12.95 | 2.70 | 2.34 |
Age (years) | 51 | 60 | 66 | 73 | 89 | 66.51 | 8.11 |
Education (years) | 2 | 11 | 11 | 13 | 19 | 12.18 | 2.96 |
Sex | male: | 194 (58.8%) | female: | 136 (41.2%) | |||
ApoE 4 | no: | 182 (55.2%) | yes: | 148 (44.8%) |
Model fitting I. In a preliminary analysis, we fitted GA and GA.LSS models for the components amyloid- 42, amyloid- 40 and total tau, where either the rate parameters only (GA) or both the rate and the shape parameters (GA.LSS) were related to the four covariates. According to the Bayesian information criterion (BIC) the simple GA models (BIC for amyloid- 42, BIC for amyloid- 40 andBIC for total tau) showed better fits than the respective GA.LSS models (BIC for amyloid- 42, BIC for amyloid- 40 and BIC for total tau). This result indicates that it is sufficient to relate the two rate parameters to the covariates. Furthermore, it supports the assumptions of the proposed FCGAM model, which treats the shape parameters and as nuisance parameters.
The fits of the FCGAM model with covariate-dependent association parameter are presented in Supplementary Table S1. According to the credible intervals given in columns 4 and 6, none of the risk factors is found to affect the association parameter . Applying Equation (3) yielded the mean estimated rank correlations (range: 0.21 to 0.49) for amyloid- 42/40 and (range: -0.41 to 0.03) for amyloid- 42/total tau. Both estimates are close to the respective unconditional rank correlations.
Model fitting II. Based on the above findings and to further reduce model complexity, we fitted FCGAM models with constant association parameter (setting the coefficients to zero). We then calculated the BIC from these reduced models along with their counterparts obtained from the models with covariate-dependent . For amyloid- 42/40, the BIC values were (constant ) and (modeled ). For amyloid- 42/total tau, the BIC values were (constant ) and (modeled ). This result suggests that the reduced models with constant meet a better compromise between model fit and model complexity than the respective full models with covariate-dependent . This observation is confirmed by randomized quantile residuals of the reduced FCGAM models (Figure 6) indicating only slight deviations from normality.
Main results. The results obtained from the reduced FCGAM models are shown in Table 5 and Supplementary Figures S5 and S6. The upper part of Table 5 refers to the parameter , reporting the differences . Note that the coefficient estimates are very similar to the respective estimates of the more complex model in Table S1. For example, for amyloid- 42/total tau one obtains (Table 5) and (Table S1). The credible intervals in Table 5 were obtained by drawing a sample of size 10,000 from the posterior distribution in (14) and by calculating the and percentiles from the sampled differences . According to the results of the FCGAM model, there is strong evidence for an effect of the risk factors age and ApoE 4 on theamyloid- 42/40 and amyloid- 42/total tau ratios. As depicted in Supplementary Figures S5(a) and S6(a), both the expected amyloid- 42/40 ratio and the expected amyloid- 42/total tau ratio decrease with increasing age, implying a higher risk of progression to AD in older patients. Similarly, the expected ratios of ApoE 4 carriers are strongly reduced compared to patients not carrying the allele (Supplementary Figures S5(d) and S6(d), confirming the important role of this genetic risk factor in AD progression). The figures also illustrate how the estimated median values as well as the and percentiles of the distributions change with the covariates. In contrast to age and ApoE 4, Table 5 shows no evidence for an effect of sex and educational level on the two ratio outcomes. These results are in full agreement with the findings by Berger et al. (2019), who fitted an eGB2 model with amyloid- 42/40 outcome to the DCN study data.
amyloid- 42/40 | amyloid- 42/total tau | ||||
Parameter | Covariate | CI | CI | ||
Age (years) | |||||
Education (years) | |||||
Sex (male) | . | . | . | . | |
Sex (female) | |||||
ApoE 4 (no) | . | . | . | . | |
ApoE 4 (yes) | |||||
5 Discussion
The main contribution of this work is a copula-based regression model that relates the ratio of two gamma distributed components to a set of covariates. Our model is primarily designed for the analysis of ratio outcomes in medical research, which is an important task, for instance, in neurology (Novellino et al., 2021), infectiology (Caby et al., 2016) and pharmacology (Cawley et al., 2022). Importantly, when biomarker ratios are used as clinical metrics or indicators of clinical outcomes, our model may be used to relate the respective ratio values to a set of risk factors and/or confounding variables. A prototypical example is given by the prognosis of AD progression considering ratios of amyloid- and total tau protein biomarkers, as presented in Section 4 of this paper.
Conceptually, the FCGAM model developed in this paper has the following advantages: First, by assuming the ratio components to follow marginal gamma distributions, the FCGAM model represents the two biomarkers by real-valued random variables with positive support and right-skewed (marginal) distributions. These distributional characteristics, which are common to most biomarkers encountered in medical research, are directly incorporated in the definition of the proposed copula model. As a consequence, the resulting ratio density incorporates the full information contained in the marginal densities of the components of the ratio. We emphasize that this property does not apply to simpler modeling approaches approximating the ratio by a single log-normal or gamma-distributed variable. In fact, without consideration of the paired components themselves, these approximations inevitably bear the risk of a loss of information (“neglected companion”; Kerkhof et al., 2019). This issue has been demonstrated by the results of our simulation study (Section 3), which resulted in an increased estimation accuracy of the proposed copula-based approach in all data-generating scenarios. We also stress that linking the two marginal distributions by a copula does in general not restrict our model to the use of two gamma distributions for the ratio components. In fact, although our model can be seen as the most relevant use case in many medical applications, the marginal distributions can in principle be replaced by arbitrary parametric distributions. For instance, our model can in a straightforward manner be extended to situations where one biomarker is discrete or ordinal.
Second, the proposed FCGAM model has a high flexibility regarding the direction of the association between the two ratio components. Importantly, by the choice of Frank copula, the FCGAM model allows for both positive and negative values of the (rank) correlation between the components and , thereby improving previous modeling approaches that restricted this correlation to be zero (Yee, 2015) or positive (Berger et al., 2019). As demonstrated in the simulation study in Section 3, the FCGAM model indeed performs better in terms of estimation accuracy when the association between and is negative. On the other hand, it does not perform worse than the aforementioned approach when the association between and is positive.
Third, although the proposed model incorporates the full information contained in the marginal densities and , it provides a rather simple interpretation of the associations between the ratio and the covariates. This is because the FCGAM model reduces the original five-parameter set (including all parameters of the marginal densities and the association parameter ) to the restricted set with . As a consequence, when treating , (and possibly also ) as nuisance parameters, the associations between and each of the covariates can be investigated using one-dimensional coefficient estimates and single-parameter hypothesis tests. Similarly, the association between the components and has a natural interpretation in terms of Kendall’s rank correlation, being related to by the one-to-one relationship given in Equation (3).
Finally, beside the flexibility in specifying other marginal distributions than the gamma distribution, the FCGAM model may be extended in many other ways. For example, Frank copula could be replaced by other copulas (noting that the results on ratio densities are also valid for other absolutely continuous copulas; see Ly et al., 2019). When there is particular interest in the tail dependencies of and , benchmark experiments to identify the best fitting copula and/or marginal distributions could be performed using resampling techniques (e.g. bootstrapping or subsampling). It should be noted, however, that other copulas from the literature might be less flexible regarding the range of (and thus also the range of possible associations between the components and ; see e.g. Ghosh et al., 2022, for a recent overview of copulas allowing for modeling negative dependence). For example, it is not possible to model negative associations between and using non-rotated Gumbel or Joe copulas.
6 Appendices
It contains the proof of Proposition 2 (Appendix A), additional simulation results (Appendix B) as well as further results of the analysis of the DCN study data (Appendix C).
7 Acknowledgments
Moritz Berger acknowledges support by the grant BE 7543/1-1 of the German research foundation (DFG). Nadja Klein acknowledges support by the Emmy Noether grant KL 3037/1-1 of the DFG. The analysis of the DCN study data was supported by the German Federal Ministry of Education and Research (Kompetenznetz Demenzen, grant 01GI0420)
References
- Berger and Schmid (2020) M. Berger and M. Schmid. Flexible modeling of ratio outcomes in clinical and epidemiological research. Statistical Methods in Medical Research, 29:2250–2268, 2020.
- Berger et al. (2019) M. Berger, M. Wagner, and M. Schmid. Modeling biomarker ratios with gamma distributed components. The Annals of Applied Statistics, 13:548–572, 2019.
- Bourguignon et al. (2021) M. Bourguignon, M. Santos-Neto, and M. de Castro. A new regression model for positive random variables with skewed and long tail. Metron, 79:33–55, 2021.
- Burnham and Anderson (2002) K. P. Burnham and D. A. Anderson. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Springer, New York, 2nd edition, 2002.
- Caby et al. (2016) F. Caby, A. Guihot, S. Lambert-Niclot, M. Guiguet, D. Boutolleau, R. Agher, M.-A. Valantin, R. Tubiana, V. Calvez, A.-G. Marcelin, B. Autran, D. Costagliola, and C. Katlama. Determinants of a low CD4/CD8 ratio in HIV-1–infected individuals despite long-term viral suppression. Clinical Infectious Diseases, 62:1297–1303, 2016.
- Cawley et al. (2022) A. Cawley, B. Keen, K. Tou, M. Elbourne, and J. Keledjian. Biomarker ratios. Drug Testing and Analysis, 14:983–990, 2022.
- dos Santos et al. (2023) K. H. dos Santos, T. L. Pereira, T. C. de Souza, and M. Bourguignon. A misspecification test for beta prime regression models. Communications in Statistics – Simulation and Computation, 52:4561–4574, 2023.
- Fahrmeir et al. (2022) L. Fahrmeir, T. Kneib, S. Lang, and B. D. Marx. Regression: Models, Methods and Applications. Springer, Berlin, Heidelberg, 2nd edition, 2022.
- Firth (1988) D. Firth. Multiplicative errors: Log-normal or gamma? Journal of the Royal Statistical Society, Series B, 50:266–268, 1988.
- Genest (1987) C. Genest. Frank’s family of bivariate distributions. Biometrika, 74:549–555, 1987.
- Ghosh et al. (2022) S. Ghosh, P. Bhuyan, and M. Finkelstein. On a bivariate copula for modeling negative dependence: Application to New York air quality data. Statistical Methods & Applications, 31:1329–1353, 2022.
- Harry (2014) J. Harry. Dependence Modeling with Copulas. Chapman & Hall, New York, 2014.
- Jack et al. (2015) C. R. Jack, H. J. Wiste, S. D. Weigand, D. S. Knopman, P. Vemuri, M. M. Mielke, V. Lowe, M. L. Senjem, J. L. Gunter, M. M. Machulda, B. E. Gregg, V. S. Pankratz, W. A. Rocca, and R. C. Petersen. Age, sex, and APOE 4 effects on memory, brain structure, and -amyloid across the adult life span. JAMA Neurology, 72:511–519, 2015.
- Kerkhof et al. (2019) P. L. M. Kerkhof, R. A. Peace, and N. Handly. Ratiology and a complementary class of metrics for cardiovascular investigations. Physiology, 34:250–263, 2019.
- Kibble (1941) W. F. Kibble. A two-variate gamma type distribution. Sankhya: The Indian Journal of Statistics, 5:137–150, 1941.
- Kleiber and Kotz (2003) C. Kleiber and S. Kotz. Statistical Size Distributions in Economics and Actuarial Sciences. Wiley, Hoboken, 2003.
- Kornhuber et al. (2009) J. Kornhuber, K. Schmidtke, L. Frölich, R. Perneczky, S. Wolf, H. Hampel, F. Jessen, I. Heuser, O. Peters, M. Weih, H. Jahn, C. Luckhaus, M. Hüll, H.-J. Gertz, J. Schröder, J. Pantel, O. Rienhoff, S. A. Seuchter, E. Rüther, F. Henn, W. Maier, and J. Wiltfang. Early and differential diagnosis of dementia and mild cognitive impairment: Design and cohort baseline characteristics of the German Dementia Competence Network. Dementia and Geriatric Cognitive Disorders, 27:404–417, 2009.
- Koyama et al. (2012) A. Koyama, O. I. Okereke, T. Yang, D. Blacker, D. J. Selkoe, and F. Grodstein. Plasma amyloid-beta as a predictor of dementia and cognitive decline – a systematic review and meta-analysis. Archives of Neurology, 69:824–831, 2012.
- Lee et al. (1979) R.-Y. Lee, B. S. Holland, and J. A. Flueck. Distribution of a ratio of correlated gamma random variables. SIAM Journal on Applied Mathematics, 36:304–320, 1979.
- Long et al. (2016) Q. Long, X. Zhang, Y. Zhao, B. A. Johnson, and R. M. Bostick. Modeling clinical outcome using multiple correlated functional biomarkers: A Bayesian approach. Statistical Methods in Medical Research, 25:520–537, 2016.
- Ly et al. (2019) S. Ly, K.-H. Pho, S. Ly, and W.-K. Wong. Determining distribution for the quotients of dependent and independent random variables by using copulas. Journal of Risk and Financial Management, 12:42, 2019.
- Medeiros et al. (2023) F. M. C. Medeiros, M. C. Araújo, and M. Bourguignon. Improved estimators in beta prime regression models. Communications in Statistics – Simulation and Computation, 52:5125–5138, 2023.
- Mekić et al. (2012) E. Mekić, M. Stefanović, P. Spalević, N. Sekulović, and A. Stanković. Statistical analysis of ratio of random variables and its application in performance analysis of multihop wireless transmissions. Mathematical Problems in Engineering, 2012:841092, 2012.
- Mitchell et al. (2015) E. M. Mitchell, R. H. Lyles, and E. F. Schisterman. Positing, fitting, and selecting regression models for pooled biomarker data. Statistics in Medicine, 34:2544–2558, 2015.
- Natarajan et al. (2003) S. Natarajan, H. Glick, M. Criqui, D. Horowitz, S. R. Lipsitz, and B. Kinosian. Cholesterol measures to identify and treat individuals at risk for coronary heart disease. American Journal of Preventive Medicine, 25:50–57, 2003.
- Novellino et al. (2021) F. Novellino, A. Donato, N. Malara, J. L. M. Madrigal, and G. Donato. Complete blood cell count-derived ratios can be useful biomarkers for neurological diseases. International Journal of Immunopathology and Pharmacology, 35:20587384211048264, 2021.
- Perri and Porporato (2022) S. Perri and A. Porporato. Environmental concentrations as ratios of random variables. Environmental Research Letters, 17:024011, 2022.
- Rigby and Stasinopoulos (2005) R. A. Rigby and D. M. Stasinopoulos. Generalized additive models for location, scale and shape (with discussion). Journal of the Royal Statistical Society, Series C, 54:507–554, 2005.
- Safari-Katesari and Zaroudi (2020) H. Safari-Katesari and S. Zaroudi. Count copula regression model using generalized beta distribution of the second kind. Statistics in Transition, New Series, 21:1–12, 2020.
- Sottas et al. (2007) P.-E. Sottas, N. Baume, C. Saudan, C. Schweizer, M. Kamber, and M. Saugy. Bayesian detection of abnormal values in longitudinal biomarkers with an application to T/E ratio. Biostatistics, 8:285–296, 2007.
- Sperling et al. (2013) R. A. Sperling, J. Karlawish, and K. A. Johnson. Preclinical Alzheimer disease – the challenges ahead. Nature Reviews Neurology, 9:54–58, 2013.
- Tapiola et al. (2009) T. Tapiola, I. Alafuzoff, S.-K. Herukka, L. Parkkinen, P. Hartikainen, H. Soininen, and T. Pirttilä. Cerebrospinal fluid -amyloid 42 and tau proteins as biomarkers of Alzheimer-type pathologic changes in the brain. Archives of Neurology, 66:382–389, 2009.
- Tubbs (1986) J. D. Tubbs. Moments for a ratio of correlated gamma variates. Communications in Statistics – Theory and Methods, 15:251–259, 1986.
- Tulupyev et al. (2013) A. Tulupyev, A. Suvorova, J. Sousa, and D. Zelterman. Beta prime regression with application to risky behavior frequency screening. Statistics in Medicine, 32:4044–4056, 2013.
- Van Domelen et al. (2021) D. R. Van Domelen, E. M. Mitchell, N. J. Perkins, E. F. Schisterman, A. K. Manatunga, Y. Huang, and R. H. Lyles. Gamma models for estimating the odds ratio for a skewed biomarker measured in pools and subject to errors. Biostatistics, 22:250–265, 2021.
- Wiens (1999) B. L. Wiens. When log-normal and gamma models give different results: A case study. The American Statistician, 53:89–93, 1999.
- Wintergerst et al. (2022) M. W. M. Wintergerst, N. R. Merten, M. Berger, C. Dysli, J. H. Terheyden, E. Poletti, F. G. Holz, V. S. Schäfer, M. Schmid, T. Ach, and R. P. Finger. Spectrally resolved autofluorescence imaging in posterior uveitis. Scientific Reports, 12:14337, 2022.
- Wood (2017) S. N. Wood. Generalized Additive Models: An Introduction with R. CRC press, Boca Raton, 2nd edition, 2017.
- Xu et al. (2020) W. Xu, L. Tan, B.-J. Su, H. Yu, Y.-L. Bi, X.-F. Yue, Q. Dong, and J.-T. Yu. Sleep characteristics and cerebrospinal fluid biomarkers of Alzheimer’s disease pathology in cognitively intact older adults: The CABLE study. Alzheimer’s & Dementia, 16:1146–1152, 2020.
- Yee (2015) T. W. Yee. Vector Generalized Linear and Additive Models: With an Implementation in R. Springer, New York, 2015.
Appendix Appendix A Proof of Proposition 2
By Proposition 1, the PDF of the ratio is given by
where denotes the lower incomplete gamma function.
Appendix Appendix B Further Simulation Results
Appendix Appendix C Further Results of the Analysis of the DCN Study Data
amyloid- 42/40 | amyloid- 42/total tau | ||||
Parameter | Covariate | CI | CI | ||
Age (years) | |||||
Education (years) | |||||
Sex (male) | . | . | . | . | |
Sex (female) | |||||
ApoE 4 (no) | . | . | . | . | |
ApoE 4 (yes) | |||||
Age (years) | |||||
Education (years) | |||||
Sex (male) | . | . | . | . | |
Sex (female) | |||||
ApoE 4 (no) | . | . | . | . | |
ApoE 4 (yes) | |||||
Age (years) | |||||
Education (years) | |||||
Sex (male) | . | . | . | . | |
Sex (female) | |||||
ApoE 4 (no) | . | . | . | . | |
ApoE 4 (yes) | |||||