Nonparametric Tests for Multivariate Association

Harrar, Solomon W.; Xu, Yan

doi:10.3390/sym14061112

Open AccessArticle

Nonparametric Tests for Multivariate Association

by

Solomon W. Harrar

^1,*

and

Yan Xu

²

¹

Dr. Bing Zhang Department of Statistics, College of Arts and Sciences, University of Kentucky, Lexington, KY 40506, USA

²

Merck & Co., 351 North Sumneytown Pike, P.O. Box 1000, North Wales, PA 19454, USA

^*

Author to whom correspondence should be addressed.

Symmetry 2022, 14(6), 1112; https://doi.org/10.3390/sym14061112

Submission received: 21 March 2022 / Revised: 16 May 2022 / Accepted: 24 May 2022 / Published: 28 May 2022

(This article belongs to the Section Life Sciences)

Download

Browse Figure

Versions Notes

Abstract

:

Testing the existence of association between a multivariate response and predictors is an important statistical problem. In this paper, we present nonparametric procedures that make no specific distributional, regression function, and covariance matrix assumptions. Our test is motivated by recent results in MANOVA tests for a large number of groups. Two types of tests are proposed. While it is natural to consider the classical approach for constructing the test by jointly considering all the variables together, we also investigate a composite test where variable-by-variable univariate tests are combined to form a multivariate test. The asymptotic distributions of the test statistics are derived in a unified manner by deriving the asymptotic matrix variate normal distribution of random matrices involved in the construction of the statistics. The tests have good numerical performance in finite samples. The application of the methods is illustrated with gene expression profiling of bronchial airway brushings.

Keywords:

multivariate data; nonparametric; MANOVA; lack-of-fit test; large number of groups

1. Introduction

Testing association between variables is perhaps one of the most important problems in statistics. The utility of a statistical model to adequately describe a physical system and accurately predict future perfomance hinges on the associations between the variables in the system. Several parametric and nonparametric methods have been developed in the univariate case. For example, Pearson correlation measures the linear association between two continuous variables [1]. Its rank-based analog, known as Spearman’s correlation [2], is specifically useful to detect monotonic association between two variables, but otherwise both variables could be ordinal or continuous. Kendall’s

τ

coefficient measures concordance or discordance between two variables [3]. While there exist multivariate version of these measures, their utility for testing association is limited [4].

ANOVA F-test is appropriate to test association between a continuous response and categorical predictor variables. This test may also be applied on residuals, and any association between the predictor and the residual variables would be indicative of lack-of-fit of the assumed model. Recent research studied the F-test in the ANOVA model under nonnormality as the number of groups tends to infinity but the numbers of observations in each group remain fixed, in both the balanced and unbalanced cases [5,6,7,8]. The heteroscedastic case and, especially, its application for lack-of-fit test was developed by several authors e.g., [9,10,11]. In a similar application, a nonparametric diagnostic test for homogeneity of variance was investigated by Wang and Zhou [12]. These tests do not require any distributional assumption on the errors other than independence accross observational units or subjects. The central ideas lie in checking whether the conditional moments of the response variable are different for different values of the predictor variables much in the same way as ANOVA except that there are no replications per each value of the predictor variable. In essence, each value of the continuous predictor is treated as a factor level by creating moving windows containing nearest-neighbor values and grouping the values of the reponse variable by window membership. Roughly speaking, a one-way ANOVA layout with large number of factor levels is ultimately constructed. However, the groups are not independent and, hence, the asymptotic theories are a lot more involved. Notwithstanding their novelty, all these works assume a univariate response variable. It is the aim of the present paper to develop a multivariate extension.

We adopt the idea of a moving window in order to propose two families of tests in the multivariate situation. The tests in one of the families consider all the responses jointly to construct a global test. The ones in the other family combine variable-by-variable test statistics for each p marginal hypotheses to develop a composite test, also refered to as Multiple Contrast Test Procedure (MCTP). The proposed multivariate methods are nonparametric in the sense that they do not make any parametric assumption on the regression function (conditional mean vector) as well as the conditional covariance matrix. Furthermore, no assumption is made on the distributional form of the errors and the responses are allowed to be heterogeneous across subjects. Permutation tests by combining variable-by-variable tests has recieved considerable attention in the literature in the multiple regression [13] and multivariate analysis of variance (MANOVA) [14] contexts. For a general account of combined permutation tests see Salmaso and Pesarin [15].

The remainder of the paper is organized as follows. The model and hypothesis of interest are introduced in Section 2.1. Section 2.2 explains the moving window approach and how it is used to construct the one-way MANOVA layout. In Section 2.3, the two tests, namely the global tests and composite tests of association and their asymptotic theory are presented. Finite sample performance of the tests are evaluated with simulation studies in Section 3. In Section 4, the applications of the proposed methods are illustrated with gene expression data from Genes-environments & Admixture in Latino Americans (GALA) II study in Section 5. Further discussions of the results are provided in Section 5. All proofs are placed in the Appendix A.

2. Methods

2.1. Model and Hypothesis

Suppose independent pairs of observations

(x_{i}, Y_{i}^{⊤})

are available from a units, where

x_{i}

is a fixed scalar and

Y_{i}^{⊤} = (Y_{i}^{(1)}, \dots, Y_{i}^{(p)})

is a p-dimensional random vector. Assume the nonparametric model

Y = m (x) + Σ^{1 / 2} (x) ϵ,

(1)

holds for each unit

i = 1, \dots, a

, where for each x the mean

m (x)

is an unknown vector valued function and the covariance

Σ (x)

is a

p \times p

unknown positive definite matrix valued function. The error vector

ϵ

is assumed to be identically and independently distributed across units with mean

0

and covariance matrix

I_{p}

.

The aim of this paper is to develop tests for no association in the nonparametric heteroscedastic regression model (1). More precisely, we consider the null hypothesis

H_{0} : m (x) = C,

(2)

for any x, where

C \in R^{p}

is an unknown vector of constants. We do not assume any functional forms for

m (x)

and

Σ (x)

. Furthermore, the distribution of the error

ϵ

is unspecified.

2.2. Moving Window One-Way Layout

Our approach is in essence similar to a lack-of-fit test in regression models. Lack-of-fit tests ideally require multiple replicates of the response variable per each value of the predictor variable. In observational studies or for predictors (covariates) measured prior to randomization, replicates per predictor variable values do not typically arise. For this situation, the idea of nearest neighborhood from nonparametric smoothing could be employed to construct artificial replicates. For

i = 1, \dots, a

, let the window

W_{i}

be the set of indices defined by,

W_{i} = \{j : | \hat{F} (x_{j}) - \hat{F} (x_{i}) | \leq \frac{n - 1}{2 a}\},

where

n < a

is an odd number,

\hat{F} (t) = a^{- 1} \sum_{k = 1}^{a} I (x_{k} \leq t)

, and

I (\cdot)

is the indicator function. The units in the i-th window

W_{i}

will constitute the replications in the i-th group, for a total of a groups. For the development of the theory in this paper, we assume that the group sizes are all equal to a fixed number n. To be precise, the groups at the extreme ends will have sizes smaller than n. However, the effect of this unbalancedness will be negligible in our asymptotic framework,

a \to \infty

.

Roughly speaking, the test of association proposed in this paper examines if mean vectors of the a groups are significantly different. In this setup, the large sample size in the original sample corresponds to the large number of groups in the moving window one-way layout. Asymptotic tests for MANOVA when the number of groups is large has been previously studied in parametric and nonparametric contexts [16,17,18,19]. However, these tests assume that the groups are independent and their results are not applicable for moving window one-way layout where the groups are not mutually independent.

Note that under the null hypothesis (2), within each window (group) i the response vectors, i.e.,

{Y_{j} : j \in W_{i}}

, have the same mean but different covariance matrices. Furthermore, under the assumptions

A 1

and

A 2

stated below, the within group covariances will be nearly constant from unit to unit, especially so when n is relatively small compared to a. Furthermore, when

H_{0}

is not true, the within group mean vectors would be nearly constant if

m (x)

is a smooth function. Therefore, test statistics developed for MANOVA with unequal group covariance e.g., [19] could potentially be sensitive for detecting departure from the null hypothesis (2). These ideas of a moving window in a one-way layout were previously used for lack-of-fit test [10,11] and test of homogeneity of variance [12,20] in the univariate setting.

Our theoretical results require some regularity conditions which are listed below.

A1:: $x_{1}, \dots, x_{a}$ are fixed design values on $[0, 1]$ where $x_{i}$ is the $(i / a)$ th quantile of some Lipschitz continuous positive density $r (x)$ on $[0, 1]$ .
A2:: The covaraince $Σ (x)$ is a Lipschitz continuous function.
A3:: $E {(ϵ_{1}^{⊤} ϵ_{1})}^{2 + δ} < \infty$ for some $δ > 0$ .

The sequence

x_{1}, \dots, x_{n}

which satisfies assumption A1 is known as a regular sequence [12,21]. Assumption A1 stipulates that the design points

x_{i}

satisfy

\int_{0}^{x_{i}} r (x) d x = i / a

for

i = 1, \dots, a

. For example,

x_{i} = i / a

;

i = 1, \dots, a

; is a regular sequence with respect to the uniform distribution. The Liptschitz continuity in A2 is in the senses of the Frobinius norm,

{| | A | |}_{F} = (\sum_{i, j} | a_{i j} {|^{2})}^{1 / 2}

for matrix

A = (a_{i j})

. Together, assumptions A1 and A2 imply

| | Σ (x_{j_{2}}) - Σ (x_{j_{1}}) {| |}_{F} \leq K_{1} | x_{j_{2}} - x_{j_{1}} | \leq K_{2} | j_{2} - j_{1} |,

(3)

for

x_{j_{1}}, x_{j_{2}} \in [0, 1]

and universal constants

K_{1}

and

K_{2}

. Specifically,

| | Σ (x_{j_{2}}) - Σ (x_{j_{1}}) {| |}_{F} = O (n / a)

, if

x_{j_{1}}, x_{j_{2}} \in W_{i}

for any i. Therefore, assumptions A1 and A2 guarantee that heterogeneities are regulated within window covariance. These assumptions motivate the application of the ideas in the high-dimensional (large number of groups) MANOVA to moving window one-way MANOVA and also permit convenient expression for the asymptotic results.

2.3. Test Statistics

For testing the association hypothesis in (2), we follow two approaches. The first one uses omnibus (global) tests for heteroscedastic MANOVA proposed in the context of large number of factor levels e.g., [16,19]. The second approach is based on the idea of simultaneous inference where multiple univariate tests are combined to construct a composite multivariate test. A somewhat related idea to the latter was implemented in Zambom and Kim [20] to develop lack-of-fit test in univariate multiple regression.

Let the

p \times (a n)

data matrix for the augmented (moving window) one-way layout be denoted by

Y = (Y_{1}^{*}, \dots, Y_{a}^{*})

, where

Y_{i}^{*} = (Y_{j} : j \in W_{i})

is the matrix of data on the response vector for the ith group. Further, define the group sample mean vectors and covariance matrices as

{\bar{Y}}_{(i)} = \sum_{j \in W_{i}} Y_{j} and S_{(i)} = \frac{1}{n - 1} \sum_{j \in W_{i}} (Y_{j} - {\bar{Y}}_{(i)}) {(Y_{j} - {\bar{Y}}_{(i)})}^{⊤},

and the overall mean as

\bar{Y} = a^{- 1} \sum_{i = 1}^{a} {\bar{Y}}_{(i)}

.

2.3.1. Omnibus Tests

Classical MANOVA tests assume that the number of treatments is fixed and observations in different treatment groups are independent. There has been extension of these tests for large number of treatment groups under general conditions in the parametric [16,18,22] and nonparametric [17,19,23] settings. In the univariate case, the usual F statistic for one-way ANOVA coincides with the regression lack-of-fit test when there are multiple replications for each observed value of the predictor variable [9]. In view of this, the large number of treatment asymptotic framework is the ideal setup for large sample inference for the lack-of-fit problem with moving window one-way layout. The multivariate extension of this testing problem from the MANOVA global testing point of view is considered in this section.

For testing the Hypothesis (2) under the model (1), consider the test statistic

T (A) = \sqrt{\frac{a}{n}} tr \{(MST - MSE) A\} .

(4)

Here

MST

is the treatment mean squares and cross products matrix and

MSE

is the error mean squares and cross product matrix of error for the augmented (moving window) one-way layout data described in Section 2.2. These matrices are defined by

MST = \frac{n}{a - 1} \sum_{i = 1}^{a} ({\bar{Y}}_{(i)} - \bar{Y}) {({\bar{Y}}_{(i)} - \bar{Y})}^{⊤} and MSE = \frac{1}{a} \sum_{i = 1}^{a} S_{(i)} .

(5)

The introdution of matrix

A

into the test statistics in (4) serves multiple purposes. It allows the test statistic to use the information in the off-diagonal elements (correlation information) of

MST - MSE

. In addition, with the appropriate choice of

A

, one can make the test affine invariant in the sense that the test is invariant to the transformation

B Y_{i} + c

for any fixed

p \times p

nonsingular matrix

B

and any vector

c \in R^{p}

. There are many reasonable choices for the matrix

A

. In the simulation study, we will consider two of them that correspond to Lawley-Hotelling’s [24] and Dempster’s [25] trace statistics which are popular tests in multivariate analysis for low- and high-dimensional situations, respectively. From a theoretical stand point, the Crammer–Wold device affords us a limiting matrix variate normal distruntion for

{(a / n)}^{1 / 2} (MST - MSE)

if we establish asymptotic normality of

T (A)

for any fixed matrix

A \in R^{p \times p}

. To these end, Theorem 1 gives the asymptotic distribution of

T (A)

for any fixed

A

.

Theorem 1.

Under assumption A1–A3 and the null hypothesis

H_{0}

,

T (A) \overset{d}{\to} N (0, τ^{2} (A)),

for any

p \times p

fixed matrix

A

, where

τ^{2} (A) = (2 / 3) (2 n - 1) {(n - 1)}^{- 1} \int_{0}^{1} tr {(Σ (x) A)}^{2} r (x) d x

and n is fixed.

As detailed in the proof of Theorem 1, the asymptotic variance

τ^{2} (A)

can be expressed as

τ^{2} (A) = \frac{2 (2 n - 1)}{3 (n - 1)} vec {(A^{⊤})}^{⊤} Ψ vec (A^{⊤}),

(6)

where

Ψ = \int_{0}^{1} (Σ (x) \otimes Σ (x)) r (x) d x

. A cosistent estimator of

Ψ

may be constructed following the ideas of Dette and Munk [26] see also [10,27]. Specifically, if

m (x)

is Liptisctz continuous,

\hat{Ψ} = \frac{1}{4 (a - 3)} \sum_{j = 2}^{a - 2} {(Y_{j} - Y_{j - 1}) {(Y_{j} - Y_{j - 1})}^{⊤}} \otimes {(Y_{j + 2} - Y_{j + 1}) {(Y_{j + 2} - Y_{j + 1})}^{⊤}}

(7)

is consitent for

Ψ

. Therefore, a reasonable estimator

{\hat{τ}}^{2} (A)

of the asymptotic variance

τ^{2} (A)

can be created by replacing

Ψ

in (6) with

\hat{Ψ}

in (7). The Liptisctz continuity requirement on

m (x)

allows to control the finite sample bias in the estimation of

Ψ

. For a valid asymptotic test, one would reject

H_{0}

if

T (A) / \hat{τ} (A) > z_{α},

where

z_{α}

is the upper

α

th quantile of the standard normal distribution.

2.3.2. Composite Tests

The Hypothesis (2) can be equivalently formulated as the intersection of p marginal hypotheses as

H_{0} : ⋂_{k = 1}^{p} H_{0, k},

(8)

where

H_{0, k} : m_{k} (x) = C_{k}

, and

m_{k} (x)

and

C_{k}

are the kth components of

m (x)

and

C

, respectively.

Let

{MST}^{(k)}

and

{MSE}^{(k)}

be the kth diagonal entries of

MST

and

MSE

, respectively. Wang et al. [10] studied the test statistic

T^{(k)} = \sqrt{\frac{a}{n}} ({MST}^{(k)} - {MSE}^{(k)}),

which is suitable for the marginal hypothesis

H_{0, k}

. Theorem 2 establishes the asymptotic joint distribution of

T = {(T^{(1)}, \dots, T^{(p)})}^{⊤}

.

Theorem 2.

Under the assumptions A1–A3 and the null hypothesis

H_{0}

,

T \overset{d}{\to} N_{p} (0, Ω),

as

a \to \infty

, where n is fixed and

Ω = (ω_{k ℓ})

is a

p \times p

positive definite matrix whose entires are defined by

ω_{k ℓ}^{2} = (2 / 3) (2 n - 1) {(n - 1)}^{- 1} \int_{0}^{1} σ_{k ℓ}^{2} (x) r (x) d x

and

σ_{k ℓ} (x)

is the

(k, ℓ)

th entry of

Σ (x)

.

An estimator

\hat{Ω} = ({\hat{ω}}_{k ℓ})

of the asymptotic covariance can be assembled by taking the correponding entries from

\hat{Ψ}

. The result of Theorem 2 can be used to construct a multitude of test statistics. In the simulation study, we investigate

T_{max} = max_{k \in {1, \dots, p}} \frac{| T^{(k)} |}{\sqrt{{\hat{ω}}_{k k}}}

(9)

for its performance in finite samples. The critical value for the test statistic

T_{\max}

can be obtained from

P (T_{\max} \geq t_{α}) = α

. Equivalently,

t_{α}

must satisfy

P (- t_{α} < T^{(k)} / \sqrt{{\hat{ω}}_{k k}} < t_{α}; k = 1, \dots, p) = 1 - α .

The test based on

T_{\max}

falls under the class of multiple contrast test proedures (MCTP) e.g., [28]. In particular,

T_{\max}

enables to identify which of the response variables are not associated with the predictor. We propose the numerical algorithm in Genz and Bretz [29] to determine

t_{α}

based on the asymptotic joint distribution in Theorem 2.

3. Simulation Study

In this section, simulation results are presented to evaluate the finite sample performance of results in Theorems 1 and 2. Three test statistics are evaluated. These are the omnibus (global) test statistic in (4) with two choices of the matrix

A

and the composite or MCTP test in (9). We evaluate both type-I error rates and powers.

3.1. Simulation Design

We consider various practical scenarios which allow to investigate the effects of sample size (a), the dimension of the response vector (p), the distribution of the error vector (

ϵ

) and the covariance structure (

Σ (x)

). For each setting, 5000 runs are conducted to compute the achieved type-I error rates and powers. Throughout, the level of significance is set to

α = 0.05

.

We investigate three values for the sample size,

a \in {30, 100, 400}

; small, medium, and large dimensions for the response vector,

p \in {2, 3, 5}

and three window sizes,

n \in {7, 9, 11}

. The data will be generated from the model

Y_{i} = m (x_{i}) + Σ^{1 / 2} (x_{i}) ϵ_{i},

for

i = 1, \dots, a,

where

x_{i} = i / a

. To investigate the effect of Skewness or Kurtosis compared to the normal distribution, we evaluate performance under three distributions:

MVN:: multivariate normal distribution,
MVT:: multivariate t distribution with 10 degrees of freedom and
MNM:: mixture of 90% multivariate normal distribution with mean $0$ and 10% multivariate normal distribution with mean $2 \cdot 1_{p}$ .

The distributions MVN and MVT are symmetric and MNM is skewed. The effect of diverse covariance structures is examined by generating both homogeneous and heterogeneous datasets. Four different homogeneous covariance structures

Σ_{1} = I_{p}, Σ_{2} = (1 - ρ) I_{p} + ρ J_{p}, Σ_{3} = {AR}_{ρ} (1) and Σ_{4} = {MA}_{ρ} (1),

are considered, where

I_{p}

and

J_{p}

are the

p \times p

identity matrix and matrix of all ones, respectively, and the notations

{AR}_{ρ} (1)

and

{MA}_{ρ} (1)

reperesent the autoregressive and moving average, respectively, processes of order 1 with correlation parameter

ρ

. For all

Σ_{2}

–

Σ_{4}

, the correlation parameter

ρ

is set to

ρ = 0.3

. The heterogeneous covariance structures we consider are

Σ_{5} (x) = (1 - ρ (x)) I_{p} + ρ (x) J_{p}, Σ_{6} (x) = A R_{ρ (x)} (1) and Σ_{7} (x) = M A_{ρ (x)} (1),

where the correlation parameter is set as

ρ (x) = (x / 2) I (x < 1 / 2) + (1 - x) / 2 I (x > 1 / 2)

. We construct two global tests using

A_{1} = {MSE}^{- 1} and A_{2} = \frac{p}{tr (MSE)} I_{p},

for the matrix

A

which correspond to Lawley–Hotelling’s [24] and Dempster’s [25] trace statistics, respectively, and investgate the following three test statistics:

$T_{1}$ :: the omnibus test statistic with $A = A_{1}$ ,
$T_{2}$ :: the omnibus test statistic with $A = A_{2}$ and
$T_{3}$ :: the composite or MCTP test $T_{\max}$ .

Obviously,

T_{1}

is affine invariant. For

T_{3}

, the critical values are obtained by using equicoordinate quantiles from multivariate normal distribution [29].

Under the null hypothesis, there is no association between the predctor (x) and the response vector (

Y

). Without loss of generality, we will set

m (x) = 0

. To investigate the sensitivity of the three tests against alternative points, two different types of signals will be studied:

m_{1} (x) = {(exp (- x), x, x^{2}, x^{2}, sin (2 π x))}^{⊤} and m_{2} (x) = {(2 exp (x), 0, 0, 0, 0)}^{⊤},

where in each case the first p components are kept for

m (x)

.

Exhaustively considering all possible combinations of the parameter settings discussed above will be cumbersome and, perhaps, unnecessary. For brevity, the strategy for our investigation is as follows. For each factor under investigation, we set other factors at fixed values which we refer to as the baseline. Table 1 contains a summary of the combination of parameter settings investigated.

The purpose of Setting 1 is to study the effect of the distribution of the error vector while controlling the covariance structure at the baseline (

Σ_{2}

). Settings 2 and 3 are designed to evaluate the effect of homogeneous and heterogeneous, respectively, covariance matrices while the distribution of the error vector is fixed at the baseline (MVT).

3.2. Simulation Results

The size simulation results are summarized in Table 2, Table 3 and Table 4. Generally, all the tests tend to perform reasonably well as a gets large, but

T_{1}

shows superior performance (compared to

T_{2}

and

T_{3}

) in all the cases. The three tests tend to have better performance under smaller window size when sample size is small and the dimension of the response vector has the usual negative effect. The tests

T_{2}

and

T_{3}

exhibit conservative and liberal, repsectively, behavior especially for small sample size a. From Table 2, we notice a slight decline in performace under the multivariate t distribution. Comparing Table 3 and Table 4, the perfromances of the three tests do not vary much for the four covariance structures in the of homogeneous as well as heterogeneous covariance situations.

In Table 5, the results of power study are presented. The three tests perform reasonably well in terms of power for a as small as 100. The differences in power among the three tests reflect the differences in the sizes. Therefore, the results do not necessarily inform inherent power differences of the tests. However, we see that for diffuse alternative

m_{1}

, the power tends to increase with the dimension p as each additional variable contains information useful for detecting departure from the null. Note that

m_{2}

represents a stronger departure from the null hypothesis compared to

m_{1}

. As one would expect

T_{3}

has an edge over

T_{1}

and

T_{2}

for the sparse alternative

m_{2}

.

In summary, the form of the error distribution has little, if any, effect on the size of the test but the covariance structure as well as heterogeneity do not appear to have much effect. Over all

T_{1}

appears to be the preferered test among the three in terms of the size. The power study shows a reasonable performance for this test as well. However,

T_{3}

has the advantage that it can pinpoint which of the response variables are responsible for the rejection of

H_{0}

. In other words, it helps to identify which of the response variables are associated with the predictor, information not available from

T_{1}

or

T_{2}

.

4. Real Data Analysis

Cellular processes are often associated with changes in sets of genes that share common biological functions or attributes. A meaningful change in a gene set is more biologically reliable and interpretable than a change in a single gene [30]. Sustained extracellular signal-regulated kinase 1/2 (ERK1/2) activation may provide a mechanistic understanding of self-sustained biological processes in chronic illnesses such as asthma. In an investigation to develop a cellular model of sustained ERK1/2 activation, Liu et al. [31] noted that gene set LIU_IL13_MEMORY_MODEL_DN (containing the genes BCL2L11, CBL, DUSP4 IL13RA1 and PFKFB2) was down-regulated in BEAS-2B cells (bronchial epithelium) when stimulated with IL13. The gene IL13 is the central mediator of allergic asthma [32]. Increases in the Sprouty 2 (SPRTY2) gene expression is linked to the mechanism that leads to sustained activation ERK 1/2 [31]. Therefore, association of the gene set with SPRTY2 may shed light on mechanistic understanding of self-sustained biological processes in asthma.

We use data from Genes-environments & Admixture in Latino Americans (GALA) II study (GEO152004 or phs001274.v2.p1) [33]. The dataset contains gene expressions for asthmatic (

n = 441

) and control (

n = 254

) subjects. The dependent variable

Y

consists of the expressions for the five genes (

p = 5

) in the gene set and the independent variable x is the expression for the Sprouty 2 gene.

A scatterplot matrix of the expressions of the gene set against the SPRTY2 gene for asthmatic and control subjects is shown in Figure 1. To aid visualize the pattern of association, Loess smoothing was applied to each of the scatterplots. The observed trends, if any, are nonlinear and the different members of the gene set show different patterns of association with SPRTY2.

For the asthmatic subjects, the results of tests of assciation based on the test statistics

T_{1}

,

T_{2}

and

T_{\max}

are given in Table 6. At

α = 0.05

, the tests

T_{1}

and

T_{\max}

clearly indicate presence of association, while

T_{2}

does not pick up association. This result is consistent with the observations from the scatterplots and the conservative behavior of

T_{2}

in the simulation studies. However, the results for the three test statistics applied to the control subjects (not reported here to save space) indicated significant association.

The observed values of the test statistics

T^{(ϕ)}

for the marginal hypotheses

H_{0, k}

, for gene

k \in {BCL 2 L 11, CBL, DUSP 4, IL 13 RA 1 and PFKFB 2}

, and the equicoordinate quantile [29] cut-off values for the rejection regions corresponding to

n = 7, 9, 11

are given in Table 7. The results indicate that significant association is found at level

α = 0.05

only for gene BCL2L11. It is intesting to note from the critical values that the estimation of the asymptotic covariance of the vector of the univariate test statistics is stable over the values of widow size (n).

5. Discussion

We investigated methods for a multivariate test of association without making any parametric assumptions on the distributional form of the errors as well as the form of the association. A moving window one-way layout, a smoothing-type approach, is used to develop a new nonparametric solution, especially applicable to multivariate heterogeneous datasets. Two types of test statistics, global and composite, are investigated. The latter combines multiple univariate tests to get a multivariate test. The asymptotic distributions of the two types of test statistics are derived. Simulations show that the proposed tests perform reasonably well in homogeneous as well as heterogeneous covariance structures, especially for moderate to larger sample sizes. The effect of window size appears to less imporatnt for large sample sizes. However, smaller window sizes should be used with smaller sample sizes. The tests also are mostly robust to the distribution of the data. In terms of power, the composite test performs better when the association is sparse, e.g., when a single dependent variable is associated with the predictor. When the uni-dimensional

(p = 1)

case, both the global test and composite tests reduces to univariate test studied elsewhere e.g., [10]. The extension of the proposed methods for testing association between a response vector and a covariate in a multi-factorial MANOVA setting is relatively straightforward. We defer the details and the applications for a separate manuscript.

There are a few limitations to our tests. The simulation shows that when the sample size (a) is small, the tests do not have satisfactory performance. Resampling methods may be useful to overcome this problem. Generally, increase in dimension tends to impose large requirement on the sample size. When all the variables are not active (the association is sparse) or the available sample size is not correspondingly large, one may apply variable selection methods to reduce the dimension before applying our tests. Optimal choice of window size is another problem that is worthy of investigation. For a data-adaptive selection, a crossvalidation-based approach appears to be a promising avenue. Our tests assume that the design points (values of the predictor variable in the sample data) are assumed to be quantiles of a continuous distribution function. This assumption precludes the possibility of tied values of the predictor variable in the data. Ties, when present, may cause windows to have unequal sizes. Investigation of the properties of the proposed tests under unequal window sizes is an open problem. Another important extension is the multivariate predictor situation. When then are more than one predicors, the moving-window based approach should work in principle. However, scarcity of data in the predictor space could limit its practicality. We plan to investigate this extension in future research.

Author Contributions

Conceptualization, S.W.H. and Y.X.; methodology, S.W.H. and Y.X.; validation, S.W.H. and Y.X.; formal analysis, S.W.H. and Y.X.; investigation, S.W.H. and Y.X.; writing—original draft preparation, S.W.H. and Y.X.; writing—review and editing, S.W.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this paper is publicly available at https://www.ncbi.nlm.nih.gov/geo/ accessed on 30 March 2021 with accession number GEO152004.

Acknowledgments

The authors are grateful to the three anonymous reviewers for critically reading the original version of the manuscript and making valuable suggestions which have led to significant improvements. The authors are also thankful to the editor for the orderly handling of the manuscript. Y. Xu is sincerely thankful to the Dr. Bing Zhang Department of Statistics of the University of Kentucky where she obtained her Ph.D. degree.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MDPI	Multidisciplinary Digital Publishing Institute
DOAJ	Directory of open access journals
MCTP	Multiple contrast test procedure
ANOVA	Analysis of variance
MANOVA	Multivariate analysis of variance
MST	Mean squares of treatment
MSE	Mean square of error
MVT	Multivariate normal distribution
MNM	Multivariate t distribution with 10 degrees of freedom mixture of 90% multivariate normal distribution center $0$ and 10% multivariate normal distribution mean $2 \cdot 1_{p}$

Appendix A. Proofs

Proof of Theorem 1.

The proof contains three steps. First, we decompose the quadratic form of the test statistic

T (A)

in (4) as a sum of two quadratic forms. Then we will prove that one of the quadratic forms is asymptotically negligible, i.e., it converges to zero in probability. Next, we derive the asymptotic distribution of the other quadratic form. Finally, the desired result follows by Slutsky’s Theorem.

Since both MST and MSE are invariant to translations, without loss of generality we set

C = 0

. The matrix difference

MST - MSE

can be decomposed as

MST - MSE = Y D_{1} Y^{⊤} + Y D_{2} Y^{⊤}

where

D_{1} = {(a n (n - 1))}^{- 1} (I_{a} - J_{a}) \otimes J_{n}

and

D_{2} = {(a (n - 1))}^{- 1} I_{a} \otimes (J_{n} - I_{n})

.

By independence and the fact that

E (Y_{j}) = 0

for all j under the null hypothesis,

\begin{matrix} E (\sqrt{\frac{a}{n}} tr (Y D_{1} Y^{⊤} A)) & = - \sqrt{\frac{a}{n}} \frac{1}{a n (a - 1)} \sum_{j = 1}^{a} E (tr (\sum_{i_{1} \neq i_{2}} Y_{j} Y_{j}^{⊤} A)) I (j \in W_{i_{1}} \cap W_{i_{2}}) \\ = O (\sqrt{\frac{a}{n}} \frac{1}{a n (a - 1)} a n (n - 1)) = o (1), \end{matrix}

(A1)

as

a \to \infty

and for fixed n, where the second equality is due to the fact that

\sum_{j = 1}^{a} \sum_{i_{1} \neq i_{2}} I (j \in W_{i_{1}} \cap W_{i_{2}}) = O (a n (n - 1))

and

{sup}_{x \in [0, 1]} {| | Σ (x) | |}_{F} < \infty

by continuity. Similarly,

\begin{matrix} E {(\sqrt{\frac{a}{n}} tr (Y D_{1} Y^{⊤} A))}^{2} & = \frac{a}{n} \frac{1}{a^{2} n^{2} {(a - 1)}^{2}} \sum_{j_{1}, j_{2}} E (tr (\sum_{i_{1} \neq i_{2}} Y_{j_{1}} Y_{j_{1}}^{⊤} A) tr (\sum_{i_{3} \neq i_{4}} Y_{j_{2}} Y_{j_{2}}^{⊤} A)) \\ I (j_{1} \in W_{i_{1}} \cap W_{i_{2}}, j_{2} \in W_{i_{3}} \cap W_{i_{4}}) \\ + \frac{a}{n} \frac{1}{a^{2} n^{2} {(a - 1)}^{2}} \sum_{j_{1}, j_{2}} E (tr (\sum_{i_{1} \neq i_{2}} Y_{j_{1}} Y_{j_{2}}^{⊤} A) tr (\sum_{i_{3} \neq i_{4}} Y_{j_{1}} Y_{j_{2}}^{⊤} A)) \\ I (j_{1} \in W_{i_{1}} \cap W_{i_{3}}, j_{2} \in W_{i_{2}} \cap W_{i_{4}}) \\ = O (\frac{a}{n} \frac{1}{a^{2} n^{2} {(a - 1)}^{2}} a^{2} n^{2} {(n - 1)}^{2}) = o (1), \end{matrix}

(A2)

where the second equality holds under assumptions A2 and A3, and the fact that

| (W_{i_{1}} \cap W_{i_{2}}) \cup (W_{i_{3}} \cap W_{i_{4}}) | = O (a^{2} n^{2} {(n - 1)}^{2})

. Therefore, we have

{(a / n)}^{1 / 2} tr (Y D_{1} Y^{⊤}) = o_{P} (1),

(A3)

as

a \to \infty

.

We now turn to the asymptotic distribution of

{(a / n)}^{1 / 2} tr (Y D_{2} Y^{⊤} A)

. First note that

E [\sqrt{\frac{a}{n}} tr (Y D_{2} Y^{⊤} A)] = \sqrt{\frac{a}{n}} \frac{1}{a (n - 1)} E (tr (\sum_{i}^{a} \sum_{j_{1} \neq j_{2}} Y_{j_{1}} Y_{j_{2}} A)) I (j_{1}, j_{2} \in W_{i}) = 0 .

In the calculation above, note that the off-diagonal blocks and the main diagonal elements of

D_{2}

are zeros and

E (Y_{j}) = 0

. Furthermore,

\begin{matrix} E {[\sqrt{\frac{a}{n}} tr (Y D_{2} Y^{⊤} A)]}^{2} & = \frac{1}{a n {(n - 1)}^{2}} Vec {(A^{⊤})}^{⊤} \sum_{i_{1} = 1}^{a} \sum_{i_{2} = 1}^{a} \sum_{j_{1} \neq j_{2}}^{a} E [Y_{j_{1}} Y_{j_{1}}^{⊤} \otimes Y_{j_{2}} Y_{j_{2}}^{⊤}] Vec (A) \\ I (j_{1}, j_{2} \in W_{i_{1}} \cap W_{i_{2}}) \\ = \frac{2 (2 n - 1)}{3 (n - 1)} \int_{0}^{1} tr {(Σ (x) A)}^{2} r (x) d x + o (1) \end{matrix}

as

a \to \infty

. The second equality follows by assumptions A1 and A2, the inequalities in (3) and the fact that see also, [10]

\sum_{i_{1}, i_{2} = 1}^{a} \sum_{j_{1} = 1, j_{1} \neq j_{2}}^{a} I (j_{1}, j_{2} \in W_{i_{1}} \cap W_{i_{2}}) = 2 (1^{2} + 2^{2} + 3^{2} + \dots + {(n - 1)}^{2}) = \frac{n (n - 1) (2 n - 1)}{3}

Finally, to obtain the asymptotic normality of

\sqrt{\frac{a}{n}} tr (Y D_{2} Y^{⊤} A)

, let us rewrite

tr (Y D_{2} Y^{⊤} A) = \frac{1}{a} \sum_{i = 1}^{a} F_{i},

where

F_{i} = \frac{1}{n - 1} \sum_{j_{1} \neq j_{2}}^{a} tr (Y_{j_{1}} Y_{j_{2}}^{⊤} A) I (j_{1}, j_{2} \in W_{i}) .

For

ℓ \leq m

positive integers, let

F_{ℓ}^{m} = σ {F_{i} : i = ℓ, \dots, m}

be the

σ

–algebra generated by

F_{ℓ}, \dots, F_{m}

, and

α (k) = {sup}_{ℓ} {sup}_{A \in F_{- \infty}^{ℓ}, B \in F_{ℓ + k}^{\infty}} | P (A \cap B) - P (A) P (B) |

be the dependence coefficient. Clearly,

α (k) = 0

for

k \geq n

. Therefore, the sequence

F_{1}, F_{2}, \dots

is strong mixing and

\sum_{k = 0}^{\infty} {(k + 1)}^{2} α_{ℓ}^{\frac{δ}{4 + δ}} (k) \leq \frac{n (n + 1) (2 n + 1)}{6} .

The desired convergence in distribution occurs by the Central Limit Theorem of Ekström [34]. □

Proof of Theorem 2.

Let

Y^{(k)} = {(Y_{i}^{(k)}, i \in W_{1}, \dots, Y_{i}^{(k)}, i \in W_{a})}^{⊤}

be the

k^{t h}

row of

Y

. Applying (A3) for each k by first setting

A = I_{p}

and

p = 1

, we have

T - \tilde{T} = o_{p} (1)

as

a \to \infty

, where

\tilde{T} = {({\tilde{T}}^{(1)}, \dots, {\tilde{T}}^{(p)})}^{⊤}

and

{\tilde{T}}^{(k)} = {(a / n)}^{1 / 2} (Y^{(k)} D_{2} {Y^{(k)}}^{⊤})

. Finally, by Therem 1 and the Crammer-Wold Theorem [35], pp.17–18,

\tilde{T}

has a limiting matrix variate normal distruion with mean

0_{p \times p}

and covariance

2 (2 n - 1) {(3 (n - 1))}^{- 1} \int_{0}^{1} (Σ (x) \otimes Σ (x)) r (x) d x

. Thus,

{(a / n)}^{1 / 2} T

has the desired multivariate normal limiting distribution. □

References

Pearson, K., VII. Note on regression and inheritance in the case of two parents. Proc. R. Soc. Lond. 1895, 58, 240–242. [Google Scholar]
Spearman, C. The Proof and Measurement of Association between Two Things. Am. J. Psychol. 1987, 100, 441–471. [Google Scholar] [CrossRef]
Kendall, M.G. A new measure of rank correlation. Biometrika 1938, 30, 81–93. [Google Scholar] [CrossRef]
Nowak, C.P.; Konietschke, F. Simultaneous inference for Kendall’s tau. J. Multivar. Anal. 2021, 185, 104767. [Google Scholar] [CrossRef]
Brownie, C.; Boos, D.D. Type I error robustness of ANOVA and ANOVA on ranks when the number of treatments is large. Biometrics 1994, 50, 542–549. [Google Scholar] [CrossRef]
Boos, D.D.; Brownie, C. ANOVA and rank tests when the number of treatments is large. Stat. Probab. Lett. 1995, 23, 183–191. [Google Scholar] [CrossRef]
Akritas, M.; Arnold, S. Asymptotics for analysis of variance when the number of levels is large. J. Am. Stat. Assoc. 2000, 95, 212–226. [Google Scholar] [CrossRef]
Harrar, S.W.; Gupta, A.K. Asymptotic expansion for the null distribution of the F-statistic in one-way ANOVA under non-normality. Ann. Inst. Stat. Math. 2007, 59, 531–556. [Google Scholar] [CrossRef]
Akritas, M.G.; Papadatos, N. Heteroscedastic one-way ANOVA and lack-of-fit tests. J. Am. Stat. Assoc. 2004, 99, 368–382. [Google Scholar] [CrossRef]
Wang, L.; Akritas, M.G.; Van Keilegom, I. An ANOVA-type nonparametric diagnostic test for heteroscedastic regression models. J. Nonparametric Stat. 2008, 20, 365–382. [Google Scholar] [CrossRef]
Zambom, A.Z.; Akritas, M.G. Nonparametric lack-of-fit testing and consistent variable selection. Stat. Sin. 2014, 20, 1837–1858. [Google Scholar] [CrossRef] [Green Version]
Wang, L.; Zhou, X.H. A fully nonparametric diagnostic test for homogeneity of variances. Can. J. Stat. 2005, 33, 545–558. [Google Scholar] [CrossRef] [Green Version]
Bonnini, S.; Cavallo, G. A study on the satisfaction with distance learning of university students with disabilities: Bivariate regression analysis using a multiple permutation test. Stat. Appl. Ital. J. Appl. Stat. 2021, 33, 143–162. [Google Scholar]
Bonnini, S.; Corain, L.; Marozzi, M.; Salmaso, L. Nonparametric Hypothesis Testing: Rank and Permutation Methods with Applications in R; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
Salmaso, L.; Pesarin, F. Permutation Tests for Complex Data: Theory, Applications and Software; John Wiley & Sons: Hoboken, NJ, USA, 2010. [Google Scholar]
Gupta, A.K.; Harrar, S.W.; Fujikoshi, Y. MANOVA for large hypothesis degrees of freedom under non-normality. Test 2008, 17, 120–137. [Google Scholar] [CrossRef]
Bathke, A.C.; Harrar, S.W. Nonparametric methods in multivariate factorial designs for large number of factor levels. J. Stat. Plan. Inference 2008, 138, 588–610. [Google Scholar] [CrossRef]
Gupta, A.K.; Harrar, S.W.; Fujikoshi, Y. Asymptotics for testing hypothesis in some multivariate variance components model under non-normality. J. Multivar. Anal. 2006, 97, 148–178. [Google Scholar] [CrossRef] [Green Version]
Harrar, S.W.; Bathke, A.C. Nonparametric methods for unbalanced multivariate data and many factor levels. J. Multivar. Anal. 2008, 99, 1635–1664. [Google Scholar] [CrossRef] [Green Version]
Zambom, A.Z.; Kim, S. A nonparametric hypothesis test for heteroscedasticity in multiple regression. Can. J. Stat. 2017, 45, 425–441. [Google Scholar] [CrossRef]
Sacks, J.; Ylvisaker, D. Designs for regression problems with correlated errors. Ann. Math. Stat. 1966, 37, 66–89. [Google Scholar] [CrossRef]
Harrar, S.W.; Bathke, A.C. A modified two-factor multivariate analysis of variance: Asymptotics and small sample approximations. Ann. Inst. Stat. Math. 2012, 64, 135–165. [Google Scholar] [CrossRef]
Bathke, A.C.; Harrar, S.W. Rank-based inference for multivariate data in factorial designs. In Robust Rank-Based and Nonparametric Methods; Springer: Berlin/Heidelberg, Germany, 2016; pp. 121–139. [Google Scholar]
Anderson, T. An Introduction to Multivariate Analysis; John Wiley & Sons: New York, NY, USA, 2003. [Google Scholar]
Dempster, A.P. A high dimensional two sample significance test. Ann. Math. Stat. 1958, 29, 995–1010. [Google Scholar] [CrossRef]
Dette, H.; Munk, A. Testing heteroscedasticity in nonparametric regression. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 1998, 60, 693–708. [Google Scholar] [CrossRef]
Rice, J. Bandwidth choice for nonparametric regression. Ann. Stat. 1984, 12, 1215–1230. [Google Scholar] [CrossRef]
Bretz, F.; Hothorn, T.; Westfall, P. Multiple Comparisons Using R; CRC Press: Boca Raton, FL, USA, 2016. [Google Scholar]
Genz, A.; Bretz, F. Computation of Multivariate Normal and T Probabilities; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2009; Volume 195. [Google Scholar]
Maleki, F.; Ovens, K.; Hogan, D.J.; Kusalik, A.J. Gene set analysis: Challenges, opportunities, and future research. Front. Genet. 2020, 11, 654. [Google Scholar] [CrossRef]
Liu, W.; Tundwal, K.; Liang, Q.; Goplen, N.; Rozario, S.; Quayum, N.; Gorska, M.; Wenzel, S.; Balzar, S.; Alam, R. Establishment of extracellular signal-regulated kinase 1/2 bistability and sustained activation through Sprouty 2 and its relevance for epithelial function. Mol. Cell. Biol. 2010, 30, 1783–1799. [Google Scholar] [CrossRef] [Green Version]
Wynn, T.A. IL-13 effector functions. Annu. Rev. Immunol. 2003, 21, 425–456. [Google Scholar] [CrossRef]
Nishimura, K.K.; Galanter, J.M.; Roth, L.A.; Oh, S.S.; Thakur, N.; Nguyen, E.A.; Thyne, S.; Farber, H.J.; Serebrisky, D.; Kumar, R.; et al. Early-life air pollution and asthma risk in minority children. The GALA II and SAGE II studies. Am. J. Respir. Crit. Care Med. 2013, 188, 309–318. [Google Scholar] [CrossRef]
Ekström, M. A general central limit theorem for strong mixing sequences. Stat. Probab. Lett. 2014, 94, 236–238. [Google Scholar] [CrossRef]
Serfling, R.J. Approximation Theorems of Mathematical Statistics; John Wiley & Sons: Hoboken, NJ, USA, 2009; Volume 162. [Google Scholar]

Figure 1. Scatterplot of expressions of the genes in IL13 Gene Set versus SPRY2 gene. The top panels are for asthmatic group and the bottom panels are for control group. The expressions for SPTRY2 gene is plotted on the x-axis and the five panels in each group corroponds to the dependent variables BCL2L11, CBL, DUSP4 IL13RA1 and PFKFB2 in the same order. Loess curve is added in each plot.

Table 1. Summary of simulation settings investigated. The column

Σ

is for covariance structure. The error distribution MVN is for multivariate normal distribution, MVT is for multivariate t with 10 degrees of freedom and MNM is the multivariate normal distribution contaminated with

10 %

outliers from a multivariate normal distribution with mean

1_{p}

. The baseline setting is the setting used for one parameter when the effect of the other one is investigated.

Table 1. Summary of simulation settings investigated. The column

Σ

is for covariance structure. The error distribution MVN is for multivariate normal distribution, MVT is for multivariate t with 10 degrees of freedom and MNM is the multivariate normal distribution contaminated with

10 %

outliers from a multivariate normal distribution with mean

1_{p}

. The baseline setting is the setting used for one parameter when the effect of the other one is investigated.

Simulation	$Σ$	Error Distribution
Baseline	$Σ_{2}$	MVT
Setting 1	$Σ_{2}$	MVN, MNM
Setting 2	$Σ_{1}, Σ_{3}, Σ_{4}$	MVT
Setting 3	$Σ_{5}, Σ_{6}, Σ_{7}$	MVT

Table 2. Percentages of rejection under

H_{0}

(

\times 100 %

) for Baseline and Setting 1. Covariance

Σ_{2}

is used. In the Error coumn, MVN is multivariate normal distribution (MVN), MVT is multivariate t with 10 degrees of freedom and MNM is mixture of two multivariate normals. a is sample size, n is window size and p is dimension. The test statistics

T_{1}

and

T_{2}

are global test statistics with

A_{1}

and

A_{2}

, respectively, and

T_{3}

is the composite test statistic

T_{\max}

. Level of significance is

α = 5 %

.

Table 2. Percentages of rejection under

H_{0}

(

\times 100 %

) for Baseline and Setting 1. Covariance

Σ_{2}

is used. In the Error coumn, MVN is multivariate normal distribution (MVN), MVT is multivariate t with 10 degrees of freedom and MNM is mixture of two multivariate normals. a is sample size, n is window size and p is dimension. The test statistics

T_{1}

and

T_{2}

are global test statistics with

A_{1}

and

A_{2}

, respectively, and

T_{3}

is the composite test statistic

T_{\max}

. Level of significance is

α = 5 %

.

			$a = 30$			$a = 100$			$a = 400$
Error	$n$		p = 2	p = 3	p = 5	p = 2	p = 3	p = 5	p = 2	p = 3	p = 5
MVN	7	$T_{1}$	5.5	5.2	6.5	6.0	5.5	6.0	5.2	4.6	4.4
		$T_{2}$	4.3	3.1	2.1	5.1	4.2	3.9	4.8	4.5	4.0
		$T_{3}$	8.5	10.1	12.7	7.9	9.1	11.2	7.2	7.6	8.5
	9	$T_{1}$	3.4	3.1	3.2	5.1	4.8	5.0	4.8	4.6	4.1
		$T_{2}$	2.8	2.0	1.5	4.3	3.7	3.1	4.7	4.5	4.0
		$T_{3}$	6.2	7.3	9.5	7.5	8.7	11.0	7.1	7.7	8.2
	11	$T_{1}$	1.8	1.3	1.2	4.5	3.9	3.9	4.7	4.5	3.8
		$T_{2}$	1.7	1.0	0.7	4.1	3.3	2.6	4.7	4.1	3.6
		$T_{3}$	4.1	4.8	6.0	6.8	8.3	10.2	7.3	7.6	8.7
MVT	7	$T_{1}$	4.4	4.5	6.3	5.3	5.2	5.9	5.3	4.6	5.7
		$T_{2}$	3.8	2.9	1.9	4.5	4.4	3.8	5.1	4.6	4.7
		$T_{3}$	7.7	9.1	12.7	7.8	9.9	11.5	6.9	7.8	8.5
	9	$T_{1}$	3.0	2.4	2.7	4.7	4.6	4.7	5.0	4.1	5.4
		$T_{2}$	2.4	1.7	1.2	4.2	4.1	3.2	4.8	4.2	4.4
		$T_{3}$	5.8	6.5	9.0	7.5	9.7	10.9	6.7	7.4	8.7
	11	$T_{1}$	1.6	1.2	0.9	4.1	4.0	3.9	4.8	3.7	5.0
		$T_{2}$	1.5	0.9	0.5	3.7	3.6	3.0	4.7	4.1	4.3
		$T_{3}$	3.7	4.1	5.5	6.8	8.9	10.3	6.7	7.4	8.9
MNM	7	$T_{1}$	5.1	5.2	6.0	5.9	5.6	6.3	5.3	5.0	5.6
		$T_{2}$	4.1	3.6	3.0	5.4	5.0	4.9	5.5	5.1	4.9
		$T_{3}$	7.7	9.8	11.4	8.2	8.4	10.3	6.8	7.3	8.1
	9	$T_{1}$	3.3	3.2	2.7	5.2	4.9	5.0	5.1	4.9	5.0
		$T_{2}$	2.6	2.5	1.9	4.8	4.5	4.5	5.4	4.9	4.7
		$T_{3}$	5.6	7.1	8.0	7.7	7.9	9.8	6.5	7.4	8.0
	11	$T_{1}$	1.7	1.5	0.9	4.5	4.5	4.2	4.7	4.7	4.5
		$T_{2}$	1.6	1.4	0.9	4.5	4.1	4.3	5.3	4.8	4.6
		$T_{3}$	3.7	4.7	5.0	7.2	7.5	9.8	6.6	7.2	7.7

Table 3. Percentages of rejection under

H_{0}

(%) for Setting 2, i.e., covariances are homogeneneous with different structures. Multivariate t with 10 degrees of freedom (MVT) is used for the errors. a is sample size, n is window size and p is dimension. The test statistics

T_{1}

and

T_{2}

are global test statistics with

A_{1}

and

A_{2}

, respectively, and

T_{3}

is the composite test statistic

T_{\max}

. Level of significance is

α = 5 %

.

Table 3. Percentages of rejection under

H_{0}

(%) for Setting 2, i.e., covariances are homogeneneous with different structures. Multivariate t with 10 degrees of freedom (MVT) is used for the errors. a is sample size, n is window size and p is dimension. The test statistics

T_{1}

and

T_{2}

are global test statistics with

A_{1}

and

A_{2}

, respectively, and

T_{3}

is the composite test statistic

T_{\max}

. Level of significance is

α = 5 %

.

			$a = 30$			$a = 100$			$a = 400$
$Σ$	$n$	$T$	p = 2	p = 3	p = 5	p = 2	p = 3	p = 5	p = 2	p = 3	p = 5
$Σ_{1}$	7	$T_{1}$	4.4	4.6	6.4	5.3	5.1	5.9	5.3	4.6	5.7
		$T_{2}$	3.3	2.3	1.4	4.4	3.5	2.7	4.9	3.9	4.1
		$T_{3}$	7.8	9.7	12.9	7.8	9.9	12.9	6.9	7.6	8.7
	9	$T_{1}$	3.0	2.7	3.3	4.7	4.6	4.7	5.0	4.1	5.4
		$T_{2}$	2.2	1.4	0.8	4.0	3.2	2.1	4.5	3.7	3.9
		$T_{3}$	5.7	6.6	9.7	7.6	9.4	11.7	6.6	7.2	8.8
	11	$T_{1}$	1.6	1.2	1.3	4.1	4.0	3.9	4.8	3.7	5.0
		$T_{2}$	1.4	0.6	0.4	3.7	2.8	1.9	4.4	3.4	3.7
		$T_{3}$	3.5	4.3	6.2	6.8	8.5	10.5	6.8	7.1	9.2
$Σ_{3}$	7	$T_{1}$	4.4	4.8	7.1	5.3	5.2	5.9	5.3	4.6	5.7
		$T_{2}$	3.8	2.8	1.5	4.5	4.2	3.0	5.1	4.5	4.0
		$T_{3}$	7.7	9.4	14.0	7.8	9.8	11.9	6.9	8.2	8.6
	9	$T_{1}$	3.0	3.0	3.1	4.7	4.6	4.7	5.0	4.1	5.4
		$T_{2}$	2.4	1.6	0.9	4.2	3.8	2.4	4.8	4.1	3.8
		$T_{3}$	5.8	6.6	10.4	7.5	9.4	11.1	6.7	7.5	8.8
	11	$T_{1}$	1.6	1.4	1.1	4.1	4.0	3.9	4.8	3.7	5.0
		$T_{2}$	1.5	0.9	0.3	3.7	3.3	2.2	4.7	3.9	3.9
		$T_{3}$	3.7	4.5	6.6	6.8	8.6	10.6	6.7	7.5	9.1
$Σ_{4}$	7	$T_{1}$	4.4	4.8	7.0	5.3	5.2	5.9	5.3	4.6	5.7
		$T_{2}$	3.8	2.7	1.8	4.5	4.1	3.0	5.1	4.5	3.9
		$T_{3}$	7.7	10.2	14.2	7.8	9.7	12.1	6.9	8.1	8.6
	9	$T_{1}$	3.0	2.6	3.6	4.7	4.6	4.7	5.0	4.1	5.4
		$T_{2}$	2.4	1.5	1.0	4.2	3.8	2.3	4.8	4.2	3.8
		$T_{3}$	5.8	7.3	10.5	7.5	9.2	11.2	6.7	7.7	8.8
	11	$T_{1}$	1.6	1.2	1.2	4.1	4.0	3.9	4.8	3.7	5.0
		$T_{2}$	1.5	0.8	0.3	3.7	3.4	2.1	4.7	3.9	3.9
		$T_{3}$	3.7	4.5	6.5	6.8	8.4	10.8	6.7	7.5	9.2

Table 4. Percentages of rejection under

H_{0}

(

\times 100 %

) for Setting 3. Heterogeneity covariance

Σ_{5}

,

Σ_{6}

and

Σ_{7}

are used. The error distribution is set to multivariate t with 10 degrees of freedom (MVT). a is sample size, n is window size and p is dimension. The test statistics

T_{1}

and

T_{2}

are global test statistics with

A_{1}

and

A_{2}

, respectively, and

T_{3}

is the composite test statistic

T_{\max}

. Level of significance is

α = 5 %

.

Table 4. Percentages of rejection under

H_{0}

(

\times 100 %

) for Setting 3. Heterogeneity covariance

Σ_{5}

,

Σ_{6}

and

Σ_{7}

are used. The error distribution is set to multivariate t with 10 degrees of freedom (MVT). a is sample size, n is window size and p is dimension. The test statistics

T_{1}

and

T_{2}

are global test statistics with

A_{1}

and

A_{2}

, respectively, and

T_{3}

is the composite test statistic

T_{\max}

. Level of significance is

α = 5 %

.

			$a = 30$			$a = 100$			$a = 400$
$Σ$	$n$		p = 2	p = 3	p = 5	p = 2	p = 3	p = 5	p = 2	p = 3	p = 5
$Σ_{5}$	7	$T_{1}$	4.4	5.3	6.6	5.6	5.9	6.3	5.1	4.7	6.2
		$T_{2}$	3.2	2.8	1.5	4.7	3.9	3.2	4.9	4.3	4.2
		$T_{3}$	8.1	10.2	13.4	8.6	9.7	11.9	7.3	7.2	9.0
	9	$T_{1}$	2.8	2.8	2.9	5.1	5.4	4.7	5.2	4.4	5.2
		$T_{2}$	2.0	1.9	0.8	4.5	3.5	2.9	5.2	4.2	3.8
		$T_{3}$	5.6	7.7	9.8	8.2	9.3	11.0	7.1	6.8	9.1
	11	$T_{1}$	1.5	1.2	1.2	4.6	4.7	3.8	5.2	4.2	4.8
		$T_{2}$	1.2	0.8	0.3	4.3	3.2	2.5	5.0	4.0	3.6
		$T_{3}$	3.3	5.1	6.3	7.9	8.7	10.5	6.9	7.0	9.2
$Σ_{6}$	7	$T_{1}$	4.4	5.0	6.8	5.6	6.1	6.2	5.1	4.9	6.2
		$T_{2}$	3.2	2.6	1.7	4.7	3.9	2.9	4.9	4.2	4.2
		$T_{3}$	8.1	10.2	14.2	8.6	9.8	11.7	7.3	7.1	9.3
	9	$T_{1}$	2.8	2.9	3.4	5.1	5.4	4.6	5.2	4.5	5.3
		$T_{2}$	2.0	1.7	0.8	4.5	3.4	2.5	5.2	4.2	3.8
		$T_{3}$	5.6	7.4	10.6	8.2	9.3	11.0	7.1	6.9	9.4
	11	$T_{1}$	1.5	1.4	1.1	4.6	4.8	3.8	5.2	4.2	4.7
		$T_{2}$	1.2	0.8	0.3	4.3	3.1	2.3	5.0	4.0	3.6
		$T_{3}$	3.3	4.9	6.7	7.9	8.7	10.5	6.9	7.1	9.6
$Σ_{7}$	7	$T_{1}$	4.4	5.1	6.3	5.6	6.2	6.2	5.1	5.0	6.0
		$T_{2}$	3.2	2.8	1.4	4.7	3.9	2.9	4.9	4.2	4.3
		$T_{3}$	8.1	10.6	14.2	8.6	9.8	11.7	7.3	7.1	9.4
	9	$T_{1}$	2.8	3.1	3.3	5.1	5.4	4.6	5.2	4.5	5.3
		$T_{2}$	2.0	1.6	0.7	4.5	3.4	2.6	5.2	4.2	3.8
		$T_{3}$	5.6	7.6	10.4	8.2	9.3	11.0	7.1	6.9	9.6
	11	$T_{1}$	1.5	1.4	1.1	4.6	4.8	3.8	5.2	4.3	4.8
		$T_{2}$	1.2	0.9	0.2	4.3	3.1	2.2	5.0	3.9	3.5
		$T_{3}$	3.3	5.0	6.8	7.9	8.7	10.5	6.9	7.1	9.7

Table 5. Percentages of rejection under the alternatives

m_{1} (x)

and

m_{2} (x)

.

m_{0} (x) = 0

is the null case. The covariance is set to the homogeneity covariance

Σ_{2}

, error distribution is set to the multivariate t with 10 degrees of freedom (MVT). a is sample size, n is window size and p is dimension. The test statistics

T_{1}

and

T_{2}

are global test statistics with

A_{1}

and

A_{2}

, respectively, and

T_{3}

is the composite test statistic

T_{\max}

. Level of significance is

α = 5 %

.

Table 5. Percentages of rejection under the alternatives

m_{1} (x)

and

m_{2} (x)

.

m_{0} (x) = 0

is the null case. The covariance is set to the homogeneity covariance

Σ_{2}

, error distribution is set to the multivariate t with 10 degrees of freedom (MVT). a is sample size, n is window size and p is dimension. The test statistics

T_{1}

and

T_{2}

are global test statistics with

A_{1}

and

A_{2}

, respectively, and

T_{3}

is the composite test statistic

T_{\max}

. Level of significance is

α = 5 %

.

			$a = 30$			$a = 100$
$m (x)$	$n$		p = 2	p = 3	p = 5	p = 2	p = 3	p = 5
$m_{0}$	7	$T_{1}$	4.7	5.2	6.7	5.7	5.2	6.1
		$T_{2}$	3.6	2.9	2.9	5.3	4.3	4.0
		$T_{3}$	7.7	10.4	13.7	7.7	9.4	12.8
	9	$T_{1}$	3.1	2.8	3.4	4.9	4.5	5.0
		$T_{2}$	2.3	1.8	1.7	4.6	4.0	3.4
		$T_{3}$	5.4	7.8	10.1	7.5	8.8	12.0
	11	$T_{1}$	1.6	1.3	1.2	4.5	3.8	4.1
		$T_{2}$	1.4	1.0	0.9	4.2	3.6	2.9
		$T_{3}$	3.5	4.9	6.3	7.1	8.4	11.2
$m_{1}$	7	$T_{1}$	19.2	26.0	90.1	48.3	61.2	100.0
		$T_{2}$	10.4	13.6	56.1	30.7	40.3	98.4
		$T_{3}$	19.2	27.7	85.2	38.5	50.5	99.8
	9	$T_{1}$	15.0	20.1	86.4	50.2	63.1	100.0
		$T_{2}$	8.1	10.5	52.2	32.2	43.3	99.3
		$T_{3}$	16.2	23.5	84.8	40.7	52.9	99.9
	11	$T_{1}$	11.4	14.0	76.7	51.3	63.7	100.0
		$T_{2}$	5.6	7.1	42.7	33.0	45.1	99.6
		$T_{3}$	12.4	18.3	81.1	42.0	54.3	100.0
$m_{2}$	7	$T_{1}$	87.4	85.3	84.0	100.0	100.0	99.9
		$T_{2}$	80.8	70.0	50.8	100.0	99.9	99.0
		$T_{3}$	90.1	88.7	89.3	100.0	100.0	100.0
	9	$T_{1}$	83.2	79.2	74.5	100.0	100.0	99.9
		$T_{2}$	75.9	64.2	43.2	100.0	100.0	99.4
		$T_{3}$	87.1	86.1	86.2	100.0	100.0	100.0
	11	$T_{1}$	77.2	70.5	61.7	100.0	100.0	100.0
		$T_{2}$	69.5	55.2	32.5	100.0	100.0	99.5
		$T_{3}$	83.1	82.4	81.1	100.0	100.0	100.0

Table 6. p-Values and observed values of

T_{1}, T_{2}

and

T_{\max}

.

Table 6. p-Values and observed values of

T_{1}, T_{2}

and

T_{\max}

.

Test	$n = 7$		$n = 9$		$n = 11$
Test Stat	$T_{cal}$	p-Value	$T_{cal}$	p-Value	$T_{cal}$	p-Value
$T_{1}$	3.9	$4.7 \times 10^{- 5}$	4.1	$1.9 \times 10^{- 5}$	4.1	$1.9 \times 10^{- 5}$
$T_{2}$	1.4	$8 \times 10^{- 2}$	1.6	$5.1 \times 10^{- 2}$	1.8	$3.3 \times 10^{- 2}$
$T_{\max}$	4.4	$4.2 \times 10^{- 8}$	4.8	$5.9 \times 10^{- 9}$	5	$1.6 \times 10^{- 9}$

Table 7. Calculated values of the marginal test statistics and equicoordinate quantile of the multivariate normal distribution.

	$n = 7$	$n = 9$	$n = 11$
$T^{(BCL 2 L 11)}$	4.4	4.8	5.0
$T^{(CBL)}$	0.3	0.4	0.6
$T^{(DUSP 4)}$	−0.6	−0.6	−0.9
$T^{(IL 13 RA 1)}$	2.0	2.1	2.0
$T^{(PFKFB 2)}$	0.7	0.8	0.8
Crit. Value	2.3154	2.3153	2.3154

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Harrar, S.W.; Xu, Y. Nonparametric Tests for Multivariate Association. Symmetry 2022, 14, 1112. https://doi.org/10.3390/sym14061112

AMA Style

Harrar SW, Xu Y. Nonparametric Tests for Multivariate Association. Symmetry. 2022; 14(6):1112. https://doi.org/10.3390/sym14061112

Chicago/Turabian Style

Harrar, Solomon W., and Yan Xu. 2022. "Nonparametric Tests for Multivariate Association" Symmetry 14, no. 6: 1112. https://doi.org/10.3390/sym14061112

APA Style

Harrar, S. W., & Xu, Y. (2022). Nonparametric Tests for Multivariate Association. Symmetry, 14(6), 1112. https://doi.org/10.3390/sym14061112

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Nonparametric Tests for Multivariate Association

Abstract

1. Introduction

2. Methods

2.1. Model and Hypothesis

2.2. Moving Window One-Way Layout

2.3. Test Statistics

2.3.1. Omnibus Tests

2.3.2. Composite Tests

3. Simulation Study

3.1. Simulation Design

3.2. Simulation Results

4. Real Data Analysis

5. Discussion

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Proofs

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI