Preprint

Article

A Distance Based Two-Sample Test of Means Difference for Multivariate Datasets

Altmetrics

Downloads

169

Views

253

Comments

A peer-reviewed article of this preprint also exists.

Alexander Novoselsky,

Eugene Kagan^*

Alexander Novoselsky,

Eugene Kagan^*

This version is not peer-reviewed

Submitted:

27 November 2023

Posted:

28 November 2023

Read the latest preprint version here

Alerts

Abstract

In the paper we present a new test for comparison of the means of multivariate samples with unknown distributions. The test is based on the comparison of the distributions of the distances between the samples’ elements and their means using univariate two-sample Kolmogorov-Smirnov test. The activity of the suggested method is illustrated by numerical analysis of the real-world and simulated data.

Keywords:

Subject: Computer Science and Mathematics - Probability and Statistics

1. Introduction

The problem of comparison of two samples obtained in different measurements appears in a wide range of tasks starting from physical research and ending with social and political studies. The comparison includes the tests of the samples’ distributions and their parameters, and the result of the comparison specifies whether the samples were drawn from the same population or not.

For univariate samples, the problem is solved by different methods: the two-sample Student

t

-test and the Welch

t

-test (both for comparison of the means for normal distributions), the Fisher

F

-test (for comparison of variances for normal distributions), the Wilcoxon rank sum test and the paired permutation test (for comparison of the locations which differ from the means), the Kolmogorov-Smirnov test (comparison of the continuous distributions), the Tukey-Duckworth test (comparison of the samples’ shift), and so on [1].

For multivariate samples, the problem is less studied and was solved for several specific cases. If the samples are drawn from populations with multivariate normal distributions with equivalent variances, then the comparison of the multivariate means is provided by the extension of the Student

t

-test that is the two-sample Hotelling

T^{2}

-test [2]. If the variances of the populations are different, then comparison of the multivariate means of the samples can be conducted by the family of the tests, which implement the same extension of the Hotelling statistics [3], or its different versions including the test with missing data [4].

Finally, there exists a small number of methods that address the multivariate two-sample problem in which the samples are drawn from the populations with the unknown or differ from normal distributions. The review of the methods based on the interpoint distances appears in the thesis [5], and of the non-parametric methods – in the thesis [6].

In particular, the mostly applicable Baringhaus-Franz test [7] implements the Euclidean distances between the elements of the samples (inter-sample distances) and the distances between the elements in each sample (intra-sample distances). The resulting statistic is a normalized difference between the sum of the inter-sample distances and the average of the intra-sample distances. Since this statistic is not distribution free, critical values are defined by the bootstrapping techniques [8].

In the paper, we follow the line of using the inter- and intra-sample distances and propose a distribution free test for comparison of the means of multivariate samples with unknown distributions. The proposed test implements the distances between the elements of the samples and the centroid of both samples and the distances between the elements of the samples and their centroids. These distances are considered as random variables, and the test compares distributions of these variables. Acceptance of null hypothesis about the equivalence of the distributions indicates that the populations from which the samples were drawn are equivalent (by equivalence of the means and forms of the distributions) and rejection of the null hypotheses indicates that the samples are drawn from the populations with different means.

Thus, in the proposed test the multivariate data samples are reduced to univariate samples of distances and then the distributions of the univariate samples are compared. If, similar to the Baringhaus-Franz test, the proposed test uses the Euclidian metrics, then the distances are interpreted as deviations of the samples’ elements; however, the choice of the metric function is not crucial and can depend on the nature of the data. Comparison between the distances’ samples is conducted using the standard two-sample Kolmogorov-Smirnov test.

The proposed test is illustrated by its application to the simulated data and the real-world Iris flowers [9] and Swiss banknotes [10] datasets, which are widely accepted for benchmark tasks.

2. Problem formulation

Let

x = (x_{1}, x_{2}, \dots, x_{m_{x}})

and

y = (y_{1}, y_{2}, \dots, y_{m_{y}})

be two

n

-dimensional samples such that each observation

x_{i}

is a random vector

x_{i} = (x_{i 1}, x_{i 2}, \dots, x_{i n})

i = 1,2, \dots, m_{x}

, and each observation

y_{j}

, is a random vector

y_{j} = (y_{j 1}, y_{j 2}, \dots, y_{j n})

j = 1,2, \dots, m_{y}

. In the other words, the samples are represented by random matrices

x = {(x_{i k})}_{1 \leq i \leq m_{x}, 1 \leq k \leq n} and y = {(y_{i k})}_{1 \leq i \leq m_{y}, 1 \leq k \leq n} .

We assume that the numbers

m_{x}

and

m_{y}

of observations appearing in the considered samples

x

and

y

are equal or at least are rather close.

Denote

F_{x}

and

F_{y}

the multivariate distributions on the populations

X

and

Y

, respectively.

The question is: whether the samples

x

and

y

were drawn from the same population or populations

X

and

Y

, from which the samples

x

and

y

were, respectively, drawn, are statistically different.

If populations

X

and

Y

are univariate, the samples

x

and

y

are random vectors, and the problem is solved by the standard two-sample tests for different known or arbitrary unknown distributions

F_{x}

and

F_{y}

. However, for multivariate populations complete analytical solution – the two-sample Hotelling

T^{2}

-test [2] – was suggested only for normal

F_{x}

and

F_{y}

. Together with that, in the last decade were suggested several multivariate two-sample tests [11,12] based on the multivariate version of the Kolmogorov-Smirnov test [13], but these and similar solutions either implement bootstrapping techniques or have certain limitations. For other directions in considering the problem see, e.g., the work [14] and references herein.

In the paper, we assume that the distributions

F_{x}

and

F_{y}

are continuous with finite expectations

E (x)

and

E (y)

, respectively, and consider the null hypothesis and alternative hypotheses

H_{0} : E (x) = E (y) and H_{1} : E (x) \neq E (y) .

From the construction of the test it follows that acceptance of null hypothesis indicates that the populations

X

and

Y

have equivalent expectations and rejection of the null hypotheses indicates that these populations are statistically different by the difference of their means.

The test of statistical equivalence of the populations

X

and

Y

requires additional test which is conducted after acceptance of the null hypothesis and considers the hypotheses

H_{0} : F_{x} = F_{y} and H_{1} : F_{x} \neq F_{y}

given

E (x) = E (y)

. Acceptance of the null hypothesis indicates that the populations

X

and

Y

are equivalent, and rejection of this hypothesis indicates that the populations

X

and

Y

are different with equivalent expectations.

3. Suggested solution

The proposed test includes two stages: first, the test reduces the multivariate data to the univariate arrays, and second, it studies these arrays as realizations of certain random variables. For the univariate data the first stage is avoided, and the analysis includes the second stage only.

Let

x

and

y

be independent

n

-dimensional random samples respectively drawn from the populations

X

and

Y

with distributions

F_{x}

and

F_{y}

and finite expectations

E (x)

and

E (y)

. Denote by

x ⊔ y

concatenation of the samples such that if

x = (x_{1}, x_{2}, \dots, x_{m_{x}})

and

y = (y_{1}, y_{2}, \dots, y_{m_{y}})

, then

z = x ⊔ y = (x_{1}, x_{2}, \dots, x_{m_{x}}, y_{1}, y_{2}, \dots, y_{m_{y}}) .

Expectation of the concatenated sample

x ⊔ y

E (z) = \frac{1}{2} (E (x) + E (y)) .

Now we introduce four univariate random vectors which represent the distances between the observations

x_{i}

and

y_{j}

and the corresponding expectations. The first two vectors

a = (a_{1}, a_{2}, \dots, a_{m_{x}}), a_{i} = ‖x_{i} - E (x)‖, i = 1,2, \dots, m_{x},

b = (b_{1}, b_{2}, \dots, b_{m_{y}}), b_{j} = ‖y_{j} - E (y)‖, j = 1,2, \dots, m_{y},

are the vectors of distances between the observations and the expected values of these observations. The third vector is the concatenation of these two vectors

a

and

b

c = a ⊔ b = (c_{1}, \dots, c_{m_{x}}, c_{m_{x} + 1}, \dots, c_{m_{x} + m_{y}}),

in which

c_{i} = a_{i}

i = 1,2, \dots, m_{x}

, and

c_{m_{x} + j} = b_{j}

j = 1,2, \dots, m_{y}

. Finally, the fourth vector is the vector of distances between the observations and the expectation

E (z)

of the concatenation

z = x ⊔ y

of the vectors of observations

d = (d_{1}, \dots, d_{m_{x}}, d_{m_{x} + 1}, \dots, d_{m_{y} + m_{y}}),

where

d_{i} = ‖x_{i} - E (z)‖

i = 1,2, \dots, m_{x}

, and

d_{m_{x} + j} = ‖y_{j} - E (z)‖

j = 1,2, \dots, m_{y}

It is clear that from the equivalence of the expectations

E (x)

and

E (y)

follows the equivalence of the vectors

c

and

d

and vice versa. Hence, to check the hypothesis that

H_{0} : E (x) = E (y)

it is enough to check whether the vectors

c

and

d

are statistically equivalent.

Similar to the Baringhaus-Franz test [7], assume that the indicated distances are the Euclidian distances. Then the estimated distances are

a_{i} = \sqrt{\sum_{k = 1}^{n} {(x_{i k} - {\bar{x}}_{k})}^{2}}, i = 1,2, \dots, m_{x},

b_{j} = \sqrt{\sum_{k = 1}^{n} {(y_{j k} - {\bar{y}}_{k})}^{2}}, j = 1,2, \dots, m_{y},

c_{l} = \{\begin{matrix} \begin{matrix} a_{i}, & i = 1,2, \dots, m_{x}, l = 1,2, \dots, m_{x}, \end{matrix} \\ \begin{matrix} b_{j}, & j = 1,2, \dots, m_{y}, l = m_{x} + 1, m_{x} + 2, \dots, m_{x} + m_{y}, \end{matrix} \end{matrix}

d_{l} = \{\begin{matrix} \begin{matrix} \sqrt{\sum_{k = 1}^{n} {(x_{i k} - {\bar{z}}_{k})}^{2}}, & i = 1,2, \dots, m_{x}, l = 1,2, \dots, m_{x}, \end{matrix} \\ \begin{matrix} \sqrt{\sum_{k = 1}^{n} {(y_{j k} - {\bar{z}}_{k})}^{2}}, & j = 1,2, \dots, m_{y}, l = m_{x} + 1, m_{x} + 2, \dots, m_{x} + m_{y}, \end{matrix} \end{matrix}

where

{\bar{x}}_{k} = \frac{1}{m_{x}} \sum_{i = 1}^{m_{x}} x_{i k}

{\bar{y}}_{k} = \frac{1}{m_{y}} \sum_{i = 1}^{m_{y}} y_{i k}

and

{\bar{z}}_{k} = \frac{1}{m_{x} + m_{y}} (\sum_{i = 1}^{m_{x}} x_{i k} + \sum_{j = 1}^{m_{y}} y_{j k})

are the elements of the multivariate estimated centers of distributions

\bar{x} = ({\bar{x}}_{1}, {\bar{x}}_{2}, \dots, {\bar{x}}_{n})

\bar{y} = ({\bar{y}}_{1}, {\bar{y}}_{2}, \dots, {\bar{y}}_{n})

and

\bar{z} = ({\bar{z}}_{1}, {\bar{z}}_{2}, \dots, {\bar{z}}_{n})

, respectively.

For comparison of the vectors

c

and

d

we apply the two-sample Kolmogorov-Smirnov test. Then for the considered vectors

c

and

d

and their empirical distributions

F_{c}

and

F_{d}

the Kolmogorov-Smirnov statistic

D_{m_{x} + m_{y}, m_{x} + m_{y}} = \sup_{ξ} |F_{c} (ξ) - F_{d} (ξ)|

is defined by the difference between the estimated centers of distributions

\bar{x}

\bar{y}

and

\bar{x ⊔ y}

Note that acceptance of the hypothesis

H_{0} : F_{c} = F_{d}

does not indicate the equivalence of the distributions

F_{x}

and

F_{y}

. To finalize the test and to check the hypothesis

H_{0} : F_{x} = F_{y}

(after acceptance of

H_{0} : F_{c} = F_{d}

) we propose to apply the Kolmogorov-Smirnov test and compare the distances vectors

a

and

b

. Here the Kolmogorov-Smirnov statistic

D_{m_{x}, m_{y}} = \sup_{ξ} |F_{a} (ξ) - F_{b} (ξ)|

is defined by the difference between the distributions of the vectors

F_{a}

and

F_{b}

. Acceptance of the hypothesis

H_{0} : F_{a} = F_{b}

, together with the accepted hypothesis

H_{0} : F_{c} = F_{d}

, indicates that distributions

F_{x}

and

F_{y}

are statistically equivalent and the samples

x

and

y

were drawn from the same population or two statistically equivalent populations.

4. Examples of univariate and bivariate samples

To clarify the suggested method let us consider two simple examples. We start with the univariate two-sample problem.

Let the samples

x = (41, 21, 28, 30, 11, 35, 30, 13, 23, 11) and y = (3, 7, 2, 6, 5, 15, 10, 12)

of the lengths

m_{x} = 10

and

m_{y} = 8

be drawn from the population with normal distribution with the expected value

E (x) = 20

(and standard deviation

σ (x) = 10

) and exponential distribution with

E (y) = 10

, respectively. For simplicity, we rounded the values in the samples.

Then the distances vectors are

a = (16.7, 3.3, 3.7, 5.7, 13.3, 10.7, 5.7, 11.3, 1.3, 13.3)

b = (4.5, 0.5, 5.5, 1.5, 2.5, 7.5, 2.5, 4.5)

c = (16.7, 3.3, 3.7, 5.7, 13.3, 10.7, 5.7, 11.3, 1.3, 13.3

4.5, 0.5, 5.5, 1.5, 2.5, 7.5, 2.5, 4.5)

d = (24.2, 4.2, 11.2, 13.2, 5.8, 18.2, 13.2, 3.8, 6.2, 5.8,

13.8, 9.8, 14.8, 10.8, 11.8, 1.8, 6.8, 4.8)

The Kolmogorov-Smirnov test with significance level

α = 0.05

rejects the hypothesis

H_{0} : F_{c} = F_{d}

. Thus, it can be concluded that the expectations

E (x)

and

E (y)

are different and the samples

x

and

y

were drawn from different populations or, at least, are significantly shifted.

The same result is obtained by direct comparison of the samples

x

and

y

. The Kolmogorov-Smirnov test with significance level

α = 0.05

rejects the hypothesis

H_{0} : F_{x} = F_{y}

Now let both samples

x = (18, 28, 15, 14, 26, 19, 21, 31, 15, 22) and y = (28, 8, 21, 18, 25, 23, 20, 12)

of the lengths

m_{x} = 10

and

m_{y} = 8

be drawn from the population with normal distribution with the expected value

E (x) = 20

(and standard deviation

σ (x) = 10

As expected, the Kolmogorov-Smirnov test with significance level

α = 0.05

accepts the hypothesis

H_{0} : F_{c} = F_{d}

, and then accepts the hypothesis

H_{0} : F_{a} = F_{b}

. Thus, it can be concluded that samples

x

and

y

were drawn from the same population, and direct comparison of the samples

x

and

y

confirms this conclusion.

Now let us consider an example of the bivariate two-sample problem. Assume that the samples are represented by the matrices

x = (\begin{matrix} \begin{matrix} \begin{matrix} \begin{matrix} 17 & - 4 \end{matrix} & \begin{matrix} \begin{matrix} 29 & 25 \end{matrix} & \begin{matrix} 20 & 27 \end{matrix} \end{matrix} \end{matrix} & \begin{matrix} \begin{matrix} 26 & 20 \end{matrix} & \begin{matrix} 13 & 16 \end{matrix} \end{matrix} \end{matrix} \\ \begin{matrix} \begin{matrix} \begin{matrix} 22 & 17 \end{matrix} & \begin{matrix} \begin{matrix} 2 & 15 \end{matrix} & \begin{matrix} 28 & 17 \end{matrix} \end{matrix} \end{matrix} & \begin{matrix} \begin{matrix} - 4 & 7 \end{matrix} & \begin{matrix} 22 & 17 \end{matrix} \end{matrix} \end{matrix} \end{matrix}),

y = (\begin{matrix} \begin{matrix} \begin{matrix} 3 & 3 \end{matrix} & \begin{matrix} 27 & 20 \end{matrix} & \begin{matrix} 1 & \begin{matrix} 6 & 16 & 14 \end{matrix} \end{matrix} \end{matrix} \\ \begin{matrix} \begin{matrix} 2 & 5 \end{matrix} & \begin{matrix} 19 & 12 \end{matrix} & \begin{matrix} 13 & \begin{matrix} 1 & 2 & 0 \end{matrix} \end{matrix} \end{matrix} \end{matrix}) .

The first matrix was drawn from normally distributed population with

E (x) = (\begin{matrix} 20 \\ 20 \end{matrix})

(and standard deviation

σ (x) = (\begin{matrix} 10 \\ 10 \end{matrix})

) and the second matrix was drawn from exponential distribution with

E (y) = (\begin{matrix} 10 \\ 10 \end{matrix})

. The mean vectors for these samples are

\bar{x} = (\begin{matrix} 18.9 \\ 14.3 \end{matrix}) and \bar{y} = (\begin{matrix} 11.25 \\ 6.75 \end{matrix}) .

The values of the samples and their centers are shown in Figure 1.

Then the distances vectors are

a = (7.9, 23.1, 15.9, 6.1, 13.7, 8.5, 19.6, 7.4, 9.7, 4.0)

b = (9.5, 8.4, 20.0, 10.2, 12.0, 7.8, 6.7, 7.3)

c = (7.9, 23.1, 15.9, 6.1, 13.7, 8.5, 19.6, 7.4, 9.7, 4.0,

9.5, 8.4, 20.0, 10.2, 12.0, 7.8, 6.7, 7.3)

d = (11.2, 20.4, 16.2, 10.3, 17.6, 13.0, 18.3, 6.0, 11.3, 6.1,

15.4, 13.8, 14.0, 4.6, 14.6, 13.8, 9.0, 11.0)

The Kolmogorov-Smirnov test with significance level

α = 0.05

rejects the hypothesis

H_{0} : F_{c} = F_{d}

. Thus, it can be concluded that the expectations

E (x)

and

E (y)

are different and the samples

x

and

y

were drawn from different populations or, at least, are significantly shifted.

Note that direct comparison of the vectors

a

and

b

results in acceptance of the hypothesis

H_{0} : F_{a} = F_{b}

, which, however, does not lead to additional conclusions about the expectations

E (x)

and

E (y)

and about the distributions

F_{x}

and

F_{y}

5. The algorithm of two-sample test

For convenience, let us formulate the proposed test in the algorithmic form.

Algorithm: two-sample test of means difference for multivariate datasets

Input: two

n

-dimensional samples that are independent random matrices

x = {(x_{i k})}_{1 \leq i \leq m_{x}, 1 \leq k \leq n}

and

y = {(y_{i k})}_{1 \leq i \leq m_{y}, 1 \leq k \leq n}

.
Output: conclusions about difference between the expectations

E (x)

and

E (y)

and about difference between the distributions

F_{x}

and

F_{y}

of the samples.

Compute the multivariate mean $\bar{x}$ ( $n$ -dimentional vector) of the sample $x$ .
Compute the multivariate mean $\bar{y}$ ( $n$ -dimentional vector) of the sample $y$ .
Compute the distance between each element $x_{i}$ of the sample $x$ and its mean $\bar{x}$ and combine them into vector $a$ .
Compute the distance between each element $y_{j}$ of the sample $y$ and its mean $\bar{y}$ and combine them into vector $b$ .
Concatenate the vectors $a$ and $b$ of the distances into the vector $c$ .
Concatenate the samples $x$ and $y$ into the sample $z$ .
Compute the multivariate mean $\bar{z}$ ( $n$ -dimensional vector) of the sample $z$ .
Compute the distance between each element $z_{l}$ of the sample $z$ and its mean $\bar{z}$ and combine them into vector $d$ .
Apply the two-sample Kolmogorov-Smirnov test for the distributions $F_{c}$ and $F_{d}$ of the vectors $c$ and $d$ .
If the hypothesis $H_{0} : F_{c} = F_{d}$ is accepted, then
Accept the hypothesis $H_{0} : E (x) = E (y)$ ,
Apply the two-sample Kolmogorov-Smirnov test for the distributions $F_{a}$ and $F_{b}$ of the vectors $a$ and $b$ ,
If the hypothesis $H_{0} : F_{a} = F_{b}$ is accepted, then
Accept the hypothesis $H_{0} : F_{x} = F_{y}$ ,
else
Accept the hypothesis $H_{1} : F_{x} \neq F_{y}$ ,
end if,
else
Accept the hypothesis $H_{1} : E (x) \neq E (y)$ ,
end if.
Return the accepted hypotheses.

Note again that the numbers

m_{x}

and

m_{y}

of observations in the samples

x

and

y

should be equal or at least rather close.

Application of the squared differences between the samples’ elements and the means leads to certain similarity between the suggested method and the one-way analysis of variance [15] but without crucial requirement of the normal distribution of the samples.

6. Verification of the method

The suggested method was verified using real-world and simulated data. For verifications, we implemented the algorithm in MATLAB^® and used the appropriate functions from its statistical toolbox. The significance level in the two-sample Kolmogorov-Smirnov tests is

α = 0.05

6.1. Trials on the simulated data

For the first trials we generated two multivariate samples with different distributions and parameters and then applied the suggested algorithm to these samples. Each sample includes

m_{x} = m_{y} = 100

elements. Examples of the simulated bivariate normally distributed samples are shown in Figure 2. In the figure, the samples have different predefined means and different standard deviations. For simplicity here we show the samples with difference in their means only in one dimension; in the other dimension the means are equal to zero.

The results of the tests of these samples by the suggested algorithm are summarized in Table 1.

As it was expected, for the first two samples the test accepted the hypothesis

H_{0} : E (x) = E (y)

for equivalent expectations and rejected the hypothesis

H_{0} : F_{x} = F_{y}

because of different standard deviations. In the next three cases, the test rejected the hypothesis

H_{0} : E (x) = E (y)

since the expectations were indeed different and because of this difference the hypothesis

H_{0} : F_{x} = F_{y}

was also rejected.

In the next trials we compared the activity of the Hotelling

T^{2}

-test [2] with the activity of the proposed test. The implementation of the Hotelling

T^{2}

-test was downloaded from the MATLAB Central File Exchange [16].

Following the requirement of the Hotelling

T^{2}

-test, in the trials, we compared two samples drawn from normally distributed populations with varying standard deviations

σ (x) = σ (y)

and the expectations

E (x)

and

E (y)

such that the difference between them changes from zero (equivalent expectations) to the values for which the samples are separated with certainty. Results of the trials are summarized in Table 2.

The obtained results demonstrate that for normally distributed samples the suggested test recognizes the differences between the samples as correct as the Hotelling

T^{2}

-test, but as expected, it is less sensitive than the Hotelling

T^{2}

-test. Thus, if it is known that the samples were drawn from the populations with normal distributions, then the Hotelling

T^{2}

-test is preferable, and if the distributions of the populations are not normal or unknown, then the suggested test can be applied.

For validation of the suggested test on the samples drawn from the populations with not normal distributions it was trialed on several pair of samples with different distributions. For example, in Table 3 we summarized the results of the test on the samples with uniform distributions.

In addition, from the results presented in Table 2 and Table 3 it follows that similarly to any other statistical test, the sensitivity of the test is as lower as the spreading of the data (standard deviation

σ

in Table 2 and interval widths

|b - a|

in Table 3) is higher.

6.2. Trials on the real-world data

For additional verification, we applied the suggested algorithm on two widely known datasets. The first is the Iris flower dataset [9], which contains three samples of Iris plant: Iris setosa, Iris versicolour and Iris virginica. The plants are described by

n = 4

numerical parameters: sepal length, sepal width, petal length and petal width. Each sample includes

m = 50

elements.

The sample representing the Iris setosa is linearly separable from the other two samples, the Iris versicolour and the Iris virginica, but these two samples are not linearly separable.

The trial includes six independent two-sample tests. The first three tests consider the samples and compare each of them with each of two others. In these tests it was expected that the suggested method will identify that the samples represent different populations.

The second three tests compared each of the samples with itself. We compared the subsample of the first

25

elements with the subsample of the last

25

elements. In these tests, we certainly expected that the method will identify that the compared parts of the same sample are statistically equivalent.

Results of the tests are summarized in Table 4.

As expected, the method correctly identified that the samples representing different types of Iris plants are statistically different. In all comparisons the hypotheses

H_{0} : E (x) = E (y)

was rejected. Note that the method correctly identified the difference between two linearly non separable samples.

Also, the method correctly identified statistical equivalence of the subsamples taken from the same samples. In these comparisons the methods correctly accepted both the hypothesis

H_{0} : E (x) = E (y)

and the hypothesis

H_{0} : F_{x} = F_{y}

The second dataset is the dataset of Swiss banknotes [10], which includes

m = 200

records about

100

genuine and

100

counterfeit banknotes included in the samples

x

and

y

, respectively. Each banknote is characterized by

n = 6

numerical parameters specifying their geometrical sizes.

The suggested test correctly rejected the null hypothesis

H_{0} : E (x) = E (y)

and separated the records about genuine and counterfeit banknotes with significance level

α = 0.05

and

p

-value close to zero.

Note that the same result was reported for the two-sample Hoteling

T^{2}

test which also rejected the null hypothesis about the equivalence of the samples and separated the records with

p

-value close to zero.

7. Conclusion

The proposed test for comparison of the means of multivariate samples with unknown distributions correctly identifies statistical equivalence and difference between the samples.

Since the test implements the Kolmogorov-Smirnov statistic, it does not require specific distributions of the samples and can be applied to any reasonable data.

In addition, the proposed method, in contrast to the existing tests, does not consider the pairwise relations between all elements of the samples and so it requires less computation power.

The method was verified on simulated and real-world data and in all trials it demonstrated correct results.

Funding

This research has not received any grant from funding agencies in the public, commercial, or non-profit sectors.

Data Availability Statement

The data obtained from open access repositories; the links appear in the references.

Conflicts of Interest

The authors declare no conflict of interest.

Competing Interests

The authors declare no competing interests.

References

Moore D.S., McCabe G.P., Craig B. Introduction to the Practice of Statistics. W. H. Freeman: New York, 2014.
Hotelling H. The generalization of Student's ratio. Annals of Mathematical Statistics, 1931, 2(3), 360-378.
Coombs W.T., Algina J., Oltman D.O. Univariate and multivariate omnibus hypothesis tests selected to control type I error rates when population variances are not necessarily equal. Review of Educational Research, 1996, 66(2), 137-179.
Wu Y., Genton M.G., Stefanski L.A. A multivariate two-sample mean test for small sample size and missing data. Biometrics, 2006, 62(3), 877-885.
Siluyele I.J. Power Studies of Multivariate Two-Sample Tests of Comparison. MSc Thesis. University of the Western Cape, Cape Town, SA, 2007.
Lhéritier A. Nonparametric Methods for Learning and Detecting Multivariate Statistical Dissimilarity. PhD Thesis. Université Nice Sophia Antipolis, Nice, France, 2015.
Baringhaus L., Franz C. On a new multivariate two-sample test. J. Multivariate Analysis, 2004, 88, 190-206.
Efron B. Bootstrap methods: another look at the jackknife. Annals of Statistics, 1979, 7(1), 1-26.
Fisher, R.A. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 1936, 7(2), 179-188. The Iris flower dataset was downloaded from the UCI Machine Learning Repository, https://archive.ics.uci.edu/dataset/53/iris (accessed 25 Nov 2023).
The Two-Sample Hotelling's T-Square Test Statistic. In the course notes Applied Multivariate Statistical Analysis. Eberly College of Science, Pennsylvania State University. The Swiss banknotes dataset was downloaded from the page https://online.stat.psu.edu/stat505/lesson/7/7.1/7.1.15 (accessed 25 Nov 2023).
Sadhanala V., Wang Y.-X., Ramdas A., Tibshirani R.J. A Higher-order Kolmogorov-Smirnov test. In Proc. 22^nd Int. Conf. Artificial Intelligence and Statistics (AISTATS), 2019, Naha, Okinawa, Japan, 89, 2621-2630.
Kesemen O., Tiryaki B.K., Tezel O., Özku E. A new goodness of fit test for multivariate normality. Hacet. J. Math. Stat., 2021, 50(3), 872 – 894.
Justel A., Peña D., Zamar R. A multivariate Kolmogorov-Smirnov test of goodness of fit. Statistics & Probability Letters, 1997, 35, 251-259.
Qiu Z., Chen J., Zhang J.-T. Two-sample tests for multivariate functional data with applications. Computational Statistics and Data Analysis, 2021, 157, 107160, 1-14.
Tabachnick B.G., Fidell L.S., Ullman J.D. Using Multivariate Statistics. Pearson Education: London, UK, 2018.
Trujillo-Ortiz A. Hotelling T². MATLAB Central File Exchange, https://www.mathworks.com/matlabcentral/fileexchange/2844-hotellingt2 (accessed 25 Nov 2023).

Figure 1. The bivariate samples

x

and

y

and their centers

\bar{x}

and

\bar{y}

Figure 1. The bivariate samples

x

and

y

and their centers

\bar{x}

and

\bar{y}

Figure 2. Randomly generated bivariate normally distributed samples with standard deviations

σ (x) = 15

and

σ (y) = 10

: (a) expectations

E (x) = E (y) = 0

; (b) expectations

E (x) = - 15

E (y) = 15

; (c) expectations

E (x) = - 30

E (y) = 30

; (d) expectations

E (x) = - 45

, and

E (y) = 45

Figure 2. Randomly generated bivariate normally distributed samples with standard deviations

σ (x) = 15

and

σ (y) = 10

: (a) expectations

E (x) = E (y) = 0

; (b) expectations

E (x) = - 15

E (y) = 15

; (c) expectations

E (x) = - 30

E (y) = 30

; (d) expectations

E (x) = - 45

, and

E (y) = 45

Table 1. Results of the tests of the illustrative bivariate normally distributed samples with standard deviations

σ (x) = 15

and

σ (y) = 10

Table 1. Results of the tests of the illustrative bivariate normally distributed samples with standard deviations

σ (x) = 15

and

σ (y) = 10

Sample $x$	Sample $y$	$H_{0} : E (x) = E (y)$	$H_{0} : F_{x} = F_{y}$
$E (x) = 0$	$E (y) = 0$	Accepted	Rejected
$E (x) = - 15$	$E (y) = 15$	Rejected	Rejected^*
$E (x) = - 30$	$E (y) = 30$	Rejected	Rejected^*
$E (x) = - 45$	$E (y) = 45$	Rejected	Rejected^*

^*Hypothesis was rejected by the rejection of the hypothesis

H_{0} : E (x) = E (y)

Table 2. Results of the Hotelling

T^{2}

-test and the suggested test for bivariate normally distributed samples with different expected values and standard deviations.

Table 2. Results of the Hotelling

T^{2}

-test and the suggested test for bivariate normally distributed samples with different expected values and standard deviations.

		Hotelling $T^{2}$ test			Suggested test
$E (x)$	$E (y)$	$σ = 0.5$	$σ = 1.0$	$σ = 1.5$	$σ = 0.5$	$σ = 1.0$	$σ = 1.5$
$0$	$0$	$H_{0}$	$H_{0}$	$H_{0}$	$H_{0}$	$H_{0}$	$H_{0}$
$0$	$0.5$	$H_{1}$	$H_{1}$	$H_{0}$	$H_{0}$	$H_{0}$	$H_{0}$
$0$	$1.0$	$H_{1}$	$H_{1}$	$H_{1}$	$H_{1}$	$H_{0}$	$H_{0}$
$0$	$1.5$	$H_{1}$	$H_{1}$	$H_{1}$	$H_{1}$	$H_{1}$	$H_{0}$
$0$	$2.0$	$H_{1}$	$H_{1}$	$H_{1}$	$H_{1}$	$H_{1}$	$H_{1}$

Table 3. Results of the suggested test for bivariate uniformly distributed samples with different expected values and interval widths.

$E (x)$	$E (y)$	$\|b - a\| = 0.5$	$\|b - a\| = 1.0$	$\|b - a\| = 1.5$
$0$	$0$	$H_{0}$	$H_{0}$	$H_{0}$
$0$	$0.15$	$H_{1}$	$H_{0}$	$H_{0}$
$0$	$0.30$	$H_{1}$	$H_{1}$	$H_{0}$
$0$	$0.45$	$H_{1}$	$H_{1}$	$H_{1}$
$0$	$0.60$	$H_{1}$	$H_{1}$	$H_{1}$

Table 4. Results of the tests of the Iris plant samples.

Sample $x$	Sample $y$	$H_{0} : E (x) = E (y)$	$H_{0} : F_{x} = F_{y}$
Iris setosa	Iris versicolor	Rejected	Rejected^*
Iris setosa	Iris virginica	Rejected	Rejected^*
Iris versicolor	Iris virginica	Rejected	Rejected^*
Iris setosa	Iris setosa	Accepted	Accepted
Iris versicolor	Iris versicolor	Accepted	Accepted
Iris virginica	Iris virginica	Accepted	Accepted

^*Hypothesis was rejected by the rejection of the hypothesis

H_{0} : E (x) = E (y)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.