Happ 2018
Happ 2018
DOI: 10.1002/sim.7983
RESEARCH ARTICLE
1
Department of Mathematics, University
of Salzburg, Salzburg, Austria There are many different proposed procedures for sample size planning for
2
Department of Statistics, University of the Wilcoxon-Mann-Whitney test at given type-I and type-II error rates 𝛼 and
Kentucky, Lexington, Kentucky 𝛽, respectively. Most methods assume very specific models or types of data to
3
Department of Medical Statistics,
simplify calculations (eg, ordered categorical or metric data, location shift alter-
University of Göttingen, Göttingen,
Germany natives, etc). We present a unified approach that covers metric data with and
without ties, count data, ordered categorical data, and even dichotomous data.
Correspondence
For that, we calculate the unknown theoretical quantities such as the variances
Arne C. Bathke, Department of
Mathematics, University of Salzburg, 5020 under the null and relevant alternative hypothesis by considering the following
Salzburg, Austria. “synthetic data” approach. We evaluate data whose empirical distribution func-
Email: [email protected]
tions match the theoretical distribution functions involved in the computations
Present Address of the unknown theoretical quantities. Then, well-known relations for the ranks
Arne C. Bathke, University of Salzburg, of the data are used for the calculations.
Hellbrunnerstrasse 34, 5020 Salzburg,
Austria. In addition to computing the necessary sample size N for a fixed allocation
proportion t = n1 ∕N, where n1 is the sample size in the first group and
Funding information
N = n1 + n2 is the total sample size, we provide an interval for the optimal
Austrian Science Fund, Grant/Award
Number: I 2697-N31 allocation rate t, which minimizes the total sample size N. It turns out that,
for certain distributions, a balanced design is optimal. We give a characteriza-
tion of such distributions. Furthermore, we show that the optimal choice of t
depends on the ratio of the two variances, which determine the variance of the
Wilcoxon-Mann-Whitney statistic under the alternative. This is different from
an optimal sample size allocation in case of the normal distribution model.
K E Y WO R D S
nonparametric relative effect, nonparametric statistics, optimal design, rank-based inference,
sample size planning, Wilcoxon-Mann-Whitney test
1 I N T RO DU CT ION
The comparison of two independent samples is widespread in medicine, the life sciences in general, and other fields of
research. Arguably, the most popular method is the unpaired t-test for two sample comparisons. However, its application
is limited. For heavy-tailed or very skewed distributions, use of the t-test is not recommended, especially for small sample
sizes. For ordered categorical data, comparing averages by means of t-tests is not appropriate at all. For those situations,
a nonparametric test such as the Wilcoxon-Mann-Whitney (WMW) test is much preferred.
...............................................................................................................................................................
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the
original work is properly cited.
© 2018 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.
In order to plan a study for this type of two-sample comparison, we need to know how many subjects are needed to
detect a prespecified effect at least with probability 1 − 𝛽, where 𝛽 denotes the type-II error probability. If the underlying
distributions are normal, a prespecified effect might be formulated as a difference of means. Within a general nonpara-
metric framework, the relative effect (see Section 2) is very often used. However, for a statistics practitioner, it is sometimes
difficult to state a relevant effect size to be detected in terms of the nonparametric relative effect. Therefore, we will be
using a slightly different approach. Based on prior information F1 regarding one group, eg, the standard treatment or the
control group, one can derive the distribution F2 under a conjectured (relevant) alternative in cooperation with a subject
matter expert. This distribution is established in such a way that it features what the subject matter expert would quantify
as a relevant effect. In other words, the expert may, but does not necessarily have to, provide a (standardized) difference
of means, or a relevant value for the nonparametric relative effect on which the WMW test is based. Or, alternatively,
the subject matter expert may simply provide information on a configuration that the expert would consider relevant in
terms of providing evidence in favor of the research hypothesis. This information will then be translated into a relevant
nonparametric effect. More details on deriving F2 based on an interpretable effect to compute the nonparametric effect
and the variances involved in the sample size planning are given in Section 4.
For the WMW test, there already exist many sample size formulas. However, most of them require special situations,
eg, either continuous data as used in the works of Bürkner et al,1 Wang et al,2 or Noether,3 or they require ordered cate-
gorical data as in the works of Fan,4 Tang,5 Lachin,6 Hilton and Mehta,7 or Whitehead.8 For a review of different methods,
we refer to the work of Rahardja et al.9 A rather well-known method for sample size calculation in case of continuous
data is given by Noether3 who approximated the variance under alternative by the variance under the null hypothesis.
A similar approximation was also used by Zhao et al10 who generalized Noether's formula to allow for ties. For practical
application, however, this approximation may not always be appropriate because the variances under null hypothesis and
under alternative can be very different, thus potentially leading to an underpowered or overpowered study. See, eg, the
work of Shieh et al11 for a comparison of Noether's formula with different alternative methods.
In some other approaches, the sample size is only calculated under the assumption of a proportional odds model for
ordered categorical data (eg, the works of Kolassa12 or Whitehead8 ), or considering only location shift models for contin-
uous metric data (see, eg, the works of Rosner and Glynn,13 Chakraborti et al,14 Lesaffre et al,15 Hamilton and Collings,16
or Collings and Hamilton,17 among others). An advantage of our formula (9) in Section 2 for the sample size calculation
is its generality and practicality. It can be used for metric data as well as for ordered categorical data, and it even works
very well for dichotomous data. Furthermore, our formula does not assume any special model for the alternatives.
Within the published literature, the sample size formulas bearing most similarity to ours are those by Wang et al.2
However, their approach is limited to continuous distributions, whereas our approach is based on a unified approach
allowing for discrete, as well as continuous data.
A completely different way to approach optimality of WMW tests has been pursued by Matsouaka et al.18 They use a
weighted sum of multiple WMW tests and determine the optimal weight for each test. Their aim is not an optimal sample
size planning including optimization of the ratio of sample sizes, but instead they try to optimally combine a primary
endpoint with mortality.
In a two-sample setting, we sometimes can choose the proportion of subjects in the first group. That is, we can choose
t = n1 ∕N, where n1 is the number of subjects in the first group and N is the total number of subjects. The question that
arises is how to choose t in an optimal way. In the work of Bürkner et al,1 the optimal t is chosen such that the power of
the WMW test is maximized for a given sample size N. On the other hand, in practice, we prefer to choose t in such a way
that the total sample size N is minimized for a specified power 1 − 𝛽. For the two-sample t-test with unequal variances,
Dette and O'Brien19 showed that the optimal t to maximize the power of the test is approximately
1
t≈ ,
1+𝜏
where 𝜏 = 𝜎 1 ∕𝜎 0 is the ratio of standard deviations of the two groups under the hypothesis and under the alternative,
respectively. This means that, when applying the t-test, more subjects should be allocated to the group with the higher
variance. Bürkner et al1 showed for symmetric continuous distributions under a location shift model that a balanced
design is optimal for the WMW test. For general distributions, they observed in simulation studies that, in many situations,
the difference between using the optimal t and using a balanced design is negligible.
In most publications, the generation of the alternative from the reference group is not discussed, and instead, the dis-
tribution under the alternative is assumed to be known. Here, however, we want to discuss also how we can generate the
distribution under the alternative based on the distribution in the reference group and an interpretable relevant effect.
HAPP ET AL. 3
TABLE 1 Number of seizures for 28 subjects from the advance information X1, k ∼ F1 (x),
k = 1, … , 28, and for the relevant effect F2 (x) = F1 (x∕q), where q = 0.5 denotes the percentage
of the relevant reduction of seizures to be detected. This means X2, k = [q · X1, k ] ∼ F2 (x), where
[u] denotes the largest integer ≤ u
Number of counts
Advance Information
X1,1 , … , X1,28 ∼ F1 (x) 3, 3, 5, 4, 21, 7, 2, 12, 5, 0, 22, 4, 2, 12
9, 5, 3, 29, 5, 7, 4, 4, 5, 8, 25, 1, 2, 12
Relevant Alternative
X2, k ∼ F2 (x) = F1 (x∕q) 1, 1, 2, 2, 10, 3, 1, 6, 2, 0, 11, 2, 1, 6
4, 2, 1, 14, 2, 3, 2, 2, 2, 4, 12, 0, 1, 6
In order to motivate the method derived in this paper, let us consider an example with count data, as it appears that most
publications on sample size planning focus on ordered categorical or continuous metric data. In Table 1, the data of an
advance information F1 on a placebo in an epilepsy trial is given where the outcome variable is the number of seizures. We
would like to base sample size planning for a new drug on the data X1,1 , … , X1,28 of the advance information F1 , which
comes from a study published by Leppik et al,20 as well as Thall and Vail.21 For these data, we cannot assume a location
shift model, as an absolute reduction of two seizures would be very good for someone with three seizures, but not really
helpful for someone with 20 or more seizures. More appropriate would probably be a reduction of the number of seizures
by some percentage q, for example q = 50%. Based on this specified relevant effect F2 (x) = F1 (x∕q), we artificially gen-
erate a new data set X2,1 , … , X2,28 whose empirical distribution function F ̂2 (x) is exactly equal to F2 (x). Basically, the
number n2 of the artificially generated data is arbitrary (here, n2 = 28) as long as F ̂2 (x) = F2 (x) = F1 (x∕q). We will refer
to such data as “synthetic” data.
Most of the methods mentioned before cannot be applied to data such as these as they have been derived under different
restrictive assumptions. In particular, methods assuming a location-shift model cannot be used here. However, application
of the method proposed in the present paper does not require specific types of data or a specific alternative because it is
based on the observed data and the generated synthetic data, which do not need to follow any particular model. See also
the chapter “Keeping Observed Data as a Theoretical Distribution” in the work of Puntanen et al22 for a similar approach
in the parametric case. More details regarding this data set and the sample size calculation can be found in Section 4.
The rest of this paper is now organized as follows. We first derive a general sample size formula and investigate the
behavior of the optimal t. That is, we show in which cases more subjects should be allocated to the first or second group.
Then, we apply this method to several data examples with different types of data and provide power simulations to show
that, with the sample size calculated by our method, the simulated power is at least 1 − 𝛽. Furthermore, we simulate how
the chosen type-I and type-II error rates affect the value of the optimal allocation rate t.
Let X1i ∼ F1 and X2j ∼ F2 , i = 1, … , n1 , j = 1 … , n2 , be independent random samples obtained on N different subjects,
with N = n1 + n2 . The cumulative distribution functions (cdfs) F1 and F2 are understood as their normalized versions,
ie, Fi (x) = 12 (Fi+ (x) + Fi− (x)), where Fi+ denotes the right-continuous cdf and Fi− denotes the left-continuous cdf. By
using the normalized version, we can pursue a unified approach for continuous and discrete data; no separate formulas
“correcting for ties” are necessary. This unified approach results naturally in the usage of midranks in the formulas for
the test statistics; see the works of Ruymgaart,23 Akritas et al,24 and Akritas and Brunner25 for details. We denote by t the
proportion of the N subjects that is allocated to the first group. That is, n1 = tN and n2 = (1 − t)N. Without loss of
generality, X1i may be regarded as the reference group and the second group X2i as the (experimental) treatment group.
The WMW test is based on the nonparametric relative treatment effect
1
p= F1 dF2 = P(X11 < X21 ) + P(X11 = X21 ), (1)
∫ 2
which can be estimated in a natural way by its empirical analog p̂ = ∫ F̂ 1 dF̂ 2 . Here, F̂ i = 12 (F̂ i− + F̂ i+ ) is the normalized
∑ni −1 ∑ni
empirical cdf with F̂ i− (x) = n−1
i
̂+
𝑗=1 1{Xi 𝑗 <x} , and Fi (x) = ni 𝑗=1 1{Xi𝑗 ≤x} the left- and right-continuous empirical cdfs
for i = 1, 2, respectively. Finally, 1{Xi𝑗 <x} denotes the indicator function of the set {Xi j < x}. Using the relation of the
so-called placement P2k = n1 F̂ 1 (X2k ) to the overall rank R2k of X2k among all N = n1 + n2 observations and the internal
4 HAPP ET AL.
rank R(2)
2k
of X2k only among the n2 observations within sample 2, it follows from the asymptotic equivalence theorem
(see, eg, theorem 1.3 in the work of Brunner and Puri26 ) that
√ √ [1 ( n2 + 1 )
]
TN = N(p̂ − p) = N R2· − −p (2)
n1 2
∑n2
is asymptotically normal under slight regularity assumptions. Here, R2· = n1 k=1 R2k denotes the mean of the overall
2
ranks R2k in the second sample. For a derivation, we refer, eg, to the works of Brunner and Munzel27 or Brunner and Puri26
while the placements P2k are considered in more detail at the end of this section in (10). From this theorem, it follows
that, asymptotically, the statistic
( )
√ ∑
n2
∑
n1
UN = N n2 −1 −1
F1 (X2𝑗 ) − n1 F2 (X1𝑗 ) + 1 − 2p , (3)
𝑗=1 𝑗=1
which is based on independent random variables, has the same distribution as TN . Then, under the null hypothesis H0 ∶
F1 = F2 , the variance of UN can be written as
N2 2 1
𝜎02 = 𝜎 = 𝜎2, (4)
n1 n2 t(1 − t)
where 𝜎 2 = ∫ F12 dF1 − 14 . This means, TN ∕𝜎 0 has asymptotically the same distribution as UN ∕𝜎 0 , but the distribution of
the latter is asymptotically standard normal. To compute the variance of TN , in general, we again take advantage of the
asymptotically equivalent statistic in (3) and obtain the asymptotic variance
N ( )
𝜎N2 = n2 𝜎12 + n1 𝜎22 , (5)
n1 n2
where
Clearly, the variance 𝜎N2 under alternative is a weighted sum of two components, 𝜎12 and 𝜎22 . Both of these components
are important for minimizing the sample size, as performed in Section 3, unlike in the parametric case for the t-test where
only the two variances 𝜎02 under the null and 𝜎12 under the alternative hypotheses are considered.
Based on these considerations, an approximate sample size formula for the WMW test can be obtained similar to the
one calculated by Wang et al2 for continuous data. Namely, we obtain
( )2
𝜎0 u1−𝛼∕2 + 𝜎N u1−𝛽
N= ( )2 , (8)
p − 12
where 𝛼 and 𝛽 denote the type-I and type-II error rates, respectively, and u1 − 𝛼/2 is the 1 − 𝛼∕2 quantile of the standard
normal distribution.
The quantities p, 𝜎 0 , and 𝜎 N in Equation (8) are unknown in general. Moreover, 𝜎N2 is a linear combination of the two
unknown variances 𝜎12 and 𝜎22 in Equations (6) and (7). To compute these quantities from the distribution F1 of the prior
information in the reference group and the distribution F2 generated by an intuitive and easy to interpret relevant effect,
we proceed as follows.
We interpret the distributions of the data as fixed theoretical distributions similar to the parametric case in the works
, … , X1n
(p433) (pp27-28 ) ∗ ∗
of Seber28 and Puntanen et al.22 Therefore, we denote the data from the prior information by X11 and
1
the synthetic data for the treatment group by X , … , X . The corresponding cdfs are denoted by F (x) = F1 (x) and
∗ ∗ ∗ ̂
21 2n2 1
F2∗ (x) = F̂ 2 (x), respectively. Here, F̂ 1 (x) denotes the empirical distribution function of the available data X11
∗
, … , X1n
∗
in
1
the reference group and F̂ 2 (x) the empirical distribution functions of the synthetic data X , … , X in the treatment
∗
21
∗
2n2
group. In this context, “synthetic” means that the data for F2 are artificially generated based on the prior information F1
and some interpretable relevant effect. We can generate data sets of arbitrary size for F1 and F2 , as long as the relative
frequencies or probabilities remain unchanged. Because we assume that our synthetic data represent fixed distributions
and not a sample, we can calculate the variances 𝜎12 , 𝜎22 , and 𝜎 2 , as well as the relative effect p exactly. To emphasize
HAPP ET AL. 5
that these quantities are not estimators but rather the true parameters based on the synthetic data, we will denote these
quantities by 𝜎 2∗ , 𝜎12∗ , 𝜎22∗ , and p∗ .
By using the relations Nt = n1 and N(1 − t) = n2 , the sample size formula from Equation (8) is then rewritten as
( √ )2
𝜎 u1−𝛼∕2 + u1−𝛽 t𝜎2 + (1 − t)𝜎1
∗ 2∗ 2∗
N= ( )2 . (9)
∗ 1
t(1 − t) p − 2
The variances and the relative effect can be easily calculated by using a simple relation between ranks and the so-called
placements P1k = n2 F̂ 2 (X1k ) and P2k = n1 F̂ 1 (X2k ), which were introduced by Orban and Wolfe.29,30 The placements were
first defined only for continuous distributions, but were later generalized to include discrete distributions. For details,
see, eg, the work of Brunner and Munzel.27 To this end, let R∗ik denote the overall rank of Xik∗ among all n1 + n2 = N
∗ ∑ni ∗
synthetic data, and R∗(i)
ik
the ranks within the ith group, i = 1, 2. Furthermore, let Ri· = n1 k=1 Rik , i = 1, 2, denote the
i
∗
rank means. Then, the placements Pik can be represented by these ranks as
∗
Pik = R∗ik − R∗(i)
ik
, (10)
i = 1, 2; k = 1, … , ni . Finally, by letting Fi∗ (x) = F̂ i (x), the quantities in the sample size formula (9) can be calculated
directly as follows:
( ∗ )
1 ∗ 1
p∗ = F1∗ dF2∗ = R2· − R1· + , (11)
∫ N 2
2 ni ( )
1 1 ∑∑ ∗ N + 1 2
𝜎 2∗ = (F ∗ )2 dF ∗ − = 3 Rik − , (12)
∫ 4 N i=1 k=1 2
n1 ( )
1 ∑ ∗ ∗ 2
𝜎12∗ = (F2∗ )2 dF1∗ − (1 − p∗ )2 = P − P 1· , (13)
∫ n1 n22 k=1 1k
n2 ( )
1 ∑ ∗ ∗ 2
𝜎22∗ = (F1∗ )2 dF2∗ − (p∗ )2 = 2 P2k − P2· . (14)
∫ n1 n2 k=1
The cdf F ∗ is the distribution function of the combined synthetic data from both groups. Note that, for computing the
variances, we do not divide by N − 1 or ni − 1, but rather by N or ni , i = 1, 2 because the distributions of the synthetic
data are considered as fixed theoretical distributions similar to the parametric case in the work of Puntanen et al.22(pp27-28)
3 MINIMIZING N
Now, regarding the case 𝜎 1 = 𝜎 2 , it is clear from formula (9) that the optimal allocation rate is t0 = 1∕2 because the
numerator of N(t) does not depend on t, and t(1 − t) is maximized at t = 1∕2. For the case 𝜎 1 ≠ 𝜎 2 , we consider first
0 < 𝜎 1 < 𝜎 2 . Then, it is possible to show (see Supplementary Material, Result 2) that the sample size is minimized by a
t0 ∈ [I1 , I2 ] with I1 ≤ I2 < 1∕2. The minimizer is unique in the interval (0, 1), and the bounds I1 and I2 are given by
1
I1 = , (15)
𝜅+1
√
z
I2 = √ ( √ ), (16)
z + u1−𝛼∕2 q𝜎 + u1−𝛽 𝜎22
where 𝜅 = 𝜎 2 ∕𝜎 1 , 𝜎 2 = ∫ F12 dF1 − 1∕4 as in (4), q = p(1 − p), and
( √ )( √ )
z = u1−𝛼∕2 q𝜎 + u1−𝛽 𝜎12 u1−𝛼∕2 q𝜎 + u1−𝛽 𝜎22 .
1
t0 < ⇐⇒ 𝜎1 < 𝜎2 . (17)
2
In the case 0 < 𝜎 2 < 𝜎 1 , we obtain an analogous result for the minimizer t0 ∈ [I2 , I1 ], where the bounds are the same
as before. Moreover, we have a similar equivalence, namely,
1
t0 > ⇐⇒ 𝜎1 > 𝜎2 . (18)
2
The derivation of these two equivalences can be found in the Supplementary Material in Results 2 and 3.
From the form of the interval [I1 , I2 ], we can see that, if 𝜅 ≈ 1, then t0 ≈ 1∕2. In most cases, this means that the minimum
total sample size N is obtained for allocation rates close to 1∕2, or the allocation rate is 1∕2 because of rounding. Larger
values for the type-I error rate 𝛼 or the power 1 − 𝛽 lead in general to more extreme values for t0 , ie, |1∕2 − t0 | gets larger.
This can be seen from the upper bound I2 . By increasing 𝛼 or the power 1 − 𝛽, the bound I2 decreases (or increases for
𝜎 1 > 𝜎 2 ). Typically, this means that the difference |1∕2 − t0 | tends to get larger. Note that I2 is bounded from below
(above), ie, t0 cannot become arbitrarily small (or large). The impact of 𝛼 and 𝛽 is demonstrated in simulations in Section 5.
Next, we consider the case 0 = 𝜎 1 < 𝜎 2 . In the same way as before, it is possible to construct an interval for the optimal
allocation rate t0 , which is given by [I1(0) , I2 ], where the lower bound is
u1−𝛼∕2 𝜎
I1(0) = , (19)
2u1−𝛼∕2 𝜎 + u1−𝛽 𝜎2
and the upper bound is the same as in the case 0 < 𝜎 1 . More details are given in the Supplementary Material in Result 4.
An analogous result can be obtained for 0 = 𝜎 2 < 𝜎 1 .
Therefore, the value of t0 is mainly determined by 𝜅, which is the ratio of the standard deviations 𝜎 1 and 𝜎 2 under the
alternative hypothesis. This is qualitatively different from the result of the work of Dette and O'Brien19 for the t-test in
a parametric location-scale model, where the optimal allocation value is determined by the ratio of standard deviations
under the null and under the alternative hypothesis. For the WMW test, the variance under null hypothesis is not really
important for determining t0 , in case of continuous distributions, eg, the variance under null hypothesis is 𝜎02 = 1∕12.
1
t0 = ⇐⇒ 𝜎1 = 𝜎2 . (20)
2
Bürkner et al1 showed analytically that, for symmetric and continuous distributions with F2 (x) = F1 (x + a) and a ≠ 0,
the minimal sample size is attained at t0 = 1∕2. Such distributions satisfy the integral equation
However, the class of distributions satisfying Equation (22) is actually larger. Consider normalized cdfs F1 , F2 for which
an a ∈ ℝ exists such that, for all x ∈ ℝ, the following equality holds:
F1 (a + x) = 1 − F2 (a − x). (23)
Furthermore, let us assume 1 − 𝛽 > 0.5. Then, the minimum for N(t), t ∈ (0, 1) is attained at t0 = 1∕2. This means
that (23) is a sufficient but not necessary condition for t0 = 1∕2. As an example for distributions that satisfy Equation (22)
but not (23), consider F1 = F2 to be a nonsymmetric distribution.
Note that we do not assume for (23) that the distributions are stochastically ordered or symmetric. If we assume finite
third moments, then Equation (23) only implies that both distributions have the same variance and their skewness has
opposite signs, ie, 𝜈F1 = −𝜈F2 , if we denote with 𝜈Fi the skewness of the distribution with cdf Fi , i = 1, 2.
Obviously, for a large class of distributions, the optimal allocation rate is exactly 1∕2. Bürkner et al1 already noticed
the robustness of the WMW test regarding the optimal allocation rate. When the optimal t0 is not equal to 1∕2, it is often
close to 1∕2. Furthermore, the exact choice of t typically only has a small influence on the required total sample size. This
applies not only to continuous and symmetric distributions but in general to arbitrary distributions.
4 DATA EXAMPLES
The generality of the approach proposed in this paper is demonstrated using different data examples with continuous
metric, discrete metric, and ordered categorical data. In this section, we first describe the data sets. Then, the calculated
sample sizes along with the actual achieved power in comparison with other sample size calculation methods are given.
For all data sets, we used the prior information from one group (eg, from a previous study or from literature) to gener-
ate synthetic data for the second group based on an interpretable effect specified by a subject matter expert. For ordered
categorical data, such an effect might be that a certain percentage of subjects in each category are moved to a better or
worse category. For metric data, it is possible to simply use a location shift as the effect of interest. Regardless on how
the effects are chosen, in the end, they all are translated into the so-called nonparametric relative effect, which itself pro-
vides for another interpretable effect quantification, which might be useful for practitioners, in addition to, eg, a location
shift effect.
For all examples, we used 𝛼 = 0.05 as the type-I error rate and provide the output from an R function, which shows
the optimal t, the sample size determined for each group, and the ratio 𝜅 = 𝜎 2 ∕𝜎 1 . Furthermore, we provide simula-
tion results to assess the actual achieved power. The R Code is given in the Supplementary Material. For calculating the
asymptotic WMW test, we used the function rank.two.samples from the R package rankFD.31 For all simulations
performed with the statistical software R, we generated 104 data sets and used 0 as our starting seed value for drawing data
sets from the synthetic data. To compute the optimal allocation rate t0 and the sample sizes for each group, the function
WMWssp_Minimize from the R package WMWssp can be used.
written in terms of 𝜎 ∗ (see formula (4)). Then, for this sample size formula (9), we still need to calculate the variances
𝜎 ∗ , 𝜎1∗ , and 𝜎2∗ . We can do that by first calculating the placements for the data according to Equation (10). Then, we use
(12), (13), and (14) to obtain the quantities needed for the sample size formula.
In order to have a power of at least 80%, we need 24 subjects in each group, according to our method. When using
the optimal t0 ≈ 0.49, we need n1 = 23 and n2 = 24 subjects. In this case, the optimal allocation only reduces the
total number of subjects needed by one, in comparison with a balanced design. Applying Noether's formula in this case
yields sample sizes n1 = n2 = 26. Table 2 presents results from a power simulation regarding the different sample size
recommendations. Here, Noether's formula would lead to a slightly overpowered study.
unbalanced design. Tang5 derived a sample size formula for ordered categorical data. If we use his method, we obtain that
86 rats per group are needed. The closeness of his result to ours may be taken as confirmation that our unified approach
produces appropriate results also in the case of ordered categorical data.
In the aforementioned four data examples, we have used 𝛼 = 0.05 and 1 − 𝛽 = 0.8 or 0.9 for the sample size calculation
and power simulation according to the examples from the literature. By formula (9) and the intervals for t0 (Equations (15)
and (16)) in Section 3.1, the choice of 𝛼 and 𝛽 has an influence not only on the total sample size N but also on the optimal
allocation rate t0 . In order to study the behavior of these two parameters, we have performed two simulation studies,
which are described in Section 5.
In this section, we assess in different simulations the behavior of the optimal allocation rate t0 when changing the nominal
type-I error rate 𝛼, the power 1 − 𝛽, and the ratio of standard deviations 𝜅 = 𝜎 2 ∕𝜎 1 .
For simulating the influence of 𝛼, we used Beta(5, 5) and Beta(3, i) distributed random numbers in the first and second
group for i = 1, 2, 3. For each 𝛼 = 0.01, 0.02, … , 0.1, we generated 106 random numbers for each group and calculated
the optimal allocation rate t0 and the total sample sizes N(t0 ) and N(1∕2) (corresponding to a balanced design) to achieve
at least 80% power. From the formula for the upper bound I2 of t0 , we already saw (Section 3.1) that larger values for the
type-I error rate 𝛼 would lead to a larger difference |I2 − 1∕2|. While we cannot conclude from this directly that t0 will
be more extreme, the optimal allocation rate will more likely tend to more extreme values, ie, the difference |t0 − 1∕2|
tends to become larger. We can see this behavior confirmed in Figure 1. In this simulation, we had p ≈ 0.5 and 𝜅 = 1.35,
implying t0 < 1∕2 for the case i = 1 (red curve), p = 0.657 and 𝜅 = 1.53 (green curve), and p = 0.84 and 𝜅 = 1.98
(blue curve). Note that an effect of p ≈ 0.5 makes no sense in a realistic scenario as the calculated sample size would be
much too large to be of practical relevance, but we use this setting regardless just to demonstrate the behavior of t0 with
regard to the effect p. The ratio 𝜅 = 𝜎 2 ∕𝜎 1 also has an influence on the value of t0 . Hence, we chose the alternative in
such a way that 𝜅 > 1. This means that t0 < 1∕2, and if we increase p, then 𝜅 also increases. From that, we saw that more
extreme effects (or larger values of 𝜅) led to larger differences |t0 − 1∕2|. This can also be seen from the upper bound I2 .
In the data examples, we already found very little difference between using a balanced design or the optimal design.
The simulation study yielded a similar observation where the maximal difference was at most 1 for the medium and large
FIGURE 1 The graphic shows the values of the optimal allocation rate t0 for different values of type-I error rates 𝛼 where the goal is to
detect a relevant effect with at least 80% power. For the reference group, we used Beta (5, 5) distributions, and for the treatment group, we
assumed Beta (3, i), where i = 1, 2, 3. The red line represents i = 3 (relative effect p ≈ 0.5); for the green curve, we have used i = 2
( p ≈ 0.65), and for the red line, i = 1 ( p ≈ 0.84) [Colour figure can be viewed at wileyonlinelibrary.com]
HAPP ET AL. 11
FIGURE 2 The graphic shows the values of the optimal allocation rate t0 for different values of the power for 𝛼 = 0.05. For the reference
group, we used Beta (5, 5) distributions, and for the treatment group, we assumed Beta (3, i), where i = 1, 2, 3. The red line represents i = 3
(relative effect p ≈ 0.5); for the green curve, we have used i = 2 ( p ≈ 0.65), and for the red line, i = 1 ( p ≈ 0.84) [Colour figure can be
viewed at wileyonlinelibrary.com]
relative effect p, ie, max |N(t) − N(1∕2)| = 1. For the small effect p ≈ 0.5, the maximal difference was larger but still negli-
gible because the total sample size was very large for this setting. The detailed results are provided in the Supplementary
Material.
In a second simulation, we investigated the behavior of t0 for increasing power (or decreasing 𝛽). We used 𝛼 = 0.05 and
the same distributions as before. Therefore, p and 𝜅 were the same as aforementioned for the three different alternatives.
As values for the power, we chose 1 − 𝛽 = 0.5, … , 0.95 and generated 106 random numbers for each 𝛽 to calculate the
optimal allocation rate t0 . The results are displayed in Figure 2. Obviously, for 1 − 𝛽 = 0.5, we had t0 = 1∕2 in all cases.
A larger power led to more extreme values for t0 , but the difference in required sample sizes between the balanced and
optimal design was again negligible. The difference was again at most 1 for the medium and large relative effect p. Similar
to the simulation from before, more extreme values of the relative effect led to larger differences |t0 − 1∕2|.
6 DISCUSSION
In this paper, we have proposed a unified approach to sample size determination for the WMW two-sample rank sum test.
Our approach does not assume any specific type of data or a specific alternative hypothesis. In particular, data distributions
may be discrete or continuous. Based on the general formula, we have also derived an optimal allocation rate to both
groups, ie, to choose a value for t = n1 ∕N such that N is minimized. The value of this optimal allocation rate t0 mainly
depends on the ratio 𝜅 = 𝜎 2 ∕𝜎 1 (see (13) and (14) for a definition of these variances) and on 𝛽. The variance under the
null hypothesis has no influence on t0 . For 𝜅 > 1, we have t0 < 1∕2, for 𝜅 < 1, we have t0 > 1∕2, and for 𝜅 = 1, we
have exactly t0 = 1∕2 assuming u1 − 𝛽 > 0. The nominal type-I error rate 𝛼 only has a small impact on the value of t0 .
The larger 𝛼 is, the larger is the difference |t0 − 1∕2|.
We can see from the interval [I1 , I2 ] for the optimal allocation rate t0 derived in Section 3.1 that t0 will typically be close
to 1∕2. This was also confirmed in some illustrative data examples in Section 4. Furthermore, the difference in required
sample size between using a balanced design and using the optimal allocation design appears practically negligible.
In other words, in most cases, a balanced design can be recommended for the WMW test. In extensive simulations,
we have confirmed that the new procedure actually meets the power at the calculated sample sizes quite well. In special
cases, our sample size formula yields basically the same results as those by Lachin6 and Tang5 for ordinal data or Noether3
for continuous data (see Section 4). Matching the established results in these special cases is a desirable property for a
generally valid sample size formula. However, note that, for Noether's formula, the variance under the alternative hypoth-
esis is approximated by the variance under the null hypothesis; hence, a difference to our formula is to be expected even
for continuous data (see, eg, Table 6). The advantage of our new sample size formula is that it can be used universally
12 HAPP ET AL.
for different types of data. We also provide details on how to generate synthetic data based on an interpretable effect.
The new procedure has been implemented in the R package WMWssp.
ACKNOWLEDGEMENT
This research was supported by Austrian Science Fund (FWF) I 2697-N31.
ORCID
REFERENCES
1. Bürkner P-C, Doebler P, Holling H. Optimal design of the Wilcoxon–Mann–Whitney-test. Biom J. 2017;59(1):25-40.
2. Wang H, Chen B, Chow SC. Sample size determination based on rank tests in clinical trials. J Biopharm Stat. 2003;13(4):735-751.
3. Noether GE. Sample size determination for some common nonparametric tests. J Am Stat Assoc. 1987;82(398):645-647.
4. Fan C, Zhang D. A note on power and sample size calculations for the Kruskal–Wallis test for ordered categorical data. J Biopharm Stat.
2012;22(6):1162-1173.
5. Tang Y. Size and power estimation for the Wilcoxon–Mann–Whitney test for ordered categorical data. Statist Med. 2011;30(29):3461-3470.
6. Lachin JM. Power and sample size evaluation for the Cochran–Mantel–Haenszel mean score (Wilcoxon rank sum test) and the
Cochran–Armitage test for trend. Statist Med. 2011;30(25):3057-3066.
7. Hilton JF, Mehta CR. Power and sample size calculations for exact conditional tests with ordered categorical data. Biometrics.
1993;49(2):609-616.
8. Whitehead J. Sample size calculations for ordered categorical data. Statist Med. 1993;12(24):2257-2271.
9. Rahardja D, Zhao YD, Qu Y. Sample size determinations for the Wilcoxon–Mann–Whitney test: a comprehensive review. Stat Biopharm
Res. 2009;1(3):317-322.
10. Zhao YD, Rahardja D, Qu Y. Sample size calculation for the Wilcoxon–Mann–Whitney test adjusting for ties. Statist Med.
2008;27(3):462-468.
11. Shieh G, Jan SL, Randles RH. On power and sample size determinations for the Wilcoxon–Mann–Whitney test. J Nonparametric Stat.
2006;18(1):33-43.
12. Kolassa JE. A comparison of size and power calculations for the Wilcoxon statistic for ordered categorical data. Statist Med.
1995;14(14):1577-1581.
13. Rosner B, Glynn RJ. Power and sample size estimation for the Wilcoxon rank sum test with application to comparisons of C statistics from
alternative prediction models. Biometrics. 2009;65(1):188-197.
14. Chakraborti S, Hong B, van de Wiel MA. A note on sample size determination for a nonparametric test of location. Technometrics.
2006;48(1):88-94.
15. Lesaffre E, Scheys I, Fröhlich J, Bluhmki E. Calculation of power and sample size with bounded outcome scores. Statist Med.
1993;12(11):1063-1078.
16. Hamilton MA, Collings BJ. Determining the appropriate sample size for nonparametric tests for location shift. Technometrics.
1991;33(3):327-337.
17. Collings BJ, Hamilton MA. Estimating the power of the two-sample Wilcoxon test for location shift. Biometrics. 1988;44:847-860.
18. Matsouaka RA, Singhal AB, Betensky RA. An optimal Wilcoxon–Mann–Whitney test of mortality and a continuous outcome. Stat Methods
Med Res. 2016;27(8):2384-2400.
19. Dette H, O'Brien TE. Efficient experimental design for the Behrens-Fisher problem with application to bioassay. Am Stat.
2004;58(2):138-143.
20. Leppik IE, Dreifuss FE, Bowman T, et al. A double-blind crossover evaluation of progabide in partial seizures: 3:15 PM8. Neurology.
1985;35(4):285.
21. Thall PF, Vail SC. Some covariance models for longitudinal count data with overdispersion. Biometrics. 1990;46:657-671.
22. Puntanen S, Styan GPH, Isotalo J. Matrix Tricks for Linear Statistical Models: Our Personal Top Twenty. Berlin, Germany: Springer; 2011.
23. Ruymgaart FH. A unified approach to the asymptotic distribution theory of certain midrank statistics. In: Raoult JP, ed. Statistique non
Parametrique Asymptotique. Berlin, Germany: Springer; 1980:1-18.
24. Akritas MG, Arnold SF, Brunner E. Nonparametric hypotheses and rank statistics for unbalanced factorial designs. J Am Stat Assoc.
1997;92(437):258-265.
25. Akritas MG, Brunner E. A unified approach to rank tests for mixed models. J Stat Plan Inference. 1997;61(2):249-277.
26. Brunner E, Puri ML. Nonparametric methods in factorial designs. Stat Pap. 2001;42(1):1-52.
27. Brunner E, Munzel U. The nonparametric Behrens-Fisher problem: asymptotic theory and a small-sample approximation. Biom J.
2000;42(1):17-25.
HAPP ET AL. 13
28. Seber GAF. A Matrix Handbook for Statisticians. Hoboken, NJ: John Wiley & Sons; 2008.
29. Orban J, Wolfe DA. A class of distribution-free two-sample tests based on placements. J Am Stat Assoc. 1982;77(379):666-672.
30. Orban J, Wolfe DA. Distribution-free partially sequential piacment procedures. Commun Stat Theory Methods. 1980;9(9):883-904.
31. Konietschke F, Friedrich S, Brunner E, Pauly M. rankFD: rank-based tests for general factorial designs. 2016. R package version 0.0.1.
SUPPORTING INFORMATION
Additional supporting information may be found online in the Supporting Information section at the end of the article.
How to cite this article: Happ M, Bathke AC, Brunner E. Optimal sample size planning for the
Wilcoxon-Mann-Whitney test. Statistics in Medicine. 2018;1–13. https://doi.org/10.1002/sim.7983