Wallenius' noncentral hypergeometric distribution
This article or section is in a state of significant expansion or restructuring. You are welcome to assist in its construction by editing it as well. If this article or section has not been edited in several days, please remove this template. If you are the editor who added this template and you are actively editing, please be sure to replace this template with {{in use}} during the active editing session. Click on the link for template parameters to use.
This article was last edited by Arnold90 (talk | contribs) 17 years ago. (Update timer) |
Introduction
In probability theory and statistics, Wallenius' noncentral hypergeometric distribution is a generalization of the hypergeometric distribution where items are sampled with bias.
This distribution can be illustrated as an urn model with bias. Assume, for example, that an urn contains m1 red balls and m2 white balls, totalling N = m1 + m2 balls. Each red ball has the weight ω1 and each white ball has the weight ω2. We will say that the odds ratio is ω = ω1 / ω2. Now we are taking n balls, one by one, in such a way that the probability of taking a particular ball at a particular draw is equal to its proportion of the total weight of all balls that lie in the urn at that moment. The number of red balls x1 that we get in this experiment is a random variable with Wallenius' noncentral hypergeometric distribution.
The matter is complicated by the fact that there is more than one noncentral hypergeometric distribution. Wallenius' noncentral hypergeometric distribution is obtained if balls are sampled one by one in such a way that there is competition between the balls. Fisher's noncentral hypergeometric distribution is obtained if the balls are sampled simultaneously or independently of each other. Unfortunately, both distributions are known in the literature as "the" noncentral hypergeometric distribution. It is important to be specific about which distribution is meant when using this name.
The two distributions are both equal to the (central) hypergeometric distribution when the odds ratio is 1.
It is far from obvious why these two distributions are different. See the Wikipedia entry on noncentral hypergeometric distributions for a more detailed explanation of the difference between these two probability distributions.
Univariate distribution
Parameters |
| ||
---|---|---|---|
Support |
| ||
PMF |
where | ||
Mean |
Approximated by solution to | ||
Variance |
, where |
Wallenius' distribution is particularly complicated because each ball has a probability of being taken that depends not only on its weight, but also on the total weight of its competitors. And the weight of the competing balls depends on the outcomes of all preceding draws.
This recursive dependency gives rise to a difference equation with a solution that is given in open form by the integral in the expression of the probability mass function in the table above.
Closed form expressions for the probability mass function exist (Lyons, 1980), but they are not very useful for practical calculations because of extreme numerical instability, except in degenerate cases.
Several other calculation methods are used, including recursion, Taylor expansion and numerical integration (Fog, 2007, 2008).
The most reliable calculation method is recursive calculation of f(x,n) from f(x,n-1) and f(x-1,n-1) using the recursion formula given below under properties. The probabilities of all (x,n) combinations on all possible trajectories leading to the desired point are calculated, starting with f(0,0) = 1 as shown on the figure to the right. The total number of probabilities to calculate is n(x+1)-x2. Other calculation methods must be used when n and x are so big that this method is too inefficient.
Properties
Wallenius' distribution has fewer symmetry relations than Fisher's noncentral hypergeometric distribution has. The only symmetry relates to the swapping of colors:
Unlike Fisher's distribution, Wallenius' distribution has no symmetry relating to the number of balls not taken.
The following recursion formula is useful for calculating probabilities:
Another recursion formula is also known:
The probability is limited by
where the underlined superscript indicates the falling factorial .
Multivariate distribution
The distribution can be expanded to any number of colors c of balls in the urn. The multivariate distribution is used when there are more than two colors.
Parameters |
| ||
---|---|---|---|
Support | |||
PMF |
where | ||
Mean |
Approximated by solution to | ||
Variance | Approximated by variance of Fisher's noncentral hypergeometric distribution with same mean. |
The probability mass function can be calculated by various Taylor expansion methods or by numerical integration (Fog, 2008).
A reasonably good approximation to the mean can be calculated using the equation given above. The equation is solved by defining θ so that
and solving
for θ by Newton-Raphson iteration.
No good way of calculating the variance is known. The best known method is to approximate the multivariate Wallenius distribution by a multivariate Fisher's noncentral hypergeometric distribution with the same mean, and insert the mean as calculated above in the approximate formula for the variance of the latter distribution.
Properties
The order of the colors is arbitrary so that any colors can be swapped.
The weights can be arbitrarily scaled:
- for all .
Colors with zero number (mi = 0) or zero weight (ωi = 0) can be omitted from the equations.
Colors with the same weight can be joined:
where is the (univariate, central) hypergeometric distribution probability.
Complementary Wallenius' noncentral hypergeometric distribution
The balls that are not taken in the urn experiment have a distribution that is different from Wallenius' noncentral hypergeometric distribution, due to a lack of symmetry. The distribution of the balls not taken can be called the complementary Wallenius' noncentral hypergeometric distribution.
Probabilities in the complementary distribution are calculated from Wallenius' distribution by replacing n with N-n, xi with mi - xi, and ωi with 1/ωi.
Applications
- Wallenius' distribution is useful as a model of natural selection in population genetics and modern evolutionary theory. Natural selection occurs if individuals of a species compete for a limited resource and if there are different variants (phenotypes) of the species with different probabilities of survival.
- More generally, Wallenius' distribution can be used as a statistical model of competition for any limited resource if the outcome of the competition is random, but biased, and if the resource is quantisized into discrete units.
- In models of biased sampling, Wallenius' distribution is useful for estimating bias or for correcting for a known bias.
- In statistical tests relating to contingency tables, Wallenius' distribution is useful for testing if there is bias or not. It is more common to use Fisher's noncentral hypergeometric distribution for this purpose in Fisher's exact test, but it may be more correct to use Wallenius' noncentral hypergeometric distribution in cases where margins are fixed prior to the experiment (Wallenius, 1963).
In many applications, there can be uncertainty about whether to use Fisher's noncentral hypergeometric distribution or Wallenius' noncentral hypergeometric distribution. Please see the entry for noncentral hypergeometric distributions for a discussion of the difference between these two distributions.
Software available
- An implementation for the R programming language is available as the package named [BiasedUrn]. Includes univariate and multivariate probability mass functions, distribution functions, random variable generating functions, mean and variance.
- Implementation in C++ is available from [www.agner.org].
See also
- Noncentral hypergeometric distributions
- Fisher's noncentral hypergeometric distribution
- Hypergeometric distribution
- Urn models
- Biased sample
- Bias
- Population genetics
References
Chesson, J. (1976), "A non-central multivariate hypergeometric distribution arising from biased sampling with application to selective predation", Journal of Applied Probability, vol. 13, pp. 795–797.
Fog, A. (2007), Random number theory.
Fog, A. (2008), "Calculation Methods for Wallenius' Noncentral Hypergeometric Distribution", Communications In statictics, Simulation and Computation, vol. 37, no. 2, pp. (forthcoming).
Johnson, N. L.; Kemp, A. W.; Kotz, S. (2005), Univariate Discrete Distributions, Hoboken, New Jersey: Wiley and Sons.
Lyons, N. I. (1980), "Closed Expressions for Noncentral Hypergeometric Probabilities", Communications In statictics, B, vol. 9, pp. 313–314.
Manly, B. F. J. (1974), "A Model for Certain Types of Selection Experiments", Biometrics, vol. 30, pp. 281–294.
Wallenius, K. T. (1963), Biased Sampling: The Non-central Hypergeometric Probability Distribution. Ph.D. Thesis, Stanford University, Department of Statistics.