Negative multinomial distribution: Difference between revisions

Negative Multinomial
	Probability mass functionFile:Multivariate
	Cumulative distribution functionFile:Multivariate
Parameters
Support
PMF	, or equivalently , where is the Gamma function
Mean
Variance

Next edit →

Content deleted Content added

VisualWikitext

Inline

Revision as of 03:50, 7 November 2009

In probability theory and statistics, the Negative Multinomial Distribution (NMD) is a generalization of the two-parameter Negative Binomial distribution (NB(r,p)) to $m\geq 1$ outcomes^[1],^[2]. Suppose we have an experiment that generates $m\geq 1$ possible outcomes, $\{X_{0},\cdots ,X_{m}\}$ , each occurring with probability $\{p_{0},\cdots ,p_{m}\}$ , respectively, where with $0<p_{i}<1$ and $\sum _{i=0}^{m}{p_{i}}=1$ . That is, $p_{0}=1-\sum _{i=1}^{m}{p_{i}}$ . If the experiment proceeds to generate independent outcomes until $\{X_{0},X_{1},\cdots ,X_{m}\}$ occur exactly $\{k_{0},k_{1},\cdots ,k_{m}\}$ times, then the distribution of the m-tuple $\{X_{1},\cdots ,X_{m}\}$ is Negative Multinomial with parameter vector $(k_{0},\{p_{1},\cdots ,p_{m}\})$ . Notice that the degree-of-freedom here is actually m, not (m+1). That is why we only have a probability parameter vector of size m, not (m+1). This contrasts with the combinatorial interpretation of Negative Binomial, which is a special case of NMD with m=1:

X\sim NegativeBinomial(r,p)

,

X=Total number of experiments (n) to get r successes (and therefore n-r failures);

X\sim NegativeMultinomial(k_{0},\{p_{0},p_{1}\})

,

X=Total number of experiments (n) to get $k_{0}$ (default variable) and $n-k_{0}$ outcomes of 1 other possible outcome ( $X_{1}$ ).

Negative multinomial distribution example

The table below shows the an example of 400 Melanoma (skin cancer) Patients where the Type and Site of the cancer are recorded for each subject.

Type	Site			Totals
Type	Head and Neck	Trunk	Extremities	Totals
Hutchinson's melanomic freckle	22	2	10	34
Superficial	16	54	115	185
Nodular	19	33	73	125
Indeterminant	11	17	28	56
Column Totals	68	106	226	400

The sites (locations) of the cancer may be independent, but there may be positive dependencies of the type of cancer for a given location (site). For example, localized exposure to radiation implies that elevated level of one type of cancer (at a given location) may indicate higher level of another cancer type at the same location. The Negative Multinomial distribution may be used to model the sites cancer rates and help measure some of the cancer type dependencies within each location.

If $x_{i,j}$ denote the cancer rates for each site ( $0\leq i\leq 2$ ) and each type of cancer ( $0\leq j\leq 3$ ), for a fixed site ( $i_{0}$ ) the cancer rates are independent Negative Multinomial distributed random variables. That is, for each column index (site) the column-vector X has the following distribution:

X=\{X_{1},X_{2},X_{3}\}\sim NMD(k_{0},\{p_{1},p_{2},p_{3}\})

.

Different columns in the table (sites) are considered to be different instances of the random multinomially distributed vector, X. Then we have the following estimates of expected counts (frequencies of cancer):

{\hat {\mu }}_{i,j}={\frac {x_{i,.}\times x_{.,j}}{x_{.,.}}}

x_{i,.}=\sum _{j=0}^{3}{x_{i,j}}

x_{.,j}=\sum _{i=0}^{2}{x_{i,j}}

x_{.,.}=\sum _{i=0}^{2}\sum _{j=0}^{3}{x_{i,j}}

Example:

{\hat {\mu }}_{1,1}={\frac {x_{1,.}\times x_{.,1}}{x_{.,.}}}={\frac {34\times 68}{400}}=5.78

For the first site (Head and Neck, j=0), suppose that $X=\left\{X_{1}=5,X_{2}=1,X_{3}=5\right\}$ and $X\sim NMD(k_{0}=10,\{p_{1}=0.2,p_{2}=0.1,p_{3}=0.2\})$ . Then:

p_{0}=1-\sum _{i=1}^{3}{p_{i}}=0.5

NMD(X|k_{0},\{p_{1},p_{2},p_{3}\})=0.00465585119998784

cov[X_{1},X_{3}]={\frac {10\times 0.2\times 0.2}{0.5^{2}}}=1.6

\mu _{2}={\frac {k_{0}p_{2}}{p_{0}}}={\frac {10\times 0.1}{0.5}}=2.0

\mu _{3}={\frac {k_{0}p_{3}}{p_{0}}}={\frac {10\times 0.2}{0.5}}=4.0

corr[X_{2},X_{3}]=\left({\frac {\mu _{2}\times \mu _{3}}{(k_{0}+\mu _{2})(k_{0}+\mu _{3})}}\right)^{\frac {1}{2}}

and therefore,

corr[X_{2},X_{3}]=\left({\frac {2\times 4}{(10+2)(10+4)}}\right)^{\frac {1}{2}}=0.21821789023599242.

You can also use the interactive SOCR negative multinomial distribution calculator to compute these quantities, as shown on the figure below.

Notice that the pair-wise NMD correlations are always positive, where as the correlations between multinomail counts are always negative. As the parameter $k_{0}$ increases, the paired correlations tend to zero! Thus, for large $k_{0}$ , the Negative Multinomial counts $X_{i}$ behave as independent Poisson random variables with respect to their means $\left(\mu _{i}=k_{0}{\frac {p_{i}}{p_{0}}}\right)$ .

The marginal distribution of each of the $X_{i}$ variables is negative binomial, as the $X_{i}$ count (considered as success) is measured against all the other outcomes (failure). But jointly, the distribution of $X=\{X_{1},\cdots ,X_{m}\}$ is negative multinomial, i.e., $X\sim NMD(k_{0},\{p_{1},\cdots ,p_{m}\})$ .

Parameter estimation

Estimation of the mean (expected) frequency counts ( $\mu _{j}$ ) of each outcome ( $X_{j}$ ) using maximum likelihood is possible. If we have a single observation vector $\{x_{1},\cdots ,x_{m}\}$ , then ${\hat {\mu }}_{i}=x_{i}.$ If we have several observation vectors, like in this case we have the cancer type frequencies for 3 different sites, then the MLE estimates of the mean counts are ${\hat {\mu }}_{j}={\frac {x_{j,.}}{I}}$ , where $0\leq j\leq J$ is the cancer-type index and the summation is over the number of observed (sampled) vectors (I). For the cancer data above, we have the following MLE estimates for the expectations for the frequency counts:

Hutchinson's melanomic freckle type of cancer (

X_{0}

) is

{\hat {\mu }}_{0}=34/3=11.33

.

Superficial type of cancer (

X_{1}

) is

{\hat {\mu }}_{1}=185/3=61.67

.

Nodular type of cancer (

X_{2}

) is

{\hat {\mu }}_{2}=125/3=41.67

.

Indeterminant type of cancer (

X_{3}

) is

{\hat {\mu }}_{3}=56/3=18.67

.

There is no MLE estimate for the NMD $k_{0}$ parameter ^[3],^[4]. However, there are approximate protocols for estimating the $k_{0}$ parameter using the chi-squared goodness of fit statistic. In the usual chi-squared statistic:

\mathrm {X} ^{2}=\sum _{i}{\frac {(x_{i}-\mu _{i})^{2}}{\mu _{i}}}

, we can replace the expected-means (

\mu _{i}

) by their estimates,

{\hat {\mu _{i}}}

, and replace denominators by the corresponding negative multinomial variances. Then we get the following test statistic for negative multinomial distributed data:

\mathrm {X} ^{2}(k_{0})=\sum _{i}{\frac {(x_{i}-{\hat {\mu _{i}}})^{2}}{{\hat {\mu _{i}}}\left(1+{\frac {\hat {\mu _{i}}}{k_{0}}}\right)}}

.

Next, we can estimate the

k_{0}

parameter by varying the values of

k_{0}

in the expression

\mathrm {X} ^{2}(k_{0})

and matching the values of this statistic with the corresponding asymptotic chi-squared distribution. The following protocol summarizes these steps using the cancer data above.

DF: The degree of freedom for the Chi-square distribution in this case is:

df = (# rows – 1)(# columns – 1) = (3-1)*(4-1) = 6

Median: The median of a chi-squared random variable with 6 df is 5.261948.

Mean Counts Estimates: The mean counts estimates (

\mu _{j}

) for the 4 different cancer types are:

{\hat {\mu }}_{1}=185/3=61.67

;

{\hat {\mu }}_{2}=125/3=41.67

; and

{\hat {\mu }}_{3}=56/3=18.67

.

Thus, we can solve the equation above

\mathrm {X} ^{2}(k_{0})=5.261948

for the single variable of interest -- the unknown parameter

k_{0}

. In the cancer example, suppose

x=\{x_{1}=5,x_{2}=1,x_{3}=5\}

. Then, the solution is an asymptotic chi-squared distribution driven estimate of the parameter

k_{0}

.

\mathrm {X} ^{2}(k_{0})=\sum _{i=1}^{3}{\frac {(x_{i}-{\hat {\mu _{i}}})^{2}}{{\hat {\mu _{i}}}\left(1+{\frac {\hat {\mu _{i}}}{k_{0}}}\right)}}

.

\mathrm {X} ^{2}(k_{0})={\frac {(5-61.67)^{2}}{61.67(1+61.67/k_{0})}}+{\frac {(1-41.67)^{2}}{41.67(1+41.67/k_{0})}}+{\frac {(5-18.67)^{2}}{18.67(1+18.67/k_{0})}}=5.261948.

Solving this equation for

k_{0}

provides the desired estimate for the last parameter.

Mathematica provides 3 distinct (

k_{0}

) solutions to this equation: {50.5466, -21.5204, 2.40461}. Since

k_{0}>0

there are 2 candidate solutions.

Estimates of probabilities: Assume $k_{0}=2$ and ${\frac {\mu _{i}}{k_{0}}}p_{0}=p_{i}$ , then:

{\frac {61.67}{k_{0}}}p_{0}=31p_{0}=p_{1}

20p_{0}=p_{2}

9p_{0}=p_{3}

Hence,

1-p_{0}=p_{1}+p_{2}+p_{3}=60p_{0}

, and

p_{0}={\frac {1}{61}}

,

p_{1}={\frac {31}{61}}

,

p_{2}={\frac {20}{61}}

and

p_{3}={\frac {9}{61}}

.

Therefore, the best model distribution for the observed sample

x=\{x_{1}=5,x_{2}=1,x_{3}=5\}

is

X\sim NMD\left(2,\left\{{\frac {31}{61}},{\frac {20}{61}},{\frac {9}{61}}\right\}\right).

Related distributions

Online visualization tools

SOCR Interactive Negative Multinomial distribution calculator

References

^ Le Gall, F. The modes of a negative multinomial distribution, Statistics & Probability Letters, Volume 76, Issue 6, 15 March 2006, Pages 619-624, ISSN 0167-7152, 10.1016/j.spl.2005.09.009.
^ SOCR EBook, Probability and Statistics EBook, SOCR Publications, 2007, link
^ Le Gall, F. The modes of a negative multinomial distribution, Statistics & Probability Letters, Volume 76, Issue 6, 15 March 2006, Pages 619-624, ISSN 0167-7152, 10.1016/j.spl.2005.09.009.
^ Zelterman, Daniel (2002). Advanced log-linear models using SAS. SAS Publishing. p. 196. ISBN 9781590470800.

Template:Common multivariate probability distributions

Iwaterpolo (talk) 03:50, 7 November 2009 (UTC)

[1] Le Gall, F. The modes of a negative multinomial distribution, Statistics & Probability Letters, Volume 76, Issue 6, 15 March 2006, Pages 619-624, ISSN 0167-7152, 10.1016/j.spl.2005.09.009.

[2] SOCR EBook, Probability and Statistics EBook, SOCR Publications, 2007, link

[3] Le Gall, F. The modes of a negative multinomial distribution, Statistics & Probability Letters, Volume 76, Issue 6, 15 March 2006, Pages 619-624, ISSN 0167-7152, 10.1016/j.spl.2005.09.009.

[4] Zelterman, Daniel (2002). Advanced log-linear models using SAS. SAS Publishing. p. 196. ISBN 9781590470800.

[1]

[2]

[3]

[4]

Negative Multinomial
Probability mass function File:Multivariate
Cumulative distribution function File:Multivariate
Parameters	$(k_{0},\{p_{1},\cdots ,p_{m}\})$
Support	$X_{i}\in \{1,2,\ldots \},1\leq i\leq m$
PMF	$P(k_{1},\cdots ,k_{m})=\left(\sum _{i=0}^{m}{k_{i}}-1\right)!{\frac {p_{0}^{k_{0}}}{(k_{0}-1)!}}\prod _{i=1}^{m}{\frac {p_{i}^{k_{i}}}{k_{i}!}}$ , or equivalently $P(k_{1},\cdots ,k_{m})=\Gamma \left(\sum _{i=1}^{m}{k_{i}}\right){\frac {p_{0}^{k_{0}}}{\Gamma (k_{0})}}\prod _{i=1}^{m}{\frac {p_{i}^{k_{i}}}{k_{i}!}}$ , where $\Gamma (x)$ is the Gamma function
Mean	$\mu =\left({\frac {k_{0}p_{1}}{p_{0}}},\cdots ,{\frac {k_{0}p_{m}}{p_{0}}}\right)$
Variance	$cov[i,j]={\begin{cases}{\frac {k_{0}p_{i}p_{j}}{p_{0}^{2}}},&i\not =j,\\{\frac {k_{0}p_{i}(p_{i}+p_{0})}{p_{0}^{2}}},&i=j.\end{cases}}$