Poisson Formula
Poisson Formula
Poisson Distribution
Abstract
The geometric Poisson distribution (also called Polya-Aeppli) is a
particular case of the compound Poisson distribution. We propose
to express the general term of this distribution through a recurrence
formula leading to a linear algorithm for the computation of its cumu-
lative distribution function. Practical implementation with a special
care for numerical computations is proposed and validated. The paper
ends with an example of application of these results for the computa-
tion of pattern statistics in biological sequences modelized by a Markov
model.
Keywords: compound Poisson, Polya-Aeppli, conuent hypergeomet-
ric function, Kummer
1 INTRODUCTION
The geometric Poisson distribution (or Polya-Aeppli) is a particular case
of the classical compound Poisson distribution where the contribution of
each term is distributed according to the geometric distribution. Johnson et
al (1972) proposes a linear formula to compute the general term of a com-
pound Poisson distribution which can be simplied (remaining linear) in the
geometric Poisson case. This formula results in a quadratic complexity for
the computation of the cumulative distribution function (cdf) which is often
too slow for many problems.
We propose here to give a direct and straightforward proof for this for-
mula and, rewriting this result with a Kummers conuent geometric func-
1
tion, to derive a recurrence relation from it leading to a linear computation
of the cdf.
In a rst part we will recall some denitions and elementary results, then
the recurrence relation will be established. The practical implementation of
the resulting algorithm will be then discussed and an application to the
computation of pattern statistics on random text generated by a Markov
chain will be nally presented.
2 RECALLS
We rst recall the
Denition 1 (Poisson distribution) The random variable M is distributed
according to P(), the Poisson distribution of parameter > 0, if
P(M = k) = e
k
k!
k N
and the
Denition 2 (compound Poisson distribution) The random variable N
is distributed according to CP ((
k
)
kN
), the compound Poisson distribution
of parameter (
k
)
kN
such as all
k
> 0 and
k=1
k
= < if
N =
M
m=1
K
m
where M P() is independent from the K
m
which are independent and
identically distributed according to
P(K = k) =
k
k N
k=1
k
= then we have n N
P(N = n) =
n
m=1
e
k
k!
k
1
,...,kmN
I
{k
1
+...+km=n}
k
1
. . .
km
m
(1)
and P(N = 0) = e
.
2
We are now ready to introduce the
Denition 4 (geometric Poisson distribution) The random variable N
is distributed according to GP(, ) the geometric Poisson distribution pa-
rameter ]0, 1] for the geometric part and parameter > 0 for the Poisson
part, if N CP ((
k
)
kN
) with
k
= (1 )
k1
k N
(2)
(in the degenerate case where = 1, we simply fall back to the Poisson
distribution) and we recall that
E[N] =
and V[N] =
(2 )
2
We can see on Fig. 1 the pdf of this distribution for some choices of
parameter with chosen such as the expectation / = 10. As pointed
out in the denition, we are in the Poisson case with = 1. When the
aggregative parameter decreases, we can see that the tail distribution
becomes heavier.
As stated by Johnson et al (1972, pp. 378),the geometric Poisson distri-
bution is also dened by
Proposition 5 If N GP(, ) then for all n 1 we have
P(N = n) =
n
k=1
e
k
k!
(1 )
nk
k
_
n 1
k 1
_
were
_
n
k
_
is the binomial coecient, and P(N = 0) = e
.
proof. Using relations (1) and (2) we get n N
P(N = n) =
n
m=1
e
k
k!
(1 )
nk
k
1
,...,kmN
I
{k
1
+...+km=n}
. .
A(n,k)
so we only have to compute A(n, k), the number of way to make n with a
sum of k non negative integers (where the order matters). Let us consider
the list 1, 2, . . . , n. This list contains n1 commas, and a selection of k 1
of these commas will obviously give a partition of n such as those involved
in A(n, k). It is therefore clear that A(n, k) =
_
n1
k1
_
and the proposition is
proved.
3
3 MAIN RESULTS
From now, N GP(, ) with > 0 and ]0, 1[. It is then possible to
rewrite the proposition 5 using the Kummers M conuent hypergeometric
function (see Abramovitz and Stegun (1972, chapter 13) ;all further x.x.x
references will concern this book).
Proposition 6 n N
we have
P(N = n) = e
(1 )
n
ze
z
M(n + 1, 2, z) (3)
where M is the Kummers M conuent hypergeometric function and
z =
1
(4)
proof. We start by considering e
z
M(n+1, 2, z) = M(n+1, 2, z) (thanks
to 13.1.27). The denition (13.1.2) easily gives that
M(n + 1, 2, z) = 1 +
(n 1)
2
z
1
1!
+
(n 1)(n 2)
2 3
z
2
2!
+. . .
=
n1
k=1
(n 1) . . . (n k + 1)
k!
z
k1
(k 1)!
=
n1
k=1
_
n 1
k 1
_
z
k1
k!
and so,
P(N = n) =
n1
k=1
e
(1 )
n
z
k
k!
_
n 1
k 1
_
which gives the result by replacing z by its value.
It is now nally time to establish our recurrence relation
Proposition 7 (recurrence) We have
P(N = 0) = e
and P(N = 1) = e
(1 )z
and n N, n 2
P(N = n) =
(2n 2 +z)
n
(1)P(N = n1)+
(2 n)
n
(1)
2
P(N = n2)
where z is dened in equation (4)
4
proof. Using the denition 13.1.2, it is easy to show that M(2, 2, z) = e
z
and this immediately gives P(N = 1). 13.4.1 also implies the following
relation a, b R
M(a + 1, b, z) =
1
a
[(2a b +z)M(a, b, z) + (b a)M(a 1, b, z)]
we simply apply this result to a = n and b = 2 to obtain
M(n + 1, 2, z) =
(2n 2 +z)
n
M(n, 2, z) +
(2 n)
n
M(n 1, 2, z)
and thanks to the relation (3) the proposition is proved.
4 ALGORITHMS
In this part, we want to compute for all n
0
N the following functions:
F(n
0
) =
n
0
n=0
P(N = n) and G(n
0
) = 1 F(n
0
1) =
n=n
0
P(N = n)
Thanks to proposition 7, we can compute all (P(N = n))
nn
0
in O(n
0
)
but we must face two numerical issues:
1. some terms could be out of the machine range and set to zero;
2. the relative error in the computation of a F close to one, becomes an
absolute error for the corresponding G computed through the relation
G = 1 F.
A simple solution for the rst point is to compute the L
n
= lnP(N = n)
by adapting proposition 7 to obtain
Proposition 8 (log recurrence)
L
0
= and L
1
= + ln((1 )z)
and n N, n 2
L
n
= L
n1
+ln
_
(2n 2 +z)
n
(1 ) +
(2 n)
n
(1 )
2
exp(L
n2
L
n1
)
_
5
A simple application of this proposition leads to the
Algorithm for lower tail
initialization L
0
= , L
1
= + ln((1 )z), A
1
= and S
1
=
1 + (1 )z
main loop for n = 2 . . . n
0
do
compute L
n
from L
n1
and L
n2
through the proposition (8)
check range of sum S = S
n1
+ exp(L
n
A
n1
)
if S ]0, [ then A
n
= A
n1
and S
n
= S
else A
n
= A
n1
+ ln(S
n1
) and S
n
= 1 + exp(L
n
A
n
)
end F(n
0
) = exp(A
n
0
) S
n
0
(where A
i
and S
i
are auxiliary variables used to compute the cumulative
sum).
The second problem is illustrated by the g (2). As one can see, using
F to compute G = 1 F is acceptable so long as the value of G is not too
small (> 10
12
) but when its value comes closer to the relative error of the
computations (10
16
for double precision on 32bit machine), the results are
no longer reliable.
The solution to this problem is obviously to compute G directly us-
ing the serie but, doing so, how does one know when the serie has con-
verged ? One could think to use a rough Gaussian approximation to test
the convergence. Unfortunately, the geometric Poisson distribution diers
so much from Gaussian one (for example, with = 10.0 and = 0.1,
P(N > 1000) = 8.765 10
24
while the Gaussian approximation gives
5.161 10
95
for the same value) that this option must be abandoned.
Without any better criterion, we will add terms until we get a numerical
convergence. We rst choose a small (typically = 10
16
depending on
the relative precision of the computations) and we get the
6
Algorithm for upper tail
initialization L
0
= and L
1
= + ln((1 )z)
rst loop for n = 2 . . . n
0
do
compute L
n
from L
n1
and L
n2
through the proposition (8)
sum initialization A
n0
= L
n
0
and S
n
0
= 1
main loop while the serie has not converged do
n = n + 1
compute L
n
from L
n1
and L
n2
through the proposition (8)
check convergence
if L
n
< ln() + A
n1
+ ln(S
n1
) then the serie has con-
verged and n
1
= n 1, go to end
else the serie has not yet converged
check range of sum S = S
n1
+ exp(L
n
A
n1
)
if S ]0, [ then A
n
= A
n1
and S
n
= S
else A
n
= A
n1
+ ln(S
n1
) and S
n
= 1 + exp(L
n
A
n
)
end G(n
0
) = exp(A
n
1
) S
n
1
(where, like in the previous algorithm, A
i
and S
i
are auxiliary variables).
Of course, such a naive approach could lead to errors if the convergence
rate of the serie is too low, but as we can see on Fig. 3 the algorithm works
very well with, in the worst case, a relative error around 10
12
.
Finally, let us point out that the complexity of these two algorithms is
O(n
0
) in time and O(1) in memory (as only the last two values of L
n
and
the last value of A
n
and S
n
are needed at each step).
5 APPLICATION
We consider an homogeneous stationary Markov chain X of length on
7
the nite alphabet A (ex: A = {a, c, g, t} when DNA sequences are consid-
ered). We count N the number of occurrences of a given (non degenerate)
pattern (ex: pattern aacaa or pattern gctgtctg).
Crysaphinou and Papastavridis (1988) rst introduced the geometric
Poisson approximation for N in the independant case. Arratia et al (1990)
then proposed to use Chen-Stein method to get a compound Poisson ap-
proximation in the Markov case. This idea was then studied in details by
Schbath (1995) and, more recently, Robin (2002) pointed out that the com-
pound Poisson distribution was in fact a simple geometric Poisson one in
the non degenerate case.
According to these references, N is approximately distributed according
to a GP(, 1 a) distribution where
a = P(pattern self-overlaps)
and
= (1 a) P(pattern appears at a a given position in X)
If a = 0, then = 1 a = 1.0 and the distribution is a simple Poisson
one (and its cumulative distribution function is easily computed with the
incomplete gamma function, see Press et al (1992),chapter 6, for more details
on this point). From now, we will then consider only cases where a > 0.
Let us consider the complete genome of the bacteria Escherichia coli
K12 (of length = 4 639 221). We estimate an order 1 Markov model on its
sequence to get
=
_
_
_
_
_
_
0.2957922699 0.2247175468 0.2082510314 0.2712391519
0.2756564177 0.2303218838 0.2939007929 0.2001209057
0.2270901404 0.3262008455 0.2295111640 0.2171978501
0.1857763808 0.2342592584 0.2824187007 0.2975456600
_
_
_
_
_
_
the transition matrix between the states {a, c, g, t} (in this order) such as
P(g|c) = (c, g) = (2, 3). The corresponding stationary distribution
(such as = ) is therefore given by
=
_
0.2461911663 0.2542308865 0.2536579603 0.2459199869
_
8
For any given pattern, we can use and to compute a and . For
example, for the pattern gggg we get
a(gggg) = (g, g) = 0.2295111640
and
(gggg) = (g) (g, g)
3
(1 a(gggg)) = 10961.53
So N(gggg) is approximately distributed according to GP(10961.53, 0.7704888)
and hence P(N(gggg) 8704) 10
356.942033
1.142791496 10
357
.
The table 1 gives the ten smallest p-values computed with the lower tail
algorithm for the self overlapping patterns of size 4 and 8 which are seen
less than expected. Table 2 gives the p-values computed with the upper tail
algorithm for the self overlapping patterns seen more than expected. In both
cases, these p-values are compared with the one obtained through precise
large deviations approximations which are, according to Nuel (2006), very
close to the exact p-values (usual relative error is smaller than 0.5%).
In both tables, one should note that p-values smaller than 10
300
are
computed, and, especially for the size 4 patterns, that large n
0
are considered
(which corresponding cdf would have taken a very long time to compute with
the original quadratic algorithm).
6 CONCLUSION
We have presented here a simple and ecient way to compute the cu-
mulative distribution function of a geometric Poisson distribution. As its
complexity is linear, this new algorithm allows to deal with cases where the
usual quadratic algorithm was not usable. Thanks to the logarithm ver-
sion of our recurrence, this algorithm is also capable to compute eciently
very small p-values such as those that usually arise when studying pattern
statistics on biological sequences.
One should note that the geometric assumption is here a key stage
of the proposed method since it allows to rewrite the complex sum on
k
1
, . . . , k
m
N