Stochastic Processes and Simulations - A Machine Learning Perspective
Stochastic Processes and Simulations - A Machine Learning Perspective
Sponsors
MLTechniques. Private, self-funded Machine Learning research lab and publishing company. Develop-
ing explainable artificial intelligence, advanced data animations in Python including videos, model-free
inference, and modern solutions to synthetic data generation. Visit our website, at MLTechniques.com.
Email the author at vincentg@MLTechniques.com to be listed as a sponsor.
Note: External links (in blue) and internal references (in red) are clickable throughout this document. Keywords
highlighted in orange are indexed; those in red are both indexed and in the glossary section.
Contents
About this Textbook 2
Target Audience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
About the Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Applications 12
2.1 Modeling Cluster Systems in Two Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Generalized Logistic Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Infinite Random Permutations with Local Perturbations . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Probabilistic Number Theory and Experimental Maths . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Poisson Limit of the Poisson-binomial Distribution, with Applications . . . . . . . . . . . 18
2.3.2 Perturbed Version of the Riemann Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Videos: Fractal Supervised Classification and Riemann Hypothesis . . . . . . . . . . . . . . . . . 22
2.4.1 Dirichlet Eta Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.2 Fractal Supervised Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1
3.4 Spatial Statistics, Nearest Neighbors, Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.1 Stochastic Residues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.2 Inference for Two-dimensional Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.3 Clustering Using GPU-based Image Filtering . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4.4 Black-box Elbow Rule to Detect Outliers and Number of Clusters . . . . . . . . . . . . . 41
3.5 Boundary Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.1 Quantifying some Biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.2 Extreme Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.6 Poor Random Numbers and Other Glitches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.6.1 A New Type of Pseudo-random Number Generator . . . . . . . . . . . . . . . . . . . . . . 48
4 Theorems 49
4.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Link between Interarrival Times and Point Count . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Point Count Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4 Link between Intensity and Scaling Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.5 Expectation and Limit Distribution of Interarrival Times . . . . . . . . . . . . . . . . . . . . . . 51
4.6 Convergence to the Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.7 The Inverse or Hidden Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.8 Special Cases with Exact Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.9 Fundamental Theorem of Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Glossary 89
List of Figures 90
References 91
Index 94
2
new research developments and open problems. I focus on the methodology and principles, providing the reader
with solid foundations and numerous resources: theory, applications, illustrations, statistical inference, refer-
ences, glossary, educational spreadsheet, source code, stochastic simulations, original exercises, videos and more.
Below is a short selection highlighting some of the topics featured in the textbook. Some are research re-
sults published here for the first time.
GPU clustering Fractal supervised clustering in GPU (graphics processing unit) using image
filtering techniques akin to neural networks, automated black-box detection
of the number of clusters, unsupervised clustering in GPU using density (gray
levels) equalizer
Inference New test of independence, spatial processes, model fitting, dual confidence
regions, minimum contrast estimation, oscillating estimators, mixture and
surperimposed models, radial cluster processes, exponential-binomial distri-
bution with infinitely many parameters, generalized logistic distribution
Nearest neighbors Statistical distribution of distances and Rayleigh test, Weibull distribution,
properties of nearest neighbor graphs, size distribution of connected compo-
nents, geometric features, hexagonal lattices, coverage problems, simulations,
model-free inference
Cool stuff Random functions, random graphs, random permutations, chaotic conver-
gence, perturbed Riemann Hypothesis (experimental number theory), attrac-
tor distributions in extreme value theory, central limit theorem for stochastic
processes, numerical stability, optimum color palettes, cluster processes on
the sphere
Resources 28 exercises with solution expanding the theory and methods presented in
the textbook, well documented source code and formulas to generate various
deviates and simulations, simple recipes (with source code) to design your
own data animations as MP4 videos – see ours on YouTube
This first volume deals with point processes in one and two dimensions, including spatial processes and
clustering. The next volume in this series will cover other types of stochastic processes, such as Brownian-related
and random, chaotic dynamical systems. The point process which is at the core of this textbook is called the
Poisson-binomial process (not to be confused with a binomial nor a Poisson process) for reasons that will soon
become apparent to the reader. Two extreme cases are the standard Poisson process, and fixed (non-random)
points on a lattice. Everything in between is the most exciting part.
Target Audience
College-educated professionals with an analytical background (physics, economics, finance, machine learning,
statistics, computer science, quant, mathematics, operations research, engineering, business intelligence), stu-
dents enrolled in a quantitative curriculum, decision makers or managers working with data scientists, graduate
students, researchers and college professors, will benefit the most from this textbook. The textbook is also
intended to professionals interested in automated machine learning and artificial intelligence.
It includes many original exercises requiring out-of-the-box thinking, and offered with solution. Both
students and college professors will find them very valuable. Most of these exercises are an extension of the
core material. Also, a large number of internal and external references are immediately accessible with one
click, throughout the textbook: they are highlighted respectively in red and blue in the text. The material
is organized to facilitate the reading in random order as much as possible and to make navigation easy. It is
written for busy readers.
The textbook includes full source code, in particular for simulations, image processing, and video generation.
You don’t need to be a programmer to understand the code. It is well documented and easy to read, even for
people with little or no programming experience. Emphasis is on good coding practices. The goal is to help you
quickly develop and implement your own machine learning applications from scratch, or use the ones offered in
the textbook. The material also features professional-looking spreadsheets allowing you to perform interactive
statistical tests and simulations in Excel alone, without statistical tables or any coding. The code, data sets,
videos and spreadsheets are available on my GitHub repository.
3
The content in this textbook is frequently of graduate or post-graduate level and thus of interest to re-
searchers. Yet the unusual style of the presentation makes it accessible to a large audience, including students
and professionals with a modest analytic background (a standard course in statistics). It is my hope that it will
entice beginners and practitioners faced with data challenges, to explore and discover the beautiful and useful
aspects of the theory, traditionally inaccessible to them due to jargon.
4
1 Poisson-binomial or Perturbed Lattice Process
I introduce here one of the simplest point process models. The purpose is to illustrate, in simple English, the
theory of point processes using one of the most elementary and intuitive examples, keeping applications in mind.
Many other point processes will be covered in the next sections, both in one and two dimensions. Key concepts,
soon to be defined, include:
I also present several probability distributions that are easy to sample from, including logistic, uniform,
Laplace and Cauchy. I use them in the simulations. I also introduce new ones such as the exponential-binomial
distribution (the distribution of interarrival times), and a new type of generalized logistic distribution. One
of the core distributions is the Poisson-binomial with an infinite number of parameters. The Poisson-binomial
process is named after that distribution, attached to the point count (a random variable) counting the number
of points found in any given set. By analogy, the Poisson point process is named after the Poisson distribution
for its point count. Poisson-binomial processes are also known as perturbed lattice point processes. Lattices,
also called grids, are a core topic in this textbook, as well as nearest neighbors.
Poisson-binomial processes are different from both Poisson and binomial processes. However, as we shall
see and prove, they converge to a Poisson process when a parameter called the scaling factor (closely related
to the variance), tends to infinity. In recent years, there has been a considerable interest in perturbed lattice
point processes, see [62, 68]. The Poisson-binomial process is lattice-based, and indeed, perturbed lattice point
processes and Poisson-binomial processes are one and the same. The name “Poisson-binomial” has historical
connotations and puts emphasis on its combinatorial nature, while “perturbed lattice” is more modern, putting
emphasis on topological features and modern applications such as cellular networks.
Poisson-binomial point processes with small scaling factor s are good at modeling lattice-based structures
such as crystals, exhibiting repulsion (also called inhibition) among the points, see Figure 3. They are also
widely used in cellular networks, see references in Section 2.1.
5
1.1 Definitions
A point process is a (usually infinite) collection of points, sometimes called events in one dimension, randomly
scattered over the real line (in one dimension), or over the entire space in higher dimensions. The points are
denoted as Xk with k ∈ Z in one dimension, or (Xh , Xk ) with (h, k) ∈ Z2 in two dimensions. The random
variable Xk takes values in R, known as the state space . In two dimensions, the state space is R2 . The points
are assumed to be independently distributed, though not identically distributed. Later in this textbook, it will
be evident from the context when we are dealing with the one or two dimensional case.
In one dimension, the Poisson-binomial process is characterized by infinitely many points Xk , k ∈ Z, each
centered around k/λ, independently distributed with
x − k/λ
P (Xk < x) = F , (1)
s
where
The parameter λ > 0 is called the intensity; it represents the granularity of the process. The expected
number of points in an interval of length 1/λ (in one dimension) or in a square of area 1/λ2 (in two
dimensions), is equal to one. This generalizes to higher dimensions. The set Z/λ (or Z/λ × Z/λ in two
dimensions) is the underlying lattice space of the process (also called the grid), while Z (or Z2 in two
dimensions) is called the index space. The difference between state and lattice space is illustrated in
Figure 22.
The parameter s > 0 is the scaling factor, closely related to the variance. It determines the degree of
mixing among the Xk ’s. When s = 0, Xk = k/λ and the points are just the lattice points; there is no
randomness. When s is infinite, the process becomes a classic stationary Poisson point process of intensity
λd , where d is the dimension.
The cumulative distribution function (CDF) F (x) is continuous and belongs to a family of location-scale
distributions [Wiki]. It is centered at the origin (F (0) = 12 ), and symmetric (F (x) = 1 − F (−x)). Thus
it has zero expectation, assuming the expectation exists. Its derivative, denoted as f (x), is the density
function; it is assumed to be unimodal (it has only one maximum), with the maximum value attained at
x = 0.
In two dimensions, Formula (1) becomes
x − h/λ y − k/λ
P [(Xh , Yk ) < (x, y)] = F F . (2)
s s
Typical choices for F are
1 x
Uniform: F (x) = + if − 1 ≤ x ≤ 1, with F (x) = 1 if x > 1 and F (x) = 0 if x < −1
2 2
1 1
Laplace: F (x) = + sgn(x)(1 + exp(−|x|))
2 2
1
Logistic: F (x) =
1 + exp(−x)
1 1
Cauchy: F (x) = + arctan(x)
2 π
where sgn(x) is the sign function [Wiki], with sgn(0) = 0. Despite the appearance, I use the standard form
of these well-known distributions, when the location parameter is zero, and the scaling factor is s = 1. It
looks unusual because I define them via their cumulative distribution function (CDF), rather than via the more
familiar density function. Throughout this textbook, I use the CDF and its inverse (the quantile function) for
simulation purposes.
Table 1 shows the relationship between s and the actual variance, for the distributions in question. I
use the notation Fs (x) = F (x/s) and fs (x) for its density, interchangeably throughout this textbook. Thus,
F (x) = F1 (x) and f (x) = f1 (x). In orther words, F is the standardized version of Fs . In two dimensions, I use
6
F (x, y) = F (x)F (y), assuming independence between the two coordinates: see Formula (2).
Remark: The parameter s is called the scaling factor because it is proportional to the variance of Fs , but
visually speaking, it represents the amount of repulsion among the points of the process. See visual impact of
a small s in Figure 3, and of a larger one in Figure 4.
pk = Fs (b − tk ) − Fs (a − tk )
b − k/λ a − kλ
=F −F (3)
s s
This easily generalizes to two dimensions based on Formula (2). As a consequence, the integer-valued random
variable N (B) counting the number of points of the process in a set B, known as the counting measure [Wiki]
or point count , has a Poisson-binomial distribution of parameters pk , k ∈ Z [Wiki]. The only difference with
a standard Poisson-binomial distribution is that here, we have infinitely many parameters (the pk ’s). Basic
properties of that distribution yield:
∞
X
E[N (B)] = pk (4)
k=−∞
X∞
Var[N (B)] = pk (1 − pk ) (5)
k=−∞
Y∞
P [N (B) = 0] = (1 − pk ) (6)
k=−∞
∞
X pk
P [N (B) = 1] = · P [N (B) = 0] (7)
1 − pk
k=−∞
It is more difficult, though possible, to obtain the higher moments E[N r (B)] or P [N (B) = r] in closed form if
r > 2. This is due to the combinatorial nature of the Poisson-binomial distribution. But you can easily obtain
approximated values using simulations.
Another fundamental, real-valued random variable, denoted as T or T (λ, s), is the interarrival times between
two successive points of the process, once the points are ordered on the real line. In two dimensions, it is replaced
by the distance between a point of the process, and its nearest neighbor. Thus it satisfies (see Section 4.2) the
following identity:
P (T > y) = P [N (B) = 0],
with B =]X0 , X0 + y], assuming it is measured at X0 (the point of the process corresponding to k = 0). See
Formula (38) for the distribution of T . In practice, this intractable exact formula is not used; instead it is
approximated via simulations. Also, the point X0 is not known, since the Xk ’s are in random order, and
retrieving k knowing Xk is usually not possible. The indices (the k’s) are hidden. However, see Section 4.7.
The fundamental question is whether using X0 or any Xk (say X5 ), matters for the definition of T . This is
discussed in Section 1.4 and illustrated in Table 4.
Finally, the point distribution is also of particular interest. In one dimension, this distribution can be
derived from the distribution of interarrival times: the distance between two successive points. For instance,
for a stationary Poisson process on the real line (that is, the intensity λ does not depend on the location), the
points in any given set B are uniformly and independently distributed in B, and the interarrival times have an
exponential distribution of expectation 1/λ. However, for Poisson-binomial processes, there is no such simple
result. If s is small, the points are more evenly spaced than the laws of pure randomness would dictate, see
Figure 3. Indeed, the process is called repulsive: it looks as if the points behave like electrical charges, all of the
same sign, exercising repulsive forces against each other. Despite this fact, the points are still independently
distributed. To the contrary, cluster processes later investigated in this textbook, exhibit point attraction: it
looks as if the points are attracted to each other.
Remark: A binomial process is defined as a finite set of points uniformly distributed over a domain B of finite
area. Usually, the number of points is itself random, typically with a binomial distribution.
7
1.3 Limiting Distributions, Speed of Convergence
I prove in Theorem 4.5 that Poisson-binomial processes converge to ordinary Poisson processes. In this section,
I illustrate the rate of convergence, both for the interarrival times and the point count in one dimension.
In Figure 1, we used λ = 1 and B = [−0.75, 0.75]; µ(B) = 1.5 is the length of B. The limiting values
(combined with those of Table 3), as s → ∞, are in agreement with N (B)’s moments converging to those
of a Poisson distribution of expectation λµ(B), and T ’s moments to those of an exponential distribution of
expectation 1/λ. In particular, it shows that P [N (B) = 0] → exp[−λµ(B)] and E[T 2 ] → 2/λ as s → ∞. These
limiting distributions are features unique to stationary Poisson processes of intensity λ.
Figure 1 illustrates the speed of convergence of the Poisson-binomial process to the stationarity Poisson
process of intensity λ, as s → ∞. Further confirmation is provided by Table 3, and formally established by
Theorem 4.5. Of course, when testing data, more than a few statistics are needed to determine whether you are
dealing with a Poisson process or not. For a full test, compare the empirical moment generating function (the
estimated E[T r ]’s say for all r ∈ [0, 3]) or the empirical distribution of the interarrival times, with its theoretical
limit (possibly obtained via simulations) corresponding to a Poisson process of intensity λ. The parameter λ
can be estimated based on the data. See details in Section 3.
In Figure 1, the values of E[T 2 ] are more volatile than those of P [N (B) = 0] because they were estimated
via simulations; to the contrary, P [N (B) = 0] was computed using the exact Formula (6), though truncated to
20,000 terms. The choice of a Cauchy or logistic distribution for F makes almost no difference. But a uniform
F provides noticeably slower, more bumpy convergence. The Poisson approximation is already quite good with
s = 10, and only improves as s increases. Note that in our example, N (B) > 0 if s = 0. This is because Xk = k
if s = 0; in particular, X0 = 0 ∈ B = [−0.75, 0.75]. Indeed N (B) > 0 for all small enough s, and this effect is
more pronounced (visible to the naked eye on the left plot, blue curve in Figure 1) if F is uniform. Likewise,
E[T 2 ] = 1 if s = 0, as T (λ, s) = λ if s = 0, and here λ = 1.
The results discussed here in one dimension easily generalize to higher dimensions. In that case B is a
domain such as a circle or square, and T is the distance between a point of the process, and its nearest neighbor.
The limit Poisson process is stationary with intensity λd , where d is the dimension.
1.4.1 Stationarity
There are various definitions of stationarity [Wiki] for point processes. The most common one is that the
distribution of the point count N (B) depends only on µ(B) (the length or area of B), but not on its location. The
Poisson-binomial process is not stationary. Assuming λ = 1, if s is small enough, the point count distribution
attached to (say) B1 = [0.3, 0.8] is different from that attached to B2 = [5.8, 6.3], despite both intervals having
8
the same length. This is obvious if s = 0: in that case N (B1 ) = 0, and N (B2 ) = 1. However, if B1 = [a, b] and
B2 = [a+k/λ, b+k/λ], then N (B1 ) and N (B2 ) have the same distribution, regardless of k ∈ Z; see Theorem 4.1
for a related result. So, knowing the theoretical distribution of N ([x, x + 1/λ]) for each 0 ≤ x < 1/λ is enough
to know the distribution of N (B) on any interval B. Since λ is unknown when dealing with actual data, it must
be estimated using techniques described in Section 3. This generalizes to two dimensions, with the interval
N ([x, x + 1/λ]) replaced by the square N ([x, x + 1/λ]) × N ([y, y + 1/λ]), with 0 ≤ x, y < 1/λ. Statistical testing
is discussed in [55], also available online, here.
The interarrival times T face fewer non-stationarity issues, as evidenced by Theorem 4.3, Table 4, and
Exercise 5. It should be favored over the point count N (B), when assessing whether your data fit with a
Poisson-binomial, or a Poisson point process model. In particular, it does not depend, for practical purposes,
on the choice of X0 in the definition of T in Section 1.2. The definition could be changed using (say) X5 , or
any other Xk instead of X0 , with no impact on the theoretical distribution.
1.4.2 Ergodicity
This brings us to the concept of ergodicity. It is heavily used in the active field of dynamical systems: see
[15, 19, 41] and my book [36] available here. I will cover dynamical systems in details, in my upcoming book
on this topic. For Poisson-binomial point processes, ergodicity means that you can estimate a quantity in two
different ways:
using one very long simulation of the process (a large n in our case),
or using many small realizations of the process (small n), and averaging the statistics obtained in each
simulation
Ergodicity means that both strategies, at the limit, lead to to same value. This is best illustrated with the
estimation of E[T ], or its higher moments. The expectation of the interarrival times T is estimated, in most of
my simulations, as the average distance between a point Xk , and its nearest neighbor to the right, denoted as
Xk′ . It is computed as an average of Xk′ − Xk over k = −n, . . . , n with n = 3 × 104 , on a single realization of the
process. The same methodology is used in the source code provided in Section 6. Likewise, E[T 2 ] is estimated
as the average (Xk′ − Xk )2 in the same way.
Table 4 is an exception. There I used 104 realizations of a same Poisson-binomial process. In each realization
I computed, among others, T0 = X0′ −X0 . This corresponds to the actual definition of T provided in Section 1.2.
Then I averaged these T0 ’s over the 104 realizations to get an approximated value for T . It turns out that both
methods lead to the same result. This is thanks to ergodicity, as far as T is concerned. I may as well have
averaged T5 = X5′ − X5 over the 104 realizations, and end up with the same result for E[T ]. Note that not all
processes are ergodic. The difference between stationarity and ergodicity is further explained here.
1.4.4 Homogeneity
An ordinary Poisson point process (the limit, as s → ∞, of a Poisson-binomial process) is said to be homogeneous
if the intensity λ does not depend on the location. In the case of the Poisson process, homogeneity is equivalent
to stationarity. Even for non-homogenous Poisson processes, the point count N (B1 ) and N (B2 ), attached
to two disjoint sets B1, B2 , are independently (though not identically) distributed. This is not the case for
Poisson-binomial processes, not even for those that are homogeneous.
9
Poisson-binomial processes investigated so far are homogeneous. I discuss non-homogeneous cases in Sec-
tions 1.5.3, 1.5.4 and 2.1. A non-homogeneous Poisson-binomial process is one where the intensity λ depends
on the index k attached to a point Xk .
10
can be standardized using the Mahalanobis transformation [Wiki], to remove stretching (so that variances are
identical for both coordinates) and to decorrelate the two coordinates, when correlation is present.
h U
ih
Xih = µi + + s · log (8)
λ 1 − Uih
k U
ik
Yik = µ′i + ′ + s · log (9)
λ 1 − Uik
where Uij are uniformly and independently distributed on [0, 1] and −n ≤ h, k ≤ n. I chose n = 25 in
the simulation – a window much larger than that of Figure 2 – to avoid boundary effects in the picture. The
boundary effect is sometimes called edge effect. The unobserved data points outside the window of observations,
are referred to as censored data [Wiki]. Of course, in my simulations their locations and features (such as which
process they belong to) are known by design. But in a real data set, they are truly missing or unobservable,
and statistical inference must be adjusted accordingly [23]. See also Section 3.5.
I discuss Figure 2 in Section 1.5.4. A simple introduction to mixtures of ordinary Poisson processes is found
on the Memming blog, here. In Section 3.4, I discuss statistical inference: detecting whether a realization of a
point process is Poisson or not, and detecting the number of superimposed processes (similar to estimating the
number of clusters in a cluster process, or the number of components in a mixture model). In Section 3.4.4, I
introduce a black-box version of the elbow rule to detect the number of clusters, of mixture components, or the
number of superimposed processes.
11
Figure 2: Four superimposed Poisson-binomial processes: s = 0 (left), s = 5 (right)
2 Applications
Applications of Poisson-binomial point processes (also called perturbed lattices point processes) are numerous.
In particular, they are widely used in cellular and sensor network modeling and optimization. It also has
applications in physics and crystal structures: see Figure 5 featuring man-made marble countertops. I provide
many references in Section 2.1.
Here I focus on two-dimensional processes, to model lattice-based clustering. It is different from traditional
clustering in two ways: clustering takes place around the vertices of the lattice space, and the number of clusters
12
is infinite (one per vertex). This concept is visualized in Figures 15 and 16, showing representations of these
cluster processes.
The processes in Section 2.1 are different from the mixtures or superimposed Poisson-binomial processes on
shifted rectangular lattices, discussed in Sections 1.5.4 and 3.4. The latter can produce clustering on hexagonal
lattice spaces, as pictured in Figure 2, with applications to cellular networks [Wiki]. See also an application to
number theory (sums of squares) in my article “Bernoulli Lattice Models – Connection to Poisson Processes”,
available here. Instead, the cluster processes discussed here are based on square lattices and radial densities.
In Section 2.1, I introduce radial processes (called child processes) to model the cluster structure. The
underlying distribution F attached to the points (Xh , Yk ) of the base process (called parent process) is the
logistic one. In Section 2.1.1, I discuss a new type of generalized logistic distribution, which is easy to handle
for simulation purposes, or to find its CDF (cumulative distribution function) and quantile function.
In Section 2.2, I focus on the hidden or inverse model (in short, the unobserved lattice). It leads to infinite,
slightly random permutations. The final section deals with what the Poisson-binomial process was first designed
for: randomizing mathematical series to transform them into random functions. The purpose is to study the
effect of small random perturbations. Here it is applied to the famous Riemann zeta function. It leads to a new
type of clusters called sinks, and awkward 2D Brownian motions with a very strong, unusual cluster structure,
and beautiful data animations (see Section 2.4).
Along the lines, I prove Theorem 2.1, related to Le Cam’s inequality. It is a fundamental result about the
convergence of the Poisson-binomial distribution, to the Poisson distribution.
13
the child process. Poisson point processes with non-homogeneous radial intensities are discussed in my article
“Estimation of the Intensity of a Poisson Point Process by Means of Nearest Neighbor Distances” [35], freely
available online here.
Remark: By non-homogeneous intensity, I mean that the intensity λ depends on the location, as opposed to a
stationary Poisson process where λ is constant. Estimating the intensity function of such a process is equivalent
to a density estimation problem, using kernel density estimators [Wiki].
To simulate radial distributions (also called radial intensities in this case), I use a generalized logistic
distribution instead of the Gaussian one, for the child process. The generalized logistic distribution has nice
features: easy to simulate, easy to compute the CDF, and it has many parameters, offering a lot of flexibility
for the shape of the density. The peculiarity of the Poisson-binomial process offers two options:
Classic option: Child processes are centered around the points of the parent process, with exactly one
child process per point.
Ad-hoc option: Child processes are centered around the bivariate lattice locations (h/λ, k/λ), with exactly
one child process per location, and h, k ∈ Z.
In the latter case, if s is small, the child process attached to the index (h, k) has its points distributed around
(Xh , Xk ) – a point of the parent process – thus it won’t be much different from the classic option. This is
because if s is small, then (h/λ, k/λ) is close to (Xh , Xk ) on average. It becomes more interesting when s is
neither too small nor too large.
In my simulations, I used a random number of points (up to 15) for the child process, and the parameter
λ is set to one. I used a generalized logistic distribution for the radial distribution.
14
Formulas (14) and (15) may be used to solve some integration problems. For instance, if a closed form can be
found for (15), then the integral in (14) has the same value. I just mention three results here; more details are
found in Exercise 3 in Section 5.
√
If α = 1, β = 1/6, then E[Z] = µ + ρ log τ − ρ2 ( 3π + 4 log 2 + 3 log 3) and the distribution is typically not
symmetric.
If α = 1 and τ = e1/β , then β −1 E[(Z − µ)/ρ] → π 2 /6 as β → 0. This is a consequence of (41). However,
the limiting distribution has zero expectation. See Exercise 4 in Section 5 for details.
π2
If α = β = 1, then E[(Z −µ)/ρ] = log τ and Var[Z/ρ] = 3 . The standard logistic distribution corresponds
to τ = 1.
Finally, the moment generating function [Wiki] (MGF) can easily be computed using the quantile function, as
a direct application of the quantile theorem 4.9. If µ = 0 and α = ρ = 1, we have:
Z 1 h τ u1/β i
E[exp(tZ)] = exp t log du
0 1 − u1/β
Z 1
=τ t
ut/β (1 − u1/β )−t du
0
Z 1
= βτ t
v t+β−1 (1 − v)−t dv
0
= βτ t B(β + t, 1 − t), (16)
where B is the Beta function [Wiki]. Note that I made the change of variable v = u1/β when computing the
integral. Unless α = τ = 1, it is clear from the moment generating function that this 5-parameter generalized
logistic distribution is different from the 4-parameter one described in [Wiki]. Another generalization of the
logistic distribution is the metalog distribution [Wiki].
Remark: If α = 1, we face a model identifiability issue [Wiki]. This is because if τ1 exp(µ1 /ρ1 ) = τ2 exp(µ2 /ρ2 ),
the two CDF’s are identical even if these two subsets of parameters are different. That is, it is impossible to
separately estimate τ, µ and ρ. However, in practice, we use µ = 0.
2.1.2 Illustrations
Figures 3 and 4 show two extreme cases of the cluster processes discussed at the beginning of Section 2.1. The
parent process modeling the cluster centers, is Poisson-binomial. It is simulated with intensity λ = 1, using a
uniform distribution for F . The scaling factor is s = 0.2 for Figure 3, and s = 2 for Figure 4. The left plot is a
zoom-in. Around each center (marked with a blue cross in the picture), up to 15 points are radially distributed,
creating the overall cluster structure. These points are the actual, observed points of the process, referred to
as the child process. The distance between a point (X ′ , Y ′ ) and its cluster center (X, Y ) has a half-logistic
distribution [Wiki]. The simulations are performed using Formulas (10) and (11).
Figure 3: Radial cluster process (s = 0.2, λ = 1) with centers in blue; zoom in on the left
The contrast between Figures 3 and 4 is due to the choice of the scaling factor s. The value s = 0.2, close
to zero, strongly reveals the underlying lattice structure. Here this effect is strong because of the choice of F
(it has a very thin tail), and the relatively small variance of the distance between a point and its associated
cluster center. It produces repulsion among neighbor points: we are dealing with a repulsive process, also
15
called perturbed lattice point processes. When s = 0, all the randomness is gone: the state space is the lattice
space. See left plot in Figure 2. Modeling applications include optimum distribution of sensors (for instance
cell towers), crystal structures and bonding patterns of molecules in chemistry.
Figure 4: Radial cluster process (s = 2, λ = 1) with centers in blue; zoom in on the left
By contrast, s = 2 makes the cluster structure much more apparent. This time, there is attraction among
neighbor points: we are dealing with an attractive process. It can model many types of structures, associated to
human activities or natural phenomena, such as the distribution of galaxies in the universe. Figure 5 provides an
example, related to the manufacture of kitchen countertops. I discuss other types of cluster patterns generated
by Poisson-binomial processes, in Sections 2.4 and 3.4.
Figure 5 shows luxury kitchen countertops called “Inverness bronze Cambria quartz”, on the left. While the
quality (and price) is far superior to all other products from the same company, the rendering of marble veins is
not done properly. It looks man-made: not the kind of patterns you would find in real stones. The pattern is too
regular, as if produced using a very small value of the scaling factor s. An easy fix is to used patterns generated
by the cluster processes described here, incidentally called perturbed lattices. To increase randomness, increase
s. It will improve the design. I am currently talking to the company, as I plan to buy these countertops. The
picture on the right shows a more realistic rendering of randomness.
16
Figure 6: Locally random permutation σ; τ (k) is the index of Xk ’s closest neighbor to the right
17
Figure 7: Chaotic function (bottom), and its transform (top) showing the global minimum
what probabilistic number theory is about. In the process, I prove a version of Le Cam’s theorem: the fact that
under certain circumstances, the Poisson-binomial distribution tends to a Poisson distribution.
The second problem (Section 2.3.2) deals with the Riemann zeta function ζ and the famous Riemann
hypothesis (RH), featuring unusual, not well-known patterns It leads to heuristic arguments supporting RH. I
then apply small perturbations to ζ (more specifically, to its sister function, the Dirichlet eta function η) using
a Poisson-binomial process, to see when and if the patterns remain. The purpose is to check whether RH can be
extended to a larger class of chaotic, random functions, unrelated to Dirichlet L-functions [Wiki]. Would such
an extension be possible, it could offer new potential paths to proving RH. Unfortunately, while I exhibit such
extensions, they only occur when the perturbations are incredibly small. RH has a $1 million award attached
to it and offered by the Clay Institute, see here.
18
Zk ’s equal to zero, with 1 ≤ k ≤ m, is denoted as N (n, m) or simply N . In mathematical notations,
m
X
N= χ(Zk = 0), (17)
k=1
where χ is the indicator function [Wiki], equal to one if its argument is true, and to zero otherwise. Thus N ,
the counting random variable, has a Poisson-binomial distribution [Wiki] of parameters p1 , · · · , pm , similar to
that discussed in Formula (4). The goal is to prove that when n → ∞, the limiting distribution of N is Poisson
with expectation log α. I then discuss the implications of this result, regarding the distribution of large factors
in very large integers. The main result, Theorem 2.1, is a particular case of Le Cam’s inequality [Wiki]; see also
[73], available online here.
Theorem 2.1 As n → ∞ and m/n → α > 1, the discrete Poisson-binomial distribution of the counting random
variable N defined by Formula (17), tends to a Poisson distribution of expectation log α.
Proof
Clearly, pk = P (Zk = 0) = 1/(n + k). Let
m m
Y X pk
q0 = (1 − pk ) and µ = .
1 − pk
k=1 k=1
This corresponds to a Poisson distribution. It follows that µ = − log q0 . To complete the proof, I now show
that q0 → 1/α, as n → ∞ and m/n → α > 1. We have
m
Y ∞
X m
X 1
log q0 = log (1 − pk ) = log(1 − pk ) = − pk + O
n
k=1 k=0 k=1
m m n
X 1 1 X 1 X1 1
=− +O =− + +O
n+k n k k n
k=1 k=1 k=1
1 1 1
= − log m + log n + O = − log(m/n) + O = − log α + O
n n n
and thus q0 → 1/α as n → ∞.
19
a mod (n + 2) ∈ {0, . . . , n + 1} is random,
..
.
a mod (m − 1) ∈ {0, . . . , m − 2} is random,
a mod m ∈ {0, . . . , m − 1} is random,
and the above m − n + 1 residues are mutually independent to some extent. The integer a has a factor in [n, m]
if and only if at least one of the above residues is zero. Thus Theorem 2.1 applies, at least approximately.
∞
X cos(t log k)
ℜ[η(z)] = ℜ[η(σ + it)] = − (−1)k (18)
kσ
k=1
∞
X sin(t log k)
ℑ[η(z)] = ℜ[η(σ + it)] = − (−1)k (19)
kσ
k=1
Note that i represents the imaginary unit, that is i2 = −1. I investigated two cases: σ = 12 and σ = 43 .
I used a Poisson-binomial process with intensity λ = 1, scaling factor s = 10−3 and a uniform F to generate
the (Xk )’s and replace the index k by Xk in the two sums. I also replaced (−1)k by cos πk. The randomized
(perturbed) sums are
∞
X cos(t log Xk )
ℜ[ηs (z)] = ℜ[ηs (σ + it)] = − cos(πXk ) · (20)
Xkσ
k=1
∞
X cos(t log Xk )
ℑ[ηs (z)] = ℜ[ηs (σ + it)] = − cos(πXk ) · (21)
Xkσ
k=1
20
Proving the convergence of the above (random) sums is not obvious. The notation ηs emphasizes the fact that
the (Xk )’s have been created using the scaling factor s; if s = 0, then Xk = k and ηs = η.
Figure 8 shows the orbits of ηs (σ + it) in the complex plane, for fixed values of σ and s. The orbit consists
of the points P (t) = (ℜ[ηs (σ + it)], ℑ[ηs (σ + it)]) with 0 < t < 200, and t increasing by increments of 0.05. The
plots are based on a single realization of the Poisson-binomial process. The sums converge very slowly, though
there are ways to dramatically increase the convergence: for instance, Euler’s transform [Wiki] or Borwein’s
method [Wiki]. I used 104 terms to approximate the infinite sums.
Figure 8: Orbit of η in the complex plane (left), perturbed by a Poisson-binomial process (right)
Let’s look at the two plots on the left in Figure 8. A hole around the origin is noticeable when σ = 0.75.
This suggests that η has no root with real part σ = 0.75, at least if 0 < t < 200, as the orbit never crosses the
origin, and indeed stays far away from it at all times. For larger t’s the size of the hole may decrease, but with
appropriate zooming, it may never shrink to an empty set. This is conjectured to be true for any σ ∈] 12 , 1[ and
any t; indeed, this constitutes the famous Riemann Hypothesis. To the contrary, if σ = 0.5, the orbit crosses
the origin time and again, confirming the well-know fact that η has all its non-trivial zeros (infinitely many),
on the critical line σ = 12 . This is the other part of the Riemann Hypothesis.
I noticed that the hole observed when σ = 0.75 shifts more and more to the left as σ decreases. Its size also
decreases, to the point that when σ = 12 (but not before), the hole has completely vanished, and its location
has shifted to the origin. For σ = 0.75, it seems that there is a point on the X-axis, to the left-hand side of the
hole but close to it, where the orbit goes through time and again, playing the same role as the origin does to
σ = 21 . That special point, let’s name it h(σ), exists for any σ ∈ [ 12 , 1[, and depends on σ. It moves to the right
as σ increases. At least that’s my conjecture, which is a generalization of the Riemann Hypothesis.
Let’s now turn to the two plots on the right in Figure 8. I wanted to check if the above features were
unique to the Riemann zeta function. If that is the case, it further explains why the Riemann Hypothesis is
so hard to prove (or possibly unprovable), and why it constitutes to this day one of the most famous unsolved
mathematical problems of all times. Indeed, there is very little leeway: only extremely small perturbations keep
these features alive. For instance, using s = 10−3 and a uniform F , that is, a microscopic perturbation, the
orbits shown on the right are dramatically changed compared to their left counterparts. Key features seem to
barely be preserved, and I suspect the hole, when σ = 0.75 no longer exists if you look at larger values of t: all
that remains is a lower density of crossings where the hole used to be, compared to no crossing at all in the
absence of perturbations (s = 0).
I will publish an eBook with considerably more details about the Riemann Hypothesis (and the twin prime
conjecture) in the near future. The reason, I think, why such little perturbations have a dramatic effect, is
because of the awkward chaotic convergence of the above series: see details with illustrations in Exercises 24
and 25, as well as here. The Riemann function gives rise to a number of interesting probability distributions,
some related to dynamical systems, some defined on the real line, and some on the complex plane. This will be
21
discussed in another upcoming book.
Remark: The conjecture that if σ ∈]1/2, 1[, the hole never shrinks to a single point no matter how large t is
(a conjecture weaker than the Riemann Hypothesis) must be interpreted as follows: it never shrinks to a point
in any finite interval [t, t + τ ]. If you consider an infinite interval, this may not be true due to the universality
of the Riemann zeta function [Wiki]. An approach to the Riemann hypothesis, featuring new developments,
and not involving complex analysis, can be found in my article “Fascinating Facts About Complex Random
Variables and the Riemann Hypothesis”, here. For an introduction to the Riemann zeta function and Dirichlet
series, see [44]. See also Section 2.4.1 in this textbook.
The two leftmost videos illustrate the beautiful, semi-chaotic convergence of the series attached to the
Dirichlet eta function η(z) [Wiki] in the complex plane. Details are in Section 2.4.1, including the connection
to the famous Riemann Hypothesis [Wiki]. The rightmost video shows fractal supervised clustering performed
in GPU (graphics processing unit), using image filtering techniques that act as a neural network. It is discussed
in Section 2.4.2. For a short beginner introduction on how to produce these videos, read my article “Data
Animation: Much Easier than you Think!”, here.
Thus, the function ζ can be uniquely extended to σ > 0, using ζ(z) = (1 − 21−z )−1 η(z), while preserving
Formula (22) if σ > 1: the first series converges if and only if σ > 1, and the second one if and only if σ > 0.
Both functions, after the analytic continuation of ζ, have the same zeroes in the critical strip 0 < σ < 1. The
famous Riemann Hypothesis [Wiki] claims that all the infinitely many zeroes in the critical strip occur at σ = 21 .
This is one of the seven Millenium Problems, with a $1 million prize, see here. For another one, “P versus NP”,
see Exercise 21, about finding the maximum cliques of a nearest neighbor graph.
More than 1013 zeroes of ζ have been computed. The first two million are in Andrew Odlyzko’s table, here.
See the OEIS sequences A002410 and A058303. You can find zeroes with the free online version of Mathematica
using the FindRoot[] and Zeta[] functions, here. For fast computation, several methods are available, for
example the Odlyzko–Schönhage algorithm [Wiki]. The statistical properties are studied in Guilherme França
and André LeClair [28] (available online here), in André LeClair in the context of random walks [53] (avail-
able online here) and in Peter J. Forrester and Anthony Mays in the context of random matrix theory [27]
(available online here). I discuss recent developments about the Riemann Hypothesis in my article “Fascinating
Facts About Complex Random Variables and the Riemann Hypothesis”, here. See also my contributions on
MathOverflow: “More mysteries about the zeros of the Riemann zeta function” (here) and “Normal numbers,
22
Liouville function, and the Riemann Hypothesis” (here).
23
has a computational complexity that beats (by a long shot) any traditional classifier. It does not require the
computation of nearest neighbor distances.
The video medium also explains how the clustering is done, in better ways than any text description could
do. You can view the video (also called data animation) on YouTube, here. The source code and instructions
to help you create your own videos or replicate this one, is in Section 6.7.2. See Section 3.4.3 for a description
of the underlying supervised clustering methodology.
I use the word “fractal” because the shape of the clusters, and their boundaries in particular, is arbi-
trary. The boundary may be as fractal-like as a shoreline. It also illustrates the concept of fuzzy clustering
[Wiki]: towards the middle of the video, when the entire state space is eventually classified, constant cluster
re-assignments are taking place along the cluster boundaries. A point, close to the fuzzy border between clusters
A and B, is sometimes assigned to A in a given video frame, and may be assigned to B in the next one. By
averaging cluster assignments over many frames, it is possible to compute the probability that the point belongs
to A or B. Another question is whether the algorithm (the successive frames) converge or not. It depends on the
parameters, and in this case, stochastic convergence is observed. In other words, despite boundaries changing
all the time, their average location is almost constant, and the changes are small. Small portions of a cluster,
embedded in another cluster, don’t disappear over time.
24
data driven techniques. Several chapters in my book “Statistics: New Foundations, Toolbox, and Machine
Learning Recipes” [37] published in 2019 (available online here) deal with extensions and modern versions of
this methodology. I follow the same footsteps here, first discussing the general principles, and then showing how
it applies to estimating the intensity λ and scaling factor s of a Poisson-binomial process. As in Jesper Møller
[58], my methodology is based on minimum contrast estimation: see slides 114-116 here or here. See also [18]
for other examples of this method in the context of point process inference.
There are easier methods to estimate λ and s: I describe some of them in Section 3.2. However, the goal
here is to provide a general framework that applies to any multivariate parameter. I chose the parameters λ, s
as they are central to Poisson-binomial processes. By now, you should be familiar with them. They serve as
a test to benchmark the methodology. Yet, the standard estimator of λ is slightly biased, and the method in
this section provides an alternative to obtain unbiased estimates. It assumes that boundary effect are properly
handled. I describe how to deal with them in Section 3.1.2.
The idea behind minimum contrast estimation is to use proxy statistics as substitutes for the parameter
estimators. It makes sense here as it is not clear what combination of variables represents s.
where χ is the indicator function [Wiki] and N (Bk ) is the number of points in Bk . If there is a one-to-one
mapping between (λ, s) and (p, q), then one can easily compute (p, q) using Formula (26) applied to the observed
data, and then retrieve (λ, s) via the inverse mapping. It is even possible to build 2D confidence regions for the
bivariate parameter (λ, s). That’s it!
I now explain how to implement this generic method to our example. I also address some of the challenges.
First, the problem is to find good proxy statistics for the model parameters λ, s. I picked up p and q because
it leads to an easy implementation in Excel. However, interarrival times (their mean and variance) are better,
requiring smaller samples to achieve the same level of accuracy. Next, we are not sure if the mapping in question
is one-to-one.
25
The scatterplot in Figure (10) illustrates the method. The X axis represents p, and the Y axis represents q.
There are two main features:
Observed data. The purple dots correspond to values of (p, q) derived from the observations, and
computed with Formula (26). I tested three sets of observations (thus the three purple dots), each with
20,001 points (that is, n = 10,000).
Theoretical model. The four overlapping clusters show the distribution of (p, q) for four different values
of (λ, s). Each cluster – identified by its color – has 100 points corresponding to 100 simulations. Each
simulation within a same cluster uses the same hand-picked (λ, s). Also, each simulation consists of 2n + 1
data points, to match the number of observations. The purpose of these simulations is to find the inverse
mapping via numerical approximations. Four colors is just a small beginning. In Table 2, each cluster
is summarized by two statistics: its computed center in the (p, q)–space, associated to the hand-picked
parameter vector (λ, s).
Point Estimates
Let us focus on the rightmost purple dot in Figure 10, corresponding to one of the three observations sets. Its
coordinates vector is denoted as (p0 , q0 ). The (p, q)–space is called the proxy space. In this case, it is equal to
[0, 1] × [0, 1]. If the proxy spaced contained only the four points (p, q) listed in Table 2, the estimated value
(λ0 , s0 ) of (λ, s) would be the center of the orange cluster. That is, (λ0 , s0 ) = (1.4, 0.6) because (0.3275, 0.4113)
is the closest cluster center to the purple dot (p0 , q0 ) in the proxy space.
But let’s imagine that I hand-picked 105 vectors (λ, s) instead of four, thus generating 105 cluster centers
and a very large Table 2 with 105 entries. Then again, the best estimator of (λ, s) would still be the one obtained
by minimizing the distance between the purple dot (p0 , q0 ) computed on the observations, and the 105 cluster
centers. In practice, the hand-picking is automated (computerized) and leads to a black-box implementation of
the estimation procedure.
Table 2: Extract of the mapping table used to recover (λ, s) from (p, q)
Thanks to the law of large numbers [Wiki], the cluster centers quickly converge to their theoretical value as n
increases. The cluster centers (p, q) in Table 2 can be computed as a function of (λ, s) using a mathematical
formula. It facilitates the construction of the inverse mapping, avoiding tedious simulations: see Section 3.1.2.
Confidence Regions
Again, for the sake of illustration, let us focus on the rightmost purple dot (p0 , q0 ). Imagine that contour lines
are drawn around each cluster center. A contour line of level γ (0 ≤ γ ≤ 1) is a closed curve (say an ellipse)
oriented in the same direction as the cluster in question, and centered at the cluster center. Its interior covers
a proportion γ of the points of the cluster, in the proxy space. In this case, the contour line of level γ, around
the cluster center (p, q) is obtained as follows.
First define h x − p 2 x − p y − q y − q 2 i
2n
Hn (x, y, p, q) = 2
· − 2ρp,q + , (27)
1 − ρp,q σp σp σq σq
with p p pq
σp = p(1 − p), σq = q(1 − q), ρp,q = − p . (28)
pq(1 − p)(1 − q)
Then the contour line is the set of points (x, y) ∈ [0, 1] × [0, 1] satisfying Hn (x, y, p, q) = Gγ . Here Gγ is a
quantile of some Hotelling distribution [Wiki]. I included a table of the Gγ function, obtained by simulations,
in my spreadsheet (see next section); it is also pictured on the right plot in Figure 11.
This classic asymptotic result is a consequence of the central limit theorem, see here. For detailed expla-
nations, see Exercise 27. Note that Gγ does not depend on n, p or q. At least not asymptotically.
26
Figure 11: Confidence region for (p, q) – Hotelling’s quantile function on the right
We now have a mechanism to find any confidence region [Wiki] of level γ. It works – when n is not too
small – as follows:
Step 1. Let (p0 , q0 ) be the estimator of (p, q), computed on your observations set with Formula (26).
Step 2. Find all (x, y)’s satisfying Hn (x, y, p, q) = Gγ , where (p, q) is replaced by (p0 , q0 ). These (x, y)’s
form the boundary of your confidence region in the proxy space.
Step 3. Apply the inverse mapping described earlier (see Table 2 and spreadsheet section below) to map
(x, y) to (λ, s). Do it for all (x, y) on the boundary obtained in step 2.
The resulting (λ, s)’s obtained in step 3 form the boundary of your confidence region in the original parameter
space. The methodology described here is generic and applicable to any estimation problem involving multidi-
mensional parameters, regardless of the complexity. In Exercise 27, I introduce a new type of confidence region
called dual confidence region, obtained by swapping the roles of (p, q) and (x, y) in Formula (27). This new
concept is also discussed here and here.
Again, the choice of p, q as proxy statistics is not ideal, but it leads to an easy implementation in Excel,
offering educational value. A different choice may lead to more narrow confidence regions, that is, a higher
confidence level γ. Or to put in another way, it may require a smaller sample size [Wiki] (that is, smaller
observations sets) to produce the same level of confidence. This is true if you choose proxy statistics that,
unlike p and q, are independent.
Remark: Most authors use 1 − α for the confidence level, based on a long tradition. A different but related
concept called “significance level” is denoted as α: it is technically defined as “one minus the confidence level”
[Wiki]. Then the “critical value” is denoted as Zα . Here, I use γ instead of α or 1 − α, and a single term
“confidence level” to avoid confusions.
27
boundary effects. Still, it is always good practice to quantify all potential sources of bias. In some occasions,
the pseudo-random number generator itself was one of the major sources of inaccuracies (see Section 3.6.1),
until it got identified and replaced by a better one. In other occasions, roundoff errors caused by numerical
instability were to blame. It got fixed by using more stable computations or high precision computing [Wiki].
Spreadsheet
The spreadsheet is available on my GitHub repository, here: PB independence.xlsx (click on the link to
access it). Look for the Confidence Region tab. I simulated N = 10,000 observations sets, each with n
observations. I used the values p0 , q0 in cells B1, B2, and a bivariate Bernoulli model with these values as
parameters, to generate the observations. The source code related to the Bernoulli model is in column Y. The
Bernoulli model is described in Exercise 27, as well as here and here.
Each row in the spreadsheet table represents one of the N sets, with the estimated proportions p, q in
columns D and E, then σp , σq , ρp,q in columns H, I, J, and Gγ in column G. These quantities were computed
using the source code in column Y, based on Formulas (26) and (27). The rows are sorted by the values in
column G. The confidence region featured in Figure 11 corresponds to the (p, q)’s in the first 9000 rows, after the
sorting in question. Thus the confidence level is a γ = 90%. The corresponding Gγ = 4.595 is in cell G:9001.
28
By virtue of Theorem 4.1, ϕτ (t) = 1 if τ = 1/λ. More generally, regardless of τ , the function ϕτ (t) is periodic
of period 1/λ. That is, ϕτ (t) = ϕτ (t + 1/λ). This latter statement is also true for Var[Nτ (t)], P [Nτ (t) = 0],
and P [Nτ (t) = 1]. This fact is trivial if you look at Formulas (4), (5), (6) and (7), used to compute the four
quantities in question.
The amplitude of the oscillations is extremely small even with a scaling factor as low as s = 0.3 (assuming
F is logistic). It quickly tends to zero as s → ∞. So, the process is almost stationary unless s is very close to
zero. Thus, in most inference problems, the choice of the (non-overlapping) intervals has very little impact. In
particular, ϕτ (t) ≈ λτ . The small amplitude of ϕτ (t) is pictured in Figure 12.
Assuming that ϕτ (t) is constant and equal to λτ results in a tiny error, unless s is very close to zero. To
the contrary, boundary effects are a bigger source of bias, this time when s is large. Simulations can quantify
the amount of bias, see Section 3.5. See also the spreadsheet section below.
Spreadsheet
The functions ϕτ (t) = E[Nτ (t)], Var[Nτ (t)], P [Nτ (t) = 0] and P [Nτ (t) = 1] are tabulated in the spreadsheet
PB independence.xlsx. See columns D to I in the Periodicity tab. The parameters are λ = 1.4 (cell
B1) and s = 0.3 (cell B2). The source code to produce this table is in column AI. Here τ = 1.
Also, columns U to Z contains the computations to estimate λ based on a realization of a Poisson-binomial
process in column S. I generated 2n + 1 points, with n = 5000. The estimator, denoted as λ0 , is in column Z. I
computed different versions of λ0 = Nτ (t)/τ , based on different values of t and τ . The point counts Nτ (t) are
computed on the simulated realization. The true value λ = 1.4 (used for the simulation in column S) is stored
in cell B4, while s = 12 is stored in cell B5. The purpose is to find optimum t, τ that minimize the boundary
effects, to get an unbiased estimator λ0 of the intensity λ.
Figure 13 shows how λ0 (on the Y axis) varies depending on the choice of a parameter α (on the X axis,
and also in column U in the spreadsheet). The parameter α, with 0.96 < α ≤ 1 in the picture, determines the
interpercentile range [Lα , Uα ] = [t, t + τ [ used to compute λ0 . When α = 1 (the leftmost position on the X axis),
Lα is the minimum, and Uα is the maximum value among the points of the process. The bias is also maximum.
The smaller α, the fewer points used to compute λ0 , and the further away we are from the boundaries, thus
29
reducing the bias to almost zero if α is small enough. Yet the smaller α, the more unstable λ0 is. Thus one
needs to find the right balance between a too large and a too small value of α.
In my example, if you look at Figure 13, α = 0.992 achieves this goal, yielding an estimate λ0 = 1.400
correct to three digits. Note that α = 1 yields a biased value of λ0 between 1.380 and 1.390 depending on
the simulation (close to 1.380 in Figure 13). A technique such as the automated elbow rule, described in
Section 3.4.4, can be used to detect the optimum α, and thus the optimum λ0 .
Let (Xk ) be the points of a Poisson-binomial process MA of intensity λ = 1 and scale factor s = 0.7, with
a logistic F . Exercise 10 shows – using theoretical arguments – that the point counts are not independent.
Here I establish the same conclusion via statistical testing. The purpose is to illustrate how the test works,
so that you can use it in other contexts. I chose three intervals B1 = [−1.5, −0.5[, B2 = [−0.5, 0.5[, and
B3 = [0.5, 1.5[. The data consists of m = 1000 realizations of the process in question, each one consisting of
41 points Xk , k = −20, . . . , 20. The number 41 is large enough in this case, to eliminate boundary effects.
The data, computations and results are in the spreadsheet PB independence.xlsx, described later in this
section.
The point counts attached to a realization ω of the point process, is denoted as Nω . The aggregated point
count over the m realizations is denoted as N , and the set of m realizations is denoted as Ω. Now, for i = 1, 2, 3
and j1 , j2 , j3 ∈ N, I can define the following quantities:
1 X
pi (j) = χ[Nω (Bi ) = j],
m
ω∈Ω
3
1 X Y
p(j1 , j2 , j3 ) = χ[Nω (Bi ) = ji ], (29)
m i=1
ω∈Ω
3
1 Y X
p′ (j1 , j2 , j3 ) = χ[Nω (Bi ) = ji ], (30)
m3 i=1
ω∈Ω
where χ is the indicator function [Wiki]. For instance, p1 (3) = 0.043 means that in 43 realizations out of
m = 1000, the domain B1 contained exactly 3 points. Also, p′ (j1 , j2 , j3 ) = p1 (j1 )p2 (j2 )p3 (j3 ). The three point
30
counts N (B1 ), N (B2 ), N (B3 ) are independently distributed if and only if Formulas (29) and (30) represent the
same quantity when m = ∞. In other words, the three point counts are independently distributed if p → p′
pointwise [Wiki], as m → ∞.
To avoid future confusion, p and p′ are denoted as pA and p′A to emphasize the fact that they are attached
to the process MA . To test for independence, I simulated m realizations of a sister point process MB : one with
the same marginal distributions for the three point counts, using the estimates pi (j) obtained from MA , but
this time with guaranteed independence of the point counts, by design. Likewise, I define the functions pB and
p′B . Let ρA be the correlation between pA and p′A , computed across all triplets satisfying
I chose ϵ = 0. In my example, there were fewer than 7 × 7 × 7 = 343 such triplets. Finally, the statistic of the
test is ρ2A .
31
3.2 Estimation of Core Parameters
It is assumed that the point process covers the entire state space R or R2 with infinitely many points, and that
only a finite number of points are observed through a finite (typically rectangular) window or interval. Here I
focus on the one-dimensional case. For processes in two dimensions, see Section 3.4.2.
Estimation of λ
There are various ways to estimate the intensity λ (more specifically, λd in d dimension) using interarrival times
T , nearest neighbors (in two dimensions) or the point count N (B) computed on some interval B. A good
estimator with small variance, assuming boundary effects are mitigated (see Section 3.5), is the total number
of observed points divided by the area (or length, in one dimension) of the window of observations.
Another estimator is based on Theorem 4.3: the expected value of the interarrival time is 1/λ. Thus, if
you average all the interarrival times accross all the observed points (called events in one dimension), you get
an unbiased estimator of 1/λ. Its multiplicative inverse will be a slightly biased estimator of λ; if the number
of points is large enought (say > 50), the bias is negligible.
Estimation of s
Once λ has been estimated, the scaling factor s can be estimated by leveraging Theorem 4.2. The strategy is
as follows. Let λ0 be your estimate of λ. By virtue of Theorem 4.2, the interarrival times satisfy E[T r (λ, s)] =
E[T r (1, λs)]/λr for any r > 0. This result does not depend on the distribution F .
With r = 2, let
τ0 be your estimate of the squared interarrival times (the average squared value), computed on your data
set,
τ ′ = (λ0 )r · τ0 , where λ0 is your estimate of λ (see above subsection),
s′ be the solution to E[T r (1, s′ )] = τ ′ .
Then s0 = s′ /λ0 is an estimate of s.
Example: Here F is the logistic distribution, and I chose r = 2. Any r > 0 except r = 1 would work. If λ0 =
1.45 and τ0 = 0.77, then τ ′ = (λ0 )2 τ0 = 1.61. Looking at the E[T 2 (1, s′ )] table, to satisfy E[T 2 (1, s′ )] ≈ 1.61,
you need s′ = 0.65. Thus s0 = s′ /λ0 = 0.45. These numbers match those obtained by simulation. To view or
download the table, look at the E[T 2 ] tab in PB inference.xlsx.
The equation E[T 2 (1, s′ )] = τ ′ , where s′ is the unknown, can be solved using numerical methods. The
easiest way is to build a granular table of E[T 2 (1, s)] for various values of s, by simulating Poisson-binomial
processes of intensity λ = 1 and scaling factor s. Then finding s′ consists in browsing and interpolating the
table in question the old fashioned way, to identify the value of s closest to satisfying E[T 2 (1, s)] = τ ′ . This can
of course be automated. There are two ways to perform the simulations in question:
generating one realization of each process with a large number of points (that is, one realization for each
0 < s < 20 with λ = 1 and s increments equal to 0.01),
or generating many realizations of each process, each one with a rather small number of points.
Either way, the results should be almost identical due to ergodicity if the same F is used in both cases. The
simulations also allow you to compute the theoretical variance of the estimators in question (at least a very good
approximation). This is useful when multiple estimators (based on different statistics) are available, to choose
the best one: the one with minimum variance. Simulations also allow you to compute confidence intervals for
your estimators, as discussed in Section 3.1. The source code for the simulations can be found in Section 6.2.
32
Var[Nk ] ≤ 1 does not depend on k thanks to the choice of Bk (see Section 3.1.2). The variance is maximum
and equal to one when s = ∞.
It is possible, for any value of s and λ, to compute the theoretical variance v(λ, s) = Var[Nk ] using either
simulations or Formula (5) with a = 0 and b = 1/λ. It slightly depends on F , but barely. Now compute the
empirical variance of Nk as the average (Nk − 1)2 across all the Bk ’s, based on your observations, assuming λ is
known or estimated. This empirical variance is denoted as v0 (λ). The estimated value of s is the the one that
makes the empirical and theoretical variances identical, that is, the unique value of s that solves the equation
v(λ, s) = v0 (λ). This method easily generalizes to higher dimensions, see Section 3.4.2. The fact that E[Nk ] = 1
is a direct consequence of Theorem 4.1.
See the Nk tab in PB inference.xlsx, for a Poisson-binomial process simulation with a generalized
logistic F , and computation of E[Nk ] and Var[Nk ] in Excel. You can download the spreadsheet from the same
location.
That is, X = Xk with k = L(X). See definition of arg min here. This assumes that λ is known or estimated.
In this particular situation, assuming s is also known or estimated, the empirical distribution of s · (X − L(X))
computed over many points X, converges to F as the number of observed points tends to infinity. See also
Section 4.7 about the hidden process, and Exercise 12.
A more practical situation is when one has to decide which F provides the best fit to the data, given a
few potential candidates for F . In that case, one may compute (using simulations) the theoretical expectation
η(r, λ, s, F ) = E[T r (λ, s)] as a function of r > 0 for various F ’s, and find which F provides the best fit to the
estimated E[T r (λ, s)], denoted as η0 (r, λ, s, F ) and computed on the data (the expectation being replaced by
an average when computed on the data). By best fit, I mean finding F that minimizes (say)
Z 2
γ(F ) = |η(r, λ, s, F ) − η0 (r, λ, s, F )|dr. (31)
0
Again, s and λ should be estimated first. However, a simultaneous estimation of λ, s, F is feasible and consists
of finding the parameters λ, s, F minimizing γ(F ), now denoted as γ(λ, s, F ). See Section 3.2.1 to estimate λ
and s separately: this stepwise procedure is simpler and less prone to overfitting [Wiki].
The estimation technique introduced here, especially Formula (31), is sometimes referred to as minimum
contrast estimation. See slides 114–116 in the presentation entitled “Introduction to Spatial Point Processes
and Simulation-Based Inference”, by Jesper Møller [58], available online here or here.
33
Formula Value Uniform Logistic Cauchy
s=∞ s=∞ s = 39.85 s = 39.85 s = 39.85
E[N (B)] λµ(B) 3/2 1.5019 1.5000 1.4962
Var[N (B)] λµ(B) 3/2 1.4738 1.4906 1.4872
−λµ(B)
P[N (B) = 0] e 0.2231 0.2196 0.2221 0.2230
E[T ] 1/λ 1 1.0003 0.9999 1.0010
Var[T ] 1/λ2 1 0.9680 0.9888 1.0029
√ 1
p
E[ T ] 2 π/λ 0.8862 0.8865 0.8862 0.8873
Table 3 summarizes some statistics produced with the source code in Section 6.2, with λ = 1, r = 1/2 and
B = [a, b]. Here, a = −0.75 and b = 0.75. The notation µ(B) stands for b − a. In two dimensions, it represents
the area of the set B (typically, a square or a circle). In one dimension, when s = ∞, N (B) has a Poisson
distribution of expectation λµ(B), and T has an exponential distribution of√expectation 1/λ. The limiting
process is a stationary Poisson process of intensity λ. The exact formula for E[ T ], when s = ∞, was obtained
with the online version of Mathematica: you can check the computation, here. In general, convergence to the
Poisson process, when s → ∞, is slower and more bumpy if F is uniform, compared to using a logistic or Cauchy
distribution for F .
k (in Xk ) −5 −4 −3 −2 −1 0 1 2 3 4 5
E[Tk ] 0.99 0.98 1.01 0.99 1.01 1.00 1.00 1.02 1.00 1.00 1.01
1/2
E[Tk ] 0.90 0.90 0.90 0.91 0.90 0.91 0.91 0.91 0.90 0.90 0.91
3/2
E[Tk ] 1.24 1.24 1.24 1.27 1.22 1.29 1.27 1.26 1.26 1.24 1.27
E[Tk2 ] 1.70 1.68 1.70 1.75 1.67 1.79 1.76 1.70 1.74 1.71 1.75
Table 4 displays various moments obtained by simulation, from averaging Tkr across 104 realizations of a Poisson-
binomial process with a logistic F and s = λ = 1, for small values of k, yielding about 2 digits of accuracy. Each
realization consisted of 2n + 1 points X−n , X−n+1 , . . . , X0 , . . . , Xn−1 , Xn , with n = 30 large enough to avoid
significant boundary effects (see Section 3.5). The interarrival time Tk was defined as the distance between
Xk , and its closest neighbor Xk′ to the right. The purpose was to check whether the choice of k matters. The
conclusion from looking at the table, is that it does not matter. This empirically justifies the choice k = 0 in
our definition of T in Section 1.2.
Another way to measure T is by averaging the various Tk = Xk′ − Xk , say for −104 < k < 104 , measured
on a single realization of the same Poisson-binomial process, with a very large n, say n = 3 × 104 . Here Xk′ is
the closest neighbor to Xk , to the right on the real axis. It yields the same result. The theoretical value for
r = 1 is E[T ] = 1/λ, according to Theorem 4.3. Also for r = 2, the theoretical value if s = ∞ is E[T 2 ] = 2/λ2
due to the Poisson process approximation. The value reported in Table 4 is around 1.72, and this is for s = 1.
We are not that far from the Poisson limit!
34
Figure 15: Radial cluster process (s = 0.5, λ = 1) with centers in blue; zoom in on the left
In this section, I explore some these peculiarities. As a starter, let’s look at Figures 3 and 4. They clearly
represent two distinct models: lattice structure, versus random point distribution. But what about Figures 15
versus 16? Actually, all four feature the same model. The only difference is the choice of the scaling factor s.
The first two represent two extremes: s = 0.2 versus s = 2. But the last two correspond to in-between cases
(s = 0.5 versus s = 1), and look similar. Also, unless you have experience dealing with these processes, it is
not easy to tell whether or not the point pattern in Figure 16, despite looking a bit more “random” than in
Figure 15, corresponds to pure randomness (a stationary Poisson process). The answer is negative despite the
appearances: the points are too evenly spread to represent pure randomness.
Figure 16: Radial cluster process (s = 1, λ = 1) with centers in blue; zoom in on the left
35
Figure 17: Realization of a 5-interlacing with s = 0.15 and λ = 1: original (left), modulo 2/λ (right)
36
The point count in a square of side 1/λ has expectation equal to one, according to a multidimensional version
of Theoremh 4.1. So,
h one
h way hto estimate λ is to partition the window of observations W into small squares
Bh,k (λ) = λ , λ × λk , k+1
h h+1
λ for various values of the (unknown) λ, compute the number of points Nh,k (λ)
(called point count) in each of these squares, and find λ that minimizes the empirical variance
X 2
v(λ) = Nh,k (λ) − 1
h,k
computed on the observations. The sum is over h, k ∈ Z ∩ W ′ , where W is the window of observation, and W ′
is slightly smaller than W to mitigate boundary effects. In short, your estimate of the intensity λ is defined as
λ0 = arg min v(λ).
λ
The benefit of this approach is that it also allows you to easily estimate the scaling factor s. Since v(λ)
also depends on the unknownh s,h let’s
h denote
h it as v(λ, s). Also, let V (λ, s) be the theoretical variance of the
1 1
point count N (B) in B = 0, λ × 0, λ , computed using simulations or via the Formula (5). The estimated
value of s, assuming λ0 is the estimate of λ, is the solution to the equation v(λ0 , s) = V (λ0 , s).
Another simple estimator, this time for λd , is the total number of observed points in the observation win-
dow W , divided by the area of W . Here d = 2 is the dimension of the state space. Estimators of λ and s
may also be obtained using nearest neighbor distances, just like I did with interarrival times in one dimension
in Section 3.2.1. I haven’t checked if the random variable S, defined as the size of the connected components
associated to the undirected nearest neighbor graph (see Exercise 20), is of any use to estimate s. Confidence
intervals can be built as in Section 3.1.
37
usually not feasible if cluster overlap is substantial, at least not exactly. This is discussed in Section 3.4.3.
A black-box version of the elbow rule (the traditional tool to estimate the number of clusters) is discussed
in Section 3.4.4.
Shift vectors: They are discussed in Section 1.5.2 and 1.5.3 in the context of m-interlacings (a superimpo-
sition of m processes). Each of the m individual processes has a shift vector attached to it: it determines
the position of a cluster center modulo 1/λ. If these vectors are well separated and s is small, they can be
retrieved. See discussion in Section 3.4.3, and Figure 19, featuring 5 different shift vectors (m = 5) and
thus 5 clusters.
Homogeneity and stretching: In Section 1.5.3, I mention the fact that stretched processes are not homo-
geneous because different intensities apply to the X and Y coordinates: observations are stretched using
different stretching factors for each coordinate. More generally, the process is non-homogeneous if the
intensity depends on the location in the state space. Whether the process is homogeneous or not is thus
easy to test, using the point count statistic N (B) computed at various locations.
m-mixture versus m-interlacing: To decide whether you are dealing with a mixture rather than a super-
imposition of m point processes, one has to look at the point count distribution on a square Bλ of area
1/λ2 . If there is no stretching involved, the theoretical expectation of the point count is E[N (Bλ )] = m
if the process is an m-interlacing; in that case, the number of points in each Bλ is also very stable. The
first thing to do is to estimate λ (see the beginning of Section 3.4.2), then look at the empirical variance
of N (Bλ ) computed on the observations. When s is small enough, N (Bλ ) is almost constant (equal to m)
for a m-interlacing; it almost has a binomial distribution for an m-mixture; see also Exercise 12. Again,
simulations are useful to decide which model provides the best fit.
Size of connected components: An interesting problem is to identify the connected components in the
undirected graph of nearest neighbors associated to a point process, see Exercise 20. These connected
components are featured in Figure 2. Their size distribution is of particular interest: for instance, on the
left plot in Figure 2, corresponding to s = 0, there is only one connected component of infinite size; on
the right plot, there are infinitely many small connected components (about 50% only have two points).
It is still an open question as to whether or not this statistic can be used to discriminate between different
types of point processes, or whether its theoretical distribution is exactly the same for a large class of
point processes (that is, it is an attractor distribution) and thus of little practical value.
Below I discuss a statistical test that I used many times, to check how different a set of observed points is,
compared to one arising from a simple two-dimensional Poisson-binomial point process, or from a stationary
Poisson point process, or more generally from any kind of stochastic point process.
Rayleigh Test
The Rayleigh test is a generic statistical test to assess whether two data sets consisting of points in two
dimensions, arise from the same type of stochastic point process. It assumes that the underlying point process
model is uniquely characterized by the distribution of nearest neighbor distances. The most popular use is
when the assumed model is a stationary Poisson process: in that case, the statistic of the test has a Rayleigh
distribution. It generalizes to higher dimensions; in that case the Rayleigh distribution becomes a Weibull
distribution. In short, what the test actually does, is comparing the empirical distributions of nearest neighbor
distances computed on the two datasets, possibly after standardization, to assess if from a statistical point of
view, they are indistinguishable.
The test is performed as follows. Let’s say you have two data sets consisting of points in two dimensions,
observed through a window. You compute the empirical distribution of the nearest neighbor distances for both
datasets, based on the observations, after taking care of boundary effects. Let η1 (u) and η2 (u) be the two
distributions in question. The statistic of the test is
Z ∞ Z 1
V = |η1 (u) − η2 (u)|du = |ν1 (u) − ν2 (u)|du, (32)
−∞ 0
where ν is the empirical quantile function, that is, the inverse of the empirical distribution. An alternative
test is based on W = supu |η1 (u) − η2 (u)|, or on W ′ = supu |ν1 (u) − ν2 (u)|. The test based on W is the
traditional Kolomogorov-Smirnov test [Wiki] with known tabulated values. In Excel, it is easier to use the
empirical quantile function, readily available as the PERCENTILE Excel function. In practice, the integral in
Formula (32) is replaced by a sum computed over 100 equally spaced value of u ∈ [0, 1]. The advantage of W
is that it is known (asymptotically) not to depend on the underlying (possibly unknown) point process model
that the data originates from.
I provide an illustration in PB inference.xlsx: see the “Rayleigh test” tab in the spreadsheet. I compare
two data sets, one from a simulation of a two-dimensional Poisson-binomial process with s = 20, and one with
38
Figure 18: Rayleigh test to assess if a point distribution matches that of a Poisson process
s = 0.4. In both cases, λ is set to 1.5 in the simulator; its estimated value on the generated data set is close
to 1.5. I then compare the nearest neighbor distances (their empirical quantile function) with the theoretical
distribution of a two-dimensional stationary Poisson process of intensity λ2 . The theoretical distribution is
Rayleigh of expectation 1/(2λ). The dataset with s = 20 is indistinguishable, at least using the Rayleigh test,
from a realization of a stationary Poisson process. This was expected: as s → ∞, the Poisson-binomial process
converges to a Poisson process by virtue of Theorem 4.5, and the convergence is very fast. But the data set
with s = 0.4 is markedly different from a Poisson point process realization, as seen by looking at the statistic
V or W ′ .
Tabulated values for the statistics V and W ′ can be obtained by simulations. For W , they have been known
since at least 1948, since W is the Kolomogorov-Smirnov statistic [26]. Here I simply used tabulated values
of the Rayleigh distribution since I was comparing the simulated data with a realization of stationary Poisson
process. Confidence bands [Wiki] for the empirical quantile function can be obtained using resampling methods
[Wiki]. Modern resampling methods are discussed in details in my book “Statistics: New Foundations, Toolbox,
and Machine Learning Recipes” [37] available here; see the chapters “Model-free, Assumption-free Confidence
Intervals” and “Modern Resampling Techniques for Machine Learning”. See also Section 3.1 in this textbook.
Figure 18 illustrates the result of my test, using the empirical quantile function of the nearest neighbor
distances, and the statistic V for the test. No re-sampling or confidence bands were needed, the conclusion
is obvious: s = 0.4 provides a simulated data set markedly different from a Poisson point process realization
(the gray curve is way off) while s = 20 is indistinguishable from a Poisson point process (the red and blue
curves, representing the empirical quantile function of the nearest neighbor distances, are almost identical).
Interestingly, the scatterplot corresponding to s = 0.4 (rightmost in Figure 18) seems more random than with
s = 20 (middle plot), but actually, the opposite is true. The plot with s = 0.4 corresponds to a repulsive
process, where points are more away from each other than pure chance would dictate; thus it exhibits fewer big
empty spaces and less clustering, falsely giving the impression of increased randomness.
39
3.4.3 Clustering Using GPU-based Image Filtering
In this section, I describe a methodology for very fast supervised and unsupervised clustering. The data is
first transformed into a 400 × 400 two-dimensional array called bitmap. The points are referred to as pixels,
and the array represents an image stored in GPU (the graphics processing unit) [Wiki]. The functions applied
to the bitmap are standard image processing techniques such as high pass filtering or histogram equalization
[Wiki]. The easy-to-read source code is in Section 6.6.2; it is accompanied by detailed comments about the
methodology. I encourage you to read it.
The input data consists of a realization (obtained by simulation) of an m-interlacing (that is, a super-
imposition of m shifted Poisson-binomial processes) with each individual process represented by a different
color: see Figure 17. The left plot in Figure 17 shows the data points observed through a small window
B = [−10, 10] × [−10, 10]. The right plot corresponds to a much bigger window, with all points taken modulo
2/λ. So, despite the bigger window, the point locations, after the modulo operation, are in [0, 2/λ] × [0, 2/λ]. I
chose λ = 1 for the intensity, in the simulations. The modulo operation (see Section 3.4.1) magnifies the cluster
structure, invisible on the left plot, and visible on the right plot.
The end result is displayed in Figure 19. The left plot corresponds to unsupervised clustering, including
locating the shift vectors attached to each individual process of the m-mixture. The right plot corresponds to
supervised clustering of the entire state space: the color of a point represents the individual point process it
belongs to; in this case the data set is the training set.
Remark: For the simulations, see source code PB NN.py in Section 6.4 (Part 2), or Formulas (8) and (9);
m-mixtures are described in Exercise 18 and Sections 1.5.3, 1.5.4 and 3.4. See [29] (available online, here) for
a similar use of GPU in the context of nearest neighbor clustering.
where arg max g(j) [Wiki] is the value of j that maximizes g(j), and χ[A] is the indicator function [Wiki]:
χ[A] = 1 if A is true, and 0 otherwise. The boundary problem (when x − u or y − v is outside the bitmap) is
handled in the source code. Have a look at my solution (Part 2 of source code in Section 6.6.2), though there
are many other ways to handle it.
After filtering the whole bitmap 3 times, thanks to the large size of the filtering window (21 × 21 pixels),
all pixels are assigned to a cluster (a color different from 255). This means that any future point (not in the
training set) can easily and efficiently be classified: first, find its location on the bitmap; then its cluster is the
color assigned to that location. It is worth asking whether convergence occurs (and to what solution) if you
were to filter the bitmap many times. I have not investigated this problem, however, I studied convergence for
a similar type of filter, in my paper “Simulated Annealing: A Proof of Convergence” [38].
While the algorithm is very fast, the bottleneck is the large size of the local filter window. The amount of
time required to color the bitmap is proportional to the size of that window: in our case, 21 × 21 pixels. There
is a way to accelerate this by a factor about 20, using a caching mechanism. See Exercise 26.
40
Unsupervised Clustering with Density Equalization
A similar filter is used for unsupervised clustering. Much of what I wrote for unsupervised clustering also applies
here. I recommend that you first read the above section about supervised clustering. Indeed, both supervised
and unsupervised clustering are implemented in parallel in the source code, within the same loop. The main
difference is that the color (or cluster) c(x, y) attached to a pixel (x, y) is not known. Instead of colors, I use
gray levels representing the density of points at any location on the bitmap: the darkest, the higher the density.
I start with a bitmap where c(x, y) = 1 if (x, y) corresponds to the location of an observed point on the bitmap,
and c(x, y) = 0 otherwise. Again, I filter the whole 400 × 400 bitmap 3 times with the same 20 × 20 filter size.
The new gray level assigned to pixel (x, y) at iteration t is now
20 20
X X c(x − u, y − v) · 10−t
c′ (x, y) = arg max √ . (34)
j u=−20 v=−20
1 + u2 + v 2
The first time this filter is applied to the whole bitmap, I use t = 0 in Formula (34); the second time I use
t = 1, and the third time I use t = 2. The purpose is to dampen the effect of successive filtering, otherwise the
image (left plot in Figure 19) would turn almost black everywhere after a few iterations, making it impossible
to visualize the cluster structure. The second and third iterations, with the dampening factor, provide an
improvement over using a single iteration only.
After filtering the image, I applied a final post-processing step to enhance the gray levels: see Part 4 of the
source code in Section 6.6.2. It is a purely cosmetic step consisting in binning and rescaling the histogram of
gray levels to make the image nicer and easier to interpret. This step, called equalization, can be automated;
I will discuss it in details in an upcoming textbook. I chose a data set with significant overlap among the
clusters to show the power of the methodology. Indeed, if you look at the raw data (Figure 17, left plot), the
cluster structure is invisible to the naked eye. This algorithm was able to only partially recover the cluster
structure. The centers of the clusters visible in Figure 19 (left plot) roughly correspond to some of the shift
vectors attached to the m-mixture. Retrieving the shift vectors was one of the goals.
The dataset used here is produced by the program PB NN.py in Section 6.4. You can download the dataset
from my GitHub repository, here. The first column is the cluster number: an integer i ∈ {0, . . . , m − 1} with
m = 5; the fourth column is the X coordinate, and column 5 + i is the Y coordinate. To produce the images
and manipulate the palettes, I used the Pillow graphics library. See Section 6.6.2.
41
where Uk is uniform on [0, 1]. Also, λ > 0, and the random variables Uk , θk are all independently distributed.
If γ > −1, then E[Rk ] = λ1 Γ(1 + γ) where Γ is the gamma function [Wiki]. In order to standardize the process,
I use λ = Γ(1 + γ). Thus, E[Rk ] = 1 and if γ > − 12 ,
Γ(1 + 2γ)
Var[Rk ] = − 1.
Γ2 (1 + γ)
1.00, 0.92, 0.77, 0.76, 0.71, 0.69, 0.63, 0.61, 0.60, 0.56, 0.55, 0.55.
Clearly, the third value 0.77 is pivotal, as the next ones stop dropping sharply, after an initial big drop at the
beginning of the sequence. So the “elbow signal” is strongest at m = 3, and the conclusion is that the first
two values (2 = m − 1) outshine all the other ones. The purpose of the black-box elbow rule algorithm, is to
automate the decision process: in this case deciding that the optimum is m = 3.
Note that in some instances, it is not obvious to detect an elbow, and there may be none. In my example,
the elbow signal is very strong, because I chose a rather large value γ = 2 in Formula 37, causing the Brownian
process to exhibit an unusually strong cluster structure, and large disparities among the top v(m)’s. A larger γ
would generate even stronger disparities. A negative value of γ, say γ = −0.75, also causes strong disparities,
well separated clusters, and an easy-to-detect elbow. The resulting process is not even Brownian anymore if
γ = −0.75, since in that case, Var[Rk ] = ∞. The standard Brownian motion corresponds to γ = 0 and can still
exhibit clusters depending on the realization. Finally, in our case, m = 3 also corresponds to the number of
clusters on the left plot in Figure 20. This is a coincidence, one that happens very frequently, because the top
v(m)’s (left to the elbow) correspond to unusually large values of Rk . Each of these very large values typically
gives rise to the building of a new cluster, in the simulations.
The elbow rule can be used recursively, first to detect the number of “main” clusters in the data set, then
to detect the number of sub-clusters within each cluster. The strength of the signal (the height of the red bar)
42
is typically very low if the v ′ (m)’s have a low variance. In that case, there is no set of values outshining all the
other ones, that is, no true elbow. For an application of this methodology to detect the number of clusters, see
a recent article of Chikumbo [14], available online here. An alternative to the elbow rule, to detect the number
of clusters, is the silhouette method [Wiki].
Figure 20: Elbow rule (right) finds m = 3 clusters in Brownian motion (left)
I now explain how the strength of the elbow signal (the height of the red bars in Figure 20) is computed.
First, compute the first and second order differences of the function v ′ (m): δ1 (m) = v ′ (m − 1) − v ′ (m) for
m > 1, and δ2 (m) = δ1 (m − 1) − δ1 (m) for m > 2. The strength of the elbow signal, at position m > 1, is
ρ1 (m) = max[0, δ2 (m + 1) − δ1 (m + 1)]. I used a dampened version of ρ1 (m), namely ρ2 (m) = ρ1 (m)/m, to
favor cluster structures with few large clusters, over many smaller clusters. Larger clusters can always be broken
down into multiple clusters, using the same clustering algorithm. The data, including formulas, charts, and sim-
ulation of the Brownian motion (done in Excel!), is on the Elbow Brownian tab, in the PB inference.xls
spreadsheet. You can modify the parameters highlighted in orange in the spreadsheet: in this case, γ in cell
B16. Note that λ is set to Γ(1 + γ) in cell B17.
The left plot in Figure 21 represents the partial sums (Xk , Yk ) of η(z) for the complex number z = σ + it,
using the aforementioned formulas with k terms (k = 1, . . . , 104 ). The X axis represents the real part, the
43
Y axis the imaginary part. In complex number notation, (Xk , Yk ) is denoted as Xk + iYk . Here σ = 12 and
t = 24,556.59. Not only this value of σ + it is on the critical line [Wiki] since σ = 21 , but it is actually an
excellent approximation to a non-trivial root [Wiki] of the Riemann zeta function. Thus, starting at (0, 0), after
an infinite number of steps (k = ∞), we end up back at (0, 0) as shown on the left plot in Figure 21. In between,
the path is pretty wild! No wonder why a proof of the famous Riemann Hypothesis [Wiki] still remains elusive.
I used the elbow rule to detect the number of sinks, denoted as m. A sink is when – on its path to
convergence – the iterations get stuck for a while around a center, circling many times before resuming the
normal path, creating the appearance of circular clusters. The final sink is centered at (0, 0) since σ + it is a
root of η. If σ > 0 is close to zero, and t is large, the number of sinks can be much larger, and you may need far
more than 104 iterations to reach the final sink, called “black hole”. For the elbow rule, I first computed the
empirical percentiles of the distance between (Xk , Yk ) and (Xk+τ , Yk+τ ) with τ = 100, ignoring the first 1000
points where the path is most erratic. Then, I chose the v(m)’s as follows: v(1) corresponds to the maximum
distance, v(2) to the 99-th percentile of the distances, v(3) to the 98-th percentile, and so on. The remaining
computations, once the v(m) are computed, are identical to those in the previous section. The method found
m = 8 sinks.
The data, including formulas, charts, and iterative computations of (Xk , Yk ) for k = 1, . . . , 104 (done in
Excel), is on the Elbow Riemann tab, in the PB inference.xls spreadsheet. You can modify the parameters
highlighted in orange in the spreadsheet: in this case, σ in cell B16, and t in cell B17. The reason why “jumps”
appear in the sequence (Xk , Yk ) is explained and further illustrated for the one-dimensional case – the imaginary
part of the Dirichlet eta function η(z) – in Exercise 25. Tables of zeros of the Riemann zeta function (up to the
first two million), published by Andrew Odlyzko, are available here.
44
F n λ s N1 (n) N2 (n) N3 (n) ρ(n)
Logistic 100 1 5 38,712 1287 1689 3.2%
Logistic 100 1 1 39,814 186 589 0.5%
Logistic 50 1 5 9356 644 845 6.4%
Logistic 50 1 1 9907 93 294 0.9%
Uniform 100 1 5 39,600 400 801 1.0%
Uniform 100 1 1 40,000 0 401 0.0%
Uniform 50 1 5 9800 200 401 2.0%
Uniform 50 1 1 10,000 0 201 0.0%
question. This point will be generated by the simulator, but may not be included in statistical estimations, and
its creation is a waste of time. These are the two problems we face.
It turns out that some of the biases can be exactly computed, assuming you know the underlying model: in
our case, a Poisson-binomial point process. Let N = n, B(an ) = [−an , an ] × [−an , an ] be a square with an > 0
to be determined later, and ph,k (an ) = P [(Xh , Yk ) ∈ B(an )]. In two dimensions, we have:
h a − h/λ −a − h/λ i h a − k/λ −a − k/λ i
n n n n
ph,k (an ) = F −F × F −F .
s s s s
Also, let In = {(h, k) ∈ Z2 , with max(|h|, |k|) ≤ n}, and
X
N1 (n) = ph,k (an ),
(h,k)∈In
X
N2 (n) = ph,k (an ),
(h,k)∈I
/ n
X
N3 (n) = (1 − ph,k (an )) = (2n + 1)2 − N1 (n).
(h,k)∈In
The quantities N1 (n), N2 (n), N3 (n) represent respectively the expected number of observed points in the small
window B(an ), the expected number of missing (unobserved) points in the same window, and the expected
number of points outside B(an ) that were generated by the simulator if (h, k) ∈ In . The bias, when counting
the points in B(an ) generated by the simulator, is thus N2 (n).
For a fixed n, it is possible to find an that minimizes N2 (n)/N1 (n), but in practice, an = n/λ is good enough.
Table 5 shows the bias N2 (n) obtained with λ = 1 and an = n. The ratio ρ(n) = N2 (n)/(N1 (n) + N2 (n)) is
the proportion of bias. Assuming λ = 1, the unbiased point count (expected value) is 4n2 ; the biased count is
N1 (n).
I used the CDF function in Section 6.2.4 to compute the statistics N1 , N2 and N3 in Table 5. The source
code, illustrating the use of a bivariate cumulative distribution function F , is as follows:
N1=0
N2=0
N=2000 # should be infinite, but 2000 is good enough
n=100
llambda=1
s=5
aa=n # aa corresponds to a_n in the text
type="Logistic"
for h in range(-N,N+1):
print(h)
for k in range(-N,N+1):
45
ff=(CDF(type,llambda,s,h,aa)-CDF(type,llambda,s,h,-aa)) \
* (CDF(type,llambda,s,k,aa)-CDF(type,llambda,s,k,-aa))
if abs(k)<=n and abs(k)<=n:
N1+=ff
else:
N2+=ff
N3=(2*n+1)*(2*n+1)-N1
print("N1=",int(N1),"N3=",int(N3))
For a fixed, large n, as s increases, both N2 (n) and N3 (n) increase, but their ratio tends to 1 as s → ∞. This
is because as s → ∞, the Poisson-binomial process tends to a stationary Poisson process.
Remark: If an = n/λ, by virtue of Theorem 4.1 generalized to two dimensions, N1 (n)+N2 (n) = λ2 µ(B(an )) =
(2n)2 .
Figure 22: Each arrow links a point (blue) to its lattice index (red): s = 0.2 (left), s = 1 (right)
One question is how far a point can be from its lattice location, and how frequently such “extremes” occur.
Even more interesting is the reverse question, associated to the inverse or hidden model: can a point (Xh , Yk )
close to the origin, well within the small window of observations, have its lattice location (h, k) very far away?
Such a point will not be generated by the point process simulator. It will be unaccounted for, introducing a
bias; indeed, it is counted in N2 (n). This happens with increased frequency as s increases, requiring a larger
and larger observation window (that is, larger n and N ), as seen in Table 5.
46
Figure 23: Distance between a point and its lattice location (s = 1)
Unless F has a finite support domain (for instance, if F is uniform), unobserved points in the small window
of observations – even though their expected number is finite and rather small – can be attached to any arbitrary
lattice location, not matter how far away. In two dimensions, the probability P [R > r] that the distance R
between a point and its lattice location is greater than r, is
Z ∞Z ∞ x y
P (R > r) = χ(x2 + y 2 > r)F F dxdy
−∞ −∞ s s
where χ(A) is the indicator function, equal to one if A is true, and to zero otherwise.
The distance R corresponds to the length of the arrow, in Figure 22. If F is Gaussian, then R has a
Rayleigh distribution [Wiki]. In two dimensions, the distance between two nearest neighbor points, for a sta-
tionary Poisson point process, also has a Rayleigh distribution, see Section 3.4 and Exercise 15.
Distribution of Records
Now let Mn be the maximum distance between a point and its lattice location, measured over n points of the
process, randomly selected. In other words Mn = max(R1 , . . . , Rn ) where Ri (i = 1, . . . , n) is the distance
between the i-th point, and its lattice location. Depending on F , the standardized distribution of Mn is
asymptotically Weibull, Gumbel or Fréchet: these are the tree potential attractor distributions in the context
of extreme value theory [Wiki]. The Rayleigh distribution is a particular case of the Weibull distribution.
Surprisingly, in d dimensions, the distribution of the nearest neighbor distances, for a stationary Poisson point
process, is also Weibull, see Section 3.4.
Figure 23 shows (on the Y-axis) the distance R between a point (Xh , Yk ) and its location (h/λ, k/λ) on the
lattice space. These are the same points as on the right plot in Figure 22; R represents the length of the arrows.
The points are ordered by how close they are to the origin (0, 0), and the X-axis represents their distance to the
origin, that is, their norm. By looking at Figure 23, it is easy to visualize the extreme values of R, and when
they occur on the X-axis.
47
3.6 Poor Random Numbers and Other Glitches
All machine learning and modeling techniques are subject to a number of issues. I discussed the boundary effect
in Section 3.5, creating biases in some statistical measurements, and how to address it. Perturbed lattice point
processes, referred to as Poisson-binomial processes in this textbook, are unusually stable structures. However
on occasions, one may face numerical stability or precision issues. For instance, the detection of connected
components (those generated by the nearest neighbors) can fail if the scaling factor s is zero. In that case, a
point can have multiple nearest neighbors, causing problems. This is addressed in Part 3 of the source code in
Section 6.4. Another example is caused by the chaotic convergence of some mathematical series: see Exercise 25,
with a solution. Limiting distributions near a singularity are another typical source of problems, see Exercise 4,
entitled small paradox. Iterative algorithms such as the filter-based classifier in Section 3.4, used to produce
Figure 19, may not converge or converge to a wrong solution depending on the parameters.
But generally speaking, iterative systems going awry are rare when dealing with lattice-based point pro-
cesses. This is in contrast to discrete dynamical systems , where a simple recursion such as xn+1 = 4xn (1 − xn )
with 0 < x0 < 1 (called the chaotic logistic map) yields erroneous values with not a single correct digit after
as little as n = 50 iterations, when using single-precision arithmetic [Wiki]. This is not an issue to compute
average-based statistics due to the ergodicity of the dynamical system, mimicking a stochastic process. It be-
comes an issue when looking at a single path, or when computing statistics such as long-range auto-correlations
to assess the randomness of the sequence.
Surprisingly, in some instances, using a faulty algorithm can be a blessing. For instance, to find the
global minimum of the chaotic curve pictured in Figure 7, standard optimization techniques such as the fixed
point algorithm [Wiki], fail. Instead, I used a fixed point algorithm that by design, never converges. Yet as
the iterations approach the (magnified) global minimum of the transformed function, it emits a signal before
moving away to nowhere. It is possible to retrieve the global minimum via the signal. This will be discussed in
an upcoming textbook.
In our context, since I heavily rely on massive simulations, in particular to estimate a number of theoretical
distributions with good enough accuracy or to compare two very similar empirical distributions, an excellent
pseudo-random number (PRNG) generator is paramount. Nowadays, most programming languages and even
Excel offer decent PRNGs. See also here for a discussion on this topic. I have used billions of binary digits of
peculiar transcendental numbers [Wiki] on many √occasions: they provide some of the best non periodic PRNGs.
You can get one million binary digits of (say) 2, online in less than one second, on the Sage symbolic math
calculator, here. I now discuss a situation where my PRNG dramatically failed, and a new type of PRNG that
I am currently developing.
The initial value is x1 , a positive integer; p1 , p2 and so on are the prime numbers, with p1 = 2. Also, ak , bk ∈
{−1, 0, 1}. The sequence is periodic, though the period may start after a large number of iterations. In general,
the larger r, the larger the period. This PRNG is further discussed here.
The parameter set in Table 6 yields a period equal to 643,032,390 = 2 × 3 × 5 × 73 × 11 × 13 × 19 × 23.
Detecting the period of these PRNG’s, either via an algorithm or through theoretical considerations, is an
interesting problem in and of itself. The period grows exponentially fast with the number of prime numbers
involved. The number of iterations before the period starts to kick in can be very large. This makes it difficult
to detect the period. But to make things easier, the period typically has a simple form, involving the product
of consecutive primes. So one can try an integer q (a simple product of primes) and check if for some n large
48
k 1 2 3 4 5 6 7 8 9
pk 2 3 5 7 11 13 17 19 23
ak −1 1 1 1 −1 −1 0 1 −1
bk −1 −1 −1 −1 1 1 0 −1 1
enough, xn = xn+q , xn+1 = xn+q+1 , . . . , xn+200 = xn+q+200 . If this is the case, q is a potential candidate for
the period.
4 Theorems
The theorems presented here are selected for their practical and educational value. The proofs are usually short,
constructive, and sometimes subtle. These results are used in one way or another throughout this textbook,
including in the simulations. The reader is invited to try proving some of them on her own, before reading my
solutions. I have added many comments, which are just as important as the theorems or the proofs. Emphasis
is on making this material accessible to many practitioners as well as beginners, and hopefully, fun to read.
Remark: Unless otherwise specified, the theorems are valid for the one-dimensional case. Generalizations to
higher dimensions are provided for several theorems, following the proof.
4.1 Notations
I use the notation tk = k/λ and Fs (x) = F (x/s). The density attached to F , if it exists, is denoted as f . Also
B = [a, b] with a < b is an interval on the real line. In two dimensions, B may be a rectangle or a circle, and I
use the notation µ(B) for the area of B. The notation “Left ≡ Right” means that “Left” is a shorter notation
for “Right”: by definition, they both represent the same thing.
The random variable T (λ, s) measuring interarrival times is sometimes denoted as T . It represents the
distance between two successive points of the process, once the points are ordered by value on the real line.
In higher dimensions, T is the distance between a point of the process, and its closest neighbor. The random
variable counting the number of points of the process in B is denoted as N (B) and called point count.
Z ∞
P (T > y) = f (x)P (N (B0 ) = 0|X0 = x)dx
−∞
Z ∞ Y
= f (x) 1 − P (Xk ∈ B0 ) dx
−∞ k̸=0
Z ∞
f (x) Y
= 1 − pk (x, y) dx (38)
−∞ 1 − p0 (x, y)
k∈Z
where
pk (x, y) = P (Xk ∈]x, x + y]) = Fs (x + y − tk ) − Fs (x − tk ). (39)
A different way to compute the distribution of the interarrival times is offered by Theorem 4.7. Note that T
depends on λ and s, since tk = k/λ. By analogy with the Poisson-binomial distribution attached to the counting
random variable N (B), the distribution of T is said to be exponential-binomial of parameters pk (x, y), k ∈ Z.
When s → ∞, the limit is a standard exponential distribution, as seen in Theorem 4.5.
I am now in a position to state and prove some important results. Unless otherwise specified, the theorems
apply to the one-dimensional case. Following each proof, when possible, I discuss how the result generalizes to
higher dimensions.
49
4.3 Point Count Arithmetic
Here is a pretty curious arithmetic-related result, easy to prove.
Theorem 4.1 Regardless of the distribution Fs , if λ · (b − a) is an integer, then E[N (B)] = λ · µ(B) = λ · (b − a).
Proof
For any function Fs , we have the following trivial equality. Assuming λ · (b − a) = 1,
n
X
En [B] ≡ Fs (b − tk ) − Fs (a − tk ) = Fs (b + tn ) − F (a − tn ).
k=−n
T (1, λs)
T (λ, s) = .
λ
In particular, this also holds when s = ∞, corresponding the standard Poisson process.
Proof
After replacing tk by k/λ in (38), and since Fs (z) = F (z/s), we have:
x + y − k/λ x − k/λ
pk (x, y) = F −F .
s s
The expression F ((x + y − k/λ)/s) can be rewritten as F ((λ · (x + y) − k/λ′ )/s′ ) with λ′ = 1 and s′ = λs. This
works too if y = 0. With the change of variable λ · (x + y) = x′ + y, we have dx = (dx′ )/λ and the expression
becomes F ((x′ + y − k/λ′ )/s′ ). The variables are x, x′ , and y is assumed to be fixed. Integral (38), after these
changes, must be updated as follows:
The dummy variable x is replaced by the dummy variable x′
The value of the integral is divided by λ because dx = (dx′ )/λ
The bounds are still from −∞ to ∞
λ is replaced by λ′ = 1 and s by s′ = λs
That is: P (T (λ, s) > y) = P (T (λ′ , s′ )/λ > y) = P (T (1, λs)/λ > y), thus T (λ, s) = T (1, λs)/λ
Theorem 4.2 has important practical implications. Instead of working with two parameters λ, s, when
dealing with interarrival times, you can replace T (λ, s) by T ∗ (s′ ) = λ1 T (1, s′ ), with s′ = λs, thus reducing
the number of effective parameters from two to one. I use this fact in Section 3.2.1 to facilitate estimation
techniques, and to compute the empirical distribution [Wiki] of T more efficiently.
It would be interesting to see how Theorem 4.2 (and its proof) can be adapted to the two-dimensional
case, where interarrival times are replaced by distances between a point of the process and its nearest neighbor.
Simulations show that the situation is different. In two dimensions, x is replaced by (x1 , x2 ), and dx becomes
dx1 dx2 . The product over k becomes a double product over h, k. Also, Fs (x − k/λ) is replaced by Fs (x1 −
h/λ)Fs (x2 − k/λ), and dx1 = (dx′1 )/λ, dx2 = (dx′2 )/λ. This suggests that the denominator λ in Theorem 4.2
should be replaced by λ2 in two dimensions. See also Exercise 15.
50
4.5 Expectation and Limit Distribution of Interarrival Times
Here I discuss the one-dimensional case. For the two-dimensional case, see Exercise 15. The proof of the next
theorem justifies the choice of X0 as the reference point to define interarrival times; X5 (say) would have led to
the same distribution. We already know that if s = 0 then T = 1/λ, and if s = ∞ then T has an exponential
distribution of expectation 1/λ. If s is small enough and F ’s tail is not too thick, then E[T ] = 1/λ and T ’s
distribution is also independent from X0 , see Exercise 5. Now, the result below is valid for any s ≥ 0.
Theorem 4.3 If F has a finite expectation, then E[T (λ, s)] = 1/λ, regardless of F and s.
Proof
Let (Xk ) with k = −n, . . . , n be a finite version of a Poisson-binomial point process, with 2n + 1 points. One
of the points, say Xk1 , is the minimum, and another one, say Xk2 , is the maximum. The range for the Xk ’s
is Xk2 − Xk1 , with E[Xk2 ] = n/λ and E[Xk1 ] = −n/λ. So the expectation of the range is 2n/λ. Since there
are M = 2n interarrival times between Xk1 and Xk2 , the average interarrival time, that is the average distance
between two successive points, is λ1 2n 1
M = λ . This is true whether n is finite or infinite. To finalize the proof,
due to the symmetry of the problem (there is nothing special about X0 versus, say, X5 ), it does not matter, as
far as the theoretical expectation is concerned, whether T is defined as the distance between X0 and the next
point to the right, or between X5 (or any other point) and the next point.
If F is Cauchy, T ’s expectation may not exist. But in practice, we work with symmetric truncated Cauchy
distributions [Wiki], that have zero expectation. Since the choice of the point X0 does not matter in the definition
of T , one might replace X0 by the closest point to the origin. At least that point is known (observable) while
X0 is not. The next theorem, though surprisingly easy to prove, is much deeper than Theorem 4.3. I use it to
solve Exercise 6.
Theorem 4.4 If F has a density f , then
h1 Z ∞
1 i
lim P T (λ, s) − <y = F (y − x)f (x)dx. (40)
s→0 s λ −∞
Proof
1 k
Note that E[T (λ, s)] = λ by virtue of Theorem 4.3. When s → 0, then Xk → λ. It is then easy to establish
(see Exercise 5) that
∞
y − 1/λ
Z
P (T < y) = F x+ f (x)dx.
−∞ s
This can be rewritten as
h1 1 i Z ∞ Z ∞
P T (λ, s) − <y = F (y + x)f (x)dx = F (y − x)f (x)dx.
s λ −∞ −∞
The last equality is justified by the fact that f is symmetric, thus f (x) = f (−x). The integral on the right
hand side of Formula (40) represents the self-convolution [Wiki] of F .
Step 1
Rb 1
From (39), we have pk (x, y) = a f (u)du, where f (the density) is the derivative of F , b = λs (λ(x + y) − k),
1 y 1 1 k
and a = λs (λx − k). This integral has interval length b − a = s and midpoint 2 (a + b) = 2s (2x + y) − λs . In
particular,
y 2x + y k
pk (x, y) ∼ f − as s → ∞,
s 2s λs
n Z n
y n 2x + y
Z
X ν
Jn ≡ pk (x, y) ∼ pν (x, y)dν = f − dν.
−n s −n 2s λs
k=−n
51
With the change of variable τ = −ν/(λs), we obtain
Z n/(λs) Z n/(λs)
y h 2x + y i 2x + y
Jn ∼ · λs f + τ dτ = λy f + τ dτ.
s −n/(λs) 2s −n/(λs) 2s
√
Here λ is fixed. When n → ∞, s → ∞ and n/s → ∞ (say s ∼ n or s ∼ n/(log n), we have
Z ∞
2x + y
Jn → λy f + τ dτ = λy,
−∞ 2s
Step 2
= −J∞ = −λy.
Thus, Y
1 − pk (x, y) ∼ exp(−λy) as s → ∞.
k∈Z
This product does not (at the limit) depend on x. Finally, we get
Z ∞ Z ∞
f (x)
P (T > y) ∼ exp(−λy) dx ∼ f (x)dx = exp(−λy),
−∞ 1 − p0 (x, y) −∞
as f is a density and thus integrates to one. So, T has an exponential distribution of parameter λ as s → ∞.
This implies that the limiting point process must be Poisson of intensity λ.
The takeaway from the proof of Theorem 4.5 (see bottom of Step 1) is that to simulate a realistic Poisson
process as a limit of a Poisson-binomial process (pretty much regardless of F ), you generate your 2n + 1 points
(k between −n and n), you choose a large n and a large s, but √ s must be an order of magnitude smaller than
n, to make boundary effect [Wiki] negligible. For instance, s = n or s = logn n will do.
Theorem 4.5 generalizes to higher dimensions. It is somewhat similar to the Central Limit Theorem [Wiki]
(CLT), in the sense that it works regardless of the continuous distribution F . Even if the index space Z (the
support domain for the index k) was relatively random, it would still work. What is remarkable is that even
if F is a Cauchy distribution, known to have no expectation nor variance, it still works, and convergence to
the Poisson process is even faster than if F was uniform. This is because the Cauchy distribution, with its
thick tail, does a great job at mixing the points of the process. To the contrary, the standard CLT fails with a
Cauchy distribution, as the sum of iid Cauchy random variables always has a Cauchy distribution, thus never
converging to a Gaussian distribution. This is because the Cauchy distribution, like the Gaussian one, belongs
to a family of stable distributions [Wiki].
In our case, convergence to a Poisson process is quite fast, with s = 40, assuming λ = 1, yielding an
excellent approximation regardless of F , see Table 3. Consequently, the interest here is in small values of s.
There might be a different way to prove Theorem 4.5, using Le Cam’s inequality [73] applied to the point count
distribution. It would amount to proving that as n → ∞, regardless of B, the Poisson-binomial distribution of
N (B) tends to a Poisson distribution of expectation λd µ(B), where d is the dimension, and µ(B) is the area of
B in two dimensions, or the length of the interval B in one dimension. See Theorem 2.1 for such a proof, in a
similar context.
52
that λ, s are known or estimated. Another random variable of interest, denoted as K and also taking on integer
values (positive or negative) is is the index of the point closest to the point X0 , on the right-hand side on the
real axis. Related material in the literature includes “Recovering the lattice from its random perturbations” by
Yakir [79] (2020) available online here and “Cloaking the Underlying Long-Range Order of Randomly Perturbed
Lattices” by Klatt [47], available here. See also how I use the function L in an application to generate locally
random permutations, in Section 2.2. Now I can state two new theorems.
Theorem 4.6 Let us assume that Fs has a derivative fs (the density), continuous and strictly positive every-
where. For any h ∈ Z, we have
X
P (L(x) = k) = C · fs (x − tk ), with C = fs (x − th ).
h∈Z
Proof
Let Bϵ (x) = [x − ϵ, x + ϵ]. We have
P (Xk ∈ Bϵ (x)) fs (x − tk )
lim = .
ϵ→0 P (Xh ∈ Bϵ (x)) fs (x − th )
Thus P (L(x) = k) ∝ fs (x − tk ), and the proportionality constant is such that the sum over all k ∈ Z, must be
one.
Theorem 4.7 The interarrival time T and K are connected by the following formula:
Z ∞
X k
P (T (λ, s) < y) = P (K = k) Fs x + y − fs (x)dx.
−∞ λ
k̸=0
Proof
We have: K = k if and only if k is the smallest index such that Xk > X0 . Thus,
P (T (λ, s) < y) = P (XK − X0 < y)
X
= P (K = k)P (Xk − X0 < y)
k̸=0
Z ∞
X k
= P (K = k) Fs x + y − fs (x)dx.
−∞ λ
k̸=0
The last integral is the result of the convolution between the random variables Xk and −X0 .
Theorem 4.7 provides us with a way to compute the P (K = k)’s. You need to solve a linear system with
an infinite number of variables and an infinite number of equations. The unknowns are the P (K = k)’s. In
practice, especially if λ = 1, you can just reduce it to −n ≤ k ≤ n, with k ̸= 0 and n = 10, as P (K = k) quickly
decays to zero when k becomes large in absolute value. Pick up a different y in the integral, for each of the 2n
equations, to get an invertible system.
The distribution of the interarrival times is combinatorial in nature, and in principle you could use the
theory of order statistics [Wiki] to get the exact distribution. References on this topic includes [7, 17]. When
the random variables are independently but not identically distributed, one may use the Bapat-Beg theorem
[Wiki] to find the joint distribution of the order statistics of the sequence (Xk ), and from there, obtain the
theoretical distribution of T . This approach is difficult, and not recommended. Simulations are the preferred
option.
53
Proof
Let λ = 1, B = [a, b] and pk = Fs (b − k) − Fs (a − k). Here
1 x−k
Fs (x − k) = + , with − s ≤ x − k ≤ s, and s > 0.
2 2s
We have two cases, each with three sub-cases:
If b − a ≤ 2s then
If a − s ≤ k ≤ b − s then pk = 1 a−k
2 − 2s .
If b − s ≤ k ≤ a + s then pk = b−a
2s .
If a + s ≤ k ≤ b + s then pk = 1 b−k
2 + 2s .
If b − a ≥ 2s then
If a − s ≤ k ≤ a + s then pk = 1
2 − a−k
2s .
If a + s ≤ k ≤ b − s then pk = 1.
If b − s ≤ k ≤ b + s then pk = 1
2 + b−k
2s .
If k ∈
/ [a − s, b + s], then pk = 0. Let B = [a, b]. The above results can be used to compute (in closed form) the
quantities
X∞ X∞
E[N (B)] = pk , Var[N (B)] = pk (1 − pk ).
k=−∞ k=−∞
In particular, if a = −s and b = s, there are some simplifications, and we obtain the result announced in the
theorem.
Note that if s/2 is an integer, the above result is compatible with Theorem 38 since E[N (B)] = 2s = b − a.
Also, as s → ∞, E[N (B)] ∼ 2s = b−a. In general though, E[N (B)] is not an exact function of µ(B) = λ·(b−a),
confirming that the Poisson-binomial process is different from a Poisson process, and very much so in particular
if F is the uniform distribution.
If F is the Laplace distribution, an exact, closed-form formula can also be obtained for E[N (B)] and
Var[N (B)], and for higher moments. See Exercise 1 in Section 5.
Proof
Use the change of variable u = F (z) in the leftmost integral. Then Q(u) becomes Q(F (z)) = F −1 (F (z)) = z,
du becomes dF (z) = f (z)dz, and the interval of integration changes from [0, 1] to the entire real line.
Here f is the density attached to F , assuming it exists. The righmost equality is well known, but the leftmost is
not. Surprisingly, this unnamed, little known theorem, rarely if ever mentioned, has a crucial role. It is routinely
and unconsciously used by all machine learning practioners almost on a daily basis, at least in the version that
applies to empirical, observation-based statistics. The above version applies to theoretical (mathematical)
statistics. I suggest to call it the quantile theorem.
For instance, the moment generating function of Z is defined as E[exp(tZ)]. It can be computed via the
quantile function Q, using g(z) = exp(tz), see Formula (16) for the generalized logistic distribution. See also
Exercises 3 and 4 in Section 5, for a different application.
54
5 Exercises, with Solutions
While the purpose of these exercises is to strengthen the learning experience and to generate out-of-the-box
thinking, perhaps even more importantly, they provide additional methodological and technical material, com-
plementing and extending the main text.
Starred exercises are more difficult. Several of the problems require only simulations, statistical analysis,
and testing hypotheses on a computer. They are marked as [S] and should help you hone your machine learning
and computing skills; they may not be easier or less challenging than the mathematical problems. Exercises
involving mathematics or probability theory are marked as [M], while those combining both simulations and
mathematics are marked as [MS]. Solutions or hints are provided for each problem.
Exercise 1 [M] Point count, Laplace distribution. If F is a Laplace distribution and λ = 1, find E[N (B)],
where B = [a, b] is an interval with ⌊a⌋ ≤ ⌊b⌋ < ⌊a⌋+1. Here the brackets represent the integer part function, and
Fs (x) = F (x/s). See Theorem 4.8, solving the same problem with a uniform rather than Laplace distribution.
Solution
Let pk = Fs (b − k) − Fs (a − k) with s > 0, and let sgn stands for the sign function, with sgn(0) = 0. Here
1 1 h 1 i
Fs (x − k) = + sgn(x − k) 1 − exp − · |x − k|
2 2 s
We have three cases: h i
If k ≤ a < b then pk = 1
2 exp(−(a − k)/s) − exp(−(b − k)/s)
h i
If a ≤ k ≤ b then pk = 1 − 21 exp(−(b − k)/s) + exp((a − k)/s)
h i
If a < b ≤ k then pk = 21 exp((b − k)/s) − exp((a − k)/s)
55
If ⌊a⌋ ≤ ⌊b⌋ < ⌊a⌋ + 1, then the second case is empty and ⌊a⌋ = ⌊b⌋. As a result, the computations simplify to
X X 1 X 1 k 1 X 1 k
2E[N (B)] = α ϕk − β ϕk + −
β ϕ α ϕ
k≤a k≤a k≥b k≥b
hX 1 X 1 k i
= (α − β) ϕk +
αβ ϕ
k≤a k≥b
ϕ h 1 1 ⌊a⌋+1 i
= (α − β) · · ϕ⌊a⌋ +
ϕ−1 αβ ϕ
where α = exp(−a/s) ≥ β = exp(−b/s) and ϕ = exp(1/s). We are dealing with geometric series, which are
easily summable. The last equality is due to the fact that ⌊a⌋ = ⌊b⌋. Note that b can not be an integer in
this case, so a sum with integer index k ≥ b actually starts at k = ⌊b⌋ + 1 = ⌊a⌋ + 1. Also, when combining
the various sums, make sure that their indices don’t overlap, otherwise double counting will occur. This is not
happening here.
Exercise 3 [M*] Limit of generalized logistic distribution. Compute the expectation of the generalized
1
logistic distribution, when α = 1 and 1/β is a positive integer. If α = 1 and τ = e1/β , prove that βρ (E[Z]−µ) →
2
−π /6 as β → 0. See Formula (13) for the cumulative distribution function.
Solution Instead of using Formula (14) to compute the expectation, I use Formula (15), with r = 1. Using
Formula (12), the expectation can be rewritten as
Z 1 τ um h Z 1 i
E[Z] = µ + ρ log m
du = µ − ρ m log τ + log(1 − um )du ,
0 1−u 0
where m = 1/β is an integer. Also, 1 − um is a polynomial of degree m, and its roots are the m-th roots of 1
in the complex plane (see here), that is
m−1
Yh 2kπi i
(1 − um ) = − u − exp .
m
k=0
Thus,
m−1
X h 2kπi i
log(1 − um ) = log(−1) + log u − exp .
m
k=0
R
Since log(−1) = πi and log(u − c)du = (u − c) log(u − c) − u if c is a constant (whether complex or real), we
finally have
m−1
Xh i
E[Z] = µ − ρm log τ − ρπi − ρ (1 − ck,m ) log(1 − ck,m ) − 1 + ck,m log(−ck,m )
k=0
where ck,m = exp(2kπi/m). This involves computing complex logarithms [Wiki]. When combining the real and
imaginary parts from all the terms, only real numbers are left. This tedious computation is best achieved using
some automated tool. See the result for µ = 0, τ = 1, ρ = 1 and m = 8, using the online version of Mathematica,
here. In this case, the final result is
√
−4 log 2 − (π/2) cot(π/8) − 2 log(cot(π/8)).
56
The value of ξ was obtained by replacing log(1 − um ) by its Taylor series in the integral, then integrating term
by term, and finally taking the limit as m → ∞. It can also be obtained with the change of variable v = um in
1
the integral. Let α = 1 and τ = e1/β = em . From (41), we get: E[Z] → µ and βρ (E[Z] − µ) → π 2 /6 as β → 0.
Exercise 4 [MS] Small paradox. Let Zβ be a random variable with generalized logistic distribution, with
µ = 0, ρ = α = 1 and τ = e1/β . Using simulations based on Formula (12) for the quantile function Q(u),
with u a uniform deviate on [0, 1], try to guess the expectation and variance of the limit random variable
Z∗ = limβ→0 β −1 Zβ . The exact values are respectively zero and one. Using numerical approximations, show
that limβ→0 β −1 E[Zβ ] ≈ 1.645. All of this can be done in Excel using the rand function. Then obtain the exact
values for the three quantities in question. The last one, equal to π 2 /6, was computed in Exercise 3. Thus, in
this case, the expectation and limit operators can not be swapped (one yields the answer 0, the other one yields
π2
6 ). This is because the limiting distribution P (Z∗ < z) is truncated, as shown in the solution below.
Solution Despite the appearance, this is an easy exercise. As in Exercise 3, let m = 1/β and m → ∞. The
problem here, when approximating
Z 1 Z 1h i Z 1
β −1 E[Zβ ] = m Q(u)du = m log(τ um ) − log(1 − um ) du = −m log(1 − um )du,
0 0 0
m
is that the computation of log(1 − u ) is numerically unstable [42] when 0 < u < 1 and m is large, resulting
in erroneous results, whether you do it in Excel or Python. The problem is said to be ill-conditioned. To avoid
this problem, use the change of variable v = um , yielding
Z 1 Z 1 Z 1
log(1 − v) π2
−m log(1 − um )du = − 1−1/m
dv → − v −1 log(1 − v)dv = as m → ∞.
0 0 v 0 6
Base your Excel computations on the last integral, using a sum to approximate it. Now it works!
The fact that E[Z∗ ] = 0 and Var[Z∗ ] = 1 will be apparent from your simulations involving the quantile function
Q(u). However, to prove it rigorously, rather than using Q, it is easier to work with the CDF P (Zβ < z), with
µ = 0, ρ = α = 1, β = 1/m and τ = em , using Formula (13):
1 1
P (Z∗ < z) = lim P (Z1/m < mz) = lim = lim .
m→∞ m→∞ (1 + τm exp(−mz)) m m→∞ (1 + exp[m(1 − z)])m
If z > 1, the above limit is one, otherwise it is equal to exp[−(1−z)]. The density fZ∗ (the derivative of the CDF)
R1
is also equal to fZ∗ (z) = exp[−(1 − z)] with support domain z < 1. Thus we have E[Z∗ ] = −∞ zfZ∗ (z)dz = 0
R1
and Var[Z∗ ] = E[Z∗2 ] = −∞ z 2 fZ∗ (z)dz = 1.
Exercise 5 [M] Exact distribution of interarrival times. Find the distribution of the interarrival times
T (λ, s) if all the points are ordered, that is, if Xk < Xk+1 for all k ∈ Z. This happens when s is small enough,
and the tail of F is not too thick.
Solution
We have Xk = λk + sZk and Xk+1 = k+1 λ + sZk+1 where Zk , Zk+1 are two independent random variables of
distribrution F . Thus the interarrival time Xk+1 − Xk has the same distribution as T = λ1 + s(Zk+1 − Zk ) and
does not depend on k. You may as well use k = 0 for its computation. The result is
Z ∞
1 y − 1/λ y − 1/λ
P (T < y) = P + s(Z1 − Z0 ) < y = P Z1 − Z0 < = F x+ f (x)dx
λ s −∞ s
where f is the density attached to F . Since f is symmetric and centered at the origin, the distribution P (T < y)
is the self-convolution of F [Wiki], also denoted as F ∗ F .
Examples:
If F is normal with zero mean and unit variance, then T is almost normal, with mean 1/λ and variance
2s, assuming s is small compared to 1/λ. But T ’s distribution can not be exactly normal, because in this
case, the Xk ’s can not all be perfectly naturally ordered unless s = 0 (then Xk = k/λ).
If F is uniform on [−1, 1] then T has a symmetric triangular distribution [Wiki] of mean 1/λ and support
domain [ λ1 − 2s, λ1 + 2s]. This is the exact solution if 0 ≤ s < 2λ
1
. In this case, the Xk ’s are all naturally
ordered.
For a more formal result, see Theorem 4.4. See also Exercise 6.
57
Exercise 6 [M*] Retrieving F from the interarrival times distribution. I assume here that F has a
density f . Given the limit distribution of the standardized interarrival times, the purpose is to retrieve the
distribution of F . If you are familiar with the concept of characteristic function [Wiki], this exercise is easy. If
not, you should first get familiar with this concept. Thus this exercise is marked as difficult.
The standardized interarrival times is defined as 1s [T (λ, s) − λ1 ] and has zero expectation by virtue of
1
Theorem 4.3. By virtue of Theorem 4.2, it can be rewritten as λs [T (1, λs) − 1]. Its limit, as s → 0, is denoted
as T . One of the simplest cases, besides Gaussian and Cauchy, is the following: If T ∗ has a standard Laplace
∗
2
distribution [Wiki] (that is, symmetric centered at zero and with variance π3 ), show that F is a modified
Bessel distribution of the second kind (see reference [63], available online here). Note that as a consequence of
L’Hôpital’s rule [Wiki], T ∗ is the derivative of T (λ, s) with respect to s, evaluated at s = 0.
Solution
By virtue of Theorem 4.4, we have
Z ∞
P (T ∗ < y) = F (y − x)f (x)dx,
−∞
which is a convolution of F with itself. Thus T ∗ has the distribution of the sum of two independent random
variables, say Z1 , Z2 , of distribution F . Its characteristic function is therefore
1 2
E[exp(−itT ∗ )] = 2
= E[exp(−itZ1 )] × E[exp(−itZ2 )] = E[exp(−itZ1 )] .
1+t
Thus E[exp(−itZ1 )] = (1 + t2 )−1/2 . Taking the inverse Fourier transform to retrieve the density of Z1 , which is
the density attached to F , one finds
Z ∞
1 cos(tx) 1
f (x) = √ dt = K0 (x),
2π −∞ 1 + t 2 π
where K0 is the modified Bessel function of the second kind [Wiki]. More about the Laplace distribution
and its generalization can be found in [49]. The cases when T ∗ is Gaussian or Cauchy are easy because these
distributions belong to stable families of distributions [Wiki]: in that case, F is respectively Gaussian or Cauchy.
Exercise 7 [M*] Poisson limit of Poisson-binomial distribution. Theorem 2.1 shows that a particular
case of the Poisson-binomial distribution converges to the Poisson distribution. In the proof, I established the
values of P (N = 0), P (N = 1) for the counting random variable N . I also stated (without proving it), that as
n → ∞ and m/n → α, we have P (N = k) → q0 µk /k! for all positive integers k. The purpose of this exercise
is to prove this latter statement. This in turn completes the proof of Theorem 2.1. The notations refer to the
theorem in question. In particular, q0 = P (N = 0) and µ = P (N = 1)/P (N = 0). This exercise reveals the
true combinatorial nature of the Poisson-binomial distribution, in all its complexity. This is also related to Le
Cam’s inequality.
Solution
For P (N = 0) and P (N = 1), see proof of Theorem 2.1. LetPpk = 1/(n + k) as in the proof of Theorem 2.1.
We have, with the convention that a sum or product such as k2 ̸=k1 is over k2 = 1, . . . , k1 − 1, k1 + 1, . . . , m:
m
X X Y
P (N = 2) = pk1 pk2 (1 − pk )
k1 =1 k2 ̸=k1 k1 ̸=k̸=k2
m m
X X pk1 pk2 Y
= (1 − pk )
1 − pk1 1 − pk2
k1 =1 k2 ̸=k1 k=1
m
X X pk1 pk2
= q0 . (42)
1 − pk1 1 − pk2
k1 =1 k2 ̸=k1
58
with
m m m−1
X pk X 1 X 1 m
µ= = = → log α as n → ∞, → α > 1,
1 − pk n+k−1 k n
k=1 k=1 k=n
and
m m
X p k 2 X 1
= → 0 as n → ∞.
1 − pk (n + k − 1)2
k=1 k=1
1
Note that the fraction in Formula (43) is required to eliminate double counting the products in the double
2
summation. For P (N = 3), we have a triple summation over indices k1 , k2 , k3 , and because there are 3! = 6
1
ways to re-arrange distinct k1 , k2 , k3 , the fraction 21 = 2! 1
becomes 3! ; likewise, because of the triple product, µ2
3
becomes µ .
2 3
Now I proved that P (N = 2) = q0 µ2! , and provided hints as to why P (N = 3) = q0 µ3! . The general case if
left to the reader.
Exercise 8 [M] A few simple theorems. Prove all theorems in Section 4, except Theorem 4.5.
Hint
Of course you can just look at the proof that I provided for each theorem. However, it is a much better
learning experience to try to prove them on your own without reading my solution, and possibly to generalize
the theorems, for instance from one dimension to any dimension, if applicable. Your proofs might even be
shorter, more rigorous, complete or elegant than mines. In fact, mines are mostly sketch proofs. For each of
the theorems in question, some key trick is required to make progress towards a short, easy but subtle proof.
Exercise 9 [S] Ergodicity, independent increments. Test on simulated data (for various realizations of
Poisson-binomial processes) if and when the following assumptions are violated, depending on s, F and other
parameters.
Independent increments: point counts in non-overlapping intervals are independent.
In two dimensions, zero correlation between the X and Y coordinates of a point.
Ergodicity for interarrival times, in one dimension.
Homogeneity: the point density is statistically the same anywhere on the plane or on the real line.
Anisotropy: the point density, even if non homogeneous, does not show directional trends.
Stationarity: the point count in [a, b] is statistically the same as in [a + c, b + c], regardless of a, b, c.
Aperiodicity: the point density on the real line does not exhibit periodic behavior.
To perform the simulations to test the various assumptions, you can use the source code in Section 6.2.
Solution
Some of these assumptions (aperiodicity, stationarity) are violated when the scaling factor s is close enough to
zero. If s is large (say s = 40, λ = 1), the process is not statistically different from a stationary Poisson point
process, so no statistical test will be able to detect violations, even if present. That said, interarrival times
exhibit ergodicity and independence. For instance, while T is defined as the distance between X0 and its closest
neighbor to the right, you can replace X0 by X1 , X2 , or any Xk , and T ’s distribution remains unchanged. There
is also stationarity in the following sense, regardless of s: point counts in [a, b] and [a + c, b + c] have the same
distribution if c is a multiple of 1/λ.
The largest departure from stationarity occurs with small s, and using a uniform distribution for F . If F is
Cauchy, things look prettier (more stationary) as F has a thick tail and does a better job at mixing the points.
To illustrate the non-stationarity, use λ = 1.4, s = 0.15 with the logistic distribution for F . Let B1 = [0, 0.8]
and B2 = [6, 6.8]. Then Var[N (B1 )] ≈ 0.506 ̸= Var[N (B2 )] ≈ 0.312. Now if you increase the scaling factor from
s = 0.15 to s = 0.6, the process is almost stationary. In particular the two variances in question range from
0.878181 to 0.878193 depending on the interval. The same is true for other statistics. For all practical purposes,
the distribution of N ([a, b]) depends only on b − a if s ≥ 0.6. Statistical tests would not be able to detect the
minuscule lack of stationarity.
59
I used the exact Formula (5) to compute point count variances. This formula is implemented in the source
code in Section 6.2.1. In Exercise 10, I prove non-independence for point counts over non-overlapping domains.
Exercise 10 [M] Joint distribution of point counts. Let B1 , B2 be non-overlapping domains. For Poisson
point processes, the point counts N (B1 ) and and N (B2 ) are independent random variables. Is this also true for
Poisson-binomial processes? Consider the one-dimensional case, with B1 = [a, b[, B2 = [b, c[.
Solution
The answer is negative, although the dependence is weak [Wiki]. Let B12 = B1 ∪ B2 , q1 = P [N (B1 ) = 0],
q2 = P [N (B2 ) = 0] and q12 = P [N (B12 ) = 0] = P [N (B1 ) = 0, N (B2 ) = 0]. It suffices to prove that in general,
q12 ̸= q1 q2 .
Let Xk , k ∈ Z be the points of the Poisson-binomial process. According to Formula (6), we have:
∞ h
Y c − k/λ a − k/λ i
q12 = 1−F +F
s s
k=−∞
Y∞ h b − k/λ a − k/λ i
q1 = 1−F +F
s s
k=−∞
Y∞ h c − k/λ b − k/λ i
q2 = 1−F +F
s s
k=−∞
There is no reason why we would have q12 = q1 q2 . For instance, if F is the logistic distribution, λ = 1.4, s = 0.29,
a = 0, b = 0.8 and c = 1.6, we have (approximately) q1 = 0.2329, q2 = 0.2306, q12 = 0.0177, and q1 q2 = 0.0537.
However the dependence is weak. Also, we have asymptotic independence, with full independence when
s = ∞, thanks to Theorem 4.5. Formula (6) is implemented in the source code, in Section 6.2.1. See also
Section 3.1.3, featuring a test of independence for the point counts.
Exercise 11 [S] Boundary effect. The purpose is to assess the impact of the boundary effect, in one
dimension. Assuming λ = 1, use a small value of n, say n = 300, to generate 2n + 1 points Xk , k = −n, . . . , n
of a Poisson-binomial process. Estimate E[T ], the expectation of the interarrival times, using all the 2n + 1
points. Do the same, this time using N = 104 , to generate 2N + 1 points Xk , k = −N, . . . , N of the same
Poisson-binomial process. But only use the 2n + 1 innermost points (closest to zero) still with n = 300, in
your estimation of E[T ]. These 2n + 1 points won’t be the same as in the first simulation. Also, some closest
neighbors won’t be among the 2n + 1 innermost points but instead, in the larger set of 2N + 1 points. Now
your estimate takes into account nearest neighbors that were unobserved in the first simulation (called censored
data) because they were outside the boundary. Compare your two estimates of E[T ]. The first one is slightly
biased due to boundary effects, the latter one almost has no bias. Compare the impact of using a Cauchy
versus a uniform distribution for F , by looking at the loss of accuracy when estimating E[T ] based on a single
realization of the process.
Hint
Try a simulation with s = 0.5, and one with s = 10. A large s, a thick tail (Cauchy versus uniform F ), or a
small value of n, all magnify the boundary effect, resulting in loss of accuracy in the estimates. Source code to
compute E[T ] can be found in Section 6.2.2.
Exercise 12 [M] A curious, Poisson-like point process. If we use a uniform distribution for F , and
1
s = 2λ , is the resulting process a stationary Poisson process? What if we use a mixture of m such processes in
equal proportions, called a m-mixture? (see Exercises 18 and 19). Assume here that we work with 2-dimensional
Poisson-binomial point processes.
Solution
In this case, each point (Xh , Yk ) of the process is uniformly distributed on a square with sides of length 1/λ
and centered at (h/λ, k/λ). The support domains of these uniform distributions form a partition [Wiki] of
R2 : they don’t overlap, and there is also no empty space left. So the points of the process are uniformly and
independently distributed on each square B of area 1/λ2 . But there is only one point in any such B. The process
is a Poisson-binomial point process of intensity λ (by construction), but it can not be a standard Poisson process.
If we mix m such processes, the resulting process has a point count N (B) with identical binomial distributions
[Wiki] on any square B of area 1/λ2 , with N (B) ∈ {0, . . . , m} and E[N (B)] = 1. It is is not a Poisson process
either since N (B) does not have a Poisson distribution, though it is getting close.
60
Exercise 13 [S*] Poisson-binomial process on the sphere. Build a Poisson-binomial process on the
sphere. You can start with a circle first, a cube or a torus. Study its properties, such as the distribution of
nearest neighbor distances or the size of connected components, via simulation. Note that in this case, the
point process has a finite number of points. See also “Nearest Neighbor and Contact Distance Distribution for
Binomial Point Process on Spherical Surfaces” [75], avaiable online here.
Solution
The first step is to define a lattice on the sphere. One way to do it is to build an inscribed polyhedron inside
the sphere [Wiki], and use its vertices as the lattice locations in the lattice space. See [65], available online here.
An easier way is as follows:
plot longitudes and latitudes at equally spaced angles,
the points where they intersect are the lattice locations,
the angle between two successive parallels (latitude or longitudes) is the intensity.
The disadvantage of this method is that it creates two poles, and the lattice locations are not evenly distributed
on the sphere. The resulting process is not homogeneous. For a solution with evenly distributed lattice locations,
see here.
Now around each lattice location, generate a random point on the surface of the sphere. The point is
specified by two independent random variables: an angle θ uniformly distributed on [0, 2π], and a radius R
measuring the distance to the lattice location on the surface of the sphere. It makes sense to require R ≤ πρ,
where ρ is the radius of the sphere. The scaling factor can be defined as s = E[R]. Note that there are no
boundary effects here. The next step is the create clusters on the sphere. See [33], available online here. Also,
one can study the conditions to obtain convergence to a stationary Poisson point process on the sphere.
Another possible generalization is random lines. In two dimensions, a line is characterized by two quantities:
its distance R to the origin, and its orientation θ. A similar methodology can be used to produce a Poisson-
binomial line process, with the angle θ uniformly distributed on [0, 2π]. In this case, the lattice space could be
(Z/λ) × (Z/λ), where λ is the intensity. Also see “Generating stratified random lines in a square” [70], available
online here. This is a typical stochastic geometry problem.
Exercise 14 [S] Taxonomy of point processes. The purpose of this exercise is to prove that each type of
point process studied in details in this textbook, is unique. In other words, the overlap between the different
classes of point processes is small, despite model identifiability issues. Here, I ask you to verify, via examples,
that m-interlacings defined in Section 1.5.3 are different from m-mixtures, stationary Poisson processes, Poisson-
binomial point processes, and the radial cluster processes discussed in Section 2.1.
Solution
As usual, the differences are most striking when the scaling factor s is very small. In that case, for m-interlacings,
each lattice location in the lattice space has exactly m points of the process clustered around it. For Poisson-
binomial and m-mixtures, that number is one. For radial cluster processes (with a Poisson-binomial parent
process), the number in question is random and depends on the location. For Poisson point processes (the limit
of some of these processes when s → ∞) the underlying lattice space becomes meaningless.
Exercise 15 [MS] Distribution of nearest neighbor distances. In two dimensions, T (λ, s) represents the
distance between a point of the process and its nearest neighbor.
Prove that when s → ∞, the limiting distribution of T is Rayleigh [Wiki] of mean 1
2λ .
Show by simulations or logical arguments, that unlike in the one dimensional case (see Theorem 4.3), T ’s
expectation depends on s.
Also, show that depending on F , the maximum nearest neighbor distance, computed over the infinitely
many points of the process, can have a finite expectation. Is this true too when s → ∞, that is, for
stationary Poisson point processes?
Finally, what is T ’s distribution if T is replaced by the distance between an arbitrary location in R2 , and
its closest neighbor among the points of the process?
61
Solution
In two dimensions, the fact that E[T (λ, s)] depends on s, is obvious: if s = 0, it is equal to λ1 , and if s = ∞,
1
it is equal to 2λ . Between these two extremes, there is a continuum of values, of course depending on s. The
maximum nearest neighbor distance (over all the infinitely many points) always has a finite expectation if F is
uniform, regardless of s < ∞. To the contrary, for a Poisson point process, the maximum is infinite, see here.
Now let’s prove that T has a Rayleigh distribution when s = ∞, corresponding to a Poisson process of intensity
λ2 . We have P (T > y) = P [N (B) = 0], where B is a disc of radius y centered at an arbitrary point of the
process, and N is the point count, with an exponential distribution of mean λ2 µ(B) with µ(B) = πy 2 being
the area of B. Thus P (T > y) = exp(−λ2 πy 2 ), that is, P (T < y) = 1 − exp(−λ2 πy 2 ). This is the CDF of a
1
Rayleigh distribution of mean 2λ .
Exercise 16 [M] Cell networks: coverage problem. Points are randomly distributed on the plane, with
an average of λ points per unit area. A circle of radius R is drawn around each point. What is the proportion
of the plane covered by these (possibly overlapping) circles? What if R is a random variable, so that we are
dealing with random circles? Such stochastic covering problems are part of stochastic geometry [Wiki] [22, 74].
See also Hall’s book on coverings [39]. Applications include wireless networks [Wiki].
Solution
The points are distributed according to a Poisson point process of intensity λ. The probability that an arbitrary
location x in the plane is not covered by any circle, is the probability that there is zero point from the process,
in a circle of radius R centered at x. This is equal to exp(−λπR2 ). Thus the proportion of the plane covered
by the circles is 1 − exp(−λπR2 ). Now, let’s say that we have two types of circles: one with radius R1 , and one
with radius R2 , each type equally likely to be picked up. This is like having two independent, superimposed
Poisson processes (see Section 1.5.3), each with intensity λ/2, one for each type of circle. Now the probability
p that x is not covered by any circle is thus a product of two probabilities:
λ λ R2 + R22
p = exp − πR12 × exp − πR12 = exp − λπ 1 .
2 2 2
You can generalize to m types of circles, each type with a radius Rk and probability pk to be picked up, with
1 ≤ k ≤ m. It leads to
h Xm i
1 − p = 1 − exp − λπ pk Rk2 , (44)
k=1
which is the proportion of the plane covered by at least one circle. If R, the radius of the circle, is a continuous
random variable, the sum in Formula (44) must be replaced by E[R2 ]. A related topic is the smallest circle
problem [Wiki].
Exercise 17 [M] Optimum circle covering of the plane. This is an old problem, mentioned by Kershner
in 1939 [46], revisited in 1971 by Williams [78], and still active today, see [32] (available online here) and [67]
(available online here). Unlike in Exercise 16, the slightly overlapping circles of fixed radius, covering the entire
plane, have centers located on a lattice rather than being the points of a Poisson process; in other words, the
scaling factor s of the underlying Poisson-binomial process is zero (the point process reduces to its lattice space).
Applications include cellular network coverage, optimum location of sensor devices, and supply chain opti-
mization such as optimum packing [Wiki]. The circle covering problem [Wiki] consists of finding the lattice that
achieves optimum coverage: each location in the plane is covered by an average of p > 1 circles; the optimum is
reached when p is minimum. Compute p both for the hexagonal lattice, and for the square lattice. Note that
throughout this textbook, I worked with Poisson-binomial processes defined on a square lattice, except when
considering lattice rotations, stretching, and superimposition in Section 1.5.3.
Solution
Let’s start with circle centers located on a square lattice. For full coverage of the plane with as little overlapping
as possible, the circles must be the smallest ones covering a square: the four vertices of the square must lie on
the circle boundary, and the √ centers (both for the circle and square) coincide. For a unit square, such a circle
must have a radius equal to 2/2 and an area equal to π/2. It is easy to see that p = π/2 ≈ 1.571. This is
illustrated here. For an hexagonal lattice [Wiki], the circle must be the smallest one covering an hexagon √ and
having the same center as the inscribed hexagon [Wiki]. Computations (see [46]) show that p = 2π/ 27. This
is indeed the minimum possible value for p. There are only five types of regular lattices, called Bravais lattices
[Wiki]. The hexagon is the regular polygon with the maximum number of sides, among those able to produce
a regular Voronoi tessellation [Wiki], and thus results in the optimum lattice and minimum p.
62
Exercise 18 [S] Interlaced lattices, lattice mixtures and nearest neighbors. This is an additive
number theory problem [Wiki], see also [64]. Let us consider a mixture (called m-mixture) or superimposition
(called m-interlacing) of m shifted two-dimensional Poisson-binomial processes M1 , . . . , Mm with scaling factor
s = 0. Thus, these are non-random processes, where the state space of the i-th process Mi corresponds to its
shifted lattice space: (Xih , Xik ) = (µi + h/λ, µ′i + k/λ) for each point (Xih , Xik ) of Mi , with (h, k) ∈ Z2 and
(µi , µ′i ) is the shift parameter vector of Mi , depending only on i. Assume that each Mi has intensity λ = 1.
Perform simulations to compare the distribution of nearest neighbor distances between m-interlacings and m-
mixtures. More specifically, we are interested in the number of unique values that it can take. Conclude from
this experiment that m-interlacings with small s, are less “random” than m-mixtures with the same s. Mixtures
and superimposition of shifted processes are discussed in Section 1.5.3 and 1.5.4. By nearest neighbors, I mean
among points of the m-mixture or m-interlacing, not between an arbitrary location and a point of the process,
nor within each individual Mi taken separately.
Solution
For m-interlacings with s = 0, we have exactly m points P1 , . . . , Pm in the square [0, λ1 [×[0, λ1 [ (or in any square
of same area, for that matter), and thus m pairs {Pi , Pi′ } (i = 1, . . . , m) where Pi′ is the nearest neighbor (NN)
to Pi . Thus we have at most m distinct NN distances ||Pi −Pi′ ||. So for m-interlacings with s = 0, the maximum
number of unique values for the NN distance is m.
For m-mixtures, the situation is different. Now we have between 1 and m points in the square B = [0, λ1 [×[0, λ1 [,
assuming s = 0. Each of these points has one NN, possibly in the same square or in an adjacent square. For
instance, if Pi ∈ B, it has one NN: a point Pj ∈ B or a shifted version of Pj in an adjacent square. All
combinations i, j ∈ {1, . . . , m} are possible, and will necessarily show up (with probability one) in some squares
of same area 1/λ2 . Thus the number of unique NN distances is at least m2 − 1, and at most m · (4m − 1). The
“minus one” is because a point can not be its NN, that is, Pi ̸= Pi′ .
Simulations confirm these findings, both for m-interlacings and m-mixtures. It is assumed here that the shift
vectors (µi , µ′i ) are arbitrary, as if they were randomly generated.
Exercise 19 [SM*] Lattice topology and algebra Using a superimposition of m stretched shifted Poisson-
binomial processes M1 , . . . , Mm , denoted as M and called an m-interlacing in Exercise 18, build a point process
that has a regular hexagonal lattice as its lattice space, with m as small as possible. Note that each Mi has a
rectangular lattice space. Superimposed stretched shifted processes are defined in Section 1.5.3. When s = 0,
M is identical to its fixed (non-random) hexagonal lattice space, see left plot in Figure 2. It is also clear from
Figure 2 that each point of M has exactly 3 nearest neighbors. To the contrary, in a square lattice, each point
(called vertex in graph theory) has 4 nearest neighbors. In a rectangular (non-square) lattice, each vertex has
2 nearest neighbors. Is it possible to build a lattice where each vertex has 5 or 6 nearest neighbors? A line
joining two nearest neighbor vertices is called an edge. In Figure 2, all edges have the same unit length. Use
Formulas (8) and (9) to generate a realization of M . The challenge is to find the minimum m and then identify
the parameters λ, λ′ and µi , µ′i (i = 1, . . . , m) resulting in a regular hexagonal lattice when s = 0. By regular,
I mean that all edges have the same length, and only one regular polygon is used in the construction (in our
case, an hexagon).
Solution
The solution can be found in Section
√ 1.5.3. I used m = 4, and I don’t think you can use√a smaller m. The
′ ′ ′
parameters
√ are λ = 1/3, λ = 3/3, µ1 = 0, µ2 = 1/2, µ3 = 2, µ4 = 3/2 and µ1 = 0, µ2 = 3/2, µ′3 = 0, µ′4 =
3/2. You won’t be able to build a regular lattice based on a single regular polygon [Wiki] if each point has
exactly 5 or exactly 6 (or more) nearest neighbors. But many semi-regular lattices also called tilings [Wiki], such
as square-hexagonal [Wiki], exist. This also illustrates the fact that lattices form a group [Wiki], where shifting
(also called translation) corresponds to the addition operation, and stretching is the scalar multiplication [Wiki].
Each shift vector uniquely characterizes a lattice, and the other way around. Also, an infinite 2-D lattice shifted
by the vector (µ, µ′ ) = (h/λ, k/λ), regardless of h, k ∈ Z, is topologically unchanged. The two lattices are
congruent to each other modulo 1/λ, in the same sense that (in one dimension) the numbers 30.628 and 40.052
are congruent [Wiki] to each other modulo 2.356 (in the latter case, because 30.628 − 40.052 = −4 × 2.356 is a
multiple of 2.356).
Exercise 20 [MS**] Nearest neighbors and size distribution of connected components. Simulate
10 realizations of a stationary Poisson process of intensity λ = 1, each with n = 103 points distributed over a
square window. Identify the connected components [Wiki] and their size (the number of points in each connected
component). The purpose of the exercise is to study the distribution of the size, denoted as S. In particular,
what is the proportion of connected components with only 2 points (P [S = 2]), 3 points (P [S = 3]) and so on?
For connected components, use the undirected graph, that is: points Vi , Vj (also called vertices) are connected
63
if Vi is nearest neighbor to Vj , or the other way around. The questions are:
Estimate the probabilities in question via simulations. When computing the proportions using multiple
realizations of the same process, do we get a similar empirical distribution for S, across all realizations?
Does the empirical distribution seem to convergence, when increasing n, say from n = 103 to n = 104 or
n = 105 ?
Do the same experiment with a Poisson-binomial process, with λ = 1 and s = 0.15. Do we get the same
distribution for S? What about P [S = 2]?
Generate a particular type of random graph, called random NN graph, as follows. Let V1 , . . . , Vn be
the n vertices of the graph (their locations do not matter). For the “nearest neighbor” to vertex Vk
(k = 1, . . . , n), randomly pick up one of the n vertices except Vk itself. Two points (vertices) can have the
same nearest neighbor. Now study the distribution of S via simulations. Is it the same as for the graph
generated by the nearest neighbors in a stationary Poisson point process?
This is the most difficult part. Let P (S = k), k = 2, 3, . . . be the size distribution for connected components
of a stationary Poisson process; S is a random variable. Of course, it does not depend on λ. Does it
uniquely characterize the Poisson process, in the same way that the exponential distribution for interarrival
times uniquely characterizes the Poisson process in one dimension? Do we have P (S = 2) = 12 , not only
for Poisson processes, but also for a much larger class of point processes?
Useful references about random graphs [Wiki] include “The Probabilistic Method” by Alon and Spencer [1]
(available online here), and “Random Graphs and Complex Networks” by Hofstad [77] (available online here).
See also here.
Hints
Beware of the boundary effect; to minimize the impact, use a uniform distribution for F (the distribution
attached to the points of the Poisson-binomial process) and n > 103 . When the scaling factor s is zero, there
is only one connected component of infinite size (P [S = ∞] = 1): this is a singularity, as illustrated on the
left plot in Figure 2. But as soon as s > 0, all the connected components are of finite size and rather small.
The smallest ones have two points as each point has a nearest neighbor, thus P [S < 2] = 0. When s = ∞, the
process becomes a stationary Poisson process, see Theorem 4.5.
I conjecture that stationary Poisson processes and some other (if not all) Poisson-binomial processes share
the exact same discrete probability distribution for the size of connected components defined by nearest neigh-
bors, and abbreviated as CCS distribution. Thus, unlike the point count or nearest neighbor distance distri-
butions, the CCS distribution can not be used to characterize a Poisson process. For random graphs, the CCS
distribution is different from that of a Poisson process. I used a Kolmogorov-Smirnov test [Wiki] (see also [26]
available online here) to compare the two empirical CCS distributions – the one attached to Poisson processes
versus the one attached to random NN graphs – and concluded, based on my sample size (n = 104 points or
vertices), that they were statistically different.
To conclude, it appears that the CCS distribution can not be arbitrary. Many point processes seem to have
the same CCS distribution, called attractor distribution, and these processes constitute the domain of attraction
of the attractor. The concepts of domain of attraction and attractor is used in other contexts such as dynamical
systems [Wiki] or extreme value theory [Wiki] (also, see [7] page 317). The most well known analogy is the
Central Limit Theorem, where the Gaussian distribution is the main attractor, and the Cauchy distribution is
another one. In chapter 11 of “The Probabilistic Method” [1], dealing with the size of connected components in
random graphs, the author introduces a random variable Tc , also counting a number of vertices (called nodes
in the book). Its distribution has all the hallmarks of an attractor. See Theorem 11.4.2 (page 202) in the book
in question.
To find the connected components, you can use the source code in Section 6.5. To simulate point processes,
you can use the source code in Section 6.4: it produces an output file PB NN dist full.txt that can be used
as input, without any change, to the connected components algorithm in Section 6.5. Exercise 21 features a
similar problem, dealing with cliques rather than connected components.
Exercise 21 [M] Maximum clique problem. In undirected graphs [Wiki], a clique is a set of vertices (also
called nodes) all connected to each other. In nearest neighbor graphs, two points are connected if one of them
is a closest neighbor to the other one. How would you identify a clique of maximum size in such a graph? No
need to design an algorithm from scratch; instead, search the literature. Finding the maximum clique [Wiki]
is NP-hard [Wiki], and the problem is related to the “P versus NP” conjecture [Wiki]. The maximum clique
problem has many applications, in particular in social networks. Probabilistic properties of cliques in random
graphs are discussed in “Cliques in random graphs” [8] (available online here) and “On the evolution of random
graphs” [25] (available online here). See also [Wiki]. More recent articles include [30, 57], respectively available
here and here.
64
Solution
In two dimensions, in an undirected nearest neighbor graph, the minimum size of a maximum clique is 2 (as each
point has a nearest neighbor), and the maximum size is 3. A maximum clique must be a connected component.
See definition of connected component in Exercise 20. If each point has exactly one nearest neighbor, then a
connected component of size n > 1 has n or n − 1 edges (the arrows on the right plot in Figure 2), while a clique
of size n has exactly 12 n(n − 1) edges. This is why maximum cliques of size larger than 3 don’t exist. But in d
dimensions, a maximum clique can be of size d + 1. The maximum clique can be found using the MaxCliqueDyn
algorithm [Wiki].
5.5 Miscellaneous
This section features problems that don’t fit well in any of the previous categories.
Exercise 22 [M] Computing moments using the CDF. The purpose is to prove a formula to compute the
moments of a random variable, using the cumulative distribution function (CDF), rather than the density. If
X is a univariate random variable with CDF F (x) = P (X < x), and r is a positive integer, prove the following:
R∞
If X is positive, then E[X r ] = r 0 xr−1 (1 − F (x))dx
R∞
If X is symmetric around the origin and r is even, then E[X r ] = 2r 0 xr−1 (1 − F (x))dx. If r is odd,
E[X r ] = 0.
Solution A solution for the general case, or when X is positive, can be found here. If X is symmetric around
the origin, then F (x) = 1 − F (−x), and the result follows easily.
Exercise 23 [S] Simulations: generalized logistic distribution. Implement a routine that generates
deviates for the generalized logistic distribution, using the quantile function Q(u) in Formula (12), with a
uniform distribution on [0, 1] for u. Do the same for the Laplace distribution defined in Section 1.1. Simulate
1-D and 2-D Poisson-binomial point processes, using a Laplace and generalized logistic distribution for F . For
the generalized logistic distribution, try different values for the parameters α, β, τ, µ, λ.
Hint
Use inverse transform sampling to simulate Laplace deviates. That is, use the Laplace quantile function Q(u)
with uniform deviates on [0,1] for u; Q(u) is the inverse of the Laplace cumulative distribution function F listed
in Section 1.1.
Exercise 24 [S] Riemann Hypothesis. Refer to Section 2.3.2 for the material and notations discussed here.
The hole in Figure 8, on the top left plot corresponding to σ = 0.75 and s = 0, is observed when 0 ≤ t ≤ 200.
Try other intervals, say [t, t + τ ], for much larger values of t and (say) τ = 200. See if the hole gets any smaller.
Try s = 10−2 , instead of s = 10−3 in Formula (20) and (21): now the hole is entirely gone. This shows how
sensitive the η function is to small perturbations. Finally, find the first 40 values t = t1 , . . . , t40 , with t > 0,
solutions of ℑ[η(σ +it)] = 0, when σ = 12 , using numerical techniques. How many of these roots are also solution
to ℜ[η(σ + it)] = 0? Such values of t correspond to the non-trivial complex zeros of the Riemann zeta function,
on the critical line σ = 12 .
Solution
The challenge here is the slow and chaotic convergence of the two series (real and imaginary parts) representing
the function η(σ + it) in Formula (18) and (19). I refer to t as the time. The larger t, the smaller the time
increments required to correctly plot the orbit. These increments can be as small as 0.01 if t ≈ 103 , to not
miss any rare value, say t0 , resulting in η(σ + it0 ) unusually close to the origin when σ = 0.75. A convergence
acceleration technique is described in Exercise 25.
Exercise 25 [S*] Convergence acceleration. Design a basic algorithm for convergence acceleration of
alternating series [Wiki]. How does it perform, when computing the sum in Formula (19)? Try with s = 0.75
and t = 18265.2 (the correct value of the sum is about 0.292040897 if you ignore the sign, see Mathematica
computation here).
Solution
If Sn = a1 + a2 + · · · + an converges to S, and the ak ’s are alternating, then one can proceed as follows:
Let Sn′ = a′1 + a′2 + · · · + a′n with a′k = αak + (1 − α)ak+1 , and 0 ≤ α ≤ 1 chosen to maximize the speed
of convergence of Sn′ .
Let Sn′′ = a′′1 + a′′2 + · · · + a′′n where a′′k = α′ a′k + (1 − α′ )a′k+1 , and 0 ≤ α′ ≤ 1 chosen to maximize the
speed of convergence of Sn′′ .
65
One can continue iteratively with Sn′′′ and so forth, each new sum converging faster to S than the previous one.
Also, the sequence a1 , a′1 , a′′1 and so on, rapidly converges to S. However, it fails to work in our example.
The reason is because, despite the appearance, the series in Formula (19) is not an alternating one. Indeed,
hundreds, and even trillions of trillions of consecutive terms, depending on t, can have the same sign despite the
(−1)k factor attached to each term. This behavior creates numerical instability. The explanation is as follows:
If for some large k in Formula (18) or (19), the quantity t log(k + 1) − t log k ≈ t/k is close to an odd multiple
of π, then around that k, a lot of terms in the series will have the same sign and similar value (as opposed to
the regular alternating behavior). As a result, if k is not large enough (but not too small) when this happens,
a sum that seems to have converged, will suddenly experience a huge shift. This is what happens here, most
strikingly when k = 5814 and t = 18265.2, leading to t/k = 3.141589 . . . very close to π, and resulting in the
odd behavior around k = 5814, illustrated in Figure 24. The X-axis represents k and the Y-axis represents the
value of the partial sum computed using k terms.
There are various workarounds to deal with this issue. First, the Dirichlet eta function η has numerous
representations: you can choose one that is more suitable for computation purposes. But even if you want to
stick to Formula (19), you can improve it by splitting the sum into two parts:
One part that deals with the few dips and spikes, easy to identify. Here, the last one occurs at k = 5814.
The second part is to compute the first few hundred terms by traditional means.
Then combine both parts to get a good approximation of the final sum, in the end using much fewer
operations than brute force, and having a good sense as to when convergence is reached.
To prove the convergence of the series in Formulas (18) and (19) representing the Dirichlet eta function, one
can use the Dirichlet test [Wiki]. Note that without the factor (−1)k in Formulas (18) and (19), the series may
not converge.
Exercise 26 [S] Fast image filtering algorithm. The filtering algorithm described in Section 3.4.3 requires
a large moving window of 21 × 21 pixels, around each pixel. The size of this window is the bottleneck. How
can you make the algorithm about 20 times faster, still keeping the same window size?
Solution
When filtering the image, the window used at (x, y), and the next one at (x + 1, y), both have 21 × 21 = 441
pixels, but these two windows have 441 − (2 × 21) = 399 pixels in common. So rather than visiting 441 pixels
each time, the overlapping pixels can be kept in a 21 × 21 buffer. To update the buffer after visiting a pixel
and moving to the next one to the right, one only has to update 21 values in the buffer: overwrite the column
corresponding to the old 21 leftmost pixels, by the values derived from the new 21 rightmost pixels.
Exercise 27 [M**] Confidence regions: theory and computations. What are the foundations justifying
the methodology used to build the confidence regions in Section 3.1.1? In particular, how would you proceed
to find the values of σp , σq , ρp,q in Formula (27)? Does Gγ depend on n, p or q? Why not? What justifies the
66
choice of an ellipse for the confidence region? What is the role of Hotteling’s distribution in the methodology?
How would you tabulate Gγ via simulations? Can you think of a different type of confidence region?
The goal here is to identify references that answer the questions, rather than trying to prove everything on
your own. There is a considerable amount of tightly packed material in Section 3.1.1, presented at a high level.
I only ask you, in this exercise, to dig just one level beneath the surface.
Solution
Each of the 2n observations is realization of a Bernoulli random vector (Uk , Vk ) ∈ {(0, 0), (0, 1), (1, 0)}, with
k = −n, . . . , n − 1. In particular, Uk = 1 if the interval Bk defined by Formula (25) contains exactly one point
of the Poisson-binomial process, otherwise Uk = 0. Likewise Vk = 1 if Bk contains exactly two points, otherwise
Vk = 0. The statistic p (a random variable depending on n) is the pproportion of 1p in the sequence (Uk ), and
q is the proportion of 1 in (Vk ). From this, it follows that σp = p(1 − p), σq = q(1 − q), p + q ≤ 1, and
Uk , Vk are negatively correlated. The proportions of (0, 0), p(1, 0) and (0, 1) among the (Uk , Vk ) are respectively
1 − (p + q), p and q. From this, it follows that ρp,q = −pq/ pq(1 − p)(1 − q).
If the random vectors (Uk , Vk ) were identically and independently distributed (iid), things would be easier
thanks to the multivariate central limit theorem [Wiki], despite the strong correlation between Uk and Vk .
Unfortunately, they are neither. Exercise 9 shows that the point counts are not identically distributed in
general, and Exercise 10 shows the lack of independence. But the dependencies are local and very weak. Also
a careful choice of non-overlapping Bk ’s in Formula (25) – inspired by Theorem 4.1 – makes the point counts
almost identically distributed. They are in fact asymptotically iid. The length of Bk , set by Formula (24),
is very well approximated by (and asymptotically equal to) 1/λ, to minimize any problem. Section 3.1.2
provides further reassurance. Finally, when n is large, the Bk ’s are on average far away from each other, further
dampening dependencies and related issues. And as a bonus, the bias caused by boundary effects tends to zero.
By asymptotically, I mean when n → ∞.
In the remaining of this discussion, I consider the previous issue as overcome. I proceed as if the (Uk , Vk )
were iid. Thus, we can use the central limit √ theorem (CLT) as is. If we only had one statistic p and 2n
observations, then the CLT states that Z = 2n · (p − µp )/σp → N (0, 1) as n → ∞. In two dimensions, σp2 is
replaced by the 2×2 symmetric covariance matrix, denoted as Σ or Σp,q . Its inverse (the analogous of σp−1 in one
√
dimension), is denoted as Σ−1 . The multivariate CLT implies that Z = 2n · (p − µp , q − µq )Σ−1/2 → N (0, I).
That is, we have convergence to a bivariate normal distribution [Wiki] (also called multivariate Gaussian) of
zero mean and identity covariance matrix I [Wiki]. Also,
−2 −1 −1
1 σ p ρ p,q σ p σq
Σ−1 = ·
1 − ρ2p,q −ρp,q σp−1 σq−1 σq−2
In one dimension, Z 2 has a chi-squared distribution with one degree of freedom at the limit as n → ∞. The
Berry-Esseen theorem [Wiki] quantifies the stochastic error (that is, the quality of the approximation) when n
is not infinite. In two dimensions, Z 2 is replaced by
h√ i h√ it
Z · Zt = 2n (p − µp , q − µq ) Σ−1/2 × 2n (p − µp , q − µq ) Σ−1/2
= 2n (p − µp , q − µq ) Σ−1 (p − µp , q − µq )t
2n h p − µ 2 p − µ q − µ q − µ 2 i
p p q q
= 2
· − 2ρp,q +
1 − ρp,q σp σp σq σq
and still has a chi-squared distribution [Wiki], but this time with two degrees of freedom [Wiki]. This explains
the choice of Hn (x, y, p, q) in Formula (27), and why Gγ does not depend on p, q and quickly converges as
n → ∞. Here the symbol t denotes the transposition operator [Wiki], transforming a row vector into a column
vector. Also, Z −1/2 · (Z −1/2 )t = Z −1 . The chi-squared limit is a particular case of Cochran’s theorem [Wiki].
It assumes that the exact values of µp , µq , σp , σq , ρp,q are known. Unfortunately, this is not the case here: these
values are replaced by their estimates based on p and q. As a result, in two dimensions, the chi-squared must be
replaced by Hotteling’s distribution [Wiki]. The proof of these results (the fact that the square of a Gaussian
is a chi-squared, and so on) is based on the characteristic functions of these distributions.
The ellipse is the best possible shape for the confidence region: given a confidence level γ, it is the one of
minimum area. To see why, start building a tiny, almost empty confidence region with γ ≈ 0. You need to start
at the maximum of the density function (here, a bivariate Gaussian by virtue of the central limit theorem). As
you increase γ, the confidence region expands. But to keep it expanding at the slowest possible rate (keeping
its area minimum at all times), you need to follow the contour lines of the density: the curves where the density
is constant. For the Gaussian distribution, these contour lines are ellipses. But the same principle is true for
any continuous bivariate density. In general, the shape will not be an ellipse.
67
There is an alternative definition of confidence region, called dual confidence region, leading to non-elliptic
shapes even for Gaussian distributions. It consists of computing the confidence region for all (p, q) in the proxy
space, rather than for your estimate (p0 , q0 ) only. If the confidence region of some (p, q) contains (p0 , q0 ), then
it is part of (p0 , q0 )’s newly defined confidence region. The boundary of the newly defined confidence region of
(p0 , q0 ) consists of the points (x, y) satisfying
2n h p − x 2 p − x q − y q − y 2 i
2
· − 2ρx,y + = Hγ , (45)
1 − ρx,y σx σx σy σy
with (p, q) set to (p0 , q0 ). Compare Formula (27) with (45). Clearly, the latter does not correspond to the
equation of an ellipse; the former does. The roles of (p, q) and (x, y) have been swapped. Also note the
use of a different “scale” Hγ instead of Gγ . Yet in practice, the two methods yield almost identical results.
Both the standard and newly defined confidence regions are implemented in the spreadsheet. An example
using the standard region is featured on the left plot in Figure 11. Tables for Gγ and Hγ are provided in the
Confidence Region tab in the spreadsheet in question (PB Independence.xlsx): see columns F, G, and
K. I produced them via simulations, based on the code in column Y.
Last but not least, in the end the goal is to obtain confidence regions for the parameter (λ, s) in the
parameter space, not for the proxy vector (p, q) in the proxy space . The final step consists of using the inverse
mapping defined in Section 3.1.1, to map the confidence region built in the proxy space, onto the parameter
space. The challenge here is to prove that the mapping is one-to-one. This is still an open question. Most
likely, the final confidence region in the parameter space won’t be an ellipse. There is an easy formula to do
the mapping from the parameter space to the proxy space, see Section 3.1.2. But the inverse mapping, needed
here, is a bit less easy to perform.
Exercise 28 [M*] Minimum set covering 90% of a distribution. This is related to the confidence
regions discussed in Exercise 27. It consists of (1) finding the shape of the 2D set of minimum area, covering a
proportion γ of the mass of a specific 2D probability distribution, and (2) determining its area.
Solution
Let Sγ be the set in question, and f (x, y) be the density attached to the distribution. I assume that the density
has one maximum only, and that it is continuous everywhere on R2 . Thus the problem consists of finding the
set Sγ of minimum area, such that Z Z
f (x, y)dxdy = γ. (46)
Sγ
It is easy to see that the boundary of Sγ is a contour line of f (x, y). To build Sγ , you start at the maximum of
the density, and to keep the area minimum, the set must progressively be expanded, strictly following contour
lines, until (46) is satisfied. So
where Gγ must be chosen so that (46) is satisfied. Assuming max f (x, y) = M , the volume covered by Sγ is
Z M
γ = zγ · |Sγ | + |R(z)|dz, (47)
zγ
where R(z) = {(x, y) ∈ R2 such that f (x, y) = z}, and | · | denotes the area of a 2D domain. Clearly, |Sγ | =
|R(zγ )|. So there is only one unknown in Equation (47), namely zγ . Finally, Gγ = zγ , and thus the value of Gγ
is found by solving (47). The area of Sγ is thus |Sγ | = |R(Gγ )|.
68
6 Source Code, Data, Videos, and Excel Spreadsheets
My source code is available online at github.com/VincentGranville/Point-Processes, as well as in this textbook.
It is written using only basic data structures and manipulations available in all programming languages, such
as strings, arrays, stacks, subroutines, regular expressions and hash tables, to make it easy to read and rewrite
in Java, C++ or other languages. The visualizations are performed either in R with the Cairo graphics library
[Wiki] to create better scatterplots, or Python with the Pillow library to create PNG images pixel by pixel,
including density estimation and clustering via image filtering.
My source code is designed to bring as much educational value as possible, without jeopardizing efficiency.
It includes algorithms useful in many other contexts, such as the generation of random deviates from a logistic,
Cauchy or Laplace distribution, and a fast, compact algorithm to detect connected components in a graph. The
textbook version has detailed explanations. The source code repository is organized according to Table 8; to
access the code online, click on the filename:
69
PB clustering video.py: Generates the frames for the video imgPB.mp4 featuring fractal supervised
clustering. See Section 2.4.2.
Detailed descriptions are included in the relevant subsections. Table 9 lists the data sets produced by the
various programs, as well as the interactions between these programs. All these files are standard text files,
with tab-separated columns. They are also available on my GitHub repository: click on the filename to find an
example of the corresponding data set. The fields attached to each data set are described in the section covering
the source code that produces it: for instance, Section 6.4 for the data set PB NN dist full.txt, produced
by PB NN.py.
The spreadsheets accompanying this textbook are discussed in Section 6.1. They are also accessible from
the same GitHub repository, here.
70
The spreadsheets, PB independence.xlsx and PB inference.xlsx, are summarized in Table 10.
# PB_main.py
import math
import random
random.seed(100)
model=("Uniform","Logistic","Cauchy")
pi= 3.1415926535897932384626433
seed=4565 # allows for replicability (to produce same random numbers each time)
71
bb = 0.75 # see aa
r = 0.50 # to compute E[Tˆr], r>0
# E[Tˆr] tends to r!/(lambda)ˆr as s tends to infinity
n1 = 10000 # compute E[N(B)], Var[N(B)]: k between -n1 and +n1
# n1 much larger than s (if F has thick tail)
# reduce n1 if program too slow [speed ˜ O(n1 log n1
n2 = 30000 # Simulation: Xk with index k between -n2 and +n2
#---
def main():
OUT.write(line)
s=s+0.2
OUT.close()
def E_and_Var_N(type,llambda,s,aa,bb,n):
variance=0
expectation=0
product=0
flag=0
for k in range(-n1,n1+1):
f1=CDF(type,llambda,s,k,bb)
f2=CDF(type,llambda,s,k,aa)
if 1-f1+f2 == 0:
flag=1
else:
product=product+math.log(1-f1+f2)
expectation=expectation+(f1-f2)
variance=variance+((f1-f2)*(1-f1+f2))
if flag==1:
product=0
72
else:
product=math.exp(product)
return[expectation,variance,product]
def var_T(type,llambda,s,r,n):
xs=[]
m=0
for k in range(-n,n+1):
ranx=random.random()
xs.append(deviate(type,llambda,s,k))
m=m+1
xs.sort()
expectation=0
variance=0
moment_r=0
k1=int(m/4)
k2=int(3*m/4)
for k in range(k1,k2+1):
dist=(xs[k]-xs[k-1])
expectation=expectation+dist
variance=variance+(dist*dist)
moment_r=moment_r+(dist**r)
expectation=expectation/(k2-k1+1)
variance=(variance/(k2-k1+1))-(expectation*expectation)
moment_r=moment_r/(k2-k1+1)
return[expectation,variance,moment_r]
def deviate(type,llambda,s,k):
ranx=random.random()
if type == "Logistic":
z=k/llambda+s*math.log(ranx/(1-ranx))
elif type == "Uniform":
z=k/llambda+2*s*(ranx-1/2)
elif type == "Cauchy":
z=k/llambda+s*math.tan(pi*(ranx-1/2))
return(z)
73
CDF (centered at the origin), set k to zero. The scaling factor s is a function of the variance σ 2 . Table 1
provides the conversion table between s and σ 2 .
def CDF(type,llambda,s,k,x):
if type == "Logistic":
z= 1/2+ (1/2)*math.tanh((x-k/llambda)/(2*s))
elif type == "Uniform":
z= 1/2 + (x-k/llambda)/(2*s)
if z<=0:
z=0
if z>1:
z=1
elif type == "Cauchy":
z= 1/2 +math.atan((x-k/llambda)/s)/pi;
return(z)
# PB_radial.py
import math
import random
random.seed(100)
s=10
pi=3.14159265358979323846264338
file=open(’PB_radial.txt’,"w")
for h in range(-30,31):
for k in range(-30,31):
ranx=random.random()
rany=random.random()
x=h+2*s*(ranx-1/2)
y=k+2*s*(rany-1/2)
line=str(h)+"\t"+str(k)+"\tCenter\t"+str(x)+"\t"+str(y)+"\n"
file.write(line)
M=int(15*random.random())
for m in range(M):
ran1=random.random()
ran2=random.random()
factor=math.log(ran2/(1-ran2))
74
x1=x+factor*math.cos(2*pi*ran1);
y1=y+factor*math.sin(2*pi*ran1);
line=str(h)+"\t"+str(k)+"\tLocal\t"+str(x1)+"\t"+str(y1)+"\n"
file.write(line)
file.close()
# PB_NN.py
# lambda = 1
import numpy as np
import math
import random
# PART 1: Initialization
75
for i in range(Nprocess) :
shiftX.append(random.random())
shiftY.append(random.random())
stretchX.append(1.0)
stretchY.append(1.0)
sstring.append(sep)
# i TABs separating x and y coordinates in output file for points
# originating from process i; Used to easily create a scatterplot in Excel
# with a different color for each process.
sep=sep + "\t"
processID=0
m=0 # number of points generated
height,width = (400, 400)
Part 2 generates a realization of m superimposed stretched shifted Poisson-binomial point processes, called m-
interlacing; m is represented by the variable Nprocess. The index space is limited to (h, k) ∈ {−25, . . . , 25} ×
{−25, . . . , 25}. The points of the process, along with their lattice index (h, k) and the individual process
they belong to (processID), are saved in the output file PB NN.txt. A subset of these points, those with
coordinates in [−20, 20] × [20, 20], this time taken modulo 2/λ (with λ = 1), are saved in the bitmap array for
further processing as well as in the output file PB NN mod.txt.
The restriction to a subset is to mitigate boundary effects. Taking the modulo allows you to magnify the
patterns in the point distribution, to make statistical inference easier and to make the underlying shift-induced
clustering structure visible to the naked eye. The modulo function is defined as follows: x mod λ2 = x− λ2 ⌊x· λ2 ⌋
where the brackets represent the integer function, also called floor function.
# PART 2: Generate point process, its modulo 2 version; save to bitmap and output files.
for h in range(-25,26):
for k in range(-25,26):
for processID in range(Nprocess):
ranx=random.random()
rany=random.random()
x=shiftX[processID]+stretchX[processID]*h+s*math.log(ranx/(1-ranx))
y=shiftY[processID]+stretchY[processID]*k+s*math.log(rany/(1-rany))
a.append(x) # x coordinate attached to point m
b.append(y) # y coordinate attached to point m
process.append(processID) # processID attached to point m
m=m+1
line=str(processID)+"\t"+str(h)+"\t"+str(k)+"\t"+str(x)+sstring[processID]+str(y)+"\n"
OUT.write(line)
# replace sstring[processID] by \t if you don’t care about Excel
76
Part 3 detects the nearest neighbor(s) to each point of the process, and compute the nearest neighbor distances.
Only points in [−20, 20] × [20, 20] are considered, to mitigate boundary effects. The main loop is over all points
of the process. Per convention, variables with the keyword “hash” in their name, represent hash tables. The
output file PB NN dist small.txt contains all that is needed to study the distribution of nearest neighbor
distances for model-fitting purposes (see Section 3.4).
The output file PB NN dist full.txt contains more fields, including the points and their nearest neigh-
bor(s); this information is used to compute the connected components in the program PB NN graph.py. Here
the variable m represents the number of points of the process. For each point i,
a[i], b[i] are the X and Y coordinate of point i.
NNidx[i] is a nearest neighbor to point i (usually unique unless s = 0), and NNx[i], NNy[i] are its
X and Y coordinates.
mindist is the distance between point i and its nearest neighbor point NNidx[i].
NNidxHash[i] is the list of points having i as nearest neighbor (separated by the character “˜”)
# PART 3: Find nearest neighbor points, and compute nearest neighbor distances.
if NNflag:
NNx=[]
NNy=[]
NNidx=[]
NNidxHash={}
for i in range(m):
NNx.append(0.0)
NNy.append(0.0)
NNidx.append(-1)
mindist=99999999
flag=-1
if a[i]>-20 and a[i]<20 and b[i]>-20 and b[i]<20:
flag=0;
for j in range(m):
dist=math.sqrt((a[i]-a[j])**2 + (b[i]-b[j])**2) # taxicab distance faster to
compute
if dist<=mindist+epsilon and i!=j:
NNx[i]=a[j] # x-coordinate of nearest neighbor of point i
NNy[i]=b[j] # y-coordinate of nearest neighbor of point i
NNidx[i]=j # indicates that point j is nearest neighbor to point i
# NNidxHash[i] is the list of points having point i as nearest neighbor;
# these points are separated by "˜" (usually only one point in NNidxHash[i]
# unless the simulated points are exactly on a lattice, e.g. if s = 0)
if abs(dist-mindist) < epsilon:
NNidxHash[i]=NNidxHash[i]+"˜"+str(j)
else:
NNidxHash[i]=str(j)
mindist=dist
if i % 100 == 0:
print("Finding Nearest neighbors of point",i)
line=str(i)+"\t"+str(mindist)+"\n"
OUT.write(line)
line=str(i)+"\t"+str(NNidx[i])+"\t"+str(NNidxHash[i])+"\t"+str(a[i])+"\t"
line=line+str(b[i])+"\t"+str(NNx[i])+"\t"+str(NNy[i])+"\t"+str(mindist)+"\n"
OUTf.write(line)
OUTf.close()
OUT.close()
Part 4 produces the output file PB r.txt used by PB NN arrows.r to generate Figure 2. This file consists
77
of the points of the process, with for each point idx: its X and Y coordinates a[idx], b[idx], its nearest
neighbor point NNindex, the X and Y coordinates a[NNindex], b[NNindex] of point NNindex, and the
individual process process[idx] that idx belongs to (for coloring purposes).
# PART 4: Produce data to use in R code that generates the nearest neighbors picture.
if NNflag:
OUT = open("PB_r.txt","w")
OUT.write("idx\tnNN\tNNindex\ta\tb\taNN\tbNN\tprocessID\tNNprocessID\n")
OUT.close()
Part 5 consists of a single call the the function GD Maps in GD util.py (see Section 6.6.2) to produce two
images: one representing the point density of the point process (to identify cluster centers, corresponding to
the darkest gray level), and one representing (by a color) how each future, unobserved point should be classified
based on its X and Y coordinates in the context of supervised clustering. See Figure 17 (original point process),
and 19 after clustering / density estimation.
The X and Y coordinates are taken modulo 2/λ; here λ = 1 (see Part 2 of this source code) and thus
cover the entire, infinite 2-D space. The choice of the modulus (here 2/λ, rather than 1/λ) is dictated by
the granularity of the underlying lattice space. The image is first processed in memory (the bitmap array)
before being saved to PNG files (pb-cluster3.png and pb-density3.png). An high-pass (sharpening)
kernel-based filter is applied nloop times to the entire bitmap image, using a p × p pixels filtering window. For
a large image of fixed size, filtering the entire image once is O(p2 ) but can be reduced to O(p) with a smarter
implementation (see Exercise 26). The variable window represents p. See section 3.4 for details.
Part1 reads the first two columns of PB dist full.txt produced by PP NN.py (see Section 6.4). The
first column represents the index idx of a point, and NNidx[idx] (in the second column) is the index of a
point that has point idx as nearest neighbor.
Then, it creates the undirected graph hash, as follows: if a point with index k is nearest neighbor to a point
with index idx, add point idx to hash[k], and add point k to hash[idx]. Thus hash[idx] contains all
the points (their indices) directly connected to point idx; the points are separated by the character “˜”.
# PB_NN_graph.py
#
78
# Compute connected components of nearest neighbor graph
# Input file has two tab-separated columns: idx and idx2
# idx is the index of of point, idx2 is the index of a nearest neighbor to idx
# Output file has two fields, for each principal component:
# the list of points it is made up (separated by ˜), and the number of points
# Example.
# Input:
# 100 101
# 100 103
# 101 100
# 101 102
# 103 100
# 103 102
# 102 101
# 102 100
# 102 103
# 102 104
# 104 102
# 106 105
# 105 107
# Output:
# ˜100˜103˜102˜104˜101 5
# ˜106˜105˜107 3
# PART 1: Initialization.
point=[]
NNIdx={}
idxHash={}
n=0
file=open(’PB_dist_full.txt’,"r") # input file
lines=file.readlines()
for aux in lines:
idx =int(aux.split(’\t’)[0])
idx2=int(aux.split(’\t’)[1])
if idx in idxHash:
idxHash[idx]=idxHash[idx]+1
else:
idxHash[idx]=1
point.append(idx)
NNIdx[idx]=idx2
n=n+1
file.close()
hash={}
for i in range(n):
idx=point[i]
if idx in NNIdx:
substring="˜"+str(NNIdx[idx])
string=""
if idx in hash:
string=str(hash[idx])
if substring not in string:
if idx in hash:
hash[idx]=hash[idx]+substring
else:
hash[idx]=substring
substring="˜"+str(idx)
if NNIdx[idx] in hash:
string=hash[NNIdx[idx]]
79
if substring not in string:
if NNIdx[idx] in hash:
hash[NNIdx[idx]]=hash[NNIdx[idx]]+substring
else:
hash[NNIdx[idx]]=substring
Part 2: Find the connected components. The algorithm is as follows. Browse the list of points. If a point idx
has not yet been assigned to a connected component, create a new connected component cliqueHash[idx]
containing idx; find the points connected to idx, add them to the stack (stack). Find the points connected
to the points connected to idx, and so on recursively, until no more points can be added. Each time a point
is added to cliqueHash, decrease the stack size by one. It takes about 2n steps to find all the connected
components, where n is the number of points. This algorithm does not use recursive functions; it uses a stack
instead, which emulates recursivity.
i=0;
status={}
stack={}
onStack={}
cliqueHash={}
while i<n:
nstack=1
if i<n:
idx=point[i]
stack[0]=idx; # initialize the point stack, by adding idx
onStack[idx]=1;
size=1 # size of the stack at any given time
while nstack>0:
idx=stack[nstack-1]
if (idx not in status) or status[idx] != -1:
status[idx]=-1 # idx considered processed
if i<n:
if point[i] in cliqueHash:
cliqueHash[point[i]]=cliqueHash[point[i]]+"˜"+str(idx)
else:
cliqueHash[point[i]]="˜"+str(idx)
nstack=nstack-1
aux=hash[idx].split("˜")
aux.pop(0) # remove first (empty) element of aux
for idx2 in aux:
# loop over all points that have point idx as nearest neighbor
idx2=int(idx2)
if idx2 not in status or status[idx2] != -1:
# add point idx2 on the stack if it is not there yet
if idx2 not in onStack:
stack[nstack]=idx2
nstack=nstack+1
onStack[idx2]=1
Part 3 saves the result to output text file PB cc.txt. Each row corresponds to a connected component. The
first column stores the connected component, as a string of point indices, separated by the character “˜”. The
second column is the size (number of points) in the connected component in question.
80
file=open(’PB_cc.txt’,"w")
for clique in cliqueHash:
count=cliqueHash[clique].count(’˜’)
line=cliqueHash[clique]+"\t"+str(count)+"\n"
file.write(line)
file.close()
# install.packages(’Cairo’)
library(’Cairo’);
# CairoWin(6,6);
CairoPNG(filename = "c:/Users/vince/tex/PB-hexa2.png", width = 600, height = 600);
data<-read.table("c:/Users/vince/tex/PB_r.txt",header=TRUE);
a<-data$a; # x coordinate of point of the superimposed/mixture process
b<-data$b; # y coordinate of point of the superimposed/mixture process
aNN<-data$aNN; # x coordinate of nearest neighbor point to (a,b) across all processes
bNN<-data$bNN; # y coordinate of nearest neighbor point to (a,b) across all processes
processID<-data$processID;
plot(a,b,xlim=c(0,5),ylim=c(0,5),pch=20,cex=0,
col=rgb(0,0,0),xlab="",ylab="",axes=TRUE );
arrows(a, b, aNN, bNN, length = 0.10, angle = 10, code = 2,col=rgb(0.7,0.7,0.7));
aa<-data$a[processID == 0];
bb<-data$b[processID == 0];
points(aa,bb,col=rgb(1,0,0),pch=20,cex=1.75);
aa<-data$a[processID == 1];
bb<-data$b[processID == 1];
points(aa,bb,col=rgb(0,0,1),pch=20,cex=1.55);
aa<-data$a[processID == 2];
bb<-data$b[processID == 2];
points(aa,bb,col=rgb(1,0.7,0),pch=20,cex=1.75);
aa<-data$a[processID == 3];
81
bb<-data$b[processID == 3];
points(aa,bb,col=rgb(0,0,0),pch=20,cex=1.75);
aa<-data$a[processID == 4];
bb<-data$b[processID == 4];
points(aa,bb,col=rgb(0,0.7,0),pch=20,cex=1.75);
dev.off();
Part 1 initializes the color palette for the cluster image. The input parameters of the function GD Maps
are window, the size of the filtering window (see Section 3.4.3), nloop, the number of times the image is fil-
tered, img cluster and img density, the names of the output PNG images, and bitmap, a two-dimensional
400 × 400 array representing the point process in a format suitable for image processing.
Before describing bitmap, let’s quickly summarize the observed data. It consists of a simulation of m
superimposed stretched shifted Poisson-binomial point processes P1 , · · · , Pm , as described in Exercise 18 and
Sections 1.5.3, 1.5.4 and 3.4. An observed point (x, y) = (Xih , Yik ) in the state space is a point such that
(x, y) ∈ Pi , and (h, k) is the index in the index space, with h, k ∈ {−n, . . . , n}. I used n = 25 and m = 5
in PB NN.py, the parent Python script that calls GD Maps. For the simulation, see source code PB NN.py
(Section 6.4, Part 2), or Formulas (8) and (9).
Now I can describe bitmap. Initially, bitmap[pixelX][pixelY]=255, unless there is a point of the
process, say (x, y) ∈ Pi , such that pixelX=⌊200 × (x mod λ2 )⌋ and pixelY=⌊200 × (y mod λ2 )⌋. In that case,
bitmap[pixelX][pixelY]=processID, where processID is the variable representing i − 1 in the source
code. The brackets represent the floor function (also called integer function).
import math
from PIL import Image, ImageDraw # ImageDraw to draw rectangles etc.
def GD_Maps(method,bitmap,Nprocess,window,nloop,height,width,img_cluster,img_density):
col1=[]
col1.append((255,0,0,255))
col1.append((0,0,255,255))
col1.append((255,179,0,255))
col1.append((0,0,0,255))
col1.append((0,179,0,255))
for i in range(Nprocess,256):
col1.append((255,255,255,255))
oldBitmap = [[255 for k in range(height)] for h in range(width)]
densityMap= [[0.0 for k in range(height)] for h in range(width)]
for pixelX in range(0,width):
for pixelY in range(0,height):
processID=bitmap[pixelX][pixelY]
pix1[pixelX,pixelY]=col1[processID]
draw1.rectangle((0,0,width-1,height-1), outline ="black",width=1)
fname=img_cluster+’.png’
img1.save(fname)
Part 2 performs supervised clustering in bitmap by filtering the entire bitmap nloop times. It also creates
82
and filters densityMap, another bitmap with same dimensions, this time for unsupervised clustering, and
using a slightly different filter. Both filters take place within the same loop. The contribution g(u, v) of a point
(u, v), in the small moving window (the local filter), is function of its distance to the center of the window:
g(u, v) = (1 + u2 + v 2 )−1/2 . Also, for unsupervised clustering, successive applications of the filter to the entire
image are increasingly dampened. The purpose is to get the algorithm to converge to a meaningful solution.
For details, see Section 3.4. Finally, boundary effects are taken care of.
print("loop",loop,"out of",nloop)
for pixelX in range(0,width):
for pixelY in range(0,height):
oldBitmap[pixelX][pixelY]=bitmap[pixelX][pixelY]
Part 3 assigns the right color to each pixel of the supervised clustering image and generates the associated PNG
output file. It also generates a highly granular histogram of the density values observed in the unsupervised
clustering image. The histogram, used in Part 4, is stored in the hash table densityCountHash.
83
pix1[pixelX,pixelY]=col1[topProcessID]
About the cluster image (see right plot in Figure 19): each point in the state space (modulo 2/λ) colored in
red, must be assigned to the red cluster, or in other words, classified to red. The same applies to the other colors,
and the assignment mechanism is extremely fast. Each color is attached to one of the individual processes of
the model; each process (its points generated via simulation) can be seen as a particular cluster of a training set
(see right plot in Figure 17). So the code performs a very fast supervised clustering of the entire state space;
the clustering algorithm is represented by the cluster image itself. See Section 3.4 for details.
Part 4 equalizes the density levels in the unsupervised clustering image, then allocates the image, allocates the
gray levels in the palette, and save the density image (corresponding to unsupervised clustering) as a PNG file.
I manually selected the thresholds in the equalizer algorithm for best visual impact; this needs to be automated.
The result of the equalizer is this: the darkest areas in the image (left plot, Figure 19) correspond to the highest
concentration of points in the state space modulo 2/λ. This is where the mass of each cluster is concentrated.
The cluster centers (darkest in the image) are the estimators of the shift vectors used to build the superim-
posed point process (one shift vector per individual process). The unsupervised clustering is performed on the
observed data shown in the right plot in Figure 17, assuming the colors are not known. Detailed explanations
are in Section 3.4.
# PART 4: Equalize gray levels in the density image; output image as a PNG file
# Also try https://www.geeksforgeeks.org/python-pil-imageops-equalize-method/
densityColorHash={}
col2=[]
size=len(densityCountHash) # number of elements in hash
counter=0
84
for pixelY in range(0,height):
density=densityMap[pixelX][pixelY]
color=densityColorHash[density]
pix2[pixelX,pixelY]=col2[color]
return()
Conclusion We accomplished the whole purpose: estimating the unknown shift vectors (or cluster centers)
associated to the observations, and inventing a new, very fast clustering technique (supervised or unsupervised)
that can be performed in GPU [Wiki]. See also [29] (available online, here) for a similar use of GPU in the
context of nearest neighbor clustering.
# install.packages(’Cairo’)
library(’Cairo’);
data<-read.table("c:/Users/vince/tex/av_demo_vg2cb.txt",header=TRUE);
k<-data$k;
x<-data$x;
y<-data$y;
85
x2<-data$x2;
y2<-data$y2;
col<-data$col;
for (n in 1:1000) {
plot(x,y,pch=20,cex=0,col=rgb(0,0,0),xlab="",ylab="",axes=FALSE );
rect(-60, -60, 90, 30, density = NULL, angle = 45,
col = rgb(0,0,0), border = NULL);
# You need to adjust the size of the rectangle to your data
Part 1 creates the training set consisting of 4 groups, via simulation. The variable ProcessID represents
the group label. It is transformed into an image and stored in memory as bitmap (a 2D array), for easy image
processing.
# PB_clustering_video.py
import math
import random
from PIL import Image, ImageDraw # ImageDraw to draw rectangles etc.
import moviepy.video.io.ImageSequenceClip # to produce mp4 video
for i in range(Nprocess) :
shiftX.append(random.random())
shiftY.append(random.random())
processID=0
height,width = (600, 600)
bitmap = [[255 for k in range(height)] for h in range(width)]
for h in range(-25,26):
for k in range(-25,26):
86
for processID in range(Nprocess):
ranx=random.random()
rany=random.random()
ranID=random.random()
if ranID < 0.20:
processID=0
elif ranID < 0.60:
processID=1
elif ranID < 0.90:
processID=2
else:
processID=3
x=shiftX[processID]+h+s*math.log(ranx/(1-ranx))
y=shiftY[processID]+k+s*math.log(rany/(1-rany))
if x>-3 and x<3 and x>-3 and x<3:
xmod=1+x-int(x) # x modulo 2/lambda
ymod=1+y-int(y) # y modulo 2/lambda
pixelX=int(width*xmod/2)
pixelY=int(height*(2-ymod)/2) # pixel (0,0) at top left corner
bitmap[pixelX][pixelY]=processID
Part 2 generates the first frame img 0.png corresponding to the training set stored in the the bitmap array.
img1 = Image.new( mode = "RGBA", size = (width, height), color = (255, 255, 255) )
pix1 = img1.load() # pix[x,y]=col[n] to modify the RGB color of a pixel
draw1 = ImageDraw.Draw(img1,"RGBA")
col1=[]
col1.append((255,0,0,255))
col1.append((0,0,255,255))
col1.append((255,179,0,255))
col1.append((0,179,0,255))
col1.append((0,0,0,255))
for i in range(Nprocess,256):
col1.append((255,255,255,255))
Part 3 filters the bitmap image nloop times, generating the output frames img 1.png, img 2.png and so
on, up to img 251.png.
87
if topProcessID==255 or loop>50:
r=random.random()
if r<0.25:
x=x+1
if x>width-2:
x=x-(width-2)
elif r<0.5:
x=x-1
if x<1:
x=x+width-2
elif r<0.75:
y=y+1
if y>height-2:
y=y-(height-2)
else:
y=y-1
if y<1:
y=y+height-2
if loop>=50 and oldBitmap[x][y]==255:
x=pixelX
y=pixelY
topProcessID=oldBitmap[x][y]
bitmap[pixelX][pixelY]=topProcessID
pix1[pixelX,pixelY]=col1[topProcessID]
draw1.rectangle((0,0,width-1,height-1), outline ="black",width=1)
fname="img_"+str(loop+1)+’.png’
flist.append(fname)
img1.save(fname)
88
Glossary
89
Modulo Operator Sometimes, it is useful to work with point “residues” modulo λ1 , instead of the
original points, due to the nature of the underlying lattice. It magnifies the patterns
of the point process. By definition, Xk mod λ1 = Xk − λ1 ⌊λXk ⌋ where the brackets
represent the integer part function. See pages 36, 38, 40, 63, 76, 78, 84
NN Graph Nearest neighbor graph. The vertices are the points of the process. Two vertices
(the points they represent) are connected if at least one of the two points is nearest
neighbor to the other one. This graph is undirected. See pages 22, 37, 65, 69, 78,
89
Point Count Random variable, denoted as N (B), counting the number of points of the process
in a particular set B, typically an interval [a, b] in one dimension, and a square or
circle in two dimensions. See pages 5, 7, 24, 30, 32, 33, 37, 38, 44, 49, 59, 60, 62,
64, 67, 69, 71
Point Distribution Random variable representing how a point of the process is distributed in a domain
B; for instance, for a stationary Poisson process, points are uniformly distributed
on any compact domain B (say, an interval in one dimension, or a square in two
dimensions). See pages 7, 25, 37, 76
Quantile function Inverse of the cumulative distribution function (CDF) F , denoted as Q. Thus if
P (X < x) = F (x), then P (X < Q(x)) = x. See pages 6, 13, 14, 38, 54, 57, 65
Scaling Factor Core parameter of the Poisson-binomial process. Denoted as s, proportional to the
variance of the distribution F attached to the points of the process. It measures the
level of repulsion among the points (maximum if s = 0, minimum if s = ∞). In d
dimensions, the process is stationary Poisson of intensity λd if s = ∞, and coincides
with the fixed lattice space if s = 0. See pages 5, 6, 10, 11, 12, 15, 23, 25, 32, 37, 48,
59, 61, 62, 63, 64, 69, 73, 75
Shift vector The lattice attached to a 2-D Poisson-binomial process consists of the vertices ( λh , λk )
with h, k ∈ Z. A shifted process has its lattice translated by a shift vector (u, v).
The new vertices are (u + λh , v + λk ). See page 11, 36, 38, 40, 41, 63, 75, 84
Standardized Process Poisson-binomial process with intensity λ = 1, scaling factor s = 1, and shifted (if
necessary) so that the lattice space coincides with Z or Z2 . See page 10
State Space Space where the points of the process are located. Here, R or R2 . See also index
space and lattice space. See pages 6, 16, 23, 32, 36, 37, 38, 40, 44, 46, 51, 63, 82, 84
Stationarity Property of a point process: the point distributions in two sets of same shape and
area, are identical. The process is stochastically invariant under translations. See
pages 6, 8, 11, 24, 29, 63
List of Figures
1 Convergence to stationary Poisson point process of intensity λ . . . . . . . . . . . . . . . . . . . . 8
2 Four superimposed Poisson-binomial processes: s = 0 (left), s = 5 (right) . . . . . . . . . . . . . 12
3 Radial cluster process (s = 0.2, λ = 1) with centers in blue; zoom in on the left . . . . . . . . . . 15
4 Radial cluster process (s = 2, λ = 1) with centers in blue; zoom in on the left . . . . . . . . . . . 16
5 Manufactured marble lacking true lattice randomness (left) . . . . . . . . . . . . . . . . . . . . . 16
6 Locally random permutation σ; τ (k) is the index of Xk ’s closest neighbor to the right . . . . . . 17
7 Chaotic function (bottom), and its transform (top) showing the global minimum . . . . . . . . . 18
8 Orbit of η in the complex plane (left), perturbed by a Poisson-binomial process (right) . . . . . . 21
9 Data animations – click on a picture to start a video . . . . . . . . . . . . . . . . . . . . . . . . . 22
10 Minimum contrast estimation for (λ, s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
11 Confidence region for (p, q) – Hotelling’s quantile function on the right . . . . . . . . . . . . . . . 27
12 Period and amplitude of ϕτ (t); here τ = 1, λ = 1.4, s = 0.3 . . . . . . . . . . . . . . . . . . . . . . 29
13 Bias reduction technique to minimize boundary effects . . . . . . . . . . . . . . . . . . . . . . . . 29
14 A new test of independence (R-squared version) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
15 Radial cluster process (s = 0.5, λ = 1) with centers in blue; zoom in on the left . . . . . . . . . . 35
16 Radial cluster process (s = 1, λ = 1) with centers in blue; zoom in on the left . . . . . . . . . . . 35
17 Realization of a 5-interlacing with s = 0.15 and λ = 1: original (left), modulo 2/λ (right) . . . . 36
18 Rayleigh test to assess if a point distribution matches that of a Poisson process . . . . . . . . . . 39
19 Unsupervised (left) versus supervised clustering (right) of Figure 17 . . . . . . . . . . . . . . . . 39
90
20 Elbow rule (right) finds m = 3 clusters in Brownian motion (left) . . . . . . . . . . . . . . . . . . 43
21 Elbow rule (right) finds m = 8 or m = 11 “jumps” in left plot . . . . . . . . . . . . . . . . . . . . 43
22 Each arrow links a point (blue) to its lattice index (red): s = 0.2 (left), s = 1 (right) . . . . . . . 46
23 Distance between a point and its lattice location (s = 1) . . . . . . . . . . . . . . . . . . . . . . . 47
24 Chaotic convergence of partial sums in Formula (19) . . . . . . . . . . . . . . . . . . . . . . . . . 66
References
[1] Noga Alon and Joel H. Spencer. The Probabilistic Method. Wiley, fourth edition, 2016. 64
[2] José M. Amigó, Roberto Dale, and Piergiulio Tempesta. A generalized permutation entropy for random
processes. Preprint, pages 1–9, 2012. arXiv:2003.13728. 17
[3] Luc Anselin. Point Pattern Analysis: Nearest Neighbor Statistics. The Center for Spatial Data Science,
University of Chicago, 2016. Slide presentation. 13
[4] Adrian Baddeley. Spatial point processes and their applications. In Weil W., editor, Stochastic Geometry.
Lecture Notes in Mathematics, pages 1–75. Springer, Berlin, 2007. 13
[5] Adrian Baddeley and Richard D. Gill. Kaplan-meier estimators of distance distributions for spatial point
processes. Annals of Statististics, 25(1):263–292, 1997. 44
[6] David Bailey, Jonathan Borwein, and Neil Calkin. Experimental Mathematics in Action. A K Peters, 2007.
17
[7] N. Balakrishnan and C.R. Rao (Editors). Order Statistics: Theory and Methods. North-Holland, 1998. 47,
53, 64
[8] B. Bollobas and P. Erdös. Cliques in random graphs. Mathematical Proceedings of the Cambridge Philo-
sophical Society, 80(3):419–427, 1976. 64
[9] Miklos Bona. Combinatorics of Permutations. Routledge, second edition, 2012. 17
[10] Jonathan Borwein and David Bailey. Mathematics by Experiment. A K Peters, 2008. 17
[11] Bartlomiej Blaszczyszyn and Dhandapani Yogeshwaran. Clustering and percolation of point processes.
Preprint, pages 1–20, 2013. Project Euclid. 13
[12] Bartlomiej Blaszczyszyn and Dhandapani Yogeshwaran. On comparison of clustering properties of point
processes. Preprint, pages 1–26, 2013. arXiv:1111.6017. 13
[13] Bartlomiej Blaszczyszyn and Dhandapani Yogeshwaran. Clustering comparison of point processes with
applications to random geometric models. Preprint, pages 1–44, 2014. arXiv:1212.5285. 13
[14] Oliver Chikumbo and Vincent Granville. Optimal clustering and cluster identity in understanding high-
dimensional data spaces with tightly distributed points. Machine Learning and Knowledge Extraction,
1(2):715–744, 2019. 43
[15] Yves Coudène. Ergodic Theory and Dynamical Systems. Springer, 2016. 9
[16] Noel Cressie. Statistic for Spatial Data. Wiley, revised edition, 2015. 13
[17] H.A. David and H.N. Nagaraja. Order Statistics. Wiley, third edition, 2003. 53
[18] Tilman M. Davies and Martin L. Hazelton. Assessing minimum contrast parameter estimation for spatial
and spatiotemporal log-Gaussian Cox processes. Statistica Neerlandica, 67(4):355–389, 2013. 25
[19] Robert Devaney. An Introduction to Chaotic Dynamical Systems. Chapman and Hall/CRC, third edition,
2021. 9
[20] D.J.Daley and D. Vere-Jones. An Introduction to the Theory of Point Processes – Volume I: Elementary
Theory and Methods. Springer, second edition, 2013. 13
[21] D.J.Daley and D. Vere-Jones. An Introduction to the Theory of Point Processes – Volume II: General
Theory and Structure. Springer, second edition, 2014. 13
[22] David Coupier (Editor). Stochastic Geometry: Modern Research Frontiers. Wiley, 2019. 62
[23] Ding-Geng Chen (Editor), Jianguo Sun (Editor), and Karl E. Peace (Editor). Interval-Censored Time-to-
Event Data: Methods and Applications. Chapman and Hall/CRC, 2012. 11
[24] Bradley Efron. Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7(1):1–26, 1979.
24
[25] Paul Erdős and Alfréd Rényi. On the evolution of random graphs. In Publication of the Mathematical
Institute of the Hungarian Academy of Sciences, volume 5, pages 17–61, 1960. 64
[26] W. Feller. On the Kolmogorov-Smirnov limit theorems for empirical distributions. Annals of Mathematical
Statistics, 19(2):177–189, 1948. 39, 64
91
[27] Peter J. Forrester and Anthony Mays. Finite size corrections in random matrix theory and Odlyzko’s data
set for the Riemann zeros. Proceedings of the Royal Society A, 471:1–21, 2015. arXiv:1506.06531. 22
[28] Guilherme França and André LeClair. Statistical and other properties of Riemann zeros based on an
explicit equation for the n-th zero on the critical line. Preprint, pages 1–26, 2014. arXiv:1307.8395. 22
[29] Vincent Garcia, Eric Debreuve, and Michel Barlaud. Fast k nearest neighbor search using GPU. In IEEE
Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Anchorage, AK,
2008. 40, 85
[30] Minas Gjoka, Emily Smith, and Carter Butts. Estimating clique composition and size distributions from
sampled network data. Preprint, pages 1–9, 2013. arXiv:1308.3297. 64
[31] B.V. Gnedenko and A. N. Kolmogorov. Limit Distributions for Sums of Independent Random Variables.
Addison-Wesley, 1954. 42
[32] Michel Goemans and Jan Vondrák. Stochastic covering and adaptivity. In Proceedings of the 7th Latin
American Theoretical Informatics Symposium, pages 532–543, Valdivia, Chile, 2006. 62
[33] M. Golzy, M. Markatou, and Arti Shivram. Algorithms for clustering on the sphere: Advances & applica-
tions. In Proceedings of the World Congress on Engineering and Computer Science, volume 1, pages 1–6,
San Francisco, USA, 2016. 61
[34] R. Goodman. Introduction to Stochastic Models. Dover, second edition, 2006. 8
[35] Vincent Granville. Estimation of the intensity of a Poisson point process by means of nearest neighbor
distances. Statistica Neerlandica, 52(2):112–124, 1998. 14
[36] Vincent Granville. Applied Stochastic Processes, Chaos Modeling, and Probabilistic Properties of Numera-
tion Systems. Data Science Central, 2018. 9, 17, 42
[37] Vincent Granville. Statistics: New Foundations, Toolbox, and Machine Learning Recipes. Data Science
Central, 2019. 25, 28, 39
[38] Vincent Granville, Mirko Krivanek, and Jean-Paul Rasson. Simulated annealing: A proof of convergence.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 16:652–656, 1996. 40
[39] Peter Hall. Introduction to the theory of coverage processes. Wiley, 1988. 62
[40] K. Hartmann, J. Krois, and B. Waske. Statistics and Geospatial Data Analysis. Freie Universität Berlin,
2018. E-Learning Project SOGA. 31
[41] Jane Hawkins. Ergodic Dynamics: From Basic Theory to Applications. Springer, 2021. 9
[42] Nicholas J. Higham. Accuracy and Stability of Numerical Algorithms. Society for Industrial and Applied
Mathematics, 2002. 57
[43] Zhiqiu Hu and Rong-Cai Yang. A new distribution-free approach to constructing the confidence region for
multiple parameters. PLOS One, 8(12), 2013. 28
[44] Aleksandar Ivić. The Riemann’s Zeta Function: Theory and Applications. Dover, reprint edition, 2003. 22
[45] Timothy D. Johnson. Introduction to spatial point processes. Preprint, page 2008. NeuroImaging Statistics
Oxford (NISOx) group. 13
[46] Richard Kershner. The number of circles covering a set. American Journal of Mathematics, 61(2):665–671,
1939. 62
[47] Michael A. Klatt, Jaeuk Kim, and Salvatore Torquato. Cloaking the underlying long-range order of ran-
domly perturbed lattices. Physical Review Series E, 101(3):1–10, 2020. 53
[48] Denis Kojevnikov, Vadim Marmer, and Kyungchul Song. Limit theorems for network dependent random
variables. Journal of Econometrics, 222(2):419–427, 2021. 13
[49] Samuel Kotz, Tomasz Kozubowski, and Krzystof Podgorski. The Laplace Distribution and Generalizations:
A Revisit with Applications to Communications, Economics, Engineering, and Finance. Springer, 2001. 58
[50] K. Krishnamoorthy. Handbook of Statistical Distributions with Applications. Routledge, second edition,
2015. 73
[51] Faraj Lagum. Stochastic Geometry-Based Tools for Spatial Modeling and Planning of Future Cellular
Networks. PhD thesis, Carleton University, 2018. 13
[52] Günther Last and Mathew Penrose. Lectures on the Poisson Process. Cambridge University Press, 2017.
13
[53] André LeClair. Riemann hypothesis and random walks: The zeta case. Symmetry, 13:1–13, 2021. 22
[54] G. Last M.A. Klatt and D. Yogeshwaran. Hyperuniform and rigid stable matchings. Random Structures
and Algorithms, 2:439–473, 2020. 13
92
[55] J. Mateu, C. Comas, and M.A. Calduch. Testing for spatial stationarity in point patterns. In International
Workshop on Spatio-Temporal Modeling, 2010. 9
[56] Jorge Mateu, Frederic P Schoenberg, and David M Diez. On distances between point patterns and their
applications. Preprint, pages 1–29, 2010. 13
[57] Natarajan Meghanathan. Distribution of maximal clique size of the vertices for theoretical small-world
networks and real-world networks. Preprint, pages 1–20, 2015. arXiv:1508.01668. 64
[58] Jesper Møller. Introduction to spatial point processes and simulation-based inference. In International
Center for Pure and Applied Mathematics (Lecture Notes), Lomé, Togo, 2018. 13, 25, 33
[59] Jesper Møller and Frederic Paik Schoenberg. Thinning spatial point processes into Poisson processes.
Random Structures and Algorithms, 42:347–358, 2010. 10
[60] Jesper Møller and Rasmus P. Waagepetersen. An Introduction to Simulation-Based Inference for Spatial
Point Processes. Springer, 2003. 13
[61] Jesper Møller and Rasmus P. Waagepetersen. Statistical Inference and Simulation for Spatial Point Pro-
cesses. CRC Press, 2007. 13
[62] S. Ghosh N., Miyoshi, and T. Shirai. Disordered complex networks: energy optimal lattices and persistent
homology. Preprint, pages 1–44, 2020. arXiv:2009.08811. 5
[63] Saralees Nadarajah. A modified Bessel distribution of the second kind. Statistica, 67(4):405–413, 2007. 58
[64] Melvyn B. Nathanson. Additive Number Theory: The Classical Bases. Springer, reprint edition, 2010. 63
[65] D Noviyanti and H P Lestari. The study of circumsphere and insphere of a regular polyhedron. Journal
of Physics: Conference Series, 1581:1–10, 2020. 61
[66] Yosihiko Ogata. Cluster analysis of spatial point patterns: posterior distribution of parents inferred from
offspring. Japanese Journal of Statistics and Data Science, 3:367–390, 2020. 13
[67] Vamsi Paruchuri, Arjan Durresi, and Raj Jain. Optimized flooding protocol for ad hoc networks. Preprint,
pages 1–10, 2003. arXiv:cs/0311013v1. 62
[68] Yuval Peres and Allan Sly. Rigidity and tolerance for perturbed lattices. Preprint, pages 1–20, 2020.
arXiv:1409.4490. 5, 13
[69] Brian Ripley. Stochastic Simulation. Wiley, 1987. 73
[70] Peter Shirley and Chris Wyman. Generating stratified random lines in a square. Journal of Computer
Graphics Techniques, 6(2):48–54, 2017. 61
[71] Karl Sigman. Notes on the Poisson process. New York NY, 2009. IEOR 6711: Columbia University course.
9, 13
[72] Luuk Spreeuwers. Image Filtering with Neural Networks: Applications and Performance Evaluation. PhD
thesis, University of Twente, 1992. 40
[73] J. Michael Steele. Le Cam’s inequality and Poisson approximations. The American Mathematical Monthly,
101(1):48–54, 1994. 19, 52
[74] Dietrich Stoyan, Wilfrid S. Kendall, Sung Nok Chiu, and Joseph Mecke. Stochastic Geometry and Its
Applications. Wiley, 2013. 62
[75] Anna Talgat, Mustafa A. Kishk, and Mohamed-Slim Alouini. Nearest neighbor and contact distance
distribution for binomial point process on spherical surfaces. IEEE Communications Letters, 24(12):2659–
2663, 2020. 61
[76] Gerald Tenenbaum. Introduction to Analytic and Probabilistic Number Theory. American Mathematical
Society, third edition, 2015. 17
[77] Remco van der Hofstad. Random Graphs and Complex Networks. Cambridge University Press, 2016. 64
[78] Robert Williams. The Geometrical Foundation of Natural Structure: A Source Book of Design. Dover,
1979. 62
[79] Oren Yakir. Recovering the lattice from its random perturbations. Preprint, pages 1–18, 2020.
arXiv:2002.01508. 13, 53
[80] Ruqiang Yan, Yongbin Liub, and Robert Gao. Permutation entropy: A nonlinear statistical measure for
status characterization of rotary machines. Mechanical Systems and Signal Processing, 29:474–484, 2012.
17
[81] D. Yogeshwaran. Geometry and topology of the boolean model on a stationary point processes : A brief
survey. Preprint, pages 1–13, 2018. Researchgate. 13
[82] Tonglin Zhang. A Kolmogorov-Smirnov type test for independence between marks and points of marked
point processes. Electronic Journal of Statistics, 8(2):2557–2584, 2014. 30
93
Index
m-interlacing, 11, 24, 34–38, 40, 61, 63, 69, 75, 76 Fréchet, 23, 42
m-mixture, 35–38, 60, 61, 63 Gaussian, 67
generalized logistic, 5, 13–15, 33, 56, 65
anisotropy, 10, 17, 37 half-logistic, 15
attraction (point process), 7, 16 Hotelling, 26
attractor (distribution), 38, 42, 47, 64 Laplace, 54, 55, 58, 65
location-scale, 6, 10
Berry-Esseen theorem, 67 logistic, 11, 14, 73
Bessel function, 58 Lévy, 42
Beta function, 15 metalog, 15
bias, 44 modified Bessel, 58
binomial distribution, 7, 38, 60 Poisson, 18
boundary effect, 10–12, 17, 24, 25, 27, 29, 30, 32, 34, Poisson-binomial, 5, 7, 13, 18, 47
37, 38, 44, 46, 52, 60, 61, 64, 67, 76, 83 Rayleigh, 38, 47, 61
Brownian motion, 23, 41 stable distribution, 58
triangular, 57
Cauchy distribution, 42, 51
truncated, 51, 57
censored data, 11, 44, 60
uniform, 53, 73
central limit theorem, 26, 42, 52, 56
Weibull, 23, 38, 42, 47
multivariate, 67
domain of attraction, 64
chaotic convergence, 21, 65
dual confidence region, 24, 27, 68
characteristic function, 58, 67
dynamical systems, 9, 23, 43, 48, 64
chi-squared distribution, 67
child process, 13, 74 edge (graph theory), 63
clique (graph theory), 64 edge effect (statistics), 11, 44
cluster process, 11, 13, 36, 37 elbow rule, 11, 30, 36, 38, 41
on the sphere, 61 empirical distribution, 8, 17, 24, 31, 33, 38, 46, 48, 50,
clustering, 40 54, 64
fractal clustering, 24, 70 entropy, 17
fuzzy, 24 ergodicity, 9, 32, 33, 37, 48, 59
GPU-based, 24, 36 extreme values, 42, 46
supervised, 23, 36
unsupervised, 23, 36 filtering (image processing), 22–24, 40
Cochran’s theorem, 67 fixed point algorithm, 48
confidence band, 39 Fourier transform, 58
confidence interval, 32, 37 fractal clustering, 22, 24, 70
confidence level, 27 fractal dimension, 43
confidence region, 27, 32, 66, 68 Fréchet distribution, 23, 42
dual region, 24, 27, 68
connected components, 12, 37, 38, 48, 61, 63, 65, 69, Gamma function, 42
77, 78 Gaussian distribution, 67
contour line, 26 multivariate, 67
convergence acceleration, 65 GPU-based clustering, 23, 24, 36, 40
convolution of distributions, 51, 57, 58 graph, 12, 64
counting measure, 7 connected components, 12, 37, 63, 69, 77, 78
covariance matrix, 67 edge, 12
covering (stochastic), 62 nearest neighbor graph, 22, 37, 65, 69, 78
cross-validation, 28 node, 12, 64
path, 12
data animation, 24 random graph, 64
degrees of freedom, 67 random nearest neighbor graph, 64
density estimation, 14 undirected, 12, 37, 38, 63–65, 69, 78
deviate, 73 vertex, 12
Dirichlet eta function, 18, 20, 22, 43, 66, 69 graph theory, 12, 63
distribution grid, 5, 6
binomial, 7, 38, 60
Cauchy, 42, 51, 52, 73 hash table, 16, 69, 77
chi-squared, 67 hexagonal lattice, 13
empirical, 54 hidden model, 13, 16, 33, 46, 52
exponential-binomial, 5, 49 high precision computing, 28
94
histogram equalization, 40 order statistics, 46, 53
homogeneity, 9, 11, 14, 38, 61 outliers, 46
Hotelling distribution, 26, 67 overfitting, 33
95
random permutation, 17
random walk, 42
Rayleigh distribution, 38, 47, 61
Rayleigh test, 38
records, 46
renewal process, 9, 13
repulsion (point process), 5, 7, 15, 39
resampling, 28, 39
Riemann hypothesis, 21, 22
Riemann zeta function, 13, 18, 20, 43
sample size, 27
scaling factor, 5, 6, 11, 12, 15, 23, 25, 32, 37, 44, 46,
48, 59, 61–64, 69, 73, 75
shift vector, 11, 36, 38, 40, 41, 63, 75, 84
shifted process, 10, 40, 41, 63
simulation, 27, 48
spatial process, 36
spatial statistics, 13
stable distribution, 42, 52, 58
standardized arrival times, 58
standardized point process, 10, 38
state space, 6, 16, 23, 32, 36–38, 40, 44, 46, 51, 82, 84
stationarity, 6, 8, 11, 24, 29, 59, 63
stochastic convergence, 24
stochastic geometry, 61, 62
stochastic residues, 36
stretching (point process), 10, 12, 38, 75
superimposition (point processes), 11, 36
symbolic math, 48
tessellation, 62
thinning (point process), 10
tiling (spatial processes), 63
training set, 23
transcendental number, 48
truncated distribution, 51, 57
96