GSSSources.bib
Geodesic slice sampling on Riemannian manifolds
Abstract
We propose a theoretically justified and practically applicable slice sampling based Markov chain Monte Carlo (MCMC) method for approximate sampling from probability measures on Riemannian manifolds. The latter naturally arise as posterior distributions in Bayesian inference of matrix-valued parameters, for example belonging to either the Stiefel or the Grassmann manifold. Our method, called geodesic slice sampling, is reversible with respect to the distribution of interest, and generalizes Hit-and-run slice sampling on to Riemannian manifolds by using geodesics instead of straight lines. We demonstrate the robustness of our sampler’s performance compared to other MCMC methods dealing with manifold valued distributions through extensive numerical experiments, on both synthetic and real data. In particular, we illustrate its remarkable ability to cope with anisotropic target densities, without using gradient information and preconditioning.
1 Introduction
In statistical models for real world phenomena it is natural to incorporate geometric knowledge in terms of manifolds into state- or parameter-spaces. This allows to better capture dependencies, to easily embed additional constraints and to include expert/data guidance. Extracting information with Bayesian inference from such models requires the ability to sample (at least approximately) from the usually highly intractable posterior distribution on the manifold. However, due to the involved geometric features, standard approaches are not directly applicable. This highlights the importance of developing efficient sampling techniques specifically designed for manifolds.
In this paper we tackle this problem and consider a target measure on a general Riemannian manifold that has a density with respect to the Riemannian measure of the manifold, i.e., that is of the form
(1) |
where is integrable with respect to the Riemannian measure . Such frameworks are encountered in various applications, e.g., brain connectivity network analysis [mantoux2021understanding], dimensionality reduction [holbrook2016bayesian], computer vision [lui2012advances], texture analysis [kunze2004bingham] and protein conformation modeling [hamelryck2006sampling]. In most instances, the Riemannian manifolds that arise are matrix manifolds, such as the Stiefel manifold or the Grassmann manifold [edelman1998geometry].
One popular way to approximately sample from intractable distributions are Markov chain Monte Carlo (MCMC) methods. We follow this approach and propose a practical MCMC algorithm based on slice sampling techniques. This method that we refer to as geodesic slice sampler (GSS) incorporates the geometry of the underlying Riemannian manifold by using the geodesics. We briefly describe the transition mechanism of GSS. It crucially exploits the fact that on a Riemannian manifold for every pair , where and is an element of the tangent space at , there exists a unique geodesic emanating from in direction .
More precisely, given the current state , a single transition of the GSS targeting the distribution proceeds in three steps. First, a level is uniformly sampled from the interval . Second, a geodesic that passes through is randomly chosen. Lastly, a point is generated from the intersection of the geodesic and the level set . This final step presents the most significant challenge and requires special care to ensure invariance of the target distribution. To this end, we carefully adapt Neal’s stepping-out and shrinkage procedure (detailed in [neal2003slice]) to our manifold setting.
Metropolis-Hastings, Hamiltonian Monte Carlo and Langevin-type algorithms have already been adapted to the Riemannian manifold setting. Moreover, there exist several tailor-made algorithms for specific manifolds. Consult Section 2.3 for a literature review of MCMC-methods on Riemannian manifolds. However, to the best of our knowledge GSS represents the first practical slice sampling-based MCMC method applicable to general Riemannian manifolds. Following a slice sampling paradigm is appealing because, by design, the length of the transition step is fitly chosen for each transition. This is especially advantageous when efficiently exploring the target distribution requires varying the length of transition steps based on the position and direction of the move, e.g., due to anisotropy. It is worth noting that some slice sampling algorithms achieve this without the need for any tunable hyperparameters. However, a practical implementation usually introduces additional tunable parameters. Nonetheless, it is expected that the resulting slice sampler algorithms’ performance will be more robust with regard to the choice of these parameters compared to, for example, the sensitivity of a Metropolis-Hastings or Hamiltonian Monte Carlo algorithm to the selection of step size. We refer to [neal2003slice] for some more details on these properties of slice sampling; see also [murray2010elliptical] for a further discussion of advantageous and limitations of a slice sampling approach.
To conclude this introduction, our main contributions can be summarized as follows:
-
•
We propose a practical slice sampling based MCMC-method, which we call geodesic slice sampling, to target distributions of the form (1). It combines a geodesical Hit-and-run algorithm with a 1-dimensional slice sampler arriving at a method that generalizes Hit-and-run slice sampling to Riemannian manifolds.
-
•
We demonstrate the applicability of GSS in numerical experiments and evaluate its strong suits as well as its drawbacks. Its main feature being its capacity to be quite robust to the geometry of the target, without using gradient information and with an easy tuning of parameters using the diameter of the manifold.
-
•
We verify that GSS has the correct invariant distribution by showing reversibility with respect to .
The structure of the paper is as follows. First we provide an introduction to slice sampling on in Section 2.1, before turning to GSS on general Riemannian manifolds in Section 2.2. This is followed by a literature review of MCMC-methods on Riemannian manifolds in Section 2.3. Readers that wish more details on the differential geometry background used in Section 2 may find it in Appendix C. Section 3 is devoted to numerical experiments. The proof of the reversibility of GSS with respect to is given in Section 4. A formal treatment of the stepping-out and shrinkage procedure is also included in this section.
1.1 General notation
We introduce some general notation that is valid throughout the whole paper. Let be the set of strictly positive integers and call . We denote by the -dimensional Lebesgue measure on and by the Borel--algebra. Similar, for a set we write for the trace Borel--algebra on . Moreover, we set to be the -dimensional Euclidean unit sphere. For , let be its transpose, and write for the identity matrix. If is a finite set or satisfies , denote the discrete, respectively continuous, uniform distribution on as . Whenever we introduce random variables in the sequel, we assume them to be defined on some rich enough probability space . Let be a measurable space and let be an -valued random variable. Then we denote by
the distribution of . If for some probability measure , then we also write . Furthermore, let be a possibly different set and let be a map from to . For some set , we denote by the restriction of to . Now equip also with a -algebra and turn it into the measurable space . Additionally assume to be measurable and let be a measure on . We call
the push forward measure of under , and for we call
the restriction of to . To emphasize that a union is disjoint we write . Finally, for simplicity we also use and to denote the minimum respectively the maximum between two real numbers, i.e., and for .
2 Methodology: Geodesic slice sampling
2.1 Slice sampling on
In this section we revisit slice sampling on in order to provide a general introduction to the ideas of (uniform simple) slice sampling. The slice sampler, as all MCMC-methods, defines a Markov chain which, for a given measurable unnormalized density that satisfies , can be used to approximately sample from the probability measure
At its heart are the (super) level sets of defined by
which contain all points that have a function value with respect to that is greater than a specified value. The uniform simple slice sampler, which we also call idealized slice sampler, generates approximate samples from by drawing suitably from these level sets.
We denote by the Markov chain that corresponds to the idealized slice sampler. A transition from step to step works as follows:
-
1.
Draw a random level , call the result . This specifies a level set .
-
2.
Draw uniformly from this level set .
Consequently, for any this defines the conditional distributions
A graphic representation of the transition mechanism for can be found in Figure 1.
We can also describe the idealized slice sampler through its transition kernel given by
Since slice sampling was brought to the attention of the statistics community in [besag1993spatial], the properties of idealized slice sampling, including reversibility with respect to and conditions for ergodicity, have been investigated in several works, e.g., [mira2002efficiency, natarovskii2021quantitative, roberts2002convergence, rudolf2013positivity, rudolf2018comparison]. Exemplary, the following result illustrates a major advantage of slice sampling: In [natarovskii2021quantitative, Corollary 3.7], Natarovskii et. al. show that the spectral gap of only depends on the level set function , i.e., on the volume of the level sets, not their shape. This means that the performance of the idealized slice sampler is ignorant of the introduction of, e.g., multimodality, local modes or anisotropy as long as the volume of the level sets is not modified.
Unfortunately, each transition of the idealized slice sampler requires to sample from the uniform distribution , , of a level set. They are -dimensional, measurable sets and in general there is no efficient algorithm to tackle this problem. This is a major prevention for the implementation of the idealized slice sampler. One modification strategy to obtain a practical algorithm is called hybrid slice sampling, see [latuszyinski2014convergence]. Here, the uniform distribution on the level sets is replaced by a family of kernels such that for any (where it is well-defined) is invariant for . This leads to a transition kernel of the form
In order to get a better understanding of this strategy, we exemplary consider the case introduced in [neal2003slice]. There a stepping-out and a shrinkage procedure is proposed. In the following we aim to provide a basic understanding of the concepts behind these two schemes. To this end, we give a verbal description, and a visualization in Figure 3. The stepping-out procedure is treated in more detail in Section 4.1. This includes pseudocode and a careful definition of the generated distributions. For an extensive treatment of the shrinkage procedure, we refer to [ReversibilityEllipticalSliceSampler].
The stepping-out and shrinkage based hybrid slice sampler takes a point (current state of Markov chain) and a level set (level generated as in the first step of the transition mechanism of the idealized slice sampler), and proceeds in two steps.
- 1. Stepping-out:
-
Under the specification of two hyper parameters and , the stepping-out procedure chooses a random segment of containing . To this end, an interval of length is placed randomly around by sampling the left interval boundary point and setting the right interval boundary to . Then this interval is extended iteratively to the left by intervals of length until for the first time the left boundary leaves the level set , or the maximal number of stepping-out steps to the left is reached. Here is obtained by randomly splitting , the maximal number of total stepping-out steps, into two summands and . Similarly the interval is extended iteratively to the right by intervals of length until the right boundary hits for the first time, or steps have been performed. This provides a randomly generated interval , where
with
- 2. Shrinkage:
-
We generate a point from with the shrinkage procedure. Roughly speaking, it is an adaptive acceptance/rejection scheme that shrinks the proposal area with each rejection.111Note that we describe here a scheme that differs slightly from the original one in [neal2003slice] and rather resembles the one of the elliptical slice sampler, see [murray2010elliptical]. Crucially, we view here the interval as a circle, i.e., if one \ldqleaves\rdq the interval at the right boundary, it is immediately \ldqreentered\rdq at the left boundary. Observe that a set (viewed as a circle) is divided by two points into two segments, namely and . If contains the initial point , we set
to be the segment containing . The shrinkage procedure now builds a sequence of such segments. First we sample , and set . For , we let be a random variable with conditional distribution
that is, we draw the next proposal uniformly from the current segment. Observe that divides into two segments as described above. Set for , i.e., we keep the segment of that contains the initial point . This is continued until we generate a proposal that lies in . Hence, overall the shrinkage procedure yields a random point where . If the stepping-out and shrinkage scheme is embedded into a 1-dimensional hybrid slice sampler this point is then the next random variable of the chain.
We comment on the hyperparameters of the stepping-out procedure.
Remark 1.
The output of the stepping-out procedure can be viewed as a \ldqloose\rdq approximation of the set . The maximal possible length of this approximation is given by , but also the individual choice of and affects the quality of the approximation depending on the shape of . Choosing larger and smaller can lead to an interval that lies \ldqtighter\rdq around . However, if has \ldqholes\rdq, it is also more likely that parts of are \ldqcut off\rdq. Moreover, increasing increases the computational cost of the stepping-out procedure.
Now we have a practical slice sampling algorithm on . One way to lift the stepping-out and shrinkage approach to is to combine it with the Hit-and-run algorithm arriving at something called the Hit-and-run slice sampler222Hit-and-run slice sampling is already mentioned in [Mackay, Section 29.7]. Convergence and comparison results for this sampler are provided in [latuszyinski2014convergence, rudolf2018comparison], and it is used as a benchmark approach in [murray2010elliptical, schaer2023gibbsian]., which essentially samples a random line though the current point and then runs a slice sampler on this line. For and we define
to be the line through in direction , and for
to be the parameterized intersection of a straight line with a level set. The transition mechanism of the Hit-and-run slice sampler then proceeds as depicted in Figure 4.
For a complete algorithmic description see Algorithm 1 with being the uniform distribution on for all . Equivalently, the transition mechanism can be described as first sampling a direction uniformly from , and then running a 1-dimensional stepping-out and shrinkage based slice sampler for the unnormalized density with initial point 0.
Our interest into this framework arises from the fact that, by generalizing straight lines to geodesics, it can be leveraged to general Riemannian manifolds, which we discuss in the next section.
2.2 Geodesic slice sampling
We now turn to slice sampling on Riemannian manifolds. More precise, in the following we consider sampling from measures defined on a state space satisfying the following assumption:
Assumption A.
Let be a -dimensional, smooth, connected manifold. In addition we assume that is endowed with a Riemannian metric and is complete.
(For the sake of brevity, we keep the introduction of objects from differential and Riemannian geometry to a bare minimum here. References and details on certain aspects can be found in Appendix C.) First, we need to equip with a suitable reference measure. Under Assumption A there exists an atlas consisting of homeomorphisms such that for all the map is infinitely often continuously differentiable. We denote the tangent space to at as . The Riemannian metric is a smooth field of symmetric, positive definite covariant 2-tensors. Let be the Borel--algebra induced by the topology of . The measure on is defined for any as
(2) |
where is a partition of unity subordinate to , and for
with being the coordinate frames associated to . We call the Riemannian measure induced by . It can be viewed as an extension of the Lebesgue measure to Riemannian manifolds, see e.g. [AnalysisIII, Section XII.1]. We provide some examples for manifolds satisfying Assumption A.
Example 2.
-
1.
The most simple example for a -dimensional, smooth, connected manifold is . Since for each the tangent space to at is again , we can equip it with the Riemannian metric
rendering complete. The induced Riemannian measure is the Lebesgue measure.
-
2.
We equip the Euclidean unit sphere with the standard Riemannian metric induced by its embedding in , that is,
where is the map on the tangent spaces induced by . Then satisfies Assumption A. The corresponding Riemannian measure is the standard volume measure.
-
3.
Let with . The -dimensional, smooth, connected Stiefel manifold
consists of (ordered) -tuples of vectors in that from an orthonormal system. This means that each point on the Stiefel manifold describes (not uniquely) a -dimensional subspace of . To characterize the tangent space to a point , we need a matrix such that the columns of and form an orthonormal basis of . Then we have
If we introduce the Riemannian metric
where and Tr denotes the trace of a matrix, Assumption A holds true for . For more details see [edelman1998geometry].
The need to sample from measures defined on connected, complete Riemannian manifolds occurs in several applications in Bayesian statistics, see e.g. [holbrook2016bayesian, holbrook2020nonparametric, lieAccepteddimension, mantoux2021understanding].
We assume in this paper that the measure on admits a density with respect to the Riemannian measure , i.e.,
(3) |
Very much in parallel to the (special) case , we are now able to develop a slice sampling approach to target defined on . We can easily extend the uniform simple slice sampler (also called idealized slice sampler) from to . (In fact this works for every measure space, see Appendix B.) The level sets, containing all points where the unnormalized density is greater than a given value , take the form
The idealized slice sampler with initial point then proceeds, as in , by first sampling a level uniformly from . Then the next state of the chain is sampled according to the uniform distribution on , that is, according to . More precisely, the idealized slice sampler has transition kernel
The implementation of the kernel requires sampling from the manifold-equivalent of the uniform distribution on a level set . Doing this efficiently poses a problem, as in , because in general we have no knowledge about the shape of the level sets other than that they are -dimensional, measurable sets. Therefore, we propose a hybrid slice sampler that lifts Hit-and-run slice sampling from to the general Riemannian manifold . To this end, we need a generalization of straight lines to , the geodesics. A curve , where is an interval, is called a geodesic if the covariant derivative of its velocity vector field is zero, i.e., if the equation holds. Under Assumption A, we know that for all and all there exists a unique geodesic
(4) |
satisfying and . We may interpret as the geodesic emanating from in direction .
As in , we want to index the geodesics emanating from a point by a \ldqunit sphere of directions\rdq, which naturally is given by the unit tangent sphere in
The natural immersion of into the inner product space via the identity induces a Riemannian metric on . We call the normalization
of the Riemannian measure , induced by , uniform distribution on . Then for , and we immediately obtain
as the parameterized intersection of a geodesic with a level set. We now present the extension of the Hit-and-run slice sampler to manifolds replacing straight lines by geodesics. We call this sampler the geodesic slice sampler. Roughly speaking, we arrive at a transition mechanism which at each step randomly chooses a geodesic and then runs a stepping-out and shrinkage based 1-dimensional slice sampler on this geodesic: Given a point , a new point is generated by first sampling a level uniformly from , and a direction uniformly from , i.e., from . The sampled level defines a level set , and the sampled direction specifies a geodesic emanating from in direction . Now we use Neal’s stepping-out and shrinkage techniques described in Section 2.1 to generate a point from the intersection of the level set and the geodesic . The new point is then given by .
Input: point , hyperparameters and
Output: point
Input: point , direction , level , hyperparameters and
Output: two points such that
Input: point , direction , level and parameters
Output: point
A complete algorithmic description of the geodesic slice sampler in pseudo code can be found in Algorithm 1. It calls Algorithm 2 and Algorithm 3 representing the stepping-out and shrinkage procedure on the geodesic respectively. We also provide a description in terms of random variables.
Remark 3.
Let be the Markov chain corresponding to the geodesic slice sampler. For let and be a random variables with conditional distributions
For fixed hyperparameters and , let and be independent of all previous random variables. We set
and define . Let the random variable have conditional distribution
and set . Using the segment-notation from the description of the shrinkage procedure in Section 2.1, we define to have conditional distribution
and
for all . Then set , where
The next state of the geodesic slice sampler is then given by
We comment on the prerequisites of the geodesic slice sampler.
Remark 4.
-
1.
In order to implement Algorithm 1 we need to be able to perform the following operations:
-
•
Evaluation of the unnormalized density at every .
-
•
Sampling from for all . If we know an isometric isomorphism for all this is an easy task.
-
•
Evaluation of geodesics for all , and . Some cases where this is possible are provided in Example 5.
-
•
-
2.
The geodesic slice sampler takes two hyperparameters, namely and . They arise from the usage of Algorithm 2 (the stepping-out procedure), and their influence on the geodesic slice sampler is derived from their influence on the stepping-out procedure, see Remark 1. Roughly speaking, determines the maximal possible size of the neighborhood of the initial point that the geodesic slice sampler takes into account when performing its transition. This affects the reach of the algorithm as well as its ability to jump between modes of the target distribution . Choosing larger and smaller can be seen as increasing the likelihood that a smaller neighborhood of is considered for the transition, compared to the maximal possible reach of the algorithm. Depending on the shape of the unnormalized density , this can hamper the ability of the geodesic slice sampler to jump between modes or allow the consideration of more \ldqrelevant\rdq neighborhoods. Observe that larger leads to higher computational cost by increasing the cost of Algorithm 2.
We provide some illustrative scenarios.
Example 5.
-
1.
The Hit-and-run slice sampler on described in Section 2.1 fits into the framework of the geodesic slice sampler.
-
2.
We consider . For we have . The projection onto the subspace orthogonal to
where denotes the identity on , yields a simple formula for
The geodesics of are the great circles given by the explicit formula
for and . Of course we may apply the geodesic slice sampler for arbitrary hyperparameters as described in Algorithm 1. However, since all geodesics are periodic of a known period length (namely ), this renders Algorithm 2 (the stepping-out procedure) somehow superfluous. For wisely chosen hyperparameters (), the geodesic segment sampled by Algorithm 2 always equals exactly one winding of the great circle, and we can simply replace line 3 in Algorithm 1 by deterministically setting . The resulting algorithm is the geodesic shrinkage slice sampler on the sphere from [habeck2023geodesic].
-
3.
For the Stiefel manifold defined in Example 2.3, an isometric isomorphism between and the tangent space at a point is given by using the first components of to determine a skew symmetric matrix and the remaining ones to form a matrix . These two matrices determine an element of (after fixing ) by . We provide an explicit formula for the geodesic . To this end let be the compact QR-decomposition of . For set and to be
where denotes here the matrix exponential and is the matrix with all entries zero. Then
The derivation of these results can be found in [edelman1998geometry].
Finally, we present the Markov transition kernel corresponding to the geodesic slice sampler. To this end, we first give a rigorous specification of the unnormalized density :
Assumption B.
The unnormalized density is a lower semi-continuous333All level sets , , are open. function such that .
Remark 6.
Next we fix and . For simplicity we drop these two hyperparameters of the geodesic slice sampler in our subsequent notation. Let , and . We denote by
(5) |
the distribution of the output of Algorithm 2 and by
(6) |
where , the distribution of the output of Algorithm 3. Note that in (5) and (6) the right hand side only depends on , and through the set . A formal definition of these distributions and some of their properties can be found in Section 4.1 and Section 4.2. For , and we define the auxiliary Markov kernels
Then the Markov kernel
(7) |
corresponds to Algorithm 1. Observe that has the correct invariant distribution, as implied by the following theorem.
Theorem 7.
The proof of this statement can be found in Section 4.3.
2.3 Literature review of MCMC-methods on Riemannian manifolds
In this section, we aim to provide an overview of existing MCMC-methods on Riemannian manifolds in the literature. Roughly speaking they can be assigned to three different categories.
The first class of MCMC algorithms are defined on open sets of but consider non canonical Riemannian metrics. In [girolami2011riemann], Girolami and Calderhead generalize Hamiltonian Monte Carlo (HMC) to equipped with an arbitrary metric tensor obtaining Riemannian manifold HMC (RMHMC), which assumes knowledge about the Riemannian metric of the underlying manifold. They also introduce a MALA-type algorithm in this setting.
A second class of MCMC algorithms consists of methods defined on submanifolds of associated with the canonical metric introduced by the embedding in that are not necessary open sets. This includes Hamiltonian based MCMC-methods tailor-made for specific classes of manifolds that have been further developed upon RMHMC such as for implicitly defined manifolds [brubaker2012family], manifolds embedded in Euclidean space [byrne2013geodesic] and the sphere [lan2014spherical].
Distributions on submanifolds of can also be approximated by proposing a sample from the ambient projected to the manifold and then running a Metropolis-Hastings acceptance rejection step, see e.g. [mantoux2021understanding, zappa2018monte]. However, as usually the conditional distribution of the proposal is intractable and is therefore not taken into account in the acceptance rejection step, the resulting Metropolis-Hastings algorithm is biased. In [zappa2018monte], Zappa et. al propose a bias-free modification of this method for submanifolds of defined by inequalities and equality constraints. In the special case when the underlying manifold is a hypersphere equipped with the angular Gaussian distribution as reference measure, other specialized bias-free reprojected MCMC algorithms have been proposed such as reprojected preconditioned Crank–Nicolson algorithm or reprojected Elliptical Slice sampling, see [lieAccepteddimension].
The third class of MCMC-methods on Riemannian manifolds employs geodesic flows. In [mangoubi2018rapid], the authors analyze a geodesic random walk on Riemannian manifolds with positive bounded curvature, which is invariant with respect to the Riemannian measure. This analysis has been extended in [goyal2019sampling] to arbitrary target probability measures on manifolds with bounded non-negative curvature by adding a Metropolis-Hastings-like acceptance step resulting in a geodesic Metropolis-Hastings algorithm. Note that a similar approach has been taken before in [lee2017geodesic] to target the uniform distribution on a polytope by equipping it with a Hessian structure. In addition to this class of Metropolis-Hastings algorithms, there already exist MCMC-methods for specific manifolds that combine slice sampling with geodesic, that is, for distributions on there is Hit-and-Run slice sampling [latuszyinski2014convergence, Mackay], and for distributions on hyperspheres there is geodesic slice sampling on the sphere which uses great circles in stead of straight lines [habeck2023geodesic]. They can both be viewed as special cases of GSS.
3 Application
We numerically asses the performance of GSS (Algorithm 1) in comparison with other Riemannian MCMC algorithms. In our experiments, we consider the compact Stiefel manifold and the Grassmann manifold with, , . They find applications in shape analysis [hong2017regression], dimensionality reduction [holbrook2016bayesian] and computer vision [lui2012advances, nguyen2019neural]. We give a brief overview of the conducted experiments. All the code is available on GitHub444https://github.com/samuelgruffaz/Geodesic_Slice_Sampling_on_Riemannian_Manifold.git.
-
•
We first consider the case where the target distribution belongs to the class of von Mises-Fisher distributions, defining families of distributions on and . When examining the Stiefel manifold , we compare GSS with an adaptive random walk Metropolis Hastings (RMH) sampler. This comparison highlights the impact of the GSS approach compared to a well-chosen random walk. On the Grassmann manifold, we contrast the GSS with a gradient-informed sampler referred to as the geodesics Metropolis-adjusted Langevin algorithm (GeoMALA), inspired by the framework proposed by [byrne2013geodesic].
-
•
We compare GSS and RMH in inferring a latent variable model developed in [mantoux2021understanding]. This model is used for the analysis of brain network structures. Specifically, it encodes principal directions of adjacency matrices obtained from MRI as points on a Stiefel manifold.
-
•
Finally, we introduce a Bayesian von Mises-Fisher clustering model for action recognition in videos. We approximate the posterior distribution using GSS and compare our resulting model with other existing approaches [lin2017bayesian, sengupta2017bayesian].
We do not conduct comparisons with the Riemannian Hamiltonian Monte Carlo (RHMC) algorithm [girolami2011riemann], since it uses the expression of the metric of the manifold rather than its geodesics as GSS. Throughout this whole section we denote the Gaussian distribution on with mean and covariance matrix by . For , let be the diagonal matrix with diagonal entries given by the components of . Moreover, we define to be the vector and to be the matrix with all entries zero, and write for the trace of a square matrix .
We comment on the structures of the Stiefel and the Grassmann manifold needed to implement GSS.
Remark 8.
-
1.
(Stiefel manifold.) The Stiefel manifold , including an explicit formula for its geodesic and the uniform distribution on its tangent spheres, is introduced in Example 2-2 and 5-3. It is worth noting that the computation of a geodesic requires evaluating the matrix exponential map on skew symmetric matrices of size , which can be efficiently done using the eigenvalue decomposition of skew symmetric matrices.
-
2.
(Grassmann manifold.) Let with . The -dimensional Grassmann manifold can be defined as the set of all -dimensional subspaces of the Euclidean space (see [bendokat2020grassmann] for a complete overview), i.e.,
(8) The following representation allows for an efficient implementation of points on the Grassmann manifold. First, observe that for any subspace , there exists a (non-unique) orthonormal basis formed by the column of such that . Thus, by defining the equivalence relation on as if and only if , the Grassmann manifold can be identified with the quotient manifold . As a result, an element of can be represented by an element of .
Fixing such that the columns of and form an orthonormal basis of , the tangent space to at is given by
We equip with the Riemannian metric
such that is an isometric isomorphism between and . Let be the compact singular value decomposition (SVD) of an element . Then, the geodesic at with direction admits the following explicit expression
Further details and derivations can be found in [edelman1998geometry].
For the convenience of the reader we provide more details on the three samplers appearing in our numerical experiments. Each sampler defines a Markov chain , and we describe the respective transition mechanisms from to for .
- (a) adaptive random walk Metropolis Hastings (RMH).
-
Given and a step size this sampler uses the following Metropolis-Hastings (MH) like transition mechanism :
-
1.
Sample interpreted as a matrix in . Let be the projection on the Stiefel manifold, where is the SVD of . Then define .
-
2.
Sample independent of all previously appearing random variables. If , where
(9) then set , otherwise .
The hyperparameter is adaptively tuned to target an acceptance probability of , a constant proposed by optimal design analysis when using the same sampler in an Euclidean space [roberts2001optimal]. As remarked in [zappa2018monte], it is worth noting the proposal mechanism is not symmetric here. More precisely, if the conditional distribution of given admits a transition density denoted by , this function is not symmetric, i.e., for . Therefore, using (9) as the acceptance ratio leads to a biased MCMC algorithm, which is justified in [mantoux2021understanding] by the observation that, for sufficiently small , holds for all . However, this introduces a bias that grows with increasing . In Section 3.2, we compare adaptive RMH with GSS using the same application as discussed in [mantoux2021understanding].
It is worth noting that [zappa2018monte] proposes an MCMC algorithm that, by employing another projection and an additional accept-reject step, results in an unbiased MCMC algorithm. However, their approach does not exploit the specificity of a \ldqtractable geodesics\rdq-framework. -
1.
- (b) geodesic adaptive Metropolis Hastings (GeoRMH).
-
This methods is a bias free modification of RMH. It essentially uses the same transition mechanism, but replaces the proposal at step with where is distributed according to a standard Gaussian on . In [goyal2019sampling], a similar algorithm is proposed to sample uniformly from a convex subset of a manifold with non-negative curvature.
- (c) geodesic Metropolis adjusted Langevin algorithm (GeoMALA).
-
On the Grassmann manifold, GeoMALA is used and presented in [holbrook2016bayesian, Algorithm 1] to estimate parameters in Bayesian inference models involving dimensionality reduction. Fix a step and denote by for the projection onto the tangent space. The transition mechanism of GeoMALA given works as follows:
-
1.
Sample interpreted as a matrix in . Then define and .
-
2.
Define and , , as well as and .
-
3.
Sample independently. If then set , otherwise .
-
1.
Note that for all the algorithms that we implement, GSS, GeoRMH, and GeoMALA, we always reproject the (final) state of the Markov chain at each step onto the Stiefel manifold to mitigate numerical errors arising from the geodesic computations. This reprojection step ensures that the resulting samples lie on the manifold, preserving the desired manifold structure and improving the accuracy of the sampling algorithms.
3.1 Sampling the von Mises–Fisher distribution
In this section we present numerical experiments targeting the von Mises-Fisher distribution. Given a matrix-valued parameter , the von Mises–Fisher distribution on the Stiefel manifold has unnormalized density with respect to its Riemannian measure
(10) |
However, this expression can not be used for the Grassmann manifold since with does not imply . Note that an element of the Grassmann manifold can be identified with its orthogonal projector , which does not depend on the choice of the representative for . Therefore, given a positive semi-definite matrix , the von Mises–Fisher distribution (also called matrix Langevin distribution [chikuse2003concentrated]) on the Grassmann manifold has unnormalized density
(11) |
In the following, given and , we always choose the parameters and to be of the form
(12) |
where .
We consider three different experimental setups on . Firstly, is fixed at , we vary in the set and set in (12). Secondly, for the same choice of as in the first experiment, we fix and vary in the set . Thirdly, the pair is fixed at , but we choose , where varies in the set . This allows us to study the impact of the target distribution’s anisotropy on the samplers performance.
On the Grassmann manifold we repeat similar experiments. However in the first two setups, we use in (12). For the third experiment, we set and for . There, we also vary the stepping-out parameter of GSS in the set . In all the other experiments, we fix the hyperparameter of GSS to 1 for the sake of simplicity. The samplers run for iterations, and the stepsize is initialized to 0.01 and updated every 20 steps for RMH and GeoRMH.
We define the initialization by first sampling . Then, is projected onto using [mantoux2021understanding, Lemma 2] to obtain , thereby identifying a point on with its equivalence class when we work on . To evaluate the performance of the samplers, we compute the effective sample size (ESS) of , where denotes the samples generated by a single sampler. The results are averaged over 10 different resamplings, but the initialization is kept fixed since the use of a burn-in period has no significant impact.
(n,k) | (3,2) | (30,2) | (100,2) |
---|---|---|---|
GSS | |||
GSS | |||
GSS | |||
GSS | |||
GSS | |||
GSS | |||
RMH: | |||
GeoRMH: |
(n,k) | (30,5) | (30,10) | (30,20) |
---|---|---|---|
GSS | |||
GSS | |||
GSS | |||
RMH: | |||
GeoRMH: |
Varying on the Stiefel manifold.
First, we observe in Table 1 and 2 that the larger the value of , the higher the effective sample size (ESS). This is coherent with the fact that the average distance between the proposal and the current state increases with , leading to a higher ESS as the space is better explored. However, note that on a compact manifold like , following a geodesic for a long time may cause the proposal to return close to the starting point, similar to following the great circle of a sphere as shown in Figure 5. As a result, the gain in ESS reaches a plateau when .
The second observation is that for all samplers, as increases for fixed , the ESS increases, and inversely, as increases with fixed, the ESS decreases. Considering (12), the number of directions that impact the density is equal to since for any . Thus, as is small, the target density is \ldqflat\rdq in many directions and thus the risk of proposal rejection is small if the sampler attempts to move in these directions.
In Table 1, we see that RMH outperforms GSS and GeoRMH for , and remains competitive for . This, maybe at first glance surprising, performance of RMH can be explained by the fact that when is small, the number of constraints is low compared to the dimensionality of the space, making nearly Euclidean and the Stiefel projection nearly equal to the identity. Consequently, since the optimal acceptance rate of for the adaptive mechanism was found in an Euclidean space, it is reasonably explainable that RMH performs better in this scenario. This interpretation is further confirmed by Table 2, where we see that GSS() outperforms RMH and GeoRMH for any , though this advantage is less significant for GeoRMH due to its ability to explore the space according to its intrinsic geometry. Recall, to put our observations into perspective, that while RMH may outperform GSS and GeoRMH from ESS, it is fundamentally biased.
In Table 1, we observe that GeoRMH is not as efficient as RMH but performs comparably to GSS() when . This difference can be linked to the fact that GeoRMH and GSS only differ in the method they employ to move on a geodesic. The first uses a Metropolis Hastings mechanism whereas the latter runs a slice sampler. Using (uniform) slice sampling, can be viewed as first fixing an acceptance level and then drawing proposals (with the help of the stepping-out and shrinkage procedure) until acceptance is reached. This ensures that the sampled direction in the tangent space is not wasted, but it increases the computational cost of each transition step, since each proposal within the shrinkage procedure involves new computations. The parameter affects the average number of attempts in the shrinkage procedure since it widens the portion of the geodesic where the shrinkage proposal is sampled, thus increasing the possibility of not meeting the acceptance level. For example, in the case , there are, on average, attempts when , but attempts when . Therefore, there is a trade-off when selecting to optimize the time efficiency of sampling, as larger values of increase both the ESS and the computation time.
Varying on the Grassmann manifold.
In Table 3, we present only a subset of our experiments to convey the following message: When GeoMALA is well tuned, the gradient-informed sampler outperforms GSS regardless of the choice of . However, the tuning of GeoMALA is very sensitive. For instance, in the case with , the gradient information encourages to focus on high density area and does not explore enough to outperform GSS, but if we increase the stepsize to , GeoMALA is better than GSS. Regarding the role of the hyperparameter , the conclusions are consistent with those on the Stiefel manifold. Increasing beyond a certain point does not pay off, since the manifold is compact. The sweet spot appears to be around . Both methods have the same complexity, as in both cases, we need to sample a point on the tangent space and compute a geodesic using SVD.
(3,2) | (30,20) | (100,2) | |
---|---|---|---|
GSS | |||
GeoMALA Best |
Varying anisotropy on the Stiefel manifold.
1 | 10 | 100 | |
---|---|---|---|
GSS | |||
RMH | |||
GeoRMH |
In Table 4, the samplers’ performances worsen as increases, indicating the difficulty of exploring the space when the density is sharp in a specific direction.
We observe that RMH and GeoRMH outperform GSS only when , highlighting the effectiveness of the slice sampling approach of GSS in dealing with sharp densities.
In our experiments, we notice that the number of attempts in the shrinkage procedure is more sensitive to the variance of the target distribution than the value of the parameter . Therefore, in cases where GSS is a good fit for the sampling task, the computation time increases accordingly.
Varying the variance on the Grassmann manifold.
In Table 5, we observe that GSS outperforms GeoMALA when , indicating its advantage in situations where the density is concentrated. It highlights the fact that GeoMALA is not robust to the choice of the stepsize parameter.
Furthermore, we find that using a lower value for is more suitable when the distribution is sharp (e.g., ). Additionally, choosing a large value of the stepping out parameter can improve the performance when is chosen too small.
This experiment reaffirms the previous findings, demonstrating that GSS adapts to the geometry of the density, which is promising for practical applications. Moreover, the performance of GSS appears to be quite robust across different choices of , and taking a large value for strengthens this feature.
1 | 10 | 100 | |
---|---|---|---|
GSS | |||
GSS | |||
GSS | |||
GSS | |||
GeoMALA Best |
We discuss some further aspects of the conducted numerical experiments.
Remark 9.
(Complexity.) Up to this point, a discerning reader might consider the comparison of ESS between RMH, GSS, and GeoRMH unfair, as GSS involves additional computations due to the shrinkage procedure. While this is true, the main computational bottlenecks in all methods are the reprojection of the proposal on the Stiefel manifold at the end of each step using an SVD , the sampling on the tangent space using a QR decomposition , and the eigenvalues of the skew-symmetric matrices .
In the case of the shrinkage procedure in GSS, an eigenvalue decomposition is initially computed, enabling the generation of geodesics with some matrix products for the subsequent attempts. Thus, from a computational perspective, the only difference between GeoRMH and GSS is the cost of these matrix products .
However, when comparing RMH and GSS, we need to account for the additional computational cost of the QR decomposition and eigenvalue decomposition, both in . Notably, the computational cost related to constraints naturally increases with . In practice, without any engineering optimization of the codes, we observe that GSS takes between two to four times longer to execute than RMH (which is biased), but it is nearly equivalent to GeoRMH in terms of computation time when is not excessively large. The relative speed of RMH compared to GSS has to be weighted by its bias (of unknown size).
Remark 10.
(Choice of the hyperparameters.) The performance of GSS seems to be quite robust to the choice of the hyperparameters and as soon as (see Tables 1, 2, 5). The number coincides with twice the diameter of the Grassmann and the Stiefel manifold. Therefore, as a heuristic we propose to choose and such that is about twice the diameter for compact manifolds in general.
3.2 A practical case: Understanding the variability in graph data sets.
We consider in this section a model introduced in [mantoux2021understanding] that aims to infer the structure of adjacency matrices from functional connectivity networks of brains. Consider , adjacency matrices of different networks with nodes each. Let with . The model that we consider has parameter , , and , and is defined for as
(13) | ||||
where , is called pattern weight vector, and represents the symmetric residual noise (i.e., maps a vector in to the symmetric matrix in determined by its components). The unobserved variables and determine the individual-level specificity of network .
The original paper estimates the parameters using the Markov chain Monte Carlo-stochastic algorithm expectation maximization (MCMC-SAEM) procedure [kuhn2004coupling]. MCMC-SAEM is an extension of the expectation maximization (EM) algorithm, where the E-step involves approximating integrals using MCMC methods. In this context, the E-step samples the density , where is the current estimate for the parameter , within a Gibbs procedure using RMH (see [mantoux2021understanding, Algorithm 3]).
For the optimization procedure, we propose to replace RMH with GSS() to compare performances. Since our implementation of RMH is four times faster than GSS, we chose to multiply the number of MCMC iterations by four when using RMH for the E-step.
On a synthetic dataset.
We follow the same procedure as [mantoux2021understanding] to generate synthetic data with , and . The generation parameters are fixed as , , , and is chosen as the matrix with columns , where is sampled uniformly from , and , for any . The factor in this context serves as an anisotropy factor incorporated to examine how performance can be influenced by anisotropy. The optimization process involves random initialization of the parameters and , and we deterministically set . Then, is initialized either by performing 200 iterations of GSS on or 800 iterations of RMH. The results are averaged over 10 repetitions with different random initializations and generation parameters.
We run the MCMC-SAEM procedure for 100 iterations, and at each step of MCMC-SAEM, 20 iterations of MCMC are performed when GSS is used, while 80 iterations are performed when RMH is used for the E-step of the EM algorithm.
First, upon examining Figure 6, we can observe that the complete log-likelihood curve555Denoting by the model parameter, the log complete likelihood is and the log likelihood is . exhibits higher values when in contrast to . This can be attributed to the unobserved variables being more concentrated in the direction indicated by the first column of , making recovery easier.
Secondly, GSS outperforms RMH regarding the complete likelihood. The curve increases more rapidly with the number of iterations and attains a higher final value. This improvement is particularly striking for , which is consistent with the findings in Table 2. Moreover, as illustrated by the third graph of Figure 6, the relative root mean square error (rRMSE) of the estimated parameter to the true parameter is always smaller when using GSS, especially when .
Missing links imputation.
We follow the experimental setup proposed in [mantoux2021understanding, Section 5.1.2] for missing links imputation: A synthetic data set of adjacency matrices with nodes and is generated from the model specified by (13) with parameters given in Appendix D.1. In a first stage, the MCMC-SAEM algorithm with GSS is applied to perform the estimation of the parameters resulting in an estimator . Then, from the same model, we generate another 200 samples . For each of these samples, 16% of the edge weights corresponding to the interactions between the last eight nodes are masked. Denote by the resulting adjacency matrices. We then aim to reconstruct the missing links from the samples using the following three procedures:
-
•
The missing links are found from the maximum a posteriori (MAP) approximated as , where are the result of 4000 iterations of gradient ascent on the posterior density , the conditional density of given the masked observation in the model (13).
-
•
The missing links are found from the posterior mean (PM) defined as
(14) This distribution is approximated using and GSS or RMH within Gibbs sampler.
-
•
Finally, we consider a reconstruction using simply the link from .
When computing the PM and the MAP, GSS achieves a rRMSE of 50% ( 24% standard deviation over the dataset) and 52% ( 16%) on average respectively, in contrast to 57% ( 24%) and 58% ( 28%) for RMH, and 85% ( 10 %) for the mean sample. Then the same experiment is repeated, but this time, 40% of the edges are uniformly masked instead of constraining the mask. With GSS, we find a rRMSE of 30% ( 7% standard deviation over the dataset) and 35% ( 8%) on average for the PM and the MAP, compared to 34% ( 9%) and 35% ( 7%) when using RMH, and 75% ( 5 %) with the mean sample. This clearly indicates that GSS outperforms RMH on this example.
On a real dataset.
We use real data to perform a comparison similar to the one proposed in [mantoux2021understanding] where the authors use connectivity matrices with nodes and in their model to analyze subjects666We did not use the same data, since their dataset was unavailable to us.. Data were provided by the Human Connectome Project777https://www.humanconnectome.org/study/hcp-young-adult/document/extensively-processed-fmri-data-documentation. The dataset is composed of brain connectivity matrices generated from resting-state functional MRI (rs-fMRI) by following the pipeline described in this documentation888https://www.humanconnectome.org/storage/app/media/documentation/s1200/HCP1200-DenseConnectome+PTN+Appendix-July2017.pdf. On a subject which receives no stimulation (at rest), the rs-fMRI records fluctuations in blood oxygenation levels throughout the brain. By maximizing the signal coherence in each region of the brain with a spatial independent component analysis (ICA) [beckmann2004probabilistic], it yields a partition of the brain depending on its structure and variations from one individual to another. Finally, the temporal correlations between the mean blood oxygenation levels in each region are assembled into a matrix. This matrix is called the brain’s functional connectivity network. It should be noted that this network is not necessarily derived from physical reality, since it only represents correlations between brain regions. This is why the term “functional” is coined. In this study, the connectivity matrices are defined on a parcellation of the brain into regions999The network modeling is related to “netmats” in the documentation.. We chose the “recon2” group of the dataset leaving us 812 matrices of dimension to analyze.
We choose , and run 1000 MCMC-SAEM iterations with 20 MCMC steps per SAEM iteration when using GSS(), and 80 iterations when using RMH. The initialization procedure is the same as for synthetic data. To compare the samplers, we compute the rRMSE between PM and the observations similarly as in the previous paragraph. To asses the benefit of the model, we also compute the approximation given by the projection onto the subspace of the first five principal component analysis (PCA) components of the full data set, where each matrix has been vectorized.
The boxplots in Figure 7 show that GSS outperforms RMH since it reduces the tail of the error distribution. In addition, GSS achieves better results than RMH when looking at the evolution of the complete likelihood, as already observed on synthetic data. The model proves to be better suited than a simple PCA, as shown in [mantoux2021understanding], but often fails to offer a good representation of the observations in large dimensions, even for synthetic data.
3.3 ARMA model
Time series related to different types of data, such as dynamic textures, shape sequences and videos, are often modeled as auto-regressive and moving average (ARMA) models [doretto2003dynamic, aggarwal2004system, bissacco2001recognition, veeraraghavan2005matching]. Provided observations , the ARMA model equations are
where is the hidden state vector, the transition matrix, the measurement matrix and covariance matrices. We can wrap this model in a Bayesian framework by considering the priors and with , such that are seen as hyperparameters.
In this experiment, we compare the ESS related to the sampling of the posterior by using GSS and RMH. The ESS is computed from where are the samples. The expression of is given in (10) and can be computed using Kalman filter updates. The observations are synthetically generated from the model where the parameters are chosen randomly with and varying dimensions . We choose a low variance for the observation covariance matrix and a prior close to the true parameter in order to have a concentrated posterior.
In Table 6, GSS outperforms RMH, and a large increases the ESS. Surprisingly, increasing improves the ESS for GSS and not for RMH, contrary to the experiments with the von Mises-Fisher distribution in Section 3.1. This highlights the impact of the target distribution on the quality of the sampling and the robustness of GSS. The case reveals the influence of the parameter : increasing while reducing increases the performances. Sometimes, large transitions can be achieved only in specific contexts, and large allows this when necessary.
() | |||
---|---|---|---|
GSS : | [389,425,453] | [386,404,439] | [305,312,337] |
GSS : | [418,467,484] | [470,511,565] | [1146,1414,1779] |
GSS : | [506,549,589] | [436,484,527] | [1513,1966,2674] |
RMH : | [293,367,387] | [402,421,468] | [286,296,323] |
3.4 Bayesian clustering on the KTH video action dataset.
In this section, GSS is used to estimate parameters of a Bayesian clustering model on the KTH video action data [schuldt2004recognizing]. The pipeline used in [chakraborty2019statistics] is followed. For four different scenarios (called “d1”, “d2”, “d3” and “d4”), the dataset records 6 actions carried out by 25 humans, which yields 125 videos per scenario. From each video, a sequence of frames is extracted, and each frame is resized to , before computing its histogram of oriented gradients (HOG) [dalal2005histograms] features. Their dimension is . Finally, an ARMA model is used to model the sequence of HOG features by estimating the parameter with the closed-form formula given in [doretto2003dynamic]. For each video, let be the number of frames and be the matrix formed by stacking the HOG feature vectors from each frame. Let be the SVD of by taking only the first components, i.e., , and . Then the video is represented by where
Clustering models using mixture of von Mises-Fisher distributions have already proposed in [lin2017bayesian, sengupta2017bayesian], where the distribution on is a mixture of multivariate Gaussian distributions with diagonal covariance matrix. The number of clusters equals six as the number of actions. The parameter of each cluster is and the mixing weight is . The observations and their cluster assignment are assumed to follow this generation process:
(15) | |||
(16) |
where is the multinomial distribution with parameter . In this Bayesian setting, we take a uniform non-informative prior on every parameter. We want to assess if the clustering separates the actions. To this end, for each environment, the dataset is split into a training set (67% of the data) and a test set (33%). The clustering model is trained with the training set and evaluated on the test set.
The parameter estimation is performed on the training set by adapting the EM algorithm described in [sengupta2017bayesian] to the product space . We denote by the estimated parameters. The cluster label for a point in the test set is given by . The related results are reported in Table 7 as “vMF clustering EM”.
The result of the EM algorithm is then used as an initialization for the sampling of the posterior. However, the sampling is only done for some parameters for computational reasons and to highlight the use of GSS on the Stiefel manifold. The vMF distributions parameters are parameterized using their SVD belonging to and respectively. Then we approximately sample from the posterior associated with using GSS, and the resulting posterior samples yield pseudo-posterior cluster parameter samples where is the number of samples. Then, the samples are used to compute Bayesian assignment weights for each observation of the test set and each cluster . The label of each observation is finally assigned as . This procedure is referred to as “vMF clustering MCMC” in Table 7.
To demonstrate the benefit of our Bayesian approach, for each observation of the test set, we compute the empirical variance of the sample
(17) |
and remove the point if it is positive. The previous procedure is then applied to this reduced test set. The scores resulting from this procedure referred to as “vMF clustering MCMC lower variance”, are given in Table 7.
Scenario | d1 | d2 | d3 | d4 |
---|---|---|---|---|
vMF clustering EM : | 49.54 | 56.61 | 57.99 | 49.98 |
vMF clustering MCMC : | 52.21 | 56.61 | 57.99 | 49.98 |
vMF clustering MCMC lower variance : | 50.15 | 57.75 | 59.62 | 54.06 |
The Bayesian approach strengthens the estimation by averaging different plausible weights, as in the ensemble methods [dietterich2000ensemble], which always turns out to be better than the simple maximum likelihood estimator (MLE) in 7.
4 Validity
The aim of this section is to prove the reversibility of the geodesic slice sampler. To this end, we introduce a stepping-out distribution to describe Algorithm 2 and a shrinkage kernel to describe Algorithm 3. Their properties collected in Lemma 11, Lemma 12, Lemma 15 and Lemma 16 intuitively ensure ‘reversibility on every parameterized intersection of geodesic and levelset’. Together with the invariance of the Liouville measure under a certain map on the tangent bundle, both introduced in Section 4.3, they are the essential indigence for the proof of the reversibility of the geodesic slice sampler.
4.1 Stepping-out procedure
Throughout this section fix and . We consider a generalization of the stepping-out procedure described in Section 2.1 that targets an arbitrary set , see Algorithm 4.
Input: point , hyperparameters and
Output: two points such that
To formally describe the resulting distribution on we use stopped random variables. Let . For every let
(18) |
be two sequences of random variables. Observe that setting
(19) |
is consistent with this definition. By construction both sequences are strictly monotone, i.e.,
(20) |
We now define appropriate stopping times depending on the target set . To this end, let be a random variable that is independent of all previous objects satisfying , and set
(21) |
Note that and are finite stopping times with respect to the filtration generated by the sequence . More precisely, we have the bounds
(22) |
As and are finite stopping times,
(23) |
are random variables for all and , see [Klenke, Lemma 9.23]. We define the stepping-out distributions
on . Observe that the output of Algorithm 4 has distribution . Since Algorithm 2 is a special case of Algorithm 4 with and , this definition is coherent with (5). Moreover, note that is a transition kernel. More details on this can be found in Remark 21.
We collect two properties of the stepping-out distribution that are useful to show reversibility of the geodesic slice sampler. Their proof is postponed to Appendix A.
Lemma 11.
Let and let be a measurable function. We have for all
The previous lemma essentially states that, conditioned on the event that the initial point lies inside the resulting interval, the stepping-out distribution does not depend on the initial point. To formulate the second property we need the collection
(24) |
of linear functions index by . Intuitively, they express a U-turn at . Moreover, for we define
(25) |
The next lemma describes the behavior of the stepping-out distribution under U-turns.
Lemma 12.
Let and . We have
4.2 Shrinkage procedure
In this section, for every half open interval , we introduce an algorithm that generalizes Algorithm 3 approximating the uniform distribution on for an arbitrary open set . To express this scheme as a kernel, we employ the kernel of the shrinkage procedure, essentially operating on parameterized by , introduced in [ReversibilityEllipticalSliceSampler] and push it forward to an arbitrary interval as described in [rudolf2022robust, Appendix A]. For the convenience of the reader we quickly sketch the kernel from [ReversibilityEllipticalSliceSampler]. To this end set
We denote by the Dirac measure at . Let be a random variable on , and let be a sequence of random variables with conditional distributions
for with , and . Moreover, for every set which is open in , i.e., satisfies that for all there exists such that , define the stopping time
Then the kernel of the shrinkage procedure targeting is given by
To \ldqbend the interval onto \rdq, where such that , we use the family of maps
Note that due to the restriction of the domain, these functions are bijective and therefore have inverses . Let such that , and let be an open set such that . For such numbers and sets we define the shrinkage kernel as the push forward of the kernel for 101010Observe that is open in , since is open as a set in , and non-empty by assumption. under , i.e.,
Observe that this agrees with the definition made in (6), where we have and for , and .
We briefly discuss the measureability of the shrinkage kernel in the arguments .
Remark 13.
Let and be two random variables independent of and satisfying almost surely, and let , . By a disintegration argument, we have for all , that
Therefore
is measurable in . The equality above holds by definition of the shrinkage kernel, see also [ReversibilityEllipticalSliceSampler, Proof of Theorem 2.9]. Consequently
is a transition kernel.
Remark 14.
Note that the combination of stepping-out distribution and shrinkage kernel as in (7) is valid, i.e., the random interval generated by the stepping-out procedure can be used as an input for the shrinkage procedure. To see this, fix , and . Let and be as in (23). We need to verify that
-
•
is open,
-
•
almost surely. In particular this implies that this intersection is almost surely non-empty.
The lower semi-continuity of yields that is open. Since , we have . Moreover, we have almost surely by construction. Therefore 0 is also almost surely contained in the intersection of these two sets.
We provide two properties of the shrinkage kernel that are useful to derive the reversibility of the geodesic slice sampler. Both are essentially extensions of corresponding properties of .
Lemma 15.
Let and an open set such that . Then the kernel is reversible with respect to .
To obtain this result, we push the reversibility statement for formulated in [ReversibilityEllipticalSliceSampler] forward to the shrinkage kernel on arbitrary half open intervals.
Proof.
By [ReversibilityEllipticalSliceSampler, Theorem 2.9] we know that is reversible with respect . Observe that by [rudolf2022robust, Proposition 19] this implies that is reversible with respect to .
∎
The next lemma can be seen as describing the behavior of the shrinkage kernel under U-turns.
Lemma 16.
In order to leverage a similar property of the kernel for showing the above statement, we need the collection of maps
indexed by .
Proof.
We aim to apply [ReversibilityEllipticalSliceSampler, Lemma 2.10]. Let such that . Moreover, let be an open set such that , and . Observe that for we have
Therefore
and
for all , as . Since and , we get by [ReversibilityEllipticalSliceSampler, Lemma 2.10, note that ] that for
∎
4.3 Reversibility of the geodesic slice sampler
Note that the tangent bundle
of is a smooth, -dimensional, connected manifold, see e.g. [Sakai, Section I.2.2]. To prove Theorem 7, we employ the Riemannian structure of . We only briefly sketch how this Riemannian structure is introduced, for more details see [Sakai, Section II.4].
We denote by
the projection map from the tangent bundle onto . Observe that for the tangent space to at can be identified with the direct sum
For this identification allows us to introduce a \ldqcanonical\rdq metric on referenced to as the Sasaki metric
Together with , the tangent bundle is a Riemannian manifold. However, in fact we are more interested into a submanifold of , that is, the unit tangent bundle
We call the Riemannian measure , which is induced by the Sasaki metric onto the unit tangent bundle , the Liouville measure. Observe that the restriction is a Riemannian submersion, and for the fiber equipped with the metric induced by is . Note that additionally for all . Applying Fubini’s theorem for manifolds (see [Sakai, Theorem II.5.6]), this yields a nice expression for the Liouville measure, namely we have
(26) |
for all measurable functions .
In the following, we combine the Liouville measure with a family of maps indexed by which can be interpreted as \ldqwalking along the geodesic specified by with step length and then doing a U-turn\rdq. To this end, for we denote the geodesic flow by
For more details on the geodesic flow see [Sakai, Section II.4.II]. Moreover, we define a flip on the unit tangent bundle
Then we set
Observe that the Liouville measure is invariant under the geodesic flow (see [Sakai, Exercise II.16]) and under the flip (see [Paternain, Lemma 1.34]). Therefore we have due to the representation in (26) for all measurable functions and all that
(27) |
i.e., the Liouville measure is invariant under .
We shed some further light on the interaction of and the geodesics.
Remark 17.
Let , and . Using the rescaling property of geodesics (see [Lee, Lemma 5.18]) and the chain rule, we have
Since 111111This is a basic property of a flow, see e.g. [LeeSmooth, Chapter 9]. Observe that the geodesic flow is a flow on the (unit) tangent bundle., this yields
In particular this implies
as .
Now we can prove the reversibility of the geodesic slice sampler.
Proof of Theorem 7.
Observe that by virtue of [latuszyinski2014convergence, Lemma 1] it suffices to show that is reversible with respect to for all .
Let and . After introducing the uniform distribution on , we get by Lemma 11 that
Using (27), we obtain
For all , we have
and by Remark 17
If we also apply Remark 17 to and to the set in the 1-dimensional Lebesgue measure in the numerator, we overall obtain
Then Lemma 12 together with Remark 17 yields
Observe that
for all , and . Hence
Note that due to Remark 14 and Remark 17, we may apply Lemma 16, such that we obtain
This expression is symmetric in and , because of the reversibility of the shrinkage kernel. Namely Lemma 15 yields
Consequently,
∎
Acknowledgments
Data were kindly provided in part by the Human Connectome Project, WU-Minn Consortium (Principal Investigators: David Van Essen and Kamil Ugurbil; 1U54MH091657) funded by the 16 NIH Institutes and Centers that support the NIH Blueprint for Neuroscience Research; and by the McDonnell Center for Systems Neuroscience at Washington University.
We thank Rudrasis Chakraborty and Clément Mantoux for interesting discussions regarding the experiments on video actions and network data. We thank Eric Moulines, Randal Douc and Philip Schär for fruitful conversations regarding the theoretical part of this paper. MH and DR gratefully acknowledge support of the DFG within project 32680300 – SFB 1456 subproject B02. MH expresses her gratitude for the hospitality of École Polytechnique.
Appendix
Appendix A Properties of the stepping-out procedure
In this section we provide the proofs of Lemma 11 and Lemma 12. To this end, we suppose that we are in the setting of Section 4.1 summarized as follows:
Setting C.
In the proof of both lemmas we push a corresponding property of the sequences and to the stopped sequences. Observe that it is relatively simple to establish the required properties for and , because their joint distributions are available explicitly.
Lemma 18.
Assume we are in Setting C. Let and . We have for all that
Proof.
Let , and . Then, as , we have
∎
We present partitions of certain events that facilitate the extension of properties of the sequences and to the stepping-out distribution.
Lemma 19.
Suppose we are in Setting C. Then for all we have
-
1.
-
2.
-
3.
-
4.
for all .
Proof.
To 2: Let . Then there exists
In the first case, choosing and taking (22) into account, we get , and . Or, in the second case, , and . Hence
Conversely, let there exists
Then clearly .
The following lemma collects the properties of and and the stopped sequences which lead to the proof of Lemma 11.
Lemma 20.
Proof.
To 2: Let , and . If , the statement is true by (22), which implies that both probabilities are zero. We now consider . To this end, for all with and set
Then for holds
(28) |
We can use these sets to express events of the form . Namely, observe that by (22) we have . Together with the independence of from and , we get
(29) |
If we use this for and , we get
Applying statement 1 and then (28) yields
Then doing an index shift and using (29) for and , we obtain
To 3: Let . Using (22), we obtain
By virtue of Lemma 19.1 and 19.4 this yields
We apply statement 2 in the first sum straight forwardly. In the second sum we apply statement 2 with and and reversed roles of and . Relocating the summand for from the first to the second sum, we obtain
Then the lower bounds in (22) imply
Changing the order of summation, we obtain
Together with the upper bound in (22) this yields
By virtue of Lemma 19.2 to 19.4 we get
∎
Lemma 11 is now an easy corollary of the previous lemma.
Proof of Lemma 11.
Before addressing Lemma 12, we take a brief detour, because some of the notation of the proof of Lemma 20 is convenient to show measureability of the stepping-out distribution in the argument .
Remark 21.
We turn to the proof of Lemma 12. Its essential strategy is to leverage the following property of and to the stepping-out distribution.
Lemma 22.
Assume that we are in Setting C.
-
1.
Let , and . Then we have
-
2.
For all , , and holds
Appendix B Uniform simple slice sampling
Let be a -finite measure space. For illustrative purposes we introduce uniform simple slice sampling for probability measures on that are absolutely continuos with respect to . More precisely, let
be a measurable function that satisfies
We consider the probability measure
(30) |
that has unnormalized density with respect to . We define the level sets as
and the essential supremum norm of
By definition of and because
we have that for almost all holds . Consequently, the Markov kernel of the uniform simple slice sampler
(31) |
where
is well defined.
Lemma 23.
Proof.
Let . We have
This expression is symmetric in and , such that
∎
Appendix C Manifolds
We do not give a complete introduction to differential geometry in this section. The aim is rather to give a better understanding on the key objects used by the geodesic slice sampler and provide references for what is outside the scope of this paper.
C.1 Tangent space and Riemannian metric
We revise some selected basic concepts on manifolds. For a more thorough introduction to these objects see [Boothby, Sections I, III, IV, V]. Let be a -dimensional, smooth manifold that is connected as a set. The most defining property of such an object is the following: For all there exists an open set containing and an open set such that there is a homeomorphism . We call the tuple a coordinate neighborhood. Since is smooth, it is equipped with an atlas that contains coordinate neighborhoods for all points of , such that for all the map , which is a map from a subset of to a subset of , is a diffeomorphism. At all we can define the tangent space to at , which is the set of all linear mappings from the space of germs at to satisfying the Leibniz rule. Observe that each coordinate neighborhood of induces vector fields, the coordinate frames. Essentially a vector field associates to each point an element . The coordinate frames form a basis of the tangent spaces , turning it into a -dimensional vector space, at each . We denote the coordinate frames induced by as
If there exists a smooth field of symmetric, positive definite bilinear forms, the manifold is called Riemannian and is called a Riemannian metric on . Essentially this means that associates to each a symmetric, positive definite bilinear form , which turns into an inner product space.
C.2 Geodesics
We call a map from an interval in to the manifold a curve. The Riemannian structure on induces a special class of curves on , namely the geodesics. Intuitively, a geodesic can be thought of as a curve of constant velocity. For a formal definition of geodesics consult [Boothby, Section VII.5]. We say the manifold is geodesically complete if all geodesics can be extended such that their domain is . In this case, by virtue of [Lee, Corollary 4.28], for all and all there exists a unique geodesic
(32) |
satisfying and for the velocity vector field at zero. Intuitively, can be thought of as the geodesic through in direction . It can also be written in terms of the exponential map (see [Boothby, Section VII.6]) as
By the Hopf-Rinow Theorem (see e.g [Lee, Theorem 6.19]) geodesic completeness is equivalent to metric completeness.
C.3 The Riemannian measure
We now introduce a measure on which can be viewed as an extension of the Lebesgue measure to Riemannian manifolds, see also [Sakai, Section II.5]. The topology on induces as a Borel--algebra, which we denote as . We need the following family of functions: Given a coordinate neighborhood , we use the Gram matrix of the coordinate frames evaluated at to introduce the function
Moreover, as is by definition second countable, there exists a countable collection of coordinate neighborhoods such that .121212A space where every cover of open sets contains a countable subcover is called Lindelöf space. Second countable spaces are Lindelöf. Note that admits a partition of unity subordinate to , see [Sakai, Section I.2.1], i.e.,
-
•
is a -function131313A function is called a -function if for all coordinate neighborhoods the function is a smooth function. with support for all ,
-
•
is locally finite141414For all there exists a neighborhood of such that holds only for finitely many . ,
-
•
for all .
Then the Riemannian measure induced by the Riemannian metric
(33) |
defines a measure on the measure space . We provide some brief arguments that the Riemannian measure is well-defined and indeed a measure. Observe that is Borel measurable, and and are smooth for all and all . Hence the appearing Lebesgue integrals are defined. For independence of the construction from the choice of open covering and partition of unity see [Sakai, page 62]. The -additivity of is inherited from the -additivity of the Lebesgue integral. Applying standard extension arguments using the additivity of the Lebesgue integral and monotone convergence theorem, we can extend (33) to measurable functions yielding
C.4 The uniform distribution on the unit tangent spheres
Throughout this section we fix . We denote by
the unit tangent sphere to at . It is immersed into via the identity map . For all , this induces a map on the tangent spaces, see [Boothby, Theorem IV.1.2]. As is a -dimensional inner product space, the tangent space to at is again , see [Boothby, Section II.3] for more details on this construction. Because then becomes , we can exploit this to define the Riemannian metric on given by
see [Boothby, Corollary V.2.5]. As described in Appendix C.3, induces the (Riemannian) measure on . Since is compact, is finite, and we may define the probability measure
Note that, up to a push forward under an isometric isomorphism, is the uniform distribution on the Euclidean unit sphere .
Appendix D Experiments appendix
D.1 A practical case: Understanding the variability in graph data sets.
Parameter | ||||
---|---|---|---|---|
Synthetic data estimation | 0.1 | 2 | ||
Missing link imputation | 0.3 | 2 | [60,20,20,20,5] |