\addbibresource

GSSSources.bib

Geodesic slice sampling on Riemannian manifolds

Alain Durmus École Polytechnique, France, Email: [email protected] Samuel Gruffaz Université Paris Saclay, France, Email: [email protected] Mareike Hasenpflug University of Passau, Germany, Email: [email protected], [email protected] Daniel Rudolf³³footnotemark: 3

Abstract

We propose a theoretically justified and practically applicable slice sampling based Markov chain Monte Carlo (MCMC) method for approximate sampling from probability measures on Riemannian manifolds. The latter naturally arise as posterior distributions in Bayesian inference of matrix-valued parameters, for example belonging to either the Stiefel or the Grassmann manifold. Our method, called geodesic slice sampling, is reversible with respect to the distribution of interest, and generalizes Hit-and-run slice sampling on $\mathbb{R}^{d}$ to Riemannian manifolds by using geodesics instead of straight lines. We demonstrate the robustness of our sampler’s performance compared to other MCMC methods dealing with manifold valued distributions through extensive numerical experiments, on both synthetic and real data. In particular, we illustrate its remarkable ability to cope with anisotropic target densities, without using gradient information and preconditioning.

1 Introduction

In statistical models for real world phenomena it is natural to incorporate geometric knowledge in terms of manifolds into state- or parameter-spaces. This allows to better capture dependencies, to easily embed additional constraints and to include expert/data guidance. Extracting information with Bayesian inference from such models requires the ability to sample (at least approximately) from the usually highly intractable posterior distribution on the manifold. However, due to the involved geometric features, standard $\mathbb{R}^{d}$ approaches are not directly applicable. This highlights the importance of developing efficient sampling techniques specifically designed for manifolds.

In this paper we tackle this problem and consider a target measure $\pi$ on a general Riemannian manifold $\mathsf{M}$ that has a density with respect to the Riemannian measure $\nu_{\mathfrak{g}}$ of the manifold, i.e., that is of the form

\pi({\rm d}x)=\frac{p(x)}{\int_{\mathsf{M}}p(y)\,\nu_{\mathfrak{g}}({\rm d}y)}% \nu_{\mathfrak{g}}({\rm d}x),

(1)

where $p:\mathsf{M}\to(0,\infty)$ is integrable with respect to the Riemannian measure $\nu_{\mathfrak{g}}$ . Such frameworks are encountered in various applications, e.g., brain connectivity network analysis [mantoux2021understanding], dimensionality reduction [holbrook2016bayesian], computer vision [lui2012advances], texture analysis [kunze2004bingham] and protein conformation modeling [hamelryck2006sampling]. In most instances, the Riemannian manifolds that arise are matrix manifolds, such as the Stiefel manifold or the Grassmann manifold [edelman1998geometry].

One popular way to approximately sample from intractable distributions are Markov chain Monte Carlo (MCMC) methods. We follow this approach and propose a practical MCMC algorithm based on slice sampling techniques. This method that we refer to as geodesic slice sampler (GSS) incorporates the geometry of the underlying Riemannian manifold by using the geodesics. We briefly describe the transition mechanism of GSS. It crucially exploits the fact that on a Riemannian manifold for every pair $(x,v)$ , where $x\in\mathsf{M}$ and $v\in T_{x}\mathsf{M}$ is an element of the tangent space at $x$ , there exists a unique geodesic $\gamma_{(x,v)}$ emanating from $x$ in direction $v$ .

More precisely, given the current state $x\in\mathsf{M}$ , a single transition of the GSS targeting the distribution $\pi$ proceeds in three steps. First, a level $t$ is uniformly sampled from the interval $(0,p(x))$ . Second, a geodesic $\gamma_{(x,v)}$ that passes through $x$ is randomly chosen. Lastly, a point is generated from the intersection of the geodesic $\gamma_{(x,v)}$ and the level set $L(t):=\{y\in\mathsf{M}\mid p(y)>t\}$ . This final step presents the most significant challenge and requires special care to ensure invariance of the target distribution. To this end, we carefully adapt Neal’s stepping-out and shrinkage procedure (detailed in [neal2003slice]) to our manifold setting.

Metropolis-Hastings, Hamiltonian Monte Carlo and Langevin-type algorithms have already been adapted to the Riemannian manifold setting. Moreover, there exist several tailor-made algorithms for specific manifolds. Consult Section 2.3 for a literature review of MCMC-methods on Riemannian manifolds. However, to the best of our knowledge GSS represents the first practical slice sampling-based MCMC method applicable to general Riemannian manifolds. Following a slice sampling paradigm is appealing because, by design, the length of the transition step is fitly chosen for each transition. This is especially advantageous when efficiently exploring the target distribution requires varying the length of transition steps based on the position and direction of the move, e.g., due to anisotropy. It is worth noting that some slice sampling algorithms achieve this without the need for any tunable hyperparameters. However, a practical implementation usually introduces additional tunable parameters. Nonetheless, it is expected that the resulting slice sampler algorithms’ performance will be more robust with regard to the choice of these parameters compared to, for example, the sensitivity of a Metropolis-Hastings or Hamiltonian Monte Carlo algorithm to the selection of step size. We refer to [neal2003slice] for some more details on these properties of slice sampling; see also [murray2010elliptical] for a further discussion of advantageous and limitations of a slice sampling approach.

To conclude this introduction, our main contributions can be summarized as follows:

•

We propose a practical slice sampling based MCMC-method, which we call geodesic slice sampling, to target distributions of the form (1). It combines a geodesical Hit-and-run algorithm with a 1-dimensional slice sampler arriving at a method that generalizes Hit-and-run slice sampling to Riemannian manifolds.
•

We demonstrate the applicability of GSS in numerical experiments and evaluate its strong suits as well as its drawbacks. Its main feature being its capacity to be quite robust to the geometry of the target, without using gradient information and with an easy tuning of parameters using the diameter of the manifold.
•

We verify that GSS has the correct invariant distribution by showing reversibility with respect to $\pi$ .

The structure of the paper is as follows. First we provide an introduction to slice sampling on $\mathbb{R}^{d}$ in Section 2.1, before turning to GSS on general Riemannian manifolds in Section 2.2. This is followed by a literature review of MCMC-methods on Riemannian manifolds in Section 2.3. Readers that wish more details on the differential geometry background used in Section 2 may find it in Appendix C. Section 3 is devoted to numerical experiments. The proof of the reversibility of GSS with respect to $\pi$ is given in Section 4. A formal treatment of the stepping-out and shrinkage procedure is also included in this section.

1.1 General notation

We introduce some general notation that is valid throughout the whole paper. Let $\mathbb{N}$ be the set of strictly positive integers and call $\mathbb{N}_{0}:=\mathbb{N}\cup\{0\}$ . We denote by $\mathrm{Leb}_{d}$ the $d$ -dimensional Lebesgue measure on $\mathbb{R}^{d}$ and by $\mathcal{B}(\mathbb{R}^{d})$ the Borel- $\sigma$ -algebra. Similar, for a set $\mathsf{S}\in\mathcal{B}(\mathbb{R}^{d})$ we write $\mathcal{B}(\mathsf{S})$ for the trace Borel- $\sigma$ -algebra on $\mathsf{S}$ . Moreover, we set $\mathbb{S}^{d-1}:=\{x\in\mathbb{R}^{d}\mid\|x\|=1\}$ to be the $d-1$ -dimensional Euclidean unit sphere. For $x\in\mathbb{R}^{d}$ , let $x^{\top}$ be its transpose, and write $\mathrm{Id}_{d}\in\mathbb{R}^{d\times d}$ for the identity matrix. If $\mathsf{S}\in\mathcal{B}(\mathbb{R}^{d})$ is a finite set or satisfies $\mathrm{Leb}_{d}(\mathsf{S})\in(0,\infty)$ , denote the discrete, respectively continuous, uniform distribution on $\mathsf{S}$ as $\mathrm{Unif}(\mathsf{S})$ . Whenever we introduce random variables in the sequel, we assume them to be defined on some rich enough probability space $(\Omega,\mathcal{F},\mathbb{P})$ . Let $(\mathsf{X},\mathcal{X})$ be a measurable space and let $R$ be an $(\mathsf{X},\mathcal{X})$ -valued random variable. Then we denote by

\mathbb{P}^{R}(\mathsf{A}):=\mathbb{P}(R\in\mathsf{A}),\qquad A\in\mathcal{X},

the distribution of $R$ . If $\mathbb{P}^{R}=\mu$ for some probability measure $\mu$ , then we also write $R\sim\mu$ . Furthermore, let $\mathsf{Y}$ be a possibly different set and let $f:\mathsf{X}\to\mathsf{Y}$ be a map from $\mathsf{X}$ to $\mathsf{Y}$ . For some set $\mathsf{S}\subseteq\mathsf{X}$ , we denote by $f|_{\mathsf{S}}$ the restriction of $f$ to $\mathsf{S}$ . Now equip also $\mathsf{Y}$ with a $\sigma$ -algebra and turn it into the measurable space $(\mathsf{Y},\mathcal{Y})$ . Additionally assume $f$ to be measurable and let $\mu$ be a measure on $(\mathsf{X},\mathcal{X})$ . We call

f_{\sharp}\mu(\mathsf{A}):=\mu\left(f^{-1}(\mathsf{A})\right),\qquad A\in% \mathcal{Y}

the push forward measure of $\mu$ under $f$ , and for $\mathsf{S}\in\mathcal{X}$ we call

\mu|_{\mathsf{S}}(\mathsf{A}):=\mu(\mathsf{S}\cap A),\qquad A\in\mathcal{X},

the restriction of $\mu$ to $\mathsf{S}$ . To emphasize that a union is disjoint we write $\sqcup$ . Finally, for simplicity we also use $\land$ and $\lor$ to denote the minimum respectively the maximum between two real numbers, i.e., $r\land s:=\min\{r,s\}$ and $r\lor s:=\max\{r,s\}$ for $r,s\in\mathbb{R}$ .

2 Methodology: Geodesic slice sampling

2.1 Slice sampling on $\mathbb{R}^{d}$

In this section we revisit slice sampling on $\mathbb{R}^{d}$ in order to provide a general introduction to the ideas of (uniform simple) slice sampling. The slice sampler, as all MCMC-methods, defines a Markov chain which, for a given measurable unnormalized density $p:\mathbb{R}^{d}\to(0,\infty)$ that satisfies $\int_{\mathbb{R}^{d}}p(y)\mathrm{Leb}_{d}({\rm d}y)\in(0,\infty)$ , can be used to approximately sample from the probability measure

\pi({\rm d}x)=\frac{p(x)}{\int_{\mathbb{R}^{d}}p(y)\mathrm{Leb}_{d}({\rm d}y)}% \ \mathrm{Leb}_{d}({\rm d}x).

At its heart are the (super) level sets of $p$ defined by

L(t):=\{x\in\mathbb{R}^{d}\mid p(x)>t\},\qquad t\in(0,\infty),

which contain all points that have a function value with respect to $p$ that is greater than a specified value. The uniform simple slice sampler, which we also call idealized slice sampler, generates approximate samples from $\pi$ by drawing suitably from these level sets.

We denote by $(Y_{k})_{k\in\mathbb{N}}$ the Markov chain that corresponds to the idealized slice sampler. A transition from step $Y_{k}=x\in\mathbb{R}^{d}$ to step $Y_{k+1}$ works as follows:

1.

Draw a random level $T_{k+1}\sim\mathrm{Unif}\big{(}(0,p(x))\big{)}$ , call the result $t$ . This specifies a level set $L(t)$ .
2.

Draw $Y_{k+1}\sim\mathrm{Unif}(L(t))$ uniformly from this level set $L(t)$ .

Consequently, for any $k\in\mathbb{N}$ this defines the conditional distributions

	$\displaystyle\mathbb{P}\left(T_{k+1}\in\cdot\mid Y_{1},\ldots,Y_{k},T_{1},% \ldots,T_{k}\right)=\mathbb{P}\left(T_{k+1}\in\cdot\mid Y_{k}\right)=\mathrm{% Unif}\big{(}(0,p(Y_{k}))\big{)},$
	$\displaystyle\mathbb{P}\left(Y_{k+1}\in\cdot\mid Y_{1},\ldots,Y_{k},T_{1},% \ldots,T_{k+1}\right)=\mathbb{P}\left(Y_{k+1}\in\cdot\mid T_{k+1}\right)=% \mathrm{Unif}\big{(}L(T_{k+1})\big{)}.$

A graphic representation of the transition mechanism for $d=1$ can be found in Figure 1.

Refer to caption — (a) Sample the level $t$ uniformly from $(0,p(x))$ .

We can also describe the idealized slice sampler through its transition kernel given by

	$\displaystyle H:\mathbb{R}^{d}\times\mathcal{B}(\mathbb{R}^{d})$	$\displaystyle\to[0,1]$
	$\displaystyle(x,\mathsf{A})$	$\displaystyle\mapsto\frac{1}{p(x)}\int_{(0,p(x))}\frac{1}{\mathrm{Leb}_{d}\big% {(}L(t)\big{)}}\int_{L(t)}\mathbbm{1}_{\mathsf{A}}(y)\ \mathrm{Leb}_{d}({\rm d% }y)\,\mathrm{Leb}_{1}({\rm d}t).$

Since slice sampling was brought to the attention of the statistics community in [besag1993spatial], the properties of idealized slice sampling, including reversibility with respect to $\pi$ and conditions for ergodicity, have been investigated in several works, e.g., [mira2002efficiency, natarovskii2021quantitative, roberts2002convergence, rudolf2013positivity, rudolf2018comparison]. Exemplary, the following result illustrates a major advantage of slice sampling: In [natarovskii2021quantitative, Corollary 3.7], Natarovskii et. al. show that the spectral gap of $H$ only depends on the level set function $t\mapsto\mathrm{Leb}_{d}(L(t))$ , i.e., on the volume of the level sets, not their shape. This means that the performance of the idealized slice sampler is ignorant of the introduction of, e.g., multimodality, local modes or anisotropy as long as the volume of the level sets is not modified.

Unfortunately, each transition of the idealized slice sampler requires to sample from the uniform distribution $\mathrm{Unif}(L(t))$ , $t>0$ , of a level set. They are $d$ -dimensional, measurable sets and in general there is no efficient algorithm to tackle this problem. This is a major prevention for the implementation of the idealized slice sampler. One modification strategy to obtain a practical algorithm is called hybrid slice sampling, see [latuszyinski2014convergence]. Here, the uniform distribution on the level sets is replaced by a family of kernels $(H_{t})_{t>0}$ such that for any $t>0$ (where it is well-defined) $\mathrm{Unif}(L(t))$ is invariant for $H_{t}$ . This leads to a transition kernel of the form

(x,A)\mapsto\frac{1}{p(x)}\int_{(0,p(x))}H_{t}(x,A)\ \mathrm{Leb}_{1}({\rm d}t% ),\qquad x\in\mathbb{R}^{d},A\in\mathcal{B}(\mathbb{R}^{d}).

In order to get a better understanding of this strategy, we exemplary consider the case $d=1$ introduced in [neal2003slice]. There a stepping-out and a shrinkage procedure is proposed. In the following we aim to provide a basic understanding of the concepts behind these two schemes. To this end, we give a verbal description, and a visualization in Figure 3. The stepping-out procedure is treated in more detail in Section 4.1. This includes pseudocode and a careful definition of the generated distributions. For an extensive treatment of the shrinkage procedure, we refer to [ReversibilityEllipticalSliceSampler].

The stepping-out and shrinkage based hybrid slice sampler takes a point $x\in\mathbb{R}$ (current state of Markov chain) and a level set $L(t)$ (level generated as in the first step of the transition mechanism of the idealized slice sampler), and proceeds in two steps.

1. Stepping-out:

Under the specification of two hyper parameters $w>0$ and $m\in\mathbb{N}$ , the stepping-out procedure chooses a random segment of $\mathbb{R}$ containing $x$ . To this end, an interval of length $w$ is placed randomly around $x$ by sampling the left interval boundary point $L_{1}\sim\mathrm{Unif}\big{(}(x-w,x)\big{)}$ and setting the right interval boundary to $R_{1}=L_{1}+w$ . Then this interval is extended iteratively to the left by intervals of length $w$ until for the first time the left boundary leaves the level set $L(t)$ , or the maximal number of stepping-out steps to the left $\upiota$ is reached. Here $\upiota$ is obtained by randomly splitting $m+1$ , the maximal number of total stepping-out steps, into two summands $\upiota$ and $m+1-\upiota$ . Similarly the interval is extended iteratively to the right by intervals of length $w$ until the right boundary hits $\mathbb{R}\setminus L(t)$ for the first time, or $m+1-\upiota$ steps have been performed. This provides a randomly generated interval $I=(L^{\ast},R^{\ast})$ , where

L^{\ast}=L_{1}-\big{(}\uptau_{\ell}\land(\upiota-1)\big{)}w\qquad\text{and}% \qquad R^{\ast}=R_{1}+\big{(}\uptau_{r}\land(m-\upiota)\big{)}w

with

\uptau_{\ell}:=\inf\{k\geqslant 0\mid L_{1}-kw\notin L(t)\}\qquad\text{and}% \qquad\uptau_{r}:=\{k\geqslant 0\mid R_{1}+kw\notin L(t)\}.

2. Shrinkage:

We generate a point from $I\cap L(t)$ with the shrinkage procedure. Roughly speaking, it is an adaptive acceptance/rejection scheme that shrinks the proposal area with each rejection.¹¹1Note that we describe here a scheme that differs slightly from the original one in [neal2003slice] and rather resembles the one of the elliptical slice sampler, see [murray2010elliptical]. Crucially, we view here the interval $I$ as a circle, i.e., if one \ldqleaves\rdq the interval at the right boundary, it is immediately \ldqreentered\rdq at the left boundary. Observe that a set $\mathsf{J}\subseteq\mathbb{R}$ (viewed as a circle) is divided by two points $y,z\in\mathsf{J}$ into two segments, namely $(y\land z,y\lor z)\cap\mathsf{J}$ and $\mathsf{J}\setminus(y\land z,y\lor z)$ . If $\mathsf{J}$ contains the initial point $x$ , we set

\mathbb{J}(y,z,\mathsf{J}):=\begin{dcases}(y\land z,y\lor z)\cap\mathsf{J},&% \text{if }x\in(y\land z,y\lor z),\\ \mathsf{J}\setminus(y\land z,y\lor z),&\text{otherwise},\end{dcases}

to be the segment containing $x$ . The shrinkage procedure now builds a sequence of such segments. First we sample $Y_{1}\sim\mathrm{Unif}(I)$ , and set $\mathsf{J}_{1}=I$ . For $k\in\mathbb{N}$ , we let $Y_{k+1}$ be a random variable with conditional distribution

\mathbb{P}\left(Y_{k+1}\in\cdot\mid Y_{1},\ldots,Y_{k},\mathsf{J}_{1},\ldots,% \mathsf{J}_{k}\right)=\mathrm{Unif}(\mathsf{J}_{k}),

that is, we draw the next proposal uniformly from the current segment. Observe that $Y_{k+1}$ divides $\mathsf{J}_{k}$ into two segments as described above. Set $\mathsf{J}_{k+1}=\mathbb{J}(Y_{k},Y_{k+1},\mathsf{J}_{k})$ for $k\in\mathbb{N}$ , i.e., we keep the segment of $\mathsf{J}_{k}$ that contains the initial point $x$ . This is continued until we generate a proposal that lies in $L(t)$ . Hence, overall the shrinkage procedure yields a random point $Y^{\ast}=Y_{\uptau}$ where $\uptau:=\inf\{k\in\mathbb{N}\mid Y_{k}\in L(t)\}$ . If the stepping-out and shrinkage scheme is embedded into a 1-dimensional hybrid slice sampler this point is then the next random variable of the chain.

We comment on the hyperparameters of the stepping-out procedure.

Remark 1.

The output of the stepping-out procedure can be viewed as a \ldqloose\rdq approximation of the set $L(t)$ . The maximal possible length of this approximation is given by $mw$ , but also the individual choice of $m$ and $w$ affects the quality of the approximation depending on the shape of $L(t)$ . Choosing $m$ larger and $w$ smaller can lead to an interval that lies \ldqtighter\rdq around $L(t)$ . However, if $L(t)$ has \ldqholes\rdq, it is also more likely that parts of $L(t)$ are \ldqcut off\rdq. Moreover, increasing $m$ increases the computational cost of the stepping-out procedure.

Now we have a practical slice sampling algorithm on $\mathbb{R}$ . One way to lift the stepping-out and shrinkage approach to $\mathbb{R}^{d}$ is to combine it with the Hit-and-run algorithm arriving at something called the Hit-and-run slice sampler²²2Hit-and-run slice sampling is already mentioned in [Mackay, Section 29.7]. Convergence and comparison results for this sampler are provided in [latuszyinski2014convergence, rudolf2018comparison], and it is used as a benchmark approach in [murray2010elliptical, schaer2023gibbsian]., which essentially samples a random line though the current point and then runs a slice sampler on this line. For $x\in\mathbb{R}^{d}$ and $v\in\mathbb{S}^{d-1}$ we define

\gamma_{(x,v)}(\theta)=x+\theta v,\qquad\theta\in\mathbb{R},

to be the line through $x$ in direction $v$ , and for $t>0$

L(x,v,t):=\{\alpha\in\mathbb{R}\mid p\left(\gamma_{(x,v)}(\alpha)\right)>t\}=% \{\alpha\in\mathbb{R}\mid\gamma_{(x,v)}(\alpha)\in L(t)\}

to be the parameterized intersection of a straight line with a level set. The transition mechanism of the Hit-and-run slice sampler then proceeds as depicted in Figure 4.

For a complete algorithmic description see Algorithm 1 with $\sigma_{d-1}^{(x)}$ being the uniform distribution on $\mathbb{S}^{d-1}$ for all $x\in\mathbb{R}^{d}$ . Equivalently, the transition mechanism can be described as first sampling a direction $v$ uniformly from $\mathbb{S}^{d-1}$ , and then running a 1-dimensional stepping-out and shrinkage based slice sampler for the unnormalized density $\alpha\mapsto p\left(\gamma_{(x,v)}(\alpha)\right)$ with initial point 0.

Our interest into this framework arises from the fact that, by generalizing straight lines to geodesics, it can be leveraged to general Riemannian manifolds, which we discuss in the next section.

2.2 Geodesic slice sampling

We now turn to slice sampling on Riemannian manifolds. More precise, in the following we consider sampling from measures $\pi$ defined on a state space $\mathsf{M}$ satisfying the following assumption:

Assumption A.

Let $\mathsf{M}$ be a $d$ -dimensional, smooth, connected manifold. In addition we assume that $\mathsf{M}$ is endowed with a Riemannian metric $\mathfrak{g}$ and is complete.

(For the sake of brevity, we keep the introduction of objects from differential and Riemannian geometry to a bare minimum here. References and details on certain aspects can be found in Appendix C.) First, we need to equip $\mathsf{M}$ with a suitable reference measure. Under Assumption A there exists an atlas $(\mathsf{U}_{i},\varphi_{i})_{i\in\mathbb{N}}$ consisting of homeomorphisms $\varphi_{i}:\mathsf{U}_{i}\to\mathbb{R}^{d}$ such that for all $i,j\in\mathbb{N}$ the map $\varphi_{i}\circ\varphi_{j}^{-1}:\varphi_{j}\left(\mathsf{U}_{j}\cap\mathsf{U}% _{i}\right)\to\varphi_{i}\left(\mathsf{U}_{j}\cap\mathsf{U}_{i}\right)$ is infinitely often continuously differentiable. We denote the tangent space to $\mathsf{M}$ at $x\in\mathsf{M}$ as $T_{x}\mathsf{M}$ . The Riemannian metric $\mathfrak{g}:x\mapsto\mathfrak{g}_{x}$ is a smooth field of symmetric, positive definite covariant 2-tensors. Let $\mathcal{B}(\mathsf{M})$ be the Borel- $\sigma$ -algebra induced by the topology of $\mathsf{M}$ . The measure $\nu_{\mathfrak{g}}$ on $\mathsf{M}$ is defined for any $\mathsf{A}\in\mathcal{B}(\mathsf{M})$ as

\nu_{\mathfrak{g}}(\mathsf{A}):=\sum_{i=1}^{\infty}\int_{\varphi_{i}(\mathsf{U% }_{i})}\left(\rho_{i}\cdot\mathbbm{1}_{\mathsf{A}}\cdot\sqrt{\det(g,\varphi_{i% })}\right)\circ\varphi_{i}^{-1}(z)\ \mathrm{Leb}_{d}({\rm d}z),

(2)

where $\{\rho_{i}\}_{i\in\mathbb{N}}$ is a partition of unity subordinate to $\{\mathsf{U}_{i}\}_{i\in\mathbb{N}}$ , and for $i\in\mathbb{N}$

	$\displaystyle\sqrt{\det(g,\varphi_{i})}:\mathsf{U}_{i}$	$\displaystyle\to[0,\infty)$
	$\displaystyle x$	$\displaystyle\mapsto\sqrt{\det\left[\left(\mathfrak{g}_{x}(E_{j,x}^{\varphi_{i% }},E_{k,x}^{\varphi_{i}})\right)_{\{1\leqslant j,k\leqslant d\}}\right]}$

with $E_{1}^{\varphi_{i}},\ldots,E_{d}^{\varphi_{i}}$ being the coordinate frames associated to $(\mathsf{U}_{i},\varphi_{i})$ . We call $\nu_{\mathfrak{g}}$ the Riemannian measure induced by $\mathfrak{g}$ . It can be viewed as an extension of the Lebesgue measure to Riemannian manifolds, see e.g. [AnalysisIII, Section XII.1]. We provide some examples for manifolds satisfying Assumption A.

Example 2.

The most simple example for a $d$ -dimensional, smooth, connected manifold is $\mathbb{R}^{d}$ . Since for each $x\in\mathbb{R}^{d}$ the tangent space to $\mathbb{R}^{d}$ at $x$ is again $\mathbb{R}^{d}$ , we can equip it with the Riemannian metric

\mathfrak{g}_{x}(v_{1},v_{2})=v_{1}^{\top}v_{2},\qquad x\in\mathbb{R}^{d},v_{1% },v_{2}\in T_{x}\mathsf{M},

rendering $\mathbb{R}^{d}$ complete. The induced Riemannian measure is the Lebesgue measure.

We equip the Euclidean unit sphere $\mathbb{S}^{d-1}$ with the standard Riemannian metric $\widehat{\mathfrak{g}}$ induced by its embedding $\mathrm{Id}:\mathbb{S}^{d-1}\to\mathbb{R}^{d}$ in $\mathbb{R}^{d}$ , that is,

\widehat{\mathfrak{g}}_{x}=\left(\mathrm{Id}_{*}(v_{1})\right)^{\top}\mathrm{% Id}_{*}(v_{2}),\qquad x\in\mathbb{S}^{d-1},v_{1},v_{2}\in T_{x}\mathbb{S}^{d-1},

where $\mathrm{Id}_{*}$ is the map on the tangent spaces induced by $\mathrm{Id}$ . Then $(\mathbb{S}^{d-1},\widehat{\mathfrak{g}})$ satisfies Assumption A. The corresponding Riemannian measure $\nu_{\widehat{\mathfrak{g}}}$ is the standard volume measure.

Let $k,n\in\mathbb{N}$ with $k\leqslant n$ . The $k(k-1)/2+k(n-k)$ -dimensional, smooth, connected Stiefel manifold

\mathcal{V}(n,k):=\{\Gamma\in\mathbb{R}^{n\times k}\mid\Gamma^{\top}\Gamma=% \mathrm{Id}_{k}\},

consists of (ordered) $k$ -tuples of vectors in $\mathbb{R}^{n}$ that from an orthonormal system. This means that each point on the Stiefel manifold describes (not uniquely) a $k$ -dimensional subspace of $\mathbb{R}^{n}$ . To characterize the tangent space $T_{\Gamma}\mathcal{V}(n,k)$ to a point $\Gamma\in\mathcal{V}(n,k)$ , we need a matrix $\Gamma_{\perp}\in\mathbb{R}^{n\times(n-k)}$ such that the columns of $\Gamma$ and $\Gamma_{\perp}$ form an orthonormal basis of $\mathbb{R}^{n}$ . Then we have

T_{\Gamma}\mathcal{V}(n,k):=\{\Gamma\Pi+\Gamma_{\perp}\Sigma\in\mathbb{R}^{n% \times k}\mid\Pi\in\mathbb{R}^{k\times k}\text{ skew symmetric},\Sigma\in% \mathbb{R}^{(n-k)\times k}\}.

If we introduce the Riemannian metric

\mathfrak{g}_{\Gamma}(\Delta_{1},\Delta_{2})=\textrm{Tr}\big{(}\Delta_{1}^{% \top}(\mathrm{Id}_{n}-\frac{1}{2}\Gamma\Gamma^{\top})\Delta_{2}\big{)}=\frac{1% }{2}\textrm{Tr}(\Pi_{1}^{\top}\Pi_{2})+\textrm{Tr}(\Sigma_{1}^{\top}\Sigma_{2}),

where $\Delta_{1}=\Gamma\Pi_{1}+\Gamma_{\perp}\Sigma_{1},\Delta_{2}=\Gamma\Pi_{2}+% \Gamma_{\perp}\Sigma_{2}\in T_{\Gamma}\mathcal{V}(n,k)$ and Tr denotes the trace of a matrix, Assumption A holds true for $\mathcal{V}(n,k)$ . For more details see [edelman1998geometry].

The need to sample from measures $\pi$ defined on connected, complete Riemannian manifolds occurs in several applications in Bayesian statistics, see e.g. [holbrook2016bayesian, holbrook2020nonparametric, lieAccepteddimension, mantoux2021understanding].

We assume in this paper that the measure $\pi$ on $\mathsf{M}$ admits a density with respect to the Riemannian measure $\nu_{\mathfrak{g}}$ , i.e.,

\pi({\rm d}x):=\frac{p(x)}{\int_{\mathsf{M}}p(y)\ \nu_{\mathfrak{g}}({\rm d}y)% }\nu_{\mathfrak{g}}({\rm d}x).

(3)

Very much in parallel to the (special) case $\mathbb{R}^{d}$ , we are now able to develop a slice sampling approach to target $\pi$ defined on $\mathsf{M}$ . We can easily extend the uniform simple slice sampler (also called idealized slice sampler) from $\mathbb{R}^{d}$ to $\mathsf{M}$ . (In fact this works for every measure space, see Appendix B.) The level sets, containing all points where the unnormalized density $p$ is greater than a given value $t\in(0,\infty)$ , take the form

L(t):=\{x\in\mathsf{M}\mid p(x)>t\}.

The idealized slice sampler with initial point $x\in\mathsf{M}$ then proceeds, as in $\mathbb{R}^{d}$ , by first sampling a level $t$ uniformly from $\big{(}0,p(x)\big{)}$ . Then the next state of the chain is sampled according to the uniform distribution on $L(t)$ , that is, according to $\nu_{\mathfrak{g}}(L(t))^{-1}\nu_{\mathfrak{g}}|_{L(t)}$ . More precisely, the idealized slice sampler has transition kernel

	$\displaystyle H:\mathsf{M}\times\mathcal{B}(\mathsf{M})$	$\displaystyle\to[0,1]$
	$\displaystyle(x,\mathsf{A})$	$\displaystyle\mapsto\frac{1}{p(x)}\int_{(0,p(x))}\frac{1}{\nu_{\mathfrak{g}}% \big{(}L(t)\big{)}}\int_{L(t)}\mathbbm{1}_{\mathsf{A}}(y)\ \nu_{\mathfrak{g}}(% {\rm d}y)\,\mathrm{Leb}_{1}({\rm d}t).$

The implementation of the kernel requires sampling from the manifold-equivalent $\nu_{\mathfrak{g}}(L(t))^{-1}\nu_{\mathfrak{g}}|_{L(t)}$ of the uniform distribution on a level set $L(t)$ . Doing this efficiently poses a problem, as in $\mathbb{R}^{d}$ , because in general we have no knowledge about the shape of the level sets other than that they are $d$ -dimensional, measurable sets. Therefore, we propose a hybrid slice sampler that lifts Hit-and-run slice sampling from $\mathbb{R}^{d}$ to the general Riemannian manifold $\mathsf{M}$ . To this end, we need a generalization of straight lines to $\mathsf{M}$ , the geodesics. A curve $\gamma:\mathsf{I}\to\mathsf{M}$ , where $\mathsf{I}\subseteq\mathbb{R}$ is an interval, is called a geodesic if the covariant derivative of its velocity vector field is zero, i.e., if the equation $(D/dt)\big{(}({\rm d}\gamma_{(x,v)})/({\rm d}t)\big{)}=0$ holds. Under Assumption A, we know that for all $x\in\mathsf{M}$ and all $v\in T_{x}\mathsf{M}$ there exists a unique geodesic

\gamma_{(x,v)}:\mathbb{R}\to\mathsf{M}

(4)

satisfying $\gamma_{(x,v)}(0)=x$ and $({\rm d}\gamma_{(x,v)})/({\rm d}t)|_{0}=v$ . We may interpret $\gamma_{(x,v)}$ as the geodesic emanating from $x$ in direction $v$ .

As in $\mathbb{R}^{d}$ , we want to index the geodesics emanating from a point $x\in\mathsf{M}$ by a \ldqunit sphere of directions\rdq, which naturally is given by the unit tangent sphere in $T_{x}\mathsf{M}$

\mathbb{S}_{x}^{d-1}:=\{v\in T_{x}\mathsf{M}\mid\mathfrak{g}_{x}(v,v)=1\}.

The natural immersion of $\mathbb{S}^{d-1}_{x}$ into the inner product space $(T_{x}\mathsf{M},\mathfrak{g}_{x})$ via the identity induces a Riemannian metric $\widehat{\mathfrak{g}}_{x}$ on $\mathbb{S}_{x}^{d-1}$ . We call the normalization

\sigma_{d-1}^{(x)}:=\frac{1}{\nu_{\widehat{\mathfrak{g}}_{x}}(\mathbb{S}^{d-1}% _{x})}\nu_{\widehat{\mathfrak{g}}_{x}},\qquad x\in\mathsf{M},

of the Riemannian measure $\nu_{\widehat{\mathfrak{g}}_{x}}$ , induced by $\widehat{\mathfrak{g}}_{x}$ , uniform distribution on $\mathbb{S}_{x}^{d-1}$ . Then for $x\in\mathsf{M}$ , $v\in\mathbb{S}_{x}^{d-1}$ and $t>0$ we immediately obtain

L(x,v,t):=\{\alpha\in\mathbb{R}\mid p\left(\gamma_{(x,v)}(\alpha)\right)>t\}=% \{\alpha\in\mathbb{R}\mid\gamma_{(x,v)}(\alpha)\in L(t)\}

as the parameterized intersection of a geodesic with a level set. We now present the extension of the Hit-and-run slice sampler to manifolds replacing straight lines by geodesics. We call this sampler the geodesic slice sampler. Roughly speaking, we arrive at a transition mechanism which at each step randomly chooses a geodesic and then runs a stepping-out and shrinkage based 1-dimensional slice sampler on this geodesic: Given a point $x\in\mathsf{M}$ , a new point $y\in\mathsf{M}$ is generated by first sampling a level $t$ uniformly from $(0,p(x))$ , and a direction $v$ uniformly from $\mathbb{S}^{d-1}_{x}$ , i.e., from $\sigma_{d-1}^{(x)}$ . The sampled level $t$ defines a level set $L(t)$ , and the sampled direction $v$ specifies a geodesic $\gamma_{(x,v)}$ emanating from $x$ in direction $v$ . Now we use Neal’s stepping-out and shrinkage techniques described in Section 2.1 to generate a point $\theta\in\mathbb{R}$ from the intersection $L(x,v,t)$ of the level set $L(t)$ and the geodesic $\gamma_{(x,v)}$ . The new point $y\in\mathsf{M}$ is then given by $y=\gamma_{(x,v)}(\theta)$ .

Algorithm 1 Geodesic slice sampler.

Input: point $x\in\mathsf{M}$ , hyperparameters $w\in(0,\infty)$ and $m\in\mathbb{N}$
Output: point $y\in\mathsf{M}$

1: Draw

T\sim\mathrm{Unif}\big{(}(0,p(x))\big{)}

, call the result

t

2: Draw

V\sim\sigma_{d-1}^{(x)}

, call the result

v

3: Generate a realization of

(L,R)=\texttt{Step-out}_{w,m}(x,v,t)

, call the result

(\ell,r)

4: Generate a realization of

\Theta=\texttt{Shrink}_{\ell,r}(x,v,t)

, call the result

\theta

5: return

y=\gamma_{(x,v)}(\theta)

Algorithm 2 Stepping-out procedure. Call as

\texttt{Step-out}_{w,m}(x,v,t)

Input: point $x\in\mathsf{M}$ , direction $v\in\mathbb{S}_{x}^{d-1}$ , level $t\in(0,p(x))$ , hyperparameters $w\in(0,\infty)$ and $m\in\mathbb{N}$
Output: two points $\ell,r\in\mathbb{R}$ such that $\ell<0<r$

1: Draw

\Upsilon\sim\mathrm{Unif}\big{(}[0,w]\big{)}

, call the result

u

2: Set

\ell:=-u

and

r:=\ell+w

3: Draw

J\sim\mathrm{Unif}(\{1,\ldots,m\})

, call the result

\upiota

4: Set

i=2

and

j=2

5: while

i\leqslant\upiota

and

p\left(\gamma_{(x,v)}(\ell)\right)>t

6: Set

\ell=\ell-w

7: Update

i=i+1

8: end while

9: while

j\leqslant m+1-\upiota

and

p\left(\gamma_{(x,v)}(r)\right)>t

10: Set

r=r+w

11: Update

j=j+1

12: end while

13: return

(\ell,r)

Algorithm 3 Shrinkage procedure. Call as

\texttt{Shrink}_{\ell,r}(x,v,t)

Input: point $x\in\mathsf{M}$ , direction $v\in\mathbb{S}_{x}^{d-1}$ , level $t\in(0,p(x))$ and parameters $\ell<0<r$
Output: point $\theta\in L(x,v,t)\cap[\ell,r)$

1: Draw

\Theta\sim\mathrm{Unif}\big{(}(0,r-l)\big{)}

, call the result

\theta_{h}

2: Set

\theta:=\theta_{h}-\mathbbm{1}_{\{\theta_{h}>r\}}(r-l)

3: Set

\theta_{\min}:=\theta_{h}

4: Set

\theta_{\max}:=\theta_{h}

5: while

\left(\gamma_{(x,v)}(\theta)\right)\leqslant t

6: if

\theta_{h}\in[\theta_{\min},r-l)

then

7: Set

\theta_{\min}=\theta_{h}

8: else

9: Set

\theta_{\max}=\theta_{h}

10: end if

11: Draw

\Theta\sim\mathrm{Unif}\big{(}(0,\theta_{\max})\cup[\theta_{\min},r-l)\big{)}

, call result

\theta_{h}

12: Set

\theta=\theta_{h}-\mathbbm{1}_{\{\theta_{h}>r\}}(r-l)

13: end while

14: return

\theta

A complete algorithmic description of the geodesic slice sampler in pseudo code can be found in Algorithm 1. It calls Algorithm 2 and Algorithm 3 representing the stepping-out and shrinkage procedure on the geodesic respectively. We also provide a description in terms of random variables.

Remark 3.

Let $(Y_{k})_{k\in\mathbb{N}}$ be the Markov chain corresponding to the geodesic slice sampler. For $k\in\mathbb{N}$ let $T_{k+1}$ and $V_{k+1}$ be a random variables with conditional distributions

\mathbb{P}\left(T_{k+1}\in\cdot\mid Y_{1},\ldots,Y_{k}\right)=\mathrm{Unif}% \big{(}(0,p(Y_{k}))\big{)}\qquad\text{and}\qquad\mathbb{P}\left(V_{k+1}\in% \cdot\mid Y_{1},\ldots,Y_{k},T_{k+1}\right)=\sigma_{d-1}^{(Y_{k})}.

For fixed hyperparameters $w>0$ and $m\in\mathbb{N}$ , let $J_{k+1}\sim\mathrm{Unif}(\{1,\ldots,m\})$ and $\Upsilon_{k+1}\sim\mathrm{Unif}\big{(}(0,w)\big{)}$ be independent of all previous random variables. We set

	$\displaystyle L^{\ast}_{k}=-\Upsilon_{k+1}-\big{(}\inf\{i\in\mathbb{N}_{0}\mid% -\Upsilon_{k+1}-iw\notin L(Y_{k},T_{k+1},V_{k+1})\}\land(J-1)\big{)}w,$
	$\displaystyle R^{\ast}_{k}=w-\Upsilon_{k+1}+\big{(}\inf\{i\in\mathbb{N}_{0}% \mid w-\Upsilon_{k+1}+iw\notin L(Y_{k},T_{k+1},V_{k+1})\}\land(m-J)\big{)}w,$

and define $\mathsf{I}^{(k+1)}=(L^{\ast}_{k},R^{\ast}_{k})$ . Let the random variable $\Theta_{1}^{(k+1)}$ have conditional distribution

\displaystyle\mathbb{P}\left(\Theta_{1}^{(k+1)}\in\cdot\mid Y_{1},\ldots,Y_{k}% ,T_{k+1},V_{k+1},\mathsf{I}^{(k+1)}\right)=\mathrm{Unif}(\mathsf{I}^{(k+1)}),

and set $\mathsf{I}_{1}^{(k+1)}=\mathsf{I}^{(k+1)}$ . Using the segment-notation $\mathbb{J}$ from the description of the shrinkage procedure in Section 2.1, we define $\Theta_{i+1}^{(k+1)}$ to have conditional distribution

\mathbb{P}\left(\Theta_{i+1}^{(k+1)}\in\cdot\mid Y_{1},\ldots,Y_{k},T_{k+1},V_% {k+1},\mathsf{I}_{1}^{(k+1)},\ldots,\mathsf{I}_{i}^{(k+1)},\Theta_{1}^{(k+1)},% \ldots,\Theta_{i}^{k+1}\right)=\mathrm{Unif}(\mathsf{I}_{i}^{(k+1)}),

and

\mathsf{I}_{i+1}^{(k+1)}=\mathbb{J}(\Theta_{i+1}^{(k+1)},\Theta_{i}^{(k+1)},% \mathsf{I}_{i}^{(k+1)})

for all $i\in\mathbb{N}$ . Then set $\Theta^{\ast}_{k}=\Theta_{\uptau_{k+1}}^{(k+1)}$ , where

\uptau_{k+1}:=\inf\{i\in\mathbb{N}\mid\Theta_{i}^{(k+1)}\in L(Y_{k},V_{k+1},T_% {k+1})\}.

The next state of the geodesic slice sampler is then given by

Y_{k+1}=\gamma_{(Y_{k},V_{k+1})}\left(\Theta^{\ast}_{k}\right).

We comment on the prerequisites of the geodesic slice sampler.

Remark 4.

1.
In order to implement Algorithm 1 we need to be able to perform the following operations:
- •
  
  Evaluation of the unnormalized density $p(x)$ at every $x\in\mathsf{M}$ .
- •
  
  Sampling from $\sigma_{d-1}^{(x)}$ for all $x\in\mathsf{M}$ . If we know an isometric isomorphism $\mathbb{R}^{d}\to T_{x}\mathsf{M}$ for all $x\in\mathsf{M}$ this is an easy task.
- •
  
  Evaluation of geodesics $\gamma_{(x,v)}(\theta)$ for all $x\in\mathsf{M}$ , $v\in\mathbb{S}^{d-1}_{x}$ and $\theta\in\mathbb{R}$ . Some cases where this is possible are provided in Example 5.
2.

The geodesic slice sampler takes two hyperparameters, namely $w\in(0,\infty)$ and $m\in\mathbb{N}$ . They arise from the usage of Algorithm 2 (the stepping-out procedure), and their influence on the geodesic slice sampler is derived from their influence on the stepping-out procedure, see Remark 1. Roughly speaking, $mw$ determines the maximal possible size of the neighborhood of the initial point $x\in\mathsf{M}$ that the geodesic slice sampler takes into account when performing its transition. This affects the reach of the algorithm as well as its ability to jump between modes of the target distribution $\pi$ . Choosing $m$ larger and $w$ smaller can be seen as increasing the likelihood that a smaller neighborhood of $x$ is considered for the transition, compared to the maximal possible reach of the algorithm. Depending on the shape of the unnormalized density $p$ , this can hamper the ability of the geodesic slice sampler to jump between modes or allow the consideration of more \ldqrelevant\rdq neighborhoods. Observe that larger $m$ leads to higher computational cost by increasing the cost of Algorithm 2.

We provide some illustrative scenarios.

Example 5.

1.

The Hit-and-run slice sampler on $\mathbb{R}^{d}$ described in Section 2.1 fits into the framework of the geodesic slice sampler.

We consider $\mathbb{S}^{d}\subseteq\mathbb{R}^{d+1}$ . For $x\in\mathbb{S}^{d}$ we have $\mathbb{S}^{d-1}_{x}=\{v\in\mathbb{S}^{d}\mid x^{\top}v=0\}$ . The projection onto the subspace orthogonal to $x\in\mathbb{S}^{d}$

\displaystyle\psi_{x}:\mathbb{S}^{d}\to\mathbb{S}^{d-1}_{x},\qquad v\mapsto% \left(\mathrm{Id}-xx^{\top}\right)v,

where $\mathrm{Id}$ denotes the identity on $\mathbb{S}^{d}$ , yields a simple formula for

\sigma_{d-1}^{(x)}=\left(\psi_{x}\right)_{\sharp}\left(\frac{1}{\nu_{\widehat{% \mathfrak{g}}}(\mathbb{S}^{d})}\nu_{\widehat{\mathfrak{g}}}\right).

The geodesics of $\mathbb{S}^{d}$ are the great circles given by the explicit formula

\gamma_{(x,v)}(\theta)=\cos(\theta)x+\sin(\theta)v,\qquad\theta\in\mathbb{R},

for $x\in\mathbb{S}^{d}$ and $v\in\mathbb{S}^{d-1}_{x}$ . Of course we may apply the geodesic slice sampler for arbitrary hyperparameters as described in Algorithm 1. However, since all geodesics are periodic of a known period length (namely $2\pi$ ), this renders Algorithm 2 (the stepping-out procedure) somehow superfluous. For wisely chosen hyperparameters ( $w=2\pi,m=1$ ), the geodesic segment sampled by Algorithm 2 always equals exactly one winding of the great circle, and we can simply replace line 3 in Algorithm 1 by deterministically setting $(\ell,r)=(-\pi,\pi)$ . The resulting algorithm is the geodesic shrinkage slice sampler on the sphere from [habeck2023geodesic].

For the Stiefel manifold defined in Example 2.3, an isometric isomorphism between $\mathbb{R}^{k(k-1)/2+k(n-k)}$ and the tangent space $T_{\Gamma}\mathcal{V}(n,k)$ at a point $\Gamma\in\mathcal{V}(n,k)$ is given by using the first $k(k-1)/2$ components of $v\in\mathbb{R}^{k(k-1)/2+k(n-k)}$ to determine a skew symmetric matrix $\Pi$ and the remaining ones to form a matrix $\Sigma\in\mathbb{R}^{(n-k)\times k}$ . These two matrices determine an element of $T_{\Gamma}\mathcal{V}(n,k)$ (after fixing $\Gamma_{\perp}$ ) by $\Delta=\Gamma\Pi+\Gamma_{\perp}\Sigma$ . We provide an explicit formula for the geodesic $\gamma_{(\Gamma,\Delta)}$ . To this end let $QR=\Gamma_{\perp}\Sigma$ be the compact QR-decomposition of $\Gamma_{\perp}\Sigma=(\mathrm{Id}_{n}-\Gamma\Gamma^{\top})\Delta$ . For $\theta\in\mathbb{R}$ set $N_{1}(\theta)\in\mathbb{R}^{k\times k}$ and $N_{2}(\theta)\in\mathbb{R}^{n\times k}$ to be

\begin{pmatrix}N_{1}(\theta)\\ N_{2}(\theta)\end{pmatrix}=\exp\left(\theta\begin{pmatrix}\Pi&-R^{\top}\\ R&\boldsymbol{0}\end{pmatrix}\right)\begin{pmatrix}\mathrm{Id}_{k}\\ \boldsymbol{0}\end{pmatrix},

where $\exp$ denotes here the matrix exponential and $\boldsymbol{0}\in\mathbb{R}^{k\times k}$ is the matrix with all entries zero. Then

\gamma_{(\Gamma,\Delta)}(\theta)=\Gamma N_{1}(\theta)+QN_{2}(\theta),\qquad% \theta\in\mathbb{R}.

The derivation of these results can be found in [edelman1998geometry].

Finally, we present the Markov transition kernel corresponding to the geodesic slice sampler. To this end, we first give a rigorous specification of the unnormalized density $p$ :

Assumption B.

The unnormalized density $p:\mathsf{M}\to(0,\infty)$ is a lower semi-continuous³³3All level sets $L(t):=\{x\in\mathsf{M}\mid p(x)>t\}$ , $t\in\mathbb{R}$ , are open. function such that $\int_{\mathsf{M}}p(x)\ \nu_{\mathfrak{g}}({\rm d}x)\in(0,\infty)$ .

We denote by

\|p\|_{\infty}:=\sup_{x\in\mathsf{M}}|p(x)|

the supremum norm of $p$ . Observe that Assumption B gives $\|p\|_{\infty}\in(0,\infty]$ .

Remark 6.

We impose lower semicontinuity of the unormalized density $p$ in Assumption B to ensure that Algorithm 3 (the shrinkage procedure) terminates almost surely. This guarantees that its output has indeed a distribution on $\mathbb{R}$ . For more details see [ReversibilityEllipticalSliceSampler].

Next we fix $w\in(0,\infty)$ and $m\in\mathbb{N}$ . For simplicity we drop these two hyperparameters of the geodesic slice sampler in our subsequent notation. Let $x\in\mathsf{M}$ , $v\in\mathbb{S}_{x}^{d-1}$ and $t\in(0,p(x))$ . We denote by

\xi_{L(x,v,t)}^{(0)}(\mathsf{A}):=\mathbb{P}(\texttt{Step-out}_{w,m}(x,v,t)\in% \mathsf{A}),\qquad\mathsf{A}\in\mathcal{B}(\mathbb{R}^{2}),

(5)

the distribution of the output of Algorithm 2 and by

Q_{L(x,v,t)}^{\ell,r}(0,\mathsf{A}):=\mathbb{P}(\texttt{Shrink}_{\ell,r}(x,v,t% )\in\mathsf{A}),\qquad\mathsf{A}\in\mathcal{B}(\mathbb{R}),

(6)

where $\ell<0<r$ , the distribution of the output of Algorithm 3. Note that in (5) and (6) the right hand side only depends on $x\in\mathsf{M}$ , $v\in\mathbb{S}_{x}^{d-1}$ and $t\in(0,p(x))$ through the set $L(x,v,t)$ . A formal definition of these distributions and some of their properties can be found in Section 4.1 and Section 4.2. For $x\in\mathsf{M}$ , $t\in(0,p(x))$ and $\mathsf{A}\in\mathcal{B}(\mathsf{M})$ we define the auxiliary Markov kernels

K_{t}(x,\mathsf{A}):=\int_{\mathbb{S}_{x}^{d-1}}\int_{\mathbb{R}^{2}}\int_{[% \ell,r)}\mathbbm{1}_{\mathsf{A}}\left(\gamma_{(x,v)}(\theta)\right)\ Q_{L(x,v,% t)}^{\ell,r}(0,{\rm d}\theta)\ \xi_{L(x,v,t)}^{(0)}\big{(}{\rm d}(\ell,r)\big{% )}\ \sigma_{d-1}^{(x)}({\rm d}v).

Then the Markov kernel

\begin{split}K:\mathsf{M}\times\mathcal{B}(\mathsf{M})&\to[0,1]\\ (x,\mathsf{A})&\mapsto\frac{1}{p(x)}\int_{(0,p(x))}K_{t}(x,\mathsf{A})\ % \mathrm{Leb}_{1}({\rm d}t)\end{split}

(7)

corresponds to Algorithm 1. Observe that $K$ has the correct invariant distribution, as implied by the following theorem.

Theorem 7.

Suppose Assumption A and B are satisfied, and let $\pi$ be defined as in (3). Fix $w\in(0,\infty)$ and $m\in\mathbb{N}$ . Then the Markov kernel $K$ given in (7) is reversible with respect to $\pi$ .

The proof of this statement can be found in Section 4.3.

2.3 Literature review of MCMC-methods on Riemannian manifolds

In this section, we aim to provide an overview of existing MCMC-methods on Riemannian manifolds in the literature. Roughly speaking they can be assigned to three different categories.

The first class of MCMC algorithms are defined on open sets of $\mathbb{R}^{d}$ but consider non canonical Riemannian metrics. In [girolami2011riemann], Girolami and Calderhead generalize Hamiltonian Monte Carlo (HMC) to $\mathbb{R}^{d}$ equipped with an arbitrary metric tensor obtaining Riemannian manifold HMC (RMHMC), which assumes knowledge about the Riemannian metric of the underlying manifold. They also introduce a MALA-type algorithm in this setting.

A second class of MCMC algorithms consists of methods defined on submanifolds of $\mathbb{R}^{d}$ associated with the canonical metric introduced by the embedding in $\mathbb{R}^{d}$ that are not necessary open sets. This includes Hamiltonian based MCMC-methods tailor-made for specific classes of manifolds that have been further developed upon RMHMC such as for implicitly defined manifolds [brubaker2012family], manifolds embedded in Euclidean space [byrne2013geodesic] and the sphere [lan2014spherical].

Distributions on submanifolds of $\mathbb{R}^{d}$ can also be approximated by proposing a sample from the ambient $\mathbb{R}^{d}$ projected to the manifold and then running a Metropolis-Hastings acceptance rejection step, see e.g. [mantoux2021understanding, zappa2018monte]. However, as usually the conditional distribution of the proposal is intractable and is therefore not taken into account in the acceptance rejection step, the resulting Metropolis-Hastings algorithm is biased. In [zappa2018monte], Zappa et. al propose a bias-free modification of this method for submanifolds of $\mathbb{R}^{d}$ defined by inequalities and equality constraints. In the special case when the underlying manifold is a hypersphere equipped with the angular Gaussian distribution as reference measure, other specialized bias-free reprojected MCMC algorithms have been proposed such as reprojected preconditioned Crank–Nicolson algorithm or reprojected Elliptical Slice sampling, see [lieAccepteddimension].

The third class of MCMC-methods on Riemannian manifolds employs geodesic flows. In [mangoubi2018rapid], the authors analyze a geodesic random walk on Riemannian manifolds with positive bounded curvature, which is invariant with respect to the Riemannian measure. This analysis has been extended in [goyal2019sampling] to arbitrary target probability measures on manifolds with bounded non-negative curvature by adding a Metropolis-Hastings-like acceptance step resulting in a geodesic Metropolis-Hastings algorithm. Note that a similar approach has been taken before in [lee2017geodesic] to target the uniform distribution on a polytope by equipping it with a Hessian structure. In addition to this class of Metropolis-Hastings algorithms, there already exist MCMC-methods for specific manifolds that combine slice sampling with geodesic, that is, for distributions on $\mathbb{R}^{d}$ there is Hit-and-Run slice sampling [latuszyinski2014convergence, Mackay], and for distributions on hyperspheres there is geodesic slice sampling on the sphere which uses great circles in stead of straight lines [habeck2023geodesic]. They can both be viewed as special cases of GSS.

3 Application

We numerically asses the performance of GSS (Algorithm 1) in comparison with other Riemannian MCMC algorithms. In our experiments, we consider the compact Stiefel manifold $\mathcal{V}(n,k)\subseteq\mathbb{R}^{n\times k}$ and the Grassmann manifold $\mathcal{G}(n,k)$ with, $k,n\in\mathbb{N}$ , $k<n$ . They find applications in shape analysis [hong2017regression], dimensionality reduction [holbrook2016bayesian] and computer vision [lui2012advances, nguyen2019neural]. We give a brief overview of the conducted experiments. All the code is available on GitHub⁴⁴4https://github.com/samuelgruffaz/Geodesic_Slice_Sampling_on_Riemannian_Manifold.git.

•

We first consider the case where the target distribution belongs to the class of von Mises-Fisher distributions, defining families of distributions on $\mathcal{V}(n,k)$ and $\mathcal{G}(n,k)$ . When examining the Stiefel manifold $\mathcal{V}(n,k)$ , we compare GSS with an adaptive random walk Metropolis Hastings (RMH) sampler. This comparison highlights the impact of the GSS approach compared to a well-chosen random walk. On the Grassmann manifold, we contrast the GSS with a gradient-informed sampler referred to as the geodesics Metropolis-adjusted Langevin algorithm (GeoMALA), inspired by the framework proposed by [byrne2013geodesic].
•

We compare GSS and RMH in inferring a latent variable model developed in [mantoux2021understanding]. This model is used for the analysis of brain network structures. Specifically, it encodes principal directions of adjacency matrices obtained from MRI as points on a Stiefel manifold.
•

Finally, we introduce a Bayesian von Mises-Fisher clustering model for action recognition in videos. We approximate the posterior distribution using GSS and compare our resulting model with other existing approaches [lin2017bayesian, sengupta2017bayesian].

We do not conduct comparisons with the Riemannian Hamiltonian Monte Carlo (RHMC) algorithm [girolami2011riemann], since it uses the expression of the metric of the manifold rather than its geodesics as GSS. Throughout this whole section we denote the Gaussian distribution on $\mathbb{R}^{d}$ with mean $x\in\mathbb{R}^{d}$ and covariance matrix $\Sigma\in\mathbb{R}^{d\times d}$ by $\mathcal{N}(x,\Sigma)$ . For $\kappa\in\mathbb{R}^{d}$ , let $\mathrm{diag}(\kappa)\in\mathbb{R}^{d\times d}$ be the diagonal matrix with diagonal entries given by the components of $\kappa$ . Moreover, we define $\boldsymbol{0}_{d}\in\mathbb{R}^{d}$ to be the vector and $\boldsymbol{0}_{d,k}\in\mathbb{R}^{d\times k}$ to be the matrix with all entries zero, and write $\textrm{Tr}(\Sigma)$ for the trace of a square matrix $\Sigma$ .

We comment on the structures of the Stiefel and the Grassmann manifold needed to implement GSS.

Remark 8.

1.

(Stiefel manifold.) The Stiefel manifold $\mathcal{V}(n,k)$ , including an explicit formula for its geodesic and the uniform distribution on its tangent spheres, is introduced in Example 2-2 and 5-3. It is worth noting that the computation of a geodesic requires evaluating the matrix exponential map on skew symmetric matrices of size $2k\times 2k$ , which can be efficiently done using the eigenvalue decomposition of skew symmetric matrices.

(Grassmann manifold.) Let $n,k\in\mathbb{N}$ with $k\leqslant n$ . The $k(n-k)$ -dimensional Grassmann manifold $\mathcal{G}(n,k)$ can be defined as the set of all $k$ -dimensional subspaces of the Euclidean space $\mathbb{R}^{n}$ (see [bendokat2020grassmann] for a complete overview), i.e.,

\mathcal{G}(n,k)=\{W\subseteq\mathbb{R}^{n}\mid W\text{ is a }k\text{-% dimensional subspace}\}.

(8)

The following representation allows for an efficient implementation of points on the Grassmann manifold. First, observe that for any subspace $W\in\mathcal{G}(n,k)$ , there exists a (non-unique) orthonormal basis formed by the column of $\Gamma\in\mathcal{V}(n,k)$ such that $W=\operatorname{span}\Gamma$ . Thus, by defining the equivalence relation $\sim$ on $\mathcal{V}(n,k)$ as $\Gamma_{1}\sim\Gamma_{2}$ if and only if $\operatorname{span}\Gamma_{1}=\operatorname{span}\Gamma_{2}$ , the Grassmann manifold $\mathcal{G}(n,k)$ can be identified with the quotient manifold $\mathcal{V}(n,k)/\sim$ . As a result, an element $W$ of $\mathcal{G}(n,k)$ can be represented by an element $\Gamma$ of $\mathcal{V}(n,k)$ .

Fixing $\Gamma_{\perp}\in\mathbb{R}^{n\times(n-k)}$ such that the columns of $\Gamma$ and $\Gamma_{\perp}$ form an orthonormal basis of $\mathbb{R}^{n}$ , the tangent space to $\mathcal{G}(n,k)$ at $W$ is given by

T_{W}\mathcal{G}(n,k):=\{\Gamma_{\perp}\Sigma\in\mathbb{R}^{n\times k}\mid% \Sigma\in\mathbb{R}^{(n-k)\times k}\}.

We equip $\mathcal{G}(n,k)$ with the Riemannian metric

\mathfrak{g}_{W}(\Delta_{1},\Delta_{2})=\textrm{Tr}(\Delta_{1}^{\top}\Delta_{2% }),\qquad\Delta_{1},\Delta_{2}\in T_{W}\mathcal{G}(n,k),

such that $\Sigma\mapsto\Gamma_{\perp}\Sigma$ is an isometric isomorphism between $\mathbb{R}^{n\times(n-k)}$ and $T_{W}\mathcal{G}(n,k)$ . Let $\Delta=\widehat{U}D\widehat{V}^{\top}$ be the compact singular value decomposition (SVD) of an element $\Delta\in T_{W}\mathcal{G}(n,k)$ . Then, the geodesic at $W$ with direction $\Delta\in T_{W}\mathcal{G}(n,k)$ admits the following explicit expression

\gamma_{(W,\Delta)}(\theta)=\big{(}\Gamma\widehat{V}\cos(D\theta)+\widehat{U}% \sin(D\theta)\big{)}\widehat{V}^{\top},\qquad\theta\in\mathbb{R}.

Further details and derivations can be found in [edelman1998geometry].

For the convenience of the reader we provide more details on the three samplers appearing in our numerical experiments. Each sampler defines a Markov chain $(X_{i})_{i\in\mathbb{N}}$ , and we describe the respective transition mechanisms from $X_{i}$ to $X_{i+1}$ for $i\in\mathbb{N}$ .

(a) adaptive random walk Metropolis Hastings (RMH).

Given $X_{i}\in\mathcal{V}(n,k)$ and a step size $h_{a}\in(0,\infty)$ this sampler uses the following Metropolis-Hastings (MH) like transition mechanism :

1.

Sample $V_{i+1}\sim\mathcal{N}(\boldsymbol{0}_{nk},\mathrm{Id}_{nk})$ interpreted as a matrix in $\mathbb{R}^{n\times k}$ . Let $\operatorname{proj}_{\mathcal{V}(n,k)}:\mathbb{R}^{n\times k}\to\mathcal{V}(n,% k),\ V=\widehat{U}D\widehat{V}^{\top}\mapsto\widehat{U}\widehat{V}^{\top}$ be the projection on the Stiefel manifold, where $\widehat{U}D\widehat{V}^{\top}$ is the SVD of $V$ . Then define $\widetilde{X}_{i+1}=\operatorname{proj}_{\mathcal{V}(n,k)}(X_{i}+h_{a}V_{i+1})$ .

Sample $\Upsilon_{i+1}\sim\mathrm{Unif}\big{(}[0,1]\big{)}$ independent of all previously appearing random variables. If $\Upsilon_{i+1}<\upalpha(X_{i},\widetilde{X}_{i+1})$ , where

\upalpha(X_{i},\widetilde{X}_{i+1})=\frac{p(\widetilde{X}_{i+1})}{p(X_{i})},

(9)

then set $X_{i+1}=\widetilde{X}_{i+1}$ , otherwise $X_{i+1}=X_{i}$ .

The hyperparameter $h_{a}$ is adaptively tuned to target an acceptance probability of $0.234$ , a constant proposed by optimal design analysis when using the same sampler in an Euclidean space [roberts2001optimal]. As remarked in [zappa2018monte], it is worth noting the proposal mechanism is not symmetric here. More precisely, if the conditional distribution of $\widetilde{X}_{i+1}$ given $X_{i}=\Gamma\in\mathcal{V}(n,k)$ admits a transition density denoted by $q$ , this function is not symmetric, i.e., $q(\Gamma_{1}|\Gamma_{2})\neq q(\Gamma_{2}|\Gamma_{1})$ for $\Gamma_{1},\Gamma_{2}\in\mathcal{V}(n,k)$ . Therefore, using (9) as the acceptance ratio leads to a biased MCMC algorithm, which is justified in [mantoux2021understanding] by the observation that, for sufficiently small $h_{a}>0$ , $q(\Gamma_{1}|\Gamma_{2})\approx q(\Gamma_{2}|\Gamma_{1})$ holds for all $\Gamma_{1},\Gamma_{2}\in\mathcal{V}(n,k)$ . However, this introduces a bias that grows with increasing $h_{a}$ . In Section 3.2, we compare adaptive RMH with GSS using the same application as discussed in [mantoux2021understanding].
It is worth noting that [zappa2018monte] proposes an MCMC algorithm that, by employing another projection and an additional accept-reject step, results in an unbiased MCMC algorithm. However, their approach does not exploit the specificity of a \ldqtractable geodesics\rdq-framework.

(b) geodesic adaptive Metropolis Hastings (GeoRMH).

This methods is a bias free modification of RMH. It essentially uses the same transition mechanism, but replaces the proposal at step $i\in\mathbb{N}$ with $\gamma_{(X_{i},V_{i+1})}(h_{a})$ where $V_{i+1}$ is distributed according to a standard Gaussian on $T_{X_{i}}\mathcal{V}(n,k)$ . In [goyal2019sampling], a similar algorithm is proposed to sample uniformly from a convex subset of a manifold with non-negative curvature.

(c) geodesic Metropolis adjusted Langevin algorithm (GeoMALA).

On the Grassmann manifold, GeoMALA is used and presented in [holbrook2016bayesian, Algorithm 1] to estimate parameters in Bayesian inference models involving dimensionality reduction. Fix a step $h\in(0,\infty)$ and denote by $\operatorname{proj}_{T_{W}\mathcal{G}(n,k)}:\mathbb{R}^{n\times k}\to T_{W}% \mathcal{G}(n,k),V\mapsto(\mathrm{Id}_{n}-\Gamma\Gamma^{\top})V$ for $W=\mathrm{span}\ \Gamma\in\mathcal{G}(n,k)$ the projection onto the tangent space. The transition mechanism of GeoMALA given $X_{i}\in\mathcal{G}(n,k)$ works as follows:

1.

Sample $V_{i+1}\sim\mathcal{N}(\boldsymbol{0}_{nk},\mathrm{Id}_{nk})$ interpreted as a matrix in $\mathbb{R}^{n\times k}$ . Then define $\bar{V}_{i+1}=\operatorname{proj}_{T_{X_{i}}\mathcal{G}(n,k)}(V_{i+1})$ and $E_{0}=\log p(X_{i})-\mathrm{Tr}(\bar{V}_{i+1}^{\top}\bar{V}_{i+1})/2$ .
2.

Define $\widetilde{V}_{i+1}=\operatorname{proj}_{T_{X_{i}}\mathcal{G}(n,k)}(\bar{V}_{i% +1}+h\nabla\log p(X_{i})/2)$ and $\bar{X}_{i+1}=\gamma_{(X_{i},\widetilde{V}_{i+1})}(h)$ , $\bar{V}_{i+1}=\dot{\gamma}_{(X_{i},\widetilde{V}_{i+1})}(h)$ , as well as $V_{i+1}^{*}=\operatorname{proj}_{T_{X_{i}}\mathcal{G}(n,k)}(\bar{V}_{i+1}+h% \nabla\log p(\bar{X}_{i})/2)$ and $E_{1}=\log p(\bar{X}_{i+1})-\mathrm{Tr}(V_{i+1}^{*\top}V_{i+1}^{*})/2$ .
3.

Sample $\Upsilon_{i+1}\sim\mathrm{Unif}\big{(}[0,1]\big{)}$ independently. If $\log(\Upsilon_{i+1})<E_{1}-E_{0}$ then set $X_{i+1}=\bar{X}_{i+1}$ , otherwise $X_{i+1}=X_{i}$ .

Note that for all the algorithms that we implement, GSS, GeoRMH, and GeoMALA, we always reproject the (final) state of the Markov chain at each step onto the Stiefel manifold to mitigate numerical errors arising from the geodesic computations. This reprojection step ensures that the resulting samples lie on the manifold, preserving the desired manifold structure and improving the accuracy of the sampling algorithms.

3.1 Sampling the von Mises–Fisher distribution

In this section we present numerical experiments targeting the von Mises-Fisher distribution. Given a matrix-valued parameter $F\in\mathbb{R}^{n\times k}$ , the von Mises–Fisher distribution $\operatorname{vMF}(F)$ on the Stiefel manifold $\mathcal{V}(n,k)$ has unnormalized density with respect to its Riemannian measure

p_{\operatorname{vMF}(F)}(\Gamma)=\exp\left(\textrm{Tr}(F^{\top}\Gamma)\right)% ,\qquad\Gamma\in\mathcal{V}(n,k).

(10)

However, this expression can not be used for the Grassmann manifold since $\operatorname{span}\Gamma_{1}=\operatorname{span}\Gamma_{2}$ with $(\Gamma_{1},\Gamma_{2})\in\mathcal{V}(n,k)^{2}$ does not imply $p_{\operatorname{vMF}(F)}(\Gamma_{1})=p_{\operatorname{vMF}(F)}(\Gamma_{2})$ . Note that an element $W$ of the Grassmann manifold can be identified with its orthogonal projector $\Gamma\Gamma^{\top}$ , which does not depend on the choice of the representative $\Gamma\in\mathcal{V}(n,k)$ for $W$ . Therefore, given a positive semi-definite matrix $P\in\mathbb{R}^{n\times n}$ , the von Mises–Fisher distribution $\operatorname{vMF}(P)$ (also called matrix Langevin distribution [chikuse2003concentrated]) on the Grassmann manifold $\mathcal{G}(n,k)$ has unnormalized density

p_{\operatorname{vMF}(P)}(W)=\exp\left(\textrm{Tr}(P^{\top}\Gamma\Gamma^{\top}% )\right),\quad\operatorname{span}\Gamma=W\qquad(\Gamma,W)\in\mathcal{V}(n,k)% \times\mathcal{G}(n,k)\;.

(11)

In the following, given $n$ and $k$ , we always choose the parameters $F$ and $P$ to be of the form

F=\begin{pmatrix}D\\ \boldsymbol{0}_{n-k,k}\end{pmatrix}\in\mathbb{R}^{n\times k},\qquad P=FF^{\top}.

(12)

where $D\in\mathbb{R}^{k\times k}$ .

We consider three different experimental setups on $\mathcal{V}(n,k)$ . Firstly, $k$ is fixed at $2$ , we vary $n$ in the set $\{3,30,100\}$ and set $D=\mathrm{diag}((1,\ldots,k)^{\top})\in\mathbb{R}^{d\times d}$ in (12). Secondly, for the same choice of $D$ as in the first experiment, we fix $n=30$ and vary $k$ in the set $\{3,30,100\}$ . Thirdly, the pair $(n,k)$ is fixed at $(30,2)$ , but we choose $D=\mathrm{diag}((1,\lambda)^{\top})\in\mathbb{R}^{2\times 2}$ , where $\lambda$ varies in the set $\{1,10,100\}$ . This allows us to study the impact of the target distribution’s anisotropy on the samplers performance.

On the Grassmann manifold $\mathcal{G}(n,k)$ we repeat similar experiments. However in the first two setups, we use $D=\mathrm{Id}_{k}$ in (12). For the third experiment, we set $(n,k)=(3,2)$ and $D=\sqrt{\lambda}\mathrm{Id}_{k}$ for $\lambda\in\{1,10,100\}$ . There, we also vary the stepping-out parameter $m$ of GSS in the set $\{1,3,10\}$ . In all the other experiments, we fix the hyperparameter $m$ of GSS to 1 for the sake of simplicity. The samplers run for $N=100,000$ iterations, and the stepsize $h_{a}$ is initialized to 0.01 and updated every 20 steps for RMH and GeoRMH.

We define the initialization $X_{1}$ by first sampling $\widetilde{X}_{1}\sim\mathrm{Unif}\left([0,1]^{n\times k}\right)$ . Then, $\widetilde{X}_{1}$ is projected onto $\mathcal{V}(n,k)$ using [mantoux2021understanding, Lemma 2] to obtain $X_{1}$ , thereby identifying a point on $\mathcal{V}(n,k)$ with its equivalence class when we work on $\mathcal{G}(n,k)$ . To evaluate the performance of the samplers, we compute the effective sample size (ESS) of $(\log(p_{\operatorname{vMF}(F)}(x_{i})))_{i\in\{1,\ldots,N\}}\in\mathbb{R}^{N}$ , where $(x_{i})_{i\in\{1,\ldots,N\}}$ denotes the samples generated by a single sampler. The results are averaged over 10 different resamplings, but the initialization is kept fixed since the use of a burn-in period has no significant impact.

Table 1: Experiment performances recorded with the effective sample size [min, median, max] over 10 repetitions for varying dimension

n

, on the Stiefel manifold

\mathcal{V}(n,k)

(n,k)	(3,2)	(30,2)	(100,2)
GSS $w=1$	$[1842,2156,2549]$	$[1668,2000,2254]$	$[1693,2033,2181]$
GSS $w=3$	$[10232,9315,11460]$	$[12290,13445,15014]$	$[14355,15786,18229]$
GSS $w=5$	$[11899,13986,18806]$	$[22756,26360,32073]$	$[31946,35769,42168]$
GSS $w=7$	$[13160,14734,\textbf{18276}]$	$[30215,33844,35793]$	$[43254,52119,57652]$
GSS $w=9$	$[12810,14159,16628]$	$[34512,38960,\textbf{48829}]$	$[47771,57845,68873]$
GSS $w=11$	$[\textbf{14477, 15615},16733]$	$[\textbf{35254, 39762},46525]$	$[51742,57700,62830]$
RMH:	$[10173,11980,13985]$	$[30329,39098,44068]$	$[\textbf{53187,60005, 72752}]$
GeoRMH:	$[7047,8399,10792]$	$[28481,34113,37743]$	$[45427,51125,56370]$

Table 2: Experiment performances recorded with the effective sample size [min, median, max] on 10 repetitions for varying dimension

k

, on the Stiefel manifold

\mathcal{V}(n,k)

(n,k)	(30,5)	(30,10)	(30,20)
GSS $w=1$	$[728,794,859]$	$[428,448,479]$	$[316,331,344]$
GSS $w=3$	$[3141,3628,4002]$	$[975,1111,1264]$	$[365,390,427]$
GSS $w=5$	$[\textbf{5327, 5843, 6581}]$	$[\textbf{1172, 1242, 1364}]$	$[\textbf{381, 397, 426}]$
RMH:	$[2803,3664,4609]$	$[1028,1140,1286]$	$[354,370,387]$
GeoRMH:	$[1006,4249,4704]$	$[1069,1148,1260]$	$[363,376,392]$

Varying $(n,k)$ on the Stiefel manifold.

First, we observe in Table 1 and 2 that the larger the value of $w$ , the higher the effective sample size (ESS). This is coherent with the fact that the average distance between the proposal and the current state increases with $w$ , leading to a higher ESS as the space is better explored. However, note that on a compact manifold like $\mathcal{V}(n,k)$ , following a geodesic for a long time may cause the proposal to return close to the starting point, similar to following the great circle of a sphere as shown in Figure 5. As a result, the gain in ESS reaches a plateau when $w\geqslant 7\geqslant 2\pi$ .

The second observation is that for all samplers, as $n$ increases for fixed $k$ , the ESS increases, and inversely, as $k$ increases with $n$ fixed, the ESS decreases. Considering (12), the number of directions that impact the density is equal to $k$ since $\textrm{Tr}(F^{\top}\Gamma)=\sum_{i=1}^{k}f_{i}^{\top}\Gamma_{i}$ for any $F=(f_{i})_{i\in\{1,\ldots,k\}},\Gamma=(\Gamma_{i})_{i\in\{1,\ldots,k\}}\in% \mathbb{R}^{n\times k}$ . Thus, as $k/n$ is small, the target density is \ldqflat\rdq in many directions and thus the risk of proposal rejection is small if the sampler attempts to move in these directions.

In Table 1, we see that RMH outperforms GSS and GeoRMH for $n=100$ , and remains competitive for $n\in\{3,30\}$ . This, maybe at first glance surprising, performance of RMH can be explained by the fact that when $k/n$ is small, the number of constraints is low compared to the dimensionality of the space, making $\mathcal{V}(n,k)$ nearly Euclidean and the Stiefel projection nearly equal to the identity. Consequently, since the optimal acceptance rate of $0.234$ for the adaptive mechanism was found in an Euclidean space, it is reasonably explainable that RMH performs better in this scenario. This interpretation is further confirmed by Table 2, where we see that GSS( $w=5$ ) outperforms RMH and GeoRMH for any $k\in\{5,10,20\}$ , though this advantage is less significant for GeoRMH due to its ability to explore the space according to its intrinsic geometry. Recall, to put our observations into perspective, that while RMH may outperform GSS and GeoRMH from ESS, it is fundamentally biased.

In Table 1, we observe that GeoRMH is not as efficient as RMH but performs comparably to GSS( $w=7$ ) when $n\in\{30,100\}$ . This difference can be linked to the fact that GeoRMH and GSS only differ in the method they employ to move on a geodesic. The first uses a Metropolis Hastings mechanism whereas the latter runs a slice sampler. Using (uniform) slice sampling, can be viewed as first fixing an acceptance level and then drawing proposals (with the help of the stepping-out and shrinkage procedure) until acceptance is reached. This ensures that the sampled direction in the tangent space is not wasted, but it increases the computational cost of each transition step, since each proposal within the shrinkage procedure involves new computations. The parameter $w$ affects the average number of attempts in the shrinkage procedure since it widens the portion of the geodesic where the shrinkage proposal is sampled, thus increasing the possibility of not meeting the acceptance level. For example, in the case $n=3$ , there are, on average, $1.11$ attempts when $w=1$ , but $1.41$ attempts when $w=5$ . Therefore, there is a trade-off when selecting $w$ to optimize the time efficiency of sampling, as larger values of $w$ increase both the ESS and the computation time.

Varying $(n,k)$ on the Grassmann manifold.

In Table 3, we present only a subset of our experiments to convey the following message: When GeoMALA is well tuned, the gradient-informed sampler outperforms GSS regardless of the choice of $w$ . However, the tuning of GeoMALA is very sensitive. For instance, in the case $(n,k)=(100,2)$ with $h=1$ , the gradient information encourages to focus on high density area and does not explore enough to outperform GSS, but if we increase the stepsize $h$ to $2$ , GeoMALA is better than GSS. Regarding the role of the hyperparameter $w$ , the conclusions are consistent with those on the Stiefel manifold. Increasing $w$ beyond a certain point does not pay off, since the manifold is compact. The sweet spot appears to be around $w=7\approx 2\pi$ . Both methods have the same complexity, as in both cases, we need to sample a point on the tangent space and compute a geodesic using SVD.

Table 3: Experiment performances recorded with the effective sample size [min, median, max] on 10 repetitions for varying dimension

(n,k)

with a fixed shape of distribution, on the Grassmann manifold. We show the results only for

w=7

since it does not affect the comparison with GeoMALA, and the dependence according to

w

is globally the same as on the Stiefel manifold. \ldqGeoMALA Best

h

\rdq means that we provide the result for the best stepsize parameter

h

\{0.01,0.1,0.5,1,2\}

$(n,k)$	(3,2)	(30,20)	(100,2)
GSS $w=7$	$\small{[29520,37691,41409]}$	$\small{[22434,24668,28531]}$	$\small{[27294,30578,35995]}$
GeoMALA Best $h$	$\small{[\textbf{46086, 51165, 55630}]}$	$\small{[\textbf{39445,41776,48142}]}$	$\small{[\textbf{56437, 61601, 71250}]}$

Varying anisotropy on the Stiefel manifold.

Table 4: Experiment performances recorded with the effective sample size [min, median, max] on the 10 repetitions for varying anisotropy factor

\lambda

when

(n,k)=(30,2)

, on the Stiefel manifold.

$\lambda$	1	10	100
GSS $w=5$	$[28375,34262,37405]$	$[\textbf{4901, 5283, 5477}]$	$[\textbf{1153, 1328, 1453}]$
RMH	$[\textbf{50772},54243,59906]$	$[1492,2314,3214]$	$[669,878,998]$
GeoRMH	$[49007,\textbf{57195, 68948}]$	$[1978,2336,3217]$	$[682,870,1075]$

In Table 4, the samplers’ performances worsen as $\lambda$ increases, indicating the difficulty of exploring the space when the density is sharp in a specific direction.

We observe that RMH and GeoRMH outperform GSS only when $\lambda=1$ , highlighting the effectiveness of the slice sampling approach of GSS in dealing with sharp densities.

In our experiments, we notice that the number of attempts in the shrinkage procedure is more sensitive to the variance of the target distribution than the value of the parameter $w$ . Therefore, in cases where GSS is a good fit for the sampling task, the computation time increases accordingly.

Varying the variance on the Grassmann manifold.

In Table 5, we observe that GSS outperforms GeoMALA when $\lambda\in\{10,100\}$ , indicating its advantage in situations where the density is concentrated. It highlights the fact that GeoMALA is not robust to the choice of the stepsize parameter.

Furthermore, we find that using a lower value for $w$ is more suitable when the distribution is sharp (e.g., $w=1,\lambda=100$ ). Additionally, choosing a large value of the stepping out parameter $m$ can improve the performance when $w$ is chosen too small.

This experiment reaffirms the previous findings, demonstrating that GSS adapts to the geometry of the density, which is promising for practical applications. Moreover, the performance of GSS appears to be quite robust across different choices of $w$ , and taking a large value for $m$ strengthens this feature.

Table 5: Experiment performances recorded with the effective sample size [min, median, max] on 10 repetitions for varying variance factor

\lambda

when

(n,k)=(3,2)

, on the Grassmann manifold. \ldqGeoMALA Best

h

\rdq means that we provide the result for the best stepsize parameter

h

\{0.01,0.1,0.5,1\}

$\lambda$	1	10	100
GSS $w=1,m=1$	$\small{[8877,10070,11010]}$	$\small{[10905,12282,13582]}$	$\small{[16111,19371,23164]}$
GSS $w=1,m=3$	$\small{[31097,35149,42968]}$	$\small{[13630,16001,19422]}$	$\small{[12298,20144,21981]}$
GSS $w=1,m=10$	$\small{[32075,36968,42086]}$	$\small{[\textbf{17346, 18913, 22446}]}$	$\small{[16390,19436,23697]}$
GSS $w=7,m=1$	$\small{[33768,36142,41443]}$	$\small{[14807,17197,20787]}$	$\small{[\textbf{18110, 19526},23424]}$
GeoMALA Best $h$	$\small{[\textbf{44628, 49368, 56009}]}$	$\small{[11173,12986,14294]}$	$\small{[395,2611,\textbf{43087}]}$

We discuss some further aspects of the conducted numerical experiments.

Remark 9.

(Complexity.) Up to this point, a discerning reader might consider the comparison of ESS between RMH, GSS, and GeoRMH unfair, as GSS involves additional computations due to the shrinkage procedure. While this is true, the main computational bottlenecks in all methods are the reprojection of the proposal on the Stiefel manifold at the end of each step using an SVD $O(nk^{2})$ , the sampling on the tangent space using a QR decomposition $O(k^{3})$ , and the eigenvalues of the skew-symmetric matrices $O((2k)^{3})$ .

In the case of the shrinkage procedure in GSS, an eigenvalue decomposition is initially computed, enabling the generation of geodesics with some matrix products for the subsequent attempts. Thus, from a computational perspective, the only difference between GeoRMH and GSS is the cost of these matrix products $O(n^{2.37})$ .

However, when comparing RMH and GSS, we need to account for the additional computational cost of the QR decomposition and eigenvalue decomposition, both in $O(k^{3})$ . Notably, the computational cost related to constraints naturally increases with $k$ . In practice, without any engineering optimization of the codes, we observe that GSS takes between two to four times longer to execute than RMH (which is biased), but it is nearly equivalent to GeoRMH in terms of computation time when $n$ is not excessively large. The relative speed of RMH compared to GSS has to be weighted by its bias (of unknown size).

Remark 10.

(Choice of the hyperparameters.) The performance of GSS seems to be quite robust to the choice of the hyperparameters $m$ and $w$ as soon as $mw\geqslant 2\pi$ (see Tables 1, 2, 5). The number $2\pi$ coincides with twice the diameter of the Grassmann and the Stiefel manifold. Therefore, as a heuristic we propose to choose $m$ and $w$ such that $mw$ is about twice the diameter for compact manifolds in general.

3.2 A practical case: Understanding the variability in graph data sets.

We consider in this section a model introduced in [mantoux2021understanding] that aims to infer the structure of adjacency matrices from functional connectivity networks of brains. Consider $(\Upphi^{(j)})_{j\in\{1,\ldots,J\}}$ , $J$ adjacency matrices of different networks with $n\in\mathbb{N}$ nodes each. Let $k\in\mathbb{N}$ with $k\leqslant n$ . The model that we consider has parameter $\theta=(\sigma_{\kappa}^{2},\sigma^{2}_{\epsilon},\mu,F)$ , $\sigma_{\kappa}^{2},\sigma^{2}_{\epsilon}>0,\mu\in\mathbb{R}^{k}$ , and $F\in\mathbb{R}^{n\times k}$ , and is defined for $j\in\{1,\ldots,J\}$ as

	$\displaystyle\Upphi^{(j)}$	$\displaystyle=\Gamma^{(j)}\operatorname{diag}(\kappa^{(j)})(\Gamma^{(j)})^{% \top}+\mathcal{E}(\epsilon^{(j)}),\qquad\epsilon^{(j)}\overset{\text{i.i.d.}}{% \sim}\mathcal{N}(0,\sigma_{\epsilon}^{2}\mathrm{Id}_{n(n+1)/2}),$		(13)
	$\displaystyle\kappa^{(j)}$	$\displaystyle\overset{\text{i.i.d.}}{\sim}\mathcal{N}(\mu,\sigma_{\kappa}^{2}% \mathrm{Id}_{k}),\qquad\Gamma^{(j)}\overset{\text{i.i.d.}}{\sim}\operatorname{% vMF}(F),$		(13)

where $\Gamma^{(j)}\in\mathcal{V}(n,k)$ , $\kappa^{(j)}\in\mathbb{R}^{k}$ is called pattern weight vector, and $\epsilon^{(j)}\in\mathbb{R}^{n(n+1)/2}$ represents the symmetric residual noise (i.e., $\mathcal{E}$ maps a vector in $\mathbb{R}^{n(n+1)/2}$ to the symmetric matrix in $\mathbb{R}^{n\times n}$ determined by its components). The unobserved variables $\Gamma^{(j)}\in\mathcal{V}(n,k)$ and $\kappa^{(j)}$ determine the individual-level specificity of network $j$ .

The original paper estimates the parameters using the Markov chain Monte Carlo-stochastic algorithm expectation maximization (MCMC-SAEM) procedure [kuhn2004coupling]. MCMC-SAEM is an extension of the expectation maximization (EM) algorithm, where the E-step involves approximating integrals using MCMC methods. In this context, the E-step samples the density $p_{\Gamma}=p(\Gamma^{(j)}|\Upphi^{(j)},\kappa^{(j)},\theta_{n})$ , where $\theta_{n}$ is the current estimate for the parameter $\theta$ , within a Gibbs procedure using RMH (see [mantoux2021understanding, Algorithm 3]).

For the optimization procedure, we propose to replace RMH with GSS( $w=1,m=5$ ) to compare performances. Since our implementation of RMH is four times faster than GSS, we chose to multiply the number of MCMC iterations by four when using RMH for the E-step.

On a synthetic dataset.

We follow the same procedure as [mantoux2021understanding] to generate synthetic data $(\Upphi^{(j)}_{*},\Gamma^{(j)}_{*},\kappa^{(j)}_{*},\epsilon^{(j)}_{*})_{j\in% \{1,\ldots,J\}}$ with $J=100$ , $(n,k)=(30,5)$ and $(n,k)=(3,2)$ . The generation parameters are fixed as $\sigma_{\epsilon}^{2}=0.1$ , $\sigma_{\kappa}^{2}=2$ , $\mu=(10,2,\ldots,2)\in\mathbb{R}^{k}$ , and $F^{*}\in\mathbb{R}^{n\times k}$ is chosen as the matrix with columns $a_{1}f_{1},\ldots,a_{k}f_{k}\in\mathbb{R}^{n}$ , where $(f_{i})_{i\in\{1,\ldots,k\}}\in\mathcal{V}(n,k)$ is sampled uniformly from $\mathcal{V}(n,k)$ , and $a_{1}=\lambda\in\{1,100\}$ , $a_{i}=1$ for any $i\in\{2,\ldots,k\}$ . The factor $\lambda$ in this context serves as an anisotropy factor incorporated to examine how performance can be influenced by anisotropy. The optimization process involves random initialization of the parameters $F$ and $\kappa$ , and we deterministically set $\sigma_{\kappa}^{2}=\sigma^{2}_{\epsilon}=1$ . Then, $(\Gamma^{(j)})_{j\in\{1,\ldots,J\}}$ is initialized either by performing 200 iterations of GSS on $p_{\Gamma}$ or 800 iterations of RMH. The results are averaged over 10 repetitions with different random initializations and generation parameters.

We run the MCMC-SAEM procedure for 100 iterations, and at each step of MCMC-SAEM, 20 iterations of MCMC are performed when GSS is used, while 80 iterations are performed when RMH is used for the E-step of the EM algorithm.

First, upon examining Figure 6, we can observe that the complete log-likelihood curve⁵⁵5Denoting by $\theta$ the model parameter, the log complete likelihood is $\log p(\Upphi^{(j)},\kappa^{(i)},\Gamma^{(i)}|\theta)$ and the log likelihood is $\log p(\Upphi^{(j)}|\theta)$ . exhibits higher values when $\lambda=100$ in contrast to $\lambda=1$ . This can be attributed to the unobserved variables $(\Gamma^{(j)})_{j\in{1,\ldots,J}}$ being more concentrated in the direction indicated by the first column of $F^{*}$ , making recovery easier.

Secondly, GSS outperforms RMH regarding the complete likelihood. The curve increases more rapidly with the number of iterations and attains a higher final value. This improvement is particularly striking for $(n,k)=(30,5)$ , which is consistent with the findings in Table 2. Moreover, as illustrated by the third graph of Figure 6, the relative root mean square error (rRMSE) of the estimated parameter to the true parameter is always smaller when using GSS, especially when $\lambda=100$ .

Missing links imputation.

We follow the experimental setup proposed in [mantoux2021understanding, Section 5.1.2] for missing links imputation: A synthetic data set of $N=200$ adjacency matrices $\{\Upphi^{(i)}\}_{i=1}^{N}$ with $n=20$ nodes and $k=5$ is generated from the model specified by (13) with parameters $\theta=(\sigma_{\kappa}^{2},\sigma^{2}_{\epsilon},\mu,F)$ given in Appendix D.1. In a first stage, the MCMC-SAEM algorithm with GSS is applied to perform the estimation of the parameters resulting in an estimator $\hat{\mathbf{\theta}}$ . Then, from the same model, we generate another 200 samples $\{\bar{\Upphi}^{(i)}\}_{i=1}^{N}$ . For each of these samples, 16% of the edge weights corresponding to the interactions between the last eight nodes are masked. Denote by $\{\tilde{\Upphi}^{(i)}\}_{i=1}^{N}$ the resulting adjacency matrices. We then aim to reconstruct the missing links from the samples using the following three procedures:

•

The missing links are found from the maximum a posteriori (MAP) approximated as $\text{MAP}=\Gamma\operatorname{diag}(\kappa)(\Gamma)^{\top}$ , where $\Gamma,\kappa$ are the result of 4000 iterations of gradient ascent on the posterior density $p(\Gamma,\kappa|\tilde{\Upphi}^{(i)},\hat{\mathbf{\theta}})$ , the conditional density of $(\Gamma,\kappa)$ given the masked observation $\tilde{\Upphi}^{(i)}$ in the model (13).

•

The missing links are found from the posterior mean (PM) defined as

\text{PM}=\int_{\mathbb{R}^{d}}\int_{\mathcal{V}(n,k)}\Gamma\operatorname{diag% }(\kappa)(\Gamma)^{\top}p(\Gamma,\kappa|\tilde{\Upphi}^{(i)},\hat{\mathbf{% \theta}})\ \nu_{\mathfrak{g}}(\mathrm{d}\Gamma)\,\mathrm{Leb}_{d}(\mathrm{d}% \kappa).

(14)

This distribution is approximated using $\hat{\mathbf{\theta}}$ and GSS or RMH within Gibbs sampler.

•

Finally, we consider a reconstruction using simply the link from $N^{-1}\sum_{i=1}^{N}\bar{\Upphi}^{(i)}$ .

When computing the PM and the MAP, GSS achieves a rRMSE of 50% ( $\pm$ 24% standard deviation over the dataset) and 52% ( $\pm$ 16%) on average respectively, in contrast to 57% ( $\pm$ 24%) and 58% ( $\pm$ 28%) for RMH, and 85% ( $\pm$ 10 %) for the mean sample. Then the same experiment is repeated, but this time, 40% of the edges are uniformly masked instead of constraining the mask. With GSS, we find a rRMSE of 30% ( $\pm$ 7% standard deviation over the dataset) and 35% ( $\pm$ 8%) on average for the PM and the MAP, compared to 34% ( $\pm$ 9%) and 35% ( $\pm$ 7%) when using RMH, and 75% ( $\pm$ 5 %) with the mean sample. This clearly indicates that GSS outperforms RMH on this example.

On a real dataset.

We use real data to perform a comparison similar to the one proposed in [mantoux2021understanding] where the authors use connectivity matrices with $n=21$ nodes and $k=5$ in their model to analyze $N=1000$ subjects⁶⁶6We did not use the same data, since their dataset was unavailable to us.. Data were provided by the Human Connectome Project⁷⁷7https://www.humanconnectome.org/study/hcp-young-adult/document/extensively-processed-fmri-data-documentation. The dataset is composed of brain connectivity matrices generated from resting-state functional MRI (rs-fMRI) by following the pipeline described in this documentation⁸⁸8https://www.humanconnectome.org/storage/app/media/documentation/s1200/HCP1200-DenseConnectome+PTN+Appendix-July2017.pdf. On a subject which receives no stimulation (at rest), the rs-fMRI records fluctuations in blood oxygenation levels throughout the brain. By maximizing the signal coherence in each region of the brain with a spatial independent component analysis (ICA) [beckmann2004probabilistic], it yields a partition of the brain depending on its structure and variations from one individual to another. Finally, the temporal correlations between the mean blood oxygenation levels in each region are assembled into a matrix. This matrix is called the brain’s functional connectivity network. It should be noted that this network is not necessarily derived from physical reality, since it only represents correlations between brain regions. This is why the term “functional” is coined. In this study, the connectivity matrices are defined on a parcellation of the brain into $n=25$ regions⁹⁹9The network modeling is related to “netmats” in the documentation.. We chose the “recon2” group of the dataset leaving us 812 matrices of dimension $25\times 25$ to analyze.

We choose $k=5$ , and run 1000 MCMC-SAEM iterations with 20 MCMC steps per SAEM iteration when using GSS( $w=1,m=10$ ), and 80 iterations when using RMH. The initialization procedure is the same as for synthetic data. To compare the samplers, we compute the rRMSE between PM and the observations similarly as in the previous paragraph. To asses the benefit of the model, we also compute the approximation given by the projection onto the subspace of the first five principal component analysis (PCA) components of the full data set, where each matrix $\Upphi^{(i)}$ has been vectorized.

The boxplots in Figure 7 show that GSS outperforms RMH since it reduces the tail of the error distribution. In addition, GSS achieves better results than RMH when looking at the evolution of the complete likelihood, as already observed on synthetic data. The model proves to be better suited than a simple PCA, as shown in [mantoux2021understanding], but often fails to offer a good representation of the observations in large dimensions, even for synthetic data.

3.3 ARMA model

Time series related to different types of data, such as dynamic textures, shape sequences and videos, are often modeled as auto-regressive and moving average (ARMA) models [doretto2003dynamic, aggarwal2004system, bissacco2001recognition, veeraraghavan2005matching]. Provided observations $z=(z_{t})_{t\in\{1,\ldots,T\}}\in(\mathbb{R}^{n})^{T}$ , the ARMA model equations are

	$\displaystyle z_{t}$	$\displaystyle=Hx_{t}+w_{t},\quad w_{t}\overset{\text{i.i.d.}}{\sim}\mathcal{N}% (\boldsymbol{0}_{n},R),$
	$\displaystyle x_{t+1}$	$\displaystyle=Bx_{t}+v_{t},\quad v_{t}\overset{\text{i.i.d.}}{\sim}\mathcal{N}% (\boldsymbol{0}_{k},Q),\qquad t\in\{1,\ldots,T\},$

where $x=(x_{t})_{t\in\{1,\ldots,T\}}\in(\mathbb{R}^{k})^{T}$ is the hidden state vector, $B\in\mathbb{R}^{k\times k}$ the transition matrix, $H\in\mathcal{V}(n,k)$ the measurement matrix and $R\in\mathbb{R}^{n\times n},Q\in\mathbb{R}^{k\times k}$ covariance matrices. We can wrap this model in a Bayesian framework by considering the priors $x_{1}\sim\mathcal{N}(\boldsymbol{0}_{k},Q)$ and $H\sim\operatorname{vMF}(F)$ with $F\in\mathbb{R}^{n\times k}$ , such that $(F,B,R,Q)$ are seen as hyperparameters.

In this experiment, we compare the ESS related to the sampling of the posterior $p(H|z)\propto p(z|H)p(H)$ by using GSS and RMH. The ESS is computed from $(\textrm{Tr}(F^{\top}H_{i}))_{i\in\{1,\ldots,T\}}$ where $(H_{i})_{i\in\{1,\ldots,T\}}$ are the samples. The expression of $p(H)$ is given in (10) and $p(z|H)$ can be computed using Kalman filter updates. The observations $(z_{t})_{t\in\{1,\ldots,T\}}$ are synthetically generated from the model where the parameters are chosen randomly with $T=10$ and varying dimensions $(n,k)$ . We choose a low variance for the observation covariance matrix $R$ and a prior close to the true parameter in order to have a concentrated posterior.

In Table 6, GSS outperforms RMH, and a large $w$ increases the ESS. Surprisingly, increasing $n$ improves the ESS for GSS and not for RMH, contrary to the experiments with the von Mises-Fisher distribution in Section 3.1. This highlights the impact of the target distribution on the quality of the sampling and the robustness of GSS. The case $(n,k)=(30,5)$ reveals the influence of the parameter $m$ : increasing $m$ while reducing $w$ increases the performances. Sometimes, large transitions can be achieved only in specific contexts, and large $m$ allows this when necessary.

Table 6: Experiment performances recorded with the effective sample size [min, median, max] on 10 repetitions for varying dimensions

(n,k)

of the Stiefel manifold.

( $n,k$ )	$(30,2)$	$(30,5)$	$(100,2)$
GSS $w=1,m=1$ :	[389,425,453]	[386,404,439]	[305,312,337]
GSS $w=5,m=2$ :	[418,467,484]	[470,511,565]	[1146,1414,1779]
GSS $w=10,m=1$ :	[506,549,589]	[436,484,527]	[1513,1966,2674]
RMH :	[293,367,387]	[402,421,468]	[286,296,323]

3.4 Bayesian clustering on the KTH video action dataset.

In this section, GSS is used to estimate parameters of a Bayesian clustering model on the KTH video action data [schuldt2004recognizing]. The pipeline used in [chakraborty2019statistics] is followed. For four different scenarios (called “d1”, “d2”, “d3” and “d4”), the dataset records 6 actions carried out by 25 humans, which yields 125 videos per scenario. From each video, a sequence of frames is extracted, and each frame is resized to $64\times 128$ , before computing its histogram of oriented gradients (HOG) [dalal2005histograms] features. Their dimension is $d=3780$ . Finally, an ARMA model is used to model the sequence of HOG features by estimating the parameter with the closed-form formula given in [doretto2003dynamic]. For each video, let $T$ be the number of frames and $f_{e}\in\mathbb{R}^{d\times T}$ be the matrix formed by stacking the HOG feature vectors from each frame. Let $f_{e}=\widehat{U}\mathrm{diag}(\lambda)\widehat{V}^{\top}$ be the SVD of $f_{e}$ by taking only the $d_{l}=50$ first components, i.e., $\widehat{U}\in\mathbb{R}^{d\times d_{l}}$ , $\lambda\in\mathbb{R}^{d_{l}}$ and $\widehat{V}\in\mathbb{R}^{T\times d_{l}}$ . Then the video is represented by $(\widehat{U},\lambda,\Sigma)\in\mathcal{V}(d,d_{l})\times\mathbb{R}^{d_{l}}% \times\mathcal{V}(d_{l},d_{l})=:\mathcal{X}$ where

\Sigma=\mathrm{diag}(\lambda)\widehat{V}^{\top}M_{1}\widehat{V}(\widehat{V}^{% \top}M_{2}\widehat{V})^{-1}\mathrm{diag}(\lambda)^{-1},\quad M_{1}=\begin{% pmatrix}0&0\\ \mathrm{Id}_{T-1}&0\end{pmatrix},\,\,M_{2}=\begin{pmatrix}\mathrm{Id}_{T-1}&0% \\ 0&0\end{pmatrix}\in\mathbb{R}^{T\times T}.

Clustering models using mixture of von Mises-Fisher distributions have already proposed in [lin2017bayesian, sengupta2017bayesian], where the distribution on $\mathbb{R}^{d_{l}}$ is a mixture of multivariate Gaussian distributions with diagonal covariance matrix. The number of clusters equals six as the number of actions. The parameter of each cluster is $\theta_{k}=(F_{k}^{1},F_{k}^{2},\mu_{k},s_{k})\in(\mathbb{R}^{d\times d_{l}}% \times\mathbb{R}^{d_{l}\times d_{l}}\times\mathbb{R}^{d_{l}}\times(0,\infty)^{% d_{l}})$ and the mixing weight is $m_{k}\in[0,1]$ . The observations $y_{i}=(\widehat{U}_{i},\lambda_{i},\Sigma_{i})\in\mathcal{X}$ and their cluster assignment $Z_{i}$ are assumed to follow this generation process:

	$\displaystyle Z_{i}\overset{\text{i.i.d.}}{\sim}\mathrm{Mult}((m_{k})_{k\in\{1% ,\ldots,6\}}),\quad y_{i}\overset{\text{i.i.d.}}{\sim}\sum_{k=1}^{6}m_{k}p(% \cdot\|Z_{i}=k,\theta_{k})\ ,$		(15)
	$\displaystyle p(y_{i}\|Z_{i}=k,\theta_{k})=p_{\operatorname{vMF}(F_{k}^{1})}(% \widehat{U}_{i})p_{\operatorname{vMF}(F_{k}^{2})}(\Sigma_{i})\exp\left(-\frac{% 1}{2}(\lambda-\mu_{k})^{\top}\mathrm{diag}(s_{k})^{-1}(\lambda-\mu_{k})\right),$		(16)

where $\mathrm{Mult}((m_{k})_{k\in\{1,\ldots,6\}})$ is the multinomial distribution with parameter $(m_{k})_{k\in\{1,\ldots,6\}}$ . In this Bayesian setting, we take a uniform non-informative prior on every parameter. We want to assess if the clustering separates the $6$ actions. To this end, for each environment, the dataset is split into a training set (67% of the data) and a test set (33%). The clustering model is trained with the training set and evaluated on the test set.

The parameter estimation is performed on the training set by adapting the EM algorithm described in [sengupta2017bayesian] to the product space $\mathcal{X}$ . We denote by $(\hat{\theta}_{k})_{k\in\{1,\ldots,6\}}$ the estimated parameters. The cluster label for a point $y$ in the test set is given by $l_{i}=\operatorname*{arg\,max}_{k\in\{1,\ldots,6\}}\{p(y|\hat{\theta}_{k},Z_{i% }=k)\}$ . The related results are reported in Table 7 as “vMF clustering EM”.

The result of the EM algorithm is then used as an initialization for the sampling of the posterior. However, the sampling is only done for some parameters for computational reasons and to highlight the use of GSS on the Stiefel manifold. The vMF distributions parameters $(F_{k}^{1},F_{k}^{2})_{k\in\{1,\ldots,6\}}$ are parameterized using their SVD $F_{k}^{1}=\bar{U}_{k}\bar{D}_{k}\bar{V}_{k}^{\top},F_{k}^{2}=\widetilde{U}_{k}% \widetilde{D}_{k}\widetilde{V}_{k}^{\top}$ belonging to $\mathcal{V}(d,d_{l})\times\mathbb{R}^{d_{l}}\times\mathcal{V}(d_{l},d_{l})$ and $\mathcal{V}(d_{l},d_{l})\times\mathbb{R}^{d_{l}}\times\mathcal{V}(d_{l},d_{l})$ respectively. Then we approximately sample from the posterior associated with $(\bar{U}_{k},\widetilde{U}_{k})_{k\in\{1,\ldots,6\}}$ using GSS, and the resulting posterior samples $(\bar{U}_{k}^{j},\widetilde{U}_{k}^{j})_{k\in\{1,\ldots,6\}}^{j\in\{1,\ldots,N\}}$ yield pseudo-posterior cluster parameter samples $(\tilde{\theta}_{k}^{j})^{j\in\{1,\ldots,N\}}_{k\in\{1,\ldots,6\}}$ where $N=500$ is the number of samples. Then, the samples $(\tilde{\theta}_{k}^{j})_{k\in\{1,\ldots,6\}}^{j\in\{1,\ldots,N\}}$ are used to compute Bayesian assignment weights $w_{i,k}=\sum_{j}p(y_{i}|\tilde{\theta}_{k}^{j},Z_{i}=k)/N$ for each observation $y_{i}\in\mathcal{X}$ of the test set and each cluster $k$ . The label of each observation is finally assigned as $l_{i}=\operatorname*{arg\,max}_{k\in\{1,\ldots,6\}}\{w_{i,k}\}$ . This procedure is referred to as “vMF clustering MCMC” in Table 7.

To demonstrate the benefit of our Bayesian approach, for each observation $y_{i}$ of the test set, we compute the empirical variance of the sample

L_{i}=(l_{i,j}=\operatorname*{arg\,max}_{k\in\{1,\ldots,6\}}p(y_{i}|\tilde{% \theta}_{k}^{j},Z_{i}=k))_{j\in\{1,\ldots,N\}}

(17)

and remove the point $i$ if it is positive. The previous procedure is then applied to this reduced test set. The scores resulting from this procedure referred to as “vMF clustering MCMC lower variance”, are given in Table 7.

Table 7: Clustering results on the KTH action recognition dataset reported with f1 score weighted (%).

Scenario	d1	d2	d3	d4
vMF clustering EM :	49.54	56.61	57.99	49.98
vMF clustering MCMC :	52.21	56.61	57.99	49.98
vMF clustering MCMC lower variance :	50.15	57.75	59.62	54.06

The Bayesian approach strengthens the estimation by averaging different plausible weights, as in the ensemble methods [dietterich2000ensemble], which always turns out to be better than the simple maximum likelihood estimator (MLE) in 7.

4 Validity

The aim of this section is to prove the reversibility of the geodesic slice sampler. To this end, we introduce a stepping-out distribution to describe Algorithm 2 and a shrinkage kernel to describe Algorithm 3. Their properties collected in Lemma 11, Lemma 12, Lemma 15 and Lemma 16 intuitively ensure ‘reversibility on every parameterized intersection of geodesic and levelset’. Together with the invariance of the Liouville measure under a certain map on the tangent bundle, both introduced in Section 4.3, they are the essential indigence for the proof of the reversibility of the geodesic slice sampler.

4.1 Stepping-out procedure

Throughout this section fix $w\in(0,\infty)$ and $m\in\mathbb{N}$ . We consider a generalization of the stepping-out procedure described in Section 2.1 that targets an arbitrary set $\mathsf{S}\in\mathcal{B}(\mathbb{R})$ , see Algorithm 4.

Algorithm 4 Stepping-out procedure targeting

\mathsf{S}\in\mathcal{B}(\mathbb{R})

Input: point $\theta\in\mathbb{R}$ , hyperparameters $w\in(0,\infty)$ and $m\in\mathbb{N}$
Output: two points $\ell,r\in\mathbb{R}$ such that $\ell<\theta<r$

1: Draw

\Upsilon\sim\mathrm{Unif}([0,w])

, call the result

u

2: Set

\ell:=\theta-u

and

r:=\ell+w

3: Draw

J\sim\mathrm{Unif}(\{1,\ldots,m\})

, call the result

\upiota

4: Set

i=2

and

j=2

5: while

i\leqslant\upiota

and

\ell\in\mathsf{S}

6: Set

\ell=\ell-w

7: Update

i=i+1

8: end while

9: while

j\leqslant m+1-\upiota

and

r\in\mathsf{S}

10: Set

r=r+w

11: Update

j=j+1

12: end while

13: return

(\ell,r)

To formally describe the resulting distribution on $\mathbb{R}^{2}$ we use stopped random variables. Let $\Upsilon\sim\mathrm{Unif}([0,w])$ . For every $\theta\in\mathbb{R}$ let

\begin{split}L_{i}^{(\theta)}&:=\theta-\Upsilon-(i-1)w,\qquad i\in\mathbb{N},% \\ R_{j}^{(\theta)}&:=\theta+(w-\Upsilon)+(j-1)w=\theta-\Upsilon+jw,\qquad j\in% \mathbb{N},\end{split}

(18)

be two sequences of random variables. Observe that setting

\begin{split}L_{0}^{(\theta)}&:=R_{1}^{(\theta)}=\theta-\Upsilon-(-1)w\\ R_{0}^{(\theta)}&:=L_{1}^{(\theta)}=\theta-\Upsilon+0\cdot w\end{split}

(19)

is consistent with this definition. By construction both sequences are strictly monotone, i.e.,

\begin{split}L_{i+1}^{(\theta)}=\theta-\Upsilon-iw&<\theta-\Upsilon-(i-1)w=L_{% i}^{(\theta)},\qquad i\in\mathbb{N}_{0},\theta\in\mathbb{R},\\ R_{j}^{(\theta)}=\theta-\Upsilon+jw&<\theta-\Upsilon+(j+1)w=R_{j+1}^{(\theta)}% ,\qquad j\in\mathbb{N}_{0},\theta\in\mathbb{R}.\end{split}

(20)

We now define appropriate stopping times depending on the target set $\mathsf{S}\in\mathcal{B}(\mathbb{R})$ . To this end, let $J$ be a random variable that is independent of all previous objects satisfying $J\sim\mathrm{Unif}(\{1,\ldots,m\})$ , and set

\begin{split}\tau_{\mathsf{S}}^{(\theta)}&:=\inf\{i\in\mathbb{N}\mid L_{i}^{(% \theta)}\notin\mathsf{S}\}\land J,\\ \mathfrak{T}_{\mathsf{S}}^{(\theta)}&:=\inf\{j\in\mathbb{N}\mid R_{j}^{(\theta% )}\notin\mathsf{S}\}\land(m+1-J),\qquad\theta\in\mathbb{R},\mathsf{S}\in% \mathcal{B}(\mathbb{R}).\end{split}

(21)

Note that $\tau_{\mathsf{S}}^{(\theta)}$ and $\mathfrak{T}_{\mathsf{S}}^{(\theta)}$ are finite stopping times with respect to the filtration generated by the sequence $(J,L_{1}^{(\theta)},\ldots,L_{n}^{(\theta)},R_{1}^{(\theta)},\ldots,R_{n}^{(% \theta)})_{n\in\mathbb{N}}$ . More precisely, we have the bounds

\begin{split}1\leqslant\tau_{\mathsf{S}}^{(\theta)}\leqslant J\leqslant m,\\ 1\leqslant\mathfrak{T}_{\mathsf{S}}^{(\theta)}\leqslant m+1-J\leqslant m+1-1=m% ,\\ 2\leqslant\tau_{\mathsf{S}}^{(\theta)}+\mathfrak{T}_{\mathsf{S}}^{(\theta)}% \leqslant J+m+1-J\leqslant m+1.\end{split}

(22)

As $\tau_{\mathsf{S}}^{(\theta)}$ and $\mathfrak{T}_{\mathsf{S}}^{(\theta)}$ are finite stopping times,

\left(\boldsymbol{L}_{\mathsf{S}}^{(\theta)},\boldsymbol{R}_{\mathsf{S}}^{(% \theta)}\right):=\left(L_{\tau_{\mathsf{S}}^{(\theta)}}^{(\theta)},R_{% \mathfrak{T}_{\mathsf{S}}^{(\theta)}}^{(\theta)}\right)

(23)

are random variables for all $\theta\in\mathbb{R}$ and $\mathsf{S}\in\mathcal{B}(\mathbb{R})$ , see [Klenke, Lemma 9.23]. We define the stepping-out distributions

\xi_{\mathsf{S}}^{(\theta)}:=\mathbb{P}^{\big{(}\boldsymbol{L}_{\mathsf{S}}^{(% \theta)},\boldsymbol{R}_{\mathsf{S}}^{(\theta)}\big{)}},\qquad\theta\in\mathbb% {R},\mathsf{S}\in\mathcal{B}(\mathbb{R}),

on $(\mathbb{R}^{2},\mathcal{B}(\mathbb{R}^{2}))$ . Observe that the output of Algorithm 4 has distribution $\xi_{\mathsf{S}}^{(\theta)}$ . Since Algorithm 2 is a special case of Algorithm 4 with $\mathsf{S}=L(x,v,t)$ and $\theta=0$ , this definition is coherent with (5). Moreover, note that $(\theta,C)\mapsto\xi_{\mathsf{S}}^{(\theta)}(C)$ is a transition kernel. More details on this can be found in Remark 21.

We collect two properties of the stepping-out distribution that are useful to show reversibility of the geodesic slice sampler. Their proof is postponed to Appendix A.

Lemma 11.

Let $\mathsf{S}\in\mathcal{B}(\mathbb{R}),$ and let $f:\mathbb{R}^{2}\to[0,\infty)$ be a measurable function. We have for all $\theta,\alpha\in\mathbb{R}$

\displaystyle\int_{\mathbb{R}^{2}}f(\ell,r)\mathbbm{1}_{(\ell,r)}(\alpha)\ \xi% _{\mathsf{S}}^{(\theta)}\big{(}{\rm d}(\ell,r)\big{)}=\int_{\mathbb{R}^{2}}f(% \ell,r)\mathbbm{1}_{(\ell,r)}(\theta)\ \xi_{\mathsf{S}}^{(\alpha)}\big{(}{\rm d% }(\ell,r)\big{)}.

The previous lemma essentially states that, conditioned on the event that the initial point lies inside the resulting interval, the stepping-out distribution does not depend on the initial point. To formulate the second property we need the collection

\Lambda_{\alpha}:\mathbb{R}\to\mathbb{R},\qquad\theta\mapsto\alpha-\theta

(24)

of linear functions index by $\alpha\in\mathbb{R}$ . Intuitively, they express a U-turn at $\alpha\in\mathbb{R}$ . Moreover, for $\alpha\in\mathbb{R}$ we define

\uplambda_{\alpha}:\mathbb{R}^{2}\to\mathbb{R}^{2},\qquad(\ell,r)\mapsto(% \alpha-r,\alpha-\ell).

(25)

The next lemma describes the behavior of the stepping-out distribution under U-turns.

Lemma 12.

Let $\mathsf{S}\in\mathcal{B}(\mathbb{R})$ and $\theta,\alpha\in\mathbb{R}$ . We have

\xi_{\Lambda_{\alpha}(\mathsf{S})}^{(\theta)}=(\uplambda_{\alpha})_{\sharp}\xi% _{\mathsf{S}}^{(\Lambda_{\alpha}(\theta))}.

4.2 Shrinkage procedure

In this section, for every half open interval $[\ell,r)\subseteq\mathbb{R}$ , we introduce an algorithm that generalizes Algorithm 3 approximating the uniform distribution on $\mathsf{S}\cap(\ell,r)$ for an arbitrary open set $\mathsf{S}\in\mathcal{B}\big{(}\mathbb{R}\big{)}$ . To express this scheme as a kernel, we employ the kernel of the shrinkage procedure, essentially operating on $\mathbb{S}^{1}$ parameterized by $[0,2\pi)$ , introduced in [ReversibilityEllipticalSliceSampler] and push it forward to an arbitrary interval $[\ell,r)$ as described in [rudolf2022robust, Appendix A]. For the convenience of the reader we quickly sketch the kernel $Q_{\textgoth}{S}$ from [ReversibilityEllipticalSliceSampler]. To this end set

\mathbb{J}(\alpha,\theta):=\begin{dcases}[\alpha,\theta),&\alpha<\theta,\\ [0,\theta)\cup[\alpha,2\pi),&\alpha\geqslant\theta.\end{dcases}

We denote by $\delta_{z}$ the Dirac measure at $z\in\mathbb{R}$ . Let $\Uptheta$ be a random variable on $[0,2\pi)$ , and let $(\Upgamma_{n},\Upgamma_{n}^{\min},\Upgamma_{n}^{\max})_{n\in\mathbb{N}}$ be a sequence of random variables with conditional distributions

	$\displaystyle\mathbb{P}\big{(}(\Upgamma_{n+1},\Upgamma_{n+1}^{\min},\Upgamma_{% n+1}^{\max})\in\mathsf{C}\mid(\Upgamma_{n},\Upgamma_{n}^{\min},\Upgamma_{n}^{% \max})=(z,z^{\min},z^{\max}),\Uptheta=\theta\big{)}$
	$\displaystyle\qquad=\mathbbm{1}_{\mathbb{J}(z,z^{\max})}(\theta)\cdot\left(% \mathrm{Unif}\big{(}\mathbb{J}(z,z^{\max})\big{)}\otimes\delta_{z}\otimes% \delta_{z^{\max}}\right)(\mathsf{C})$
	$\displaystyle\qquad\qquad+\mathbbm{1}_{\mathbb{J}(z^{\min},z)}(\theta)\cdot% \left(\mathrm{Unif}\big{(}\mathbb{J}(z^{\min},z)\big{)}\otimes\delta_{z^{\min}% }\otimes\delta_{z}\right)(\mathsf{C})$

for $\theta,z,z^{\min},z^{\max}\in[0,2\pi)$ with $\theta,z\in\mathbb{J}(z^{\min},z^{\max})$ , $\mathsf{C}\in\mathcal{B}([0,2\pi)^{3})$ and $n\in\mathbb{N}$ . Moreover, for every set $\textgoth{S}\in\mathcal{B}([0,2\pi))$ which is open in $\mathbb{S}^{1}$ , i.e., satisfies that for all $\theta\in\textgoth{S}$ there exists $\varepsilon>0$ such that $\{\alpha\mod 2\pi\mid|\alpha-\theta|<\varepsilon,\alpha\in\mathbb{R}\}% \subseteq\textgoth{S}$ , define the stopping time

\mathcal{T}_{\textgoth}{S}:=\inf\{n\in\mathbb{N}\mid\Upgamma_{n}\in\textgoth{S% }\}.

Then the kernel of the shrinkage procedure targeting $\mathrm{Unif}(\textgoth{S})$ is given by

Q_{\textgoth{S}}:\textgoth{S}\times\mathcal{B}(\textgoth{S})\to[0,1],\quad(% \theta,\mathsf{A})\mapsto\mathbb{P}\left(\Upgamma_{\mathcal{T}_{\textgoth}{S}}% \in A,\mathcal{T}_{\textgoth}{S}<\infty\mid\Uptheta=\theta\right).

To \ldqbend the interval $[\ell,r)$ onto $\mathbb{S}^{1}$ \rdq, where $\ell,r\in\mathbb{R}$ such that $\ell<r$ , we use the family of maps

\displaystyle h_{\ell,r}:[\ell,r)\to[0,2\pi),\qquad\theta\mapsto\frac{2\pi}{r-% \ell}\theta\ \mod 2\pi.

Note that due to the restriction of the domain, these functions are bijective and therefore have inverses $h_{\ell,r}^{-1}$ . Let $\ell,r\in\mathbb{R}$ such that $\ell<r$ , and let $\mathsf{S}\in\mathcal{B}(\mathbb{R})$ be an open set such that $(\ell,r)\cap\mathsf{S}\neq\emptyset$ . For such numbers $\ell,r$ and sets $\mathsf{S}$ we define the shrinkage kernel as the push forward of the kernel $Q_{\textgoth{S}}$ for $\textgoth{S}=h_{\ell,r}(\mathsf{S}\cap(\ell,r))$ ¹⁰¹⁰10Observe that $h_{\ell,r}(\mathsf{S}\cap(\ell,r))$ is open in $\mathbb{S}^{1}$ , since $\mathsf{S}\cap(\ell,r)$ is open as a set in $\mathbb{R}$ , and non-empty by assumption. under $h_{\ell,r}^{-1}$ , i.e.,

\displaystyle Q_{\mathsf{S}}^{\ell,r}(\theta,\mathsf{A}):=Q_{h_{\ell,r}(% \mathsf{S}\cap(\ell,r))}\big{(}h_{\ell,r}(\theta),h_{\ell,r}(\mathsf{A})\big{)% },\qquad\theta\in\mathsf{S}\cap(\ell,r),\mathsf{A}\in\mathcal{B}\big{(}\mathsf% {S}\cap(\ell,r)\big{)}.

Observe that this agrees with the definition made in (6), where we have $\theta=0$ and $\mathsf{S}=L(x,v,t)$ for $x\in\mathsf{M}$ , $v\in\mathbb{S}_{x}^{d-1}$ and $t\in(0,p(x))$ .

We briefly discuss the measureability of the shrinkage kernel in the arguments $(\theta,\ell,r)$ .

Remark 13.

Let $L$ and $R$ be two random variables independent of $\Theta$ and $(\Upgamma_{n},\Upgamma_{n}^{\min},\Upgamma_{n}^{\max})_{n\in\mathbb{N}}$ satisfying $L<R$ almost surely, and let $\mathsf{S}\in\mathcal{B}(\mathbb{R})$ , $\mathsf{A}\in\mathcal{B}(\mathsf{S})$ . By a disintegration argument, we have for all $\theta\in[0,2\pi)$ , $\ell,r\in\mathbb{R}$ that

	$\displaystyle f_{k}(\theta,\ell,r)$	$\displaystyle:=\mathbb{E}\left(\mathbbm{1}_{h_{l,r}(\mathsf{A}\cap\mathsf{S}% \cap(l,r))}(\Gamma_{k})\prod_{i=1}^{k-1}\mathbbm{1}_{[0,2\pi)\setminus h_{l,r}% (\mathsf{S}\cap(l,r))}(\Gamma_{i})\mid\Theta=\theta\right)$
		$\displaystyle=\mathbb{E}\left(\mathbbm{1}_{h_{L,R}(\mathsf{A}\cap\mathsf{S}% \cap(L,R))}(\Gamma_{k})\prod_{i=1}^{k-1}\mathbbm{1}_{[0,2\pi)\setminus h_{L,R}% (\mathsf{S}\cap(L,R))}(\Gamma_{i})\mid\Theta=\theta,L=\ell,R=r\right).$

Therefore

Q_{\mathsf{S}}^{\ell,r}(\theta,\mathsf{A}\cap(\ell,r))=\sum_{k=1}^{\infty}f_{k% }\left(h_{\ell,r}(\theta),\ell,r\right),\qquad\theta\in\mathsf{S}\cap(\ell,r),% \ell<r,

is measurable in $(\theta,\ell,r)$ . The equality above holds by definition of the shrinkage kernel, see also [ReversibilityEllipticalSliceSampler, Proof of Theorem 2.9]. Consequently

\{(\theta,\ell,r)\in\mathbb{R}^{3}\mid\ell<r,\theta\in\mathsf{S}\cap(\ell,r)\}% \times\mathcal{B}(\mathsf{S})\to[0,1],\qquad\big{(}(\theta,\ell,r),B\big{)}% \mapsto Q_{\mathsf{S}}^{\ell,r}(\theta,\mathsf{B}\cap(\ell,r))

is a transition kernel.

Remark 14.

Note that the combination of stepping-out distribution and shrinkage kernel as in (7) is valid, i.e., the random interval generated by the stepping-out procedure can be used as an input for the shrinkage procedure. To see this, fix $x\in\mathsf{M}$ , $v\in\mathbb{S}_{x}^{d-1}$ and $t\in(0,p(x))$ . Let $\boldsymbol{L}_{L(x,v,t)}^{(0)}$ and $\boldsymbol{R}_{L(x,v,t)}^{(0)}$ be as in (23). We need to verify that

•

$L(x,v,t)$ is open,
•

$0\in L(x,v,t)\cap\big{(}\boldsymbol{L}_{L(x,v,t)}^{(0)},\boldsymbol{R}_{L(x,v,% t)}^{(0)}\big{)}$ almost surely. In particular this implies that this intersection is almost surely non-empty.

The lower semi-continuity of $p$ yields that $L(x,v,t)$ is open. Since $t\in(0,p(x))$ , we have $0\in L(x,v,t)$ . Moreover, we have $0\in\big{(}\boldsymbol{L}_{L(x,v,t)}^{(0)},\boldsymbol{R}_{L(x,v,t)}^{(0)}\big% {)}$ almost surely by construction. Therefore 0 is also almost surely contained in the intersection of these two sets.

We provide two properties of the shrinkage kernel $Q_{\mathsf{S}}^{\ell,r}$ that are useful to derive the reversibility of the geodesic slice sampler. Both are essentially extensions of corresponding properties of $Q_{\textgoth{S}}$ .

Lemma 15.

Let $\ell,r\in\mathbb{R}$ and $\mathsf{S}\in\mathcal{B}(\mathbb{R})$ an open set such that $(\ell,r)\cap\mathsf{S}\neq\emptyset$ . Then the kernel $Q_{\mathsf{S}}^{\ell,r}$ is reversible with respect to $\mathrm{Unif}\big{(}\mathsf{S}\cap(\ell,r)\big{)}$ .

To obtain this result, we push the reversibility statement for $Q_{\textgoth{S}}$ formulated in [ReversibilityEllipticalSliceSampler] forward to the shrinkage kernel on arbitrary half open intervals.

Proof.

By [ReversibilityEllipticalSliceSampler, Theorem 2.9] we know that $Q_{h_{\ell,r}(\mathsf{S}\cap(\ell,r))}$ is reversible with respect $\mathrm{Unif}\big{(}h_{\ell,r}(\mathsf{S}\cap(\ell,r))\big{)}$ . Observe that by [rudolf2022robust, Proposition 19] this implies that $Q_{\mathsf{S}}^{\ell,r}$ is reversible with respect to $(h_{\ell,r}^{-1})_{\sharp}\mathrm{Unif}\big{(}h_{\ell,r}(\mathsf{S}\cap(\ell,r% ))\big{)}=\mathrm{Unif}\big{(}\mathsf{S}\cap(\ell,r)\big{)}$ .

∎

The next lemma can be seen as describing the behavior of the shrinkage kernel under U-turns.

Lemma 16.

Let $\ell,r\in\mathbb{R}$ such that $\ell<r$ and let $\mathsf{S}\in\mathcal{B}(\mathbb{R})$ be an open set such that $(\ell,r)\cap\mathsf{S}\neq\emptyset$ . For all $\alpha\in\mathsf{S}\cap(\ell,r)$ and $\mathsf{A}\in\mathcal{B}(\Lambda_{\alpha}(\mathsf{S}\cap(\ell,r)))$ we have

Q_{\Lambda_{\alpha}(\mathsf{S})}^{\uplambda_{\alpha}(\ell,r)}(0,\mathsf{A})=Q_% {\mathsf{S}}^{\ell,r}(\alpha,\Lambda_{\alpha}(\mathsf{A})),

where $\Lambda_{\alpha}$ and $\uplambda_{\alpha}$ are defined as in (24) and (25) respectively.

In order to leverage a similar property of the kernel $Q_{\textgoth{S}}$ for showing the above statement, we need the collection of maps

\displaystyle\widetilde{\Lambda}_{\alpha}:[0,2\pi)\to[0,2\pi),\qquad\theta% \mapsto\alpha-\theta\mod 2\pi

indexed by $\alpha\in[0,2\pi)$ .

Proof.

We aim to apply [ReversibilityEllipticalSliceSampler, Lemma 2.10]. Let $\ell,r\in\mathbb{R}$ such that $\ell<r$ . Moreover, let $\mathsf{S}\in\mathcal{B}(\mathbb{R})$ be an open set such that $(\ell,r)\cap\mathsf{S}\neq\emptyset$ , and $\alpha\in(\ell,r)\cap\mathsf{S}$ . Observe that for $\theta\in\mathsf{S}\cap(\ell,r)$ we have

	$\displaystyle\widetilde{\Lambda}_{h_{\ell,r}(\alpha)}\left(h_{\ell,r}(\theta)% \right)=\left(\frac{2\pi}{r-\ell}\alpha\mod 2\pi-\frac{2\pi}{r-\ell}\theta\mod 2% \pi\right)\mod 2\pi$
	$\displaystyle\qquad=\left(\frac{2\pi}{r-\ell}(\alpha-\theta)\right)\mod 2\pi=% \left(\frac{2\pi}{(\alpha-\ell)-(\alpha-r)}(\alpha-\theta)\right)\mod 2\pi$
	$\displaystyle\qquad=h_{\uplambda_{\alpha}(\ell,r)}\left(\Lambda_{\alpha}(% \theta)\right).$

Therefore

\widetilde{\Lambda}_{h_{\ell,r}(\alpha)}\Big{(}h_{\ell,r}\big{(}\mathsf{S}\cap% (\ell,r)\big{)}\Big{)}=h_{\uplambda_{\alpha}(\ell,r)}\Big{(}\Lambda_{\alpha}% \big{(}\mathsf{S}\cap(\ell,r)\big{)}\Big{)}=h_{\uplambda_{\alpha}(\ell,r)}\Big% {(}\Lambda_{\alpha}(\mathsf{S})\cap\big{(}\Lambda_{\alpha}(r),\Lambda_{\alpha}% (\ell)\big{)}\Big{)},

and

\widetilde{\Lambda}_{h_{\ell,r}(\alpha)}\left(h_{\ell,r}\big{(}\Lambda_{\alpha% }(\mathsf{A})\big{)}\right)=h_{\uplambda_{\alpha}(\ell,r)}(\mathsf{A})

for all $\mathsf{A}\in\mathcal{B}(\Lambda_{\alpha}(\mathsf{S}\cap(\ell,r)))$ , as $\Lambda_{\alpha}^{-1}=\Lambda_{\alpha}$ . Since $\widetilde{\Lambda}_{h_{\ell,r}(\alpha)}\left(h_{\ell,r}(\alpha)\right)=h_{% \ell,r}(\alpha)-h_{\ell,r}(\alpha)\mod 2\pi=0$ and $\widetilde{\Lambda}_{\alpha}^{-1}=\widetilde{\Lambda}_{\alpha}$ , we get by [ReversibilityEllipticalSliceSampler, Lemma 2.10, note that $g_{\theta}=\widetilde{\Lambda}_{\alpha}$ ] that for $\mathsf{A}\in\mathcal{B}(\Lambda_{\alpha}(\mathsf{S}\cap(\ell,r)))$

	$\displaystyle Q_{\Lambda_{\alpha}(\mathsf{S})}^{\uplambda_{\alpha}(\ell,r)}(0,% \mathsf{A})$	$\displaystyle=Q_{h_{\uplambda_{\alpha}(\ell,r)}\big{(}\Lambda_{\alpha}(\mathsf% {S})\cap(\Lambda_{\alpha}(r),\Lambda_{\alpha}(\ell))\big{)}}\left(0,h_{% \uplambda_{\alpha}(\ell,r)}(\mathsf{A})\right)$
		$\displaystyle=Q_{\widetilde{\Lambda}_{h_{\ell,r}(\alpha)}\big{(}h_{\ell,r}(% \mathsf{S}\cap(\ell,r))\big{)}}\left(\widetilde{\Lambda}_{h_{\ell,r}(\alpha)}% \left(h_{\ell,r}(\alpha)\right),\widetilde{\Lambda}_{h_{\ell,r}(\alpha)}\big{(% }h_{\ell,r}(\Lambda_{\alpha}(\mathsf{A}))\big{)}\right)$
		$\displaystyle=Q_{h_{\ell,r}(\mathsf{S}\cap(\ell,r))}\big{(}h_{\ell,r}(\alpha),% h_{\ell,r}(\Lambda_{\alpha}(\mathsf{A}))\big{)}$
		$\displaystyle=Q_{\mathsf{S}}^{\ell,r}(\alpha,\Lambda_{\alpha}(\mathsf{A})).$

∎

4.3 Reversibility of the geodesic slice sampler

Note that the tangent bundle

T\mathsf{M}:=\bigcup_{x\in\mathsf{M}}\{x\}\times T_{x}\mathsf{M}

of $\mathsf{M}$ is a smooth, $2d$ -dimensional, connected manifold, see e.g. [Sakai, Section I.2.2]. To prove Theorem 7, we employ the Riemannian structure of $T\mathsf{M}$ . We only briefly sketch how this Riemannian structure is introduced, for more details see [Sakai, Section II.4].

We denote by

\mathrm{proj}_{\mathsf{M}}:T\mathsf{M}\to\mathsf{M},\qquad(x,v)\mapsto x

the projection map from the tangent bundle onto $\mathsf{M}$ . Observe that for $u=(x,v)\in T\mathsf{M}$ the tangent space $T_{u}T\mathsf{M}$ to $T\mathsf{M}$ at $u$ can be identified with the direct sum

T_{u}T\mathsf{M}=T_{x}\mathsf{M}\oplus T_{x}\mathsf{M}.

For $u=(x,v)\in\mathsf{M}$ this identification allows us to introduce a \ldqcanonical\rdq metric on $T\mathsf{M}$ referenced to as the Sasaki metric

\mathfrak{G}_{u}(\eta,\widetilde{\eta}):=\mathfrak{g}_{x}\left(\eta_{h},% \widetilde{\eta}_{h}\right)+\mathfrak{g}_{x}(\eta_{v},\widetilde{\eta}_{v}),% \qquad\eta=(\eta_{h},\eta_{v}),\widetilde{\eta}=(\widetilde{\eta}_{h},% \widetilde{\eta}_{v})\in T_{x}\mathsf{M}\oplus T_{x}\mathsf{M}.

Together with $\mathfrak{G}$ , the tangent bundle is a Riemannian manifold. However, in fact we are more interested into a submanifold of $T\mathsf{M}$ , that is, the unit tangent bundle

U\mathsf{M}:=\bigcup_{x\in\mathsf{M}}U_{x}\mathsf{M}:=\bigcup_{x\in\mathsf{M}}% \{x\}\times\mathbb{S}^{d-1}_{x}.

We call the Riemannian measure $\nu_{\mathfrak{G}}$ , which is induced by the Sasaki metric $\mathfrak{G}$ onto the unit tangent bundle $U\mathsf{M}$ , the Liouville measure. Observe that the restriction $\mathrm{proj}_{\mathsf{M}}|_{U\mathsf{M}}$ is a Riemannian submersion, and for $x\in\mathsf{M}$ the fiber $\mathrm{proj}_{\mathsf{M}}|_{U\mathsf{M}}^{-1}(x)$ equipped with the metric induced by $\mathfrak{g}$ is $\left(\mathbb{S}^{d-1}_{x},\widehat{\mathfrak{g}}_{x}\right)$ . Note that additionally $\nu_{\widehat{\mathfrak{g}}_{x}}(\mathbb{S}^{d-1}_{x})=\nu_{\widehat{\mathfrak% {g}}}(\mathbb{S}^{d-1})$ for all $x\in\mathsf{M}$ . Applying Fubini’s theorem for manifolds (see [Sakai, Theorem II.5.6]), this yields a nice expression for the Liouville measure, namely we have

\int_{U\mathsf{M}}f(x,v)\ \nu_{\mathfrak{G}}\big{(}{\rm d}(x,v)\big{)}=\int_{% \mathsf{M}}\int_{\mathbb{S}_{x}^{d-1}}f(x,v)\ \nu_{\widehat{\mathfrak{g}}_{x}}% ({\rm d}v)\,\nu_{\mathfrak{g}}({\rm d}x)

(26)

for all measurable functions $f:U\mathsf{M}\to[0,\infty)$ .

In the following, we combine the Liouville measure with a family of maps $T^{(\theta)}:U\mathsf{M}\to U\mathsf{M}$ indexed by $\theta\in\mathbb{R}$ which can be interpreted as \ldqwalking along the geodesic specified by $u\in U\mathsf{M}$ with step length $\theta$ and then doing a U-turn\rdq. To this end, for $\theta\in\mathbb{R}$ we denote the geodesic flow by

\displaystyle\phi_{\theta}:U\mathsf{M}\to U\mathsf{M},\qquad(x,v)\mapsto\left(% \left.\gamma_{(x,v)}(\theta),\frac{{\rm d}\gamma_{(x,v)}}{{\rm d}t}\right|_{% \theta}\right).

For more details on the geodesic flow see [Sakai, Section II.4.II]. Moreover, we define a flip on the unit tangent bundle

\displaystyle\mathfrak{I}:U\mathsf{M}\to U\mathsf{M},\qquad(x,v)\mapsto(x,-v).

Then we set

T^{(\theta)}:=\mathfrak{I}\circ\phi_{\theta},\qquad\theta\in\mathbb{R}.

Observe that the Liouville measure is invariant under the geodesic flow (see [Sakai, Exercise II.16]) and under the flip $\mathfrak{I}$ (see [Paternain, Lemma 1.34]). Therefore we have due to the representation in (26) for all measurable functions $f:U\mathsf{M}\to[0,\infty)$ and all $\theta\in\mathbb{R}$ that

\begin{split}&\int_{\mathsf{M}}\int_{\mathbb{S}_{x}^{d-1}}f\left(T^{(\theta)}(% x,v)\right)\ \sigma_{d-1}^{(x)}({\rm d}v)\,\nu_{\mathfrak{g}}({\rm d}x)=\frac{% 1}{\nu_{\widehat{\mathfrak{g}}}(\mathbb{S}^{d-1})}\int_{U\mathsf{M}}f\left(T^{% (\theta)}(x,v)\right)\ \nu_{\mathfrak{G}}\big{(}{\rm d}(x,v)\big{)}\\ &\qquad=\frac{1}{\nu_{\widehat{\mathfrak{g}}}(\mathbb{S}^{d-1})}\int_{U\mathsf% {M}}f(x,v)\ \nu_{\mathfrak{G}}\big{(}{\rm d}(x,v)\big{)}=\int_{\mathsf{M}}\int% _{\mathbb{S}_{x}^{d-1}}f(x,v)\ \sigma_{d-1}^{(x)}({\rm d}v)\,\nu_{\mathfrak{g}% }({\rm d}x),\end{split}

(27)

i.e., the Liouville measure is invariant under $T^{(\theta)}$ .

We shed some further light on the interaction of $T^{(\theta)}$ and the geodesics.

Remark 17.

Let $x\in\mathsf{M}$ , $v\in\mathbb{S}_{x}^{d-1}$ and $\theta,\alpha\in\mathbb{R}$ . Using the rescaling property of geodesics (see [Lee, Lemma 5.18]) and the chain rule, we have

	$\displaystyle(\mathfrak{I}\circ\phi_{\theta})(x,v)$	$\displaystyle=\left(\gamma_{(x,v)}(\theta),-\frac{{\rm d}\gamma_{(x,v)}}{{\rm d% }t}\|_{\theta}\right)=\left(\gamma_{(x,-v)}(-\theta),\frac{{\rm d}\gamma_{(x,-v% )}}{{\rm d}t}\|_{-\theta}\right)$
		$\displaystyle=(\phi_{-\theta}\circ\mathfrak{I})(x,v).$

Since $\phi_{\theta}\circ\phi_{-\alpha}=\phi_{\theta-\alpha}$ ¹¹¹¹11This is a basic property of a flow, see e.g. [LeeSmooth, Chapter 9]. Observe that the geodesic flow is a flow on the (unit) tangent bundle., this yields

	$\displaystyle\gamma_{T^{(\alpha)}(x,v)}(\theta)$	$\displaystyle=\mathrm{proj}_{\mathsf{M}}\big{(}\phi_{\theta}\left(T^{(\alpha)}% (x,v)\right)\big{)}=\mathrm{proj}_{\mathsf{M}}\big{(}(\phi_{\theta}\circ% \mathfrak{I}\circ\phi_{\alpha})(x,v)\big{)}$
		$\displaystyle=\mathrm{proj}_{\mathsf{M}}\big{(}(\phi_{\theta}\circ\phi_{-% \alpha}\circ\mathfrak{I})(x,v)\big{)}=\mathrm{proj}_{\mathsf{M}}\big{(}(\phi_{% \theta-\alpha}\circ\mathfrak{I})(x,v)\big{)}$
		$\displaystyle=\mathrm{proj}_{\mathsf{M}}\left((\mathfrak{I}\circ\phi_{\Lambda_% {\alpha}(\theta)})(x,v)\right)=\mathrm{proj}_{\mathsf{M}}\left(\phi_{\Lambda_{% \alpha}(\theta)}(x,v)\right)=\gamma_{(x,v)}\big{(}\Lambda_{\alpha}(\theta)\big% {)}.$

In particular this implies

	$\displaystyle L\big{(}T^{(\alpha)}(x,v),t\big{)}$	$\displaystyle=\{\theta\mid p\left(\gamma_{T^{(\alpha)}(x,v)}(\theta)\right)>t% \}=\{\theta\mid p\left(\gamma_{(x,v)}\big{(}\Lambda_{\alpha}(\theta)\big{)}% \right)>t\}$
		$\displaystyle=\Lambda_{\alpha}\left(L(x,v,t)\right),$

as $\Lambda_{\alpha}^{-1}=\Lambda_{\alpha}$ .

Now we can prove the reversibility of the geodesic slice sampler.

Proof of Theorem 7.

Observe that by virtue of [latuszyinski2014convergence, Lemma 1] it suffices to show that $K_{t}$ is reversible with respect to $\nu_{\mathfrak{g}}(L(t))^{-1}\nu_{\mathfrak{g}}|_{L(t)}$ for all $t\in(0,\|p\|_{\infty})$ .

Let $t\in(0,\|p\|_{\infty})$ and $\mathsf{A},\mathsf{B}\in\mathcal{B}(\mathsf{M})$ . After introducing the uniform distribution on $L(x,v,t)$ , we get by Lemma 11 that

	$\displaystyle\int_{L(t)\cap\mathsf{A}}K_{t}(x,\mathsf{B})\ \nu_{\mathfrak{g}}(% {\rm d}x)$
	$\displaystyle\quad=\int_{\mathsf{M}}\int_{\mathbb{S}_{x}^{d-1}}\int_{\mathbb{R% }^{2}}\int_{(\ell,r)}\mathbbm{1}_{L(t)\cap\mathsf{A}}(x)\mathbbm{1}_{\mathsf{B% }}\left(\gamma_{(x,v)}(\theta)\right)\ Q_{L(x,v,t)}^{\ell,r}(0,{\rm d}\theta)% \,\xi_{L(x,v,t)}^{(0)}\big{(}{\rm d}(\ell,r)\big{)}\,\sigma_{d-1}^{(x)}({\rm d% }v)\,\nu_{\mathfrak{g}}({\rm d}x)$
	$\displaystyle\quad=\int_{\mathbb{R}}\int_{\mathsf{M}}\int_{\mathbb{S}_{x}^{d-1% }}\int_{\mathbb{R}^{2}}\int_{(\ell,r)}\mathbbm{1}_{L(t)\cap\mathsf{A}}(x)% \mathbbm{1}_{\mathsf{B}}\left(\gamma_{(x,v)}(\theta)\right)\frac{1}{\mathrm{% Leb}_{1}(L(x,v,t)\cap(\ell,r))}\mathbbm{1}_{L(x,v,t)\cap(\ell,r)}(\alpha)$
	$\displaystyle\hskip 142.26378pt\times Q_{L(x,v,t)}^{\ell,r}(0,{\rm d}\theta)\ % \xi_{L(x,v,t)}^{(0)}\big{(}{\rm d}(\ell,r)\big{)}\ \sigma_{d-1}^{(x)}({\rm d}v% )\ \nu_{\mathfrak{g}}({\rm d}x)\ \mathrm{Leb}_{1}({\rm d}\alpha)$
	$\displaystyle\quad=\int_{\mathbb{R}}\int_{\mathsf{M}}\int_{\mathbb{S}_{x}^{d-1% }}\int_{\mathbb{R}^{2}}\int_{(\ell,r)}\mathbbm{1}_{L(t)\cap\mathsf{A}}(x)% \mathbbm{1}_{\mathsf{B}}\left(\gamma_{(x,v)}(\theta)\right)\frac{1}{\mathrm{% Leb}_{1}(L(x,v,t)\cap(\ell,r))}\mathbbm{1}_{L(x,v,t)}(\alpha)\mathbbm{1}_{(% \ell,r)}(0)$
	$\displaystyle\hskip 142.26378pt\times Q_{L(x,v,t)}^{\ell,r}(0,{\rm d}\theta)\ % \xi_{L(x,v,t)}^{(\alpha)}\big{(}{\rm d}(\ell,r)\big{)}\ \sigma_{d-1}^{(x)}({% \rm d}v)\ \nu_{\mathfrak{g}}({\rm d}x)\ \mathrm{Leb}_{1}({\rm d}\alpha).$

Using (27), we obtain

	$\displaystyle\int_{L(t)\cap\mathsf{A}}K_{t}(x,\mathsf{B})\ \nu_{\mathfrak{g}}(% {\rm d}x)$
	$\displaystyle\quad=\int_{\mathbb{R}}\int_{\mathsf{M}}\int_{\mathbb{S}_{x}^{d-1% }}\int_{\mathbb{R}^{2}}\int_{(\ell,r)}\frac{\mathbbm{1}_{L(t)\cap\mathsf{A}}% \left(\gamma_{(x,v)}(\alpha)\right)\mathbbm{1}_{\mathsf{B}}\left(\gamma_{T^{(% \alpha)}(x,v)}(\theta)\right)\mathbbm{1}_{L\left(T^{(\alpha)}(x,v),t\right)}(% \alpha)\mathbbm{1}_{(\ell,r)}(0)}{\mathrm{Leb}_{1}\big{(}L(T^{(\alpha)}(x,v),t% )\cap(\ell,r)\big{)}}$
	$\displaystyle\hskip 113.81102pt\times Q_{L\left(T^{(\alpha)}(x,v),t\right)}^{% \ell,r}(0,{\rm d}\theta)\,\xi_{L\left(T^{(\alpha)}(x,v),t\right)}^{(\alpha)}% \big{(}{\rm d}(\ell,r)\big{)}\,\sigma_{d-1}^{(x)}({\rm d}v)\,\nu_{\mathfrak{g}% }({\rm d}x)\,\mathrm{Leb}_{1}({\rm d}\alpha).$

For all $\alpha\in\mathbb{R}$ , we have

\gamma_{(x,v)}(\alpha)\in L(t)\quad\Leftrightarrow\quad p\left(\gamma_{(x,v)}(% \alpha)\right)>t\quad\Leftrightarrow\quad\alpha\in L(x,v,t),

and by Remark 17

\alpha\in L\big{(}T^{(\alpha)}(x,v),t\big{)}\quad\Leftrightarrow\quad p(x)=p% \left(\gamma_{(x,v)}(\alpha-\alpha)\right)=p\left(\gamma_{T^{(\alpha)}(x,v)}(% \alpha)\right)>t\quad\Leftrightarrow\quad x\in L(t).

If we also apply Remark 17 to $\mathbbm{1}_{\mathsf{\mathsf{B}}}$ and to the set in the 1-dimensional Lebesgue measure in the numerator, we overall obtain

	$\displaystyle\int_{L(t)\cap\mathsf{A}}K_{t}(x,\mathsf{B})\ \nu_{\mathfrak{g}}(% {\rm d}x)$
	$\displaystyle\quad=\int_{\mathbb{R}}\int_{\mathsf{M}}\int_{\mathbb{S}_{x}^{d-1% }}\int_{\mathbb{R}^{2}}\int_{(\ell,r)}\frac{\mathbbm{1}_{L(x,v,t)}\left(\alpha% \right)\mathbbm{1}_{\mathsf{A}}\left(\gamma_{(x,v)}(\alpha)\right)\mathbbm{1}_% {\mathsf{B}}\left(\gamma_{(x,v)}\big{(}\Lambda_{\alpha}(\theta)\big{)}\right)% \mathbbm{1}_{L(t)}(x)\mathbbm{1}_{(\ell,r)}(0)}{\mathrm{Leb}_{1}\left(\Lambda_% {\alpha}(L(x,v,t))\cap(\ell,r)\right)}$
	$\displaystyle\hskip 113.81102pt\times Q_{L(T^{(\alpha)}(x,v),t)}^{\ell,r}(0,{% \rm d}\theta)\ \xi_{L(T^{(\alpha)}(x,v),t)}^{(\alpha)}\big{(}{\rm d}(\ell,r)% \big{)}\ \sigma_{d-1}^{(x)}({\rm d}v)\ \nu_{\mathfrak{g}}({\rm d}x)\ \mathrm{% Leb}_{1}({\rm d}\alpha).$

Then Lemma 12 together with Remark 17 yields

	$\displaystyle\int_{L(t)\cap\mathsf{A}}K_{t}(x,\mathsf{B})\ \nu_{\mathfrak{g}}(% {\rm d}x)$
	$\displaystyle\quad=\int_{\mathbb{R}}\int_{\mathsf{M}}\int_{\mathbb{S}_{x}^{d-1% }}\int_{\mathbb{R}^{2}}\int_{\Lambda_{\alpha}((\ell,r))}\frac{\mathbbm{1}_{L(x% ,v,t)}\left(\alpha\right)\mathbbm{1}_{\mathsf{A}}\left(\gamma_{(x,v)}(\alpha)% \right)\mathbbm{1}_{\mathsf{B}}\left(\gamma_{(x,v)}\big{(}\Lambda_{\alpha}(% \theta)\big{)}\right)\mathbbm{1}_{L(t)}(x)\mathbbm{1}_{\Lambda_{\alpha}((\ell,% r))}(0)}{\mathrm{Leb}_{1}\big{(}\Lambda_{\alpha}(L(x,v,t)\cap(\ell,r))\big{)}}$
	$\displaystyle\hskip 113.81102pt\times Q_{L(T^{(\alpha)}(x,v),t)}^{\uplambda_{% \alpha}(\ell,r)}(0,{\rm d}\theta)\ \xi_{L(x,v,t)}^{(0)}\big{(}{\rm d}(\ell,r)% \big{)}\ \sigma_{d-1}^{(x)}({\rm d}v)\ \nu_{\mathfrak{g}}({\rm d}x)\ \mathrm{% Leb}_{1}({\rm d}\alpha).$

Observe that

\displaystyle\mathrm{Leb}_{1}\big{(}\Lambda_{\alpha}(L(x,v,t)\cap(\ell,r))\big% {)}=\mathrm{Leb}_{1}\big{(}L(x,v,t)\cap(\ell,r)\big{)}

for all $x\in M$ , $v\in\mathbb{S}_{x}^{d-1}$ and $\alpha,\ell,r\in\mathbb{R}$ . Hence

	$\displaystyle\int_{L(t)\cap\mathsf{A}}K_{t}(x,\mathsf{B})\ \nu_{\mathfrak{g}}(% {\rm d}x)$
	$\displaystyle\quad=\int_{\mathbb{R}}\int_{\mathsf{M}}\int_{\mathbb{S}_{x}^{d-1% }}\int_{\mathbb{R}^{2}}\int_{\Lambda_{\alpha}((\ell,r))}\frac{\mathbbm{1}_{L(x% ,v,t)}\left(\alpha\right)\mathbbm{1}_{\mathsf{A}}\left(\gamma_{(x,v)}(\alpha)% \right)\mathbbm{1}_{\mathsf{B}}\left(\gamma_{(x,v)}\big{(}\Lambda_{\alpha}(% \theta)\big{)}\right)\mathbbm{1}_{L(t)}(x)\mathbbm{1}_{\Lambda_{\alpha}((\ell,% r))}(0)}{\mathrm{Leb}_{1}(L(x,v,t)\cap(\ell,r))}$
	$\displaystyle\hskip 113.81102pt\times Q_{L(T^{(\alpha)}(x,v),t)}^{\uplambda_{% \alpha}(\ell,r)}(0,{\rm d}\theta)\ \xi_{L(x,v,t)}^{(0)}\big{(}{\rm d}(\ell,r)% \big{)}\ \sigma_{d-1}^{(x)}({\rm d}v)\ \nu_{\mathfrak{g}}({\rm d}x)\ \mathrm{% Leb}_{1}({\rm d}\alpha).$

Note that due to Remark 14 and Remark 17, we may apply Lemma 16, such that we obtain

	$\displaystyle\int_{L(t)\cap\mathsf{A}}K_{t}(x,\mathsf{B})\ \nu_{\mathfrak{g}}(% {\rm d}x)$
	$\displaystyle\quad=\int_{\mathbb{R}}\int_{\mathsf{M}}\int_{\mathbb{S}_{x}^{d-1% }}\int_{\mathbb{R}^{2}}\int_{(\ell,r)}\mathbbm{1}_{L(x,v,t)}\left(\alpha\right% )\mathbbm{1}_{\mathsf{A}}\left(\gamma_{(x,v)}(\alpha)\right)\mathbbm{1}_{% \mathsf{B}}\left(\gamma_{(x,v)}(\theta)\right)\frac{1}{\mathrm{Leb}_{1}(L(x,v,% t)\cap(\ell,r))}$
	$\displaystyle\hskip 56.9055pt\mathbbm{1}_{L(t)}(x)\mathbbm{1}_{\Lambda_{\alpha% }((\ell,r))}(0)\ Q_{L(x,v,t)}^{\ell,r}(\alpha,{\rm d}\theta)\ \xi_{L(x,v,t)}^{% (0)}\big{(}{\rm d}(\ell,r)\big{)}\ \sigma_{d-1}^{(x)}({\rm d}v)\ \nu_{% \mathfrak{g}}({\rm d}x)\ \mathrm{Leb}_{1}({\rm d}\alpha)$
	$\displaystyle\quad=\int_{\mathbb{R}}\int_{\mathsf{M}}\int_{\mathbb{S}_{x}^{d-1% }}\int_{\mathbb{R}^{2}}\int_{(\ell,r)}\mathbbm{1}_{L(x,v,t)}\left(\alpha\right% )\mathbbm{1}_{\mathsf{A}}\left(\gamma_{(x,v)}(\alpha)\right)\mathbbm{1}_{% \mathsf{B}}\left(\gamma_{(x,v)}(\theta)\right)\frac{1}{\mathrm{Leb}_{1}(L(x,v,% t)\cap(\ell,r))}$
	$\displaystyle\hskip 56.9055pt\mathbbm{1}_{L(t)}(x)\mathbbm{1}_{(\ell,r)}(% \alpha)\ Q_{L(x,v,t)}^{\ell,r}(\alpha,{\rm d}\theta)\ \xi_{L(x,v,t)}^{(0)}\big% {(}{\rm d}(\ell,r)\big{)}\ \sigma_{d-1}^{(x)}({\rm d}v)\ \nu_{\mathfrak{g}}({% \rm d}x)\ \mathrm{Leb}_{1}({\rm d}\alpha)$
	$\displaystyle\quad=\int_{\mathsf{M}}\int_{\mathbb{S}_{x}^{d-1}}\int_{\mathbb{R% }^{2}}\frac{1}{\mathrm{Leb}_{1}(L(x,v,t)\cap(\ell,r))}\int_{L(x,v,t)\cap(\ell,% r)}\int_{(\ell,r)}\mathbbm{1}_{\mathsf{A}}\left(\gamma_{(x,v)}(\alpha)\right)% \mathbbm{1}_{\mathsf{B}}\left(\gamma_{(x,v)}(\theta)\right)$
	$\displaystyle\hskip 56.9055pt\mathbbm{1}_{L(t)}(x)\ Q_{L(x,v,t)}^{\ell,r}(% \alpha,{\rm d}\theta)\ \mathrm{Leb}_{1}({\rm d}\alpha)\ \xi_{L(x,v,t)}^{(0)}% \big{(}{\rm d}(\ell,r)\big{)}\ \sigma_{d-1}^{(x)}({\rm d}v)\ \nu_{\mathfrak{g}% }({\rm d}x).$

This expression is symmetric in $\mathsf{A}$ and $\mathsf{B}$ , because of the reversibility of the shrinkage kernel. Namely Lemma 15 yields

	$\displaystyle\int_{L(t)\cap\mathsf{A}}K_{t}(x,\mathsf{B})\ \nu_{\mathfrak{g}}(% {\rm d}x)$
	$\displaystyle\quad=\int_{\mathsf{M}}\int_{\mathbb{S}_{x}^{d-1}}\int_{\mathbb{R% }^{2}}\frac{1}{\mathrm{Leb}_{1}(L(x,v,t)\cap(\ell,r))}\int_{L(x,v,t)\cap(\ell,% r)}\int_{(\ell,r)}\mathbbm{1}_{\mathsf{A}}\left(\gamma_{(x,v)}(\theta)\right)% \mathbbm{1}_{\mathsf{B}}\left(\gamma_{(x,v)}(\alpha)\right)$
	$\displaystyle\hskip 56.9055pt\mathbbm{1}_{L(t)}(x)\ Q_{L(x,v,t)}^{\ell,r}(% \alpha,{\rm d}\theta)\ \mathrm{Leb}_{1}({\rm d}\alpha)\ \xi_{L(x,v,t)}^{(0)}% \big{(}{\rm d}(\ell,r)\big{)}\ \sigma_{d-1}^{(x)}({\rm d}v)\ \nu_{\mathfrak{g}% }({\rm d}x).$

Consequently,

\displaystyle\int_{L(t)\cap\mathsf{A}}K_{t}(x,\mathsf{B})\ \nu_{\mathfrak{g}}(% {\rm d}x)=\int_{L(t)\cap\mathsf{B}}K_{t}(x,\mathsf{A})\ \nu_{\mathfrak{g}}({% \rm d}x).

∎

Acknowledgments

Data were kindly provided in part by the Human Connectome Project, WU-Minn Consortium (Principal Investigators: David Van Essen and Kamil Ugurbil; 1U54MH091657) funded by the 16 NIH Institutes and Centers that support the NIH Blueprint for Neuroscience Research; and by the McDonnell Center for Systems Neuroscience at Washington University.

We thank Rudrasis Chakraborty and Clément Mantoux for interesting discussions regarding the experiments on video actions and network data. We thank Eric Moulines, Randal Douc and Philip Schär for fruitful conversations regarding the theoretical part of this paper. MH and DR gratefully acknowledge support of the DFG within project 32680300 – SFB 1456 subproject B02. MH expresses her gratitude for the hospitality of École Polytechnique.

Appendix

Appendix A Properties of the stepping-out procedure

In this section we provide the proofs of Lemma 11 and Lemma 12. To this end, we suppose that we are in the setting of Section 4.1 summarized as follows:

Setting C.

Fix $w\in(0,\infty)$ and $m\in\mathbb{N}$ , and let $\mathsf{S}\in\mathcal{B}(\mathbb{R})$ . For all $\theta\in\mathbb{R}$ define the sequences $(L_{i}^{(\theta)})_{i\in\mathbb{N}}$ and $(R_{j}^{(\theta)})_{j\in\mathbb{N}}$ as in (18), and the stopping times $\tau_{\mathsf{S}}^{(\theta)}$ and $\mathfrak{T}_{\mathsf{S}}^{(\theta)}$ as in (21).

In the proof of both lemmas we push a corresponding property of the sequences $(L_{i}^{(\theta)})_{i\in\mathbb{N}}$ and $(R_{j}^{(\theta)})_{j\in\mathbb{N}}$ to the stopped sequences. Observe that it is relatively simple to establish the required properties for $(L_{i}^{(\theta)})_{i\in\mathbb{N}}$ and $(R_{j}^{(\theta)})_{j\in\mathbb{N}}$ , because their joint distributions are available explicitly.

Lemma 18.

Assume we are in Setting C. Let $i,j\in\mathbb{N}$ and $\theta\in\mathbb{R}$ . We have for all $\mathsf{A}_{1},\ldots,\mathsf{A}_{i},\mathsf{B}_{1},\ldots,\mathsf{B}_{j}\in% \mathcal{B}(\mathbb{R})$ that

	$\displaystyle\mathbb{P}\Big{(}L_{1}^{(\theta)}\in\mathsf{A}_{1},\ldots,L_{i}^{% (\theta)}\in\mathsf{A}_{i},R_{1}^{(\theta)}\in\mathsf{B}_{1},\ldots,R_{j}^{(% \theta)}\in\mathsf{B}_{j}\Big{)}$
	$\displaystyle\qquad=\frac{1}{w}\int_{0}^{w}\prod_{k=1}^{i}\mathbbm{1}_{\mathsf% {A}_{k}}\big{(}\theta-u-(k-1)w\big{)}\cdot\prod_{l=1}^{j}\mathbbm{1}_{\mathsf{% B}_{l}}(\theta-u+lw)\ \mathrm{Leb}_{1}({\rm d}u).$

Proof.

Let $i,j\in\mathbb{N}$ , $\theta\in\mathbb{R}$ and $\mathsf{A}_{1},\ldots,\mathsf{A}_{i},\mathsf{B}_{1},\ldots,\mathsf{B}_{j}\in% \mathcal{B}(\mathbb{R})$ . Then, as $\Upsilon\sim\mathrm{Unif}([0,w])$ , we have

	$\displaystyle\mathbb{P}\Big{(}L_{1}^{(\theta)}\in\mathsf{A}_{1},\ldots,L_{i}^{% (\theta)}\in\mathsf{A}_{i},R_{1}^{(\theta)}\in\mathsf{B}_{1},\ldots,R_{j}^{(% \theta)}\in\mathsf{B}_{j}\Big{)}$
	$\displaystyle\qquad=\mathbb{P}\Big{(}\theta-\Upsilon-(k-1)w\in\mathsf{A}_{k},% \,k\in\{1,\ldots,i\},\ \theta-\Upsilon+lw\in\mathsf{B}_{l},\,l\in\{1,\ldots,j% \}\Big{)}$
	$\displaystyle\qquad=\frac{1}{w}\int_{0}^{w}\prod_{k=1}^{i}\mathbbm{1}_{\mathsf% {A}_{k}}(\theta-u-(k-1)w)\cdot\prod_{l=1}^{j}\mathbbm{1}_{\mathsf{B}_{l}}(% \theta-u+lw)\ \mathrm{Leb}_{1}({\rm d}u).$

∎

We present partitions of certain events that facilitate the extension of properties of the sequences $(L_{i}^{(\theta)})_{i\in\mathbb{N}}$ and $(R_{j}^{(\theta)})_{j\in\mathbb{N}}$ to the stepping-out distribution.

Lemma 19.

Suppose we are in Setting C. Then for all $\theta,\alpha\in\mathbb{R}$ we have

1.

$\begin{aligned} \left\{L_{i}^{(\theta)}<\alpha<R_{j}^{(\theta)}\right\}&=% \bigsqcup_{k=0}^{i-1}\left\{L_{k+1}^{(\theta)}<\alpha<L_{k}^{(\theta)}\right\}% \ \sqcup\ \bigsqcup_{l=1}^{j-1}\left\{R_{l}^{(\theta)}<\alpha<R_{l+1}^{(\theta% )}\right\}\\ &\qquad\sqcup\ \bigsqcup_{k=1}^{i-1}\left\{L_{k}^{(\theta)}=\alpha\right\}\ % \sqcup\ \bigsqcup_{l=1}^{i-1}\left\{R_{l}^{(\theta)}=\alpha\right\}\text{ for % all }i,j\in\mathbb{N},\end{aligned}$
2.

$\begin{aligned} \{\boldsymbol{L}_{\mathsf{S}}^{(\theta)}<\alpha<L_{0}^{(\theta% )}\}&=\bigsqcup_{i=1}^{m}\{L_{\tau_{\mathsf{S}}^{(\theta)}-i+1}^{(\theta)}<% \alpha<L_{\tau_{\mathsf{S}}^{(\theta)}-i}^{(\theta)}\,,\ \tau_{\mathsf{S}}^{(% \theta)}\geqslant i\}\\ &\qquad\sqcup\ \bigsqcup_{i^{\prime}=1}^{m-1}\{L_{i^{\prime}}^{(\theta)}=% \alpha,\ \tau_{\mathsf{S}}^{(\theta)}\geqslant i^{\prime}+1\},\end{aligned}$
3.

$\begin{aligned} \{R_{1}^{(\theta)}<\alpha<\boldsymbol{R}_{\mathsf{S}}^{(\theta% )}\}&=\bigsqcup_{j=1}^{m-1}\{R_{\mathfrak{T}_{\mathsf{S}}^{(\theta)}-j}^{(% \theta)}<\alpha<R_{\mathfrak{T}_{\mathsf{S}}^{(\theta)}-j+1}^{(\theta)}\,,\ % \mathfrak{T}_{\mathsf{S}}^{(\theta)}\geqslant j+1\}\\ &\qquad\sqcup\ \bigsqcup_{j^{\prime}=2}^{m-1}\left\{R_{j^{\prime}}^{(\theta)}=% \alpha,\mathfrak{T}_{\mathsf{S}}^{(\theta)}\geqslant j^{\prime}+1\right\},\end% {aligned}$
4.

$\mathbb{P}\left(L_{n}^{(\theta)}=\alpha\right)=\mathbb{P}\left(R_{n}^{(\theta)% }=\alpha\right)=0$ for all $n\in\mathbb{N}$ .

Proof.

To 1: Observe that the monotonicity property (20) of $(L_{k}^{(\theta)})_{k\in\mathbb{N}}$ and $(R_{l}^{(\theta)})_{l\in\mathbb{N}}$ implies the statement.

To 2: Let $\upomega\in\{\boldsymbol{L}_{\mathsf{S}}^{(\theta)}<\alpha<L_{0}^{(\theta)}\}$ . Then there exists

	$\displaystyle\widehat{k}$	$\displaystyle\in\{0,\ldots,\tau_{\mathsf{S}}^{(\theta)}(\upomega)-1\}\text{ % such that }L_{\widehat{k}+1}^{(\theta)}(\upomega)<\alpha<L_{\widehat{k}}^{(% \theta)}(\upomega),\text{ or}$
	$\displaystyle k^{\prime}$	$\displaystyle\in\{1,\ldots,\tau_{\mathsf{S}}^{(\theta)}(\upomega)-1\}\text{ % such that }L_{k^{\prime}}^{(\theta)}(\upomega)=\alpha.$

In the first case, choosing $k=\tau_{\mathsf{S}}^{(\theta)}(\upomega)-\widehat{k}$ and taking (22) into account, we get $k\in\{1,\ldots,m\}$ , $k\leqslant\tau_{\mathsf{S}}^{(\theta)}(\upomega)$ and $L_{\tau_{\mathsf{S}}^{(\theta)}-k+1}^{(\theta)}(\upomega)<\alpha<L_{\tau_{% \mathsf{S}}^{(\theta)}-k}^{(\theta)}(\upomega)$ . Or, in the second case, $k^{\prime}\in\{1,\ldots,m-1\}$ , $k^{\prime}+1\leqslant\tau_{\mathsf{S}}^{(\theta)}(\upomega)$ and $L_{k^{\prime}}^{(\theta)}(\upomega)=\alpha$ . Hence

\upomega\in\bigsqcup_{i=1}^{m}\{L_{\tau_{\mathsf{S}}^{(\theta)}-i+1}^{(\theta)% }<\alpha<L_{\tau_{\mathsf{S}}^{(\theta)}-i}^{(\theta)}\,,\ \tau_{\mathsf{S}}^{% (\theta)}\geqslant i\}\ \sqcup\ \bigsqcup_{i^{\prime}=1}^{m-1}\{L_{i^{\prime}}% ^{(\theta)}=\alpha,\tau_{\mathsf{S}}^{(\theta)}\geqslant i^{\prime}+1\}.

Conversely, let there exists

	$\displaystyle i$	$\displaystyle\in\{1,\ldots,m\}\text{ such that }\upomega\in\{L_{\tau_{\mathsf{% S}}^{(\theta)}-i+1}^{(\theta)}<\alpha<L_{\tau_{\mathsf{S}}^{(\theta)}-i}^{(% \theta)}\,,\ \tau_{\mathsf{S}}^{(\theta)}\geqslant i\},\text{ or}$
	$\displaystyle i^{\prime}$	$\displaystyle\in\{1,\ldots,m-1\}\text{ such that }\{L_{i^{\prime}}^{(\theta)}=% \alpha,\tau_{\mathsf{S}}^{(\theta)}\geqslant i^{\prime}+1\}.$

Then clearly $\upomega\in\{\boldsymbol{L}_{\mathsf{S}}^{(\theta)}<\alpha<L_{0}^{(\theta)}\}$ .

To 3: We apply similar arguments as for the statement 2. Let $\upomega\in\{R_{1}^{(\theta)}<\alpha<\boldsymbol{R}_{\mathsf{S}}^{(\theta)}\}$ . Then there exists

	$\displaystyle\widehat{k}$	$\displaystyle\in\{1,\ldots,\mathfrak{T}_{\mathsf{S}}^{(\theta)}(\upomega)-1\}% \text{ such that }R_{\widehat{k}}^{(\theta)}(\upomega)<\alpha<R_{\widehat{k}+1% }^{(\theta)}(\upomega),\text{ or}$
	$\displaystyle k^{\prime}$	$\displaystyle\in\{2,\ldots,\mathfrak{T}_{\mathsf{S}}^{(\theta)}(\upomega)-1\}% \text{ such that }R_{k^{\prime}}^{(\theta)}(\upomega)=\alpha.$

In the first case, again choosing $k=\mathfrak{T}_{\mathsf{S}}^{(\theta)}(\upomega)-\widehat{k}$ and observing (22), we get $k\in\{1,\ldots,m-1\}$ , $k+1\leqslant\mathfrak{T}_{\mathsf{S}}^{(\theta)}(\upomega)$ and $R_{\mathfrak{T}_{\mathsf{S}}^{(\theta)}-k}^{(\theta)}(\upomega)<\alpha<R_{% \mathfrak{T}_{\mathsf{S}}^{(\theta)}-k+1}^{(\theta)}(\upomega)$ . Or, in the second case, we get $k^{\prime}\in\{2,\ldots,m-1\}$ , $k^{\prime}+1\leqslant\mathfrak{T}_{\mathsf{S}}^{(\theta)}(\upomega)$ and $R_{k^{\prime}}^{(\theta)}(\upomega)=\alpha$ . Thus

\upomega\in\bigsqcup_{j=1}^{m-1}\{R_{\mathfrak{T}_{\mathsf{S}}^{(\theta)}-j}^{% (\theta)}<\alpha<R_{\mathfrak{T}_{\mathsf{S}}^{(\theta)}-j+1}^{(\theta)}\,,\ % \mathfrak{T}_{\mathsf{S}}^{(\theta)}\geqslant j+1\}\ \sqcup\ \bigsqcup_{j^{% \prime}=2}^{m-1}\left\{R_{j^{\prime}}^{(\theta)}=\alpha,\mathfrak{T}_{\mathsf{% S}}^{(\theta)}\geqslant j^{\prime}+1\right\}.

Conversely, let

	$\displaystyle\upomega$	$\displaystyle\in\{R_{\mathfrak{T}_{\mathsf{S}}^{(\theta)}-j}^{(\theta)}<\alpha% <R_{\mathfrak{T}_{\mathsf{S}}^{(\theta)}-j+1}^{(\theta)},\mathfrak{T}_{\mathsf% {S}}^{(\theta)}\geqslant j+1\}\text{ for some }j\in\{1,\ldots,m-1\},\text{ or}$
	$\displaystyle\upomega$	$\displaystyle\in\left\{R_{j^{\prime}}^{(\theta)}=\alpha,\mathfrak{T}_{\mathsf{% S}}^{(\theta)}\geqslant j^{\prime}+1\right\}\text{ for some }j^{\prime}\in\{2,% \ldots,m-1\}.$

Then we have $\upomega\in\{R_{1}^{(\theta)}<\alpha<\boldsymbol{R}_{\mathsf{S}}^{(\theta)}\}$ , as $\mathfrak{T}_{\mathsf{S}}^{(\theta)}(\upomega)-j\geqslant j+1-j=1$ and $\mathfrak{T}_{\mathsf{S}}^{(\theta)}(\upomega)-j+1<\mathfrak{T}_{\mathsf{S}}^{% (\theta)}(\upomega)-1+1=\mathfrak{T}_{\mathsf{S}}^{(\theta)}(\upomega)$ .

To 4: Let $n\in\mathbb{N}$ . We have by definition of $L_{n}^{(\theta)}$ that

	$\displaystyle\mathbb{P}\left(L_{n}^{(\theta)}=\alpha\right)$	$\displaystyle=\frac{1}{w}\int_{0}^{w}\mathbbm{1}_{\{\alpha\}}\big{(}\theta-u-(% n-1)w\big{)}\ \mathrm{Leb}_{1}({\rm d}u)$
		$\displaystyle\leqslant\frac{1}{w}\ \mathrm{Leb}_{1}\big{(}\{\theta-\alpha-(n-1% )w\}\big{)}=0.$

The statement for $R_{n}^{(\theta)}$ follows analogously. ∎

The following lemma collects the properties of $(L_{i}^{(\theta)})_{i\in\mathbb{N}}$ and $(R_{j}^{(\theta)})_{j\in\mathbb{N}}$ and the stopped sequences which lead to the proof of Lemma 11.

Lemma 20.

Assume we are in Setting C and let $\theta,\alpha\in\mathbb{R}$ .

Let $i,j\in\mathbb{N}$ and $\mathsf{A}_{1},\ldots,\mathsf{A}_{i},\mathsf{B}_{1},\ldots,\mathsf{B}_{j}\in% \mathcal{B}(\mathbb{R})$ . We have for all $q\in\{0,\ldots,i-1\}$ that

	$\displaystyle\mathbb{P}\left(L_{1}^{(\theta)}\in\mathsf{A}_{1},\ldots,L_{i}^{(% \theta)}\in\mathsf{A}_{i},R_{1}^{(\theta)}\in\mathsf{B}_{1},\ldots,R_{j}^{(% \theta)}\in\mathsf{B}_{j},L_{q+1}^{(\theta)}<\alpha<L_{q}^{(\theta)}\right)$
	$\displaystyle\quad=\mathbb{P}\left(R_{q}^{(\alpha)}\in\mathsf{A}_{1},\ldots,R_% {1}^{(\alpha)}\in\mathsf{A}_{q},L_{1}^{(\alpha)}\in\mathsf{A}_{q+1},\ldots,L_{% i-q}^{(\alpha)}\in\mathsf{A}_{i},\right.$
	$\displaystyle\quad\qquad\left.R_{q+1}^{(\alpha)}\in\mathsf{B}_{1},\ldots,R_{j+% q}^{(\alpha)}\in\mathsf{B}_{j},R_{q}^{(\alpha)}<\theta<R_{q+1}^{(\alpha)}% \right).$

Observe that for $q=0$ we use convention (19).

Let $i,j\in\mathbb{N}$ . For all $\mathsf{A},\mathsf{B}\in\mathcal{B}(\mathbb{R})$ and $q\in\{0,\ldots,i-1\}$ we have

	$\displaystyle\mathbb{P}\left(\boldsymbol{L}_{\mathsf{S}}^{(\theta)}\in\mathsf{% A},\boldsymbol{R}_{\mathsf{S}}^{(\theta)}\in\mathsf{B},L_{q+1}^{(\theta)}<% \alpha<L_{q}^{(\theta)},\tau_{\mathsf{S}}^{(\theta)}=i,\mathfrak{T}_{\mathsf{S% }}^{(\theta)}=j\right)$
	$\displaystyle\qquad=\mathbb{P}\left(\boldsymbol{L}_{\mathsf{S}}^{(\alpha)}\in% \mathsf{A},\boldsymbol{R}_{\mathsf{S}}^{(\alpha)}\in\mathsf{B},R_{q}^{(\alpha)% }<\theta<R_{q+1}^{(\alpha)},\tau_{\mathsf{S}}^{(\alpha)}=i-q,\mathfrak{T}_{% \mathsf{S}}^{(\alpha)}=j+q\right).$

For all $\mathsf{A},\mathsf{B}\in\mathcal{B}(\mathbb{R})$ holds

	$\displaystyle\mathbb{P}\left(\boldsymbol{L}_{\mathsf{S}}^{(\theta)}\in\mathsf{% A},\boldsymbol{R}_{\mathsf{S}}^{(\theta)}\in\mathsf{B},\boldsymbol{L}_{\mathsf% {S}}^{(\theta)}<\alpha<\boldsymbol{R}_{\mathsf{S}}^{(\theta)}\right)$
	$\displaystyle\qquad=\mathbb{P}\left(\boldsymbol{L}_{\mathsf{S}}^{(\alpha)}\in% \mathsf{A},\boldsymbol{R}_{\mathsf{S}}^{(\alpha)}\in\mathsf{B},\boldsymbol{L}_% {\mathsf{S}}^{(\alpha)}<\theta<\boldsymbol{R}_{\mathsf{S}}^{(\alpha)}\right).$

Proof.

To 1: Let $i,j\in\mathbb{N}$ , $q\in\{0,\ldots,i-1\}$ and $\mathsf{A}_{1},\ldots,\mathsf{A}_{i},\mathsf{B}_{1},\ldots,\mathsf{B}_{j}\in% \mathcal{B}(\mathbb{R})$ . By Lemma 18 we have

	$\displaystyle\mathbb{P}\left(L_{1}^{(\theta)}\in\mathsf{A}_{1},\ldots,L_{i}^{(% \theta)}\in\mathsf{A}_{i},R_{1}^{(\theta)}\in\mathsf{B}_{1},\ldots,R_{j}^{(% \theta)}\in\mathsf{B}_{j},L_{q+1}^{(\theta)}<\alpha<L_{q}^{(\theta)}\right)$
	$\displaystyle\qquad=\frac{1}{w}\int_{0}^{w}\prod_{k=1}^{i}\mathbbm{1}_{\mathsf% {A}_{k}}\big{(}\theta-u-(k-1)w\big{)}\cdot\prod_{l=1}^{j}\mathbbm{1}_{\mathsf{% B}_{l}}\big{(}\theta-u+lw\big{)}$
	$\displaystyle\qquad\qquad\qquad\qquad\cdot\mathbbm{1}_{(-\infty,\alpha)}(% \theta-u-qw)\mathbbm{1}_{(\alpha,\infty)}\big{(}\theta-u-(q-1)w\big{)}\ % \mathrm{Leb}_{1}({\rm d}u).$

Substituting $\widetilde{u}=u-\theta+\alpha+qw$ , we obtain

	$\displaystyle\mathbb{P}\left(L_{1}^{(\theta)}\in\mathsf{A}_{1},\ldots,L_{i}^{(% \theta)}\in\mathsf{A}_{i},R_{1}^{(\theta)}\in\mathsf{B}_{1},\ldots,R_{j}^{(% \theta)}\in\mathsf{B}_{j},L_{q+1}^{(\theta)}<\alpha<L_{q}^{(\theta)}\right)$
	$\displaystyle\qquad=\frac{1}{w}\int_{\mathbb{R}}\mathbbm{1}_{(0,w)}(\widetilde% {u}+\theta-\alpha-qw)\prod_{k=1}^{i}\mathbbm{1}_{\mathsf{A}_{k}}\big{(}\alpha-% \widetilde{u}-(k-q-1)w\big{)}$
	$\displaystyle\qquad\qquad\cdot\prod_{l=1}^{j}\mathbbm{1}_{\mathsf{B}_{l}}\big{% (}\alpha-\widetilde{u}+(l+q)w\big{)}\cdot\mathbbm{1}_{(-\infty,\alpha)}(\alpha% -\widetilde{u})\mathbbm{1}_{(\alpha,\infty)}\big{(}\alpha-\widetilde{u}+w\big{% )}\ \mathrm{Leb}_{1}({\rm d}\widetilde{u})$
	$\displaystyle\qquad=\frac{1}{w}\int_{\mathbb{R}}\mathbbm{1}_{(-\infty,\theta)}% (\alpha-\widetilde{u}+qw)\mathbbm{1}_{(\theta,\infty)}\big{(}\alpha-\widetilde% {u}+(q+1)w\big{)}\prod_{k^{\prime}=1}^{i-q}\mathbbm{1}_{\mathsf{A}_{k^{\prime}% +q}}\big{(}\alpha-\widetilde{u}-(k^{\prime}-1)w\big{)}$
	$\displaystyle\qquad\qquad\prod_{l^{\prime}=1}^{q}\mathbbm{1}_{\mathsf{A}_{q+1-% l^{\prime}}}\big{(}\alpha-\widetilde{u}+l^{\prime}w\big{)}\cdot\prod_{l^{% \prime}=q+1}^{j+q}\mathbbm{1}_{\mathsf{B}_{l^{\prime}-q}}\big{(}\alpha-% \widetilde{u}+l^{\prime}w\big{)}\cdot\mathbbm{1}_{(0,w)}(\widetilde{u})\ % \mathrm{Leb}_{1}({\rm d}\widetilde{u}).$

Applying again Lemma 18, we get

	$\displaystyle\mathbb{P}\left(L_{1}^{(\theta)}\in\mathsf{A}_{1},\ldots,L_{i}^{(% \theta)}\in\mathsf{A}_{i},R_{1}^{(\theta)}\in\mathsf{B}_{1},\ldots,R_{j}^{(% \theta)}\in\mathsf{B}_{j},L_{q+1}^{(\theta)}<\alpha<L_{q}^{(\theta)}\right)$
	$\displaystyle\qquad=\mathbb{P}\left(R_{q}^{(\alpha)}\in\mathsf{A}_{1},\ldots,R% _{1}^{(\alpha)}\in\mathsf{A}_{q},L_{1}^{(\alpha)}\in\mathsf{A}_{q+1},\ldots,L_% {i-q}^{(\alpha)}\in\mathsf{A}_{i},\right.$
	$\displaystyle\qquad\qquad\qquad\left.R_{q+1}^{(\alpha)}\in\mathsf{B}_{1},% \ldots,R_{j+q}^{(\alpha)}\in\mathsf{B}_{j},R_{q}^{(\alpha)}<\theta<R_{q+1}^{(% \alpha)}\right).$

To 2: Let $i,j\in\mathbb{N}$ , $q\in\{0,\ldots,i-1\}$ and $\mathsf{A},\mathsf{B}\in\mathcal{B}(\mathbb{R})$ . If $i+j>m$ , the statement is true by (22), which implies that both probabilities are zero. We now consider $i+j\leqslant m$ . To this end, for all $n,\widetilde{n}\in\mathbb{N}$ with $n+\widetilde{n}\leqslant m$ and $\upiota\in\{n,\ldots,m+1-\widetilde{n}\}$ set

	$\displaystyle\mathsf{A}_{k}^{\upiota,n,\widetilde{n}}=\mathsf{S},\quad k\in\{1% ,\ldots,n-1\}\qquad$	$\displaystyle\text{and}\qquad\mathsf{A}_{n}^{\upiota,n,\widetilde{n}}=\begin{% dcases}\mathsf{A},&\upiota=n\\ \mathsf{A}\cap(\mathbb{R}\setminus\mathsf{S}),&\upiota>n,\end{dcases}$
	$\displaystyle\mathsf{B}_{l}^{\upiota,n,\widetilde{n}}=\mathsf{S},\quad l\in\{1% ,\ldots,\widetilde{n}-1\}\qquad$	$\displaystyle\text{and}\qquad\mathsf{B}_{\widetilde{n}}^{\upiota,n,\widetilde{% n}}=\begin{dcases}\mathsf{B},&\upiota=m+1-\widetilde{n}\\ \mathsf{B}\cap(\mathbb{R}\setminus\mathsf{S}),&\upiota<m+1-\widetilde{n}.\end{dcases}$

Then for $q\in\{0,\ldots,n-1\}$ holds

\begin{split}\mathsf{A}_{k}^{\upiota-q,n-q,\widetilde{n}+q}&=\mathsf{A}_{k+q}^% {\upiota,n,\widetilde{n}},\qquad\forall\ k\in\{1,\ldots,n-q\},\\ \mathsf{B}_{l}^{\upiota-q,n-q,\widetilde{n}+q}&=\begin{dcases}\mathsf{A}_{q-l+% 1}^{\upiota,n,\widetilde{n}},&l\leqslant q,\\ \mathsf{B}_{l-q}^{\upiota,n,\widetilde{n}},&l>q,\qquad\forall\ l\in\{1,\ldots,% \widetilde{n}+q\}.\end{dcases}\end{split}

(28)

We can use these sets to express events of the form $\{\boldsymbol{L}_{\mathsf{S}}^{(\theta)}\in\mathsf{A},\boldsymbol{R}_{\mathsf{% S}}^{(\theta)}\in\mathsf{B},\tau_{\mathsf{S}}^{(\theta)}=n,\mathfrak{T}_{% \mathsf{S}}^{(\theta)}=\widetilde{n}\}$ . Namely, observe that by (22) we have $\tau_{\mathsf{S}}^{(\theta)}\leqslant J\leqslant m+1-\mathfrak{T}_{\mathsf{S}}% ^{(\theta)}$ . Together with the independence of $J$ from $(L_{k}^{(\theta)})_{k\in\mathbb{N}}$ and $(R_{l}^{(\theta)})_{l\in\mathbb{N}}$ , we get

\begin{split}&\mathbb{P}\left(\boldsymbol{L}_{\mathsf{S}}^{(\theta)}\in\mathsf% {A},\boldsymbol{R}_{\mathsf{S}}^{(\theta)}\in\mathsf{B},\tau_{\mathsf{S}}^{(% \theta)}=n,\mathfrak{T}_{\mathsf{S}}^{(\theta)}=\widetilde{n}\right)\\ &\quad=\sum_{\upiota=n}^{m+1-\widetilde{n}}\mathbb{P}\left(\boldsymbol{L}_{% \mathsf{S}}^{(\theta)}\in\mathsf{A},\boldsymbol{R}_{\mathsf{S}}^{(\theta)}\in% \mathsf{B},\tau_{\mathsf{S}}^{(\theta)}=n,\mathfrak{T}_{\mathsf{S}}^{(\theta)}% =\widetilde{n},J=\upiota\right)\\ &\quad=\sum_{\upiota=n}^{m+1-\widetilde{n}}\mathbb{P}\left(L_{1}^{(\theta)}\in% \mathsf{A}_{1}^{\upiota,n,\widetilde{n}},\ldots,L_{n}^{(\theta)}\in\mathsf{A}_% {n}^{\upiota,n,\widetilde{n}},R_{1}^{(\theta)}\in\mathsf{B}_{1}^{\upiota,n,% \widetilde{n}},\ldots,R_{\widetilde{n}}^{(\theta)}\in\mathsf{B}_{\widetilde{n}% }^{\upiota,n,\widetilde{n}},J=\upiota\right)\\ &\quad=\frac{1}{m}\sum_{\upiota=n}^{m+1-\widetilde{n}}\mathbb{P}\left(L_{1}^{(% \theta)}\in\mathsf{A}_{1}^{\upiota,n,\widetilde{n}},\ldots,L_{n}^{(\theta)}\in% \mathsf{A}_{n}^{\upiota,n,\widetilde{n}},R_{1}^{(\theta)}\in\mathsf{B}_{1}^{% \upiota,n,\widetilde{n}},\ldots,R_{\widetilde{n}}^{(\theta)}\in\mathsf{B}_{% \widetilde{n}}^{\upiota,n,\widetilde{n}}\right).\end{split}

(29)

If we use this for $n=i$ and $\widetilde{n}=j$ , we get

	$\displaystyle\mathbb{P}\left(\boldsymbol{L}_{\mathsf{S}}^{(\theta)}\in\mathsf{% A},\boldsymbol{R}_{\mathsf{S}}^{(\theta)}\in\mathsf{B},L_{q+1}^{(\theta)}<% \alpha<L_{q}^{(\theta)},\tau_{\mathsf{S}}^{(\theta)}=i,\mathfrak{T}_{\mathsf{S% }}^{(\theta)}=j\right)$
	$\displaystyle\quad=\frac{1}{m}\sum_{\upiota=i}^{m+1-j}\mathbb{P}\left(L_{1}^{(% \theta)}\in\mathsf{A}_{1}^{\upiota,i,j},\ldots,L_{i}^{(\theta)}\in\mathsf{A}_{% i}^{\upiota,i,j},R_{1}^{(\theta)}\in\mathsf{B}_{1}^{\upiota,i,j},\ldots,R_{j}^% {(\theta)}\in\mathsf{B}_{j}^{\upiota,i,j},L_{q+1}^{(\theta)}<\alpha<L_{q}^{(% \theta)}\right).$

Applying statement 1 and then (28) yields

	$\displaystyle\mathbb{P}\left(\boldsymbol{L}_{\mathsf{S}}^{(\theta)}\in\mathsf{% A},\boldsymbol{R}_{\mathsf{S}}^{(\theta)}\in\mathsf{B},L_{q+1}^{(\theta)}<% \alpha<L_{q}^{(\theta)},\tau_{\mathsf{S}}^{(\theta)}=i,\mathfrak{T}_{\mathsf{S% }}^{(\theta)}=j\right)$
	$\displaystyle\quad=\frac{1}{m}\sum_{\upiota=i}^{m+1-j}\mathbb{P}\left(R_{q}^{(% \alpha)}\in\mathsf{A}_{1}^{\upiota,i,j},\ldots,R_{1}^{(\alpha)}\in\mathsf{A}_{% q}^{\upiota,i,j},L_{1}^{(\alpha)}\in\mathsf{A}_{q+1}^{\upiota,i,j},\ldots,L_{i% -q}^{(\alpha)}\in\mathsf{A}_{i}^{\upiota,i,j},\right.$
	$\displaystyle\quad\qquad\qquad\qquad\qquad\left.R_{q+1}^{(\alpha)}\in\mathsf{B% }_{1}^{\upiota,i,j},\ldots,R_{j+q}^{(\alpha)}\in\mathsf{B}_{j}^{\upiota,i,j},R% _{q}^{(\alpha)}<\theta<R_{q+1}^{(\alpha)}\right)$
	$\displaystyle\quad=\frac{1}{m}\sum_{\upiota=i}^{m+1-j}\mathbb{P}\left(L_{1}^{(% \alpha)}\in\mathsf{A}_{1}^{\upiota-q,i-q,j+q},\ldots,L_{i-q}^{(\alpha)}\in% \mathsf{A}_{i-q}^{\upiota-q,i-q,j+q},\right.$
	$\displaystyle\quad\qquad\qquad\qquad\qquad\left.R^{(\alpha)}_{1}\in\mathsf{B}_% {1}^{\upiota-q,i-q,j+q},\ldots,R_{j+q}^{(\alpha)}\in\mathsf{B}_{j}^{\upiota-q,% i-q,j+q},R_{q}^{(\alpha)}<\theta<R_{q+1}^{(\alpha)}\right).$

Then doing an index shift and using (29) for $n=i-q$ and $\widetilde{n}=j+q$ , we obtain

	$\displaystyle\mathbb{P}\left(\boldsymbol{L}_{\mathsf{S}}^{(\theta)}\in\mathsf{% A},\boldsymbol{R}_{\mathsf{S}}^{(\theta)}\in\mathsf{B},L_{q+1}^{(\theta)}<% \alpha<L_{q}^{(\theta)},\tau_{\mathsf{S}}^{(\theta)}=i,\mathfrak{T}_{\mathsf{S% }}^{(\theta)}=j\right)$
	$\displaystyle\quad=\frac{1}{m}\sum_{\upiota^{\prime}=i-q}^{m+1-(j+q)}\mathbb{P% }\left(L_{1}^{(\alpha)}\in\mathsf{A}_{1}^{\upiota^{\prime},i-q,j+q},\ldots,L_{% i-q}^{(\alpha)}\in\mathsf{A}_{i-q}^{\upiota^{\prime},i-q,j+q},\right.$
	$\displaystyle\quad\qquad\qquad\qquad\qquad\left.R^{(\alpha)}_{1}\in\mathsf{B}_% {1}^{\upiota^{\prime},i-q,j+q},\ldots,R_{j+q}^{(\alpha)}\in\mathsf{B}_{j}^{% \upiota^{\prime},i-q,j+q},R_{q}^{(\alpha)}<\theta<R_{q+1}^{(\alpha)}\right)$
	$\displaystyle\quad=\mathbb{P}\left(\boldsymbol{L}_{\mathsf{S}}^{(\alpha)}\in% \mathsf{A},\boldsymbol{R}_{\mathsf{S}}^{(\alpha)}\in\mathsf{B},R_{q}^{(\alpha)% }<\theta<R_{q+1}^{(\alpha)},\tau_{\mathsf{S}}^{(\alpha)}=i-q,\mathfrak{T}_{% \mathsf{S}}^{(\alpha)}=j+q\right).$

To 3: Let $\mathsf{A},\mathsf{B}\in\mathcal{B}(\mathbb{R})$ . Using (22), we obtain

	$\displaystyle\mathbb{P}\left(\boldsymbol{L}_{\mathsf{S}}^{(\theta)}\in\mathsf{% A},\boldsymbol{R}_{\mathsf{S}}^{(\theta)}\in\mathsf{B},\boldsymbol{L}_{\mathsf% {S}}^{(\theta)}<\alpha<\boldsymbol{R}_{\mathsf{S}}^{(\theta)}\right)$
	$\displaystyle\qquad=\sum_{i=1}^{m}\sum_{j=1}^{m+1-i}\mathbb{P}\left(L_{i}^{(% \theta)}\in\mathsf{A},R_{j}^{(\theta)}\in\mathsf{B},L_{i}^{(\theta)}<\alpha<R_% {j}^{(\theta)},\tau_{\mathsf{S}}^{(\theta)}=i,\mathfrak{T}_{\mathsf{S}}^{(% \theta)}=j\right).$

By virtue of Lemma 19.1 and 19.4 this yields

	$\displaystyle\mathbb{P}\left(\boldsymbol{L}_{\mathsf{S}}^{(\theta)}\in\mathsf{% A},\boldsymbol{R}_{\mathsf{S}}^{(\theta)}\in\mathsf{B},\boldsymbol{L}_{\mathsf% {S}}^{(\theta)}<\alpha<\boldsymbol{R}_{\mathsf{S}}^{(\theta)}\right)$
	$\displaystyle\quad=\sum_{i=1}^{m}\sum_{j=1}^{m+1-i}\left[\sum_{q=0}^{i-1}% \mathbb{P}\left(\boldsymbol{L}_{\mathsf{S}}^{(\theta)}\in\mathsf{A},% \boldsymbol{R}_{\mathsf{S}}^{(\theta)}\in\mathsf{B},L_{q+1}^{(\theta)}<\alpha<% L_{q}^{(\theta)},\tau_{\mathsf{S}}^{(\theta)}=i,\mathfrak{T}_{\mathsf{S}}^{(% \theta)}=j\right)\right.$
	$\displaystyle\qquad\qquad\quad+\left.\sum_{q=1}^{j-1}\mathbb{P}\left(% \boldsymbol{L}_{\mathsf{S}}^{(\theta)}\in\mathsf{A},\boldsymbol{R}_{\mathsf{S}% }^{(\theta)}\in\mathsf{B},R_{q}^{(\theta)}<\alpha<R_{q+1}^{(\theta)},\tau_{% \mathsf{S}}^{(\theta)}=i,\mathfrak{T}_{\mathsf{S}}^{(\theta)}=j\right)\right].$

We apply statement 2 in the first sum straight forwardly. In the second sum we apply statement 2 with $i+q$ and $j-q$ and reversed roles of $\theta$ and $\alpha$ . Relocating the summand for $q=0$ from the first to the second sum, we obtain

	$\displaystyle\mathbb{P}\left(\boldsymbol{L}_{\mathsf{S}}^{(\theta)}\in\mathsf{% A},\boldsymbol{R}_{\mathsf{S}}^{(\theta)}\in\mathsf{B},\boldsymbol{L}_{\mathsf% {S}}^{(\theta)}<\alpha<\boldsymbol{R}_{\mathsf{S}}^{(\theta)}\right)$
	$\displaystyle\quad=\sum_{i=1}^{m}\sum_{j=1}^{m+1-i}\left[\sum_{q=1}^{i-1}% \mathbb{P}\left(\boldsymbol{L}_{\mathsf{S}}^{(\alpha)}\in\mathsf{A},% \boldsymbol{R}_{\mathsf{S}}^{(\alpha)}\in\mathsf{B},R_{q}^{(\alpha)}<\theta<R_% {q+1}^{(\alpha)},\tau_{\mathsf{S}}^{(\alpha)}=i-q,\mathfrak{T}_{\mathsf{S}}^{(% \alpha)}=j+q\right)\right.$
	$\displaystyle\qquad\qquad\quad+\left.\sum_{q=0}^{j-1}\mathbb{P}\left(% \boldsymbol{L}_{\mathsf{S}}^{(\alpha)}\in\mathsf{A},\boldsymbol{R}_{\mathsf{S}% }^{(\alpha)}\in\mathsf{B},L_{q+1}^{(\alpha)}<\theta<L_{q}^{(\alpha)},\tau_{% \mathsf{S}}^{(\alpha)}=i+q,\mathfrak{T}_{\mathsf{S}}^{(\alpha)}=j-q\right)\right]$
	$\displaystyle\quad=\sum_{i=1}^{m}\sum_{j=1}^{m+1-i}\left[\sum_{q=1}^{i-1}% \mathbb{P}\left(\boldsymbol{L}_{\mathsf{S}}^{(\alpha)}\in\mathsf{A},% \boldsymbol{R}_{\mathsf{S}}^{(\alpha)}\in\mathsf{B},R_{\mathfrak{T}_{\mathsf{S% }}^{(\alpha)}-j}^{(\alpha)}<\theta<R_{\mathfrak{T}_{\mathsf{S}}^{(\alpha)}-j+1% }^{(\alpha)},\mathfrak{T}_{\mathsf{S}}^{(\alpha)}=j+q,\right.\right.$
	$\displaystyle\quad\hskip 284.52756pt\tau_{\mathsf{S}}^{(\alpha)}+\mathfrak{T}_% {\mathsf{S}}^{(\alpha)}=i+j\bigg{)}$
	$\displaystyle\quad\qquad\qquad+\left.\sum_{q=0}^{j-1}\mathbb{P}\left(% \boldsymbol{L}_{\mathsf{S}}^{(\alpha)}\in\mathsf{A},\boldsymbol{R}_{\mathsf{S}% }^{(\alpha)}\in\mathsf{B},L_{\tau_{\mathsf{S}}^{(\alpha)}-i+1}^{(\alpha)}<% \theta<L_{\tau_{\mathsf{S}}^{(\alpha)}-i}^{(\alpha)},\tau_{\mathsf{S}}^{(% \alpha)}=i+q,\right.\right.$
	$\displaystyle\quad\hskip 284.52756pt\tau_{\mathsf{S}}^{(\alpha)}+\mathfrak{T}_% {\mathsf{S}}^{(\alpha)}=i+j\bigg{)}\Bigg{]}.$

Then the lower bounds in (22) imply

	$\displaystyle\mathbb{P}\left(\boldsymbol{L}_{\mathsf{S}}^{(\theta)}\in\mathsf{% A},\boldsymbol{R}_{\mathsf{S}}^{(\theta)}\in\mathsf{B},\boldsymbol{L}_{\mathsf% {S}}^{(\theta)}<\alpha<\boldsymbol{R}_{\mathsf{S}}^{(\theta)}\right)$
	$\displaystyle\quad=\sum_{i=1}^{m}\sum_{j=1}^{m+1-i}\left[\mathbb{P}\left(% \boldsymbol{L}_{\mathsf{S}}^{(\alpha)}\in\mathsf{A},\boldsymbol{R}_{\mathsf{S}% }^{(\alpha)}\in\mathsf{B},R_{\mathfrak{T}_{\mathsf{S}}^{(\alpha)}-j}^{(\alpha)% }<\theta<R_{\mathfrak{T}_{\mathsf{S}}^{(\alpha)}-j+1}^{(\alpha)},\mathfrak{T}_% {\mathsf{S}}^{(\alpha)}\geqslant j+1\right.\right.,$
	$\displaystyle\quad\hskip 284.52756pt\tau_{\mathsf{S}}^{(\alpha)}+\mathfrak{T}_% {\mathsf{S}}^{(\alpha)}=i+j\bigg{)}$
	$\displaystyle\qquad\qquad\quad+\left.\mathbb{P}\left(\boldsymbol{L}_{\mathsf{S% }}^{(\alpha)}\in\mathsf{A},\boldsymbol{R}_{\mathsf{S}}^{(\alpha)}\in\mathsf{B}% ,L_{\tau_{\mathsf{S}}^{(\alpha)}-i+1}^{(\alpha)}<\theta<L_{\tau_{\mathsf{S}}^{% (\alpha)}-i}^{(\alpha)},\tau_{\mathsf{S}}^{(\alpha)}\geqslant i,\tau_{\mathsf{% S}}^{(\alpha)}+\mathfrak{T}_{\mathsf{S}}^{(\alpha)}=i+j\right)\right].$

Changing the order of summation, we obtain

	$\displaystyle\mathbb{P}\left(\boldsymbol{L}_{\mathsf{S}}^{(\theta)}\in\mathsf{% A},\boldsymbol{R}_{\mathsf{S}}^{(\theta)}\in\mathsf{B},\boldsymbol{L}_{\mathsf% {S}}^{(\theta)}<\alpha<\boldsymbol{R}_{\mathsf{S}}^{(\theta)}\right)$
	$\displaystyle\quad=\sum_{j=1}^{m}\sum_{i=1}^{m+1-j}\mathbb{P}\left(\boldsymbol% {L}_{\mathsf{S}}^{(\alpha)}\in\mathsf{A},\boldsymbol{R}_{\mathsf{S}}^{(\alpha)% }\in\mathsf{B},R_{\mathfrak{T}_{\mathsf{S}}^{(\alpha)}-j}^{(\alpha)}<\theta<R_% {\mathfrak{T}_{\mathsf{S}}^{(\alpha)}-j+1}^{(\alpha)},\mathfrak{T}_{\mathsf{S}% }^{(\alpha)}\geqslant j+1,\right.$
	$\displaystyle\quad\hskip 284.52756pt\tau_{\mathsf{S}}^{(\alpha)}+\mathfrak{T}_% {\mathsf{S}}^{(\alpha)}=i+j\bigg{)}$
	$\displaystyle\quad\qquad+\sum_{i=1}^{m}\sum_{j=1}^{m+1-i}\mathbb{P}\left(% \boldsymbol{L}_{\mathsf{S}}^{(\alpha)}\in\mathsf{A},\boldsymbol{R}_{\mathsf{S}% }^{(\alpha)}\in\mathsf{B},L_{\tau_{\mathsf{S}}^{(\alpha)}-i+1}^{(\alpha)}<% \theta<L_{\tau_{\mathsf{S}}^{(\alpha)}-i}^{(\alpha)},\tau_{\mathsf{S}}^{(% \alpha)}\geqslant i,\right.$
	$\displaystyle\quad\hskip 284.52756pt\tau_{\mathsf{S}}^{(\alpha)}+\mathfrak{T}_% {\mathsf{S}}^{(\alpha)}=i+j\bigg{)}.$

Together with the upper bound in (22) this yields

	$\displaystyle\mathbb{P}\left(\boldsymbol{L}_{\mathsf{S}}^{(\theta)}\in\mathsf{% A},\boldsymbol{R}_{\mathsf{S}}^{(\theta)}\in\mathsf{B},\boldsymbol{L}_{\mathsf% {S}}^{(\theta)}<\alpha<\boldsymbol{R}_{\mathsf{S}}^{(\theta)}\right)$
	$\displaystyle\quad=\sum_{j=1}^{m}\mathbb{P}\left(\boldsymbol{L}_{\mathsf{S}}^{% (\alpha)}\in\mathsf{A},\boldsymbol{R}_{\mathsf{S}}^{(\alpha)}\in\mathsf{B},R_{% \mathfrak{T}_{\mathsf{S}}^{(\alpha)}-j}^{(\alpha)}<\theta<R_{\mathfrak{T}_{% \mathsf{S}}^{(\alpha)}-j+1}^{(\alpha)},\mathfrak{T}_{\mathsf{S}}^{(\alpha)}% \geqslant j+1\right)$
	$\displaystyle\quad\qquad+\sum_{i=1}^{m}\mathbb{P}\left(\boldsymbol{L}_{\mathsf% {S}}^{(\alpha)}\in\mathsf{A},\boldsymbol{R}_{\mathsf{S}}^{(\alpha)}\in\mathsf{% B},L_{\tau_{\mathsf{S}}^{(\alpha)}-i+1}^{(\alpha)}<\theta<L_{\tau_{\mathsf{S}}% ^{(\alpha)}-i}^{(\alpha)},\tau_{\mathsf{S}}^{(\alpha)}\geqslant i\right).$

By virtue of Lemma 19.2 to 19.4 we get

	$\displaystyle\mathbb{P}\left(\boldsymbol{L}_{\mathsf{S}}^{(\theta)}\in\mathsf{% A},\boldsymbol{R}_{\mathsf{S}}^{(\theta)}\in\mathsf{B},\boldsymbol{L}_{\mathsf% {S}}^{(\theta)}<\alpha<\boldsymbol{R}_{\mathsf{S}}^{(\theta)}\right)$
	$\displaystyle\qquad=\mathbb{P}\left(\boldsymbol{L}_{\mathsf{S}}^{(\alpha)}\in% \mathsf{A},\boldsymbol{R}_{\mathsf{S}}^{(\alpha)}\in\mathsf{B},R_{1}^{(\alpha)% }<\theta<\boldsymbol{R}_{\mathsf{S}}^{(\alpha)}\right)+\mathbb{P}\left(% \boldsymbol{L}_{\mathsf{S}}^{(\alpha)}\in\mathsf{A},\boldsymbol{R}_{\mathsf{S}% }^{(\alpha)}\in\mathsf{B},\boldsymbol{L}_{\mathsf{S}}^{(\alpha)}<\theta<L_{0}^% {(\alpha)}\right)$
	$\displaystyle\qquad=\mathbb{P}\left(\boldsymbol{L}_{\mathsf{S}}^{(\alpha)}\in% \mathsf{A},\boldsymbol{R}_{\mathsf{S}}^{(\alpha)}\in\mathsf{B},\boldsymbol{L}_% {\mathsf{S}}^{(\alpha)}<\theta<\boldsymbol{R}_{\mathsf{S}}^{(\alpha)}\right).$

∎

Lemma 11 is now an easy corollary of the previous lemma.

Proof of Lemma 11.

Lemma 20.3 yields that the finite measures

\int_{\mathbb{R}^{2}}\mathbbm{1}_{\mathsf{C}}(\ell,r)\mathbbm{1}_{(\ell,r)}(% \alpha)\ \xi_{\mathsf{S}}^{(\theta)}\big{(}{\rm d}(\ell,r)\big{)},\qquad% \mathsf{C}\in\mathcal{B}(\mathbb{R}^{2}),

and

\int_{\mathbb{R}^{2}}\mathbbm{1}_{\mathsf{C}}(\ell,r)\mathbbm{1}_{(\ell,r)}(% \theta)\ \xi_{\mathsf{S}}^{(\alpha)}\big{(}{\rm d}(\ell,r)\big{)},\qquad% \mathsf{C}\in\mathcal{B}(\mathbb{R}^{2}),

agree on the intersection stable generator $\{\mathsf{A}\times\mathsf{B}\mid\mathsf{A},\mathsf{B}\in\mathcal{B}(\mathbb{R})\}$ of $\mathcal{B}(\mathbb{R}^{2})$ . By [Klenke, Lemma 1.42], this implies that these two measures are equal. ∎

Before addressing Lemma 12, we take a brief detour, because some of the notation of the proof of Lemma 20 is convenient to show measureability of the stepping-out distribution in the argument $\theta$ .

Remark 21.

Assume we are in Setting C. Let $A,B\in\mathcal{B}(\mathbb{R})$ , and for all $n,\widetilde{n}\in\mathbb{N}$ with $n+\widetilde{n}\leqslant m$ and $\upiota\in\{n,\ldots,m+1-\widetilde{n}\}$ define $\mathsf{A}_{1}^{\upiota,n,\widetilde{n}},\ldots,\mathsf{A}_{n}^{\upiota,n,% \widetilde{n}}$ , $\mathsf{B}_{1}^{\upiota,n,\widetilde{n}},\ldots,\mathsf{B}_{\widetilde{n}}^{% \upiota,n,\widetilde{n}}$ as in the proof of Lemma 12. Combining Tonnelli’s theorem and Lemma 18, we get that

\theta\mapsto\mathbb{P}\left(L_{1}^{(\theta)}\in\mathsf{A}_{1}^{\upiota,n,% \widetilde{n}},\ldots,L_{n}^{(\theta)}\in\mathsf{A}_{n}^{\upiota,n,\widetilde{% n}},R_{1}^{(\theta)}\in\mathsf{B}_{1}^{\upiota,n,\widetilde{n}},\ldots,R_{% \widetilde{n}}^{(\theta)}\in\mathsf{B}_{\widetilde{n}}^{\upiota,n,\widetilde{n% }}\right)

is measurable for all $n,\widetilde{n}\in\mathbb{N}$ with $n+\widetilde{n}\leqslant m$ and $\upiota\in\{n,\ldots,m+1-\widetilde{n}\}$ . Thus also

	$\displaystyle\theta\mapsto$	$\displaystyle\xi_{\mathsf{S}}^{(\theta)}(A\times B)$
		$\displaystyle\quad=\frac{1}{m}\sum_{n=1}^{m}\sum_{\widetilde{n}=1}^{m-n}\sum_{% \upiota=n}^{m+1-\widetilde{n}}\mathbb{P}\left(L_{1}^{(\theta)}\in\mathsf{A}_{1% }^{\upiota,n,\widetilde{n}},\ldots,L_{n}^{(\theta)}\in\mathsf{A}_{n}^{\upiota,% n,\widetilde{n}},R_{1}^{(\theta)}\in\mathsf{B}_{1}^{\upiota,n,\widetilde{n}},% \ldots,R_{\widetilde{n}}^{(\theta)}\in\mathsf{B}_{\widetilde{n}}^{\upiota,n,% \widetilde{n}}\right)$

is measurable. It is now straight forward to verify that $(\theta,C)\mapsto\xi_{\mathsf{S}}^{(\theta)}(C)$ is a transition kernel.

We turn to the proof of Lemma 12. Its essential strategy is to leverage the following property of $(L_{i})_{i\in\mathbb{N}}$ and $(R_{j})_{j\in\mathbb{N}}$ to the stepping-out distribution.

Lemma 22.

Assume that we are in Setting C.

Let $\alpha\in\mathbb{R}$ , $i,j\in\mathbb{N}$ and $\mathsf{A}_{1},\ldots,\mathsf{A}_{i},\mathsf{B}_{1},\ldots,\mathsf{B}_{j}\in% \mathcal{B}(\mathbb{R})$ . Then we have

	$\displaystyle\mathbb{P}\left(L_{1}^{(\theta)}\in\Lambda_{\alpha}(\mathsf{A}_{1% }),\ldots,L_{i}^{(\theta)}\in\Lambda_{\alpha}(\mathsf{A}_{i}),R_{1}^{(\theta)}% \in\Lambda_{\alpha}(\mathsf{B}_{1}),\ldots,R_{j}^{(\theta)}\in\Lambda_{\alpha}% (\mathsf{B}_{j})\right)$
	$\displaystyle\quad=\mathbb{P}\left(L_{1}^{\left(\Lambda_{\alpha}(\theta)\right% )}\in\mathsf{B}_{1},\ldots,L_{j}^{\left(\Lambda_{\alpha}(\theta)\right)}\in% \mathsf{B}_{j},R_{1}^{\left(\Lambda_{\alpha}(\theta)\right)}\in\mathsf{A}_{1},% \ldots,R_{i}^{\left(\Lambda_{\alpha}(\theta)\right)}\in\mathsf{A}_{i}\right).$

For all $i\in\{1,\ldots,m\}$ , $j\in\{1,\ldots,m+1-i\}$ , $\theta,\alpha\in\mathbb{R}$ and $\mathsf{A},\mathsf{B}\in\mathcal{B}(\mathbb{R})$ holds

	$\displaystyle\mathbb{P}\left(L_{i}^{(\theta)}\in\mathsf{A},R_{j}^{(\theta)}\in% \mathsf{B},\tau_{\Lambda_{\alpha}(\mathsf{S})}^{(\theta)}=i,\mathfrak{T}_{% \Lambda_{\alpha}(\mathsf{S})}^{(\theta)}=j\right)$
	$\displaystyle\qquad=\mathbb{P}\left(L_{j}^{(\Lambda_{\alpha}(\theta))}\in% \Lambda_{\alpha}^{-1}(\mathsf{B}),R_{i}^{(\Lambda_{\alpha}(\theta))}\in\Lambda% _{\alpha}^{-1}(\mathsf{A}),\tau_{\mathsf{S}}^{(\Lambda_{\alpha}(\theta))}=j,% \mathfrak{T}_{\mathsf{S}}^{(\Lambda_{\alpha}(\theta))}=i\right).$

Proof.

To 1: Let $\alpha\in\mathbb{R}$ , $i,j\in\mathbb{N}$ and $\mathsf{A}_{1},\ldots,\mathsf{A}_{i},\mathsf{B}_{1},\ldots,\mathsf{B}_{j}\in% \mathcal{B}(\mathbb{R})$ . We have by Lemma 18

	$\displaystyle\mathbb{P}\left(L_{1}^{(\theta)}\in\Lambda_{\alpha}(\mathsf{A}_{1% }),\ldots,L_{i}^{(\theta)}\in\Lambda_{\alpha}(\mathsf{A}_{i}),R_{1}^{(\theta)}% \in\Lambda_{\alpha}(\mathsf{B}_{1}),\ldots,R_{j}^{(\theta)}\in\Lambda_{\alpha}% (\mathsf{B}_{j})\right)$
	$\displaystyle\quad=\frac{1}{w}\int_{0}^{w}\prod_{k=1}^{i}\mathbbm{1}_{\Lambda_% {\alpha}(\mathsf{A}_{k})}\big{(}\theta-u-(k-1)w\big{)}\prod_{l=1}^{j}\mathbbm{% 1}_{\Lambda_{\alpha}(\mathsf{B}_{l})}\big{(}\theta-u+lw\big{)}\ \mathrm{Leb}_{% 1}({\rm d}u)$
	$\displaystyle\quad=\frac{1}{w}\int_{0}^{w}\prod_{k=1}^{i}\mathbbm{1}_{\mathsf{% A}_{k}}\big{(}\alpha-\theta+u+(k-1)w\big{)}\prod_{l=1}^{j}\mathbbm{1}_{\mathsf% {B}_{l}}\big{(}\alpha-\theta+u-lw\big{)}\ \mathrm{Leb}_{1}({\rm d}u).$

Performing a change of variables $\widetilde{u}=-u+w$ yields

	$\displaystyle\mathbb{P}\left(L_{1}^{(\theta)}\in\Lambda_{\alpha}(\mathsf{A}_{1% }),\ldots,L_{i}^{(\theta)}\in\Lambda_{\alpha}(\mathsf{A}_{i}),R_{1}^{(\theta)}% \in\Lambda_{\alpha}(\mathsf{B}_{1}),\ldots,R_{j}^{(\theta)}\in\Lambda_{\alpha}% (\mathsf{B}_{j})\right)$
	$\displaystyle\quad=\frac{1}{w}\int_{0}^{w}\prod_{k=1}^{i}\mathbbm{1}_{\mathsf{% A}_{k}}\big{(}\alpha-\theta-\widetilde{u}+kw\big{)}\prod_{l=1}^{j}\mathbbm{1}_% {\mathsf{B}_{l}}\big{(}\alpha-\theta-\widetilde{u}-(l-1)w\big{)}\ \mathrm{Leb}% _{1}({\rm d}\widetilde{u}).$

Using again Lemma 18, we obtain

	$\displaystyle\mathbb{P}\left(L_{1}^{(\theta)}\in\Lambda_{\alpha}(\mathsf{A}_{1% }),\ldots,L_{i}^{(\theta)}\in\Lambda_{\alpha}(\mathsf{A}_{i}),R_{1}^{(\theta)}% \in\Lambda_{\alpha}(\mathsf{B}_{1}),\ldots,R_{j}^{(\theta)}\in\Lambda_{\alpha}% (\mathsf{B}_{j})\right)$
	$\displaystyle\quad=\mathbb{P}\left(L_{1}^{\left(\Lambda_{\alpha}(\theta)\right% )}\in\mathsf{B}_{1},\ldots,L_{j}^{\left(\Lambda_{\alpha}(\theta)\right)}\in% \mathsf{B}_{j},R_{1}^{\left(\Lambda_{\alpha}(\theta)\right)}\in\mathsf{A}_{1},% \ldots,R_{i}^{\left(\Lambda_{\alpha}(\theta)\right)}\in\mathsf{A}_{i}\right).$

To 2: Let $i\in\{1,\ldots,m\}$ , $j\in\{1,\ldots,m+1-i\}$ , $\theta,\alpha\in\mathbb{R}$ and $\mathsf{A},\mathsf{B}\in\mathcal{B}(\mathbb{R})$ . For $\upiota\in\{i,\ldots,m+1-j\}$ we define

	$\displaystyle\mathsf{A}_{k}^{\upiota}=\Lambda_{\alpha}(\mathsf{S}),\quad k\in% \{1,\ldots,i-1\}\qquad$	$\displaystyle\text{and}\qquad\mathsf{A}_{i}^{\upiota}=\begin{dcases}\mathsf{A}% ,&\upiota=i\\ \mathsf{A}\cap(\mathbb{R}\setminus\Lambda_{\alpha}(\mathsf{S})),&\upiota>i,% \end{dcases}$
	$\displaystyle\mathsf{B}_{l}^{\upiota}=\Lambda_{\alpha}(\mathsf{S}),\quad l\in% \{1,\ldots,j-1\}\qquad$	$\displaystyle\text{and}\qquad\mathsf{B}_{j}^{\upiota}=\begin{dcases}\mathsf{B}% ,&\upiota=m+1-j\\ \mathsf{B}\cap(\mathbb{R}\setminus\Lambda_{\alpha}(\mathsf{S})),&\upiota>m+1-j% ,\end{dcases}$

as well as

	$\displaystyle\widetilde{\mathsf{A}}_{k}^{\upiota}=\mathsf{S},\quad k\in\{1,% \ldots,i-1\}\qquad$	$\displaystyle\text{and}\qquad\widetilde{\mathsf{A}}_{i}^{\upiota}=\begin{% dcases}\Lambda_{\alpha}^{-1}(\mathsf{A}),&\upiota=i\\ \Lambda_{\alpha}^{-1}(\mathsf{A})\cap(\mathbb{R}\setminus\mathsf{S}),&\upiota>% i,\end{dcases}$
	$\displaystyle\widetilde{\mathsf{B}}^{\upiota}_{l}=\mathsf{S},\quad l\in\{1,% \ldots,j-1\}\qquad$	$\displaystyle\text{and}\qquad\widetilde{\mathsf{B}}^{\upiota}_{j}=\begin{% dcases}\Lambda_{\alpha}^{-1}(\mathsf{B}),&\upiota=m+1-j\\ \Lambda_{\alpha}^{-1}(\mathsf{B})\cap(\mathbb{R}\setminus\mathsf{S}),&\upiota>% m+1-j.\end{dcases}$

Observe that

\displaystyle\Lambda_{\alpha}(\widetilde{\mathsf{A}}_{k}^{\upiota})=\mathsf{A}% _{k}^{\upiota},\qquad\text{and}\qquad\Lambda_{\alpha}(\widetilde{\mathsf{B}}_{% l}^{\upiota})=\mathsf{B}_{l}^{\upiota},\qquad\textup{for }k\in\{1,\ldots,i\},l% \in\{1,\ldots,j\},

as $\Lambda_{\alpha}(\mathbb{R})=\mathbb{R}$ . Due to $J\sim\mathrm{Unif}(\{1,\ldots,m\})$ being independent of $(L_{k}^{(\theta)})_{k\in\mathbb{N}}$ and $(R_{l}^{(\theta)})_{l\in\mathbb{N}}$ , and (22), using statement 1 gives

	$\displaystyle\mathbb{P}\left(L_{i}^{(\theta)}\in\mathsf{A},R_{j}^{(\theta)}\in% \mathsf{B},\tau_{\Lambda_{\alpha}(\mathsf{S})}^{(\theta)}=i,\mathfrak{T}_{% \Lambda_{\alpha}(\mathsf{S})}^{(\theta)}=j\right)$
	$\displaystyle\qquad=\sum_{\upiota=i}^{m+1-j}\mathbb{P}\left(L_{1}^{(\theta)}% \in\mathsf{A}_{1}^{\upiota},\ldots,L_{i}^{(\theta)}\in\mathsf{A}_{i}^{\upiota}% ,R_{1}^{(\theta)}\in\mathsf{B}_{1}^{\upiota},\ldots,R_{j}^{(\theta)}\in\mathsf% {B}_{j}^{\upiota}\right)\mathbb{P}(J=\upiota)$
	$\displaystyle\qquad=\frac{1}{m}\sum_{\upiota=i}^{m+1-j}\mathbb{P}\left(L_{1}^{% (\theta)}\in\Lambda_{\alpha}(\widetilde{\mathsf{A}}_{1}^{\upiota}),\ldots,L_{i% }^{(\theta)}\in\Lambda_{\alpha}(\widetilde{\mathsf{A}}_{i}^{\upiota}),R_{1}^{(% \theta)}\in\Lambda_{\alpha}(\widetilde{\mathsf{B}}_{1}^{\upiota}),\ldots,R_{j}% ^{(\theta)}\in\Lambda_{\alpha}(\widetilde{\mathsf{B}}_{j}^{\upiota})\right)$
	$\displaystyle\qquad=\frac{1}{m}\sum_{\upiota=i}^{m+1-j}\mathbb{P}\left(L_{1}^{% (\Lambda_{\alpha}(\theta))}\in\widetilde{\mathsf{B}}_{1}^{\upiota},\ldots,L_{j% }^{(\Lambda_{\alpha}(\theta))}\in\widetilde{\mathsf{B}}_{j}^{\upiota},R_{1}^{(% \Lambda_{\alpha}(\theta))}\in\widetilde{\mathsf{A}}_{1}^{\upiota},\ldots,R_{i}% ^{(\Lambda_{\alpha}(\theta))}\in\widetilde{\mathsf{A}}_{i}^{\upiota}\right)$
	$\displaystyle\qquad=\mathbb{P}\left(L_{j}^{(\Lambda_{\alpha}(\theta))}\in% \Lambda_{\alpha}^{-1}(\mathsf{B}),R_{i}^{(\Lambda_{\alpha}(\theta))}\in\Lambda% _{\alpha}^{-1}(\mathsf{A}),\tau_{\mathsf{S}}^{(\Lambda_{\alpha}(\theta))}=j,% \mathfrak{T}_{\mathsf{S}}^{(\Lambda_{\alpha}(\theta))}=i\right).$

∎

Proof of Lemma 12.

Let $\mathsf{A},\mathsf{B}\in\mathcal{B}(\mathbb{R})$ . Using (22), we obtain

\begin{split}\xi_{\Lambda_{\alpha}(\mathsf{S})}^{(\theta)}(\mathsf{A}\times% \mathsf{B})&=\mathbb{P}\left(\boldsymbol{L}^{(\theta)}_{\Lambda_{\alpha}(% \mathsf{S})}\in\mathsf{A},\boldsymbol{R}^{(\theta)}_{\Lambda_{\alpha}(\mathsf{% S})}\in\mathsf{B}\right)\\ &=\sum_{i=1}^{m}\sum_{j=1}^{m+1-i}\mathbb{P}\left(L_{i}^{(\theta)}\in\mathsf{A% },R_{j}^{(\theta)}\in\mathsf{B},\tau_{\Lambda_{\alpha}(\mathsf{S})}^{(\theta)}% =i,\mathfrak{T}_{\Lambda_{\alpha}(\mathsf{S})}^{(\theta)}=j\right).\end{split}

Applying Lemma 22.2 and a reordering of the sum yields

	$\displaystyle\xi_{\Lambda_{\alpha}(\mathsf{S})}^{(\theta)}(\mathsf{A}\times% \mathsf{B})$	$\displaystyle=\sum_{j=1}^{m}\sum_{i=1}^{m+1-j}\mathbb{P}\left(L_{j}^{(\Lambda_% {\alpha}(\theta))}\in\Lambda_{\alpha}^{-1}(\mathsf{B}),R_{i}^{(\Lambda_{\alpha% }(\theta))}\in\Lambda_{\alpha}^{-1}(\mathsf{A}),\tau_{\mathsf{S}}^{(\Lambda_{% \alpha}(\theta))}=j,\mathfrak{T}_{\mathsf{S}}^{(\Lambda_{\alpha}(\theta))}=i\right)$
		$\displaystyle=\mathbb{P}\left(\boldsymbol{L}^{(\Lambda_{\alpha}(\theta))}_{% \mathsf{S}}\in\Lambda_{\alpha}^{-1}(\mathsf{B}),\boldsymbol{R}^{(\Lambda_{% \alpha}(\theta))}_{\mathsf{S}}\in\Lambda_{\alpha}^{-1}(\mathsf{A})\right)=\xi_% {\mathsf{S}}^{(\Lambda_{\alpha}(\theta))}\left(\uplambda_{\alpha}^{-1}(\mathsf% {A}\times\mathsf{B})\right),$

such that the probability measures $\xi_{\Lambda_{\alpha}(\mathsf{S})}^{(\theta)}$ and $(\uplambda_{\alpha})_{\sharp}\xi_{\mathsf{S}}^{(\Lambda_{\alpha}(\theta))}$ agree on the intersection stable generator $\{\mathsf{A}\times\mathsf{B}\mid\mathsf{A},\mathsf{B}\in\mathcal{B}(\mathbb{R})\}$ of $\mathcal{B}(\mathbb{R}^{2})$ . Then [Klenke, Lemma 1.42] gives the result. ∎

Appendix B Uniform simple slice sampling

Let $(\mathsf{X},\mathcal{X},\mu)$ be a $\sigma$ -finite measure space. For illustrative purposes we introduce uniform simple slice sampling for probability measures on $(\mathsf{X},\mathcal{X})$ that are absolutely continuos with respect to $\mu$ . More precisely, let

p:\mathsf{X}\to(0,\infty)

be a measurable function that satisfies

Z:=\int_{\mathsf{X}}p(x)\ \mu({\rm d}x)\in(0,\infty).

We consider the probability measure

\pi({\rm d}x):=\frac{1}{Z}p(x)\mu({\rm d}x)

(30)

that has unnormalized density $p$ with respect to $\mu$ . We define the level sets as

L(t):=\{x\in\mathsf{X}\mid p(x)>t\},\qquad t\in(0,\infty),

and the essential supremum norm of $p$

\|p\|_{\text{ess-}\infty}:=\inf_{\mathsf{N}\in\mathcal{X},\mu(\mathsf{N})=0}% \sup_{x\in\mathsf{X}\setminus\mathsf{N}}p(x).

By definition of $\|p\|_{\text{ess-}\infty}$ and because

	$\displaystyle\int_{0}^{\infty}\mu\big{(}L(t)\big{)}\mathrm{Leb}_{1}({\rm d}t)=% \int_{0}^{\infty}\int_{\mathsf{X}}\mathbbm{1}_{L(t)}(x)\ \mu({\rm d}x)\,% \mathrm{Leb}_{1}({\rm d}t)$
	$\displaystyle\qquad=\int_{\mathsf{X}}\int_{0}^{\infty}\mathbbm{1}_{(0,p(x))}(t% )\ \mathrm{Leb}_{1}({\rm d}t)\,\mu({\rm d}x)=\int_{\mathsf{X}}p(x)\ \mu({\rm d% }x)=Z<\infty,$

we have that for almost all $t\in(0,\|p\|_{\text{ess-}\infty})$ holds $\mu\big{(}L(t)\big{)}\in(0,\infty)$ . Consequently, the Markov kernel of the uniform simple slice sampler

H(x,\mathsf{A}):=\frac{1}{p(x)}\int_{0}^{p(x)}\mu_{t}(\mathsf{A})\ \mathrm{Leb% }_{1}({\rm d}t),\qquad x\in\mathsf{X},\mathsf{A}\in\mathcal{X},

(31)

where

\mu_{t}:=\frac{1}{\mu(L(t))}\mu|_{L(t)},\qquad t\in(0,\|p\|_{\text{ess-}\infty% }),

is well defined.

Lemma 23.

Let $(\mathsf{X},\mathcal{X},\mu)$ be a $\sigma$ -finite measure space, and let let $p:\mathsf{X}\to(0,\infty)$ be a measurable function with $Z:=\int_{\mathsf{X}}p(x)\ \mu({\rm d}x)\in(0,\infty)$ . Then $H$ , as defined in (31), is reversible with respect to $\pi$ , given in (30).

Proof.

Let $\mathsf{A},\mathsf{B}\in\mathcal{X}$ . We have

	$\displaystyle\int_{B}H(x,\mathsf{A})\ \pi({\rm d}x)=\int_{B}\frac{p(x)}{Z\,p(x% )}\int_{0}^{p(x)}\mu_{t}(\mathsf{A})\ \mathrm{Leb}_{1}({\rm d}t)\,\mu({\rm d}x)$
	$\displaystyle\qquad=\frac{1}{Z}\int_{0}^{\infty}\int_{B}\mu_{t}(\mathsf{A})% \mathbbm{1}_{(0,p(x))}(t)\ \mu({\rm d}x)\,\mathrm{Leb}_{1}({\rm d}t)=\frac{1}{% Z}\int_{0}^{\infty}\frac{\mu(\mathsf{A}\cap L(t))\mu(\mathsf{B}\cap L(t))}{\mu% (L(t))}\ \mathrm{Leb}_{1}({\rm d}t).$

This expression is symmetric in $\mathsf{A}$ and $\mathsf{B}$ , such that

\displaystyle\int_{\mathsf{B}}H(x,\mathsf{A})\ \pi({\rm d}x)=\int_{\mathsf{A}}% H(x,\mathsf{B})\ \pi({\rm d}x).

∎

Appendix C Manifolds

We do not give a complete introduction to differential geometry in this section. The aim is rather to give a better understanding on the key objects used by the geodesic slice sampler and provide references for what is outside the scope of this paper.

C.1 Tangent space and Riemannian metric

We revise some selected basic concepts on manifolds. For a more thorough introduction to these objects see [Boothby, Sections I, III, IV, V]. Let $\mathsf{M}$ be a $d$ -dimensional, smooth manifold that is connected as a set. The most defining property of such an object is the following: For all $x\in\mathsf{M}$ there exists an open set $\mathsf{U}\subseteq\mathsf{M}$ containing $x$ and an open set $\mathsf{U}^{\prime}\subseteq\mathbb{R}^{d}$ such that there is a homeomorphism $\varphi:\mathsf{U}\to\mathsf{U}^{\prime}$ . We call the tuple $(\mathsf{U},\varphi)$ a coordinate neighborhood. Since $\mathsf{M}$ is smooth, it is equipped with an atlas $\{(\mathsf{U}_{i},\varphi_{i})\}_{i\in I}$ that contains coordinate neighborhoods for all points of $\mathsf{M}$ , such that for all $i,j\in I$ the map $\varphi_{i}\circ\varphi_{j}^{-1}:\varphi_{j}\left(\mathsf{U}_{j}\cap\mathsf{U}% _{i}\right)\to\varphi_{i}\left(\mathsf{U}_{j}\cap\mathsf{U}_{i}\right)$ , which is a map from a subset of $\mathbb{R}^{d}$ to a subset of $\mathbb{R}^{d}$ , is a diffeomorphism. At all $x\in\mathsf{M}$ we can define the tangent space $T_{x}\mathsf{M}$ to $\mathsf{M}$ at $x$ , which is the set of all linear mappings from the space of germs at $x$ to $\mathbb{R}$ satisfying the Leibniz rule. Observe that each coordinate neighborhood $(\mathsf{U},\varphi)$ of $\mathsf{M}$ induces $d$ vector fields, the coordinate frames. Essentially a vector field $E$ associates to each point $x\in\mathsf{M}$ an element $E_{x}\in T_{x}\mathsf{M}$ . The coordinate frames form a basis of the tangent spaces $T_{x}\mathsf{M}$ , turning it into a $d$ -dimensional vector space, at each $x\in\mathsf{U}$ . We denote the coordinate frames induced by $(\mathsf{U},\varphi)$ as

E_{1}^{\varphi},\ldots,E_{d}^{\varphi}.

If there exists a smooth field $\mathfrak{g}$ of symmetric, positive definite bilinear forms, the manifold $\mathsf{M}$ is called Riemannian and $\mathfrak{g}$ is called a Riemannian metric on $\mathsf{M}$ . Essentially this means that $\mathfrak{g}$ associates to each $x\in\mathsf{M}$ a symmetric, positive definite bilinear form $\mathfrak{g}_{x}:T_{x}\mathsf{M}\times T_{x}\mathsf{M}\to\mathbb{R}$ , which turns $T_{x}\mathsf{M}$ into an inner product space.

C.2 Geodesics

We call a map from an interval in $\mathbb{R}$ to the manifold $\mathsf{M}$ a curve. The Riemannian structure on $\mathsf{M}$ induces a special class of curves on $\mathsf{M}$ , namely the geodesics. Intuitively, a geodesic can be thought of as a curve of constant velocity. For a formal definition of geodesics consult [Boothby, Section VII.5]. We say the manifold $\mathsf{M}$ is geodesically complete if all geodesics can be extended such that their domain is $\mathbb{R}$ . In this case, by virtue of [Lee, Corollary 4.28], for all $x\in\mathsf{M}$ and all $v\in T_{x}\mathsf{M}$ there exists a unique geodesic

\gamma_{(x,v)}:\mathbb{R}\to\mathsf{M}

(32)

satisfying $\gamma_{(x,v)}(0)=x$ and $\frac{{\rm d}\gamma_{(x,v)}}{{\rm d}t}|_{0}=v$ for the velocity vector field at zero. Intuitively, $\gamma_{(x,v)}$ can be thought of as the geodesic through $x$ in direction $v$ . It can also be written in terms of the exponential map $\mathrm{Exp}$ (see [Boothby, Section VII.6]) as

\gamma_{(x,v)}(\theta)=\mathrm{Exp}_{x}(\theta v),\qquad x\in\mathsf{M},v\in T% _{x}\mathsf{M},\theta\in\mathbb{R}.

By the Hopf-Rinow Theorem (see e.g [Lee, Theorem 6.19]) geodesic completeness is equivalent to metric completeness.

C.3 The Riemannian measure

We now introduce a measure on $\mathsf{M}$ which can be viewed as an extension of the Lebesgue measure to Riemannian manifolds, see also [Sakai, Section II.5]. The topology on $\mathsf{M}$ induces as a Borel- $\sigma$ -algebra, which we denote as $\mathcal{B}(\mathsf{M})$ . We need the following family of functions: Given a coordinate neighborhood $(\mathsf{U},\varphi)$ , we use the Gram matrix of the coordinate frames $E_{1,x}^{\varphi},\ldots,E_{d,x}^{\varphi}$ evaluated at $x\in\mathsf{M}$ to introduce the function

	$\displaystyle\sqrt{\det(g,\varphi)}:\mathsf{U}$	$\displaystyle\to[0,\infty)$
	$\displaystyle x$	$\displaystyle\mapsto\sqrt{\det\left[\left(\mathfrak{g}_{x}(E_{j,x}^{\varphi},E% _{k,x}^{\varphi})\right)_{\{1\leqslant j,k\leqslant d\}}\right]}.$

Moreover, as $\mathsf{M}$ is by definition second countable, there exists a countable collection $\{(\mathsf{U}_{i},\varphi_{i})\}_{i\in\mathbb{N}}$ of coordinate neighborhoods such that $\bigcup_{i\in\mathbb{N}}\mathsf{U}_{i}=\mathsf{M}$ .¹²¹²12A space where every cover of open sets contains a countable subcover is called Lindelöf space. Second countable spaces are Lindelöf. Note that $\mathsf{M}$ admits a partition of unity $\{\rho_{i}\}_{i\in\mathbb{N}}$ subordinate to $\{\mathsf{U}_{i}\}_{i\in\mathbb{N}}$ , see [Sakai, Section I.2.1], i.e.,

•

$\rho_{i}:\mathsf{M}\to[0,\infty)$ is a $C^{\infty}$ -function¹³¹³13A function $f:\mathsf{M}\to\mathbb{R}$ is called a $C^{\infty}$ -function if for all coordinate neighborhoods $(\mathsf{U},\varphi)$ the function $f\circ\varphi^{-1}:\varphi(\mathsf{U})\to\mathbb{R}$ is a smooth function. with support $\mathrm{supp}\,\rho_{i}\subseteq\mathsf{U}_{i}$ for all $i\in\mathbb{N}$ ,
•

$\{\mathrm{supp}\,\rho_{i}\}_{i\in\mathbb{N}}$ is locally finite¹⁴¹⁴14For all $x\in\mathsf{M}$ there exists a neighborhood $\mathsf{U}$ of $x$ such that $\mathsf{U}\cap\mathrm{supp}\,\rho_{i}\neq\emptyset$ holds only for finitely many $i\in\mathbb{N}$ . ,
•

$\sum_{i=1}^{\infty}\rho_{i}(x)=1$ for all $x\in\mathsf{M}$ .

Then the Riemannian measure induced by the Riemannian metric $\mathfrak{g}$

\nu_{\mathfrak{g}}(\mathsf{A}):=\sum_{i=1}^{\infty}\int_{\varphi_{i}(\mathsf{U% }_{i})}\left(\rho_{i}\cdot\mathbbm{1}_{\mathsf{A}}\cdot\sqrt{\det(g,\varphi_{i% })}\right)\circ\varphi_{i}^{-1}(z)\ \mathrm{Leb}_{d}({\rm d}z),\qquad\mathsf{A% }\in\mathcal{B}(\mathsf{M}),

(33)

defines a measure on the measure space $(M,\mathcal{B}(\mathsf{M}))$ . We provide some brief arguments that the Riemannian measure is well-defined and indeed a measure. Observe that $\mathbbm{1}_{\mathsf{A}}\circ\varphi_{i}^{-1}$ is Borel measurable, and $\rho_{i}\circ\varphi_{i}^{-1}$ and $\sqrt{\det(g,\varphi_{i})}\circ\varphi_{i}^{-1}$ are smooth for all $A\in\mathcal{B}(\mathsf{M})$ and all $i\in\mathbb{N}$ . Hence the appearing Lebesgue integrals are defined. For independence of the construction from the choice of open covering and partition of unity see [Sakai, page 62]. The $\sigma$ -additivity of $\nu_{\mathfrak{g}}$ is inherited from the $\sigma$ -additivity of the Lebesgue integral. Applying standard extension arguments using the additivity of the Lebesgue integral and monotone convergence theorem, we can extend (33) to measurable functions $f:\mathsf{M}\to[0,\infty)$ yielding

\int_{\mathsf{M}}f(x)\ \nu_{\mathfrak{g}}({\rm d}x)=\sum_{i=1}^{\infty}\int_{% \varphi_{i}(\mathsf{U}_{i})}\left(\rho_{i}\cdot f\cdot\sqrt{\det(g,\varphi_{i}% )}\right)\circ\varphi_{i}^{-1}(z)\ \mathrm{Leb}_{d}({\rm d}z).

C.4 The uniform distribution on the unit tangent spheres

Throughout this section we fix $x\in M$ . We denote by

\mathbb{S}_{x}^{d-1}:=\{v\in T_{x}\mathsf{M}\mid\mathfrak{g}_{x}(v,v)=1\}

the unit tangent sphere to $\mathsf{M}$ at $x$ . It is immersed into $T_{x}\mathsf{M}$ via the identity map $\mathrm{Id}:\mathbb{S}_{x}^{d-1}\to T_{x}\mathsf{M}$ . For all $v\in\mathbb{S}_{x}^{d-1}$ , this induces a map $\mathrm{Id}_{*}:T_{v}\mathbb{S}_{x}^{d-1}\to T_{v}T_{x}\mathsf{M}$ on the tangent spaces, see [Boothby, Theorem IV.1.2]. As $(T_{x}\mathsf{M},\mathfrak{g}_{x})$ is a $d$ -dimensional inner product space, the tangent space $T_{v}T_{x}\mathsf{M}$ to $T_{x}\mathsf{M}$ at $v\in T_{v}M$ is again $T_{x}\mathsf{M}$ , see [Boothby, Section II.3] for more details on this construction. Because $(\mathfrak{g}_{x})_{v}$ then becomes $\mathfrak{g}_{x}$ , we can exploit this to define the Riemannian metric $\widehat{\mathfrak{g}}_{x}$ on $\mathbb{S}_{x}^{d-1}$ given by

\left(\widehat{\mathfrak{g}}_{x}\right)_{v}(\xi_{1},\xi_{2}):=\mathfrak{g}_{x}% \big{(}\mathrm{Id}_{*}(\xi_{1}),\mathrm{Id}_{*}(\xi_{2})\big{)},\qquad v\in% \mathbb{S}_{x}^{d-1},\xi_{1},\xi_{2}\in T_{v}\mathbb{S}_{x}^{d-1},

see [Boothby, Corollary V.2.5]. As described in Appendix C.3, $\widehat{\mathfrak{g}}_{x}$ induces the (Riemannian) measure $\nu_{\widehat{\mathfrak{g}}_{x}}$ on $\mathbb{S}_{x}^{d-1}$ . Since $\mathbb{S}_{x}^{d-1}$ is compact, $\nu_{\widehat{\mathfrak{g}}_{x}}\left(\mathbb{S}_{x}^{d-1}\right)$ is finite, and we may define the probability measure

\sigma_{d-1}^{(x)}:=\frac{1}{\nu_{\widehat{\mathfrak{g}}_{x}}(\mathbb{S}^{d-1}% _{x})}\nu_{\widehat{\mathfrak{g}}_{x}}.

Note that, up to a push forward under an isometric isomorphism, $\sigma_{d-1}^{(x)}$ is the uniform distribution on the Euclidean unit sphere $\mathbb{S}^{d-1}$ .

Appendix D Experiments appendix

D.1 A practical case: Understanding the variability in graph data sets.

Table 8: Parameters used in experiments.

\operatorname{Unif}(n,k

) denotes the uniform sampling on

\mathcal{V}(n,k)

and

\cdot

the colomn wise multiplication.

Parameter	$\sigma_{\epsilon}$	$\sigma_{\kappa}$	$\mu$	$F$
Synthetic data estimation $(n,k,\lambda)$	0.1	2	$[10,2,\ldots]$	$[\lambda,1,\ldots]\cdot\operatorname{Unif}(n,k)$
Missing link imputation	0.3	2	$[20,10,5,2,-10]$	[60,20,20,20,5] $\cdot\operatorname{Unif}(20,5)$

\printbibliography

Geodesic slice sampling on Riemannian manifolds

Abstract

1 Introduction

1.1 General notation

2 Methodology: Geodesic slice sampling

2.1 Slice sampling on ℝdsuperscriptℝ𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT

Remark 1.

2.2 Geodesic slice sampling

Assumption A.

Example 2.

Remark 3.

Remark 4.

Example 5.

Assumption B.

Remark 6.

Theorem 7.

2.3 Literature review of MCMC-methods on Riemannian manifolds

3 Application

Remark 8.

3.1 Sampling the von Mises–Fisher distribution

Varying (n,k)𝑛𝑘(n,k)( italic_n , italic_k ) on the Stiefel manifold.

Varying (n,k)𝑛𝑘(n,k)( italic_n , italic_k ) on the Grassmann manifold.

Varying anisotropy on the Stiefel manifold.

Varying the variance on the Grassmann manifold.

Remark 9.

Remark 10.

3.2 A practical case: Understanding the variability in graph data sets.

On a synthetic dataset.

Missing links imputation.

On a real dataset.

3.3 ARMA model

3.4 Bayesian clustering on the KTH video action dataset.

4 Validity

4.1 Stepping-out procedure

Lemma 11.

Lemma 12.

4.2 Shrinkage procedure

Remark 13.

Remark 14.

Lemma 15.

Proof.

Lemma 16.

Proof.

4.3 Reversibility of the geodesic slice sampler

Remark 17.

Proof of Theorem 7.

Acknowledgments

Appendix

Appendix A Properties of the stepping-out procedure

Setting C.

Lemma 18.

Proof.

Lemma 19.

Proof.

Lemma 20.

Proof.

Proof of Lemma 11.

Remark 21.

Lemma 22.

Proof.

Proof of Lemma 12.

Appendix B Uniform simple slice sampling

Lemma 23.

Proof.

Appendix C Manifolds

C.1 Tangent space and Riemannian metric

C.2 Geodesics

C.3 The Riemannian measure

C.4 The uniform distribution on the unit tangent spheres

Appendix D Experiments appendix

D.1 A practical case: Understanding the variability in graph data sets.

2.1 Slice sampling on $\mathbb{R}^{d}$

Varying $(n,k)$ on the Stiefel manifold.

Varying $(n,k)$ on the Grassmann manifold.