Equivariant neural networks for inverse problems

Elena Celledoni; Matthias J Ehrhardt; Christian Etmann; Brynjulf Owren; Carola-Bibiane Schönlieb; Ferdia Sherry

doi:10.1088/1361-6420/ac104f

1. Introduction

Deep learning has recently had a large impact on a wide variety of fields; research laboratories have published state-of-the-art results applying deep learning to sundry tasks such as playing Go [1], predicting protein structures [2] and generating natural language [3]. In particular, deep learning methods have also been developed to solve inverse problems, with some examples being [4–7]. In this work we investigate the use of equivariant neural networks for solving inverse imaging problems, i.e. inverse problems where the solution is an image. Convolutional neural networks (CNNs) [8] are a standard tool in deep learning methods for images. By learning convolutional filters, CNNs naturally encode translational symmetries of images: if τ_h is a translation by h ∈ R^d, and k, f are functions on R^d, we formally have the following relation (translational equivariance)

$\begin{equation}{\tau }_{h}[k{\ast}f]=k{\ast}[{\tau }_{h}f].\end{equation} \tag{ 1 }$

This allows learned feature detectors to detect features regardless of their position (though not their orientation or scale) in an image. In many cases it may be desirable for these learned feature detectors to also work when images are transformed under other group transformations, i.e. one may ask that a property such as equation (1) holds for a more general group transformation than the group of translations {τ_h|h ∈ R^d}. If natural symmetries of the problem are not built into the machine learning method and are not present in the training data, in the worst case, it can result in catastrophic failure as illustrated in figure 1.

Figure 1. Refer to the following caption and surrounding text. — **Figure 1.** Roto-translationally ('equivariant') and just translationally ('ordinary') equivariant filters are trained to denoise on a single pair of ground truth and noisy images ('clean' and 'noisy' in the top row), giving perfect denoising results on the training example. In the bottom row, we see the result of testing the learned filters on a rotated version of the training image; the ordinary filter completely fails at recovering the ground truth, whereas the equivariant filter performs as well as it did on the training image.
Download figure:
Standard image High-resolution image

To some extent, this problem is circumvented by augmenting the training data through suitable transformations, but it has been shown in classification and segmentation tasks that it is still beneficial to incorporate known symmetries directly into the architecture used, especially if the amount of training data is small [9–11]. Furthermore, training on augmented data is not enough to guarantee that the final model satisfies the desired symmetries. There has recently been a considerable amount of work in this direction, in the form of group equivariant CNNs. Most of the focus has been on roto-translational symmetries of images [9, 11–13], though there is also some work on incorporating scaling symmetries [14, 15] and even on equivariance to arbitrary Lie group symmetries [16].

As mentioned before, we will concern ourselves with solving inverse imaging problems: given measurements y that are related to an underlying ground truth image u through a model

$\begin{equation}y=\mathfrak{N}(A(u)),\end{equation} \tag{ 2 }$

with A the so-called forward operator and $\mathfrak{N}$ a noise-generating process, the goal is to estimate the image u from the measurements y as well as possible. Typical examples of inverse imaging problems include the problem of recovering an image from its line integrals as in computerised tomography (CT) [17], or recovering an image from subsampled Fourier measurements as in magnetic resonance imaging (MRI) [18, 19]. The solution of an inverse problem is often complicated by the presence of ill-posedness: a problem is said to be well-posed in the sense of Hadamard [20] if it satisfies a set of three conditions (existence of a solution, its uniqueness, and its continuous dependence on the measurements), and ill-posed if any of these conditions fail.

It is a natural idea to try to apply equivariant neural networks to solve inverse imaging problems: there is useful knowledge about the relationship between a ground truth image and its measurements in the form of A and the symmetries in both the measurement and image domain (the range and domain of A respectively). Furthermore, training data tends to be considerably less abundant in medical and scientific imaging than in the computer vision and image analysis tasks that are typical of the deep learning revolution, such as ImageNet classification [21]. This suggests that the lower sample complexity of equivariant neural networks (as compared to ordinary CNNs) may be harnessed in this setting with scarce data to learn better reconstruction methods. Finally, end users of the methods, e.g. medical practitioners, are often skeptical of 'black-box' methods and guarantees on the behaviour of the method, such as equivariance of the method to certain natural image transformations, may alleviate some of the concerns that they have.

We investigate the use of equivariant neural networks within the framework of learned iterative reconstruction methods [5, 22], which constitute some of the most prototypical deep learning solutions to inverse problems. The designs of these methods are motivated by classical variational regularisation approaches [23, 24], which propose to overcome the ill-posedness of an inverse problem by estimating its solution as

$\begin{equation}\hat{u}={\mathrm{arg min}}_{u}\enspace d(A(u),y)+J(u),\end{equation} \tag{ 3 }$

with d a measure of discrepancy motivated by our knowledge of the noise-generating process $\mathfrak{N}$ and J is a regularisation functional incorporating prior knowledge of the true solution. Learned iterative reconstruction methods, also known as unrolled iterative methods, are designed by starting from a problem such as problem (3), choosing an iterative optimisation method to solve it, truncating that method to a finite number of iterations, and finally replacing parts of it (e.g. the proximal operators) by neural networks. We will show that these neural networks can naturally be chosen to be equivariant neural networks, and that doing so gives improved performance over choosing them to be ordinary CNNs. More precisely, our contributions in this work are as follows:

1.1. Our contributions

We show that invariance of a functional to a group symmetry implies that its proximal operator satisfies an equivariance property with respect to that group. This insight can be combined with the unrolled iterative method approach: it makes sense for a regularisation functional to be invariant to roto-translations if there is no prior knowledge on the orientation and position of structures in the images, in which case the corresponding proximal operators are roto-translationally equivariant.

Motivated by these observations, we build learned iterative methods using roto-translationally equivariant building blocks. We show in a supervised learning setting that these methods can outperform comparable methods that only use ordinary convolutions as building blocks, when applied to a low-dose CT reconstruction problem and a subsampled MRI reconstruction problem. This outperformance is manifested in two main ways: the equivariant method is better able to take advantage of small training sets than the ordinary one, and its performance is more robust to transformations that leave images in orientations not seen during training.

2. Notation and background on groups and representations

In this section, we give an overview of the main concepts regarding groups and representations that are required to follow the main text. By a group G, we mean a set equipped with an associative binary operation ⋅ : G × G → G (usually the dot is omitted in writing), furthermore containing a neutral element e, such that e ⋅ g = g ⋅ e = g for all g ∈ G and a unique inverse g⁻¹ for each group element g, such that g ⋅ g⁻¹ = g⁻¹ ⋅ g = e. Given groups G and H, we say that a map ϕ : G → H is a group homomorphism if it respects the group structures:

$\begin{equation*}\phi ({g}_{1}{g}_{2})=\phi ({g}_{1})\phi ({g}_{2})\quad \text{for}\enspace \text{any}\quad {g}_{1},{g}_{2}\in G.\end{equation*}$

Groups can be naturally used to describe symmetries of mathematical objects through the concept of group actions. Given a group G and set X, we say that G acts on X if there is a function T : G × X → X (the application of which we stylise as T_g[x] for g ∈ G, x ∈ X) that obeys the group structure in the sense that

$\begin{equation}{T}_{{g}_{1}}{\circ}{T}_{{g}_{2}}={T}_{{g}_{1}{g}_{2}}\quad \text{for}\enspace \text{any}\quad {g}_{1},{g}_{2}\in G\end{equation} \tag{ 4 }$

and T_e = id. That is, the group action can be thought of as a group homomorphism from G to the permutation group of X. If there is no ambiguity, the group action may just be written as T_g[x] = g ⋅ x = gx. An important type of group actions is given by the group representations. If V is a vector space, we will denote by GL(V) its general linear group, the group of invertible linear maps V → V, with the group operation given by composition. A representation ρ : G → GL(V) of a group G which acts on V is a group homomorphism, and so corresponds to a linear group action T of G on V: ρ(g)x = T_g[x] for x ∈ V and g ∈ G. Given a vector space V, any group G has a representation on V given by ρ(g) = I, which is the so-called trivial representation. If V is additionally a Hilbert space, we will call ρ a unitary representation if ρ(g) is a unitary operator for each g ∈ G, i.e. ||ρ(g)x|| = ||x|| for all x ∈ V. Given a finite group G = {g₁, ..., g_n}, we can define the so-called regular representation ρ of G on Rⁿ by

$\begin{equation*}\rho ({g}_{i}){e}_{j}={e}_{k},\end{equation*}$

where {e₁, ..., e_n} is a basis of Rⁿ and k is such that g_i g_j = g_k. With this representation, each ρ(g) is a permutation matrix, so ρ is a unitary representation if the basis {e₁, ..., e_n} is orthonormal.

In this work, the groups that we will consider take the form of a group of isometries on R^d. These groups are represented by a semi-direct product G = R^d ⋊ H, where H is a subgroup of the orthogonal group O(d) of rotations and reflections:

$\begin{equation*}\text{O}(d)=\left\{R\in \text{GL}({\mathbf{R}}^{d})\vert {R}^{\text{T}}={R}^{-1}\right\}.\end{equation*}$

An important subgroup of O(d) is the special orthogonal group SO(d) = {A ∈ O(d)| det(A) = 1}, which represents the set of pure rotations in O(d). Each element of the semi-direct product G can be identified with a unique pair (t, R) of t ∈ R^d, the translation component, and R ∈ H, the rotation (and potentially reflection). The semi-direct product can naturally be encoded as a matrix using homogeneous coordinates

$\begin{equation*}(t,R){\leftrightarrow}\left(\begin{matrix}\hfill R\hfill & \hfill t\hfill \\ \hfill 0\hfill & \hfill 1\hfill \end{matrix}\right),\end{equation*}$

so that the group product is given by a matrix product. G naturally acts on a point x ∈ R^d through T_(t,R)[x] = (t, R)x = Rx + t.

In the experiments that we consider later in this work, we will consider the case d = 2. In this case SO(2) has a simple description:

$\begin{equation*}\mathrm{S}\mathrm{O}(2)=\left\{\left(\begin{array}{cc}\mathrm{cos}(\theta )& -\mathrm{sin}(\theta )\\ \mathrm{sin}(\theta )& \mathrm{cos}(\theta )\end{array}\right)\left\vert \right.\theta \in [0,2\pi )\right\}.\end{equation*}$

We will identify the groups Z_m of integers modulo m with the subgroup of SO(2) given by

$\begin{equation*}{\mathbf{Z}}_{m}=\left\{\left(\begin{array}{cc}\mathrm{cos}(2\pi k/m)& -\mathrm{sin}(2\pi k/m)\\ \mathrm{sin}(2\pi k/m)& \mathrm{cos}(2\pi k/m)\end{array}\right)\left\vert \right.k\in \mathbf{Z}\right\}.\end{equation*}$

Given vector spaces V₁, V₂, we will denote by Hom(V₁, V₂) the vector space of linear operators A : V₁ → V₂. We will refer to a number of function spaces: L²(R^d, R^c) denotes the Hilbert space of square integrable functions f : R^d → R^c (where R^c carries the Euclidean norm), identified as usual up to equality almost everywhere, and ${C}_{c}^{\infty }({\mathbf{R}}^{d},{\mathbf{R}}^{c})$ denotes the vector space of infinitely smooth functions f : R^d → R^c that have compact support.

3. Learnable equivariant maps

The concept of equivariance is well-suited to describing the group symmetries that a function might obey:

Definition 1. Given a general group G, a function ${\Phi}:\mathcal{X}\to \mathcal{Y}$ and group actions ${T}^{\mathcal{X}},{T}^{\mathcal{Y}}$ of G on $\mathcal{X}$ and $\mathcal{Y}$ , Φ will be called equivariant if it satisfies

$\begin{equation}{\Phi}({T}_{g}^{\mathcal{X}}[f])={T}_{g}^{\mathcal{Y}}[{\Phi}(f)]\end{equation} \tag{ 5 }$

for all $f\in \mathcal{X}$ and g ∈ G.

Following the definition of equivariance, we see that equivariant functions have the convenient property that composing them results in an equivariant function, as long as the group actions on the inputs and outputs match in the appropriate way:

Lemma 1. Suppose that G is a group that acts on sets $\mathcal{X},\mathcal{Y}$ and $\mathcal{Z}$ through ${T}^{\mathcal{X}},{T}^{\mathcal{Y}}$ and ${T}^{\mathcal{Z}}$ . If ${\Phi}:\mathcal{X}\to \mathcal{Y}$ and ${\Psi}:\mathcal{Y}\to \mathcal{Z}$ are equivariant, then so is ${\Psi}{\circ}{\Phi}:\mathcal{X}\to \mathcal{Z}$ .

Based on this property it is clear that the standard approach to building neural networks (compose linear and nonlinear functions with learnable components in an alternating manner) can be used to build equivariant neural networks as long as linear and nonlinear functions with the desired equivariance can be constructed.

Example 1. Suppose that $\mathcal{X}={L}^{2}({\mathbf{R}}^{d},{\mathbf{R}}^{{c}_{\mathcal{X}}})$ and $\mathcal{Y}={L}^{2}({\mathbf{R}}^{d},{\mathbf{R}}^{{c}_{\mathcal{Y}}})$ , with the group G = R^d acting on $\mathcal{X}$ by ${T}_{h}^{\mathcal{X}}[f](x)=f(x-h)$ , and in a similar way on $\mathcal{Y}$ by ${T}^{\mathcal{Y}}$ . Ordinary CNNs [8], with convolutional linear layers and pointwise nonlinear functions, are equivariant in this setting.

In this work, we will consider the group G = R^d ⋊ H for some subgroup H of O(d) (see section 2 for some background), acting on vector-valued functions. To be more specific, we will let $\mathcal{X}={L}^{2}({\mathbf{R}}^{d},{\mathbf{R}}^{{d}_{\mathcal{X}}})$ be the Hilbert space of square-integrable ${\mathbf{R}}^{{d}_{\mathcal{X}}}$ -valued functions and assume that ${\mathbf{R}}^{{d}_{\mathcal{X}}}$ carries a representation ${\pi }_{\mathcal{X}}:H\to \text{GL}({\mathbf{R}}^{{d}_{\mathcal{X}}})$ . Similarly, we will define $\mathcal{Y}={L}^{2}({\mathbf{R}}^{d},{\mathbf{R}}^{{d}_{\mathcal{Y}}})$ and assume that ${\pi }_{\mathcal{Y}}:H\to \text{GL}({\mathbf{R}}^{{d}_{\mathcal{Y}}})$ is a representation of H. We define the group actions ${T}^{\mathcal{X}}$ and ${T}^{\mathcal{Y}}$ to be the induced representations, ${\rho }_{\mathcal{X}}$ and ${\rho }_{\mathcal{Y}}$ , of ${\pi }_{\mathcal{X}}$ and ${\pi }_{\mathcal{Y}}$ on $\mathcal{X}$ and $\mathcal{Y}$ respectively. In the setting that we are considering, these representations take a particularly simple form. As mentioned in section 2, since we assume that G takes the semi-direct product form R^d ⋊ H, each group element g ∈ G can be uniquely thought of as a pair g = (t, R) for some t ∈ R^d and R ∈ H. With this in mind, the representations ${\rho }_{\mathcal{X}}$ and ${\rho }_{\mathcal{Y}}$ can be written as follows for any $f\in \mathcal{Z},x\in {\mathbf{R}}^{d}$ and t ∈ R^d, R ∈ H:

$\begin{equation}{\rho }_{\mathcal{Z}}((t,R))[f](x)=\underset{(\text{a})}{\underbrace{{\pi }_{\mathcal{Z}}(R)}}\underset{(\text{b})}{\underbrace{f({(t,R)}^{-1}x)}}\quad \text{for}\quad \mathcal{Z}=\mathcal{X},\enspace \text{or}\enspace \mathcal{Z}=\mathcal{Y}.\end{equation} \tag{ 6 }$

These representations have a natural interpretation: to apply a group element (t, R) to a vector-valued function, we must move the vectors, as in part (b) of equation (6), and transform each vector accordingly, as in part (a) of equation (6).

3.1. Equivariant linear operators

It is well-established that equivariant linear operators are strongly connected to the concept of convolutions. Indeed, in a relatively general setting it has been shown that an integral operator is equivariant if and only if it is given by a convolution with an appropriately constrained kernel [25]. In the setting that we are considering, the more specific result in proposition 1 can be derived, as done in [11, 26] for the case d = 2 and [27] for the case d = 3.

Proposition 1. Suppose that ${\Phi}:\mathcal{X}\to \mathcal{Y}$ is an operator given by integration against a continuous kernel $K:{\mathbf{R}}^{d}{\times}{\mathbf{R}}^{d}\to \mathrm{H}\mathrm{o}\mathrm{m}({\mathbf{R}}^{{d}_{\mathcal{X}}},{\mathbf{R}}^{{d}_{\mathcal{Y}}})$ ,

$\begin{equation*}{\Phi}(f)(x)=\underset{{\mathbf{R}}^{d}}{\int }K(x,y)f(y)\enspace \mathrm{d}y.\end{equation*}$

Then the operator Φ is equivariant if and only if it is in fact given by a convolution satisfying an additional constraint: there is a continuous $k:{\mathbf{R}}^{d}\to \mathrm{H}\mathrm{o}\mathrm{m}({\mathbf{R}}^{{d}_{\mathcal{X}}},{\mathbf{R}}^{{d}_{\mathcal{Y}}})$

$\begin{equation*}{\Phi}(f)(x)=\underset{{\mathbf{R}}^{d}}{\int }k(x-y)f(y)\mathrm{d}y,\end{equation*}$

where k satisfies the additional condition

$\begin{equation*}k(Rx)={\pi }_{\mathcal{Y}}(R)k(x){\pi }_{\mathcal{X}}({R}^{-1})\quad \text{for}\quad x\in {\mathbf{R}}^{d},R\in H.\end{equation*}$

The derivation of this result proceeds by writing out the definitions of equivariance and using the invariances of the Lebesgue measure. The equivariance of Φ implies that we must the following chain of equalities for any $x\in {\mathbf{R}}^{d},f\in \mathcal{X},t\in {\mathbf{R}}^{d},R\in H$ and g = (t, R) ∈ G:

$\begin{align*}\underset{{\mathbf{R}}^{d}}{\int }{\pi }_{\mathcal{Y}}(R)K({g}^{-1}x,y)f(y)\enspace \mathrm{d}y& \enspace (\text{a})\enspace {=}{\pi }_{\mathcal{Y}}(R)\underset{{\mathbf{R}}^{d}}{\int }K({g}^{-1}x,y)f(x)\enspace \mathrm{d}y\\ & ={\rho }_{\mathcal{Y}}(g)[{\Phi}(f)](x)\\ & (\text{b}){\enspace =}\enspace {\Phi}({\rho }_{\mathcal{X}}(g)[f])(x)\\ & =\underset{{\mathbf{R}}^{d}}{\int }K(x,y){\rho }_{\mathcal{X}}g[f](y)\enspace \mathrm{d}y\\ & =\underset{{\mathbf{R}}^{d}}{\int }K(x,y){\pi }_{\mathcal{X}}(h)f({g}^{-1}y)\enspace \mathrm{d}y\\ & (\text{c}){\enspace =}\underset{{\mathbf{R}}^{d}}{\int }K(x,gy){\pi }_{\mathcal{X}}(h)f(y)\enspace \mathrm{d}y.\end{align*}$

Here the tags above the equality signs correspond to the following justifications:

(a)
Since ${\pi }_{\mathcal{Y}}$ is a group representation, ${\pi }_{\mathcal{Y}}(R)$ is a linear map and commutes with the integral,
(b)
Φ is assumed to be equivariant,
(c)
We make the substitution y ← gy and note that the Lebesgue measure is invariant to G.

Taking the left-hand side and right-hand side together, we find that

$\begin{equation*}\underset{{\mathbf{R}}^{d}}{\int }\left({\pi }_{\mathcal{Y}}(R)K({g}^{-1}x,y)-K(x,gy){\pi }_{\mathcal{X}}(R)\right)f(y)\mathrm{d}y=0,\end{equation*}$

and since this must hold for any $f\in \mathcal{X}={L}^{2}({\mathbf{R}}^{d},{\mathbf{R}}^{{d}_{\mathcal{X}}})$ , we conclude by testing on sequences converging to Dirac delta functions that

$\begin{equation}{\pi }_{\mathcal{Y}}(R)K({g}^{-1}x,y)=K(x,gy){\pi }_{\mathcal{X}}(R).\end{equation} \tag{ 7 }$

Specialising by setting R equal to the identity element, we see that

$\begin{equation*}K(x-t,y)=K({(t,I)}^{-1}x,y)=K(x,(t,I)y)=K(x,y+t),\end{equation*}$

or upon substituting x ← x + t, K(x, y) = K(x + t, y + t). Choosing t to be the translation that takes y to 0, we find that

$\begin{equation*}K(x,y)=K(x-y,0)= :\enspace k(x-y)\end{equation*}$

defines a convolution kernel $k:{\mathbf{R}}^{d}\to \mathrm{H}\mathrm{o}\mathrm{m}({\mathbf{R}}^{{d}_{\mathcal{X}}},{\mathbf{R}}^{{d}_{\mathcal{Y}}})$ . Now specialising equation (7) by letting R ∈ H and x ∈ R^d be arbitrary and t, y = 0, we obtain the condition ${\pi }_{\mathcal{Y}}(R)k({R}^{-1}x)=k(x){\pi }_{\mathcal{X}}(R)$ , or upon substituting x ← Rx and rearranging,

$\begin{equation}k(Rx)={\pi }_{\mathcal{Y}}(R)k(x){\pi }_{\mathcal{X}}({R}^{-1}).\end{equation} \tag{ 8 }$

Conversely, the above reasoning can be reversed to show that the condition in equation (8) (for all x ∈ R^d, R ∈ H) is sufficient to guarantee equivariance of Φ.

The condition in equation (8) is a linear constraint that is fully specified before training. Hence, if a basis is computed for the convolution kernels satisfying equation (8), a general equivariant linear operator can be learned by learning its parameters in that basis. Since the choices of H that we consider are all compact groups, any representation of H can be decomposed as a direct sum of irreducible representations of H (theorem 5.2 in [28]). As a result of this, we can give the following procedure to compute a basis for the convolution kernels satisfying the equivariance condition in equation (8) as soon as ${\pi }_{\mathcal{X}}$ and ${\pi }_{\mathcal{Y}}$ are specified:

Decompose ${\pi }_{\mathcal{X}}$ and ${\pi }_{\mathcal{Y}}$ as direct sum of irreducible representations; ${\pi }_{\mathcal{X}}={Q}_{\mathcal{X}}\enspace \mathrm{diag}({\pi }_{\mathcal{X}}^{1},\dots ,{\pi }_{\mathcal{X}}^{{k}_{\mathcal{X}}}){Q}_{\mathcal{X}}^{-1},{\pi }_{\mathcal{Y}}={Q}_{\mathcal{Y}}\enspace \mathrm{diag}({\pi }_{\mathcal{Y}}^{1},\dots ,{\pi }_{\mathcal{Y}}^{{k}_{\mathcal{Y}}}){Q}_{\mathcal{Y}}^{-1}$ (here diag constructs a block diagonal matrix with the diagonal elements given by the arguments supplied to diag).
For each i, j with $1{\leqslant}i{\leqslant}{k}_{\mathcal{X}},1{\leqslant}j{\leqslant}{k}_{\mathcal{Y}}$ find a basis for the convolution kernels k_i,j satisfying the equivariance condition
$\begin{equation*}{k}_{i,j}(Rx)={\pi }_{\mathcal{Y}}^{j}(R){k}_{i,j}(x){\pi }_{\mathcal{X}}^{j}({R}^{-1})\end{equation*}$
with the irreducible representations ${\pi }_{\mathcal{Y}}^{j}$ and ${\pi }_{\mathcal{X}}^{i}$ .
Given expansions of the k_i,j, compute the overall equivariant convolution kernel k by
$\begin{equation*}k={Q}_{\mathcal{Y}}\cdot {({k}_{i,j})}_{1{\leqslant}i{\leqslant}{k}_{\mathcal{X}},1{\leqslant}j{\leqslant}{k}_{\mathcal{Y}}}\cdot {Q}_{\mathcal{X}}^{-1}.\end{equation*}$

This procedure has been described in more detail in [11] and implemented in the corresponding software package for the groups G = R² ⋊ H, where H can be any subgroup of O(2).

Since the equivariant convolutions described above are implemented using ordinary convolutions, little extra computational effort required to use them compared to ordinary convolutions: during training, there is just an additional step of computing the basis expansion defining the equivariant convolution kernels (and backpropagating through it). When it is time to test the network, this step can be avoided by computing the basis expansion once and only saving the resulting convolution kernels, so that it is completely equivalent in terms of computational effort to using an ordinary CNN.

3.2. Equivariant nonlinearities

Although pointwise nonlinearities are translationally equivariant, some more care is needed when designing nonlinearities that satisfy the equivariance condition in equation (5) with our choices of groups. Examining the form of the induced representations in our setting, as given in equation (6), it is evident that for a pointwise nonlinearity ϕ : R → R to be equivariant (in the sense that $\phi ({\rho }_{\mathcal{X}}(g)[\enspace f])={\rho }_{\mathcal{X}}(g)[\phi (\enspace f)]$ , with ϕ applied pointwise) ϕ must commute with ${\pi }_{\mathcal{X}}(R)$ for every R ∈ H: with g = (t, R) for t ∈ R^d, R ∈ H we have

$\begin{align*}\phi ({\pi }_{\mathcal{X}}(R)f({g}^{-1}x))& =\phi ({\rho }_{\mathcal{X}}(g)[f])(x)={\rho }_{\mathcal{X}}(g)[\phi (f)](x)\\ & ={\pi }_{\mathcal{X}}(h)\phi (f({g}^{-1}x)).\end{align*}$

This can be ensured if ${\pi }_{\mathcal{X}}$ is the regular representation of H, since in that case each ${\pi }_{\mathcal{X}}(h)$ is a permutation matrix, giving the following guideline:

Lemma 2. Suppose that G = R^d ⋊ H with H a finite subgroup of O(d) and that ϕ : R → R is a given function. If ${\pi }_{\mathcal{X}}$ is the regular representation of H, then ${\Phi}:\mathcal{X}\to \mathcal{X}$ is equivariant, where Φ(f) (x) = ϕ(f(x)).

Another way to ensure that ϕ commutes with ${\pi }_{\mathcal{X}}$ is by choosing the trivial representation. Although the trivial representation may not be very interesting by itself, this gives rise to another form of nonlinearity called the norm nonlinearity. If ${\pi }_{\mathcal{X}}$ is a unitary representation, taking the pointwise norm satisfies an equivariance condition: with g = (t, R) for t ∈ R^d, R ∈ H

$\begin{equation*}{\Vert}{\rho }_{\mathcal{X}}(g)[f](x){\Vert}={\Vert}{\pi }_{\mathcal{X}}(R)f({g}^{-1}x)\left.\right){\Vert}={\Vert}f({g}^{-1}x){\Vert}.\end{equation*}$

The right-hand side transforms according to the trivial representation, so by the above comments we deduce that the nonlinearity f ↦ ϕ(||f||) satisfies an equivariance condition of the same form. To obtain the norm nonlinearity, which maps features of a given type to features of the same type, we then form the map ${\Phi}:\mathcal{X}\to \mathcal{X},f{\mapsto}f\cdot \phi ({\Vert}f{\Vert})$ : with g = (t, R) for t ∈ R^d, R ∈ H, we have

$\begin{align*}{\Phi}({\rho }_{\mathcal{X}}(g)[f])(x)& ={\pi }_{\mathcal{X}}(R)f({g}^{-1}x)\cdot \phi ({\Vert}f({g}^{-1}x){\Vert})\\ & ={\pi }_{\mathcal{X}}(R)\left(f({g}^{-1}x)\cdot \phi ({\Vert}f({g}^{-1}x){\Vert})\right)\\ & ={\pi }_{\mathcal{X}}(R)\left(f\cdot \phi ({\Vert}f{\Vert})\right)({g}^{-1}x)\\ & ={\rho }_{\mathcal{X}}(g)[{\Phi}(f)](x),\end{align*}$

where we used that ϕ(||f(g⁻¹ x)||) is a scalar. This shows that the norm nonlinearity Φ is indeed equivariant:

Lemma 3. Suppose that ${\pi }_{\mathcal{X}}$ is a unitary representation of H, and that ϕ : R → R is a given function. Then the norm nonlinearity ${\Phi}:\mathcal{X}\to \mathcal{X}$ with Φ(f)[x] = f(x)ϕ(||f(x)||) is equivariant.

4. Reconstruction methods motivated by variational regularisation

We consider the inverse problem of estimating an image u from noisy measurements y. We will assume that knowledge of the measurement process is available in the form of the forward operator A, which maps an image to ideal, noiseless measurements, and generally there were will be a reasonable idea of the process by which they are corrupted to give rise to the noisy measurements y. A tried and tested approach to solving inverse problems is the variational regularisation approach [23, 29]. In this approach, images are recovered from measurements by minimising a trade-off between the data fit and a penalty function encoding prior knowledge:

$\begin{equation}\hat{u}={\mathrm{arg min}}_{u}\enspace {E}_{y}(u)+J(u),\end{equation} \tag{ 9 }$

with E_y a data discrepancy functional penalising mismatch of the estimated image and the measurements and J the penalty function. Usually E_y will take the form E_y(u) = d(A(u), y), where d is a measure of divergence chosen based on our knowledge of the noise process.

4.1. Equivariance in splitting methods

Generally, problem (9) may be difficult to solve, and a lot of research has been done on methods to solve problems such as these. Iterative methods to solve it are often structured as splitting methods: the objective function is split into terms, and easier subproblems associated with each of these terms are solved in an alternating fashion to yield a solution to problem (9) in the limit. A prototypical example of this is the proximal gradient method (also known as forward-backward splitting) [30, 31], which has become a standard tool for solving linear inverse problems, particularly in the form of the FISTA algorithm [32]. In its basic form, the proximal gradient method performs the procedure described in algorithm 1.

Algorithm 1. Proximal gradient method.

inputs: measurements y, initial estimate u⁰

u ← u⁰

for i ← 1, ..., it do

$\quad u{\leftarrow}{\mathrm{prox}}_{{\tau }^{i}J}(u-{\tau }^{i}\nabla {E}_{y}(u))$ $\quad u{\leftarrow}{\mathrm{prox}}_{{\tau }^{i}J}(u-{\tau }^{i}\nabla {E}_{y}(u))$

end for

return u

Recall here that the proximal operator [33–35] prox_J is defined as follows:

Definition 2. Suppose that $\mathcal{X}$ is a Hilbert space and that $J:\mathcal{X}\to \mathbf{R}\cup \left\{+\infty \right\}$ is a lower semi-continuous convex proper functional. The proximal operator ${\mathrm{prox}}_{J}:\mathcal{X}\to \mathcal{X}$ is then defined as

$\begin{equation}{\mathrm{prox}}_{J}(u)={\mathrm{arg min}}_{{u}^{\prime }\in \mathcal{X}}\enspace \frac{1}{2}{\Vert}u-{u}^{\prime }{{\Vert}}^{2}+J({u}^{\prime }).\end{equation} \tag{ 10 }$

Although this definition of proximal operators assumes that the functional J is convex, this assumption is more stringent than is necessary to ensure that an operator defined by equation (10) is well-defined and single-valued. One can point for example to the classes of μ-semi-convex functionals (i.e. the set of J, such that $u{\mapsto}J(u)+\frac{\mu }{2}{\Vert}u{{\Vert}}^{2}$ is convex) on $\mathcal{X}$ for 0 < μ < 1, which include nonconvex functionals. In what follows, we will allow for such more general functionals by just asking that the proximal operator is well-defined and single-valued.

It is often reasonable to ask that the proximal operators prox_τJ satisfy an equivariance property; if the corresponding regularisation functional J is invariant to a group symmetry, the proximal operator will be equivariant:

Proposition 2. Suppose that $\mathcal{X}$ is a Hilbert space and ρ is a unitary representation of a group G on $\mathcal{X}$ . If a functional $J:\mathcal{X}\to \mathbf{R}\cup \left\{+\infty \right\}$ is invariant, i.e. J(ρ(g)f) = J(f), and has a well-defined single-valued proximal operator ${\mathrm{prox}}_{J}:\mathcal{X}\to \mathcal{X}$ , then prox_J is equivariant, in the sense that

$\begin{equation*}{\mathrm{prox}}_{J}(\rho (g)f)=\rho (g){\mathrm{prox}}_{J}(f)\end{equation*}$

for all $f\in \mathcal{X}$ and g ∈ G.

Proof. We have the following chain of equalities:

$\begin{align*}{\mathrm{prox}}_{J}(\rho (g)f)& ={\mathrm{arg min}}_{h}\enspace \frac{1}{2}{\Vert}\rho (g)f-h{{\Vert}}^{2}+J(h)\\ & (\text{a}){=}{\enspace \mathrm{arg min}}_{h}\enspace \frac{1}{2}{\Vert}\rho (g)(f-\rho ({g}^{-1})h){{\Vert}}^{2}+J(\rho ({g}^{-1})h)\\ & (\text{b}){=}{\enspace \mathrm{arg min}}_{h}\enspace \frac{1}{2}{\Vert}f-\rho ({g}^{-1})h{{\Vert}}^{2}+J(\rho ({g}^{-1})h)\\ & (\text{c}){=}\enspace \rho (g)\left[{\mathrm{arg min}}_{h}\enspace \frac{1}{2}{\Vert}f-h{{\Vert}}^{2}+J(h)\right]=\rho (g){\mathrm{prox}}_{J}(f).\end{align*}$

The three marked steps are justified as follows:

(a)
J is assumed to be invariant w.r.t. ρ,
(b)
The representation ρ is assumed to be unitary,
(c)
ρ(g) is invertible, and under the substitution h ← ρ(g)h, the minimiser transforms accordingly.

□

Example 2. As a prominent example of a regularisation functional satisfying the conditions of proposition 2, consider the total variation functional [36] on L²(R^d)

$\begin{equation*}\text{TV}(u)=\underset{\phi \in {C}_{c}^{\infty }({\mathbf{R}}^{d};{\mathbf{R}}^{d}),{\Vert}\phi {{\Vert}}_{\infty }{\leqslant}1}{\mathrm{sup}}\underset{{\mathbf{R}}^{d}}{\int }u\enspace \text{div}\enspace \phi ,\end{equation*}$

with the group G = SE(d) and the scalar field representation ρ(r)[f](x) = f(r⁻¹ x). Since the Lebesgue measure is invariant to G and the set of vector fields $\left\{\phi \in {C}_{c}^{\infty }({\mathbf{R}}^{d};{\mathbf{R}}^{d})\vert {\Vert}\phi {{\Vert}}_{\infty }{\leqslant}1\right\}$ is closed under G, TV is invariant w.r.t. ρ. As a result of this, proposition 2 tells us that prox_τTV is equivariant w.r.t. ρ for any τ ⩾ 0. Note that TV is not unique in satisfying these conditions; by a similar argument it can be shown, for example, that the higher order total generalised variation functionals [37] share the same invariance property (and hence also that their proximal operators are equivariant).

Remark 1. The above example, and all other examples that we consider in this work, are concerned with the case where the image to be recovered is a scalar field. Note, however, that proposition 2 is not limited to this type of field and that there are applications where it is natural to use more complicated representations ρ. A notable example is diffusion tensor MRI [38] in which case the image to be estimated is a diffusion tensor field and ρ should be chosen as the appropriate tensor representation.

4.1.1. Equivariance of the reconstruction operator

It is worth thinking about whether it is sensible to ask that the overall reconstruction method is equivariant, and how this should be interpreted. Thinking of the reconstruction operator as a map from measurements y to images $\hat{u}$ , it is hard to make sense of the statement that it is equivariant, since the measurement space generally does not share the symmetries of the image space (in the case where measurements may be incomplete). If we think instead of the reconstruction method as mapping a true image u to an estimated image $\hat{u}$ through (noiseless) measurements y = A(u), we might ask that a symmetry transformation of u should correspond to the same symmetry transformation of $\hat{u}$ . In the case of reconstruction by a variational regularisation method as in problem (9), this is too much to ask for even if the regularisation functional is invariant, since information in the (incomplete) measurements can appear or disappear under symmetry transformations of the true image. An example of this phenomenon when solving an inpainting problem is shown in figure 2.

Figure 2. Refer to the following caption and surrounding text. — **Figure 2.** An example demonstrating the non-equivariance of a general variational regularisation approach to image reconstruction, even when the corresponding regularisation functional J (as in problem (9)) is invariant. Here, A represents the application of an inpainting mask, R is an operator rotating the image by 20° and Φ is the solution map to problem (9) with E_y(u) = ||Au − y||² and J(u) = τTV(u).
Download figure:
Standard image High-resolution image

4.2. Learned proximal gradient descent

A natural way to use knowledge of the forward model in a neural network approach to image reconstruction is in the form of unrolled iterative methods [5, 22]. Starting from an iterative method to solve problem (9), the method is truncated to a fixed number of iterations and some of the steps in the truncated algorithm are replaced by learnable parts. As noted in the previous section, the proximal gradient method in algorithm 1 can be applied to a variational regularisation problem such as problem (9). Motivated by this and the unrolled iterative method approach, we can study learned proximal gradient descent as in algorithm 2 (where the variable s can be used as a memory state as is common in accelerated versions of the proximal gradient method [32]):

Algorithm 2. Learned proximal gradient method.

inputs: measurements y, initial estimate u⁰

u ← u⁰, s ← 0

for i ← 1, ..., it do

$\quad (u,s){\leftarrow}{\hat{\mathrm{prox}}}_{i}(u,s,\nabla {E}_{y}(u))$ $\quad (u,s){\leftarrow}{\hat{\mathrm{prox}}}_{i}(u,s,\nabla {E}_{y}(u))$

end for

return Φ(y) := u

Here ${\hat{\mathrm{prox}}}_{i}$ are neural networks, the architectures of which are chosen to model proximal operators. In this work, we choose ${\hat{\mathrm{prox}}}_{i}$ to be defined as

$\begin{equation}{\hat{\mathrm{prox}}}_{i}={K}_{\text{project},i}{\circ}(\text{id}+\phi {\circ}{K}_{\text{intermediate},i}){\circ}{K}_{\text{lift},i},\end{equation} \tag{ 11 }$

where each of the K_project,i, K_{intermediate,i} and K_lift,i are learnable affine operators (given by a convolution operation followed by adding a bias term) and ϕ is an appropriate nonlinear function. We can appeal to proposition 2 and model ${\hat{\mathrm{prox}}}_{i}$ as translationally equivariant (we will call the corresponding reconstruction method the ordinary method in what follows) or as roto-translationally equivariant (we will call the corresponding reconstruction method the equivariant method in what follows). Figure 3 gives a schematic illustration of the inputs and outputs of the learned proximal operators.

Figure 3. Refer to the following caption and surrounding text. — **Figure 3.** A schematic illustration of a single iteration of the learned proximal gradient method, algorithm 2, for a CT reconstruction problem. The choice of E_y is described in section 5.1.1. Knowledge of the forward model is incorporated into the reconstruction through ∇E_y, which is not an equivariant operator in general. Motivated by proposition 2, we know that ${\hat{\mathrm{prox}}}_{i}$ is naturally modelled as an equivariant operator.
Download figure:
Standard image High-resolution image

**Figure 3.** A schematic illustration of a single iteration of the learned proximal gradient method, algorithm 2, for a CT reconstruction problem. The choice of E_y is described in section 5.1.1. Knowledge of the forward model is incorporated into the reconstruction through ∇E_y, which is not an equivariant operator in general. Motivated by proposition 2, we know that ${\hat{\mathrm{prox}}}_{i}$ is naturally modelled as an equivariant operator.
Download figure:
Standard image High-resolution image

Recall that we consider groups of the form G = R^d ⋊ H for subgroups H of O(d) in this work. Since we apply the learned equivariant method to reconstruct scalar-valued images, the input and output types of each ${\hat{\mathrm{prox}}}_{i}$ should correspond to features carrying the trivial representation of H. For the equivariant method, K_lift,i are equivariant convolutions from a small number (2 + the number of channels used for the memory state) of input channels with the trivial representation of H to a larger number of intermediate channels with the regular representation of H, if H is a finite group, or various irreducible representations of H, if H is a continuous group. K_{intermediate,i} are chosen as equivariant convolutions mapping the output channels of K_lift,i to a set of channels of the same type. Finally, K_project,i are chosen as equivariant convolutions that map the output channels of K_{intermediate,i} to a small number (1 + the number of channels used for the memory states) of output channels with the trivial representation of H. For the implementation of the equivariant convolutions, recall the procedure described at the end of section 3.1.

For the ordinary method, K_lift,i are ordinary convolutions mapping a small number (equal to that of the equivariant method) of input channels to a larger number of intermediate channels, K_{intermediate,i} are ordinary convolutions mapping the output channels of K_lift,i to a set of channels of the same type, and K_project,i are ordinary convolutions mapping the many output channels of K_{intermediate,i} to a small number (equal to that of the equivariant method) of output channels.

Since the implementations of the equivariant convolutions are ultimately based on ordinary convolutions, a natural comparison can be made between the equivariant and ordinary method by matching the widths of the underlying ordinary convolutions. When the methods are compared in this way, they should take comparable computational effort to use and the ordinary method is a superset of the equivariant method in the sense that the parameters of the ordinary method can be chosen to reproduce the action of the equivariant method.

Remark 2. Both in the case of algorithms 1 and 2, we require access to the gradient ∇E_y, where E_y is a data discrepancy functional. In our case, E always takes the form E_y(u) = d(A(u), y) where A is the forward operator and d is a measure of divergence. As a result of this E_y can be differentiated by the chain rule as long as we have access to the gradient of d and can compute vector-Jacobian products of A. If the forward operator A is linear, its vector-Jacobian products are just given by the action of the adjoint of A.

5. Experiments

In this section, we demonstrate that roto-translationally equivariant operations can be incorporated into a learned iterative reconstruction method such as algorithm 2 to obtain higher quality reconstructions than those obtained using comparable reconstruction methods that only use translationally equivariant operations. We consider two different inverse problems: a subsampled MRI problem and a low-dose CT problem. The code that was used to produce the experimental results shown is freely available at https://github.com/fsherry/equivariant_image_recon [54].

5.1. Datasets

5.1.1. LIDC-IDRI dataset

We use a selection of chest CT images of size 512 × 512 from the LIDC-IDRI dataset [39, 40] for our CT experiments. We use a combination of L¹ norm and the TV functional as a simple way to screen out low-quality images. The details of this procedure can be found in the code repository associated with this work. The set is split into 5000 images that can be used for training, 200 images that can be used for validation and 1000 images that can be used for testing. For the experiments using this dataset, we use the ASTRA toolbox [41–43] to simulate a parallel beam ray transform $\mathcal{R}$ with 50 uniformly spaced views at angles between 0 and π. We simulate the measurements y as post-log data in a low-dose setting:

$\begin{equation*}y=-\frac{1}{\mu \enspace }\enspace \mathrm{log}\left(\mathrm{max}\left\{\frac{n}{{N}_{\text{in}}},\eta \right\}\right),\quad \text{where}\quad n\sim \text{Pois}({N}_{\text{in}}\enspace \mathrm{exp}(-\mu \mathcal{R}(u))).\end{equation*}$

Here N_in = 10 000 is the average number of photons per detector pixel (without attenuation), μ is a base attenuation coefficient connecting the volume geometry and attenuation strength, and η is a small constant to ensure that the argument of the logarithm is strictly positive, chosen as η = 10⁻⁸ in our experiments. Figure 4 shows some examples of the ground truth images and filtered backprojection (FBP) reconstructions from the corresponding simulated measurements. In these experiments, we will define the data discrepancy functional E_y as

$\begin{equation*}{E}_{y}(u)=\frac{1}{2}{\Vert}\mathcal{R}u-y{{\Vert}}_{2}^{2}.\end{equation*}$

Figure 4. Refer to the following caption and surrounding text. — **Figure 4.** Four samples of the images that were used to train the reconstruction operators in the CT experiments, and the results of applying FBP to the corresponding simulated sinograms. The images are clipped between upper and lower attenuation coefficient limits of −1024 HU and 1023 HU.
Download figure:
Standard image High-resolution image

5.1.2. FastMRI

We use a selection of axial T1-weighted brain images of size 320 × 320 from the FastMRI dataset [44, 45] for our MRI experiments. As in section 5.1.1, we screen the images to remove as many low-quality images as possible. The set is split into 5000 images that can be used for training, 200 images that can be used for validation and 1000 images that can be used for testing. For the experiments using this dataset, we simulate the measurements using a discrete Fourier transform $\mathcal{F}$ and a variable density Cartesian line sampling pattern $\mathcal{S}$ (simulated using the software package associated with the work in [46] and shown in figure 5):

$\begin{equation*}y=\mathcal{S}\mathcal{F}u+\varepsilon ,\end{equation*}$

where ɛ is complex-valued white Gaussian noise. In this setting, a complex-valued image is modelled as a real image with two channels, one for the real part and the other for the imaginary part. The corresponding data discrepancy functional (E_y in equation (9)) will be defined as

$\begin{equation*}{E}_{y}(u)=\frac{1}{2}{\Vert}\mathcal{S}\mathcal{F}u-y{{\Vert}}_{2}^{2}.\end{equation*}$

Figure 5. Refer to the following caption and surrounding text. — **Figure 5.** The sampling mask $\mathcal{S}$ used in the MRI experiments, sampling 20.3% of k-space, and two samples of the images that were used to train the reconstruction operators in the MRI experiments, and the zero-filling reconstructions from the corresponding simulated k-space measurements.
Download figure:
Standard image High-resolution image

5.2. Experimental setup

5.2.1. Learning framework

Although it is also possible to learn the parameters of the reconstruction methods in algorithm 2 in an unsupervised learning setting, all experiments that we consider in this work can be classified as supervised learning experiments: given a finite training set ${\left\{({u}_{i},{y}_{i})\right\}}_{i=1}^{N}$ of ground truth images u_i and corresponding noisy measurements y_i, we choose the parameters of Φ in algorithm 2 by solving the empirical risk minimisation problem

$\begin{equation*}\underset{{\Phi}}{\mathrm{min}}\enspace \frac{1}{N}\sum\limits _{i=1}^{N}{\Vert}{u}_{i}-{\Phi}({y}_{i}){{\Vert}}_{2}^{2}.\end{equation*}$

5.2.2. Architectures and initialisations of the reconstruction networks

We use the reconstruction networks defined in section 4.2, referring to the architecture described there with roto-translationally equivariant components as the equivariant method and referring to the architecture with translationally equivariant components as the ordinary method. To ensure fair comparisons between the various methods that we compare, we fix as many as possible of the aspects of the methods that are orthogonal to the point investigated in the experiments. To this end, every learned proximal gradient method has a depth of it = 8 iterations. Both for the CT and MRI experiment, the images being recovered are two-dimensional, so we use equivariant convolutions with respect to groups of the form R² ⋊ Z_m. Since the equivariant convolutions are implemented using ordinary convolutions, it is natural and straightforward to compare methods with the same width. The width of each network is the same (feature vectors that transform according to the regular representation take up |H| 'ordinary' channels, and we fix the size of the product |H| ⋅ n_channels = 96 where n_channels is the number of such feature vectors in the intermediate part of ${\hat{\mathrm{prox}}}_{i}$ in equation (11)). All convolution filters used are of size 3 × 3. We choose the initial reconstruction u⁰ = 0 and use a memory variable s of five scalar channels wide in the learned proximal gradient method (algorithm 2).

Furthermore we ensure that the initialisation of both types of methods are comparable. Referring back to equation (11), we choose to initialise K_{intermediate,i} equal to zero and let K_project,i and K_lift,i be randomly initialised using the He initialisation method [47], as implemented in PyTorch [48] for ordinary convolutions and generalised to equivariant convolutions in [26] and implemented in the software package https://github.com/QUVA-Lab/e2cnn [11]. For the practical implementation of the exact methods studied, the reader is advised to consult the code at https://github.com/fsherry/equivariant_image_recon [54].

5.2.3. Hyperparameters of the equivariant methods

In addition to the usual parameters of a CNN, the learned equivariant reconstruction methods have additional parameters related to the choice of the symmetry group its representations to use. In this work, we have chosen to work with groups of the form R² ⋊ Z_m, so a choice needs to be made which m ∈ N to consider.

In figure 6, we see the result of training and validating learned equivariant reconstruction methods on the CT reconstruction problem, with various orders m of the group H = Z_m. Each of the learned methods is trained on the same training set consisting of 100 images. The violin plots used give kernel density estimates of the distributions of the performance measures; for each one, we have omitted the top and bottom 5% of values so as not to be misled by outliers. Evidently, in this case, the groups of on-grid rotations significantly outperform the other choices, with m = 4 giving the best performance. Based on this result, all further experiments with the equivariant methods will use the group H = Z₄.

Figure 6. Refer to the following caption and surrounding text. — **Figure 6.** The reconstruction quality, as measured on a validation set, of learned proximal gradient methods trained on the CT reconstruction problem with varying orders of the group H. Note that when H is chosen to represent on-grid rotations (i.e. m = 2 or m = 4), the performance is significantly better than for any of the other choices of H.
Download figure:
Standard image High-resolution image

5.2.4. Training details

For both the equivariant and ordinary reconstruction methods, we train the methods using the Adam optimisation algorithm [49] with learning rate 10⁻⁴, β₁ = 0.9, β₂ = 0.999 and ɛ = 10⁻⁸. We use minibatches of size 1 and perform a total of 10⁵ iterations of the Adam algorithm to train each method, so that we perform the same total number of iterations for each training set, regardless of its size. Since we have chosen to use the finite group approach, with intermediate fields transforming according to their regular representation, we can use a pointwise nonlinearity for both the equivariant and ordinary reconstruction methods. In all experiments, we use the leaky ReLU function as the nonlinearity (ϕ in equation (11)), applied pointwise:

$\begin{equation*}\phi (x)=\begin{cases}x\quad & \quad \text{if}\enspace x{ >}0,\\ 0.01x\quad & \quad \text{else.}\end{cases}\end{equation*}$

Each training run is performed on a computer with an Intel Xeon Gold 6140 CPU and an NVIDIA Tesla P100 GPU. Training the equivariant methods requires slightly more computational effort than the ordinary methods: to begin with, given the specification of the architecture, bases need to be computed for the equivariant convolution kernels (this takes negligible effort compared to the effort expended in training). Besides this, each training iteration requires the computation of the convolutional filter from its parameters and the basis functions and the backpropagation through this basis expansion. To give an example of the extra computational effort required, we have timed 100 training iterations for comparable equivariant and ordinary methods for the MRI reconstruction problem: this took 35.5 s for the ordinary method and 41.9 s for the equivariant method, an increase of 18%. These times correspond to a total training time of 9.9 h and 11.6 h for the equivariant and ordinary methods respectively. Note that at test time, however, the ordinary and equivariant methods can be computed with the same effort.

5.2.5. Performance measures

We evaluate the performance of the learned reconstruction methods using two performance measures:

The peak signal to noise ratio (PSNR), defined for a ground truth signal u ∈ Rⁿ and reconstruction $\hat{u}\in {\mathbf{R}}^{n}$ as
$\begin{equation*}\text{PSNR}(\hat{u},u)=10\enspace {\mathrm{log}}_{10}\left(\frac{n\enspace \underset{1{\leqslant}i{\leqslant}n}{\mathrm{max}}\enspace \vert {u}_{i}{\vert }^{2}}{{\Vert}u-\hat{u}{{\Vert}}^{2}}\right).\end{equation*}$
The structural similarity index measure (SSIM) [50], defined initially on small windows of images, $u,\hat{u}\in {\mathbf{R}}^{w{\times}w}$ (with w odd) by
$\begin{equation*}\text{SSIM}(\hat{u},u)=\frac{2\overline{\hat{u}}\overline{u}+{c}_{1}}{{\overline{\hat{u}}}^{2}+{\overline{u}}^{2}+{c}_{1}}\cdot \frac{2{s}_{\hat{u},u}+{c}_{2}}{{s}_{\hat{u}}^{2}+{s}_{u}^{2}+{c}_{2}}\end{equation*}$
for small nonnegative constants c₁, c₂. In this formula, we have used the mean and variance statistics defined by
$\begin{equation*}\overline{u}=\frac{1}{{w}^{2}}\sum\limits _{1{\leqslant}i,j{\leqslant}w}{u}_{i,j},\hspace{25.0pt}{s}_{\hat{u},u}=\frac{1}{{w}^{2}}\sum\limits _{1{\leqslant}i,j{\leqslant}w}({\hat{u}}_{i,j}-\overline{\hat{u}})({u}_{i,j}-\overline{u}),\hspace{25.0pt}{s}_{u}={s}_{u,u}.\end{equation*}$
To obtain a performance measure for larger images $u,\hat{u}\in {\mathbf{R}}^{{n}_{1}{\times}{n}_{2}}$ with n₁, n₂ ⩾ w, we compute the SSIM on each of their subwindows and average:
$\begin{equation*}\text{SSIM}(\hat{u},u)=\frac{1}{({n}_{1}+1-w)({n}_{2}+1-w)}\sum\limits _{\begin{subarray}{c}1{\leqslant}i{\leqslant}{n}_{1}+1-w\\ 1{\leqslant}j{\leqslant}{n}_{2}+1-w\end{subarray}}\text{SSIM}({[\hat{u}]}_{i,j},{[u]}_{i,j}),\end{equation*}$
where [u]_i,j is the window ${({u}_{k,l})}_{i{\leqslant}k{< }i+w,j{\leqslant}l{< }j+w}$ . We use the implementation included in scikit-image [51], with the corresponding default parameter choices.

Both the PSNR and the SSIM have the property that higher values correspond to better reconstructions. Whereas the PSNR can immediately be applied to arbitrarily shaped signals (since the various locations in the signals do not interact), the SSIM in principle requires the input images to be regularly sampled to make sense of the subwindow statistics. One way in which the SSIM can be reasonably computed on segmented data is as follows: note that the subwindow SSIMs that are needed in the computation of the full SSIM define an image, the so-called SSIM map. If the input images are first padded on each side by ⌊w/2⌋ pixels (for example by reflection padding, as is done in the scikit-image implementation), the SSIM map computed from them will be of the same size as the original input images and will be aligned with them. The ordinary SSIM is computed by taking the average of such an SSIM map, so given a segmentation mask we can compute a segmented SSIM by instead taking the average of the values of the SSIM map over points that are inside the mask.

To apply either of the performance measures to the MRI images, which are complex-valued, we compute them on the absolute value images.

5.3. CT experiment: varying the size of the training set

In this experiment, we study the effect of varying the size of the training set on the performance of the equivariant and ordinary methods. We consider a range of training set sizes, as shown in figure 7, and test the learned reconstruction methods on images that were not seen during training time, both in the same orientation and randomly rotated images. In medical applications, one tends to be particularly interested in the lung regions of the chest CT images. Although the methods have not been trained with this specifically in mind, in this section we will consider their performance on the lung regions. For this purpose, we use an automatic lung CT segmentation tool from [52] to select the regions of interest. As can be seen in figure 8, the equivariant method does a better job at reconstructing the lung regions than the ordinary method when trained on smaller training sets, but does slightly worse with larger training sets. This can be explained by the fact that the equivariant method is subsumed by the ordinary method (recall that the equivariant method can be replicated by appropriately setting the weights of the ordinary method, but the converse does not hold). The violin plots displayed have the same interpretation as those shown in figure 6 and described in section 5.2.3. We see a slight deviation from a monotonic relationship between the training set size and reconstruction quality that would usually be expected. Small random variations in the test performance can be explained by various nondeterministic aspects of the training procedure: we use random initialisations of the network weights, the learning problem is nonconvex and there is randomness in how the examples are sampled during training. From this comparison, we see that the equivariant method is able to better take advantage of smaller training sets than the ordinary method. Furthermore, we see that the performance of the equivariant method does not suffer much when testing on images in unseen orientations, whereas the performance of the ordinary method drops significantly when testing on rotated images. Figure 8 shows some examples of test reconstructions made with the methods learned on a training set of size N = 50. In these reconstructions, it can be seen that the equivariant method does better at removing streaking artefacts than the ordinary method.

Figure 7. Refer to the following caption and surrounding text. — **Figure 7.** A comparison of the performance of equivariant and ordinary learned proximal gradient methods trained on training sets of various sizes for the CT reconstruction problem. The methods are tested on images that have not been seen during training time, both in the same orientations as were observed during training ('upright test images') and rotated at random angles ('rotated test images'). The performance is evaluated on the lung regions only.
Download figure:
Standard image High-resolution image

Figure 8. Refer to the following caption and surrounding text. — **Figure 8.** A random selection of test images corresponding to the plots shown in figure 7, with a training set of size N = 50. On each reconstruction, the top number is its SSIM and the bottom number is its PSNR w.r.t. the ground truth, with both performance measures computed on the lung regions only. The images are clipped between upper and lower attenuation coefficient limits of −1024 HU and 1023 HU.
Download figure:
Standard image High-resolution image

5.4. MRI experiment: varying the size of the training set

This experiment is similar to the experiment in section 5.3, but concerns the MRI reconstruction problem. A notable difference with the CT reconstruction problem is that, as a result of the Cartesian line sampling pattern, the forward operator is now less compatible with the rotational symmetry. Regardless of this, we have seen in section 4 that it is still sensible in this context to use equivariant neural networks in a method motivated by a splitting optimisation method. As in section 5.3, we evaluate the performance of the learned methods on regions of interest: in this case we use the foreground of the images, which we isolate by thresholding the ground truth images, followed by taking the convex hull of the result. The performance differential between the equivariant and ordinary methods is more subtle than in the CT reconstruction problems. An explanation for this can be found in the fact that the MRI reconstruction problem is, in a certain sense, easier than the CT reconstruction problem: the nonzero singular values of the MRI forward operator are constant, while those of the CT forward operator decay, complicating the inversion. Remarkably, it is observed that both methods perform better on the rotated images than they do on the upright images. This is an artefact of how the rotated images are created: rotated images are generated from the upright images by performing a rotation operation which necessarily includes an interpolation step. As a result of this, some of the high frequency details disappear after rotating, resulting in an easier reconstruction problem. Appendix A goes into more detail about this effect. In figure 9, we see that the equivariant method can again take better advantage of smaller training sets and is more robust to images dissimilar to those seen in training. Figure 10 shows examples of reconstructions made with the methods learned on a training set of size N = 100.

Figure 9. Refer to the following caption and surrounding text. — **Figure 9.** A comparison of the performance of equivariant and ordinary learned proximal gradient methods trained on training sets of various sizes for the MRI reconstruction problem. The methods are tested on images that have not been seen during training time and that have been rotated at random angles. The performance is evaluated on the foreground regions only.
Download figure:
Standard image High-resolution image

Figure 10. Refer to the following caption and surrounding text. — **Figure 10.** A random selection of test images corresponding to the plots shown in figure 9, with a training set of size N = 100. On each reconstruction, the top number is its SSIM and the bottom number is its PSNR w.r.t. the ground truth, with both performance measures computed on the foreground regions only.
Download figure:
Standard image High-resolution image

6. Conclusions and discussion

In this work, we have shown that equivariant neural networks can be naturally incorporated into learnable reconstruction methods for inverse problems. Doing so requires little extra effort and results in higher quality reconstructions when compared to similar methods that use ordinary CNNs. The main difference of this approach compared to existing approaches is that we model proximal operators in a learned reconstruction method as roto-translationally equivariant rather than just translationally equivariant, as is usually the case. Building the extra symmetries into the learned reconstruction method has the effect of lowering the method's sample complexity. Using roto-translationally equivariant neural networks as opposed to ordinary CNNs results in better performance when trained on smaller training sets and more robustness to rotations.

Let us now discuss some of the limitations of the proposed approach and potential improvements to be considered in future work.

As we saw in sections 5.3 and 5.4, the equivariant method outperforms the ordinary method for small training sets, but is slightly outperformed by the ordinary method for large training sets. This is a result of the equivariant method being a subset of the ordinary method. The equivariant method can be made more expressive by using a larger number of intermediate channels, but this comes at the expense of increased computational cost.

In section 5.2.3, we saw that that the learned methods perform best when the group H is chosen to be a group of on-grid rotations. In theory, one would expect better performance with a larger number of rotations, but in practice there is the issue of how the equivariant kernels are discretised. Indeed, when solving the constraint for equivariance in equation (8), the allowed kernels turn out to be circular harmonics multiplied by an arbitrary radial profile, and in practice we discretise these functions on 3 × 3 filters. An opportunity for future work on the use of equivariant neural networks can be found in how the combination of group and discretisation should be optimised.

All of the experiments shown in this work have dealt with two-dimensional images, but the methods described here can be applied equally well to three-dimensional images, as long as the two-dimensional equivariant convolutions are replaced by their three-dimensional counterparts. The representation theory of SO(3) is more complicated than that of SO(2), but it is similarly possible to design roto-translationally equivariant convolutions in three dimensions [27]. One potential application is mentioned in remark 1: in diffusion tensor MRI, the domain is three-dimensional, with the additional challenge that the image that is to be recovered is a tensor field rather than a scalar field.

In the experiments that we demonstrated in this work, we focused on a single type of learned reconstruction operator, the learned proximal gradient method. In fact, the framework that we describe is not limited to this form of reconstruction algorithm. As an example of another type of learned reconstruction operator, consider the learned primal-dual method of [53]. A small corollary to proposition 2 is that, when J is invariant and the Fenchel conjugate J* is well-defined, ${\mathrm{prox}}_{{J}^{{\ast}}}$ will be equivariant in the same way that prox_J is. As a result, assuming reasonable invariance properties of a data discrepancy term, a learned primal-dual method can be considered where both the primal and dual proximal operators are modelled as appropriate equivariant neural networks.

Acknowledgments

Data used in the preparation of this article were obtained from the NYU fastMRI Initiative database (fastmri.med.nyu.edu) [44, 45]. As such, NYU fastMRI investigators provided data but did not participate in analysis or writing of this report. A listing of NYU fastMRI investigators, subject to updates, can be found at fastmri.med.nyu.edu. The primary goal of fastMRI is to test whether machine learning can aid in the reconstruction of medical images. The authors acknowledge the National Cancer Institute and the Foundation for the National Institutes of Health, and their critical role in the creation of the free publicly available LIDC/IDRI Database used in this study [39, 40]. MJE acknowledges support from the EPSRC Grants EP/S026045/1 and EP/T026693/1, the Faraday Institution via EP/T007745/1, and the Leverhulme Trust fellowship ECF-2019-478. CE and CBS acknowledge support from the Wellcome Innovator Award RG98755. CBS acknowledges support from the Leverhulme Trust project on 'Breaking the non-convexity barrier', the Philip Leverhulme Prize, the EPSRC Grants EP/S026045/1 and EP/T003553/1, the EPSRC Centre No. EP/N014588/1, European Union Horizon 2020 research and innovation programmes under the Marie Skłodowska-Curie Grant agreement No. 777826 NoMADS and No. 691070 CHiPS, the Cantab Capital Institute for the Mathematics of Information and the Alan Turing Institute. FS acknowledges support from the Cantab Capital Institute for the Mathematics of Information. EC and BO thank the SPIRIT project (No. 231632) under the Research Council of Norway FRIPRO funding scheme.

Data availability statement

The data that support the findings of this study are openly available at the following URL/DOI: 10.5281/zenodo.4556410.

Appendix A.: The blurring effect of image rotations

In section 5.4, we made the remarkable observation that the learned reconstruction methods perform better for the MRI problem on rotated images than on upright images similar to those on which they were trained. It was mentioned there that this is an artefact of the way in which rotated images are created. As a simple test of this explanation, consider the comparison of the performance on the unaltered upright images and the performance on upright images that have been randomly rotated and then rotated back to be upright. If the hypothesised explanation for the difference in performance is correct, we would expect the methods to perform better on the images that have been rotated and rotated back than on the unaltered images. Figure A1 shows the result of doing this comparison, confirming that the MRI problem is significantly easier to solve for the learned reconstruction methods after the images have undergone the blurring effect of the rotation operation. Figure A2 shows the same comparison repeated for the CT problem. In this case the effect is still visible, but it is considerably weaker, which explains why it was not observed in section 5.3.

Figure A1. Refer to the following caption and surrounding text. — **Figure A1.** A comparison of the performance of the learned reconstruction methods on two types of upright images for the MRI problem: the original images ('unaltered') and otherwise identical images that have been rotated and rotated back ('rotated').
Download figure:
Standard image High-resolution image

Figure A2. Refer to the following caption and surrounding text. — **Figure A2.** A comparison of the performance of the learned reconstruction methods on two types of upright images for the CT problem: the original images and otherwise identical images that have been rotated and rotated back.
Download figure:
Standard image High-resolution image

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates