The Selective -Bispectrum and its Inversion: Applications to -Invariant Networks
Abstract
An important problem in signal processing and deep learning is to achieve invariance to nuisance factors not relevant for the task. Since many of these factors are describable as the action of a group (e.g. rotations, translations, scalings), we want methods to be -invariant. The -Bispectrum extracts every characteristic of a given signal up to group action: for example, the shape of an object in an image, but not its orientation. Consequently, the -Bispectrum has been incorporated into deep neural network architectures as a computational primitive for -invariance—akin to a pooling mechanism, but with greater selectivity and robustness. However, the computational cost of the -Bispectrum (, with the size of the group) has limited its widespread adoption. Here, we show that the -Bispectrum computation contains redundancies that can be reduced into a selective -Bispectrum with complexity. We prove desirable mathematical properties of the selective -Bispectrum and demonstrate how its integration in neural networks enhances accuracy and robustness compared to traditional approaches, while enjoying considerable speeds-up compared to the full -Bispectrum.
1 Introduction
The visual world is rich with symmetries. For example, the identity of an object is invariant to its position in the visual field; vision has translational symmetry. Group theory is the mathematics used to describe transformations, their actions on objects, and the object’s symmetry. As such, group theory has penetrated the fields of signal processing and deep learning alike. For example, the Fourier transform, pillar of signal processing, has been adapted to the -Fourier transform, with its spectrum decomposing a signal defined over a group into several frequencies. More recently, researchers have become interested in the properties of higher-order spectra such as the Bispectrum, and its generalization to signals over groups via the -Bispectrum.
-Bispectrum
The -Bispectrum is the Fourier transform of the -Triple Correlation (-TC). Historically, higher-order spectra like found initial applications in the context of classical signal processing as generalizations of the two-point autocorrelation [31, 3, 24]. The work of Kakarala [14] illuminated the relevance of the -Bispectrum for invariant theory, as it is the lowest-degree spectral invariant that is complete [30]. Since then, it has appeared in diverse settings such as vision science [34], machine learning [18, 19], and 3D modeling [17].
Limitations of the -Bispectrum for Deep Learning
The computational complexity of the -Bispectrum has severely limited the reach of its applications. The most salient example of this limitation is in machine learning and deep learning. Convolutional Neural Network (CNN) [21, 22] reflect and exploit the translational symmetry of the visual world. Group-Equivariant CNNs (-CNNs) [7, 20] do just this, with more general group-equivariant convolutions to exploit symmetries like rotational symmetries. In both cases, one typically wants to preserve transformations throughout the layers of a network (i.e., to be group-equivariant), and remove them only at the end when “canonicalizing” an image for classification (i.e., to be group-invariant). While the theory of equivariant layers has been thoroughly developed [8, 33], less attention has been paid to the theory of invariant layers [12]. This is where the -Bispectrum enters the picture, and where its computational cost has strongly limited its integration into deep learning.
Commonly, invariance in -CNNs is achieved by simply taking an average or maximum over the transformation group (Average or Max -Pooling, respectively). However, as noted by Sanborn & Miolane [27], this is a highly lossy operation removing information about the structure of the signal. While the max operation is indeed invariant (the max of an image is the same as the max of an image rotated by degrees), it is excessively invariant: one could permute all of the pixels in the image and without changing the maximum, but with none of the same structure (see Figure 7). To address this, Sanborn & Miolane [27] used the -TC as a -invariant layer that is complete—that is, it removes group transformations with no loss of signal structure. This approach achieves demonstrable gains in accuracy and robustness [27], but it is computationally expensive.
Indeed, the space complexity of the -TC, i.e, its number of coefficients, scales as , where is the size of the group. As each coefficient demands for operations, the computational cost or the time complexity of the -TC is . An alternative would be to use the -Bispectrum as the pooling layer. However, both its space and time complexities are . By comparison, the Max -pooling layer features a computational cost and returns a scalar output. This raises the question of whether one can achieve complete invariance and adversarial robustness without sacrificing too much in terms of computational efficiency.
Contributions. In this work, we prove for the first time that we can significantly reduce the computational complexity of the -Bispectrum. This result has important implications for signal processing and deep learning on groups, for which the -Bispectrum is a foundational computational primitive. Our contributions are:
-
•
We provide a general algorithm that reduces the computational complexity of the -Bispectrum from to in space complexity and from to in time complexity if an FFT is available on . We term it the selective -Bispectrum. The algorithm can be applied to any finite group.
-
•
We prove that the selective -Bispectrum is complete for the most important finite groups used in practice, i.e., all discrete commutative groups, the dihedral groups of any order, the octahedral and full octahedral group. This significantly extends the work of [10, 14, 26], who first showed this for some finite, commutative groups, where it was demonstrated that the -Bispectrum can be computed with only space complexity.
-
•
We use the selective -Bispectrum to propose a new -invariant layer that strikes a balance between robustness and efficiency. In particular, it is more expensive than the Max -pooling, but cheaper than the -TC pooling.It is also cheaper than the full -bispectral pooling of time and space complexity. The selective -Bispectrum is more robust than the max -pooling, and almost as robust as the -TC.
-
•
We run extensive experiments on the MNIST [23] and EMNIST [6] datasets to evaluate how each invariance layer (Max -pooling, -TC, selective -Bispectrum) impacts accuracy and speed on classification tasks. We achieve the expected results: Our layer is faster than the -TC and full -Bispectrum and more accurate than Max -pooling.
-
•
We present several findings important to the design of invariant layers to guide further advances in the field of geometric deep learning. In particular, we show that the accuracy and speed advantages of the selective -Bispectrum is most striking for -CNNs with low number of convolutional filters. Conversely, increasing the number of filters in the -Convolutions allows the Max -Pooling to catch up on the accuracy.This demonstrates that the -bispectral pooling will be particularly interesting for neural networks operating under a smaller parameter budget.
We hope that the proposed reduction of the -Bispectrum complexity will further open areas of research in signal processing on groups, that were previously prohibited due to the high complexity of the operation.
2 Background: -Triple Correlation and -Bispectrum
The proposed selective -Bispectrum operation is closely related to two other foundational operations on signals defined on groups: the -Triple Correlation and the full -Bispectrum, which we introduce here. The background on group theory, including the definitions of groups, group actions, equivariance and invariance, is presented in Appendix A.
The -Triple Correlation
Given a real signal defined on a finite group , the -Triple Correlation (-TC) [14] is the lowest order polynomial that is complete, i.e., that conserves all of the information of the signal , up to group action by .
Definition 2.1.
The -Triple Correlation of a real signal is given by
(1) |
The original triple-correlation was introduced for the classical framework of translations of a one-dimensional signal, i.e., where and . The -triple correlation from Definition 2.1 extends the original definition to any finite group . In our setting, the signal will be obtained after the -convolution of a function , representing a continuous image with channels, with a filter , and the -TC will be applied channel-by-channel. Importantly, the -TC layer has computational complexity and outputs coefficients.
The -Bispectrum
The -TC operation has a Fourier equivalent: the -Bispectrum. Indeed, the definition of the Discrete Fourier Transform (DFT) can be extended to any finite group (see, e.g., [9]), as recalled below.
Definition 2.2.
The -Bispectrum is defined as , with evaluated over the group . Kakarala [14] proposed a closed-form expression for the -Bispectrum directly in terms of . We recall it in Theorem 2.3.
Theorem 2.3.
Complete -Invariants
The -TC and the -Bispectrum are desirable computational primitives for signal processing and deep learning because they are complete -invariants (for generic data ). Indeed, this completeness property make them very interesting for building invariance layers in -CNNs, as they are selectively invariant. We define complete -invariance next.
Theorem 2.4.
[15, Thm.3.2] The -TC and the -Bispectrum are complete -invariants, i.e., for with nonsingular for all irreps , , respectively , if and only if there exists such that for all .
Application: -invariance Layers
The -CNN architecture, first proposed in [7], is illustrated on Figure 1. The input signal , typically an image, is processed through a -Convolution layer using filters . The output is feature maps that form a set of real-valued signals with domain . This -Convolution layer is traditionally followed by a -invariance layer. The most common is the Max -Pooling layer. More recent works have proposed two alternatives based on the -TC and the full -Bispectrum: the -TC Pooling [27] and the (full) -Bispectrum [28] respectively, where the latter requires the computations of the Fourier transforms of the feature maps, preferably computed using a Fast Fourier Transform (FFT) algorithm on [9]. When testing the impact of the choice of -invariance layer, the output of the invariance layer is typically fed to a Secondary Neural Network (NN) to perform the desired task, e.g., image classification. The Secondary NN often takes the form of a Multi-Layer Perceptron (MLP).
Experimental results have demonstrated the superior accuracy and adversarial robustness of the -CNN equipped with a -TC and -Bispectrum invariance layer [28, 27]. However, both methods inherit the high space and time complexity of their respective operations. This raises the question of whether we can reduce this computational complexity.
3 Method: The Selective -Bispectrum and its Inversion
The Selective -Bispectrum
We introduce a novel tool for signal processing on groups: the selective -Bispectrum. The selective -Bispectrum is a subset of all coefficients of the -Bispectrum (Definition 2.3), only conserving well-chosen pairs of irreps . Which pairs of irreps to select depends on the group of interest. This is possible due to redundancies and symmetries in the full object. Below, we provide an algorithmic procedure to compute the selective -Bispectrum for any finite group that features at most coefficients. The procedure is summarized in Algorithm 1. We have the following proposition.
Proposition 3.1.
The selective -Bispectrum from Algorithm 1 has at most coefficients.
Proof.
Inverting the Selective -Bispectrum for completeness
The inversion of the selective -Bispectrum is reconstructing a signal from the -Fourier coefficients in the list such that for some (The -Bispectrum is -invariant, hence, can only be recovered at best up to group action). Once is known, can be obtained using the Inverse Fourier Transform. If the selective -Bispectrum can be inverted, then, by definition, it is complete in the sense of Theorem 2.4.
4 Theory: Completeness of the Selective -Bispectrum
Our main theoretical claim is that the selective -Bispectrum can be inverted and is a complete -invariant that drastically reduces the complexity of the -Bispectrum. We prove this claim for many finite groups of interest in signal processing and deep learning in a sequence of theorems presented in this section.
Known Theorems
Previous authors had looked into the -Bispectrum inversion problem. It is well known that coefficients are enough for the cyclic group .
Theorem 4.1.
[16] For cyclic groups , , the -Bispectrum can be inverted using coefficients if for all irreps of .
Similarly for a product of two such groups, we have the following theorem.
Theorem 4.2.
[10] For a product of cyclic groups , , the -Bispectrum can be inverted using coefficients if for all .
New Theorems
From now on, we assume that the Fourier transform only features non-zero elements, or invertible matrices in the case of non-scalar Fourier coefficients. This assumption is supported by the zero probability of encountering this corner case (an arbitrarily small perturbation of any signal makes this assumption true).
We first extend the above results to all commutative groups. The proof relies on the fact that every finite commutative group is the direct sum of finitely many cyclic groups.
Theorem 4.3.
For finite commutative groups , the -Bispectrum can be inverted using coefficients if for all .
See Appendix D for the proof and derivation of the inversion for the specific case of commutative groups. We note that our approach to inversion is symbolic, in that a solution can be expressed explicitly as a formula in terms of the input. Other approaches are also possible to determine an inverse, such as using least squares [11] or more recent spectral methods [5].
We now extend the result to dihedral groups. Dihedral groups are ubiquitous in signal processing and deep learning because they represent the group of rotations and reflections.
Theorem 4.4.
For any dihedral group (symmetries of the -gon), , we need at most bispectral matrix coefficients for inversion if for all irreps of . This corresponds to scalar values.
The proof is provided in Appendix E. We now extend the result to octahedral and full octahedral groups, that are related to the symmetries of the octahedron. These groups are very important in signal processing and deep learning of 3D images.
Theorem 4.5.
For the octahedral group which has group elements and irreps, we need only -Bispectral coefficients in the selective -Bispectrum. For the full octahedral group which has elements, we only need -Bispectral coefficients in the selective -Bispectrum to perform inversion.
A sketch of proof is provided in Appendix F given the redundancy of the procedure. We see that the selective -Bispectrum uses only coefficients, compared to coefficients needed for the full -Bispectrum for the octahedral group. For the full octahedral group, it requires only coefficients compared to the coefficients of the full -Bispectrum. In Figure 3, we compare the full and selective -Bispectra of the dihedral group (symmetries of the square) and the octahedral group.
5 Experimental results
Implementation and architecture
Our implementation of the selective -bispectrum layer is based on the gtc-invariance repository, implementing the -CNN with -convolution and -TC layer [27] and relying itself on the escnn library [4, 32].
We propose an experimental assessment of the newly proposed selective -Bispectrum layer by comparing it with the Avg -pooling, the Max -pooling, the -TC as invariance operations after the -convolution of a -CNN on the classification problems of the MNIST dataset of handwritten digits [23], the EMNIST dataset of handwritten letters [6] with standard train-test division. These datasets count 10 and 26 classes, respectively. We obtain transformed versions of the datasets – -MNIST/EMNIST – by applying a random action on each image in the original dataset.
The objective of our experiments is to isolate the speed-up of the -Bispectrum layer. Hence, we consider architectures that only differ by the invariance layer in the classification task, following the experimental set up by [27]. The neural network architecture is composed of a -convolution, a -invariance layer, and finally a Multi-Layer-Perceptron (MLP), itself composed of three fully connected layers with ReLU nonlinearity. Finally, a fully connected linear layer is added to perform classification. The MLP’s widths are tuned to match the number of parameters across each neural network model. The details are given in Appendix G. We highlight here that the pursued objective is to compare the differences in performances of the -invariance layers, not to provide the state-of-the-art accuracy on the datasets involved. Henceforth, we do not optimize the architectures to reach the highest possible accuracy. We set simple architectures providing interpretable results for analysis. The experiments a performed using cores of a NVIDIA A30 GPU.
Training speed performance
Table 1 recalls the theoretical complexities of the different layers. The computational cost of computing the selective -Bispectrum is if an FFT algorithm is available on [9], and with classical DFT. On Figure 4, we report the average training times on -MNIST for 10 runs as the discretization of varies. In the first case, we use the FFT and observe that the Max -pooling and -Bispectrum training time scale linearly whereas it scales quadratically for the -TC. For , we perform a classic DFT on so that the -Bispectrum scales worth. However, an FFT could be implemented to speed-up the process.
Pooling layer | Computational Complexity | Ouput size |
---|---|---|
-TC | ||
Full -Bsp. | ||
Select. -Bsp. | ||
Max -pool. |
Classification Performance
We compare the performances of the -Bispectrum layer with respect to the -TC, the Max -pooling and the Avg -pooling models, trained on the /-MNIST/EMNIST datasets and we assess the accuracy by averaging the validation accuracy over 10 runs. The classification accuracy is provided in Table 2. For the experiments in Table 2, the following pattern holds: at equivalent number of parameters, the more computationally expensive the pooling layer, the better the accuracy. However, the use of the -TC becomes prohibitive when increases. In the next section, we discuss the settings where each invariance layer should be preferred, and highlight each invariance layer’s strengths and weaknesses.
Dataset | Group | Pooling | filters | Avg acc. | Std. dev. | Param. count | |
---|---|---|---|---|---|---|---|
MNIST | Avg -pooling | 24 | 0.74 | 50247 | |||
Max -pooling | 24 | 0.96 | 50247 | ||||
Select. -Bsp | 24 | 0.95 | 49116 | ||||
-TC | 24 | 0.96 | 48385 | ||||
Avg -pooling | 4 | 0.60 | 147675 | ||||
Max -pooling | 4 | 0.78 | 147675 | ||||
Select. -Bsp | 4 | 0.93 | 143029 | ||||
-TC | 4 | 0.96 | 142220 | ||||
EMNIST | Avg -pooling | 24 | 0.40 | 50195 | |||
Max -pooling | 24 | 0.76 | 50195 | ||||
Select. -Bsp | 24 | 0.77 | 49254 | ||||
-TC | 24 | 0.80 | 48494 | ||||
Avg -pooling | 20 | 0.38 | 48832 | ||||
Max -pooling | 20 | 0.71 | 48832 | ||||
Select. -Bsp | 20 | 0.74 | 47320 | ||||
-TC | 20 | 0.79 | 46954 |
Discussion on the choice of invariance layer
The first observation from Table 2 is though the selective -Bispectrum is complete, the model obtains slightly lower accuracy than -TC. This observation might be surprising at first, since we prove mathematically in Section 4 that the selective -Bispectrum is complete just as the full version. An explanation to this lies in the paradoxes of the Universal Approximation Theorem [13]. Just because an arbitrarily large MLP can theoretically fit any function, this does not imply that it will happen for a practical, limited MLP. In practice, we hypothesize that the redundancy of the -TC allows the MLP to distinguish inputs more easily. If the size of the model allows it, the -TC or the full -Bispectrum will provide better accuracy. However, when the size of the group is big, their use is often out of reach while the selective -Bispectrum is scalable. In Table 2, we also notice that the Max -pooling performs well compared to the others even though it is not complete. This is because we have many filters that allow for refined classification. Indeed, assume are black-and-white images with pixels. In consequence, for . The Max -pooling allows a maximum separation of classes. In practice, this value is not reached, but it explains why Max -pooling performs well. Figure 5 highlights this dependency of the Max -pooling on the number of filters since the accuracy drops to less than with 2 filters. In comparison, the -TC and the selective -Bispectrum, which are complete, keep an accuracy above with 2 filters.
Completeness
To conclude our numerical experiments, we study the robustness of the selective -Bispectrum to adversarial attacks, following the analysis in Sanborn & Miolane [27, Figure 2]. Given an image and a filter , they numerically verified the robustness (=completeness) of the -TC by showing that
Indeed, Sanborn & Miolane [27, Figure 2] shows that only images that are identical up to rotation/reflection can yield the same -TC. That is, the -CNN with -TC can not be “fooled” since only input in the same orbit yield the same output. It is well-known that the -convolution is -equivariant. Hence, a strictly equivalent experiment is to show that
On Figure 6, we show that the selective -Bispectrum is robust to adversarial attacks by solving
(3) |
The signals are indeed recovered up to a translation, i.e., a group action of . Moreover, despite (3) only optimizes using the selective -Bispectrum, the full -Bispectrum is correctly recovered.
6 Conclusion and Future works
In this paper, we introduced a new type of complete invariant layer for -invariant CNNs – called selective -Bispectrum layer – with the objective of increasing the accuracy and robustness of -CNNs compared to those implemented with the initially proposed Max -pooling. The -TC layer also achieves this goal, but at an output cost of coefficients that prevents its application to large groups, while the selective -Bispectrum layer only outputs coefficients. Building on the result of Kakarala [14] for cyclic groups, we have shown that the completeness of the selective -Bispectrum layer holds for all commutative groups, all dihedral groups, the octahedral and full octahedral groups. In a suite of experiments, we provided a global picture of the strength and weaknesses of each invariance layer.
References
- Andre & Street [2006] Andre, J. and Street, R. An Introduction to Tannaka Duality and Quantum Groups, volume 1438, pp. 413–492. 11 2006. ISBN 978-3-540-54706-8. doi: 10.1007/BFb0084235.
- Bhatia [1997] Bhatia, R. Matrix Analysis, volume 169. Springer, 1997. ISBN 0387948465.
- Brillinger [1991] Brillinger, D. Some history of higher-order statistics and spectra. Stat. Sin., 1:465–476, 1991.
- Cesa et al. [2022] Cesa, G., Lang, L., and Weiler, M. A program to build e(n)-equivariant steerable CNNs. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=WE4qe9xlnQw.
- Chen et al. [2018] Chen, H., Zehni, M., and Zhao, Z. A spectral method for stable bispectrum inversion with application to multireference alignment. IEEE Signal Processing Letters, 25(7):911–915, 2018.
- Cohen et al. [2017] Cohen, G., Afshar, S., Tapson, J., and van Schaik, A. Emnist: an extension of mnist to handwritten letters, 2017.
- Cohen & Welling [2016] Cohen, T. and Welling, M. Group equivariant convolutional networks. In Balcan, M. F. and Weinberger, K. Q. (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp. 2990–2999, New York, New York, USA, 20–22 Jun 2016. PMLR. URL https://proceedings.mlr.press/v48/cohenc16.html.
- Cohen et al. [2021] Cohen, T. et al. Equivariant convolutional networks. PhD thesis, Taco Cohen, 2021.
- Diaconis & Rockmore [1990] Diaconis, P. and Rockmore, D. N. Efficient computation of the fourier transform on finite groups. Journal of the American Mathematical Society, 3:297–332, 1990. URL https://api.semanticscholar.org/CorpusID:120893890.
- Giannakis [1989] Giannakis, G. B. Signal reconstruction from multiple correlations: frequency-and time-domain approaches. JOSA A, 6(5):682–697, 1989.
- Haniff [1991] Haniff, C. A. Least-squares fourier phase estimation from the modulo 2 bispectrum phase. JOSA A, 8(1):134–140, 1991.
- Higgins et al. [2018] Higgins, I., Amos, D., Pfau, D., Racaniere, S., Matthey, L., Rezende, D., and Lerchner, A. Towards a definition of disentangled representations, 2018.
- Hornik et al. [1989] Hornik, K., Stinchcombe, M., and White, H. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359–366, 1989. ISSN 0893-6080. doi: https://doi.org/10.1016/0893-6080(89)90020-8. URL https://www.sciencedirect.com/science/article/pii/0893608089900208.
- Kakarala [1992] Kakarala, R. Triple correlation on groups. PhD thesis, UC Irvine, 1992.
- Kakarala [2009a] Kakarala, R. Completeness of bispectrum on compact groups. 2009a. URL https://api.semanticscholar.org/CorpusID:18425284.
- Kakarala [2009b] Kakarala, R. Bispectrum on finite groups. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3293–3296, 2009b. doi: 10.1109/ICASSP.2009.4960328.
- Kakarala [2012] Kakarala, R. The bispectrum as a source of phase-sensitive invariants for fourier descriptors: a group-theoretic approach. Journal of Mathematical Imaging and Vision, 44:341–353, 2012.
- Kondor [2007] Kondor, R. A novel set of rotationally and translationally invariant features for images based on the non-commutative bispectrum. arXiv preprint cs/0701127, 2007.
- Kondor [2008] Kondor, R. Group theoretical methods in machine learning. PhD thesis, Columbia University, 2008.
- Kondor & Trivedi [2018] Kondor, R. and Trivedi, S. On the generalization of equivariance and convolution in neural networks to the action of compact groups. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 2747–2755, 2018.
- Lecun & Bengio [1995] Lecun, Y. and Bengio, Y. Convolutional Networks for Images, Speech and Time Series, pp. 255–258. The MIT Press, 1995.
- Lecun et al. [1998] Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791.
- LeCun et al. [2010] LeCun, Y., Cortes, C., and Burges, C. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
- Nikias & Mendel [1993] Nikias, C. L. and Mendel, J. M. Signal processing with higher-order spectra. IEEE Signal processing magazine, 10(3):10–37, 1993.
- Norman [2012] Norman, C. Finitely Generated Abelian Groups and Similarity of Matrices over a Field. Springer London, 2012. ISBN 9781447127307. doi: 10.1007/978-1-4471-2730-7. URL http://dx.doi.org/10.1007/978-1-4471-2730-7.
- Sadler & Giannakis [1992] Sadler, B. M. and Giannakis, G. B. Shift-and rotation-invariant object reconstruction using the bispectrum. JOSA A, 9(1):57–69, 1992.
- Sanborn & Miolane [2023] Sanborn, S. and Miolane, N. A general framework for robust g-invariance in g-equivariant networks, 2023.
- Sanborn et al. [2023] Sanborn, S., Shewmake, C., Olshausen, B., and Hillar, C. Bispectral neural networks. In International Conference on Learning Representations (ICLR), 2023.
- Steinberg [2011] Steinberg, B. Representation Theory of Finite Groups: An Introductory Approach. Universitext. Springer New York, 2011. ISBN 9781461407751. URL https://books.google.com/books?id=uwggkgEACAAJ.
- Sturmfels [2008] Sturmfels, B. Algorithms in invariant theory. Springer Science & Business Media, 2008.
- Tukey [1953] Tukey, J. The spectral representation and transformation properties of the higher moments of stationary time series. Reprinted in The Collected Works of John W. Tukey, 1:165–184, 1953.
- Weiler & Cesa [2021] Weiler, M. and Cesa, G. General -equivariant steerable cnns, 2021.
- Weiler et al. [2023] Weiler, M., Forré, P., Verlinde, E., and Welling, M. Equivariant and Coordinate Independent Convolutional Networks. 2023. URL https://maurice-weiler.gitlab.io/cnn_book/EquivariantAndCoordinateIndependentCNNs.pdf.
- Zetzsche & Krieger [2001] Zetzsche, C. and Krieger, G. Nonlinear mechanisms and higher-order statistics in biological vision and electronic image processing: review and perspectives. Journal of Electronic Imaging, 10(1):56–99, 2001.
Appendix A Background on groups
We introduce the fundamentals of group theory, which provide the foundation for the theory of -CNNs. These notions can be found in [29].
Definition A.1.
A group is a pair where is a set and is an associative multiplication such that there is an identity element (i.e., for all , ) and, for all , there is an inverse such that .
A group is thus a set combined with a product preserving the characteristics of . Here, the established term “product” can be misleading. It denotes any operation which makes Definition A.1 true given the set . For instance, is a group. Another example is , the real invertible matrices, associated to the usual matrix product. This group is said to be non-commutative since in general for . An important group for us is the set associated with addition modulo . It is usually written and called the cyclic group . A single group can arise in different contexts under seemingly distinct forms. For instance, and the rotations leaving the square unchanged in are fundamentally the same object. This observation gives rise to representation theory, a branch of group theory studying how the same abstract idea of a group can emerge under different forms.
Definition A.2.
A representation of a group is a pair where is a vector space and is a group homomorphism, i.e., for all , . If is equipped with an inner product and if for all and all , , is unitary.
Remark A.3.
Throughout this paper, we use the shorthand to refer to the group and to refer to a representation .
To illustrate Definition A.2, a representation of is given by the complex roots of unity, , on the complex one-dimensional vector space . Every group also admit the trivial representation: for all . There is a specific subset of these representations called the irreducible representations, irreps for short, being those that can not be expressed in a more compact form. The irreps are fundamental objects of group theory since they allow us to define an invertible Fourier transform on finite groups. The irreps are therefore needed to define the -Bispectrum – i.e., the Fourier transform of the -TC. The notion of irreps is derived from that of a -invariant subspace, which we recall in Definition A.4.
Definition A.4.
Given a representation , a subspace is -invariant if for all .
The formal definition of the irreps is then stated as the representations with no non-trivial invariant subspace.
Definition A.5.
A non-zero representation of a group is irreducible if the only -invariant subspaces of are and itself.
A single group acting on different spaces will have different representations. However, one can reveal the similarity between these representations by the mean of an equivalence relation.
Definition A.6.
Two representations and are equivalent if there exists an isomorphism such that for all , .
For the interested reader, the invertibility property of the Fourier transform is a consequence of the concepts of Pontryagin duality (commutative groups) and Tannaka-Krein duality (non-commutative groups); see, e.g., [1]. Our proofs will also rely on the notion of generating set of , which we introduce here.
Definition A.7.
A generating set of a group is a subset such that every can be expressed as a finite combination of the elements in and their inverses under the group action .
Remark A.8.
It can be shown that every group of size has a generating set of size at most .
Group Actions
A group represents a set of transformations such as rotations that can act on data such as images.We define formally how groups can indeed transform datasets through the concept of group action.
Definition A.9.
Given a group , a group action is a function satisfying i) Identity: , ii) Compatibility: for any and and where is the identity of .
Processing operations and neural networks can be designed so that they respect group actions: specifically, a group acting on the input (e.g., rotating an input image) should yield a group action on the output (e.g., a rotation of the output feature map). This is the notion of -equivariance.
Definition A.10.
A function is -equivariant if for all and all , where and are group actions on and , respectively.
For example, the -convolution layer is -equivariant by design [7]. An important problem in signal processing and deep learning is to achieve invariance to nuisance factors not relevant for the task. Many of these factors are describable as group actions (e.g. rotations, translations, scaling). Thus, we want processing methods and machine learning models to be -invariant:
Definition A.11.
A function is -invariant if for all and all .
For example, the Max -pooling () traditionally follows a -convolutional layer to remove the equivariance of the convolution and achieve -invariance. A -CNN is a neural network that consists of -convolutional layers and a pooling/invariance operation. The main applications of our proposed selective -Bispectrum operation is to act as a -invariant pooling layer, that can conveniently replace the classical Max -Pooling layer of -CNNs, as shown in the rest of the paper.
Clebsh-Jordan matrices
Given a group and a family of unitary irreps , the Clebsh-Jordan matrix is analytically defined for each pair as:
(4) |
-Bispectrum for Commutative Finite Groups
The computation of the -Bispectrum simplifies for commutative groups compared to Theorem 2.3, as recalled below.
Theorem A.12.
[16] If is a commutative group and , the -Bispectrum can be computed as
(5) |
For commutative groups, the -bispectral coefficients are complex scalars [29].
Appendix B Indeterminacy of -Bispectrum inversion problem
It is important to state precisely which information we can possibly retrieve from the -Bispectrum. A consequence of the -invariance of is that the -Bispectrum inversion problem is ill-posed. Recall that -invariance means that for all , we have . Given a function , a possible definition for the group action on is given by for all (see, e.g., [7]). Therefore, for all , we have
which shows that the -Fourier transform is -equivariant. In consequence, recovering from can at best be done up to an unknown factor . Moreover, as explained in [14, 19], the indeterminacy is not limited to . Take for instance . An indeterminacy factor corresponds to a translation of of the signal, . [14] showed that is not restricted to : it may take any value in . The Bispectrum is not only invariant to a discrete set of rotations, but to the continuous group of rotations . The factor can thus be written where .
Appendix C Selective -Bispectrum inversion: known results
From now on, we assume that the Fourier transform only features non-zero elements, or invertible matrices in the case of non-scalar Fourier transform. This assumption is supported by the zero probability of encountering this corner case.
Cyclic groups
We start with . Recall that the irreps are given by where for (see, e.g., [29]).
Theorem C.1.
[16] For cyclic groups , the -Bispectrum can be inverted using coefficients if for all .
Proof.
The Fourier coefficient associated to the trivial representation , is uniquely determined and can be recovered from Theorem A.12 by identifying phase and modulus:
(6) |
We proceed using Pontryagin duality: the irreps , form a group themselves, with . In the case of the cyclic group , notice that for all , . Leveraging (5), we can use to recover :
(7) |
Equation (7) leaves an indeterminacy on the phase of . This corresponds to the indeterminacy factor , of Appendix B. It is inherited from the -invariance of (it is not injective, hence you cannot distinguish inputs that have the same -Bispectrum). For now, let . The key to recover all the other Fourier coefficients is to notice that is a generating set of . Therefore, computing sequentially
(8) |
for recovers completely . We are not done yet because the phase we fixed before is not a valid shift. A valid phase shift for is such that the shift w.r.t the original signal has the form for . This valid phase shift is easy to find. It is the unique such that, if we define for all , then we have (i.e., with no imaginary part). Note that this is an explicit method to recover a valid phase while [16] relies on its existence without explicit method to find it. The method is summarized in Algorithm 2 and illustrated on Figure 8. In consequence only the following -bispectral coefficients are needed for completeness: and for . This makes a total of coefficients. We summarize this result in Theorem C.1. ∎
Appendix D Selective -Bispectrum inversion: commutative groups
Here, we prove the theorem stated in the main text and recalled below:
Theorem D.1.
For finite commutative groups , the -Bispectrum can be inverted using coefficients if for all .
Specifically, we extend Algorithm 2 to all commutative groups, based on the Theorem D.2. That is, we design a method for the direct sum of finitely many cyclic groups.
Theorem D.2.
(see, e.g., [25]) Every finite commutative group is isomorphic to a finite direct sum of cyclic groups: where and for .
For all ( is integer-valued vector of length ), the irreps are given by . The number of irreps is . We detail and prove in Theorem D.3 the procedure to invert the -Bispectrum on commutative groups. The procedure is summarized in Algorithm 3 where we use the two following notations.
-
1.
denotes the basis vector in such that if and otherwise.
-
2.
} for .
The sets are a recursively constructed such that . For , the sets are represented on Figure 9.
Theorem D.3.
For finite commutative groups , the -Bispectrum can be inverted using coefficients if for all .
Proof.
Notice that for the commutative groups, we keep the property where for . The first step is to obtain a generating set of the irreps is of size . It is given by the usual basis vectors for where
By Theorem A.12, it is sufficient to have the Fourier coefficients associated to each generating element in to recover all the Fourier coefficients. Indeed, knowing and allows us to compute . By definition of the generating set , we can thus recover for all . The moduli of the coefficients for can be computed as follows:
(9) |
where is the Fourier coefficient of the trivial representation (computed as in (6)). We claim that the phase can be fixed independently for each label , thus times. This is because only one factor remains among the independent factors in :
for all . Therefore, fixing an arbitrary phase in (9) only fixes . Again, the indeterminacy factor is not restricted to but can belong to . We will have to solve this issue further. For now, we set the phase of to zero. Once the Fourier coefficients are known for all the generators of the group of the irreps , it remains to combine them to obtain all the elements in the groups and, consequently, all the associated Fourier coefficients.
At this point, it helps to consider the problem geometrically. Each irreps can be associated to its integer coordinate inside a hyper-rectangle in , whose length of edges is for . We combine the coordinates to obtain all the possible integer coordinates inside the hyper-rectangle. First, we can obtain the orthogonal edges of the hyper-rectangle. For , gives , gives , etc. This is in fact the procedure of Algorithm 1. Now, we combine the edges to generate the inside of the hyper-rectangle. We proceed iteratively. For , we define and . This construction is such that and . We generate the missing Fourier coefficients by combining the ones associated to the generating set of . For , compute
(10) |
for , for all . Intuitively for , we obtain first an edge, then a face and finally the full parallelepiped. To conclude, we reproduce the procedure from Algorithm 2 to find a valid phase shift in each basis direction . The last step is then to compute, for all ,
The procedure is summarized in Algorithm 3 and illustrated on Figure 10. It shows that the bispectral coefficients needed for completeness are: , for , and all . We recover exactly one Fourier coefficient per -bispectrum coefficient. This makes thus a total of bispectral coefficients precisely and proves the following theorem. ∎
Appendix E Selective -Bispectrum inversion: dihedral groups
Dihedral group
The dihedral group is the group of all symmetries of the -gon. Mathematically, it is defined as
(11) |
where is the rotation and is the reflection, and they form a generating set of . We will only consider the case since the cases and are commutative groups covered by the previous subsection, while gives non-commutative groups. The 2D irreps of are given by
(12) |
where for . There are also or 1D irreps if is odd or even, respectively. We denote these 1D irreps by , and (see Appendix E.1). The two last ones only exist for even.
Theorem E.1.
For the family of dihedral groups , we need at most bispectral matrix coefficients for inversion if for all irreps of . This corresponds to scalar values.
Proof.
In view of section C, we wish to show that there is an irrep that generates all the irreps of . As in the cyclic and commutative cases, we can first deduce from . Now, we have
(13) |
The novelty for this non-commutative group is that is a matrix. After computing the eigenvalue decomposition , we can choose
(14) |
for all unitary (i.e., ) to solve (13). In the case the indeterminacy belonged to , the continuous set of 2d rotations. For , it belongs to , the continuous set of 2d rotations and reflections. For finite groups, Kakarala [14] ensures that the only choices for such that is a Fourier transform on are such that is identical to the original signal up to some group action . For , we can then obtain
(15) |
It is shown in Appendix E.1.1 that appears in the tensor decomposition (4) of so that the for-loop can be applied. Moreover, we know from Appendix E.1.1 that appears in the decomposition of and, for even, in . Thus the iteration recovers the complete DFT . ∎
The procedure is summarized in Algorithm 4.
E.1 The 1D irreps of
We recall the definition of the dihedral group given in (11). The 1D irreps of can be found, e.g., in [29]. They are given by:
-
•
for all .
-
•
-
•
If even,
-
•
If even,
To exemplify the 1D irreps, we give their values for in Table 3.
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
1 | 1 | 1 | 1 | -1 | -1 | -1 | -1 | |
1 | -1 | 1 | -1 | 1 | -1 | 1 | -1 | |
1 | -1 | 1 | -1 | -1 | 1 | -1 | 1 |
E.1.1 Generation of the coefficients of
Theorem E.1 makes two assertions that we verify explicitly in this appendix. First, it is said that is in the tensor decomposition (4) (TD) of for so that the iteration of Algorithm 4 recovers all the Fourier coefficients associated to the 2D irreps. The second assertion to verify is that is in the TD of and, for even, in the TD of so that the Fourier coefficients associated to the 1D irreps are also recovered, asserting the validity of the inversion procedure. We provide an analytical proof of these two assertions. The proof is based on the theory of character functions.
Definition E.2.
[29] Given a group and a representation , the character of is the function . is said to be an irreducible character if is an irreducible representation.
The character function is a class function on , i.e., is constant on a conjugacy class of . The space of class functions on a finite group , written , can be equipped with an inner product such that for , we have
(16) |
The irreducible characters form an orthonormal basis w.r.t for [29], i.e.,
(17) |
Therefore, for two irreps of , is in the TD of if and only if . Let us apply this to the 2D irreps of (). Let be three irreps defined as in (12). Notice that we have . This yields
Recall that . Therefore, if we assume without loss of generality, if and only if
Therefore, based on Definition 2.3, by utilizing , , , we can compute for . By iterating, if is known, gives . Then, can be leveraged to obtain . Continuing the procedure provides for by using . We have thus obtained the Fourier coefficients associated to all the 2D irreps.
It remains to show that the iteration also recovered the Fourier coefficients associated to the 1D irreps. This is because is in the TD of and, for even, are in the TD of . Indeed,
Moreover, for even and , we have
In conclusion, the procedure of Algorithm 4 recovers all the Fourier coefficients and the selective -Bispectrum.
E.2 The Clebsch-Gordan matrices on
The matrix algebra properties that we use in this subsection can be found, e.g., in [2]. Recall from Theorem A.12 the (implicit) definition of the Clebsch-Gordan matrices:
(18) |
where . We only consider the case of both 2d irreps of since otherwise, the Clebsch-Gordan matrix is the scalar . First notice that , is an orthonormal matrix. Indeed, using the properties of the Kronecker product, we obtain (“” omitted for clarity):
For a real orthogonal matrix (), and with , a real Schur decomposition of , it is known that is block diagonal with blocks of size or . These blocks are themselves orthogonal matrices. Therefore, the real Schur decomposition is the decomposition in (18) up to permutations. In order for to represent exactly the irreps from (12), the non-zero sub-diagonal elements should all be positive. If not, the symmetric element is positive and a permutation must be added to exchange their positions: . permutes the two columns of associated with the permuted block of .
Appendix F Bispectrum inversion for octahedral and full octahedral groups
We provide a sketch of the procedure to retrieve given for the octahedral group and the full octahedral group. These two groups are available in the escnn library. These groups are easier to deal with than the cyclic and dihedral groups presented in the paper, given that they do not come from a family of groups. Indeed, our proofs for the cyclic (resp. dihedral) groups needed to work for all cyclic groups , and for all dihedral groups , for all . The octahedral and full octahedral groups are only two groups.
F.1 Octahedral group
10000 | 01000 | 00100 | 00010 | 00001 | |
01000 | 11110 | 01111 | 01100 | 00100 | |
00100 | 01111 | 11110 | 01100 | 01000 | |
00010 | 01100 | 01100 | 10011 | 00010 | |
00001 | 00100 | 01000 | 00010 | 10000 |
The octahedral group has 24 elements and 5 irreps. We can compute its Kronecker table, either manually using characters or using a Python package such as escnn. We give its Kronecker table below (Table 4), where each column/row represents one irrep, labelled .
We apply the procedure from Algorithms 2, 3 and 4 to Table 4. This procedure relies on the use of Theorem 2.3. We first select the bispectral coefficient (we omit “” for clarity) to get the component where is the trivial representation. Next, we choose and use to obtain (we know from Table 4 that ) up to an indeterminacy which is a transformation in and corresponds to the indeterminacy factor from Appendix B. Then, we select to get the Fourier components . Lastly, we select to get the missing Fourier component .
In summary, we only need bispectral coefficients () instead of in order to get the five Fourier components, i.e., the full Fourier transform of the signal.
F.2 Full octahedral group
1000000000 | 0100000000 | 0010000000 | 0001000000 | 0000100000 | 0000010000 | 0000001000 | 0000000100 | 0000000010 | 0000000001 | |
0100000000 | 1111000000 | 0111100000 | 0110000000 | 0010000000 | 0000001000 | 0000011110 | 0000001111 | 0000001100 | 0000000100 | |
0010000000 | 0111100000 | 1111000000 | 0110000000 | 0100000000 | 0000000100 | 0000001111 | 0000011110 | 0000001100 | 0000001000 | |
0001000000 | 0110000000 | 0110000000 | 1001100000 | 0001000000 | 0000000010 | 0000001100 | 0000001100 | 0000010011 | 0000000010 | |
0000100000 | 0010000000 | 0100000000 | 0001000000 | 1000000000 | 0000000001 | 0000000100 | 0000001000 | 0000000010 | 0000010000 | |
0000010000 | 0000001000 | 0000000100 | 0000000010 | 0000000001 | 1000000000 | 0100000000 | 0010000000 | 0001000000 | 0000100000 | |
0000001000 | 0000011110 | 0000001111 | 0000001100 | 0000000100 | 0100000000 | 1111000000 | 0111100000 | 0110000000 | 0010000000 | |
0000000100 | 0000001111 | 0000011110 | 0000001100 | 0000001000 | 0010000000 | 0111100000 | 1111000000 | 0110000000 | 0100000000 | |
0000000010 | 0000001100 | 0000001100 | 0000010011 | 0000000010 | 0001000000 | 0110000000 | 0110000000 | 1001100000 | 0001000000 | |
0000000001 | 0000000100 | 0000001000 | 0000000010 | 0000010000 | 0000100000 | 0010000000 | 0100000000 | 0001000000 | 1000000000 |
The full octahedral group has elements and irreps. Again, we can compute its Kronecker table using a Python package such as escnn. We give its Kronecker table below (Table 5), where each column/row represents one irrep, labelled .
We apply the procedure from Algorithms 2, 3 and 4 to Table 5. Again, this procedure relies on the use of Theorem 2.3. (we omit “” for clarity) allows to compute directly, such as in Algorithm 4. Then, from , we obtain up to an unknown group action in . Then, using and , we obtain , and . Next, leveraging , and , we obtain . Using , and , we obtain , , . Finally, with , and , we obtain the last coefficient . Hence we have recovered all the Fourier coefficients using only , , , ,, , thus a total of bispectral coefficients instead of .
Appendix G Training of the -CNN architecture
The -MNIST/EMNIST datasets are obtained after applying random planar rotations on each image of the datasets MNIST [23], EMNIST [6] respectively. In the case of -MNIST/EMNIST, in addition to a planar rotation, a reflection is applied with probability . The original size of each image is conserved. The size of the training sets are and for MNIST and EMNIST, respectively.
We conserve the architecture of [27]. For all invariance layers, being the -TC, the selective -Bispectrum and the Avg/Max -pooling, the architecture is composed of a -convolutional block with filters (see Table 2). Then, the invariance layer is applied before feeding the output to a MLP. The MLP is composed of fully-connected layers with ReLU non-linearity. A final fully-connected linear layer is applied for classification. The vector of the output sizes of these layers is given by respectively. and is equal to the number of classes of the dataset. is tuned to reach the parameter count from Table 2.