On The Principles of Parsimony and Self-Consistency For The Emergence of Intelligence

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Ma, Tsao and Shum 1

On the Principles of Parsimony and Self-Consistency


for the Emergence of Intelligence
Yi Ma†‡1 , Doris Tsao†2 , Heung-Yeung Shum†3
1Electrical Engineering and Computer Science Department, University of California, Berkeley, CA 94720, USA
2Department of Molecular & Cell Biology and Howard Hughes Medical Institute, University of California, Berkeley, CA 94720, USA
3International Digital Economy Academy, Shenzhen 518045, China
† E-mail: [email protected]; [email protected]; [email protected]
arXiv:2207.04630v1 [cs.AI] 11 Jul 2022

Abstract: Ten years into the revival of deep networks and artificial intelligence, we propose a theoretical framework that sheds
light on understanding deep networks within a bigger picture of Intelligence in general. We introduce two fundamental principles,
Parsimony and Self-consistency, that we believe to be cornerstones for the emergence of Intelligence, artificial or natural. While
these two principles have rich classical roots, we argue that they can be stated anew in entirely measurable and computable
ways. More specifically, the two principles lead to an effective and efficient computational framework, compressive closed-loop
transcription, that unifies and explains the evolution of modern deep networks and many artificial intelligence practices. While
we mainly use modeling of visual data as an example, we believe the two principles will unify understanding of broad families
of autonomous intelligent systems and provide a framework for understanding the brain.

Key words: Intelligence; Parsimony; Self-Consistency; Rate Reduction; Deep Networks; Closed-Loop Transcription

1 Context and Motivation intelligence has largely relied on training homogeneous


black-box models, such as deep neural networks (LeCun
For an autonomous intelligent agent to survive and et al., 2015), using a brute-force engineering approach.
function in a complex environment, it must efficiently and While functional modularity may emerge from training,
effectively learn models that reflect both its past experi- the learned feature representation remains largely hidden
ences and the current environment being perceived. Such or latent and difficult to interpret. As we now know, such
models are critical for gathering information, making expensive brute-force training of an end-to-end black-box
decisions, and taking actions. These models, generally model not only results in ever-growing model size and
referred to as world models, should be continuously im- high data/computation cost1 , but is also accompanied by
proved from how projections agree with new observations many problems in practice: the lack of richness in final
and outcomes. They should incorporate both knowledge learned representations due to neural collapse (Papyan
from past experiences (e.g., recognizing familiar objects) et al., 2020)2 ; lack of stability in training due to mode
and mechanisms for interpreting immediate sensory in- collapse (Srivastava et al., 2017); lack of adaptiveness and
puts (e.g., detecting and tracking moving objects). Stud- susceptibility to catastrophic forgetting (McCloskey and
ies in neuroscience suggest that the brain’s world model Cohen, 1989); and lack of robustness to deformations
is highly structured anatomically (e.g., modular brain (Azulay and Weiss, 2018; Engstrom et al., 2017) or
areas and columnar organization) and functionally (e.g., adversarial attacks (Szegedy et al., 2013).
sparse coding (Olshausen and Field, 1996) and subspace
coding (Chang and Tsao, 2017; Bao et al., 2020)). It is
believed that such a structured model is the key for the
1 With model sizes frequently going beyond billions or trillions
brain’s efficiency and effectiveness in perceiving, pre-
of parameters, even Google seems to recently have started worrying
dicting, and making intelligent decisions (Barlow, 1961; about the carbon footprint of such practices (Patterson et al., 2022)!
Josselyn and Tonegawa, 2020). 2 This refers to the final representation for each class collapsing to

In contrast, in the past decade, progress in artificial a one-hot vector that carries no information about the input except
its class label. Richer features might be learned inside the networks,
‡ Corresponding author but their structures are unclear and remain largely hidden.
2 Ma, Tsao and Shum

Fig. 1 Overall framework for a universal learning engine: seeking a compact and structured model for sensory data via a
compressive closed-loop transcription: a (nonlinear) mapping f that maps high-dimensional sensory data with complicated low-
dimensional structures to a compact structured representation. The model needs to be self-consistent in the sense that it can
regenerate the original data via a map g such that f cannot distinguish despite its best effort.

A principled and unifying approach? We hypothe- fundamental questions regarding learning:


size that one of the fundamental reasons why these prob-
1. What to learn: what is the objective for learning
lems arise in the current practice of deep networks and
from data and how can it be measured?
artificial intelligence is a lack of systematic and integrated
understanding about the functional and organizational 2. How to learn: how can we achieve such an objective
principles of intelligent systems. via efficient and effective computation?
For instance, training discriminative models for As we will see, answers to the first question fall
classification and generative models for sampling or re- naturally into the realm of Information/Coding theory
playing has been largely separated in practice. Such (Shannon, 1948) that studies how to accurately quantify
models are typically open-loop systems that need to be and measure the information of the data and then to
trained end-to-end via supervision or self-supervision. seek the most compact representations of the informa-
A principle long-learned in control theory is that such tion. Once the objective of learning is clear and set,
open-loop systems cannot automatically correct errors answers to the second question fall naturally into the
in prediction, and are not adaptive to changes in the realm of Control/Game theory (Wiener, 1948) that pro-
environment. This led to the introduction of “closed-loop vides a universally effective computational framework,
feedback” to controlled systems so that a system can i.e., a closed-loop feedback system, for achieving any
learn to correct its errors (Wiener, 1948; Mayr, 1970). measurable objective consistently (Figure 1).
As we will argue in this paper, a similar lesson can be Basic ideas behind each of the two principles pro-
drawn here: once a discriminative model and a gener- posed in this paper can find their roots in classic works.
ative model are combined together to form a complete Artificial (deep) neural networks since their earliest incep-
closed-loop system, learning can become autonomous tion as “perceptrons” (Rosenblatt, 1958) were conceived
(without exterior supervision), and more efficient, stable to efficiently store and organize sensory information.
and adaptive. Back propagation (Kelley, 1960; Rumelhart et al., 1986)
To understand any functional component that may was later proposed as a mechanism to learn such models.
be necessary in an intelligent system, such as a discrimi- Moreover, even before the inception of neural networks,
native or a generative segment, we need to understand Norbert Wiener had started to contemplate computational
Intelligence from a more principled and unifying per- mechanisms for learning at a system level. In his famed
spective. To this end, in this paper we introduce two book Cybernetics (Wiener, 1948), he studied possible
fundamental principles: Parsimony and Self-consistency, roles of information compression for parsimony and
which we believe govern the function and design of feedback/games in a learning machine for consistency.
any intelligent system, artificial or natural. The two But we are here to reunite and restate the two princi-
principles respectively aim to answer the following two ples within the new context of data science and machine
Ma, Tsao and Shum 3

learning, as they help better explain and unify many There is a fundamental reason why intelligent sys-
modern instances and practices of artificial intelligence, tems need to embody this principle: intelligence would
deep learning in particular.3 Different from earlier ef- be impossible without it! If observations of the external
forts, our restatement of these principles will be entirely world have no low-dimensional structures, there would
measurable and computationally tractable – hence easily be nothing worth learning or memorizing. There would
realizable by machines or in nature with limited resources. be nothing to be relied upon for good generalization
The purpose of this paper is to offer our overall position or prediction, both of which rely on new observations
and perspective rather than to justify every claim techni- following the same low-dimensional structures. Thus
cally. Nevertheless, we will provide many references to this principle is not simply a convenience arising from
related work where readers can find convincing theoreti- the need for intelligent systems to be frugal with their
cal and compelling empirical evidence. They are based resources such as energy, space, time, and matter, etc.
on a coherent series of past and recent developments in In some contexts, the above principle is also called
the study of machine learning and data science by the the Principle of Compression. But Parsimony of Intelli-
authors and their students (Ma et al., 2007; Wright et al., gence is not about achieving the best possible compres-
2008; Chan et al., 2015; Yu et al., 2020; Chan et al., sion, but about obtaining compact and structured rep-
2022; Baek et al., 2022; Dai et al., 2022; Tong et al., resentations via computationally efficient means. There
2022; Pai et al., 2022; Wright and Ma, 2022). is no point for an intelligent system to try to compress
data to the ultimate level of Kolmogorov complexity
or Shannon information: they are not only intractable
Organization. In Section 2, we use visual data model-
to compute (or even to approximate) but also result
ing as a concrete example to introduce the two principles
in completely unstructured representations. For exam-
and illustrate how they can be instantiated as computable
ple, representing data with the minimum description
objectives, architectures, and systems. In Section 3, we
length (Shannon information) requires minimizing the
conjecture they lead to a universal learning engine for
Helmholtz free energy, say via a Helmholtz machine
broader perception and decision making tasks. Finally,
(Hinton et al., 1995), which is typically computationally
in Section 4, we discuss many implications of the pro-
intractable. Examined more closely, many commonly
posed principles and their connections to neuroscience,
used mathematical or statistical “measures” for model
mathematics, and higher-level intelligence.
goodness are either exponentially expensive to compute
for general high-dimensional models or even become
2 Two Principles for Intelligence ill-defined for data distributions with low-dimensional
supports. These measures include widely used quantities
In this section, we introduce and explain the two such as maximal likelihood, KL divergence, mutual infor-
fundamental principles that can help answer the questions mation, and Jensen-Shannon and Wasserstein distances.4
what to learn and how to learn by an intelligent agent or It is commonplace in the practice of machine learning to
system. resort to various heuristic approximations and empirical
evaluations. As a result, performance guarantees and
2.1 What to Learn: the Principle of Parsimony
understanding are lacking.
“Entities should not be multiplied unnecessar- Now we face a question: how can an intelligent sys-
ily.” tem embody the Principle of Parsimony to identify and
– William of Ockham represent structures in observations in a computationally
tractable and even efficient way? In theory, an intelli-
The Principle of Parsimony: The objective of learn- gent system could use any family of desirable structured
ing for an intelligent system is to identify low-dimensional models for the world, as long as they are simple yet ex-
structures in observations of the external world and re- pressive enough to model useful structures in real-world
organize them in the most compact and structured way. sensory data. The system should be able to accurately
and efficiently evaluate how good a learned model is,
3 As we will see, besides integrating discriminative and generative
and the measure used should be basic, universal, and
models, they lead to a closed-loop framework that works uniformly
in supervised, incremental or unsupervised settings, without suffering 4 More explanations about caveats associated with these measures

many of the problems of open-loop deep networks. can be found in (Ma et al., 2007; Dai et al., 2022).
4 Ma, Tsao and Shum

Sj
RD Mj Rd
M1
xi
M M2
f (x, ✓) zi
S2
S1

Fig. 2 Seeking Figure 1: discriminative


a linear and Left and Middle: The distribution
representation: D of high-dim
mapping high-dimensional data
sensory data, 2 RD is
x typically supported on a
manifold M a
distributed on many
nonlinear low-dimensional submanifolds, onto a set of independent linear subspaces of the same dimensions as the submanifolds.
its classes on low-dim submanifolds Mj , we learn a map f (x, ✓) such that zi = f (xi , ✓) are on a union
maximally uncorrelated subspaces {Sj }. Right: Cosine similarity between learned features by our meth
for the and
tractable to compute CIFAR10 training
to optimize. dataset.
What Each class
is a good has 5,000
modeling samples
as trying to find and their features
a (nonlinear) spanf athat
transform subspace of over
dimensions (see Figure 3(c)).
choice for a family of structured models with such a achieves the following goals:
measure? the component distributions Dj are (or •can be made).map One popular working assumption is that t
compression: high-dimensional sensory data
To see how we can model and compute parsimony, 9
distribution of each class has relatively low-dimensional intrinsic structures.
x to a low-dimensional representation z; Hence we may assu
we use the motivating and intuitive example of modeling
the distribution D of each class has a support on a low-dimensional submanifold, say Mj w
visual data.5 To make our expositionj easy, we will start linearization: map each class of objects distributed
dimension dj ⌧
with a supervised setting in this
D, and the distribution• D
section. Nevertheless, as
of x is supported on the mixture of those submanifol
on a nonlinear submanifold to a linear subspace;
j , in the high-dimensional ambient space RD , as illustrated in Figure 1 left.
k
Mnext
we will see in the =[ j=1 M
section, with parsimony as the only
“self-supervision,” together with the second principle of • sparsification: map different classes into subspaces
With the manifold assumption in mind, we withwant to learnoramaximally
independent mappingincoherent
z = f (x, ✓) that
bases.7 maps each
self-consistency, a learning system can become fully
the submanifolds Mj ⇢ RD to a linear subspace Sj ⇢ Rd (see Figure 1 middle). To do so,
autonomous and function without need for any exterior In other words, we try to transform real-world data that
supervision.
require our learned representation to have the following properties:
may lie on a family of low-dimensional submanifolds in
1. Between-Class Discriminative: a high-dimensional
Features of samplesspace ontofrom
a family of independent
different classes/clusters shou
Modeling and computing parsimony. Let us use x low-dimensional linearlow-dimensional
subspaces, respectively. Such a
be highly uncorrelated and belong to different linear subspaces.
to denote the input sensory data, say an image, and z its model is called a linear discriminative representation
internal representation.2.The
Within-Class
sensory data sampleCompressible:
x ∈ RD Features of samples from the same class/cluster should
(LDR) (Yu et al., 2020; Chan et al., 2022) and the
relatively correlated in a sense
is typically rather high-dimensional (millions of pixels) compression that theyprocess
belongis to a low-dimensional
illustrated in Figure 2. In linear
some subspace.
6
but has extremely low-dimensional
3. Maximally intrinsic
Diverse structures.
Representation: Dimension
sense, one may even view (or variance)
the commonofpractice
features of for
deepeach class/clus
Without loss of generality,should
we may assume it
be as large is distributed
as possible as long
learning thatasmaps
theyeach
stayclass
uncorrelated
to a “one-hot”from the asother classes.
vector
on some low-dimensional submanifolds as illustrated in seeking a very special type of LDR models in which each
Figure 2. ThenNotice
the purpose
that,ofalthough
learning istheto intrinsic
establish astructures of each
target subspace class/cluster
is only may be
one-dimensional andlow-dimensional,
orthogonal they a
(usually nonlinear) mapping f , say in some parametric
by no means simply linear in their original representation x. Here the subspaces {Sj } can be view
to others.
family θ ∈ Θ, asfrom x to a generalized
nonlinear much lower-dimensional
principal components The ideafor x [VMS16].
of compression as aFurthermore,
guiding principle forofmany clusteri
representation or R :
d
z ∈classification tasks (such as object recognition), we consider two samples as equivalent if th
the brain for representing (sensory data of) the world
differ
x ∈ RDby certain
−−−−−−→ z class
∈ Rd , of
f (x,θ) domain (1) deformations or augmentations T = {⌧ }. Hence, we are on
has strong roots in neuroscience, going back to Barlow’s
10
interested in low-dimensional structures that are
efficient invariant
coding to such
hypothesis deformations,
(Barlow, 1961). Scientific which are known
such that the havedistribution of features z is much more
sophisticated geometric and topological structures [WDCB05] and can be difficult
studies have shown that visual object representations in to learn i
compact and structured. Being compact means
principled manner even with CNNs [CW16,economic the brainCGW19].
exhibit compactThere structures such as attempts
are previous sparse codes to directly enfo
to store. Being structured
subspace implies efficient
structures to access
on features learned by a deep
7 This network
is related forofsupervised
to the notion [LQMS18]
sparse dictionary learning (Zhai or unsupervis
and use: in particular, linear +structures are
+ ideal for+ et al., 2020) or independent
learning [JZL 17, ZJH 18, PFX 17, ZHF18, ZJH+ 19,component ZLY+ 19, LQMS18].
analysis However, the s
(ICA) (Hyvärinen,
interpolation or extrapolation.
expressive property of subspaces exploited 1997; Hyvärinen and+
by [JZL Oja, 1997). Once the bases of the subspaces
17] does not enforce all the desired propert
To be more specific and precise, we can formally are made independent or incoherent by the transform, the resulting
listed above; [LQMS18] uses a nuclear norm based
representation becomesgeometric
sparse and loss to enforce
thus collectively orthogonality
compact and betwe
instantiate the principle of Parsimony for visual data
classes, but does not promote diversity in theForlearned
structured. example, representations,
two sets of subspaces withasthe
wesamewill soon see. Figur
dimen-
5 It is arguably true that among all senses, vision is the most sions have the same intrinsic complexity. However, their extrinsic
right illustrates a representation learned by our method on the CIFAR10 dataset. More details can
representations can be very different, see Figure 3. This illustrates
complex to model.
6 For example,found in the
all images of a experimental
rotating pen trace Section
out only a 3. why simply compressing data based their intrinsic complexity is not
one-dimensional curve in the space of millions of pixels. sufficient for parsimony.

2 Technical Approach and Method


2.1 Measure of Compactness for a Representation

Although the above properties are all highly desirable for the latent representation z, they are by
Ma, Tsao and Shum 5

Fig. 3 Rate of all features R = log #(green spheres + blue spheres); average rate of features on the two subspaces Rc =
log #(green spheres); rate reduction is the difference between the two rates: ∆R = R − Rc .

(Olshausen and Field, 1996) and subspaces (Chang and the precision . This is also known more generally as the
Tsao, 2017; Bao et al., 2020). This supports the proposal description length (Rissanen, 1989; Ma et al., 2007).
that low-dimensional linear models are the preferred Now let R be the rate distortion of the joint distri-
.
representations in the brain (at least for visual data). bution of all features Z = [z 1 , . . . , z n ] of sampled data
.
X = [x , . . . , x ] from all, say k, classes. Rc is the
1 n

Maximizing rate reduction. Somewhat remarkably, average of rate distortions for the k classes: Rc (Z) =
k [R(Z1 ) + · · · + R(Zk )] where Z = Z1 ∪ · · · ∪ Zk .
1
for the family of LDR models, there is a natural intrinsic
measure of parsimony. Intuitively speaking, given an Note that, because of the logarithm, the ratio between
LDR, we can compute the total “volume” spanned by volumes becomes the difference between rates. Then the
all features on all subspaces and the sum of “volumes” difference between the whole and the sum of the parts,
spanned by features of each class. Then the ratio between called rate reduction (Chan et al., 2022):
these two volumes gives a natural measure that suggests .
∆R(Z) = R(Z) − Rc (Z), (2)
how good the LDR model is: the larger, the better.
gives a most basic, bean-counting-like, measure for how
Figure 3 shows an example with features distributed
good the feature representation Z is.9
on two subspaces S1 and S2 . Models on the left and
Although for general distributions in high-
right have the same intrinsic complexity. Obviously,
dimensional spaces the rate distortion, like many other
the configuration on the left is preferred as features for
measures mentioned before, is intractable and actually
different classes are made independent and orthogonal –
NP-hard to compute (MacDonald et al., 2019), rate dis-
their extrinsic representations would be the most sparse.
tortion for data Z drawn from a Gaussian supported on
Hence, in terms of this basic volumetric measure, the
a subspace has a closed-form formula (Ma et al., 2007):
best representation should be such that “the whole is
maximally greater than the sum of its parts.” . 1
R(Z) = log det (I + αZZ ∗ ) . (3)
2
As per information theory, the volume of a distribu-
Hence, it can be efficiently computed and optimized!
tion can be measured by its rate distortion (Cover and
The work of Chan et al. (2022) has shown that if
Thomas, 2006). Roughly speaking, the rate distortion
one uses the rate distortion functions of Gaussians and
is the logarithm of how many -balls or spheres one
chooses a generic deep network (say a ResNet) to model
can pack into the space spanned by a distribution.8 The
the mapping f (x, θ), then by maximizing the coding rate
logarithm of the number of balls directly translates into
reduction, known as the MCR2 principle:
how many binary bits one needs in order to encode a
random sample drawn from the distribution subject to max ∆R(Z(θ)) = R(Z(θ)) − Rc (Z(θ)), (4)
θ
8 Sphere packing gives almost a universal way to measure the 9 The rate reduction quantity also has a natural interpretation as

volume of space of arbitrary shape: to compare volumes of two “information gain” (Quinlan, 1986). It measures how much information
containers, one only has to fill them both with beans and then count is gained, in terms of bits saved, by specifying a sample on one of
and compare. the parts, compared to drawing a random sample from the whole.
6 Ma, Tsao and Shum

∂∆R(Z)
Fig. 4 A basic way to construct the nonlinear mapping f : following local gradient flow ∂Z
of the rate reduction ∆R,
we incrementally linearize and compress features on nonlinear submanifolds and push different submanifolds apart to respective
orthogonal subspaces (the two dotted lines).

one can effectively map a multi-class visual dataset to gradient ascent (PGA) scheme:10
multiple orthogonal subspaces. Notice that maximizing
∂∆R
the first term of the rate reduction R expands the volume Z`+1 ∝ Z` + η · subject to kZ`+1 k = 1. (5)
∂Z Z`
of all features. That is, it conducts “constrastive learning”
for all features simultaneously, which can be much more That is, one can follow the gradient of the rate reduction
effective than contrasting sample pairs as normally done to move the features so that the rate reduction increases.
in conventional contrastive methods (Hadsell et al., 2006; Such a gradient-based iterative deformation process is
Oord et al., 2018). Minimizing the second term Rc illustrated in Figure 4.
compresses and linearizes features in each class. This From the closed-form formulae for the rate distor-
can be interpreted as conducting “contractive learning” tions (3), we can compute the gradient of ∆R = R − Rc
(Rifai et al., 2011) for each class. The rate reduction in closed-form too. For example, the gradient of the first
objective unifies and generalizes these heuristics. term R is of the form (similarly for the second term Rc ):
1 ∂ log det(I +αZZ ∗ )

In particular, one can rigorously show that, by max- ∂R(Z)
= (6)
imizing the rate reduction, features of different classes ∂Z Z` 2 ∂Z
Z`
will be independent and features of each class will be .
= α(I +αZ` Z`∗ )−1 Z` = E` Z` . (7)
distributed almost uniformly within each subspace (Chan
et al., 2022). In contrast, the widely practiced cross en- Similarly we can compute the gradients for the k terms
tropy objective for mapping each class to a one-hot label {R(Zi )}ki=1 in Rc and obtain k operators on Z` , named
maps final features of each class onto a one-dimensional as Ci . Then the above gradient ascent operation (5) takes
singleton (Papyan et al., 2020). the following structured form:
h i
z`+1 ∝ z` + η · E` z` + σ [C`1 z` , . . . , C`k z` ]
subject to kz`+1 k = 1, (8)
Whitebox deep networks from unrolling optimiza-
where E` and C` ’s are linear operators that are fully
tion. Notice that in this context, the role of a deep
determined by covariances of the features from the pre-
network is simply to model the nonlinear mapping f be-
vious layer Z` (7).11 Here σ is a soft-max operator that
tween the external data x and the internal representation
assigns z` to its closest class based on its distance to
z. How should an intelligent system know what family
each class, measured by C` z` . A diagram of all the
of models to use for the map f in the first place? Is there
operators per iteration is given in Figure 5 left.
a way to directly derive and construct such a mapping,
10 For fair comparison of coding rates between two representations,
instead of guessing and trying different possibilities?
we need to normalize the scale of the features, say kzk = 1.
11 E is associated with the gradient of the first term R and stands
Recall that our goal is to optimize the rate reduction
for “expansion” of the whole set of features, whereas C’s are
∆R(Z) as a function of the set of features Z. To this
associated with the gradients of multiple rate distortions in the second
end, we may directly start with the original data Z0 = X term Rc and stand for “compression” of features in each class. See
and optimize ∆R(Z) incrementally, say with a projected Chan et al. (2022) for details.
Ma, Tsao and Shum 7

256-d in 256-d in

256, 1x1, 64 256, 1x1, 4 256, 1x1, 4 256, 1x1, 4


32 path in total

4, 3x3, 4 4, 3x3, 4
... 4, 3x3, 4
64, 3x3, 64

64, 1x1, 256 4, 1x1, 256 4, 1x1, 256 4, 1x1, 256

256-d out 256-d out

Fig. 5 Building blocks of the nonlinear mapping f . Left: one layer of the ReduNet as one iteration of projected gradient ascent,
which precisely consists of expansive or compressive linear operators, a nonlinear softmax, plus a skip connection, and normalization.
Middle and Right: one layer of ResNet and ResNeXt, respectively.

Acute readers may have recognized that such a timization only relies on operations between adjacent
diagram draws good resemblance to a layer of popular layers that can be hard-wired; hence, it would be much
“tried-and-tested” deep networks such as ResNet (He et al., easier for nature to realize and exploit.
2016) (Figure 5 middle), including parallel columns as in In addition, parameters and operators of the so-
ResNeXt (Xie et al., 2017) (Figure 5 right) and a Mixture constructed networks are amenable to further fine-tuning
of Experts (MoE) (Shazeer et al., 2017). This gives a via another level of optimization, e.g., (stochastic) gra-
natural and principled interpretation for an important dient descent realized by back propagation (Rumelhart
class of deep neural networks from the perspective of et al., 1986).13 But one should not confuse the (stochas-
unrolling an optimization scheme. Even before the rise tic) gradient descent used to fine-tune a network with the
of modern deep networks, iterative optimization schemes gradient-based optimization that layers of the network
for seeking sparsity such as ISTA or FISTA (Wright are meant to realize.
and Ma, 2022) had been interpreted as learnable deep
networks, e.g., the work of Gregor and LeCun (2010) Shift-invariance and nonlinearity. If we further wish
on Learned ISTA.12 The class of networks derived from the learned encoding f to be invariant (or equivariant)
optimizing rate reduction has been named as ReduNet to all time-shifts or space-translations, we view every
(Chan et al., 2022). sample x(t) with all its shifted versions {x(t − τ ) ∀τ }
as in the same equivalence class. If we compress and
Forward unrolling versus backward propagation. linearize them altogether into the same subspace, then
We see above that compression leads to an entirely all the linear operators E or C’s in the above gradient
constructive way of deriving a deep neural network, operation (8) automatically become multi-channel con-
including its architecture and parameters, as a fully in- volutions (Chan et al., 2022)! As a result, the ReduNet
terpretable white-box: its layers conduct iterative and naturally becomes a multi-channel convolution neural
incremental optimization of a principled objective that network (CNN), originally proposed for shift-invariant
promotes parsimony. As a result, for so-obtained deep recognition (Fukushima, 1980; LeCun et al., 1998).14
networks, the ReduNets, starting from the data X as
input, each layer’s operators and parameters (E` , C` ) Artificial selection and evolution of neural networks.
are constructed and initialized in an entirely forward Once we realize the role of deep networks themselves
unrolling fashion. This is very different from the popu- is to conduct (gradient-based) iterative optimization to
lar practice in deep learning: starting with a randomly compress, linearize and sparsify data, it may become
constructed and initialized network which is then tuned easy to understand the “evolution” of artificial neural
globally via back propagation (Rumelhart et al., 1986).
13 It has been shown that the ReduNets have the same model capacity
It is widely believed that the brain is unlikely to utilize
(say to interpolate all training data precisely) as tried-and-tested deep
back propagation as its learning mechanism due to the networks such as ResNets (Chan et al., 2022).
requirement for symmetric synapses and the complex 14 In addition, due to special structures in such convolution operators

form of the feedback. Here, the forward unrolling op- E and C’s, they are much more efficient to be computed in the
frequency domain than in the time/space domain: the computational
12 Similarly, unfolding iterative optimization for sequential sparse complexity reduces from O(D3 )15 to O(D) in the dimension D of
recovery leads to recurrent networks (Wisdom et al., 2017). the input signals (Chan et al., 2022).
8 Ma, Tsao and Shum

networks that has taken place in the past decade. In par- that select and transform token sets (on submanifolds)
ticular, it helps explain why only a few have emerged on that belong to the same category (of signals or images).
top through a process of artificial selection: going from Hence, we conjecture that Transformers are emulating a
general MLPs to CNNs to ResNets to Transformers. In more general family of gradient-based iterative schemes
comparison, random search of network structures, such that optimize rate reduction of all the input token sets
as Neural Architecture Search (Zoph and Le, 2017; Baker (on multiple submanifolds) by clustering, compressing,
et al., 2017) and AutoML (Hutter et al., 2019), has not and linearizing them altogether.
resulted in any network architectures that are effective Furthermore, gradient ascent or descent is merely
for general tasks. We speculate that the successful archi- the most basic type of optimization scheme. Networks
tectures are simply getting more and more effective and based on unrolling such schemes (e.g., ReduNet) might
flexible at emulating iterative optimization schemes for not be the most efficient. One could anticipate that more
data compression. Besides the aforementioned similarity advanced optimization schemes, such as accelerated gra-
between ReduNet and ResNet/ResNeXt, we here discuss dient descent methods (Wright and Ma, 2022), could lead
a few more examples. to more efficient deep network architectures in the future.
Notice that the gradient of a rate distortion term Architecture wise, these accelerated methods require the
R(Z) is of the form (7): ∂Z ∂R
= α(I + αZ` Z`∗ )−1 Z` . introduction of skip connections across multiple layers.
Instead of viewing the matrix α(I+αZ` Z`∗ )−1 as a linear This may help explain, from an optimization perspective,
operator E` acting on Z` , as was done in the ReduNet, why additional skip connections have often been found
we may rewrite the whole gradient term approximately to improve network efficiency in practice, e.g., in high-
as: way networks (Srivastava et al., 2015) or dense networks
(Huang et al., 2017).
α(I + αZ` Z`∗ )−1 Z` ≈ α(I − αZ` Z`∗ )Z`
= α Z` − αZ` (Z`∗ Z` ) . (9) 2.2 How to Learn: the Principle of Self-Consistency
 

That is, the gradient operation for optimizing a rate “Everything should be made as simple as pos-
distortion term depends mostly on the auto-correlation sible, but not any simpler.”
. – Albert Einstein
of the features A = Z`∗ Z` from the previous iteration.
This is also known as “self-attention” or “self-expression” The principle of Parsimony alone does not ensure
in some contexts (Vaswani et al., 2017; Vidal, 2022). a learned model will capture all important information
If we consider applying an additional learnable linear in the data sensed about the external world. For exam-
transform U to each of the feature terms in the above ple, mapping each class to a one-dimensional “one-hot”
expression (9) for the gradient, a gradient-based iteration vector, via minimizing the cross entropy, may be viewed
to optimize rate distortion takes the general form: as a form of being parsimonious. It may learn a good
.
Z`+1 = Z` +Uo Z` −αUv Z` (Uk Z` )∗ (Uq Z` ) . (10)
  classifier but the features learned would collapse to a
singleton, known as neural collapse (Papyan et al., 2020).
This is of exactly the same form as the basic operation The so learned features would no longer contain enough
of each layer for a Transformer (Vaswani et al., 2017), information to regenerate the original data. Even if we
also known as a self-attention (SA) head. consider the more general class of LDR models, the
Moreover, very similar to ResNeXT versus ResNet, rate reduction objective alone does not automatically
for tasks such as image classification, it is found empiri- determine the correct dimension of the ambient feature
cally better to use not one but multiple, say k, SA heads space. If the feature space dimension is too low, the
in parallel in each layer (Dosovitskiy et al., 2021). In model learned will under-fit the data; if too high, the
the context of rate reduction, these SA heads may be model might over-fit.17
naturally interpreted as gradient terms associated with More generally, we take the view that perception is
the multiple rate distortion terms in the rate reduction distinct from performance of specific tasks, and the goal
∆R(Z) = R(Z) − R(Z1 ) + · · · + R(Zk ) /k. The
 
of perception is to learn everything predictable about what
learned linear transforms in each SA head can be inter- is sensed. By this we mean the intelligent system should
preted as “matched filters” or “sparsifying dictionaries”16 17 The first expansive or contrastive term in the rate reduction might
16 Interested
readers may see (Zhai et al., 2020) for more details over-expand the features to fill the space, due to noises or other
about the topic of sparse dictionary learning. variations.
Ma, Tsao and Shum 9

be able to regenerate the distribution of the observed is to perform compression (and linearization), and in
data from the compressed representation to a point that general the more layers, the better compressed.23
itself cannot distinguish internally despite its best effort. So far we have established that we need a mech-
This view distinguishes our framework from existing anism to determine if the compressed representation
frameworks that are customized to a specific class of contains all the information that is sensed. In the re-
tasks, such as the information bottleneck for classification mainder of this section, we will first introduce a general
(recognition) (Tishby and Zaslavsky, 2015).18 To govern architecture for achieving this, a generative model, which
the process of learning a fully faithful representation, we can regenerate a sample from its compressed represen-
introduce a second principle: tation. A difficult problem then arises: how to sensibly
measure discrepancy between the sensed sample and the
The Principle of Self-consistency: An autonomous
regenerated sample? We argue that for an autonomous
intelligent system seeks a most self-consistent model
system there is one and only one solution to this, namely,
for observations of the external world by minimizing
measuring their discrepancy in the internal feature space.
the internal discrepancy between the observed and the
Finally, we argue that the compressive encoder and the
regenerated.
generator must learn together through a zero-sum game.
The two principles of Self-consistency and Parsi- Through these deductions, we derive a universal frame-
mony are highly complementary and should always be work for learning that we believe is inevitable.
used together. The principle of Self-consistency alone
does not ensure any gain in compression or efficiency. Auto-encoding and its caveats with computability.
Mathematically and computationally, it is easy and even To ensure the learned feature mapping f and representa-
trivial to fit any training data with over-parameterized tion z have correctly captured low-dimensional structures
models19 or to ensure consistency by establishing one- in the data, one may check if the compressed feature z
to-one mappings between domains with the same di- can reproduce the original data x, by some generating
mensions without learning intrinsic structures in the map g, parameterized by η:
data distribution.20 Only through compression can an
f (x,θ) g(z,η)
intelligent system be compelled to discover intrinsic x ∈ RD −−−−−−→ z ∈ Rd −−−−−−→ x̂ ∈ RD , (11)
low-dimensional structures within the high-dimensional
sensory data, and transform and represent them in the in the sense that x̂ = g(z, η) is close to x (according to
feature space in the most compact way for future use. certain measure). This process is generally known as an
Also, only through compression can we easily understand auto-encoding (Kramer, 1991; Hinton and Zemel, 1993).
why over-parameterization, say by feature lifting with In the special case of compressing to a structured repre-
hundreds of channels as normally done in DNNs, will not sentation such as LDR, we call such an auto-encoding
lead to over-fitting if its sheer purpose is to compress in a transcription24 (Dai et al., 2022). However, this goal
the higher-dimensional feature space: lifting helps reduce is easier said than done. The main difficulty lies in how
the nonlinearity in the data21 , hence render it easier to to make this goal computationally tractable hence phys-
compress and linearize.22 The role of subsequent layers ically realizable. More precisely, what is a principled
measure for the difference between the distribution of x
18 Although in this section, for simplicity, we focus our discussions
and that of x̂ that is both mathematically well-defined
on modeling 2D imagery data, we will discuss perception of the and efficiently computable? As we have mentioned be-
3D world in Section 3.1, as well as argue why perception needs to
integrate recognition, reconstruction, and regeneration. fore, when dealing with distributions in high-dimensional
19 Having a photographic memory is not intelligence. It is the same spaces with degenerate low-dimensional supports, which
with fitting all data in the world with a Big Model. is almost always the case with real-world data (Ma et al.,
20 That is the case with many popular methods for learning genera-
2007; Vidal et al., 2016), conventional measures such as
tive models of data such as normalizing flows (Kobyzev et al., 2021),
cycle GAN (Zhu et al., 2017), and diffusion probabilistic models must first expand.”
23 This naturally explains a seemingly mystery about deep networks:
(Ho et al., 2020), etc. Although so learned models might be useful
for applications such as image generation or style transfer, they do the “double-descent” phenomenon suggests a deep model’s test error
not identify low-dimensional structures in the data distributions nor becomes smaller as it gets larger, after reaching its peak at certain
produce compact linear structures in the learned representations. interpolation point (Belkin et al., 2019; Yang et al., 2020).
21 Say, as in the scattering transforms (Bruna and Mallat, 2013) or 24 This is analogous to the memory-forming transcription process of

random filters (Chan et al., 2015, 2022). Engram (Josselyn and Tonegawa, 2020) or the transcription process
22 As Lao Tzu famously said in Tao Te Ching: “That which shrinks between functional proteins and DNA (genes).
10 Ma, Tsao and Shum

Fig. 6 A compressive closed-loop transcription of nonlinear data submanifolds to an LDR, by comparing and minimizing difference
in z and ẑ, internally. This leads to a natural pursuit-evasion game between the encoder/sensor f and the decoder/controller g,
allowing distribution of the decoded x̂ (the blue dotted curves) to chase and match that of the observed data x (the black solid
curves).

the KL divergence, mutual information, Jensen-Shannon Measuring distribution difference in z space is in fact
distance, Helmholtz free energy, and Wasserstein dis- well-defined and efficient: it is arguably true that in
tances can be either ill-defined or intractable to compute, the case of natural intelligence, learning to measure
even for Gaussians (with support on subspaces) and their discrepancy internally is the only thing that the brain of
mixtures25 . How can we resolve this fundamental and a self-contained autonomous agent can do.26
yet often unacknowledged difficulty in computability This effectively leads to a “closed-loop” feedback
associated with comparing degenerate distributions in system and the overall process is illustrated as the diagram
high-dimensional spaces? in Figure 6. The encoder f now plays an additional role
as a discriminator that detects any discrepancy between x
Closed-loop data transcription for self-consistency. and x̂ through difference between their internal features
As we have seen in the previous section, the rate re- z and ẑ. The distance between the distribution of z and
duction ∆R gives a well-defined principled distance that of ẑ can be measured through the rate reduction (2)
measure between degenerate distributions. But it is com- of their samples Z and Ẑ:
putable (with closed-form) only for mixture of subspaces  .  1 
or Gaussians, not for general distributions! Yet we can ∆R Z(θ), Ẑ(θ, η) = R Z ∪ Ẑ − R(Z) + R(Ẑ) .
2
only expect the distribution of the internal structured rep-
resentation z to be mixtures of subspaces or Gaussians, One can interpret popular practices for learning ei-
not the original data x. ther a DNN classifier f or a generator g alone as learning
This leads to a rather profound question regarding an open-ended segment of the closed-loop system (Fig-
learning a “self-consistent” representation: to verify the ure 6). This currently popular practice is very similar
correctness of an internal model for the external world, to an open-loop control which has long been known
does an autonomous agent really need to measure any in the control community to be problematic and costly:
discrepancy in the data space? The answer is actually no. training such a segment requires supervision on the de-
The key is to realize that, to compare x and x̂, the agent sired output (e.g., class labels); and deployment of such
only needs to compare their respective internal features an open-loop system is inherently not stable, robust, or
z = f (x) and ẑ = f (x̂) via the same mapping f that adaptive if the data distributions, system parameters, or
intends to make z compact and structured. tasks change. For example, deep classification networks
trained in supervised settings often suffer catastrophic
f (x,θ) g(z,η) f (x,θ)
x −−−−−−→ z −−−−−−→ x̂ −−−−−−→ ẑ. (12) forgetting if retrained for new tasks with new classes of
25 Many existing methods formulate their objectives based on these
data (McCloskey and Cohen, 1989). In contrast, closed-
quantities. As a result, these methods typically rely on expensive loop systems are inherently more stable and adaptive
brute-force sampling to approximate these quantities or optimize (Wiener, 1948). In fact, it has been suggested by Hinton
their approximated lower-bounds or surrogates, such as in variational et al. (1995) that the discriminative and generative seg-
auto-encoding (VAE) (Kingma and Welling, 2013). Fundamental
limitations of these methods are often disguised by good empirical 26 Imagining someone colorblind, it is unlikely his/her internal

results obtained with clever heuristics and excessive computational representation of the world requires minimizing discrepancy in RGB
resources. values of the visual inputs x.
Ma, Tsao and Shum 11

ments need to be combined as the “wake” and the “sleep” In the 1961 edition of his book Cybernetics, Wiener
phases, respectively, of a complete learning process. (1961) had added a supplementary chapter discussing
learning through playing games. The games he described
Self-learning through a self-critiquing game. How- were mostly about an intelligent agent against an oppo-
ever, just closing the loop is not enough. It is tempting nent or the world (which we will discuss in the next
to think that now we only need to optimize the generator section). Here we advocate the need of an internal game-
g so as to minimize the difference between z and ẑ,27 like mechanism for any intelligent agent to be able to
say in terms of the rate reduction measure: conduct self-learning via self-critique! What abides is the
notion of games as a universally effective way for learn-
(13) ing: applying the current model or strategy repeatedly

min ∆R Z(θ), Ẑ(θ, η) .
η
against an adversarial critique, hence continuously im-
proving the model or strategy based on feedback received
Note that ∆R(Z, Ẑ) = 0 if Ẑ = g(f (Z)) = Z. That
through a closed loop!
is, the optimal learned features Z should be a “fixed
Within such a framework, the encoder f assumes a
point” of the encoding-decoding loop28 . But the encoder
dual role: in addition to learning a representation z for
f performs significant dimension reduction and compres-
the data x via maximizing the rate reduction ∆R(Z) (as
sion, so Ẑ = Z does not necessarily imply X̂ = X. To
done in Section 2.1), it should also serve as a feedback
see this, consider the simplest case when X is already
“sensor” that actively detects discrepancy between the
on a linear subspace (say of dimension k) and f and g
data x and the generated x̂. The decoder g also assumes a
are a linear projection and lifting, respectively (Pai et al.,
dual role: it is a “controller” that corrects any discrepancy
2022). f will not be able to detect any difference in its
between x and x̂ detected by f as well as a decoder
(large) null space: X and any X̂ = X + null(f ) have
that tries to minimize the overall coding rate needed to
the same image under f .
achieve this goal (subject to a given precision).
How can Ẑ = Z imply X̂ = X then? In other
As a result, the optimal “parsimonious” and “self-
words, how can satisfaction of the self-consistency crite-
consistent” representation tuple (z, f, g) can be inter-
rion in the internal space guarantee that we have learned
preted as the equilibrium point of a zero-sum game
to faithfully regenerate the observed data? This is pos-
between f (θ) and g(η), over a combined rate reduction
sible only when the dimension k is low enough and f
based utility:
can be further adjusted. Let us assume the dimension of
X is k < d/2 where d is the dimension of the feature max min ∆R Z(θ) + ∆R Z(θ), Ẑ(θ, η) . (14)
 
θ η
space. Then the dimension X̂ = g(f (X)) under a linear
lifting g is a subspace of k-dimension. The union of Recent analysis has rigorously shown that, in the case
the two subspaces of X and X̂ is of dimension at most when the input data X lie on multiple linear subspaces,
2k < d. Hence, if there is difference between these two the desired optimal representation for Z is indeed the
subspaces and f can be an arbitrary projection, we have Stackelberg equilibria (Fiez et al., 2019; Jin et al., 2019)
f (X) 6= f (X̂), i.e., X 6= X̂ implies Z 6= Ẑ. of a sequential maximin game over a rate reduction objec-
Hence, after g minimizes the error ∆R in (13), f tive similar to the above (Pai et al., 2022). It remains an
needs to actively adjust and detect, in its full capacity, open problem for the case when X are on multiple non-
if there is remaining discrepancy between X and X̂, linear submanifolds. Nevertheless, compelling empirical
say by maximizing the same measure ∆R. The process evidence indicates that solving this game indeed gives
can repeat between the encoder f and the decoder g very good auto-encoding for real-world visual datasets
and results in a natural pursuit and evasion game, as (Dai et al., 2022), and automatically determines a sub-
illustrated in Figure 6. space with a proper dimension for each class. It does not
27 This
seem to suffer from problems such as mode collapsing
is very similar in spirit to the “sleep” phase of the wake-
sleep scheme proposed by Hinton et al. (1995): it essentially tries to
in training conventional generative models such as GAN
ensure that the encoding (recognition) network f produces a response (Srivastava et al., 2017). The so-learned representation is
ẑ to the regenerated x̂ = g(z) consistent with its origin z. simultaneously discriminative and generative.
28 This can be viewed as a generalization to the “deep equilibrium

models” (Bai et al., 2019) or the “implicit deep learning” models


(El Ghaoui et al., 2021). Both interpret deep learning as conducting Self-consistent incremental and unsupervised learn-
fixed point computation from a feedback control perspective. ing. So far we have mainly discussed the two princi-
12 Ma, Tsao and Shum

Fig. 7 Incremental learning via a compressive closed-loop transcription. For a new data class Xnew , a new LDR memory Znew is
learned via a constrained minimax game between the encoder and decoder subject to a constraint that memory of past classes Zold
is preserved, as a “fixed point” of the closed loop.

ples in the supervised setting. In fact, one of the main catastrophic forgetting (McCloskey and Cohen, 1989).
advantages of our framework is that it is most natural The forgetting, if any at all, is rather graceful with such
and effective for self-learning via self-supervision and a closed-loop system. In addition, when images of an
self-critique. In addition, since rate reduction has sought old class are provided again to the system to review, the
explicit (subspace-type) representations for the learned learned representation can be further consolidated – a
structures,29 this makes it easy for past knowledge to characteristic very similar to that of human memory. In
be preserved when learning new tasks/data, as a prior some sense, such a constrained closed-loop formulation
(memory) to be kept self-consistent. essentially ensures that the visual memory forming can
To be more clear, let us see how the closed-loop be Bayesian and adaptive – characteristics hypothesized
transcription framework above can be naturally extended to be desirable for the brain (Friston, 2009).
to the case of incremental learning – that is, to learn
to recognize one class of objects at a time instead of
Note that this framework is fundamentally con-
learning many classes jointly at the same time. While
ceived to work in an entirely unsupervised setting. Thus
learning the representation Znew for a new class, one
even though for pedagogical purposes we presented the
only needs to add the cost to the objective (14) and
principles assuming class information is available, the
ensure the representation Zold learned before for old
framework can be naturally extended to the entirely un-
classes remains self-consistent (a fixed point) through the
supervised setting in which no class information is given
closed-loop transcription: Zold ≈ Ẑold = f (g(Zold )).
for any data sample. In this case, we only have to view
In other words, the above maximin game (14) becomes
every new sample and its augmentations as one new
a constrained one:
class in (15). This can be viewed as one type of “self-

maxθ minη ∆R(Z) + ∆R Z, Ẑ + ∆R(Znew , Ẑnew ) supervision.” Together with the “self-critiquing” game
mechanism, a compressive closed-loop transcription can
subject to ∆R Zold , Ẑold = 0. (15)

be easily learned. As shown in Figure 8, not only does the
Such a constrained game makes the learning an incremen- so-learned auto-encoding show good sample-wise consis-
tal and dynamical process so that the learned transcription tency, but the learned features also demonstrate clear and
can continuously adapt to new incoming data. This pro- meaningful local low-dimensional (thin) structures. More
cess is illustrated in Figure 7. surprisingly, subspaces, or block-diagonal structures in
Recent empirical studies (Tong et al., 2022) have feature correlation, start to emerge in features learned for
shown that this leads to arguably the first self-contained the classes even without any class information provided
neural system with a fixed capacity that can incremen- during training at all (Figure 9)! Hence structures of
tally learn good LDR representations without suffering so learned features resemble those of category-selective
29 instead of a “hidden” or “latent” representation learned by a areas observed in a primate’s brain (Kanwisher et al.,
purely generative method such as GAN (Goodfellow et al., 2014) 1997; Kanwisher, 2010; Kriegeskorte et al., 2008; Bao
where the features are distributed randomly in the feature space. et al., 2020).
Ma, Tsao and Shum 13

Fig. 8 Left: comparison between x and the corresponding decoded x̂ of the auto-encoding learned in the unsupervised setting for
the CIFAR-10 dataset (with 50,000 images in 10 classes). Right: t-SNE of unsupervisedly learned features of the 10 classes and
visualization of several neighborhoods with their associated images. Notice the local thin (nearly 1-D) structures in the visualized
features, projected from a feature space of hundreds of dimension.

and concepts from coding/information theory, feedback


control, deep networks, optimization, and game theory
all come together to become integral parts of a complete
intelligent system that can learn. Although “divide and
conquer” has long been a cherished tenet in scientific
research, when it comes to understanding a complex sys-
tem such as Intelligence, the opposite “unite and build”
should be the tenet of choice. Otherwise we would for-
ever be blind men with an elephant: each person would
always believe a small piece is the whole world and tend
to blow its significance out of proportion.31
Fig. 9 Correlations between unsupervised-learned features for Together, the two principles serve as the glue needed
50,000 images that belong to 10 classes (CIFAR-10) by the closed- to combine many necessary pieces together for the jigsaw
loop transcription. Block-diagonal structures consistent with the
classes emerge without any supervision.
puzzle of Intelligence, with the role of deep networks
naturally and clearly revealed as models for the nonlinear
3 Universal Learning Engines mappings between external observations and internal
representations. Interestingly, the principles reveal com-
“What I cannot create, I do not under-
putational mechanisms for learning systems that resemble
stand.”
– Richard Feynman some of the key characteristics observed in or hypothe-
sized about the brain, such as sparse coding and subspace
In the above section, we deduced from the first prin-
coding (Barlow, 1961; Olshausen and Field, 1996; Chang
ciples of Parsimony and Self-consistency the compressive
and Tsao, 2017), closed-loop feedback (Wiener, 1948),
closed-loop transcription framework, using the example
and free energy minimization (Friston, 2009), as we will
of modeling visual imagery data. In the remaining two
discuss more in the next section.
sections, we offer more speculative thoughts on the uni-
Notice that closed-loop compressive architectures
versality of this framework, extending it to 3D vision and
are ubiquitous in nature for all intelligent beings and
reinforcement learning (the rest of this section)30 and
at all scales, from the brain (which compresses sensory
projecting its implications for neuroscience, mathematics,
information) to spinal circuits (which compress muscle
and higher-level intelligence (Section 4).
movements) down to DNA (which compresses functional
information about proteins). We believe that compressive
“Unite and build” versus “divide and conquer.”
closed-loop transcription may be the universal learn-
Within the compressive closed-loop transcription frame-
ing engine behind all intelligent behaviors. It enables
work, we have seen why and how fundamental ideas
intelligent beings and systems to discover and distill
30 Our discussions on the two topics require familiarity with certain
low-dimensional structures from seemingly complex and
domain specific terminology and knowledge. Readers who are less
familiar with these topics may skip without much loss of continuity. 31 Hence all the superficial claims: “this or that is all you need.”
14 Ma, Tsao and Shum

Fig. 10 A closed-loop relationship between Computer Vision and Graphics for a compact and structured 3D model of the visual
inputs.

unorganized input and transform them to compact and an opposite “unite and build” approach.
organized internal structures to memorize and exploit.
To illustrate the universality of such a framework, for Perception as a compressive closed-loop transcrip-
the remainder of this section, we examine two more tasks: tion? More precisely, a three-dimensional represen-
3D perception and decision making, which are believed tation of shapes, appearances, and even dynamics of
to be two key modules for any autonomous intelligent objects in the world should be the most compact and
system (LeCun, 2022). We speculate on how, guided by structured representation that our brain has developed
the two principles, one can develop different perspectives internally to interpret all perceived visual observations
and new insights to understand these challenging tasks. consistently. If so, the two principles then suggest that
a compact and structured 3D representation is directly
3.1 3D Perception: Closing the Loop for Vision and the internal model to be sought for. This implies that
Graphics we could and should unify Computer Vision and Com-
puter Graphics within a single closed-loop computational
So far, we have demonstrated the success of closed- framework, as illustrated in Figure 10.
loop transcription to discover compact structures in Computer Vision has conventionally been inter-
datasets of 2D images. This has relied on the exis- preted as a forward process that reconstructs and recog-
tence of statistical correlations among imagery data in nizes an internal 3D model for all the 2D visual inputs
each of the classes. We believe that the same compression (Ma et al., 2004; Szeliski, 2022), whereas Computer
mechanisms would be even more effective if the low- Graphics (Hughes et al., 2014) represents its inverse
dimensional structures in the data are defined through process that renders and animates the internal 3D model.
hard physical or geometric constraints rather than soft There might be tremendous computational and practical
statistical correlations. benefits to combine these two processes directly into a
In particular, if we believe the principles of Par- closed-loop system: all the rich structures (e.g. sparsity
simony and Self-consistency also play a role in how and smoothness) in geometric shapes, visual appearances,
the human brain develops mental models of the world and dynamics can be exploited together for a unified 3D
from life-long visual inputs, then our sense of 3D space model that is the most compact and consistent with all
should be the result of such a closed-loop compression the visual inputs.
or transcription. The classic paradigm for 3D Vision Indeed, the recognition techniques in computer vi-
laid out by David Marr in his influential book Vision sion could help computer graphics in building compact
(Marr, 1982) advocates a “divide and conquer” approach models in the spaces of shapes and appearance and en-
that partitions the task of 3D perception into several abling new ways for creating realistic 3D content. On the
modular processes: from low-level 2D processing (e.g. other hand, the 3D modeling and simulation techniques
edge detection, contour sketching), to mid-level 2.5D in computer graphics could predict, learn and verify the
parsing (e.g. grouping, segmentation, figure and ground), properties and behaviors of the real objects and scenes
and high-level 3D reconstruction (e.g. pose, shape) and analyzed by computer vision algorithms. In fact, the ap-
recognition (e.g. objects). In contrast, the compressive proach of "analysis by synthesis" has been long practiced
closed-loop transcription proposed in this paper advocates by the vision and graphics community, e.g., for efficient
Ma, Tsao and Shum 15

Fig. 11 An autonomous intelligent agent that integrates perception (feedback), learning, optimization, and action in a closed loop to
learn optimal policy for a certain task. st or xk is the state of the world model; r or g is the perceived reward or cost of action at
or control uk on the current state; J or V is the (learned) cost or value associated with each state, Q is the (learned) cost associated
with each state-action pair. Here, we deliberately use terminologies from optimal control (OC) (Bertsekas, 2012) and reinforcement
learning (RL) (Sutton and Barto, 2018) in parallel, for both comparison and unification.

online perception (Yildirim et al., 2020). Some recent and efficient framework for learning a good perceptual
examples of closing the loop for computer vision and model from visual inputs. At the next level, such a
graphics include a learned 3D rendering engine (Kulkarni perceptual model can then be used by an autonomous
et al., 2015) and 3D aware image synthesis (Chan et al., agent to achieve certain tasks in a complex dynamical
2020; Wood et al., 2021). environment. The overall process for the agent to learn
from perceived results or received rewards of its actions
Unified representations for appearance and shape? forms another closed loop at a higher level (Figure 11).
Image-based rendering (Levoy and Hanrahan, 1996; The principle of self-consistency is clearly at play
Gortler et al., 1996; Shum et al., 2011), in which a here: the role of the closed-loop feedback system is to
new view is generated by learning from a set of given ensure that the learned model and control policy by the
images, may be regarded as an early attempt to close the agent is consistent with the external world in such a way
gap between vision and graphics with the principles of that the model can make the best prediction of the state
parsimony and self-consistency. In particular, plenoptic (st ) transition, and the learned control policy πθ for the
sampling (Chai et al., 2000) showed that an anti-aliased action (at ) results in maximal expected reward R:
image (self-consistency) can be achieved with the mini- .
hX i
max R(θ) = Eat ∼πθ (st ) r(st , at ) . (16)
mum number of images required (parsimony). θ
t
Recent developments in modeling radiance fields
Note here the reward R plays a similar role as the rate
have provided more empirical evidence for this view
reduction objective (4) for LDR models, which measures
(Yu et al., 2021): directly exploiting low-dimensional
the “goodness” of the learned control policy π and guides
structures in the radiance field in 3D (sparse support and
its improvement.
spatial smoothness) leads to much more efficient and
The principle of Parsimony is the main reason for
effective solutions than brute-force training of black-box
the very success of modern reinforcement learning in
deep neural networks (Mildenhall et al., 2020). However,
tackling large-scale tasks such as Alpha-Go (Silver et al.,
it remains a challenge for the future to identify the right
2016, 2017) and playing video games (Berner et al.,
family of compact and structured 3D representations
2019; Vinyals et al., 2019). In almost all tasks that have
that can integrate shape geometry, appearance, and even
a state-action space of astronomical size or dimension,
dynamics in a unified framework that leads to the minimal
say D, practitioners always assume the optimal value
complexity in data, model, and computation.
function V ∗ , Q-function Q∗ , or policy π ∗ only depends
3.2 Decision Making: Closing the Loop for Percep- on a small number of, say d  D, features:
tion, Learning, and Action V ∗ (s) ≈ V̂ f (s, a) ,


Q∗ (s, a) ≈ Q̂ f (s, a) , (17)



So far in this paper, we have discussed how compres-

sive closed-loop transcription may lead to an effective

π (a | s) ≈ π̂ a; f (s, a) ,
16 Ma, Tsao and Shum

where f (s, a) ∈ Rd is a nonlinear mapping that learns the best known bounds on the sample complexity for re-
some low-dimensional features of the extremely large inforcement learning remains linear in cardinality of the
or high-dimensional state-action space. In the case of state space and action O(|S||A|) (Li et al., 2020), which
video games, the state dimension D is easily in the does not explain the empirically observed efficiency of
millions and yet the number of features d needed to learn RL in large-scale tasks (such as Alpha-Go and video
a good policy is typically only a few dozen or hundred! games) where the state or action spaces are astronomical.
Very often, these optimal control policy or value/reward We believe that the efficiency of RL in tackling many
functions sought in OC/RL are even assumed to be a of the practical large-scale tasks can come only from the
linear superposition of these features (Ng and Russell, intrinsic low-dimensionality in the system dynamics or
2000; Kakade, 2001): correlation between the optimal policy/control and the
states. For example, assuming the systems have bounded
ω > f (s, a) = ω1 · f1 (s, a) + · · · + ωd · fd (s, a). (18) eluder dimension (Osband and Roy, 2014) or the MDPs
are low rank (Uehara et al., 2021; Agarwal et al., 2020).
That is, the nonlinear mapping f is assumed to be able to The role of deep networks is again to identify and model
also linearize the dependency of the policy/value/reward such low-dimensional structure and hopefully linearize
functions on the learned features.32 it.
To conclude, for large-scale RL tasks, it is the two
Autonomous feature selection via a game? Notice principles together that make such a closed-loop sys-
that all these practices in RL are very similar in spirit tem of perception, learning, and action a truly efficient
to the learning objectives under the principle of Parsi- and effective learning engine. With such an engine, au-
mony stated in Section 2.1. Effectively exploiting the tonomous agents are able to discover low-dimensional
low-dimensional structures is the (only) reason why the structures if there are indeed such structures in the en-
learning can be so scalable with such a high-dimensional vironment and in the learning task, and eventually act
state-action space; and correctly identifying and lineariz- intelligently when the structures learned are good enough
ing such low-dimensional structures is the key for the and generalize well!
so-learned control policy to be generalizable33 . Nev-
ertheless, a proper choice in the number of features
4 A Broader Program for Intelligence
d remains heuristically designed by human in practice.
That makes the overall RL not autonomous. We believe “If I were to choose a patron saint for
that, for a closed-loop learning system to automatically cybernetics out of the history of science, I
determine the right number of features associated with a should have to choose Leibniz. The philosophy
reward/task, one has to extend the formulation of RL (16) of Leibniz centers about two closely related
to a certain maximin game34 , in a similar spirit as those concepts – that of a universal symbolism and
studied in Section 2.2 for achieving Self-consistency for that of a calculus of reasoning.”
visual modeling. – Norbert Wiener, Cybernetics, 1961

It has been ten years since the dramatic revival of


Data and computational efficiency of RL? In recent deep neural networks with the work of Krizhevsky et al.
years, there have been many theoretical attempts to ex- (2012), which has led to considerable enthusiasm about
plain the empirically observed efficiency of reinforcement artificial intelligence in both the technology industry and
learning in terms of sampling and computation complex- the scientific community. Subsequent theoretical studies
ity of Markov decision processes (MDP). However, any of deep learning often view deep networks themselves
theory based on unstructured generic MDPs and reward as the object of study (Roberts et al., 2022). We here
functions would not be able to provide pertinent explana- however have argued that deep networks are better un-
tions to such empirical successes. For example, some of derstood as a means to an end: they clearly arise to serve
32 It has been a common practice in systems theory to linearize any the purposes of identifying and transforming nonlinear
nonlinear dynamics before controlling them, through either nonlinear low-dimensional structures in high-dimensional data, a
mappings known as the Koopman operators (Koopman, 1931) or
universal task for learning from high-dimensional data
feedback linearization (Sastry, 1999).
33 otherwise, the learned model/policy tends to over-fit or under-fit. (Wright and Ma, 2022).
34 by introducing a self-critique for the features selected and learned. More broadly, in this paper, we have proposed and
Ma, Tsao and Shum 17

argued that Parsimony and Self-consistency are two fun- via a closed-loop maximin game over such a utility
damental principles responsible for the emergence of (14), instead of minimization alone. Thus we believe
Intelligence, artificial or natural. The two principles our framework provides a new perspective on how to
together lead to a closed-loop computational framework practically implement Bayesian inference.
that unifies and clarifies many practices and empirical Our framework clarifies the overall learning archi-
findings of deep learning and artificial intelligence. Fur- tecture used by the brain. One important insight is that a
thermore, we believe they will guide us from now on to feedforward segment can be constructed via unrolling an
study Intelligence with a more principled and integrated optimization scheme; learning from a random network
approach. Only thus can we achieve a new level in un- via back propagation is not necessary. Furthermore, our
derstanding the science and mathematics of Intelligence. framework suggests the existence of a complementary
generative segment to form a closed-loop feedback sys-
4.1 Neuroscience of Intelligence tem to guide learning. Finally, our framework sheds
One would naturally expect any fundamental prin- light on the elusive “prediction error” signal sought by
ciple of intelligence to have major implications for the many neuroscientists interested in brain mechanisms for
design of the most intelligent thing in the universe, the “predictive coding,” a computational scheme with res-
brain. As already mentioned, the principles of Parsi- onances to compressive closed-loop transcription (Rao
mony and Self-consistency shed new light on several and Ballard, 1999; Keller and Mrsic-Flogel, 2018): for
experimental observations concerning the primate visual reasons of computational tractability, the discrepancy
system. Even more importantly, they shine light on what between incoming and generated observations should be
to look for in future experiments. measured at the final stage of representation.
We have shown that seeking an internally parsimo- So far, many of the resemblances between this new
nious and predictive representation alone is enough “self- framework and natural intelligence discussed in this pa-
supervision” to allow structures to emerge automatically per are still speculative and remain to be corroborated
in the final representation learned through a compressive with future scientific experiments and evidences. Nev-
closed-loop transcription. For example, Figure 9 shows ertheless, these speculations offer neuroscientists ample
that unsupervised data transcription learns features that new perspectives and hypotheses about why and how
automatically distinguish different categories, providing Intelligence could emerge in nature.
an explanation for category-selective representations ob- 4.2 Mathematics of Intelligence
served in the brain (Kanwisher et al., 1997; Kanwisher,
2010; Kriegeskorte et al., 2008; Bao et al., 2020). These In terms of mathematical or statistical models for
features also provide a plausible explanation for the data analysis, one can view our framework as a general-
widespread observations of sparse coding (Olshausen ization of PCA (Jolliffe, 1986), GPCA (Vidal et al., 2016),
and Field, 1996) and subspace coding (Chang and Tsao, RPCA (Candès et al., 2011), and Nonlinear PCA (Kramer,
2017; Bao et al., 2020) in the primate brain. Further- 1991) to the case with multiple low-dimensional nonlin-
more, beyond modeling of visual data, recent studies in ear submanifolds in a high-dimensional space. These
neuroscience suggest the emergence of other structured classical methods largely model data with a single or
representations in the brain such as “place cells” might multiple linear subspaces or with a single nonlinear sub-
also be the result of coding spatial information in the manifold. We have argued that the role of deep networks
most compressed way (Benna and Fusi, 2021). is mainly to model the nonlinear mappings that linearize
Arguably, the maximal coding rate reduction and separate multiple low-dimensional submanifolds si-
(MCR2 ) principle (4) is similar in spirit to the “free multaneously. This generalization is necessary to connect
energy minimization principle” from cognitive science these idealistic classic models to true structures of real-
(Friston, 2009), which attempts to provide a framework world data. Despite promising and exciting empirical
for Bayesian inference through energy minimization. But evidence, mathematical properties of the compressive
unlike the general notion of free energy, rate reduction closed-loop transcription process have not been so well
is computationally tractable and directly optimizable as studied and understood. To our best knowledge, only
it can be expressed in closed form. Furthermore, the for the case when the original data x are assumed to
interplay of our two principles suggests that autonomous be on multiple linear subspaces, it has been shown the
learning of the correct model (class) should be done maximin game based on rate reduction gives the correct
18 Ma, Tsao and Shum

optimal solution (Pai et al., 2022). Little is known about for popular models such as RNNs or LSTMs (Hochreiter
transcription of nonlinear submanifolds. and Schmidhuber, 1997).
A rigorous and systematic investigation requires un- But besides pure mathematical interests, we must
derstanding of high-dimensional probability distributions require this time that the mathematical investigation leads
with low-dimensional supports on submanifolds (Feffer- to computationally tractable measures and scalable algo-
man et al., 2013). Hence mathematically, it is crucial rithms. One has to characterize the precise statistical and
to study how such submanifolds in high-dimensional computational resources needed to achieve such tasks, in
spaces can be identified, grouped or separated, deformed the same spirit as the research program set for Compres-
and flattened with minimal distortion to their original sive Sensing (Wright and Ma, 2022), because Intelligence
metric, geometry, and topology (Tenenbaum et al., 2000; needs to apply them to model high-dimensional data and
Buchanan et al., 2020; Wang et al., 2021). Problems solve large-scale tasks. This entails to “close the loop”
like these seem to fall in an understudied area between between Mathematics and Computation, enabling the use
classical Differential Geometry and Differential Topology of rich families of good geometric structures (e.g., sparse
in mathematics. models, subspaces, grids, groups, or graphs; Figure 1,
In addition, often we also wish that during the pro- right) as compact archetypes for modeling real-world
cess of deformation, the probability measure of data on data, through efficiently computable nonlinear mappings
each submanifold is redistributed in a certain optimal that generalize deep networks, see e.g. Bronstein et al.
way such that coding and sampling will be the most (2021).
economic and efficient. This is related to topics such as
Optimal Transport (Lei et al., 2017). For the case when 4.3 Towards Higher-Level Intelligence
the submanifolds are fixed linear subspaces, understand- The two principles laid out in this paper are mainly
ing the achievable extremals of the rate reduction, or for explaining the emergence of Intelligence in individ-
ratio of volumes of the whole and the parts, seems related ual agents or systems, related to the notion of ontoge-
to certain fundamental inequalities in analysis such as netic learning that Nobert Wiener first proposed (Wiener,
the Brascamp-Lieb inequalities (Jonathan Bennett et al., 1948). It is probably not a coincidence that after more
2007). More general problems like these seem to be re- than seventy years, we find ourselves in this paper “clos-
lated to the studies of Metric Entropy and Coding Theory ing the loop” of modern practice of Artificial Intelligence
for distributions over more general compact structures. back to its roots in Cybernetics and interweaving the very
Besides nonlinear low-dimensional structures, real- same set of fundamental concepts that Wiener touched
world data and signals are typically invariant to shift upon in his book while conducting inquiries into the jig-
in time, translation in space, or to more general group saw puzzle of Intelligence: information and compression,
transformations. Wiener (1961) recognized dealing both closed-loop feedback, learning via games, white-box
nonlinearity and invariance simultaneously presents a modeling, nonlinearity, shift-invariance, etc.
major technical challenge. He had made early attempts As we see from this paper, the compressive closed-
to generalize Harmonic Analysis to nonlinear processes loop transcription is arguably the first computational
and systems.35 Indeed, the revival of deep learning framework that integrates many of these pieces coher-
has reignited strong interests in this important problem ently together. It is in close spirit to earlier frameworks
and great progresses have been made recently, including (Hinton et al., 1995) but makes them computationally
the work of Bruna and Mallat (2013); Wiatowski and tractable and scalable. In particular, the learned non-
Bölcskei (2018); Cohen and Welling (2016); Cohen et al. linear encoding/decoding mappings of the loop, often
(2019). Our framework suggests that a more unifying manifested as deep networks, essentially provide a much
approach to deal with nonlinearity and invariance is needed “interface” between external unorganized raw
through (incremental) compression. That has led to a sensory data (say visual, auditory etc.) and internal
natural derivation of structured deep networks such as compact and structured representations.
the convolution ReduNet (Chan et al., 2022). We believe However, the two principles proposed in this paper
compression provides a unifying perspective to model- do not necessarily explain all aspects of Intelligence.
ing general nonlinear sequential data and processes too, Obviously, computational mechanisms behind the emer-
which could lead to mathematically rigorous justification gence and development of high-level semantic, symbolic,
35 He used his analysis to explain brain waves (Wiener, 1961)! or logic inferences remain elusive, although many foun-
Ma, Tsao and Shum 19

dational works have been set forth by pioneers like John gence as a white box36 , including measurable ob-
McCarthy, Marvin Minsky, Allen Newell and Herbert jectives, associated computing architectures, and
Simon since the 1950’s (Simon, 1969; Newell and Simon, structures of learned representations.
1972) and a comprehensive modern exposition can be
2. Computability: any new principle for Intelligence
found in Russell and Norvig (2020). Till today, there
must be computationally tractable and scalable,
are still active and contentious debates about whether
physically realizable by computing machines or
such high-level symbolic intelligence can emerge from
nature, and ultimately corroborated with scientific
continuous learning or has to be hard-coded (Marcus,
evidence.
2020; LeCun and Browning, 2022).
In our view, structured internal representations such Only with such fully interpretable and truly realizable
as subspaces are one necessary intermediate step for the principles in place can we explain all existing intelligent
emergence of high-level semantic or symbolic concepts– systems, artificial or natural, as partial or holistic instan-
each subspace corresponds to one discrete category (of tiations of these principles. Then they can help us to
objects). Additional statistical, causal or logical rela- discover effective architectures and systems for different
tionships among the so-abstracted discrete concepts can intelligent tasks without relying on the current expensive
be further modeled parsimoniously as a compact and and time-consuming “trial-and-error” approach to ad-
structured (say sparse) graph, with each node represent- vance. We will also be able to characterize the minimal
ing a subspace/category. The graph can be learned via data and computation resources needed to achieve these
auto-encoding to ensure self-consistency, e.g. (Bear et al., tasks, instead of the current brute-force approach that
2020). simply advocates “the bigger, the better.” Intelligence
We conjecture that only on top of compact and should not be the privilege of the most resourceful, as it is
structured representations learned by individual agents not the way of nature. Instead, parsimony and autonomy
can the emergence and development of high-level intelli- are the main characteristics.37 Under a correct set of
gence (with shareable symbolic knowledge) be possible, principles, anyone should be able to design and build
subsequently and eventually. We suggest that new princi- future generations of intelligent systems, small or big,
ples for the emergence of high-level intelligence, if any, with autonomy, ability and efficiency emulating and even
should be sought through the need for efficient commu- surpassing that of animals and humans, eventually.
nication of information or transfer of knowledge among
intelligent agents. This is related to the notion of phy-
5 Conclusion
logenetic learning that Wiener also discussed (Wiener,
1961). Through this paper, we hope to have convinced
Furthermore, any new principles needed for such the reader that we are now at a much better place than
higher-level intelligence must reveal reasons why align- people like Wiener and Shannon seventy years ago when
ment and sharing of internal models/concepts across it comes to uncovering, understanding, and even exploit-
different individual agents is possible computationally, ing the works of Intelligence. We have proposed and
as well as reveal certain measurable gains in intelligence argued that, under the two principles of Parsimony and
for a group of agents from such symbolic abstraction and Self-consistency, it is possible to assemble many neces-
sharing. sary pieces of the puzzle of Intelligence into a unified
computational framework that is easily implementable on
Intelligence as interpretable and computable systems. machines or by nature. This unifying framework offers
Obviously, as we advance our inquiries into higher-level new perspectives on how we could further advance the
Intelligence, we want to set much higher standards this study of perception, decision making, and intelligence in
time. Whatever new principles there might remain to general.
be discovered in the future, for them to truly play a To draw an end to our proposal for a principled ap-
substantial role in the emergence and development of proach to Intelligence, we emphasize once again that all
Intelligence, they must share two characteristics to the 36 Again, the phrase “white box” modeling has been conveniently

two principles we have presented in this paper: borrowed from Wiener’s Cybernetics (Wiener, 1961).
37 A tiny ant is arguably much more intelligent and independent
1. Interpretability: all principles together should than any legged robot in the world, which has merely a quarter of
help reveal computational mechanisms of Intelli- million neurons consuming negligible energy.
20 Ma, Tsao and Shum

scientific principles for Intelligence should not be merely Fusi of Columbia University, Professor Yann LeCun and
philosophical guidelines or merely conceptual frame- Rob Fergus of New York University, Dr. Xin Tong of
works formulated with mathematical/statistical quantities MSRA. We realize that these perspectives might be inter-
that are intractable to compute or can only be approxi- esting to broader scientific and engineering communities.
mated heuristically. They should rely on the most basic Some of the thoughts presented about integrating
and principled objectives that are measurable with finite pieces of the puzzle for intelligent systems can be traced
observations and lead to computational systems that can back to an advanced robotics course that Yi had led and
be realized even with limited resources. This belief organized jointly with Professor Jitendra Malik, Shankar
is probably best expressed through a quote from Lord Sastry, and Claire Tomlin as Berkeley EECS290-005:
Kelvin:38 the Integration of Perception, Learning, and Control
in Spring 2021. The need for an integrated view or a
“When you can measure what you are “unite and build” approach seems to be a topic that is
speaking about and express it in numbers, you drawing increasing interest and importance for the study
know something about it; but when you can- of Artificial Intelligence.
not measure it, when you cannot express it We would also like to thank many of our former
in numbers, your knowledge is of the meager and current students who, against extraordinary odds,
and unsatisfactory kind: it may be the begin- have worked on projects under this new framework in
ning of knowledge, but you have scarcely, in the past several years when some of the ideas were
your thoughts, advanced to the stage of science, still at their infancy and seemed not in accordance with
whatever the matter may be.” the mainstream, including Xili Dai, Yaodong Yu, Peter
– Lord Kelvin, 1883 Tong, Ryan Chan, Chong You, Ziyang Wu, Christina
Baek, Druv Pai, Brent Yi, Michael Psenska, and others.
Afterword and Acknowledgement Many of the technical evidences and figures used in
this position paper are conveniently borrowed from their
Although the research of Yi and Harry focuses more recent research results.
on computer vision and computer graphics, they both
References
happened to major in control and automation as under-
Agarwal A, Kakade S, Krishnamurthy A, et al., 2020. Flambe:
graduate students. They started their collaboration many Structural complexity and representation learning of low rank
years ago at Microsoft Research Asia (MSRA) with a MDPs. Advances in neural information processing systems,
compression-based approach to data clustering and clas- 33:20095-20107.
Azulay A, Weiss Y, 2018. Why do deep convolutional networks
sification (Ma et al., 2007; Wright et al., 2008). In the generalize so poorly to small image transformations? arXiv
past two years, they have had frequent discussions and preprint arXiv:180512177, .
debates about understanding (deep) learning and (artifi- Baek C, Wu Z, Ryan Chan TD, et al., 2022. Efficient maximal
coding rate reduction by variational form. CVPR.
cial) intelligence. Their shared interests in Intelligence
Bai S, Kolter JZ, Koltun V, 2019. Deep equilibrium models. Advances
have brought all these fundamental ideas together and led in Neural Information Processing Systems, 32.
to the recent collaboration on closed-loop transcription Baker B, Gupta O, Naik N, et al., 2017. Designing neural
(Dai et al., 2022), and eventually to many of the views network architectures using reinforcement learning. ArXiv,
abs/1611.02167.
shared in this paper. Doris is deeply interested in whether Bao P, She L, McGill M, et al., 2020. A map of object space in
and how the brain implements generative models for vi- primate inferotemporal cortex. Nature, 583:103-108.
sual perception, and her group has been having intense Barlow HB, 1961. Possible principles underlying the transformation
discussions with Yi on this topic since moving to UC of sensory messages. Sensory communication, 1(01).
Bear D, Fan C, Mrowca D, et al., 2020. Learning physical
Berkeley a year ago. graph representations from visual scenes. Advances in Neural
The idea of writing of this position paper is partly Information Processing Systems, 33.
motivated from recent stimulating discussions among a Belkin M, Hsu D, Ma S, et al., 2019. Reconciling modern machine-
learning practice and the classical bias–variance trade-off. Pro-
group of researchers with very diverse backgrounds in ceedings of the National Academy of Sciences, 116(32):15849-
artificial intelligence, applied mathematics, optimization, 15854.
and neuroscience: Professor John Wright and Stefano Benna MK, Fusi S, 2021. Place cells may simply be memory
cells: Memory compression leads to spatial tuning and history
38 that we have learned from Professor Jitendra Malik of UC dependence. Proceedings of the National Academy of Sciences,
Berkeley. 118(51).
Ma, Tsao and Shum 21

Berner C, Brockman G, Chan B, et al., 2019. Dota 2 with large Goodfellow I, Pouget-Abadie J, Mirza M, et al., 2014. Generative
scale deep reinforcement learning. preprint arXiv:191206680, . adversarial nets. Advances in neural information processing
Bertsekas D, 2012. Dynamic Programming and Optimal Control, systems, 27.
volume I and II. Athena Scientific. Gortler SJ, Grzeszczuk R, Szeliski R, et al., 1996. The lumi-
Bronstein MM, Bruna J, Cohen T, et al., 2021. Geometric deep graph. Proceedings of the 23rd annual conference on Computer
learning: Grids, groups, graphs, geodesics, and gauges, . graphics and interactive techniques, p.43-54.
http://arxiv.org/abs/2104.13478 Gregor K, LeCun Y, 2010. Learning fast approximations of sparse
Bruna J, Mallat S, 2013. Invariant scattering convolution networks. coding. Proceedings of the 27th International Conference on
PAMI, . International Conference on Machine Learning, p.399-406.
Buchanan S, Gilboa D, Wright J, 2020. Deep networks and the Hadsell R, Chopra S, LeCun Y, 2006. Dimensionality reduction by
multiple manifold problem. International Conference on learning an invariant mapping. 2006 IEEE Computer Society
Learning Representations, . Conference on Computer Vision and Pattern Recognition, CVPR
https://par.nsf.gov/biblio/10218695 2006, p.1735-1742.
Candès EJ, Li X, Ma Y, et al., 2011. Robust principal component
He K, Zhang X, Ren S, et al., 2016. Deep residual learning for
analysis? Journal of the ACM (JACM), 58(3):1-37.
image recognition. Proceedings of the IEEE conference on
Chai JX, Tong X, Chan SC, et al., 2000. Plenoptic sampling.
computer vision and pattern recognition, p.770-778.
Proceedings of the 27th annual conference on Computer graphics
Hinton GE, Zemel RS, 1993. Autoencoders, minimum description
and interactive techniques, p.307-318.
length and helmholtz free energy. Proceedings of the 6th
Chan E, Monteiro M, Kellnhofer P, et al., 2020. pi-gan: Periodic
International Conference on Neural Information Processing
implicit generative adversarial networks for 3d-aware image
Systems, San Francisco, CA, USA, p.3–10.
synthesis. arXiv.
Chan KHR, Yu Y, You C, et al., 2022. ReduNet: A white-box Hinton GE, Dayan P, Frey BJ, et al., 1995. The "wake-sleep" algorithm
deep network from the principle of maximizing rate reduction. for unsupervised neural networks. Science, 268(5214):1158-
Journal of Machine Learning Research, 23(114):1-103. 1161.
http://jmlr.org/papers/v23/21-0631.html https://www.science.org/doi/abs/10.1126/science.7761831
Chan TH, Jia K, Gao S, et al., 2015. PCANet: A simple deep https://doi.org/10.1126/science.7761831
learning baseline for image classification? IEEE Transactions Ho J, Jain A, Abbeel P, 2020. Denoising diffusion probabilistic
on Image Processing, . models, .
Chang L, Tsao D, 2017. The code for facial identity in the primate https://arxiv.org/abs/2006.11239
brain. Cell, . Hochreiter S, Schmidhuber J, 1997. Long short-term memory. Neural
Cohen T, Geiger M, Weiler M, 2019. A general theory of equivariant computation, .
CNNs on homogeneous spaces. NeurIPS. Huang G, Liu Z, Van Der Maaten L, et al., 2017. Densely connected
Cohen TS, Welling M, 2016. Group equivariant convolutional convolutional networks. Proceedings of the IEEE conference
networks. CoRR, abs/1602.07576. on computer vision and pattern recognition, p.4700-4708.
http://arxiv.org/abs/1602.07576 Hughes JF, van Dam A, McGuire M, et al., 2014. Computer Graphics:
Cover TM, Thomas JA, 2006. Elements of Information Theory. Principles and Practice, 3rd Edition. Addison-Wesley.
Wiley-Interscience. , 2019. Automatic Machine Learning: Methods, Systems, Challenges.
Dai X, Tong S, Li M, et al., 2022. CTRL: Closed-loop data Springer.
transcription to an LDR via minimaxing rate reduction. Entropy, Hyvärinen A, Oja E, 1997. A fast fixed-point algorithm for
arXiv:211106636, . independent component analysis. Neural Computation, 9:1483-
Dosovitskiy A, Beyer L, Kolesnikov A, et al., 2021. An image is 1492.
worth 16x16 words: Transformers for image recognition at
Hyvärinen A, 1997. A family of fixed-point algorithms for independent
scale. International Conference on Learning Representations,
component analysis. IEEE Int Conf on Acoustics, Speech and
https://openreview.net/forum?id=YicbFdNTTy
Signal Processing (ICASSP), p.3917-3920.
El Ghaoui L, Gu F, Travacca B, et al., 2021. Implicit deep learning.
Jin C, Netrapalli P, Jordan MI, 2019. What is local optimality
SIAM Journal on Mathematics of Data Science, 3(3):930-958.
in nonconvex-nonconcave minimax optimization? preprint
https://doi.org/10.1137/20M1358517
arXiv:190200618, .
https://doi.org/10.1137/20M1358517
https://arxiv.org/abs/1902.00618
Engstrom L, Tran B, Tsipras D, et al., 2017. A rotation and a
Jolliffe I, 1986. Principal Component Analysis. Springer-Verlag,
translation suffice: Fooling CNNs with simple transformations.
New York, NY.
arXiv preprint arXiv:171202779, .
Fefferman C, Mitter S, Narayanan H, 2013. Testing the manifold Jonathan Bennett J, Carbery A, Christ M, et al., 2007. The
hypothesis. preprint arXiv:13100425, . brascamp–lieb inequalities: Finiteness, structure and extremals.
https://arxiv.org/abs/1310.0425 Geometric and Functional Analysis, 17.
Fiez T, Chasnov B, Ratliff LJ, 2019. Convergence of Learning Josselyn SA, Tonegawa S, 2020. Memory Engrams: Recalling the
Dynamics in Stackelberg Games. preprint arXiv:190601217, . past and imagining the future. Science, 367.
https://arxiv.org/abs/1906.01217 Kakade SM, 2001. A natural policy gradient. Advances in Neural
Friston K, 2009. The free-energy principle: a rough guide to the Information Processing Systems, 14.
brain? Trends in Cognitive Sciences, 13(7):293-301. Kanwisher N, McDermott J, Chun MM, 1997. The fusiform face
https://doi.org/https://doi.org/10.1016/j.tics.2009.04.005 area: A module in human extrastriate cortex specialized for
Fukushima K, 1980. Neocognitron: A self-organizing neural network face perception. Journal of Neuroscience, 17(11):4302-4311.
model for a mechanism of pattern recognition unaffected by https://www.jneurosci.org/content/17/11/4302
shift in position. Biol Cybernetics, 36:193-202. https://doi.org/10.1523/JNEUROSCI.17-11-04302.1997
22 Ma, Tsao and Shum

Kanwisher N, 2010. Functional specificity in the human brain: A win- McCloskey M, Cohen NJ, 1989. Catastrophic interference in connec-
dow into the functional architecture of the mind. Proceedings tionist networks: The sequential learning problem. 24:109-165.
of the National Academy of Sciences, 107(25):11163-11170. Mildenhall B, Srinivasan PP, Tancik M, et al., 2020. Nerf: Rep-
https://www.pnas.org/doi/abs/10.1073/pnas.1005062107 resenting scenes as neural radiance fields for view synthesis.
https://doi.org/10.1073/pnas.1005062107 ECCV.
Keller GB, Mrsic-Flogel TD, 2018. Predictive processing: A canonical Newell A, Simon HA, 1972. Human Problem Solving. Prentice Hall,
cortical computation. Neuron, 100(2):424-435. New Jersey, USA.
Kelley H, 1960. Gradient theory of optimal flight paths. ARS Journal, Ng A, Russell S, 2000. Algorithms for inverse reinforcement
30(10):947-954. learning. ICML ’00 Proceedings of the Seventeenth International
Kingma DP, Welling M, 2013. Auto-encoding variational Bayes. Conference on Machine Learning, .
preprint arXiv:13126114, . Olshausen BA, Field DJ, 1996. Emergence of simple-cell receptive
Kobyzev I, Prince SJ, Brubaker MA, 2021. Normalizing flows: field properties by learning a sparse code for natural images.
An introduction and review of current methods. IEEE Nature, 381(6583):607.
Transactions on Pattern Analysis and Machine Intelligence, Oord Avd, Li Y, Vinyals O, 2018. Representation learning with
43(11):3964-3979. contrastive predictive coding. preprint arXiv:180703748, .
https://doi.org/10.1109https://doi.org/10.1109/tpami.2020.2992934 Osband I, Roy BV, 2014. Model-based reinforcement learning and
Koopman BO, 1931. Hamiltonian systems and transformation in the eluder dimension. Proceedings of the 27th International
Hilbert space. Proceedings of the National Academy of Sciences Conference on Neural Information Processing Systems - Volume
of the United States of America, 17 5:315-8. 1, Cambridge, MA, USA, p.1466–1474.
Kramer MA, 1991. Nonlinear principal component analysis using Pai D, Psenka M, Chiu CY, et al., 2022. Pursuit of a discriminative
autoassociative neural networks. Aiche Journal, 37:233-243. representation for multiple subspaces via sequential games.
Kriegeskorte N, Mur M, Ruff DA, et al., 2008. Matching categorical preprint arXiv:220609120, .
object representations in inferior temporal cortex of man and Papyan V, Han X, Donoho DL, 2020. Prevalence of neural collapse
monkey. Neuron, 60(6):1126-1141. during the terminal phase of deep learning training. preprint
Krizhevsky A, Sutskever I, Hinton GE, 2012. Imagenet classification arXiv:200808186, .
Patterson D, Gonzalez J, Hölzle U, et al., 2022. The carbon footprint
with deep convolutional neural networks. Advances in neural
of machine learning training will plateau, then shrink. preprint
information processing systems, p.1097-1105.
arXiv:220405149, .
Kulkarni TD, Whitney WF, Kohli P, et al., 2015. Deep convolutional
https://arxiv.org/abs/2204.05149
inverse graphics network. Advances in Neural Information
Quinlan JR, 1986. Induction of decision trees. Machine Learning, .
Processing Systems, 28.
Rao RPN, Ballard DH, 1999. Predictive coding in the visual cortex:
LeCun Y, Browning J, 2022. What AI can tell us about intelligence.
a functional interpretation of some extra-classical receptive-field
NOĒMA Magazine, .
effects. Nat Neurosci, 2(1):79-87.
LeCun Y, Bottou L, Bengio Y, et al., 1998. Gradient-based learning
Rifai S, Vincent P, Muller X, et al., 2011. Contractive auto-encoders:
applied to document recognition. Proceedings of the IEEE, .
Explicit invariance during feature extraction. In International
LeCun Y, Bengio Y, Hinton G, 2015. Deep learning. Nature,
Conference on Machine Learning, p.833–840.
521(7553):436-444.
Rissanen J, 1989. Stochastic Complexity in Statistical Inquiry Theory.
LeCun Y, 2022. A path towards autonomous machine intelligence.
World Scientific Publishing Co., Inc., USA.
preprint posted on openreview, . Roberts DA, Yaida S, Hanin B, 2022. The Principles of Deep
Lei N, Su K, Cui L, et al., 2017. A geometric view of optimal Learning Theory. Cambridge University Press (https://
transportation and generative model. preprint arXiv:171005488, deeplearningtheory.com).
. Rosenblatt F, 1958. The perceptron: a probabilistic model for
https://arxiv.org/abs/1710.05488 information storage and organization in the brain. Psychological
Levoy M, Hanrahan P, 1996. Light field rendering. Proceedings of the review, 65 6:386-408.
23rd annual conference on Computer graphics and interactive Rumelhart DE, Hinton GE, Williams RJ, 1986. Learning representa-
techniques, p.31-42. tions by back-propagating errors. Nature, .
Li G, Wei Y, Chi Y, et al., 2020. Breaking the sample size Russell S, Norvig P, 2020. Artificial Intelligence: A Modern
barrier in model-based reinforcement learning with a generative Approach. Pearson.
model. Advances in Neural Information Processing Systems, Sastry S, 1999. Nolinear Systems: Analysis, Stability, and Control.
33:12861-12872. Springer.
Ma Y, Soatto S, Kosecka J, et al., 2004. An Invitation to 3D Vision: Shannon CE, 1948. A mathematical theory of communication. The
From Images to Geometric Models. Springer-Verlag. Bell System Technical Journal, 27(3):379-423.
Ma Y, Derksen H, Hong W, et al., 2007. Segmentation of multivariate https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
mixed data via lossy data coding and compression. IEEE Trans Shazeer N, Mirhoseini A, Maziarz K, et al., 2017. Outrageously
PAMI, . large neural networks: The sparsely-gated mixture-of-experts
MacDonald J, Wäldchen S, Hauch S, et al., 2019. A rate-distortion layer. ICLR,
framework for explaining neural network decisions. preprint https://openreview.net/pdf?id=B1ckMDqlg
arXiv:190511092, . Shum HY, Chan S, Kang S, 2011. Image-Based Rendering. Springer
Marcus G, 2020. The next decade in AI: four steps towards robust US,
artificial intelligence. CoRR, abs/2002.06177. https://books.google.com/books?id=GeObcQAACAAJ
https://arxiv.org/abs/2002.06177 Silver D, Huang A, Maddison CJ, et al., 2016. Mastering the game
Marr D, 1982. Vision. MIT Press. of go with deep neural networks and tree search. nature,
Mayr O, 1970. The Origins of Feedback Control. MIT Press. 529(7587):484-489.
Ma, Tsao and Shum 23

Silver D, Schrittwieser J, Simonyan K, et al., 2017. Mastering the Xie S, Girshick R, Dollár P, et al., 2017. Aggregated residual
game of go without human knowledge. nature, 550(7676):354- transformations for deep neural networks. 2017 IEEE Con-
359. ference on Computer Vision and Pattern Recognition (CVPR),
Simon HA, 1969. The Sciences of the Artificial. MIT Press, p.5987-5995.
Cambridge, MA, USA. Yang Z, Yu Y, You C, et al., 2020. Rethinking bias-variance trade-off
Srivastava A, Valkoz L, Russell C, et al., 2017. VeeGAN: Reducing for generalization of neural networks. International Conference
mode collapse in GANs using implicit variational learning. on Machine Learning.
Advances in Neural Information Processing Systems, p.3310- Yildirim I, Belledonne M, Freiwald W, et al., 2020. Efficient inverse
3320. graphics in biological face processing. Science advances,
Srivastava RK, Greff K, Schmidhuber J, 2015. Highway networks. 6(10):eaax5979.
preprint arXiv:150500387, . Yu A, Fridovich-Keil S, Tancik M, et al., 2021. Plenoxels: Radiance
Sutton R, Barto A, 2018. Reinforcement Learning: An Introduction. fields without neural networks. preprint arXiv:211205131, .
MIT Press. Yu Y, Chan KHR, You C, et al., 2020. Learning diverse and discrim-
Szegedy C, Zaremba W, Sutskever I, et al., 2013. Intriguing properties inative representations via the principle of maximal coding rate
of neural networks. arXiv:13126199, . reduction. Advances in Neural Information Processing Systems,
Szeliski R, 2022. Computer Vision: Algorithms and Applications, 33:9422-9434.
2nd Edition. Springer-Verlag. Zhai Y, Yang Z, Liao Z, et al., 2020. Complete dictionary learning
Tenenbaum JB, de Silva V, Langford JC, 2000. A global geometric via `4 -norm maximization over the orthogonal group. Journal
framework for nonlinear dimensionality reduction. Science, of Machine Learning Research, 21(165):1-68.
290(5500):2319-2323. Zhu JY, Park T, Isola P, et al., 2017. Unpaired image-to-image
Tishby N, Zaslavsky N, 2015. Deep learning and the information translation using cycle-consistent adversarial networks. Pro-
bottleneck principle. IEEE Information Theory Workshop. ceedings of the IEEE international conference on computer
Tong S, Dai X, Wu Z, et al., 2022. Incremental learning of vision, p.2223-2232.
structured memory via closed-loop transcription. preprint Zoph B, Le QV, 2017. Neural architecture search with reinforcement
arXiv:220205411, . learning,
Uehara M, Zhang X, Sun W, 2021. Representation learning for
https://arxiv.org/abs/1611.01578
online and offline RL in low-rank MDPs. arXiv preprint
arXiv:211004652, .
Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you
need. preprint arXiv:170603762, .
https://arxiv.org/abs/1706.03762
Vidal R, Ma Y, Sastry S, 2016. Generalized Principal Component
Analysis. Springer Verlag.
Vidal R, 2022. Attention: Self-expression is all you need, .
https://openreview.net/forum?id=MmujBClawFo
Vinyals O, Babuschkin I, Czarnecki WM, et al., 2019. Grandmaster
level in StarCraft II using multi-agent reinforcement learning.
Nature, 575(7782):350-354.
Wang T, Buchanan S, Gilboa D, et al., 2021. Deep networks provably
classify data on curves. Advances in Neural Information
Processing Systems, 34.
Wiatowski T, Bölcskei H, 2018. A mathematical theory of deep
convolutional neural networks for feature extraction. IEEE
Transactions on Information Theory, .
Wiener N, 1948. Cybernetics. MIT Press, Cambridge, Mass.
Wiener N, 1961. Cybernetics, 2nd edition. MIT Press, Cambridge,
Mass.
Wisdom S, Powers T, Pitton J, et al., 2017. Building recurrent
networks by unfolding iterative thresholding for sequential
sparse recovery. 2017 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), p.4346-
4350.
Wood E, Baltrušaitis T, Hewitt C, et al., 2021. Fake it till you
make it: Face analysis in the wild using synthetic data alone.
Proceedings of the IEEE/CVF International Conference on
Computer Vision (ICCV), p.3681-3691.
Wright J, Ma Y, 2022. High-Dimensional Data Analysis with Low-
Dimensional Models: Principles, Computation, and Applications.
Cambridge University Press (https://book-wright-ma.
github.io).
Wright J, Tao Y, Lin Z, et al., 2008. Classification via minimum incre-
mental coding length (MICL). Advances in Neural Information
Processing Systems, p.1633-1640.

You might also like