Representations of Space and Time in The Maximization of Information Flow in The Perception-Action Loop
Representations of Space and Time in The Maximization of Information Flow in The Perception-Action Loop
Representations of Space and Time in The Maximization of Information Flow in The Perception-Action Loop
Alexander S. Klyubin
[email protected]
Adaptive Systems Research Group, School of Computer Science,
University of Hertfordshire, Hatfield Herts AL10 9AB, U.K.
Daniel Polani
[email protected]
Adaptive Systems and Algorithms Research Groups, School of Computer Science,
University of Hertfordshire, Hatfield Herts AL10 9AB, U.K.
Chrystopher L. Nehaniv
[email protected]
Adaptive Systems and Algorithms Research Groups, School of Computer Science,
University of Hertfordshire, Hatfield Herts AL10 9AB, U.K.
1 Introduction
2Representation is a loaded term (cf. Millikan, 2002; Usher, 2001). In this letter, the
“representation” of one variable by another means the probabilistic mapping or corre-
spondence between the two variables.
2390 A. Klyubin, D. Polani, and C. Nehaniv
In section 6 the representation of time inside the agent and by the agent-
environment system as a whole is studied. A method to quantify the time
dependence of representations is proposed in section 7. Since some of the
representations appear to be factorizable (decomposable into independent
components), in section 8 the representations of space (initial position)
and time are factorized. Finally, overall discussion and conclusions are
presented in section 9.
In appendix A, the information-theoretic quantities used in the letter
are briefly defined. Appendix B remarks on modeling an agent’s controller
using causal Bayesian networks. Appendices C and D explain how infor-
mation flows are measured and how the controllers are evolved in this
letter. Appendix E proves the assertion for the grid-world model used in
this letter that when an agent follows the gradient and reaches the source,
most of the information about the agent’s initial position relative to the
source passes through its sensors. Appendix F continues the discussion
of time dependence of representation by considering two examples. Ap-
pendix G contains the details of the factorization method used in section 8
and discusses its relation to synergy and the parallel bottleneck case of the
multivariate information bottleneck method.
2 Background
2.1
. Related Work
the assumption that it can be stored” are not appropriate for the theory of
perception (p. 242).
However, work in perceptual neural networks (see section 2.1.2), recent
work in control (Touchette & Lloyd, 2004), and implicitly also Ashby’s
work (1956) show that information theory can be productively applied to
perception-action without the need to assume any intentionality. Touchette
and Lloyd (2000, 2004) provided bounds on the efficiency of any sensor for
one-step control (entropy reduction). Their result for closed-loop control
can be interpreted as a generalized statement of Ashby’s law of requisite
variety, which states that only variety (entropy) in actions can reduce the
variety (entropy) in the environment.
2.2 Relation of Previous Work to This Letter. The prior work already
noted is related to this work as follows:
1. The minimalistic model of an agent employed in this letter (see section
3.1) is defined in terms of what the agent’s controller can access:
sensors, actuators, and memory, conforming to Gibson’s philosophy.
2. In this letter, the treatment of the perception-action loop by Ashby
and by Touchette and Lloyd is generalized by modeling an arbitrary
number of time steps.
3. The conceptual difference between this work and Linsker’s infomax
is that internal spatial structure in the processing architecture (i.e.,
receptive fields and layers of neurons) is replaced by natural con-
straints coming only from the agent-environment interaction (“em-
bodiment”) extended over many time steps.
4. Although information flow is maximized through time in this letter,
the perception-action loop infomax is conceptually different from the
temporal infomax of Ay and Wennekers. First of all, compositionality,
the existence of separate finely grained components, is central to their
Information Flow in the Perception-Action Loop 2393
3 Although the agent can have more than one sensor and actuator, from now on
they will be referred to in the singular. Multiple sensors can of course be treated as one
composite sensor. Similar reasoning applies to actuators.
2394 A. Klyubin, D. Polani, and C. Nehaniv
random variable Y is denoted using the shorthand notation X → Y, which stands for the
Markov kernel X × Y → [0, 1], (x, y) → Pr (Y = y|X = x).
Information Flow in the Perception-Action Loop 2395
3.2 Information Flow. In the material world, the concept of matter flow
is natural and widespread owing to the conservation laws and additivity
of matter. For example, in nuclear medicine, radioisotope tracers are in-
jected into a patient, and their flow through the organism is tracked. Once
one tries to formulate an analogous concept of information flow, one en-
counters a number of differences, one being the nonconservative nature of
classical information unlike matter. Classical information can be duplicated
or destroyed. Furthermore, information can be “encrypted,” making it dis-
appear unless one knows another “piece” of information that serves as the
“decryption” key.6
As an illustration, consider an organization where confidential infor-
mation is leaked to the outside and there are several suspects who have
access to the information. To identify which one is leaking the information,
one would provide each suspect with a different and deliberately made-up
piece of information and see what information gets out. The presence of the
6 Here encryption is used in the broad sense of the word. The one-time pad encryption
scheme (Schneier, 1995) is one example. Another example is the use of an address as a
key to access information on a storage medium.
2396 A. Klyubin, D. Polani, and C. Nehaniv
7 In fact, information flow is nonzero only in the direction of the causal future (descen-
dants) of X.
Information Flow in the Perception-Action Loop 2397
8The grid is large enough so that the agent never experiences the boundary conditions.
Without loss of generality, the grid can be considered infinite; however, only a finite
number of cells is ever reached.
2398 A. Klyubin, D. Polani, and C. Nehaniv
Figure 2: Sensoric input S depending on the position R. In cells with only one
arrow the sensor input is constant. In cells with multiple arrows it is randomly
chosen from the one of the arrows shown in the cells. The source is located at
the center of the grid.
the internal state or the memory of the agent’s controller with the random
variable M taking values from the set M = {1, 2, . . . , N}, where N is the
number of states; and the two-dimensional position of the agent with the
random variable R taking values from the set Z2 . S, A, M, and R completely
describe the state of the whole system at any instant.
At and Mt+1 depend directly only on (St , Mt ) for any t (see Figure 1).
The agent’s controller, regardless of its implementation, is performing a
probabilistic mapping (St , Mt ) → (At , Mt+1 ). The mapping is assumed to be
time invariant.
The probabilistic mapping can be implemented by a nondeterministic
Mealy finite state automaton operating with input set S, output set A,
and state set M. Preliminary experiments showed that the evolved map-
pings tend to be deterministic; hence, the experiments are constrained to
deterministic automata, allowing only deterministic mappings. However,
without any loss of generality, the approach can be used with stochastic
mappings. Determinism of the controller is an experimental choice, not a
limitation of the model.
the initial position of the agent, using a controller with limited memory and
given limited time?
In finding a controller that maximizes the information flow9 from R0
into Mt , where t is the duration of the run (see Figure 3), the controller
can be seen as constructing a temporally extended communication channel
between these two temporally separated parts of the system, passing not
only through the sensor but also via the memory, the actuator, and the
environment.
Two types of controllers are evolved:10 (1) unconstrained, which are not
constrained in terms of the implemented mapping (see Figure 3a) and (2)
gradient following (see Figure 3b), which are constrained to follow the gra-
dient. Clearly, gradient-following controllers are a subset of unconstrained
controllers where the information flow from M to A is blocked.
To evaluate a controller, the agent’s initial position R0 is randomly and
uniformly distributed over a square with side d centered at the source. At
the beginning of the run, the controller is always in state 1. The agent moves
for t time steps, after which its fitness I (R0 ; Mt ) is evaluated.
I (R0 ; Mt ) is the amount of information about the agent’s initial position
captured by the state of controller at time step t . It quantifies how well one
can predict the state of the controller at time step t from the initial position
of the agent; or, alternatively, how well one can deduce the initial position
of the agent knowing only the state of its controller at time step t .
Evolutionary search is not guaranteed to find optimal solutions. How-
ever, an upper bound for achievable performance is the amount of in-
formation about R0 in principle extractable from the cumulative sensoric
input Stc −1 = (S0 , S1 , . . . , St −1 ) into Mt . Since R0 , Stc −1 , and Mt form a data
processing chain, it follows from the data processing inequality that no
controller can extract more information into Mt about R0 than it obtains via
Stc −1 . However, since the cumulative sensoric input contains in the (tempo-
ral) limit all the information about the initial position, the main constraint
derives from the agent’s controller having access only to the momentary
sensoric input, one at a time, and having limited memory.
The bound can be estimated for gradient-following controllers. Since
all gradient-following controllers by definition employ exactly the same
actuation strategy, that is, follow the gradient, the probability distribution
of the random variable Stc −1 is exactly the same for all the controllers.
To calculate the bound, the information bottleneck (IB) principle (Tishby,
Pereira, & Bielek, 1999) can be used to find a mapping Stc −1 → Mt that
maximizes the mutual information I (R0 ; Mt ). The latter is an upper bound
for the amount of information about R0 any gradient-following controller
can capture in Mt .
In Figure 4, the performance of best-evolved controllers is compared
to the size of their memory in bits. The controllers manage to extract in-
formation that is temporally “smeared” in sensoric input. Unconstrained
Information Flow in the Perception-Action Loop 2401
adjacent cells, the sensor captures only certain features of the world, and,
in the case of following the gradient, the agent’s controller is constrained to
strictly following the gradient.
The evolved maps R0 → M15 , and also M15 → R0 , are thus not arbitrary.
They reflect the structure of the agent’s sensorimotor apparatus, the corre-
lations between the actuator and the sensor—in other words, the agent’s
embodiment. The maps also reflect constraints on the agent’s information
processing capabilities. This is illustrated by the qualitative differences in
maps of gradient-following (constrained) and unconstrained controllers
(see Figure 6), as well as by quantitative differences between maps created
by controllers with different number of states.
5.4
. Discussion
5.4.1 Experiments. The mappings from the controller’s state at the end
of the run M15 to the agent’s initial position R0 show what information
about the agent’s initial position gets captured by different controllers.
Gradient-following and unconstrained controllers as a class capture dif-
ferent features of the initial position (see Figure 6). The main qualitative
difference is that unconstrained controllers divide the initial position into
spatially solid chunks. Neighboring initial positions are generally coded
by the same state of the controller at the end of the run. Representa-
tions facilitated by gradient-following controllers lack the solidity of the
chunks. These representations tend to have stripe or checkerboard patterns.
The differences in representations derive from the differences in informa-
tion processing capabilities (unconstrained versus gradient-following) and
embodiment.
The results may thus be viewed as having extended and generalized
Linsker’s work on the infomax principle (Linsker, 1988) (see section 4.1)
but with fewer assumptions: this letter does not assume linearity, it does
not assume a particular information processing substrate for the agent’s
controller, and it allows for more than just one time step for information
processing. The unconstrained controllers were even allowed to perform ac-
tions based on the past inputs. The results, however, are similar to Linsker’s:
without being told what is important about the agent’s initial position, evo-
lution found controllers that extract various features of the initial position
as a result of maximizing information preservation. Despite the agent hav-
ing no concept of geometry, most of the features, especially the ones created
2406 A. Klyubin, D. Polani, and C. Nehaniv
6 Representation of Time
Assume that during the experiment the state of the controller was ob-
served without knowing the time. Would the state of the controller contain
Information Flow in the Perception-Action Loop 2407
information about the time step? It turns out that it would, provided the
probability distribution of the controller’s state depends on time.
Time is modeled as a random variable T with values t in T =
{0, 1, . . . , 15}. A special case is a peaked distribution where p(t) = 1 only for
a single t and is 0 for all others, resulting in the observed state of memory
Mt . However, in general, the distribution of T may be arbitrary. Denote the
readout of the process (Mt )t=0,1,...,15 observed at an unknown time step T by
a random variable MT (see Figure 7) with values mT . Here p(mT ) is given
by
p(mT ) = Pr (MT = mT ) = Pr (T = t)Pr (MT = mT |T = t), (6.1)
t∈T
where the probability distribution of the state of the readout for a particular
time step t is inherited from the memory variable at the time step
The relation between MT , T, and the process (Mt )t=0,1,...,15 trivially follows
from equations 6.1 and 6.2:
Pr (MT = mT ) = Pr (T = t)Pr (Mt = mT ). (6.3)
t∈T
Figure 8: p(t|(s, m)T )—representation of time by the best two- and nine-
state controllers. Vertical axis—(s, m)T − 2 × 4 states; horizontal axis—t −
{0, 1, . . . , 15}. The darker the cell, the higher the conditional probability
p(t|(s, m)T ). I = I ((S, M)T ; T) measured in bits.
11 If the controllers were evolved to capture as much information about time as possi-
ble, they would evolve into clocks. The simplest solution is to shut out the environment
and permute the controller’s state.
12 Strictly speaking, the reconstructed time is not metric time but an ordinal sequence.
Information Flow in the Perception-Action Loop 2409
from the top) are such examples. These states differentiate between odd
and even time steps. In Figure 8d there are states that capture time modulo
2, 3, and 4.
Another type of states captures global properties of time, such as being
close to the beginning, the middle, or the end of the experiment. Rows 1
and 4 in Figure 8 are an example.
There are also hybrid states, which capture both local and global proper-
ties of time. Row 1 and middle rows in Figure 8 are such states. They locate
the system in time in terms of being at a particular modulo 4 time step and
also in terms of being at the beginning or the end of the experiment.
The controllers were evolved to capture information about the agent’s
initial position at the end of 15 time steps. Why do the controllers capture
information about time? Is capturing information about time an evolution-
ary neutral side effect, or is it crucial to capturing information about the
agent’s initial position? For example, it may be that the processing and in-
tegration of information by the controller depend on time, or that different
methods of obtaining the information about the agent’s position are used
at different time steps, or that the controller represents the information in
a time-dependent way. The issue of time-dependence of representation is
addressed in the next section.
Section 5 showed how M15 , the state of controller at the end of the experi-
ment, represents R0 , the agent’s position at the beginning of the experiment.
It turns out that the representation of the initial position in the controller’s
memory throughout the experiment is time dependent. For illustration
purposes the representation used by the best eight-state unconstrained con-
troller is shown in Figure 9.
The degree of time dependence of the representation of the initial posi-
tion R0 in the process (Mt )t=0,1,...,15 is quantified as the probabilistic depen-
dence of the mapping MT → R0 on the time step T, where MT is the readout
of the process at an unknown time T. (See section 6 for the definition of T
and MT .)
For every observed state mT of the controller, the mapping MT → R0 ,
characterized by the conditional probability distribution p(r0 |mT ), describes
where and with what probability the agent could have started given that
the controller was found in that state at an unknown time step.
If, in addition to knowing mT , the time step t is known, the mapping
can be used to find out where the agent could have started. For example,
in the case of the best evolved eight-state unconstrained controller (see
Figure 9) knowing whether the time step is odd or even helps to significantly
reduce the uncertainty in the agent’s initial position given the state of the
controller. The difference in how the two conditional distributions (time-less
and time-dependent) describe the initial position can be used to quantify
the dependence of the mapping MT → R0 on the time step T.
Here the degree of time dependence of the representation is quantified as
the Kullback-Leibler distance from the conditional distribution p(r0 |mT , t)
to the conditional distribution p(r0 |mT ):13
Dp(mT ,t) p(r0 |mT , t) p(r0 |mT ) = H(R0 |MT ) − H(R0 |MT , T)
= I (R0 ; T|MT ).
13 The method for measuring time dependence of a mapping between two random
variables is quite general. In fact, equation 7.2 applies to the general case of having three
arbitrary random variables X, Y, Z and measuring the dependency of the mapping
X → Y on Z:
D p(x,z) p(y|x, z)|| p(y|x) = I (Y; Z|X)
I (Y; Z|X) + I (X; Z) = I (X; Z|Y) + I (Y; Z).
Information Flow in the Perception-Action Loop 2411
Since the initial position of the agent is not dependent on time, the last term,
I (R0 ; T), vanishes. The time dependence of the two representations
and the
amount of information captured by the controller about time I (MT ; T) are
then related as
8 Factorization of Representations
Y is to be split into two new random variables, Z1 and Z2 (see Figure 10),
such that they meet the following criteria.14
First, the two new variables should capture as much information about
X as possible. Hence, I (Z1 , Z2 ; X) should be as large as possible. Second,
the two variables should comprise an orthogonal coordinate system for X.
Hence, the two variables should be as independent as possible (I (Z1 ; Z2 )
close to zero), and the conditional probability distribution p(z1 , z2 |x) de-
scribing the projection of X onto Z1 and Z2 should be closely approximated
by the product of partial projections p(z1 |x) and p(z2 |x). The distance be-
tween p(z1 , z2 |x) and the product p(z1 |x) p(z2 |x) can be measured as the
Kullback-Leibler distance:
Dp(x) p(z1 , z2 |x) p(z1 |x) p(z2 |x) = I (Z1 ; Z2 |X) . (8.1)
The distance should be as close to zero as possible. Note that although the
distance is for conditional distributions, it still depends on the distribution
of X.
Minimizing I (Z1 ; Z2 |X) makes Z1 and Z2 as conditionally independent
given X as possible. This avoids synergetic or XOR-like coding schemes
where, taken separately, Z1 and Z2 contain less information about X than
when they are combined.15 (See section G.3 for more details.)
To summarize the task, given a joint distribution p(x, y), find two con-
ditional distributions p(z1 |y) and p(z2 |y) satisfying the following three
criteria:
1. I (Z1 , Z2 ; X) ≤ I (X; Y) is maximal.
2. I (Z1 ; Z2 ) ≥ 0 is minimal.
3. I (Z1 ; Z2 |X) ≥ 0 is minimal (goal: p(z1 , z2 |x) = p(z1 |x) p(z2 |x)).
14 Factorizing into three or more variables places more constraints on the solution. Two
variables are sufficient to start with. Later, each of the new variables can be factorized
further if necessary.
15 Synergy colloquially refers to the case when the whole is more than the sum of its
parts.
Information Flow in the Perception-Action Loop 2413
“smeared” over a large number of states (|T | · |S| · |M|) and, being an av-
erage quantity, cannot be extracted without noticeable loss into two binary
variables. This hypothesis was confirmed by relaxing the factorization con-
straints and simply trying to extract as much of the information as possible
into just four states. The amount of the extracted information is a lower
bound for the distance from the ideal point the factorization algorithm can
achieve. Thus, the original factorization may have been less successful sim-
ply because the controllers were not evolved to capture information about
time in a convenient compressed way.
The factorization of the highly compressed information about the initial
position offers the advantage of viewing the captured features in terms
of two smaller independent features. The case of the best evolved eight-
state gradient-following controller in Figure 11 is a good example. There,
factorization creates two natural variables: left versus right and black versus
white cell of the checkerboard.
When factorizing the relatively scarce and widely “smeared” informa-
tion about time, the most useful feature of factorization is that relevance-
based compression is performed. This projects the information about time
from a high number of states to just a few. These are easier to interpret.
For example, it becomes obvious that the two most common features of
time captured are roughly (1) whether the time step is odd or even and (2)
whether roughly half of the run has elapsed.
9 Conclusion
even when one does not understand the underlying substrate. For example,
there is no requirement to assume finely grained components (e.g., neurons)
that perform the information processing. The model of computation is thus
more abstract and accommodating than that of traditional infomax. The
various information-theoretic tools presented here can provide a peek into
systems that do information processing in a nonobvious way.
Viewing information as the currency or essence of computation in the
perception-action loop provides a plausible mechanism, where a few as-
sumptions and constraints, such as embodiment with various bottlenecks
coming from the limited capacity of sensory and motor channels, and lim-
ited memory, can give rise to structures in a self-organized way. This paves
the way toward a principled and general understanding of the mechanisms
guiding the evolution of sensors, actuators, and information processing
substrates in nature and provides insights into the design of mechanisms
for artificial evolution.
The main contributions of this letter are (1) building up an information-
theoretic picture of the perception-action loop with a minimal set of assump-
tions, (2) maximizing information flow through the full perception-action
loop of an agent through time, as opposed to the well-established maxi-
mization of the flow through layers in neural networks, and (3) introducing
the use of information-theoretic tools to analyze the representations arising
from the maximization of information flow through the perception-action
loop.
H(X) := − p(x) log2 p(x).
x∈X
2420 A. Klyubin, D. Polani, and C. Nehaniv
Dp(y) ( p1 || p2 ) = p(y)D p1 (x|Y = y) p2 (x|Y = y)
y∈Y
p1 (x|y)
= p(y) p1 (x|y) log2 .
y∈Y x∈X
p2 (x|y)
The agent’s controller selects actions based on sensoric input and its own
memory. The assumptions about the agent’s controller are minimal: the
controller is finite time and finite state. Given the sets of sensoric states S,
actions A, and the memory states of the controller M, any such controller
Information Flow in the Perception-Action Loop 2421
To spot and avoid undersampling, for each of the possible initial states of
the system, 256 or more samples are produced depending on the quantities
measured. The number of samples is then increased by a factor of at least
16 to check whether the quantities of interest remain stable.
This section proves the assertion in section 4.4 that when the agent follows
the gradient, most of the information about the agent’s initial position passes
through its sensors.
Information Flow in the Perception-Action Loop 2423
Denote the agent’s initial position with a random variable R0 . The en-
tropy of initial position is H(R0 ). This is the maximum amount of informa-
tion there is to know about the initial position.
Assume that the agent follows the gradient for t time steps. Denote the
agent’s position after t time steps with a random variable Rt .
Denote the sequence of sensoric input up to but excluding the time step
t with a random variable St−1c
= (S0 , S1 , . . . , St−1 ).
The goal is to prove that
c
I St−1 ; R0 = H(R0 ) − , (E.1)
All actions move the agent deterministically over the grid. Given a final
c
position rt , the sequence of actions performed, a t−1 , identifies the exact
starting position r0 . This means that the initial position is a deterministic
function of the final position and the sequence of actions:
H R0 |Rt , Act−1 = 0. (E.3)
When the agent follows the gradient, it does not end up exactly at the
source. Because there is no “do nothing” action, once the agent is at the
source, it cannot stay there and moves into one of the four adjacent cells at
the next time step. One time step later, it comes back following the gradient.
Thus, after a certain amount of time, the agent cycles through two states:
(1) being in the center and moving in one of the four randomly chosen
directions and (2) being in one of the four cells adjacent to the central cell
and moving back to the central cell.
Due to this fact, given enough time, the tails of sequences of actions will
consist of a repeating pattern: a random action, then the action opposite to
it—for example, a tc = (. . . , ↑, ↓, →, ←, ↑). When the agent is at the central
cell, it enters the loop of performing an action followed by its counteraction.
When not in the central cell, the agent performs an action followed by its
2424 A. Klyubin, D. Polani, and C. Nehaniv
I R0 ; Rt |Act−1 = H R0 |Act−1 − H R0 |Act−1 , Rt (E.5)
≤ H R0 |Act−1 , (E.6)
it follows that
I R0 ; Rt |Act−1 → 0 (E.7)
When the agent follows the gradient, its actions are completely deter-
mined by the sensoric input. The sequence of actions is in one-to-one cor-
respondence with the sequence of sensoric input. Hence, the sequence of
actions contains the same amount of information about the initial position
as the sequence of sensoric input:
c
I St−1 ; R0 = I Act−1 ; R0 . (E.9)
Thus,
c
I St−1 ; R0 → H(R0 ). (E.10)
with increasing t.
Information Flow in the Perception-Action Loop 2425
I (Mpos , Mclock )T ; T|R0 = I (Mclock )T ; T . (F.1)
F.3 Factoring Time Out. In the light of the two examples above, it is
possible to give an interpretation to equation 7.3, linking time dependence
of the mappings between R0 and MT and the amount of information about
time contained in MT . The state of the controller MT can be thought of as
containing two pieces of information about time. One is available without
reference to any other variable; its amount is quantified by I (MT ; T). The
other is “encrypted” through R0 . Knowledge of R0 provides I (R0 ; T|MT )
more bits of information about time. The total amount of information about
time extractable from MT with the help of R0 is then I (MT ; T|R0 ).
One could attempt to factor MT into two variables: one containing infor-
mation about time, the other containing information about the rest of MT . In
the clock example above, this type of factorization is easy. The new variables
pos
correspond to MTclock and MT by definition. In the XOR example, though,
it is not possible to split MT due to the fact that all information about time
in MT is “encrypted” through R0 and is thus not “freely” available without
the knowledge of R0 . The term I (MT ; T) in equation 7.3 is the upper bound
to the amount of information about time that can be factored out of MT .
Factoring out or removing information about time from MT reduces
the time dependence of the R0 → MT mapping by exactly the amount of
information about time removed. However, this still does not guarantee that
the mapping will become time independent. At best the time dependence
of R0 → MT can be lowered to the time dependence of its inverse, MT →
R0 . However, the time dependence of the latter cannot be influenced by
removing information about time from MT .
G.1 Ideal Solutions. An ideal situation is when (1) all of the information
about X contained in Y is captured by (Z1 , Z2 ), (2) Z1 and Z2 are indepen-
dent, and (3) the mapping X → (Z1 , Z2 ) is the product of the mappings
X → Z1 and X → Z2 . The ideal point in terms of the left-hand side values
of the three criteria above is thus (I (X; Y), 0, 0).
It should be noted that in case Z1 or Z2 is large enough to accommodate
all the states of Y, it is possible to create a trivial factorization, such as where
Y → Z1 is information preserving and Y → Z2 preserves no information
about Y. This trivial factorization meets the ideal point criterion but is
not interesting. A similarly trivial factorization is known for numbers: any
number can be factorized into the product of itself and 1. To avoid trivial
factorization, both |Z1 | and |Z2 | were made smaller than |Y|.
Information Flow in the Perception-Action Loop 2427
where c = min(I (X; Y), log2 |Z1 ||Z2 |) is the maximum amount of informa-
tion that Z1 and Z2 could in principle capture about X. c is a constant for all
tuples with same sizes of Z1 and Z2 . The goal is to find a tuple with minimal
distance from the ideal point.
Here hill climbing is employed in the space of the tuples of mappings
( p(z1 |y), p(z2 |y)) to search for the best tuple. The search is restricted to de-
terministic mappings. The best solution is chosen from five independent
runs of the hill climbing search. Each search run is started with a randomly
chosen tuple and run for 50,000 steps. At each step, the tuple is modified
1 + (t mod 20) times, where t is the step. A modification is performed by
randomly choosing one of the two mappings and then deterministically
mapping a randomly chosen state y to a randomly chosen state z1 or z2 ,
depending on which of the two mappings is modified. If after the modifi-
cations the new tuple is closer to the ideal point than the old one, the tuple
is used as a base for the next step; otherwise, the new tuple is discarded,
and the old one is used at the next step.
Now two differences between the factorization and the parallel bottle-
neck are apparent. First, the factorization method in this article can be seen
as using a constant parallel bottleneck β = 2. Second, the last term in the
two equations is different. The latter needs closer attention.
I (Z1 ; Z2 |X) = 0 requires that Z1 and Z2 are conditionally independent
given X. This avoids synergetic, XOR-like coding schemes where Z1 and
Z2 individually cannot be used to obtain information about X, where only
their combination is helpful. I (Z; Y|X) = 0 requires that Z and Y share no
information other than information about X. In other words, this requires
that Z1 and Z2 contain information only about X.
The two requirements are different. However, they are related. Keeping
in mind that Z = (Z1 , Z2 ) and that Z1 and Z2 are obtained from Y using
independent mappings,
Acknowledgments
We thank the Condor Team from the University of Wisconsin, whose high-
throughput computing system Condor provided a convenient way to run
large numbers of simulations on ordinary workstations.
References
Nadal, J.-P., & Parga, N. (1995). Information transmission by networks of non linear
neurons. In Proc. of the Third Workshop on Neural Networks: From Biology to High
Energy Physics (1994, Elba, Italy). International Journal of Neural Systems (Suppl.),
153–157.
Nolfi, S., & Marocco, D. (2002). Active perception: A sensorimotor account of object
categorization. In B. Hallam, D. Floreano, J. Hallam, G. Hayes, & J.-A. Meyer
(Eds.), From Animals to Animats 7, Proceedings of the VII International Conference on
Simulation of Adaptive Behavior (pp. 266–271). Cambridge, MA: MIT Press.
Pearl, J. (2001). Causality: Models, reasoning, and inference. Cambridge: Cambridge
University Press.
Polani, D., Nehaniv, C. L., Martinetz, T., & Kim, J. T. (2006). Relevant information
in optimized persistence vs. progeny strategies. In L. M. Rocha, M. Bedau, D.
Floreano, R. Goldstone, A. Vespignani, & L. Yaeger (Eds.), Proc. Artificial Life X
(pp. 337–343). Cambridge, MA: MIT Press.
Powers, W. T. (1974). Behavior: The control of perception. London: Wildwood House.
Schmidhuber, J. (1992). Learning factorial codes by predictability minimization.
Neural Computation, 4(6), 863–879.
Schmidhuber, J., Eldracher, M., & Foltin, B. (1996). Semilinear predictability mini-
mization produces well-known feature detectors. Neural Computation, 8(4), 773–
786.
Schneidman, E., Bialek, W., & Berry II, M. J. (2003). Synergy, redundancy, and inde-
pendence in population codes. Journal of Neuroscience, 23(37), 11539–11553.
Schneier, B. (1995). Applied cryptography: Protocols, algorithms, and source code in C
(2nd ed.). New York: Wiley.
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical
Journal, 27, 379–423.
Tishby, N., Pereira, F. C., & Bialek, W. (1999). The information bottleneck method.
In Proceedings of the 37th Annual Allerton Conference on Communication, Control,
and Computing (pp. 368–377). Urban-Champaign: Department of Electrical and
Computer Engineering, University of Illinois.
Touchette, H., & Lloyd, S. (2000). Information-theoretic limits of control. Phys. Rev.
Lett., 84, 1156.
Touchette, H., & Lloyd, S. (2004). Information-theoretic approach to the study of
control systems. Physica A, 331(1–2), 140–172.
Usher, M. (2001). A statistical referential theory of content: Using information theory
to account for misrepresentation. Mind and Language, 16(3), 311–334.
Wennekers, T., & Ay, N. (2005). Finite state automata resulting from temporal infor-
mation maximization. Neural Computation, 17(10), 2258–2290.