Principles of Neural Information Theory
Principles of Neural Information Theory
Principles of Neural Information Theory
Stone
Sebtel Press
A Tutorial Introduction Book
Principles of Neural
Information Theory
Computational Neuroscience and
Metabolic Efficiency
James V Stone
Title: Principles of Neural Information Theory
Computational Neuroscience and Metabolic Efficiency
ISBN 978-0-9933679-2-2
2. Information Theory 9
2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2. Finding a Route, Bit by Bit . . . . . . . . . . . . . . . . 10
2.3. Information and Entropy . . . . . . . . . . . . . . . . . 11
2.4. Maximum Entropy Distributions . . . . . . . . . . . . . 15
2.5. Channel Capacity . . . . . . . . . . . . . . . . . . . . . . 17
2.6. Mutual Information . . . . . . . . . . . . . . . . . . . . 21
2.7. The Gaussian Channel . . . . . . . . . . . . . . . . . . . 22
2.8. Fourier Analysis . . . . . . . . . . . . . . . . . . . . . . 24
2.9. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6. Encoding Time 91
6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2. Linear Models . . . . . . . . . . . . . . . . . . . . . . . . 91
6.3. Neurons and Wine Glasses . . . . . . . . . . . . . . . . . 96
6.4. The LNP Model . . . . . . . . . . . . . . . . . . . . . . 99
6.5. Estimating LNP Parameters . . . . . . . . . . . . . . . . 101
6.6. The Predictive Coding Model . . . . . . . . . . . . . . . 110
6.7. Estimating Predictive Coding Parameters . . . . . . . . 113
6.8. Evidence for Predictive Coding . . . . . . . . . . . . . . 115
6.9. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Appendices
A. Glossary 169
References 191
Index 197
Preface
1.1. Introduction
Just as a bird cannot fly without obeying the laws of physics, so a brain
cannot function without obeying the laws of information. And, just as
the shape of a bird’s wing is determined by the laws of physics, so the
structure of a neuron is determined by the laws of information.
Neurons communicate information, and that is pretty much all that
they do. But neurons are extraordinarily expensive to make, maintain,
and use 63 . Half of a child’s energy budget, and a fifth of an adult’s
budget, is required just to keep the brain ticking over 105 (Figure 1.1).
For both children and adults, half the brain’s energy budget is used
for neuronal information processing, and the rest is used for basic
maintenance 8 . The high cost of using neurons accounts for the fact
that only 2-4% of them are active at any one time 65 . Given that
neurons are so expensive, we should be un-surprised to find that they
have evolved to process information efficiently.
1
1 In the Light of Evolution
All that we see begins with an image formed on the eye’s retina (Figure
1.2). Initially, this image is recorded by 126 million photoreceptors
within the retina. The outputs of these photoreceptors are then re-
packaged or encoded, via a series of intermediate connections, into a
sequence of digital pulses or spikes, that travel through the one million
neurons of the optic nerve which connect the eye to the brain.
The fact that we see so well suggests that the retina must be
extraordinarily accurate when it encodes the image into spikes, and
the brain must be equally accurate when it decodes those spikes into
all that we see (Figure 1.3). But the eye and brain are not only good
at translating the world into spikes, and spikes into perception, they
are also efficient at transmitting information from the eye to the brain.
Precisely how efficient, is the subject of this book.
(a) (b)
Figure 1.1. a) The human brain weighs 1300g, contains about 1010 neurons,
and consumes 12 Watts of power. The outer surface seen here is the neocortex.
b) Each neuron (plus its support structures) therefore accounts for an average
of 1200pJ/s (1pJ/s = 10 12 J/s). From Wikimedia Commons.
2
1.3. In the Light of Evolution
Re#na
Op#c
Nerve
Lens
3
1 In the Light of Evolution
coding efficiency (even if the energy cost per bit is large). Conversely,
if a task is not time-critical then each neuron can deliver information
at a low rate. The laws of information dictate that if information rates
are low then the cost per bit can be low, so low information rates can
be achieved with high metabolic efficiency 66;109 . Both coding efficiency
and metabolic efficiency are defined formally in Chapters 3 and 4.
The idea of coding efficiency has been enshrined as the efficient
coding hypothesis. This hypothesis has been interpreted in a various
ways, but, in essence, it assumes that neurons encode sensory data
to transmit as much information as possible 82 . The efficient coding
hypothesis has been championed over many years by Horace Barlow
(1959) 13 , amongst others (e.g. Attneave (1954) 7 , Atick, 1992) 3 ),
and has had a substantial influence on computational neuroscience.
However, accumulating evidence, summarised in this text, suggests that
metabolic efficiency, rather than coding efficiency, may be the dominant
influence on the evolution of neural systems.
We should note that there are a number of di↵erent computational
models which collectively fall under the umbrella term ‘efficient coding’.
The results of applying the methods associated with these models tend
to be similar 93 , even though the methods are di↵erent. These methods
include sparse coding 41 , principal component analysis, independent
component analysis 15;110 , information maximisation (infomax) 67 ,
redundancy reduction 5 , and predictive coding 90;107 .
Response (spikes)
a
Encoding
Decoding
b
200Luminance 400 600 800 1000
Time (ms)
Reconstructed
luminance
200 400 600 800 1000
Time (ms)
Time (ms)
Figure 1.3. Encoding and decoding (schematic). A rapidly changing
luminance (bold curve in b) is encoded as a spike train (a), which is decoded
to estimate the luminance (thin curve in b). See Chapters 6 and ??.
4
1.4. In Search of General Principles
5
1 In the Light of Evolution
Even though Marr did not address the role of information theory
directly, his analytic approach has served as a source of inspiration,
not only for this book, but also for much of the progress made within
computational neuroscience.
6
1.6. An Overview of Chapters
Note that the following section contains technical terms which are
explained fully in the relevant chapters, and in the Glossary.
To fully appreciate the importance of information theory for neural
computation, some familiarity with the basic elements of information
theory is required; these elements are presented in Chapter 2 (which
can be skipped on a first reading of the book). In Chapter 3, we use
information theory to estimate the amount of information in the output
of a spiking neuron, and also to estimate how much of this information
(i.e. mutual information) is related to the neuron’s input. This leads to
an analysis of the nature of the neural code; specifically, whether it is a
rate code or a spike timing code. We also consider how often a neuron
should produce a spike so that each spike conveys as much information
as possible, and we discover that the answer involves a vital property
of efficient communication (i.e. linear decodability).
In Chapter 4, we discover that one of the consequences of information
theory (specifically, Shannon’s noisy coding theorem) is that the cost
of information rises inexorably and disproportionately with information
rate. We consider empirical results which suggest that this steep rise
accounts for physiological values of axon diameter, the distribution of
axon diameters, mean firing rate, and synaptic conductance; values
which appear to be ‘tuned’ to minimise the cost of information.
In Chapter 5, we consider how the correlations between the outputs
of photoreceptors sensitive to similar colours threaten to reduce
information rates, and how this can be ameliorated by synaptic pre-
processing in the retina. This pre-processing makes efficient use of the
available neuronal infrastructure to maximise information rates, which
explains not only how, but also why, there is a red-green aftere↵ect,
but no red-blue aftere↵ect. A more formal account involves using
principal component analysis to estimate the synaptic connections
which maximise neuronal information throughput.
In Chapter 6, the lessons learned so far are applied to the problem
of encoding time-varying visual inputs. We explore how a standard
(LNP) neuron model can be used as a model of physiological neurons.
We then introduce a model based on predictive coding, which yields
7
1 In the Light of Evolution
8
Chapter 2
Information Theory
2.1. Introduction
Every image formed on the retina, and every sound that reaches the
ear is sensory data, which contains information about some aspect of
the world. The limits on an animal’s ability to capture information
from the environment depends on packaging (encoding) sensory data
efficiently, and extracting (decoding) that information. The efficiency of
these encoding and decoding processes is dictated by a few fundamental
theorems, which represent the foundations on which information theory
is built. The theorems of information theory are so important that they
deserve to be regarded as the laws of information 91;100;112 .
The basic laws of information can be summarised as follows. For
any communication channel (Figure 2.1): 1) there is a definite upper
limit, the channel capacity, to the amount of information that can be
communicated through that channel, 2) this limit shrinks as the amount
of noise in the channel increases, 3) this limit can very nearly be reached
by judicious packaging, or encoding, of data.
9
Further Reading
Bialek (2012) 16 . Biophysics. A comprehensive and rigorous account of
the physics which underpins biological processes, including neuroscience
and morphogenesis. Bialek adopts the information-theoretic approach,
so successfully applied in his previous book (Spikes, see below). The
writing style is fairly informal, but this book assumes a high level of
mathematical competence. Highly recommended.
Dayan P and Abbott LF (2001) 29 . Theoretical Neuroscience This
classic text is a comprehensive and rigorous account of computational
neuroscience, which demands a high level of mathematical competence.
167
Further Reading
Tutorial Material
Byrne JH, Neuroscience Online, McGovern Medical School University
of Texas. http://neuroscience.uth.tmc.edu/index.htm.
Frisby JP and Stone JV (2010). Seeing: The Computational Approach
to Biological Vision.
Laughlin SB (2006). The Hungry Eye: Energy, Information and
Retinal Function,
http://www.crsltd.com/guest-talks/crs-guest-lecturers/simon-laughlin.
Lay DC (1997). Linear Algebra and its Applications.
Pierce JR (1980). An Introduction to Information Theory: Symbols,
Signals and Noise.
Riley KF, Hobson MP and Bence SJ (2006). Mathematical Methods
for Physics and Engineering.
Scholarpedia. This online encyclopedia includes excellent tutorials on
computational neuroscience. www.scholarpedia.org.
Smith S (2013). Digital signal processing: A practical guide for
engineers and scientists. Freely available at www.dspguide.com.
Stone JV (2012). Vision and Brain: How We See The World.
Stone JV (2013). Bayes’ Rule: A Tutorial Introduction to Bayesian
Analysis.
Stone JV (2015). Information Theory: A Tutorial Introduction.
168
Appendix A
Glossary
169
Glossary
170
Glossary
171
Glossary
172
Appendix B
Mathematical Symbols
/ proportional to.
173
Mathematical Symbols
174
Mathematical Symbols
175
Mathematical Symbols
p(xi ) the probability that the random variable x has the value xi .
p(xi |yi ) the conditional probability that x=xi given that y=yi .
176
Appendix C
We are not usually concerned with the p value of the constant cxy , but
for completeness, it is defined as cxy = var(x) ⇥ var(y), where var(x)
and var(y) are the variances of x and y, respectively. Variance is a
measure of the ‘spread’ in the values of of a variable. For example, the
variance in x is estimated as
n
1X
var(x) = (xt x)2 . (C.3)
n t=1
177
Correlation and Independence
= E[xt yt ], (C.6)
In particular, if two signals are Gaussian and they have a joint Gaussian
distribution (as in Figure C.1b) then being uncorrelated means they are
also independent.
p(x,y) p(x,y)
x x
y y
(a) (b)
Figure C.1. (a) Joint probability density function p(s, y) for correlated
Gaussian variables x and y. The probability density p(x,y) is indicated by the
density of points on the ground plane at (x,y). The marginal distributions
p(s) and p(y) are on the side axes. (b) Joint probability density function
p(x,y) for independent Gaussian variables, for which p(x,y)=p(x)p(y).
178
Appendix D
The single key fact about vectors and matrices is that each vector
represents a point located in space, and a matrix moves that point to
a di↵erent location. Everything else is just details.
Vectors. A number, such as 1.234, is known as a scalar, and a vector
is an ordered list of scalars. Here is an example of a vector with two
components a and b: w=(a,b). Note that vectors are written in bold
type. The vector w can be represented as a single point in a graph,
where the location of this point is by convention a distance of a from
the origin along the horizontal axis and a distance of b from the origin
along the vertical axis.
Adding Vectors. The vector sum of two vectors is the addition of
their corresponding elements. Consider the addition of two pairs of
scalars (x1 ,x2 ) and (a,b)
(a + x1 ), (b + x2 ). (D.1)
z = (a + x1 ),(b + x2 ) (D.2)
= (x1 ,x2 ) + (a,b)
= x + w. (D.3)
Thus the sum of two vectors is another vector, which is known as the
resultant of those two vectors.
Subtracting Vectors. Subtracting vectors is similarly implemented
by the subtraction of corresponding elements so that
z = x w (D.4)
= (x1 a),(x2 b). (D.5)
179
Vector Matrix
The reason for having row and column vectors is because it is often
necessary to combine several vectors into a single matrix which is then
used to multiply a single vector x, defined here as
180
Vector Matrix
y = wT x (D.12)
✓ ◆
x1
= (a,b) (D.13)
x2
= x1 w 1 + x2 w 2 . (D.14)
Here, each (single element) column y1t is given by the inner product of
the corresponding column in x with the row vector w. This can now
be rewritten succinctly as
y = wT x.
y1 = w1T x (D.16)
y2 = w2T x, (D.17)
181
Vector Matrix
= W x. (D.23)
182
Appendix E
where SNR is the signal to noise ratio, with equality if each variable is
independent and Gaussian. The mutual information can be estimated
using three broad strategies 20 , which provide: 1) a direct estimate using
Equation E.1, 2) an upper bound using Equation E.3, and, 3) a lower
bound using Equation E.2. For simplicity, stimulus values (s in the
main text) are represented as x here, so that y = g(x) + ⌘, where g is a
neuron transfer function, and ⌘ is a noise term.
183
Neural Information Methods
resolution used to measure spikes. Dividing H(T , t) by T yields the
entropy rate, which converges to the entropy H(y) for large values of
T , specifically,
H(T , t)
H(y) = lim bits/s. (E.4)
T !1 T
Strong et al (1998) 113 use arguments from statistical mechanics to show
that a graph of H(T , t)/T versus 1/T should yield a straight line (also
see Appendix A.8 in Bialek, 2012 16 ). The x-intercept of this line is at
184
Neural Information Methods
200
160
120
80
40
0
0 20 40 60 80 100
H(T , t)
H(y) = lim (E.6)
T !1 T
m
1X T
1
= lim p(y i )log , (E.7)
T !1 T
i=1
p(y i )
186
Neural Information Methods
187
Neural Information Methods
The Lower Bound Method
Unlike previous methods, this method does not rely on repeated
presentations of the same stimulus, and can be used for spiking or
continuous outputs. In both cases, we can use the neuron inputs
x and outputs y to estimate a linear decoding filter wd . When the
output sequence is convolved with this filter, it provides an estimate
xest =wd ⌦y of the stimulus x, where ⌦ is the convolution operator. We
assume that x=xest + ⇠est , so that the estimated noise in the estimated
stimulus sequence is ⇠est = x xest (Figure 1.3).
Assuming a bandwidth of W Hz and that values are transmitted at
the Nyquist rate of 2W Hz, we proceed by Fourier transforming the
stimulus sequence x to find the signal power X (f ) at each frequency f ,
and Fourier transform ⇠est to find the power in the estimated noise
M(f ) at each frequency. The mutual information is estimated by
summing over frequencies
X X
= logX (f ) logM(f ) (E.13)
f f
W
X X (f )
= log bits/s, (E.14)
M(f )
f =0
188
Appendix F
Key Equations
Logarithms use base 2 unless stated otherwise.
Entropy
m
X 1
H(s) = p(xi ) log bits (F.1)
i=1
p(xi )
Z
1
H(s) = p(x) log dx bits (F.2)
x p(x)
Joint entropy
m X
X m
1
H(x,y) = p(xi ,yj ) log bits (F.3)
i=1 j=1
p(xi ,yj )
Z Z
1
H(x,y) = p(x,y) log dx dy bits (F.4)
y x p(x,y)
Conditional Entropy
m X
X m
1
H(y|s) = p(xi ,yj ) log bits (F.6)
i=1 j=1
p(xi |yj )
Xm X m
1
H(y|x) = p(xi ,yj ) log bits (F.7)
i=1 j=1
p(yj |xi )
Z Z
1
H(x|y) = p(x,y) log dx dy bits (F.8)
y x p(x|y)
Z Z
1
H(y|x) = p(x,y) log dx dy bits (F.9)
y x p(y|x)
189
Key Equations
from which we obtain the chain rule for entropy
Mutual Information
m X
X m
p(xi ,yj )
I(x,y) = p(xi ,yj ) log bits (F.16)
i=1 j=1
p(xi )p(yj )
Z Z
p(x,y)
I(x,y) = p(x,y) log dx dy bits (F.17)
y x p(x)p(y)
If the channel input x has variance S, the noise ⌘ has variance N , and
both x and ⌘ are iid Gaussian variables then I(x,y)=C, where
✓ ◆
1 S
C = log 1 + bits per value, (F.24)
2 N
190
References
191
References
192
References
412(6849):787–792, 2001.
[39] AA Faisal and SB Laughlin. Stochastic simulations on the reliability
of action potential propagation in thin axons. PLoS computational
biology, 3(5):e79, 2007.
[40] LA Finelli, S Haney, M Bazhenov, M Stopfer, and TJ Sejnowski.
Synaptic learning rules and sparse coding in a model sensory system.
PLoS Comp Biol, 2008.
[41] P Foldiak and M Young. Sparse coding in the primate cortex. in The
Handbook of Brain Theory and Neural Networks, ed. Arbib, MA, pages
895–898, 1995.
[42] JP Frisby and JV Stone. Seeing: The computational approach to
biological vision. MIT Press, 2010.
[43] Kate S Gaudry and Pamela Reinagel. Benefits of contrast
normalization demonstrated in neurons and model cells. Journal of
Neuroscience, 27(30):8071–8079, 2007.
[44] W Gerstner and WM Kistler. Spiking neuron models: Single neurons,
populations, plasticity. CUP, 2002.
[45] David H Goldberg, Jonathan D Victor, Esther P Gardner, and
Daniel Gardner. Spike train analysis toolkit: enabling wider
application of information-theoretic techniques to neurophysiology.
Neuroinformatics, 7(3):165–178, 2009.
[46] D Guo, S Shamai, and S Verdú. Mutual information and minimum
mean-square error in Gaussian channels. IEEE Transactions on
Information Theory, 51(4):1261–1282, 2005.
[47] JJ Harris, R Jolivet, E Engl, and D Attwell. Energy-efficient
information transfer by visual pathway synapses. Current Biology,
25(24):3151–3160, 2015.
[48] HK Hartline. Visual receptors and retinal interaction. Nobel lecture,
pages 242–259, 1967.
[49] T Hosoya, SA Baccus, and M Meister. Dynamic predictive coding by
the retina. Nature, 436(7047):71–77, 2005.
[50] RAA Ince, RS Petersen, DC Swan, and S Panzeri. Python
for information theoretic analysis of neural data. Frontiers in
Neuroinformatics, 3, 2009.
[51] A L Jacobs, G Fridman, RM Douglas, NM Alam, PE Latham,
GT Prusky, and S Nirenberg. Ruling out and ruling in neural codes.
Proc National Academy of Sciences, 106(14):5936–5941, 2009.
[52] EC Johnson, DL Jones, and R Ratnam. A minimum-error, energy-
constrained neural code is an instantaneous-rate code. Journal of
computational neuroscience, 40(2):193–206, 2016.
[53] M Juusola, A Dau, and Diana R Lei Z. Electrophysiological method for
recording intracellular voltage responses of drosophila photoreceptors
and interneurons to light stimuli in vivo. JoVE, (112), 2016.
[54] M Juusola and GG de Polavieja. The rate of information transfer of
naturalistic stimulation by graded potentials. The Journal of general
physiology, 122(2):191–206, 2003.
[55] K Koch, J McLean, M Berry, P Sterling, V Balasubramanian, and
MA Freed. Efficiency of information transmission by retinal ganglion
cells. Current Biology, 14(17):1523 – 1530, 2004.
[56] K Koch, J McLean, R Segev, MA Freed, MJ Berry, V Balasubrama-
nian, and P Sterling. How much the eye tells the brain. Current biology,
16(14):1428–1434, 07 2006.
193
References
[57] L Kostal, P Lansky, and J-P Rospars. Efficient olfactory coding in the
pheromone receptor neuron of a moth. PLoS Comp Biol, 4(4), 2008.
[58] SW Ku✏er. Discharge patterns and functional organization of
mammalian retina. Journal of Neurophysiology, 16:37–68, 1953.
[59] MF Land and DE Nilsson. Animal eyes. OUP, 2002.
[60] SB Laughlin. A simple coding procedure enhances a neuron’s
information capacity. Z Naturforsch, 36c:910–912, 1981.
[61] SB Laughlin. Matching coding to scenes to enhance efficiency. In
OJ Braddick and AC Sleigh, editors, Physical and biological processing
of images, pages 42–52. Springer, Berlin, 1983.
[62] SB Laughlin. Form and function in retinal processing. Trends in
Neurosciences, 10:478–483, 1987.
[63] SB Laughlin, RR de Ruyter van Steveninck, and JC Anderson. The
metabolic cost of neural information. Nature Neuro, 1(1):36–41, 1998.
[64] SB Laughlin, J Howard, and B Blakeslee. Synaptic limitations to
contrast coding in the retina of the blowfly calliphora. Proc Royal
Society London B, 231(1265):437–467, 1987.
[65] P Lennie. The cost of cortical computation. Current Biology, 13:493–
497, 2003.
[66] WB Levy and RA Baxter. Energy efficient neural codes. Neural
computation, 8(3):531–543, 1996.
[67] R Linsker. Self-organization in perceptual network. Computer, pages
105–117, 1988.
[68] P Liu and B Cheng. Limitations of rotational manoeuvrability
in insects and hummingbirds: Evaluating the e↵ects of neuro-
biomechanical delays and muscle mechanical power. Journal of The
Royal Society Interface, 14(132):20170068, 2017.
[69] Y Liu, Y Yue, Y Yu, L Liu, and L Yu. E↵ects of channel blocking on
information transmission and energy efficiency in squid giant axons.
Journal of computational neuroscience, pages 1–13, 2018.
[70] M London, A Roth, L Beeren, M Häusser, and PE Latham. Sensitivity
to perturbations in vivo implies high noise and suggests rate coding in
cortex. Nature, 466(7302):123, 2010.
[71] ZF Mainen and TJ Sejnowski. Reliability of spike timing in neocortical
neurons. Science, 268(5216):1503, 1995.
[72] D Marr. Vision: A computational investigation into the human
representation and processing of visual information. 1982,2010.
[73] G Mather. Foundations of perception. Taylor and Francis, 2006.
[74] M Meister and MJ Berry. The neural code of the retina. Neuron,
22:435–450, 1999.
[75] A Moujahid, A D’Anjou, and M Graña. Energy demands of diverse
spiking cells from the neocortex, hippocampus, and thalamus. Frontiers
in Computational Neuroscience, 8:41, 2014.
[76] I Nemenman, F Shafee, and W Bialek. Entropy and inference, revisited.
In TG Dietterich, S Becker, and Z Ghahramani (Eds), Advances in
Neural Information Processing Systems 14, 2002, 2002.
[77] S Nirenberg, SM Carcieri, AL Jacobs, and PE Latham. Retinal
ganglion cells act largely as independent encoders. Nature,
411(6838):698–701, June 2001.
[78] JE Niven, JC Anderson, and SB Laughlin. Fly photoreceptors
demonstrate energy-information trade-o↵s in neural coding. PLoS
Biology, 5(4), 03 2007.
194
References
195
References
196
Index
197
Index
198
Index
scalar variable, 76
Shannon, C, 6, 9
signal to noise ratio, 24, 172
source coding theorem, 18
space constant, 30
spatial frequency, 119, 120
spike, 2, 28
spike cost, 49
spike speed, 55
spike timing precision, 39
spike-triggered average, 101
static linearity, 92, 172
199
Dr James Stone is an Honorary Reader in Vision and Computational
Neuroscience at the University of Sheffield, England.
Seeing
The Computational Approach
to Biological Vision
second edition