Sparse Distributed Memory: The Science of Computing

© Copyright P. J. Denning.

Published in American Scientist 77 (July-August 1989), 333-335.

author. Published in American Scientist 77 (July-August 1989), 333-335.

The Science of Computing

Sparse Distributed Memory

Peter J. Denning

ABSTRACT: Sparse Distributed Memory was proposed by Pentti Kanerva as a model of human long term
memory. He presented it as an architecture that could store large patterns and retrieve them based on partial
matches with current sensory inputs. The architecture can be realized as a neural net or as an associative
memory. SDM exhibits behaviors, both in theory and in experiment, that resemble those previously
unapproachable by machines -- e.g., rapid recognition of faces or odors, discovery of new connections between
seemingly unrelated ideas, continuation of a sequence of events when given a cue from the middle, knowing
that one doesn’t know, or getting stuck with an answer on the tip of one’s tongue. These behaviors are now
within reach of machines that can be incorporated into the computing systems of robots capable of seeing,
talking, and manipulating. Kanerva’s theory is a new interpretation of learning and cognition that respects
biology and the mysteries of individual human beings.

ecognizing your mother’s face in a crowd. failure of the rationalistic philosophy deeply
Experiencing a flood of old memories an rooted in Western thought (1). That philosophy
instant after sniffing an odor you haven’t has produced in many disciplines a search for
smelled for years. Seeing a connection that no models that combine context-free (meaningless)
one ever taught you between two concepts. elements into systems governed by formal laws.
Discovering that an idea that seemed to have Not only have information-processing models of
occurred to you spontaneously was actually cognition fallen short in computer science,
given to you by a friend in a conversation last corresponding formal models have fallen short
year. Recognizing that a particular leaf is a in anthropology, economics, linguistics, political
maple. Humming the rest of a familiar tune science, psychology, and other disciplines.
when given a phrase from the middle. Knowing These shortcomings have prompted a new
that you don’t know the answer to a question. examination of what it means to be human, a
Knowing that you do know the answer to a search for a philosophy that respects the
question, but that it is inaccessibly perched on mystery of individuals and the biological roots
the tip of your tongue. of all learning.
These everyday phenomena illustrate Against this background, the emergence of
capabilities of human beings that we do not Pentti Kanerva's theory of sparse distributed
know how to reproduce with a machine but that memory is refreshingly welcome (2). Kanerva
would be very useful if we could. The failure of departs from the formalistic tradition to develop
artificial intelligence to produce machines with an architecture of memory, inspired by biology,
any of these capabilities after forty years of in which the phenomena I mentioned in the first
research is not a failure of intention. It is a paragraph can arise holistically. Because his

memory; I would encourage you to read the
details in Kanerva’s book.
The theory begins with an interpretation of
human long-term memory as a storage system
that associates sensory input patterns quickly
with actions that are appropriate for the
situation. In Kanerva’s model, sensory input is
represented in the form of very long bit vectors
containing thousands or tens of thousands of
bits. Because no two external situations are
identical, the memory must respond to partial
matches between the current sensory pattern
and previously stored patterns. The measure of
dissimilarity between patterns is the number of
bits in which they differ, a metric known as the
Hamming distance. For example, the distance
between 01101 and 10111 is 3 bits.
Kanerva illustrates his design with an
example of 1,000-bit patterns, giving rise to a
space of 21000 possible patterns. In this space,
1/1000 of the patterns are within 451 bits of any
given pattern, and all but 1/1000 of the patterns
This schematic diagram shows the relations among the are within 549 bits. The extremely large number
components of sparse distributed memory. The memory in of patterns that are so close (±49 bits) to the
this example stores and retrieves 256-bit patterns across mean distance of 500 bits between two random
2,000 physical locations. Each horizontal row is a location. patterns is crucial to the memory’s ability to
The input pattern (cue) in the address register is compared
simultaneously to all 2,000 patterns in the memory address
make connections between patterns that
array; each line in the array holds the address of one seemingly have little to do with each other.
location. The distances from each address pattern are
Ordinary (random-access) computer
compared with the memory’s built-in threshold radius and a
subset of the locations is selected (shaded areas). The memories are designed around a simple idea:
256-bit data pattern is stored at the selected locations by Within a few nanoseconds after a memory cue
adding 1 to each counter in the counter array (address) is presented for a read operation, the
corresponding to each 1 in the pattern and subtracting 1 memory responds with an output pattern (data).
from each counter corresponding to a 0 in the pattern. A High speed is achieved by associating one
256-bit pattern is retrieved by forming 256 sums from the
physical location with each possible address.
corresponding counters in each selected location and then
forming a 1 output bit in the data-out register for each sum Current technology limits the designs to about
that is nonnegative and a 0 for each sum that is negative. 25 address bits and 64 data bits, nowhere near
The retrieved pattern is a statistical reconstruction the pattern lengths needed for simulation of
determined from the contents of all selected locations. All human long-term memory.
selections can be done in parallel, and all data bits can be
handled in parallel, giving the memory great speed over a Kanerva proposes an architecture that
wide range of pattern widths and physical locations. encompasses an affordable number of physical
locations (say 1,000,000) and a large pattern size
(say 1,000 bits). Each location is assigned an
address (1,000-bit pattern) at random, and the
theory deals with patterns recalled statistically set of location addresses constitutes a sparse
from patterns previously stored across large subset of the memory space. The memory has
regions of the memory space, he does not insist an input register for the cue (address) pattern
that anyone can ever know precisely how the and an input register for the data pattern, and it
phenomena arise. In what follows, I will has a register for an output pattern (these
describe the central ideas of sparse distributed registers each hold 1,000 bits). Each location has

With these parameters, approximately 1/1000 of
the physical locations will be selected by any
given input cue. How are storage and retrieval
carried out with this arrangement?
To store a 1,000-bit data pattern at address
A, the memory works as follows. The input cue
pattern A is presented to the memory, and all
locations within 451 bits of A select themselves.
This set of selected locations is called the sphere
selected by A. A copy of the input data pattern,
which is to be associated with A, is then entered
into each of the selected locations. Because any
given location is within the spheres of selection
of many distinct cue patterns, entering a new
value must not obliterate the previous contents
of the location. This is accomplished by
implementing each location as a set of 1,000
counters, one for each bit position of the data.
Data are entered by adding 1 to each counter for
which the corresponding data bit is 1, and
subtracting 1 from each counter for which the
corresponding data bit is 0. Kanerva calculates
that 8-bit counters are adequate for most
Each of the nine patterns at the top of the figure was stored applications.
in a simulated sparse distributed memory by addressing
the memory with the pattern itself. Each pattern is a 16x16 To retrieve a 1,000-bit pattern
array of bits that transforms into a 256-bit vector. The three corresponding to input cue A, the memory
figures at the bottom show the result of an iterative search works as follows. The sphere of selected
in which the result of the first retrieval was used as the locations is formed as described above. A set of
input cue for the second retrieval. The final output pattern 1,000 output counter values is constructed from
was none of the patterns stored. Because each of the nine
stored patterns was constructed from an O with 20% of the all the selected locations by summing all the
bits randomly reversed, this behavior may be interpreted as corresponding selected counters; for example,
the memory’s ability to extract a signal from noise. Another the counter in output bit position 2 is the sum of
interpretation is that the memory formed a statistical the bit-2 counters of each selected physical
interpolation among the stored patterns; the new pattern is location. The 1,000-bit output pattern is
stable (it will retrieve itself) and thus serves as a constructed from the 1,000 output counters by a
conceptualization of the data.
threshold method: if an output counter is
nonnegative, that output bit is 1, otherwise it is
an address decoder that compares its own 0.
address with the input cue, selecting that The rationale for the name is now obvious:
location as a participant in the next storage or the memory is sparse because the physical
retrieval operation whenever the cue is within locations are a vanishingly small subset of the
distance d of the location’s address. Kanerva memory space; it is distributed because a pattern
demonstrates that the address decoders can be is stored in many locations and retrieved by
built of linear threshold circuits -- gates that statistical reconstruction from many locations.
produce a 1 at their output whenever the Distribution enables the memory to retrieve a
number of 1s among their many inputs is at least stored pattern when the input cue only partially
1000-d -- and notes a similarity of operation matches any stored pattern, an ability that arises
between these circuits and neurons in the from the large overlap between the spheres of
nervous systems of many animals. selected locations of two similar cues. It also
Kanerva recommends d=451 for the renders the memory robust in case of failure of
1,000,000-location memory of 1,000-bit patterns. portions of the addressing or storage hardware.

Each storage and retrieval can be carried
out with massively parallel operations among
the address decoders and counters, allowing the
memory to respond rapidly. At the NASA
Ames Research Center, David Rogers has built a
simulator of the sparse distributed memory
running on a 32,768-node Connection Machine 2
of the Thinking Machines Corporation; it
simulates 250,000 locations with 256-bit patterns,
with cycle time of about 1/2 of a second.
Let us consider again the phenomena
mentioned at the start of this essay. The
memory’s ability to retrieve patterns associated
with sensory input quickly could allow it to
recognize instantly your mother’s face or a long-
forgotten odor. The memory can form
associations between patterns without ever
being explicitly taught those associations
because the distance between two patterns is The six patterns at the top were stored as a list in a
sufficiently small that the one pattern retrieves simulated sparse distributed memory by storing each
pattern as the data associated with the previous pattern in
the other. Similarly the memory can retrieve a the sequence. The four patterns at the bottom resulted
forgotten pattern from some cue that seemingly from an iterative search, beginning with a noisy version of
had nothing to do with it, giving the impression the third pattern and culminating with a clean version of the
of generating a new pattern. It can retrieve the sixth. This behavior may be interpreted as the memory’s
pattern corresponding to “maple leaves” that ability to locate the remainder of a temporal sequence
was formed internally after storing many given a pattern that is similar to one of the members. This
behavior will occur even when the sequence stored in the
patterns encoding specific maple leaves. It can memory is noisy, suggesting that the memory can generate
store patterns in lists representing their an abstract form of a sequence.
temporal order, and begin an iterative retrieval
from anywhere in the list. Fast convergence of
an iterative search can be interpreted as Albus’s theory also emphasizes the hierarchical
“knowing that you know” and nonconvergence organization of the nervous system and suggests
as “knowing that you don’t know;”' the tip-of- that associative memory and sensory encoding
the-tongue phenomenon would occur may be organized into levels. All these theories
somewhere between these two cases. are consistent with the biological theory of
learning proposed by Maturana and Varela (5).
It is important to remember that the theory
predicts that these phenomena will occur in The sparse distributed memory is intended
sparse distributed memory, but it cannot predict as an integral component of a larger system that
the details. It cannot predict which connections includes sensory apparatus and a scheme for
you might see between ideas, which concepts encoding sensory input into binary patterns.
you will form, or what will be on the tip of your Such a system also includes motors that act
tongue. when driven with stimulus patterns. Kanerva
calls this an autonomous learning system. It
Kanerva began to develop his theory in the includes a component called the focus that
early 1970s. He did so independently of James contains a pattern updated constantly from both
Albus and David Marr, who developed similar sensory input and the contents of the sparse
theories from observation of the structure of the distributed memory and that generates the
human nervous system and the cerebellum (3,4). patterns used to drive motors. The focus
These theories have the distinguishing feature represents the current moment of consciousness,
that they can be readily tested; they have thus which continuously changes as the sensory
inspired much work with simulators that verify input and the context retrieved from memory
their mathematical properties and predictions. change.

A major research area is the design of David Rogers has been studying the sparse
sensory encoders. How does visual input get distributed memory as a statistical inference
encoded so that the patterns stored in memory machine. In one experiment, he fed in a stream
are relatively insensitive to small rotations, of patterns, each derived from a vector of
translations, zooms, and pans of the visual field? measurements of 15 weather-related factors
Or so that certain shapes are easily detectable from a four-hour interval at a weather station in
within any visual field? How does speech input Darwin, Australia. There were 50,000 vectors
get encoded so that the same word produces covering about 23 years of observations. The 15
similar patterns independently of the speaker? components of each vector were encoded as a
How does tactile input get encoded so that 256-bit pattern that was the storage address of
different surface textures are distinguishable? the single bit indicating rain in the subsequent
These and similar questions are occupying four-hour period. Rogers modified the operation
Kanerva and his colleagues, who seek to build of the memory so that the address array was
prototypes of devices that recognize visual dynamically altered to add addresses similar to
shapes, continuous speech, and fine textures. A those associated with rain, and delete addresses
theory of Robert Erickson about how the power not associated with rain. At the end of the
of visual systems arises from large numbers of experiment, the address array identified the
simple components illustrates a possible combinations of bits that were the most reliable
sensory-encoding system that might mesh well predictors of rain in the data.
with sparse distributed memory (6).
However promising his theory is, Pentti
Kanerva advises that it is not a final answer. It
is only a step in a line of investigation whose
The theory cannot predict which connections you final outcomes cannot be predicted. His theory
opens the possibility that machines can perform
might see between ideas, which concepts you will some of the actions of which we are capable,
form, or what will be on the tip of your tongue. while leaving plenty of room for the biological
roots of intelligence and the mysteries of each
human being.

