Language, Brain and Representation

Language, brain and
representation
The problem of speech perception
Week 4
Three aspects of speech recognition
 The input signal:

– the acoustic structure of speech and how speech
signals are processed by the human auditory system.
 The internal phonological representation:
– the way that words or phonological targets are
stored in the speech recognition lexicon.
 The interface between (1) and (2) above:
– how the auditory input makes contact with the
internal forms in the recognition lexicon.
The acoustic structure of speech: a
spectrogram
sh ee p l i ke s o f t gr a ss
 The spectogram shows the changing energy structure of
the speech signal over time.
 Spectrograms as (a) ‘visible speech’ and (b) as ‘voice
prints’.
Why speech recognition is difficult
 The segmentation problem
 The variability problem
Speech sounds vary with context. Compare ‘sh’
in shoe vs sheep; ‘k’ in keep vs cool.
The speaking environment and noise.
Speakers’ vocal tracts vary.
Effects of speech rate and style variation.
Connected speech processes
I should have thought … [ • t  f ›]

Implications of connected speech processes
for speech recognition
 CSPs are more prevalent in fast and casual
speech.
 CSPs are motivated by ‘ease of articulation’
 CSPs complicate the mapping from speech signal
to phonological target. Many intended gestures
may not gain full expression.
 CSPs increase the inferential burden on the
listener.
Rate of information transmission in speech
 8-10 phonemes, 3-4 syllables, 2-3 words per

second.
 The rate of phoneme transmission provided a
paradoxical dilemma for early researchers
attempting to build an aural reading machine for
the blind, based on an ‘auditory cipher’
 Sounds presented at rates of more than 2-3 per
second could not be tracked by listeners.
 Is there something special about speech
perception?
Modularity and specialization in the
speech perception mechanism?
 Several proposals have been offered over the
years, variations on a theme:
 The ‘motor’ theory of speech perception
(Liberman & Cooper, 1957)
 The ‘speech mode’ hypothesis (Liberman et al.
1967)
 ‘Phonetic feature detectors’ (Eimas, 1972)
 ‘Mirror neurons’ (Arbib, 1998)
 The role of perceptual learning in speech
recognition vs specialization and modularity.
Lexical retrieval in speech perception
 Lexical retrieval: the process by which

information about the form or meaning of lexical
items is retrieved from long-term memory.
 What is the role of the lexicon in speech
recognition?
– Critically necessary: e.g., ‘Trace’ and other activation
based models.
– Necessary but not sufficient: e.g., ‘bottom-up’
processing models.
Phonological parsing prior to lexical
access
 A demonstration…
 Transcribe what you hear:
____________________________________
 How many ‘words’ in the utterance?
Removing sentence prosody
 Produced this stimulus.
 Compare the two stimuli
 How the sentence prosody was removed:.
 Results:
– Subjects’ identification of syllables was superior with prosody
present.
 The explanation:
– Prosody and ‘information chunking’
 Conclusion:
– Substantial amount of phonological processing is possible
without lexical access.
Listening in ‘the speech mode’
 Bookstore anecdote
 Another demonstration:
 The original signal:

 This is ‘sine wave’ speech.
 The auditory system is induced to flip
between a ‘speech’ and ‘non-speech’ mode
of perceptual processing.
Strong and weak versions of the
speech mode hypothesis.
 Weak version:
– We cannot help but engage in linguistic processing,
which involves accessing a vast store of specialized
tacit knowledge about speech.
 Strong version:
– specialized perceptual mechanisms, dedicated to the
task of speech recognition, and in some sense hard-
wired into the brain are required at least for certain
aspects of speech recognition.
What evidence is there for the
strong version of the SMH?
 We will consider evidence from several sources:
 Dichotic listening
 Categorical perception
 Duplex perception
 Perceptual magnet effects
 Does speech perception involve modular

processing systems ?
 Gerry Fodor (1983) The modularity of mind.
Modularity (Fodor)
Properties of modular systems Examples
1. Domain specific phonetic feature detectors
2. Mandatory operation the speech listening mode
3. Limited central access
4. Fast acting on-line word recognition
5. Informationally encapsulated
6. ‘Shallow’ analysis extraction of ‘surface’ not
‘deep’ structure
7. Fixed neural architecture Cerebellum: motor control
8. Specific patterns of breakdown Aphasic syndromes
9. Maturational sequencing Language acquisition
Dichotic listening paradigm
 Dichotic vs binaural listening.
 Kimura’s original findings
 The anatomical basis of the REA for
speech sounds.
 A left ear advantage for chords, but not
melody recognition.
 Haskins investigations using synthetic
speech.
Haskins results: dichotic listening
Basis of the REA for some speech
sounds
 The type of acoustic cues needed to identify
stop consonants.
 Context invariant vs context sensitive cues.
 Synthetic stop consonant stimulus continuum
 Formant transitions as cues to place of
articulation in stop consonants /b d g/
 These cues require special neural machinery?
 A strong version of the SMH.
Categorical Perception
 What is categorical perception?
 Is categorical perception unique to speech and some
speech sounds?
 Construct a stimulus continuum.
 Determine identification functions for that continuum.
 Test for discrimination (one step, two step) along the
continuum, using the AB-X paradigm.
 Can peaks and troughs in discrimination performance be
accounted for entirly by identification abilities?
 Or: Is there residual within-category discrimination?
 The strict definition of categorical perception:
Discrimination is entirely predictable from identification
(categorization).
Categorical Perception of Phonetic
Features by Infants.
 Peter Eimas 1972.
 Non-nutritive sucking
paradigm.
 Discrimination of VOT.
 Greater sensitivity to
VOT differences
across a phoneme
boundary (pa~ba).
 Significance of
finding?
Patricia Khul (1975) and Others
 Categorical tendencies in VOT perception in
chinchillas.
 A blow to the hypothesis of innate ‘phonetic’ feature
detector for ‘voicing’.
 Was it coincidental that chinchillas show higher
discrimination sensitivity at a region of the VOT
auditory continuum that English and many other
languages (see Lisker and Abramson, 1964, 1971)
use for making phonemic voicing contrasts among
stop consonants?
 Languages take advantage of a natural discontinuity
on a temporal dimension of auditory contrast that is
common to mammalian auditory systems in general.
Is Categorical Perception Unique
Speech Sound Perception?
 Beale and Keil (1993)used a graphical distortion

technique known as ‘morphing’ to construct
visual continua.
 Demonstrated categorical tendencies for face
recognition.
Beale and Keil (1993) Results
 The strength of the
categorical effect was
directly related to the
familiarity of the faces.
 Learning and object
recognition obviously
play a key role in the
categorical
discrimination of these
morphed images.
Coarticulation Effects and Category
Boundary Shifts
 Phoneme category boundaries may be shifted by
coarticulation effects:
The strongest category boundary cue for contrast
between /t/ and /k/ is the second formant transition.
 The second formant transition, is systematically
shifted closer to the velar end of the continuum
when the preceding sound is a palatal fricative
/•/ rather than an alveolar fricative /s/.
– (E.G.: Foolish tapes/capes; Christmas tapes/capes;
Mann and Repp, 1981).
Boundary Shifts
 A similar category shift in the alveolar - velar
place of articulation boundary for stop
consonants is also observed when an /r/
precedes the target stop compared with an /l/
(or done/gone; All done/gone; Mann and
Liberman, 1983).
 If ‘phonetic feature detectors’ mediate
perception of stop consonant place of
articulation, then their acoustic triggering
conditions must be dynamically tunable to
phonological context.
Boundary Shifts
 These detectors would be highly complex
and specialized perceptual analysers, if
not ‘hard-wired’, then at least
‘programmed’ for the kinds of acoustic-
articulatory mappings encountered in
speech and probably only in speech
stimuli.
Mann and Liberman (1983)
 Some provocative findings on category boundary shifts
for /d/ - /g/ contrasts in Japanese learners of English.
 Tested native English listeners and adult Japanese
learners of English with short and long exposure to L2,
for /d - g/ category boundary shifts following /•/ or /s/
and, crucially, following /l/ or /r/.
 Japanese groups showed perceptual boundary
adaptations identical in form to those of the English
controls in their responses to stimuli on the /d - g/
continuum as a function of a preceding /l/ or /r/ sound.
 What does this mean?
Duplex Perception
 The dichotic presentation

of the acoustic elements
of a phonetic stimulus,
separately to each ear.
 The invariant portions of
a [ba] [da] or [ga]
stimulus (called the base)
to the right ear, and the
variable portions of the
formant transitions to the
left ear.
Duplex Perception
 The listener experiences the unusual sensation
of hearing both the unintegrated formant
transition, which sounds like a brief pitch glide
or ‘chirp’, in the left ear and, simultaneously, the
integrated phonetic stimulus - the syllable: [ba],
[da], or [ga] - in the right ear.
 By directing listeners’ attention to the phonetic
percept or to the formant transition (chirp), one
can investigate the effect of the processing task
upon the response characteristics of the auditory
system.
Duplex Perception
 It has been found that while perception of the
stop’s place of articulation is categorical,
perception of the chirps is non-categorical
(Liberman, Isenberg & Rakerd, 1981).
 When formant transitions are integrated with the
base components presented in the opposite ear,
they are treated categorically by the perceptual
system, but when processed as non-speech
auditory stimuli their perception is non-
categorical.
Conclusions: Is Speech Perception
Special?
 Three paradigms: dichotic listening, categorical
perception, and duplex perception, each originally
advanced as support for the strong version of the
speech mode hypothesis.
 Initially, dichotic listening and categorical perception
found converging evidence for the specialized nature of
the perceptual machinery required for recognizing
certain context-dependent phonetic features.
 Certain non-speech auditory perceptual tasks also
showed evidence of ‘specialized processing’, raising the
possibility that it was not speech per se which is special,
but perhaps the level or type of perceptual processing
involved.
What can neural imaging studies tell us about the
neural basis of speech perception abilities?
‘Determinants of dominance: Is language

laterality explained by physical or linguistic
features of speech?’ Sytyrov, Y. Pihko, E.
& Pulvemuller F. Neuroimage 27 2005 37-
47.
 LH is specialized for:
– Rapid temporal processing
– Phonetic feature detection
– Word recognition?
An MEG study
 MEG imaging
 The response measure: MMN (mismatch
negativity)
– Elicited by rare deviant stimuli in a sequence of
standard stimuli.
– An ‘automatic’ change detection mechanism
– Elicited in absence of conscious attention
– Occurs approximately 100-150 msec, post stimulus.
– A selective left lateralized increase in the magnitude
of MMN for native language sounds c.f. other
acoustically similar sounds. (Naatanen,1997)
The stimuli
word
word
Pseudo-word
Non-speech
Method
 Acoustic stimulation:
s…s….s…s…s…d….s….s…..s….s….d…..
 MEG recording 306 channel
 Data processing
– MMN: subtract ‘standard’ from ‘deviant’
response.
– Evaluate LH and RH responses separately
Results
• LH>RH MMN for words only.

• Greater amplitude of response to
words than pseudo-words.
• Strong bilateral MMN response to
non-speech.
Discussion
 Basis of stronger response and

lateralization for words than pseudo-
words.
– The role of the lexicon?
– The nature of the {-t} ending in Swedish?
 ‘The activation of memory networks for
known items produced larger L-R
asymetry for the words than the other two
stimuli.’
Localization of MMN source generators

Language, Brain and Representation

Uploaded by

Language, Brain and Representation

Uploaded by

Language, brain and

 The input signal:

I should have thought … [ • t  f ›]

 8-10 phonemes, 3-4 syllables, 2-3 words per

 Lexical retrieval: the process by which

 The original signal:

 Does speech perception involve modular

 Beale and Keil (1993)used a graphical distortion

 The dichotic presentation

‘Determinants of dominance: Is language

• LH>RH MMN for words only.

 Basis of stronger response and

You might also like