Pronunciation Adaptation 2002

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

8 Pronunciation Adaptation

The techniques of ASR and ASU have reached a level where they are ready
to be used in products. Nowadays people can travel easily to almost every
place in the world and due to the growing globalisation of business they
even have to. As a consequence public information systems are frequently
used by foreigners or people who are new to a city or region. If the sys-
tems are equipped with a speech interface, supporting several languages is
extremely costly since for each language AMs, pronunciation lexica and LMs
need to be built, all of which require large amounts of data for each lan-
guage. The number of languages has to be restricted somehow and as a
result many people will choose one of the supported languages and speak
with a foreign accent. Also the number of growing speech enabled Internet
applications or various content selection tasks might require the speaker to
say titles or the like in a foreign language. Most of the systems nowadays
are not able to handle accented speech adequately and perform rather badly,
compared to the performance achieved when recognising speech of native
speakers, see [Wit99a, Cha97, Bon98].
While it was possible to capture some of the pronunciation variation in
high complexity HMM models for native speakers, this will not be possible
to the required extent for non-native speech, as was already indicated by the
findings of Jurafsky [Jur01]. Also acoustic adaptation alone as described in
Chapter 6, will not be sufficient to deal with accents. This is also supported
by Woodland [Woo99].
State-of-the-art speech recognition systems use a pronunciation dictio-
nary to map the orthography of each word to its pronunciation, as we have
seen in Chapter 5. Usually, a ‘standard’ pronunciation (also called canonical
pronunciation or base form), as it can be found in published pronunciation
dictionaries, is used. These canonical pronunciations show the phonemic rep-
resentation of a word; that is, how it should be pronounced if it is spoken
in isolation. This is also called citation form, see [Pau98]. In isolated speech
this canonical pronunciation might come close to what speakers actually say,
but for some words there is a mismatch between this standard pronunciation
and its phonetic realisation. For continuous or even spontaneous speech, the
problem gets more severe. Due to co-articulation effects the pronounced words
tend to deviate more and more from the canonical pronunciation, especially

S. Goronzy: Robust Adaptation to Non-native Accents, LNAI 2560, pp. 79-104, 2002.
 Springer-Verlag Berlin Heidelberg 2002
80 8 Pronunciation Adaptation

at higher speaking rates. A transcription of spontaneous American English


speech (Switchboard) revealed 80 variants of the word ‘the’, see [Gre99] that
are certainly not captured in any standard American English pronunciation
dictionary. Jost and his co-authors [Jos97] estimated that in spontaneous
speech 40% of all words are not pronounced as in the standard pronunciation
dictionary. Also the pronunciation is very much speaker-specific, so that it is
not possible to determine one correct pronunciation. Many approaches were
presented that try to account for this variability by including more than one
pronunciation in the dictionary. Most of the research done in pronunciation
modelling, however, has focused solely on the derivation of native variants.
Since the problem is much more severe in the case of non-native speakers
speaking with a more or less strong accent, this particular problem will be
considered in the following sections. For this purpose a non-native database
is closely examined and the effect of including specialised non-native pronun-
ciations in the dictionary on recognition accuracy is demonstrated. So far the
biggest problem when modelling non-native pronunciation has been to obtain
accented speech data bases. In this chapter a very flexible method is proposed
that makes it possible to derive pronunciation rules for non-native speakers
without using any accented speech data at all. It needs native data of the
two considered languages only and is thus applicable to any combination of
languages. Before these methods are described in detail, an overview of the
state of the art in pronunciation modelling is given in Section 8.1.

8.1 The State of the Art in Pronunciation Modelling


The dictionary relates the orthography of the words that are known to the
system – representing the desired recognition output – to their pronuncia-
tions – representing the input by the speaker. Of course it would be desirable
to automatically derive the pronunciation of a word from its orthography
without the need of human interference. The manual creation and/or cor-
rection of a pronunciation or segmentation of an utterance requires expert
knowledge and is a very time consuming and costly task especially for large
vocabularies of several 10,000 words. Adding new words to the vocabulary is
problematic, since the expert needs to be consulted again. A further problem
is that different trained specialists often do not produce exactly the same
transcription for the same utterance or consider the same variants for the
same word as important.
There is a consensus that including multiple pronunciation variants is im-
portant for ASR systems. Lamel and her co-authors [Lam96] showed that
a careful design of the pronunciation dictionary influences recognition per-
formance. In [AD99] a further investigation showed that not including cer-
tain variants introduces recognition errors. The main problem is that the
more alternatives that are incorporated in the dictionary, the higher is the
probability that one of them comes quite close to the pronunciation of a
8.1 The State of the Art in Pronunciation Modelling 81

different word. Thus the confusability in the dictionary increases. In many


publications the positive effect of including variants was more than nulli-
fied by the increased confusability, e.g. in [Noc98, MB98, AD99]. Lamel and
Adda-Decker [Lam96, AD99] furthermore found a (language dependent) re-
lation between the word frequency and the number of pronunciation variants
used, such that the number of variants increases with the frequency of the
words.
Also McAllaster [McA98] showed that, if all pronunciations in the corpus
are in the dictionary, a dramatic decrease in error rate can be achieved. He
showed this using simulated data that were automatically generated using
the pronunciation dictionary. Even if pronunciations were added that were
shared by different words, the error rates could drastically be reduced for a
spontaneous speech task. But he additionally found that if more variants than
actually occurred in the test data are included in the dictionary, performance
declined. So the consequence is that the dictionary needs to accurately reflect
the range of phonetic variation observed, otherwise performance is impaired.
If the same test was conducted on real data, an increase in error rate was
observed always, even if only variants that were known to appear in the
test data were added. He sees the main reason in the fact that the HMM
models used were trained on data aligned with the canonical pronunciations,
thus resulting in ‘diffuse’ models. So both, [McA98] and [AD99] argue that
introducing variants for the alignment of data that are used for HMM training
is necessary because it will produce more accurate AMs. But it needs to
be mentioned that there is considerable controversy over whether training
with more accurate transcriptions is better than training with the canonical
transcription only.
The majority of current speech recognisers use HMMs. High complexity
models, such as triphones with many mixtures, are already capable of mod-
elling pronunciation variation and co-articulation effects to a certain extent,
see [Hol99, AD99, Ril99]. Jurafsky and his co-authors investigated this issue
more closely [Jur01] and found that on one hand, some of the variation, such
as vowel reduction and phoneme substitution can indeed be handled by tri-
phones, provided that more training data for the cases under consideration
are available for triphone training. On the other hand, there are variations
like syllable deletions that cannot be captured by increased training data.
Recently He and Zhao [He01] found that for non-native English speakers
triphones perform worse than monophones. Ravishankar [Rav97] also states
that, inferior pronunciations do not always cause misrecognitions because
they can to a certain extent be handled by the AMs, but they will lower
the acoustic likelihood of the sentence in which such a word occurs and will
increase the chances of an error elsewhere in the utterance. This is also con-
firmed by Fosler-Lussier [FL99], who found in experiments that not all words
for which the actual pronunciation differs from the one in the dictionary are
misrecognised. He states that not all variation can be appropriately captured
82 8 Pronunciation Adaptation

this way, especially if spontaneous speech is considered. Here factors such as


word frequency and speaking rate play an important role. If the speaking rate
is very high and the words are very frequent the syllabic distance from the
canonical pronunciation increases for many syllables. From the information-
theoretic viewpoint this makes sense, since high-frequency words are more
predictable. Thus more variation is allowed in their production at various
speaking rates because the listener will be able to reconstruct what was said
from the context and a few acoustic cues.
Greenberg [Gre99] investigated the influence of speaking rate on pronun-
ciations and found that at high speaking rates, more deletions take place.
Also [FL99] confirmed that infrequently used words that are of higher in-
formation valence tend to be pronounced canonically while frequently used
words (like pronouns, function word or articles) deviate from the canonical
pronunciation quite regularly.
All the results described indicate that variation in pronunciation cannot
be handled appropriately solely on the AM level (by increasing the data
used for triphone training or using speaker adaptation techniques) but that
a modification of the pronunciation dictionary becomes necessary. One of
the few publications known to the author that combines acoustic adaptation
(using MLLR) with the adaptation of the dictionary for non-native speech
is that of Huang [Hua00]. He also assumes that acoustic deviation is an
independent but complementary phenomenon and thus independently uses
MLLR adaptation for adapting the AMs and enhances the dictionary with
accented variants. He shows that the combination of the two methods yields
better results than either method alone. His work will be described in more
detail in this chapter.
The existing approaches for modelling pronunciation variation can be
classified into three broad classes, rule-based, data-driven and combined ap-
proaches. They will be presented in the following sections.

8.1.1 Rule-Based Approaches

In rule-based approaches a set of pronunciation rules is used to transform the


standard pronunciation into a pronunciation variant. They can for example
be of the form
/l @ n/→ /l n/

to account for the @-deletion at word endings in German. The rules are de-
rived using linguistic and phonetic knowledge on what kind of pronunciation
variation occurs in the kind of speech considered. These rules are then ap-
plied to the baseline dictionary. Thus for the entries in the dictionary a set
of alternative pronunciations is obtained and added. The advantage of this
approach is that it is completely task-independent, since it uses general lin-
guistic and phonetic rules and can thus be used across corpora and especially
8.1 The State of the Art in Pronunciation Modelling 83

for new words that are introduced to the system. The drawback however, is
that the rules are often very general and thus too many variants are gener-
ated, some of which might not be observed very often. As we have already
seen, too many variants increase confusability. The difficulty is then to find
those rules that are really relevant for the task.
An example of rule-based approaches is that of Lethinen [Let98], who
showed that using very rudimentary rules, which were obtained by seg-
menting a graphemic string and simply converting the segments according
to a grapheme-to-phoneme alphabet for German and using all transcrip-
tions generated together with their application likelihoods very often re-
ceives a higher ranking than the canonical transcription. Also Wiseman and
Downey [Wis98, Dow98] show that some rules have effects on the recognition
accuracy while others don’t.
In several publications, [Wes96a, Wes96b, Kip96, Kip97] present their
work on the Munich AUtomatic Segmentation system (MAUS). Pronuncia-
tion variants that were needed for segmenting the training data are generated
using a set of rules. In [Kip97] this rule-based approach is compared to a
statistical pronunciation model that uses micro pronunciation variants that
apply to a small number of phonemes. These rules are determined from a
hand-labelled corpus. The latter model achieves higher agreement with man-
ual transcriptions than the rule-based approach.

8.1.2 Data-Driven Approaches

In data-driven approaches, the alternative pronunciations are learned from


the speech data directly, so that it is possible to also compute application like-
lihoods from this data. These likelihoods are a measure for how frequently
a certain pronunciation is used. Additionally, in this case only pronuncia-
tions are generated that are really used. However, this is very much corpus
dependent and variants that are frequently used in one corpus do not nec-
essarily have to be used in another corpus as well. Another problem is that
this approach lacks any generalisation capability.
A method that is often used to derive the variants directly from the speech
data is to use a phoneme recogniser. A brief description of a phoneme recog-
niser is given in Section 8.1.2. The problem with this approach is to cope
with the large number of phoneme errors that are introduced by the recog-
niser. The highest achievable phoneme recognition rates without using any
further constraints for the recogniser have been between 50 and 70% in the
past [dM98]. Further restrictions could be made by using phoneme bi- or
trigrams, but since the goal is to find unknown variants, the freedom of the
recogniser should be as high as possible.
Hanna [Han99a] addressed the problem of phoneme recognition errors
by assigning different probabilities to insertions, deletions and substitutions.
This was done to avoid equally probable transcriptions if all insertion, dele-
tion and substitution probabilities are assigned the same values. To compute
84 8 Pronunciation Adaptation

separate probabilities, an iterative DP scheme and a confusion matrix were


used. While the most significant substitutions were retained, the number of
insignificant ones was reduced. In [Han99b] additionally pronunciations re-
sulting from co-articulation were removed first.
Wester [Wes00a], uses decision trees to prune the variants generated by
a phoneme recogniser. Amdal [Ama00], uses a measure they call association
strength between phones. They use statistics on co-occurrences of phones and
use these for the alignment of the reference and alternative transcriptions,
which was generated using a phoneme recogniser. They create rules for speak-
ers separately and then merge them into one dictionary. Although this is a
promising approach for discarding phoneme recognition errors, it requires a
lot of data. Williams [Wil98c], uses CMs to select reliable pronunciations and
Mokbel [Mok98] groups variants that were obtained by a phoneme recogniser
and represents each group by one transcription.
Phoneme Recogniser. As described above, often a phoneme recogniser
is used to derive pronunciation variants from speech data directly, if the
only available information source is the speech signal and the orthography
of the word that was spoken. The phoneme recognition result then provides
a possible pronunciation for this word. A phoneme recogniser is a special
case in recognition. The dictionary does not consist of words as usual, but
of phonemes only. Therefore no words but only phoneme sequences can be
recognised (that can optionally later be mapped to words). An example of
an English phoneme recogniser is depicted in Figure 8.1. The search is either
not restricted at all, so that arbitrary phoneme sequences can be recognised
or it is restricted by a phoneme bi- or tri-gram LM. Then certain phoneme
sequences are favoured by the LM. Although providing the phoneme sequence
best fitting the speech signal, and thus often achieving higher phoneme scores
than in word recognition tasks, phoneme recognition rates are usually not
very high, between 50-70%. But human expert spectrogram readers were
also only able to achieve a phoneme recognition rate of around 69% [dM98].
This again shows that the context and meaning of a word or sentence plays an
important role for the recognition of speech both for humans and machines. A
phoneme recogniser was used for the derivation of non-native pronunciation
variants and the whole generation procedure will be described in Section 8.4.

8.1.3 Combined Approaches

It seems to be a good solution to combine the rule-based and data-driven


approaches that is to use speech corpora to derive a set of rules. Therefore
phenomena really occurring in the speech can be covered but still the possi-
bility to generalise to other tasks and corpora can be retained. Of course, the
rules derived still depend on the corpus used. An example for such a com-
bined approach is given by Cremelie and Martens [Cre97, Cre99]. A simple
left-to-right pronunciation model was built by a forced alignment using the
8.1 The State of the Art in Pronunciation Modelling 85

Th

A:

Enter Exit

U@

Fig. 8.1. Schematic view of a phoneme recogniser

standard transcription only. Then at all possible positions, deletion, insertion


and substitution transitions were inserted and a second alignment using this
model was conducted. From this, a number of candidate rules were generated
by finding the differences between the reference and this new transcription.
Then the rules were pruned and only the most likely ones were kept. So pro-
nunciation variants are obtained together with their application likelihoods.
A detailed analysis of the results revealed that most of the improvement was
caused by the co-articulation rules. This was also shown in [Yan00a, Yan00b].
Another way to capture linguistic knowledge is to grow decision trees
using a set of linguistic and phonetic questions, see [Ril96, Ril99]. Similar
to the method proposed here, [Ril96, Ril99, Hum96] used decision trees to
learn a phone-to-phone mapping from the canonical pronunciations to the one
obtained from a database. While [Ril96, Ril99] extracted the variants from
hand-labelled corpora, [Hum96] obtained the variants by re-transcribing the
speech data with SI models and deriving typical substitutions. However, they
restricted the possible substitutions to vowels only. Always, the decision trees
were then used to generate the alternative pronunciations for the baseline
dictionary.
In general it can be stated that within-word variation which frequently
occurs in CSR is not so easy to derive just from a dictionary because it
depends very much on the speaking style, speaking rate and the speaker.
Cross-word variations, as for example in ‘going to’, which is often pronounced
86 8 Pronunciation Adaptation

like ‘gonna’, are easier to derive, because the changes that are made to the
pronunciation of one word depend on the preceding and following word to a
high extent (although still depending on the speaker as well). In contrast to
[Cre99, Yan00a], [Kes99] found that modelling cross-word variation actually
increases the WER if taken in isolation, whereas together with within-word
variation it improves the performance.

8.1.4 Miscellaneous Approaches

Apart from modelling pronunciation variation on the dictionary level it


can also be modelled on the HMM level. Some approaches exist that try
to account for the variability in pronunciation by the development of spe-
cialised HMMs that are trained or tailored for fast, slow or emotional speech,
[Pfa98, Fal99, Pol98, Eid99, Hei98, Wom97]. Another widely used approach
is to use NNs to model pronunciations, [MB98, Fuk98, Fuk97, Des96].
A common way to deal with cross-word variation is to use multi-words,
which are joined together words that are added as entities to the dictionary.
Those word sequences that occur frequently in spontaneous speech should
be covered without increasing the confusability with other words too much.
When added to the LM, they are often assigned the same probability as the
single word sequences. Multi-words were used e.g. by [Slo96], who generated
variants for words and multi-words by running a phoneme recogniser that
was constrained by a phoneme bi-gram. Relevant variants were determined
by a CM. The sets of multi-words were determined based on the frequency
of occurrence in the corpus. Both adding variants and multi-words could
decrease the WERs. This multi-word approach was also used by [Rav97,
Fin97, Noc98].

8.1.5 Re-training the Acoustic Models

Most of the work so far presented, concentrated on using pronunciation vari-


ants during recognition only. Even though there is a common agreement that
the modelling of pronunciation variation is an important topic, the improve-
ments in WER are rather modest in most of the cases. In [Wes00b] a detailed
analysis of the results was done, revealing that there are many differences in
the recognition result that are not reflected by the WER.
Another reason might be that usually the training of the acoustic models is
done using only the canonical pronunciation for the alignment of the training
data. The reason why it seems reasonable to re-train the acoustic models is
that if better transcriptions containing the actually spoken variants are used,
the number of alignment errors can be reduced and this in turn will result
in sharper models. An example of this was given already in Section 5.2. In
many publications, e.g. [Byr97, Slo96, Sar99, Fin97, Kes99] it has proven to
be beneficial to do a re-training of the acoustic models, using re-transcribed
8.2 Pronunciation Modelling of Accented and Dialect Speech 87

data. An exception to this is [Hol99], where a re-training of the acoustic


models did not help.
When pronunciation variations are considered during recognition another
problem arises. If the LM remains unchanged (that means only the orthog-
raphy of a word is used in the LM), but for decoding a dictionary containing
all pronunciation variants is used, the consequence would be that all variants
are assigned the same a priori probability. This does not reflect reality. So it
would be better to include the variants with their corresponding probabili-
ties directly into the LM. However this increases complexity and requires a
corpus labelled with variants to obtain the LM probabilities for each of the
variants. Pousse [Pou97], showed that using contextual variants in a bi-gram
LM outperforms a LM that includes the orthographic representation of the
words only, but of course the complexity of the LM increased a lot. In [Kes99],
it was also shown that incorporating the variants into the LM is beneficial.
To summarise the state of the art in pronunciation modelling, it can be
stated that it is considered to be an important topic. However, results are
not as good as expected in most of the cases. One reason is the potentially
increased confusability, another one the use of ‘diffuse’ HMM models during
decoding. Some techniques, while improving the results on one database, do
not succeed on other databases, so it seems that considerable research is still
necessary concerning this topic.

8.2 Pronunciation Modelling of Accented and Dialect


Speech

The main focus of the research in pronunciation modelling has so far been
native speech. When non-native or accented speech is considered, some more
problems arise. In the following description the native language of a speaker
will be called source language and the language that he is trying to speak will
be called target language. While in native speech, basically only insertions,
deletions and substitutions within the phoneme set of that particular lan-
guage have to be considered, for non-native speech phonemes of the source
languages are also used by many speakers. Witt [Wit99b], states that in gen-
eral learners of a second or third language tend to apply articulatory habits
and phonological knowledge of their native language. This is one reason why
many approaches assume a known mother tongue of the speaker, because
certain characteristics of the accented speech can then be predicted easily,
see [Bon98, Fis98, Fra99, Tra99, Her99, Wit99a, Tom00]. If no similar sound
of the target language exists in the source language, speakers tend to either
insert or delete vowels or consonants, in order to reproduce a syllable struc-
ture comparable to their native language. Furthermore, non-native speech is
often characterised by lower speech rates and different temporal character-
istics of the sounds, especially if the second language is not spoken fluently.
88 8 Pronunciation Adaptation

However, considering the intelligibility of non-native speech spectral char-


acteristics, such as F2 and F3 frequency locations, [Wit99a] found that the
latter have a stronger influence than the temporal ones. These frequency lo-
cations are influenced by changes in the tongue restriction centre (narrowing
the vocal tract due to the tongue). In contrast to this, F1 does not seem to
have such a strong influence on the intelligibility. F1 changes with the vocal
tract. So it seems that the tricky part when speaking in a foreign language
are the tongue movements.
As mentioned above, Humphries and his co-authors [Hum96] explicitly
modelled accented variants and achieved major reductions in WER.
Huang and his co-authors [Hua00] are one of the few researchers who
investigated the combination of MLLR speaker adaptation and the modifi-
cation of the pronunciation dictionary. They exploit the fact that between
speaker groups of different accents some clear tendencies of e.g. phoneme
substitutions can be observed. They obtained syllable level transcriptions us-
ing a syllable recogniser for Mandarin and aligned these with the reference
transcriptions to identify error pairs (mainly substitutions were considered).
These were used to derive transformation rules. New pronunciations were gen-
erated using these transformation rules and added to the canonical dictionary.
Using the extended dictionary and MLLR speaker adaptation alone, improves
the results. When both methods are combined even better improvements are
achieved, which indicates that acoustic deviation from the SI models and
pronunciation variation are at least partly independent phenomena.
This was also shown in an approach that combined pronunciation variant
modelling with VTLN, see [Pfa97]. Pfau and his co-authors use HMM models
specialised for different speaking rates and tested the effect of VTLN and
pronunciation modelling on these models. While both VTLN and the use of
pronunciation variants in the dictionary could improve the results, the best
results were achieved when both methods were combined.

8.3 Recognising Non-native Speech

In this section, some experiments are presented that tested the effect of
adding relevant non-native pronunciation variants to the pronunciation dic-
tionary for a non-native speech database. Recognition rates were computed
on the Interactive Spoken Language Education corpus (ISLE) (see [ISL]).
The speakers contained in this database are German and Italian learners
of English. The database was recorded for a project that aimed at devel-
oping a language learning tool that helps learners of a second language to
improve their pronunciation. The ISLE corpus is one of the few corpora
that exclusively contains non-native speakers with the same mother tongue,
in this case German and Italian. Often, if non-native speech is contained at
all in a database, it covers a diversity of source languages but only very few
8.3 Recognising Non-native Speech 89

speakers with the same source language. A more detailed description of the
ISLE corpus can be found in Appendix A.
However, the ISLE database is not optimal in the sense that very special
sentences were recorded. As already mentioned, the goal of the project was
to develop a language learning tool with special focus on evaluating the pro-
nunciation quality. So the sentences were carefully chosen so as to contain as
many as possible phoneme sequences that are known to be problematic for
Germans and Italians. The database is divided into blocks and while some
of the blocks contain very long sentences that were read from a book, other
parts contain only very short phrases, like e.g. ‘a thumb’. This was especially
a problem for the training of the LM for this task. The only textual data that
were available were the transcriptions that came with the ISLE database. A
bi-gram LM was trained using this data. The LM perplexity was 5.9. We
expect sub-optimal results from such a LM as far as the recognition rates
were concerned due to the very different structure of training sentences. Fur-
thermore the fact that in contrast to our previous experiments this is a CSR
task, required the use of a different recogniser1, since the previously used
one can only handle isolated words and short phrases. The pre-processing of
the speech was the same as in all previous experiments. The baseline WER
using British English monophone models, trained on the British English Wall
Street Journal (WSJ), averaged over twelve test speakers was 40.7%. Unfor-
tunately no native speakers were available to evaluate to what extent the low
baseline recognition rates are caused by the foreign accent.
Part of the database is manually transcribed and contains the pronunci-
ations that were actually used by the speakers. These manual labels did not
consider the possibility that German or Italian phonemes were used; they
used the British English phoneme set only. All these non-native pronuncia-
tions were added to the baseline dictionary and tested. The results can be
seen in Figures 8.2 and 8.3 for Italian and German speakers, respectively.
The number of pronunciations per speaker was 1.2 and 1.8 if the German
and Italian rules were applied, respectively. In the baseline dictionary which
included a few native variants already, the average number was 1.03. In the
following figures, the speaker IDs are used together with the indices ‘ I’ (for
Italian) and ‘ G’ (for German). First of all, it should be noted, that the base-
line recognition rates for Italian speakers are much worse than for German
speakers. This allows one to draw the conclusion that in general the En-
glish of the Italian speakers is worse than that of the German speakers in
terms of pronunciation errors. This was confirmed by the manual rating that
was available for the database and counted the phoneme error rates of the
speakers.
It can be seen that the recognition rates can be improved for almost all
speakers in the test set if the dictionary is enhanced with the correspond-
ing variants of the respective language (indicated by ‘GerVars’ and ‘ItaVars’,
1
HTK3.0 was used for the ISLE experiments, see [HTK]
90 8 Pronunciation Adaptation

90
BaseDict GerVars ItaVars

80
% WER

70

60

50

40
41_I 122_I 123_I 125_I 129_I 130_I

Fig. 8.2. WERs for Italian speakers, using manually derived variants

BaseDict GerVars ItaVars

80
% WER

60

40
12_G 162_G 163_G 181_G 189_G 190_G

Fig. 8.3. WERs for German speakers, using manually derived variants
8.3 Recognising Non-native Speech 91

respectively). Interestingly, for one German speaker (189 G) the Italian vari-
ants also improved the results and for some Italian speakers (41 I,122 I, 125 I,
130 I) the application of the German rules yielded improvements, indicating
that the necessary variants depend not only on the mother tongue of a speaker
but also on the speaker himself. Please note that manual phonetic transcrip-
tions were available only for two thirds of the data. That means that only
for words that occurred in these parts, pronunciation variants were added to
the dictionary. For all other words, no variants were included.
In a second step the canonical and the manual labels were used to man-
ually derive a set of pronunciation rules for each source language, thus ob-
taining a ‘German’ and an ‘Italian’ rule set.
For German, the most important rule from the phonological point of view
is the inability of Germans to reduce vowels. In English in unstressed sylla-
bles, vowels are very often reduced. The vowel that is mostly affected is /@/.
In German unstressed vowels are, with few exceptions, pronounced with their
full quality. To give an example, the correct British pronunciation of the En-
glish word ‘desert’ is /d e z @ t/. Many German speakers however, pronounce
it /d e z 3: t/. The occurrence, i.e., how often the phoneme obeying the rule
was involved in errors, was 21.7%.
For Italian, the most important rule is to append a /@/ at word fi-
nal consonants, thus creating an open syllable, e.g. in the word ‘between’
/b @ t w i: n @/ for /b @ t w i: n/. The occurrence for this was 13.5%.
A more detailed description of all pronunciation rules can be found in Sec-
tion C.3 in Appendix C and in [Gor01b, Sah01]. The derived rules were used
to automatically generate German- and Italian-accented variants from the
canonical pronunciations, this time for all words in the dictionary.
First, simply all rules were used to derive the variants, yielding 3.2 vari-
ants per word. The corresponding recognition results (in % WER) are shown
in the second bar (‘AllRules’) of Figures 8.4 and 8.5, for the Italian and Ger-
man speakers, respectively. It can be seen that for four out of five Italian
speakers, but only for one German speaker, the baseline results can be im-
proved. The third bar (‘ItaRules’) shows the results if the Italian rules were
used, resulting in 2.5 variants per word. The baseline results can be improved
for all Italian speakers and the improvements are even bigger than before. For
all German speakers when using these Italian rules, performance decreases
drastically. This is exactly as expected, since we consider the Italian rules as
irrelevant for the German speakers. When applying the German rules, which
are depicted in the fourth bar (‘GerRules’) and yield 2.0 variants per word,
for the German speakers, the results are better than using all rules, but still
only for one speaker, the baseline recognition rate can be improved. Inter-
estingly, for some Italian speakers, the German rules can also improve the
baseline results, of course not as much as the Italian rules. Finally, when se-
lecting the rules speaker-wise (that means selecting from all German and all
Italian rules those that can improve the results for the respective speaker if
92 8 Pronunciation Adaptation

tested separately) to create the speaker-optimised dictionaries (‘optRules’),


the biggest improvements can be achieved for all German and four Italian
speakers.

BaseDict AllRules ItaRules GerRules OptRules

80
% WER

60

40
41_I 122_I 123_I 125_I 129_I 130_I

Fig. 8.4. WERs for Italian speakers using speaker-optimised dictionaries

When we compare the results for the Italian and German rule set to the
results when the manually found variants were added directly to the dictio-
nary, we can observe that for many speakers the rules perform better than
the manual variants. One reason probably is that using the rules, we gener-
ate non-native variants for all words in the dictionary, whereas the manual
variants were only available for parts of the corpus. This allows one to draw
the conclusion that the rule set is able to generate relevant non-native pro-
nunciation variants.
The analysis of the recognition rates shows that a speaker-wise selection of
the rules is superior to adding all rules or rules that are typical for a speaker
of that mother tongue. This is shown by the fact that some of the German
rules improve the results for some Italian speakers and vice versa. We thus
conclude that using rules that do not reflect the speaker’s articulatory habits
can indeed lower the recognition rates, due to the increased confusability.
We further conclude that the closer a speaker’s pronunciation comes to the
native one, the more carefully the rules need to be selected. For most Italian
speakers applying all the rules or the Italian rules was already sufficient to
improve the baseline. This was not the case for the German speakers, where
only the speaker-optimised dictionaries achieved improvements.
As outlined above we assume that on one hand we have to account for the
‘non-native phoneme sequences’ that might occur, but on the other hand also
8.3 Recognising Non-native Speech 93

90
BaseDict AllRules ItaRules GerRules OptRules

80

70
% WER

60

50

40
12_G 162_G 163_G 181_G 189_G 190_G

Fig. 8.5. Recognition results for German speakers using speaker-optimised dictio-
naries

for the model mismatch that will occur for non-native speakers. Even if the
same phonemes exist in two languages, they will be different in sound colour.
This phenomenon cannot be captured by pronunciation adaptation alone but
by acoustic speaker adaptation. So we expect a combination of both methods
to be beneficial.
Figure 8.6 shows again the results averaged over all speakers and addi-
tionally the cases where the adaptation was applied to the baseline (‘Base-
Dict+MLLR’) as well as to the speaker-optimised dictionaries (‘optRules+-
MLLR’). MLLR can improve the baseline and the manually derived variants
(‘manVars’) but even more when the speaker-specific dictionaries are used.
Please note that the results for the combination of MLLR and the optimised
dictionaries are bigger than for either method alone; that implies that the
improvements are at least partly additive and thus the both methods should
be combined.
Even though the results clearly showed the necessity of explicitly mod-
elling non-native pronunciation variation in the dictionary, the rule-based
approach we used and which required the manual derivation of the rules
from an accented speech corpus, is not appropriate if we want to deal with a
variety of source language accents for different languages. Therefore, a new
method for generating pronunciation rules was developed and is presented in
the next section.
94 8 Pronunciation Adaptation

60
BaseDict
BaseDict+MLLR
56 ManVars
AllRules
optRules

52 optRules+MLLR
%WER

48

44

40

Fig. 8.6. Overall recognition results if speaker-optimised dictionaries and MLLR


adaptation were used

8.4 Generating Non-native Pronunciation Variants


Neither the traditional rule-based nor the traditional data-driven methods
are suited for the problems we want to tackle, since they are too costly and
too inflexible. If we chose the rule-based approach we have to have experts for
the various source languages we want to consider, who could generate rule-
sets for these source languages when the target language is to be spoken. If
we chose the data-driven approach we would have to collect large databases
in which several speakers from each source language are recorded speaking
the target language. Neither is feasible. This motivated the development of
a new approach that has been described in [Gor01a].
The following considerations apply to speakers who are not able to speak
the target language at all. They hear the words spoken by native speakers of
the target language several times and then try to reproduce them. The basic
assumption is that they will try to speak the words the way they hear them.
We again use the language pair German-English, but this time the setting is
just the other way round. Now the source language is English and the target
language is German. This switch was necessary since we wanted to exploit
the many repetitions of the same word that are only available in the German
command corpus, but not in the ISLE corpus.
Concretely that means that an English speaker will listen to one or several
German speakers speaking the same German word, and will then try to repeat
it with his English phoneme inventory. This process is shown in Figure 8.7. In
this example the English speaker tries to repeat the German word ‘Alarm’,
after having heard it for several times spoken by different German speakers.
8.4 Generating Non-native Pronunciation Variants 95

? a l a: r m ?
? a l a: r m


? a l a: r m

k V p l A: m

Fig. 8.7. English speaker reproducing German speech

This procedure is simulated by using the HMM models that were trained
on the British English WSJ corpus to recognise German speech. This is done
using a phoneme recogniser and a phoneme bi-gram LM that was trained on
the English phoneme transcription files, so as to reflect the English phono-
tactic constraints. Using these constraints insertions and deletions that are
typical for English are expected to occur.
Recognising the German utterances with the English phoneme recogniser
provides us with several ‘English transcriptions’ for those words. These are
used to train a decision tree (as will be explained in Section 8.4.1) that is
then used to predict English-accented variants from the German canonical
one. This two step procedure is again depicted in Figure 8.8.
The great advantage of the proposed method is that it needs only na-
tive speech data, in the case considered native English and native German
data. Usually one of the biggest problems is to acquire sufficient amounts of
accented data.
Some example phoneme recognition results for the German word ‘Alarm’
are given in Figure 8.9. In the database the word ‘Alarm’ was spoken 186
times by 186 different speakers. The left hand side shows the (canonical)
German pronunciation (which might be not exactly what was spoken, because
the database was aligned with the canonical pronunciation only), on the right
hand side some English phoneme recognition results are listed.
96 8 Pronunciation Adaptation

1 English
n-gram Th A: l A: m
p { l A: m

? a l a: r m
English k V p l A: m
HMMs

2
German +
English acc.
German Dictionary
LM

Recognition Result

A: l A: m German
HMMs

Fig. 8.8. 1: Generating English pronunciations for German words and training the
decision tree 2: Applying the decision tree to the German baseline dictionary

Word ’Alarm ’ spoken English phoneme


by different native recognition results:
speakers of German:

? a l a: r m T h A: l A: m

? a l a: r m sil b aI l A: m

? a l a: r m V I aU @ m

? a l a: r m p { l A: m

? a l a: r m aI l aU h V m u:

? a l a: r m h Q I A: m

? a l a: r m U k b { l A: u:

Fig. 8.9. English phoneme recognition result for the German word ‘Alarm’
8.4 Generating Non-native Pronunciation Variants 97

We can see that the second part of the word is often recognised as /l A: m/.
For the beginning of the word there seems to be no consistency at all. The
reason in this particular case might be fact that the words start with a glot-
tal stop /?/ in German. Nothing comparable does exist in English that’s
why it is hard to find a ‘replacement’. An auditory inspection of the results
showed that some English phoneme recognition results, although at first sight
seeming to be nonsense, were quite reasonable transcriptions when listening
to the phoneme segments. However, there were also many genuinely useless
results. By inducing decision trees, the hope is that the consistency that is
in our example found for the end of the word can be kept there. For the
word beginning however no phoneme sequence occurred twice in the exam-
ple, which lets us assume that it will be hard for the decision tree to predict
any reliable pronunciation at all. For these cases we added several instances of
the correct German pronunciation, so that if no English equivalent phoneme
sequence can be found, at least the German one will be retained for that
part. The best results for the non-native speakers were achieved when two
German pronunciations were added to (on the average) 100 English ones.
The trained tree was used to predict the accented variant from the German
canonical one for each entry in the baseline dictionary. To be able to use the
enhanced dictionary in our standard German recognition system, a mapping
of those English phonemes that do not exist in German was necessary. This
mapping can be found in Appendix C. Adding the non-native variants to
the dictionary performed better than replacing the canonical ones. However,
adding the variants means doubling the number of entries in the dictionary,
which might increase the confusability. Furthermore, mapping those English
phonemes that do not exist to the closest German phonemes performed better
than using a merged German/English HMM model set.

8.4.1 Classification Trees

By using an English phoneme recogniser for decoding German speech, we


obtained English-accented pronunciations for the German words. Since we are
faced with high phoneme error rates (the phoneme error rate of the English
phoneme recogniser on an English test set was 47.3%), it is not possible to
simply add the generated variants to the dictionary as they are. The phoneme
sequences that are due to recogniser errors should be removed. In general it is
assumed that erroneous sequences will not appear as often as correct ones do
(remember that for several repetitions of the German words, English-accented
variants were generated), so only the correct parts should be retained. This
can be achieved by inducing decision or classification trees [Ril96, Ril99].
In this section a rather informal description of classification trees is given.
For a more detailed discussion on classification trees the interested reader is
referred to [Kuh93, Bre84].
Classification trees consist of inner nodes, arcs connecting these nodes and
finally leaves. The nodes are labelled with yes-no questions and depending
98 8 Pronunciation Adaptation

on the answer one or the other arc is followed. Using appropriate training
algorithms, such a tree can learn mappings from one data set to another
(of course decision trees can be applied to numerous other machine learn-
ing tasks). In the case considered it is provided with the canonical German
phoneme sequences and the English accented variants and is supposed to
learn the mapping between the two. Thus the tree can be applied to the
baseline dictionary after training to predict the accented variants. In the
experiments the commercial tool C5.0 [C5] was used to grow the tree.
The tree was induced from a set of training examples consisting of at-
tribute values, in this case the German phonemes in the vicinity of the
phoneme under consideration, along with the class that the training data
describes, in this case a single English target phoneme. A window moving
from the right to the left is considered, each time predicting one of the target
phonemes. Together with the phonemes in the vicinity of the source phoneme
the last predicted phoneme is also used as an attribute to make a prediction
for the following one. Each inner node of the tree is labelled with a question
like ‘Is the target phoneme a /{/?’ or ‘Is the phoneme two positions to the
right a /n/?’. The arcs of the tree are labelled with answers to the questions of
the originating nodes. Finally each leaf is labelled with a class to be predicted
when answering all questions accordingly from the root of the tree down to
that leaf, following the arcs. If the training data are non-deterministic, i.e.,
several classes are contained for a specific attribute constellation needed to
reach a leaf, it can hold a distribution of classes seen. In this case the class
that has most often been seen is predicted. After the tree is built from the
training data it can be used to predict new cases by following the questions
from the root node to the leaf nodes. In addition to the phoneme under con-
sideration itself, the last predicted phoneme and a phoneme context of three
to the left and three to the right was used, making up a total number of eight
attributes. A problem that arose was that of different lengths of the phoneme
strings. Before training the decision tree they needed to be aligned to have the
same length. An iterative procedure was used that starts with those entries
in the training data that have the same number of phonemes. Co-occurrences
are calculated on these portions and used to continuously calculate the align-
ment for all other entries, by inserting so-called ‘null-phonemes’ at the most
probable places (see [Rap98]) until the whole training set is aligned.
One of the first questions that can be found in the generated tree is
the question whether the considered phoneme is a glottal stop /?/. If so it is
removed in the English target output. This is very reasonable since in English
the glottal stop does not exist. In Figure 8.10 another example of a set of
questions is shown as a very small subtree that was generated for the German
phoneme /y:/ that occurs e.g. in the word ‘Büro’ /by:ro:/, (Engl. ‘office’). In
the case where the last predicted English phoneme was a /b/ and the last two
phonemes were /-/ (which is the sign representing the null-phonemes, here
very probably representing a word boundary), the predicted English phoneme
8.4 Generating Non-native Pronunciation Variants 99

will be a /u:/. This makes sense, since English speakers are usually not able
to pronounce the German /y:/ and it is in practice very often substituted
by the phoneme /u:/. In the figure only those arcs of the subtree that were
labelled with ‘yes’ are shown in detail. The whole tree consisted of 16,992
nodes and 18,363 leafs.

y:

is last predicted phone = ’b’ ?


no yes
is phone 1 to the left = ’-’ ?
no yes

is phone 2 to the left = ’-’ ?


no yes

u:

Fig. 8.10. Subtree for the German phoneme /y:/

8.4.2 Experiments and Results

A total number of 190 speakers from the German command set was used for
the phoneme recognition experiments that generated the English-accented
transcriptions. Eight native English speakers speaking German were excluded
to build the test set for the final word recognition experiments. A set of
experiments using a subset of the command corpus was conducted to find
the optimal settings w.r.t. the LM weight and word insertion penalties of the
phoneme recogniser. This was to avoid too many phoneme insertions, which
would occur frequently if no penalty was applied. From the speech files the
silence in the beginning and in the end was cut off to prevent the recogniser
from hypothesising fricatives or the like, where actually silence occurred.
We tried to balance insertions and deletions, however always keeping more
insertions than deletions, since we expect to find some typical insertions for
that source language. A decision tree was trained with the phoneme strings
100 8 Pronunciation Adaptation

generated this way and applied to the German dictionary to generate English-
accented German variants.
The goal was to improve the recognition results for English-accented Ger-
man speech using our standard German recognition system. Thus apart from
the dictionary, all settings remained unchanged. First the baseline WER re-
sults for eight native speakers of English (three American and five British
English speakers) were measured and compared to the slightly changed test
set of 15 native German speakers that was used in Chapter 6. The overall
result on the command corpus was 11.5% WER compared to 18% WER for
the non-native speakers, which corresponds to an increase in WER of 56.5%
relative.
The non-native speakers were then tested using the dictionary enhanced
with the variants. Figure 8.11 shows the results for the eight speakers using
the baseline dictionary (‘BaseDict’), the baseline dictionary combined with
our weighted MLLR adaptation (‘Base+MLLR’), the dictionary extended
with the generated English-accented variants (‘extdDict’) and finally the
extended dictionary combined with MLLR adaptation (‘extdMLLR’). The
American speakers are ID059, ID060 and ID075.

BaseDict Base+MLLR extdDict extdMLLR


35

27.5
% WER

20

12.5

5
ID022 ID054 ID059 ID060 ID064 ID075 ID078 ID138

Fig. 8.11. WERs using the extended dictionary

Weighted MLLR adaptation can improve the baseline results for most of
the speakers. Relative improvements of up to 37% can be achieved. The ex-
tended dictionary can only improve the results for two speakers, when no ad-
ditional speaker adaptation is used. However, when combining the extended
dictionary with MLLR the results can be improved for most of the speakers
compared to the baseline dictionary. Further, for five out of the eight speakers
8.4 Generating Non-native Pronunciation Variants 101

the extended dictionary with MLLR is better than the baseline dictionary
with MLLR and further improvements of up to 16% can be achieved. When
testing the enhanced dictionary on the native reference speakers only a slight
increase in WER from 11.5% to 11.8% can be observed, which might be due
to the increased confusability.
More results are shown in Figure 8.12. The experiments using the ex-
tended dictionary and weighted MLLR were repeated using the semi-super-
vised approach (i.e. weighted MLLR combined with the CM). It can be seen
that especially for those speakers with high initial WERs, the use of the semi-
supervised approach is superior to the unsupervised one, because it can be
avoided to use misrecognised utterances for adaptation. Improvements of up
to 22% can be achieved compared to unsupervised adaptation.

BaseDict extdMLLR extdMLLR+CM


35

27.5
% WER

20

12.5

5
ID022 ID054 ID059 ID060 ID064 ID075 ID078 ID138

Fig. 8.12. WERs for unsupervised and semi-supervised adaptation, using the ex-
tended dictionary

The big differences in the baseline recognition rate indicate that the
strength of accent varies dramatically in the test set. An auditive test re-
vealed that while two speakers had only a very slight accent (ID022 and
ID064), one (ID078) had a very strong accent. This is reflected in the base-
line WERs, where ID022 and ID064 achieve the lowest WERs and ID078 is
among those who achieve high WERs. Furthermore mixing native British and
American speakers is also not optimal, but due to the lack of more British En-
glish speakers the American ones were used to have a more representative test
set. Figure 8.13 shows again the overall results, averaged over all speakers. It
can be seen that the extended dictionary combined with MLLR outperforms
102 8 Pronunciation Adaptation

the baseline, however, not MLLR. But when the CM is additionally used, the
weighted MLLR can be outperformed.

20
BaseDict

17.5 Base+MLLR
extdDict

15 extdMLLR
extdMLLR+CM
% WER

12.5

10

7.5

Fig. 8.13. Overall results

The way the accented pronunciations are generated does not at all con-
sider possible differences in accents. As we have learned from the experiments
on the ISLE database, a speaker-wise selection of relevant pronunciations is
beneficial and sometimes even necessary to obtain improvements. Compar-
ing these results to the ISLE experiments, simply generating one non-native
variant for each entry in the dictionary that is the same for all speakers,
corresponds to e.g. using all German rules to construct one dictionary for
all German speakers. The results showed that this could only improve the
performance for some of the speakers and was clearly outperformed by the
speaker-specific dictionary that used only selected rules for each speaker.
As a consequence instead of using the decision tree directly to generate
the variants as was done in the current experiments, rules should be derived
from the tree, such that a speaker-wise selection becomes possible. Then
improvements for more speakers are to be expected, also for the extended
dictionary alone. Unfortunately at the time of the writing of this book it was
not possible to do so.
The results however indicate that the improvements that can be achieved
by acoustic adaptation using the weighted MLLR approach and by the pro-
posed pronunciation adaptation method are at least partly additive.
8.5 Summary 103

8.5 Summary
In this section the problem of recognising non-native speech was examined
and a new method for generating non-native pronunciations without using
non-native data was proposed. It was shown that performance decreases if a
recogniser that is optimised for native speech is exposed to non-native speech.
Investigations on the ISLE database showed that including specialised non-
native pronunciation variants in the dictionary can greatly improve the re-
sults. A set of rules was manually derived and the optimal set of rules was
determined for each speaker in the test set. The application of these spe-
cialised rules was clearly superior to using all rules or rules that are typical
for the native language of a speaker group. However, since the manual deriva-
tion is not feasible if the combination of several source and target languages
is to be covered, a new method was introduced that solely relies on native
speech to derive non-native variants automatically. The big advantage of the
proposed method is that it requires native data only, and can thus be re-
peated for all the languages for which HMM models trained on native speech
are available.
English native speakers who spoke German were investigated in more
detail. To derive the accented pronunciation variants, German speech was
decoded with an English phoneme recogniser. A decision tree was trained to
map the German canonical pronunciation to the English-accented variants
obtained. The trained tree was then used to generate an extended dictio-
nary, containing the native and the non-native variants. When combining
the enhanced dictionary with MLLR speaker adaptation, the results can be
improved for all speakers compared to the baseline dictionary and for the
majority of the speakers, compared to the baseline dictionary with MLLR
adaptation. While weighted MLLR alone can achieve up to 37% reduction
in WER, the additional usage of the enhanced dictionary achieves further
improvements of up to 16%. The use of the semi-supervised adaptation is su-
perior to the unsupervised approach, especially for speakers with high initial
WERs. Using the enhanced dictionary with the native speakers, the perfor-
mance remained almost stable.
However, it should be possible to further improve the results, if rules are
generated instead of using the tree directly, so that a speaker-wise selection
of the variants becomes possible. The experiments showed that the improve-
ments that can be achieved by acoustic adaptation (using the weighted MLLR
approach) and the proposed pronunciation adaptation method are at least
partly additive and should therefore be combined.
The approach is very promising because of its flexibility w.r.t. the consid-
ered languages. Only native data for the source and target language under
consideration is required to derive accented pronunciations of the target lan-
guage spoken with the accent of the source language. Even though in the
experiments only the language pair German (target language) and English
(source language) was considered, the proposed method is valid for arbitrary
104 8 Pronunciation Adaptation

language pairs and is able to overcome the problem of insufficient accented


speech data.
When pronunciation rules are to be selected specifically for each speaker,
this has so far been done by testing each rule separately. However in real
applications this is not feasible and the selection of rules will preferably be
conducted dynamically. For this purpose all previously generated rules could
be represented in a multi-dimensional feature space, that represents different
‘ways of pronunciation’. The appropriate rules for the current speaker would
then be determined and chosen for dictionary adaptation. A method to con-
duct such a dynamic selection is currently under investigation and is briefly
outlined in Chapter 9.

You might also like