Pronunciation Adaptation 2002
Pronunciation Adaptation 2002
Pronunciation Adaptation 2002
The techniques of ASR and ASU have reached a level where they are ready
to be used in products. Nowadays people can travel easily to almost every
place in the world and due to the growing globalisation of business they
even have to. As a consequence public information systems are frequently
used by foreigners or people who are new to a city or region. If the sys-
tems are equipped with a speech interface, supporting several languages is
extremely costly since for each language AMs, pronunciation lexica and LMs
need to be built, all of which require large amounts of data for each lan-
guage. The number of languages has to be restricted somehow and as a
result many people will choose one of the supported languages and speak
with a foreign accent. Also the number of growing speech enabled Internet
applications or various content selection tasks might require the speaker to
say titles or the like in a foreign language. Most of the systems nowadays
are not able to handle accented speech adequately and perform rather badly,
compared to the performance achieved when recognising speech of native
speakers, see [Wit99a, Cha97, Bon98].
While it was possible to capture some of the pronunciation variation in
high complexity HMM models for native speakers, this will not be possible
to the required extent for non-native speech, as was already indicated by the
findings of Jurafsky [Jur01]. Also acoustic adaptation alone as described in
Chapter 6, will not be sufficient to deal with accents. This is also supported
by Woodland [Woo99].
State-of-the-art speech recognition systems use a pronunciation dictio-
nary to map the orthography of each word to its pronunciation, as we have
seen in Chapter 5. Usually, a ‘standard’ pronunciation (also called canonical
pronunciation or base form), as it can be found in published pronunciation
dictionaries, is used. These canonical pronunciations show the phonemic rep-
resentation of a word; that is, how it should be pronounced if it is spoken
in isolation. This is also called citation form, see [Pau98]. In isolated speech
this canonical pronunciation might come close to what speakers actually say,
but for some words there is a mismatch between this standard pronunciation
and its phonetic realisation. For continuous or even spontaneous speech, the
problem gets more severe. Due to co-articulation effects the pronounced words
tend to deviate more and more from the canonical pronunciation, especially
S. Goronzy: Robust Adaptation to Non-native Accents, LNAI 2560, pp. 79-104, 2002.
Springer-Verlag Berlin Heidelberg 2002
80 8 Pronunciation Adaptation
to account for the @-deletion at word endings in German. The rules are de-
rived using linguistic and phonetic knowledge on what kind of pronunciation
variation occurs in the kind of speech considered. These rules are then ap-
plied to the baseline dictionary. Thus for the entries in the dictionary a set
of alternative pronunciations is obtained and added. The advantage of this
approach is that it is completely task-independent, since it uses general lin-
guistic and phonetic rules and can thus be used across corpora and especially
8.1 The State of the Art in Pronunciation Modelling 83
for new words that are introduced to the system. The drawback however, is
that the rules are often very general and thus too many variants are gener-
ated, some of which might not be observed very often. As we have already
seen, too many variants increase confusability. The difficulty is then to find
those rules that are really relevant for the task.
An example of rule-based approaches is that of Lethinen [Let98], who
showed that using very rudimentary rules, which were obtained by seg-
menting a graphemic string and simply converting the segments according
to a grapheme-to-phoneme alphabet for German and using all transcrip-
tions generated together with their application likelihoods very often re-
ceives a higher ranking than the canonical transcription. Also Wiseman and
Downey [Wis98, Dow98] show that some rules have effects on the recognition
accuracy while others don’t.
In several publications, [Wes96a, Wes96b, Kip96, Kip97] present their
work on the Munich AUtomatic Segmentation system (MAUS). Pronuncia-
tion variants that were needed for segmenting the training data are generated
using a set of rules. In [Kip97] this rule-based approach is compared to a
statistical pronunciation model that uses micro pronunciation variants that
apply to a small number of phonemes. These rules are determined from a
hand-labelled corpus. The latter model achieves higher agreement with man-
ual transcriptions than the rule-based approach.
Th
A:
Enter Exit
U@
like ‘gonna’, are easier to derive, because the changes that are made to the
pronunciation of one word depend on the preceding and following word to a
high extent (although still depending on the speaker as well). In contrast to
[Cre99, Yan00a], [Kes99] found that modelling cross-word variation actually
increases the WER if taken in isolation, whereas together with within-word
variation it improves the performance.
The main focus of the research in pronunciation modelling has so far been
native speech. When non-native or accented speech is considered, some more
problems arise. In the following description the native language of a speaker
will be called source language and the language that he is trying to speak will
be called target language. While in native speech, basically only insertions,
deletions and substitutions within the phoneme set of that particular lan-
guage have to be considered, for non-native speech phonemes of the source
languages are also used by many speakers. Witt [Wit99b], states that in gen-
eral learners of a second or third language tend to apply articulatory habits
and phonological knowledge of their native language. This is one reason why
many approaches assume a known mother tongue of the speaker, because
certain characteristics of the accented speech can then be predicted easily,
see [Bon98, Fis98, Fra99, Tra99, Her99, Wit99a, Tom00]. If no similar sound
of the target language exists in the source language, speakers tend to either
insert or delete vowels or consonants, in order to reproduce a syllable struc-
ture comparable to their native language. Furthermore, non-native speech is
often characterised by lower speech rates and different temporal character-
istics of the sounds, especially if the second language is not spoken fluently.
88 8 Pronunciation Adaptation
In this section, some experiments are presented that tested the effect of
adding relevant non-native pronunciation variants to the pronunciation dic-
tionary for a non-native speech database. Recognition rates were computed
on the Interactive Spoken Language Education corpus (ISLE) (see [ISL]).
The speakers contained in this database are German and Italian learners
of English. The database was recorded for a project that aimed at devel-
oping a language learning tool that helps learners of a second language to
improve their pronunciation. The ISLE corpus is one of the few corpora
that exclusively contains non-native speakers with the same mother tongue,
in this case German and Italian. Often, if non-native speech is contained at
all in a database, it covers a diversity of source languages but only very few
8.3 Recognising Non-native Speech 89
speakers with the same source language. A more detailed description of the
ISLE corpus can be found in Appendix A.
However, the ISLE database is not optimal in the sense that very special
sentences were recorded. As already mentioned, the goal of the project was
to develop a language learning tool with special focus on evaluating the pro-
nunciation quality. So the sentences were carefully chosen so as to contain as
many as possible phoneme sequences that are known to be problematic for
Germans and Italians. The database is divided into blocks and while some
of the blocks contain very long sentences that were read from a book, other
parts contain only very short phrases, like e.g. ‘a thumb’. This was especially
a problem for the training of the LM for this task. The only textual data that
were available were the transcriptions that came with the ISLE database. A
bi-gram LM was trained using this data. The LM perplexity was 5.9. We
expect sub-optimal results from such a LM as far as the recognition rates
were concerned due to the very different structure of training sentences. Fur-
thermore the fact that in contrast to our previous experiments this is a CSR
task, required the use of a different recogniser1, since the previously used
one can only handle isolated words and short phrases. The pre-processing of
the speech was the same as in all previous experiments. The baseline WER
using British English monophone models, trained on the British English Wall
Street Journal (WSJ), averaged over twelve test speakers was 40.7%. Unfor-
tunately no native speakers were available to evaluate to what extent the low
baseline recognition rates are caused by the foreign accent.
Part of the database is manually transcribed and contains the pronunci-
ations that were actually used by the speakers. These manual labels did not
consider the possibility that German or Italian phonemes were used; they
used the British English phoneme set only. All these non-native pronuncia-
tions were added to the baseline dictionary and tested. The results can be
seen in Figures 8.2 and 8.3 for Italian and German speakers, respectively.
The number of pronunciations per speaker was 1.2 and 1.8 if the German
and Italian rules were applied, respectively. In the baseline dictionary which
included a few native variants already, the average number was 1.03. In the
following figures, the speaker IDs are used together with the indices ‘ I’ (for
Italian) and ‘ G’ (for German). First of all, it should be noted, that the base-
line recognition rates for Italian speakers are much worse than for German
speakers. This allows one to draw the conclusion that in general the En-
glish of the Italian speakers is worse than that of the German speakers in
terms of pronunciation errors. This was confirmed by the manual rating that
was available for the database and counted the phoneme error rates of the
speakers.
It can be seen that the recognition rates can be improved for almost all
speakers in the test set if the dictionary is enhanced with the correspond-
ing variants of the respective language (indicated by ‘GerVars’ and ‘ItaVars’,
1
HTK3.0 was used for the ISLE experiments, see [HTK]
90 8 Pronunciation Adaptation
90
BaseDict GerVars ItaVars
80
% WER
70
60
50
40
41_I 122_I 123_I 125_I 129_I 130_I
Fig. 8.2. WERs for Italian speakers, using manually derived variants
80
% WER
60
40
12_G 162_G 163_G 181_G 189_G 190_G
Fig. 8.3. WERs for German speakers, using manually derived variants
8.3 Recognising Non-native Speech 91
respectively). Interestingly, for one German speaker (189 G) the Italian vari-
ants also improved the results and for some Italian speakers (41 I,122 I, 125 I,
130 I) the application of the German rules yielded improvements, indicating
that the necessary variants depend not only on the mother tongue of a speaker
but also on the speaker himself. Please note that manual phonetic transcrip-
tions were available only for two thirds of the data. That means that only
for words that occurred in these parts, pronunciation variants were added to
the dictionary. For all other words, no variants were included.
In a second step the canonical and the manual labels were used to man-
ually derive a set of pronunciation rules for each source language, thus ob-
taining a ‘German’ and an ‘Italian’ rule set.
For German, the most important rule from the phonological point of view
is the inability of Germans to reduce vowels. In English in unstressed sylla-
bles, vowels are very often reduced. The vowel that is mostly affected is /@/.
In German unstressed vowels are, with few exceptions, pronounced with their
full quality. To give an example, the correct British pronunciation of the En-
glish word ‘desert’ is /d e z @ t/. Many German speakers however, pronounce
it /d e z 3: t/. The occurrence, i.e., how often the phoneme obeying the rule
was involved in errors, was 21.7%.
For Italian, the most important rule is to append a /@/ at word fi-
nal consonants, thus creating an open syllable, e.g. in the word ‘between’
/b @ t w i: n @/ for /b @ t w i: n/. The occurrence for this was 13.5%.
A more detailed description of all pronunciation rules can be found in Sec-
tion C.3 in Appendix C and in [Gor01b, Sah01]. The derived rules were used
to automatically generate German- and Italian-accented variants from the
canonical pronunciations, this time for all words in the dictionary.
First, simply all rules were used to derive the variants, yielding 3.2 vari-
ants per word. The corresponding recognition results (in % WER) are shown
in the second bar (‘AllRules’) of Figures 8.4 and 8.5, for the Italian and Ger-
man speakers, respectively. It can be seen that for four out of five Italian
speakers, but only for one German speaker, the baseline results can be im-
proved. The third bar (‘ItaRules’) shows the results if the Italian rules were
used, resulting in 2.5 variants per word. The baseline results can be improved
for all Italian speakers and the improvements are even bigger than before. For
all German speakers when using these Italian rules, performance decreases
drastically. This is exactly as expected, since we consider the Italian rules as
irrelevant for the German speakers. When applying the German rules, which
are depicted in the fourth bar (‘GerRules’) and yield 2.0 variants per word,
for the German speakers, the results are better than using all rules, but still
only for one speaker, the baseline recognition rate can be improved. Inter-
estingly, for some Italian speakers, the German rules can also improve the
baseline results, of course not as much as the Italian rules. Finally, when se-
lecting the rules speaker-wise (that means selecting from all German and all
Italian rules those that can improve the results for the respective speaker if
92 8 Pronunciation Adaptation
80
% WER
60
40
41_I 122_I 123_I 125_I 129_I 130_I
When we compare the results for the Italian and German rule set to the
results when the manually found variants were added directly to the dictio-
nary, we can observe that for many speakers the rules perform better than
the manual variants. One reason probably is that using the rules, we gener-
ate non-native variants for all words in the dictionary, whereas the manual
variants were only available for parts of the corpus. This allows one to draw
the conclusion that the rule set is able to generate relevant non-native pro-
nunciation variants.
The analysis of the recognition rates shows that a speaker-wise selection of
the rules is superior to adding all rules or rules that are typical for a speaker
of that mother tongue. This is shown by the fact that some of the German
rules improve the results for some Italian speakers and vice versa. We thus
conclude that using rules that do not reflect the speaker’s articulatory habits
can indeed lower the recognition rates, due to the increased confusability.
We further conclude that the closer a speaker’s pronunciation comes to the
native one, the more carefully the rules need to be selected. For most Italian
speakers applying all the rules or the Italian rules was already sufficient to
improve the baseline. This was not the case for the German speakers, where
only the speaker-optimised dictionaries achieved improvements.
As outlined above we assume that on one hand we have to account for the
‘non-native phoneme sequences’ that might occur, but on the other hand also
8.3 Recognising Non-native Speech 93
90
BaseDict AllRules ItaRules GerRules OptRules
80
70
% WER
60
50
40
12_G 162_G 163_G 181_G 189_G 190_G
Fig. 8.5. Recognition results for German speakers using speaker-optimised dictio-
naries
for the model mismatch that will occur for non-native speakers. Even if the
same phonemes exist in two languages, they will be different in sound colour.
This phenomenon cannot be captured by pronunciation adaptation alone but
by acoustic speaker adaptation. So we expect a combination of both methods
to be beneficial.
Figure 8.6 shows again the results averaged over all speakers and addi-
tionally the cases where the adaptation was applied to the baseline (‘Base-
Dict+MLLR’) as well as to the speaker-optimised dictionaries (‘optRules+-
MLLR’). MLLR can improve the baseline and the manually derived variants
(‘manVars’) but even more when the speaker-specific dictionaries are used.
Please note that the results for the combination of MLLR and the optimised
dictionaries are bigger than for either method alone; that implies that the
improvements are at least partly additive and thus the both methods should
be combined.
Even though the results clearly showed the necessity of explicitly mod-
elling non-native pronunciation variation in the dictionary, the rule-based
approach we used and which required the manual derivation of the rules
from an accented speech corpus, is not appropriate if we want to deal with a
variety of source language accents for different languages. Therefore, a new
method for generating pronunciation rules was developed and is presented in
the next section.
94 8 Pronunciation Adaptation
60
BaseDict
BaseDict+MLLR
56 ManVars
AllRules
optRules
52 optRules+MLLR
%WER
48
44
40
? a l a: r m ?
? a l a: r m
➩
➩
? a l a: r m
k V p l A: m
This procedure is simulated by using the HMM models that were trained
on the British English WSJ corpus to recognise German speech. This is done
using a phoneme recogniser and a phoneme bi-gram LM that was trained on
the English phoneme transcription files, so as to reflect the English phono-
tactic constraints. Using these constraints insertions and deletions that are
typical for English are expected to occur.
Recognising the German utterances with the English phoneme recogniser
provides us with several ‘English transcriptions’ for those words. These are
used to train a decision tree (as will be explained in Section 8.4.1) that is
then used to predict English-accented variants from the German canonical
one. This two step procedure is again depicted in Figure 8.8.
The great advantage of the proposed method is that it needs only na-
tive speech data, in the case considered native English and native German
data. Usually one of the biggest problems is to acquire sufficient amounts of
accented data.
Some example phoneme recognition results for the German word ‘Alarm’
are given in Figure 8.9. In the database the word ‘Alarm’ was spoken 186
times by 186 different speakers. The left hand side shows the (canonical)
German pronunciation (which might be not exactly what was spoken, because
the database was aligned with the canonical pronunciation only), on the right
hand side some English phoneme recognition results are listed.
96 8 Pronunciation Adaptation
1 English
n-gram Th A: l A: m
p { l A: m
? a l a: r m
English k V p l A: m
HMMs
2
German +
English acc.
German Dictionary
LM
Recognition Result
A: l A: m German
HMMs
Fig. 8.8. 1: Generating English pronunciations for German words and training the
decision tree 2: Applying the decision tree to the German baseline dictionary
? a l a: r m T h A: l A: m
? a l a: r m sil b aI l A: m
? a l a: r m V I aU @ m
? a l a: r m p { l A: m
? a l a: r m aI l aU h V m u:
? a l a: r m h Q I A: m
? a l a: r m U k b { l A: u:
Fig. 8.9. English phoneme recognition result for the German word ‘Alarm’
8.4 Generating Non-native Pronunciation Variants 97
We can see that the second part of the word is often recognised as /l A: m/.
For the beginning of the word there seems to be no consistency at all. The
reason in this particular case might be fact that the words start with a glot-
tal stop /?/ in German. Nothing comparable does exist in English that’s
why it is hard to find a ‘replacement’. An auditory inspection of the results
showed that some English phoneme recognition results, although at first sight
seeming to be nonsense, were quite reasonable transcriptions when listening
to the phoneme segments. However, there were also many genuinely useless
results. By inducing decision trees, the hope is that the consistency that is
in our example found for the end of the word can be kept there. For the
word beginning however no phoneme sequence occurred twice in the exam-
ple, which lets us assume that it will be hard for the decision tree to predict
any reliable pronunciation at all. For these cases we added several instances of
the correct German pronunciation, so that if no English equivalent phoneme
sequence can be found, at least the German one will be retained for that
part. The best results for the non-native speakers were achieved when two
German pronunciations were added to (on the average) 100 English ones.
The trained tree was used to predict the accented variant from the German
canonical one for each entry in the baseline dictionary. To be able to use the
enhanced dictionary in our standard German recognition system, a mapping
of those English phonemes that do not exist in German was necessary. This
mapping can be found in Appendix C. Adding the non-native variants to
the dictionary performed better than replacing the canonical ones. However,
adding the variants means doubling the number of entries in the dictionary,
which might increase the confusability. Furthermore, mapping those English
phonemes that do not exist to the closest German phonemes performed better
than using a merged German/English HMM model set.
on the answer one or the other arc is followed. Using appropriate training
algorithms, such a tree can learn mappings from one data set to another
(of course decision trees can be applied to numerous other machine learn-
ing tasks). In the case considered it is provided with the canonical German
phoneme sequences and the English accented variants and is supposed to
learn the mapping between the two. Thus the tree can be applied to the
baseline dictionary after training to predict the accented variants. In the
experiments the commercial tool C5.0 [C5] was used to grow the tree.
The tree was induced from a set of training examples consisting of at-
tribute values, in this case the German phonemes in the vicinity of the
phoneme under consideration, along with the class that the training data
describes, in this case a single English target phoneme. A window moving
from the right to the left is considered, each time predicting one of the target
phonemes. Together with the phonemes in the vicinity of the source phoneme
the last predicted phoneme is also used as an attribute to make a prediction
for the following one. Each inner node of the tree is labelled with a question
like ‘Is the target phoneme a /{/?’ or ‘Is the phoneme two positions to the
right a /n/?’. The arcs of the tree are labelled with answers to the questions of
the originating nodes. Finally each leaf is labelled with a class to be predicted
when answering all questions accordingly from the root of the tree down to
that leaf, following the arcs. If the training data are non-deterministic, i.e.,
several classes are contained for a specific attribute constellation needed to
reach a leaf, it can hold a distribution of classes seen. In this case the class
that has most often been seen is predicted. After the tree is built from the
training data it can be used to predict new cases by following the questions
from the root node to the leaf nodes. In addition to the phoneme under con-
sideration itself, the last predicted phoneme and a phoneme context of three
to the left and three to the right was used, making up a total number of eight
attributes. A problem that arose was that of different lengths of the phoneme
strings. Before training the decision tree they needed to be aligned to have the
same length. An iterative procedure was used that starts with those entries
in the training data that have the same number of phonemes. Co-occurrences
are calculated on these portions and used to continuously calculate the align-
ment for all other entries, by inserting so-called ‘null-phonemes’ at the most
probable places (see [Rap98]) until the whole training set is aligned.
One of the first questions that can be found in the generated tree is
the question whether the considered phoneme is a glottal stop /?/. If so it is
removed in the English target output. This is very reasonable since in English
the glottal stop does not exist. In Figure 8.10 another example of a set of
questions is shown as a very small subtree that was generated for the German
phoneme /y:/ that occurs e.g. in the word ‘Büro’ /by:ro:/, (Engl. ‘office’). In
the case where the last predicted English phoneme was a /b/ and the last two
phonemes were /-/ (which is the sign representing the null-phonemes, here
very probably representing a word boundary), the predicted English phoneme
8.4 Generating Non-native Pronunciation Variants 99
will be a /u:/. This makes sense, since English speakers are usually not able
to pronounce the German /y:/ and it is in practice very often substituted
by the phoneme /u:/. In the figure only those arcs of the subtree that were
labelled with ‘yes’ are shown in detail. The whole tree consisted of 16,992
nodes and 18,363 leafs.
y:
u:
A total number of 190 speakers from the German command set was used for
the phoneme recognition experiments that generated the English-accented
transcriptions. Eight native English speakers speaking German were excluded
to build the test set for the final word recognition experiments. A set of
experiments using a subset of the command corpus was conducted to find
the optimal settings w.r.t. the LM weight and word insertion penalties of the
phoneme recogniser. This was to avoid too many phoneme insertions, which
would occur frequently if no penalty was applied. From the speech files the
silence in the beginning and in the end was cut off to prevent the recogniser
from hypothesising fricatives or the like, where actually silence occurred.
We tried to balance insertions and deletions, however always keeping more
insertions than deletions, since we expect to find some typical insertions for
that source language. A decision tree was trained with the phoneme strings
100 8 Pronunciation Adaptation
generated this way and applied to the German dictionary to generate English-
accented German variants.
The goal was to improve the recognition results for English-accented Ger-
man speech using our standard German recognition system. Thus apart from
the dictionary, all settings remained unchanged. First the baseline WER re-
sults for eight native speakers of English (three American and five British
English speakers) were measured and compared to the slightly changed test
set of 15 native German speakers that was used in Chapter 6. The overall
result on the command corpus was 11.5% WER compared to 18% WER for
the non-native speakers, which corresponds to an increase in WER of 56.5%
relative.
The non-native speakers were then tested using the dictionary enhanced
with the variants. Figure 8.11 shows the results for the eight speakers using
the baseline dictionary (‘BaseDict’), the baseline dictionary combined with
our weighted MLLR adaptation (‘Base+MLLR’), the dictionary extended
with the generated English-accented variants (‘extdDict’) and finally the
extended dictionary combined with MLLR adaptation (‘extdMLLR’). The
American speakers are ID059, ID060 and ID075.
27.5
% WER
20
12.5
5
ID022 ID054 ID059 ID060 ID064 ID075 ID078 ID138
Weighted MLLR adaptation can improve the baseline results for most of
the speakers. Relative improvements of up to 37% can be achieved. The ex-
tended dictionary can only improve the results for two speakers, when no ad-
ditional speaker adaptation is used. However, when combining the extended
dictionary with MLLR the results can be improved for most of the speakers
compared to the baseline dictionary. Further, for five out of the eight speakers
8.4 Generating Non-native Pronunciation Variants 101
the extended dictionary with MLLR is better than the baseline dictionary
with MLLR and further improvements of up to 16% can be achieved. When
testing the enhanced dictionary on the native reference speakers only a slight
increase in WER from 11.5% to 11.8% can be observed, which might be due
to the increased confusability.
More results are shown in Figure 8.12. The experiments using the ex-
tended dictionary and weighted MLLR were repeated using the semi-super-
vised approach (i.e. weighted MLLR combined with the CM). It can be seen
that especially for those speakers with high initial WERs, the use of the semi-
supervised approach is superior to the unsupervised one, because it can be
avoided to use misrecognised utterances for adaptation. Improvements of up
to 22% can be achieved compared to unsupervised adaptation.
27.5
% WER
20
12.5
5
ID022 ID054 ID059 ID060 ID064 ID075 ID078 ID138
Fig. 8.12. WERs for unsupervised and semi-supervised adaptation, using the ex-
tended dictionary
The big differences in the baseline recognition rate indicate that the
strength of accent varies dramatically in the test set. An auditive test re-
vealed that while two speakers had only a very slight accent (ID022 and
ID064), one (ID078) had a very strong accent. This is reflected in the base-
line WERs, where ID022 and ID064 achieve the lowest WERs and ID078 is
among those who achieve high WERs. Furthermore mixing native British and
American speakers is also not optimal, but due to the lack of more British En-
glish speakers the American ones were used to have a more representative test
set. Figure 8.13 shows again the overall results, averaged over all speakers. It
can be seen that the extended dictionary combined with MLLR outperforms
102 8 Pronunciation Adaptation
the baseline, however, not MLLR. But when the CM is additionally used, the
weighted MLLR can be outperformed.
20
BaseDict
17.5 Base+MLLR
extdDict
15 extdMLLR
extdMLLR+CM
% WER
12.5
10
7.5
The way the accented pronunciations are generated does not at all con-
sider possible differences in accents. As we have learned from the experiments
on the ISLE database, a speaker-wise selection of relevant pronunciations is
beneficial and sometimes even necessary to obtain improvements. Compar-
ing these results to the ISLE experiments, simply generating one non-native
variant for each entry in the dictionary that is the same for all speakers,
corresponds to e.g. using all German rules to construct one dictionary for
all German speakers. The results showed that this could only improve the
performance for some of the speakers and was clearly outperformed by the
speaker-specific dictionary that used only selected rules for each speaker.
As a consequence instead of using the decision tree directly to generate
the variants as was done in the current experiments, rules should be derived
from the tree, such that a speaker-wise selection becomes possible. Then
improvements for more speakers are to be expected, also for the extended
dictionary alone. Unfortunately at the time of the writing of this book it was
not possible to do so.
The results however indicate that the improvements that can be achieved
by acoustic adaptation using the weighted MLLR approach and by the pro-
posed pronunciation adaptation method are at least partly additive.
8.5 Summary 103
8.5 Summary
In this section the problem of recognising non-native speech was examined
and a new method for generating non-native pronunciations without using
non-native data was proposed. It was shown that performance decreases if a
recogniser that is optimised for native speech is exposed to non-native speech.
Investigations on the ISLE database showed that including specialised non-
native pronunciation variants in the dictionary can greatly improve the re-
sults. A set of rules was manually derived and the optimal set of rules was
determined for each speaker in the test set. The application of these spe-
cialised rules was clearly superior to using all rules or rules that are typical
for the native language of a speaker group. However, since the manual deriva-
tion is not feasible if the combination of several source and target languages
is to be covered, a new method was introduced that solely relies on native
speech to derive non-native variants automatically. The big advantage of the
proposed method is that it requires native data only, and can thus be re-
peated for all the languages for which HMM models trained on native speech
are available.
English native speakers who spoke German were investigated in more
detail. To derive the accented pronunciation variants, German speech was
decoded with an English phoneme recogniser. A decision tree was trained to
map the German canonical pronunciation to the English-accented variants
obtained. The trained tree was then used to generate an extended dictio-
nary, containing the native and the non-native variants. When combining
the enhanced dictionary with MLLR speaker adaptation, the results can be
improved for all speakers compared to the baseline dictionary and for the
majority of the speakers, compared to the baseline dictionary with MLLR
adaptation. While weighted MLLR alone can achieve up to 37% reduction
in WER, the additional usage of the enhanced dictionary achieves further
improvements of up to 16%. The use of the semi-supervised adaptation is su-
perior to the unsupervised approach, especially for speakers with high initial
WERs. Using the enhanced dictionary with the native speakers, the perfor-
mance remained almost stable.
However, it should be possible to further improve the results, if rules are
generated instead of using the tree directly, so that a speaker-wise selection
of the variants becomes possible. The experiments showed that the improve-
ments that can be achieved by acoustic adaptation (using the weighted MLLR
approach) and the proposed pronunciation adaptation method are at least
partly additive and should therefore be combined.
The approach is very promising because of its flexibility w.r.t. the consid-
ered languages. Only native data for the source and target language under
consideration is required to derive accented pronunciations of the target lan-
guage spoken with the accent of the source language. Even though in the
experiments only the language pair German (target language) and English
(source language) was considered, the proposed method is valid for arbitrary
104 8 Pronunciation Adaptation