2001 Data Driven Approach

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO.

4, MAY 2001 327

Data-Driven Approach to Designing Compound


Words for Continuous Speech Recognition
George Saon and Mukund Padmanabhan, Senior Member, IEEE

Abstract—In this paper, we present a new approach to deriving


compound words from a training corpus. The motivation for
making compound words is because under some assumptions,
speech recognition errors occur less frequently in longer words.
Furthermore, they also enable more accurate modeling of pro-
nunciation variability at the boundary between adjacent words
in a continuously spoken utterance. We introduce a measure
based on the product between the direct and the reverse bigram
probability of a pair of words for finding candidate pairs in order
to create compound words. Our experimental results show that
by augmenting both the acoustic vocabulary and the language
model with these new tokens, the word recognition accuracy
can be improved by absolute 2.8% (7% relative) on a voicemail
continuous speech recognition task. We also compare the proposed
measure for selecting compound words with other measures that
have been described in the literature.

I. INTRODUCTION
Fig. 1. Word error rate versus word length (expressed as number of phones in

O NE of the observations that can be made in speech recog-


nition systems is that short words are more frequently
misrecognized. This is indicated in Fig. 1, which represents the
word).

as increases, implying that longer words are less frequently


number of errors made in all words of a specified length (length
misrecognized (with the exception of phone lengths between six
as defined by the average number of phones in the baseforms of
and nine where the tendency seems to be reversed).
the words). The results for this figure were obtained by decoding
The second observation is that the pronunciation variability
the training data of the voicemail corpus (representing 40 h of
of words is greater in spontaneous, conversational speech
spontaneous telephone speech) in the following way. Two lan-
compared to the case of carefully read speech where the uttered
guage models were trained, one from the transcriptions of the
words are closer to their canonical representations (baseforms).
first 20 h (LMa) and the second from the transcriptions of the
One can argue that, by increasing the vocabulary of alternate
last 20 h (LMb). The first 20 h of the training data were then
pronunciations of words (acoustic vocabulary), most of the
decoded using LMb and the last 20 h with LMa. These results
speech variability can be captured in the spontaneous case.
are intuitively understandable—in a longer phone sequence, it
However, an increase in the number of alternate pronunciations
is necessary to make more errors in order to get the word wrong.
is usually followed by an increase in the confusability between
If we consider different words in the vocabulary as sequences of
words since different words can end up having close or even
phones and under the following assumptions:
identical pronunciation variants. Most coarticulation effects
1) no phone sequence in the vocabulary is a subset of any arise at the boundary between adjacent words and result in
other phone sequence in the vocabulary; alterations of the last phones of the first word and the first few
2) probability of error for all phones is the same, ; phones of the second word.
3) majority of the phones in a baseform need to be erro- One method to model these changes is the use of crossword
neously decoded for the word to be wrong, then the prob- phonological rewriting rules as proposed in [5]; this provides a
ability of making an error in a word with baseform of systematic way of taking into account coarticulation phenomena
length is given by . such as geminate or plosive deletion (e.g., WENT TO W EH
For values of around 0.3 (which is consistent with what we N T UW), palatization (e.g., GOT YOU G AO CH AX), etc.
observed in the training data), can be seen to decrease An alternative way of dealing with coarticulation effects at
word boundaries is to merge specific pairs of words into single
compound words (also called multi-words [3], phrases [6], [8],
Manuscript received October 19, 1999; revised November 29, 2000. This
[10], [11] or “sticky” pairs [2]) and to provide special coar-
work was supported in part by DARPA under Grant MDA972-97-C-0012. The ticulated pronunciation variants for these new tokens. For in-
associate editor coordinating the review of this manuscript and approving it for stance, frequently occurring pairs such as “KIND OF,” “LET
publication was Dr. Jerome R. Bellegarda. ME,” “LET YOU” can be viewed as single words (KIND-OF,
The authors are with the IBM T. J. Watson Research Center, Yorktown
Heights, NY 10598 USA (e-mail: [email protected]). LET-ME, LET-YOU) which are often pronounced “K AY N D
Publisher Item Identifier S 1063-6676(01)02736-5. AX,” “L EH M IY,” or “L EH CH AX,” respectively.
1063–6676/01$10.00 © 2001 IEEE
328 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 4, MAY 2001

In this paper, we present a new approach to deriving com- The use of compound words has been suggested by several
pound words from a training corpus. Compound words have a researchers and has been shown to improve speech recognition
fortiori longer phone sequences than their constituents, conse- performance for various tasks [1]–[3], [6], [8], [10]–[12]. We will
quently, one would expect them to be misrecognized less fre- make further references to the different approaches throughout
quently. Furthermore, they also enable more accurate modeling this paper as we will examine some possible metrics for selecting
of pronunciation variability at the boundary between adjacent compound words. These measures can be broadly classified
words in a continuously spoken utterance. We suggest and ex- into language model oriented or acoustic oriented measures, de-
periment with a number of acoustic and linguistic measures to pending on whether the information that is being used is entirely
select these compound words, and present results that indicate textual or includes acoustic confusability such as phone recogni-
that up to a 7% relative improvement can be obtained by adding tion rate or coarticulated versus non coarticulated baseform (or
a small number of compound words to the vocabulary. word pronunciation variant or lexeme) recognition rate.
The rest of the paper is organized as follows. In Section II,
we investigate the effect of adding compound words to the lan- A. Effect of Compound Words on the Language Model
guage model and describe the various measures that we used for Before describing the methods related to selecting the com-
deriving compound words. In Section III, we discuss the exper- pound words, it is instructive to see what effect the addition
iments and results. Concluding remarks will be presented at the of these words has on the language model. Let us assume that
end of the paper. the lexicon has been constructed, with the compound words se-
lected according to some measure, and examine the effect on
II. MEASURES FOR DERIVING COMPOUND WORDS the language model. Language models are generally character-
ized by the log likelihood of the training data and the perplexity.
Though the motivation for adding compound words to the The log likelihood of a sequence of words (representing the
vocabulary is clear, as mentioned previously, adding more to- training data) can be obtained simply as
kens or pronunciation variants to the acoustic vocabulary and/or
the language model could increase the confusability between
words. Hence, the candidate pairs for compound words have to (1)
be chosen carefully in order to avoid this increase. Intuitively,
such a pair has to meet several requirements [9]. The usual n-gram assumption limits the number of terms in
1) The pair of words has to occur frequently in the training the conditioning in (1) to . Hence, the log likelihood of the
corpus. There is no gain in adding a pair with a low training data assuming a unigram or bigram model would be,
count to the vocabulary since the chances of encountering respectively,
that pair during the decoding of unseen data will be low.
Besides, the compound word issued from this pair will
contribute to the acoustic confusability with other words
which are more likely according to the language model. (2)
2) The words within the pair have to occur frequently to-
gether and more rarely in the pair context of other words. We can also define an average log likelihood per word as
This requirement is necessary since one very frequent , This may also be written as
word, say , can be part of several different frequent
pairs, say , . If
all these pairs were to be added to the vocabulary then
the confusability between and the pair or
would be increased, especially if word has a short phone
sequence. This will result in insertions or deletions of the
word when incorrectly decoding the word or the se- (3)
quence (or ). A concrete example is given by Hence, the average log likelihood per word is related to the con-
the function word “THE” which can occur in numerous ditional entropy of given .
different contexts (such as “IN-THE,” “OF-THE,” “ON- The perplexity of the language model is defined in terms of the
THE,” “AT-THE,” etc) all of which being frequent. inverse of the average log likelihood per word [7]. It is an indica-
3) The words should ideally present coarticulation effects at tion of the average number of words that can follow a given word
the juncture, i.e., their continuous pronunciation should (a measure of the predictive power of the language model). Hence
be different than when they are uttered in isolation. Unfor-
tunately, this requirement is not always compatible with Perplexity (4)
the previous ones, in other words, the word pairs which
1) Unigram Model—Difference in Log Likelihood: Con-
have strong coarticulation effects do not necessarily occur
sider the probability of a sequence of two words and .
very often, nor do the individual words occur only to-
Further assume that and . The probability
gether. Consider, for instance, the sequence “BYE-BYE”
of this word sequence assuming a unigram language model is
often pronounced “B AX B AY” which is relatively rare
given by
in our database whereas the individual word “BYE” ap-
pears in most voicemail messages. (5)
SAON AND PADMANABHAN: DATA-DRIVEN APPROACH TO DESIGNING COMPOUND WORDS 329

Now consider replacing the pair of words and in the original i.e., the compound word has the effect of incorporating a trigram
lexicon with the compound word . The likelihood of the dependency in a bigram language model. The denominator in
word sequence becomes (12) is the product of the forward and reverse bigram probability
of and , and the numerator is the product of the forward and
reverse trigram probability of and .
(6)
Comparing (5) and (6), the difference in log probability is given B. Language Model Measures
by The first measure that we consider is the mutual information
between two consecutive words [3], [6], [11], [12] which is de-
fined as

(7)
(13)
This can be seen to represent the mutual information between
the words and , and forms the basis of the first linguistic From (7), this choice of compound words may be seen to be mo-
measure. A similar discussion of the link between the likelihood tivated by the desire to maximize the difference in log likelihood
and the average mutual information between adjacent classes is of the training data for the two lexicons when a unigram model
provided by Brown et al. in [2]. is used. A weighted variant of the mutual information was pro-
Bigram model-Difference in log likelihood: An analogous posed in [2] as a criterion for finding sticky pairs. Most authors
reasoning can be applied in the case of a bigram language model however, use in its unweighted form, that is, they choose
by considering the probability of a sequence of three words the pairs such as to maximize the mutual information between
and conditioned on when the words regardless of the frequency of the pairs (see, for ex-
and . The probability of this word sequence assuming a ample, [6] and [12]). In [11], is used only to select can-
bigram language model is didate pairs, the final decision of turning pairs into compound
words is made based on bigram perplexity reduction.
The second measure that we propose is based on defining
(8) a direct bigram probability between the words and as
As before, replacing the pair of words and in the original and a reverse bigram probability
lexicon with the compound word changes the likelihood between the words as . The reverse bi-
of the word sequence as follows: gram probability as a standalone measure has been mentioned in
[10] (called backward bigram) and in [1] (called left probability).
Both the direct and the reverse bigrams can be simply esti-
mated from the training corpus as follows:

(9)
(14)
Comparing (8) and (9), the difference in log likelihood is
The measure that we used is the geometrical average of the
direct and the reverse bigram

(10)
Substituting

This measure has also been independently introduced in [1]


(called mutual probability) and is similar to the correlation co-
efficient proposed recently by Kuo [8] which can be written as

(11)
we get
The similarity between the two arises from the fact that they di-
vide the joint probability by the mean of the marginals,
, , with the main difference lying in the choice of
an arithmetic versus a geometric mean of the marginals.
Note that for every pair of words
(12) . A high value for means that both the
330 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 4, MAY 2001

direct and the reverse bigrams are high for , or in other If this ratio is bigger than a threshold (which is set in practice to
words, the probabilities that is followed by and is 1) then the pair is turned into a compound word.
preceded by are high which makes the pair a good candidate The second measure is more related to the acoustic confus-
for a compound word according to our second requirement. In ability of a word. Let us assume that word has a low proba-
our implementation we selected all pairs of words for which this bility of correct classification. One would expect that, by tying
measure is greater than a fixed threshold and for which the raw to a word which has a higher phone classification accu-
count of the word pair exceeds another predefined threshold. racy, the compound word (or ) would have
It may be seen that the mutual information measure has much a higher classification accuracy. The second measure that we
in common with the bigram product measure. Intuitively, a high define computes a quantity related to the probability of cor-
mutual information between two words means that they occur rect classification for the different pronunciation variants of the
often together in the training corpus (the pair count is compa- compound word. We first define a probability of correct classi-
rable with the individual counts), and in this sense it is similar fication for word , , as follows:
to the bigram product measure. However, the bigram product
measure imposes an additional constraint in that it not only re-
quires and to occur together, but also prevents them from
occurring in conjunction with other words.
Further, from (12), it is not apparent that the log likelihood phone
improves with the use of compound words chosen by
because this measure maximizes the denominator term of the
where phone denotes the probability of correct classifi-
likelihood difference . The log likelihood is generally
cation for phone (this is computed by decoding the training data
directly related to the perplexity, however, perplexities cannot
and just counting the number of times that phone was correctly
be compared for language models with different vocabularies.
Some authors suggest the use of a normalized perplexity (where recognized), and baseform denotes the number of phones in
the average log likelihood of the training data is computed with the baseform.
respect to the original number of words [1], [11]) and even de- The second acoustic measure for selecting compound words
sign the compound words such as to directly optimize this quan- is defined as follows:
tity [8], [10], [11]. This turns out to be equivalent to increasing
the total likelihood of the training corpus.
The word pairs that maximize this measure are then selected
C. Acoustic Measures as compound words.
Neither the bigram product measure nor the mutual informa-
III. EXPERIMENTS AND RESULTS
tion take into account coarticulation effects at word boundaries,
since they are language model oriented measures. These coar- All the experiments were performed on a telephony voicemail
ticulation effects have to be added explicitly for the pairs which database comprising about 40 h of speech [9]. The language
become compound words according to these metrics, either by model is a conventional linearly interpolated trigram model [7]
using phonological rewriting rules or by manually designing and was trained on approximately 400 K words of text. The
coarticulated baseforms where appropriate. effect of adding compound words was to increase the span of
The second part of our study is centered around the use of the LM beyond trigrams. We have not attempted, however, to
explicit acoustic information when designing compound words. compare a “weaker” LM (say a bigram LM) augmented with
The first measure deals explicitly with coarticulation phe- compound words with the corresponding trigram or -gram LM
nomena and can be summarized as follows. For the pairs of as was suggested by one reviewer.
words in the training corpus which present such phenomena ac- The size of the acoustic vocabulary for the application is 14 K
cording to the applicability of at least one phonological rewriting words. The results are reported on a set of 43 voicemail mes-
rule [5], one can compare the number of times that a coarticu- sages (roughly 2000 words).
lated baseform for the pair is preferred over a concatenation of The experimental setup is as follows. We started with a vocab-
non-coarticulated individual baseforms of the words forming ulary that had no compound words, and applied every measure
that pair in the training corpus. This can be estimated by doing a iteratively to increase the number of compound words in the vo-
Viterbi alignment of all instances of the word pair in the training cabulary. After one iteration, the word pairs that scored more than
data, with the coarticulated pair baseform and with the con- a threshold were transformed into compound words and all in-
catenation of individual baseforms, and selecting the baseform stances of the pairs in the training corpus were replaced by these
which has a higher acoustic score. If baseform new words. Both the acoustic vocabulary and the language model
denotes the number of times that the coarticulated baseform is vocabulary were augmented by these words after each step. In the
preferred, and baseform denotes the number of following tables, underlined compound words are meant to indi-
times that the concatenated baseform is preferred, the measure is cate that the compound words also had coarticulated baseforms
defined as the ratio between these two counts which were added to the acoustic vocabulary. Also, indicates
the number of compound words that were added to the vocabu-
baseform lary during the current iteration. We will first describe the results
baseform with as this gave us the best performance.
SAON AND PADMANABHAN: DATA-DRIVEN APPROACH TO DESIGNING COMPOUND WORDS 331

TABLE I
RECOGNITION SCORES AND PERPLEXITIES FOR MEASURE LM

TABLE II
RECOGNITION SCORES AND PERPLEXITIES FOR MEASURES LM , AC AND AC

For the second language model measure, which was based this facilitates a fair comparison between the performances of
on the product between the direct and the reverse bigram, the the different measures. The threshold on the pair count was set
threshold was chosen to be 0.2, i.e., if , to 100 for and (or ) and to
then would be made a compound word. This threshold 300 for . The performances of these measures are illustrated
was chosen so as to get approximately the same number of com- in Table I. It may be seen that there is virtually no improvement
pound words finally as in the case where they were designed by by using any of these other measures. The bigram product mea-
hand. Table I summarizes the number of new compound words sure , outperforms the mutual information metric , be-
obtained after each iteration, examples of such words, and the cause the measure seems to pick words which co-occur fre-
word error rate as well as the perplexity of the test set (the nor- quently (i.e., the first condition in Section II) without paying heed
malized perplexity is denoted by a *). to whether the same constituent words also co-occur frequently
The last line of Table I also indicates the beneficial effect of with other words (the second condition in Section II).
adding coarticulated baseforms to the vocabulary, even when the Another observation from Tables I and Table II is that for the
compound words are chosen strictly based on a linguistic mea- same number of pairs after the first iteration (42), the difference
sure. The only difference between Iteration 3 and 3b in Table I is in perplexity is significant between the language models based
that in the former case, baseforms were added to the vocabulary on and . Surprisingly, the better performance is ob-
to account for the coarticulation in the selected compound words, tained for the language model with a higher perplexity.1
whereas in the latter case (3b), the baseforms were simply a con- The poor performance of the acoustic measures can be ex-
catenation of the baseforms of the individual components. This plained by the fact that neither nor take into account
seems to indicate that though a significant gain can be obtained word pair frequency information. Besides, there is no measure
by selecting compound words based only on a linguistic measure, of the degree of “stickiness” of a pair as in the case of the lan-
the gain can be further enhanced by allowing for a coarticulated
pronunciation of these selected compound words. 1As was pointed out by the reviewers, perplexity cannot really be compared

For the remaining measures ( , and ), the thresh- across different vocabularies. The normalized perplexity (also shown in the ta-
bles) is supposedly a better indicator of task complexity in this case, but our
olds were set such as to obtain the same number of words (or results did not seem to indicate any great correlation between the word error
pairs) after each iteration as for the case. We believe that rate and normalized perplexity either.
332 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 4, MAY 2001

TABLE III frequency of pairs and the degree of closeness of a pair (how
PERPLEXITY AND RECOGNITION PERFORMANCE USING MANUALLY often do the words of a pair occur together). Once the pairs
DESIGNED COMPOUND WORDS
have been found, the modeling of coarticulation effects at word
boundaries within the pairs (where applicable) may further
improve the overall performance.

REFERENCES
guage model oriented measures (by “stickiness,” we mean fre- [1] C. Beaujard and M. Jardino, “Language modeling based on automatic
quency of co-occurrence of the word pair, i.e., word tends to word concatenations,” in Proc. Eurospeech ’99, Budapest, Hungary,
1999.
“stick” to word ). This tends to increase the acoustic confus- [2] P. F. Brown, V. J. Della Pietra, P. V. DeSouza, J. C. Lai, and R. L. Mercer,
ability between words in the vocabulary since a frequent word “Class-based n-gram models of natural language,” Comput. Linguist.,
can be part of many pairs now. vol. 18, no. 4, pp. 467–477, 1992.
[3] M. Finke and A. Waibel, “Speaking mode dependent pronunciation
Finally, Table III shows the performance of a set of 58 manu- modeling in large vocabulary conversational speech recognition,” in
ally designed compound words suited for the voicemail recog- Proc. Eurospeech ’97, Rhodes, Greece, 1997.
nition task. It is generally the case that tuning the speech recog- [4] M. Finke, “Flexible transcription alignment,” in 1997 IEEE Workshop
Speech Recognition Understanding, Santa Barbara, CA, 1997.
nition system to a particular task (for instance by manually se- [5] E. P. Giachin, A. E. Rosenberg, and C. H. Lee, “Word juncture modeling
lecting the compound words) is a process that does tend to im- using phonological rules for HMM-based continuous speech recogni-
prove performance on the task, however, this represents a te- tion,” Comput., Speech, Lang., vol. 5, pp. 155–168, 1991.
[6] E. P. Giachin, “Phrase bigrams for continuous speech recognition,” in
dious and time consuming process. Consequently, it is encour- Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, Detroit, MI,
aging to see that the statistically derived measure (which 1995, pp. 225–228.
can be implemented relatively easily on a new task) is able to [7] F. Jelinek, “Statistical methods for speech recognition,” in Language,
Speech and Communication Series. Cambridge, MA: MIT Press,
approach the same performance, even though it uses a few more 1999.
compound words. [8] H. K. J. Kuo and W. Reichl, “Phrase-based language models for speech
recognition,” in Proc. Eurospeech ’99, Budapest, Hungary, 1999.
[9] M. Padmanabhan, G. Saon, S. Basu, J. Huang, and G. Zweig, “Recent
IV. DISCUSSION improvements in voicemail transcription,” in Proc. Eurospeech ’99, Bu-
dapest, Hungary, 1999.
In this paper, we experimented with a number of methods [10] K. Ries, F. D. Buo, and A. Waibel, “Class phrase models for language
to design compound words to augment the vocabulary of a modeling,” in Proc. Int. Conf. Speech Language Processing ’96,
Philadelphia, PA, 1996.
speech recognition system. The motivation for combining pairs [11] B. Suhm and A. Waibel, “Toward better language models for sponta-
of words to form compound words is twofold: 1) experimental neous speech,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Pro-
observations indicate that it is less likely that longer phone cessing ’94, Yokohama, Japan, 1994.
[12] I. Zitouni, J. F. Mari, K. Smaili, and J. P. Haton, “Variable-length se-
sequences are misrecognized and 2) compound words enable quence language models for large vocabulary continuous dictation ma-
cross word coarticulation effects to be easily modeled. We chine,” in Proc. Eurospeech ’99, Budapest, Hungary, 1999.
experimented with both linguistic and acoustic measures in
selecting these compound words. The linguistic measures were
related to the mutual information between word pairs, and a
George Saon received the M.Sc. and Ph.D. degrees in computer science from
new measure, the product of the forward and reverse bigram the University Henri Poincare, Nancy, France, in 1994 and 1997.
probability of the word pair. The acoustic measures were From 1994 to 1998, he worked on stochastic modeling for off-line hand-
based on whether the word pair had a significant amount of writing recognition at the Laboratorie Lorrain de Recherche en Informatique
et ses Applications (LORIA). He is currently with the IBM T. J. Watson Re-
cross word coarticulation. Our experimental results indicated search Center, Yorktown Heights, NY, conducting research on large vocabulary
that the second linguistic measure was particularly useful conversational telephone speech recognition. His research interests are in pat-
in selecting compound words. Even though we found that tern recognition and stochastic modeling.
selecting compound words on the basis of acoustic measures
was not useful, we found that in the case where the compound
words were selected based on the linguistic measure, it was Mukund Padmanabhan (S’89–M’89–SM’99) received the M.S. and Ph.D. de-
beneficial to add coarticulated baseforms when necessary grees from the University of California, Los Angeles, in 1989 and 1992, respec-
for the selected compound words. Experimental results show tively.
Since 1992, he has been with the Speech Recognition Group, IBM T. J.
an overall improvement in word error rate of 7% (relative) Watson Research Center, Yorktown Heights, NY, where he currently manages
and achieves comparable performance to human design of a group conducting research on aspects of telephone speech recognition.
compound words. The main conclusion that can be drawn is His research interests are in speech recognition and language processing
algorithms, signal processing algorithms, and analog integrated circuits. He
that effective metrics for designing compound words should is coauthor of the book Feedback-based Orthogonal Digital Filters: Theory,
depend upon some language model information such as the Applications, and Implementation.

You might also like