Use of Metadata To Improve Recognition of Spontaneous Speech and Named Entities

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

USE OF METADATA TO IMPROVE RECOGNITION OF SPONTANEOUS

SPEECH AND NAMED ENTITIES


Bhuvana Ramabhadran, Olivier Siohan, Geoffrey Zweig

IBM T. J. Watson Research Center


Yorktown Heights, NY 10598, USA
{bhuvana, siohan, gzweig}@us.ibm.com

Abstract discussion of the same topic by multiple speakers or mentions


of several named entities that could subsequently be searched
With improved recognition accuracies for LVCSR tasks, it for. The MALACH corpus described in [7, 14, 15] naturally
has become possible to search large collections of spontaneous lends itself as an excellent testbed for both LVCSR as well as
speech for a variety of information. The MALACH corpus of NLP and search applications.
Holocaust testimonials is one such collection, in which we are
Apart from spoken archives, many practical applications
interested in automatically transcribing and retrieving portions
such as name-dialing and call-center applications require not
that are relevant to named entities such as people, places, and
only accurate recognition of spontaneous speech but accu-
organizations. Since the testimonials were gathered from thou-
rate recognition of names, places, digits, foreign words, and
sands of people in countries throughout Europe, an extremely
acronyms. Building such a system is complex due to the very
large number of potential named entities are possible, and this
large number of such entities that can occur, some more fre-
causes a well-known dilemma: increasing the size of the vocab-
quently than the others. Name recognition has been the focus
ulary allows for more of these words to be recognized, but also
of many researchers, particularly in the context of directory as-
increases confusability, and can harm recognition performance.
sistance applications [3, 2, 5, 6], where ASRs are designed to
However, the MALACH corpus, like many other collections,
recognize between 200K and 2M names. A significant decrease
includes side information or metadata that can be exploited to
in recognition performance has been noted when increasing vo-
provide prior information on exactly which named entities are
cabulary size in [3]. One of the techniques proposed to counter
likely to appear. This paper proposes a method that capital-
the adverse effects of a large lexicon is to include a diverse set
izes on this prior information to reduce named-entity recogni-
of pronunciations to cover the acoustic variability [3]. While
tion errors by over 50% relative, and simultaneously decrease
this helps, it has been shown that focusing on the discrimina-
the overall word error rate by 7% relative. The metadata we
tive segments in a multi-pass approach reduces the effect of
use derives from a pre-interview questionaire that includes the
confusability [2]. Approaches to include confidence measures
names of friends, relatives, places visited, membership of orga-
and rejection thresholds have shown to be useful in the accurate
nizations, synonyms of place names, and similar information.
recognition of names [5].
By augmenting the lexicon and language model with this in-
formation on a speaker-by-speaker basis, we are able to exploit ASR systems, including the ones mentioned above typically
the textual information that is already available in the corpus to focus on short-time information distributed over periods of 10-
facilitate much improved speech recognition. 20 ms. It has been shown in [6] that capturing information dis-
tributed over longer periods of time, such as syllabic or word
level time span, can lead to substantial gains in name recogni-
1. INTRODUCTION tion accuracy. The number of different acoustic units required
In a recent report, an international digital library working group for a given recognition task is a function of the vocabulary size
called for the creation of systems capable of providing access and the nature of the underlying acoustic units. For phonemes
to an estimated 100 million hours of culturally significant spo- the number of basic models (without context modeling) is fixed
ken word collections [12]. Achieving this will require two fun- for a given language. However, when using syllable or word
damental advances over the present state of the art: (1) the size units, the number increases in general with the vocabulary
degree to which existing LVCSR and NLP techniques can be size. Many of these units are pronunciations of words which are
adapted to provide access to spontaneous conversational speech not used frequently and will have poor coverage in the training
and (2) the robust ability to identify spoken words and other data. For small vocabulary tasks such as alphabet or digit recog-
useful features such as named entities in many types of col- nition, longer units (typically word level units) have been used
lections. Several narrow-band and broadband speech collec- successfully. Sparsity of training data has been the main hin-
tions are currently available [1, 13, 4], and carefully tuned Au- dering block in using longer acoustic units for LVCSR tasks.
tomatic Speech Recognition (ASR) systems are now able to However, in [8] it was demonstrated that an LVCSR system
achieve word error rates between 10% and 40%, depending which uses competing phonetic and mixed syllabic-phonetic
on the difficulty of the collection. The Spoken Document Re- paths in parallel can be built to improve recognition of names
trieval (SDR) track of the Text Retrieval Conferences (TREC) and concepts (by 17% relative). In the MALACH project this is
has demonstrated the feasibility of subject-based searching in particularly important for the search and retrieval of segments
non-spontaneous broadcast news collections in the presence of of speech relevant to the mention of a name, place or a con-
such word error rates [13]. However, none of these sponta- cept [14].
neous speech corpora were designed to contain a substantive In this paper we report on the use of metadata to improve
recognition of named entities while reducing the overall word their synonyms that may be mentioned during the course of an
error rate (WER). The use of metadata in the form of a caller interview. For illustration, we include here an example of the
ID string associated with an incoming call to aid name recog- actual words spoken by different speakers during several seg-
nition in a voice mail transcription task has been presented ments, annotated with the appropriate named entities (shown in
in [9]. The metadata for the MALACH corpus, as is the case for bold face).
any other oral history archive is available in the pre-interview
because there was no normal teacher in Blashova
questionnaire (PIQ) completed by the interviewees. This in-
so there was a teacher that ran more or less like a
cludes biographical data, person names, family relationships,
high school teacher ...
locations and extensive demographic data. All of these named
entities are not equally likely to occur during every speaker’s okay well we got to the point where I was in Pe-
testimony. Therefore, if the metadata can be used to select the terboro on Flaten near near Peterboro
subset of words that can occur with the highest probability on ...
a per-interview basis, this can lead to significant improvements
I was no longer able to stay in Flaten and neither
in recognition accuracy.
was my f- the son of the Cookland who was the
Section 2 describes the MALACH database and the diffi- same age as me and we both all came and went
culties associated with recognizing named entities for this data.
back to live in Rectory Road in Hackney then
Sections 3 and 4 describe the metadata and the technique used the question came as to what I was going to do
to condition the ASR system with the available metadata. Sec- with my life ...
tion 5 presents the experimental setup and improvements in
recognition accuracies obtained when dynamically adapting the
lexicon. Section 6 discusses the implications of better recog- Moreover, these names occur in many variations (Hebrew
nition on subsequent search and retrieval. The paper concludes names, Yiddish names, diminutives, first names only, nick-
with a summary and potential applications for this work in other names, etc.). Named entities have proven to be important
collections. to searchers of this collection [14] and hence it is important
that the ASR systems hypothesize these words correctly. The
2. MALACH MALACH data offers an opportunity to study this problem
through its large database of personal identities (approx. 2.5
MALACH (Multilingual Access to Large Spoken Archives), million names) that is populated with information taken from
is an ongoing effort that aims to improve access the contents survey forms filled out by the subjects (PIQs), additional names
of large, multilingual, spoken archives by advancing the state of topics and concepts assigned by catalogers, and a large list
of the art in automated speech recognition (ASR), information of place names and their synonyms (over 20,000 locations). An
retrieval (IR) and other component technologies, by utilizing important challenge in this work is that many key search terms
the world’s largest digital archive of video oral histories col- comprising of these named entities will be found only among
lected by VHF1 [7]. The MALACH corpus consists of un- the infrequently occurring words and phrases, and rare terms
constrained, natural speech filled with disfluencies, heavy ac- are inevitably modeled less well than more common ones.
cents, age-related coarticulations, uncued speaker and language The metadata also contains synonyms for named entities.
switching and emotional speech collected in the form of in- Many street and city names have changed over a period of time,
terviews from over 52000 speakers in 32 languages. Approx- for example, St. Petersburg was formerly known as Leningrad
imately 25000 of these testimonies are in English, spanning and Petrograd. Every interviewee provided their current name,
a wide range of accents, such as Hungarian, Polish, Yiddish, name at birth, release name, Hebrew name, Yiddish name, nick-
German, Italian, French, Czech, Hebrew, Croatian, Spanish, names and any other false names they worked under during their
Ukrainian etc. A good number of words uttered in this corpus lives. All of these were also included in the dynamic graph that
are foreign words or sequences of words spoken in a foreign was built. For example, a person with the first name, Alicia, has
language, unfamiliar names and places. The corpus consists of Alicja, Chana, Alice, Jadwiga, Alushia, and Alla as possible
elderly speech, where the age of the interviewees range from variations of the first name. Table 1 illustrates the distribution
56 years to 90 years. In order to obtain training data for acous- of names, places and foreign words on the English portion of
tic and language models, approximately 200 hours of the En- the MALACH corpus as a function of hours of speech.
glish portion of the MALACH corpus was manually transcribed
and annotated with named entities. Transcription is challeng- Hours Names and Places (%) Foreign Words (%)
ing even for skilled annotators and they typically required 8 to 65 7.2 4.1
12 hours to transcribe a single hour of an English interview. 200 10.6 5.3
The difficulties arise from unfamiliar names and places, multi-
ple languages encountered during a single interview, coarticula- Table 1: Distribution of Names, Places and Foreign Words
tions related to age, highly variable speaking rates, and heavily
accented speech.
4. APPROACH: METADATA in ASR
3. METADATA Given the intended role of ASR to support information access,
we are particularly interested in named entity recognition [10],
This metadata from the PIQ is available on a per-speaker ba- especially the recognition of personal names and place names
sis and therefore serves as the name and place authority for the which are both important search criteria. Therefore, the ASR
mentions in the interview. Many of the place names consti- lexicon was carefully constructed using a large database of per-
tute cities, streets and names of concentration camps as well as sonal identities populated with information taken from survey
forms filled out by the interviewees, additional names assigned
1 VHF, or The Survivors of the Shoah Visual History Foundation. by catalogers, and a large list of place names. However, many
of these entities are rare terms and therefore cannot be modeled 5.2. Decoding Strategy
very well.
The decoder used in our experiments is a Viterbi decoder op-
The manual transcriptions of 180 hours of training data erating on a fully flattened state-level HMM. A traditional de-
was used to build language models using the modified Kneser- coding setup operates on a single HMM constructed from an
Ney algorithm [7]. The training data is relatively small (1.7M overall language model. In contrast to this, in our experiments,
words), therefore, the language models built from Broadcast the states are built for each speaker using a speaker-specific lex-
News (BN) and Switchboard (SWB) corpora (158M and 3.4M icon and language model. Detailed descriptions of the Viterbi
words, respectively) were interpolated with the LM built from decoder used is presented in [11].
this collection. The interpolated weights were optimized to
achieve minimum perplexity on the held-out data from this col- 5.3. Results
lection. The perplexity of this task on the held-out test set is
72.3. Although a lexicon can be built with the most frequently The use of metadata was evaluated using the overall WER and
occurring words and by minimizing the OOV rate on a held-out the WER on named-entities for different vocabulary sizes. Ta-
set, as the number of interviews processed grows, many new ble 2 illustrates the large gains obtained when incorporating
words will need to be added. It is important for these words to metadata information into the recognizer. The maximum gains
be recognized accurately in order for subsequent search to be (51% relative) on the recognition of named entities was ob-
successful. tained with a 30K vocabulary that was specialized to include
person-specifc metadata. On the other hand, the improvement
The PIQ database is indexed on the interview code. The ap-
in named entity recognition is fairly small if the size of the
proach presented in this paper includes the named entities con-
static lexicon is increased by a factor of three. If the names
tained in the metadata for the interview being decoded into the
and places derived from the metadata were not added to the
ASR’s lexicon and replaces the static decoding graph (Section
ASR’s lexicon, the named entity WER decreases marginally
5.2) with the new graph appropriately weighted with the lan-
when the vocabulary is expanded from 30K to 90K (66.4% to
guage model probabilities seen in the training data. If a men-
61.4%). However, when adding speaker-specific information,
tion of a named-entity did not occur in the MALACH training
the named-entity WER goes down from 66.2% (with a static
material, its language model probability backs off to that of an
30K vocab) to 32.3%, almost reducing the WER by half. It can
unknown word. Many of the words added from the metadata
be seen that as the lexicon size increases, a small percentage
occurred in our LM training material that had been derived by
of the gains obtained from the metadata is lost, probably offset
interpolating MALACH data with Broadcast News and Switch-
by the added confusability, i.e with a 90K lexicon, the named
board material. During test time, the identity of the interviewee
entity WER increases to 38.8% from 32.3%. This is consistent
as defined by the interview code is used to derive the dynamic
with the degradation in performance seen in the literature with
graph using the pre-defined metadata available for that inter-
increased lexicon sizes. Table 3 shows the decrease in overall
view code. In [9], a class-based language model was built from
WER (relative 7%) obtained with the use of metadata. The re-
the metadata defined names derived from the caller ID string
duction in the overall WER that is obtained when tripling the
and a name network was composed with finite-state transduc-
lexicon size without the use of metadata is much smaller (rela-
ers for this specific caller. Given the spontaneous speech in the
tive 2.5%).
MALACH data it is very difficult to derive a network for the
usage of named entities, however, augmenting the ASR lexicon
with possible realizations of names and places can be done. Vocab Static Metadata adapted Relative
Vocab (%) Vocab (%) Gain (%)
30K 66.2 32.3 51.2
5. EXPERIMENTS AND RESULTS 60K 62.1 36.8 40.7
90K 61.4 38.8 36.8
5.1. Training and test corpora
Table 2: WER computed on the named entities for different
vocabulary sizes
The English training corpus was generated using 15-minute seg-
ments of an interview from 720 randomly selected speakers.
Thus, a total of 180 hours of data was selected for manual tran-
scription to serve as training material for ASR systems. Male Vocab Static (%) Metadata adapted (%)
and female speakers in this corpus were more or less equally Vocab Vocab
distributed and a wide range of accents were covered (e.g., Hun- 30K 40.1 37.6
garian, Italian, Yiddish, German, and Polish). The ASR test set 60K 39.4 36.7
consists of 30 minute segments taken from 15 randomly chosen 90K 39.2 36.5
speakers. This test set was also appropriately annotated with
named entities tags (illustrated in the example in Section 3). Table 3: Improvements in overall WER for different vocabulary
sizes
The audio signal was down-sampled to 16KHz from
44.1Khz and parameterized using 24-dimensional mel fre- Our goal is to select the vocabulary size that yields the best
quency cepstral coefficients (MFCC). Final acoustic features overall WER and named entity WER, and surprisingly, the 30K
were derived using linear discriminant (LDA) and maximum- vocabulary coupled with the PIQ words ( the best matching vo-
likelihood based linear transformations (MLLT). Speaker spe- cabulary) for an interview is the best choice. This is interesting
cific transformations (SAT and MLLR) were used by the final because Table 4 shows that a 90K vocab actually has a much
system that gave the best WER. This ASR system had a WER better overall OOV rate. However, the extra words increase con-
of 35.2% on the test set described in [7]. fusability and our results show that this is detrimental.
Vocab Named Entities (%) OOV rate (%) 9. References
OOV rate w/PIQ
30K 25.5 9.2 [1] Bacchiani, M., “Automatic Transcription of voice-mail at
60K 16.4 8.9 AT&T”, ICASSP, 2001.
90K 13.2 8.8 [2] Junqua, J.-C, Valente, S., Fohr, D., abd Mari J.-F, “An n-
best strategy, dynamic grammars and selectively trained
Table 4: OOV rate computed on the named entities neural networks for real-time recognition of continuously
spelled names over the telephone”, ICASSP, pp. 852-855,
1995.
6. IMPLICATIONS FOR SEARCH AND [3] Gao, Y., Ramabhadran, B., Chen, J., Erdogan, H., and
RETRIEVAL Picheny, M., “Innovative approaches for large vocabulary
name recognition”, ICASSP, pp. 53-56, 2001.
A test collection for the English data in the MALACH corpus [4] Glass, J., Hazen, T. J., Hetherington, L., and Wang,
was presented in [14]. The collection comprised of 404 full in- C., “Analysis and Processing of Lecture Audio Data
terviews comprising over 600 hours of speech with automated : Preliminary Investigations”, Workshop on Interdisci-
speech recognition transcripts and associate relevant judgments plinary Approaches to Speech Indexing and Retrieval, HLT-
for 28 queries. These queries were built from over 600 written NAACL04, 2004.
requests for materials from the collection from scholars, educa-
[5] Liao, Y.-F. and Rose, G., “Recognition of chinese names
tors, documentary film makers and students. The mean average
in continuous speech for directory assistance applications”,
precision score obtained by the best search system was 0.09.
ICASSP, pp. 741-744, 2002.
An analysis of the retrieval systems indicated that for 25% of
the queries, the keywords did not appear in the ASR transcripts [6] Sethy, A., Narayanan, S. and Parthasarthy, S., “A sylla-
and hence resulted in failure of the system to retrieve the rele- ble based approach for improved recognition of spoken
vant segments. In about 30% of the queries, the keywords were names”, Proceedings of the ISCA Pronunciation Modeling
recognized at least once in a segment of speech even though Workshop, Denver, 2002.
there were several mentions of the same. All of these keywords [7] Ramabhadran, B., Huang, J. and Picheny, M., “Towards
were domain-specific names and places. This illustrates the im- Automatic Transcription of Large Spoken Archives - En-
portance of a high recognition accuracy on named-entities for glish ASR for the MALACH Project”, ICASSP 2003.
search and retrieval tasks.
[8] Sethy, A., Ramabhadran, B., and Narayanan, S., “Improve-
ments in English ASR for the MALACH project Using
7. CONCLUSIONS Syllable-Centric Models,”, Proc. Automatic Speech Recog-
nition and Understanding Workshop, ASRU, 2003.
This paper presents a technique to incorporate metadata infor- [9] Maskey, S., Bacchiani, M. Roark, B., and Sproat, R. “Im-
mation into a speech recognition system. The results show that proved Name-Recognition with Meta-data dependent name
whenever available, adding domain-specific metadata not only networks”, ICASSP, 2004.
provides substantial gains of the order of 50% relative to named- [10] McCarley, J. S. and Franz, M., “Influence of Speech
entity detection accuracies; but also provides a modest improve- Recognition Errors on Topic Detection”, Proceedings of the
ment in overall WER that otherwise cannot be achieved by sim- 23rd ACM SIGIR Conference on Information Retrieval, pp.
ply increasing the size of the static lexicon. The large gains 342-344, 2000.
in the recognition of named entities is crucial to search and re-
trieval tasks, particularly in the MALACH project. Analysis [11] Saon, G., Zweig, G., Kingsbury, B., Mangu, L. and
of real-user requests for this corpus indicates that the topical Chaudhari, U., “An Architecture for Rapid Decoding of
requests account for approximately 53% of the requests while Large Vocabulary Conversational Speech”, Eurospeech,
89% of them are searches based on person names, organization, 2003.
camp and city names. Therefore for search to be successful it is [12] EU-US Working Group on Spoken-Word Audio Col-
more crucial to recognize these terms accurately than it is to im- lections, http://www.dcs.shef.ac.uk/spandh/projects/swag,
prove the overall recognition accuracy. The technique presented 2003.
here on the use of metadata achieves both simultaneously. The
[13] Garofolo, S. J., Cedric, G., Auzanne, P. and Voorhees, E.
use of metadata is extremely promising and we plan to explore
M., “The TREC Spoken Document Retrieval Track: A Suc-
its impact on search by redecoding the test collection described
cess Story”, The Eighth Text Retrieval Conference, TREC-
in [14] with a lexicon derived from the metadata for each inter-
8, 1999.
view. The proposed algorithm also has applications in searching
other spoken word collections such as recordings of meetings, [14] Oard, D.W., Soergel, D., Murray, C. G., Doermann, D.,
lectures and call center mining. Wang, J., Ramabhadran, B., Franz, M., and Gustman, S., “
Building an Information Retrieval test Collection for Spon-
taneous Conversational Speech”, to appear in SIGIR, 2004.
8. Acknowledgments [15] Byrne, W., Doermann, D., Franz, M., Gustman, S., Hajič,
J., Oard, D., Picheny, M., Psutka, J., Ramabhadran, B., So-
This project is part of an on-going effort funded by NSF under ergel, D., Ward, T., and Zhu, W-J., “Automated Recog-
the Information Technology Research (ITR) program, NSF IIS nition of spontaneous speech for access to multilingual
Award No. 0122466. Any opinions, findings and conclusions oral history archives”, to appear in IEEE Transactions on
or recommendations expressed in this material are those of the Speech and Audio Processing, July 2004.
authors and do not necessarily reflect the views of the NSF.

You might also like