This document discusses using metadata to improve speech recognition of named entities in spontaneous speech collections. It proposes a method that uses metadata from pre-interview questionnaires to augment the lexicon and language model on a per-speaker basis. This allows the system to focus on named entities actually mentioned by each speaker. Evaluation on the MALACH Holocaust testimonial corpus showed this approach reduced named entity recognition errors by over 50% and decreased overall word error rate by 7%.
Copyright:
Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online from Scribd
Use of Metadata To Improve Recognition of Spontaneous Speech and Named Entities
This document discusses using metadata to improve speech recognition of named entities in spontaneous speech collections. It proposes a method that uses metadata from pre-interview questionnaires to augment the lexicon and language model on a per-speaker basis. This allows the system to focus on named entities actually mentioned by each speaker. Evaluation on the MALACH Holocaust testimonial corpus showed this approach reduced named entity recognition errors by over 50% and decreased overall word error rate by 7%.
This document discusses using metadata to improve speech recognition of named entities in spontaneous speech collections. It proposes a method that uses metadata from pre-interview questionnaires to augment the lexicon and language model on a per-speaker basis. This allows the system to focus on named entities actually mentioned by each speaker. Evaluation on the MALACH Holocaust testimonial corpus showed this approach reduced named entity recognition errors by over 50% and decreased overall word error rate by 7%.
Copyright:
Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online from Scribd
Download as pdf or txt
0 ratings0% found this document useful (0 votes)
18 views4 pages
Use of Metadata To Improve Recognition of Spontaneous Speech and Named Entities
This document discusses using metadata to improve speech recognition of named entities in spontaneous speech collections. It proposes a method that uses metadata from pre-interview questionnaires to augment the lexicon and language model on a per-speaker basis. This allows the system to focus on named entities actually mentioned by each speaker. Evaluation on the MALACH Holocaust testimonial corpus showed this approach reduced named entity recognition errors by over 50% and decreased overall word error rate by 7%.
Copyright:
Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online from Scribd
Download as pdf or txt
You are on page 1of 4
USE OF METADATA TO IMPROVE RECOGNITION OF SPONTANEOUS
Yorktown Heights, NY 10598, USA {bhuvana, siohan, gzweig}@us.ibm.com
Abstract discussion of the same topic by multiple speakers or mentions
of several named entities that could subsequently be searched With improved recognition accuracies for LVCSR tasks, it for. The MALACH corpus described in [7, 14, 15] naturally has become possible to search large collections of spontaneous lends itself as an excellent testbed for both LVCSR as well as speech for a variety of information. The MALACH corpus of NLP and search applications. Holocaust testimonials is one such collection, in which we are Apart from spoken archives, many practical applications interested in automatically transcribing and retrieving portions such as name-dialing and call-center applications require not that are relevant to named entities such as people, places, and only accurate recognition of spontaneous speech but accu- organizations. Since the testimonials were gathered from thou- rate recognition of names, places, digits, foreign words, and sands of people in countries throughout Europe, an extremely acronyms. Building such a system is complex due to the very large number of potential named entities are possible, and this large number of such entities that can occur, some more fre- causes a well-known dilemma: increasing the size of the vocab- quently than the others. Name recognition has been the focus ulary allows for more of these words to be recognized, but also of many researchers, particularly in the context of directory as- increases confusability, and can harm recognition performance. sistance applications [3, 2, 5, 6], where ASRs are designed to However, the MALACH corpus, like many other collections, recognize between 200K and 2M names. A significant decrease includes side information or metadata that can be exploited to in recognition performance has been noted when increasing vo- provide prior information on exactly which named entities are cabulary size in [3]. One of the techniques proposed to counter likely to appear. This paper proposes a method that capital- the adverse effects of a large lexicon is to include a diverse set izes on this prior information to reduce named-entity recogni- of pronunciations to cover the acoustic variability [3]. While tion errors by over 50% relative, and simultaneously decrease this helps, it has been shown that focusing on the discrimina- the overall word error rate by 7% relative. The metadata we tive segments in a multi-pass approach reduces the effect of use derives from a pre-interview questionaire that includes the confusability [2]. Approaches to include confidence measures names of friends, relatives, places visited, membership of orga- and rejection thresholds have shown to be useful in the accurate nizations, synonyms of place names, and similar information. recognition of names [5]. By augmenting the lexicon and language model with this in- formation on a speaker-by-speaker basis, we are able to exploit ASR systems, including the ones mentioned above typically the textual information that is already available in the corpus to focus on short-time information distributed over periods of 10- facilitate much improved speech recognition. 20 ms. It has been shown in [6] that capturing information dis- tributed over longer periods of time, such as syllabic or word level time span, can lead to substantial gains in name recogni- 1. INTRODUCTION tion accuracy. The number of different acoustic units required In a recent report, an international digital library working group for a given recognition task is a function of the vocabulary size called for the creation of systems capable of providing access and the nature of the underlying acoustic units. For phonemes to an estimated 100 million hours of culturally significant spo- the number of basic models (without context modeling) is fixed ken word collections [12]. Achieving this will require two fun- for a given language. However, when using syllable or word damental advances over the present state of the art: (1) the size units, the number increases in general with the vocabulary degree to which existing LVCSR and NLP techniques can be size. Many of these units are pronunciations of words which are adapted to provide access to spontaneous conversational speech not used frequently and will have poor coverage in the training and (2) the robust ability to identify spoken words and other data. For small vocabulary tasks such as alphabet or digit recog- useful features such as named entities in many types of col- nition, longer units (typically word level units) have been used lections. Several narrow-band and broadband speech collec- successfully. Sparsity of training data has been the main hin- tions are currently available [1, 13, 4], and carefully tuned Au- dering block in using longer acoustic units for LVCSR tasks. tomatic Speech Recognition (ASR) systems are now able to However, in [8] it was demonstrated that an LVCSR system achieve word error rates between 10% and 40%, depending which uses competing phonetic and mixed syllabic-phonetic on the difficulty of the collection. The Spoken Document Re- paths in parallel can be built to improve recognition of names trieval (SDR) track of the Text Retrieval Conferences (TREC) and concepts (by 17% relative). In the MALACH project this is has demonstrated the feasibility of subject-based searching in particularly important for the search and retrieval of segments non-spontaneous broadcast news collections in the presence of of speech relevant to the mention of a name, place or a con- such word error rates [13]. However, none of these sponta- cept [14]. neous speech corpora were designed to contain a substantive In this paper we report on the use of metadata to improve recognition of named entities while reducing the overall word their synonyms that may be mentioned during the course of an error rate (WER). The use of metadata in the form of a caller interview. For illustration, we include here an example of the ID string associated with an incoming call to aid name recog- actual words spoken by different speakers during several seg- nition in a voice mail transcription task has been presented ments, annotated with the appropriate named entities (shown in in [9]. The metadata for the MALACH corpus, as is the case for bold face). any other oral history archive is available in the pre-interview because there was no normal teacher in Blashova questionnaire (PIQ) completed by the interviewees. This in- so there was a teacher that ran more or less like a cludes biographical data, person names, family relationships, high school teacher ... locations and extensive demographic data. All of these named entities are not equally likely to occur during every speaker’s okay well we got to the point where I was in Pe- testimony. Therefore, if the metadata can be used to select the terboro on Flaten near near Peterboro subset of words that can occur with the highest probability on ... a per-interview basis, this can lead to significant improvements I was no longer able to stay in Flaten and neither in recognition accuracy. was my f- the son of the Cookland who was the Section 2 describes the MALACH database and the diffi- same age as me and we both all came and went culties associated with recognizing named entities for this data. back to live in Rectory Road in Hackney then Sections 3 and 4 describe the metadata and the technique used the question came as to what I was going to do to condition the ASR system with the available metadata. Sec- with my life ... tion 5 presents the experimental setup and improvements in recognition accuracies obtained when dynamically adapting the lexicon. Section 6 discusses the implications of better recog- Moreover, these names occur in many variations (Hebrew nition on subsequent search and retrieval. The paper concludes names, Yiddish names, diminutives, first names only, nick- with a summary and potential applications for this work in other names, etc.). Named entities have proven to be important collections. to searchers of this collection [14] and hence it is important that the ASR systems hypothesize these words correctly. The 2. MALACH MALACH data offers an opportunity to study this problem through its large database of personal identities (approx. 2.5 MALACH (Multilingual Access to Large Spoken Archives), million names) that is populated with information taken from is an ongoing effort that aims to improve access the contents survey forms filled out by the subjects (PIQs), additional names of large, multilingual, spoken archives by advancing the state of topics and concepts assigned by catalogers, and a large list of the art in automated speech recognition (ASR), information of place names and their synonyms (over 20,000 locations). An retrieval (IR) and other component technologies, by utilizing important challenge in this work is that many key search terms the world’s largest digital archive of video oral histories col- comprising of these named entities will be found only among lected by VHF1 [7]. The MALACH corpus consists of un- the infrequently occurring words and phrases, and rare terms constrained, natural speech filled with disfluencies, heavy ac- are inevitably modeled less well than more common ones. cents, age-related coarticulations, uncued speaker and language The metadata also contains synonyms for named entities. switching and emotional speech collected in the form of in- Many street and city names have changed over a period of time, terviews from over 52000 speakers in 32 languages. Approx- for example, St. Petersburg was formerly known as Leningrad imately 25000 of these testimonies are in English, spanning and Petrograd. Every interviewee provided their current name, a wide range of accents, such as Hungarian, Polish, Yiddish, name at birth, release name, Hebrew name, Yiddish name, nick- German, Italian, French, Czech, Hebrew, Croatian, Spanish, names and any other false names they worked under during their Ukrainian etc. A good number of words uttered in this corpus lives. All of these were also included in the dynamic graph that are foreign words or sequences of words spoken in a foreign was built. For example, a person with the first name, Alicia, has language, unfamiliar names and places. The corpus consists of Alicja, Chana, Alice, Jadwiga, Alushia, and Alla as possible elderly speech, where the age of the interviewees range from variations of the first name. Table 1 illustrates the distribution 56 years to 90 years. In order to obtain training data for acous- of names, places and foreign words on the English portion of tic and language models, approximately 200 hours of the En- the MALACH corpus as a function of hours of speech. glish portion of the MALACH corpus was manually transcribed and annotated with named entities. Transcription is challeng- Hours Names and Places (%) Foreign Words (%) ing even for skilled annotators and they typically required 8 to 65 7.2 4.1 12 hours to transcribe a single hour of an English interview. 200 10.6 5.3 The difficulties arise from unfamiliar names and places, multi- ple languages encountered during a single interview, coarticula- Table 1: Distribution of Names, Places and Foreign Words tions related to age, highly variable speaking rates, and heavily accented speech. 4. APPROACH: METADATA in ASR 3. METADATA Given the intended role of ASR to support information access, we are particularly interested in named entity recognition [10], This metadata from the PIQ is available on a per-speaker ba- especially the recognition of personal names and place names sis and therefore serves as the name and place authority for the which are both important search criteria. Therefore, the ASR mentions in the interview. Many of the place names consti- lexicon was carefully constructed using a large database of per- tute cities, streets and names of concentration camps as well as sonal identities populated with information taken from survey forms filled out by the interviewees, additional names assigned 1 VHF, or The Survivors of the Shoah Visual History Foundation. by catalogers, and a large list of place names. However, many of these entities are rare terms and therefore cannot be modeled 5.2. Decoding Strategy very well. The decoder used in our experiments is a Viterbi decoder op- The manual transcriptions of 180 hours of training data erating on a fully flattened state-level HMM. A traditional de- was used to build language models using the modified Kneser- coding setup operates on a single HMM constructed from an Ney algorithm [7]. The training data is relatively small (1.7M overall language model. In contrast to this, in our experiments, words), therefore, the language models built from Broadcast the states are built for each speaker using a speaker-specific lex- News (BN) and Switchboard (SWB) corpora (158M and 3.4M icon and language model. Detailed descriptions of the Viterbi words, respectively) were interpolated with the LM built from decoder used is presented in [11]. this collection. The interpolated weights were optimized to achieve minimum perplexity on the held-out data from this col- 5.3. Results lection. The perplexity of this task on the held-out test set is 72.3. Although a lexicon can be built with the most frequently The use of metadata was evaluated using the overall WER and occurring words and by minimizing the OOV rate on a held-out the WER on named-entities for different vocabulary sizes. Ta- set, as the number of interviews processed grows, many new ble 2 illustrates the large gains obtained when incorporating words will need to be added. It is important for these words to metadata information into the recognizer. The maximum gains be recognized accurately in order for subsequent search to be (51% relative) on the recognition of named entities was ob- successful. tained with a 30K vocabulary that was specialized to include person-specifc metadata. On the other hand, the improvement The PIQ database is indexed on the interview code. The ap- in named entity recognition is fairly small if the size of the proach presented in this paper includes the named entities con- static lexicon is increased by a factor of three. If the names tained in the metadata for the interview being decoded into the and places derived from the metadata were not added to the ASR’s lexicon and replaces the static decoding graph (Section ASR’s lexicon, the named entity WER decreases marginally 5.2) with the new graph appropriately weighted with the lan- when the vocabulary is expanded from 30K to 90K (66.4% to guage model probabilities seen in the training data. If a men- 61.4%). However, when adding speaker-specific information, tion of a named-entity did not occur in the MALACH training the named-entity WER goes down from 66.2% (with a static material, its language model probability backs off to that of an 30K vocab) to 32.3%, almost reducing the WER by half. It can unknown word. Many of the words added from the metadata be seen that as the lexicon size increases, a small percentage occurred in our LM training material that had been derived by of the gains obtained from the metadata is lost, probably offset interpolating MALACH data with Broadcast News and Switch- by the added confusability, i.e with a 90K lexicon, the named board material. During test time, the identity of the interviewee entity WER increases to 38.8% from 32.3%. This is consistent as defined by the interview code is used to derive the dynamic with the degradation in performance seen in the literature with graph using the pre-defined metadata available for that inter- increased lexicon sizes. Table 3 shows the decrease in overall view code. In [9], a class-based language model was built from WER (relative 7%) obtained with the use of metadata. The re- the metadata defined names derived from the caller ID string duction in the overall WER that is obtained when tripling the and a name network was composed with finite-state transduc- lexicon size without the use of metadata is much smaller (rela- ers for this specific caller. Given the spontaneous speech in the tive 2.5%). MALACH data it is very difficult to derive a network for the usage of named entities, however, augmenting the ASR lexicon with possible realizations of names and places can be done. Vocab Static Metadata adapted Relative Vocab (%) Vocab (%) Gain (%) 30K 66.2 32.3 51.2 5. EXPERIMENTS AND RESULTS 60K 62.1 36.8 40.7 90K 61.4 38.8 36.8 5.1. Training and test corpora Table 2: WER computed on the named entities for different vocabulary sizes The English training corpus was generated using 15-minute seg- ments of an interview from 720 randomly selected speakers. Thus, a total of 180 hours of data was selected for manual tran- scription to serve as training material for ASR systems. Male Vocab Static (%) Metadata adapted (%) and female speakers in this corpus were more or less equally Vocab Vocab distributed and a wide range of accents were covered (e.g., Hun- 30K 40.1 37.6 garian, Italian, Yiddish, German, and Polish). The ASR test set 60K 39.4 36.7 consists of 30 minute segments taken from 15 randomly chosen 90K 39.2 36.5 speakers. This test set was also appropriately annotated with named entities tags (illustrated in the example in Section 3). Table 3: Improvements in overall WER for different vocabulary sizes The audio signal was down-sampled to 16KHz from 44.1Khz and parameterized using 24-dimensional mel fre- Our goal is to select the vocabulary size that yields the best quency cepstral coefficients (MFCC). Final acoustic features overall WER and named entity WER, and surprisingly, the 30K were derived using linear discriminant (LDA) and maximum- vocabulary coupled with the PIQ words ( the best matching vo- likelihood based linear transformations (MLLT). Speaker spe- cabulary) for an interview is the best choice. This is interesting cific transformations (SAT and MLLR) were used by the final because Table 4 shows that a 90K vocab actually has a much system that gave the best WER. This ASR system had a WER better overall OOV rate. However, the extra words increase con- of 35.2% on the test set described in [7]. fusability and our results show that this is detrimental. Vocab Named Entities (%) OOV rate (%) 9. References OOV rate w/PIQ 30K 25.5 9.2 [1] Bacchiani, M., “Automatic Transcription of voice-mail at 60K 16.4 8.9 AT&T”, ICASSP, 2001. 90K 13.2 8.8 [2] Junqua, J.-C, Valente, S., Fohr, D., abd Mari J.-F, “An n- best strategy, dynamic grammars and selectively trained Table 4: OOV rate computed on the named entities neural networks for real-time recognition of continuously spelled names over the telephone”, ICASSP, pp. 852-855, 1995. 6. IMPLICATIONS FOR SEARCH AND [3] Gao, Y., Ramabhadran, B., Chen, J., Erdogan, H., and RETRIEVAL Picheny, M., “Innovative approaches for large vocabulary name recognition”, ICASSP, pp. 53-56, 2001. A test collection for the English data in the MALACH corpus [4] Glass, J., Hazen, T. J., Hetherington, L., and Wang, was presented in [14]. The collection comprised of 404 full in- C., “Analysis and Processing of Lecture Audio Data terviews comprising over 600 hours of speech with automated : Preliminary Investigations”, Workshop on Interdisci- speech recognition transcripts and associate relevant judgments plinary Approaches to Speech Indexing and Retrieval, HLT- for 28 queries. These queries were built from over 600 written NAACL04, 2004. requests for materials from the collection from scholars, educa- [5] Liao, Y.-F. and Rose, G., “Recognition of chinese names tors, documentary film makers and students. The mean average in continuous speech for directory assistance applications”, precision score obtained by the best search system was 0.09. ICASSP, pp. 741-744, 2002. An analysis of the retrieval systems indicated that for 25% of the queries, the keywords did not appear in the ASR transcripts [6] Sethy, A., Narayanan, S. and Parthasarthy, S., “A sylla- and hence resulted in failure of the system to retrieve the rele- ble based approach for improved recognition of spoken vant segments. In about 30% of the queries, the keywords were names”, Proceedings of the ISCA Pronunciation Modeling recognized at least once in a segment of speech even though Workshop, Denver, 2002. there were several mentions of the same. All of these keywords [7] Ramabhadran, B., Huang, J. and Picheny, M., “Towards were domain-specific names and places. This illustrates the im- Automatic Transcription of Large Spoken Archives - En- portance of a high recognition accuracy on named-entities for glish ASR for the MALACH Project”, ICASSP 2003. search and retrieval tasks. [8] Sethy, A., Ramabhadran, B., and Narayanan, S., “Improve- ments in English ASR for the MALACH project Using 7. CONCLUSIONS Syllable-Centric Models,”, Proc. Automatic Speech Recog- nition and Understanding Workshop, ASRU, 2003. This paper presents a technique to incorporate metadata infor- [9] Maskey, S., Bacchiani, M. Roark, B., and Sproat, R. “Im- mation into a speech recognition system. The results show that proved Name-Recognition with Meta-data dependent name whenever available, adding domain-specific metadata not only networks”, ICASSP, 2004. provides substantial gains of the order of 50% relative to named- [10] McCarley, J. S. and Franz, M., “Influence of Speech entity detection accuracies; but also provides a modest improve- Recognition Errors on Topic Detection”, Proceedings of the ment in overall WER that otherwise cannot be achieved by sim- 23rd ACM SIGIR Conference on Information Retrieval, pp. ply increasing the size of the static lexicon. The large gains 342-344, 2000. in the recognition of named entities is crucial to search and re- trieval tasks, particularly in the MALACH project. Analysis [11] Saon, G., Zweig, G., Kingsbury, B., Mangu, L. and of real-user requests for this corpus indicates that the topical Chaudhari, U., “An Architecture for Rapid Decoding of requests account for approximately 53% of the requests while Large Vocabulary Conversational Speech”, Eurospeech, 89% of them are searches based on person names, organization, 2003. camp and city names. Therefore for search to be successful it is [12] EU-US Working Group on Spoken-Word Audio Col- more crucial to recognize these terms accurately than it is to im- lections, http://www.dcs.shef.ac.uk/spandh/projects/swag, prove the overall recognition accuracy. The technique presented 2003. here on the use of metadata achieves both simultaneously. The [13] Garofolo, S. J., Cedric, G., Auzanne, P. and Voorhees, E. use of metadata is extremely promising and we plan to explore M., “The TREC Spoken Document Retrieval Track: A Suc- its impact on search by redecoding the test collection described cess Story”, The Eighth Text Retrieval Conference, TREC- in [14] with a lexicon derived from the metadata for each inter- 8, 1999. view. The proposed algorithm also has applications in searching other spoken word collections such as recordings of meetings, [14] Oard, D.W., Soergel, D., Murray, C. G., Doermann, D., lectures and call center mining. Wang, J., Ramabhadran, B., Franz, M., and Gustman, S., “ Building an Information Retrieval test Collection for Spon- taneous Conversational Speech”, to appear in SIGIR, 2004. 8. Acknowledgments [15] Byrne, W., Doermann, D., Franz, M., Gustman, S., Hajič, J., Oard, D., Picheny, M., Psutka, J., Ramabhadran, B., So- This project is part of an on-going effort funded by NSF under ergel, D., Ward, T., and Zhu, W-J., “Automated Recog- the Information Technology Research (ITR) program, NSF IIS nition of spontaneous speech for access to multilingual Award No. 0122466. Any opinions, findings and conclusions oral history archives”, to appear in IEEE Transactions on or recommendations expressed in this material are those of the Speech and Audio Processing, July 2004. authors and do not necessarily reflect the views of the NSF.