ASR Patient 2014
ASR Patient 2014
ASR Patient 2014
recognizing everything anyone can say in multiple applications [9]. Vertanen [10] tested both the HTK and the
languages has yet to be achieved, research has been focused Sphinx systems with the Wall Street Journal (WSJ) corpus
on smaller-scale approaches [4]. and found no significant differences in error rate and speed.
This conclusion is corroborated by other researchers [11].
A. Theoretical Foundations
Huggins-Daines et al. [12] optimized CMU Sphinx II for
Aymen et al. [5] presented the theoretical foundation of embedded systems, primarily those with ARM architecture.
Hidden Markov Models (HMM) that underpin most modern To balance the loss of precision required in other
implementations of automatic speech recognition. The optimizations, the CMU Sphinx III Gaussian mixture model
authors present the distinction between speech recognition, was back-ported. They managed to have a 1000-word
which aims to recognize almost anyone’s speech, and voice vocabulary running at 0.87 times real-time on a 206 MHz
recognition, which creates systems trained to particular embedded device, with an error rate of 13.95% [12]. “Times
users. The model is constructed based on a large corpus of real-time” is a notation that indicates the amount of time
recorded speech, annotated with the respective transcription. required to process live data. In this case, the system can
The HMM requires three different sub-models: process 1 second of data in 0.87 seconds, which makes it
1) The acoustic model consists of different features for suitable to real-time recognizing. This work has lead to the
each utterance the system recognizes; creation of the PocketSphinx project, an open source
2) The lexical model tries to identify sounds considering initiative to continue this work. This project is in active
the context; development, and it has bindings for C and Python [13].
3) The language model identifies the higher-level
characteristics of speech, such as words and sentences. C. Practical systems
The HMM searches the model for similar patterns that fit Vijay [14] studied the problem of phonetic
into the given audio input, producing probable matches. The decomposition in lesser-studied languages, like Native
HMM’s advantages over previous learning algorithms American and Roma language variants, using the
consists of easy implementation on a computer and PocketSphinx system. While the system does not implement
automated training without human intervention. This stems the complex rules of these languages, it is possible to
from the fact that it is assumed that in short-time ranges the leverage the existing system to recognize unknown
process is stationary, vastly reducing the computational languages, using a relatively simple lookup table that maps
effort [5]. sounds to phones [15]. Varela et al. [15] adapted the system
to the Mexican Spanish language. The authors created a
B. Implementations
language and an acoustic model, based on an auto-attendant
There are several HMM implementations, but the most telephonic system, and achieved an error rate of 6.32% [15].
advanced are the HTK Toolkit [6] and the CMU Sphinx The same process was followed for other languages, like
system [7]. Mandarin [16], Arabic [17], Swedish [18]. These examples
Hidden Markov Model Toolkit (HTK) is a set of libraries show that the PocketSphinx system is flexible enough so
used for research in automatic speech recognition, that it is relatively easy for people with phonetics training to
implemented using HMM. The HTK codebase is owned by extend it to other languages.
Microsoft, but managed by the Cambridge University Harvey et al. [4] researched how ASR systems could be
Engineering Department. Since HTK has been largely integrated with their project aimed at developing a device to
abandoned, since the last release (v3.4.1) was made in 2009, help the elderly, both inside and outside the home. The
the CMU Sphinx system is getting more attention from the authors identified the following challenges associated with
speech recognition community [6]. ASR systems used for voice command interfaces [4]:
The original SPHINX was the first accurate Large 1) Important differences between users;
Vocabulary Continuous Speech Recognition (LVCSR) 2) Similarity between certain sounds;
system, using HMM as its underlying technology, that 3) Short words provide less data for the system to analyze,
managed to be speaker independent [7]. The next version, which may lead to increased error rates;
SPHINX-II, was an improved version that was both faster 4) Different recognition languages lead to variable error
and more accurate, created by most of the same authors, rates using the same system.
using HMM as its underlying technology. It was developed, Specifically to their project, the authors found that
from the beginning, as an open source project, creating a medical conditions, which frequently affect the elderly,
community around it [8]. The next version, SPHINX-III, is create different speech patterns and their tolerance to errors
an offline version of the previous systems, with a different is quite low. With that in mind, the authors leveraged the
internal representation to allow for greater accuracy. The Sphinx library for its maturity and features. Focusing on the
signals go through a much larger amount of pre-processing creation of models and general optimization tasks, the
before they even reach the recognizer [7]. Current hardware authors managed to create a multilingual system that has a
is capable of running the recognizer for SPHINX-III in 2-second processing time on embedded systems, with error
almost real-time, but it is not suitable to processing in such rates above 70% [4].
conditions. SPHINX-4 is a complete rewrite to create a Kirchhoff et al. [19] suggested other methods to improve
more modular and flexible system that can accept multiple the ASR systems’ performance. One proposal consists of
data sources elegantly. It is a joint venture with Mitsubishi replacing the current feature-extraction algorithms with
Electric Research Laboratories and Sun Microsystems, others specially designed to discriminate certain sound
using the Java programming language. As with the third classes depending on the intended use, or through the use of
version, its intended use is offline processing, not real-time noise reduction algorithms, which can improve the data
Image Capture
Video Acquisition Control
MIVcontrol
III. ARCHITECTURE
As referred before, the gastroenterologist needs to use a Feature Extraction
pedal to capture and save the frame, and with the proposed
solution the pedal is replaced with a hands-free voice
control module, called MIVcontrol. This module was Model Comparison
developed to tackle the problems that healthcare
professionals face when performing an endoscopic
procedure. This module is part of the MIVbox device, which
is integrated in the MyEndoscopy system.
MyEndoscopy is the name of the global system MIVacquisition Commands
IV. IMPLEMENTATION
The creation of the speech model used in the MIVcontrol
module, from higher to lower level comprises three different
phases, namely language model, dictionary and acoustic
model.
A. Language Model
The language model is a high-level description of all
valid phrases (i.e. combination of words) in a certain
language. Statistical language models try to predict all the
valid utterances in a language, by combining all the
recognized words into every possible combination [22].
Context-Free Grammars are restricted forms of a language
model, that restrict the recognized phrases to a
predetermined set, and discard those that do not fit that
model [23].
The decision to adopt a certain language model depends
mostly on its intended application. While statistical Fig. 4 Folder tree required by SphinxTrain.
language models are useful for open-ended applications, like
dictation and general-purpose recognition, context-free The folder directory has two top folders, namely the etc
grammars are suitable for specific applications, like and the wav.
command-and-control systems. The etc folder contains all the metadata and configuration
SphinxBase requires the grammar to be defined in Java parameters needed to train the acoustic model, as well as the
Speech Grammar Format (JSGF), which is a platform- dictionary. It contains both a list of all the phonemes used in
independent standard format to define context-free the model and a list of filler phonemes, such as silences, that
grammars, using a textual representation so that it can be should be ignored. It also has a list of all the files to be used
human-readable [24]. The statistical language model is during both training and testing phases, as well as a
automatically created based on the command list. mapping between each audio file and its corresponding
B. Dictionary transcription. This mapping corresponds to the labeled data
The dictionary is a map between each command and the to be the input to the HMM.
phonemes it contains. A phoneme is defined as the basic The wav folder simply contains all the collected data, as
unit of phonology, which can be combined to form words. audio files, organized in subfolders by speaker
Its internal representation consists of using the ARPAbet to identification, with a subfolder for each set of uttered
represent phonemes as ASCII characters. The ARPAbet commands.
does not allow representing the entire International Phonetic
Alphabet (IPA), but it is sufficient for small vocabularies, The system processes continuous audio in real-time,
such as the one required by this application [25]. splits it in commands and produces a line of text for each
Since the list of required commands is small, all the recognized command. If the spoken command is not
dictionaries used were created manually. recognized, an empty line is produced. The MIVcontrol
module runs on the MIVbox.
C. Acoustic Model The audio picked up by the microphone is stored in a
The acoustic model is trained using SphinxTrain and memory buffer. The first pre-processing stage involves
maps audio features to the phonemes they represent, for splitting the incoming audio into different utterances, or sets
those included in the dictionary. The training performed by of words, by tracking silent periods between them. To
SphinxTrain requires previous knowledge of the dictionary account for noise present during recording, any audio with
and a transcription for each utterance, in order to map each volume below a certain threshold is considered a silence.
utterance to its corresponding phonetic information. It also Each segmented utterance then goes through a similar
requires the data to be in a particular audio format. In order process. The audio is processed creating a set of features,
to minimize clerical errors and cut the time need to analyze and then the Semi-Continuous HMM finds the most likely
the data to a minimum, all the technical considerations and utterance contained in its dictionary. This is the final output,
index building were abstracted away in a script referred to corresponding to a command given to the system.
as amCreate. If there is Internet access and the data to be recognized is
SphinxTrain requires the folder tree presented on Fig. 4, not sensitive, it is possible to use an online speech
where “model” denotes the model name. recognition service, such as the Google Speech API [26], as
a fallback mechanism.
[2] N. Summerton, “Positive and negative factors in defensive on Speech Communication and Technology, 1997, pp. 2707–
medicine: a questionnaire study of general practitioners.,” BMJ, 2710.
vol. 310, no. 6971, pp. 27–29, Jan. 1995. [23] A. Bundy and L. Wallen, “Context-Free Grammar,” in Catalogue
[3] J. M. Canard, J.-C. Létard, L. Palazzo, I. Penman, and A. M. of Artificial Intelligence Tools, A. Bundy and L. Wallen, Eds.
Lennon, Gastrointestinal Endoscopy in Practice, 1st ed. Berlin, Heidelberg: Springer Berlin Heidelberg, 1984, pp. 22–23.
Churchill Livingstone, 2011, p. 492. [24] A. Hunt, “JSpeech Grammar Format,” 2000.
[4] A. P. Harvey, R. J. McCrindle, K. Lundqvist, and P. Parslow, [25] R. A. Gillman, “Automatic Verification of Hypothesized
“Automatic speech recognition for assistive technology devices,” Phonemic Strings in Continuous Speech,” Arlington, Virginia,
in Proc. 8th Intl Conf. Disability, Virtual Reality & Associated 1974.
Technologies, Valparaíso, 2010, pp. 273–282. [26] B. Ballinger, C. Allauzen, A. Gruenstein, and J. Schalkwyk, “On-
[5] M. Aymen, A. Abdelaziz, S. Halim, and H. Maaref, “Hidden Demand Language Model Interpolation for Mobile Speech
Markov Models for automatic speech recognition,” in 2011 Input,” Elev. Annu. Conf. Int. Speech Commun. Assoc., no.
International Conference on Communications, Computing and September, pp. 1812–1815, 2010.
Control Applications (CCCA), 2011, pp. 1–6.
[6] S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D.
Ollason, V. Valtchev, and P. Woodland, “HTK FAQ.” [Online].
Available: http://htk.eng.cam.ac.uk/docs/faq.shtml. [Accessed:
03-Feb-2014].
[7] K.-F. Lee, H.-W. Hon, and R. Reddy, “An overview of the
SPHINX speech recognition system,” IEEE Trans. Acoust., vol.
38, no. 1, pp. 35–45, 1990.
[8] X. Huang, F. Alleva, H.-W. Hon, M.-Y. Hwang, K.-F. Lee, and
R. Rosenfeld, “The SPHINX-II speech recognition system: an
overview,” Comput. Speech Lang., vol. 7, no. 2, pp. 137–148,
Apr. 1993.
[9] P. Lamere, P. Kwok, E. Gouvea, B. Raj, R. Singh, W. Walker, M.
Warmuth, and P. Wolf, “The CMU SPHINX-4 speech
recognition system,” in IEEE Intl. Conf. on Acoustics, Speech
and Signal Processing (ICASSP 2003), Hong Kong, 2003, vol. 1,
pp. 2–5.
[10] K. Vertanen, “Baseline WSJ Acoustic Models for HTK and
Sphinx: Training recipes and recognition experiments,”
Cavendish Lab. Univ. Cambridge, 2006.
[11] G. Ma, W. Zhou, J. Zheng, X. You, and W. Ye, “A comparison
between HTK and SPHINX on chinese mandarin,” in IJCAI
International Joint Conference on Artificial Intelligence, 2009,
pp. 394–397.
[12] D. Huggins-Daines, M. Kumar, A. Chan, A. W. Black, M.
Ravishankar, and A. I. Rudnicky, “Pocketsphinx: A Free, Real-
Time Continuous Speech Recognition System for Hand-Held
Devices,” 2006 IEEE Int. Conf. Acoust. Speed Signal Process.
Proc., vol. 1, pp. I–185–I–188, 2006.
[13] D. Huggins-Daines, “PocketSphinx v0.5 API Documentation,”
2008. [Online]. Available:
http://www.speech.cs.cmu.edu/sphinx/doc/doxygen/pocketsphinx
/main.html. [Accessed: 20-Feb-2014].
[14] V. John, “Phonetic decomposition for Speech Recognition of
Lesser-Studied Languages,” in Proceeding of the 2009
international workshop on Intercultural collaboration - IWIC
’09, 2009, p. 253.
[15] A. Varela, H. Cuayáhuitl, and J. A. Nolazco-Flores, “Creating a
Mexican Spanish version of the CMU Sphinx-III speech
recognition system,” in Progress in Pattern Recognition, Speech
and Image Analysis, vol. 2905, A. Sanfeliu and J. Ruiz-
Shulcloper, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg,
2003, pp. 251–258.
[16] Y. Wang and X. Zhang, “Realization of Mandarin continuous
digits speech recognition system using Sphinx,” 2010 Int. Symp.
Comput. Commun. Control Autom., pp. 378–380, May 2010.
[17] H. Hyassat and R. Abu Zitar, “Arabic speech recognition using
SPHINX engine,” Int. J. Speech Technol., vol. 9, no. 3–4, pp.
133–150, Oct. 2008.
[18] G. Salvi, “Developing acoustic models for automatic speech
recognition,” 1998.
[19] K. Kirchhoff, G. A. Fink, and G. Sagerer, “Combining acoustic
and articulatory feature information for robust speech
recognition,” Speech Commun., vol. 37, no. 3–4, pp. 303–319,
Jul. 2002.
[20] J. Braga, I. Laranjo, D. Assunção, C. Rolanda, L. Lopes, J.
Correia-Pinto, and V. Alves, “Endoscopic Imaging Results: Web
based Solution with Video Diffusion,” Procedia Technol., vol. 9,
pp. 1123–1131, 2013.
[21] I. Laranjo, J. Braga, D. Assunção, A. Silva, C. Rolanda, L. Lopes,
J. Correia-Pinto, and V. Alves, “Web-Based Solution for
Acquisition, Processing, Archiving and Diffusion of Endoscopy
Studies,” in Distributed Computing and Artificial Intelligence,
vol. 217, Springer International Publishing, 2013, pp. 317–24.
[22] P. Clarkson and R. Rosenfeld, “Statistical language modeling
using the CMU-cambridge toolkit,” in 5th European Conference