Safety, Security, and Convenience: The Benefits of Voice Recognition Technology
Safety, Security, and Convenience: The Benefits of Voice Recognition Technology
Safety, Security, and Convenience: The Benefits of Voice Recognition Technology
Abstract
We use our voices all the time to communicate with one another. Talking is the simplest and easiest means of in-
formation transmission. It could also be one of the simplest means of identification. Recent advances in comput-
ing technology have made voice recognition — a biometric technology based on the unique characteristics spe-
cific to an individual’s voice — more convenient, safer, and more secure than ever. In this paper, we review the
current state of voice recognition technology and show how deep learning — the core of contemporary AI tech-
nology — is providing the key to unlock the power of biometrics. We will also look in some detail at NEC’s work
in the field of voice recognition technology, which is at the forefront of worldwide efforts to make this technology
accessible and reliable. Finally, we discuss potential industrial applications for voice recognition technology such
as public safety solutions.
Keywords
speaker verification, speaker identification, speaker recognition, deep learning, speech recognition
NEC Technical Journal/Vol.13 No.2/Special Issue on Social Value Creation Using Biometrics 83
Core Technologies and Advanced Technologies to Support Biometrics
Feature Similarity to this trend; research into deep learning got underway
extraction calculation ?
(Same person?) in 2014, and a paradigm shift is now taking place in this
field, a shift that promises to bring voice recognition into
Fig. 1 Basic configuration of voice recognition system (one- the mainstream.
to-one comparison): Models (λ, ν) to extract features and This shift is marked by the emergence of a system
calculate similarity are determined by data through learning. called deep speaker embedding, or x-vector, which ex-
ponentially increases the accuracy of voice recognition.
Researchers in the field have eagerly seized on x-vector
2.1 Technological components of voice recognition as a new feature extractor with the potential to replace
the conventional i-vector system 3). Fig. 2 shows the
Technically speaking, voice recognition is called speak- concept of the x-vector system. First, a deep neural
er recognition or speaker verification. In many cases, it network (DNN) composed of a feature extractor and dis-
refers to a technology that uses one-to-one processing criminator is trained to correctly deduce speakers from
to compare two voices to determine if they are the same their voices. The feature extractor of the DNN has been
person. Speaker identification, on the other hand, which designed in such a way that it pulls only the information
seeks to identify an unknown individual by their voice, suitable for speaker identification from their voices.
performs one-to-many processing. But even this ulti- Because speech is time-series data with variable
mately boils down to multiple repetitions of one-to-one length, the amount of data input to the neural network
comparisons. Thus, the basic unit of processing is one- is also variable. This very fact makes it more difficult to
to-one processing, as shown in Fig. 1. handle voices than images. However, it is possible with
Today’s most popular framework for feature extraction the x-vector to output a feature in a fixed number of
1)
is a framework called i-vector . Using a standard mod- dimensions by inserting a pooling layer — which aggre-
el of phonemes comprised of many speakers’ voices gates the data in a temporal direction — in the end of
(various vowels and consonants), the i-vector extracts the feature extractor.
the differences between the standard model and input Trials in introducing deep learning to voice recogni-
voice as a feature. However, if all the differences are tion do not stop at feature extraction and range widely
extracted, the feature will be enormous, with potential- from the front-end (speech/non-speech recognition and
ly hundreds of thousands of dimensions. To avoid this speech enhancement under noisy conditions) to the
problem, i-vector compresses such an enormous fea- back-end (similarity calculation). An end-to-end system
ture to around 400 dimensions using factor analysis. To has also emerged that performs learning of the entire
calculate similarity, a model called probabilistic linear system by replacing all the technical components with
discriminant analysis (PLDA)2) is often used. The PLDA the neural network4). This trend is likely to continue in
stochastically reformulates equations using linear dis- the future.
criminant analysis (LDA) — a traditional method for ma-
chine learning — and automatically selects the feature
1) Training
best suited for identification of the speaker based on the Speech data Speaker IDs
400-dimension feature of the i-vector. Once the data has (Ground truth)
X
2.2 Incorporation of deep learning
84 NEC Technical Journal/Vol.13 No.2/Special Issue on Social Value Creation Using Biometrics
Core Technologies and Advanced Technologies to Support Biometrics
2.3 NEC’s Work in This Field Shows Promise accuracy that we were able to achieve as accomplished
by developing an original feature extraction system by
At NEC, we see voice recognition technology as one of adding an auxiliary network called an attention mecha-
the leading next-generation biometric modalities, closely nism to the x-vector. This new mechanism automatical-
following fingerprint and face recognition. Consequently, ly selects those parts of the recording where individual
we have been working hard to develop this technology voice characteristics are more prominent9). We modified
for practical use and have achieved results that have the deep learning process as well to enable effective
made us the world’s leader in this field. learning without the massive amounts of training data
We were the first to see the potential of deep learning usually required. Instead, we developed a new method
and the first to start research in this promising area. for augmenting data by converting limited voice data to
That early research has paid off with the development multiply the apparent number of speakers several times.
of powerful unique technologies; these include a so-
phisticated filter to accurately detect voice activity by
3. Industrial Applications
distinguishing speech from non-speech in noisy envi-
ronments5), a noise reduction system to eliminate noise Finally, let’s consider the potential benefits to society
components from the features of noisy speech 6), and of voice recognition technology (Fig. 3).
technology to infuse a short-duration utterance with E-commerce: Signatures for small purchases with
the same quantity of features as can be drawn from credit cards are rarely required any more. This lowers
long-duration utterance, as this makes it easier to ob- the barrier to purchasing for both buyers and sellers
tain information pertaining to individual characteristics7). by streamlining and speeding up the payment process.
Besides, NEC has been actively participating in the Nowadays, convenience is just as important to con-
Speaker Recognition Evaluation (SRE) series — eval- sumers as security. Using voice recognition meets both
uations conducted by the U.S. National Institute of these needs. The voice is a simple medium people use
Standards and Technology (NIST)8). The SRE series are for everyday communication, so biometric authentica-
a competition in which more than 60 teams (in SRE18) tion using voice provides users with a handy and easy
from industry-academia-government organizations means of individual identification. Voice recognition is an
around the world participate and compete against one identification method ideal for individual identification in
another to test the speaker recognition accuracy of their
systems using the same data set. We have repeated-
ly demonstrated our technological superiority in these
competitions.
In SRE18, testing was conducted with two tasks: one
to find a specific individual from telephone conversations Call center operation E-commerce Criminal investigation
marred by background noise and poor line conditions;
and the other to find a specific individual from multiple
individuals who appear in video segments on the Inter-
net such as YouTube. Both tasks featured technically
Smart speaker Robotics
severe conditions with a high level of difficulty. In the
telephone conversations for example, the degree of ac- Fig. 3 A wide range of scenarios where voice recognition
curacy for the baseline presented by the NIST was only can play a role.
88.8% (11.2% crossover error rate). This does not by
any means suggest that the technical level of the NIST’s
baseline system was low. In fact, this baseline system
Hello. I have a Thank you
was the latest state-of-the-art system equipped with question about
for calling.
your product…
the above-mentioned x-vector feature extractor. Taking
all this into account, NEC’s system achieved accuracy
of 95.0% (5.0% crossover error rate) — which was an
error rate less than half of what the newest cutting-edge
system could achieve.
When you are developing this kind of system, you
have to push the quality and performance of every com- Fig. 4 Call center support: Quick confirmation of cus-
ponent to the limit. The remarkable improvement in tomer identification.
NEC Technical Journal/Vol.13 No.2/Special Issue on Social Value Creation Using Biometrics 85
Core Technologies and Advanced Technologies to Support Biometrics
customers such as call centers. Some of the issues that 798, May 2011.
2) Simon J. D. Prince et al., “Probabilistic Models for In-
have arisen include simplification of individual identi-
ference about Identity,” IEEE Transactions on Pattern
fication procedure for important customers who make
Analysis and Machine Intelligence, Vol.34, Jan. 2012.
phone calls frequently (Fig. 4) and early identification 3) David Snyder et al., “X-vectors: Robust DNN Embed-
of problem customers such as chronic claimers. Because dings for Speaker Recognition,” IEEE International
voice recognition is the only biometric that can be used Conference on Acoustics, Speech, and Signal Process-
on the telephone where participants are not visible to ing (ICASSP), Apr. 2018.
one another, it’s ideal for call center operations as it 4) Georg Heigold et al., “End-to-end Text-dependent
Speaker Verification,” IEEE International Conference
makes it possible to identify customers in the course of
on Acoustics, Speech, and Signal Processing (ICASSP),
a natural conversation.
Mar. 2016.
Criminal investigation: Telephone-based fraud is 5) Hitoshi Yamamoto et al., “Robust i-vector extraction
a sophisticated and constantly evolving criminal enter- tightly coupled with voice activity detection using deep
prise, always adapting to the various measures taken neural networks,” Asia-Pacific Signal and Information
to combat it. Voice recognition may prove helpful in Processing Association Annual Summit and Conference
investigating these crimes, providing an analytical tool (APSIPA ASC), Dec. 2017.
6) Shivangi Mahto et al., “I-vector Transformation Us-
to support tracking of perpetrator. It can also support
ing Novel Discriminative Denoising Autoencoder for
surveillance of organized crime on telephone and the
Noise-Robust Speaker Recognition,” INTERSPEECH,
Internet. Voice analysis can also be used proactively to Aug. 2017.
suppress crime as it may be capable of picking up infor- 7) Hitoshi Yamamoto et al., “Denoising Autoencod-
mation pointing to potential criminal activity — more so er-Based Speaker Feature Restoration for Utterances
even than the surveillance cameras that have become of Short Duration,” INTERSPEECH, Sep. 2015.
so commonplace on our streets in recent years. 8) Speaker Recognition, National Institute of Standards
and Technology (NIST)
Other: Voice-based individual identification is likely to
https://www.nist.gov/itl/iad/mig/speaker-recognition
be the biometric of choice for hearables such as smart
9) Koji Okabe et al., “Attentive Statistics Pooling for Deep
speakers and smart earbuds as well as for user-friendly Speaker Embedding,” INTERSPEECH, Sep. 2018.
interfaces such as robots10). 10) T. Arakawa, “Ear Acoustic Authentication Technology:
Using Sound to Identify the Distinctive Shape of the
Ear Canal”, NEC Technical Journal, Vol. 13, No.2, Apr.
4. Conclusion
2019.
86 NEC Technical Journal/Vol.13 No.2/Special Issue on Social Value Creation Using Biometrics
Information about the NEC Technical Journal
Thank you for reading the paper.
If you are interested in the NEC Technical Journal, you can also read other papers on our website.
Japanese English
Vol.13 No.2
New In-Store Biometric Solutions Are Shaping the Future of Retail Services
ID Service Providing Instantaneous Availability of User’s Desired Financial Services
Biometrics-Based Approach to Improve Experience from Non-routine Lifestyle Fields April 2019
Construction Site Personnel Entrance/Exit Management Service Based on Face Recognition
and Location Info Special Issue TOP
The Importance of Personal Identification in the Fields of Next-Generation Fabrication (Monozukuri)
NEC Information
NEWS
2018 C&C Prize Ceremony