Safety, Security, and Convenience: The Benefits of Voice Recognition Technology

Special Issue on Social Value Creation Using Biometrics Core Technologies and Advanced Technologies to Support Biometrics
Safety, Security, and Convenience:

The Benefits of Voice Recognition Technology
KOSHINAKA Takafumi, LEE Kong Aik
Abstract
We use our voices all the time to communicate with one another. Talking is the simplest and easiest means of in-
formation transmission. It could also be one of the simplest means of identification. Recent advances in comput-
ing technology have made voice recognition — a biometric technology based on the unique characteristics spe-
cific to an individual’s voice — more convenient, safer, and more secure than ever. In this paper, we review the
current state of voice recognition technology and show how deep learning — the core of contemporary AI tech-
nology — is providing the key to unlock the power of biometrics. We will also look in some detail at NEC’s work
in the field of voice recognition technology, which is at the forefront of worldwide efforts to make this technology
accessible and reliable. Finally, we discuss potential industrial applications for voice recognition technology such
as public safety solutions.
Keywords
speaker verification, speaker identification, speaker recognition, deep learning, speech recognition
work, and what is its connection to deep learning — one

1. Introduction
of today’s core AI technologies? These are some of the
Communication is an integral part of our daily lives questions we will try to answer in this paper. We will also
and we use many different means to communicate with take a look at the work NEC has been doing in this area,
one another. However, none is more important than the examine the world-class voice recognition technology
human voice. Speaking and listening are fundamen- the company has developed, and consider potential
tal, the basis for all other forms of communication. To industrial applications for voice recognition technology
speak and listen, you don’t need an electronic device; such as public safety solutions.
you don’t even need paper and pen. All you need is your
voice. No other means of communication is simpler or
2. Voice Recognition Technology
easier.
The voice — which is the medium for speaking and lis- Everyone is used to guessing whose voice they are
tening communication — is a type of human biometrics. listening to even when they cannot see the speaker.
Because each person’s voice has characteristics that are As long as you know the person, you’re pretty likely
unique and peculiar to that voice alone, the voice can be to guess correctly. The unique characteristics of each
used for biometric identification. Voice recognition offers individual’s voice are dictated by various physical fea-
an easy and simple means of individual identification for tures such as the shapes of the vocal chords and oral
users. Moreover, this type of authentications requires no cavity (physical characteristics), as well as by speech
special equipment; conventional microphones and tele- habits particular to each of us (behavioral characteris-
phones can be used, and no expensive, special sensor tics). Voice recognition technology identifies the speaker
device is required. Setting up a voice recognition system by extracting and analyzing the features that relate to
is easy and relatively inexpensive. these individual physical and behavioral characteristics.
What is voice recognition technology, how does it
NEC Technical Journal／Vol.13 No.2／Special Issue on Social Value Creation Using Biometrics 83
Core Technologies and Advanced Technologies to Support Biometrics
Safety, Security, and Convenience: The Benefits of Voice Recognition Technology
Feature Similarity to this trend; research into deep learning got underway
extraction calculation ?
(Same person?) in 2014, and a paradigm shift is now taking place in this
field, a shift that promises to bring voice recognition into
Fig. 1 Basic configuration of voice recognition system (one- the mainstream.
to-one comparison): Models (λ, ν) to extract features and This shift is marked by the emergence of a system
calculate similarity are determined by data through learning. called deep speaker embedding, or x-vector, which ex-
ponentially increases the accuracy of voice recognition.
Researchers in the field have eagerly seized on x-vector
2.1 Technological components of voice recognition as a new feature extractor with the potential to replace
the conventional i-vector system 3). Fig. 2 shows the
Technically speaking, voice recognition is called speak- concept of the x-vector system. First, a deep neural
er recognition or speaker verification. In many cases, it network (DNN) composed of a feature extractor and dis-
refers to a technology that uses one-to-one processing criminator is trained to correctly deduce speakers from
to compare two voices to determine if they are the same their voices. The feature extractor of the DNN has been
person. Speaker identification, on the other hand, which designed in such a way that it pulls only the information
seeks to identify an unknown individual by their voice, suitable for speaker identification from their voices.
performs one-to-many processing. But even this ulti- Because speech is time-series data with variable
mately boils down to multiple repetitions of one-to-one length, the amount of data input to the neural network
comparisons. Thus, the basic unit of processing is one- is also variable. This very fact makes it more difficult to
to-one processing, as shown in Fig. 1. handle voices than images. However, it is possible with
Today’s most popular framework for feature extraction the x-vector to output a feature in a fixed number of
1)
is a framework called i-vector . Using a standard mod- dimensions by inserting a pooling layer — which aggre-
el of phonemes comprised of many speakers’ voices gates the data in a temporal direction — in the end of
(various vowels and consonants), the i-vector extracts the feature extractor.
the differences between the standard model and input Trials in introducing deep learning to voice recogni-
voice as a feature. However, if all the differences are tion do not stop at feature extraction and range widely
extracted, the feature will be enormous, with potential- from the front-end (speech/non-speech recognition and
ly hundreds of thousands of dimensions. To avoid this speech enhancement under noisy conditions) to the
problem, i-vector compresses such an enormous fea- back-end (similarity calculation). An end-to-end system
ture to around 400 dimensions using factor analysis. To has also emerged that performs learning of the entire
calculate similarity, a model called probabilistic linear system by replacing all the technical components with
discriminant analysis (PLDA)2) is often used. The PLDA the neural network4). This trend is likely to continue in
stochastically reformulates equations using linear dis- the future.
criminant analysis (LDA) — a traditional method for ma-
chine learning — and automatically selects the feature
1) Training
best suited for identification of the speaker based on the Speech data Speaker IDs
400-dimension feature of the i-vector. Once the data has (Ground truth)
been analyzed, the similarity is calculated as a likelihood A

A
ratio. B
B
Both i-vector and PLDA are formulated using probabi-
C
listic models based on the assumption of Gaussian dis-
C
tribution (normal distribution). The de facto standards
for voice recognition, i-vector and PLDA incorporate var-
Feature extractor Discriminator
ious machine learning techniques. Capable of automated
learning, they can generate optimal model parameters 2) Feature extraction
from a large amount of data.
X
2.2 Incorporation of deep learning
Recently, researchers in the fields of image and

speech recognition have sought to improve accuracy by Fig. 2 Concept for feature extraction based on deep
applying deep learning. Voice recognition is no exception learning (x-vector).
84 NEC Technical Journal／Vol.13 No.2／Special Issue on Social Value Creation Using Biometrics
2.3 NEC’s Work in This Field Shows Promise accuracy that we were able to achieve as accomplished
by developing an original feature extraction system by
At NEC, we see voice recognition technology as one of adding an auxiliary network called an attention mecha-
the leading next-generation biometric modalities, closely nism to the x-vector. This new mechanism automatical-
following fingerprint and face recognition. Consequently, ly selects those parts of the recording where individual
we have been working hard to develop this technology voice characteristics are more prominent9). We modified
for practical use and have achieved results that have the deep learning process as well to enable effective
made us the world’s leader in this field. learning without the massive amounts of training data
We were the first to see the potential of deep learning usually required. Instead, we developed a new method
and the first to start research in this promising area. for augmenting data by converting limited voice data to
That early research has paid off with the development multiply the apparent number of speakers several times.
of powerful unique technologies; these include a so-
phisticated filter to accurately detect voice activity by
3. Industrial Applications
distinguishing speech from non-speech in noisy envi-
ronments5), a noise reduction system to eliminate noise Finally, let’s consider the potential benefits to society
components from the features of noisy speech 6), and of voice recognition technology (Fig. 3).
technology to infuse a short-duration utterance with E-commerce: Signatures for small purchases with
the same quantity of features as can be drawn from credit cards are rarely required any more. This lowers
long-duration utterance, as this makes it easier to ob- the barrier to purchasing for both buyers and sellers
tain information pertaining to individual characteristics7). by streamlining and speeding up the payment process.
Besides, NEC has been actively participating in the Nowadays, convenience is just as important to con-
Speaker Recognition Evaluation (SRE) series — eval- sumers as security. Using voice recognition meets both
uations conducted by the U.S. National Institute of these needs. The voice is a simple medium people use
Standards and Technology (NIST)8). The SRE series are for everyday communication, so biometric authentica-
a competition in which more than 60 teams (in SRE18) tion using voice provides users with a handy and easy
from industry-academia-government organizations means of individual identification. Voice recognition is an
around the world participate and compete against one identification method ideal for individual identification in
another to test the speaker recognition accuracy of their
systems using the same data set. We have repeated-
ly demonstrated our technological superiority in these
competitions.
In SRE18, testing was conducted with two tasks: one
to find a specific individual from telephone conversations Call center operation E-commerce Criminal investigation
marred by background noise and poor line conditions;
and the other to find a specific individual from multiple
individuals who appear in video segments on the Inter-
net such as YouTube. Both tasks featured technically
Smart speaker Robotics
severe conditions with a high level of difficulty. In the
telephone conversations for example, the degree of ac- Fig. 3 A wide range of scenarios where voice recognition
curacy for the baseline presented by the NIST was only can play a role.
88.8% (11.2% crossover error rate). This does not by
any means suggest that the technical level of the NIST’s
baseline system was low. In fact, this baseline system
Hello. I have a Thank you
was the latest state-of-the-art system equipped with question about
for calling.
your product…
the above-mentioned x-vector feature extractor. Taking
all this into account, NEC’s system achieved accuracy
of 95.0% (5.0% crossover error rate) — which was an
error rate less than half of what the newest cutting-edge
system could achieve.
When you are developing this kind of system, you
have to push the quality and performance of every com- Fig. 4 Call center support: Quick confirmation of cus-
ponent to the limit. The remarkable improvement in tomer identification.
NEC Technical Journal／Vol.13 No.2／Special Issue on Social Value Creation Using Biometrics 85
commercial transactions such as e-commerce and Inter-

net banking. Reference
Call center operations: As more companies take 1) Najim Dehak et al., “Front-End Factor Analysis for
customer-oriented approaches, they are continually Speaker Verification,” IEEE Transactions on Audio,
striving to improve their services at contact points with Speech, and Language Processing, Vol.19, pp.788-
customers such as call centers. Some of the issues that 798, May 2011.
2) Simon J. D. Prince et al., “Probabilistic Models for In-
have arisen include simplification of individual identi-
ference about Identity,” IEEE Transactions on Pattern
fication procedure for important customers who make
Analysis and Machine Intelligence, Vol.34, Jan. 2012.
phone calls frequently (Fig. 4) and early identification 3) David Snyder et al., “X-vectors: Robust DNN Embed-
of problem customers such as chronic claimers. Because dings for Speaker Recognition,” IEEE International
voice recognition is the only biometric that can be used Conference on Acoustics, Speech, and Signal Process-
on the telephone where participants are not visible to ing (ICASSP), Apr. 2018.
one another, it’s ideal for call center operations as it 4) Georg Heigold et al., “End-to-end Text-dependent
Speaker Verification,” IEEE International Conference
makes it possible to identify customers in the course of
on Acoustics, Speech, and Signal Processing (ICASSP),
a natural conversation.
Mar. 2016.
Criminal investigation: Telephone-based fraud is 5) Hitoshi Yamamoto et al., “Robust i-vector extraction
a sophisticated and constantly evolving criminal enter- tightly coupled with voice activity detection using deep
prise, always adapting to the various measures taken neural networks,” Asia-Pacific Signal and Information
to combat it. Voice recognition may prove helpful in Processing Association Annual Summit and Conference
investigating these crimes, providing an analytical tool (APSIPA ASC), Dec. 2017.
6) Shivangi Mahto et al., “I-vector Transformation Us-
to support tracking of perpetrator. It can also support
ing Novel Discriminative Denoising Autoencoder for
surveillance of organized crime on telephone and the
Noise-Robust Speaker Recognition,” INTERSPEECH,
Internet. Voice analysis can also be used proactively to Aug. 2017.
suppress crime as it may be capable of picking up infor- 7) Hitoshi Yamamoto et al., “Denoising Autoencod-
mation pointing to potential criminal activity — more so er-Based Speaker Feature Restoration for Utterances
even than the surveillance cameras that have become of Short Duration,” INTERSPEECH, Sep. 2015.
so commonplace on our streets in recent years. 8) Speaker Recognition, National Institute of Standards
and Technology (NIST)
Other: Voice-based individual identification is likely to
https://www.nist.gov/itl/iad/mig/speaker-recognition
be the biometric of choice for hearables such as smart
9) Koji Okabe et al., “Attentive Statistics Pooling for Deep
speakers and smart earbuds as well as for user-friendly Speaker Embedding,” INTERSPEECH, Sep. 2018.
interfaces such as robots10). 10) T. Arakawa, “Ear Acoustic Authentication Technology:
Using Sound to Identify the Distinctive Shape of the
Ear Canal”, NEC Technical Journal, Vol. 13, No.2, Apr.
4. Conclusion
2019.
Voice recognition is clearly one of the easiest biometrics

to implement and use. Now, thanks to the incorporation
of deep learning in voice recognition systems, this tech-
Authors’ Profiles
nology is much more reliable and secure. NEC has estab-
KOSHINAKA Takafumi
lished itself as a world leader in this field with superior
Ph.D.
technology that is setting the standard for accuracy and Senior Principal Researcher
performance. Ideally suited for a broad range of appli- Biometrics Research Laboratories
cations such as e-commerce, call centers, and criminal
LEE Kong Aik
investigation, voice recognition offers user-friendly con- Ph.D.
venience and high accuracy. NEC is committed to bringing Senior Principal Researcher
Biometrics Research Laboratories
the benefits of this technology to society and to enhanc-
ing and refining that technology.
* YouTube is a trademark or registered trademark of Google

LLC.
* All other company names and product names that appear in
this paper are trademarks or registered trademarks of their
respective companies.
86 NEC Technical Journal／Vol.13 No.2／Special Issue on Social Value Creation Using Biometrics
Information about the NEC Technical Journal
Thank you for reading the paper.
If you are interested in the NEC Technical Journal, you can also read other papers on our website.
Link to NEC Technical Journal website
Japanese English
Vol.13 No.2 Social Value Creation Using Biometrics
Remarks for Special Issue on Social Value Creation Using Biometrics

Committed to Supporting Social Values via Biometrics
Papers for Special Issue
Commitment to Biometrics NEC Is Promoting

Bio-IDiom — NEC’s Biometric Authentication Brand
The Future Evolution and Development of Biometrics Studies
Privacy Measures of Biometrics Businesses
Services and Solutions That Leverage Biometrics

The Western Identification Network: Identification as a Service in a Federated Architecture
Use of Face Authentication Systems Associated with the “My Number Card”
Face Recognition Cloud Service “NeoFace Cloud”
NEC Enhanced Video Analytics Provides Advanced Solutions for Video Analytics
Vol.13 No.2　
New In-Store Biometric Solutions Are Shaping the Future of Retail Services
ID Service Providing Instantaneous Availability of User’s Desired Financial Services
Biometrics-Based Approach to Improve Experience from Non-routine Lifestyle Fields April 2019
Construction Site Personnel Entrance/Exit Management Service Based on Face Recognition
and Location Info Special Issue TOP
The Importance of Personal Identification in the Fields of Next-Generation Fabrication (Monozukuri)

How Face Recognition Technology and Person Re-identification Technology Can Help
Make Our World Safer and More Secure
Advanced Iris Recognition Using Fusion Techniques
Advanced New Technology Uses New Feature Amount to Improve Accuracy of Latent
Fingerprint Matching
Ear Acoustic Authentication Technology: Using Sound to Identify the Distinctive Shape of the Ear Canal
Automatic Classification of Behavior Patterns for High-Precision Detection of Suspicious
Individuals in Video Images
Facial-Video-Based Drowsiness Estimation Technology for Operation on Low-End IoT Devices
NEC Information
NEWS
2018 C&C Prize Ceremony

Safety, Security, and Convenience: The Benefits of Voice Recognition Technology

Uploaded by

Copyright:

Available Formats

Safety, Security, and Convenience: The Benefits of Voice Recognition Technology

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Safety, Security, and Convenience: The Benefits of Voice Recognition Technology

Uploaded by

Copyright:

Available Formats

Special Issue on Social Value Creation Using Biometrics Core Technologies and Advanced Technologies to Support Biometrics