Speech Technologies For Data Mining Voice Analytics and Voice Biometry Slides

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

SPEECH DATA MINING,

SPEECH ANALYTICS,
VOICE BIOMETRY www.phonexia.com, 1/41
OVERVIEW
How to move speech technology from research labs to the
market?
What are the current challenges is speech recognition
research?
text

Phonexia introduction
Technology deployment use cases
Technologies and what is behind
Speech core and application interfaces
Grand challenges

www.phonexia.com, 2/41
WHAT IS IN SPEECH?
Speaker Content
Gender, age Language, dialect
Speaker identity Keywords, phrases
Emotion, speaker origin Speech transcription
Education, relation Topic
When speaker speaks Data mining

Environment Equipment
Where speakers speaks Device (phone/mike/...)
To whom speakers speaks Transmit channels
(dialog, reading, public talk) (landline/cell phone/Skype)
Other sounds Codecs (gsm/mp3/)
(music, vehicles, animals) Speech quality
www.phonexia.com, 3/41
PHONEXIA
Goal:
help clients to extract automatically
maximum of valuable information from
spoken speech.
Based in 2006 as spin-off of Brno
University of Technology
Seat and main office in Brno, Czech
Republic, active worldwide
Customers in more than 20 countries
governmental agencies, call centers,
banks, telco operators, broadcast
service companies )
Profitable, no external funding

www.phonexia.com, 4/41
FROM RESEARCH TO
MARKET
Scientific papers, reports, experimental code (Matlab, Python,
C++, lots of glue (shell scripts), data files
Research The goal is accuracy
Stability, speed, reproducibility and documentation less important
Openness
text
The goal is stability (error handling, code verification, testing
cycles at various levels) and speed
Technologies Regular development cycles and planning
Well defined application interfaces (API)
Documentation, licensing

Development of new applications


Integration with clients technologies and systems
Products The goal is functionality o integrated solution
User interfaces

www.phonexia.com, 5/41
USE CASES

Call centers
Banks
Intelligence agencies

www.phonexia.com, 6/41
CALL CENTERS

Two main application areas:

1) Quality control
2) Data mining from voice traffic

www.phonexia.com, 7/41
CALL CENTERS QUALITY
CONTROL
Supervisor is responsible for:
Team leading
Rating of calls
Evaluation of operators
Analysis of results
Reporting

Only 3% of calls are analyzed


by listening
100% of calls are analyzed
using speech technologies, new statistics
lower staff costs, lower operating costs
Higher satisfaction of customers
www.phonexia.com, 8/41
CALL CENTERS QUALITY
CONTROL II
Technologies:
VAD + discourse analysis
To get important statistics about call progress (start time, speaker turns,
speech speed, reaction times )

Diarization
Separation of summed conversation to two channels

Keyword/phase detector
Detection of obligatory phrases, rough words, call script compliance

Speech transcription + search


To search for important places in calls

www.phonexia.com, 9/41
DATA MINING FROM INCOMING
CALLS
Use cases:
Prevention of call center from overloading
(for example large power outage)
Added value information for business (big data)

Technology:
Speech transcription
Data mining tool
Search engine

www.phonexia.com, 10/41
BANKS

Use cases:
Banks have call centers
Quality control
text Data mining from incoming traffic

Authentication of people using voice biometry


Using key phrase (text dependent speaker identification)
Authentication on background (text independent speaker
identification)

Identification of frauds
People with fake identities calls repeatedly to request loans

www.phonexia.com, 11/41
INTELIGENCE AGENCIES
Huge amount of information, can not be
processed manually
- public news, telecommunication
networks, air communication, internet ...
Search
text for a needle in haystack
Combination of all technologies
- language identification, gender
identification, speaker identification,
diarization, keyword spotting, speech
transcription
- data mining tools
- correlation with other metadata
Operational and forensic speaker
identification

www.phonexia.com, 12/41
TECHNOLOGIES

Voice activity detection


Language identification
Gender recognition
Speaker identification
Diarization
Keyword spotting
Speech transcription
Dialog analysis
Emotion recognition

www.phonexia.com, 13/41
VOICE ACTIVITY DETECTION

Higher accuracy, lower speed

energy based technical VAD based on neural


VAD signal removal f0 tracking network VAD

Energy based VAD fast removal of low energy parts


Technical signal removal and noise filtering - removal of tones, removal of
flat spectra signal, removal of stationary signals, filtering of pulse noise
VAD based on f0 tracking removal of other non-speech signals
neural network VAD very accurate VAD based on phoneme recognition

www.phonexia.com, 14/41
VAD CHALLENGES

Important area of research, not fully solved, VAD is a key part of other
technologies and directly affects accuracy of these technologies
Music/singing detector
Detectors of non-speech speaker sounds (cough, laugh)
Detectors of other environment sounds (transport vehicles, animals,
electric tools, door slam)
Technical signal detectors
VAD for high noisy speech (SNR lower than 0 dB)
VADs or distorted channels
Non-parametric VADs
Distant mike VADs

www.phonexia.com, 15/41
LANGUAGE IDENTIFICATION
Automatic recognition of the language spoken.
x
50 languages + user can add new ones themselves
x
Can be used also as dialect recognition
iVector based technology, discriminative training, < 1kB language x
prints >>
Acoustic channel independent
x
Usage:
Crime is caused by small groups speaking specific languages very x
often
x
Call record forwarding
(to operator / other technologies / archive ...) x

Analysis of the audio archive x


Insertion of advertisement to media x
Language verification in broadcast signal distribution
www.phonexia.com, 16/41
LID SYSTEM ARCHITECTURE
(IVECTOR BASED SYSTEM)
UBM Projection Language Calibration
parameters parameters parameters

feature collection of projection to language score calib. /


extraction UBM statistics iVectors classifier - MLR transform

language scores

Prepared by Phonexia Fully trainable by client

Language prints (iVectors) can be


easily transferred over low capacity links

www.phonexia.com, 17/41
SPEAKER RECOGNITION
x
Several scenarios: speaker verification, speaker
search, speaker spotting, link/pattern analysis
Text independent or text dependent mode x

iVectors based technology, < 1kB voiceprints


Voiceprint extraction and scoring >>
Millions of comparisons in fraction of seconds
Diarization (speaker segmentation)
x
User-based system training, user-based calibration

www.phonexia.com, 18/41
SYSTEM ARCHITECTURE
- VOICE PRINT EXTRACTION
UBM Projection Projection Norm.
parameters parameters parameters

feature collection of projection to extraction of user


extraction UBM statistics iVectors spk info. - LDA normalization

voiceprint
prepared by Phonexia trainable by user

iVector describes total variability inside speech record


LDA removes non-speaker variability
User normalization helps user to normalize to unseen
channels (mean subtraction)
www.phonexia.com, 19/41
SYSTEM ARCHITECTURE
- VOICEPRINT COMPARISON
Model Calibration
parameters parameters

voiceprint 1 voiceprint length dep. score


comparator calibration transform score
WCCN+PLDA piecewise LR Logistic func.
trainable by user

voiceprint 2
Voiceprint comparer returns log likelihood
Calibration ensures probabilistic interpretation of the score under different
speech lengths
Score transform enables to selects log likelihood ratio or percentage score

www.phonexia.com, 20/41
LID AND SID CHALLENGES
LID/SID on very short records (< 3s) while keeping training
at user side
How to ensure accuracy over large number of acoustic
channels and languages (SID)
Graphical tools for system training/calibration and
evaluation at user side
LID/SID on Voice over IP networks
LID/SID form distant mikes

www.phonexia.com, 21/41
SPEAKER DIARIZATION

random collection of estimation of conversion of


generating of
alignment of GMM stats spk. factors spk. factors
speaker
frames to for each for each to spk. GMM,
labels
speakers speaker speaker new align.

spk1, spk2, spk1

Fully Bayesian approach with eigenvoice priors (Valente, Kenny)


Initial number of speakers is higher than expected number of speakers
A duration model is used to prevent fast jumps among speakers
Target number of speaker can be chosen based on minimal speaker posterior
probability, or pre-set by user
Very accurate but slower and more memory consuming

www.phonexia.com, 22/41
DIARIZATION CHALLENGES
Diarization is a technology that still needs a lot of research
Very sensitive to initialization
Very sensitive to non-speech sounds (laugh, cough, environment sounds),
integration of a good VAD is necessary
Very sensitive to changes in transmit channels and language
Even with DER close to 1% there are recordings where current algorithms fail
completely (often two women speaking with high pitch)
Uses iterative approach, one iteration is equivalent to one run of SID, users
expect much faster run than SID
Diarization from distant mikes
Beam-forming and diarization from microphone arrays
How to accurately estimate number of speakers

www.phonexia.com, 23/41
KEYWORD SPOTTING

Two approaches:

KWS based on LVCSR Acoustic KWS


Very accurate Fast
Slower Less accurate
Expensive for development Cheap development

Speech transcription based on Simple neural network based


state-of-the-art acoustic and acoustic model
language models Simple language model
Posteriors from confusion network (phone loop as background)
used as confidences About 20h of training data
About 100h of training data Phoneme-based calibration

www.phonexia.com, 24/41
SPEECH TRANSCRIPTION
PLP + bottle-neck features, HLDA
fast VTLN estimated using a set of GMMs
GMM or NN based system
Discriminative training
Speaker adaptation
3-gram language model
strings / lattices / confusion networks

www.phonexia.com, 25/41
SPEECH TRANSCRIPTION
CHALLENGES
Rather engineering then research challenges:
Accuracy
Speed
Lower memory consumption
How to train new system fully automatically
How to run hundreds of recognizers in parallel
How to do channel normalization and speaker adaptation
for any length of speech utterance

www.phonexia.com, 26/41
CONNECTION TO TEXT BASED
DATA MINING TOOLS
It is much easier to sell speech transcription with a higher-level data
mining tool
There is too much text to read
The text has to many errors (users will never be happy unless the
text is 100% correct)
This can be overcame by integration with existing text-based data-
mining tools:
Categorization of recordings
Indexing and search using complex queries
Exploration of new topics
Content analysis
Reaction on trends
Integration is done on confusion networks (alternative hypothesis,
increased probability to find specific information)
www.phonexia.com, 27/41
TOVEK TOOLS CONTENT
ANALYSIS

www.phonexia.com, 28/41
HOW TO MOVE OUR SOFTWARE
TO USERS?
Decision to write new speech core in 2007 Brno Speech
Core
Focus on stability, speed and proper error handling
Object oriented design, proper interfaces, no dependency
among modules except through regular interfaces
One C++ compiler, binary compatibility of libraries with
others
One code base for all technologies

www.phonexia.com, 29/41
BRNO SPEECH CORE
More than 250 objects covering large range of speech
algorithms (feature extraction, acoustic models, decoders,
transforms, grammar compilers, )
More than million of source code lines
Code versioning, automatic builds, test suits, licensing
Still easy maintainable
extension of functionality inside objects
splitting of functionality to more objects
replacement of objects (fixed interfaces)

www.phonexia.com, 30/41
FAST PROTOTYPING
Research to product transfer time is essential for
commercial success
Research done using standard toolkits STK, TNET,
KALDI, Python scripts,
text

For production systems:


New system can be implemented in few days
Often no single line of C/C++ code is written
Only objects for new algorithms are implemented
Objects are connected through one configuration file to
form data streams

www.phonexia.com, 31/41
BSCORE CONFIG
[source:SFileWaveformSourceI] [posteriors:SNNetPosteriorEstimatorI]

[fconvertor:SWaveformFormatConvertorI] [decoder:SPhnDecoderI]
input_format_str=lin16
output_format_str=float
[output:STranscriptionNodeI]
nchannels=1
...

[melbanks:SMelBanksI] [links]
sample_freq=8000 source->fconvertor
vector_size=200 fconvertor->melbanks
preem_coef=0.97 melbanks->posteriors
nbanks=15 posteriors->decoder
decoder->output

www.phonexia.com, 32/41
APPLICATION INTERFACES
Customers are used to work with specific programming tools and do
not want to change their habits
GUI/command line/SDKs
C/C++ API binary compatibility with many compilers
Java API a middle layer created using Java Native Interface (JNI)
C# API automatically generated using SWIG
MRCPv2 network interface for integration to telephone infrastructure
(IVRs) - through UniMRCP open source project
REST server platform, simple network interface for each technology
Supported OS Windows/Linux, 32/64 bits, Android

www.phonexia.com, 33/41
USUAL DESIGN OF SYSTEMS
WITH WEB BASED GUI
net

Data
storage

Application
Speech server Web server
server

Speech Server (REST), application server, web serer, database


Speech technology inside application server (TomCat, JBoss) through
Java API

www.phonexia.com, 34/41
CLIENT FOR REST SERVER

text

www.phonexia.com, 35/41
GRAND CHALLENGES

Training data collection


Guarantee of accuracy
Reduction of hardware cost

www.phonexia.com, 36/41
TRAINING DATA COLLECTION

We would like to offer cheap speech technology to anyone


on this planet.
Hundreds of languages and thousands of dialects
The data is costly - about 30 000 EUR for existing
language, more than 100 000 EUR for collection and
annotation of new corpora
The existing data often do not match the target dialect and
acoustic channels

We can add only few languages to our offer per year

Can we find a smarter and cheaper way to collect data?

www.phonexia.com, 37/41
DATA COLLECTION PROJECT

1. Collection of data from public sources (broadcast)


2. Automatic detection of phone calls within broadcast (high variability in
speakers, dialects and speaking style)
3. Language identification to verify language, speaker identification to
ensure speaker variability
4. Annotation through crowd sourcing platform
5. Fully automatic process including training of ASR, unsupervised
adaptation
Experience from data collection for language identification (LDC and NIST
adopted this process, now mainstream in Language ID)
SpokenData.com ready for online annotation (ReplayWell company)
Cost of collection of 100h of data, annotation and system training could be
reduced bellow 15 000 EUR per language
Interested? Please write to [email protected]

www.phonexia.com, 38/41
GUARANTEE OF ACCURACY

More and more customers buy speech solutions. But each new
installation brings new risks.
Speaker identification is not fully language independent
Language identification is not dialect independent
Speech transcription is not domain independent
All technologies are not channel independent

How to estimate in advance if the installation is going to be


successful? What will be the target accuracy of each technology?

Data collection project can help


Can be extended to World Map of Spoken Languages

www.phonexia.com, 39/41
HARDWARE COST
Speech solution is not only software, but also
computational hardware, data storage, physical
planning of the HW etc.
Computers and cooling consume electricity
The additional cost can be about 50% of total
project cost
Most research is directed to reach maximal
accuracies
Any improvement in speed can have large effect
on the success of your technology

Perfect optimization of software


Use of HW acceleration (GPU cards etc.)

www.phonexia.com, 40/41
Q&A

THANKS!

Phonexia s.r.o.
[email protected]

www.phonexia.com, 41/41

You might also like