ALIZE 3.0 - Open Source Toolkit For State-Of-The-Art Speaker Recognition

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

See

discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/289605091

ALIZE 3.0 - Open source toolkit for state-of-


the-art speaker recognition

Article January 2013

CITATIONS READS

66 1,283

8 authors, including:

Jean-Franois Bonastre Kong Aik Lee


Universit dAvignon et des Pays du Vaucluse Institute for Infocomm Research
234 PUBLICATIONS 3,524 CITATIONS 108 PUBLICATIONS 1,305 CITATIONS

SEE PROFILE SEE PROFILE

Haizhou Li John Mason


National University of Singapore Swansea University
599 PUBLICATIONS 5,646 CITATIONS 296 PUBLICATIONS 2,497 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Hierarchical Spoken Language Identification View project

NIST SRE16 View project

All content following this page was uploaded by Jean-Franois Bonastre on 20 June 2017.

The user has requested enhancement of the downloaded file.


ALIZE 3.0 - Open Source Toolkit for State-of-the-Art Speaker Recognition
Anthony Larcher1 , Jean-Francois Bonastre2 , Benoit Fauve3 , Kong Aik Lee1 ,
Christophe Levy2 , Haizhou Li1 , John S.D. Mason4 , Jean-Yves Parfait5
1
Institute for Infocomm Research - A? STAR, Singapore, 2 University of Avignon - LIA, France
3
ValidSoft Ltd, UK, 4 Swansea University, UK, 5 Multitel, Belgium
[email protected]

Abstract the aim to create an open-source C++ library for speaker recog-
nition. Since then, many research institutes and companies
ALIZE is an open-source platform for speaker recognition. The have contributed to the toolkit through research projects or by
ALIZE library implements a low-level statistical engine based sharing source code. More recently the development has been
on the well-known Gaussian mixture modelling. The toolkit in- supported by the BioSpeak project, part of the EU-funded Eu-
cludes a set of high level tools dedicated to speaker recognition rostar/Eureka program1 . Based on the ALIZE core library,
based on the latest developments in speaker recognition such as high level functionalities dedicated to speaker recognition are
Joint Factor Analysis, Support Vector Machine, i-vector mod- available through the LIA RAL package. All the code from
elling and Probabilistic Linear Discriminant Analysis. Since the toolkit is distributed through open source software licenses
2005, the performance of ALIZE has been demonstrated in se- (LPGL) and has been tested on different platform including
ries of Speaker Recognition Evaluations (SREs) conducted by Windows, Linux and Mac-OS.
NIST and has been used by many participants in the last NIST- Recent developments include Joint Factor Analysis [21], i-
SRE 2012. This paper presents the latest version of the corpus vector modelling and Probabilistic Linear Discriminant Analy-
and performance on the NIST-SRE 2010 extended task. sis [32]. These developments stem mainly from a collaboration
Index Terms: speaker recognition, open-source platform, i- between LIA and the Institute for Infocomm Research (I 2 R).
vector This paper presents an overview of the ALIZE toolkit. Sec-
tion 2 gives a description of the collaborative tools and details
1. Introduction about the toolkit implementation. In Section 3 we describe the
main functions available in the LIA RAL package. Section 4
As indicated by the number of applications developed recently, presents the performance of i-vector systems based on ALIZE
speech technologies have now reached a level of performance for the NIST-SRE 2010 extended task. Finally, Section 5 dis-
that makes them attractive for distributed and embedded appli- cusses the future evolution of the project.
cations. Following this trend, NIST speaker recognition eval-
uations (SREs) have seen their number of participants increase
significantly since the first edition. These campaigns clearly il- 2. ALIZE: an Open Source Platform
lustrate the substantial improvements in performance that have 2.1. A Community of Users
been achieved in the last few years. Speaker verification sys-
tems have benefited from a number of developments in noise A number of tools are available for dissemination, exchange and
and channel robustness [34, 30, 3] and new paradigms such as collaborative work through a web portal2 . To federate the com-
Joint Factor Analysis [21] and i-vectors [12]. At the same time, munity, this portal collects and publishes scientific work and
techniques developed in the field of speaker verification have industrial realisations related to ALIZE. The users can register
been shown to be useful for other areas [13, 27, 35]. to the mailing list that allows them to be informed of the latest
State-of-the-art techniques are now based on intensive use developments and to share their experience with the commu-
of corpus and computational resources [16, 11]. The contin- nity. A LinkedIn group also provides a way to know about the
ual improvement in performance calls for enormous number of facilities and the people working in the field of speaker recog-
trials to maintain confidence in the results. For instance, the nition.
number of trials from the core task of NIST-SRE evaluation has Documentation, wiki and tutorials are available on the web-
increased from about 24,000 in 2006 to more than 1.88 millions site to get started with the different parts of the toolkit. The
in 2012 (or 88 millions for the extended task) and participation official release of the toolkit can be downloaded from the web-
to such an event has become a true engineering challenge. The site and the latest version of the sources are available through a
rapidly growing effort needed to keep up-to-date with state-of- SVN server.
the-art performance has strongly motivated an increasing num-
ber of collaborations between sites. However, system devel- 2.2. Source Code
opment often remains a challenge and large scale implementa- ALIZE software architecture is based on UML modelling and
tion is resource consuming. In this context, collaborative open- strict code conventions in order to facilitate collaborative de-
source software offers a viable solution as it can be used to velopment and maintenance of the code. An open-source
reduce the individual development effort and offer a baseline and cross-platform test suite enables ALIZEs contributors to
system implementation [26].
The ALIZE project has been initiated in 2004 by the Uni- 1 http://www.eurekanetwork.org/activities/eurostars

versity of Avignon LIA within the ELISA consortium [29] with 2 http://alize.univ-avignon.fr
Feature Background
Front-End
Extraction Modelling

Enrolment

Pattern Score
Matching Decision
Normalization
ALIZE Toolkit
Figure 1: General structure of a speaker verification system.

quickly run regression tests in order to increase the reliability When applied to a two-channel recording, LabelFusion re-
of future releases and to make the code easier to maintain. Test moves overlapping sections of high energy features.
cases include low level unit tests on the core ALIZE and the
most important algorithmic classes as well as an integration test 3.2. Enrolment
level on the high-level executable tools. Doxygen documenta-
tion is available on line and can be compiled from the sources. In speaker recognition, the enrolment module generates a statis-
The platform includes a Visual Studio solution and auto- tical model from one or several sequences of features. Although
tools for compilation under Windows and UNIX-like platforms. it is possible to generate one model for each recording session,
A large part of the LIA RAL functions use parallel process- depending on the systems architecture, it is common to con-
ing for speed The multi-thread implementation based on the sider a single model to represent a speaker. By extension, we
standard POSIX library is fully compatible with the most com- refer to this as the speaker model for the remainder of the paper.
mon platforms. The LIA RAL library can be linked to the well State-of-the-art speaker recognition engines are mainly
known Lapack3 library to give high accuracy in matrix manipu- based on three types of speaker models. Although these three
lations. models are all related to a Gaussian mixture model (GMM),
All sources are available under LPGL licence that impose we distinguish between a first first type of model that explic-
minimal restriction on the redistribution of the software. itly makes use of a GMM and the two other types that represent
the speaker or session as a fixed-length vector derived from a
GMM. This vector can be a super-vector [8] (concatenation of
3. High Level Executables the mean parameters of the GMM) or a more compact represen-
The goal of this paper is to show the steps to set-up a state-of-the tation known as an i-vector [12]. Each of the three models can
art biometric engine using the different components of ALIZE. be obtained from the corresponding executable of the LIA RAL
For more details, excellent tutorials on speaker recognition can toolkit.
be found in the literature [7, 2, 22]. Robustness to inter-session variability, that could be due to
Figure 1 shows the general architecture of a speaker recog- channel or noise, is one of the main issue in speaker recognition.
nition engine. LIA RAL high level toolkit provides a number Therefore, ALIZE includes the most common solutions to this
of executables that can be used to achieve the different func- challenge for each type of model described below.
tions depicted in this diagram. The rest of this section gives an
overview of the main functionnalities of the LIA RAL toolkit
with the corresponding executables. TrainTarget generates GMM models given one or more
feature sequences. The GMMs can be adapted from a universal
3.1. Front-End background model (UBM) M by using a maximum a poste-
riori (MAP) criterion [15, 33] or a maximum likelihood linear
The first stage of a speaker recognition engine consists of a fea- regression (MLLR) [25]. Noise and channel robust represen-
ture extraction module that transforms the raw signal into a se- tations can be obtained by using Factor-Analysis (FA) based
quence of low dimension feature vectors. ALIZE interfaces to approaches. Factor Analysis for speaker recognition has been
features generated by SPRO4 and HTK5 and also accepts raw introduced in [20] and assumes that the variability due to the
format. speaker and channel both lie in distinct low dimension sub-
Once the acoustic features have been generated, they can be spaces. Different flavours of FA have been proposed and two
normalised to remove the contribution of slowly varying convo- are available in ALIZE. The more general one is known as Joint
lutive noises (mean subtraction) as well as reduced (variance Factor Analysis [21], in which the super-vector m(s,n) of the
normalization) by using the function NormFeat. session and speaker dependent GMM is a sum of three terms
Low energy frames, corresponding to silence and noise, can given by Eq.1.
then be discarded with EnergyDetector. This executable
computes a bi- or tri-Gaussian model of the feature vector or of m(s,n) = M + Vy (s) + Ux(s,n) + Dz (s) (1)
the log-energy distribution and selects the features belonging to
highest mean distribution.
In this formulation, V and U are factor loaded matrices, D is
Finally, it is possible to smooth a selection of feature vec-
a diagonal MAP matrix while y (s) and x(s,n) are respectively
tors. Applied to a single channel recording, LabelFusion
called speaker and channel factors. y (s) , x(s,n) and z (s) are
smooths the selection of frames with a morphological window.
assumed to be independent and have standard normal distribu-
3 http://www.netlib.org/lapack/ tions. A simplified version of this model, often referred to as
4 http://www.irisa.fr/metiss/guig/spro/ EigenChannel or Latent Factor Analysis [30] is also available
5 http://htk.eng.cam.ac.uk/ in the TrainTarget. In this version, the simplified genera-
tive equation becomes: the system in X being spoken by the target speaker. The na-
ture and computation of this score vary depending on the type
m(s,n) = M + Ux(s,n) + Dz (s) (2) of speaker model and the different assumptions made. Similarly
for the enrolment module, LIA RAL includes three executables,
ModeToSv extracts super-vectors from GMMs. A super- each dedicated to a specific type of model.
vector is the representation of a speaker in a high dimension
space that has been popularised by the development of Support
ComputeTest, given a sequence of acoustic features, X =
Vector Machines (SVM) for speaker recognition [8]. LibSVM
{xt }tT of length T , computes a log-likelihood ratio between
library [10] has been integrated into the ALIZE SVM executable.
the UBM and a speaker dependent GMM. If no channel com-
Nuisance Attribute Projection (NAP) [34] aims to attenuate the
pensation method is applied, the log-likelihood of utterance X
channel variability in the super-vector space by rejecting a sub-
over a model s is computed as the average log-likelihood of fea-
space that contains most of the channel variability (nuisance ef-
tures xt such that:
fect). The most straightforward approach consists of estimating
the within-session co-variance matrix W of a set of speakers T C
and to compute the projection m(s,n) of super-vector m(s,n)
X X
log P (X |s) = log c N (xt , c , c ) (6)
such that: t=1 c=1
m(s,n) = (I SS t )m(s,n) (3)
where S contains the first eigenvectors resulting from the sin- where C is the number of distribution in s and c , c and c
gular value decomposition of W . are the weight, mean and co-variance matrix of the cth distribu-
tion respectively.
For the case of Joint Factor Analysis where it is difficult
IvExtractor extracts a low-dimensional i-vector [12]
to integrate out the channel effect, ComputeTest can com-
from a sequence of feature vectors. An i-vectors is generated
pute two approximations of the log-likelihood. The first one is
according to
adapted directly from [19] and uses a MAP point estimate for
m(s,n) = M + Tw(s,n) (4)
the channel factor and the second is the linear scoring proposed
where T is a low rank rectangular matrix called the Total Vari- in [17]. A detailed description of both approaches can be found
ability matrix and the i-vector w(s,n) is the probabilistic projec- in [17].
tion of the super-vector m(s,n) onto the Total Variability space,
defined by the columns of T.
Many normalization techniques have been proposed to SVM returns a score that reflects the distance of a test super-
compensate for the session variability into the Total Variability vector to the hyper-plan defined by the classifier. Different
space. The IvNorm executable can be used to apply normaliza- kernels such as GLDS [9] or GSL derived from the Kullback-
tions based on the Eigen Factor Radial (EFR) method [4]. EFR Liebler divergence [8] are available through the LibSVM li-
iteratively modifies the distribution of i-vectors such that it be- brary.
comes standard normal and the i-vectors have a unitary norm.
Given a development set T of i-vectors, of mean and total
IvTest is dedicated to i-vector comparison. The i-vector
co-variance matrix , an i-vectors is modified according to:
paradigm offers an attractive low-dimensional representation of
1 speech segments that enables standard classification techniques
2 (w )
w 1 (5) to be applied for speaker recognition. Amongst the most popu-
| 2 (w )| lar scoring methods, four have been implemented in IvTest:
cosine [12] and Mahalanobis [4] scoring, two-co-variance scor-
After this transformation has been applied to all i-vectors from ing (2cov) [6] as well as two versions of the Probabilistic Linear
the development set T and from the test data, the mean, , and Discriminant Analysis (PLDA) scoring [32].
co-variance matrix, , are re-estimated to perform the next iter-
In the remainder of this section, W , B and are respec-
ation. Note that the length-norm proposed in [14] is equivalent
tively the within- and between-class co-variance matrices and
to one iteration of the EFR algorithm.
the mean of a large set of i-vector.
A variation of the EFR, proposed later in [3] as Spherical
Cosine similarity has been proposed in [12] to compute the
Nuisance Normalization (sphNorm), is also available in the AL-
similarity between two i-vectors w1 and w2 . In the same paper,
IZE toolkit. For sphNorm, the total co-variance matrix is
the authors compensate the session variability through Within
replaced by the within class co-variance matrix of the develop-
Class Co-variance Normalization (WCCN) and Linear Discrim-
ment set T . After normalization of their norm, all i-vectors lie
inant Analysis (LDA). Considering that is the Cholesky de-
on a sphere and it is therefore difficult to estimate a relevant
composition of the within-class co-variance matrix W calcu-
within-class co-variance matrix. Spherical Nuisance Normal-
lated over a large data set and that is the LDA matrix com-
ization is then used to project the i-vectors onto a spherical sur-
puted on the same data set, the cosine similarity score is given
face while assuring that there is no principal direction for the
by:
session variability.
Other standard techniques such as Within Class Co- < t t w1 |t t w2 >
S(w1 , w2 ) = (7)
variance Normalization (WCCN) and Linear Discriminant ||t t w1 || ||t t w2 ||
Analysis (LDA) [12] are also available in ALIZE.
Mahalanobis distance is a generalisation of the Euclidian
distance for the case where the data are not following a standard
3.3. Pattern Matching
normal distribution. The Mahalanobis score is given by:
Given a test utterance, X , and a target speaker model, the
matching module returns a score that reflects the confidence of S(w1 , w2 ) = (w1 w2 )t W 1 (w1 w2 ) (8)
The two-co-variance model, described in [6], can be seen 4. Performance of i-Vector Systems
as a special case of the PLDA. It consists of a simple linear-
This section presents the performance of different i-vector sys-
Gaussian generative model in which an i-vector w can be de-
tems based on the ALIZE toolkit on the Condition 5 of the
composed as w = y s +  where the speaker and noise compo-
NIST-SRE10 extended task for male speakers [31]. 50 dimen-
nents y s and  follow respectively normal distributions given by
sion MFCC vectors are used as input features (19 MFCC, 19,
P (y s ) = N (, B) and P (|y s ) = N (y s , W ). The resulting
11 and E). High energy frames are retained and normal-
score can be expressed as:
ized so that the distribution of each cepstral coefficient is 0-
mean and 1-variance for a given utterance. A 2048-distribution
R
N (w1 |y, W ) N (w2 |y, W ) N (y|, B) dy
s= Q R (9) UBM with diagonal co-variance matrix is trained on 6,687 male
i=1,2 N (wi |y, W ) N (y|, B) dy
sessions from NIST-SRE 04, 05 and 06 telephone and micro-
PLDA [32] is one of the most recent addition to the AL- phone data. The same data augmented with Fisher and Switch-
IZE toolkit. The generative model of PLDA considers that an board databases (28,524 sessions) are used to train a Total Vari-
i-vector w is a sum of three terms: ability matrix of rank 500. All meta-parameters required for i-
vector normalization and scoring are estimated from 710 speak-
w(s,n) = + F(s) + G (s,n) +  (10) ers from NIST-SRE 04,05 and 06 with a total of 11,177 ses-
sions. Rank of the F and G matrices from the PLDA model are
where F and G are low rank speaker and channel factor loaded set to 400 and 0 respectively. When applied, two iterations of
matrices.  is a normally distributed additive noise of full co- Eigen Factor Radial and 3 iterations of Spherical Nuisance Nor-
variance matrix. ALIZE implementation of the PLDA follows malization are performed.
the work of [18]. Two scoring methods, described by Figure 2,
have been implemented. The first is based on the native PLDA
scoring while the second one is using the mean of the L en-
rolment i-vectors, w. Note that both methods allow multiple
enrolment sessions. More details can be found in [23] and in a
companion paper [24].

Same speaker Dierent speakers Same speaker Dierent speakers

12 1 2 12 1 2

2 2 2 2

L L
a. PLDA Native scoring b. PLDA mean scoring

Figure 2: Graphical model of the two PLDA scoring imple-


mentations in ALIZE for L enrolment i-vectors.

3.4. Background Modelling


Speaker recognition is a data-driven technology and all ap-
proaches implemented in ALIZE rely on a background knowl-
edge learned from a large quantity of development data. Esti-
mation of the knowledge component is computationally intense. Figure 3: Performance of ALIZE i-vector systems on NIST-
Efficient tools have been developed in the toolkit to optimize SRE10 extended male tel-tel task (condition 5) given in terms
and simplify the development efforts. The UBM can be trained of (% EER, minDCF2010).
efficiently by using the TrainWorld that uses a random selec-
tion of features to speed up the iterative learning process based
Figure 3 shows the performance of seven systems using differ-
on the EM algorithm. Meta-parameters of JFA and LFA mod-
ent i-vector normalization and scoring functions. The perfor-
els can be trained by using EigenVoice, EigenChannel
mance of these systems are consistent with the current state-
and EstimateDmatrix while the TotalVariability
of-the-art considering that simple acoustic features have been
has been especially optimised to deal with the computational
used.
constraints of learning the Total Variability matrix for i-vector
extraction. The implementation follows the work described
in [20] with additional minimum divergence described in [5]. 5. Discussion
Nuisance Attribute Projection matrices can be trained using We have described ALIZE, an open-source speaker recognition
CovIntra and, for i-vector systems, normalization and PLDA toolkit. This toolkit includes most of the standard algorithms
meta-parameters can be trained by respectively using PLDA and recently developed in the field of speaker recognition, includ-
IvNorm. PLDA estimation follows the algorithm described in ing Joint Factor Analysis, i-vector modelling and Probabilistic
[18]. Linear Discriminant Analysis. The aim of ALIZE collabora-
tive project is to pool development efforts and to make efficient
3.5. Score Normalization implementation of standard algorithms available for the com-
Different combinations of score normalization based on Z- and munity. In the future, efforts will be concentrated on the docu-
T-norm are available through ComputeNorm, [28, 1]. mentation of the toolkit through online help and tutorials.
6. References [18] Y. Jiang, K. A. Lee, Z. Tang, B. Ma, A. Larcher, and H. Li, PLDA
Modeling in I-vector and Supervector Space for Speaker Verifica-
[1] R. Auckenthaler, M. Carey, and H. Lloyd-Thomas, Score Nor-
tion, in Annual Conference of the International Speech Commu-
malization for Text-Independent Speaker Verification System,
nication Association (Interspeech), 2012.
Digital Signal Processing, vol. 1, no. 10, pp. 4254, 2000.
[19] P. Kenny, Joint factor analysis of speaker and session variability:
[2] F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. Magrin-
Theory and algorithms, CRIM, Tech. Rep., 2005.
Chagnolleau, S. Meigner, T. Merlin, J. Ortega-Garcia,
D. Petrovska-Delacretaz, and D. A. Reynolds, A tutorial [20] P. Kenny and P. Dumouchel, Disentangling speaker and channel
on text-independent speaker verification, EURASIP Journal on effects in speaker verification, in IEEE International Conference
Applied Signal Processing, vol. 4, pp. 430451, April 2004. on Acoustics, Speech, and Signal Processing, ICASSP, 2004, pp.
[3] P.-M. Bousquet, A. Larcher, D. Matrouf, J.-F. Bonastre, and O. Pl- 3740.
chot, Variance-Spectra based Normalization for I-vector Stan- [21] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, Joint
dard and Probabilistic Linear Discriminant Analysis, in Odyssey factor analysis versus eigenchannels in speaker recognition,
Speaker and Language Recognition Workshop, 2012. IEEE Transactions on Audio, Speech, and Language Processing,
[4] P.-M. Bousquet, D. Matrouf, and J.-F. Bonastre, Intersession vol. 15, no. 4, p. 1435, 2007.
compensation and scoring methods in the i-vectors space for [22] T. Kinnunen and H. Li, An overview of text-independent speaker
speaker recognition, in Annual Conference of the International recognition: From features to supervectors, Speech Communica-
Speech Communication Association (Interspeech), 2011, pp. 485 tion, vol. 52, no. 1, pp. 1240, 2010.
488.
[23] A. Larcher, K. A. Lee, B. Ma, and H. Li, Phonetically-
[5] N. Brummer. The em algorithm and minimum divergence. Online Constrained PLDA Modeling for Text-Dependent Speaker Ver-
http://niko.brummer.googlepages. Agnitio Labs Technical Report. ification with Multiple Short Utterances, in IEEE Interna-
[6] N. Brummer and E. de Villiers, The speaker partitioning prob- tional Conference on Acoustics, Speech, and Signal Processing,
lem, in Odyssey Speaker and Language Recognition Workshop, ICASSP, 2013.
2010. [24] K. A. Lee, A. Larcher, C.-H. You, B. Ma, and H. Li, Multi-
[7] J. P. Campbell, Speaker recognition: a tutorial, Proceedings of session PLDA Scoring of I-vector for Partially Open-set Speaker
the IEEE, vol. 85, no. 9, pp. 14371462, September 1997. Detection, in Annual Conference of the International Speech
Communication Association (Interspeech), 2013.
[8] W. M. Campbell, D. E. Sturim, D. A. Reynolds, and
A. Solomonoff, SVM based speaker verification using a GMM [25] C. J. Leggetter and P. C. Woodland, Maximum likelihood lin-
supervector kernel and NAP variability compensation, in Proc. ear regression for speaker adaptation of continuous density hidden
ICASSP, vol. 1, 2006, pp. 97100. markov models, Computer Speech and Language, vol. 9, no. 2,
pp. 171185, April 1995.
[9] W. Campbell, J. Campbell, D. Reynolds, E. Singer, and P. Torres-
Carrasquillo, Support vector machines for speaker and language [26] H. Li and M. Bin, TechWare: Speaker and Spoken Lan-
recognition, in Computer Speech & Language, vol. 20. Elsevier, guage Recognition Resources, IEEE Signal Processing Maga-
2006, pp. 210229. zine, vol. 27, no. 6, pp. 139142, 2010.
[10] C.-C. Chang and C.-J. Lin, LIBSVM: A library for support [27] H. Li, B. Ma, , and K. A. Lee, Spoken language recognition:
vector machines, ACM Transactions on Intelligent Systems and from fundamentals to practice, Proceedings of the IEEE, 2013.
Technology, vol. 2, pp. 127, 2011, software available at http: [28] K.-P. Li and J. E. Porter, Normalizations and selection of speech
//www.csie.ntu.edu.tw/cjlin/libsvm. segments for speaker recognition scoring, in IEEE Interna-
[11] S. Cumani and P. Laface, Memory and Computation Trade-Offs tional Conference on Acoustics, Speech, and Signal Processing,
for Efficient I-Vector Extraction , IEEE Transactions on Audio, ICASSP, vol. 1, New York (USA), April 1998, pp. 595598.
Speech, and Language Processing, vol. 21, no. 5, pp. 934944, [29] I. Magrin-Chagnolleau, G. Gravier, , and R. Blouet, Overview of
2013. the 2000-2001 ELISA consortium research activities, in Odyssey
[12] N. Dehak, P. A. Torres-Carrasquillo, D. A. Reynolds, and R. De- Speaker and Language Recognition Workshop, 2001.
hak, Language Recognition via Ivectors and Dimensionality Re- [30] D. Matrouf, N. Scheffer, B. Fauve, and J.-F. Bonastre, A straight-
duction, in Annual Conference of the International Speech Com- forward and efficient implementation of the factor analysis model
munication Association (Interspeech), 2011. for speaker verification, in Annual Conference of the Interna-
[13] C. Fredouille, G. Pouchoulin, A. Ghio, J. Revis, J.-F. Bonastre, tional Speech Communication Association (Interspeech), 2007.
A. Giovanni et al., Back-and-forth methodology for objective [31] NIST, Speaker recognition evaluation plan, http://www.itl.nist.
voice quality assessment: from/to expert knowledge to/from au- gov/iad/mig/tests/sre/2010/NISTSRE10evalplan.r6.pdf, 2010.
tomatic classification of dysphonia, EURASIP Journal on Ad-
vances in Signal Processing, vol. 2009, 2009. [32] S. J. Prince and J. H. Elder, Probabilistic linear discriminant anal-
ysis for inferences about identity, in International Conference on
[14] D. Garcia-Romero and C. Y. Espy-Wilson, Analysis of i-vector Computer Vision. IEEE, 2007, pp. 18.
length normalization in speaker recognition systems, in Annual
Conference of the International Speech Communication Associa- [33] D. A. Reynolds and R. C. Rose, Robust Text-Independent
tion (Interspeech), 2011, pp. 249252. Speaker Identification Using Gaussian Mixture Speaker Models,
IEEE Transactions on Acoustics, Speech and Signal Processing,
[15] J.-L. Gauvain and C.-H. Lee, Maximum a posteriori estimation
vol. 3, no. 1, pp. 7283, January 1995.
for multivariate gaussian mixture observations of Markov chains,
in IEEE International Conference on Acoustics, Speech, and Sig- [34] A. Solomonoff, W. Campbell, and I. Boardman, Advances in
nal Processing, ICASSP, vol. 2, 1994, pp. 291298. channel compensation for svm speaker recognition, in IEEE In-
ternational Conference on Acoustics, Speech, and Signal Process-
[16] O. Glembeck, L. Burget, P. Matejka, M. Karafiat, and P. Kenny,
ing, ICASSP, vol. 1, 18-23, 2005, pp. 629632.
Simplification and optimization of I-Vector extraction, in IEEE
International Conference on Acoustics, Speech, and Signal Pro- [35] C. Vaquero, A. Ortega, and E. Lleida, Intra-session variability
cessing, ICASSP, 2011, pp. 45164519. compensation and a hypothesis generation and selection strat-
egy for speaker segmentation, in IEEE International Conference
[17] O. Glembek, L. Burget, N. Dehak, N. Brummer, and P. Kenny,
on Acoustics, Speech, and Signal Processing, ICASSP, 2011, pp.
Comparison of Scoring Methods used in Speaker Recognition
45324535.
with Joint Factor Analysis, in IEEE International Conference on
Acoustics, Speech, and Signal Processing, ICASSP, Taipei (Tai-
wan), 2009.

The author has requested enhancement of the downloaded file. All in-text references underlined in blue are linked to publications on ResearchGate.
View publication stats

You might also like