0% found this document useful (0 votes)
5 views

hedha houa

Uploaded by

khiarihiba7
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

hedha houa

Uploaded by

khiarihiba7
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2018 3rd International Conference for Convergence in Technology (I2CT)

The Gateway Hotel, XION Complex, Wakad Road, Pune, India. Apr 06-08, 2018

Speaker Recognition Techniques: A review

Satyam P. Todkar1, Snehal S. Babar2, Dr. J. R. Prasad


Rudrendra U. Ambike3, Prasad B. Suryakar4 Department of Computer Engineering
Department of Computer Engineering Sinhgad College of Engineering
Sinhgad College of Engineering Pune, India
Pune, India [email protected]
1
[email protected], [email protected],
3
[email protected], [email protected]

Abstract—Speaker Recognition is the process of recognizing case of text dependent speaker recognition system, the speaker
the speaker from the individual's speech biometrics. The voice utters the same phrase that was used on which the system was
characteristics of every speaker are different and thus can be trained. Whereas in case of text independent speaker
used to construct a model. This model is later used to recognize recognition system, the speaker is identified irrespective of the
an enrolled speaker from the list of available speakers. The paper
spoken phrase.
makes an effort to discuss different speaker modeling techniques
like Vector Quantization (VQ), Gaussian Mixture Model (GMM), The task of speaker identification s primarily composed of
Neural Networks (NN), etc. Also, different techniques for two modules: feature extraction and feature matching. Feature
extraction of voice characteristics like Mel Frequency Cepstral Extraction deals with finding the feature vector for an input
Coefficients (MFCC), Linear Predictive Coding (LPC) are speech and is related to dimensionality reduction. Features are
discussed. Further, an in-depth analysis of these surveyed the unique attributes that characterize different speakers and
techniques is made to identify their advantages and limitations. hence can be used to model templates for the speaker in the
The work in the field of Speaker Recognition Systems began in training phase. While in the testing phase first the feature
the 1950’s and is evolving since then, it has wide applications in extraction is performed and then these extracted features are
the fields of security, forensics, authentication etc.
matched to the speaker templates by the feature matching
Index Terms—Linear Predictive Coding(LPC); Mel Frequency module.
Cepstral Coefficient(MFCC); Formants Wavelet Entropy
(FWE); Vector Quantization(VQ); Hidden Markov
Model(HMM); Gaussian Mixture Model(GMM); Neural
Network(NN)

I. INTRODUCTION
Extensive work in the field of Speaker Recognition has
been done in the past two to three decades, however, the goal
of these Speaker Recognition Algorithms remain the same.
They are either aimed to identify the speaker from different
speakers available or verify a particular speaker. The voice of
every individual sounds different as they are attributed to II. TYPE STYLE AND FONTS
different features that create the voice, this may be- pitch,
Fig. 1. Voice Recognition Hierarchy
length of the vocal tract, sound frequency etc. The devised
algorithms may use a feature or a combination of features at
different stages to perform the task of recognition. The idea Fig. 1. Shows the hierarchical representation of a voice
behind this Automatic Speaker Recognition (ASR) system is to recognition system. It also presents the different modeling
create a machine that will extract, characterize and then techniques that are used to construct the speaker template.
identify the speaker by the inputted voice samples.
Speaker Recognition can be divided into two types: speaker
identification and speaker verification. Speaker Identification is
the process of identifying a particular speaker from the set of
enrolled speakers whereas the task of speaker verification deals
with validating the affirmed identity of the speaker. The Fig. 2. Block Diagram of Speaker Recognition System
difference between the two is the user explicitly states the
identity in the later. The task of speaker identification can Fig. 2. Represents the block diagram of a speaker recognition
further be divided into text dependent and text independent. In system. It represents the different phases that are involved in

978-1-5386-4273-3/18/$31.00 ©2018 IEEE 1


the process to identify the speaker along with the algorithms in the final stage of MFCC forms the acoustic vector for every
that are used to implement them. speech utterance. Fig. 4. Represents the MFCC Block diagram.
The rest of the paper is organized as follows. Section II The MFCC algorithm consists of various phases [2].The
describes pre-processing step, section III describes various pre-emphasis stage is used to artificially boost the higher
feature extraction techniques, section IV describes different frequencies and hence increases the sound to noise ratio.
feature matching techniques, section V deals with modern Framing divides the input voice signal into frames of equal
approaches that are used in the recognition field. sizes, this is done because voice is a stationary signal only for a
small duration. The size of the frame must be optimal since it
II. PRE PROCESSING may affect the time and the frequency resolution. The window
Pre-processing is one of the most important steps in the function is applied in order to remove the discontinuities at the
process to recognize a speaker [1]. The speech of the speaker is frame boundaries which will eliminate the undesirable effect in
nonstationary one and consists of different components that frequency response. The Fast Fourier Transform is applied in
may or may not be useful in the process to identify the speaker. order to convert from the time domain to the frequency
Performing this step removes the unwanted components like domain. It applies DFT algorithm in a speedy manner. The
silenced and unvoiced regions from the voiced regions of the next stage is the Mel-frequency wrapping which is simulated
speech. This reduces the time complexity and the processing by the use of a mel frequency filter bank, which has a
power. triangular band pass frequency response. The Mel-frequency
scale has a linear frequency spacing below 1000Hz and has
III. FEATURE EXTRACTION TECHNIQUES logarithmic spacing for frequencies above 1000Hz. If f is the
frequency of a signal in Hz then Mel (f) can be given by the
A. Linear Predictive Coding (LPC) formula:
LPC is one of the earliest discoveries that is simple and a
popular technique to derive feature vectors. It is able to analyze Mel (f) = 2595*log10 (1+f/700)
speech and can encode good quality speech at a low bit rate. It
is widely used from standard telephony to military The final step is called as Discrete Cosine Transform (DCT).
communication. LPC consists of a predictor that predicts the It is used to convert the log Mel spectrum in the time domain.
current output as a linear combination of previous output. The result hence obtained is called as Mel Frequency Cepstrum
Coefficient and the set of such coefficients forms the acoustic
vector.

Fig. 4. MFCC Block Diagram


Fig. 3. LPC Block Diagram
The user’s speech is taken as an input to the pre-emphasis
stage [1]. This stage acts as an input to the frame blocking C. Formants Wavelet Entropy (FWE)
stage. The frame blocking stage is used block the signal into N
frames. The windowing function is applied to remove the
discontinuities at the frame boundary. The autocorrelated step
is the next step wherein autocorrelation value is calculated for
every windowed frame and then the highest autocorrelation
value is found out. This gives the order of LPC analysis and
then LPC coefficients are derived. The Fig. 3 represents the
LPC Block diagram and the various steps involved in it.

B. Mel Frequency Cepstrum Coefficients (MFCC)


MFCC is the most widely used algorithm for speaker
recognition. It is resilient to a noisy environment and hence can Fig. 5. FWE Block Diagram
recognize the speaker efficiently. It is more effective than the
previously discussed LPC algorithm. The coefficients obtained

2
FWE is a novel approach in the field of speaker recognition speech recognition, bioinformatics and pattern recognition
system. It works by calculating the formants and wavelet problems.
entropy of the filtered input speech. FWE is more efficient as
compared to the MFCC since it has a fixed number of feature
vectors and only these twelve extracted feature coefficients are C. Gaussian Mixture Model (GMM)
used to model the speaker voice template [3]. FWE works on Gaussian Mixture Model is one of the most successful
partially recorded voice samples and is majorly used in speaker classification techniques, this is due to the fact that
forensics. FWE can work for both vowel dependent and Gaussian mixture probability density function is used. GMM is
independent input speech, however, the speaker recognition close to the natural modeling techniques and hence can be used
efficiency is better for the vowel dependent approach. The to model a scenario comprising of higher dimensions [5].
FWE block diagram is shown in the Fig. 5. The speech attributes can readily be Gaussian distributed
FWE has two stages: recording and filtering the speech and hence GMM can be used effectively. A Gaussian Mixture
signals and extracting features. Firstly the input speech is density is a weighted sum of M component densities where M
recorded and then passed to the filter bank. The filter bank is represents the number of Gaussians. Each speaker can be
used to filter the unwanted signals from the speech. The feature effectively modeled and represented by a GMM and is referred
extraction is further divided into two parts for calculating by a model associated with him/her. This model is represented
formants and entropies. Formants represent the acoustic by using λ.
resonance of the speaker’s vocal tract. The Power Spectrum D. VQ & GMM: A Hybrid Approach
Density (PSD) is used to calculate the formants by finding first
five formants as they are easily distinguishable for every The VQ model discussed previously is one of the efficient
speaker. Then the entropies are calculated by using Wavelet and easy approaches to identify the speaker. However with an
Packets (WP). It calculates Shannon entropy for the seven increase in the number of code words the time complexity of
nodes of the wavelet packet, thus enhancing the recognition the algorithm increases along with a decrease in accuracy. One
rate. such modification to the VQ is combining it with the more
sophisticated techniques like GMM [6]. Here, in the process of
IV. FEATURE MATCHING TECHNIQUES recognizing the speaker both the techniques will recognize the
The feature matching algorithms are used in both the speaker by themselves. If both the techniques recognize the
training and the testing phase. In the training phase, the system same speaker then the speaker is readily recognized, however,
is trained by using the extracted feature vectors to construct a if there is a disagreement then the relative index is calculated
speaker model. Whereas in testing phase this model is and the confidence ratio is found. This ratio is used to identify
validated by the system by recognizing speakers or voice the true speaker or may be helpful to detect the outliers.
samples that were not used in the training phase.
A. Vector Quantization (VQ) E. GMM & Pitch Detection Algorithm
VQ is one of the most popular and easy to use feature The sophisticated, and unsupervised algorithm GMM is
matching algorithm. It works by using the extracted feature efficient in itself, however with the advancements in
vectors to construct a model [2]. VQ is a type of unsupervised technology efforts are made to minimize the time complexity
learning algorithm which creates clusters, they represent the of the speaker recognition task. The pitch of a female voice is
models of the enrolled speakers. higher as compared to her male counterpart. GMM is coupled
Initially, the feature vectors are obtained and then they are with Pitch Detection Algorithm (PDA) where the gender of the
classified into different clusters. Feature vectors belonging to speaker is identified by using the pitch [7].
the same cluster have similar properties and model the The pre-processing is an important step in the process of
attributes of the speaker. When next time a feature vector is speaker recognition as it improves the performance of the
obtained, it is compared with all of the existing clusters system. Pre-processing includes: down-sampling which is used
centroids by calculating the Euclidean Distance. The feature to reduce the sampling the rate of a signal; pre-emphasis stage
vector is assigned to the cluster with the minimum distance & that decreases the amplitude of low frequency bands whereas
the centroid of the cluster is updated. The centroids are also increases the amplitude of high frequency bands; The human
termed as code words and the collection of such code words is speech is not continuous in nature and may contain some parts
called a codebook. where there is no speech utterance, elimination of such parts
will increase the speed of identification as the number of
frames are greatly reduced. This is performed by the silence
B. Hidden Markov Model (HMM) removal stage.
HMM is a better and efficient feature matching algorithm The PDA makes use of autocorrelation method waveforms
as compared to the traditional VQ model. HMM is able to where autocorrelation is a function that is a correlation of a
model the statistical variations of the features, give a statistical waveform with itself. The PDA estimates the pitch of an
representation in a way speaker produces the sound [4]. The irregular periodic signal and hence reduces the time complexity
applications of HMM are vivid in nature viz. signal processing, by reducing the number of comparisons to half.

3
F. Neural Networks (NN)
Neural Network is basically an information processing Sr. Techniques Remarks
system. It consists of processing elements which are highly No
interconnected with each other. It is actually used to solve Accuracy gets reduced by the
problems of pattern recognition through the process of various speaker and transmission
learning. It has approaches like feed-forward neural network related effects and it also does
(FFNN) [3] and probabilistic neural network (PNN) [4]. not generalizes well.
Feed-forward neural networks are one of the earliest neural Signifies vocal tract features
networks and very simple to implement. Basically, it consists
of three layers which are input layer, hidden layer (if any), Uses probabilistic model
3. GMM
output layer. The flow of information is unidirectional i.e.
forward from the input layer to the hidden layer and to the Inefficient to handle high
output layer. There is no formation of a cycle or closed loops in dimensional data.
the nodes of the feed-forward network. Used for text dependent
Probabilistic Neural Network is an unsupervised feed- VQ &
forward network. PNN composed of four different layers Uses relative index as
GMM: A
which are: input, pattern, summation, and output. Statistical 4. confidence measures.
Hybrid
algorithms can be implemented with the help of PNN. Approach
A Gaussian function can be used as a probabilistic function Increases complexity of the
for each pattern node. According to the input patterns, the system.
network weights get updated. Then nearest neighborhood 50% reduction in time
function is used to classify the patterns. The following tables I processing.
and II compare the different feature extraction and matching GMM &
techniques. Achieves improved recognition
Pitch
5. rate.
Detection
TABLE I. COMPARISON OF FEATURE EXTRACTION TECHNIQUES Algorithm
Gives erroneous results in case if
a male has a high pitch or a
Sr. Techniques Remarks female has a low pitch.
No Requires less statistical training.
Useful in synthesis of Speech
1. LPC 6. NN Convergence speed is slow, less
Loss of compression information generalizing performance,
Immune to Noise problems of over-fitting.
2. MFCC V. MODERN APPROACHES TO SPEAKER RECOGNITION
Provides not so good correlation
and smooth transition
Has a fixed number of feature A. Denoising
vector coefficients The speaker recognition systems that were proposed made
3. FWE an effort to recognize the speaker with a high degree of
Provides more accuracy in vowel accuracy, however, different factors like noise, the inefficiency
dependent speaker recognition of the recording device and other environmental factors possess
a challenge on these speaker recognition systems.
TABLE II. COMPARISON OF FEATURE MATCHING TECHNIQUE The efficiency of such speaker recognition systems can be
increased if they are trained on pure voice samples. This pure
voice samples can be achieved by subtracting the pure noise
Sr. Techniques Remarks
from the distorted voice signal- distortion occurs due to stray
No
Used for text dependent pickups, etc. This task is implemented by a denoiser [8], which
Clustering technique and approximately finds out the pure speech without noise.
formation of Codebook for every B. Wavelet Cepstral Coefficient (WCC)
1. VQ speaker The MFCC algorithm discussed so far is immune to noise,
however, the Fourier transform used in the MFCC is only
Loss of temporal information restricted in time domain whereas the Wavelet transform is
causing system inaccuracy. restricted in both time and frequency domain [9]. WCC is
Temporal Information is well robust and can be used in the noisy environment along with
2. HMM modeled. fuzzy logic systems to increase the speaker recognition
accuracy.

4
REFERENCES [5] Bagul, S. G., & Shastri, R. K. (2013, August). “Text independent
speaker recognition system using gmm”. In Human Computer
Interactions (ICHCI), 2013 International Conference on (pp. 1-5).
[1] Subhashini, P. P. S., & Pratap, T. “TEXT-INDEPENDENT SPEAKER IEEE.
RECOGNITION USING COMBINED LPC AND MFC [6] Desai, D., & Joshi, M. (2014). “Speaker recognition using MFCC and
COEFFICIENTS”. International Journal of Research in Engineering hybrid model of VQ and GMM”. In Recent Advances in Intelligent
and Technology, 2014. Informatics (pp. 53-63). Springer International Publishing.
[2] Martinez, J., Perez, H., Escamilla, E., & Suzuki, M. M. (2012, [7] AboElenein, N. M., Amin, K. M., Ibrahim, M., & Hadhoud, M. M.
February). “Speaker recognition using Mel frequency Cepstral (2016, May). “Improved text-independent speaker identification system
Coefficients (MFCC) and Vector quantization (VQ) techniques”. In for real time applications”. In Electronics, Communications and
Electrical Communications and Computers (CONIELECOMP), 2012 Computers (JEC-ECC), 2016 Fourth International Japan-Egypt
22nd International Conference on (pp. 248-251). IEEE. Conference on (pp. 58-62). IEEE.
[3] Daqrouq, K., & Tutunji, T. A. (2015). “Speaker identification using [8] Tkachenko, M., Yamshinin, A., Lyubimov, N., Kotov, M., &
vowels features through a combined method of formants, wavelets, and Nastasenko, M. (2017, September). “Speech Enhancement for Speaker
neural network classifiers”. Applied Soft Computing, 27, 231-239. Recognition Using Deep Recurrent Neural Networks”. In International
[4] Ahmad, K. S., Thosar, A. S., Nirmal, J. H., & Pande, V. S. (2015, Conference on Speech and Computer (pp. 690-699). Springer, Cham.
January). “A unique approach in text independent speaker recognition [9] Rathor, S., & Jadon, R. S. (2017, July). “Text independent speaker
using MFCC feature sets and probabilistic neural network”. In Advances recognition using wavelet cepstral coefficient and butter worth filter”.
in Pattern Recognition (ICAPR), 2015 Eighth International Conference In 2017 8th International Conference on Computing, Communication
on (pp. 1-6). IEEE. and Networking Technologies (ICCCNT) (pp. 1-5). IEEE.

You might also like