Biometrics Recognition Using Deep Learning: A Survey
Abstract Deep learning-based models have been very Keywords Biometric Recognition, Deep Learning,
successful in achieving state-of-the-art results in many Face Recognition, Fingerprint Recognition, Iris
of the computer vision, speech recognition, and nat- Recognition, Palmprint Recognition.
ural language processing tasks in the last few years.
These models seem a natural fit for handling the ever- 1 Introduction
increasing scale of biometric recognition problems, from
cellphone authentication to airport security systems. Biometric features1 hold a unique place when it
Deep learning-based models have increasingly been lever- comes to recognition, authentication, and security ap-
aged to improve the accuracy of different biometric plications [1], [2]. They cannot get lost, unlike token-
recognition systems in recent years. In this work, we based features such as keys and ID cards, and they
provide a comprehensive survey of more than 120 promis- cannot be forgotten, unlike knowledge-based features,
ing works on biometric recognition (including face, fin- such as passwords or answers to security questions [3].
gerprint, iris, palmprint, ear, voice, signature, and gait In addition, they are almost impossible to perfectly im-
recognition), which deploy deep learning models, and itate or duplicate. Even though there have been recent
show their strengths and potentials in different appli- attempts to generate and forge various biometric fea-
cations. For each biometric, we first introduce the avail- tures [4], [5], there have also been methods proposed
able datasets that are widely used in the literature and to distinguish fake biometric features from authentic
their characteristics. We will then talk about several ones [6], [7], [8]. Changes over time for many biometric
promising deep learning works developed for that bio- features are also extremely little. For these reasons, they
metric, and show their performance on popular pub- have been utilized in many applications, including cell-
lic benchmarks. We will also discuss some of the main phone authentication, airport security, and forensic sci-
challenges while using these models for biometric recog- ence. Biometric features can be physiological, which are
nition, and possible future directions to which research features possessed by any person, such as fingerprints
in this area is headed. [9], palmprints [10], [11], facial features [12], ears [13],
irises [14], [15], and retinas [16], or behavioral, which
are apparent in a person’s interaction with the environ-
Shervin Minaee ment, such as signatures [17], gaits [18], and keystroke
Snapchat, Machine Learning R&D [19]. Voice/Speech contains both behavioral features,
Amirali Abdolrashidi such as accent, and physiological features, such as voice
University of California, Riverside pitch [20].
Hang Su Face and fingerprint are arguably the most com-
Facebook Research monly used physiological biometric feature. Fingerprint
Mohammed Bennamoun is the oldest, dating back to 1893 when it was used
The University of Western Australia to convict a murder suspect in Argentina [21]. Face
David Zhang 1
In this paper, we commonly refer to a biometric charac-
Chinese University of Hong Kong teristic as biometric for short.
2 Shervin Minaee, Amirali Abdolrashidi, Hang Su, Mohammed Bennamoun, David Zhang
to the manner of walking, which has been gaining more
(a) (b) (c) (d)
attention in the recent years. Due to the involvement
of many joints and body parts in the process of walk-
Fig. 2 The block-diagram of most of classical biometric
ing, gait can also be used to uniquely identify a person recognition algorithms.
from a distance [30]. Samples of various biometrics are
shown in Figure 1.
Many challenges arise in a traditional biometric recog-
Traditionally, the biometric recognition process in- nition task. For example, the hand-crafted features that
volved several key steps. Figure 2 shows the block- are suitable for one biometric, will not necessarily per-
diagram of traditional biometric recognition systems. form well on others. Therefore, it would take a great
Firstly, the image data are acquired via (various) cam- number of experiments to find and choose the most ef-
era or optical sensors, and are then pre-processed so ficient set of hand-crafted features for a certain bio-
as to make the algorithm work on as much useful data metric. Also many of the classical models were based
as possible. Then, features are extracted from each im- on multi-class SVM trained in an one-vs-one fashion,
age. Classical biometric recognition works were mostly which will not scale well when the number of classes is
based on hand-crafted features (designed by computer large.
vision experts) to work with a certain type of data [37], However, a paradigm shift started to occur in 2012,
[38], [39]. Many of the hand-crafted features were based when a deep learning-based model, AlexNet [47], won
on the distribution of edges (SIFT [40], HOG [41]), the ImageNet competition by a large margin. Since then,
or where derived from transform domain, such as Ga- deep learning models have been applied to a wide range
bor [42], Fourier [43], and wavelet [44]. Principal com- of problems in computer vision and Natural Language
ponent analysis is also used in many works to reduce Processing (NLP), and achieved promising results. Not
the dimensionality of the features [45], [46]. Once the surprisingly, biometric recognition methods were not an
features are extracted, they are fed into a classifier to exception, and were taken over by deep learning models
perform recognition. (with a few years delay). Deep learning based models
Biometrics Recognition Using Deep Learning: A Survey 3
provide an end-to-end learning framework, which can the most popular datasets used by the computer vi-
jointly learn the feature representation while perform- sion community, and the most promising state-of-the-
ing classification/regression. This is achieved through a art deep learning works utilized in the area of biomet-
multi-layer neural networks, also known as Deep Neu- ric recognition. We then provide a quantitative analy-
ral Networks (DNNs), to learn multiple levels of rep- sis of well-known models for each biometric. Finally, we
resentations that correspond to different levels of ab- explore the challenges associated with deep learning-
straction, which is better suited to uncover underly- based methods in biometric recognition and research
ing patterns of the data (as shown in Figure 3). The opportunities for the future.
idea of a multi-layer neural network dates back to the The goal of this survey is to help new researchers
1960s [48, 49]. However, their feasible implementation in this field to navigate through the progress of deep
was a challenge in itself, as the training time would learning-based biometric recognition models, particu-
be too large (due to lack of powerful computers at larly with the growing interest of multi-modal biomet-
that time). The progresses made in processor technol- rics systems [54]. Compared to the existing literature,
ogy, and especially the development of General-Purpose the main contributions of this paper are as follow:
GPUs (GPGPUs), as well as development of new tech-
niques (such as Dropout) for training neural networks
with a lower chance of over-fitting, enabled scientists to – To the best of our knowledge, this is the only review
train very deep neural networks much faster [50]. The paper which provides an overview of eight popular
main idea of a neural network is to pass the (raw) data biometrics proposed before and in 2019, including
through a series of interconnected neurons or nodes, face, fingerprint, iris, palmprint, ear, voice, signa-
each of which emulates a linear or non-linear function ture, and gait.
based on its own weights and biases. These weights and – We cover the contemporary literature with respect
biases would change during the training through back- to this area. We present a comprehensive review of
propagation of the gradients from the output [51], usu- more than 150 methods, which have appeared since
ally resulted from the differences between the expected 2014.
output and the actual current output, aimed to min- – We provide a comprehensive review and an insight-
imized a loss function or cost function (difference be- ful analysis of different aspects of biometric recog-
tween the predicted and actual outputs according to nition using deep learning, including the training
some metric) [52]. We will talk about different deep ar- data, the choice of network architectures, training
chitectures in more details in Section 2. strategies, and their key contributions.
Using deep models for biometric recognition, one – We provide a comparative summary of the proper-
can learn a hierarchy of concepts as we go deeper in ties and performance of the reviewed methods for
the network. Looking at face recognition for example, biometric recognition.
as shown in Figure 3, starting from the first few layers – We provide seven challenges and potential future
of the deep neural network, we can observe learned pat- direction for deep learning-based biometric recogni-
terns similar to the Gabor feature (oriented edges with tion models.
different scales). The next few layers can learn more
complex texture features and part of the face. The fol- The structure of the rest of this paper is as fol-
lowing layers are able to catch more complex pattern, lows. In Section 2, we provide an overview of popu-
such as high-bridged nose and big eyes. Finally the last lar deep neural networks architectures, which serve as
few layers can learn very abstract concepts and certain the backbone of many biometric recognition algorithms,
facial attribute (such as smile, roar, and even eye color including convolutional neural networks, recurrent neu-
faces). ral networks, auto-encoders, and generative adversarial
In this paper, we present a comprehensive review networks. Then in Section 3, we provide an introduc-
of the recent advances in biometric recognition using tion to each of the eight biometrics (Face, Fingerprint,
deep learning frameworks. For each work, we provide Iris, Palmprint, Ear, Voice, Signature, and Gait), some
an overview of the the key contributions, network archi- of the popular datasets for each of them, as well as
tecture, and loss functions, developed to push state-of- the promising deep learning based works developed for
the-art performance in biometric recognition. We have them. The quantitative results and experimental perfor-
gathered more than 150 papers, which appeared be- mance of these models for all biometrics are provided in
tween 2014 and 2019, in leading computer vision, bio- Section 4. Finally in Section 5, we explore the challenges
metric recognition, and machine learning conferences and future directions for deep learning-based biometric
and journals. For each biometric, we provide some of recognition.
4 Shervin Minaee, Amirali Abdolrashidi, Hang Su, Mohammed Bennamoun, David Zhang
Fig. 3 Illustration of the hierarchical concepts learned by a deep learning models trained for face recognition. Courtesy of [53].
2 Deep Neural Network Overview CNNs mainly consist of three type of layers: convo-
lutional layers, where a sliding kernel is applied to the
In this section, we provide an overview of some of image (as in image convolution operation) in order to
the most promising deep learning architectures used extract features; nonlinear layers (usually applied in an
by the computer vision community, including convo- element-wise fashion), which apply an activation func-
lutional neural networks (CNN) [55], recurrent neural tion on the features in order to enable the modeling of
networks (RNN) and one of their specific version called non-linear functions by the network; and pooling lay-
long short term memory (LSTM) [56], auto-encoders, ers, which takes a small neighborhood of the feature
and generative adversarial networks (GANs) [57]. It is map and replaces it with some statistical information
noteworthy that with the popularity of deep learning in (mean, max, etc.) of the neighborhood. Nodes in the
recent years, there are several other deep neural archi- CNN layers are locally connected; that is, each unit
tectures proposed (such as Transformers, Capsule Net- in a layer receives input from a small neighborhood of
work, GRU, and spatial transformer networks), which the previous layer (known as the receptive field). The
we will not cover in this work. main advantage of CNN is the weight sharing mecha-
nism through the use of the sliding kernel, which goes
through the images, and aggregates the local informa-
2.1 Convolutional Neural Networks (CNN) tion to extract the features. Since the kernel weights
are shared across the entire image, CNNs have a sig-
Convolutional Neural Networks (CNN) (inspired by nificantly smaller number of parameters than a similar
the mammalian visual cortex) are one of the most suc- fully connected neural network. Also by stacking multi-
cessful and widely used architectures in deep learning ple convolution layers, the higher-level layers learn fea-
community (specially for computer vision tasks). CNN tures from increasingly wider receptive fields.
was initially proposed by Fukushima in a seminal pa-
per, called ”Neocognitron” [58], based on the model CNNs have been applied to various computer vi-
of human visual system proposed by Nobel laureates sion tasks such as: semantic segmentation [59], medical
Hubel and Wiesel. Later on Yann Lecun and colleagues image segmentation [60], object detection [61], super-
developed an optimization framework (based on back- resolution [62], image enhancement [63], caption gener-
propagation) to efficiently learn the model weights for a ation for image and videos [64], and many more. Some
CNN architecture [55]. The block-diagram of one of the of the most well-known CNN architectures include AlexNet
first CNN models developed by Lecun et al. is shown in [47], ZFNet [65], VGGNet [66], ResNet [67], GoogLenet
Figure 4. [68], MobileNet [69], and DenseNet [70].
Biometrics Recognition Using Deep Learning: A Survey 5
Discriminator Network
Generator Fake
Network Images
strated in Figure 7. Auto-encoders are usually trained
Latent Labels
Original Image Representation Reconstructed Output Real
Encoder Decoder Images
gφ fθ
𝑥 𝑥
weights of the pre-trained model are not adapted to the sleepy, surprised, and wink).
new task. In the other approach, the whole network, or It is extended version, Yale Face Database B [32], con-
a subset of it, is fine-tuned on the new task. Therefore tains 5760 single light source images of 10 subjects each
the pre-trained model weights are treated as the initial seen under 576 viewing conditions (9 poses x 64 illu-
values for the new task, and are updated during the mination conditions). For every subject in a particular
training stage. pose, an image with ambient (background) illumination
Many of the deep learning-based models for biomet- was also captured. Ten example images from Yale face
ric recognition are based on transfer learning (except B dataset are shown in Figure 9.
for voice because of the difference in the nature of the
data, and face because of the availability of large-scale
datasets), which we are going to explain in the following
3 Deep Learning Based Works on Biometric Recog-
3.1 Face Recognition CMU Multi-PIE: The CMU Multi-PIE face database
contains more than 750,000 images of 337 people [83],
Face is perhaps one of the most popular biomet- [84]. Subjects were imaged under 15 view points and
rics (and the most researched one during the last few 19 illumination conditions while displaying a range of
years). It has a wide range of applications, from secu- facial expressions.
rity cameras in airports and government offices, to daily Labeled Face in The Wild (LFW): Labeled
usage for cellphone authentication (such as in FaceID Faces in the Wild is a database of face images de-
in iPhones). Various hand-crafted features were used signed for studying unconstrained face recognition. The
for recognition in the past, such as the LBP, Gabor database contains more than 13,000 images of faces col-
Wavelet, SIFT, HoG, and also sparsity-based represen- lected from the web. Each face has been labeled with
tations [76], [77], [78], [79], [80]. Both 2D and 3D ver- the name of the person pictured [85]. 1680 of the peo-
sions of faces are used for recognition [81], but most ple pictured have two or more distinct photos in the
people have focused on 2D face recognition so far. One database. The only constraint on these faces is that
of the main challenges for facial recognition is the face’s they were detected by the Viola-Jones face detector.
susceptibility to change over time due to aging or ex- For more details on this dataset, we refer the readers
ternal factors, such as scars, or medical conditions [24]. to the database web-page.
We will introduce some of the most widely used face
PolyU NIR Face Database: The Biometric Re-
recognition datasets in the next section, and then talk
search Centre at The Hong Kong Polytechnic Univer-
about the promising deep learning-based face recogni-
sity developed a NIR face capture device and used it to
tion models.
construct a large-scale NIR face database [86]. By using
the self-designed data acquisition device, they collected
3.1.1 Face Datasets NIR face images from 335 subjects. In each recording,
Due to the wide application of face recognition in 100 images from each subject is captured, and in to-
the industry, a large number of datasets are proposed tal about 34,000 images were collected in the PolyU-
for that purpose. We will introduce some of the most NIRFD database.
popular ones here. YouTube Faces: This data set contains 3,425 videos
Yale and Yale Face Database B: Yale face dataset of 1,595 different people. All videos were downloaded
is perhaps one of the earliest face recognition datasets from YouTube. An average of 2.15 videos are available
[82]. It Contains 165 grayscale images of 15 individuals. for each subject. The goal of this dataset was to pro-
There are 11 images per subject, one per different fa- duce a large scale collection of videos along with la-
cial expression or configuration (center-light, w/glasses, bels indicating the identities of a person appearing in
happy, left-light, w/no glasses, normal, right-light, sad, each video [87]. In addition, they published benchmark
8 Shervin Minaee, Amirali Abdolrashidi, Hang Su, Mohammed Bennamoun, David Zhang
tests, intended to measure the performance of video web) [93], and Disguised Faces in the Wild (DFW) [95]
pair-matching techniques on these videos. which contains over 11,000 images of 1,000 identities
VGGFace2: VGGFace2 is a large-scale face recog- with variations across different types of disguise acces-
nition dataset [88]. Images are downloaded from Google sories.
Image Search and have a large variations in pose, age,
illumination, ethnicity and profession. It contains 3.31 3.1.2 Deep Learning Works on Face Recognition
million images of 9131 subjects (identities), with an av-
erage of 362.6 images for each subject. Face distribution There are countless number of works using deep
for different identities is varied, from 87 to 843. learning for face recognition. In this survey, we pro-
CASIA-WebFace: CASIA WebFace Facial dataset vide an overview of some of the most promising works
of 453,453 images over 10,575 identities after face de- developed for face verification and/or identification.
tection [89]. This is one of the largest publicly available In 2014, Taigman and colleagues proposed one of the
face datasets. earliest deep learning work for face recognition in a pa-
per called DeepFace [96], and achieved the state-of-the-
MS-Celeb: Microsoft Celeb is a dataset of 10 mil-
art accuracy on the LFW benchmark [85], approach-
lion face images harvested from the Internet for the pur-
ing human performance on the unconstrained condition
pose of developing face recognition technologies, from
for the first time ever (DeepFace: 97.35% vs. Human:
nearly 100,000 individuals [90].
97.53%). DeepFace was trained on 4 million facial im-
CelebA: CelebFaces Attributes Dataset (CelebA)
ages. This work was a milestone on face recognition,
is a large-scale face attributes dataset with more than
and after that several researchers started using deep
200K celebrity images [91]. CelebA has a large diversity,
learning for face recognition.
large quantities, and rich annotations, including more
In another promising work in the same year, Sun et
than 10,000 identities, more than 202,599 face images,
al. proposed DeepID (Deep hidden IDentity features)
5 landmark locations, 40 binary attributes annotations
[97], for face verification. DeepID features were taken
per image. The dataset can be employed as the training
from the last hidden layer of a deep convolutional net-
and test sets for the following computer vision tasks:
work, which is trained to recognize about 10,000 face
face attribute recognition, face detection, landmark (or
identities in the training set.
facial part) localization, and face editing & synthesis.
In a follow up work, Sun et al. extended DeepID for
IJB-C: The IJB-C dataset [92] contains about 3,500
joint face identification and verification called DeepID2
identities with a total of 31,334 still facial images and
[98]. By training the model for joint identification and
117,542 unconstrained video frames. The entire IJB-C
verification, they showed that the face identification
testing protocols are designed to test detection, identi-
task increases the inter-personal variations by draw-
fication, verification and clustering of faces. In the 1:1
ing DeepID2 features extracted from different identi-
verification protocol, there are 19,557 positive matches
ties apart, while the face verification task reduces the
and 15,638,932 negative matches.
intra-personal variations by pulling DeepID2 features
MegaFace: MegaFace Challenge [93] is a publicly extracted from the same identity together. For identi-
available benchmark, which is widely used to test the fication, cross-entropy is used as the loss function (as
performance of facial recognition algorithms (for both defined in the Equation 4), while for verification they
identification and verification). The gallery set of MegaFace proposed to use the loss function of Equation 5 to re-
contains over 1 million images from 690K identities col- duce the intra-class distances on the features and in-
lected from Flickr [94]. The probe sets are two exist- crease the inter-class distances.
ing databases: FaceScrub and FGNet. The FaceScrub
dataset contains 106,863 face images of 530 celebrities. LIdent (f, t, θid ) = − pi log p̂i (4)
The FGNet dataset is mainly used for testing age in- i
variant face recognition, with 1002 face images from 82
LV erif (fi , fj , yij , θvr ) =
Other Datasets: It is worth mentioning that there (
1 2 (5)
are several other datasets which we skipped the details 2 kfi − fj k2 , if yij = 1
due to being private or less popularity, such as Deep- 1 1 2
2 max(1 − 2 kfi − fj k2 , 0), otherwise
Face (Facebook private dataset of 4.4M photos of 4k
subjects), NTechLab (a private dataset of 18.4M pho- As an extension of DeepID2, in DeepID3 [99] Sun et
tos of 200k subjects), FaceNet (Google private dataset al proposed a new model which has higher dimensional
of more than 500M photos of more than 10M sub- hidden representation, and deploys VGGNet and GoogleNet
jects), WebFaces (a dataset of 80M photos crawled from as the main architectures.
Biometrics Recognition Using Deep Learning: A Survey 9
In 2015, FaceNet [100] trained a GoogLeNet model veloped an ”L2-constraint softmax loss function” and
on a large private dataset. This work tried to learn used it for face verification [106]. This loss function re-
a mapping from face images to a compact Euclidean stricts the feature descriptors to lie on a hyper-sphere of
space where distances directly corresponds to a measure a fixed radius. This work achieved state-of-the-art per-
of face similarity. It adopted a triplet loss function based formance on LFW dataset with an accuracy of 99.78%
on triplets of roughly aligned matching/non-matching at the time. In [107], Liu and colleagues developed a face
face patches generated by a novel online triplet min- recognition model based on the intuition that the co-
ing method and achieved good performance on LFW sine distance of face features in high-dimensional space
dataset (99.63%). Given features for a given sample should be close enough within one class and far away
f (xai ), a positive sample f (xpi ) (matching xai ), and a across categories. They proposed the congenerous co-
negative sample f (xni ), the triplet loss for a given mar- sine (COCO) algorithm to simultaneously optimize the
gin α is defined as Equation 6: cosine similarity among data.
X In the same year, Liu et al. developed SphereFace
Ltriplet = kf (xai )−f (xpi )k22 −kf (xai )−f (xni )k22 +α [108], a deep hypersphere embedding for face recogni-
+ tion. They proposed an angular softmax (A-Softmax)
(6) loss function that enables CNNs to learn angular dis-
In the same year, Parkhi et al. proposed a model called criminative features. Geometrically, A-Softmax loss can
VGGface [101] (trained on a large-scale dataset col- be viewed as imposing discriminative constraints on a
lected from the Internet). It trained the VGGNet on hypersphere manifold, which intrinsically matches the
this dataset and fine-tuned the networks via a triplet prior that faces also lie on a manifold. They showed
loss function, Similar to FaceNet. VGGface obtained a promising face recognition accuracy on LFW, MegaFace,
very high accuracy rate of 98.95%. and Youtube Face databases.
In 2016, Liu and colleagues developed a ”Large- In 2018, in [109] Wang et al. developed a simple and
Margin Softmax Loss” for CNNs [102], and showed its geometrically interpretable objective function, called ad-
promise on multiple computer vision datasets, includ- ditive margin Softmax (AM-Softmax), for deep face
ing LFW. They claimed that, cross-entropy does not verification. This work is heavily inspired by two pre-
explicitly encourage discriminative learning of features, vious works, Large-margin Softmax [102], and Angular
and proposed a generalized large-margin softmax loss, Softmax in [108].
which explicitly encourages intra-class compactness and CosFace [110] and ArcFace [111] are two other promis-
inter-class separability between learned features. ing face recognition works developed in 2018. In [110],
In the same year, Wen et al. proposed a new su- Wang et al. proposed a novel loss function, namely large
pervision signal, called ”center loss”, for face recogni- margin cosine loss (LM-CL). More specifically, they re-
tion task [103]. The center loss simultaneously learns formulate the softmax loss as a cosine loss by L2 nor-
a center for deep features of each class and penalizes malizing both features and weight vectors to remove
the distances between the deep features and their cor- radial variations, based on which a cosine margin term
responding class centers. With the joint supervision of is introduced to further maximize the decision margin
softmax loss and center loss, they trained a CNN to ob- in the angular space. As a result, minimum intra-class
tain the deep features with the two key learning objec- variance and maximum inter-class variance are achieved
tives, inter-class dispension and intra-class compactness by virtue of normalization and cosine decision margin
as much as possible. maximization.
In another work in 2016, Sun et al. proposed a face Ring-Loss [112] is another work focused on designing
recognition model using a convolutional network with a new loss function, which applies soft normalization,
sparse neural connections [104]. This sparse ConvNet where it gradually learns to constrain the norm to the
is learned in an iterative fashion, where each time one scaled unit circle while preserving convexity leading to
additional layer is sparsified and the entire model is more robust features. The comparison of learned fea-
re-trained given the initial weights learned in previous tures by regular softmax and the Ring-loss function is
iterations (they found out training the sparse ConvNet shown in Figure 10.
from scratch usually fails to find good solutions for face AdaCos [113], P2SGrad [114], UniformFace [115],
recognition). and AdaptiveFace [116] are among the most promis-
In 2017, in [105], Zhang and colleagues developed ing works proposed in 2019. In AdaCos [113], Zhang et
a range loss to reduce the overall intra-personal varia- al. proposed a novel cosine-based softmax loss, AdaCos,
tions while increasing inter-personal differences simul- which is hyperparameter-free and leverages an adaptive
taneously. In the same year, Ranjan and colleagues de- scale parameter to automatically strengthen the train-
10 Shervin Minaee, Amirali Abdolrashidi, Hang Su, Mohammed Bennamoun, David Zhang
Marginal Loss
Tom-vs-Pete Metric Learning DeepID2 VGGface Sparse Neural Net CoCo Loss CosFace UniformFace
Joint Bayesian Pose Robust DeepID FaceNet L2-softmax Arcface P2SGrad
Robust Coding Sparsity-based DeepFace DeepID3 Range Loss AMS Loss AdaCos
Softmax Loss
with various levels of pressure to generate significant images, hand-crafted fingerprint features, e.g., minutiae
intra-class variations. and core point, are also incorporated into the proposed
NIST Fingerprint Dataset: NIST SD27 consists architecture. This multi-Siamese CNN is trained using
of 258 latent fingerprints and corresponding reference the fingerprint images and extracted features.
fingerprints [123]. There are also some works using deep learning mod-
els for fingerprint segmentation. In [130], Stojanovic
3.2.2 Deep Learning Works on Fingerprint Recog- and colleagues proposed a fingerprint ROI segmenta-
nition tion algorithm based on convolutional neural networks.
There have been numerous works on using deep In another work [131], Zhu et al. proposed a new la-
learning for fingerprint recognition. Here we provide a tent fingerprint segmentation method based on con-
summary of some of the prominent works in this area. volutional neural networks (”ConvNets”). The latent
In [124], Darlow et al. proposed a fingerprint minu- fingerprint segmentation problem is formulated as a
tiae extraction algorithm based on deep learning mod- classification system, in which a set of ConvNets are
els, called MENet, and achieve promising results on trained to classify each patch as either fingerprint or
fingerprint images from FVC datasets. In [125], Tang background. Then, a score map is calculated based on
and colleagues proposed another deep learning-based the classification results to evaluate the possibility of a
model for fingerprint minutiae extraction, called Fin- pixel belonging to the fingerprint foreground. Finally,
gerNet. This model jointly performs feature extraction, a segmentation mask is generated by thresholding the
orientation estimation, segmentation, and uses them to score map and used to delineate the latent fingerprint
estimate the minutiae maps. The block-diagram of this boundary.
model is shown in Figure 13. . There have also been some works for fake finger-
In another work [126], Lin and Kumar proposed a print detection. In [132], Kim et al. proposed a finger-
multi-view deep representation (based on CNNs) for print liveliness detection based on statistical features
contact-less and partial 3D fingerprint recognition. The learned from deep belief network (DBN). This method
proposed model includes one fully convolutional net- achieves good accuracy on various sensor datasets of
work for fingerprint segmentation and three Siamese the LivDet2013 test. In [133], Nogueira and colleagues
networks to learn multi-view 3D fingerprint feature rep- proposed a model to detect fingerprint liveliness (where
resentation. They show promising results on several 3D they are real or fake) using a convolutional neural net-
fingerprint databases. In [127], the authors develop a work, which achieved an accuracy of 95.5% on finger-
fingerprint texture learning using a deep learning frame- print liveness detection competition 2015.
work. They evaluate their models on several bench- There have also been some works on using genera-
marks, and achieve verification accuracies of 100, 98.65, tive models for fingerprint image generation. In [134],
100 and 98% on the four databases of PolyU2D, IITD, Minaee et al proposed an algorithm for fingerprint im-
CASIA-BLU and CASIA-WHT, respectively. In [128], age generation based on an extension of GAN, called
Minaee et al. proposed a deep transfer learning ap- ”Connectivity Imposed GAN”. This model adds total
proach to perform fingerprint recognition with a very variation of the generated image to the GAN loss func-
high accuracy. They fine-tuned a pre-trained ResNet tion, to promote the connectivity of generated finger-
model on a popular fingerprint dataset, and are able to print images. In [135], Tabassi et al. developed a frame-
achieve very high recognition rate. work to synthesize altered fingerprints whose character-
In [129], Lin and Kumar proposed a multi-Siamese istics are similar to true altered fingerprints, and used
network to accurately match contactless to contact- them to train a classifier to detect ”Fingerprint alter-
based fingerprint images. In addition to the fingerprint ation/obfuscation presentation attack” (i.e. intentional
12 Shervin Minaee, Amirali Abdolrashidi, Hang Su, Mohammed Bennamoun, David Zhang
Fig. 13 The block-diagram of the proposed FingerNet model for minutiae extraction. Courtesy of [125].
tamper or damage to the real friction ridge patterns to derived by applying bank of filters of 5 different scales
avoid identification). and 6 orientations.
It is worth mentioning that many of the classical iris
recognition models perform several pre-processing steps
such as iris detection, normalization, and enhancement,
3.3 Iris Recognition as shown in Figure 15. They then extract features from
the normalized or enhanced image. Many of the mod-
Iris images contain a rich set of features embedded ern works on iris recognition skip normalization and
in their texture and patterns which do not change over enhancement, and yet, they are still able to achieve
time, such as rings, corona, ciliary processes, freckles, very high recognition accuracy. One reason is the abil-
and the striated trabecular meshwork of chromatophore ity of deep models to capture high-level semantic a fea-
and fibroblast cells, which is the most prevailing un- tures from original iris images, which are discriminative
der visible light [136]. Iris recognition has gained a lot enough to perform well for iris recognition.
of attention in recent years in different security-related
fields. 3.3.1 Iris Datasets
John Daugman developed one of the first modern Various datasets have been proposed for iris recogni-
iris recognition frameworks using 2D Gabor wavelet tion in the past. Some of the most popular ones include:
transform [35]. Iris recognition started to rise in pop- CASIA-Iris-1000 Database: CASIA-Iris-1000 con-
ularity in the 1990s. In 1994, Wildes et al [137] in- tains 20,000 iris images from 1,000 subjects, which were
troduced a device using iris recognition for personnel collected using an IKEMB-100 camera. The main sources
authentication. After that, many researchers started of intra-class variations in CASIA-Iris-1000 are eyeglasses
looking at iris recognition problem. Early works have and specular reflections [142].
used a variety of methods to extract hand-crafted fea- UBIRIS Dataset: The UBIRIS database has two
tures from the iris. Williams et al [138] converted all distinct versions, UBIRIS.v1 and UBIRIS.v2. The first
iris entries to an “IrisCode” and used Hamming’s dis- version of this database is composed of 1877 images
tance of an input iris image’s IrisCode from those of collected from 241 eyes in two distinct sessions. It sim-
the irises in the database as a metric for recognition. ulates less constrained imaging conditions [143]. The
In [139], the authors proposed an iris recognition sys- second version of the UBIRIS database has over 11000
tem based on ”deep scattering convolutional features”, images (and continuously growing) and more realistic
which achieved a significantly high accuracy rate on noise factors.
IIT Delhi dataset. This work is not exactly using deep IIT Delhi Iris Dataset: IIT Delhi iris database
learning, but is using a deep scattering convolutional contains 2240 iris images captured from 224 different
network, to extract hierarchical features from the im- people. The resolution of these images is 320x240 pix-
age. The output images at different nodes of scattering els [144]. Iris images in this dataset have variable color
network denote the transformed image along different distribution, and different (iris) sizes.
orientation and scales. The transformed images of the ND Datasets: ND-CrossSensor-Iris-2013 consists
first and second layers of scattering transform for a sam- of two iris databases, taken with two iris sensors: LG2200
ple iris image are shown in Figures 14. These images are and LG4000. The LG2200 dataset consists of 116,564
Biometrics Recognition Using Deep Learning: A Survey 13
Fig. 14 The images from the first (on the left) and second (on the right) layers of the scattering transform.
identification and verification problems. In [152], Hof- that the creases in palmprint virtually do not change
bauer and colleagues proposed a CNN based algorithm over time and are easy to extract [158]. However, sam-
for segmentation of iris images, which can results in pling palmprints requires special devices, making their
higher accuracies than previous models. In another work collection not as easy as other biometrics such as fin-
[153], Ahmad and Fuller developed an iris recognition gerprint, iris and face. Classical works on palmprint
model based on triplet network, call ThirdEye. Their recognition have explored a wide range of hand-carfted
work directly uses the segmented, un-normalized iris features such as as PCA and ICA [159], Fourier trans-
images, and is shown to achieve equal error rates of form [160], wavelet transform [161], line feature match-
1.32%, 9.20%, and 0.59% on the ND-0405, UbirisV2, ing [162], and deep scaterring features [163].
and IITD datasets respectively. In a more recent work
[154], Minaee and colleagues proposed an algorithm for 3.4.1 Palmprint Datasets
iris recognition based using a deep transfer learning
Several datasets have been proposed for palmprint
approach. They trained a CNN model (by fine-tuning
recognition dataset. Some of the most widely used datasets
a pre-trained ResNet model) on an iris dataset, and
achieved very accurate recognition on the test set.
With the rise of deep generative models, there have PolyU Multispectral Palmprint Dataset: The
been works that apply them to iris recognition. In [155], images from PolyU dataset were collected from 250 vol-
Minaee et al proposed an algorithm for iris image gen- unteers, including 195 males and 55 females. In total,
eration based on convolutional GAN, which can gener- the database contains 6,000 images from 500 different
ate realistic iris images. These images can be used for palms for one illumination [34]. Samples are collected in
augmenting the training set, resulting in better feature two separate sessions. In each session, the subject was
representation and higher accuracy. Four sample iris asked to provide 6 images for each palm. Therefore, 24
images generated by this work (over different training images of each illumination from 2 palms were collected
epochs) are shown in Figure 17. from each subject.
CASIA Palmprint Database: CASIA Palmprint
Image Database contains of 5,502 palmprint images cap-
tured from 312 subjects. For each subject, they collect
palmprint images from both left and right palms [164].
All palmprint images are 8-bit gray-level JPEG files by
their self-developed palmprint recognition device.
IIT Delhi Touchless Palmprint Database: The
IIT Delhi palmprint image database consists of the hand
Fig. 17 The generated iris images for 4 input latent vectors, images collected from the students and staff at IIT
over 140 epochs (on every 10 epochs), using the trained model
on IIT Delhi Iris database. Courtesy of [155].
Delhi, New Delhi, India [165]. This database has been
acquired using a simple and touchless imaging setup.
The currently available database is from 235 users. Seven
In [156], Lee and colleagues proposed a data aug- images from each subject, from each of the left and
mentation technique based on GAN to augment the right hand, are acquired in varying hand pose varia-
training data for iris recognition, resulting in a higher tions. Each image has a size of 800x600 pixels.
accuracy rate. They claim that historical data augmen-
tation techniques such as geometric transformations and 3.4.2 Deep Learning Works on Palmprint Recog-
brightness adjustment result in samples with very high nition
correlation with the original ones, but using augmenta-
tion based on a conditional generative adversarial net- In [166], Xin et al. proposed one of the early works
work can result in higher test accuracy. on palmprint recognition using a deep learning frame-
work. The authors built a deep belief net by top-to-
down unsupervised training, and tuned the model pa-
3.4 Palmprint Recognition rameters toward a robust accuracy on the validation
Palmprint is another biometric which is gaining more set. Their experimental analysis showed a performance
attention recently. In addition to minutiae features, palm- gain over classical models that are based on LBP, and
prints also consist of geometry-based features, delta PCA, and other other hand-crafted features.
points, principal lines, and wrinkles [11,157]. Each part In another work, Samai et al. proposed a deep learning-
of a palmprint has different features, including texture, based model for 2D and 3D palmprint recognition [167].
ridges, lines and creases. An advantage of palmprints is They proposed an efficient biometric identification sys-
Biometrics Recognition Using Deep Learning: A Survey 15
tem combining 2D and 3D palmprint by fusing them at palmprint recognition based on transfer convolutional
matching score level. To exploit the 3D palmprint data, autoencoder. Convolutional autoencoders were firstly
they converted them to grayscale images by using the used to extract low-dimensional features. A discrimi-
Mean Curvature (MC) and the Gauss Curvature (GC). nator was then introduced to reduce the gap of two
They then extracted features from images using Dis- domains. The auto-encoders and discriminator were al-
crete Cosine Transform Net (DCT Net). ternately trained, and finally the features with the same
Zhong et al. proposed a palmprint recognition algo- distribution were extracted.
rithm using Siamese network [168]. Two VGG-16 net- In [173], Zhao and colleagues proposed a joint deep
works (with shared parameters) were employed to ex- convolutional feature representation for hyperspectral
tract features for two input palmprint images, and an- palmprint recognition. A CNN stack is constructed to
other network is used on top of them to directly ob- extract its features from the entire spectral bands and
tain the similarity of two input palmprints according generate a joint convolutional feature. They evaluated
to their convolutional features. This method achieved their model on a hyperspectral palmprint dataset con-
an Equal Error Rate (EER) of 0.2819% on on PolyU sisting of 53 spectral bands with 110,770 images. They
dataset. In [169], Izadpanahkakhk et al. proposed a achieved an EER of 0.01%. In [174], Xie et al. pro-
transfer learning approach towards palmprint verifica- posed a gender classification framework using convolu-
tion, which jointly extracts regions of interests and fea- tional neural network on plamprint images. They fine-
tures from the images. They use a pre-trained con- tuned the pre-trained VGGNet on a palmprint dataset
volutional network, along with SVM to make predic- and showed that the proposed structure could achieve
tion. They achieved an IoU score of 93% and EER of a good performance for gender classification.
0.0125 on Hong Kong Polytechnic University Palmprint
(HKPU) database.
3.5 Ear Recognition
In [170], Shao and Zhong proposed a few-shot palm-
print recognition model using a graph neural network. Ear recognition is a more recent problem that scien-
In this work, the palmprint features extracted by a con- tists are exploring, and the volume of biometric recogni-
volutional neural network are processed into nodes in tion works involving ears is expected to increase in the
the GNN. The edges in the GNN are used to repre- coming years. One of the more prominent aspects of
sent similarities between image nodes. In a more recent ear recognition is the fact that the subject can be pho-
work [171], Shao and colleagues proposed a deep palm- tographed from either side of their head and the ears
print recognition approach by combining hash coding are almost identical (suitable when subject is not coop-
and knowledge distillation. Deep hashing network are erating, or hiding his/her face). Also, since there is no
used to convert palmprint images to binary codes to need for the subject’s proximity, images may be taken
save storage space and speed up the matching process. from the ear more easily. However, ears of the subject
The architecture of the proposed deep hashing network may still be occluded by factors such as hair, hat, and
is shown in Figure 18. They also proposed a database jewelry, making it difficult to detect and use the ear
image [175]. There are multiple classical methods to
perform ear recognition: geometric methods, which try
to extract the shape of the ear; holistic methods, which
extract the features from the ear image as a whole; local
methods, which specifically use a portion of the image;
and hybrid methods, which use a combination of the
others [176], [177].
AWE Ear Dataset: This database contains 1,000 proposed using transfer learning with deep networks for
images of 100 persons. Images were collected from the unconstrained ear recognition Emersic and colleagues
web using a semi-automatic procedure, and contain the [181], also proposed a deep learning-based averaging
following annotations: gender, ethnicity, accessories, oc- system to mitigate the overfitting caused by the small
clusions, head pitch, head roll, head yaw, head side, and size of the datasets. In [187], the authors proposed the
central tragus point [178]. first publicly available CNN-based ear recognition method.
Multi-PIE Ear Dataset: This dataset was cre- They explored different strategies, such as different ar-
ated in 2017 [179] based on the Multi-PIE face dataset chitectures, selective learning on pre-trained data and
[84]. There are 17,000 ear images extracted from the aggressive data augmentation to find the best configu-
profile and near-profile images of 205 subjects present rations for their work.
in the face dataset. The ears in the images are in dif- In [188], the authors showed how ear accessories
ferent illuminations, angles, and conditions, making it can disrupt the recognition process and even be used
a decent dataset for a more generalized ear recognition for spoofing, especially in a CNN-based method, e.g.,
approach. VGG-16, against a traditional method, e.g., local bi-
USTB Ear Database: This dataset contains ear nary patterns (LBP), and proposed methods to remove
images of 60 volunteers captured in 2002 [180]. Every such accessories and improve the performance, such as
volunteer is photographed three different images. They ”inprinting” and area coloring. Sinha et al [189] pro-
are normal frontal image, frontal image with trivial an- posed a framework which localizes the outer ear image
gle rotation and image under different lighting condi- using HOG and SVMs, and then uses CNNs to perform
tion. ear recognition. It aims to resolve the issues usually as-
UERC Ear Dataset: The ear images in this dataset sociated with feature extraction appearance-based tech-
[181] are collected from the Internet in unconstrained niques, namely the conditions in which the image was
conditions, i.e., from the wild. There is a total of 11,804 taken, such as illumination, angle, contrast, and scale,
images from 3,706 subjects, of which 2,304 images from which are also present in other biometric recognition
166 subjects are for training, and the rest are for test- systems, e.g. for face. Omara et al [190] proposed ex-
ing. tracting hierarchical deep features from ear images, fus-
AMI Ear Dataset: This dataset [182] contains 700 ing the features using discriminant correlation analysis
images of size 492 x 702 from 100 subjects in the age (DCA) Haghighat et al [191] to reduce their dimensions,
range of 19 to 65 years old. The images are all in the and due to the lack of ear images per person, creating
same lighting condition and distance, and from both pairwise samples and using pairwise SVM [192] to per-
sides of the subject’s head. The images, however, differ form the matching (since regular SVM would not per-
in focal lengths, and the direction the subject is looking form well due to the small size of the datasets). Hans-
(up, down, left, right). ley et al [193] used a fusion of CNNs and handcrafted
CP Ear Dataset: One of the older datasets in this features for ear recognition which outperformed other
area, the Carreira-Perpinan dataset [183] contains 102 state-of-the-art CNN-based works, reaching to the con-
left ear images taken from 17 subjects in the same con- clusion that handcrafted features can complement deep
ditions. learning methods.
WPUT Ear Dataset: The West Pomeranian Uni-
versity of Technology (WPUT) dataset [184] contains 3.6 Voice Recognition
2,071 images from 501 subjects (247 male and 254 fe- Voice Recognition (also known as speaker recogni-
male subjects), from different age groups and ethnici- tion) is the task of determining a person’s ID using the
ties. The images are taken in different lighting condi- characteristics of one’s voice. In a way, speaker recog-
tions, from various distances and two angles, and in- nition includes both behavioral and physiological fea-
clude ears with and without accessories, including ear- tures, such as accent and pitch respectively. Using au-
rings, glasses, scarves, and hearing aids. tomatic ways to perform speaker recognition dates back
to 1960s when Bell Laboratories were approached by
3.5.2 Deep Learning Works on Ear Recognition law enforcement agencies about the possibility of iden-
Ear recognition is not as popular as face, iris, and tifying callers who had made verbal bomb threats over
fingerprint recognition yet. Therefore, datasets used for the telephone [194]. Over the years, researchers have
this procedure are still limited in size. Based on this, developed many models that can perform this task ef-
Zhang et al [185] proposed few-shot learning methods, fectively, especially with the help of deep learning. In
where the network use the limited training and quickly addition to security applications, it is also being used in
learn to recognize the images. Dodge et al [186], who virtual personal assistants, such as Google Assistant, so
Biometrics Recognition Using Deep Learning: A Survey 17
they can recognize and distinguish the phone owner’s VoxCeleb2 contains over a million utterances for 6,112
voice from the others [195]. identities.
Speaker recognition can be classified into speaker Apart from datasets designed purely for speaker recog-
identification and speaker verification. speaker identi- nition tasks, many datasets collected for automatic speech
fication is the process of determining a person’s ID recognition can also be used for training or evalua-
from a set of registered voice using a given utterance tion of speaker recognition systems. For example, the
[196], whereas speaker verification is the process of ac- Switchboard dataset [206] and the Fisher Corpus
cepting or rejecting a proposed identity claimed for a [207], which were originally collected for speech recog-
speaker [197]. Since these two tasks usually share the nition tasks, are also used for model training in NIST
same evaluation process under commonly-used metrics, Speaker Recognition Evaluations. On the other hand,
the terms are sometimes used interchangeably in refer- researchers may utilize existing speech recognition datasets
enced papers. Speaker recognition is also closely related to prepare their own speaker recognition evaluation dataset
to speaker diarization, where an input audio stream to prove the effectiveness of their research. For example,
is partitioned into homogeneous segments according to Librispeech dataset [208] and the TIMIT dataset
the speaker identity [198]. [209] are pre-processed by the author in [210] to serve
as evaluation set for speaker recognition task.
recognition performance compared to the traditional i- In order to distinguish an authentic signature from a
vector approach, with the help of data augmentation. forged one, one may either store merely signature sam-
End-to-end approaches based on neural networks ples to compare against (offline verification), or also the
are also explored in various papers. In [215] and [216], features of the written signature such as the thickness
neural networks are designed to take in pairs of speech of a stroke and the speed of the pen during the sign-
segments, and are trained to classify match/mismatch ing [223]. For verification, there are writer-dependent
targets. A specially designed triplet loss function is pro- (WD) and writer-independent (WI) methods. In WD
posed in [217] to substitute a binary classification loss methods, a classifier is trained for each signature owner,
function. Generalized end-to-end (GE2E) loss, which whereas, in WI methods, one is trained for all own-
is similar to triplet loss, is proposed in [218] for text- ers [224].
dependent speaker recognition on an in-house dataset.
In [219], a complementary optimizing goal called 3.7.1 Signature Datasets
intra-class loss is proposed to improve deep speaker em- Some of the popular signature verification datasets
beddings learned with triplet loss. It is shown in the pa- include:
per that models trained using intra-class loss can yield a ICDAR 2009 SVC: ICDAR 2009 Signature Veri-
significant relative reduction of 30% in equal error rate fication Competition contains simultaneously acquired
(EER) compared to the original triplet loss. The effec- online and offline signature samples [225]. The online
tiveness is evaluated on both VoxCeleb and VoxForge dataset is called ”NFI-online” and was processed and
datasets. segmented by Louis Vuurpijl. The offline dataset is called
In [210], the authors proposed a method for learning ”NFI-offline” and was scanned by Vivian Blankers from
speaker embeddings from raw waveform by maximizing the NFI. The collection contains: authentic signatures
the mutual information. This approach uses an encoder- from 100 writers, and forged signatures from 33 writ-
discriminator architecture similar to that of Generative ers. The NLDCC-online signature collection contains in
Adversarial Networks (GANs) to optimize mutual infor- total 1953 online and 1953 offline signature files.
mation implicitly. The authors show that this approach SVC 2004: Signature Verification Competition 2004
effectively learns useful speaker representations, leading consists of two datasets for two verification tasks: one
to a superior performance on the VoxCeleb corpus when for pen-based input devices like PDAs and another one
compared with i-vector baseline and CNN-based triples for digitizing tablets [226]. Each dataset consists of 100
loss systems. sets of signatures with each set containing 20 genuine
In [220], the authors combine a deep convolutional signatures and 20 skilled forgeries.
feature extractor, self-attentive pooling and large-margin Offline GPDS-960 Corpus: This offline signa-
loss functions into their end-to-end deep speaker recog- ture dataset [227] includes signatures from 960 subjects.
nizers. The individual and ensemble models from this There are 24 authentic signatures for each person, and
approach achieved state-of-the-art performance on Vox- 30 forgeries performed by other people not in the orig-
Celeb with a relative improvement of 70% and 82%, inal 960 (1920 forgers in total). Some works have used
respectively, over the best reported results. The au- a subset of this public dataset, usually the images for
thors also proposed to use a neural network to sub- the first 160 or 300 subjects, dubbing them GPDS-160
situte PLDA classifier, which enables them to get the and GPDS-300 respectively.
state-of-the-art results on NIST-SRE 2016 dataset.
3.7.2 Deep Learning Works on Signature Recog-
3.7 Signature Recognition nition
Signature is considered a behavioral biometric. It is Before the rise of deep learning to its current pop-
widely used in traditional and digital formats to verify ularity, there were a few works seeking to use it. For
the user’s identity for the purposes of security, trans- example, Ribeiro et al [228] proposed a deep learning-
actions, agreements, etc. Therefore, being able to dis- based method to both identify a signature’s owner and
tinguish an authentic signature from a forged one is of distinguish an authentic signature from a fake, mak-
utmost importance. Signature forgery can be performed ing use of the Restricted Boltzmann Machine (RBM)
as either a random forgery, where no attempt is made [229]. With more powerful computer and massively par-
to make an authentic signature (e.g., merely writing the allel architectures making deep learning mainstream,
name [221]), or a skilled forgery, where the signature is the number of deep learning-based works increased dra-
made to look like the original and is performed with the matically, including those involving signature recogni-
genuine signature in mind [222]. tion. Rantzsch et al [230] proposed an embedding-based
Biometrics Recognition Using Deep Learning: A Survey 19
WI offline signature verification, in which the input sig- CASIA Gait Database: This CASIA Gait Recog-
natures are embedded in a high-dimensional space using nition Dataset contains 4 subsets: Dataset A (stan-
a specific training pattern, and the Euclidean distance dard dataset) [31], Dataset B (multi-view gait dataset),
between the input and the embedded signatures will Dataset C (infrared gait dataset), and Dataset D (gait
determine the outcome. Soleimani et al [222] proposed and its corresponding footprint dataset) [239]. Here we
Deep Multitask Metric Learning (DMML), a deep neu- give details of CASIA B dataset, which is very pop-
ral network used for offline signature verification, mix- ular. Dataset B is a large multi-view gait database,
ing WD methods, WI methods, and transfer learning. which is created in 2005. There are 124 subjects, and
Zhang et al [231] proposed a hybrid WD-WI classi- the gait data was captured from 11 views. Three vari-
fier in conjuction with a DC-GAN network in order to ations, namely view angle, clothing and carrying con-
learn to extract the signature features in an unsuper- dition changes, are separately considered. Besides the
vised manner. With signature being a behavioral bio- video files, they also provide human silhouettes extracted
metric, it is imperative to learn the best features to from video files. The reader is referred to [240] for more
distinguish an authentic signature from a forged one. detailed information about Dataset B.
Hafemann et al [232] proposed a WI CNN-based sys- Osaka Treadmill Dataset: This dataset has been
tem to learn features of forgeries from multiple datasets, collected in March 2007 at the Institute of Scientific
which greatly reduced the error equal rate compared to and Industrial Research (ISIR), Osaka University (OU)
that of the state-of-the-art. Wang et al [233] proposed [241]. The dataset consists of 4,007 persons walking on a
signature identification using a special GAN network treadmill surrounded by the 25 cameras at 60 fps, 640
(SIGAN) in which the loss value from the discrimina- by 480 pixels. The datasets are basically distributed
tor network is utilized as the threshold for the identifi- in a form of silhouette sequences registered and size-
cation process. Tolosana et al [234] proposed an online normalized to 88x128 pixels size. They have four sub-
writer-independent signature verification method using sets of this dataset, dataset A: Speed variation, dataset
Siamese recurrent neural networks (RNNs), including B: Clothes variation, dataset C: view variations, and
long short term memory (LSTM) and gated recurrent dataset D: Gait fluctuation. The dataset B is composed
units (GRUs). of gait silhouette sequences of 68 subjects from the side
view with clothes variations of up to 32 combinations.
Detailed descriptions about all these datasets can be
3.8 Gait Recognition
found in this technical note [242].
Gait recognition is a popular pattern recognition Osaka University Large Population (OULP)
problem and attracts a lot of researchers from different Dataset: This dataset [243] includes images from 4,016
communities such as computer vision, machine learn- subjects from different ages (up to 94 years old) taken
ing, biomedical, forensic studying and robotics. This from two surrounding cameras and 4 observation an-
problem has also great potential in industries such as gles. The images are normalized to 88x128 pixels.
visual surveillance, since gait can be observed from a
distance without the need for the subject’s coopera- 3.8.2 Deep Learning Works on Gait Recognition
tion. Similar to other behavioral biometrics, it is diffi- Research on gait recognition based on deep learn-
cult, however possible, to try to imitate someone else’s ing has only taken off in the past few years. In one of
gait [235]. It is also possible for the gait to change the older works, Wolf et al [237] proposed a gait recog-
due to factors such as the carried load, injuries, cloth- nition system using 3D convolutional neural networks
ing, walking speed, viewing angle, and weather condi- which learns the gait from multiple viewing angles. This
tions, [236], [237]. It is also a challenge to recognize model consists of multiple layers of 3D convolutions,
a person among a group of walking people [238]. Gait max pooling and ReLUs, followed by fully-connected
recognition can be model-based, in which the the struc- layers.
ture of the subject’s body is extracted (meaning more Zhang et al [235] proposed a Siamese neural network
compute demand), or appearance-based, in which fea- for gait recognition, in which the sequences of images
tures are extracted from the person’s movement in the are converted into gait energy images (GEI) [244]. Next,
images [237], [235]. they are fed to the twin CNN networks and their con-
trastive losses are also calculated. This allows the sys-
3.8.1 Gait Datasets tem to minimize the loss for similar inputs and max-
imize it for different ones. The network for this work
Some of the widely used gait recognition datasets is shown in Figure 19. Battistone et al [245] proposed
include: gait recognition through a time-based graph LSTM net-
20 Shervin Minaee, Amirali Abdolrashidi, Hang Su, Mohammed Bennamoun, David Zhang
work, which uses alternating recursive LSTM layers and erating characteristic (ROC) is also another classical
dense layers to extract skeletons from the person’s im- metric used for verification performance. ROC essen-
ages and learn their joint features. Zou et al [246] pro- tially measures the true positive rate (TPR), which is
posed a hybrid CNN-RNN network which uses the data the fraction of genuine comparisons that correctly ex-
from smartphone sensors for gait recognition, particu- ceeds the threshold, and the false positive rate (FPR),
larly from the accelerometer and the gyroscope, and the which is the fraction of impostor comparisons that in-
subjects are not restricted in their walking in any way. correctly exceeds the threshold, at different thresholds.
ACC (classification accuracy) is another metric used
by LFW, which is simply the percentage of correct
4 Performance of Different Models on Different classifications. Many works also use TPR for a certain
Datasets FPR. For example IJB-A focuses TPR@FAR=10−3 ,
while Megaface uses TPR@FPR= 10−6 .
In this section, we are going to present the perfor-
Closed-set identification can be measured in terms
mance of different biometric recognition models devel-
of closed-set identification accuracy, as well as rank-
oped over the past few years. We are going to present
N detection and identification rate. Rank-N measures
the results of each biometric recognition model sepa-
the percentage of probe searches return the samples
rately, by providing the performance of several promis-
from probe’s gallery within the top N rank-ordered re-
ing works on one or two widely used dataset of that
sults (e.g. IJB-A/B/C focuses on the rank-1 and rank-5
biometric. Before getting into the quantitative analy-
recognition rates). The cumulative match characteris-
sis, we are going to first briefly introduce some of the
tic (CMC) is another popular metric, which measures
popular metrics that are used for evaluating biometric
the percentage of probes identified within a given rank.
recognition models.
Confusion matrix is also a popular metric for smaller
4.1 Popular Metrics For Evaluating Biometrics Open-set identification deals with the cases where
Recognition Systems the recognition system should reject unknown/unseen
Various metrics are designed to evaluate the perfor- subjects (probes which are not present in gallery) at the
mance a biometric recognition systems. Here we provide test time. At present, there are very few databases cov-
an overview of some of the popular metrics for evalua- ering the task of open-set biometric recognition. Open-
tion verification and identification algorithms. set identification accuracy is a popular metrics for this
Biometric verification is relevant to the problem task. Some benchmarks also suggested to use the de-
of re-identification, where we want to see if a given cision error trade-off (DET) curve to characterize the
data matches a registered sample. In many cases the FNIR (false-negative identification rate) as a function
performance is measured in terms of verification accu- of FPIR (false-positive identification rate).
racy, specially when a test dataset is provided. Equal Performance of Models for Face Recognition: For
error rate (EER) is another popular metric, which is the face recognition, various metrics are used for verifica-
rate of error decided by a threshold that yields equal tion and identification. For face verification, EER is one
false negative rate and false positive rate. Receiver op- of the most popular metrics. For identification, various
Biometrics Recognition Using Deep Learning: A Survey 21
metrics are used such as close-set identification accu- deep learning-based models achieve very high accuracy
racy, open-set identification accuracy. For open-set per- rate on these benchmarks.
formance, many works used detection and identification Performance of Models for Iris Recognition: Many
accuracy at a certain false-alarm rate (mostly 1%). of the recent iris recognition works have reported their
Due to the popularity of face recognition, there are accuracy rates on different iris databases, making it
a large number of algorithms and datasets available. hard to compare all of them on a single benchmark.
Here, we are going to provide the performance of some The performance of deep learning-based iris recogni-
of the most promising deep learning-based face recog- tion algorithms, and their comparison with some of the
nition models, and their comparison with some of the promising classical iris recognition models are provided
promising classical face recognition models on three in Table 4. As we can see models based on deep learning
popular datasets. algorithms achieve superior performance over classical
As mentioned earlier, LFW is one of the most widely techniques. Some of these numbers are taken from [251]
used for face recognition. The performance of some of and [252].
the most prominent deep learning-based face verifica- Performance of Models for Palmprint Recogni-
tion models on this dataset is provided in Table 1. We tion: It is common for palmprint recognition papers to
have also included the results of two very well-known compare their work against others using the accuracy
classical face verification works. As we can see, models rate or equal error rate (EER). Table 5 displays the
based on deep learning algorithms achieve superior per- accuracy of some of the palmprint recognition works.
formance over classical techniques with a large margin. As we can see, deep learning-based models achieve very
In fact, many deep learning approaches have surpassed high accuracy rate on PolyU palmprint dataset.
human performance and are already close to 100% (For Performance of Models for Ear Recognition: The
verification task, not identification). results of some of the recent ear recognition models
As mentioned earlier, closed-set identification is an- are provided in Table 6. Besides recognition accuracy,
other popular face recognition task. Table 2, provides some of the works have also reported their rank-5 ac-
the summary of the performance of some of the re- curacy, i.e. if one of the first 5 outputs of the algorithm
cent state-of-the-art deep learning-based works on the is correct, the algorithm has succeeded. Different deep
MegaFace challenge 1 (for both identification and verifi- learning-based models for ear recognition report their
cation tasks). MegaFace challenge evaluates rank1 recog- accuracy on different benchmarks. Therefore, we list
nition rate as a function of an increasing number of some of the promising works, along with the respective
gallery distractors (going from 10 to 1 million) for iden- datasets that they are evaluated on, in Table 6.
tification accuracy. For verification, they report TPR Performance of Models for Voice Recognition:
at FAR= 10−6 . Some of these reported accuracies are The most widely used metric for evaluation of speaker
taken from [116], where they implemented the Softmax, recognition systems is Equal Error Rate (EER). Apart
A-Softmax, CosFace, ArcFace and the AdaptiveFace from EER, other metrics are also used for system evalu-
models with the same 50-layer CNN, for fair compari- ation. For example, detection error trade-off curve (DET
son. As we can see the deep learning-based models in curve) is used in SRE performance evaluations to com-
recent years achieve very high Rank-1 identification ac- pare different systems. A DET curve is created by plot-
curacy even in the case where 1 million distractors are ting the false negative rate versus false positive rate,
included in the gallery. with logarithmic scale on the x- and y-axes. (EER cor-
Deep learning-based models have achieved great per- responds to the point on a DET curve where false neg-
formance on other facial analysis tasks too, such as fa- ative rate and false positive rate are equal.) Minimum
cial landmark detection, facial expression recognition, detection cost is another metric that is frequently used
face tracking, age prediction from face, face aging, part in speaker recognition tasks [261]. This cost is defined
of face tracking, and many more. As this paper is mostly as a weighted average of two normalized error rates. Not
focused on biometric recognition, we skip the details of all of these metrics are reported in every research pa-
models developed for those works here. pers, but EER is the most important metric to compare
Performance of Models for Fingerprint Recog- different systems.
nition: It is common for fingerprint recognition mod- Table 7 records the performance of some of the best
els to report their results using either the accuracy or deep leaning based spearker recognition systems on Vox-
equal error rate (EER). Table 3 provides the accuracy Celeb1 dataset. As is shown in the table, the progress
of some of the recent fingerprint recognition works on made by researchers over the last two years are promi-
PolyU, FVC, and CASIA databases. As we can see, nent. All these systems shown in Table 7 are single sys-
22 Shervin Minaee, Amirali Abdolrashidi, Hang Su, Mohammed Bennamoun, David Zhang
Table 1 Accuracy of different face recognition models for face verification on LFW dataset.
Method Architecture Used Dataset Accuracy on LFW
Joint Bayesian [247] Classical - 92.4
Tom-vs-Pete [248] Classical - 93.3
DeepFace [96] AlexNet Facebook (4.4M,4K) 97.35
DeepID2 [98] AlexNet CelebFaces+ (0.2M,10K) 99.15
VGGface [101] VGGNet-16 VGGface (2.6M,2.6K) 98.95
DeepID3 [99] VGGNet-10 CelebFaces+ (0.2M,10K) 99.53
FaceNet [100] GoogleNet-24 Google (500M,10M) 99.63
Range Loss [105] VGGNet-16 MS-Celeb-1M, CASIA-WebFace 99.52
L2-softmax [106] ResNet-101 MS-Celeb-1M (3.7M,58K) 99.87
Marginal Loss [249] ResNet-27 MS-Celeb-1M (4M,80K) 99.48
SphereFace [108] ResNet-64 CASIA-WebFace (0.49M,10K) 99.42
AMS loss [109] ResNet-20 CASIA-WebFace (0.49M,10K) 99.12
Cos Face [110] ResNet-64 CASIA-WebFace (0.49M,10K) 99.33
Ring loss [112] ResNet-64 CelebFaces+ (0.2M,10K) 99.50
Arcface [111] ResNet-100 MS-Celeb-1M (3.8M,85K) 99.45
AdaCos [113] ResNet-50 WebFace 99.71
P2SGrad [114] ResNet-50 CASIAWebFace 99.82
Table 3 Accuracy of several fingerprint recognition algo- Performance of Models for Signature Recogni-
rithms. tion: Most signature recognition works use EER as the
Method Dataset Performance performance metric, but sometimes, they also report
FingerNet [128] PolyU Acc=95.70% accuracy. Table 8 summarizes the EER of several sig-
Multi-Siamese [129] PolyU EER=8.39%
nature verification methods on GPDS dataset, where
MENet [124] FVC 2002 EER=0.78%
MENet [124] FVC 2004 EER=5.45% there are 12 authentic signature samples used for each
Deep CNN [250] Composed Acc=98.21% person (except in [222] where it is 10 samples). In ad-
dition, Table 9 provides the reported accuracy of a few
other works on other datasets.
tems, which means the performance can be boosted fur-
ther with system combination or ensembles.
For SRE datasets, due to the large number of its
series and complexity of different evaluation conditions, Performance of Models for Gait Recognition:
it is hard to compile all results into one table. Also Likely due to the different configurations of the exist-
different papers may present results on different sets or ing gait datasets, it is difficult to compare the deep
conditions, making it hard to compare the performance learning-based gait recognition works. The results are
across different approaches. reported in the form of accuracies and EER across dif-
The deep learning-based approaches discussed above ferent gallery view angles and cross-view settings. For
have also been applied to other related areas, e.g. speech Gait recognition, it is common to compare rank-5 statis-
diarization, replay attack detection and language iden- tics as well as the normal rank-1 ones. We have gathered
tification. Since this paper focuses on biometric recog- some of the averaged accuracy results reported in [270]
nition, we skip the details for these tasks. in Table 10. Note that results using CASIA-B are col-
Biometrics Recognition Using Deep Learning: A Survey 23
Table 4 The performance of iris recognition models on some of the most popular datasets.
Method Dataset Model/Feature Performance
Elastic Graph Matching [253] IITD - Acc= 98%
CASIA, MMU, Acc= 99.05%
SIFT Based Model [254] SIFT features
Deep CNN [151] IITD - Acc=99.8%
Deep CNN [151] UBIRIS v2 - Acc=95.36%
Deep Scattering [139] IITD ScatNet3+Texture features Acc= 99.2%
Deep Features [147] IITD VGG-16 Acc= 99.4%
CASIA-v4, FRGC Semantics-assisted R1-ACC= 98.4
SCNN [255]
FOCS convolutional networks (CASIA-v4)
Table 5 Accuracy of various palmprint recognition systems. sor types still remain challenging. Also the number of
Method Dataset Accuracy subjects/people in real-world scenarios should be in the
RSM [256] (Classical) PolyU 99.97% order of tens of millions. Therefore biometrics dataset
Hyper- which contain a much larger number of classes (10M-
JDCFR [173] 99.62%
100M), as well as a lot more intra-class variations, would
DMRL [257] PolyU 99.65%
MobileNetV2 [258] PolyU 99.95% be another big step towards supporting all real-world
Deform-invariant [259] PolyU 99.98% conditions.
Deep Scattering [163] PolyU 100%
MobileNetV2+SVM [258] PolyU 100% 5.2 Interpretable Deep Models
It is true that deep learning-based models achieved
Table 6 Accuracy of select ear recognition algorithms.
an astonishing performance on many of the challenging
Method Dataset Accuracy benchmarks, but there are still several open questions
Zhang et al [185] UERC 62.48 ± 0.09%
about these models. For example, what exactly are deep
Eyiokur et al [179] UERC 63.62%
Zhang et al [185] AMI 99.94 ± 0.05% learning models learning? Why are these models easily
Omara et al [190] IITD I 99.5% fooled by adversarial examples (while human can detect
Omara et al [190] USTB II 99% many of those examples easily)? What is a minimal neu-
Sinha et al [189] USTB III 97.9% ral architecture which can achieve a certain recognition
Tian et al [260] USTB III 98.27%
accuracy on a given dataset?
Emersic et al [187] Composed 62%
Table 8 Reported EER of selected signature recognition requirement, and developing near real-time models yet
models on GPDS dataset (using 10-12 genuine samples). accurate models would be very valuable.
Method Dataset EER
Hafemann et al [267] GPDS-160 10.70% 5.6 Memory Efficient Models
Yilmaz et al [268] GPDS-160 6.97%
Souza et al [269] GPDS-160 2.86% Many of the deep learning-based models require a
Hafemann et al [232] GPDS-160 2.63% significant amount of memory even during inference. So
Soleimani et al [222] GPDS-300 20.94% far, most of the effort has focused on improving the ac-
Hafemann et al [267] GPDS-300 12.83%
curacy of these models, but in order to fit these models
Souza et al [269] GPDS-300 3.34%
Hafemann et al [232] GPDS-300 3.15% in devices, the networks must be simplified. This can be
done either by using a simpler model, using model com-
Table 9 Accuracy reported by some signature recognition
pression techniques, or training a complex model and
models. then using knowledge distillation techniques to com-
Method Dataset Accuracy press that into a smaller network mimicking the initial
Embedding [230] ICDAR (Japanese) 93.39% complex model. Having a memory-efficient model opens
Embedding [230] ICDAR (Dutch) 81.76% up the door for these models to be used even on con-
SIGAN [233] Composed 91.2% sumer devices.
