Current ear detection approaches exploited 2D images (including range images) or 3D point cloud data. This section discusses some well known and recent 2D ear detection methods from 2Dimages or range images and highlights the contributions of this paper.
2.1. Ear Detection
Detection and recognition are the two major components of an automatic biometrics system. In this section, a summary of ear detection approaches is provided. Basically, most of the ear detection approaches rely on the mutual morphological properties of the ear, such as the characteristic edges or frequency patterns. The first well known ear detection approach was proposed by Burge and Burger [
10]. They proposed a detection approach utilizing deformable contours requiring user interaction for contour initialization, so it is not a fully automatic ear detection approach. Hurley et al. [
11] proposed an approach that has gained some popularity, the force filed transform. However, it is only applicable when a small background is present in the ear image. A2Dear detection technique, combining geodesic active contours and a new ovoid model, has been developed by Alvarez et al. [
12]. The ear contour was estimated using a snake model and ovoid model, but this method needs a manual first initial ear contour. Ansari and Gupta [
13] used the canny edge detector to extract the outer helix edges for localization of the ear in the profile image. The experiments were done with the Indian Institute of Technology Kanpur database (IITK), which contains cut-out ear images. Hence it can be put into question whether it works under realistic conditions. A tracking method, which combined both skin-color model and intensity contour information, was proposed by Yuan and Mu [
14] to detect and track the ear in sequence frames. Another ear detection technique is based on finding the elliptical shape of the ear using a Hough Transform (HT) accruing tolerance to noise and occlusion [
15]. In [
16], Cummings et al. utilized the image ray transform to detect ears. The transform was capable of highlighting tubular structures such as the helix of the ear and spectacle frames. The approach exploited the elliptical shape of the helix to perform the ear localization, but those methods will fail as the assumption of elliptical ear shape may not apply to all subjects in a real world application. Prakash and Gupta [
17] presented a rotation, scale, and shape invariant technique for automatic localization of ear from side face images. The approach made use of the connected components in a graph, which was obtained from the edge map of the profile images. However, this technique may not be robust to background noise or minor hair covering around the ear.
Yan and Bowyer used a two-line landmark to perform detection [
18]. One line was taken along the border between ear and profile and the other line from the top of the ear to the bottom. In their further approach, an automatic ear extracting method based on ear pit detection and an Active Contour Algorithm was exploited [
19]. The ear pit was found using skin detection, curvature estimation, surface segmentation, and classification at first. Then an active contour algorithm was implemented to outline the ear region. However, because it has to locate the nose tip and ear pit on the profile image, this algorithm may not be robust to pose variations or hair covering either. Deepak Ret al. [
20] proposed an ear detection method that was invariant to back ground and pose with the use of Snakes as active contour model. The proposed method encompassed two stages, namely, Snake-based Background Removal (SBR) and Snake-based Ear Localization (SEL). SBR was used to remove the background from a face image, and, thereafter, SEL was used to localize the ear. However, its computational time of 3.86 s per image cannot be ignored for an ear detection system.
Chen and Bbanu presented an approach for ear detection utilizing step edge magnitude [
21]. They calculated the maximum distance in depth between the center point and its neighbors in a small window in a range image to get a binary edge image of the ear helix. In [
22], Chen and Bbanu detected the ear with a shape model-based technique in side face range images. The step edges were extracted, dilated, thinned, and grouped into different clusters, which were potential regions containing ears. For each cluster, the ear shape model was registered with the edges. The region with the minimum mean registration error was declared to be the detected ear region. In a more recent work [
23], Chen and Bbanu improved the extraction procedure of step edges. They used a skin color classifier to isolate the side face before extracting the step edges. The edges from the 2Dcolor image were combined with the step edges from the range image to locate regions-of-interest (ROIs) that might contain an ear. However, these ear extraction methods only work on profile images without any kind of rotation, pose, or scale variation and occultation. Ganesh and Krishna proposed an approach to detect ears in facial images under uncontrolled environments [
24]. They proposed at echnique, namely Entropic Binary Particle Swarm Optimization (EBPSO), which generated an entropy map, the highest value of which was used to localize the ear in a face image. Also, Dual Tree Complex Wavelet Transform (DTCWT) based background pruning was used to eliminate most of the background in the face image. However, this method is computationally complex so that it costs 12.18s to detect an ear on average.
Researchers presented some ear detection approaches based on template matching. Anupam [
25] utilized ear templates of different sizes to detect ears at different scales, but the ear templates may be unable to handle all the situations in practice. An automatic ear detection technique proposed by Prakash et al. [
26] was based on a skin-color classifier and template matching. An ear template was created considering ears of various shapes and resized automatically to a size suitable for the detection. Nonetheless, it only works when the images only include facial parts, or else the other skin area may lead to an incorrect ear localization result. Attarchi et al. [
27] proposed an ear detection method based on the edge map and the mean ear template. The canny edge detector was used to obtain the edges of the ear image. Then the longest path in the edge image was considered to be the outer boundary of the ear. Finally, the ear region was extracted using a predefined window, which was calculated using the mean ear template. It works well when there is a small background in the image and the performance will decrease if the approach is implemented in whole profile image. Halawani [
28] proposed a shape-based ear localization approach. The idea was based on using a predefined binary ear template that was matched to ear contours in a given edge image. To cope with changes in ear shapes and sizes, the template was allowed to deform. The dynamic programming search algorithm was used to accomplish the matching process. In [
29], an oval shape detection based approach was presented by Joshi for ear detection from 2D profile face images. The correctness of the detected ear was verified using a support vector machine tool.
The performance of ear detection approaches based on edges or templates might be declined when the profile face is influenced by partial occlusion, scaling, and rotation (pose variations). Therefore, some ear detection approaches based on learning algorithms such as cascaded AdaBoost were proposed to improve the performance of ear detection systems in the application scenario. Islam et al. [
30] used Haar-like rectangular features as the weak classifiers. AdaBoost was utilized to select the best weak classifiers and then combine them into strong classifiers. Finally, a cascade of classifiers was built as the detector. Nevertheless, the training of the classifier takes several days. Abaza et al. [
31] modified the Adaboost algorithm and reduced the training time significantly. Shih et al. [
32] presented a two-step ear detection system, which utilized arc-masking candidate extraction and AdaBoost polling verification. Firstly, the ear candidates were extracted by an arc-masking edge search algorithm; then the ear was located by rough AdaBoost polling verification. Yuan and Mu [
33] used an improved AdaBoost algorithm to detect ears against complex backgrounds. They speeded up the detection procedure and reported a good detection rate on three test data sets.
An overview of the ear detection methods mentioned above is presented in
Table 1, along with the scale of test databases and reported accuracy rates. It is worth noting that most of the ear detection work in the table wastested on images that were photographed under controlled conditions. The detection rates may have sharply dropped when those systems were tested in a realistic scenario, which contains occlusion, illumination variation, scaling, and rotation. It is also shown that the learning algorithms perform better than the algorithms based on edge detection or template matching, but shallow learning models such as Adaboost algorithms also lack robustness in practice.
2.2. Deep Learning in Computer Vision
Recently, the convolution neural network (CNN) has significantly pushed forward the development of image classification and object detection [
34]. Krizhevsky [
35] trained a deep CNN model named AlexNet to classify the 1.2 million images in the ImageNet Large Scale Visual Recognition Competition 2010 (LSVRC-2010) contest into 1000 different classes. The neural network consists of five convolutional layers (some layers are followed by max-polling layers) and three fully-connected layers with a final 1000-way softmax layer. They employed a regularization method named ‘dropout’ to reduce over-fitting and accelerate convergence. They achieved top-1 and top-5 error rates of 37.5% and 17.0% on the test data. Simonyan [
36] put forward a VGGNet deep model (Visual Geometry Group, Department of Engineering Science, University of Oxford.) to investigate the effect of the convolutional network depth on its accuracy of image classification. It showed that a significant improvement was achieved by pushing the depth to 16–19 weight layers. The top-1 and top-5 classification error rates of 23.7% and 6.8% were reported on ImageNet LSVRC-2014. In the same year, Szegedy et al. [
37] proposed an innovative deep CNN architecture code named Inception. They designed a 22 layer deep network called GoogLeNet, the quality of which was assessed in the contest of ImageNet LSVRC-2014, and the top-5 classification error rate was 6.67%. Researchers found that the network depth was of crucial importance, and the leading results on the challenging of ImageNet dataset all exploited deep models. To solve the problem of degradation in the very deep network, He et al. [
38] trained a 152 layer deep CNN called ResNet. Instead of learning unreferenced functions, they reformulated the layers as learning residual functions with reference to the layer inputs. These new networks were easier to optimize and achieved top-5 classification error rates of 3.57%.
Benefiting from the deep learning methods, the performance of object detection, as measured on the canonical Pattern Analysis, Statical Modeling and Computational Learning Visual Object Classes Challenge (PASCAL VOC), has made great progress in the last few years. Girshick et al. [
39] proposed a new framework of object detection called Regions with CNN features(R-CNN). Firstly, around 2000 bottom-up region proposals were extracts from an input image. Then the features of each proposal were extracted based on a large convolutional neural network. Finally, the class-specific linear Support Vector Machines (SVMs) were used to classify each region. The R-CNN approach achieved a mean average precision (mAP) of 53.7% on PASCAL VOC 2010. However, because it performs a ConvNet for each object proposal, the time spent on computing region proposals and features (13s/image on a Graphics Processing Unit (GPU) or 53s/image on a CPU) cannot be ignored for an object detection system. Inspired by the Spatial pyramid pooling networks (SPPnets) [
40], Girshick [
34] proposed Fast R-CNN to speed up R-CNN by sharing computation. The network processed all the images with a CNN to produce a conv feature map. Then a fixed-length feature vector was extracted from the feature map for each object proposal. Each feature vector was fed into fully connected layers and output the bounding-box of each object. Fast R-CNN processed images were 213 times faster than R-CNN and achieved a 65.7% mAP on PASCAL VOC 2012. Although the improved network reduced the running time of the detection networks, the computation of exposing the region proposal was a bottleneck. Then the modified network called Faster R-CNN was proposed by Ren et al. [
9]. In this work, they introduced a Region Proposal Network (RPN), which shared the full-image convolutional features with the detection network, enabling nearly cost-free region proposals. The RPN and Fast R-CNN were trained to share convolutional features with an alternating optimization. The detection system has a frame rate of 5fps on a GPU, while achieving 70.4% mAP on PASCAL VOC 2012.
In conclusion, schemes based on Faster R-CNN have obtained impressive performances on object detection in images captured from real world situations, but the extent of biometric application using Faster R-CNN algorithm has not been reported so far.