Abstract
Facial expressions are an important way through which humans interact socially. Building a system capable of automatically recognizing facial expressions from images and video has been an intense field of study in recent years. Interpreting such expressions remains challenging and much research is needed about the way they relate to human affect. This paper presents a general overview of automatic RGB, 3D, thermal and multimodal facial expression analysis. We define a new taxonomy for the field, encompassing all steps from face detection to facial expression recognition, and describe and classify the state of the art methods accordingly. We also present the important datasets and the bench-marking of most influential methods. We conclude with a general discussion about trends, important questions and future lines of research.
Keywords: Facial Expression, Affect, Emotion Recognition, RGB, 3D, Thermal, Multimodal
1. INTRODUCTION
FACIAL expressions (FE) are vital signaling systems of affect, conveying cues about the emotional state of persons. Together with voice, language, hands and posture of the body, they form a fundamental communication system between humans in social contexts. Automatic FE recognition (AFER) is an interdisciplinary domain standing at the crossing of behavioral science, neurology, and artificial intelligence.
Studies of the face were greatly influenced in premodern times by popular theories of physiognomy and creationism. Physiognomy assumed that a person’s character or personality could be judged by their outer appearance, especially the face [1]. Leonardo Da Vinci was one of the first to refute such claims stating they were without scientific support [2]. In the 17th century in England, John Buwler studied human communication with particular interest in the sign language of persons with hearing impairment. His book Pathomyotomia or Dissection of the significant Muscles of the Affections of the Mind was the first consistent work in the English language on the muscular mechanism of FE [3]. About two centuries later, influenced by creationism, Sir Charles Bell investigated FE as part of his research on sensory and motor control. He believed that FE was endowed by the Creator solely for human communication. Subsequently, Duchenne de Boulogne conducted systematic studies on how FEs are produced [4]. He published beautiful pictures of sometimes strange FEs obtained by electrically stimulating facial muscles (see Figure 1). Approximately in the same historical period, Charles Darwin firmly placed FE in an evolutionary context [5]. This marked the beginning of modern research of FEs. More recently, important advancements were made through the works of researchers like Carroll Izard and Paul Ekman who inspired by Darwin performed seminal studies of FEs [6], [7], [8].
In the last years excellent surveys on automatic facial expression analysis have been published [9], [10], [11], [12]. For a more processing oriented review of the literature the reader is mainly referred to [10], [12]. For an introduction into AFER in natural conditions the reader is referred to [9]. Readers interested mainly in 3D AFER, should refer to the work of Sandbach et al. [11].
In this survey, we define a comprehensive taxonomy of automatic RGB1, 3D, thermal, and multimodal computer vision approaches for AFER. The definition and choices of the different components are analyzed and discussed. This is complemented with a section dedicated to the historical evolution of FE approaches and an in-depth analysis of latest trends. Additionally, we provide an introduction into affect inference from the face from a evolutionary perspective. We emphasize research produced since the last major review of AFER in 2009 [9]. Our focus on inferring affect, defining a comprehensive taxonomy and treating different modalities is aiming at proposing a more general perspective on AFER and its current trends.
The paper is organized as follows: Section 2 discusses affect in terms of FEs. Section 3 presents a taxonomy of automatic RGB, 3D, thermal and multimodal recognition of FEs. Section 4 reviews the historical evolution in AFER and focuses on recent important trends. Finally, Section 5 concludes with a general discussion.
2. INFERRING AFFECT FROM FES
Depending on context FEs may have varied communicative functions. They can regulate conversations by signaling turn-taking, convey biometric information, express intensity of mental effort, and signal emotion. By far, the latter has been the one most studied.
2.1. Describing affect
Attempts to describe human emotion mainly fall into two approaches: categorical and dimensional description.
Categorical description of affect.
Classifying emotions into a set of distinct classes that can be recognized and described easily in daily language has been common since at least the time of Darwin. More recently, influenced by the research of Paul Ekman [7], [13] a dominant view upon affect is based on the underlying assumption that humans universally express a set of discrete primary emotions which include happiness, sadness, fear, anger, disgust, and surprise (see Figure 2). Mainly because of its simplicity and its universality claim, the universal primary emotions hypothesis has been extensively exploited in affective computing.
Dimensional description of affect.
Another popular approach is to place a particular emotion into a space having a limited set of dimensions [15], [16], [17]. These dimensions include valence (how pleasent or unpleasent a feeling is) activation2 (how likely is the person to take action under the emotional state) and control (the sense of control over the emotion). Due to the higher dimensionality of such descriptions they can potentially describe more complex and subtle emotions. Unfortunately, the richness of the space is more difficult to use for automatic recognition systems because it can be challenging to link such described emotion to a FE. Usually automatic systems based on dimensional representation of emotion simplify the problem by dividing the space in a limited set of categories like positive vs negative or quadrants of the 2D space [9].
2.2. An evolutionist approach to FE of affect
At the end of the 19th century Charles Darwin wrote The Expression of the emotion in Man and Animals, which largely inspired the study of FE of emotion. Darwin proposed that FEs are the residual actions of more complete behavioral responses to environmental challenges. Constricting the nostrils in disgust served to reduce inhalation of noxious or harmful substances. Widening the eyes in surprise increased the visual field to see an unexpected stimulus. Darwin emphasized the adaptive functions of FEs.
More recent evolutionary models have come to emphasize their communicative functions [18]. [19] proposed a process of exaptation in which adaptations (such as constricting the nostrils in disgust) became recruited to serve communicative functions. Expressions (or displays) were ritualized to communicate information vital to survival. In this way, two abilities were selected for their survival advantages. One was to automatically display exaggerated forms of the original expressions; the other was to automatically interpret the meaning of these expressions. From this perspective, disgust communicates potentially aversive foods or moral violations; sadness communicates request for comfort. While some aspects of evolutionary accounts of FE are controversial [20], strong evidence exists in their support. Evidence includes universality of FEs of emotion, physiological specificity of emotion, and automatic appraisal and unbidden occurrence [21], [22], [23].
Universality.
There is a high degree of consistency in the facial musculature among peoples of the world. The muscles necessary to express primary emotions are found universally [24], [25], [26], and homologous muscles have been documented in non-human primates [27], [28], [29]. Similar FEs in response to species-typical signals have been observed in both human and non-human primates [30].
Recognition.
Numerous perceptual judgment studies support the hypothesis that FEs are interpreted similarly at levels well above chance in both Western and non-Western societies. Even critics of strong evolutionary accounts [31], [32] find that recognition of FEs of emotion are universally above chance and in many cases quite higher.
Physiological specificity.
Physiological specificity appears to exist as well. Using directed facial action tasks to elicit basic emotions, Levenson and colleagues [33] found that HR, GSR, and skin temperature systematically varied with the hypothesized functions of basic emotions. In anger, blood flow to the hands increased to prepare for fight. For the central nervous system, patterns of prefrontal and temporal asymmetry systematically differed between enjoyment and disgust when measured using the Facial Action Coding System (FACS) [34]. Left-frontal asymmetry was greater during enjoyment; right frontal asymmetry was greater during disgust. These findings support the view that emotion expressions reliably signal action tendencies [35], [36].
Subjective experience.
While not critical to an evolutionary account of emotion, evidence exists as well for concordance between subjective experience and FE of emotion [37], [38]. However, more work is needed in this regard. Until recently, manual annotation of FE or facial EMG were the only means to measure FE of emotion. Because manual annotation is labor intensive, replication of studies is limited.
In summary, the study of FE initially was strongly motivated by evolutionary accounts of emotion. Evidence has broadly supported those accounts. However, FE more broadly figures in cultural bio-psycho-social accounts of emotion. Facial expression signals emotion, communicative intent, individual differences in personality, and psychiatric and medical status, and helps to regulate social interaction. With the advent of automated methods of AFER, we are poised to make major discoveries in these areas.
2.3. Applications
The ability to automatically recognize FEs and infer affect has a wide range of applications. AFER, usually combined with speech, gaze and standard interactions like mouse movements and keystrokes can be used to build adaptive environments by detecting the user’s affective states [39], [40]. Similarly, one can build socially aware systems [41], [42], or robots with social skills like Sony’s AIBO and ATR’s Robovie [43]. Detecting students’ frustration can help improve e-learning experiences [44]. Gaming experience can also be improved by adapting difficulty, music, characters or mission according to the player’s emotional responses [45], [46], [47]. Pain detection is used for monitoring patient progress in clinical settings [48], [49], [50]. Detection of truthfulness or potential deception can be used during police interrogations or job interviews [51]. Monitoring drowsiness or attentive and emotional status of the driver is critical for the safety and comfort of driving [52]. Depression recognition from FEs is a very important application in analysis of psychological distress [53], [54], [55]. Finally, in recent years successful commercial applications like Emotient [56], Affectiva [57], RealEyes [58] and Kairos [59] perform largescale internet-based assessments of viewer reactions to ads and related material for predicting buying behaviour.
3. A TAXONOMY FOR RECOGNIZING FES
In Figure 3 we propose a taxonomy for AFER, built along two main components: parametrization and recognition of FEs. These are important components of an automatic FE recognition system, regardless of the data modality.
Parametrization deals with defining coding schemes for describing FEs. Coding schemes may be categorized into two main classes. Descriptive coding schemes parametrize FE in terms of surface properties. They focus on what the face can do. Judgmental coding schemes describe FEs in terms of the latent emotions or affects that are believed to generate them. Please refer to Section 3.1 for further details.
An automatic facial analysis system from images or video usually consists of four main parts. First, faces have to be localized in the image (Section 3.2.1). Second, for many methods a face registration has to be performed. During registration, fiducial points (e.g., the corners of the mouth or the eyes) are detected, allowing for a particularization of the face to different poses and deformations (Section 3.2.2). In a third step, features are extracted from the face with techniques dependent on the data modality. A common taxonomy is described for the three considered modalities: RGB, 3D and thermal. The approaches are divided into geometric or appearance based, global or local, and static or dynamic (Section 3.2.3). Other approaches use a combination of these categories. Finally, machine learning techniques are used to discriminate between FEs. These techniques can predict a categorical expression or represent the expression in a continuous output space, and can model or not temporal information about the dynamics of FEs (Section 3.2.4).
An additional step, multimodal fusion (Section 3.2.5), is required when dealing with multiple data modalities, usually coming from other sources of information such as speech and physiological data. This step can be approached in four different ways, depending on the stage at which it is introduced: direct, early, late and sequential fusion.
Modern FE recognition techniques rely on labeled data to learn discriminative patterns for recognition and, in many cases, feature extraction. For this reason we introduce in Section 3.3 the main datasets for all three modalities. These are characterized based on the content of the labeled data, the capture conditions and participants distribution.
3.1. Parameterization of FEs
Descriptive coding schemes focus on what the face can do. The most well known examples of such systems are Facial Action Coding System (FACS) and Face Animation Paramters (FAP). Perhaps the most influential, FACS (1978; 2002) seeks to describe nearly all possible FEs in terms of anatomically-based facial actions [171], [172]. The FEs are coded in Action Units (AU), which define the contraction of one or more facial muscles (see Figure 4). FACS also provides the rules for visual detection of AUs and their temporal segments (onset, apex, offset, ordinal intensity). For relating FEs to emotions, Ekman and Friesen later developed the EMFACS (Emotion FACS), which scores facial actions relevant for particular emotion displays [173]. FAP is now part of the MPEG4 standard and is used for synthesizing FE for animating virtual faces. Is is rarely used to parametrize FEs for recognition purposes [136], [137]. Its coding scheme is based on the position of key feature control points in a mesh model of the face. Maximally Discriminative Facial Movement Coding System (MAX) [174], another descriptive system, is less granular and less comprehensive. Brow raise in MAX, for instance, corresponds to two separate actions in FACS. It is a truly sign-based approach as it makes no inferences about underlying emotions.
Judgmental coding schemes, on the other hand, describe FEs in terms of the latent emotions or affects that are believed to generate them. Because a single emotion or affect may result in multiple expressions, there is no 1:1 correspondence between what the face does and its emotion label. A hybrid approach is to define emotion labels in terms of specific signs rather than latent emotions or affects. Examples are EMFACS and AFFEX [175]. In each, expressions related to each emotion are defined descriptively. As an example, enjoyment may be defined by an expression displaying an oblique lip-corner pull co-occurring with cheek raise. Hybrid systems are similar to judgment-based systems in that there is an assumed 1:1 correspondence between emotion labels and signs that describe them. For this reason, we group hybrid approaches with judgmentbased systems.
3.2. Recognition of FEs
An AFER system consists of four steps: face detection, face registration, feature extraction and expression recognition.
3.2.1. Face localization
We discuss two main face localization approaches. Detection approaches locate the faces present in the data, obtaining their bounding box or geometry. Segmentation assigns a binary label to each pixel. The reader is referred to [176] for an extensive review on face localization approaches.
For RGB images, Viola&Jones [60] still is one of the most used algorithms [10], [61], [177]. It is based on a cascade of weak classifiers, but while fast, it has problems with occlusions and large pose variations [10]. Some methods overcome these weaknesses by considering multiple posespeccific detectors and either a pose router [61] or a probabilistic approach [178]. Other approaches include Convolutional Neural Networks (CNN) [63] and Support Vector Machines (SVM) applied over HOG features [62]. While the later achieves a lower accuracy when compared to Viola&Jones, the CNN approach in [63] allows for comparable accuracies over wide range of poses.
Regarding face segmentation, early works usually exploit color and texture information along with ellipsoid fitting [66], [67], [68]. A posterior step is introduced in [69] to correct prediction gaps and wrongly labeled background pixels. Some works use segmentation to reduce the search space during face detection [179], while others use a Face Saliency Map (FSM) [70] to fit a geometric model of the face and perform a boundary correction procedure.
For 3D images [64], [65] use curvature features to detect high curvature areas such as the nose tip and eye cavities. Segmentation is also applied to 3D face detection. [73] uses k-means to discard the background and locates candidate faces through edge and ellipsoid detection, selecting the highest probability fitting. In [72], Random Forests are used to assign a body part label to each pixel, including the face. This approach was latter extended in [71], using Graph Cuts (GC) to optimize the Random Forest probabilities.
While RGB techniques are applicable to thermal images, segmenting the image according to the radiant emittance of each pixel [74], [75] usually is enough.
3.2.2. Face registration
Once the face is detected, fiducial points (aka. landmarks) are located (see Figure 5). This step is necessary in many AFER approaches in order to rotate or frontalize the face. Equivalently, in the 3D case the face geometry is registered against a 3D geometric model. A thorough review on this subject is out of the scope of this work. The reader is referred to [180] and [181] for 2D and 3D surveys respectively.
Different approaches are used for grayscale, RGB and near-infrared modalities, and for 3D. In the first case, the objective is to exploit visual information to perform feature detection, a process usually referred to as landmark localization or face alignment. In the 3D case, the acquired geometry is registered to a shape model through a process known as face registration, which minimizes the distance between both. While these processes are distinct, sometimes the same name is used in the literature. To prevent confusion, this work refers to them as 2D and 3D face registration.
2D face registration.
Active Appearance Models (AAM) [77] is one of the most used methods for 2D face registration. It is an extension of Active Shape Models (ASM) [76] which encodes both geometry and intensity information. 3D versions of AAM have also been proposed [78], but making alignment much slower due to the impossibility of decoupling shape and appearance fitting. This limitation is circumvented in [79], where a 2D model is fit while a 3D one restricts its shape variations. Another possibility is to generate a 2D model from 3D data through a continuous, uniform sampling of its rotations [182].
The real-time method of [80] uses Conditional Regression Forests (CRF) over a dense grid, extracting intensity features and Gabor wavelets at each cell. A more recent set of real-time methods is based on regressing the shape through a cascade of linear regressors. As an example, Supervised Descent Method (SDM) [81] uses simplified SIFT features extracted at each landmark estimate.
3D face registration.
In the 3D case, the goal is to find a geometric correspondence between the captured geometry and a model. Iterative Closest Point (ICP) [82] iteratively aligns the closest points of two shapes. In [83], visible patches of the face are detected and used to discard obstructions before using ICP for alignment. In the case of non-rigid registration, it allows the matched 3D model to deform. In [84], a correspondence is established manually between landmarks of the model and the captured data, using a Thin Plate Spline (TPS) model to deform the shape. [85] improves the method by using multi-resolution fitting, an adaptive correspondence search range, and enforcing symmetry constraints. [86] uses a coarse-to-fine approach based on the shape curvature. It initially locates the nose tip and eye cavities, afterwards localizing finer features. Similarly, [87] first finds the symmetry axis of the face in order to facilitate feature matching. Other techniques include registering a 3D Morphable Model 3DMM [88], 3D-ASM [89] or deformable 2D triangular mesh [90], and registering a 3D model through Simulated Annealing (SA) [91].
3.2.3. Feature extraction
Extracted features can be divided into predesigned and learned. Predesigned features are hand-crafted to extract relevant information. Learned features are automatically learned from the training data. This is the case of deep learning approaches, which jointly learn the feature extraction and classficiation/regression weights. These categories are further divided into global and local, where global features extract information from the whole facial region, and local ones from specific regions of interest, usually corresponding to AUs. Features can also be split into static and dynamic, with static features describing a single frame or image and dynamic ones including temporal information.
Predesigned features can also be divided into appearance and geometrical. Appearance features use the intensity information of the image, while geometrical ones measure distances, deformations, curvatures and other geometric properties. This is not the case of learned features, for which the nature of the extracted information is usually unknown.
Geometric features describe faces through distances and shapes. These cannot be extracted from thermal data, since dull facial features difficult the precise localization of landmarks. Global geometric features, for both RGB and 3D modalities, usually describe the face deformation based on the location of specific fiducial points. For RGB, [114] uses the distance between fiducial points. The deformation parameters of a mesh model are used in [115], [116]. Similarly, for 3D data [117] use the distance between pairs of 3D landmarks, while [92] uses the deformation parameters of an EDM. Manifolds are used in [119] to describe the shape deformation of a fitted 3D mesh separately at each frame of a video sequence through Lipschitz embedding.
The use of 3D data allows generating 2D representations of facial geometry such as depth maps [120], [121]. In [122] Local Binary Patterns (LBP) are computed over different 2D representations, extracting histograms from them. Similarly, [123] uses SVD to extract the 4 principal components from LBP histograms. In [124], the geometry is described through the Conformal Factor Image (CFI) and Mean Curvature Image (MCI). [125] captures the mean curvatures at each location with Differential Mean Curvature Maps (DMCM), using HOG histograms to describe the resulting map.
In the dynamic case the goal is to describe how the face geometry changes over time. For RGB data, facial motions are estimated from color or intensity information, usually through Optical flow [126]. Other descriptors such as Motion History Images (MHI) and Free-Form Deformations (FFDs) are also used [127]. In the 3D case, much denser geometric data facilitates a global description of the facial motions. This is done either through deformation descriptors or motion vectors. [128] extracts and segments level curvatures, describing the deformation of each segment. FFDs are used in [129] to register the motion between contiguous frames, extracting features through a quad-tree decomposition. Flow images are extracted from contiguous frame pairs in [130], stacking and describing them with LBP-TOP.
In the case of local geometric feature extraction, deformations or motions in localized regions of the face are described. Because these regions are localized, it is difficult to geometrically describe their deformations in the RGB case (being restricted by the precision of the face registration step). As such, most techniques are dynamic for RGB data. In the case of 3D data, where much denser geometric information is available, the opposite happens.
In the static case, some 3D approaches describe the curvature at specific facial regions, either using primitives [131] or closed curves [132]. Others describe local deformations through SIFT descriptors [120] extracted from the depth map or HOG histograms extracted from DMCM feature maps [125]. In [133] the Basic Facial Shape Components (BFSC) of the neutral face are estimated from the expressive one, subtracting the expressive and neutral face depth maps at rectangular regions around the eyes and mouth.
Most dynamic descriptors in the geometric, local case have been developed for the RGB modality. These are either based on landmark displacements, coded with Motion Units [134], [135], or the deformation of certain facial components such as the mouth, eyebrows and eyes, coded with FAP [136], [137]. One exception is the work in [138] over 3D data, where an EDM locates a set of landmarks and a motion vector is extracted from each landmark and pair of frames.
Although geometrical features are effective for describing FEs, they fail to detect subtler characteristics like wrinkles, furrows or skin texture changes. Appearance features are more stable to noise, allowing for the detection of a more complete set of FEs, being particularly important for detecting microexpressions. These feature extraction techniques are applicable to both RGB and thermal modalities, but not to 3D data, which does not convey appearance information.
Global appearance features are based on standard feature descriptors extracted on the whole facial region. For RGB data, usually these descriptors are applied either over the whole facial patch or at each cell of a grid. Some examples include Gabor filters [99], [100], LBP [97], [98], Pyramids of Histograms of Gradients (PHOG) [93], [94], Multi-Scale Dense SIFT (MSDF) [94] and Local Phase Quantization (LPQ) [93]. In [102] a grid is deformed to match the face geometry, afterwards applying Gabor filters at each vertex. In [101] the facial region is divided by a grid, applying a bank of Gabor filters at each cell and radially encoding the mean intensity of each feature map. An approach called Graph-Preserving Sparse Non-negative Matrix Factorization (GSNMF) [95] finds the closest match to a set of base images and assigns its associated primary emotion. This approach is improved in [96], where Projected Gradient Kernel Non-negative Matrix Factorization (PGKNMF) is proposed.
In the case of thermal images, the dullness of the image makes it difficult to exploit the facial geometry. This means that, in the global case, the whole facial patch is used. The descriptors exploit the difference of temperature between regions. One of the first works [103] generated a series of Binary Differential Images (BDI), extracting the ratio of positive area divided by the mean ratio over the training samples. 2D Discrete Cosine Transform (2D-DCT) is used in [74], [105] to decompose the frontalized face into cosine waves, from which an heuristic approach extracts features.
Dynamic global appearance descriptors are extensions to 3 dimensions of the already explained static global descriptors. For instance, Local Binary Pattern histograms from Three Orthogonal Planes (LBP-TOP) are used for RGB data [106]. LBP-TOP is an extension of LBP computed over three orthogonal planes at each bin of a 3D volume formed by stacking the frames. [94] uses a combination of LBP-TOP and Local Phase Quantization from Three Orthogonal Planes (LPQ-TOP), a descriptor similar to LBP-TOP but more robust to blur. LPQ-TOP is also used in [107], along with Local Gabor Binary Patterns from Three Orthogonal Planes (LGBP-TOP). In [108], a combination of HOG, SIFT and CNN are extracted at each frame. The first two are extracted from an overlapping grid, while the CNN extracts features from the whole facial patch. These are evaluated independently over time and embedded into Riemannian manifolds. For thermal images, [109] uses a combination of Temperature Difference Histogram Features (TDHFs) and Thermal Statistic features (StaFs). TDHFs consist of histograms extracted over the difference of thermal images. StaFs are a series of 5 basic statistical measures extracted from the same difference images.
Local appearance features are not used as frequently as global ones, since it requires previous knowledge to determine the regions of interest. In spite of that, some works use them for both RGB and thermal modalities. In the case of static features, [110] describes the appearance of grayscale frames by spreading an array of cells across the mouth and extracting the mean intensity from each. For thermal images, [75] generates eigenimages from each region of interest and uses the principal component values as features. In [111] Gray Level Co-occurrence Maxrices (GLCMs) are extracted from the interest regions and secondorder statistics computed on them. GLCM encode texture information by representing the occurrence frequencies of pairs of pixel intensities at a given distance. As such, these are also applicable to the RGB case. In [104] a combination of StaFs, 2D-DCT and GLCM features is used, extracting both local and global information.
Few works consider dynamic local appearance features. The only one to our knowledge [112] describes thermal sequences by processing them with SIFT flow and chunking them into clips. Contiguous clip frames are wrapped and subtracted, spatially dividing the clip with a grid. The resulting cuboids with higher inter-frame variability for either radiance or flow are selected, extracting a Bag of Words histogram (BoW Hist.) from each.
Based on the observation that some AU are better detected using geometrical features and others using appearance ones, it was suggested that a combination of both might increase recognition performance [127], [139], [183]. Feature extraction methods combining geometry and appearance are more common for RGB, but it is also possible to combine RGB and 3D. Because 3D data is highly discriminative and robust to problems such as shadows and illumination changes, the benefits of combining it with RGB data are small. Nevertheless, some works have done so [141], [142], [143]. It should also be possible to extract features combining 3D and thermal information, but to the best of our knowledge it has not been attempted.
In the static case, [139] uses a combination of Multistate models and edge detection to detect 18 different AUs on the upper and lower parts of the face in grayscale images. [140] uses both global geometry features and local appearance features, combining landmark distances and angles with HOG histograms centered at the barycenter of triangles specified by three landmarks. Other approaches use deformable models such as 3DMM [141] to combine 3D and intensity information. In [142], [143] SFAM describes the deformation of a set of distance-based, patch-based and grayscale appearance features encoded using LBP.
When analysing dynamic information, [140] uses RGB data to combine the landmark displacements between two frames with the change in intensity of pixels located at the barycenter defined by three landmarks.
Learned features are usually trained through a joint feature learning and classification pipeline. As such, these methods are explained in Section 3.2.4 along with learning. The resulting features usually cannot be classified as local or global. For instance, in the case of CNNs, multiple convolution and pooling layers may lead to higher-level features comprising the whole face, or to a pool of local features. This may happen implicitly, due to the complexity of the problem, or by design, due to the topology of the network. In other cases, this locality may be hand-crafted by restricting the input data. For instance, the method in [152], selects interest regions and describes each one with a Deep Belief Network (DBN). Each DBN is jointly trained with a weak classifier in a boosted approach.
3.2.4. FE classification and regression
FE recognition techniques are grouped into categorical and continuous depending on the target expressions [184]. In the categorical case there is a predefined set of expressions. Commonly for each one a classifier is trained, although other ensemble strategies could be applied. Some works detect the six primary expressions [99], [115], [116], while others detect expressions of pain, drowsiness and emotional attachment [48], [185], [186], or indices of psychiatric disorder [187], [188].
In the continuous case, FEs are represented as points in a continuous multidimensional space [9]. The advantages of this second approach are the ability to represent subtly different expressions, mixtures of primary expressions, and the ability to unsupervisedly define the expressions through clustering. Many continuous models are based on the activation-evaluation space. In [157], a Recurrent Neural Network (RNN) is trained to predict the real-valued position of an expression inside that space. In [158] the feature space is scaled according to the correlation between features and target dimensions, clustering the data and performing Kernel regression. In other cases like [156], which uses a RNN for classification, each quadrant is considered as a class, along with a fifth neutral target.
Expression recognition methods can also be grouped into static and dynamic. Static models evaluate each frame independently, using classification techniques such as Bayesian Network Classifiers (BNC) [115], [134], [135], Neural Networks (NN) [103], [139], Support Vector Machines (SVM) [75], [99], [116], [120], [125], SVM committees [111] and Random Forests (RF) [140]. In [101] k-Nearest Neighbors (kNN) is used to separately classify local patches, performing a dimensionality reduction of the outputs through PCA and LDA and classifying the resulting feature vector.
More recently, deep learning architectures have been used to jointly perform feature extraction and recognition. These approaches often use pre-training [189], an unsupervised layer-wise training step that allows for much larger, unlabeled datasets to be used. CNNs are used in [144], [145], [146], [147], [148]. [149] proposes AU-aware Deep Networks (AUDN), where a common convolutional plus pooling step extracts an over-complete representation of expression features, from which receptive fields map the relevant features for each expression. Each receptive field is fed to a DBN to obtain a non-linear feature representation, using an SVM to detect each expression independently. In [152] a two-step iterative process is used to train Boosted Deep Belief Networks (BDBN) where eacn DBN learns a non-linear feature from a face patch, jointly performing feature learning, selection and classifier training. [151] uses a Deep Boltzmann Machine (DBM) to detect FEs from thermal images. Regarding 3D data, [150] transforms the facial depth map into a gradient orientation map and performs classification using a CNN.
Dynamic models take into account features extracted independently from each frame to model the evolution of the expression over time. Dynamic Bayesian Networks such as Hidden Markov Models (HMM) [127], [128], [129], [134], [136], [137], [153] and Variable-State Latent Conditional Random Fields (VSL-CRF) [113] are used. Other techniques use RNN architectures such as Long Short Term Memory (LSTM) networks [126]. In other cases [154], [155], hand-crafted rules are used to evaluate the current frame expression against a reference frame. In [140] the transition probabilities between FEs given two frames are first evaluated with RF. The average of the transition probabilities from previous frames to the current one, and the probability for each expression given the individual frame are averaged to predict the final expression. Other approaches classify each frame independently (eg. with SVM classifiers [110]), using the prediction averages to determine the final FE.
In [115], [130] an intermediate approach is proposed where motion features between contiguous frames are extracted from interest regions, afterwards using static classification techniques. [108] encodes statistical information of frame-level features into Riemannian manifolds, and evaluates three approaches to classify the FEs: SVM, Logistic regression (LR) and Partial Least Squares (PLS).
More redently, dynamic, continuous models have also been considered. Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks (DBLSTM-RNN) are used in [107]. While [159] uses static methods to make the initial affect pedictions at each time step, it uses particle filters to make the final prediction. This both reduces noise and performs modality fusion.
3.2.5. Multimodal fusion techniques
Many works have considered multimodality for recognizing emotions, either by considering different visual modalities describing the face or, more commonly, by using other sources of information (e.g. audio or physiological data). Fusing multiple modalities has the advantage of increased robustness and conveying complementary information. Depth information is robust to changes in illumination, while thermal images convey information related to changes in the blood flow produced by emotions. It has been found that momentary stress increases the periorbital blood flow, while if sustained the blood flow to the forehead increases [197]. Joy decreases the blood flow to the nose, while arousal increases it to the nose, periorbital, lips and forehead [198].
The fusion approaches followed by these works can be grouped into three main categories: early, late and sequential fusion (see Figure 6). Early fusion merges the modalities at the feature level, while late fusion does so after applying expression recognition, at the decision level [199]. Early fusion directly exploits correlations between features from different modalities, and is specially useful when sources are synchronous in time. However, it forces the classifier/regressor to work with a higher-dimensional feature space, increasing the likelihood of over-fitting. On the other hand, late fusion is usually considered for asynchronous data sources, and can be trained on modality-specific datasets, increasing the amount of available data. A sequential use of modalities is also considered by some multimodal approaches [170].
It is also possible to directly merge the input data from different modalities, an approach referred in this document as direct data fusion. This approach has the advantage of allowing the extraction of features from a richer data source, but is limited to input data correlated for both spatial and, if considered, temporal domains.
Regarding early fusion, the simplest approach is plain early fusion, which consists on concatenating the feature vectors from both modalities. This is done in [126], [160] to fuse RGB video and speech. Usually, a feature selection approach is applied. One such technique is Sequential Backward Selection (SBS), where the least significant feature is iteratively removed until some criterion is met. In [162] SBS is used to merge RGB video and speech. A more complex approach is to use the best-first search algorithm, as done in [161] to fuse RGB facial and body gesture information. Other approaches include using 10-fold cross-validation to evaluate different subsets of features [165] and an Analysis of Variance (ANOVA) [166] to independently evaluate the discriminative power of each feature. These two works both fuse RGB video, gesture and speech information.
An alternative to feature selection is to encode the dependencies between features. This can be done by using probabilistic inference models for recognition. A Bayesian Network is used in [163] to infer the emotional state from both RGB video and speech. In [164] a Multi-stream fused HMM (MFHMM) models synchronous information on both modalities, taking into account the temporal component. The advantage of probabilistic inference models is that the relations between features are restricted, reducing the degrees of freedom of the model. On the other hand, it also means that it s necessary to manually design these relations. Other inference techniques are also used, such as Fuzzy Inference Systems (FIS), to represent emotions in a continuous 4-dimensional output space based on grayscale video, audio and contextual information [169].
Late fusion merges the results of multiple classifiers/regressors into a final prediction. The goal is either to obtain a final class prediction, a continuous output specifying the intensity/confidence for each expression or a continuous value for each dimension in the case of continuous representations. Here the most common late fusion strategies used for emotion recognition are discussed, but since it can be seen as an ensemble learning approach, many other machine learning techniques could be used. The simplest approach is the Maximum rule 3, which selects the maximum of all posterior probabilities. This is done in [162] to fuse RGB video and speech. This technique is sensible to high-confidence errors. A classifier incorrectly predicting a class with high confidence would be frequently selected as winner even if all other classifiers disagree. This can be partially offset by using a combination of responses, as is the case of the Sum rule and Product rule. The Sum rule sums the confidences for a given class from each classifier, giving the class with the highest confidence as result [108], [161], [162]. The Product rule works similarly, but multiplying the confidences [161], [162]. While these approaches partially offset the single-classifier weakness problem, the strengths of each individual modality are not considered. The Weight criterion solves this by assigning a confidence to each classifier output, otputting a weighted linear combination of the predictions [161], [162], [167], [200]. A rule-based approach is also possible, where a dominant modality is selected for each target class [168].
Bayesian Inference is used to fuse predictions of RGB, speech and lexical classifiers, simultaneously modeling time [98]. The bayesian framework uses information from previous frames along with the predictions from each modality to estimate the emotion displayed at the current frame.
Sequential fusion is a technique that applies the different modality predictions in sequential order. It uses the results of one modality to disambiguate those of another when needed. Few works use this technique, being an example [170], a rule-based approach that combines grayscale facial and speech information. The method uses acoustic data to distinguish candidate emotions, disambiguating the results with grayscale information.
3.3. FE datasets
We group datasets’ properties in three main categories, focusing on content, capture modality and participants. In the content category we refer to the type of content and labels the datasets provide. We signal intentionality of the FEs (posed or spontaneous), the labels (primary expressions, AUs or others where is the case) and if datasets contain still images or video sequences (static/dynamic). In the capture category we group datasets by the context in which data was captured (lab or non-lab) and diversity in perspective, illumination and occlusions. The last section compiles statistical data about participants, including age, gender and ethnic diversity. In Figure 7 we show samples from some of the most well-known datasets. In Tables 1 and 2 the reader can refer to a complete list of RGB, 3D and Thermal datasets and their characteristics.
TABLE 1:
RGB | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
CK+ | MPIE | JAFFE | MMI | RU_FACS | SEMAINE | CASME | DISFA | AFEW | SFEW | AMFED | ||
Content | Intention(Posed/Spontaneous) | P | P | P | P | S | S | S | S | S | S | S |
Label(Primary/AU/DA) | P/AU | P | P | AU + T | P/AU | P/AU/DA1 | P/AU | AU + I | P/2 | P | P/AU/Smile | |
Temporality(Static/Dynamic) | D | S | S | D | D | D | D | D | D | S | D | |
Capture | Environment(Lab/Non-lab) | L | L | L | L | L | L | L | L | N | N | N |
Multiple Perspective | ○ | ● | ○ | ● | ● | ○ | ○ | ○ | ● | ● | ● | |
Multiple Illumination | ○ | ● | ○ | ● | ○ | ○ | ○ | ○ | ● | ● | ● | |
Occlusions | ○ | ● | ○ | ○ | ● | ○ | ○ | ○ | ● | ● | ○ | |
Subjects | # of subjects | 201 | 337 | 10 | 75 | 100 | 150 | 35 | 27 | 220 | 68 | 5268 |
Ethnic Diverse | ● | ● | ○ | ● | ○ | ○ | ○ | ● | ● | ● | ● | |
Gender(Male/Female(%)) | 31/69 | 70/30 | 100/0 | 50/50 | - | 62/38 | 37/63 | 44/56 | - | - | 58/42 | |
Age | 18–50 | μ = 27.9 | - | 19–62 | 18–30 | 22–60 | μ = 22 | 18–50 | 1–70 | - | - |
= Yes
= No
= Not enough information.
DA: Dimensional Affect, I = Intensity labelling, T = Temporal segments.
Other labels include Laughs, Nods, Epistemic states(e.g. Certain, Agreeing, Interested etc.) etc. Refer to original paper for details [205].
Pose, Age, Gender. Refer to original paper for details [203].
TABLE 2:
3D | RGB+Thermal | ||||||||
---|---|---|---|---|---|---|---|---|---|
BU-3DFE | BU-4DFE | Bosphorus | BP4D | IRIS | NIST | NVIE | KTFE | ||
Content | Intention(Posed/Spontaneous) | P | P | P | S | P | P | S/P | S/P |
Label(Primary/AU) | P +1 | P | P/AU | AU | P | P | P | P | |
Temporality(Static/Dynamic) | S | D | S | D | S | S | D | D | |
Capture | Environment(Lab/Non-lab) | L | L | L | L | L | L | L | L |
Multiple Perspective | ● | ● | - | ● | ● | ● | ● | ● | |
Multiple Illumination | ○ | ○ | ○ | ○ | ● | ● | ● | ● | |
Occlusions | ● | ○ | ● | ○ | ● | ● | ● | ● | |
Subjects | # of subjects | 100 | 101 | 105 | 41 | 30 | 90 | 215 | 26 |
Ethnic Diverse | ● | ● | ○ | ● | ● | - | ○ | ○ | |
Gender(Male/Female(%)) | 56/44 | 57/43 | 43/57 | 56/44 | - | - | 27/73 | 38/62 | |
Age | 18–70 | 18–45 | 25–35 | 18–29 | - | - | 17–31 | 12–32 |
= Yes
= No
= Not enough information, I = Intensity labelling.
RGB.
One of the first important datasets made public was the Cohn-Kanade (CK) [190], later extended into what was called the CK+ [191]. The first version is relatively small, consisting of posed primary FEs. It has limited gender, age and ethnic diversity and contains only frontal views with homogeneous illumination. In CK+, the number of posed samples was increased by 22% and spontaneous expressions were added. The MMI dataset was a major improvement [114]. It adds profile views of not only the primary expressions but most of the AU of the FACS system. It also introduced temporal labeling of onset, apex and offset. Multi-PIE [193] increases the variability by including a very large number of views at different angles and diverse illumination conditions. GEMEP-FERA is a subset of the emotion portrayal dataset GEMEP, specially annotated using FACS. CASME [201] is an example of a dataset containing microexpressions. A limitation of most RGB datasets is the lack of intensity labels. It is not the case of the DISFA dataset [202]. Participants were recorded while watching a video specially chosen for inducing emotional states and 12 AUs were coded for each video frame on a 0 (not present) to 5 (maximum intensity) scale [202].
While previous RGB datasets record FEs in controlled lab environments, Acted Facial Expressions In The Wild Database (AFEW) [203], Affectiva-MIT Facial Expression Dataset (AMFED) [204] and SEMAINE [205] contain faces in naturalistic environments. AFEW has 957 videos extracted from movies, labeled with six primary expressions and additional information about pose, age, and gender of multiple persons in a frame. AMFED contains spontaneous FEs recorded in natural settings over the Internet. Metadata consists of frame by frame AU labelling and self reporting of affective states. SEMAINE contains primitive FEs, FACS annotations, labels of cognitive states, laughs, nods and shakes during interactions with artificial agents.
3D.
The most well known 3D datasets are BU-3DFE [206], Bosphorus [195] (still images), BU-4DFE [207] (video) and BP4D [38] (video). In BU-3DFE, 6 expression from 100 different subjects are captured on four different intensity levels. Bosphorus has low ethnic diversity but it contains a much larger number of expressions, different head poses and deliberate occlusions. BU-4DFE is a high-resolution 3D dynamic FE dataset [207]. Video sequences, having 100 frames each, are captured from 101 subjects. It only contains primary expressions. BU-3DFE, BU-4DFE and Bosphorus all contain posed expressions. BP4D tries to address this issue with authentic emotion induction tasks [38]. Games, film clips and a cold pressor test for pain elicitation were used to obtain spontaneous FEs. Experienced FACS coders annotated the videos, which were double-checked by the subject’s self-report, FACS analysis and human observer ratings [38].
Thermal.
There are few thermal FE datasets, and all of them also include RGB data. The first ones, IRIS [208] and NIST/Equinox [209], consist of image pairs labeled with three posed primary emotions under various illuminations and head poses. Recently the number of labeled FEs has increased, also including image sequences. The Natural Visible and Infrared facial Expression database (NVIE) contains 215 subjects, each displaying six expressions, both spontaneous and posed [210]. The spontaneous expressions are triggered through audiovisual media, but not all of them are present for each subject. In the Kotani Thermal Facial Emotion (KTFE) dataset subjects display posed and spontaneous motions, also triggered through audiovisual media [196].
4. HISTORICAL EVOLUTION AND CURRENT TRENDS
4.1. Historical evolution
The first work on AFER was published in 1978 [211]. It was tracking the motion of landmarks in an image sequence. Mostly because of poor face detection and face registration algorithms and limited computational power, the subject received little attention throughout the next decade. The work of Mase and Pentland and Paul Ekman marked a revival of this research topic at the beginning of the nineties [212], [213]. The interested reader can refer to some influential surveys of these early works [214], [215], [216].
In 2000, the CK dataset was published marking the beginning of modern AFER [139]. While a large number of approaches aimed at detecting primary FEs or a limited set of FACS AUs [99], [116], [134], [137], others focused on a larger set of AUs [114], [127], [139]. Most of these early works used geometric representations, like vectors for describing the motion of the face [134], active contours for describing the shape of the mouth and eyebrows [137], or deformable 2D mesh models [116]. Others focused on appearance representations like Gabor filters [99], optical flow and LBPs [97] or combinations between the two [139]. The publication of the BU-3DFE dataset [206] was a starting point for consistently extending RGB FE recognition to 3D. While some of the methods require manual labelling of fiducial vertices during training and testing [118], [131], [217], others are fully automatic [121], [124], [125], [133]. Most use geometric representations of the 3D faces, like principal directions of surface curvatures to obtain robustness to head rotations [131], and normalized Euclidean distances between fiducial points in the 3D space [118]. Some encode global deformations of facial surface (depth differences between a basic facial shape component and an expressional shape component) [133] or local shape representations [122]. Most of them target primary expressions [131] but studies about AUs were published as well [122], [218].
In the first part of the decade static representations were the primary choice in both RGB [99], [139], 3D [118], [120], [125], [131], [133], [219] and thermal [111]. In later years various ways of dynamic representation were also explored like tracking geometrical deformations across frames in RGB [114], [116] and 3D [119], [128] or directly extracting features from RGB [127] and thermal frame sequences [196], [210].
Besides extended work on improving recognition of posed FEs and AUs, studies on expressions in ever more complex contexts were published. Works on spontaneous facial expression detection [115], [220], [221], [222], analysis of complex mental states [223], detection of fatigue [224], frustration [44], pain [185], [186], [225], severity of depression [53] and psychological distress [226], and including AFER capabilities in intelligent virtual agents [?] opened new territory in AFER research.
In summary, research in automatic AFER started at the end of the 1970’s, but for more than a decade progress was slow mainly because of limitations of face detection and face registration algorithms and lack of sufficient computational power. From RGB static representations of posed FEs, approaches advanced towards dynamic representations and spontaneous expressions. In order to deal with challenges raised by large pose variations, diversity in illumination conditions and detection of subtle facial behaviour, alternative modalities like 3D and Thermal have been proposed. While most of the research focused on primary FEs and AUs, analysis of pain, fatigue, frustration or cognitive states paved the way to new applications in AFER.
In Figure 8 we present a timeline of the historical evolution of AFER. In the next sections we will focus on current important trends.
4.2. Estimating intensity of facial expressions
While detecting FACS AUs facilitates a comprehensive analysis of the face and not only of a small subset of so called primary FEs of affect, being able to estimate the intensity of these expressions would have even greater informational value especially for the analysis of more complex facial behaviour. For example, differences in intensity and its timing can distinguish between posed and spontaneous smiles [227] and between smiles perceived as polite versus those perceived as embarrassed [228]. Moreover, intensity levels of a subset of AUs are important in determining the level of detected pain [229], [230].
In recent years estimating intensity of facial expressions and especially of AUs has become an important trend in the community. As a consequence the Facial Expression and Recognition (FERA) challenge added a special section for intensity estimation [231], [232]. This was recently facilitated by the publication of FE datasets that include intensity labels of spontaneous expression in RGB [202] and 3D [38].
Even though attempts in estimating FE intensity have existed before [233], the first seminal work was published in 2006 [234]. It observed a correlation between a classifier’s output margin, in this case the distance to the hyperplane of a SVM classification, and the intensity of the facial expression. Unfortunately this was only weakly observered for spontaneous FEs.
A number of studies question the validity of estimating intensity from distance to the classification hyperplane [235], [236], [237]. In two works published in 2011 and 2012 Savran et al. made an excellent study of these techniques providing solutions to their main weak points [235], [236]. They comment that such approaches are designed for AU not intensity detection and the classifier margin does not necessarily incorporate only intensity information. More recently, [237] found that intensity-trained multiclass and regression models outperformed binary-trained classifier decision values on smile intensity estimation across multiple databases and methods for feature extraction and dimensionality reduction.
Other works consider the possible advantage of using 3D information for intensity detection. [235] explores a comparison between regression on SVM margins and regression on image features in RGB, 3D and their fusion. Gabor wavelets are extracted from RGB and curvatured maps from 3D captures. A feature selection step is performed from each of the modalities and from their fusion. The main assumption would be that for different AUs, either RGB or 3D representations could be more discriminative. Experiments show that 3D is not necessarily better than RGB; in fact, while 3D shows improvements on some AUs, it has performance drops on other AUs, both in the detection and intensity estimation problems. However, when 3D is fused with RGB, the overall performance increases significantly. In [236], Savran et al. try different 3D surface representations. When evaluated comparatively, RGB representation performs better on the upper face while 3D representation performs better on the lower face and there is an overall improvement if RGB and 3D intensity estimations are fused. This might be the case because 3D sensing noise can be excessive in the eye region and 3D misses the eye texture information. On the other hand, larger deformations on the lower face make 3D more advantageous. Nevertheless, correlations on upper face are significantly higher than the lower face for both modalities. This points out to the difficulties in intensity estimation for the lower face AUs (see Figure 4).
A different line of research analyzes the way geometrical and appearance representations could combine for optimizing AU intensity estimation [49], [238]. [238] analyzes representations best suited for specific AUs. An assumption is made that geometrical representations perform better for AUs related to deformations (lips, eyes) and appearance features for other AUs (e.g. cheeks deformations). Testing of various descriptors is done on a small subset of specially chosen AUs but without a clear conclusion. On the other hand [49] combines shape with global and local appearance features for continuous AU intensity estimation and continuous pain intensity estimation. A first conclusion is that appearance features achieve better results than shape features. Even more, the fusion between the two appearance representations, DCT and LBP, gives the best performance even though a proper alignment might improve the contribution of the shape representation as well. On the other hand this approach is static, which would fail to distinguish between eye blink and eye closure, and does not exploit the correlations between apparitions of different AUs. In order to overcome such limitations some works use probabilistic models of AUs combination likelihoods and intensity priors for improving performance [239], [240].
In summary, estimating facial AUs intensity followed a few distinct approaches. First, some researchers made a critical analysis about the limitations of estimating intensity from classification scores [235], [236], [237]. As an alternative, direct estimation from features was analyzed. Further studies on optimal representations for intensity estimation of different AUs were published either from the points of view of geometrical vs appearance representations [49], [238] or the fusion between RGB and 3D [235], [236]. Finally, a third main research direction was focused on modelling the correlations between AUs appearance and intensity priors [239], [240]. Some works are treating a limited subset of AUs while others are more extensive. All the presented approaches use predesigned representations. While the vast majority of the works are performing a global feature extraction with or without selecting features there are cases of sparse representations [241]. In this paper we have analyzed AU intensity estimation but significant works in estimating intensity of pain [49], [230] or smile [242], [243] also exist.
4.3. Microexpressions analysis
Microexpressions are brief FEs that people in high stake situations make when trying to conceal their feelings. They were first reported by Haggard and Issacs in 1966 [244]. Usually a microexpression lasts between 1/25 and 1/3 of a second and has low intensity. They are difficult to recognize for an untrained person. Even after extensive training, human accuracies remain low, making an automatic system highly useful. The presumed repressed character of microexpressions is valuable in detecting affective states that a person may be trying to hide.
Microexpressions differ from other expessions not only because of their short duration but also because of their subtleness and localization. These issues have been addressed by employing specific capturing and representation techniques. Because of their short duration microexpressions may be better captured at greater than 30 fps. As with spontaneous FEs, which are shorter and less intense than exaggerated posed expressions, methods for recognizing microexpressions take into account the dynamics of the expression. For this reason, a main trend in microexpression analysis is to use appearance representations captured locally in a dynamic way [245], [246], [247]. In [248] for example, the face is divided into specific regions and posed microexpressions in each region are recognized based on 3D-gradient orientation histograms extracted from sequences of frames. [245] on the other hand use optical flow to detect strain produced on the facial surface caused by nonrigid motion. After macroexpressions have been previously detected and removed from the detection pipeline, posed microexpressions are spotted without doing classification [245], [246]. [249], another method that first extracts macroexpressions before spotting microexpressions. Unlike other similar methods microexperessions are also classified into the 6 primary FEs.
A problem in the evolution of microexpression analysis has been the lack of spontaneous expression datasets. Before the publication of the CASME and the SMIC dataset in 2013, methods were usually trained with posed non-public data [245], [246], [248]. [247] proposes the first microexpressions recognition system. LBP-TOP, an appearance descriptor is locally extracted from video cubes. Microexpressions detection and classification with high recognition rates are reported even at 25fps. Alternatively, existing datasets, such as BP4D, could be mined for microexpression analysis. One could identify the initial frames of discrete AUs, to mimic the duration and dynamic of microexpressions.
In summary, microexpressions are brief, low intensity FEs believed to reflect repressed feelings. Even highly trained human experts obtain low detection rates. An automatic microexpression recognition system would be highly valuable for spotting feelings humans are trying to hide. Due to their briefness, subtleness and localization most of methods in recent years have used local, dynamic, appearance representations extracted from high frequency video for detecting and classifying posed [245], [246], [248] and more recently spontaneous microexpressions [247].
4.4. AFER for detecting non-primary affective states
Most of AFER was used for predicting primary affective states of basic emotions, such as anger or happiness but FEs were also used for predicting non-primary affective states such as complex mental states [223], fatigue [224], frustration [44], pain [185], [186], [225], depression [53], [250], mood and personality traits [251], [252].
Approaches related to mood prediction from facial cues have pursued both descriptive (e.g., FACS) and judgmental approaches to affect. In a paper from 2009, Cohn et al. studied the difference between directly predicting depression from video by using a global geometrical representation (AAM), indirectly predicting depression from video by analyzing previously detected facial AUs and prediction depression from audio cues [187]. They concluded that specific AUs have higher predictive power for depression than others suggesting the advantage of using indirect representations for depression prediction. The AVEC, a challenge, is dedicated to dimensional prediction of affect (valance, arousal, dominance) and depression level prediction. The approaches dedicated to depression prediction are mainly using direct representations from video without detecting primitive FEs or AUs [253], [254], [255], [256]. They are based on local, dynamic representations of appearance (LBP-TOP or variants) for modelling continuous classification problems. Multimodality is central in such approaches either by applying early fusion [255] or late fusion [256] with audio representations.
As humans rely heavily on facial cues to make judgments about others, it was assumed that personality could be inferred from FEs as well. Usually studies about personality are based on the BigFive personality trait model which is organized along five factors: openness, conscientiousness, extraversion, agreeableness, and neuroticism. While there are works on detecting personality and mood from FEs only [251], [252] the dominant approach is to use multimodality either by combining acoustic with visual cues [251], [257] or physiological with visual cues [258]. Visual cues can refer to eye gaze [259], [260], frowning, head orientation, mouth fidgeting [259], primary FEs [251], [252] or characteristics of primary FEs like presence, frequency or duration [251]. In [251], Biel et al. use the detection of 6 primary FEs and of smile to build various measures of expression duration or frequency. They show that using FEs is achieving better results than more basic visual activity measures like gaze activity and overall motion of the head and body; however performance is considerably worse than when estimating personality from audio and especially from prosodic cues.
In summary, in recent years, the analysis of non-primary affective states mainly focused on predicting depression. For predicting levels of depression, local, dynamic representations of appearance were usually combined with acoustic representations [253], [254], [255], [256]. Studies of FEs for predicting personality traits had mixed conclusions until now. First, FEs were proven to correlate better than visual activity with personality traits [187]. Practically though, while many studies have showed improvements of prediction when combined with physiological or acoustic cues, FEs remain marginal in the study of personality trait prediction [251], [257], [259], [260].
4.5. AFER in naturalistic environments
Until recently AFER was mostly performed in controlled environments. The publication of two important naturalistic datasets, AMFED and AFEW marked an increasing interest in naturalistic environment analysis. AFEW, Acted Facial Expressions in the Wild dataset contains a collection of sequences from movies labelled for primitive FEs, pose, age and gender among others [203]. Additional data about context is extracted from subtitles for persons with hearing impairment. AMFED on the other hand, contains videos recording reactions to media content over the Internet. It mostly focuses on boosting research about how attitude to online media consumption can be predicted from facial reactions. Labels of AUs, primitive FEs, smiles, head movements and self reports about familiarity, liking and disposal to rewatch the content are provided.
FEs in naturalistic environments are unposed and typically of low to moderate intensity and may have multiple apexes (peaks in intensity). Large head pose and illumination diversity are common. Face detection and alignment is highly challenging in this context, but vital for eliminating rigid motion and head pose from facial expressions. Not surprisingly, in an analysis of errors in AU detection in three-person social interactions, [261] found that head yaw greater than 20 degrees was a prime source of error. Pixel intensity and skin color, by contrast, were relatively benign.
While approaches to FE detection in naturalistic environments using static representations exist [194], [262], dynamic representations are dominant [108], [113], [146], [147], [263], [264]. This follows the tendency in spontaneous FE recognition in controlled environments where dynamic representations improve the ability to distinguish between subtle expressions. In [146], spatio-temporal manifolds of low level features are modelled, [263] uses a maximum of a BoW (Bag of Words) pyramid over the whole sequence, [147] captures spatio-temporal information through autoencoders and [113] uses CRFs to model expression dynamics.
Some of the approaches use predesigned representations [194], [262], [263], [264], [265] while recent successful approaches learn the best representation [146], [147], [152] or combine predesigned and learned features [108]. Because of the need to detect subtle changes in the facial configuration, predesigned representations use appearance features extracted either globally or locally. Gehrig et al. in their analysis of the challenges of naturalistic environments use DCT, LBP and Gabor Filters [262], Sikka et al. use dense multi-scale SIFT BoWs, LPQ-TOP, HOG, PHOG and GIST to get additional information about context [263], Dhall et al. use LBP, HOG and PHOG in their baseline for the SFEW dataset (static images extracted from AFEW) [194] and LBPTOP in their baseline for the EmotiW 2014 challenge [265], and Liu et al. use convolution filters for producing mid-level features [146].
Some representative approaches using learned representation were recently proposed [108], [146], [147], [152]. In [152], a BDBN framework for learning and selecting features is proposed. It is best suited for characterizing expressionrelated facial changes. [147] proposes a configuration obtained by late fusing spatio-temporal activity recognition with audio cues, a dictionary of features extracted from the mouth region and a deep neural network for FEs recognition. In [108], predesigned (HOG, SIFT) and learned (deep CNN features) representations are combined and different image set models are used to represent the video sequences on a Riemannian manifold. In the end, late fusion of classifiers based on different kernel methods (SVM, Logistic Regression, Partial Least Squares) and different modalities (audio and video) is conducted for final recognition results. Finally, [113] encodes dynamics with a Variable-State Latent Conditional Random Fields (VSL-CRF) model that automatically selects the optimal latent states and their intensity for each sequence and target class.
Most approaches presented target primitive FEs. Methods for recognizing other affective states have also been proposed, namely cognitive states like boredom, confusion, delight, concentration and frustration [266], positive and negative affect from groups of people [267] or liking/not-linking of online media for predicting buying behaviour for marketing purposes [268].
In summary, large head pose rotations and illumination changes make FE recognition in naturalistic environments particularly challenging. FEs are by definition spontaneous, usually have low intensity, can have multiple apexes and can be difficult to distinguish from facial displays of speech. Even more, multiple persons can express FEs simultaneously. Because of the subtleness of facial configurations most predesigned representations are dynamically extracting the appearance [262], [263], [264], [265]. Recently successful methods learn representations [108], [146], [147], [152] from sequences of frames. Most approaches target primitive FEs of affect, but others recognize cognitive states [266], postive and negative affect from groups of people [267] and liking/not-linking of online media for predicting buying behaviour for marketing purposes [268].
5. DISCUSSION
By looking at faces humans extract information about each other, such as age, gender, race, and how others feel and think. Building automatic AFER systems would have tremendous benefits. Despite significant advances, automatic AFER still faces many challenges like large head pose variations, changing illumination contexts and the distinction between facial display of affect and facial display caused by speech. Finally, even when one manages to build systems that can robustly recognize FEs in naturalistic environments, it still remains difficult to interpret their meaning. In this paper we have focused in providing a general introduction into the broad field of AFER. We have started by discussing how affect can be inferred from FEs and its applications. An in-depth discussion about each step in a AFER pipeline followed, including a comprehensive taxonomy and many examples of techniques used on data captured with different video sensors (RGB, 3D, Thermal). Then, we have presented important recent evolutions in the estimation of FE intensities, recognition of microexpressions and non-primary affective states and analysis of FEs in naturalistic environments.
Face localization and registration.
When extracting FE information, techniques vary according to both modality and temporality. Regardless of these approaches, a common pipeline has been presented which is followed by most methods, consisting of face detection, face registration, feature extraction and recognition itself. When combining multiple modalities, a fifth fusion step is added to the pipeline. Depending on the modality, this pipeline can vary slightly. For instance, face registration is not feasible for thermal imaging due to the dullness of the captured images, which in turn limits feature extraction to appearancebased techniques. The techniques applied to obtain the facial landmarks are different for RGB and 3D, being these feature detection and shape registration problems respectively. The pipeline may also vary for some methods, which may not require face alignment for some global feature-extraction techniques, and may perform feature extraction implicitly with recognition, as is the case of deep learning approaches.
The first two steps of the pipeline, face localization and 2D/3D registration, are common to many facial analysis techniques, such as face and gender recognition, age estimation and head pose recovery. This work introduces them briefly, referring the reader to more specific surveys for each topic [176], [180], [181]. For face localization, two main families of methods have been found: face detection and face segmentation. Face detection is the most common approach, and is usually treated as a classification problem where a bounding box can either be a face or not. Segmentation techniques label the image at the pixel level. For face registration, 2D (RGB/thermal) and 3D approaches have been discussed. 2D approaches solve a feature detection problem where multiple facial features are to be located inside a facial region. This problem is approached either by directly fitting the geometry to the image, or by using deformable models defining a prototypical model of the face and its possible deformations. 3D approaches, on the other hand, consider a shape registration problem where a transform is to be found matching the captured shape to a model. Currently the main challenge is to improve registration algorithms to robustly deal with naturalistic environments. This is vital for dealing with large rotations, occlusions, multiple persons and, in the case of 3D registration, it could also be used for synthesising new faces for training neural networks.
Feature extraction.
There are many different aproaches for extracting features. Predesigned descriptors are very common, although recently deep learning techniques such as CNN and DBN have been used, implicitly learning the relevant features along with the recognition model. While automatically learned techniques cannot be directly classified according to the nature of the described information, predesigned descriptors exploit either the facial appearance, geometry or a combination of both. Regardless of their nature, many methods exploit information either at a local level, centering on interest regions sometimes defined by AUs based on the FACS/FAP coding, or at a global level, using the whole facial region. These methods can describe either a single frame, or dynamic information. Usually, representing the differences between consecutive frames is done either through shape deformations or appearance variations. Other methods use 3D descriptors such as LBP-TOP for directly extracting features from sequences of frames.
While these types of feature extraction methods are common to all modalities, it has been found that thermal images are not fit to extract geometric information due to the dullness of the captured image. In the RGB case, geometric information is never extracted at the local static level. While it should be possible to do so, we hypothesise that current 2D registration techniques lack the level of precision required to extract useful information from local shape deformations. In the case of learned features, to the best of our knowledge, dynamic feature extraction has not been attempted. It is clearly possible to do so though, and it has been done for other problems.
In the case of AU intensity estimation many studies were published either from the point of view of geometrical vs appearance representations [49], [238] or the fusion between RGB and 3D [235], [236]. Because of the scarcity of intensity labeled data, to the best of our knowledge all approaches until now have used predesigned representations. While the vast majority of the works perform a global feature extraction with or without selecting features there are cases of sparse representations, most notably in the work of Jeni et al. [241]. Due to their briefness, subtleness and localization, most of the methods for detecting microexpressions use local, dynamic, appearance representations extracted from high frequency video. Detection and classification of posed [245], [246], [248] and more recently spontaneous microexpressions [247] have been proposed. For predicting levels of depression, local, dynamic representations of appearance were usually combined with acoustic representations [253], [254], [255], [256]. Because of the subtleness of facial configurations in naturalistic environments most predesigned representations are dynamically extracting the appearance [262], [263], [264], [265]. Recently successful methods in naturalistic environments learn representations [108], [146], [147], [152] from sequences of frames. As the amount of labelled data increases, learning the representations could be a future trend in intensity estimation. More complex representation schemes for recognizing spontaneous microexpressions and approaches combining RGB with other modalities, especially 3D, for microexpression analysis is also a direction we foresee.
Recognition.
Recognition approaches infer emotions or mental states based on the extracted FE features. The vast majority of techniques use a multi-class classification model where a set of emotions (usually the six basic emotions defined by Ekman) or mental states are to be detected. A continuous approach is also possible. In the continuous case, emotions are represented as points in a pre-defined space, where usually each dimension corresponds to an expressive trait. This representation has advantages such as the ability to unsupervisedly define emotions and mental states, and discriminate subtle expression differences. The ease of interpretation of multi-class approaches made continuous approaches less frequent. Recognition is also divided into static and dynamic approaches, with static approaches being dominated by conventional classification and regression methods for categorical and continuous problems respectively. In the case of dynamic approaches, usually dynamic Bayesian Network techniques are used, but also others such as Conditional Random Forests and recurrent neural networks.
Many methods focus on recognizing a limited set of primary emotions (usually 6) [115], [116], [123], [130], [137], [145], [146]. This is mainly due to a lack of more diverse datasets. Increasing the number of recognized expressions usually follows two main directions. First, expressions can be encoded based on FACS AUs [99], [113], [127], [139] instead of directly being classified. This provides a comprehensive coding of FEs without directly making a judgement on their intentionality. Other methods exploit additional information provided by 3D facial data. Capturing depth information has important advantages over traditional RGB datasets. It is more invariant to rotation and illumination and captures more subtle changes on the face. This is useful for detecting microexpressions and facilitates recognizing a wider range of expressions, which would be more difficult with RGB alone.
In recent years, a critical analysis has been made about the limitations of estimating AUs intensity from classification scores [235], [236], [237] and estimation directly from features were analysed. Research suggests that using classifier scores for predicting intensity is conceptually wrong and that intensity levels should be directly learned from the ground truth [237]. Some works treat a limited subset of AUs while other are more extensive. Usually we talk about AU intensity estimation, but significant works in estimating intensity of pain [49], [230] or smile [242], [243] also exist. Starting with the publication of the BU-3DFE dataset which provides four different intensity levels for every expression, advancements in recognizing primary expressions from 3D samples were made [118], [120], [124], [125], [131], [133], [217]. In naturalistic environments, most approaches target primitive FEs of affect. Methods for recognizing cognitive states [266], positive and negative affect from groups of people [267] or liking/not-linking of online media for predicting buying behaviour for marketing purposes [268] are also common. Probably a major trend in the future will be taking into account context and recognizing ever more complex FEs from multiple data sources. Additionally, a recent trend which remains to be further exploited is mapping faces to continuous emotional spaces.
Multimodal fusion.
Multimodality can enrich the representation space and improve emotion inference [269], [270], either by using different video sensors (RGB, Depth, Thermal) or by combining FEs with other sources such as body pose, audio, language or physiological information (brain signals, cardiovascular acivity etc.). Because the different modalities can be redundant, concatenating features might not be efficient. A common solution is to use fusion (see Section 3.2.5 for details). Four main fusion approaches have been identified: direct, early, late and sequential fusion, in most cases using conventional fusion techniques. Some more advanced late fusion techniques have been identified such as fuzzy inference systems and bayesian inference. The advantage of these methods lies on the introduction of complementary sources of information. For instance, the radiance at different facial regions, captured through thermal imaging, varies according to changes in the blood flow triggered by emotions [198], [210]. Context (situation, interacting persons, place etc) can also improve emotion inference [271], [272]. [273] shows that the recognition of FE is strongly influenced by the body posture and that this becomes more important as the FE is more ambiguous. In another study, it is shown that not only emotional arousal can be detected from visual cues but voice can also provide indications of specific emotions through acoustic properties such as pitch range, rhythm, and amplitude or duration changes [156]. In the case of mood and personality traits prediction fusion of acoustic and visual cues has been extensively exploited. Conclusions were mixed. First, FEs were proven to correlate better than visual activity with personality traits [187]. Practically though, while many studies have showed improvements of prediction when combined with physiological or acoustic cues, FEs remain marginal in the study of personality trait prediction [251], [257], [259], [260]. We think years to come will probably bring improvements towards integration of visual and non-visual modalities, like acoustic, language, gestures, or physiological data coming from wearable devices.
Acknowledgments
This work was supported in part by NIMH R01MH096951.
Biography
Ciprian Adrian Corneanu got his BSc in Telecommunication Engineering from Télécom SudParis, 2011. He got his MSc in Computer Vision from Universitat Autónoma de Barcelona. Currently he is a Ph.D. student at the Universitat de Barcelona and a fellow of the Computer Vision Center, UAB. His main research interests include face and behavior analysis, affective computing, social signal processing, human computer interaction.
Marc Oliu Simón finished his M.D in Computer Sciences and MSc in Artificial Intelligence at the Universitat Politecnica de Catalunya in 2014. Currently he is a Ph.D. student at the Universitat de Barcelona and works as a researcher at the Computer Vision Center, UAB. His main research interests include face and behaviour analysis, affective computing and neural networks.
Jeffrey F Cohn is Professor of Psychology and Psychiatry at the University of Pittsburgh and Adjunct Professor of Computer Science at the Robotics Institute at CMU. He leads interdisciplinary and inter-institutional efforts to develop advanced methods of automatic analysis and synthesis of facial expression and prosody; and applies those tools to research in human emotion, social development, non-verbal communication, psychopathology, and biomedicine. His research has been supported by grants from NIH, National Science Foundation, Autism Foundation, Office of Naval Research, and Defense Advanced Research Projects Agency.
Sergio Escalera Guerrero received his Ph.D. degree on Multiclass visual categorization systems at Computer Vision Center, UAB. He leads the HuPBA group. He is an associate professor at the Department of Applied Mathematics and Analysis, Universitat de Barcelona. He is member of the Computer Vision Center. He is director of ChaLearn Challenges in Machine Learning and vice-chair of IAPR TC-12. His research interests include, among others, statistical pattern recognition, visual object recognition, and HCI systems, with special interest in human pose recovery and behaviour analysis from multimodal data.
Footnotes
RGB: Additive color model in which red, green, and blue light are combined to reproduce a broad array of colors.
Also known as arousal.
Also known as the winner takes it all rule
Contributor Information
Ciprian A. Corneanu, Computer Vision Center, UAB, Barcelona, Spain, and with the Dept. Applied Methematics, University of Barcelona, Spain.
Marc Oliu, Computer Vision Center, UAB, Barcelona, Spain, and with the Dept. Applied Methematics, University of Barcelona, Spain..
Jeffrey F. Cohn, Robotics Institute, CMU, Pittsburgh, Pennsylvania,and with the Dept. Psychology, University of Pittsburgh, Pennsylvania.
Sergio Escalera, Computer Vision Center, UAB, Barcelona, Spain, and with the Dept. Applied Methematics, University of Barcelona, Spain..
REFERENCES
- [1].Roger Highfield RW and Jenkins R, “How your looks betray your personality,” New Scientist, 2009. [Google Scholar]
- [2].Chastel A, Leonardo on Art and the Artist. Courier Corporation, 2002. [Google Scholar]
- [3].Greenblatt S et al. , “Toward a universal language of motion: reflections on a seventeenth century muscle man,” 1994. [Google Scholar]
- [4].de Boulogne G-BD and Cuthbertson RA, The Mechanism of Human Facial Expression. Cambridge University Press, 1990. [Google Scholar]
- [5].Darwin C, The expression of emotion in man and animals. Oxford University Press, 1872. [Google Scholar]
- [6].Izard CE, The face of emotion, 1971. [Google Scholar]
- [7].Ekman P, “Universal and cultural differences in facial expression of emotion,” Nebr. Sym. Motiv, vol. 19, pp. 207–283, 1971. [Google Scholar]
- [8].Ekman P and Oster H, “Facial expressions of emotion,” Annu. Rev. Psychol, no. 30, pp. 527–554, 1979. [Google Scholar]
- [9].Zeng Z, Pantic M, Roisman GI, and Huang TS, “A survey of affect recognition methods: Audio, visual, and spontaneous expressions,” TPAMI, vol. 31, no. 1, pp. 39–58, 2009. [DOI] [PubMed] [Google Scholar]
- [10].Salah AA, Sebe N, and Gevers T, “Communication and automatic interpretation of affect from facial expressions,” Affective Computing and Interaction: Psychological, Cognitive and Neuroscientific Perspectives, p. 157, 2010. [Google Scholar]
- [11].Sandbach G, Zafeiriou S, Pantic M, and Yin L, “Static and dynamic 3D facial expression recognition: A comprehensive survey,” Image Vision Comput, vol. 30, pp. 683–697, 2012. [Google Scholar]
- [12].Sariyanidi E, Gunes H, and Cavallaro A, “Automatic analysis of facial affect: A survey of registration, representation and recognition,” TPAMI, 2014. [DOI] [PubMed] [Google Scholar]
- [13].Ekman P, “Strong evidence for universals in facial expressions: A reply to Russell’s mistaken critique,” Psychol. Bull, vol. 115, no. 2, pp. 268–287, 1994. [DOI] [PubMed] [Google Scholar]
- [14].what-when-how.com.
- [15].Greenwald M, Cook E, and Lang P, “Affective judgment and psychophysiological response: Dimensional covariation in the evaluation of pictorial stimuli,” J. Psychophysiology, no. 3, pp. 51–64, 1989. [Google Scholar]
- [16].Russell J and Mehrabian A, “Evidence for a three-factor theory of emotions,” J. Research in Personality, vol. 11, pp. 273–294, 1977. [Google Scholar]
- [17].Watson D, Clark LA, and Tellegen A, “Development and validation of brief measures of positive and negative affect: The PANAS scales,” JPSP, vol. 54, pp. 1063–1070, 1988. [DOI] [PubMed] [Google Scholar]
- [18].Fridlund AJ, “The behavioral ecology and sociality of human faces,” in Emotion, 1997, pp. 90–121. [Google Scholar]
- [19].Shariff AF and Tracy JL, “What are emotion expressions for?” CDPS, vol. 20, no. 6, pp. 395–399, 2011. [Google Scholar]
- [20].Barrett LF, “Was Darwin wrong about emotional expressions?” CDPS, vol. 20, no. 6, pp. 400–406, 2011. [Google Scholar]
- [21].Eibl-Eibesfeldt I, “An argument for basic emotions,” in Cogn. Emot, 1992, pp. 169–200. [Google Scholar]
- [22].Keltner D and Ekman P, “Facial expression of emotion,” in Handbook of emotions, 2nd ed., 2000, pp. 236–249. [Google Scholar]
- [23].Matsumoto D, Keltner D, Shiota MN, O’Sullivan M, and Frank M, “Facial expressions of emotion,” in Handbook of Emotions, 2008, ch. 13, pp. 211–234. [Google Scholar]
- [24].Schmidt KL and Cohn JF, “Human facial expressions as adaptations: Evolutionary perspectives in facial expression research,” Yearbook of Physical Anthropology, vol. 116, pp. 8–24, 2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Gray H and Goss CM, Anatomy of the human body, 28th ed Lea & Febiger, 1966. [Google Scholar]
- [26].Burrows A and Cohn JF, “Comparative anatomy of the face” in Handbook of biometrics, 2nd ed Springer, 2014, pp. 1–10. [Google Scholar]
- [27].Waller BM, Cray JJ, and Burrows AM, “Selection for universal facial emotion,” Emotion, vol. 8, no. 3, pp. 435–439, 2008. [DOI] [PubMed] [Google Scholar]
- [28].Waller BM, Lembeck M, Kuchenbuch P, Burrows AM, and Liebal K, “Gibbonfacs: A muscle-based facial movement coding system for hylobatids,” J. Primatol, vol. 33, pp. 809–821, 2012. [Google Scholar]
- [29].Waller BM, Parr LA, Gothard KM, Burrows AM, and Fuglevand AJ, “Mapping the contribution of single muscles to facial movements in the rhesus macaque,” Physiol. Behav, vol. 95, pp. 93–100, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [30].Eibl-Eibesfeldt I, Human ethology, 1989. [Google Scholar]
- [31].Russell JA, “Is there universal recognition of emotion from facial expression? A review of the cross-cultural studies,” Psychol. Bull, vol. 115, no. 1, pp. 102–141, 1994. [DOI] [PubMed] [Google Scholar]
- [32].Jack RE, Blais C, Scheepers C, Schyns PG, and Caldara R, “Cultural confusions show that facial expressions are not universal,” Current Biology, vol. 19, pp. 1–6, 2009. [DOI] [PubMed] [Google Scholar]
- [33].Levenson RW, Ekman P, and Friesen WV, “Voluntary facial action generates emotion-specific autonomic nervous system activity,” Psychophysiology, vol. 27, no. 4, pp. 363–384, 1990. [DOI] [PubMed] [Google Scholar]
- [34].Ekman P, Davidson RJ, and Friesen WV, “The Duchenne smile: Emotional expression and brain psychology ii,” JPSP, vol. 58, no. 2, pp. 342–353, 1990. [PubMed] [Google Scholar]
- [35].Frijda NH and Tcherkassof A, “Facial expressions as modes of action readiness,” in The psychology of facial expression, 2nd ed., 1997, pp. 78–102. [Google Scholar]
- [36].Niedenthal PM, “Embodying emotion,” Science, vol. 116, pp. 1002–1005, 2007. [DOI] [PubMed] [Google Scholar]
- [37].Ekman P and Rosenberg E, What the face reveals, 2nd ed., 2005. [Google Scholar]
- [38].Zhang X, Yin L, Cohn JF, Canavan S, Reale M, Horowitz A, Liu P, and Girard JM, “Bp4d-spontaneous: a high-resolution spontaneous 3D dynamic facial expression database,” IVC, vol. 32, no. 10, pp. 692–706, 2014. [Google Scholar]
- [39].Duric Z, Gray WD, Heishman R, Li F, Rosenfeld A, Schoelles MJ, Schunn C, and Wechsler H, “Integrating perceptual and cognitive modeling for adaptive and intelligent humancomputer interaction,” Proceedings of the IEEE, vol. 90, no. 7, pp. 1272–1289, 2002. [Google Scholar]
- [40].Maat L and Pantic M, “Gaze-x: adaptive, affective, multimodal interface for single-user office scenarios,” in Artifical Intelligence for Human Computing. Springer, 2007, pp. 251–271. [Google Scholar]
- [41].Vinciarelli A, Pantic M, and Bourlard H, “Social signal processing: Survey of an emerging domain,” IVC, vol. 27, no. 12, pp. 1743–1759, 2009. [Google Scholar]
- [42].DeVault D, Artstein R, Benn G, Dey T, Fast E, Gainer A, and Morency LP, “A virtual human interviewer for healthcare decision support.” AAMAS, 2014. [Google Scholar]
- [43].Ishiguro H, Ono T, Imai M, Maeda T, Kanda T, and Nakatsu R, “Robovie: an interactive humanoid robot,” Industrial robot: An international journal, vol. 28, no. 6, pp. 498–504, 2001. [Google Scholar]
- [44].Kapoor A, Burleson W, and Picard RW, “Automatic prediction of frustration,” IJHCS, vol. 65, no. 8, pp. 724–736, 2007. [Google Scholar]
- [45].Bakkes S, Tan CT, and Pisan Y, “Personalised gaming,” JCT, vol. 3, 2012. [Google Scholar]
- [46].Tan CT, Rosser D, Bakkes S, and Pisan Y, “A feasibility study in using facial expressions analysis to evaluate player experiences,” in IE, 2012, p. 5. [Google Scholar]
- [47].Blom PM, Bakkes S, Tan CT, Whiteson S, Roijers D, Valenti R, and Gevers T, “Towards personalised gaming via facial expression recognition,” in AIIDE, 2014. [Google Scholar]
- [48].Lucey P, Cohn JF, Matthews I, Lucey S, Sridharan S, Howlett J, and Prkachin KM, “Automatically detecting pain in video through facial action units,” SMC-B, vol. 41, no. 3, pp. 664–674, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [49].Kaltwang S, Rudovic O, and Pantic M, “Continuous pain intensity estimation from facial expressions,” ISVC, pp. 368–377, 2012. [Google Scholar]
- [50].Irani R, Nasrollahi K, Simon MO, Corneanu CA, Escalera S, Bahnsen C, Lundtoft DH, Moeslund TB, Pedersen TL, Klitgaard M-L et al. , “Spatiotemporal analysis of rgb-dt facial images for multimodal pain level recognition,” CVPR Workshops, 2015. [Google Scholar]
- [51].Ryan A, Cohn JF, Lucey S, Saragih J, Lucey P, la Torre FD, and Ross A, “Automated facial expression recognition system,” in ICCST, 2009. [Google Scholar]
- [52].Vural E, Cetin M, Ercil A, Littlewort G, Bartlett M, and Movellan J, “Drowsy driver detection through facial movement analysis,” in Human–Computer Interaction, 2007, pp. 6–18. [Google Scholar]
- [53].Girard JM, Cohn JF, Mahoor MH, Mavadati SM, Hammal Z, and Rosenwald DP, “Nonverbal social withdrawal in depression: Evidence from manual and automatic analyses,” IVC, vol. 32, no. 10, pp. 641–647, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [54].Scherer S, Stratou G, Mahmoud M, Boberg J, Gratch J, Rizzo A, and Morency L-P, “Automatic behavior descriptors for psychological disorder analysis,” in Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on IEEE, 2013, pp. 1–8. [Google Scholar]
- [55].Joshi J, Dhall A, Goecke R, Breakspear M, and Parker G, “Neural-net classification for spatio-temporal descriptor based depression analysis,” in Pattern Recognition (ICPR), 2012 21st International Conference on. IEEE, 2012, pp. 2634–2638. [Google Scholar]
- [56]. www.emotient.com.
- [57]. www.affectiva.com.
- [58]. www.realeyesit.com.
- [59]. www.kairos.com.
- [60].Viola P and Jones M, “Rapid object detection using a boosted cascade of simple features,” in CVPR, 2001, pp. I–511. [Google Scholar]
- [61].Jones M and Viola P, “Fast multi-view face detection,” MERL, vol. 3, p. 14, 2003. [Google Scholar]
- [62].Dalal N and Triggs B, “Histograms of oriented gradients for human detection,” in CVPR, 2005, pp. 886–893. [Google Scholar]
- [63].Osadchy M, Cun YL, and Miller ML, “Synergistic face detection and pose estimation with energy-based models,” JMLR, vol. 8, pp. 1197–1215, 2007. [Google Scholar]
- [64].Colombo A, Cusano C, and Schettini R, “3D face detection using curvature analysis,” PR, vol. 39, no. 3, pp. 444–455, 2006. [Google Scholar]
- [65].Nair P and Cavallaro A, “3-d face detection, landmark localization, and registration using a point distribution model,” T. Multimedia, vol. 11, no. 4, pp. 611–623, 2009. [Google Scholar]
- [66].Sobottka K and Pitas I, “Segmentation and tracking of faces in color images,” in FG, 1996, pp. 236–241. [Google Scholar]
- [67].Sirohey SA, “Human face segmentation and identification,” U. Maryland, Tech. Rep, 1998. [Google Scholar]
- [68].Sobottka K and Pitas I, “A novel method for automatic face segmentation, facial feature extraction and tracking,” SPIC, vol. 12, no. 3, pp. 263–281, 1998. [Google Scholar]
- [69].Chai D and Ngan KN, “Face segmentation using skin-color map in videophone applications,” TCSVT, vol. 9, no. 4, pp. 551–564, 1999. [Google Scholar]
- [70].Li H and Ngan KN, “Saliency model-based face segmentation and tracking in head-and-shoulder video sequences,” JVCIR, vol. 19, no. 5, pp. 320–333, 2008. [Google Scholar]
- [71].Shotton J, Sharp T, Kipman A, Fitzgibbon A, Finocchio M, Blake A, Cook M, and Moore R, “Real-time human pose recognition in parts from single depth images,” CACM, vol. 56, no. 1, pp. 116–124, 2013. [DOI] [PubMed] [Google Scholar]
- [72].Hernández A, Zlateva N, Marinov A, Reyes M, Radeva P, Dimov D, and Escalera S, “Graph cuts optimization for multilimb human segmentation in depth maps,” in CVPR, 2012, pp. 726–732. [Google Scholar]
- [73].Pamplona Segundo M, Silva L, Bellon ORP, and Queirolo CC, “Automatic face segmentation and facial landmark detection in range images,” SMC-B, vol. 40, pp. 1319–1330, 2010. [DOI] [PubMed] [Google Scholar]
- [74].Koda Y, Yoshitomi Y, Nakano M, and Tabuse M, “A facial expression recognition for a speaker of a phoneme of vowel using thermal image processing and a speech recognition system,” in RO-MAN, 2009, pp. 955–960. [Google Scholar]
- [75].Trujillo L, Olague G, Hammoud R, and Hernandez B, “Automatic feature localization in thermal images for facial expression recognition,” in CVPR Workshops, 2005, pp. 14–14. [Google Scholar]
- [76].Cootes TF, Taylor CJ, Cooper DH, and Graham J, “Active shape models-their training and application,” CVIU, vol. 61, no. 1, pp. 38–59, 1995. [Google Scholar]
- [77].Cootes TF, Edwards GJ, and Taylor CJ, “Active appearance models,” TPAMI, vol. 23, no. 6, pp. 681–685, 2001. [Google Scholar]
- [78].Romdhani S and Vetter T, “Efficient, robust and accurate fitting of a 3D morphable model,” in ICCV, 2003, pp. 59–66. [Google Scholar]
- [79].Baker S and Matthews I, “Lucas-kanade 20 years on: A unifying framework,” IJCV, vol. 56, no. 3, pp. 221–255, 2004. [Google Scholar]
- [80].Dantone M, Gall J, Fanelli G, and Van Gool L, “Real-time facial feature detection using conditional regression forests,” in CVPR, 2012, pp. 2578–2585. [Google Scholar]
- [81].Xiong X and De la Torre F, “Supervised descent method and its applications to face alignment,” in CVPR, 2013, pp. 532–539. [Google Scholar]
- [82].Besl PJ and McKay ND, “Method for registration of 3-d shapes,” in Robotics-DL tentative, 1992, pp. 586–606. [Google Scholar]
- [83].Alyuz N, Gokberk B, and Akarun L, “Adaptive registration for occlusion robust 3D face recognition,” in ECCV, 2012, pp. 557–566. [Google Scholar]
- [84].Mao Z, Siebert JP, Cockshott WP, and Ayoub AF, “Constructing dense correspondences to analyze 3D facial change,” in ICPR, 2004, pp. 144–148. [Google Scholar]
- [85].Tena JR, Hamouz M, Hilton A, and Illingworth J, “A validated method for dense non-rigid 3D face registration,” in AVSS, 2006, pp. 81–81. [Google Scholar]
- [86].Szeptycki P, Ardabilian M, and Chen L, “A coarse-to-fine curvature analysis-based rotation invariant 3D face landmarking,” in BTAS, 2009, pp. 1–6. [Google Scholar]
- [87].Alyuz N, Gokberk B, and Akarun L, “Regional registration for expression resistant 3-d face recognition,” TIFS, vol. 5, no. 3, pp. 425–440, 2010. [Google Scholar]
- [88].Blanz V and Vetter T, “A morphable model for the synthesis of 3D faces,” in SIGGRAPH, 1999, pp. 187–194. [Google Scholar]
- [89].Fanelli G, Dantone M, and Van Gool L, “Real time 3D face alignment with random forests-based active appearance models,” in FG, 2013, pp. 1–8. [Google Scholar]
- [90].Savran A and Sankur B, “Non-rigid registration of 3D surfaces by deformable 2d triangular meshes,” in CVPR, 2008, pp. 1–6. [Google Scholar]
- [91].Queirolo CC, Silva L, Bellon OR, and Pamplona Segundo M, “3D face recognition using simulated annealing and the surface interpenetration measure,” TPAMI, vol. 32, pp. 206–219, 2010. [DOI] [PubMed] [Google Scholar]
- [92].Mpiperis I, Malassiotis S, Petridis V, and Strintzis MG, “3D facial expression recognition using swarm intelligence,” in ICASSP, 2008, pp. 2133–2136. [Google Scholar]
- [93].Dhall A, Asthana A, Goecke R, and Gedeon T, “Emotion recognition using phog and lpq features,” in FG, 2011, pp. 878–883. [Google Scholar]
- [94].Sun B, Li L, Zuo T, Chen Y, Zhou G, and Wu X, “Combining multimodal features with hierarchical classifier fusion for emotion recognition in the wild,” in ICMI, 2014, pp. 481–486. [Google Scholar]
- [95].Zhi R, Flierl M, Ruan Q, and Kleijn WB, “Graph-preserving sparse nonnegative matrix factorization with application to facial expression recognition,” SMC-B, vol. 41, pp. 38–52, 2011. [DOI] [PubMed] [Google Scholar]
- [96].Zafeiriou S and Petrou M, “Nonlinear nonnegative component analysis,” in CVPR, 2009, pp. 2860–2865. [Google Scholar]
- [97].Shan C, Gong S, and McOwan PW, “Facial expression recognition based on local binary patterns: A comprehensive study,” IVC, vol. 27, no. 6, pp. 803–816, 2009. [Google Scholar]
- [98].Savran A, Cao H, Nenkova A, and Verma R, “Temporal bayesian fusion for affect sensing: Combining video, audio, and lexical modalities,” CYB, 2014. [DOI] [PubMed] [Google Scholar]
- [99].Littlewort G, Bartlett MS, Fasel I, Susskind J, and Movellan J, “Dynamics of facial expression extracted automatically from video,” in CVPR Workshops, 2004, p. 80. [Google Scholar]
- [100].Littlewort G, Whitehill J, Wu T, Fasel I, Frank M, Movellan J, and Bartlett M, “The computer expression recognition toolbox (cert),” in FG, 2011, pp. 298–305. [Google Scholar]
- [101].Gu W, Xiang C, Venkatesh Y, Huang D, and Lin H, “Facial expression recognition using radial encoding of local gabor features and classifier synthesis,” PR, vol. 45, no. 1, pp. 80–91, 2012. [Google Scholar]
- [102].Lyons MJ, Budynek J, and Akamatsu S, “Automatic classification of single facial images,” TPAMI, vol. 21, no. 12, pp. 1357–1362, 1999. [Google Scholar]
- [103].Yoshitomi Y, Miyawaki N, Tomita S, and Kimura S, “Facial expression recognition using thermal image processing and neural network,” in RO-MAN, 1997, pp. 380–385. [Google Scholar]
- [104].Wang S, He M, Gao Z, He S, and Ji Q, “Emotion recognition from thermal infrared images using deep boltzmann machine,” FCS, vol. 8, no. 4, pp. 609–618, 2014. [Google Scholar]
- [105].Yoshitomi Y et al. , “Facial expression recognition for speaker using thermal image processing and speech recognition system,” in WSEAS, 2010, pp. 182–186. [Google Scholar]
- [106].Zhao G and Pietikainen M, “Dynamic texture recognition using local binary patterns with an application to facial expressions,” TPAMI, vol. 29, no. 6, pp. 915–928, 2007. [DOI] [PubMed] [Google Scholar]
- [107].He L, Jiang D, Yang L, Pei E, Wu P, and Sahli H, “Multimodal affective dimension prediction using deep bidirectional long short-term memory recurrent neural networks,” in AVEC, 2015, pp. 73–80. [Google Scholar]
- [108].Liu M, Wang R, Li S, Shan S, Huang Z, and Chen X, “Combining multiple kernel methods on riemannian manifold for emotion recognition in the wild,” in ICMI, 2014, pp. 494–501. [Google Scholar]
- [109].Liu Z and Wang S, “Emotion recognition using hidden markov models from facial temperature sequence,” in ACII, 2011, pp. 240–247. [Google Scholar]
- [110].Geetha A, Ramalingam V, Palanivel S, and Palaniappan B, “Facial expression recognition–a real time approach,” Expert Syst. Appl, vol. 36, no. 1, pp. 303–308, 2009. [Google Scholar]
- [111].Hernández B, Olague G, Hammoud R, Trujillo L, and Romero E, “Visual learning of texture descriptors for facial expression recognition in thermal imagery,” CVIU, vol. 106, no. 2, pp. 258–269, 2007. [Google Scholar]
- [112].Liu P and Yin L, “Spontaneous facial expression analysis based on temperature changes and head motions,” in FG, 2015. [Google Scholar]
- [113].Walecki R, Rudovic O, Pavlovic V, and Pantic M, “Variablestate latent conditional random fields for facial expression recognition and action unit detection,” FG, pp. 1–8, 2015. [Google Scholar]
- [114].Pantic M and Patras I, “Dynamics of facial expression: recognition of facial actions and their temporal segments from face profile image sequences,” SMC-B, vol. 36, pp. 433–449, 2006. [DOI] [PubMed] [Google Scholar]
- [115].Sebe N, Lew MS, Sun Y, Cohen I, Gevers T, and Huang TS, “Authentic facial expression analysis,” IVC, no. 12, pp. 1856–1863, 2007. [Google Scholar]
- [116].Kotsia I and Pitas I, “Facial expression recognition in image sequences using geometric deformation features and support vector machines,” TIP, vol. 16, pp. 172–187, 2007. [DOI] [PubMed] [Google Scholar]
- [117].Tang H and Huang TS, “3D facial expression recognition based on properties of line segments connecting facial feature points,” in FG, 2008, pp. 1–6. [Google Scholar]
- [118].Tang H and Huang T, “3D facial expression recognition based on automatically selected features,” in CVPR, 2008, pp. 1–8. [Google Scholar]
- [119].Chang Y, Vieira M, Turk M, and Velho L, “Automatic 3D facial expression analysis in videos,” AMFG, pp. 293–307, 2005. [Google Scholar]
- [120].Berretti S, Amor BB, Daoudi M, and Del Bimbo A, “3D facial expression recognition using sift descriptors of automatically detected keypoints,” TVC, vol. 27, no. 11, pp. 1021–1036, 2011. [Google Scholar]
- [121].Vretos N, Nikolaidis N, and Pitas I, “3D facial expression recognition using Zernike moments on depth images,” in ICIP, 2011, pp. 773–776. [Google Scholar]
- [122].Sandbach G, Zafeiriou S, and Pantic M, “Local normal binary patterns for 3D facial action unit detection,” in ICIP, 2012, pp. 1813–1816. [Google Scholar]
- [123].Hayat M, Bennamoun M, and El-Sallam AA, “Clustering of video-patches on grassmannian manifold for facial expression recognition from 3D videos,” in WACV, 2013, pp. 83–88. [Google Scholar]
- [124].Zeng W, Li H, Chen L, Morvan JM, and Gu XD, “An automatic 3D expression recognition framework based on sparse representation of conformal images,” in FG, 2013, pp. 1–8. [Google Scholar]
- [125].Lemaire P, Ardabilian M, Chen L, and Daoudi M, “Fully automatic 3D facial expression recognition using differential mean curvature maps and histograms of oriented gradients,” in FG, 2013, pp. 1–7. [Google Scholar]
- [126].Wollmer M, Kaiser M, Eyben F, Schuller B, and Rigoll G,¨ “Lstm-modeling of continuous emotions in an audiovisual affect recognition framework,” IVC, vol. 31, no. 2, pp. 153–163, 2013. [Google Scholar]
- [127].Koelstra S, Pantic M, and Patras I, “A dynamic texture-based approach to recognition of facial actions and their temporal models,” TPAMI, vol. 32, no. 11, pp. 1940–1954, 2010. [DOI] [PubMed] [Google Scholar]
- [128].Le V, Tang H, and Huang TS, “Expression recognition from 3D dynamic faces using robust spatio-temporal shape features,” in FG, 2011, pp. 414–421. [Google Scholar]
- [129].Sandbach G, Zafeiriou S, Pantic M, and Rueckert D, “A dynamic approach to the recognition of 3D facial expressions and their temporal models,” in FG, 2011, pp. 406–413. [Google Scholar]
- [130].Fang T, Zhao X, Shah SK, and Kakadiaris IA, “4d facial expression recognition,” in ICCV, 2011, pp. 1594–1601. [Google Scholar]
- [131].Wang J, Yin L, Wei X, and Sun Y, “3D facial expression recognition based on primitive surface feature distribution,” in CVPR, 2006, pp. 1399–1406. [Google Scholar]
- [132].Maalej A, Amor B, Daoudi M, Srivastava A, and Berretti S, “Shape analysis of local facial patches for 3D facial expression recognition,” PR, vol. 44, no. 8, pp. 1581–1589, 2011. [Google Scholar]
- [133].Gong B, Wang Y, Liu J, and Tang X, “Automatic facial expression recognition on a single 3D face by exploring shape deformation,” in ICM, 2009, pp. 569–572. [Google Scholar]
- [134].Cohen I, Sebe N, Chen L, Garg A, and Huang TS, “Facial expression recognition from video sequences: Temporal and static modelling,” in CVIU, 2003, pp. 160–187. [Google Scholar]
- [135].Cohen I, Sebe N, Gozman FG, Cirelo MC, and Huang TS, “Learning bayesian network classifiers for facial expression recognition both labeled and unlabeled data,” in CVPR, 2003, pp. I–595–I–601. [Google Scholar]
- [136].Pardas M and Bonafonte A, “Facial animation parameters ex-` traction and expression detection using hmm,” in SPIC, 2002, pp. 675–688. [Google Scholar]
- [137].Aleksic PS and Katsaggelos AK, “Automatic facial expression recognition using facial animation parameters and multistream hmms,” TIFS, vol. 1, no. 1, pp. 3–11, 2006. [Google Scholar]
- [138].Yin L, Wei X, Longo P, and Bhuvanesh A, “Analyzing facial expressions using intensity-variant 3D data for human computer interaction,” in ICPR, 2006, pp. 1248–1251. [Google Scholar]
- [139].Tian Y-L, Kanade T, and Cohn JF, “Recognizing action units for facial expression analysis,” TPAMI, vol. 23, pp. 97–115, 2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [140].Dapogny A, Bailly K, and Dubuisson S, “Dynamic facial expression recognition by joint static and multi-time gap transition classification,” in FG, 2015. [Google Scholar]
- [141].Ramanathan S, Kassim A, Venkatesh Y, and Wah WS, “Human facial expression recognition using a 3D morphable model,” in ICIP, 2006, pp. 661–664. [Google Scholar]
- [142].Zhao X, Huang D, Dellandréa E, and Chen L, “Automatic 3D facial expression recognition based on a bayesian belief net and a statistical facial feature model,” in ICPR, 2010, pp. 3724–3727. [Google Scholar]
- [143].Zhao X, Dellandréa E, Zou J, and Chen L, “A unified probabilistic framework for automatic 3D facial expression analysis based on a bayesian belief inference and statistical feature models,” IVC, vol. 31, no. 3, pp. 231–245, 2013. [Google Scholar]
- [144].Ranzato M, Susskind J, Mnih V, and Hinton G, “On deep generative models with applications to recognition,” in CVPR, 2011, pp. 2857–2864. [Google Scholar]
- [145].Rifai S, Bengio Y, Courville A, Vincent P, and Mirza M, “Disentangling factors of variation for facial expression recognition,” in ECCV, 2012, vol. 7577, pp. 808–822. [Google Scholar]
- [146].Liu M, Shan S, Wang R, and Chen X, “Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition,” in CVPR, 2014, pp. 1749–1756. [Google Scholar]
- [147].Kahou SE, Pal C, Bouthillier X, Froumenty P, Ç . Gulçehre,¨ R. Memisevic, P. Vincent, A. Courville, Y. Bengio, R. C. Ferrari et al. , “Combining modality specific deep neural networks for emotion recognition in video,” in ICMI, 2013, pp. 543–550. [Google Scholar]
- [148].Song I, Kim H-J, and Jeon PB, “Deep learning for real-time robust facial expression recognition on a smartphone,” in ICCE, 2014, pp. 564–567. [Google Scholar]
- [149].Liu M, Li S, Shan S, and Chen X, “Au-aware deep networks for facial expression recognition,” in FG, 2013, pp. 1–6. [Google Scholar]
- [150].Ijjina EP and Mohan CK, “Facial expression recognition using kinect depth sensor and convolutional neural networks,” in ICMLA, 2014, pp. 392–396. [Google Scholar]
- [151].He S, Wang S, Lan W, Fu H, and Ji Q, “Facial expression recognition using deep boltzmann machine from thermal infrared images,” in ACII, 2013, pp. 239–244. [Google Scholar]
- [152].Liu P, Han S, Meng Z, and Tong Y, “Facial expression recognition via a boosted deep belief network,” in CVPR, 2014, pp. 1805–1812. [Google Scholar]
- [153].Wu C, Wang S, and Ji Q, “Multi-instance hidden markov model for facial expression recognition,” in FG, 2015. [Google Scholar]
- [154].Tsalakanidou F and Malassiotis S, “Real-time 2d+ 3D facial action and expression recognition,” PR, vol. 43, no. 5, pp. 1763–1775, 2010. [Google Scholar]
- [155].Tsalakanidou F and Malassiotis S,, “Robust facial action recognition from real-time 3D streams,” in CVPR Workshops, 2009, pp. 4–11. [Google Scholar]
- [156].Fragopanagos N and Taylor JG, “Emotion recognition in human–computer interaction,” Neural Net, vol. 18, no. 4, pp. 389–405, 2005. [DOI] [PubMed] [Google Scholar]
- [157].Caridakis G, Malatesta L, Kessous L, Amir N, Raouzaiou A, and Karpouzis K, “Modeling naturalistic affective states via facial and vocal expressions recognition,” in ICMI, 2006, pp. 146–154. [Google Scholar]
- [158].Nicolle J, Rapp V, Bailly K, Prevost L, and Chetouani M, “Robust continuous prediction of human emotions using multiscale dynamic cues,” in ICMI, 2012, pp. 501–508. [Google Scholar]
- [159].Savran A, Cao H, Shah M, Nenkova A, and Verma R, “Combining video, audio and lexical indicators of affect in spontaneous conversation via particle filtering,” in Proceedings of the 14th ACM international conference on Multimodal interaction. ACM, 2012, pp. 485–492. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [160].Huang TS, Chen LS, Tao H, Miyasato T, and Nakatsu R, “Bimodal emotion recognition by man and machine,” in ATR Workshops, 1998. [Google Scholar]
- [161].Gunes H and Piccardi M, “Affect recognition from face and body: early fusion vs. late fusion,” in ICSMC, 2005, pp. 3437–3443. [Google Scholar]
- [162].Busso C, Deng Z, Yildirim S, Bulut M, Lee CM, Kazemzadeh A, Lee S, Neumann U, and Narayanan S, “Analysis of emotion recognition using facial expressions, speech and multimodal information,” in ICMI, 2004, pp. 205–211. [Google Scholar]
- [163].Sebe N, Cohen I, Gevers T, and Huang TS, “Emotion recognition based on joint visual and audio cues,” in ICPR, 2006, pp. 1136–1139. [Google Scholar]
- [164].Zeng Z, Tu J, Pianfetti BM, and Huang TS, “Audio–visual affective expression recognition through multistream fused hmm,” T. Multimedia, vol. 10, no. 4, pp. 570–577, 2008. [Google Scholar]
- [165].Kessous L, Castellano G, and Caridakis G, “Multimodal emotion recognition in speech-based interaction using facial expression, body gesture and acoustic analysis,” JMUI, vol. 3, no. 1–2, pp. 33–48, 2010. [Google Scholar]
- [166].D’Mello SK and Graesser A, “Multimodal semi-automated affect detection from conversational cues, gross body language, and facial features,” UMUAI, vol. 20, no. 2, pp. 147–187, 2010. [Google Scholar]
- [167].Yoshitomi Y, Kim S-I, Kawano T, and Kilazoe T, “Effect of sensor fusion for recognition of emotional states using voice, face image and thermal image of face,” in RO-MAN, 2000, pp. 178–183. [Google Scholar]
- [168].De Silva LC and Ng PC, “Bimodal emotion recognition,” in FG, 2000, pp. 332–335. [Google Scholar]
- [169].Soladé C, Salam H, Pelachaud C, Stoiber N, and Séguier R, “A multimodal fuzzy inference system using a continuous facial expression representation for emotion detection,” in ICMI, 2012, pp. 493–500. [Google Scholar]
- [170].Chen LS, Huang TS, Miyasato T, and Nakatsu R, “Multimodal human emotion/expression recognition,” in FG, 1998, pp. 366–371. [Google Scholar]
- [171].Ekman P and Friesen WV, Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, 1978. [Google Scholar]
- [172].Ekman P, Friesen WV, and Hager JC, Facial Action Coding System: The Manual on CD ROM. A Human Face, 2002. [Google Scholar]
- [173].Friesen WV and Ekman P, “Emfacs-7: Emotional facial action coding system,” U. California, vol. 2, p. 36, 1983. [Google Scholar]
- [174].Izard CE, Maximally discriminative facial movement coding system (MAX). Instructional Resources Center, University of Delaware, 1983. [Google Scholar]
- [175].M. DL. Izard HEA, E. C, A system for identifying affect expressions by holistic judgments. Instructional Resources Center, University of Delaware, 1983. [Google Scholar]
- [176].Zhang C and Zhang Z, “A survey of recent advances in face detection,” Microsoft Research, Tech. Rep, 2010. [Google Scholar]
- [177].De la Torre F and Cohn J, “Facial expression analysis,” in Visual Analysis of Humans, 2011, pp. 377–409. [Google Scholar]
- [178].Wu B, Ai H, Huang C, and Lao S, “Fast rotation invariant multi-view face detection based on real adaboost,” in FG, 2004, pp. 79–84. [Google Scholar]
- [179].Lakshmi HV and PatilKulakarni S, “Segmentation algorithm for multiple face detection in color images with skin tone regions using color spaces and edge detection techniques,” IJCTE, vol. 2, no. 4, pp. 1793–8201, 2010. [Google Scholar]
- [180].Wang N, Gao X, Tao D, and Li X, “Facial feature point detection: A comprehensive survey,” arXiv, 2014. [Google Scholar]
- [181].Tam GK, Cheng Z-Q, Lai Y-K, Langbein FC, Liu Y, Marshall D, Martin RR, Sun X-F, and Rosin PL, “Registration of 3D point clouds and meshes: a survey from rigid to nonrigid,” TVCG, pp. 1199–1217, 2013. [DOI] [PubMed] [Google Scholar]
- [182].Igual L, Perez-Sala X, Escalera S, Angulo C, and De la Torre F, “Continuous generalized procrustes analysis,” PR, vol. 47, no. 2, pp. 659–671, 2014. [Google Scholar]
- [183].Pantic M and Bartlett M, “Machine analysis of facial expressions,” in Face Recognition. I-Tech Education and Publishing, 2007, pp. 377–416. [Google Scholar]
- [184].Martinez A and Du S, “A model of the perception of facial expressions of emotion by humans: Research overview and perspectives,” JMLR, vol. 13, no. 1, pp. 1589–1608, 2012. [PMC free article] [PubMed] [Google Scholar]
- [185].Ashraf AB, Lucey S, Cohn JF, Chen T, Ambadar Z, Prkachin KM, and Solomon PE, “The painful face–pain expression recognition using active appearance models,” IVC, vol. 27, no. 12, pp. 1788–1796, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [186].Littlewort GC, Bartlett MS, and Lee K, “Automatic coding of facial expressions displayed during posed and genuine pain,” IVC, vol. 27, no. 12, pp. 1797–1803, 2009. [Google Scholar]
- [187].Cohn JF, Kruez TS, Matthews I, Yang Y, Nguyen MH, Padilla MT, Zhou F, and De la Torre F, “Detecting depression from facial actions and vocal prosody,” in ACII, 2009, pp. 1–7. [Google Scholar]
- [188].Kohler CG, Martin EA, Stolar N, Barrett FS, Verma R, Brensinger C, Bilker W, Gur RE, and Gur RC, “Static posed and evoked facial expressions of emotions in schizophrenia,” Schizophr. Res, vol. 105, no. 1–3, pp. 49–60, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [189].Hinton GE, Osindero S, and Teh YW, “A fast learning algorithm for deep belief nets,” Neural Comput, vol. 18, no. 7, pp. 1527–1554, 2006. [DOI] [PubMed] [Google Scholar]
- [190].Kanade T, Cohn JF, and Tian Y, “Comprehensive database for facial expression analysis,” in FG, 2000, pp. 46–53. [Google Scholar]
- [191].Lucey P, Cohn JF, Kanade T, Saragih J, Ambadar Z, and Matthews I, “The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression,” in CVPR Workshops, 2010, pp. 94–101. [Google Scholar]
- [192].Pantic M, Valstar MF, Rademaker R, and Maat L, “Web-based database for facial expression analysis,” in ICME, 2005, pp. 317–321. [Google Scholar]
- [193].Gross IR, Matthews R, Cohn J, Kanade T, and Baker S, “Multipie,” in FG, 2008. [Google Scholar]
- [194].Dhall A, Goecke R, Lucey S, and Gedeon T, “Static facial expression analysis in tough conditions: Data, evaluation protocol and benchmark,” in ICCV Workshops, 2011, pp. 2106–2112. [Google Scholar]
- [195].Savran A, Alyüz N, Dibeklioğlu H, Çeliktutan O, Gökberk B, Sankur B, and Akarun L, “Bosphorus database for 3D face analysis,” in BIOID, 2008, vol. 5372, pp. 47–56. [Google Scholar]
- [196].Nguyen H, Kotani K, Chen F, and Le B, “A thermal facial emotion database and its analysis,” in PSIVT, 2014, pp. 397–408. [Google Scholar]
- [197].Pavlidis I, Dowdall J, Sun N, Puri C, Fei J, and Garbey M, “Interacting with human physiology,” CVIU, vol. 108, no. 1, pp. 150–170, 2007. [Google Scholar]
- [198].Ioannou S, Gallese V, and Merla A, “Thermal infrared imaging in psychophysiology: potentialities and limits,” Psychophysiology, vol. 51, no. 10, pp. 951–963, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [199].Wu L, Oviatt SL, and Cohen PR, “Multimodal integration-a statistical view,” T. Multimedia, vol. 1, pp. 334–341, 1999. [Google Scholar]
- [200].De Silva LC, Miyasato T, and Nakatsu R, “Facial emotion recognition using multi-modal information,” in ICICS, 1997, pp. 397–401. [Google Scholar]
- [201].Yan W-J, Wu Q, Liu Y-J, Wang S-J, and Fu X, “Casme database: A dataset of spontaneous micro-expressions collected from neutralized faces,” in FG, 2013, pp. 1–7. [Google Scholar]
- [202].Mavadati SM, Mahoor MH, Bartlett K, Trinh P, and Cohn JF, “Disfa: A spontaneous facial action intensity database,” TAC, vol. 4, no. 2, pp. 151–160, 2013. [Google Scholar]
- [203].Dhall A, Goecke R, Lucey S, and Gedeon T, “Acted facial expressions in the wild database,” Australian Nat. U, Tech. Rep., 2011. [Google Scholar]
- [204].McDuff D, El Kaliouby R, Senechal T, Amr M, Cohn JF, and Picard R, “Amfed facial expression dataset: Naturalistic and spontaneous facial expressions collected ”in-the-wild”,” in CVPR Workshops, 2013, pp. 881–888. [Google Scholar]
- [205].McKeown G, Valstar M, Cowie R, Pantic M, and Schroder M,¨ “The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent,” TAC, vol. 3, no. 1, pp. 5–17, 2012. [Google Scholar]
- [206].Yin L, Wei X, Sun Y, Wang J, and Rosato MJ, “A 3D facial expression database for facial behavior research,” in FG, 2006, pp. 211–216. [Google Scholar]
- [207].Yin L, Chen X, Sun Y, Worm T, and Reale M, “A highresolution 3D dynamic facial expression database,” in FG, 2008, pp. 1–6. [Google Scholar]
- [208].“http://www.vcipl.okstate.edu/otcbvs/bench/.”
- [209].“http://www.equinoxsensors.com/.”
- [210].Wang S, Liu Z, Lv S, Lv Y, Wu G, Peng P, Chen F, and Wang X, “A natural visible and infrared facial expression database for expression recognition and emotion inference,” T. Multimedia, vol. 12, no. 7, pp. 682–691, 2010. [Google Scholar]
- [211].Suwa M, Sugie N, and Fujimora K, “A preliminary note on pattern recognition of human emotional expression,” in IJCPR, 1978, pp. 408–410. [Google Scholar]
- [212].and P MK. A., “Automatic lipreading by optical-flow analysis,” SCJ, vol. 22, 1991. [Google Scholar]
- [213].Ekman P, Huang TS, Sejnowski TJ, and Hager JC, “Final report to NSF of the planning workshop on facial expression understanding,” Human Interaction Lab, vol. 378, 1993. [Google Scholar]
- [214].Samal A and Iyengar PA, “Automatic recognition and analysis of human faces and facial expressions: A survey,” PR, vol. 25, no. 1, pp. 65–77, 1992. [Google Scholar]
- [215].Pantic M and Rothkrantz LJM, “Automatic analysis of facial expressions: the state of the art,” TPAMI, vol. 22, no. 12, pp. 1424–1445, 2000. [Google Scholar]
- [216].Fasel B and Luettin J, “Automatic facial expression analysis: a survey,” PR, vol. 36, no. 1, pp. 259–275, 2003. [Google Scholar]
- [217].Soyel H and Demirel H, “Facial expression recognition using 3D facial feature distances,” in ICIAR, 2007, pp. 831–838. [Google Scholar]
- [218].Savran A, Sankur B, and Bilge MT, “Comparative evaluation of 3D vs. 2d modality for automatic detection of facial action units,” PR, vol. 45, no. 2, pp. 767–782, 2012. [Google Scholar]
- [219].Wang J and Yin L, “Facial expression representation and recognition from static images using topographic context,” Department of Computer Science, Tech. Rep, 2005. [Google Scholar]
- [220].Lucey S, Ashraf AB, and Cohn JF, Investigating spontaneous facial action recognition through aam representations of the face. INTECH, 2007. [Google Scholar]
- [221].Valstar MF, Pantic M, Ambadar Z, and Cohn JF, “Spontaneous vs. posed facial behavior: automatic analysis of brow actions,” in ICMI, 2006, pp. 162–170. [Google Scholar]
- [222].Zeng Z, Fu Y, Roisman GI, Wen Z, Hu Y, and Huang TS, “Spontaneous emotional facial expression detection.” JMM, vol. 1, no. 5, pp. 1–8, 2006. [Google Scholar]
- [223].El Kaliouby R and Robinson P, “Real-time inference of complex mental states from facial expressions and head gestures,” in Realtime vision for human-computer interaction. Springer, 2005, pp. 181–200. [Google Scholar]
- [224].Ji Q, Lan P, and Looney C, “A probabilistic framework for modeling and real-time monitoring human fatigue,” SMC-A, vol. 36, no. 5, pp. 862–875, 2006. [Google Scholar]
- [225].Littlewort GC, Bartlett MS, and Lee K, “Faces of pain: automated measurement of spontaneousallfacial expressions of genuine and posed pain,” in ICMI, 2007, pp. 15–21. [Google Scholar]
- [226].Lucas GM, Gratch J, Scherer S, Boberg J, and Stratou G, “Towards an affective interface for assessment of psychological distress.” [Google Scholar]
- [227].Cohn JF and Schmidt KL, “The timing of facial motion in posed and spontaneous smiles,” IJWMIP, 2004. [Google Scholar]
- [228].Ambadar Z, Cohn JF, and Reed LI, “All smiles are not created equal: Morphology and timing of smiles perceived as amused, polite, and embarrassed/nervous,” J. Nonverbal Behav, vol. 33, no. 1, pp. 17–34, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [229].Prkachin KM and Solomon PE, “The structure, reliability and validity of pain expression: Evidence from patients with shoulder pain,” Pain, vol. 139, no. 2, pp. 267–274, 2008. [DOI] [PubMed] [Google Scholar]
- [230].Hammal Z and Cohn JF, “Automatic detection of pain intensity,” in ICMI, 2012, pp. 47–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [231].Valstar MF, Jiang B, Mehu M, Pantic M, and Scherer K, “The first facial expression recognition and analysis challenge,” in FG, 2011, pp. 921–926. [DOI] [PubMed] [Google Scholar]
- [232].Valstar M, Girard J, Almaev T, McKeown G, Mehu M, Yin L, Pantic M, and Cohn J, “Fera 2015-second facial expression recognition and analysis challenge,” FG, 2015. [Google Scholar]
- [233].Pantic M and Rothkrantz LJ, “An expert system for recognition of facial actions and their intensity,” in AAAI/IAAI, 2000, pp. 1026–1033. [Google Scholar]
- [234].Bartlett MS, Littlewort GC, Frank MG, Lainscsek C, Fasel IR, and Movellan JR, “Automatic recognition of facial actions in spontaneous expressions,” JMM, vol. 1, no. 6, pp. 22–35, 2006. [Google Scholar]
- [235].Savran A, Sankur B, and Taha Bilge M, “Estimation of facial action intensities on 2d and 3D data,” in EUSIPCO, 2011, pp. 1969–1973. [Google Scholar]
- [236].Savran A, Sankur B, and Bilge MT, “Regression-based intensity estimation of facial action units,” IVC, vol. 30, pp. 774–784, 2012. [Google Scholar]
- [237].Girard JM, Cohn JF, and De la Torre F, “Estimating smile intensity: A better way,” PRL, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [238].Zaker N, Mahoor MH, Mattson W, Messinger DS, Cohn JF et al. , “Intensity measurement of spontaneous facial actions: Evaluation of different image representations,” in ICDL, 2012, pp. 1–2. [Google Scholar]
- [239].Sandbach G, Zafeiriou S, and Pantic M, “Markov random field structures for facial action unit intensity estimation,” in ICCV Workshops, 2013, pp. 738–745. [Google Scholar]
- [240].Li Y, Mavadati SM, Mahoor MH, and Ji Q, “A unified probabilistic framework for measuring the intensity of spontaneous facial action units,” in FG, 2013, pp. 1–7. [Google Scholar]
- [241].Jeni L, Girard JM, Cohn JF, De La Torre F et al. , “Continuous au intensity estimation using localized, sparse facial feature space,” in FG, 2013, pp. 1–7. [Google Scholar]
- [242].Shimada K, Noguchi Y, and Kurita T, “Fast and robust smile intensity estimation by cascaded support vector machines,” IJCTE, vol. 5, no. 1, pp. 24–30, 2013. [Google Scholar]
- [243].Dhall A and Goecke R, “Group expression intensity estimation in videos via gaussian processes,” in ICPR, 2012, pp. 3525–3528. [Google Scholar]
- [244].Haggard EA and Isaacs KS, “Micromomentary facial expressions as indicators of ego mechanisms in psychotherapy,” in Methods of research in psychotherapy. Springer, 1966, pp. 154–165. [Google Scholar]
- [245].Shreve M, Godavarthy S, Manohar V, Goldgof D, and Sarkar S, “Towards macro-and micro-expression spotting in video using strain patterns,” in WACV, 2009, pp. 1–6. [Google Scholar]
- [246].Shreve M, Godavarthy S, Goldgof D, and Sarkar S, “Macro-and micro-expression spotting in long videos using spatio-temporal strain,” in FG, 2011, pp. 51–56. [Google Scholar]
- [247].Pfister T, Li X, Zhao G, and Pietikainen M, “Recognising¨ spontaneous facial micro-expressions,” in ICCV, 2011, pp. 1449–1456. [Google Scholar]
- [248].Polikovsky S, Kameda Y, and Ohta Y, “Facial micro-expressions recognition using high speed camera and 3d-gradient descriptor,” 2009. [Google Scholar]
- [249].Wu Q, Shen X, and Fu X, “The machine knows what you are hiding: an automatic micro-expression recognition system,” in ACII, 2011, pp. 152–162. [Google Scholar]
- [250].Lucas GM, Gratch J, Scherer S, Boberg J, and Stratou G, “Towards an affective interface for assessment of psychological distress,” ACII, 2015. [Google Scholar]
- [251].Biel J-I, Teijeiro-Mosquera L, and Gatica-Perez D, “Facetube: predicting personality from facial expressions of emotion in online conversational video,” in ICMI, 2012, pp. 53–56. [Google Scholar]
- [252].Sanchez-Cortes D, Biel J-I, Kumano S, Yamato J, Otsuka K, and Gatica-Perez D, “Inferring mood in ubiquitous conversational video,” in MUM, 2013, p. 22. [Google Scholar]
- [253].Williamson JR, Quatieri TF, Helfer BS, Ciccarelli G, and Mehta DD, “Vocal and facial biomarkers of depression based on motor incoordination and timing,” in AVEC, 2014, pp. 65–72. [Google Scholar]
- [254].Sidorov M and Minker W, “Emotion recognition and depression diagnosis by acoustic and visual features: A multimodal approach,” in AVEC, 2014, pp. 81–86. [Google Scholar]
- [255].Senoussaoui M, Sarria-Paja M, Santos JF, and Falk TH, “Model fusion for multimodal depression classification and level detection,” in AVEC, 2014, pp. 57–63. [Google Scholar]
- [256].Jain V, Crowley JL, Dey AK, and Lux A, “Depression estimation using audiovisual features and fisher vector encoding,” in AVEC, 2014, pp. 87–91. [Google Scholar]
- [257].Biel J-I, Tsiminaki V, Dines J, and Gatica-Perez D, “Hi youtube!: Personality impressions and verbal content in social video,” in ICMI, 2013, pp. 119–126. [Google Scholar]
- [258].Abadi MK, Correa JAM, Wache J, Yang H, Patras I, and Sebe N, “Inference of personality traits and affect schedule by analysis of spontaneous reactions to affective videos,” FG, 2015. [Google Scholar]
- [259].Batrinca LM, Mana N, Lepri B, Pianesi F, and Sebe N, “Please, tell me about yourself: automatic personality assessment using short self-presentations,” in ICMI, 2011, pp. 255–262. [Google Scholar]
- [260].Biel J-I and Gatica-Perez D, “The youtube lens: Crowdsourced personality impressions and audiovisual analysis of vlogs,” Multimedia, vol. 15, no. 1, pp. 41–55, 2013. [Google Scholar]
- [261].Girard JM, Cohn JF, Sayette MA, Jeni LA, and De la Torre F, “Spontaneous facial expression can be measured automatically,” Beh. Res. Meth, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [262].Gehrig T and Ekenel HK, “Why is facial expression analysis in the wild challenging?” in ICMI Workshops, 2013, pp. 9–16. [Google Scholar]
- [263].Sikka K, Dykstra K, Sathyanarayana S, Littlewort G, and Bartlett M, “Multiple kernel learning for emotion recognition in the wild,” in ICMI, 2013, pp. 517–524. [Google Scholar]
- [264].Liu M, Wang R, Huang Z, Shan S, and Chen X, “Partial least squares regression on grassmannian manifold for emotion recognition,” in ICMI, 2013, pp. 525–530. [Google Scholar]
- [265].Dhall A, Goecke R, Joshi J, Sikka K, and Gedeon T, “Emotion recognition in the wild challenge 2014: Baseline, data and protocol,” in ICMI, 2014, pp. 461–466. [Google Scholar]
- [266].Bosch N, D’Mello S, Baker R, Ocumpaugh J, Shute V, Ventura M, Wang L, and Zhao W, “Automatic detection of learningcentered affective states in the wild,” in Proceedings IUI, 2015, pp. 379–388. [Google Scholar]
- [267].Dhall A, Joshi J, Sikka K, Goecke R, and Sebe N, “The more the merrier: Analysing the affect of a group of people in images,” in FG, 2015. [Google Scholar]
- [268].McDuff D, El Kaliouby R, Senechal T, Demirdjian D, and Picard R, “Automatic measurement of ad preferences from facial responses gathered over the internet,” IVC, vol. 32, no. 10, pp. 630–640, 2014. [Google Scholar]
- [269].Kuncheva LI, Combining Pattern Classifier: Methods and Algorithms. John Wiley & Sons, 2004. [Google Scholar]
- [270].Russell JA, Bachorowski J, and Fernandez-Dols J, “Facial and vocal expressions of emotion,” vol. 54, pp. 329–349, 2003. [DOI] [PubMed] [Google Scholar]
- [271].Castellano G, Leite I, Pereira A, Martinho C, Paiva A, and McOwan P, “Multimodal affect modelling and recognition for empathic robot companions,” IJHR, vol. 10, no. 1, 2013. [Google Scholar]
- [272].Martínez HP and Yannakakis GN, “Mining multimodal sequential patterns: a case study on affect detection,” in ICMI, 2011, pp. 3–10. [Google Scholar]
- [273].Van den Stock J, Righart R, and De Gelder B, “Body expressions influence recognition of emotions in the face and voice,” Emotion, vol. 7, no. 3, pp. 487–494, 2007. [DOI] [PubMed] [Google Scholar]