IJCV2023-SMG-A Micro-Gesture Dataset Towards Spontaneous Body Gestures For Emotional Stress State Analysis. International Journal of Computer Vision

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

International Journal of Computer Vision

https://doi.org/10.1007/s11263-023-01761-6

SMG: A Micro-gesture Dataset Towards Spontaneous Body Gestures


for Emotional Stress State Analysis
Haoyu Chen1 · Henglin Shi1 · Xin Liu2 · Xiaobai Li1 · Guoying Zhao1

Received: 4 May 2022 / Accepted: 3 January 2023


© The Author(s) 2023

Abstract
We explore using body gestures for hidden emotional state analysis. As an important non-verbal communicative fashion,
human body gestures are capable of conveying emotional information during social communication. In previous works,
efforts have been made mainly on facial expressions, speech, or expressive body gestures to interpret classical expressive
emotions. Differently, we focus on a specific group of body gestures, called micro-gestures (MGs), used in the psychology
research field to interpret inner human feelings. MGs are subtle and spontaneous body movements that are proven, together
with micro-expressions, to be more reliable than normal facial expressions for conveying hidden emotional information. In
this work, a comprehensive study of MGs is presented from the computer vision aspect, including a novel spontaneous micro-
gesture (SMG) dataset with two emotional stress states and a comprehensive statistical analysis indicating the correlations
between MGs and emotional states. Novel frameworks are further presented together with various state-of-the-art methods
as benchmarks for automatic classification, online recognition of MGs, and emotional stress state recognition. The dataset
and methods presented could inspire a new way of utilizing body gestures for human emotion understanding and bring a new
direction to the emotion AI community. The source code and dataset are made available: https://github.com/mikecheninoulu/
SMG.

Keywords Micro-gestures · Gesture recognition · Emotion recognition · Statistical modeling · Deep learning · Affective
computing

1 Introduction

Human beings are innately able to express and interpret emo-


tional expressions via various non-verbal communication
Communicated by Maja Pantic. (Shiffrar et al. 2011), which should also be an indispens-
able part of intelligent agents. As an important non-verbal
B Guoying Zhao
communicative fashion, human body gestures are capable of
[email protected]
conveying rich emotional information during social commu-
Haoyu Chen
[email protected] nication (Aviezer et al. 2012). However, as shown in Fig. 1a,
when it comes to machines, analyzed emotional cues were
Henglin Shi
[email protected] mostly limited to human facial expressions and speech (El
Ayadi et al. 2011; Li and Deng 2020).
Xin Liu
[email protected] Compared to other modalities, body gestures have sev-
eral advantages in emotion recognition tasks. Firstly, the
Xiaobai Li
[email protected] data acquisition of body gestures is more accessible, espe-
cially when high-resolution surveillance cameras or portable
1 Center for Machine Vision and Signal Analysis (CMVS), microphones are not available for capturing facial expres-
University of Oulu, Oulu, Finland sions or speech in public areas (e.g., airports, metro, or
2 Computer Vision and Pattern Recognition Laboratory, School stadiums). With the recent success of deep learning on large-
of Engineering Science, Lappeenranta-Lahti University of scale datasets, concerns about privacy protection and ethical
Technology LUT, Lappeenranta, Finland

123
International Journal of Computer Vision

human feelings (Serge 1995). Although MGs cover a wide


group of gestures (e.g., scratching the head, touching the
nose, playing with clothes), they share one important attribute
which differentiates them from other gestures: MGs are not
performed for any illustrative or communicational purposes
at all; they are spontaneous or involuntary body responses to
the onset of certain stimuli, especially negative ones. Mean-
while, ordinary gestures are usually performed to facilitate
communications, e.g., to illustrate specific semantic mean-
ings or to explicitly express one’s feelings or attitudes, which
are referred to as illustrative gestures or iconic gestures
(Khan and Ibraheem 2012). As shown in Fig. 1b, in high-
stake situations such as interviews and games, although the
subjects try to conceal or suppress their true feelings for
either gaining advantage (win the game) or avoiding loss
(keep social image), they spontaneously initiate some body
gestures responding to the stimuli. Studies (Pentland 2008)
showed that these gestures are important clues in revealing
people’s hidden emotional status, especially negative feel-
ings such as stress, nervousness, and fear, which can be used
to detect anomalous mental status, e.g., for Alzheimer’s or
autism diagnosis. Expectedly, automatic MG recognition has
great potential in applications, i.e., human-computer interac-
tion, social media, public safety, and health care (Krakovsky
2018).
The study aims to answer this research question: How
to train a machine to recognize and better understand
hidden human emotions via body gestures like a trained
expert? Specifically, we break down this question into several
Fig. 1 a Taxonomy of emotional cues. MGs serve as one of the non- sub-problems with corresponding solutions: (1) Unlike the
verbal communicative cues for emotional understanding. b Example Action Units (AU) in facial action coding system (FACS)
scenarios to which MG recognition can be applied. In the interview or
(Ekman 1997), a common standard is absent for body
game, the subjects tend to hide their intentions, while MGs can leak
their hidden emotions gesture-based emotion measurement. The lack of this empiri-
cal guidance leaves even psychological professionals without
complete agreement on annotating bodily expressions (Luo
issues have started to emerge (Oh et al. 2016). Meanwhile, et al. 2020). Thus, we present a novel dataset of MGs, which
body gestures involve less identity information, which is was collected under objective proxy tasks to stimulate two
promising. Lastly, studies (Ekman 2004) showed that when states of emotional stress. (2) The high heterogeneity in the
people were trying to hide their emotions, most of them same gesture class makes the classification of MG much
would attempt to tune their facial expressions but could not more complicated than ordinary gestures. Thus, we provide
prohibit their micro-expressions. Besides, only a few people various state-of-the-art models from recent top computer
referred to the need to manage their body movements. Thus, vision venues to demonstrate the benchmark. (3) Accurately
it would be encouraging to use gestures to capture people’s spotting MGs from unconstrained streams is another highly
suppressed/hidden emotions. challenging task, as MGs are subtle and rapid body move-
With the above observations, this study focuses on a ments that can easily be submerged in other unrelated body
specific group of gestures called Micro-Gesture (MG) for movements. To this end, we propose a novel online detect-
emotional understanding. However, unlike any previous ing method that has a parameter-free attention mechanism to
research that uses expressive body gestures to interpret clas- differentiate MGs from non-MGs adaptively. (4) The conven-
sical expressive emotions, we propose a brand new research tional paradigm that imposes each gesture with an emotional
topic: analyzing people’s hidden emotional states with MGs. state does not resemble real-world scenarios, we explore
MGs are defined as subtle and involuntary body movements a new paradigm that achieves emotional understanding by
that reveal peoples’ suppressed/hidden emotions. They are holistically considering all the MGs.
often used in the psychology research field to interpret inner

123
International Journal of Computer Vision

Fig. 2 The overview of the main research topics of this work. a A framework for complicated gesture transition patterns. d Baselines and
novel SMG dataset with a comprehensive statistical analysis. b Multi- a newly proposed framework for emotional state recognition
ple benchmarks on the SMG dataset. c A novel online MG recognition

As shown in Fig. 2, this work consists of four main 2 Related Work


research topics for comprehensively researching MGs from
the computer vision aspect, and the contributions of each 2.1 Body Gesture Recognition in Computer Vision
topic can be summarized as follows:
Accurate recognition is the foundation of all the further
1. To the best of the authors’ knowledge, this is the first work implementations of body gestures, such as gesture lan-
to investigate MGs with computer vision technologies guage recognition, human-robot interaction, and also emo-
for hidden emotional state analysis. A new MG dataset tional gesture recognition (Carreira and Zisserman 2017;
is built through interdisciplinary efforts, which contains Shahroudy et al. 2016; Soomro et al. 2012). Over the decades,
rich MGs towards spontaneous body emotional stress human gesture recognition has been intensively researched
states understanding. in the field of computer vision. From the machine learn-
2. Comprehensive statistical analysis is conducted on the ing point of view, body gesture recognition can be sorted
relationship between body gestures and emotional stress into two settings: 1) the classification of pre-segmented body
states, investigating the various features of MGs. Various gestures and 2) temporal body gesture detection and recogni-
benchmark results for classifying and online recogniz- tion upon the long non-stationary sequence. The former task
ing MGs are reported based on multiple state-of-the-art that conducts the classification of the pre-segmented clips
methods. draws more attention from researchers, and most of the exist-
3. A hidden Markov model (HMM) recurrent network ing state-of-the-art technologies can achieve considerably
for online MG recognition is proposed with a novel promising performances. Towards video-based resources
parameter-free attention mechanism. The method is such as RGB, depth, and optical flow data, classical mod-
intensively validated on three online gesture recognition els for the body action and gesture classifications mainly
datasets with competitive performances. includes 2DCNN families (Lin et al. 2019; Wang et al. 2018;
4. A novel paradigm is explored via a spectral graph-based Xu et al. 2019a) and 3DCNN families (Carreira and Zis-
model to infer the emotional states via MGs clues of the serman 2017; Hara et al. 2018; Tran et al. 2015). Based on
holistic videos, instead of the previously prevailing one- skeleton resources obtained from such as Kinect (Shotton
gesture-one-emotion paradigm. et al. 2011) or OpenPose (Cao et al. 2019), state-of-the-art
methods nowadays are mainly derived from graph-based con-
This research is based on our previous work (Chen et al. volutional networks (Cheng et al. 2020; Liu et al. 2020; Peng
2019), but extended in several aspects: 1) more compre- et al. 2020; Shi et al. 2019; Yan et al. 2018). Methods are also
hensive dataset statistical analysis, 2) extensive benchmark proposed to fuse different resources and modalities (Crasto
experimental results with state-of-the-art methods, 3) an et al. 2019; Sun et al. 2018; Yu et al. 2020). When it comes
HMM recurrent network with a novel parameter-free atten- to the latter online recognition setting, research efforts are
tion mechanism validated on three datasets, and 4) a spectral relatively few due to the computational complexity (Chen
graph neural network as a baseline for emotional stress state et al. 2020; Li et al. 2016; Liu et al. 2018; Neverova et al.
recognition. 2016; Wu et al. 2016; Xu et al. 2019b). Different from other
The rest of this paper is structured as follows. Section 2 gesture recognition tasks such as gesture language recogni-
reviews related work in the literature. The SMG dataset and tion, the recognition of emotional body gestures and MGs
its analysis are presented in Sect. 3. Benchmarks of MG clas- has its specific challenges: (1) the duration ranges from sev-
sification are provided in Sect. 4. Section 5 focuses on the eral frames to hundreds of frames; (2) the kinetic scale varies
online GM recognition task with a newly proposed method. from only subtle finger movements to overall body changes;
Body gesture-based emotional stress state recognition is con- (3) the variations of the movements associated with a ges-
ducted in Sect. 6, and we conclude the work in Sect. 7. ture can be large due to the individual differences of subjects

123
International Journal of Computer Vision

and (4) meaningful emotional gestures are submerged within


plenty of irrelevant body movements. Participants

2.2 Human Emotion Recognition with Body Gestures Two synchronized


Kinect V2 devices
Recognizing emotional states through body movements has
been researched for decades (Noroozi et al. 2018). Previ- Multi-modality
displaying
ous works are mainly based on one-gesture-one-emotion
assumptions with two kinds of emotional modeling theories Real-time monitoring
or replaying
(Noroozi et al. 2018): the categorical and dimensional mod-
els. In the categorical model-based methods (Ginevra et al. The observer Monitoring screens
2008; Gunes and Piccardi 2006; Mahmoud et al. 2011), each
Fig. 3 Acquisition setup for the elicitation and recording of micro-
emotion was imposed with a meaningful gesture, and partic- gestures
ipants were asked to act on those emotions with their body
gestures. Recently, some researchers have explored the pos-
sibility of analyzing bodily expression with a dimensional two movie versions of the play Death of a Salesman trying to
model (Kipp and Martin 2009; Luo et al. 2020). In the work explore the correlations between basic gesture attributes and
of Luo et al. (2020), the emotions of body gestures collected emotion. It also provided the emotion dimensions of pleasure,
from movie clips are defined by the dimensions of arousal and arousal, and dominance instead of emotion-specific discrete
valence. However, an essential feature of emotional gestures expression models. Similarly, Luo et al. (2020) collected a
is neglected in all of these works: not all the body move- large-scale dataset called BoLD (Body Language Dataset)
ments are highly emotion-driven (Pentland 2008) and body that also includes both discrete emotions and dimensional
language could be interpreted differently by subject differ- emotions. In the BoLD dataset, each short video clip has
ences (Yu 2008). Thus, it is not convincing and accurate to been annotated for emotional body expressions as perceived
interpret each isolated gesture as an emotional state and not by viewers via a crowd-sourcing strategy. However, those
consider subject differences. As expected, the agreement on datasets were all designed for classical expressive emotions,
the interpretation of one bodily expression between annota- and none of them is specifically for hidden emotional state
tors is considerably low. For instance, during the emotion understanding.
annotation in the work of Luo et al. (2020), annotators still
primarily rely on facial expressions rather than gestures. This
issue makes the research limited to be extended to real-world 3 The SMG Dataset
implementation.
This section introduces the whole collecting procedure and
2.3 Emotional Body Gesture Datasets details of the SMG dataset, from the psychological back-
ground, the elicitation design and the annotation to the final
Compared to regular human gesture analysis, such as body collected dataset and its statistics.
pose, action, or sign language recognition, research efforts
devoted to using gestural behaviors to interpret human emo- 3.1 Psychological Background for Micro-gestures
tion or affection are relatively few (Noroozi et al. 2018).
The pioneering work for gesture-based emotion recogni- The term “micro-gesture” was first used in the psycholog-
tion in the computer vision field can go back more than 20 ical work (Serge 1995) for assisting doctors in diagnosing
years ago (Ginevra et al. 2008; Gunes and Piccardi 2006; patients’ mental conditions via body gestures and later could
Schindler et al. 2008; Wallbott 1998). Wallbott (1998) col- also be found in popular science works (Kuhnke 2009;
lected 224 videos and, in each of their records, an actor acting Navarro and Karlins 2008). The first work formally studying
a body gesture representing an emotional state through a sce- spontaneous gestures for hidden emotion understanding can
nario approach. In the work of Schindler et al. (2008), an trace back to Ekman (2004) where they found that spon-
image-based dataset was collected in which emotions were taneous body gestures (e.g., “a fragment of a shrugging
displayed by body language in front of a uniform background gesture”), together with micro-expressions, are more reliable
and different poses could express the same emotion. Gunes clues to interpret hidden human emotions than intentionally
and Piccardi (2006) introduced a Bimodal face and body performed facial expressions. Furthermore, the fight, flight,
gesture database, called FABO, including facial and gestu- and freeze system proposed by Gray (1982) mediates reac-
ral modalities. Different from the above laboratory settings, tions to aversive stimuli and threats, which reasons those
Kipp and Martin (2009) proposed a Theater corpus based on spontaneous body movements from the aspect of brain sci-

123
International Journal of Computer Vision

ence. The three factors, fight, flight, and freeze, can cause
specific human behaviors at the onset of certain stimuli,
including the freezing body (e.g., holding the breath), dis-
ID 1 ID 2 ID 3 ID 4 ID 5 ID 6
tancing behaviors (e.g., putting hands or objects to block
faces or bodies) and guarding behaviors (e.g., puffing out the
chest). Besides, to transfer from discomfort to comfort states,
human beings develop a natural reaction, so-called pacifying ID 8 ID 9 ID 10 ID 11 ID 12
ID 7
actions, that tries to suppress the negative feelings induced by
the above three factors (Panksepp 1998). Other psychologi-
cal research related to MG can also be found in early work
(de Becker 1997; Burgoon et al. 1994) and the most recent ID 13 ID 14 ID 15 ID 16 Non-MG
work (Kita et al. 2017; Pouw et al. 2016).
(a) A list of the micro-gestures in the SMG dataset
In total, based on the above psychology theoretical sup-
ports, we try to define the MG categories for computer vision Age< 22: 30%
Age> 28: 20%
East Asia: 37.5%
Others: 10.5%

study with criteria as (1) covering all MGs that could possibly Middle East: 35.0%

occur on the SMG dataset, (2) corresponding to psychologi- Age


22<Age< 28: 50% Europe: 17.5%
Ethnicity
Female: 32.5%
cal theories and functions, and (3) being “properly specific”
Male: 67.5%
(e.g., “touching” would be too general, “scratching the left Gender
cheek” would be too specific) for a computational model to (b) The distribution of participants’ demographic
recognize. Finally, we summarized 16 types of MGs for our
SMG dataset including fight patterns (e.g., “folding arms”),
flight patterns (e.g., “moving legs”), freeze patterns (e.g.,
“turtling neck and shoulder”), and pacifying patterns (e.g.,
“scratching head and hand rubbing”). Non-micro gestures
RGB video Depth Video Silhouette Skeleton
were also labeled as an independent category for illustrative (1920*1080) (512*424) (512*424) ( 25 coordinates)
gestures or sign gestures. The entire list of MGs and non- (c) Sample frames of the four different modalities
MGs that we collected and their psychological attributes are
provided in Fig. 4a and Table 1. The 16 categories could cover Fig. 4 Overview of MGs labeled in our SMG dataset. a Examples of
the most common MGs on the SMG dataset but there might annotated MGs and non-micro-gestures. For privacy concerns, we mask
the faces of the participants here. b The distribution of participants’
be some rare cases that were not observed in the current demographic. c The four modalities collected in our SMG dataset
experimental scenario of the SMG dataset. We will further
enrich the MG collection and make more comprehensive lists
in future work.
got caught, so they had to conceal their emotions, especially
3.2 Elicitation of Micro-gestures for the “deviation stimuli” ones. Compared to repeating a
true story (baseline stimuli, which can be regarded as the
Referring to the above supporting psychological theories, we counterpart of the placebo group in the psychological field),
design the procedure for the elicitation of MGs to create our creating a fake story off-the-cuff (deviation stimuli) needs
SMG dataset as follows. higher mental-load requirements and more inner activities
Eliciting Tasks. We designed two proxy tasks for stimulat- with mental presence and emotional involvement (Palena
ing the corresponding emotional stress states and eliciting et al. 2018). In this way, the two emotional stress states are
micro-gestures. Precisely, the two proxy tasks are (i) given obtained for our SMG dataset as the hidden emotional states.
a true story with a title and detailed content, repeating the For ease of the reader, we denote the two states as NES (non-
content of the story, as the “baseline stimuli”, and (ii) given stressed emotional state) and SES (stressed emotional state)
an empty story with only a title and no content, making up a for short.
fake story off-the-cuff, as the “deviation stimuli”. The stories Participants. In total, 40 participants were recruited for our
are short newscasts or reports with an average of 141 words dataset collection (age: M = 25.8, SD = 4.87). They are
and with rich details (see more detailed design principles in 27 men and 13 women from multicultural backgrounds (16
“Appendix B”). Participants have to repeat (baseline stimuli) countries). The distribution of participants’ demographic is
or make up (deviation stimuli) the content of the story, and given in Fig. 4b. They were recruited via advertisements, and
they need to prove that they knew the story content, respec- no specific educational major was restricted. Although some
tively, no matter which task they were assigned. Participants of the participants were familiar with machine learning and
were told that there would be a punishment for them if they computer vision, none of the participants were privy to the

123
International Journal of Computer Vision

MG category labels of the four annotators were summarized


and cross-checked, and majority voting decided inconsistent
cases. In the second round, the temporal labeling of MG clips
was cross-checked to ensure that the labeling style of the
start and endpoints of the MGs are unified at the frame level.
Finally, we have all the MGs clips with start-, end-points, and
their categories among the collected long video sequences.
Emotional Stress State Annotation. The emotional stress
states in our SMG dataset are straightforward and objective
based on the two proxy tasks, i.e., NES and SES are naturally
Fig. 5 Visualized distribution of MGs among different emotional stress assigned based on the corresponding task.
states. The size of the blocks stands for the amount of the MGs. We can
observe that, MGs are rare and fine-grained compared to ordinary ges-
tures. There are multiple types of fine-grained MGs under each coarse 3.4 SMG Dataset Statistics
category. MGs can be easily submerged by non-MGs
Dataset Structure. The final SMG dataset comprises 414
long video instances (around one minute for each instance)
workings of the machine learning algorithm of the study we from 40 participants, resulting in 821,056 frames in total
were conducting. (more than 488 min). Each long video instance has one of
Apparatus. Two Kinect V2 sensors were placed two meters the two emotional states (NES or SES). The video instances
away in front of the participants to capture their whole body are evenly distributed in the two emotional states (207 v.s.
movements, with the RGB resolution of 1920 × 1080 pixels 207). Among those 414 long-video instances, 3712 MG clips
at 28 frames per second. The resulting modalities are RGB, were labeled out, and the average length of those MGs is 51.3
silhouette, depth videos, and skeleton coordinates, as shown frames (with the shortest MG as 8 frames), which is signifi-
in Fig. 4c. The smoothing function of Kinect V2 was disabled cantly shorter than the length of common gestures collected
to obtain detailed and subtle body movements as much as in other datasets like 100–300 frames (Escalera et al. 2015;
possible due to the particularity of MGs. Li et al. 2016). The distribution of MGs in the two emotional
Procedure. The data collection was carried out in a normal stress states can be seen in Fig. 5.
office room of a college, as shown in Fig. 3. Two participants Correlations of MGs and Emotional States. We validate
took turns telling stories, and an observer monitored them the MG distributions in the two emotional stress states using
behind the scenes to ensure that participants felt the need to the t-test after a placebo-controlled study. The detailed sta-
conceal their true emotions. For one round experiment, two tistical results are given in Table 2 as a quantitative report.
participants were assigned two different (SES/NES) stimuli, Specifically, we deploy the paired sample t-test, using T -
respectively, and they needed to persuade the observer to distribution (two-tailed) to compare MG distributions over
believe that they knew the content. We ensure that the round the two emotional stress states among the 40 participants.
numbers of NES and SES collected from each participant From the last line of Table 2, we can see that, there was a sig-
is the same. In other words, the numbers of NES and SES nificant increase in the volume of MGs performed under SES
instances are evenly distributed in the SMG dataset. The time (M = 38.58, SD = 32.5095) compared to NES (M = 24.50, SD
duration of one complete round is controlled in six minutes. = 20.8671), t(39) = 4.6300, p < 0.0001. The result rejected
the null hypothesis, thus a significant correlation between
3.3 Data Annotation and Quality Control MGs and emotional stress states is found. When it comes to
non-MGs, it shows that no significant changes are found with
Our SMG dataset’s annotation contains two levels: (1) the t(39) = 0.9198, p < 0.3633.
temporal allocation and MG categories and (2) the emotional Visualized MG Distributions. We present a visualized dis-
state categories. tribution of MGs on the two emotional stress states as shown
MG Labeling. Four human annotators were assigned to go in Fig. 5. We observe certain features of MG patterns. The
through long video sequences to spot and annotate all the 16 first and most prominent feature is that non-MGs and whole-
categories of MGs (as well as all the non-MGs). To guarantee body MGs occupy the majority of the body movements,
the quality of annotation, we arranged two rounds of labeling. demonstrating it is challenging to efficiently distinguish rare
In the first round, the four annotators were trained on how MGs from unconstrained upcoming streams as they can be
to spot and classify MGs based on the MG list (see Table easily submerged among the dominating amount of non-
1) and related psychological theories. After confirming the MGs. Secondly, although MGs cover a large range of body
labeling criteria, they annotated the MGs separately based gestures, the major categories are extremely fine-grained:
on their judgments of the collected video sequences. The six kinds of MGs in “hand-body” interactions and four in

123
Table 1 The list of MGs collected in the SMG dataset, and MG IDs correspond to the indexes in Fig. 4
MG ID Kinematic Psychological attribute Number in SES/NES MG ID Kinematic Psychological attribute Number in SES/NES
description description
International Journal of Computer Vision

1 Turtling neck and Freezing 35/19 10 Crossing legs Pacifying 28/23


shoulder
2 Rubbing eyes Freezing 32/19 11 Scratching some Pacifying 15/8
and forehead part of body
3 Folding arms Fighting 122/77 12 Scratching or Pacifying 131/88
touching facial
parts other than
eyes
4 Touching or Pacifying 6/5 13 Playing or Pacifying 41/21
covering adjusting hair
suprasternal
notch
5 Moving legs Fleeing 742/447 14 Holding arms Fighting 46/22
6 Touching or Pacifying 13/11 15 Pulling shirt Pacifying 6/5
scratching neck collar
7 Folding arms Freezing 81/44 16 Playing with Pacifying 10/13
behind body jewelry, and
manipulating
other objects
8 Rubbing hands Pacifying 248/192 Non-MGs Illustrative hand – 593/518
and crossing gestures
finger
9 Arms akimbo Fighting 38/13 Total – – 2187/1525
Psychological attribute stands for the psychological basis that drives the corresponding micro-gesture

123
International Journal of Computer Vision

Table 2 The statistical


MG type Emotional stress state
distribution of gestures over the
NES SES t value p value
two states of emotional stress
S M SD S M SD

Non-MGs 518 12.95 12.61 589 14.72 13.81 0.9198 0.3633


MGs 980 24.50 20.87 1543 38.58 32.51 4.6300 <0.0001
S, M, and SD stand for “sum”, “mean” and “standard deviation”. The significance level equals to 0.01 and
the significant terms are marked in bold. Note that the total number of gestures is not equal to 3712, because
some gestures are in the transitions between two emotional stress states, which are not counted

“hand-hand” interactions. Thus, compared to other body Table 3 MG classification performance on the test set of the SMG
gesture/action recognition tasks, MGs require a more fine- dataset
grained and accurate recognizing ability from the machine Method Modality Accuracy
learning aspect. Top-1 Top-5
Relationship Between MGs and Subjects. As mentioned, ST-GCN (Yan et al. 2018) Skeleton 41.48 86.07
the use of body gestures to interpret emotions could be heav- 2 S-GCN (Shi et al. 2019) 43.11 86.90
ily affected by individual differences. Here, we conduct a Shift-GCN (Cheng et al. 2020) 55.31 87.34
qualitative analysis of MG patterns of different subjects. GCN-NAS (Peng et al. 2020) 58.85 85.08
Specifically, Pearson’s correlation coefficient is used to mea-
MS-G3D (Liu et al. 2020) 64.75 91.48
sure the correlation of different MG performing patterns from
R3D (Hara et al. 2018) RGB 29.84 67.87
40 subjects in our SMG dataset. The MG performing pattern
I3D (Carreira and Zisserman 2017) 35.08 85.90
is presented by the frequency distribution of 17 MGs of a
C3D (Tran et al. 2015) 45.90 79.18
given subject. Pearson’s correlation coefficient varies from
TSN (Wang et al. 2018) 50.49 82.13
− 1 to 1, and the higher it is, the stronger the evaluated cor-
TSM (Lin et al. 2019) 58.69 83.93
relation is. According to the statistic calculation, the average
Pearson’s correlation coefficient of these 40 subjects is 0.456, TRN (Xu et al. 2019a) 59.51 88.53
with the highest one of 0.966 and the lowest one of − 0.240. It TSN* (Wang et al. 2018) 53.61 81.98
indicates a trend that subjects share MGs patterns, especially TRN* (Xu et al. 2019a) 60.00 91.97
in the exposing frequency of MGs, while individual incon- TSM* (Lin et al. 2019) 65.41 91.48
sistency of the MG patterns is still not negligible. As a result, In total 11 state-of-the-art models for RGB and skeleton modalities are
although the above t-test proves the effectiveness of SES for reported. Methods with stars are pretrained on large-scale datasets
eliciting MGs, it is necessary to emphasize the inconsistency Methods with the best performance are marked in bold
of MG performing patterns brought by different subjects.

4.1 MG Classification Benchmark Setup

We propose the benchmark of classification MGs on SMG


4 Micro-gesture Classification dataset with two modalities. Given 3712 pre-segmented MGs
clips with their labels, the task is to achieve accurate clas-
In this section, we focus on the task of classification of pre- sification among 16 MG classes and non-MG classes. We
segmented MG clips from our SMG dataset. Analogous to implement a cross-subject protocol that the 2470+632 MG
the classical action/gesture recognition task, algorithms need clips from 30+5 subjects are used for training+validating,
to classify a given sequential clip into the correct MG cat- and 610 clips from the remaining five subjects are used for
egory, from a certain data modality, such as RGB, depth, testing. The overall accuracy on the testing set is reported as
optical flows, or body skeletons. In order to set up the bench- results. Eleven state-of-the-art models are provided for this
mark of MG classification, we select over ten state-of-the-art task, including RGB and skeleton modalities.
models for the classical action recognition task from recent RBG-based MG Classification. For RGB modality-based
top venues like AAAI, ECCV, ICCV, CVPR, and TPAMI, gesture classification, we adopt six state-of-the-art models
and evaluate them on the SMG dataset, including two repre- which are well known in the action recognition research field.
sentative modalities as RGB and skeleton. We first report the Those models can be sorted into two groups. The first group
evaluation protocols and introduce the models used for MG is 2D CNNs based models that capture the temporal informa-
classification on the two modalities. At last, we present the tion from features learned via 2D CNNs, including Temporal
experimental results and related analysis. segment networks (TSN) (Wang et al. 2018), Temporal shift

123
International Journal of Computer Vision

module (TSM) (Lin et al. 2019) and Temporal Relation Net-


works (TRN) (Xu et al. 2019a). The second group is the
3DCNN family that directly learns the temporal information
from features learned through 3D CNNs, including 3DCNNS
(C3D) (Tran et al. 2015), 3D ResNets (R3D) (Hara et al.
2018), Inflated 3D ConvNet (I3D) (Carreira and Zisserman
2017).
Skeleton-based MG Classification. For MG classifica-
tion with skeleton modality, Graph Convolutional Networks
(GCNs) are the main stream architectures to deal with
skeleton joint data. Here, we implement five recent graph
convolutional-based methods that all achieved state-of-the-
Fig. 6 The HMM model for online recognizing MGs. Our method can
art performance on large-scale action datasets, like NTU adaptively conduct the HMM decoding with a parameter-free attention
(Shahroudy et al. 2016) and Kinetic (Carreira and Zisserman mechanism
2017). The models include Spatial Temporal GCN (STGCN)
(Yan et al. 2018), Two-Stream Adaptive GCN (2 S-AGCN)
(Shi et al. 2019), Shift-GCN (Shift-GCN) (Cheng et al. 2020), 66%. As shown, our SMG dataset is challenging, especially
GCNs with Neural Architecture Search (GCN-NAS) (Peng for inter-class and long-tail issue handling.
et al. 2020) and Multi-scale Unified Spatial-temporal GCN 5 Online Micro-gesture Recognition
(MS-G3D) (Liu et al. 2020).
In this section, we take one step further by providing the
benchmark of online MG recognition, i.e., processing raw,
4.2 Evaluation Results unsegmented sequences containing multiple body gestures,
including MGs and non-MGs, on the SMG dataset. First,
The experimental results are given in Table 3. As shown we discuss the specific challenges of online MG recognition.
in Table 3, we can observe that MS-G3D (Liu et al. 2020) Then, we propose a novel HMM-DNN network for the task
achieves the best performance (top-1 64.75%, top-5 91.48%) with parameter-free attention mechanism. At last, the evalu-
than RGB modality based models (with best model TRN (Xu ation metrics, together with the evaluating results of various
et al. 2019a) of top-1 59.51%, top-5 88.53%) and generally methods on three online gesture recognition datasets, are pre-
skeleton-modality based methods outperform RGB-modality sented.
based methods. Possible reasons include (1) compared to
the RGB modality, skeleton data collected from Kinect con- 5.1 Challenges of Online MG Recognition
tains more detailed and accurate depth information. This is
critical for distinguishing subtle differences of MGs such as The online recognition of MG has two parallel sub-tasks:
“touching or covering suprasternal notch” and “illustrative detecting the potential body gestures from upcoming frames
hand gestures”, (2) GCN-based models with a compact net- and classifying the ongoing body gestures into correspond-
work structure and efficient skeleton-based representations ing MG categories. However, some challenges make online
can prevent overfitting issue thus does not severely reply on recognition of MG different from other ordinary gestures.
a large number of training samples as 3DCNN based models. First, although some existing methods (Liu et al. 2018; Wu
This overfitting problem can be spotted also on R3D (Hara et al. 2016; Xu et al. 2019b) can achieve the detection and
et al. 2018) (top-1 29.84%, top-5 67.87%) and I3D (Carreira classification of actions/gestures, they all need various redun-
and Zisserman 2017) (top-1 35.08%, top-5 85.90%) which dant post-processing procedures to optimize the predictions,
might need pre-training on large-scale datasets. Thus, we which is not practical for online detection task. Meanwhile,
further conducted extra experiments that explore the impact it is proven that sequential aligning models such as HMM
of pretrained training strategy on the performances of those and Connectionist Temporal Classification (CTC) can pro-
models by selecting several representative models, includ- vide transition priors to reason and enhance predictions from
ing TSN, TRN, and TSM. The results are presented with neural networks (Kuehne et al. 2019; Richard et al. 2018),
methods marked with stars in Table 4.1. From the results which enable the online recognition of gesture/action to be
we can observe that, after initializing the model with weight more robust and accurate. However, we argue that in the
trained on an action recognition dataset, the performances dataset with spontaneous MGs, like SMG, the prior learned
indeed can increase to some extent. Lastly, we can see that by sequential aligning models from training sets could be
even though the Top-5 accuracy of the MG classification can biased, and lead to inferior recognizing results to some
reach 90%, the Top-1 accuracy of all methods is still below extent. For instance, there could be a lot of “rubbing hands”

123
International Journal of Computer Vision

after “touching nose” in training subjects, while the testing estimated by a trained BiLSTM network:
subjects could perform no “rubbing hands” at all. Second, ⎡ ⎤
the “non-movement” interval, the so-called Ergodic state, p1
was introduced in most of the previous works (Neverova ⎢ p2 ⎥ W1:M
w(h t |xt ) = ⎢ ⎥
⎣ . . . ⎦ = W M+1 , M = N × C, (3)
et al. 2016; Wu et al. 2016) to achieve accurate alloca-
tion and segmentation of gestures. Meanwhile, MGs usually p M+1
occur continuously without any “non-movement” intervals
and sometimes can be incompletely performed. Therefore, a where C is the total number of gesture classes (as 17 in prac-
more flexible and efficient transition scheme is needed. Lastly tice, including MGs and non-MGs), N is the HMM state
and most importantly, MGs are rare and subtle. How to boost number used to present one gesture (set as 5 for the best per-
the HMM decoding escaping from local optimal brought by formance) and M is the resulting total HMM state number
the dominating amount of the Ergodic states (irrelevant/noisy (85 in practice). We take an additional HMM state M + 1 (86
body movements) and non-MGs, is exceptionally challeng- in practice) as the “non-movement” state. Then, W1:M and
ing. W M+1 stands for the probability distribution of the HMM
states of all the gestures (MGs and non-MGs) and the “non-
movement” state, respectively.
5.2 A Parameter-Free Ergodic-Attention HMM A Novel Parameter-free Attention Mechanism for Ergodic
Network for Online MG Recognition HMM Decoding. Based on the above HMM full proba-
bility, we find that although the prior p(h t ) is to correct
Mathematical Framework. We chose the sequences of the the data imbalance, this prior is not strong enough or even
3D skeletal stream as inputs because its lower dimensionality harmful. The MGs are still submerged in the dominating
is suitable for online processing tasks and the reliable per- noisy/irrelevant body movements. Thus, we propose a novel
formance shown in the MG classification task. Similarly to method to address this issue, called Attention-Based Ergodic
the work of Chen et al. (2020), we model the local temporal Decoding (AED), that has a parameter-free self-attention
dynamics with an attention-based Long Short-Term Memory mechanism to modeling the HMM alignment. It has two folds
(BiLSTM) network (giving an initial prediction of the current of improvements based on the conventional HMM frame-
frame) and use an HMM model to enhance inference reason- work (Wu et al. 2016): an attention mechanism on W1:M to lift
ing (finalizing the prediction of the current frame with priors the probability of meaningful gestures, and an inhibition on
in the past frames), shown in Fig. 6. The full probability of W M+1 for the probability of noisy body movements. Specifi-
the is specified as follows: cally, we exploit the AED by replacing w(h t |xt )/ p(h t ) with
a new form of posterior probability w  (h t |xt ) that have a
more effective prior ability:
p(x1 , x2 , . . . , x T , h 1 , h 2 , . . . , h T ) 
W1:M

T
(1) w  (h t |xt ) = 
W M+1
= p(h 1 ) p(x1 |h 1 ) p(xt |h t ) p(h t |h t−1 ), (4)
t=2 μ · so f tmax(W1:M  W1:M )  W1:M + W1:M
= λ .
W M+1

where T is the total length of the sequence, p(h) and p(x) 


For the top part W1:M in the formula, we obtain it by cal-
stand for the probabilities of hidden state and observed states, culating the self-attention map from the Hadamard product
respectively. p(h t |h t−1 ) is the transition matrix to reason for  of W1:M itself, weighing the softmax result of this atten-
the alignment on the long sequence. The emission probability tion map with a scale parameter μ, and then performing an
p(xt |h t ) can be expended as: element-wise sum operation with the original distribution
W1:M to obtain the updated distribution W1:M  . For the bot-

tom part W M+1 , we suppress it by adding W M+1 to the λth
p(xt |h t ) = w(h t |xt ) p(xt )/ p(h t ), (2) power. We do not use the dot product in the original attention
version (Vaswani et al. 2017) because the Hadamard product
has both the calculating efficiency and better performances
where p(h t ) is the prior probability of hidden states that under a non-parameter setting, while the dot product will lead
corrects the prediction when the classes are imbalanced (we to inferior results according to our experiments. In this way,
argue this raw prior is biased and insufficient, see next sec- we exploit the attention mechanism to the posterior probabil-
tion). p(xt ) is a constant value that does not depend on the ity and the problem of subject-dependent MG patterns were
class. At last, w(h t |xt ) is the posterior probability which is made possible.

123
International Journal of Computer Vision

HMM states
Inference. After BiLSTM is trained to give an estimation of
each upcoming frame with a SoftMax probability w(h t |xt ) Raw prior
of the HMM state, we can conduct the inference together T
with the learnt transition probability p(h t |h t−1 ). During the HMM states

testing phase, we want to solve the decoding of hidden state Our AED
sequence ĝ to obtain the most likely explanation (namely, the T
gesture alignment), which is determined as: SMG dataset

HMM states
ĝ = arg max p(x1 , x2 . . . x T , h 1 , h 2 . . . h T )
h Raw prior


T (5)

= arg max π0 w  (xt |h t ) p(h t |h t−1 ), T
h HMM states
t=2
Our AED
where π0 stands for the constant value. By using Eq. 5, we
can break down the problem of solving the utmost probability T
iMiGUE dataset
of a long non-stationary sequence into continuously solving
HMM states probability with hidden states h 1:T . While the Fig. 7 Visualized HMM decoding of failure cases. We present the
HMM states are aligned in real-time, the testing sequence HMM decoding of sample sequence #36 in the SMG dataset and #72
can be inferred for both segmentation (non-movement) and in the iMiGUE dataset, using raw prior (top) and our AED (bottom).
The x-axis represents time, and the y-axis represents the hidden states
recognition (MGs and non-MGs). Finally, we improved the of all classes. The cyan lines represent the highest probability given by
method proposed by (Wu et al. 2016) by treating not only the networks, while red lines denote the ground truth labels, and the blue
“non-movement” state but also the middle HMM states of lines are the predictions
every gesture as ergodic states. In this way, the segmentation
of several continuous incomplete gestures becomes possible.
The complete network structures and technical implemen- αth is defined as follows,
tation details of our AED method, such as the value of μ
|Igt ∩ I pr ed |
and λ are given in “Appendix F”. Note that w (h t |xt ) is cal- αth = , (7)
culated based on w(h t |xt ) which is given by the BiLSTM |Igt ∪ I pr ed |
output without any fine-tuning. Thus, our proposed attention
where I pr ed and Igt denote the predicted gesture and ground
scheme can be used directly in the testing phase without
truth intervals, respectively. If αth is greater than a threshold,
extra-training and extra-parameters, which is parameter-
we say that it is a correct detection. In practice, we set αth to
free and can be plugged into other existing models for online
0.3 as default (see ablation studies of different αth values in
gesture recognition.
“Appendix F”).
Performances on the SMG Dataset. As a comparison
5.3 Evaluation on SMG Dataset
to the MG online recognition performance of our HMM
BiLSTM-AED, we also implemented four related methods
Evaluation Metrics. Following the protocols used in online
as baselines: FCN-sliding window (Chen et al. 2019), DBN-
action detection from the work of Li et al. (2016), we
HMM (Wu et al. 2016) and STABNet-MES (Chen et al.
jointly evaluate the detection and classification performances
2020). The results of online recognition of both our method
of algorithms by using the F1 score measurement defined
and the baselines compared are shown in Table 4. Our method
below:
is considerably effective in recognizing continuous gestures
2Pr ecision ∗ Recall in unconstrained long sequences (accuracy of 0.173, recall
F1scor e = , (6) of 0.245, and F1 score of 0.203). Technical implementa-
Pr ecision + Recall
tion details of all the compared methods are available in
given a long video sequence that needs to be evaluated, “Appendix E”.
Pr ecision is the fraction of correctly classified MGs among
all gestures retrieved in the sequence by algorithms, while 5.4 AED-BiLSTM on Other Datasets
Recall (or sensitivity) is the fraction of MGs that have been
correctly retrieved over the total amount of annotated MGs. We also evaluate our proposed AED-BiLSTM framework on
Also, we define a criterion to determine a correct detection two other existing online detection datasets, iMiGUE (Liu
with the overlapping ratio αth between the predicted gesture et al. 2021) and OAD (Li et al. 2016) to verify its generaliz-
intervals and ground truth intervals. The overlapping ratio ability.

123
International Journal of Computer Vision

Table 4 MG online recognition performances on the test sets of SMG and iMiGUE datasets
Online recognition method SMG dataset iMiGUE dataset
Accuracy Recall F1-score Accuracy Recall F1-score

FCN-sliding window (Chen et al. 2019) 0.082 0.110 0.094 0.059 0.067 0.063
DBN-HMM (Wu et al. 2016) 0.128 0.167 0.145 – – –
STABNet (Chen et al. 2020) 0.206 0.164 0.182 0.137 0.082 0.103
AED-BiLSTM (Ours) 0.173 0.245 0.203 0.166 0.177 0.171
Methods with the best performance are marked in bold

iMiGUE Dataset is a newly published dataset that also Table 5 The early online detection performances on the OAD dataset
focuses on involuntary micro-gestures occurring during the Observational Ratio 10% 50% 90%
post-match interview of tennis players. There are 359 videos
of post-match press conferences. The videos’ duration varies ST-LSTM (Liu et al. 2016) 60.0% 75.3% 77.5%
with an average length of 350 s, and the total length is 2 Attention Net (Liu et al. 2017) 59.0% 75.8% 78.3%
092 min. A total of 18 499 MG samples were labeled out JCR-RNN (Li et al. 2016) 62.0% 77.3% 78.8%
with the multi-label annotation, which means there could SSNet (Liu et al. 2018) 65.6% 79.2% 81.6%
be multiple MGs labeled for one frame. It has more than 70 STABNet (Chen et al. 2020) 87.2% 92.0% 93.1%
subjects that contain 32 categories of MGs with 25 joints esti- AED-BiLSTM (ours) 88.1% 93.4% 94.2%
mated by OpenPose (Cao et al. 2019). We follow the same Results of accuracy are reported
cross-subject protocol provided by Liu et al. (2021) that uses Methods with the best performance are marked in bold
255 long video sequences (with 13,936 MG samples) from
37 selected subjects for training and 104 sequences (with
4,563 MG samples) from the remaining 35 subjects for test-
recognition for regular gestures (88.1%, 93.4%, 94.2% in
ing. We removed all the samples with null skeleton joints for
an observational ratio of 10%, 50%, 90%). As we can see,
the robust training for both compared methods and ours for
our method achieves superior results to StabNet on all met-
a fair comparison.
rics for SMG and iMiGUE datasets, except for accuracy on
OAD Dataset includes 10 daily human action categories.
SMG where it is considerably lower than StabNet. Essen-
It was captured as long skeleton sequences with Kinect v2.
tially, our AED module works as a regulation to suppress the
The annotation of start and end frames are provided within
non-MG and putting attention to MGs. It behaves as a ten-
peak duration (not a from-none-to-action pattern), similar to
dency to weight MGs while neglect non-MGs, resulting in
the work of Chen et al. (2020), we compensate 12 frames
a higher recall of those MGs. The high recall will naturally
to the beginning of actions to learn pre-action information
lead to the situation that many non-MGs are suppressed and
for better online recognition. “MovingPose” (Zanfir et al.
misclassified as MGs, resulting in a relatively low accuracy.
2013) is also adopted to generate features for each frame.
Failure Case Analysis. From Fig. 7, we visualize the HMM
There are more than 50 long sequences in total, and 30 of
decoding path to analyze the failure cases. As we can see,
them are used for training, 20 for testing, and the remaining
online recognition of in-the-wild body gestures is chal-
sequences are for processing speed validation. In the OAD
lenging due to the complicated transition patterns between
dataset, we use the same protocol as Liu et al. (2018) that
gestures and the high requirements for accurate temporal
sets different observation ratios to validate the algorithm.
allocation. Even though, our AED method, with its atten-
Thus the accuracy is reported for this dataset.
tion mechanism, has a better correcting performance than
Performance Discussion. The experimental results are pre-
the raw prior. For instance, around frames 11,500-11,600
sented in Tables 4 and 5. As shown, our AED-BiLSTM
of the SMG case, AED can help to escape from the false
outperforms all other methods with significant margins (2.1%
positive prediction of “non-movement” intervals and give
on SMG and 6.8% on iMiGUE) on the MG online recog-
potential MG predictions, while around frames 100–600 in
nition task. Our AED-BiLSTM brings a huge improvement,
the iMiGUE case, the AED can help to emphasize the true
especially in the iMiGUE dataset, because the skeleton joints
positive prediction of the MGs with self-attention. At last,
in this dataset are extracted from OpenPose, which are rela-
the visualization of the attention maps is presented in Fig. 8,
tively noisy. By using our enhanced prior to suppressing those
which also shows that our AED can effectively suppress the
noisy body movements, the results are effectively improved.
biased priors brought by certain classes (yellow color means
From Table 5, we can see that our AED-BiLSTM frame-
lower probability) thus can better handle long-tail class dis-
work can also efficiently improve the performance of online
tribution.

123
International Journal of Computer Vision

HMM states
0

Log of the
probabilities
Raw prior

100

800 1000 T
HMM states
0

Our AED

100
Zoom in to see
the suppressed
non-movement
states
800 1000 T

Fig. 8 Visualized attention map. We present the attention map of sam-


ple #72 in the iMiGUE dataset, using raw prior (top) and our AED
(bottom). The x-axis represents time, and the y-axis represents the hid- Fig. 9 The GUI of the human evaluation test for emotional state recog-
den states of all classes. The value of matrix is the probability given by nition. A screenshot of one sample is shown. For each video clip,
networks, we take the log value for a better visualization and computa- evaluators are asked to go through the video and annotate the emotional
tional convenience in Viterbi decoding. The last line is the probability state as a comparison of our methods
of non-movement state. The white spot on the matrix stands for the NaN
value when taking the logarithm operation

6 Body Gesture-Based Emotional Stress


State Recognition
45 for SES-NES. We report the emotional state recognition
In this section, we conduct experiments on body gesture- accuracy in percentage for each of these two protocols.
based recognition of the emotional stress state. The task is Human Evaluation. We assess the difficulty of the
defined as predicting the emotional stress state (i.e., SES or emotional state recognition task by enrolling human eval-
NES) within the context of the body gestures with a given uators to observe the emotional instances and give their
video sequence. We first introduce the benchmark for evalua- predictions. Sixteen ordinary college students with different
tions by implementing several state-of-the-art models. Then, academic majors were recruited as normal human evaluators.
we present a new graph-based network for this task with a Another three university staff were trained to recognize MGs
better performance compared to others. with related psychological backgrounds as expert evaluators.
These evaluators were offered both skeleton/RGB videos to
conduct the task ( skeleton modality was always presented
6.1 Evaluation Protocols and Human Evaluation first and then the RGB modality to avoid any significant learn-
ing effect). The GUI of the human test is shown in Fig. 9 and
Two Evaluation Protocols. As discussed in Sect. 2 and the results are shown in Table 6.
Sect. 3, subject differences could bring considerable influ- The human evaluators were also interviewed after the eval-
ence to gesture-based emotion recognition. Thus, we define uation test. Most of the testers claimed that it was tough only
two types of evaluation protocols: subject-independent (SI) to use gestures (the skeleton modality) to infer, and it was a
and semi-subject-independent (semi-SI). In SI evaluation, random guess. Meanwhile, for RGB videos, people tend to
we use the same protocol as the classification and online use multiple cues such as facial expressions and even overall
recognition tasks that split the 40 subjects into 30+5 for impressions (e.g., if the subject looks confident) to determine
training+validating (with 294 emotional state instances) and the emotional stress states. We can also observe that trained
the remaining five for testing (with 90 instances). In semi- evaluators perform better than ordinary people (accuracy of
SI evaluation, we select 294 emotional state instances from 0.75 for emotional stress states) as they know how to utilize
all the 40 subjects for training+validating and the remaining MGs as clues to infer emotional states. As discussed above,
90 instances for testing. Each instance belongs to a specific MGs are often neglected by humans in interactions. Thus
emotional stress state (NES/SES). The emotional states of using body gestures for emotional state recognition, espe-
the instances are evenly distributed in the testing set, i.e., 45- cially hidden ones, is a significantly challenging task.

123
International Journal of Computer Vision

Regularized
Graph raw data weighted graph
Natural NWNT
MG 7
state MG 1 MG 7 MG 9 MG 9 Graph
Laplacian
... ... ...
MG 2 |V|
MG 15 Classifier
Raw video sequence
Natural
state MG 1
MG 13
|V|

Fig. 10 Spectral decomposition of the graph network for emotional state recognition

Table 6 Body gesture-based emotional state recognition results of “Sequence+NN” group of Table 7. As shown in the table,
human evaluators the three baseline methods (46%, 46%, and 50%) cannot
Human evaluator Modality Emotion state even exceed the random selecting rate (50%). As expected,
recognition accu- the inference based on raw video sequences involves many
racy
redundant, irrelevant body movements and easily fails to cap-
Random guess – 0.50 ture desired body movements (such as MGs) for emotional
Common people Skeleton 0.48 stress state recognition. Thus, conducting the recognition on
RGB 0.53 long video sequences performs poorly (near random guess-
Trained evaluators Skeleton 0.66 ing) with existing state-of-the-art models.
RGB 0.75 MG-based Recognition. Unlike the above raw context
recognition methods, we also present several MG-based
methods for emotion understanding. A baseline strategy that
6.2 Emotional State Recognition with uses the Bayesian network to encode the distribution vectors
State-of-the-Art Methods of MGs (with dimensions of 1 × N , N is the MG num-
ber) was provided in our previous work (Chen et al. 2019). It
As introduced in Sect. 3, instead of using a conventional experimentally validated the contribution that MGs can bring
paradigm that maps one gesture into one emotional status (Gu to the emotion understanding context. In Table 7 bottom
et al. 2013; Gunes and Piccardi 2006), we use two proxy tasks part (“MG+classifier” group), we can observe that micro-
to present the emotional states. Thus, the task of emotional gesture is beneficial for the emotional state inference from the
state recognition in SMG dataset is to predict the correspond- BayesianNet (0.59&0.66). Besides, we go one step further
ing emotional state on a given long video sequence (the state by encoding the MG relationships on a long sequence into
of proxy task, NES/SES). Intuitively, there are two direc- graph representation (with dimensions of N × N ) so that the
tions to approach this problem, one is raw context-based transitions of MGs are also involved with node relationships.
recognition that directly conducts the inference on the whole Intuitively, this should bring more gains as the information
sequence and the other one is MG context-based recogni- of the feature increases, and we selected two state-of-the-
tion that predicts the emotional states based on the MGs on art high-dimensional graph convolutional networks L2GCN,
the sequences. Here we provide six machine learning-based BGCN (You et al. 2020; Zhang et al. 2019) to verify it.
methods for emotional state recognition, including both of However, as shown in Table 7, we find that for these two
these two kinds of methods. high-dimensional models, the emotional state performances
Raw Context Recognition. Three state-of-the-art mod- (0.44&0.47 and 0.54&0.53) are not as competitive as the sim-
els for the skeleton-based action recognition task, ST-GCN ple BayesianNet. Thus, in the next section, we try to tackle
(Yan et al. 2018), NAS-GCN (Peng et al. 2020) and MS- the issue and propose a customized graph network for better
G3D (Liu et al. 2020) are provided as baselines that infer the mining the potential of the graph-based representations.
emotional state based on the raw long instances. The input of
the models is the full sequence of the body skeleton streams, 6.3 A Weighted Spectral Graph Network for
which is to validate if the emotional patterns can be captured Emotional State Recognition
via body movements straightforwardly. The network struc-
ture is end-to-end whose hyper-parameters are the same for We find that existing graph representation learning meth-
the task of MG classification mentioned in Sect. 4.1 aside ods all rely on high-dimensional weight parameters. Limited
from the output head dimensions (as NES/SES). The perfor- sample amount easily leads to over-fitting on these mod-
mances of the three baseline methods are presented in the els (Scarselli et al. 2008) (e.g., in our cases, a graph with

123
International Journal of Computer Vision

Table 7 Body gesture-based


Methods Framework Emotion state recognition accuracy
emotional state recognition
Subject-independent Semi subject-independent
results of the proposed method
and compared baselines Random guess – 0.50 0.50
ST-GCN (Yan et al. 2018) Sequence +NN 0.46 0.42
MS-G3D (Liu et al. 2020) 0.46 0.49
NAS-GCN (Peng et al. 2020) 0.50 0.52
MG+L2GCN (You et al. 2020) MG +classifier 0.44 0.47
MG+BGCN (Zhang et al. 2019) 0.54 0.53
MG+BayesianNet (Chen et al. 2019) 0.59 0.66
MG+WSGN (ours) 0.65 0.68
Note that ”MG“ includes both MG and non-MG instances as the input feature
Methods with the best performance are marked in bold

only 17 nodes of MGs) as shown in Table 7. Meanwhile, where for f classi f ier , we experimented with different stan-
classical spectral graph handling methods like the Laplacian dard classifiers combined to our spectral embedding. That
operator (de Lara and Pineau 2018) are suitable for insuffi- is, Multi-layer Perceptron with Relu non-linearity (MLP)
cient samples to get node “gradients” without the need for (Rumelhart et al. 1986), k-nearest neighbors (kNN) (Fix and
high-dimensional weights. Thus, we utilize the strength of Hodges 1989), Random Forest (RF) (Ho 1995), and Adaptive
classical Laplacian operator to obtain the measurements of Boosting (AdaBoost) (Schapire 2013).
the “gradients” of each node and extend it to the directed,
weighted graph case to better fit the task. The whole frame-
work is presented in Fig. 10.
We give the mathematical definition of a graph as G =
(V , E, W ) to represent the relationship of MGs. With the 6.4 Discussion and Limitations
MGs of number N as graph nodes V = {v p | p = 1, . . . , n}
and the transitions between MGs as graph edges E = The experimental results for emotional state recognition are
{eq |q = 1, . . . , m}, the input is therefore the transition fre- shown in Table 7. In practice, MLP outperforms other classi-
quency vectors as the weights on the graph edges W = fiers, which is reported in Table 7 as a result of our proposed
{wi, j |i, j = 1, . . . , n}, where wi, j is obtained by count- WSGN. The detailed experimental settings can be found in
ing the transition number between MG i and j. In this way, the “Appendix G”. Besides, extra experimental results (see
we map the distribution of MGs into raw graph data with “Appendix H”) show that taking natural states (the non-
the dynamic transition patterns between MGs maintained by movement snippets) into account as an extra MG in the
W . Specifically, to tackle the directed graph issue, consider transition representation will bring an improvement to the
the vertex space RV with standard basis {e1 , . . . , en } and, a results, as well as the Laplacian operation. In the last line of
n × n matrix N can be defined as for N = {n i = e j − ek |i = Table 7, we can observe that our proposed WSGNN model
1, . . . , m and j, k = 1, . . . , n}. This matrix N is called the outperforms all the compared methods, which further ver-
signed vertex edge incidence matrix of the original G (with ifies that the MG-based analysis is beneficial to the final
respect to the fixed orientation). The key fact is that the Lapla- emotion understanding. By comparing the performances of
cian L of the G is the (transpose of the) Gram matrix of N , that MG+classifier frameworks and Sequence+NN frameworks,
is, L=N N T with which the directed graph can be deployed. we can observe that the MG-based feature vectors are more
Now recall that W is the weight matrix of G. Then we can beneficial to the present emotional states. This proves that
define the Laplacian of G as the matrix product N W N T MG-based analysis, with its effective representation capa-
where N is the signed vertex-edge incidence matrix of the bility of emotional state, can be a better option for emotional
underlying unweighted graph of G. In this way, the Laplacian understanding. We believe that this can bring inspiration and
operator can be exploited to extract “gradient” features from new paradigms to the community over bodily emotion under-
the MG graph representation. The resulting feature vectors standing.
from Laplacian operator are fed into the classifiers to predict The limitation of this experiment could be that the stakes of
the final emotional state ĉ. Eventually, the whole formulation the subjects’ emotional states were relatively low. Thus, this
of our proposed weighted spectral graph network (WSGN) might decrease the distinction between baselines and devi-
is given as follows: ations. Additionally, the sample size was relatively limited.
Therefore, more research should explore how the similarity
ĉ = f classi f ier (L(N W N T )), (8) scoring system performs when more extensive samples are
used.
123
International Journal of Computer Vision

7 Conclusions and Future Work tation, distribution and reproduction in any medium or format, as
long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence, and indi-
We proposed a novel, psychology-based and reliable paradigm cate if changes were made. The images or other third party material
for body gesture-based emotion understanding with com- in this article are included in the article’s Creative Commons licence,
puter vision methods. To our knowledge, our effort is the first unless indicated otherwise in a credit line to the material. If material
to interpret hidden emotion states via MGs, with both quan- is not included in the article’s Creative Commons licence and your
intended use is not permitted by statutory regulation or exceeds the
titative investigations of human body behaviors and machine permitted use, you will need to obtain permission directly from the copy-
vision technologies. A related spontaneous micro-gesture right holder. To view a copy of this licence, visit http://creativecomm
dataset towards hidden emotion understanding is collected. A ons.org/licenses/by/4.0/.
comprehensive static analysis is performed with significant
findings for MGs and emotional body gestures. Benchmarks
for MG classification, MG online recognition, and body Appendix A SMG Evaluation Protocols
gesture-based emotional stress state recognition are provided
with state-of-the-art models. Our proposed AED-BiLSTM In the proposed SMG dataset, the criteria of the three bench-
framework can efficiently provide a more robust correction marks (MG classification, MG online recognition, and emo-
to the prior with a parameter-free mechanism. Experiments tional state recognition) are provided. Specifically, for the
show that AED-BiLSTM can efficiently improve online MG classification and online recognition tasks, we utilized
recognition performance in a practice closer to a real-world the subject-independent evaluation protocol, while for the
setting. Moreover, a graph-based network is proposed for the emotional state recognition task, both subject-independent
MG pattern representations to better analyze the emotional and -dependent evaluation protocols are used.
states. Subject-independent protocol. In this protocol, we divide
This work involves and bridges the interdisciplinary the 40 subjects into a training group of 30 subjects, a vali-
efforts of psychology, affective computing, computer vision, dating group of 5 subjects, and a testing group of 5 subjects.
machine learning, etc. We wish to break the fixed research The subject IDs of training and testing are:
paradigm of emotional body gestures which is limited to Training set: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
classical expressive emotions and argue for more diverse 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30};
research angles for emotional understanding. Thus, we pro- Validating set: {31, 32, 33, 34, 35};
pose our spontaneous micro-gestures for hidden emotion Testing set: {36, 37, 38, 39, 40}.
understanding. We believe that the SMG dataset and pro- Under this protocol, MG classification task has 2417 MG
posed methods could inspire new algorithms for the MG clip samples for training, 632 for validating and 593 for test-
recognition tasks from the machine learning aspect, such ing (each for around 50 frames); MG online recognition task
as combining more non-verbal cues such as facial expres- has 30 long MG sequences for training, five for validating
sions with MGs using the RGB modality in the SMG dataset and five for testing (each for around 25,000 frames); and
to improve emotional recognition performance. The work emotional state recognition task has 294 videos (i.e., emo-
can also facilitate new advances in the emotion AI field tional state instances) for training and 60 for validating and
and inspire new paradigms for analyzing human emotions 60 for testing (each for around 8000 frames), respectively.
with computer vision methods. The community can be bene- Semi-subject-independent Protocol. In this protocol, we
fited from MGs with significant application potential in many selected 294 + 60 videos (147 + 30 SES and 147 + 30 NES
fields, e.g., using machines to automatically detect MGs to instances) from all the 40 subjects as the training + vali-
enhance people’s communicative skills, or assist experts in dating sets, and the remaining 60 videos (30 SES and 30
conducting Alzheimer’s and autism disease diagnoses. SES instances) as the testing set. The participants’ emotional
states (SES/NES) are recognized via analysis of micro-
Acknowledgements This work was supported by the Academy of
Finland for Academy Professor project EmotionAI (Grants 336116,
gestures.
345122), project MiGA (grant 316765), the University of Oulu & The The video IDs of training and testing are:
Academy of Finland Profi 7 (grant 352788), Postdoc project 6+E (Grant Training set: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
323287) and ICT 2023 project (grant 328115), and by Ministry of Edu- 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
cation and Culture of Finland for AI forum project. As well, the authors
wish to acknowledge CSC - IT Center for Science, Finland, for com-
31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46,
putational resources. 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62,
63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78,
Funding Open Access funding provided by University of Oulu includ- 79, 80, 81, 82, 83, 96, 97, 98, 99, 100, 101, 108, 109, 110,
ing Oulu University Hospital.
111, 112, 113, 120, 121, 122, 123, 124, 125, 132, 133, 134,
Open Access This article is licensed under a Creative Commons 135, 136, 137, 156, 157, 158, 159, 160, 161, 168, 169, 170,
Attribution 4.0 International License, which permits use, sharing, adap- 171, 172, 173, 180, 181, 182, 183, 184, 185, 192, 193, 194,

123
International Journal of Computer Vision

195, 196, 197, 204, 205, 206, 207, 208, 209, 216, 217, 218, the hair to dry – and they weigh 25 pounds when wet. She
219, 220, 221, 228, 229, 230, 231, 232, 233, 237, 238, 239, says that the extra weight of her hair makes her doctors very
243, 244, 245, 249, 250, 251, 255, 256, 257, 258, 259, 260, concerned. They seem to think that she has a curvature of her
261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, spine due to the length and weight of her hair.” Before the
273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, experiment, participants were told that if they got caught they
285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, would have a punishment, i.e., to fill in a long questionnaire
297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, which contains more than 500 questions, so they had to try
309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, their best when making up a story (deviation stimuli), pre-
321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, tend to be telling/reading a given one (baseline stimuli). The
333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, long questionnaire works as the ’punishment’ or a ’pressure’,
345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, aiming to stimuli and elicit the emotional states and micro-
357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, gestures, and there was no actual punishment conducted after
369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, the data collection.
381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392,
393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, Appendix C Relationship Between MGs and
405, 406, 407, 408, 409, 410, 411, 412, 413, 144, 145, 146, Subjects
147, 148, 149, 150, 151, 152, 153, 154, 155, 162, 163, 164,
165, 166, 167, 174, 175, 176, 177, 178, 179, 186, 187, 188, We visualize the Pearson’s correlation coefficient of the MG
189, 190, 191, 198, 199, 200, 201, 202, 203, 210, 211, 212, performing patterns from 40 subjects in our SMG dataset as
213, 214, 215, 222, 223, 224, 225, 226, 227, 234, 235, 236, shown in Fig. 11.
240, 241, 242, 246, 247, 248, 252, 253, 254};
Validating set: {144, 145, 146, 147, 148, 149, 150, 151,
152, 153, 154, 155, 162, 163, 164, 165, 166, 167, 174, 175, Appendix D Experimental Settings for MG
176, 177, 178, 179, 186, 187, 188, 189, 190, 191, 198, 199, Classification on SMG
200, 201, 202, 203, 210, 211, 212, 213, 214, 215, 222, 223,
224, 225, 226, 227, 234, 235, 236, 240, 241, 242, 246, 247, In the practical implementation of RGB modality based base-
248, 252, 253, 254}; lines, we trained all the models on SMG dataset with the same
Testing set: {84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, protocol as: 120 epochs are trained on the TSN, TRN, and
102, 103, 104, 105, 106, 107, 114, 115, 116, 117, 118, 119, TSM models; the batch size is set to 64 for TSN and TRN,
126, 127, 128, 129, 130, 131, 138, 139, 140, 141, 142, 143, and set to 32 for TSM; the base learning rate is set to 0.001 for
150, 151, 152, 153, 154, 155, 162, 163, 164, 165, 166, 167, all three models, and the learning rate is scaled with a factor
174, 175, 176, 177, 178, 179, 186, 187, 188, 189, 190, 191}.

Appendix B Materials for Stress Emotional


States

We set up two proxy tasks for stimulating the emotional


stress states and eliciting micro-gestures based on the find-
ings in related research: (1) only a comparable truth condition
(“baseline stimuli”, and “deviation stimuli”) rather than
casual small talk can induce emotional difference between
truth tellers and story makers (Palena et al. 2018) and (2)
complications of the truth (more details of the story) can help
to differ the truth tellers and story makers (Vrij et al. 2018,
2020). Thus, we selected five short reports and newscasts
with full details as the materials. The five stories are “world’s
largest swimming pool” (119 words), “world’s longest hair”
(170 words), “world’s biggest dog” (155 words), “world’s
Fig. 11 The correlation distribution of MG patterns between subject
hottest chilli” (133 words) and “world’s largest pizza” (129
pairs in our SMG dataset. The correlation factor is calculated by Pear-
words). Excerpts from the story (world’s longest hair) are as son’s correlation coefficient based on MG distribution of 40 subject
follows: “…She washes the hair once a week, using up to pairs. We can see that, the MG performing patterns can vary a lot over
six bottles of shampoo at a time. Then it takes two days for some subject pairs

123
International Journal of Computer Vision

Table 8 The ablation study of


λ Raw prior 0.7 1.0 2.0 2.1 2.2 3.0 4.0
AED-BiLSTM under different
values of λ on SMG dataset. μ F1-score 0.1825 0.1461 0.1958 0.1986 0.2020 0.2006 0.1823 0.1749
is fixed as 0.0
We show the most representative values that cause F1-score to change considerably
Methods with the best performance are marked in bold

Table 9 The ablation study of


μ Raw prior − 5.0 − 4.0 − 2.5 − 0.5 0.0 1.0
AED-BiLSTM under different
values of μ on SMG dataset. λ F1-score 0.1825 0.2006 0.2026 0.2030 0.2023 0.2020 0.2017
is set as 2.1 to obtain the best
performance We show the most representative values that cause F1-score to change considerably
Methods with the best performance are marked in bold

of 0.1 at epoch 50 and 100, respectively. For C3D, R3D, and Table 10 The online recognition performances of AED-BiLSTM under
I3D, 60 epochs are trained for each model; the batch size is different threshold values of αth on SMG dataset
set to 128 for C3D and R3D, 48 for I3D; and the learning Threshold of αth 10% 30% 50% 70% 90%
rate is set to 0.0002 for all models. The optimizer is SGD
F1-score 0.312 0.203 0.087 0.030 0.004
which is consistent to the settings of all the models. The loss
function is Categorical Cross Entropy. The training platform
was Pytorch (Paszke et al. 2019) with a single GPU: NVidia
Titan (24 GB).
In the practical implementation of skeleton modality based
utilize 3D position difference characters of joints to gener-
baselines, we trained all the models on the SMG dataset with
ate spatio-temporal information with efficient dimensional
the same protocol: 30 training epochs (all fully converged),
requirements.
batch sizes of 32, the base learning rate of 0.05, and weight
For the training phase, our AED-BiLSTM network was
decay of 0.0005. Prepossessing was conducted for all the
trained with a batch size of 32, the learning rate as 0.01
baselines: null-frame padding, translating to the center joint,
(reducing LR factor as 0.5, patience as 3 epochs) for 20
and paralleling the joints to the corresponding axis. Input
epochs on the SMG dataset, with a batch size of 64, the
length is set as 60 frames. For all the remaining network
learning rate as 0.01 (reducing LR factor as 0.5, patience as
hyperparameters, we kept their original settings (e.g., for
3 epochs) for 80 epochs on the iMiGUE dataset and 40 epochs
MSG3D, the numbers of GCN scales and G3D scales are
on the OAD dataset. The optimizer is RMSprop, following
kept as 13 and 6). The optimizer is SGD which is consistent
the setting of Chen et al. (2020). he structure of AED-
to the settings of all the models. The loss function is Categor-
BiLSTM is consistent with the STABNet (Chen et al. 2020).
ical Cross Entropy. The training platform was with a single
Specifically, RMSprop is used as optimizer. The structure of
GPU: NVidia Titan (24 GB).
STABNet is given as: a two-layer BiLSTM with 2000 GRUs
For pretraining the models for RGB modality, we used
and 1000 GRUs separately. A spatial attention layer and a
Resnet50 (He et al. 2016) as the backbone pretrained on
temporal attention layer are attached before and between the
something-something v2 (Goyal et al. 2017) dataset as it
two BiLSTM layers, respectively. The dense layer of 1000
has been commonly used for all the three methods and the
units is stacked with the sigmoid activation to the BiLSTM
trained weighs are available. The hyperparameters are set as
layers, followed by a output layer with units as the total hid-
the same as the original work.
den state number (MG class number* HMM state number
used for representing each MG, 16*5 in practice).Since the
skeleton joints of the iMiGUE dataset are extracted from
Appendix E Experimental Details for Online OpenPose (Cao et al. 2019) which contains noises, we fil-
Recognition tered out all the training samples with null skeleton joints. For
the compared methods, we use the same training scheme and
We first conduct the same pre-processing of the skeleton ensure the models are converged. The training time of a single
streams on the three validating datasets. Since pre-segmented BiLSTM is around three hours with over 18,000/8,000 (train-
clips and their global temporal information of ongoing ges- ing/validating) frame-level samples in our SMG dataset and
tures, are not available in online recognition tasks, it’s around six hours with over 47,000/5,000 (training/validating)
demanding to have an efficient local temporal feature extrac- frame-level samples in the iMiGUE dataset. The loss func-
tion. For skeleton joint feature extraction, we followed the tion is Categorical Cross Entropy. The training platform was
work of (Zanfir et al. 2013). “MovingPose” are features that Tensorflow with a single GPU: NVidia Titan (24 GB).

123
International Journal of Computer Vision

Table 11 Ablation study of


Methods Emotion state recognition accuracy
WSGN
k-NN Random Forest AdaBoost MLP

Basic graph 0.50 0.45 0.48 0.45


+ LP 0.50 0.50 0.48 0.45
+ SFFS 0.50 0.50 0.50 0.50
+ TS 0.56 0.51 0.38 0.51
+ LP & SFFS 0.50 0.50 0.50 0.50
+ LP & TS 0.57 0.42 0.48 0.50
+ SFFS & TS 0.47 0.62 0.62 0.60
+ SFFS & TS & LP (full model) 0.53 0.65 0.65 0.65
LP: Laplacian operation, TS: transition state embedding, and SFFS: sequential forward floating selection
Methods with the best performance are marked in bold

For testing and post-processing, the threshold of the min- between gestures and stress states. All the settings used Cross
imal frames to filter out noisy gestures is set as 14 frames for Entropy as the loss function.
all the methods. The values of λ and μ are set as −2.5/2.1, Full Context Recognition. In practical implementation, we
−1.0/3.0 and −0.2/2.0 for SMG, iMiGUE and OAD datasets. trained the baseline methods with the same protocols as the
classification task (e.g., training epoch number, batch size,
etc.). Besides, the input is the long skeleton sequence of an
emotional state instance with a frame number of 90 via linear
Appendix F Ablation Study of Online Recog- down-sampling. The dimension of the output layers of net-
nition works are modified into two in relation to the two emotional
states.
Although our AED is parameter-free and can be directly MG-based Context Recognition. We construct the for hid-
exploited to the inference, the correct value setting of λ and den emotional recognition. The transition of the middle
μ will affect the performance of the AED. Thus, we present state (non-movements) is enabled, and the transition direc-
the ablation study of AED-BiLSTM under different values tion is enabled. Bayesian prior is added. Sequential Forward
of λ and μ, as shown in Tables 8 and 9. Floating Selection (SFFS) strategy was used for selecting
Note that the λ will affect the inhibition of the “non- MGs with the most contributions. From SFFS, “Turtling
movement”s, which determines the segmentation results. neck and shoulder”, “Rubbing eyes and forehead”, “Fold-
Meanwhile, μ for the attention of the MGs will affect the ing arms behind body” and “Arms akimbo” are the most
classification results. Thus, we fix μ as 0.0 to conduct the contributed features for the emotional state recognition in
ablation study to obtain the best value of λ, then get the best the subject-independent protocol; meanwhile, “Rubbing eyes
value of μ with obtained λ. and forehead”, “Moving legs”, “Arms akimbo” and “Scratch-
The online recognition performances of AED-BiLSTM ing or touching facial parts other than eyes” are the most
under different threshold values of overlapping ratio αth is contributed features for the semi-subject-independent proto-
shown in Table 10. The higher the αth is, the more challenging col.
the task is, as it requires more accurate temporal allocation
of the frame boundaries. When it comes to 90%, it means the
temporal allocation of the MGs should be extremely accurate.
This is especially challenging due to the subtle and swift
nature of MGs.

Appendix H Extra Experimental Results of


WSGN
Appendix G Experimental Settings for Stress
Emotional State Recognition Ablation study. We present the contribution of each compo-
nent in the WSGN with an ablation study as shown in Table
RGB modality was not used as it might bring unnecessary 11 in the subject-independent protocol on the SMG dataset.
texture patterns like facial information into neural networks As we can see, the three components (Laplacian operation,
and makes the analysis contested. We focus on the skele- SFFS and transition state embedding) can jointly contribute
ton modality in order to specifically explore the relationship to the performance.

123
International Journal of Computer Vision

References Gu, Y., Mai, X., & Luo, Y. (2013). Do bodily expressions compete with
facial expressions? Time course of integration of emotional signals
Aviezer, H., Trope, Y., & Todorov, A. (2012). Body cues, not facial from the face and the body. PLOS ONE, 8(7), 1–9.
expressions, discriminate between intense positive and negative Gunes, H., & Piccardi, M. (2006). A bimodal face and body ges-
emotions. Science, 338(6111), 1225–1229. ture database for automatic analysis of human nonverbal affective
Burgoon, J., Buller, D., & WG, W. (1994). Nonverbal communication: behavior. In 18th international conference on pattern recognition
The unspoken dialogue. Greyden Press. (vol. 1, pp. 1148–1153).
Cao, Z., Hidalgo, G., Simon, T., Wei, S. E., & Sheikh, Y. (2019). Open- Hara, K., Kataoka, H., & Satoh, Y. (2018). Can spatiotemporal 3d cnns
pose: realtime multi-person 2d pose estimation using part affinity retrace the history of 2D CNNs and imagenet? In Proceedings of
fields. IEEE Transactions on Pattern Analysis and Machine Intel- the IEEE/CVF conference on computer vision and pattern recog-
ligence, 43(1), 172–186. nition (pp. 6546–6555).
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning
a new model and the kinetics dataset. In Proceedings of the for image recognition. In Proceedings of the IEEE conference on
IEEE/CVF conference on computer vision and pattern recogni- computer vision and pattern recognition (pp. 770–778).
tion (pp. 6299–6308). Ho, T. K. (1995). Random decision forests. In Proceedings of interna-
Chen, H., Liu, X., Li, X., Shi, H., & Zhao, G. (2019). Analyze tional conference on document analysis and recognition (vol. 1,
spontaneous gestures for emotional stress state recognition: A pp. 278–282).
micro-gesture dataset and analysis with deep learning. In Pro- Khan, R. Z., & Ibraheem, N. A. (2012). Hand gesture recognition: A
ceedings of the IEEE international conference on automatic face literature review. International Journal of Artificial Intelligence &
& gesture recognition (pp. 1–8). Applications, 3(4), 161.
Chen, H., Liu, X., Shi, J., & Zhao, G. (2020). Temporal hierarchical Kipp, M., & Martin, J. C. (2009). Gesture and emotion: Can basic
dictionary guided decoding for online gesture segmentation and gestural form features discriminate emotions? In International
recognition. IEEE Transactions on Image Processing, 29, 9689– conference on affective computing and intelligent interaction and
9702. workshops (pp. 1–8).
Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J., & Lu, H. (2020). Kita, S., Alibali, M., & Chu, M. (2017). How do gestures influence
Skeleton-based action recognition with shift graph convolutional thinking and speaking? the gesture-for-conceptualization hypoth-
network. In Proceedings of the IEEE/CVF conference on computer esis. Psychological Review, 124, 245–266.
vision and pattern recognition (pp. 183–192). Krakovsky, M. (2018). Artificial (emotional) intelligence. Communica-
Crasto, N., Weinzaepfel, P., Alahari, K., & Schmid, C. (2019). MARS: tions of the ACM, 61(4), 18–19.
Motion-augmented RGB stream for action recognition. In Pro- Kuehne, H., Richard, A., & Gall, J. (2019). A hybrid RNN-HMM
ceedings of the IEEE/CVF conference on computer vision and approach for weakly supervised temporal action segmentation.
pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence.
de Becker, G. (1997). The gift of fear. Dell Publishing. Kuhnke, E. (2009). Body language for dummies. Wiley.
de Lara, N., & Pineau, E. (2018). A simple baseline algorithm for graph Li, S., & Deng, W. (2020). Deep facial expression recognition: A survey.
classification. In Relational representation learning workshop, the IEEE Transactions on Affective Computing.
conference on neural information processing systems. Li, Y., Lan, C., Xing, J., Zeng, W., Yuan, C., & Liu, J. (2016). Online
Ekman, P. (2004). Darwin, deception, and facial expression. Annals of human action detection using joint classification-regression recur-
the New York Academy of Sciences, 1000, 205–21. rent neural networks. In Proceedings of the European conference
Ekman, R. (1997). What the face reveals: Basic and applied studies on computer vision.
of spontaneous expression using the Facial Action Coding System Lin, J., Gan, C., & Han, S. (2019). TSM: Temporal shift module for
(FACS). Oxford University Press. efficient video understanding. In Proceedings of the IEEE/CVF
El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech emo- international conference on computer vision (pp. 7083–7093).
tion recognition: Features, classification schemes, and databases. Liu, J., Shahroudy, A., Xu, D., & Wang, G. (2016). Spatio-temporal
Pattern Recognition, 44(3), 572–587. LSTM with trust gates for 3D human action recognition. In: Pro-
Escalera, S., Baró, X., Gonzàlez, J., Bautista, M.A., Madadi, M., Reyes, ceedings of the European conference on computer vision.
M., Ponce-López, V., Escalante, H.J., Shotton, J., & Guyon, I. Liu, J., Shahroudy, A., Wang, G., Duan, L.Y., & Kot, A.C. (2018).
(2015). Chalearn looking at people challenge 2014: Dataset and Ssnet: Scale selection network for online 3D action prediction.
results. In Proceedings of the European conference on computer In: Proceedings of the IEEE conference on computer vision and
vision (pp. 459–473). pattern recognition.
Fix, E., & Hodges, J. L. (1989). Discriminatory analysis. nonparametric Liu, J., Wang, G., Hu, P., Duan, L.Y., & Kot, A. C. (2017). Global
discrimination: Consistency properties. International Statistical context-aware attention LSTM networks for 3d action recognition.
Review/Revue Internationale de Statistique, 57(3), 238–247. In Proceedings of the IEEE conference on computer vision and
Ginevra, C., Loic, K., & George, C. (2008). Emotion recognition pattern recognition.
through multiple modalities: Face, body gesture, speech (pp. 92– Liu, X., Shi, H., Chen, H., Yu, Z., Li, X., & Zhao, G. (2021). imigue:
103). Springer. An identity-free video dataset for micro-gesture understanding and
Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., West- emotion analysis. In Proceedings of the IEEE/CVF conference on
phal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., & Mueller- computer vision and pattern recognition (pp. 10631–10642).
Freitag, M., et al. (2017). The” something something” video Liu, Z., Zhang, H., Chen, Z., Wang, Z., & Ouyang, W. (2020). Disentan-
database for learning and evaluating visual common sense. In: gling and unifying graph convolutions for skeleton-based action
Proceedings of the IEEE international conference on computer recognition. In Proceedings of the IEEE/CVF conference on com-
vision (pp. 5842–5850). puter vision and pattern recognition (pp. 143–152).
Gray, J. A. (1982). Précis of the neuropsychology of anxiety: An enquiry Luo, Y., Ye, J., Adams, R. B., Li, J., Newman, M. G., & Wang, J. Z.
into the functions of the septo-hippocampal system. Behavioral (2020). Arbee: Towards automated recognition of bodily expres-
and Brain Sciences, 5(3), 469–484. sion of emotion in the wild. International Journal of Computer
Vision, 128(1), 1–25.

123
International Journal of Computer Vision

Mahmoud, M., Baltrušaitis, T., Robinson, P., & Riek, L.D. (2011). 3D Soomro, K., Zamir, A. R., & Shah, M. (2012). Ucf101: A dataset of
corpus of spontaneous complex mental states. In International 101 human actions classes from videos in the wild. arXiv preprint
conference on affective computing and intelligent interaction (pp. arXiv:1212.0402.
205–214). Sun, S., Kuang, Z., Sheng, L., Ouyang, W., & Zhang, W. (2018). Optical
Navarro, J., & Karlins, M. (2008). What every BODY is saying: An flow guided feature: A fast and robust motion representation for
ex-FBI agent’s guide to speed reading people. Collins. video action recognition. In Proceedings of the IEEE/CVF confer-
Neverova, N., Wolf, C., Taylor, G., & Nebout, F. (2016). Moddrop: ence on computer vision and pattern recognition.
Adaptive multi-modal gesture recognition. IEEE Transactions on Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015).
Pattern Analysis and Machine Intelligence 38(8). Learning spatiotemporal features with 3D convolutional networks.
Noroozi, F., Kaminska, D., Corneanu, C., Sapinski, T., Escalera, S., & In Proceedings of the IEEE/CVF international conference on com-
Anbarjafari, G. (2018). Survey on emotional body gesture recog- puter vision (pp. 4489–4497).
nition. IEEE Transactions on Affective Computing. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
Oh, S. J., Benenson, R., Fritz, M., & Schiele, B. (2016). Faceless person A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need.
recognition: Privacy implications in social media. In Proceedings Advances in Neural Information Processing Systems 30.
of the European conference on computer vision (pp. 19–35). Vrij, A., Leal, S., Jupe, L., & Harvey, A. (2018). Within-subjects verbal
Palena, N., Caso, L., Vrij, A., & Orthey, R. (2018). Detecting decep- lie detection measures: A comparison between total detail and pro-
tion through small talk and comparable truth baselines. Journal of portion of complications. Legal and Criminological Psychology,
Investigative Psychology and Offender Profiling 15. 23(2), 265–279.
Panksepp, J. (1998). Affective neuroscience: The foundations of human Vrij, A., Mann, S., Leal, S., & Fisher, R. P. (2020). Combining verbal
and animal emotions. Oxford University Press. veracity assessment techniques to distinguish truth tellers from lie
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., tellers. European Journal of Psychology Applied to Legal Context,
Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019) 13(1), 9–19.
Pytorch: An imperative style, high-performance deep learning Wallbott, H. G. (1998). Bodily expression of emotion. European Jour-
library. In Advances in neural information processing systems. nal of Social Psychology, 28(6), 879–896.
Peng, W., Hong, X., Chen, H., & Zhao, G. (2020). Learning graph con- Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van
volutional network for skeleton-based human action recognition Gool, L. (2018). Temporal segment networks for action recognition
by neural searching. In Proceedings of the AAAI conference on in videos. IEEE Transactions on Pattern Analysis and Machine
artificial intelligence. Intelligence, 41(11), 2740–2755.
Pentland, A. (2008). Honest signals: How they shape our world. MIT Wu, D., Pigou, L., Kindermans, P.J., Le, N.D.H., Shao, L., Dambre, J.,
Press. & Odobez, J.M. (2016). Deep dynamic neural networks for mul-
Pouw, W. T., Mavilidi, M. F., Van Gog, T., & Paas, F. (2016). Gesturing timodal gesture segmentation and recognition. IEEE Transactions
during mental problem solving reduces eye movements, especially on Pattern Analysis and Machine Intelligence 38(8).
for individuals with lower visual working memory capacity. Cog- Xu, M., Gao, M., Chen, Y. T., Davis, L. S., & Crandall, D. J. (2019a).
nitive Processing, 17(3), 269–277. Temporal recurrent networks for online action detection. In Pro-
Richard, A., Kuehne, H., Iqbal, A., & Gall, J. (2018). Neuralnetwork- ceedings of the IEEE/CVF international conference on computer
viterbi: A framework for weakly supervised video learning. In vision (pp. 5532–5541).
Proceedings of the IEEE/CVF conference on computer vision and Xu, M., Gao, M., Chen, Y.T., Davis, L. S., & Crandall, D. J. (2019b).
pattern recognition. Temporal recurrent networks for online action detection. In Pro-
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning ceedings of the IEEE/CVF international conference on computer
representations by back-propagating errors. Nature, 323(6088), vision.
533–536. Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph con-
Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., & Monfardini, volutional networks for skeleton-based action recognition. In
G. (2008). The graph neural network model. IEEE Transactions Proceedings of the AAAI conference on artificial intelligence
on Neural Networks, 20(1), 61–80. (vol. 32).
Schapire, R. E. (2013). Explaining adaboost. In Empirical inference You, Y., Chen, T., Wang, Z., & Shen, Y. (2020). L2-gcn: Layer-wise
(pp. 37–52). Springer. and learned efficient training of graph convolutional networks. In
Schindler, K., Van Gool, L., & De Gelder, B. (2008). Recognizing Proceedings of the IEEE/CVF conference on computer vision and
emotions expressed by body pose: A biologically inspired neu- pattern recognition (pp. 2127–2135).
ral model. Neural Networks, 21(9), 1238–1246. Yu, N. (2008). Metaphor from body and culture. The Cambridge hand-
Serge, G. (1995). International Glossary of Gestalt Psychotherapy. book of metaphor and thought (pp. 247–261).
FORGE. Yu, Z., Zhou, B., Wan, J., Wang, P., Chen, H., Liu, X., Li, S. Z., &
Shahroudy, A., Liu, J., Ng, T. T., & Wang, G. (2016). Ntu rgb+d: A large Zhao, G. (2020). Searching multi-rate and multi-modal temporal
scale dataset for 3D human activity analysis. In Proceedings of the enhanced networks for gesture recognition. IEEE Transactions on
IEEE/CVF conference on computer vision and pattern recognition. Image Processing.
Shi, L., Zhang, Y., Cheng, J., & Lu, H. (2019). Two-stream adaptive Zanfir, M., Leordeanu, M., & Sminchisescu, C. (2013). The moving
graph convolutional networks for skeleton-based action recogni- pose: An efficient 3D kinematics descriptor for low-latency action
tion. In Proceedings of the IEEE/CVF conference on computer recognition and detection. In Proceedings of the IEEE/CVF inter-
vision and pattern recognition (pp. 12026–12035). national conference on computer vision.
Shiffrar, M., Kaiser, M., & Chouchourelou, A. (2011). Seeing human Zhang, Y., Pal, S., Coates, M., & Ustebay, D. (2019). Bayesian graph
movement as inherently social. The Science of Social Vision. convolutional neural networks for semi-supervised classification.
Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, In Proceedings of the AAAI conference on artificial intelligence
R., Kipman, A., & Blake, A. (2011). Real-time human pose recog- (vol. 33, pp. 5829–5836).
nition in parts from single depth images. In Proceedings of the
IEEE/CVF conference on computer vision and pattern recogni-
tion (pp. 1297–1304). Publisher’s Note Springer Nature remains neutral with regard to juris-
dictional claims in published maps and institutional affiliations.

123

You might also like