IJCV2023-SMG-A Micro-Gesture Dataset Towards Spontaneous Body Gestures For Emotional Stress State Analysis. International Journal of Computer Vision
IJCV2023-SMG-A Micro-Gesture Dataset Towards Spontaneous Body Gestures For Emotional Stress State Analysis. International Journal of Computer Vision
IJCV2023-SMG-A Micro-Gesture Dataset Towards Spontaneous Body Gestures For Emotional Stress State Analysis. International Journal of Computer Vision
https://doi.org/10.1007/s11263-023-01761-6
Abstract
We explore using body gestures for hidden emotional state analysis. As an important non-verbal communicative fashion,
human body gestures are capable of conveying emotional information during social communication. In previous works,
efforts have been made mainly on facial expressions, speech, or expressive body gestures to interpret classical expressive
emotions. Differently, we focus on a specific group of body gestures, called micro-gestures (MGs), used in the psychology
research field to interpret inner human feelings. MGs are subtle and spontaneous body movements that are proven, together
with micro-expressions, to be more reliable than normal facial expressions for conveying hidden emotional information. In
this work, a comprehensive study of MGs is presented from the computer vision aspect, including a novel spontaneous micro-
gesture (SMG) dataset with two emotional stress states and a comprehensive statistical analysis indicating the correlations
between MGs and emotional states. Novel frameworks are further presented together with various state-of-the-art methods
as benchmarks for automatic classification, online recognition of MGs, and emotional stress state recognition. The dataset
and methods presented could inspire a new way of utilizing body gestures for human emotion understanding and bring a new
direction to the emotion AI community. The source code and dataset are made available: https://github.com/mikecheninoulu/
SMG.
Keywords Micro-gestures · Gesture recognition · Emotion recognition · Statistical modeling · Deep learning · Affective
computing
1 Introduction
123
International Journal of Computer Vision
123
International Journal of Computer Vision
Fig. 2 The overview of the main research topics of this work. a A framework for complicated gesture transition patterns. d Baselines and
novel SMG dataset with a comprehensive statistical analysis. b Multi- a newly proposed framework for emotional state recognition
ple benchmarks on the SMG dataset. c A novel online MG recognition
123
International Journal of Computer Vision
123
International Journal of Computer Vision
ence. The three factors, fight, flight, and freeze, can cause
specific human behaviors at the onset of certain stimuli,
including the freezing body (e.g., holding the breath), dis-
ID 1 ID 2 ID 3 ID 4 ID 5 ID 6
tancing behaviors (e.g., putting hands or objects to block
faces or bodies) and guarding behaviors (e.g., puffing out the
chest). Besides, to transfer from discomfort to comfort states,
human beings develop a natural reaction, so-called pacifying ID 8 ID 9 ID 10 ID 11 ID 12
ID 7
actions, that tries to suppress the negative feelings induced by
the above three factors (Panksepp 1998). Other psychologi-
cal research related to MG can also be found in early work
(de Becker 1997; Burgoon et al. 1994) and the most recent ID 13 ID 14 ID 15 ID 16 Non-MG
work (Kita et al. 2017; Pouw et al. 2016).
(a) A list of the micro-gestures in the SMG dataset
In total, based on the above psychology theoretical sup-
ports, we try to define the MG categories for computer vision Age< 22: 30%
Age> 28: 20%
East Asia: 37.5%
Others: 10.5%
study with criteria as (1) covering all MGs that could possibly Middle East: 35.0%
123
International Journal of Computer Vision
123
Table 1 The list of MGs collected in the SMG dataset, and MG IDs correspond to the indexes in Fig. 4
MG ID Kinematic Psychological attribute Number in SES/NES MG ID Kinematic Psychological attribute Number in SES/NES
description description
International Journal of Computer Vision
123
International Journal of Computer Vision
“hand-hand” interactions. Thus, compared to other body Table 3 MG classification performance on the test set of the SMG
gesture/action recognition tasks, MGs require a more fine- dataset
grained and accurate recognizing ability from the machine Method Modality Accuracy
learning aspect. Top-1 Top-5
Relationship Between MGs and Subjects. As mentioned, ST-GCN (Yan et al. 2018) Skeleton 41.48 86.07
the use of body gestures to interpret emotions could be heav- 2 S-GCN (Shi et al. 2019) 43.11 86.90
ily affected by individual differences. Here, we conduct a Shift-GCN (Cheng et al. 2020) 55.31 87.34
qualitative analysis of MG patterns of different subjects. GCN-NAS (Peng et al. 2020) 58.85 85.08
Specifically, Pearson’s correlation coefficient is used to mea-
MS-G3D (Liu et al. 2020) 64.75 91.48
sure the correlation of different MG performing patterns from
R3D (Hara et al. 2018) RGB 29.84 67.87
40 subjects in our SMG dataset. The MG performing pattern
I3D (Carreira and Zisserman 2017) 35.08 85.90
is presented by the frequency distribution of 17 MGs of a
C3D (Tran et al. 2015) 45.90 79.18
given subject. Pearson’s correlation coefficient varies from
TSN (Wang et al. 2018) 50.49 82.13
− 1 to 1, and the higher it is, the stronger the evaluated cor-
TSM (Lin et al. 2019) 58.69 83.93
relation is. According to the statistic calculation, the average
Pearson’s correlation coefficient of these 40 subjects is 0.456, TRN (Xu et al. 2019a) 59.51 88.53
with the highest one of 0.966 and the lowest one of − 0.240. It TSN* (Wang et al. 2018) 53.61 81.98
indicates a trend that subjects share MGs patterns, especially TRN* (Xu et al. 2019a) 60.00 91.97
in the exposing frequency of MGs, while individual incon- TSM* (Lin et al. 2019) 65.41 91.48
sistency of the MG patterns is still not negligible. As a result, In total 11 state-of-the-art models for RGB and skeleton modalities are
although the above t-test proves the effectiveness of SES for reported. Methods with stars are pretrained on large-scale datasets
eliciting MGs, it is necessary to emphasize the inconsistency Methods with the best performance are marked in bold
of MG performing patterns brought by different subjects.
123
International Journal of Computer Vision
123
International Journal of Computer Vision
after “touching nose” in training subjects, while the testing estimated by a trained BiLSTM network:
subjects could perform no “rubbing hands” at all. Second, ⎡ ⎤
the “non-movement” interval, the so-called Ergodic state, p1
was introduced in most of the previous works (Neverova ⎢ p2 ⎥ W1:M
w(h t |xt ) = ⎢ ⎥
⎣ . . . ⎦ = W M+1 , M = N × C, (3)
et al. 2016; Wu et al. 2016) to achieve accurate alloca-
tion and segmentation of gestures. Meanwhile, MGs usually p M+1
occur continuously without any “non-movement” intervals
and sometimes can be incompletely performed. Therefore, a where C is the total number of gesture classes (as 17 in prac-
more flexible and efficient transition scheme is needed. Lastly tice, including MGs and non-MGs), N is the HMM state
and most importantly, MGs are rare and subtle. How to boost number used to present one gesture (set as 5 for the best per-
the HMM decoding escaping from local optimal brought by formance) and M is the resulting total HMM state number
the dominating amount of the Ergodic states (irrelevant/noisy (85 in practice). We take an additional HMM state M + 1 (86
body movements) and non-MGs, is exceptionally challeng- in practice) as the “non-movement” state. Then, W1:M and
ing. W M+1 stands for the probability distribution of the HMM
states of all the gestures (MGs and non-MGs) and the “non-
movement” state, respectively.
5.2 A Parameter-Free Ergodic-Attention HMM A Novel Parameter-free Attention Mechanism for Ergodic
Network for Online MG Recognition HMM Decoding. Based on the above HMM full proba-
bility, we find that although the prior p(h t ) is to correct
Mathematical Framework. We chose the sequences of the the data imbalance, this prior is not strong enough or even
3D skeletal stream as inputs because its lower dimensionality harmful. The MGs are still submerged in the dominating
is suitable for online processing tasks and the reliable per- noisy/irrelevant body movements. Thus, we propose a novel
formance shown in the MG classification task. Similarly to method to address this issue, called Attention-Based Ergodic
the work of Chen et al. (2020), we model the local temporal Decoding (AED), that has a parameter-free self-attention
dynamics with an attention-based Long Short-Term Memory mechanism to modeling the HMM alignment. It has two folds
(BiLSTM) network (giving an initial prediction of the current of improvements based on the conventional HMM frame-
frame) and use an HMM model to enhance inference reason- work (Wu et al. 2016): an attention mechanism on W1:M to lift
ing (finalizing the prediction of the current frame with priors the probability of meaningful gestures, and an inhibition on
in the past frames), shown in Fig. 6. The full probability of W M+1 for the probability of noisy body movements. Specifi-
the is specified as follows: cally, we exploit the AED by replacing w(h t |xt )/ p(h t ) with
a new form of posterior probability w (h t |xt ) that have a
more effective prior ability:
p(x1 , x2 , . . . , x T , h 1 , h 2 , . . . , h T )
W1:M
T
(1) w (h t |xt ) =
W M+1
= p(h 1 ) p(x1 |h 1 ) p(xt |h t ) p(h t |h t−1 ), (4)
t=2 μ · so f tmax(W1:M W1:M ) W1:M + W1:M
= λ .
W M+1
123
International Journal of Computer Vision
HMM states
Inference. After BiLSTM is trained to give an estimation of
each upcoming frame with a SoftMax probability w(h t |xt ) Raw prior
of the HMM state, we can conduct the inference together T
with the learnt transition probability p(h t |h t−1 ). During the HMM states
testing phase, we want to solve the decoding of hidden state Our AED
sequence ĝ to obtain the most likely explanation (namely, the T
gesture alignment), which is determined as: SMG dataset
HMM states
ĝ = arg max p(x1 , x2 . . . x T , h 1 , h 2 . . . h T )
h Raw prior
T (5)
∼
= arg max π0 w (xt |h t ) p(h t |h t−1 ), T
h HMM states
t=2
Our AED
where π0 stands for the constant value. By using Eq. 5, we
can break down the problem of solving the utmost probability T
iMiGUE dataset
of a long non-stationary sequence into continuously solving
HMM states probability with hidden states h 1:T . While the Fig. 7 Visualized HMM decoding of failure cases. We present the
HMM states are aligned in real-time, the testing sequence HMM decoding of sample sequence #36 in the SMG dataset and #72
can be inferred for both segmentation (non-movement) and in the iMiGUE dataset, using raw prior (top) and our AED (bottom).
The x-axis represents time, and the y-axis represents the hidden states
recognition (MGs and non-MGs). Finally, we improved the of all classes. The cyan lines represent the highest probability given by
method proposed by (Wu et al. 2016) by treating not only the networks, while red lines denote the ground truth labels, and the blue
“non-movement” state but also the middle HMM states of lines are the predictions
every gesture as ergodic states. In this way, the segmentation
of several continuous incomplete gestures becomes possible.
The complete network structures and technical implemen- αth is defined as follows,
tation details of our AED method, such as the value of μ
|Igt ∩ I pr ed |
and λ are given in “Appendix F”. Note that w (h t |xt ) is cal- αth = , (7)
culated based on w(h t |xt ) which is given by the BiLSTM |Igt ∪ I pr ed |
output without any fine-tuning. Thus, our proposed attention
where I pr ed and Igt denote the predicted gesture and ground
scheme can be used directly in the testing phase without
truth intervals, respectively. If αth is greater than a threshold,
extra-training and extra-parameters, which is parameter-
we say that it is a correct detection. In practice, we set αth to
free and can be plugged into other existing models for online
0.3 as default (see ablation studies of different αth values in
gesture recognition.
“Appendix F”).
Performances on the SMG Dataset. As a comparison
5.3 Evaluation on SMG Dataset
to the MG online recognition performance of our HMM
BiLSTM-AED, we also implemented four related methods
Evaluation Metrics. Following the protocols used in online
as baselines: FCN-sliding window (Chen et al. 2019), DBN-
action detection from the work of Li et al. (2016), we
HMM (Wu et al. 2016) and STABNet-MES (Chen et al.
jointly evaluate the detection and classification performances
2020). The results of online recognition of both our method
of algorithms by using the F1 score measurement defined
and the baselines compared are shown in Table 4. Our method
below:
is considerably effective in recognizing continuous gestures
2Pr ecision ∗ Recall in unconstrained long sequences (accuracy of 0.173, recall
F1scor e = , (6) of 0.245, and F1 score of 0.203). Technical implementa-
Pr ecision + Recall
tion details of all the compared methods are available in
given a long video sequence that needs to be evaluated, “Appendix E”.
Pr ecision is the fraction of correctly classified MGs among
all gestures retrieved in the sequence by algorithms, while 5.4 AED-BiLSTM on Other Datasets
Recall (or sensitivity) is the fraction of MGs that have been
correctly retrieved over the total amount of annotated MGs. We also evaluate our proposed AED-BiLSTM framework on
Also, we define a criterion to determine a correct detection two other existing online detection datasets, iMiGUE (Liu
with the overlapping ratio αth between the predicted gesture et al. 2021) and OAD (Li et al. 2016) to verify its generaliz-
intervals and ground truth intervals. The overlapping ratio ability.
123
International Journal of Computer Vision
Table 4 MG online recognition performances on the test sets of SMG and iMiGUE datasets
Online recognition method SMG dataset iMiGUE dataset
Accuracy Recall F1-score Accuracy Recall F1-score
FCN-sliding window (Chen et al. 2019) 0.082 0.110 0.094 0.059 0.067 0.063
DBN-HMM (Wu et al. 2016) 0.128 0.167 0.145 – – –
STABNet (Chen et al. 2020) 0.206 0.164 0.182 0.137 0.082 0.103
AED-BiLSTM (Ours) 0.173 0.245 0.203 0.166 0.177 0.171
Methods with the best performance are marked in bold
iMiGUE Dataset is a newly published dataset that also Table 5 The early online detection performances on the OAD dataset
focuses on involuntary micro-gestures occurring during the Observational Ratio 10% 50% 90%
post-match interview of tennis players. There are 359 videos
of post-match press conferences. The videos’ duration varies ST-LSTM (Liu et al. 2016) 60.0% 75.3% 77.5%
with an average length of 350 s, and the total length is 2 Attention Net (Liu et al. 2017) 59.0% 75.8% 78.3%
092 min. A total of 18 499 MG samples were labeled out JCR-RNN (Li et al. 2016) 62.0% 77.3% 78.8%
with the multi-label annotation, which means there could SSNet (Liu et al. 2018) 65.6% 79.2% 81.6%
be multiple MGs labeled for one frame. It has more than 70 STABNet (Chen et al. 2020) 87.2% 92.0% 93.1%
subjects that contain 32 categories of MGs with 25 joints esti- AED-BiLSTM (ours) 88.1% 93.4% 94.2%
mated by OpenPose (Cao et al. 2019). We follow the same Results of accuracy are reported
cross-subject protocol provided by Liu et al. (2021) that uses Methods with the best performance are marked in bold
255 long video sequences (with 13,936 MG samples) from
37 selected subjects for training and 104 sequences (with
4,563 MG samples) from the remaining 35 subjects for test-
recognition for regular gestures (88.1%, 93.4%, 94.2% in
ing. We removed all the samples with null skeleton joints for
an observational ratio of 10%, 50%, 90%). As we can see,
the robust training for both compared methods and ours for
our method achieves superior results to StabNet on all met-
a fair comparison.
rics for SMG and iMiGUE datasets, except for accuracy on
OAD Dataset includes 10 daily human action categories.
SMG where it is considerably lower than StabNet. Essen-
It was captured as long skeleton sequences with Kinect v2.
tially, our AED module works as a regulation to suppress the
The annotation of start and end frames are provided within
non-MG and putting attention to MGs. It behaves as a ten-
peak duration (not a from-none-to-action pattern), similar to
dency to weight MGs while neglect non-MGs, resulting in
the work of Chen et al. (2020), we compensate 12 frames
a higher recall of those MGs. The high recall will naturally
to the beginning of actions to learn pre-action information
lead to the situation that many non-MGs are suppressed and
for better online recognition. “MovingPose” (Zanfir et al.
misclassified as MGs, resulting in a relatively low accuracy.
2013) is also adopted to generate features for each frame.
Failure Case Analysis. From Fig. 7, we visualize the HMM
There are more than 50 long sequences in total, and 30 of
decoding path to analyze the failure cases. As we can see,
them are used for training, 20 for testing, and the remaining
online recognition of in-the-wild body gestures is chal-
sequences are for processing speed validation. In the OAD
lenging due to the complicated transition patterns between
dataset, we use the same protocol as Liu et al. (2018) that
gestures and the high requirements for accurate temporal
sets different observation ratios to validate the algorithm.
allocation. Even though, our AED method, with its atten-
Thus the accuracy is reported for this dataset.
tion mechanism, has a better correcting performance than
Performance Discussion. The experimental results are pre-
the raw prior. For instance, around frames 11,500-11,600
sented in Tables 4 and 5. As shown, our AED-BiLSTM
of the SMG case, AED can help to escape from the false
outperforms all other methods with significant margins (2.1%
positive prediction of “non-movement” intervals and give
on SMG and 6.8% on iMiGUE) on the MG online recog-
potential MG predictions, while around frames 100–600 in
nition task. Our AED-BiLSTM brings a huge improvement,
the iMiGUE case, the AED can help to emphasize the true
especially in the iMiGUE dataset, because the skeleton joints
positive prediction of the MGs with self-attention. At last,
in this dataset are extracted from OpenPose, which are rela-
the visualization of the attention maps is presented in Fig. 8,
tively noisy. By using our enhanced prior to suppressing those
which also shows that our AED can effectively suppress the
noisy body movements, the results are effectively improved.
biased priors brought by certain classes (yellow color means
From Table 5, we can see that our AED-BiLSTM frame-
lower probability) thus can better handle long-tail class dis-
work can also efficiently improve the performance of online
tribution.
123
International Journal of Computer Vision
HMM states
0
Log of the
probabilities
Raw prior
100
800 1000 T
HMM states
0
Our AED
100
Zoom in to see
the suppressed
non-movement
states
800 1000 T
123
International Journal of Computer Vision
Regularized
Graph raw data weighted graph
Natural NWNT
MG 7
state MG 1 MG 7 MG 9 MG 9 Graph
Laplacian
... ... ...
MG 2 |V|
MG 15 Classifier
Raw video sequence
Natural
state MG 1
MG 13
|V|
Fig. 10 Spectral decomposition of the graph network for emotional state recognition
Table 6 Body gesture-based emotional state recognition results of “Sequence+NN” group of Table 7. As shown in the table,
human evaluators the three baseline methods (46%, 46%, and 50%) cannot
Human evaluator Modality Emotion state even exceed the random selecting rate (50%). As expected,
recognition accu- the inference based on raw video sequences involves many
racy
redundant, irrelevant body movements and easily fails to cap-
Random guess – 0.50 ture desired body movements (such as MGs) for emotional
Common people Skeleton 0.48 stress state recognition. Thus, conducting the recognition on
RGB 0.53 long video sequences performs poorly (near random guess-
Trained evaluators Skeleton 0.66 ing) with existing state-of-the-art models.
RGB 0.75 MG-based Recognition. Unlike the above raw context
recognition methods, we also present several MG-based
methods for emotion understanding. A baseline strategy that
6.2 Emotional State Recognition with uses the Bayesian network to encode the distribution vectors
State-of-the-Art Methods of MGs (with dimensions of 1 × N , N is the MG num-
ber) was provided in our previous work (Chen et al. 2019). It
As introduced in Sect. 3, instead of using a conventional experimentally validated the contribution that MGs can bring
paradigm that maps one gesture into one emotional status (Gu to the emotion understanding context. In Table 7 bottom
et al. 2013; Gunes and Piccardi 2006), we use two proxy tasks part (“MG+classifier” group), we can observe that micro-
to present the emotional states. Thus, the task of emotional gesture is beneficial for the emotional state inference from the
state recognition in SMG dataset is to predict the correspond- BayesianNet (0.59&0.66). Besides, we go one step further
ing emotional state on a given long video sequence (the state by encoding the MG relationships on a long sequence into
of proxy task, NES/SES). Intuitively, there are two direc- graph representation (with dimensions of N × N ) so that the
tions to approach this problem, one is raw context-based transitions of MGs are also involved with node relationships.
recognition that directly conducts the inference on the whole Intuitively, this should bring more gains as the information
sequence and the other one is MG context-based recogni- of the feature increases, and we selected two state-of-the-
tion that predicts the emotional states based on the MGs on art high-dimensional graph convolutional networks L2GCN,
the sequences. Here we provide six machine learning-based BGCN (You et al. 2020; Zhang et al. 2019) to verify it.
methods for emotional state recognition, including both of However, as shown in Table 7, we find that for these two
these two kinds of methods. high-dimensional models, the emotional state performances
Raw Context Recognition. Three state-of-the-art mod- (0.44&0.47 and 0.54&0.53) are not as competitive as the sim-
els for the skeleton-based action recognition task, ST-GCN ple BayesianNet. Thus, in the next section, we try to tackle
(Yan et al. 2018), NAS-GCN (Peng et al. 2020) and MS- the issue and propose a customized graph network for better
G3D (Liu et al. 2020) are provided as baselines that infer the mining the potential of the graph-based representations.
emotional state based on the raw long instances. The input of
the models is the full sequence of the body skeleton streams, 6.3 A Weighted Spectral Graph Network for
which is to validate if the emotional patterns can be captured Emotional State Recognition
via body movements straightforwardly. The network struc-
ture is end-to-end whose hyper-parameters are the same for We find that existing graph representation learning meth-
the task of MG classification mentioned in Sect. 4.1 aside ods all rely on high-dimensional weight parameters. Limited
from the output head dimensions (as NES/SES). The perfor- sample amount easily leads to over-fitting on these mod-
mances of the three baseline methods are presented in the els (Scarselli et al. 2008) (e.g., in our cases, a graph with
123
International Journal of Computer Vision
only 17 nodes of MGs) as shown in Table 7. Meanwhile, where for f classi f ier , we experimented with different stan-
classical spectral graph handling methods like the Laplacian dard classifiers combined to our spectral embedding. That
operator (de Lara and Pineau 2018) are suitable for insuffi- is, Multi-layer Perceptron with Relu non-linearity (MLP)
cient samples to get node “gradients” without the need for (Rumelhart et al. 1986), k-nearest neighbors (kNN) (Fix and
high-dimensional weights. Thus, we utilize the strength of Hodges 1989), Random Forest (RF) (Ho 1995), and Adaptive
classical Laplacian operator to obtain the measurements of Boosting (AdaBoost) (Schapire 2013).
the “gradients” of each node and extend it to the directed,
weighted graph case to better fit the task. The whole frame-
work is presented in Fig. 10.
We give the mathematical definition of a graph as G =
(V , E, W ) to represent the relationship of MGs. With the 6.4 Discussion and Limitations
MGs of number N as graph nodes V = {v p | p = 1, . . . , n}
and the transitions between MGs as graph edges E = The experimental results for emotional state recognition are
{eq |q = 1, . . . , m}, the input is therefore the transition fre- shown in Table 7. In practice, MLP outperforms other classi-
quency vectors as the weights on the graph edges W = fiers, which is reported in Table 7 as a result of our proposed
{wi, j |i, j = 1, . . . , n}, where wi, j is obtained by count- WSGN. The detailed experimental settings can be found in
ing the transition number between MG i and j. In this way, the “Appendix G”. Besides, extra experimental results (see
we map the distribution of MGs into raw graph data with “Appendix H”) show that taking natural states (the non-
the dynamic transition patterns between MGs maintained by movement snippets) into account as an extra MG in the
W . Specifically, to tackle the directed graph issue, consider transition representation will bring an improvement to the
the vertex space RV with standard basis {e1 , . . . , en } and, a results, as well as the Laplacian operation. In the last line of
n × n matrix N can be defined as for N = {n i = e j − ek |i = Table 7, we can observe that our proposed WSGNN model
1, . . . , m and j, k = 1, . . . , n}. This matrix N is called the outperforms all the compared methods, which further ver-
signed vertex edge incidence matrix of the original G (with ifies that the MG-based analysis is beneficial to the final
respect to the fixed orientation). The key fact is that the Lapla- emotion understanding. By comparing the performances of
cian L of the G is the (transpose of the) Gram matrix of N , that MG+classifier frameworks and Sequence+NN frameworks,
is, L=N N T with which the directed graph can be deployed. we can observe that the MG-based feature vectors are more
Now recall that W is the weight matrix of G. Then we can beneficial to the present emotional states. This proves that
define the Laplacian of G as the matrix product N W N T MG-based analysis, with its effective representation capa-
where N is the signed vertex-edge incidence matrix of the bility of emotional state, can be a better option for emotional
underlying unweighted graph of G. In this way, the Laplacian understanding. We believe that this can bring inspiration and
operator can be exploited to extract “gradient” features from new paradigms to the community over bodily emotion under-
the MG graph representation. The resulting feature vectors standing.
from Laplacian operator are fed into the classifiers to predict The limitation of this experiment could be that the stakes of
the final emotional state ĉ. Eventually, the whole formulation the subjects’ emotional states were relatively low. Thus, this
of our proposed weighted spectral graph network (WSGN) might decrease the distinction between baselines and devi-
is given as follows: ations. Additionally, the sample size was relatively limited.
Therefore, more research should explore how the similarity
ĉ = f classi f ier (L(N W N T )), (8) scoring system performs when more extensive samples are
used.
123
International Journal of Computer Vision
7 Conclusions and Future Work tation, distribution and reproduction in any medium or format, as
long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence, and indi-
We proposed a novel, psychology-based and reliable paradigm cate if changes were made. The images or other third party material
for body gesture-based emotion understanding with com- in this article are included in the article’s Creative Commons licence,
puter vision methods. To our knowledge, our effort is the first unless indicated otherwise in a credit line to the material. If material
to interpret hidden emotion states via MGs, with both quan- is not included in the article’s Creative Commons licence and your
intended use is not permitted by statutory regulation or exceeds the
titative investigations of human body behaviors and machine permitted use, you will need to obtain permission directly from the copy-
vision technologies. A related spontaneous micro-gesture right holder. To view a copy of this licence, visit http://creativecomm
dataset towards hidden emotion understanding is collected. A ons.org/licenses/by/4.0/.
comprehensive static analysis is performed with significant
findings for MGs and emotional body gestures. Benchmarks
for MG classification, MG online recognition, and body Appendix A SMG Evaluation Protocols
gesture-based emotional stress state recognition are provided
with state-of-the-art models. Our proposed AED-BiLSTM In the proposed SMG dataset, the criteria of the three bench-
framework can efficiently provide a more robust correction marks (MG classification, MG online recognition, and emo-
to the prior with a parameter-free mechanism. Experiments tional state recognition) are provided. Specifically, for the
show that AED-BiLSTM can efficiently improve online MG classification and online recognition tasks, we utilized
recognition performance in a practice closer to a real-world the subject-independent evaluation protocol, while for the
setting. Moreover, a graph-based network is proposed for the emotional state recognition task, both subject-independent
MG pattern representations to better analyze the emotional and -dependent evaluation protocols are used.
states. Subject-independent protocol. In this protocol, we divide
This work involves and bridges the interdisciplinary the 40 subjects into a training group of 30 subjects, a vali-
efforts of psychology, affective computing, computer vision, dating group of 5 subjects, and a testing group of 5 subjects.
machine learning, etc. We wish to break the fixed research The subject IDs of training and testing are:
paradigm of emotional body gestures which is limited to Training set: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
classical expressive emotions and argue for more diverse 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30};
research angles for emotional understanding. Thus, we pro- Validating set: {31, 32, 33, 34, 35};
pose our spontaneous micro-gestures for hidden emotion Testing set: {36, 37, 38, 39, 40}.
understanding. We believe that the SMG dataset and pro- Under this protocol, MG classification task has 2417 MG
posed methods could inspire new algorithms for the MG clip samples for training, 632 for validating and 593 for test-
recognition tasks from the machine learning aspect, such ing (each for around 50 frames); MG online recognition task
as combining more non-verbal cues such as facial expres- has 30 long MG sequences for training, five for validating
sions with MGs using the RGB modality in the SMG dataset and five for testing (each for around 25,000 frames); and
to improve emotional recognition performance. The work emotional state recognition task has 294 videos (i.e., emo-
can also facilitate new advances in the emotion AI field tional state instances) for training and 60 for validating and
and inspire new paradigms for analyzing human emotions 60 for testing (each for around 8000 frames), respectively.
with computer vision methods. The community can be bene- Semi-subject-independent Protocol. In this protocol, we
fited from MGs with significant application potential in many selected 294 + 60 videos (147 + 30 SES and 147 + 30 NES
fields, e.g., using machines to automatically detect MGs to instances) from all the 40 subjects as the training + vali-
enhance people’s communicative skills, or assist experts in dating sets, and the remaining 60 videos (30 SES and 30
conducting Alzheimer’s and autism disease diagnoses. SES instances) as the testing set. The participants’ emotional
states (SES/NES) are recognized via analysis of micro-
Acknowledgements This work was supported by the Academy of
Finland for Academy Professor project EmotionAI (Grants 336116,
gestures.
345122), project MiGA (grant 316765), the University of Oulu & The The video IDs of training and testing are:
Academy of Finland Profi 7 (grant 352788), Postdoc project 6+E (Grant Training set: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
323287) and ICT 2023 project (grant 328115), and by Ministry of Edu- 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
cation and Culture of Finland for AI forum project. As well, the authors
wish to acknowledge CSC - IT Center for Science, Finland, for com-
31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46,
putational resources. 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62,
63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78,
Funding Open Access funding provided by University of Oulu includ- 79, 80, 81, 82, 83, 96, 97, 98, 99, 100, 101, 108, 109, 110,
ing Oulu University Hospital.
111, 112, 113, 120, 121, 122, 123, 124, 125, 132, 133, 134,
Open Access This article is licensed under a Creative Commons 135, 136, 137, 156, 157, 158, 159, 160, 161, 168, 169, 170,
Attribution 4.0 International License, which permits use, sharing, adap- 171, 172, 173, 180, 181, 182, 183, 184, 185, 192, 193, 194,
123
International Journal of Computer Vision
195, 196, 197, 204, 205, 206, 207, 208, 209, 216, 217, 218, the hair to dry – and they weigh 25 pounds when wet. She
219, 220, 221, 228, 229, 230, 231, 232, 233, 237, 238, 239, says that the extra weight of her hair makes her doctors very
243, 244, 245, 249, 250, 251, 255, 256, 257, 258, 259, 260, concerned. They seem to think that she has a curvature of her
261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, spine due to the length and weight of her hair.” Before the
273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, experiment, participants were told that if they got caught they
285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, would have a punishment, i.e., to fill in a long questionnaire
297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, which contains more than 500 questions, so they had to try
309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, their best when making up a story (deviation stimuli), pre-
321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, tend to be telling/reading a given one (baseline stimuli). The
333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, long questionnaire works as the ’punishment’ or a ’pressure’,
345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, aiming to stimuli and elicit the emotional states and micro-
357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, gestures, and there was no actual punishment conducted after
369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, the data collection.
381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392,
393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, Appendix C Relationship Between MGs and
405, 406, 407, 408, 409, 410, 411, 412, 413, 144, 145, 146, Subjects
147, 148, 149, 150, 151, 152, 153, 154, 155, 162, 163, 164,
165, 166, 167, 174, 175, 176, 177, 178, 179, 186, 187, 188, We visualize the Pearson’s correlation coefficient of the MG
189, 190, 191, 198, 199, 200, 201, 202, 203, 210, 211, 212, performing patterns from 40 subjects in our SMG dataset as
213, 214, 215, 222, 223, 224, 225, 226, 227, 234, 235, 236, shown in Fig. 11.
240, 241, 242, 246, 247, 248, 252, 253, 254};
Validating set: {144, 145, 146, 147, 148, 149, 150, 151,
152, 153, 154, 155, 162, 163, 164, 165, 166, 167, 174, 175, Appendix D Experimental Settings for MG
176, 177, 178, 179, 186, 187, 188, 189, 190, 191, 198, 199, Classification on SMG
200, 201, 202, 203, 210, 211, 212, 213, 214, 215, 222, 223,
224, 225, 226, 227, 234, 235, 236, 240, 241, 242, 246, 247, In the practical implementation of RGB modality based base-
248, 252, 253, 254}; lines, we trained all the models on SMG dataset with the same
Testing set: {84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, protocol as: 120 epochs are trained on the TSN, TRN, and
102, 103, 104, 105, 106, 107, 114, 115, 116, 117, 118, 119, TSM models; the batch size is set to 64 for TSN and TRN,
126, 127, 128, 129, 130, 131, 138, 139, 140, 141, 142, 143, and set to 32 for TSM; the base learning rate is set to 0.001 for
150, 151, 152, 153, 154, 155, 162, 163, 164, 165, 166, 167, all three models, and the learning rate is scaled with a factor
174, 175, 176, 177, 178, 179, 186, 187, 188, 189, 190, 191}.
123
International Journal of Computer Vision
of 0.1 at epoch 50 and 100, respectively. For C3D, R3D, and Table 10 The online recognition performances of AED-BiLSTM under
I3D, 60 epochs are trained for each model; the batch size is different threshold values of αth on SMG dataset
set to 128 for C3D and R3D, 48 for I3D; and the learning Threshold of αth 10% 30% 50% 70% 90%
rate is set to 0.0002 for all models. The optimizer is SGD
F1-score 0.312 0.203 0.087 0.030 0.004
which is consistent to the settings of all the models. The loss
function is Categorical Cross Entropy. The training platform
was Pytorch (Paszke et al. 2019) with a single GPU: NVidia
Titan (24 GB).
In the practical implementation of skeleton modality based
utilize 3D position difference characters of joints to gener-
baselines, we trained all the models on the SMG dataset with
ate spatio-temporal information with efficient dimensional
the same protocol: 30 training epochs (all fully converged),
requirements.
batch sizes of 32, the base learning rate of 0.05, and weight
For the training phase, our AED-BiLSTM network was
decay of 0.0005. Prepossessing was conducted for all the
trained with a batch size of 32, the learning rate as 0.01
baselines: null-frame padding, translating to the center joint,
(reducing LR factor as 0.5, patience as 3 epochs) for 20
and paralleling the joints to the corresponding axis. Input
epochs on the SMG dataset, with a batch size of 64, the
length is set as 60 frames. For all the remaining network
learning rate as 0.01 (reducing LR factor as 0.5, patience as
hyperparameters, we kept their original settings (e.g., for
3 epochs) for 80 epochs on the iMiGUE dataset and 40 epochs
MSG3D, the numbers of GCN scales and G3D scales are
on the OAD dataset. The optimizer is RMSprop, following
kept as 13 and 6). The optimizer is SGD which is consistent
the setting of Chen et al. (2020). he structure of AED-
to the settings of all the models. The loss function is Categor-
BiLSTM is consistent with the STABNet (Chen et al. 2020).
ical Cross Entropy. The training platform was with a single
Specifically, RMSprop is used as optimizer. The structure of
GPU: NVidia Titan (24 GB).
STABNet is given as: a two-layer BiLSTM with 2000 GRUs
For pretraining the models for RGB modality, we used
and 1000 GRUs separately. A spatial attention layer and a
Resnet50 (He et al. 2016) as the backbone pretrained on
temporal attention layer are attached before and between the
something-something v2 (Goyal et al. 2017) dataset as it
two BiLSTM layers, respectively. The dense layer of 1000
has been commonly used for all the three methods and the
units is stacked with the sigmoid activation to the BiLSTM
trained weighs are available. The hyperparameters are set as
layers, followed by a output layer with units as the total hid-
the same as the original work.
den state number (MG class number* HMM state number
used for representing each MG, 16*5 in practice).Since the
skeleton joints of the iMiGUE dataset are extracted from
Appendix E Experimental Details for Online OpenPose (Cao et al. 2019) which contains noises, we fil-
Recognition tered out all the training samples with null skeleton joints. For
the compared methods, we use the same training scheme and
We first conduct the same pre-processing of the skeleton ensure the models are converged. The training time of a single
streams on the three validating datasets. Since pre-segmented BiLSTM is around three hours with over 18,000/8,000 (train-
clips and their global temporal information of ongoing ges- ing/validating) frame-level samples in our SMG dataset and
tures, are not available in online recognition tasks, it’s around six hours with over 47,000/5,000 (training/validating)
demanding to have an efficient local temporal feature extrac- frame-level samples in the iMiGUE dataset. The loss func-
tion. For skeleton joint feature extraction, we followed the tion is Categorical Cross Entropy. The training platform was
work of (Zanfir et al. 2013). “MovingPose” are features that Tensorflow with a single GPU: NVidia Titan (24 GB).
123
International Journal of Computer Vision
For testing and post-processing, the threshold of the min- between gestures and stress states. All the settings used Cross
imal frames to filter out noisy gestures is set as 14 frames for Entropy as the loss function.
all the methods. The values of λ and μ are set as −2.5/2.1, Full Context Recognition. In practical implementation, we
−1.0/3.0 and −0.2/2.0 for SMG, iMiGUE and OAD datasets. trained the baseline methods with the same protocols as the
classification task (e.g., training epoch number, batch size,
etc.). Besides, the input is the long skeleton sequence of an
emotional state instance with a frame number of 90 via linear
Appendix F Ablation Study of Online Recog- down-sampling. The dimension of the output layers of net-
nition works are modified into two in relation to the two emotional
states.
Although our AED is parameter-free and can be directly MG-based Context Recognition. We construct the for hid-
exploited to the inference, the correct value setting of λ and den emotional recognition. The transition of the middle
μ will affect the performance of the AED. Thus, we present state (non-movements) is enabled, and the transition direc-
the ablation study of AED-BiLSTM under different values tion is enabled. Bayesian prior is added. Sequential Forward
of λ and μ, as shown in Tables 8 and 9. Floating Selection (SFFS) strategy was used for selecting
Note that the λ will affect the inhibition of the “non- MGs with the most contributions. From SFFS, “Turtling
movement”s, which determines the segmentation results. neck and shoulder”, “Rubbing eyes and forehead”, “Fold-
Meanwhile, μ for the attention of the MGs will affect the ing arms behind body” and “Arms akimbo” are the most
classification results. Thus, we fix μ as 0.0 to conduct the contributed features for the emotional state recognition in
ablation study to obtain the best value of λ, then get the best the subject-independent protocol; meanwhile, “Rubbing eyes
value of μ with obtained λ. and forehead”, “Moving legs”, “Arms akimbo” and “Scratch-
The online recognition performances of AED-BiLSTM ing or touching facial parts other than eyes” are the most
under different threshold values of overlapping ratio αth is contributed features for the semi-subject-independent proto-
shown in Table 10. The higher the αth is, the more challenging col.
the task is, as it requires more accurate temporal allocation
of the frame boundaries. When it comes to 90%, it means the
temporal allocation of the MGs should be extremely accurate.
This is especially challenging due to the subtle and swift
nature of MGs.
123
International Journal of Computer Vision
References Gu, Y., Mai, X., & Luo, Y. (2013). Do bodily expressions compete with
facial expressions? Time course of integration of emotional signals
Aviezer, H., Trope, Y., & Todorov, A. (2012). Body cues, not facial from the face and the body. PLOS ONE, 8(7), 1–9.
expressions, discriminate between intense positive and negative Gunes, H., & Piccardi, M. (2006). A bimodal face and body ges-
emotions. Science, 338(6111), 1225–1229. ture database for automatic analysis of human nonverbal affective
Burgoon, J., Buller, D., & WG, W. (1994). Nonverbal communication: behavior. In 18th international conference on pattern recognition
The unspoken dialogue. Greyden Press. (vol. 1, pp. 1148–1153).
Cao, Z., Hidalgo, G., Simon, T., Wei, S. E., & Sheikh, Y. (2019). Open- Hara, K., Kataoka, H., & Satoh, Y. (2018). Can spatiotemporal 3d cnns
pose: realtime multi-person 2d pose estimation using part affinity retrace the history of 2D CNNs and imagenet? In Proceedings of
fields. IEEE Transactions on Pattern Analysis and Machine Intel- the IEEE/CVF conference on computer vision and pattern recog-
ligence, 43(1), 172–186. nition (pp. 6546–6555).
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning
a new model and the kinetics dataset. In Proceedings of the for image recognition. In Proceedings of the IEEE conference on
IEEE/CVF conference on computer vision and pattern recogni- computer vision and pattern recognition (pp. 770–778).
tion (pp. 6299–6308). Ho, T. K. (1995). Random decision forests. In Proceedings of interna-
Chen, H., Liu, X., Li, X., Shi, H., & Zhao, G. (2019). Analyze tional conference on document analysis and recognition (vol. 1,
spontaneous gestures for emotional stress state recognition: A pp. 278–282).
micro-gesture dataset and analysis with deep learning. In Pro- Khan, R. Z., & Ibraheem, N. A. (2012). Hand gesture recognition: A
ceedings of the IEEE international conference on automatic face literature review. International Journal of Artificial Intelligence &
& gesture recognition (pp. 1–8). Applications, 3(4), 161.
Chen, H., Liu, X., Shi, J., & Zhao, G. (2020). Temporal hierarchical Kipp, M., & Martin, J. C. (2009). Gesture and emotion: Can basic
dictionary guided decoding for online gesture segmentation and gestural form features discriminate emotions? In International
recognition. IEEE Transactions on Image Processing, 29, 9689– conference on affective computing and intelligent interaction and
9702. workshops (pp. 1–8).
Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J., & Lu, H. (2020). Kita, S., Alibali, M., & Chu, M. (2017). How do gestures influence
Skeleton-based action recognition with shift graph convolutional thinking and speaking? the gesture-for-conceptualization hypoth-
network. In Proceedings of the IEEE/CVF conference on computer esis. Psychological Review, 124, 245–266.
vision and pattern recognition (pp. 183–192). Krakovsky, M. (2018). Artificial (emotional) intelligence. Communica-
Crasto, N., Weinzaepfel, P., Alahari, K., & Schmid, C. (2019). MARS: tions of the ACM, 61(4), 18–19.
Motion-augmented RGB stream for action recognition. In Pro- Kuehne, H., Richard, A., & Gall, J. (2019). A hybrid RNN-HMM
ceedings of the IEEE/CVF conference on computer vision and approach for weakly supervised temporal action segmentation.
pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence.
de Becker, G. (1997). The gift of fear. Dell Publishing. Kuhnke, E. (2009). Body language for dummies. Wiley.
de Lara, N., & Pineau, E. (2018). A simple baseline algorithm for graph Li, S., & Deng, W. (2020). Deep facial expression recognition: A survey.
classification. In Relational representation learning workshop, the IEEE Transactions on Affective Computing.
conference on neural information processing systems. Li, Y., Lan, C., Xing, J., Zeng, W., Yuan, C., & Liu, J. (2016). Online
Ekman, P. (2004). Darwin, deception, and facial expression. Annals of human action detection using joint classification-regression recur-
the New York Academy of Sciences, 1000, 205–21. rent neural networks. In Proceedings of the European conference
Ekman, R. (1997). What the face reveals: Basic and applied studies on computer vision.
of spontaneous expression using the Facial Action Coding System Lin, J., Gan, C., & Han, S. (2019). TSM: Temporal shift module for
(FACS). Oxford University Press. efficient video understanding. In Proceedings of the IEEE/CVF
El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech emo- international conference on computer vision (pp. 7083–7093).
tion recognition: Features, classification schemes, and databases. Liu, J., Shahroudy, A., Xu, D., & Wang, G. (2016). Spatio-temporal
Pattern Recognition, 44(3), 572–587. LSTM with trust gates for 3D human action recognition. In: Pro-
Escalera, S., Baró, X., Gonzàlez, J., Bautista, M.A., Madadi, M., Reyes, ceedings of the European conference on computer vision.
M., Ponce-López, V., Escalante, H.J., Shotton, J., & Guyon, I. Liu, J., Shahroudy, A., Wang, G., Duan, L.Y., & Kot, A.C. (2018).
(2015). Chalearn looking at people challenge 2014: Dataset and Ssnet: Scale selection network for online 3D action prediction.
results. In Proceedings of the European conference on computer In: Proceedings of the IEEE conference on computer vision and
vision (pp. 459–473). pattern recognition.
Fix, E., & Hodges, J. L. (1989). Discriminatory analysis. nonparametric Liu, J., Wang, G., Hu, P., Duan, L.Y., & Kot, A. C. (2017). Global
discrimination: Consistency properties. International Statistical context-aware attention LSTM networks for 3d action recognition.
Review/Revue Internationale de Statistique, 57(3), 238–247. In Proceedings of the IEEE conference on computer vision and
Ginevra, C., Loic, K., & George, C. (2008). Emotion recognition pattern recognition.
through multiple modalities: Face, body gesture, speech (pp. 92– Liu, X., Shi, H., Chen, H., Yu, Z., Li, X., & Zhao, G. (2021). imigue:
103). Springer. An identity-free video dataset for micro-gesture understanding and
Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., West- emotion analysis. In Proceedings of the IEEE/CVF conference on
phal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., & Mueller- computer vision and pattern recognition (pp. 10631–10642).
Freitag, M., et al. (2017). The” something something” video Liu, Z., Zhang, H., Chen, Z., Wang, Z., & Ouyang, W. (2020). Disentan-
database for learning and evaluating visual common sense. In: gling and unifying graph convolutions for skeleton-based action
Proceedings of the IEEE international conference on computer recognition. In Proceedings of the IEEE/CVF conference on com-
vision (pp. 5842–5850). puter vision and pattern recognition (pp. 143–152).
Gray, J. A. (1982). Précis of the neuropsychology of anxiety: An enquiry Luo, Y., Ye, J., Adams, R. B., Li, J., Newman, M. G., & Wang, J. Z.
into the functions of the septo-hippocampal system. Behavioral (2020). Arbee: Towards automated recognition of bodily expres-
and Brain Sciences, 5(3), 469–484. sion of emotion in the wild. International Journal of Computer
Vision, 128(1), 1–25.
123
International Journal of Computer Vision
Mahmoud, M., Baltrušaitis, T., Robinson, P., & Riek, L.D. (2011). 3D Soomro, K., Zamir, A. R., & Shah, M. (2012). Ucf101: A dataset of
corpus of spontaneous complex mental states. In International 101 human actions classes from videos in the wild. arXiv preprint
conference on affective computing and intelligent interaction (pp. arXiv:1212.0402.
205–214). Sun, S., Kuang, Z., Sheng, L., Ouyang, W., & Zhang, W. (2018). Optical
Navarro, J., & Karlins, M. (2008). What every BODY is saying: An flow guided feature: A fast and robust motion representation for
ex-FBI agent’s guide to speed reading people. Collins. video action recognition. In Proceedings of the IEEE/CVF confer-
Neverova, N., Wolf, C., Taylor, G., & Nebout, F. (2016). Moddrop: ence on computer vision and pattern recognition.
Adaptive multi-modal gesture recognition. IEEE Transactions on Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015).
Pattern Analysis and Machine Intelligence 38(8). Learning spatiotemporal features with 3D convolutional networks.
Noroozi, F., Kaminska, D., Corneanu, C., Sapinski, T., Escalera, S., & In Proceedings of the IEEE/CVF international conference on com-
Anbarjafari, G. (2018). Survey on emotional body gesture recog- puter vision (pp. 4489–4497).
nition. IEEE Transactions on Affective Computing. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
Oh, S. J., Benenson, R., Fritz, M., & Schiele, B. (2016). Faceless person A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need.
recognition: Privacy implications in social media. In Proceedings Advances in Neural Information Processing Systems 30.
of the European conference on computer vision (pp. 19–35). Vrij, A., Leal, S., Jupe, L., & Harvey, A. (2018). Within-subjects verbal
Palena, N., Caso, L., Vrij, A., & Orthey, R. (2018). Detecting decep- lie detection measures: A comparison between total detail and pro-
tion through small talk and comparable truth baselines. Journal of portion of complications. Legal and Criminological Psychology,
Investigative Psychology and Offender Profiling 15. 23(2), 265–279.
Panksepp, J. (1998). Affective neuroscience: The foundations of human Vrij, A., Mann, S., Leal, S., & Fisher, R. P. (2020). Combining verbal
and animal emotions. Oxford University Press. veracity assessment techniques to distinguish truth tellers from lie
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., tellers. European Journal of Psychology Applied to Legal Context,
Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019) 13(1), 9–19.
Pytorch: An imperative style, high-performance deep learning Wallbott, H. G. (1998). Bodily expression of emotion. European Jour-
library. In Advances in neural information processing systems. nal of Social Psychology, 28(6), 879–896.
Peng, W., Hong, X., Chen, H., & Zhao, G. (2020). Learning graph con- Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van
volutional network for skeleton-based human action recognition Gool, L. (2018). Temporal segment networks for action recognition
by neural searching. In Proceedings of the AAAI conference on in videos. IEEE Transactions on Pattern Analysis and Machine
artificial intelligence. Intelligence, 41(11), 2740–2755.
Pentland, A. (2008). Honest signals: How they shape our world. MIT Wu, D., Pigou, L., Kindermans, P.J., Le, N.D.H., Shao, L., Dambre, J.,
Press. & Odobez, J.M. (2016). Deep dynamic neural networks for mul-
Pouw, W. T., Mavilidi, M. F., Van Gog, T., & Paas, F. (2016). Gesturing timodal gesture segmentation and recognition. IEEE Transactions
during mental problem solving reduces eye movements, especially on Pattern Analysis and Machine Intelligence 38(8).
for individuals with lower visual working memory capacity. Cog- Xu, M., Gao, M., Chen, Y. T., Davis, L. S., & Crandall, D. J. (2019a).
nitive Processing, 17(3), 269–277. Temporal recurrent networks for online action detection. In Pro-
Richard, A., Kuehne, H., Iqbal, A., & Gall, J. (2018). Neuralnetwork- ceedings of the IEEE/CVF international conference on computer
viterbi: A framework for weakly supervised video learning. In vision (pp. 5532–5541).
Proceedings of the IEEE/CVF conference on computer vision and Xu, M., Gao, M., Chen, Y.T., Davis, L. S., & Crandall, D. J. (2019b).
pattern recognition. Temporal recurrent networks for online action detection. In Pro-
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning ceedings of the IEEE/CVF international conference on computer
representations by back-propagating errors. Nature, 323(6088), vision.
533–536. Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph con-
Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., & Monfardini, volutional networks for skeleton-based action recognition. In
G. (2008). The graph neural network model. IEEE Transactions Proceedings of the AAAI conference on artificial intelligence
on Neural Networks, 20(1), 61–80. (vol. 32).
Schapire, R. E. (2013). Explaining adaboost. In Empirical inference You, Y., Chen, T., Wang, Z., & Shen, Y. (2020). L2-gcn: Layer-wise
(pp. 37–52). Springer. and learned efficient training of graph convolutional networks. In
Schindler, K., Van Gool, L., & De Gelder, B. (2008). Recognizing Proceedings of the IEEE/CVF conference on computer vision and
emotions expressed by body pose: A biologically inspired neu- pattern recognition (pp. 2127–2135).
ral model. Neural Networks, 21(9), 1238–1246. Yu, N. (2008). Metaphor from body and culture. The Cambridge hand-
Serge, G. (1995). International Glossary of Gestalt Psychotherapy. book of metaphor and thought (pp. 247–261).
FORGE. Yu, Z., Zhou, B., Wan, J., Wang, P., Chen, H., Liu, X., Li, S. Z., &
Shahroudy, A., Liu, J., Ng, T. T., & Wang, G. (2016). Ntu rgb+d: A large Zhao, G. (2020). Searching multi-rate and multi-modal temporal
scale dataset for 3D human activity analysis. In Proceedings of the enhanced networks for gesture recognition. IEEE Transactions on
IEEE/CVF conference on computer vision and pattern recognition. Image Processing.
Shi, L., Zhang, Y., Cheng, J., & Lu, H. (2019). Two-stream adaptive Zanfir, M., Leordeanu, M., & Sminchisescu, C. (2013). The moving
graph convolutional networks for skeleton-based action recogni- pose: An efficient 3D kinematics descriptor for low-latency action
tion. In Proceedings of the IEEE/CVF conference on computer recognition and detection. In Proceedings of the IEEE/CVF inter-
vision and pattern recognition (pp. 12026–12035). national conference on computer vision.
Shiffrar, M., Kaiser, M., & Chouchourelou, A. (2011). Seeing human Zhang, Y., Pal, S., Coates, M., & Ustebay, D. (2019). Bayesian graph
movement as inherently social. The Science of Social Vision. convolutional neural networks for semi-supervised classification.
Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, In Proceedings of the AAAI conference on artificial intelligence
R., Kipman, A., & Blake, A. (2011). Real-time human pose recog- (vol. 33, pp. 5829–5836).
nition in parts from single depth images. In Proceedings of the
IEEE/CVF conference on computer vision and pattern recogni-
tion (pp. 1297–1304). Publisher’s Note Springer Nature remains neutral with regard to juris-
dictional claims in published maps and institutional affiliations.
123