Multi-Task Deep Learning For Real-Time 3D Human Pose Estimation and Action Recognition

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

2752 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO.

8, AUGUST 2021

Multi-Task Deep Learning for Real-Time 3D


Human Pose Estimation and Action Recognition
Diogo C. Luvizon , David Picard , and Hedi Tabia

Abstract—Human pose estimation and action recognition are related tasks since both problems are strongly dependent on the human
body representation and analysis. Nonetheless, most recent methods in the literature handle the two problems separately. In this
article, we propose a multi-task framework for jointly estimating 2D or 3D human poses from monocular color images and classifying
human actions from video sequences. We show that a single architecture can be used to solve both problems in an efficient way and
still achieves state-of-the-art or comparable results at each task while running with a throughput of more than 100 frames per second.
The proposed method benefits from high parameters sharing between the two tasks by unifying still images and video clips processing
in a single pipeline, allowing the model to be trained with data from different categories simultaneously and in a seamlessly way.
Additionally, we provide important insights for end-to-end training the proposed multi-task model by decoupling key prediction parts,
which consistently leads to better accuracy on both tasks. The reported results on four datasets (MPII, Human3.6M, Penn Action and
NTU RGB+D) demonstrate the effectiveness of our method on the targeted tasks. Our source code and trained weights are publicly
available at https://github.com/dluvizon/deephar.

Index Terms—Human action recognition, human pose estimation, multitask deep learning, neural networks
Ç

1 INTRODUCTION
action recognition has been intensively studied in argmax function to recover the joint coordinates as a post
H UMAN
the last years, specially because it is a very challenging
problem, but also due to the several applications that can ben-
processing stage, which breaks the backpropagation chain
needed for end-to-end learning. We propose to solve this
efit from it. Similarly, human pose estimation has also rapidly problem by extending the differentiable soft-argmax [7], [8]
progressed with the advent of powerful methods based on for joint 2D and 3D pose estimation. This allows us to stack
convolutional neural networks (CNN) and deep learning. action recognition on top of pose estimation, resulting in a
Despite the fact that action recognition benefits from precise multi-task framework trainable from end-to-end.
body poses, the two problems are usually handled as distinct In comparison with our previous work [9], we propose a
tasks in the literature [1], or action recognition is used as a new network architecture carefully designed for pose and
prior for pose estimation [2], [3]. To the best of our knowledge, action prediction simultaneously at different feature map
there is no recent method in the literature that tackles both resolutions. Each prediction is supervised and re-injected
problems in a joint way to the benefit of action recognition. In into the network for further refinement. Differently from [9],
this paper, we propose a unique end-to-end trainable multi- where we first predict poses then actions, here poses and
task framework to handle human pose estimation and action actions are predicted in parallel and successively refined,
recognition jointly, as illustrated in Fig. 1. strengthening the multi-task aspect of our method. Another
One of the major advantages of deep learning methods is improvement is the proposed depth estimation approach for
their capability to perform end-to-end optimization. This is all 3D poses, which allows us to depart from learning the costly
the more true for multi-task problems, where related tasks can volumetric heat maps while improving the overall accuracy
benefit from one another, as suggested by Kokkinos [4]. Action of the method.
recognition and pose estimation are usually hard to be stitched The main contributions of our work are presented as fol-
together to perform a beneficial joint optimization, usually lows: First, we propose a new multi-task method for jointly
requiring 3D convolutions [5] or heatmaps transformations [6]. estimating 2D/3D human poses and recognizing associated
Detection based approaches require the non-differentiable actions. Our method is simultaneously trained from end-to-
end for both tasks with multimodal data, including still
images and video clips. Second, we propose a new regression
 D. C. Luvizon is with the SAMSUNG Research Institute, Campinas, SP approach for 3D pose estimation from single frames, benefit-
13097-104, Brazil. E-mail: [email protected].
ing at the same time from images “in-the-wild” with 2D
 D. Picard is with the LIGM, IMAGINE, Ecole des Ponts, Univ Gustave Eiffel,
CNRS, 77455 Marne-la-Vallee, France. E-mail: [email protected]. annotated poses and 3D data. This has been proven a very
 H. Tabia is with the IBISC, Univ Evry, Universite Paris-Saclay, 91025 efficient way to learn good visual features, which is also very
Evry, France. E-mail: [email protected]. important for action recognition. Third, our action recogni-
Manuscript received 8 Feb. 2019; revised 12 Feb. 2020; accepted 16 Feb. 2020. tion approach is based only on RGB images, from which we
Date of publication 24 Feb. 2020; date of current version 1 July 2021. extract 3D poses and visual information. Despite that, our
(Corresponding author: Diogo C. Luvizon.)
Recommended for acceptance by J. Sivic. multi-task method achieves state-of-the-art on both 2D and
Digital Object Identifier no. 10.1109/TPAMI.2020.2976014 3D scenarios, even when compared with methods using
0162-8828 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 15,2023 at 07:46:07 UTC from IEEE Xplore. Restrictions apply.
LUVIZON ET AL.: MULTI-TASK DEEP LEARNING FOR REAL-TIME 3D HUMAN POSE ESTIMATION AND ACTION RECOGNITION 2753

Pyramid Residual Module (PRM). Very recently, [30] pro-


posed a high-resolution network that keeps a high-resolution
flow, resulting in more precise predictions. With the emer-
gence of Generative Adversarial Networks (GANs) [31],
Chou et al. [32] proposed to use a discriminative network to
distinguish between estimated and target heat maps. This
process could increase the quality of predictions, since the
generator is stimulated to produce more plausible predic-
tions. Another application of GANs in that sense is to enforce
the structural representation of the human body [33].
Fig. 1. The proposed multi-task approach for human pose estimation
However, all the previous mentioned detection based
and action recognition. Our method provides 2D/3D pose estimation
from single images or frame sequences. Pose and visual information are approaches do not provide body joint coordinates directly. To
used to predict actions in a unified framework and both predictions are recover the body joints in ðx; yÞ coordinates, predicted heat
refined by K prediction blocks. maps have to be converted to joint positions, generally using
the argument of the maximum a posteriori probability
ground-truth poses. Fourth, the proposed network architec- (MAP), called argmax. On the other hand, regression based
ture is scalable without any additional training procedure, approaches use a nonlinear function to project the input
which allows us to choose the right trade-off between speed image directly to the desired output, which can be the joint
and accuracy a posteriori. Finally, we show that the hard prob- coordinates. Following this paradigm, Toshev and Szegedy
lem of multi-tasking pose estimation and action recognition [23] proposed a holistic solution based on cascade regression
can be tackled efficiently by a single and carefully designed for body part regression and Carreira et al. [34] proposed the
architecture, handling both problems together and in a better Iterative Error Feedback. The limitation of current regression
way than separately. As a result, our method provides accept- methods is that the regression function is frequently sub-
able pose and action predictions at more than 180 frames per optimal. In order to tackle this weakness, the soft-argmax
second (FPS), while achieving its best scores at 90 FPS on a function [7] has been proposed to compute body joint coordi-
customer GPU. The remaining of this paper is organized as nates from heat maps in a differentiable way.
follows. In Section 2 we present a review of the most relevant
works related to our method. The proposed multi-task frame-
2.1.2 3D Pose Estimation
work is presented in Section 3. Extensive experiments on both
pose estimation and action recognition are presented in Recently, deep architectures have been used to learn 3D rep-
Section 4, followed by our conclusions in Section 5. resentations from RGB images [35], [36], [37], [38], [39], [40]
thanks to the availability of high precise 3D data [41], and
are now able to surpass depth-sensors [42]. Chen and Ram-
2 RELATED WORK
anan [43] divided the problem of 3D pose estimation into
In this section, we present some of the most relevant meth- two parts. First, they target 2D pose estimation considering
ods related to our work, which are divided into human pose the camera coordinates and second, the 2D estimated poses
estimation and action recognition. Since an extensive literature are matched to 3D representations by means of a nonpara-
review is out of the scope of the paper, we encourage the metric shape model. However, this is an ill-defined prob-
readers to refer to the surveys in [10], [11] for respectively lem, since two different 3D poses could have the same 2D
pose estimation and action recognition. projection. Other methods propose to regress the 3D relative
position of joints, which usually presents a lower variance
2.1 Human Pose Estimation than the absolute position. For example, Sun et al. [44] pro-
2.1.1 2D Pose Estimation posed a bone representation of the human body. However,
The problem of human pose estimation has been intensively since the errors are accumulative, such a structural transfor-
studied in the last years, from Pictorial Structures [12], [13], mation might effect tasks that depend on the extremities of
[14] to more recent CNN based approaches [15], [16], [17], the human body, like action recognition.
[18], [19], [20], [21], [22], [23], [24]. We can identify from the lit- Pavlakos et al. [45] proposed the volumetric stacked hour-
erature two distinct families of methods for pose estimation: glass architecture, but the method suffers from significant
detection and regression based methods. Recent detection increase in the number of parameters and from the required
methods handle pose estimation as a heat map prediction memory to store all the gradients. A similar technique is
problem, where each pixel in a heat map represents the used in [46], but instead of using argmax for coordinate esti-
detection score of a given body joint being localized at mation, the authors use a numerical integral regression,
this pixel [25], [26]. Exploring the concepts of stacked archi- which is similar to the soft-argmax operation [9]. More
tectures, residual connections, and multiscale processing, recently, Yang et al. [47] proposed to use adversarial net-
Newell et al. [27] proposed the Stacked Hourglass networks works to distinguish between generated and ground truth
(SHG), which improved scores on 2D pose estimation chal- poses, improving predictions on uncontrolled environments.
lenges significantly. Since then, methods in the state of the art Differently form our previous work in [9], we show that a
are frequently proposing complex variations of the SHG volumetric representation is not required for 3D prediction.
architecture. For example, Chu et al. [28] proposed an atten- Similarly to methods on hand pose estimation [48] and on
tion model based on conditional random field (CRF) and 3D human pose estimation [42], we predict 2D depth maps
Yang et al. [29] replaced the residual unit from SHG by the which encode the relative depth of each body joint.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 15,2023 at 07:46:07 UTC from IEEE Xplore. Restrictions apply.
2754 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 8, AUGUST 2021

Fig. 2. Overview of the proposed multi-task network architecture. The entry-flow extracts feature maps from the input images, which are fed through a
sequence of CNNs composed of prediction blocks (PB), downscaling and upscaling units (DU and UU), and simple (skip) connections. Each PB out-
puts supervised pose and action predictions that are refined by further blocks and units. The information flow related to pose estimation and action
recognition are independently propagated from one prediction block to another, respectively depicted by blue and red arrows. See Fig. 3 and Fig. 4
for details about DU, UU, and PB.

2.2 Action Recognition have a low range of operation, and are not robust to occlusions,
2.2.1 2D Action Recognition frequently resulting in noisy skeletons. To cope with the noisy
In this section we revisited some methods that exploit pose skeletons, Spatio-Temporal LSTM networks [60] have been
information for action recognition. For example, classical meth- widely used to learn the reliability of skeleton sequences or as
ods for feature extraction have been used in [49], [50], where an attention mechanism [61], [62]. In addition to the skeleton
the key idea is to use body joint locations to select visual fea- data, multimodal approaches can also benefit from visual
tures in space and time. 3D convolutions have been stated as cues [63]. In that direction, pose-conditioned attention mecha-
the best option to handle the temporal dimension of images nisms have been proposed [64] to focus on image patches cen-
sequences [51], [52], [53], but they involve a high number of tered around the hands.
parameters and cannot efficiently benefit from the abundant Since our architecture predicts precise 3D poses from RGB
still images during training. Another option to integrate the frames, we do not have to cope with the noisy skeletons from
temporal aspect is by analysing motion from image sequen- Kinect. Moreover, we show in the experiments that, despite
ces [1], [54], but these methods require the difficult estimation being based on temporal convolution instead of the more
of optical flow. Unconstrained temporal and spatial analysis common LSTM, our system is able to reach state of the art
are also promising approaches to tackle action recognition, performance on 3D action recognition, indicating that action
since it is very likely that, in a sequence of frames, some very recognition does not necessarily require long term memory.
specific regions in a few frames are more relevant than the
remaining parts. Inspired on this observation, Baradel et al. 3 PROPOSED MULTI-TASK APPROACH
[55] proposed an attention model called Glimpse Clouds, The goal of the proposed method is to jointly handle human
which learns to focus on specific image patches in space and pose estimation and action recognition, prioritizing the use
time, aggregating the patterns and soft-assigning each feature of predicted poses on action recognition and benefiting from
to workers that contribute to the final action decision. The shared computations between the two tasks. For conve-
influence of occlusions could be alleviated by multi-view vid- nience, we define the input of our method as either a still
eos [56] and inaccurate pose sequences could be replaced by RGB image I 2 RHW 3 or a video clip (sequence of images)
heat maps for better accuracy [57]. However, this improvement V 2 RT HW 3 , where T is the number of frames in a video
is not observed when pose predictions are sufficiently precise. clip and H  W is the frame size. This distinction is impor-
2D action recognition methods usually use the body joint tant because we handle pose estimation as a single frame
information only to extract localized visual features [1], [49], problem. The outputs of our method for each frame are: pre-
as an attention mechanism. Methods that directly explore the dicted human pose p ^ 2 RNj 3 and per body joint confidence
body joints usually do not generate it [50] or present lower score ^c 2 RNj 1 , where Nj is the number of body joints.
precision with estimated poses [51]. Our approach removes When taking a video clip as input, the method also outputs a
these limitations by performing pose estimation together vector of action probabilities ^ a 2 RNa 1 , where Na is the
with action recognition. As such, our model only needs the number of action classes. To simplify notation, in this section
input RGB frames while still performing discriminative visual we omit batch normalization layers and ReLU activations,
recognition guided by the estimated body joints. which are used in between convolutional layers as a common
practice in deep neural networks.
2.2.2 3D Action Recognition
Differently from video based action recognition, 3D action rec- 3.1 Network Architecture
ognition is mostly based on skeleton data as the primary infor- Differently from our previous work [9] where poses and
mation [58], [59]. With depth sensors such as the Microsoft actions are predicted sequentially, here we want to strengthen
Kinect, it is possible to capture 3D skeletal data without a com- the multi-task aspect of our method by predicting and refining
plex installation procedure frequently required for motion cap- poses and actions in parallel. This is implemented by the pro-
ture systems (MoCap). However, due to the required infrared posed architecture, illustrated in Fig. 2. Input images are fed
projector, depth sensors are limited to indoor environments, through the entry-flow, which extracts low level visual
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 15,2023 at 07:46:07 UTC from IEEE Xplore. Restrictions apply.
LUVIZON ET AL.: MULTI-TASK DEEP LEARNING FOR REAL-TIME 3D HUMAN POSE ESTIMATION AND ACTION RECOGNITION 2755

Fig. 3. Network elementary units: in (a) residual unit (RU), in (b) down-
scaling unit (DU), and in (c) upscaling unit (UU). Nfin and Nfout represent
the input and output number of features, Hf  Wf is the feature map
size, and k is the filter size.

features. The extracted features are then processed by a


sequence of downscaling and upscaling pyramids indexed by
p 2 f1; 2; . . . ; P g, which are respectively composed of down- Fig. 4. Network architecture of prediction blocks (PB) for a downscaling
scaling and upscaling units (DU and UU), and prediction pyramid. With the exception of the PB in the first pyramid, all PB get as
blocks (PB), indexed by l 2 f1; 2; . . . ; Lg. Each PB is super- input features from the previous pyramid in the same level (X p1;l t ,
Y p1;l ), and features from lower or higher levels (X p;l1
t , Y p;l1 ), depend-
vised on pose and action predictions, which are then re- ing if it composes a downscaling or an upscaling pyramid, respectively.
injected into the network, producing a new feature map that
is refined by further downscaling and upscaling pyramids.
Downscaling or upscaling units are respectively composed by and Y p;l is a tensor of video clip features, exclusively used
maxpooling or upsampling layers followed by a residual unit for action predictions and also propagated from one PB to
that is a standard or a depthwise separable convolution [65] another. t ¼ f1; . . . ; T g is the index of single frames in a
with skip connection. These units are detailed in Fig. 3. video clip, and Nf and Nv are respectively the size of single
In order to be able to handle human poses and actions in frame features and video clip features.
a unified framework, the network can operate into two dis- For pose estimation, prediction blocks take as input the
tinct modes: (i) single frame processing or (ii) video clip proc- single frame features X tp1;l from the previous pyramid and
essing. In the first operational mode (single frame), only the features X p;l1
t from lower or higher levels, respectively
layers related to pose estimation are active, from which con- for downscaling and upscaling pyramids. A similar propa-
nections correspond to the blue arrows in Fig. 2. In the sec- gation of previous features Y p1;l and Y p;l1 happens for
ond operational mode (video clip), both pose estimation action. Note that both X p;l t and Y
p;l
feature maps are three-
and action recognition layers are active. In this case, layers dimensional tensors (2D maps plus channels) that can be
in the single frame processing part handle each video frame easily handled by 2D convolutions.
as a single sample in the batch. Independently on the opera- The tensor of multi-task features is defined by:
tional mode, pose estimation is always performed from sin-
0
gle frames, which prevents the method from depending on Z tp;l ¼ RUðX p1;l
t þ DUðX p;l1
t ÞÞ (4)
the temporal information for this task. For video clip proc- 0 p;l
essing, the information flow from single frame processing Z p;l
t ¼ Wz  Z t ;
p;l
(5)
(pose estimation) and from video clip processing (action
recognition) are independently propagated from one pre- where DU is the downscaling unit (replaced by UU for
diction block to another, as demonstrated in Fig. 2 respec- upscaling pyramids), RU is the residual unit,  is a convolu-
tively by blue and red arrows. tion, and Wp;l z is a weight matrix. The choice of including a
residual0 unit in Equation (4) was inspired from [27] and pre-
vents Z tp;l from becoming a direct summation of its previ-
3.1.1 Multi-Task Prediction Block
ous terms. Then, Z p;l t is used to produce body joint
The main challenges related to the design of the network probability maps:
architecture is how to handle multimodal data (single frames
and video clips) in a unified way and how to allow predic- hp;l p;l p;l
t ¼ FðWh  Z t Þ; (6)
tions refinement for both poses and actions. To this end, we
propose a multi-task prediction block (PB), detailed in Fig. 4. and body joint depth maps:
In the PB, pose and action are simultaneously predicted and
re-injected into the network for further refinement. In the dp;l p;l p;l
t ¼ SigmoidðWd  Z t Þ; (7)
global architecture, each PB is indexed by pyramid p and
where F is the spatial softmax [7], and Wp;l p;l
h and Wd are
level l, and produces the following three feature maps:
weight matrices. Probability maps and body joint depth
X p;l Hf Wf Nf maps encode, respectively, the probability of a body joint
t 2R (1)
being at a given location and the depth with respect to the
Z p;l
t 2 RHf Wf Nf (2) root joint, normalized in the interval ½0; 1. Both hp;l
t and dt
p;l
Hf Wf Nj
have shape R .
T Nj Nv
Y p;l
2R : (3)
Namely, X p;l
t is a tensor of single frame features, which is 3.2 Pose Regression
propagated from one PB to another, Z p;lt is a tensor of multi- Once a set of body joint probability maps and depth maps
task (single frame) features used for both pose and action, are computed from multi-task features, we aim to estimate
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 15,2023 at 07:46:07 UTC from IEEE Xplore. Restrictions apply.
2756 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 8, AUGUST 2021

the corresponding 3D points by a differentiable and non-


parametrized function. For that, we decouple the problem
in 2D pose estimation and depth estimation, and the final 3D
pose is the concatenation of the intermediate parts.

3.2.1 The Soft-Argmax Layer for 2D Estimation


Given a 2D input signal, the main idea is to consider that the
argument of the maximum (argmax) can be approximated
by the expectation of the input signal after being normalized Fig. 5. Extraction of (a) pose and (b) appearance features.
to have the properties of a distribution. Indeed, for a suffi-
ciently pointy (Leptokurtic) distribution, the expectation
should be close to the maximum a posteriori (MAP) estima- precision on estimated poses. Differently from all previous
tion. For a 2D heat map as input, the normalized exponen- methods based on direct heat map regression, our approach
tial function (softmax) can be used, since it alleviates the can benefit from prediction re-injection at different resolu-
undesirable influences of values below the maximum and tions, since our pose regression method is invariant to the
increases the “pointiness” of the resulting distribution, pro- feature map resolution. Specifically, in each PB at different
ducing a probability map, as defined in Equation (6). pyramid and different level, we compute a new set of fea-
Let’s define a single probability map for the jth joint as tures X p;l
t based on features from previous blocks and on
hj , in such a way that h  ½h1 ; . . . ; hNj . Then, the expected the current prediction, as follows:
coordinates ðxj ; yj Þ are given by the function C:
0
! X p;l p;l p;l p;l p;l
t ¼ Wr  ht þ Ws  dt þ Z t þ Z t ;
p;l p;l
(10)
XWh X Hh
c
Wh X
X Hh
r
Cðh Þ ¼
j
hr;c ; hr;c ; (8)
c¼0 r¼0
Wh c¼0 r¼0
Hh where Wp;l r and Ws are weight matrices related to the re-
p;l

injection of 2D pose and depth information, respectively.


where Hh  Wh is the size of the input probability map, With this approach, further PB at different pyramids and
and l and c are line and column indexes of h. According to levels are able to refine predictions, considering different
Equation (8), the coordinates ðxj ; yj Þ are constrained bet- sets of features at different resolutions.
ween the interval [0,1], which corresponds to the normal-
ized limits of the input image. 3.3 Human Action Recognition
Another important advantage in our method is its ability to
3.2.2 Depth Estimation integrate high level pose information with low level visual
Differently from our previous work [9], where volumetric features in a multi-task framework. This characteristic
heat maps were required to estimate the third dimension of allows sharing the single frame processing pipeline for both
body joints, here we use a similar apprach to [48], where pose estimation and visual features extraction. Addition-
specialized depth maps d are used to encode the depth ally, visual features are trained using both action sequences
information. Similarly to the probability maps decomposi- and still images captured “in-the-wild”, which have been
tion from section 3.2.1, here we define dj as a depth map for proven as a very efficient way to learn robust visual repre-
the jth body joint. Thus, the regressed depth coordinate zj is sentations. As shown in Fig. 4, the action prediction part
defined by: takes as input two different sources of information: pose fea-
tures and appearance features. Additionally, similarly to the
Wh X
X Hh
zj ¼ hjr;c djr;c : (9) pose prediction part, action features from previous pyra-
c¼0 r¼0 mids (Y p1;l ) and levels (Y p;l1 ) are also aggregated in each
prediction.
Since hj is a normalized unitary and positive probability
map, Equation (9) represents a spatially weighted pooling
3.3.1 Pose Features
of depth map dj based on the 2D body joint location.
In order to explore the rich information encoded with body
joint positions, we convert a sequence of T poses with Nj
3.2.3 Body Joint Confidence Scores
joints each into an image-like representation. Similar repre-
The probability of a certain body joint being present (even if sentations were previously used in [64], [66]. We choose to
occluded) in the image is computed by the maximum value encode the temporal dimension as the vertical axis, the
in the corresponding probability map. Considering a pose joints as the horizontal axis, and the coordinates of each
layout with Nj body joints, the estimated joint confidence point (ðx; yÞ for 2D, ðx; y; zÞ for 3D) as the channels. With
vector is represented by ^c 2 RNj 1 . If the probability map is this approach, we can use classical 2D convolutions to
very pointy, this score is close to 1. On the other hand, if the extract patterns directly from the temporal sequence of
probability map is uniform or has more than one region body joints. The predicted coordinates of each body joints
with high response, the confidence score drops. are pondered by their confidence scores, thus points that
are not present in the image (and consequently cannot be
3.2.4 Pose Re-Injection correctly predicted) have less influence on action recogni-
As systematically noted in recent works [25], [26], [27], [45], tion. A graphical representation of pose features is pre-
predictions re-injection is a very efficient way to improve sented in Fig. 5a.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 15,2023 at 07:46:07 UTC from IEEE Xplore. Restrictions apply.
LUVIZON ET AL.: MULTI-TASK DEEP LEARNING FOR REAL-TIME 3D HUMAN POSE ESTIMATION AND ACTION RECOGNITION 2757

mitigate that influence, late in the training process, we pro-


pose to decouple estimated poses (used to compute pose
scores) from action poses (used by the action recognition
part) as illustrated in Fig. 6.
Fig. 6. Decoupled poses for action prediction. The weight matrix W0h is Specifically, we first train the network on pose estimation
initialized with a copy of Wh after the main training process. The same is for about one half of the full training iterations, then we rep-
done to depth maps (Wd and d). licate only the last layers that project the multi-task feature
map Z to heat maps and depth maps (parameters Wh and
3.3.2 Appearance Features Wd ), resulting in a “copy” of probability maps h0 and depth
In addition to the pose information, visual cues are very maps d0 . Note that this replica corresponds to a simple 1  1
important to action recognition, since they bring contextual convolution from the feature space to the number of joints,
information. In our method, localized visual information is which is almost insignificant in terms of parameters and
encoded as appearance features, which are extracted in a simi- computations. The “copy” of this layer is a new convolu-
lar process to the one of pose features, with the difference tional layer with its weights W0 initialized with W. Finally,
that the first relies on local visual information instead of for the remaining training, the action recognition part prop-
joint coordinates. In order to extract localized appearance agates its loss through the replica poses. This process allows
features, we multiply each channel from the tensor of multi- the original pose predictions to stay specialized on the first
Hf Wf Nf task, while the replicated poses absorb partially the action
task features Z p;l
t 2R by each channel from the
probability maps ht 2 RHf Wf Nj (outer product of Nf and gradients and are optimized accordingly to the action recog-
Nj ), which is learned as a byproduct of the pose estimation nition task. Despite the replicated poses not being directly
process. Then, the spatial dimensions are collapsed by a supervised in the final training stage (which corresponds to
sum, resulting in the appearance features for time t of size a few more epochs), we show in our experiments that they
RNj Nf . For a sequence of frames, we concatenate each still remain coherent with supervised estimated poses.
appearance feature map for t ¼ f1; 2; . . . ; T g resulting in the
video clip appearance features V 2 RT Nj Nf . To clarify this 4 EXPERIMENTS
process, a graphical representation is shown in Fig. 5b. In this section, we present quantitative and qualitative
We argue that our multi-task framework has two benefits results by evaluating the proposed method on two different
for the appearance based part: First, it is computationally tasks and on two different modalities: human pose estima-
very efficient since most part of the computations are tion and human action recognition on 2D and 3D scenarios.
shared. Second, the extracted visual features are more Since our method relies on body coordinates, we consider
robust since they are trained simultaneously for different four publicly available datasets mostly composed of full
but related tasks and on different datasets. poses, which are detailed as follows.

3.3.3 Action Features Aggregation and Re-Injection 4.1 Datasets


Some actions are hard to be distinguished from others only MPII Human Pose Dataset [67] is a well known 2D human
by the high level pose representation. For example, the pose dataset composed of about 25K images collected from
actions drink water and make a phone call are very similar if YouTube videos. 2D poses were manually annotated with
we take into account only the body joints, but are easily sep- up to 16 body joints. Human3.6M [41] is a 3D human pose
arated if we have the visual information corresponding to dataset composed by videos with 11 subjects performing 17
the objects cup and phone. On the other hand, other actions different activities, all recorded simultaneously by 4 cam-
are not directly related to visual information but with body eras. High precision 3D poses were captured by a MoCap
movements, like salute and touch chest, and in this case the system, from which 17 body joints are used for evaluation.
pose information can provide complementary information. Penn Action [68] is a 2D dataset for action recognition com-
In our method, we combine visual cues and body move- posed by 2,326 videos with sports people performing 15 dif-
ments by aggregating pose and appearance features. This ferent actions. Human poses were manually annotated with
aggregation is a straightforward process, since both feature up to 13 body joints. NTU RGB+D [69] is a large scale 3D
types have the same spacial dimensions. action recognition dataset composed by 56K videos in Full
Similarly to the single frame features re-injection mecha- HD with 60 actions performed by 40 different actors and
nism discussed in Section 3.2.4, our approach also allows recorded by 3 cameras in 17 different configurations. Each
action features re-injection, as detailed in the action predic- color video has an associated depth map video and 3D Kin-
tion part in Fig. 4. We demonstrate in the experiments that ect poses.
this technique also improves action recognition results with
no additional parameters. 4.1.1 Evaluation Metrics
On 2D pose estimation, we evaluate our method on the
3.3.4 Decoupled Action Poses MPII validation set composed of 3K images, using the prob-
Since the multi-task architecture is trained simultaneously ability of correct keypoints measure with respect to the
on pose estimation and on action recognition, we may have head size (PCKh) [67]. On 3D pose estimation, we evaluate
an effect of competing gradients from poses and actions, our method on Human3.6M by measuring the mean per
specially in the predicted poses, which are used as the out- joint position error (MPJPE) after alignment of the root joint.
put for the first task and as the input for the second task. To We follow the most common evaluation protocol [37], [39],
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 15,2023 at 07:46:07 UTC from IEEE Xplore. Restrictions apply.
2758 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 8, AUGUST 2021

[44], [45], [47] by taking five subjects for training (S1, S5, S6, experiments with the network architecture using 4 levels
S7, S8) and evaluating on two subjects (S9, S11) on one every and up to 8 pyramids (L ¼ 4 and P ¼ 8). No further signifi-
64 frames. We use ground truth person bounding boxes for cant improvement was noticed on pose estimation by using
a fair comparison with previous methods on single person more than 8 pyramids. On action recognition, this limit was
pose estimation. We report results using a single cropped observed at 4 pyramids. For that reason, when using the full
bounding box per sample. model with 8 pyramids, the action recognition part starts
On action recognition, we report results using the percent- only at the 5th pyramid, reducing the computational load.
age of correct action classification score. We use the proposed In our experiments, we used normalized RGB images of
evaluation protocol for Penn Action [49], splitting the data size 256  256  3 as input, which are reduced to a feature
as 50/50 for training/testing, and the more realistic cross- map of size 32  32  288 by the entry flow network, corre-
subject scenario for NTU, on which 20 subjects are used for sponding to level l ¼ 1. At each level, the spatial resolution
training, and the remaining are used for testing. Our method is reduced by a factor of 2 and the size of features is arith-
is evaluated on single-clip and/or multi-clip. In the first case, metically increased by 96. For action recognition, we used
we crop a single clip with T frames in the middle of the Nv ¼ 160 and Nv ¼ 192 features for Penn Action and NTU,
video. In the second case, we crop multiple video clips tem- respectively.
porally spaced of T =2 frames one from another, and the final
predicted action is the average decision among all clips from 4.2.3 Multi-Task Training
one video. For all the experiments, we first initialize the network by
In our experiments, we consider two scenarios: A) 2D training pose estimation only, for about 32k iterations with
pose estimation and action recognition, on which we use mini batches of 32 images (equivalent to 40 epochs on MPII).
respectively MPII and Penn Action datasets, and B) 3D pose Then, all the weights related to pose estimation are fixed and
estimation and action recognition, using MPII, Human3.6M, only the action recognition part is trained for 2 and 50 epochs,
and NTU datasets. respectively for Penn Action and NTU datasets. Finally, the
full network is trained in a multi-task scenario, simulta-
4.2 Implementation and Training Details neously for pose estimation and action recognition, until the
4.2.1 Function Loss validation scores plateau. Training the network on pose esti-
For the pose estimation task, we train the network using the mation for a few epochs provides a good general initializa-
elastic net loss [70] function on predicted poses: tion and a better convergence of the action recognition part.
The intermediate training stage of action recognition has two
j   objectives: first, it is useful to allow a good initialization of the
1 X
N
Lp ¼ k^ pj  pj k22 ;
pj  pj k1 þ k^ (11) action part, since it is built on top of the pre-initialized pose
Nj j¼1 estimator; and second, it is about 3 times faster than perform-
ing multi-task training directly while resulting in similar
where p ^ j and pj are respectively the estimated and the
scores. This process is specially useful for NTU, due to the
ground truth positions of the jth body joint. The same loss is large amount of training data. The training procedure takes
used for both 2D and 3D cases, but only available values about one day for the pose estimation initialization, then
(ðx; yÞ for 2D and ðx; y; zÞ for 3D) are taken into account for two/three days for the remaining process for Penn Action/
backpropagation, depending on the dataset. We use poses NTU, using a desktop GeForce GTX 1080Ti GPU.
in the camera coordinate system, with ðx; yÞ laying on the For initialization on pose estimation, the network was
image plane and z corresponding to the depth distance, nor- optimized with RMSprop and initial learning rate of 0.001.
malized in the interval [0,1], where the top-left image corner For action and multi-task training, we use RMSprop for
corresponds to (0,0), and the bottom-right image corner cor- Penn Action with learning rate reduced by a factor of 0.1
responds to (1,1). For depth normalization, the root joint is after 15 and 25 epochs, and, for NTU, a vanilla SGD with
assumed to have z ¼ 0:5, and a range of 2 meters is used to Nesterov momentum of 0.9 and initial learning rate of 0.01,
represent the remaining joints. If a given body joint falls out- reduced by a factor of 0.1 after 50 and 55 epochs. We weight
side the cropped bounding box on training, we set the the loss on body joint confidence scores and action estima-
ground truth confidence flag cj to zero, otherwise we set it tions by a factor of 0.01, since the gradients from the cross
to one. The ground truth confidence information is used to entropy loss are much stronger than the gradients from the
supervise predicted joint confidence scores ^c with the binary elastic net loss on pose estimation. This parameter was
cross entropy loss. Despite giving an additional informa- empirically chosen and we did not observe a significant var-
tion, the supervision on confidence scores has negligible iation in the results with slightly different values (e.g., with
influence on the precision of estimated poses. For the action 0.02). Each iteration is performed on 4 batches of 8 frames,
recognition part, we use categorical cross entropy loss on composed of random images for pose estimation and video
predicted actions. clips for action. We train the model by alternating one batch
containing pose estimation samples only and another batch
4.2.2 Network Architecture containing action samples only. This strategy resulted in
Since the pose estimation part is the most computationally slightly better results compared to batches composed of
expensive, we chose to use separable convolutions with ker- mixed pose and action samples. We augment training data
nel size equals to 5  5 for single frame layers and standard by performing random rotations from 40 to þ40 , scaling
convolutions with kernel size equals to 3  3 for video clip from 0.7 to 1.3, video temporal subsampling by a factor
processing layers (action recognition layers). We performed from 3 to 10, random horizontal flipping, and random color
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 15,2023 at 07:46:07 UTC from IEEE Xplore. Restrictions apply.
LUVIZON ET AL.: MULTI-TASK DEEP LEARNING FOR REAL-TIME 3D HUMAN POSE ESTIMATION AND ACTION RECOGNITION 2759

TABLE 1
Comparison With Previous Work on Human3.6M Evaluated Using the Mean Per Joint Position
Error (MPJPE, in Millimeters) Metric on Reconstructed Pzoses

Methods Direction Discuss Eat Greet Phone Posing Purchase Sitting


Pavlakos et al. [45] 67.4 71.9 66.7 69.1 71.9 65.0 68.3 83.7
Mehta et al. [39]? 52.5 63.8 55.4 62.3 71.8 52.6 72.2 86.2
Martinez et al. [37] 51.8 56.2 58.1 59.0 69.5 55.2 58.1 74.0
Sun et al. [44]y 52.8 54.8 54.2 54.3 61.8 53.1 53.6 71.7
Yang et al. [47]y 51.5 58.9 50.4 57.0 62.1 49.8 52.7 69.2
Sun et al. [46]y – – – – – – – –
3D heat maps (ours [9], only H36M) 61.7 63.5 56.1 60.1 60.0 57.6 64.6 75.1
3D heat maps (ours [9])y 49.2 51.6 47.6 50.5 51.8 48.5 51.7 61.5
Ours (single-task)y 43.7 48.8 45.6 46.2 49.3 43.5 46.0 56.8
Ours (multi-task)y 43.2 48.6 44.1 45.9 48.2 43.5 45.5 57.1
Methods Sit Down Smoke Photo Wait Walk Walk Dog Walk Pair Average
Pavlakos et al. [45] 96.5 71.4 76.9 65.8 59.1 74.9 63.2 71.9
Mehta et al. [39]? 120.0 66.0 79.8 63.9 48.9 76.8 53.7 68.6
Martinez et al. [37] 94.6 62.3 78.4 59.1 49.5 65.1 52.4 62.9
Sun et al. [44] y 86.7 61.5 67.2 53.4 47.1 61.6 53.4 59.1
Yang et al. [47]y 85.2 57.4 65.4 58.4 60.1 43.6 47.7 58.6
Sun et al. [46]y – – – – – – – 49.6
3D heat maps (ours [9], only H36M) 95.4 63.4 73.3 57.0 48.2 66.8 55.1 63.8
3D heat maps (ours [9])y 70.9 53.7 60.3 48.9 44.4 57.9 48.9 53.2
Ours (single-task)y 67.8 50.5 57.9 43.4 40.5 53.2 45.6 49.5
Ours (multi-task)y 64.2 50.6 53.8 44.2 40.0 51.1 44.0 48.6
?
Method
not using ground-truth bounding boxes.
y
Methods using extra 2D data for training.

shifting. On evaluation, we also subsampled Penn Action/ Penn Action, improving our previous work [9] by 1.3 percent.
NTU videos by a factor of 6/8, respectively. Our method outperformed all previous methods, including
the ones using ground truth (manually annotated) poses.
4.3 Evaluation on 3D Pose Estimation For 3D, we trained our multi-task network using mixed
data from Human3.6M (50 percent), MPII (37.5 percent) and
Our results compared to previous approaches are shown in
NTU (12.5 percent) for pose estimation and NTU video clips
Table 1. Our multi-task method achieves the state-of-the-art
for action recognition. Our results compared to previous
average prediction error of 48.6 millimeters on Human3.6M
methods are presented in Table 3. Our approach reached
for 3D pose estimation, improving our previous work [9] by
89.9 percent of correctly classified actions on NTU, which is
4.6 mm. Considering only the pose estimation task, our aver-
a strong result considering the hard task of classifying
age error is 49.5 mm, 0.9 mm higher than the multi-tasking
among 60 different actions in the cross-subject split. Our
result, which shows the benefit of multi-task training for 3D
method improves previous results by at least 3.3 percent
pose estimation. For the activity “Sit down”, which is the
most challenging case, we improve previous methods (e.g. ,
Yang et al. [47]) by 21 mm. The generalization of our method TABLE 2
is demonstrated by qualitative results of 3D pose estimation Results for Action Recognition on Penn Action
for all datasets in Fig. 10. Note that a single model and a sin- Methods RGB Optical Annot. Estimated Acc.
gle training procedure was used to produce all the images Flow poses poses
and scores, including 3D pose estimation and 3D action rec- Nie et al. [49] @ - - @ 85.5
ognition, as discussed in the following.
Iqbal et al. [3] - - - @ 79.0
@ @ - @ 92.9
4.4 Evaluation on Action Recognition Cao et al. [51] @ - @ - 98.1
For action recognition, we evaluate our method considering @ - - @ 95.3
both 2D and 3D scenarios. For the first, a single model was Du et al. [54]? @ @ - @ 97.4
trained using MPII for single frames (pose estimation) and Liu et al. [57]y @ - @ - 98.2
Penn Action for video clips. In the second scenario, we use @ - - @ 91.4
Human3.6M for 3D pose supervision, MPII for data augmenta- Our previous @ - @ - 98.6
tion, and NTU video clips for action. Similarly, a single model work [9] @ - - @ 97.4
was trained for all the reported 3D pose and action results. Ours (single-clip) @ - - @ 98.2
For 2D, the pose estimation was trained using mixed data Ours (multi-clip) @ - - @ 98.7
from MPII (80 percent) and Penn Action (20 percent), using
Results are given as the percentage of correctly classified actions. Our method
16 body joints. Results are shown in Table 2. We reached the uses extra 2D pose data from MPII for training
state-of-the-art action classification score of 98.7 percent on ?
Including UCF101 data; y using add. deep features.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 15,2023 at 07:46:07 UTC from IEEE Xplore. Restrictions apply.
2760 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 8, AUGUST 2021

TABLE 3 TABLE 5
Comparison Results on NTU Cross-Subject Results With Pose and Appearance Features Alone, Combined
for 3D Action Recognition Pose and Appearance Features, and Decoupled Poses

Methods RGB Kinect Estimated Acc. cross Action features MPII val. PCKhPennAction Acc.
poses poses subject
Pose features only 84.9 97.7
Shahroudy et al. [69] - @ - 62.9 Appearance features only 85.2 97.9
Liu et al. [60] - @ - 69.2 Combined 85.1 98.1
Song et al. [62] - @ - 73.4 Combined + decoupled poses 85.4 98.2
Liu et al. [61] - @ - 74.4
Shahroudy et al. [63] @ @ - 74.9 Experiments with a Multi-PB network with P ¼ 2 and L ¼ 4.
Liu et al. [57] @ - @ 78.8
Baradel et al. [64] - @ - 77.1
@ ?
- 75.6
@ @ - 84.8
Baradel et al. [71] - - - 86.6
Our previous @ - @ 85.5
work [9]
Ours @ - @ 89.9

Results are given as the percentage of correctly classified actions. Our method
uses extra pose data from MPII and H36M for training
?
Ground truth poses used on test to select visual features.

TABLE 4
The Influence of the Network Architecture on Pose Estimation Fig. 7. Two sequences of RGB images (top), predicted supervised poses
and on Action Recognition, Evaluated Respectively on MPII (middle), and decoupled action poses (bottom).
Validation Set ([email protected], Single-Crop) and on Penn
Action (Classification Accuracy, Single-Clip)

Network Param. No. PB PCKh Action acc.


Single-PB ðp ¼ 1; l ¼ 2Þ 2M 1 74.3 97.2
Single-PB ðp ¼ 2; l ¼ 1Þ 10M 1 84.2 97.5
Multi-PB ðP ¼ 2; L ¼ 4Þ 10M 6 85.1 98.1
Multi-PB ðP ¼ 8; L ¼ 4Þ 26M 24 88.3 98.2

Single-PB are indexed by pyramid p and level l, and P and L represent the
total number of pyramids and levels on Multi-PB scheme.

and our previous work by 4.4 percent, which shows the


effectiveness of the proposed approach.
Fig. 8. Drift of decoupled probability maps from their original positions
(head, hands and feet) used as an attention mechanism for appearance
4.5 Ablation Study features extraction. Bounding boxes are drawn here only to highlight the
4.5.1 Network Design regions with high responses. Each color corresponds to a specific body
part (see Fig. 7).
We performed several experiments on the proposed network
architecture in order to identify its best arrangement for solv-
ing both tasks with the best performance vs computational 4.5.2 Pose and Appearance Features
cost trade-off. In Table 4, we show the results on 2D pose esti- The proposed method benefits from both pose and appear-
mation and on action recognition considering different net- ance features, which are complementary to the action reco-
work layouts. For example, in the first line, a single PB is used gnition task. Additionally, the confidence score ^c is also
at pyramid 1 and level 2. In the second line, a pair of full complementary to pose itself and leads to marginal action rec-
downscaling and upscaling pyramids are used, but with ognition gains if used to weight pose predictions. Similar
supervision only at the last PB. This results in 97.5 percent of results are achieved if confidence scores are concatenated to
accuracy on action recognition and 84.2 percent on PCKh for poses. In Table 5, we present results on pose estimation and on
pose estimation. An equivalent network is used in the third action recognition for different features extraction strategies.
line, but then with supervision on all PB blocks, which Considering pose features or appearance features alone, the
brings an improvement of 0.9 percent on pose and 0.6 percent results on Penn Action are respectively 97.4 and 97.9 percent,
on action, with the same number of parameters. Note that respectively 0.7 and 0.2 percent lower than combined features.
the networks from the second and third lines are exactly the We also show in the last row the influence of decoupled action
same, but in the first case, only the last PB is supervised, while poses, resulting in a small gain of 0.1 percent on action scores
in the latter all PB receive supervision. Finally, the last line and 0.3 percent on pose estimation, which shows that decou-
shows results with the full network, reaching 88.3 percent on pling action poses brings additional improvements, specially
MPII and 98.2 percent on Penn Action (single-clip), with a for pose estimation. When not considering decoupled poses,
single multi-task model. note that the best score on pose estimation happens when
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 15,2023 at 07:46:07 UTC from IEEE Xplore. Restrictions apply.
LUVIZON ET AL.: MULTI-TASK DEEP LEARNING FOR REAL-TIME 3D HUMAN POSE ESTIMATION AND ACTION RECOGNITION 2761

TABLE 6 TABLE 7
Results Comparing the Effect of Single and Multi-Task Results on All Tasks With the Proposed Multi-Task Model
Training for Action Recognition Compared to Recent Approaches Using RGB Images
and/or Estimated Poses on MPII PCKh Validation
Training protocol PennAction Acc. NTU Acc. Set (Higher is Better), Human3.6M MPJPE
Single-task (action only) 87.5 88.0 (Lower is Better), Penn Action and NTU
Multi-task (same dataset) 97.4 – RGB+D Action Classification Accuracy
Multi-task (+MPII +H36M for 3D) 98.2 89.9 (Higher is Better)

Methods MPII H36M PennAction NTU RGB+D


PCKh MPJPE half/half Cross-sub.
poses are not directly used for action, which also supports the
Pavlakos et al. [45] - 71.9 - -
evidence of competing losses.
Mehta et al. [39] - 68.6 - -
Additionally, we can observe that decoupled action poses Martinez et al. [37] - 62.9 - -
remain coherent with supervised poses, as shown in Fig. 7, Sun et al. [44] - 59.1 - -
which suggests that the initial pose supervision is a good ini- Yang et al. [47] 88.6 58.6 - -
tialization overall. Nonetheless, in some cases, decoupled Sun et al. [46] 87.3 49.6 - -
probability maps can drift to regions in the image more rele- Nie et al. [49] - - 85.5 -
vant for action recognition, as illustrated in Fig. 8. For exam- Iqbal et al. [3] - - 92.9 -
Cao et al. [51] - - 95.3 -
ple, feet heat maps can drift to objects in the hands, since the
Du et al. [54] - - 97.4 -
last is more informative with respect to the performed action. Shahroudy et al. [63] - - - 74.9
Baradel et al. [71] - - - 86.6
4.5.3 Single-Task Versus Multi-Task Ours [9] @ 85 fps - 53.2 97.4 85.5
In this part we compare the results on human action recogni- Ours 2D @ 240 fps 85.5 - 97.5 -
tion considering single-task and multi-task training protocols. Ours 2D @ 120 fps 88.3 - 98.7 -
Ours 3D @ 240 fps 80.7 63.9 - 86.6
In Table 6, in the first row, are shown results on PennAction Ours 3D @ 180 fps 83.8 57.3 - 87.7
and NTU datasets considering training with action supervi- Ours 3D @ 90 fps 87.0 48.6 - 89.9
sion only, i.e. , with the full network architecture (including
pose estimation layers) but without pose supervision. In the
second row we show the results when using the manually can be cut at the 4th or 2nd pyramids, which generally
annotated poses from PennAction for pose supervision. We degrades the performance, but allows faster predictions. To
did not use NTU (Kinect) poses for supervision since they are show the trade-off between precision and speed, we cut the
very noisy. From this, we can notice an improvement of almost trained multi-task model at different prediction blocks and
10 percent on PennAction, only by adding pose supervision. estimate the throughput in frames per second (FPS), evaluat-
When mixing with MPII data, it further increases 0.8 percent. ing pose estimation precision and action recognition classifi-
On NTU, multi-tasking improves a significant 1.9 percent. We cation accuracy. We consider mini batches with 16 images for
believe that the improvement of multi-tasking on PennAction pose estimation and single video clips of 8 frames for action.
is much more evident because this is a small dataset, therefore The results are shown in Fig. 9. For both 2D and 3D scenarios,
it is difficult to learn good representations for complex actions the best predictions are at more than 90 FPS. For the 3D sce-
without explicit pose information. On the contrary, NTU is a nario, pose estimation on Human3.6M can be performed
large scale dataset, more suitable for learning approaches. As at more than 180 FPS and still reach a competitive result of
a consequence, the gap between single and multi-task on NTU 57.3 millimeters error, while for action recognition on NTU,
is smaller, but still relevant. at the same speed, we still obtain state of the art results with
87.7 percent of correctly classified actions, or even comparable
4.5.4 Inference Speed results with recent approaches at more than 240 FPS. Finally,
Once the network is trained, it can be easily cut to perform we show our results for both 2D and 3D scenarios compared
faster inferences. For instance, the full model with 8 pyramids to previous methods in Table 7, considering different

Fig. 9. Inference speed of the proposed method considering 2D (a) and 3D (b,c) scenarios. A single multi-task model was trained for each scenario.
The trained models were cut a posteriori for inference analysis. Markers with gradient colors from purple to red represent respectively network infer-
ences from faster to slower.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 15,2023 at 07:46:07 UTC from IEEE Xplore. Restrictions apply.
2762 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 8, AUGUST 2021

Fig. 10. Predicted 3D poses from RGB images for both 2D and 3D datasets.

inference speed. Note that our method is the only to perform ACKNOWLEDGMENTS
both pose and action estimation in a single prediction, while
This work was partially supported by the Brazilian National
achieving state-of-the-art results at a very high speed.
Council for Scientific and Technological Development
(CNPq) – Grant 233342/2014-1.
5 CONCLUSION
REFERENCES
In this work, we presented a new approach for human pose
estimation and action recognition using multi-task deep [1] G. Cheron, I. Laptev, and C. Schmid, “P-CNN: Pose-based CNN
features for action recognition,” in Proc. ICCV, 2015, pp. 3218–3226.
learning. The proposed method for 3D pose provides highly [2] A. Yao, J. Gall, and L. Van Gool, “Coupled action recognition and
precise estimations with low resolution feature maps and pose estimation from multiple views,” Int. J. Comput. Vis., vol. 100,
departs from requiring the expensive volumetric heat maps no. 1, pp. 16–37, Oct. 2012.
[3] U. Iqbal, M. Garbade, and J. Gall, “Pose for action - action for
by predicting specialized depth maps per body joints. The pose,” Proc. 12th IEEE Int. Conf. Autom. Face & Gesture Recognit.,
proposed CNN architecture, along with the pose regression 2017, pp. 438–445.
method, allows multi-scale pose and action supervision and [4] I. Kokkinos, “Ubernet: Training a ‘universal’ convolutional neural
re-injection, resulting in a highly efficient densely super- network for low-, mid-, and high-level vision using diverse data-
sets and limited memory,” in Proc. IEEE Conf. Comput. Vis. Pattern
vised approach. Our method can be trained with mixed 2D Recognit., 2017, pp. 5454–5463.
and 3D data, benefiting from precise indoor 3D data, as well [5] M. Zolfaghari, G. L. Oliveira, N. Sedaghat, and T. Brox, “Chained
as “in-the-wild” images manually annotated with 2D poses. multi-stream networks exploiting pose, motion, and appearance
for action classification and detection,” in Proc. IEEE Int. Conf.
This has demonstrated significant improvements for 3D Comput. Vis., Oct. 2017, pp. 2923–2932.
pose estimation. The proposed method can also be trained [6] V. Choutas, P. Weinzaepfel, J. Revaud, and C. Schmid, “Potion:
with single frames and video clips simultaneously and in a Pose motion representation for action recognition,” in Proc. IEEE
seamless way. Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7024–7033.
[7] D. C. Luvizon, H. Tabia, and D. Picard, “Human pose regression
More importantly, we show that the hard problem of by combining indirect part detection and contextual information,”
multi-tasking human poses and action recognition can be Comput. Graph., vol. 85, pp. 15–22, 2019.
handled by a carefully designed architecture, resulting in a [8] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua, “LIFT: Learned invariant
feature transform,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 467–483.
better solution for each task than learning them separately. [9] D. C. Luvizon, D. Picard, and H. Tabia, “2D/3D pose estimation
In addition, we show that joint learning human poses and action recognition using multitask deep learning,” in Proc.
results in consistent improvement of action recognition. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 5137–5146.
[10] N. Sarafianos, B. Boteanu, B. Ionescu, and I. A. Kakadiaris, “3d human
Finally, with a single training procedure, our multi-task
pose estimation: A review of the literature and analysis of covariates,”
model can be cut at different levels for pose and action pre- Comput. Vis. Image Understanding, vol. 152, no. Supplement C, pp. 1–20,
dictions, resulting in a highly scalable approach. 2016.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 15,2023 at 07:46:07 UTC from IEEE Xplore. Restrictions apply.
LUVIZON ET AL.: MULTI-TASK DEEP LEARNING FOR REAL-TIME 3D HUMAN POSE ESTIMATION AND ACTION RECOGNITION 2763

[11] S. Herath, M. Harandi, and F. Porikli, “Going deeper into action rec- [37] J. Martinez, R. Hossain, J. Romero, and J. J. Little, “A simple yet
ognition: A survey,” Image Vis. Comput., vol. 60, no. Supplement C, effective baseline for 3D human pose estimation,” in Proc. IEEE
pp. 4–21, 2017. Int. Conf. Comput. Vis., 2017, pp. 2659–2668.
[12] M. Andriluka, S. Roth, and B. Schiele, “Pictorial structures revis- [38] B. Tekin, P. Marquez-Neila, M. Salzmann, and P. Fua, “Fusing 2D
ited: People detection and articulated pose estimation,” in Proc. uncertainty and 3D cues for monocular body pose estimation,”
Comput. Vis. Pattern Recognit., 2009, pp. 1014–1021. CoRR, vol. abs/1611.05708, 2016. [Online]. Available: http://
[13] M. Dantone, J. Gall, C. Leistner, and L. V. Gool, “Human pose esti- arxiv.org/abs/1611.05708
mation using body parts dependent joint regressors,” in Proc. [39] D. Mehta et al., “Monocular 3d human pose estimation in the wild
Comput. Vis. Pattern Recognit., 2013, pp. 3041–3048. using improved cnn supervision,” CoRR, vol. abs/1611.09813,
[14] L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele, “Poselet pp. 506–516, Oct. 2017.
conditioned pictorial structures,” in Proc. Comput. Vis. Pattern [40] A.-I. Popa, M. Zanfir, and C. Sminchisescu, “Deep multitask archi-
Recognit., 2013, pp. 588–595. tecture for integrated 2D and 3D human sensing,” in Proc. IEEE
[15] G. Ning, Z. Zhang, and Z. He, “Knowledge-guided deep fractal Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4714–4723.
neural networks for human pose estimation,” IEEE Trans. Multi- [41] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3.6m:
media, vol. 20, no. 5, pp. 1246–1259, May 2018. Large scale datasets and predictive methods for 3D human sensing
[16] I. Lifshitz, E. Fetaya, and S. Ullman, Human Pose Estimation Using Deep in natural environments,” IEEE Trans. Pattern Anal. Mach. Intell.,
Consensus Voting. Switzerland Cham: Springer, 2016, pp. 246–260. vol. 36, no. 7, pp. 1325–1339, Jul. 2014.
[17] L. Pishchulin et al., “DeepCut: Joint subset partition and labeling [42] D. Mehta et al., “Vnect: Real-time 3D human pose estimation with
for multi person pose estimation,” in Proc. IEEE Conf. Comput. Vis. a single RGB camera,” ACM Trans. Graph., vol. 36, 2017, Art. no. 4.
Pattern Recognit., 2016, pp. 4929–4937. [43] C.-H. Chen and D. Ramanan, “3d human pose estimation = 2d
[18] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and pose estimation + matching,” in Proc. IEEE Conf. Comput. Vis. Pat-
B. Schiele, “DeeperCut: A deeper, stronger, and faster multi-person tern Recognit., 2017, pp. 5759–5767.
pose estimation model,” in Proc. Eur. Conf. Comput. Vis., 2016, [44] X. Sun, J. Shang, S. Liang, and Y. Wei, “Compositional human
pp. 34–50. pose regression,” in Proc. IEEE Int. Conf. Comput. Vis., 2017,
[19] U. Rafi, I. Kostrikov, J. Gall, and B. Leibe, “An efficient convolu- pp. 2621–2630.
tional network for human pose estimation,” in Proc. Conf. Brit. [45] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis, “Coarse-
Mach. Vis. Conf., 2016, vol. 1, Art. no. 2. to-fine volumetric prediction for single-image 3D human pose,” in
[20] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1263–1272.
pose machines,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., [46] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei, “Integral human pose
2016, pp. 4724–4732. regression,” in Proc. Eur. Conf. Comput. Vis., 2018, pp 536–553.
[21] V. Belagiannis, C. Rupprecht, G. Carneiro, and N. Navab, “Robust [47] W. Yang, W. Ouyang, X. Wang, J. S. J. Ren, H. Li, and X. Wang, “3D
optimization for deep regression,” in Proc. Int. Conf. Comput. Vis., human pose estimation in the wild by adversarial learning,” in
2015, pp. 2830–2838. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 5255–5264.
[22] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler, [48] U. Iqbal, P. Molchanov, T. Breuel, J. Gall, and J. Kautz, “Hand
“Efficient object localization using convolutional networks,” in pose estimation via latent 2.5D heatmap regression,” in Proc. Eur.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 648–656. Conf. Comput. Vis., 2018, pp. 125–143.
[23] A. Toshev and C. Szegedy, “DeepPose: Human pose estimation [49] B. Xiaohan Nie, C. Xiong, and S.-C. Zhu, “Joint action recognition
via deep neural networks,” in Proc. Comput. Vis. Pattern Recognit., and pose estimation from video,” in Proc. IEEE Conf. Comput. Vis.
2014, pp. 1653–1660. Pattern Recognit., 2015, pp. 1293–1301.
[24] T. Pfister, K. Simonyan, J. Charles, and A. Zisserman, “Deep con- [50] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, “Towards
volutional neural networks for efficient pose estimation in gesture understanding action recognition,” in Proc. IEEE Int. Conf. Comput.
videos,” in Proc. Asian Conf. Comput. Vis., 2014, pp. 538–552. Vis., 2013, pp. 3192–3199.
[25] A. Bulat and G. Tzimiropoulos, “Human pose estimation via con- [51] C. Cao, Y. Zhang, C. Zhang, and H. Lu, “Body joint guided
volutional part heatmap regression,” in Proc. Eur. Conf. Comput. 3D deep convolutional descriptors for action recognition,”
Vis., 2016, pp. 717–732. IEEE Trans. Cybern., vol. 48, no. 3, pp. 1095–1108, Mar. 2018.
[26] G. Gkioxari, A. Toshev, and N. Jaitly, “Chained predictions using [52] J. Carreira and A. Zisserman, “Quo vadis, action recognition? A
convolutional neural networks,” in Proc. Eur. Conf. Comput. Vis., new model and the kinetics dataset,” in Proc. IEEE Conf. Comput.
2016, pp 728–743. Vis. Pattern Recognit., 2017, pp. 4724–4733.
[27] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for [53] G. Varol, I. Laptev, and C. Schmid, “Long-term temporal convolu-
human pose estimation,” in Proc. Eur. Conf. Comput. Vis., 2016, tions for action recognition,” IEEE Trans. Pattern Anal. Mach.
pp. 483–499. Intell., vol. 40, no. 6, pp. 1510–1517, Jun. 2017.
[28] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang, [54] W. Du, Y. Wang, and Y. Qiao, “RPAN: An end-to-end recurrent
“Multi-context attention for human pose estimation,” in Proc. pose-attention network for action recognition in videos,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1831–1840. IEEE Int. Conf. Comput. Vis., 2017, pp. 3745–3754.
[29] W. Yang, S. Li, W. Ouyang, H. Li, and X. Wang, “Learning feature [55] F. Baradel, C. Wolf, J. Mille, and G. W. Taylor, “Glimpse clouds:
pyramids for human pose estimation,” in Proc. IEEE Int. Conf. Human activity recognition from unstructured feature points,”
Comput. Vis., 2017, pp. 1290–1299. in Proc. Comput. Vis. Pattern Recognit. 2018, pp. 469–478.
[30] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution repre- [56] D. Wang, W. Ouyang, W. Li, and D. Xu, “Dividing and aggregat-
sentation learning for human pose estimation,” in Proc. IEEE Conf. ing network for multi-view action recognition,” in Proc. Eur. Conf.
Comput. Vis. Pattern Recognit., 2019, pp. 5693–5703. Comput. Vis., 2018, pp. 457–473.
[31] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Int. [57] M. Liu and J. Yuan, “Recognizing human actions as the evolution
Conf. Neural Inf. Process. Syst., 2014, pp. 2672–2680. of pose estimation maps,” in Proc. IEEE Conf. Comput. Vis. Pattern
[32] C. Chou, J. Chien, and H. Chen, “Self adversarial training for Recognit., 2018, pp. 1159–1168.
human pose estimation,” in Proc. Asia-Pacific Signal Inf. Process. [58] D. C. Luvizon, H. Tabia, and D. Picard, “Learning features combi-
Assoc. Annu. Summit Conf., 2017, pp. 17–30. nation for human action recognition from skeleton sequences,”
[33] Y. Chen, C. Shen, X.-S. Wei, L. Liu, and J. Yang, “Adversarial pose- Pattern Recognit. Lett., vol. 99, pp. 13–20, 2017.
net: A structure-aware convolutional network for human pose [59] L. L. Presti and M. L. Cascia, “3D skeleton-based human action clas-
estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 1221–1230. sification: A survey,” Pattern Recognit., vol. 53, pp. 130–147, 2016.
[34] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik, “Human [60] J. Liu, A. Shahroudy, D. Xu, and G. Wang, “Spatio-temporal
pose estimation with iterative error feedback,” in Proc. IEEE Conf. LSTM with trust gates for 3D human action recognition,” in Proc.
Comput. Vis. Pattern Recognit., 2016, pp. 4733–4742. Eur. Conf. Comput. Vis., 2016, pp. 816–833.
[35] X. Zhou, M. Zhu, G. Pavlakos, S. Leonardos, K. G. Derpanis, and [61] J. Liu, G. Wang, P. Hu, L.-Y. Duan, and A. C. Kot, “Global context-
K. Daniilidis, “Monocap: Monocular human motion capture using aware attention LSTM networks for 3D action recognition,” in
a CNN coupled with a geometric prior,” IEEE Trans. Pattern Anal. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 3671–3680.
Mach. Intell., vol. 41, no. 4, pp. 901–914, Apr. 2017. [62] S. Song, C. Lan, J. Xing, W. Z. Wezeng, and J. Liu, “An end-to-end
[36] D. Tome, C. Russell, and L. Agapito, “Lifting from the deep: Con- spatio-temporal attention model for human action recognition
volutional 3d pose estimation from a single image,” in Proc. IEEE from skeleton data,” in Proc. 31st AAAI Conf. Artif. Intell., 2017,
Conf. Comput. Vis. Pattern Recognit., 2017, pp. 5689–5698. pp. 4263–4270.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 15,2023 at 07:46:07 UTC from IEEE Xplore. Restrictions apply.
2764 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 8, AUGUST 2021

[63] A. Shahroudy, T.-T. Ng, Y. Gong, and G. Wang, “Deep multimodal David Picard received the MSc in electrical engi-
feature analysis for action recognition in rgb+d videos,” IEEE Trans. neering, in 2005, the PhD degree in image and sig-
Pattern Anal. Mach. Intell., vol. 40, no. 5, pp. 1045–1058, May 2017. nal processing, in 2008 and the Habilitation in
[64] F. Baradel, C. Wolf, and J. Mille, “Pose-conditioned spatio-temporal computer science, in 2017. He joined the ETIS lab-
attention for human action recognition,” CoRR, vol. abs/1703.10106, oratory at ENSEA Graduate School (France), in
2017. [Online]. Available: http://arxiv.org/abs/1703.10106 2010 as an associate professor. Since 2019, he
[65] F. Chollet, “Xception: Deep learning with depthwise separable 
is senior research scientist at Ecole des Ponts
convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., ParisTech (France). His research interests include
2017, pp. 1800–1807. computer vision and machine learning, with a
[66] Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid, “A new focus on kernel methods, deep learning, and
representation of skeleton sequences for 3D action recognition,” in distributed learning.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4570–4579.
[67] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2D human
pose estimation: New benchmark and state of the art analysis,” in Hedi Tabia received the MS degree in computer
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 3686–3693. science from the INSA of Rouen - Public school
[68] W. Zhang, M. Zhu, and K. G. Derpanis, “From actemes to action: A of engineers, France, in 2008, and the PhD
strongly-supervised representation for detailed action under- degree in computer science from the University
standing,” in Proc. IEEE Int. Conf. Comput. Vis., 2013, pp. 2248–2255. of Lille, in 2011. From October 2011 to August
[69] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “NTU RGB+D: A 2012, he held a postdoctoral research associate
large scale dataset for 3D human activity analysis,” in Proc. IEEE position at the IEF laboratory (University of Paris-
Conf. Comput. Vis. Pattern Recognit. 2016, pp. 1010–1019. sud). During 2012-2019, he was an associate
[70] H. Zou and T. Hastie, “Regularization and variable selection via professor at the ENSEA. Since September 2019
the elastic net,” J. R. Statist. Soc. Ser. B, vol. 67, pp. 301–320, 2005.  Paris Saclay.
he is a professor with Universite
[71] F. Baradel, C. Wolf, J. Mille, and G. W. Taylor, “Glimpse clouds:
Human activity recognition from unstructured feature points,”
in Proc. Comput. Vis. Pattern Recognit., 2018, pp. 469–478. " For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/csdl.
Diogo Carbonera Luvizon received the BSc
degree in electrical engineering, and the MSc in
image processing and graphics from the Federal
University of Technology - Parana (Brazil), in 2015,
and the PhD degree in computer science from
the Cergy Paris Universite , France, in 2019. His
main research interests include machine learning
and deep learning algorithms for computer vision,
humans and 3D scene understanding.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 15,2023 at 07:46:07 UTC from IEEE Xplore. Restrictions apply.

You might also like