IJCV MultiHumanFlow Accepted Preprint

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/337949162

Learning Multi-Human Optical Flow

Article in International Journal of Computer Vision · April 2020


DOI: 10.1007/s11263-019-01279-w

CITATIONS READS

61 2,534

6 authors, including:

Anurag Ranjan Dimitrios Tzionas


Max Planck Institute for Intelligent Systems Max Planck Institute for Intelligent Systems
22 PUBLICATIONS 3,777 CITATIONS 19 PUBLICATIONS 2,024 CITATIONS

SEE PROFILE SEE PROFILE

Javier Romero Michael J Black


University of Granada Max Planck Institute for Intelligent Systems
62 PUBLICATIONS 15,577 CITATIONS 576 PUBLICATIONS 71,751 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Dimitrios Tzionas on 16 December 2019.

The user has requested enhancement of the downloaded file.


Noname manuscript No.
(will be inserted by the editor)

Learning Multi-Human Optical Flow


Anurag Ranjan · David T. Hoffmann · Dimitrios Tzionas · Siyu Tang · Javier
Romero · Michael J. Black
arXiv:1910.11667v2 [cs.CV] 4 Dec 2019

Received: date / Accepted: date

Abstract The optical flow of humans is well known to be 1 Introduction


useful for the analysis of human action. Recent optical flow
methods focus on training deep networks to approach the A significant fraction of videos on the Internet contain peo-
problem. However, the training data used by them does not ple moving [4] and the literature suggests that optical flow
cover the domain of human motion. Therefore, we develop plays an important role in understanding human action [5,
a dataset of multi-human optical flow and train optical flow 6]. Several action recognition datasets [6, 7] contain human
networks on this dataset. We use a 3D model of the human motion as a major component. The 2D motion of humans
body and motion capture data to synthesize realistic flow in video, or human optical flow, is an important feature that
fields in both single- and multi-person images. We then train provides a building block for systems that can understand
optical flow networks to estimate human flow fields from and interact with humans. Human optical flow is useful for
pairs of images. We demonstrate that our trained networks various applications including analyzing pedestrians in road
are more accurate than a wide range of top methods on held- sequences, motion-controlled gaming, activity recognition,
out test data and that they can generalize well to real image human pose estimation, etc.
sequences. The code, trained models and the dataset are Despite this, optical flow has previously been treated as a
available for research. generic, low-level, vision problem. Given the importance of
people, and the value of optical flow in understanding them,
we develop a dataset and trained models that are specifically
* Anurag
tailored to humans and their motion. Such motions are non-
Ranjan and David Hoffmann contributed equally.
trivial since humans are complex, articulated objects that vary
Anurag Ranjan*1 in shape, size and appearance. They move quickly, adopt a
[email protected]
wide range of poses, and self-occlude or occlude in multi-
David T. Hoffmann*1 person scenarios.
[email protected]
Our goal is to obtain more accurate 2D motion estimates
Dimitrios Tzionas1 for human bodies by training a flow algorithm specifically
[email protected]
for human movement. To do so, we create a large and realis-
Siyu Tang1 tic dataset of humans moving in virtual worlds with ground
[email protected]
truth optical flow (Fig. 1(a)), called the Human Optical Flow
Javier Romero2 dataset. This is comprised of two parts; the Single-Human
[email protected]
Optical Flow dataset (SHOF), where the image sequences
Michael J. Black1 contain only one person in motion and the Multi-Human
[email protected]
Optical Flow dataset (MHOF) where images contain multi-
1 Max Planck Institute for Intelligent Systems, Germany
2
ple people involving significant occlusion between them. We
Amazon Inc.
This work was done when JR was at MPI-IS. MJB has received research gift funds
analyse the performance of SPyNet [2] and PWC-Net [3] by
from Intel, Nvidia, Adobe, Facebook, and Amazon. While MJB is a part-time em- training (fine-tuning) them on both the SHOF and MHOF
ployee of Amazon, his research was performed solely at, and funded solely by, MPI. dataset. We observe that the optical flow performance of the
MJB has financial interests in Amazon and Meshcapade GmbH. networks improves on sequences containing human scenes,
2 Anurag Ranjan et al.

(a) Our dataset (b) Results on synthetic scenes (c) Results on real world scenes

Fig. 1 (a) We simulate human motion in virtual worlds creating an extensive dataset with images (top row) and flow fields (bottom row); color
coding from [1]. (b) We train SPyNet [2] and PWC-Net [3] for human motion estimation and show that they perform better when trained on our
dataset and (c) can generalize to human motions in real world scenes. Columns show single-person and multi-person cases alternately.

both qualitatively and quantitatively. Furthermore we show the single-person case with a body-only model. The present
that the trained networks generalize to real video sequences work extends [18] for the multi-person case, as images with
(Fig. 1(c)). Several datasets and benchmarks [1, 8, 9] have multiple occluding people have different statistics. It further
been established to drive the progress in optical flow. We ar- employs a holistic model of the body together with hands
gue that these datasets are insufficient for the task of human for more realistic motion variation. This work also extends
motion estimation and, despite its importance, no attention training SPyNet [2] and PWC-Net [3] using the new dataset
has been paid to datasets and models for human optical flow. in contrast to training only SPyNet in the earlier work [18].
One of the main reasons is that dense human motion is ex- Our experiments show both qualitative and quantitative im-
tremely difficult to capture accurately in real scenes. Without provements.
ground truth, there has been little work focused specifically
on estimating human optical flow. To advance research on In summary, our major contributions in this extended
this problem, the community needs a dataset tailored to hu- work are: 1) We provide the Single-Human Optical Flow
man optical flow. dataset (SHOF) of human bodies in motion with realistic
A key observation is that recent work has shown that textures and backgrounds, having 146, 020 frame pairs for
optical flow methods trained on synthetic data [2, 10, 11] single-person scenarios. 2) We provide the Multi-Human
generalize relatively well to real data. Additionally, these Optical Flow dataset (MHOF), with 111, 312 frame pairs of
methods obtain state-of-the-art results with increased realism multiple human bodies in motion, with improved textures
of the training data [12,13]. This motivates our effort to and realistic visual occlusions, but without (self-)collisions
create a dataset designed for human motion. or intersections of body meshes. These two datasets together
comprise the Human Optical Flow dataset. 3) We fine-tune
To that end, we use the SMPL [14] and SMPL+H [15]
SPyNet [18] on SHOF and show that its performance im-
models, that capture the human body alone and the body
proves by about 43% (over the initial SPyNet), while it also
together with articulated hands respectively, to generate dif-
outperforms existing state of the art by about 30%. Further-
ferent human shapes including hand and finger motion. We
more, we fine-tune SPyNet and PWC-Net on MHOF and
then place humans on random indoor backgrounds and sim-
observe improvements of 10 − 20% (over the initial SPyNet
ulate human activities like running, walking, dancing etc.
and PWC-Net). Compared to existing state of the art, im-
using motion capture data [16, 17]. Thus, we create a large
provements are particularly high for human regions. After
virtual dataset that captures the statistics of natural human
masking out the background, we observe improvements of
motion in multi-person scenarios. We then train optical flow
up to 13% for human pixels. 4) We provide the dataset files,
networks on this dataset and evaluate their performance for
dataset rendering code, training code and trained models1 for
estimating human motion. While the dataset can be used to
research purposes.
train any flow method, we focus specifically on networks
based on spatial pyramids, namely SpyNet [2] and PWC-Net
[3], because they are compact and computationally efficient.
A preliminary version of this work appeared in [18] that
presented a dataset and model for human optical flow for 1 https://humanflow.is.tue.mpg.de
Learning Multi-Human Optical Flow 3

2 Related Work flow methods. Middlebury is limited to small motions [1],


KITTI is focused on rigid scenes and automotive motions
Human Motion. Human motion can be understood from 2D [8], while Sintel has a limited number of synthetic scenes
motion. Early work focused on the movement of 2D joint [9]. These datasets are mainly used for evaluation of optical
locations [19] or simple motion history images [20]. Optical flow methods and are generally too small to support training
flow is also a useful cue. Black et al. [21] use principal neural networks.
component analysis (PCA) to parametrize human motion but To learn optical flow using neural networks, more datasets
use noisy flow computed from image sequences for training have emerged that contain examples on the order of tens of
data. More similar to us, Fablet and Black [22] use a 3D thousands of frames. The Flying Chairs [10] dataset con-
articulated body model and motion capture data to project tains about 22, 000 samples of chairs moving against random
3D body motion into 2D optical flow. They then learn a view- backgrounds. Although it is not very realistic or diverse, it
based PCA model of the flow fields. We use a more realistic provides training data for neural networks [2,10] that achieve
body model to generate a large dataset and use this to train a reasonable results on optical flow benchmarks. Even more re-
CNN to directly estimate dense human flow from images. cent datasets [12, 13] for optical flow are especially designed
Only a few works in pose estimation have exploited hu- for training deep neural networks. Flying Things [12] con-
man motion and, in particular, several methods [23,24] use tains tens of thousands of samples of random 3D objects in
optical flow constraints to improve 2D human pose estima- motion. The Creative Flow+ Dataset [36] contains diverse
tion in videos. Similar work [25, 26] propagates pose results artistic videos in multiple styles. The Monkaa and Driving
temporally using optical flow to encourage time consistency scene datasets [12] contain frames from animated scenes and
of the estimated bodies. Apart from its application in warp- virtual driving respectively. Virtual KITTI [13] uses graph-
ing between frames, the structural information existing in ics to generate scenes like those in KITTI and is two orders
optical flow alone has been used for pose estimation [27] or of magnitude larger. Recent synthetic datasets [37] show
in conjunction with an image stream [28, 29]. that synthetic data can train networks that generalize to real
Learning Optical Flow. There is a long history of opti- scenes.
cal flow estimation, which we do not review here. Instead, For human bodies, some works [38, 39] render images
we focus on the relatively recent literature on learning flow. with the non-learned artist-defined MakeHuman model [40]
Early work looked at learning flow using Markov Random for 3D pose estimation or person re-identification, corre-
Fields [30], PCA [31], or shallow convolutional models [32]. spondingly. However, statistical parametric models learned
Other methods also combine learning with traditional ap- from 3D scans of a big human population, like SMPL [41],
proaches, formulating flow as a discrete [33] or continuous capture the real distribution of human body shape. The SUR-
[34] optimization problem. REAL dataset [42] uses 3D SMPL human meshes rendered
The most recent methods employ large datasets to esti- on top of color images to train networks for depth estimation,
mate optical flow using deep neural networks. Voxel2Voxel and body part segmentation. While not fully realistic, they
[35] is based on volumetric convolutions to predict opti- show that this data is sufficient to train methods that gener-
cal flow using 16 frames simultaneously but does not per- alize to real data. We go beyond these works to address the
form well on benchmarks. Other methods [2, 10, 11] com- problem of optical flow.
pute two frame optical flow using an end-to-end deep learn-
ing approach. FlowNet [10] uses the Flying Chairs dataset
[10] to compute optical flow in an end-to-end deep network. 3 The Human Optical Flow Dataset
FlowNet 2.0 [11] uses stacks of networks from FlowNet and
Our approach generates a realistic dataset of synthetic hu-
performs significantly better, particularly for small motions.
man motions by simulating them against different realistic
Ranjan and Black [2] propose a Spatial Pyramid Network
backgrounds. We use parametric models [15,41] to generate
that employs a small neural network on each level of an im-
synthetic humans with a wide variety of different human
age pyramid to compute optical flow. Their method uses a
shapes. We employ Blender2 and its Cycles rendering engine
much smaller number of parameters and achieves similar
to generate realistic synthetic image frames and optical flow.
performance as FlowNet [10] using the same training data.
In this way we create the Human Optical Flow dataset, that
Sun et al. [3] use image features in a similar spatial pyra-
is comprised of two parts. We first create the Single-Human
mid network achieving state-of-the-art results on optical flow
Optical Flow (SHOF) dataset [18] using the body-only SMPL
benchmarks. Since the above methods are not trained with
model [41] in images containing a single synthetic human.
human motions, they do not perform well on our Human
However, image statistics are different for the single- and
Optical Flow dataset.
multi-person case, as multiple people tend to occlude each
Optical Flow Datasets. Several datasets have been de-
veloped to facilitate training and benchmarking of optical 2 https://www.blender.org
4 Anurag Ranjan et al.

Fig. 2 Pipeline for generating the RGB frames and ground truth optical flow for the Multi-Human Optical Flow dataset. The datasets used in this
pipeline are listed in Table 1, while the various rendering component are summarized in Table 2.

other in complicated ways. For this reason we then create the parameters of the body model for articulated pose, translation
Multi-Human Optical Flow (MHOF) dataset to better capture and shape. The pose specifically is a vector of axis-angle pa-
this realistic interaction. To make images even more realistic rameters, that describes how to rotate each body part around
for MHOF, we replace SMPL [41] with the SMPL+H [15] its corresponding skeleton joint.
model that models the body together with articulated fingers, For the SHOF dataset, we use the Human3.6M dataset
to have richer motion variation. In the rest of this section, we [43]. that contains five subjects for training (S1, S5, S6, S7,
describe the components of our rendering pipeline, shown in S8) and two for testing (S9, S11). Each subject performs 15
Figure 2. For easy reference, in Table 1 we summarize the actions twice, resulting in 1, 559, 985 frames for training and
data used to generate the SHOF and MHOF datasets, while 550, 727 for testing. These sequences are subsampled at a
in Table 2 we summarize the various tools, Blender passes rate of 16×, resulting in 97, 499 training and 34, 420 testing
and parameters used for rendering. In the rest of the section, poses from Human3.6M.
we describe the modules used for generating the data. For the MHOF dataset, we use the CMU [44] and Hu-
manEva [45] MoCap datasets to increase motion variation.
From CMU MoCap dataset, we use 2, 605 sequences of 23
high-level action categories. From the HumanEva dataset,
3.1 Human Body Generation we use more than 10 sequences performing actions from 6
different action categories. To reduce redundant poses and
Body Model. A parametrized body model is necessary to allow for larger motions between frames, sequences are sub-
generate human bodies in a scene. In the SHOF dataset, we sampled to 12 fps resulting in 321, 873 poses. As a result the
use SMPL [41] for generating human body shapes. For the final MHOF dataset has 254, 211 poses for training, 32, 670
MHOF dataset we, use SMPL+H [15] that parametrizes the for validation and 34, 992 for testing.
human body together with articulated fingers, for increased Hand Poses. Traditionally MoCap systems and datasets
realism. The models are parameterized by pose and shape [43, 44, 45] record the motion of body joints, and avoid the
parameters to change the body posture and identity, as shown tedious capture of detailed hand and finger motion. However,
in Figure 2. They also contain a UV appearance map that in natural settings, people use their body, hands and fingers
allows us to change the skin tone, face features and clothing to communicate social cues and to interact with the physical
texture of the resulting virtual humans. world. To enable our methods to learn such subtle motions, it
Body Poses. The next step is articulating the human body should be represented in our training data. Therefore, we use
with different poses, to create moving sequences. To find such the SMPL+H model [15] and augment the body-only Mo-
poses, we use 3D MoCap datasets [43, 44, 45] that capture Cap datasets, described above, with finger motion. Instead
3D MoCap marker positions, glued onto the skin surface of of using random finger poses that would generate unrealistic
real human subjects. We then employ MoSh [16, 17] that optical flow, we employ the Embodied Hands dataset [15]
fits our body model to these 3D markers by optimizing over and sample continuous finger motion to generate realistic op-
Learning Multi-Human Optical Flow 5

tical flow. We use 43 sequences of hand motion with 37, 232 non-CAESAR textures, due to the bad quality of CAESAR
frames recorded at 60 Hz by [15]. Similarly to body MoCap, texture maps for the finger region. Thus each texture is sam-
we subsample hand MoCap to 12 fps to reduce overlapping pled with equal probability.
poses without sacrificing variability. Hand Texture. Hands and fingers are hard to be scanned
Body Shapes. Human bodies vary a lot in their propor- due to occlusions and measurement limitations. As a result,
tions, since each person has a unique body shape. To rep- texture maps are particularly noisy or might even have holes.
resent this in our dataset, we first learn a gender specific Since texture is important for optical flow, we augment the
Gaussian distribution of shape parameters, by fitting SMPL body texture maps to improve hand regions. For this we
to 3D CAESAR scans [46] of both genders. We then sam- follow a divide and conquer approach. First, we capture hand-
ple random body shapes from this distribution to generate only scans with a 3dMD scanner [15]. Then, we create hand-
a large number of realistic body shapes for rendering. How- only textures using the MANO model [15], getting 176 high
ever, naive sampling can result in extreme and unrealistic resolution textures from 20 subjects. Finally, we use the hand-
shape parameters, therefore we bound the shape distribution only textures to replace the problematic hand regions in the
to avoid unlikely shapes. full-body texture maps.
For the SHOF dataset we bound the shape parameters We also need to find the best matching hand-only texture
to the range of [−3, 3] standard deviations for each shape for every body texture. Therefore, we convert all texture maps
coefficient and draw a new shape for every subsequence of in HSV space, and compute the mean HSV value for each
20 frames to increase variance. texture map from standard sampling regions. For full body
For the MHOF dataset, we account explicitly for col- textures, we sample face regions without facial hair; while
lisions and intersections, since intersecting virtual humans for hand-only textures, we sample the center of the outer
would result in generation of inaccurate optical flow. To min- palm. Then, for each body texture map we find the closest
imize such cases, we use similar sampling as above with hand-only texture map in HSV space, and shift the values of
only small differences. We first use shorter subsequences of the latter by the HSV difference, so that the hand skin tone
10 frames for less frequent inter-human intersections. Fur- becomes more similar to the facial skin tone. Finally, this
thermore, we bound the shape distribution to the narrower improved hand-only texture map is used to replace the pixels
range of [−2.7, 2.7] standard deviations, since re-targeting in the hand-region of the full body texture map.
motion to unlikely body shapes is more prone to mesh self- (Self-) Collision. The MHOF dataset contains multiple
intersections. virtual humans moving differently, so there are high chances
Body Texture. We use the CAESAR dataset [46] to gen- of collisions and penetrations. This is undesirable because
erate a variety of human skin textures. Given SMPL regis- penetrations are physically implausible and unrealistic. More-
trations to CAESAR scans, the original per-vertex color in over, the generated ground truth optical flow might have arti-
the CAESAR dataset is transferred into the SMPL texture facts. Therefore, we employ a collision detection method to
map. Since fiducial markers were placed on the bodies of avoid intersections and penetrations.
CAESAR subjects, we remove them from the textures and Instead of using simple bounding boxes for rough colli-
inpaint them to produce a natural texture. In total, we use sion detection, we draw inspiration from [53] and perform
166 CAESAR textures that are of good quality. The main accurate and efficient collision detection on the triangle level
drawback of CAESAR scans is their homogeneity in terms using bounding volume hierarchies (BVH) [51]. This level
of outfit, since all of the subjects wore grey shorts and the of detailed detection allows for challenging occlusions with
women wore sports bras. In order to increase the clothing small distances between virtual humans, that can commonly
variety, we also use textures extracted from our 3D scans be observed for realistic interactions between real humans.
(referred as non-CAESAR in the following), to which we This method is useful not only for inter-person collision
register SMPL with 4Cap [52]. A total of 772 textures from detection, but also for self-intersections. This is especially
7 different subjects with different clothes were captured. We useful for our scenarios, as re-targeting body and hand mo-
anonymized the textures by replacing the face by the average tion to people of different shapes might result in unrealistic
face in CAESAR, after correcting it to match the skin tone self-penetrations. The method is applicable out of the box,
of the texture. Textures are grouped according to the gender, with the only exception that we exclude checks of neighbor-
which is randomly selected for each virtual human. ing body parts that are always or frequently in contact, e.g.
For the SHOF dataset the textures were split in train- upper and lower arm, or the two thighs.
ing and testing sets with a 70/30 ratio, while each texture
dataset is sampled with a 50% chance. For the MHOF dataset, 3.2 Scene Generation
we introduce more refined splitting with a 80/10/10 ratio
for the train, validation and test sets. Moreover, since we Background texture. For the scene background in the
introduce also finger motion, we want to favour sampling SHOF dataset, we use random indoor images from the LSUN
6 Anurag Ranjan et al.

SHOF MHOF Purpose

MoCap data Human3.6M [43] CMU [47], HumanEva [45] Natural body poses
MoCap → SMPL MoSh [16, 17] MoSh [16,17] SMPL parameters from MoCap
Training poses 97, 499 254, 211 Articulate virtual humans
Validation poses – 32, 670 Articulate virtual humans
Test poses 34, 420 34, 992 Articulate virtual humans
Hand pose dataset – Embodied Hands [15] Natural finger poses
Body shapes Sample Gaussian distr. (CAESAR) Sample Gaussian distr. (CAESAR) Body proportions of virtual humans
bounded within [−3, 3] st.dev. bounded within [−2.7, 2.7] st.dev.
Textures CAESAR, CAESAR (hands improved), Appearance of virtual humans
non-CAESAR non-CAESAR (hands improved)
Background LSUN [48] (indoor) SUN397 [49] (indoor and outdoor) Scene background
417, 597 images 30, 022 images
Table 1 Comparison of datasets and most important data preprocessing steps used to generate the SHOF and MHOF datasets. A short description
of the respective part is provided in the last column.

SHOF MHOF Purpose

Rendering Cycles Cycles Synthetic RGB image rendering


Optical flow Vector pass (Blender) Vector pass (Blender) Optical flow ground truth
Segment. masks Material pass (Blender) Material pass (Blender) Body part segment. masks (Fig. 3)
Motion blur Vector pass (Blender) Vector pass (Blender) Realistic motion blur artifacts
Imaging noise Gaussian blur (pixel space) Gaussian blur (pixel space) Realistic image imperfections
1px std.dev. for 30% of images 1px std.dev. for 30% of images
Camera translation Sampled for 30% of frames from Sampled for 30% of subsequences Realistic perturbations of the camera
Gaussian with 1 cm std.dev. from Gaussian with 1 cm std.dev. (and resulting optical flow)
Camera rotation Sampled per frame from Gaussian – Realistic perturbations of the camera
with 0.2 degrees std.dev. (and resulting optical flow)
Illumination Spherical harmonics [50] Spherical harmonics [50] Realistic lighting model
Subsequence length 20 frames 10 frames Number of successive frames with
consistent rendering parameters
Mesh collision – BVH [51] Detect (self-)collisions on the triangle
level to avoid defect Optical Flow
Table 2 Comparison of tools, Blender passes and parameters used to generate the SHOF and MHOF datasets. The last column provides a short
description of the respective method.

dataset [48]. This provides a good compromise between sim- colliding frequently with each other, while still being close
plicity and the complex task of generating varied full 3D enough for visual occlusions.
environments. We use 417, 597 images from the LSUN cate- Scene Illumination. We illuminate the bodies with Spher-
gories kitchen, living room, bedroom and dining room. These ical Harmonics lighting [50] that define basis vectors for light
images are placed as billboards, 9 meters from the camera, directions. This parameterization is useful for randomizing
and are not affected by the spherical harmonics lighting. the scene light by randomly sampling the coefficients with
a bias towards natural illumination. The coefficients are uni-
In the MHOF dataset, we increase the variability in back- formly sampled between −0.7 and 0.7, apart from the ambi-
ground appearance, We employ the Sun397 dataset [49] that ent illumination, which has a minimum value of 0.3 to avoid
contains images for 397 highly variable scenes that are both extremely dark images, and illumination direction, which is
indoor and outdoor, in contrast to LSUN. For quality reasons, strictly negative to favour illumination coming from above.
we reject all images with resolution smaller than 512 × 512 Increasing Image Realism. In order to increase realism,
px, and also reject images that contain humans using mask- we introduced three types of image imperfections. First, for
RCNN [54,55]. As a result, we use 30, 222 images, split in 30% of the generated images we introduced camera motion
24, 178 for the training set and 3, 022 for each of the valida- between frames. This motion perturbs the location of the cam-
tion and test sets. Further, we increase the distance between era with Gaussian noise of 1 cm standard deviation between
the camera and background to 12 meters, to increase the frames and rotation noise of 0.2 degrees standard deviation
space in which the multiple virtual humans can move without per dimension in an Euler angle representation. Second, we
Learning Multi-Human Optical Flow 7

add motion blur to the scene using the Vector Blur Node in
Blender, and integrated over 2 frames sampled with 64 steps
between the beginning and end point of the motion. Finally,
we add a Gaussian blur to 30% of the images with a standard
deviation of 1 pixel.
Scene Compositing. For animating virtual humans, each
MoCap sequence is selected at least once. To increase vari-
ability, each sequence is split into subsequences. For the first
frame of each subsequence, we sample a body and back-
ground texture, lights, blurring and camera motion parame-
ters, and re-position virtual humans on the horizontal plane.
We then introduce a random rotation around the z-axis for
variability in the motion direction.
For the SHOF dataset, we use subsequences of 20 frames,
and at the beginning of each one the single virtual human is
re-positioned in the scene such that the pelvis is projected
onto the image center.
For the MHOF dataset, we increase the variability with
smaller subsequences of 10 frames and introduce more chal-
lenging visual occlusions by uniformly sampling the number
of virtual humans in the range [4, 8]. We sample MoCap se-
|S |
quences Sj with a probability of pj = P|S|j , where |Sj |
i=1 |Si |
denotes the number of frames of sequence Sj and |S| the
number of sequences. In contrast to the SHOF dataset, for
the MHOF dataset the virtual humans are not re-positioned Fig. 3 Body part segmentation for the SMPL+H model. Symmetrical
at the center, as they would all collide. Instead, they are body parts are labeled only once. Finger joints follow the same naming
placed at random locations on the horizontal plane within convention as shown for the thumb. (Best viewed in color)
camera visibility, making sure there are no collisions with
other virtual humans or the background plane during the 4 Learning
whole subsequence.
We train two different network architectures to estimate opti-
cal flow on both the SHOF and MHOF dataset. We choose
3.3 Ground Truth Generation compact models that are based on spatial pyramids, namely
SPyNet [2] and PWC-Net [3], shown in Figure 4. We denote
the models trained on the SHOF dataset by SPyNet+SHOF
Segmentation Masks. Using the material pass of Blender,
and PWC+SHOF. Similarly, we denote models trained on
we store for each frame the ground truth body part segmenta-
the MHOF dataset by SPyNet+MHOF and PWC+MHOF.
tion for our models. Although the body part segmentation for
The spatial pyramid structure employs a convnet at each
both models is similar, SMPL models the palm and fingers
level of an image pyramid. A pyramid level works on a par-
as one part, while SMPL+H has a different part segment
ticular resolution of the image. The top level works on the
for each finger bone. Figure 3 shows an example body part
full resolution and the image features are downsampled as
segmentation for SMPL+H. These segmentation masks allow
we move to the bottom of the pyramid. Each level learns a
us to perform a per body-part evaluation of our optical flow
convolutional layer d, to perform downsampling of image
estimation.
features. Similarly, a convolution layer u, is learned for de-
Rendering & Ground Truth Optical Flow. For gener- coding optical flow. At each level, the convnet Gk predicts
ating images, we use the open source suite Blender and its optical flow residuals vk at that level. These flow residuals
vector pass. The render pass is typically used for producing get added at each level to produce the full flow, VK at the
motion blur, and it produces the motion in image space of finest level of the pyramid.
every pixel; i.e. the ground truth optical flow. We are mainly In SPyNet, each convnet Gk takes a pair of images as
interested in the result of this pass, together with the color inputs along with flow Vk−1 obtained by resizing the output
rendering of the textured bodies. of the previous level with interpolation. The second frame is
8 Anurag Ranjan et al.

Fig. 4 Spatial Pyramid Network [2] (left) and PWC-Net [3] (right) for optical flow estimation. At each pyramid level, network Gk predicts flow at
that level which is used to condition the optical flow at the higher resolution level in the pyramid. Adapted from [3].

however warped using Vk−1 and the triplet {Ik1 , w(Ik2 , Vk−1 ), 5 Experiments
Vk−1 } is fed as input to the convnet Gk .
In this section, we first compare the SHOF, MHOF and other
In PWC-Net, a pair of image features, {Ik1 , Ik2 } is input at common optical flow datasets. Next, we show that fine-tuning
a pyramid level, and the second feature map is warped using SPyNet on SHOF improves the model, while we observe that
using the flow Vk−1 from the previous level of the pyramid. fine-tuning PWC-Net on SHOF does not improve the model
We then compute the cost-volume c(Ik1 , w(Ik2 , Vk−1 )) over further. We then fine-tune the same methods on MHOF and
feature maps and pass it to network Gk to compute optical evaluate them. We show that both, SPyNet and PWC-Net
flow Vk at that pyramid level. improve when fine-tuned on MHOF. We show that the meth-
ods trained on the MHOF dataset outperform generic flow
We use the pretrained weights as initializations for train- estimation methods for the pixels corresponding to humans.
ing both SPyNet and PWC-Net. We train both models end- We show on qualitative results that both, the models trained
to-end to minimize the average End Point Error (EPE). on SHOF and models trained on MHOF seem to general-
ize to real word scenes. Finally, we quantitatively evaluate
optical flow methods on the MHOF dataset using motion
Hyperparameters. We follow the same training proce-
compensated intensity metric.
dure for SPyNet and PWC-Net. The only exception to this
is the learning rate, which is determined empirically for Dataset Details. In comparison with other optical flow
each dataset and network from {10−6 , 10−5 , 10−4 }. For the datasets, our dataset is larger by an order of magnitude (see
SHOF we found 10−6 to yield best results for SpyNet. Predic- Table 3); the SHOF dataset contains 135, 153 training frames
tions of PWC on the SHOF dataset do not improve for any of and 10, 867 test frames with optical flow ground truth, while
these learning rates. For training on MHOF a learning rate of the MHOF dataset has 86, 259 training, 13, 236 test and
10−6 and 10−4 yield best results for SpyNet and PWC-Net, 11, 817 validation frames. For the single-person dataset we
respectively. We use Adam [56] to optimize our loss with keep the resolution small at 256 × 256 px to facilitate easy
β1 = 0.9 and β2 = 0.999. We use a batch size of 8 and run deployment for training neural networks. This also speeds up
400, 000 training iterations. All networks are implemented the rendering process in Blender for generating large amounts
in the Pytorch framework. Fine-tuning the networks from of data. We show the comparisons of processing time of dif-
pretrained weights takes approximately 1 day on SHOF and ferent models on the SHOF dataset in Table 4(a). For the
2 days on MHOF. MHOF dataset we increase the resolution to 640 × 640 px
to be able to reason about optical flow even in small body
Data Augmentations. We also augment our data by ap- parts like fingers, using SMPL+H. Our data is extensive,
plying several transformations and adding noise. Although containing a wide variety of human shapes, poses, actions
our dataset is quite large, augmentation improves the quality and virtual backgrounds to support deep learning systems.
of results on real scenes. In particular, we apply scaling in the
range of [0.3, 3], and rotations in [−17◦ , 17◦ ]. The dataset is Comparison on SHOF. We compare the average End
normalized to have zero mean and unit standard deviation Point Errors (EPEs) of optical flow methods on the SHOF
using [57]. dataset in Table 4, along with the time for evaluation. We
Learning Multi-Human Optical Flow 9

# Train # Test Method AEPE Time(s) Learned Fine-tuned on


Dataset Resolution SHOF
Frames Frames
MPI Sintel [9] 1, 064 564 1024 × 436 Zero 0.6611 - -
KITTI 2012 [8] 194 195 1226 × 370 FlowNet [10] 0.5846 0.080 3 7
KITTI 2015 [58] 200 200 1242 × 375 PCA Layers [31] 0.3652 10.357 7 7
Virtual Kitti [13] 21, 260 − 1242 × 375 PWC-Net [3] 0.2158 0.024 3 7
Flying Chairs [10] 22, 232 640 512 × 384 PWC+SHOF 0.2158 0.024 3 3
Flying Things [12] 21, 818 4, 248 960 × 540 SPyNet [2] 0.2066 0.022 3 7
Monkaa [12] 8, 591 − 960 × 540 Epic Flow [34] 0.1940 1.863 7 7
Driving [12] 4, 392 − 960 × 540 LDOF [59] 0.1881 8.620 7 7
SHOF (ours) 135, 153 10, 867 256 × 256 FlowNet2 [11] 0.1895 0.127 3 7
MHOF (ours) 86, 259 13, 236 640 × 640 Flow Fields [60] 0.1709 4.204 7 7
SPyNet+SHOF 0.1164 0.022 3 3
Table 3 Comparison of the Human Optical Flow datasets, namely the
Single-Human Optical Flow (SHOF) and the Multi-Human Optical Table 4 EPE comparisons and evaluation times of different optical
Flow (MHOF) dataset, with previous optical flow datasets. flow methods on the SHOF dataset. Zero refers to the EPE when zero
flow (no motion) is always used for evaluation. Evaluation times are
based on the SHOF dataset with 256 × 256 image resolution. We time
all GPU based methods using a Tesla V100-16GB GPU.
show visual comparisons in Figure 5. Human motion is com-
plex and general optical flow methods fail to capture it. We
observe that SPyNet+SHOF outperforms methods that are
not trained on SHOF, and SPyNet [2] in particular. We expect the MHOF dataset, which includes larger motions and more
more involved methods like FlowNet2 [11] to have bigger complex scenes with occlusions.
performance gain than SPyNet when trained on SHOF. A qualitative comparison to popular optical flow methods
We observe that FlowNet [10] shows poor generalization can be seen in Figure 5. Flow estimations of SPyNet+SHOF
on our dataset. Since the results of FlowNet [10] in Table 4 can be observed to be sharper than those of methods that are
and 7 are very close to the zero flow (no motion) baseline, we not trained on human motion. This can especially be seen for
cross-verify by evaluating FlowNet on a mixture of Flying edges.
Chairs [10] and Human Optical Flow and observe that the Comparison on MHOF. Training (fine-tuning) on the
flow outputs on SHOF is quite random (see Figure 5). The MHOF dataset improves SPyNet and PWC-Net on average,
main reason is that SHOF contains a significant amount of as can be seen in Table 5. In particular PWC+MHOF out-
small motions and it is known that FlowNet does not perform performs SPyNet+MHOF and also improves over generic
very well on small motions. SPyNet+SHOF [2] however state-of-the-art optical flow methods. Large parts of the im-
performs quite well and is able to generalize to body motions. age are background, whose movements are relatively easy to
The results however look noisy in many cases. estimate. However, we are particularly interested in human
Our dataset employs a layered structure where a human motions. Therefore, we mask out all errors of background
is placed against a background. As such layered methods like pixels and compute the average EPE only on body pixels (see
PCA-layers [31] perform very well on a few images (row 8 Table 5). For these pixels, light-weight networks like SpyNet
in Figure 5) where they are able to segment a person from and PWC-Net improve over almost all generic optical flow
the background. However, in most cases, they do not obtain estimation methods using our dataset (SpyNet+MHOF and
good segmentation into layers. PWC+MHOF), including the much larger network FlowNet2.
Previous state-of-the-art methods like LDOF [59] and PWC+MHOF is the best performing method.
Epic-Flow [34] perform much better than others. They get A more fine grained analysis of EPE across body parts
a good overall shape, and smooth backgrounds. However, is shown in Table 7. We obtain EPE of these body parts
their estimation is quite blurred. They tend to miss the sharp using the segmentation shown in Figure 3. It can be seen that
edges that are typical of human hands and legs. They are also improvements of PWC+MHOF over FlowNet2 are larger for
significantly slower. body parts that are at the end of the kinematic tree (i.e. feet,
In contrast, by fine-tuning on our dataset, the performance calves, arms and in particular fingers). Differences are less
of SPyNet+SHOF improves by 40% over SPyNet on the strong for body parts close to the torso. One interpretation
SHOF dataset. We also find that fine-tuning PWC-Net on of these findings is that movements of the torso are easier
the SHOF does not improve the model. This could be be- to predict, while movements of body parts at the end of the
cause SHOF dataset has predominantly small motion which kinematic tree are more complex and thus harder to estimate.
is handled better by SPyNet [2] architecture. Empirically, In contrast, SPyNet+MHOF outperforms FlowNet2 on body
we have seen that PWC-Net has state-of-the-art performance parts close to the torso and does not learn to capture the more
on standard benchmarks. This motivates the generation of complex motions of limbs better than FlowNet2. We expect
10 Anurag Ranjan et al.

Frame1 Ground Truth FlowNet FlowNet2 LDOF PCA-Layers EpicFlow SPyNet SpyNet+SHOF PWC
Fig. 5 Visual comparison of optical flow estimates using different methods on the Single-Human Optical Flow (SHOF) test set. From left to right,
we show Frame 1, Ground Truth flow, results of FlowNet [10], FlowNet2 [11], LDOF [59], PCA-Layers [31], SPyNet [2], EpicFlow [34] , LDOF
[59], SPyNet [2], SPyNet+SHOF (ours) and PWC-Net [3]

FlowNet2+MHOF to perform even better, but we do not Real Scenes. We show a visual comparison of results on
include this here due to its long and tedious training process. real-world scenes of people in motion. For visual compar-
isons of models trained on the SHOF dataset we collect these
Visual comparisons are shown in Figure 6. In particu-
scenes by cropping people from real world videos as shown
lar, PWC+MHOF predicts flow fields with sharper edges
in Figure 7. We use DPM [61] for detecting people and com-
than generic methods or SPyNet+ MHOF. Furthermore, the
pute bounding box regions in two frames using the ground
qualitative results suggest that PWC+MHOF is better at dis-
truth of the MOT16 dataset [62]. The results for the SHOF
tinguishing the motion of people, as people can be better
dataset are shown in Figure 8. A comparison of methods on
separated on the flow visualizations of PWC+MHOF (Fig-
real images with multiple people can be seen in Figure 9.
ure 6, row 3). Last, it can be seen that fine details, like the
motion of distant humans or small body parts, are better
The performance of PCA-Layers [31] is highly depen-
estimated by PWC+MHOF.
dent on its ability to segment. Hence, we see only a few cases
The above observations are strong indications that our where it looks visually correct. SPyNet [2] gets the overall
Human Optical Flow datasets (SHOF and MHOF) can be shape but the results look noisy in certain image parts. While
beneficial for the performance on human motion for other LDOF [59], EpicFlow [34] and FlowFields [60] generally
optical flow networks as well. perform well, they often find it difficult to resolve the legs,
Learning Multi-Human Optical Flow 11

Frame1 Ground Truth FlowNet2 LDOF PCA-Layers EpicFlow SPyNet SPyNet+MHOF PWC PWC+MHOF
Fig. 6 Visual comparison of optical flow estimates using different methods on the Multi-Human Optical Flow (MHOF) test set. From left to right,
we show Frame 1, Ground Truth flow, results of FlowNet2 [11], LDOF [59], PCA-Layers [31], EpicFlow [34], SPyNet [2], SPyNet+MHOF (ours),
PWC-Net [3] and PWC+MHOF (ours).

Method Average Average EPE on Fine-tuned on Method Average MCI Average MCI
EPE body pixels MHOF MHOF Real
FlowNet 0.808 2.574 7 FlowNet 287.328 401.779
PCA Layers 0.556 2.691 7 PCA Layers 201.594 423.332
Epic Flow 0.488 1.982 7 Epic Flow 129.252 234.037
SPyNet 0.429 1.977 7 SPyNet 142.108 302.753
SPyNet+MHOF 0.391 1.803 3 SPyNet+MHOF 143.029 297.142
PWC-Net 0.369 2.056 7 PWC-Net 157.088 344.202
LDOF 0.360 1.719 7 LDOF 71.449 158.281
FlowNet2 0.310 1.863 7 FlowNet2 145.732 303.799
PWC+MHOF 0.301 1.621 3 PWC+MHOF 152.314 351.567

Table 5 Comparison using End Point Error (EPE) on the Multi-Human Table 6 Comparison using Motion Compensated Intensity (MCI) on
Optical Flow (MHOF) dataset. We show the average EPE and body- the Multi-Human Optical Flow (MHOF) dataset and a real video
only EPE. For the latter, the EPE is computed only over segments of sequence. Example images for the real video sequence can be seen in
the image depicting a human body. Best results are shown in boldface. Figure 9.
A comparison of body-part specific EPE can be found in Table 7 .

like legs, hands and the human head. Models trained on the
hands and head of the person. The results from models trained Human Optical Flow dataset perform well under occlusion
on our Human Optical Flow dataset look appealing especially (Figure 8, Figure 9). Many examples including severe occlu-
while resolving the overall human shape, and various parts sion can be seen in Figure 9. Besides that, Figure 9 shows that
12 Anurag Ranjan et al.

extent of our dataset, together with an end-to-end training


scheme, allows these networks to outperform previous state-
of-the-art optical flow methods on our new human-specific
dataset. This indicates that our dataset can be beneficial for
other optical flow network architectures as well. Furthermore,
our qualitative results suggest that the networks trained on
the Human Optical Flow generalize well to real world scenes
Fig. 7 We use the DPM [61] person detector to crop out people from with humans. This is evidenced by results on a real sequence
real-world scenes (left) and use SPyNet+SHOF to compute optical flow using the MCI metric. The trained models are compact and
on the cropped section (right).
run in real time making them highly suitable for phones and
embedded devices.
the models trained on MHOF are able to distinguish motions The dataset and our focus on human optical flow opens
of multiple people and predict sharp edges of humans. up a number of research directions in human motion under-
A quantitative evaluation on real data with humans is not standing and optical flow computation. We would like to
possible, as no such dataset with ground truth optical flow extend our dataset by modeling more diverse clothing and
annotation exists. To determine generalization of the models outdoor scenarios. A direction of potentially high impact for
to real data, despite the lack of ground truth annotation, we this work is to integrate it in end-to-end systems for action
can use the Motion Compensated Intensity (MCI) as an error recognition, which typically take precomputed optical flow
metric. Given the image sequence I 1 , I 2 and predicted flow as input. The real-time nature of the method could support
V , the MCI error is given by motion-based interfaces, potentially even on devices like cell
phones with limited computing power. The dataset, dataset
generation code, pretrained models, and training code are
MCI(I 1 , I 2 , V ) = ||I 1 − w(I 2 , V )||2 , (1) available, enabling researchers to use them for problems in-
where w warps the image I 2 according to flow V . This metric volving human motion.
certainly has limitations. The motion compensated intensity
assumes Lambertian conditions i.e. intensity of a point re-
mains constant over time. MCI error does not account for Acknowledgements
occlusions. Furthermore, MCI does not account for smooth
We thank Yiyi Liao for helping us with optical flow evalua-
flow fields over texture-less surfaces. Despite these shortcom-
tion. We thank Sergi Pujades for helping us with collision
ing of MCI, we report these numbers to show that our models
detection of meshes. We thank Cristian Sminchisescu for the
generalize to real data. However, it should be noted that EPE
Human3.6M MoCap marker data.
is a more precise metric to evaluate optical flow estimation.
To test whether MCI correlates with EPEs in Table 5, we
compute MCI on the MHOF dataset. The results can be seen
References
in Table 6. We observe that, methods like FlowNet and PCA-
Layers which have poor performance on the EPE metric have 1. Simon Baker, Daniel Scharstein, JP Lewis, Stefan Roth, Michael J
higher MCI error. For methods with lower EPE, the MCI Black, and Richard Szeliski. A database and evaluation method-
errors do not exactly correspond to the respective EPEs. This ology for optical flow. International Journal of Computer Vision,
92(1):1–31, 2011.
is due to the limitations of the MCI metric, as described
2. Anurag Ranjan and Michael J Black. Optical flow estimation using
above. Finally, we compute MCI on a real video sequence a spatial pyramid network. In Proc. of the IEEE Conference on
from Youtube3 . The MCI errors are shown in Table 6. Computer Vision and Pattern Recognition (CVPR), 2017.
3. Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. PWC-
Net: CNNs for optical flow using pyramid, warping, and cost vol-
6 Conclusion and Future Work ume. In CVPR, 2018.
4. Donald Geman and Stuart Geman. Opinion: Science in the age
of selfies. Proceedings of the National Academy of Sciences,
In summary, we created an extensive Human Optical Flow 113(34):9384–9387, 2016.
dataset containing images of realistic human shapes in motion 5. Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and
together with ground truth optical flow. The dataset is com- Michael J. Black. Towards understanding action recognition. In
prised of two parts, the Single-Human Optical Flow (SHOF) IEEE International Conference on Computer Vision (ICCV), pages
3192–3199, Sydney, Australia, December 2013. IEEE.
and the Multi-Human Optical Flow (MHOF) dataset. We 6. Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah.
then train two compact network architectures based on spatial Ucf101: A dataset of 101 human actions classes from videos in the
pyramids, namely SpyNet and PWC-Net. The realism and wild. arXiv preprint arXiv:1212.0402, 2012.
7. Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso
3 https://www.youtube.com/watch?v=2DiQUX11YaY Poggio, and Thomas Serre. Hmdb: a large video database for
Learning Multi-Human Optical Flow 13

Frame 1 Frame 2 PCA-Layers SPyNet EpicFlow LDOF FlowFields SPyNet+SHOF

Fig. 8 Single-Human Optical Flow visuals on real images using different methods. From left to right, we show Frame 1, Frame 2, results of
PCA-Layers [31], and SPyNet [2], EpicFlow [34], LDOF [59], FlowFields [60] and SPyNet+SHOF (ours).

human motion recognition. In Computer Vision (ICCV), 2011 12. N.Mayer, E.Ilg, P.Häusser, P.Fischer, D.Cremers, A.Dosovitskiy,
IEEE International Conference on, pages 2556–2563. IEEE, 2011. and T.Brox. A large dataset to train convolutional networks for
8. Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready disparity, optical flow, and scene flow estimation. In IEEE Inter-
for autonomous driving? the KITTI vision benchmark suite. In national Conference on Computer Vision and Pattern Recognition
Conference on Computer Vision and Pattern Recognition (CVPR), (CVPR), 2016. arXiv:1512.02134.
2012. 13. A Gaidon, Q Wang, Y Cabon, and E Vig. Virtual worlds as proxy
9. D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic for multi-object tracking analysis. In CVPR, 2016.
open source movie for optical flow evaluation. In A. Fitzgibbon et 14. Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler,
al. (Eds.), editor, European Conf. on Computer Vision (ECCV), Part Javier Romero, and Michael J. Black. Keep it SMPL: Automatic
IV, LNCS 7577, pages 611–625. Springer-Verlag, October 2012. estimation of 3D human pose and shape from a single image. In
10. Alexey Dosovitskiy, Philipp Fischery, Eddy Ilg, Caner Hazirbas, Computer Vision – ECCV 2016, Lecture Notes in Computer Sci-
Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, Thomas ence. Springer International Publishing, October 2016.
Brox, et al. Flownet: Learning optical flow with convolutional 15. Javier Romero, Dimitrios Tzionas, and Michael J. Black. Em-
networks. In 2015 IEEE International Conference on Computer bodied hands: Modeling and capturing hands and bodies together.
Vision (ICCV), pages 2758–2766. IEEE, 2015. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6),
11. Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, November 2017.
Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution 16. Matthew Loper, Naureen Mahmood, and Michael J Black. Mosh:
of optical flow estimation with deep networks. arXiv preprint Motion and shape capture from sparse markers. ACM Transactions
arXiv:1612.01925, 2016. on Graphics (TOG), 33(6):220, 2014.
14 Anurag Ranjan et al.

Frame1 FlowNet2 FlowNet LDOF PCA-Layers EpicFlow SPyNet SPyNet+MHOF PWC PWC+MHOF
Fig. 9 Multi-Human Optical Flow visuals on real images. From left to right, we show Frame 1, results of FlowNet2 [11], FlowNet [10], LDOF
[59], PCA-Layers [31], EpicFlow [34], SPyNet [2], SPyNet+MHOF (ours), PWC-Net [3] and PWC+MHOF (ours).

17. Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard 29. Xuanyi Dong, Shoou-I Yu, Xinshuo Weng, Shih-En Wei, Yi Yang,
Pons-Moll, and Michael J. Black. AMASS: archive of motion and Yaser Sheikh. Supervision-by-registration: An unsupervised
capture as surface shapes. CoRR, abs/1904.03278, 2019. approach to improve the precision of facial landmark detectors. In
18. Anurag Ranjan, Javier Romero, and Michael J. Black. Learning The IEEE Conference on Computer Vision and Pattern Recognition
human optical flow. In 29th British Machine Vision Conference, (CVPR), 2018.
September 2018. 30. William T. Freeman, Egon C. Pasztor, and Owen T. Carmichael.
19. Gunnar Johansson. Visual perception of biological motion and a Learning low-level vision. International Journal of Computer
model for its analysis. Perception & Psychophysics, 14(2):201–211, Vision, 40(1):25–47, 2000.
1973. 31. Jonas Wulff and Michael J Black. Efficient sparse-to-dense optical
20. J. W. Davis. Hierarchical motion history images for recognizing flow estimation using a learned basis and layers. In 2015 IEEE
human motion. In Detection and Recognition of Events in Video, Conference on Computer Vision and Pattern Recognition (CVPR),
pages 39–46, 2001. pages 120–130. IEEE, 2015.
21. M. J. Black, Y. Yacoob, A. D. Jepson, and D. J. Fleet. Learning pa- 32. Sun D., S Roth, JP Lewis, and MJ Black. Learning optical flow. In
rameterized models of image motion. In IEEE Conf. on Computer ECCV, pages 83–97, 2008.
Vision and Pattern Recognition, CVPR-97, pages 561–567, Puerto 33. Fatma Güney and Andreas Geiger. Deep discrete flow. In Asian
Rico, June 1997. Conference on Computer Vision (ACCV), 2016.
22. R. Fablet and M. J. Black. Automatic detection and tracking of
34. Jerome Revaud, Philippe Weinzaepfel, Zaid Harchaoui, and
human motion with a view-based representation. In European Conf.
Cordelia Schmid. EpicFlow: Edge-Preserving Interpolation of
on Computer Vision, ECCV 2002, volume 1 of LNCS 2353, pages
Correspondences for Optical Flow. In Computer Vision and Pat-
476–491. Springer-Verlag, 2002.
23. K. Fragkiadaki, H. Hu, and J. Shi. Pose from flow and flow from tern Recognition, 2015.
pose. In 2013 IEEE Conference on Computer Vision and Pattern 35. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and
Recognition, pages 2059–2066, June 2013. Manohar Paluri. Deep End2End Voxel2Voxel prediction. In The
24. Silvia Zuffi, Javier Romero, Cordelia Schmid, and Michael J Black. 3rd Workshop on Deep Learning in Computer Vision, 2016.
Estimating human pose with flowing puppets. In IEEE Interna- 36. Maria Shugrina, Ziheng Liang, Amlan Kar, Jiaman Li, Angad
tional Conference on Computer Vision (ICCV), pages 3312–3319, Singh, Karan Singh, and Sanja Fidler. Creative flow+ dataset. In
2013. The IEEE Conference on Computer Vision and Pattern Recognition
25. Tomas Pfister, James Charles, and Andrew Zisserman. Flowing (CVPR), June 2019.
convnets for human pose estimation in videos. In ICCV, pages 37. Adrien Gaidon, Zaid Harchaoui, and Cordelia Schmid. Activity
1913–1921. IEEE Computer Society, 2015. representation with motion hierarchies. International Journal of
26. James Charles, Tomas Pfister, Derek R. Magee, David C. Hogg, and Computer Vision, 107(3):219–238, 2014.
Andrew Zisserman. Personalizing human video pose estimation. 38. Igor Barros Barbosa, Marco Cristani, Barbara Caputo, Aleksander
In CVPR, pages 3063–3072. IEEE Computer Society, 2016. Rognhaugen, and Theoharis Theoharis. Looking beyond appear-
27. Javier Romero, Matthew Loper, and Michael J. Black. FlowCap: ances: Synthetic training data for deep cnns in re-identification.
2D human pose from optical flow. In Pattern Recognition, Proc. Computer Vision and Image Understanding, 167:50–62, 2018.
37th German Conference on Pattern Recognition (GCPR), volume 39. Mona Fathollahi Ghezelghieh, Rangachar Kasturi, and Sudeep
LNCS 9358, pages 412–423. Springer, 2015. Sarkar. Learning camera viewpoint using cnn to improve 3d body
28. Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Con- pose estimation. In 2016 Fourth International Conference on 3D
volutional two-stream network fusion for video action recognition. Vision (3DV), pages 685–693. IEEE, 2016.
In CVPR, pages 1933–1941. IEEE Computer Society, 2016.
Learning Multi-Human Optical Flow 15

40. Makehuman: Open source tool for making 3d characters. 52. Gerard Pons-Moll, Javier Romero, Naureen Mahmood, and
41. Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons- Michael J. Black. Dyna: A model of dynamic human shape in
Moll, and Michael J. Black. SMPL: A skinned multi-person motion. ACM Transactions on Graphics, (Proc. SIGGRAPH),
linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(4):120:1–120:14, August 2015.
34(6):248:1–248:16, October 2015. 53. Dimitrios Tzionas, Luca Ballan, Abhilash Srikantha, Pablo Aponte,
42. Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Marc Pollefeys, and Juergen Gall. Capturing hands in action using
Michael Black, Ivan Laptev, and Cordelia Schmid. Learning from discriminative salient points and physics simulation. International
synthetic humans. In CVPR, 2017. Journal of Computer Vision (IJCV), 118(2):172–193, June 2016.
43. Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Smin- 54. Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick.
chisescu. Human3.6m: Large scale datasets and predictive methods Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International
for 3d human sensing in natural environments. IEEE Transactions Conference on, pages 2980–2988. IEEE, 2017.
on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, 55. Waleed Abdulla. Mask r-cnn for object detection and instance
jul 2014. segmentation on keras and tensorflow. https://github.com/
44. Carnegie-mellon mocap database. matterport/Mask_RCNN, 2017.
45. L. Sigal, A. Balan, and M. J. Black. HumanEva: Synchronized
56. Diederik Kingma and Jimmy Ba. Adam: A method for stochastic
video and motion capture dataset and baseline algorithm for evalu-
optimization. arXiv preprint arXiv:1412.6980, 2014.
ation of articulated human motion. International Journal of Com-
57. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
puter Vision, 87(1):4–27, March 2010.
Deep residual learning for image recognition. arXiv preprint
46. Kathleen M Robinette, Sherri Blackwell, Hein Daanen, Mark
arXiv:1512.03385, 2015.
Boehmer, and Scott Fleming. Civilian american and european
58. Moritz Menze and Andreas Geiger. Object scene flow for au-
surface anthropometry resource (caesar), final report. volume 1.
tonomous vehicles. In Proceedings of the IEEE Conference on
summary. Technical report, DTIC Document, 2002.
47. Ralph Gross and Jianbo Shi. The cmu motion of body (mobo) Computer Vision and Pattern Recognition, pages 3061–3070, 2015.
database. 2001. 59. Thomas Brox, Christoph Bregler, and Jitendra Malik. Large dis-
48. Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong placement optical flow. In Computer Vision and Pattern Recogni-
Xiao. Lsun: Construction of a large-scale image dataset using deep tion, 2009. CVPR 2009. IEEE Conference on, pages 41–48. IEEE,
learning with humans in the loop. arXiv:1506.03365, 2015. 2009.
49. Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and 60. Christian Bailer, Bertram Taetz, and Didier Stricker. Flow fields:
Antonio Torralba. Sun database: Large-scale scene recognition Dense correspondence fields for highly accurate large displacement
from abbey to zoo. In Computer vision and pattern recognition optical flow estimation. In Proceedings of the IEEE International
(CVPR), 2010 IEEE conference on, pages 3485–3492. IEEE, 2010. Conference on Computer Vision, pages 4015–4023, 2015.
50. Robin Green. Spherical Harmonic Lighting: The Gritty Details. 61. P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan.
Archives of the Game Developers Conference, March 2003. Object detection with discriminatively trained part-based models.
51. Matthias Teschner, Stefan Kimmerle, Bruno Heidelberger, Gabriel TPAMI, 32(9):1627–1645, 2010.
Zachmann, Laks Raghupathi, Arnulph Fuhrmann, Marie-Paule 62. Anton Milan, Laura Leal-Taixé, Ian D. Reid, Stefan Roth, and
Cani, François Faure, Nadia Magnenat-Thalmann, Wolfgang Konrad Schindler. MOT16: A benchmark for multi-object tracking.
Strasser, and Pascal Volino. Collision detection for deformable arXiv:1603.00831, 2016.
objects. In Eurographics, pages 119–139, 2004.
16 Anurag Ranjan et al.

Parts Epic Flow LDOF FlowNet2 FlowNet PCA Layers PWC-Net PWC+MHOF SPyNet SPyNet+MHOF
Average (whole image) 0.488 0.360 0.310 0.808 0.556 0.369 0.301 0.429 0.391
Average (body pixels) 1.982 1.719 1.863 2.574 2.691 2.056 1.621 1.977 1.803
global 1.269 1.257 1.337 2.005 1.920 1.389 1.163 1.356 1.236
head 1.806 1.328 1.626 2.681 2.808 1.881 1.445 1.708 1.519
leftCalf 2.116 1.802 1.787 2.420 2.711 2.109 1.476 1.991 1.796
leftFoot 3.089 2.346 2.476 2.987 3.393 3.002 2.142 2.701 2.566
leftForeArm 3.972 3.231 3.536 4.380 4.778 3.926 3.136 3.945 3.605
leftHand 5.777 4.422 4.823 5.928 6.531 5.634 4.385 5.547 5.040
leftShoulder 1.513 1.429 1.646 2.331 2.336 1.732 1.471 1.560 1.462
leftThigh 1.424 1.338 1.466 2.102 2.150 1.565 1.230 1.517 1.362
leftToes 3.147 2.573 2.755 3.065 3.307 3.100 2.524 2.830 2.784
leftUpperArm 2.215 1.947 2.288 3.005 3.139 2.376 1.955 2.307 2.076
lIndex0 6.199 4.900 5.334 6.254 6.785 6.124 4.861 5.925 5.472
lIndex1 6.367 5.159 5.672 6.340 6.829 6.303 5.212 6.087 5.727
lIndex2 6.315 5.253 5.878 6.203 6.670 6.270 5.433 6.028 5.784
lMiddle0 6.338 4.983 5.331 6.364 6.910 6.211 4.837 6.012 5.544
lMiddle1 6.498 5.239 5.632 6.435 6.927 6.383 5.176 6.143 5.767
lMiddle2 6.266 5.212 5.756 6.130 6.592 6.182 5.303 5.934 5.679
lPinky0 6.048 4.792 5.302 6.035 6.603 5.940 4.873 5.738 5.307
lPinky1 6.106 4.922 5.489 6.038 6.574 6.014 5.064 5.765 5.418
lPinky2 5.780 4.856 5.419 5.655 6.170 5.702 4.956 5.474 5.231
lRing0 6.388 4.973 5.281 6.413 7.010 6.218 4.834 6.064 5.552
lRing1 6.313 5.083 5.391 6.256 6.801 6.168 4.949 5.966 5.558
lRing2 6.047 5.035 5.515 5.924 6.409 5.942 5.067 5.710 5.441
lThumb0 5.415 4.318 4.673 5.473 6.072 5.316 4.329 5.212 4.809
lThumb1 5.636 4.527 5.065 5.698 6.232 5.612 4.685 5.449 5.065
lThumb2 5.825 4.749 5.388 5.820 6.323 5.802 5.005 5.629 5.314
neck 1.336 1.195 1.371 2.151 2.245 1.440 1.227 1.399 1.250
rightCalf 2.243 1.892 1.864 2.530 2.851 2.223 1.548 2.081 1.907
rightFoot 3.270 2.454 2.610 3.149 3.599 3.171 2.276 2.894 2.732
rightForeArm 3.990 3.242 3.554 4.381 4.759 3.928 3.190 4.029 3.641
rightHand 5.735 4.348 4.787 5.837 6.447 5.550 4.339 5.582 4.978
rightShoulder 1.547 1.431 1.670 2.390 2.340 1.735 1.477 1.573 1.462
rightThigh 1.477 1.374 1.512 2.158 2.226 1.624 1.263 1.556 1.407
rightToes 3.395 2.707 2.918 3.293 3.566 3.346 2.699 3.064 2.999
rightUpperArm 2.267 1.974 2.294 3.033 3.148 2.400 2.007 2.002 2.113
rIndex0 6.264 4.875 5.324 6.255 6.800 6.150 4.886 6.003 5.486
rIndex1 6.541 5.210 5.755 6.449 6.951 6.457 5.329 6.237 5.835
rIndex2 6.465 5.320 5.968 6.294 6.776 6.404 5.533 6.149 5.879
rMiddle0 6.509 5.056 5.454 6.470 7.014 6.354 4.967 6.211 5.662
rMiddle1 6.680 5.341 5.777 6.562 7.058 6.537 5.325 6.325 5.895
rMiddle2 6.394 5.261 5.838 6.209 6.713 6.274 5.366 6.038 5.739
rPinky0 5.983 4.750 5.372 5.952 6.504 5.855 4.845 5.741 5.262
rPinky1 6.076 4.905 5.566 5.979 6.533 5.943 5.025 5.809 5.402
rPinky2 5.789 4.813 5.403 5.645 6.220 5.662 4.903 5.532 5.232
rRing0 6.397 4.948 5.350 6.383 6.938 6.215 4.856 6.126 5.565
rRing1 6.395 5.108 5.465 6.290 6.841 6.212 5.019 6.066 5.615
rRing2 6.222 5.129 5.644 6.052 6.610 6.063 5.160 5.889 5.571
rThumb0 5.417 4.304 4.748 5.470 6.057 5.301 4.360 5.247 4.819
rThumb1 5.605 4.465 4.945 5.643 6.210 5.514 4.607 5.434 5.032
rThumb2 5.835 4.748 5.262 5.789 6.328 5.749 4.938 5.639 5.306
spine 1.233 1.271 1.325 1.941 1.856 1.360 1.168 1.322 1.221
spine1 1.330 1.369 1.421 2.028 1.957 1.460 1.268 1.417 1.322
spine2 1.329 1.308 1.439 2.089 2.049 1.480 1.276 1.387 1.309

Table 7 Comparison using End Point Error (EPE) on the Multi-Human Optical Flow (MHOF) dataset. We show the average EPE and body part
specific EPE, where part labels follow Figure 3. The first two rows are repeated from Tab 5.

View publication stats

You might also like