IJCV MultiHumanFlow Accepted Preprint
IJCV MultiHumanFlow Accepted Preprint
IJCV MultiHumanFlow Accepted Preprint
net/publication/337949162
CITATIONS READS
61 2,534
6 authors, including:
All content following this page was uploaded by Dimitrios Tzionas on 16 December 2019.
(a) Our dataset (b) Results on synthetic scenes (c) Results on real world scenes
Fig. 1 (a) We simulate human motion in virtual worlds creating an extensive dataset with images (top row) and flow fields (bottom row); color
coding from [1]. (b) We train SPyNet [2] and PWC-Net [3] for human motion estimation and show that they perform better when trained on our
dataset and (c) can generalize to human motions in real world scenes. Columns show single-person and multi-person cases alternately.
both qualitatively and quantitatively. Furthermore we show the single-person case with a body-only model. The present
that the trained networks generalize to real video sequences work extends [18] for the multi-person case, as images with
(Fig. 1(c)). Several datasets and benchmarks [1, 8, 9] have multiple occluding people have different statistics. It further
been established to drive the progress in optical flow. We ar- employs a holistic model of the body together with hands
gue that these datasets are insufficient for the task of human for more realistic motion variation. This work also extends
motion estimation and, despite its importance, no attention training SPyNet [2] and PWC-Net [3] using the new dataset
has been paid to datasets and models for human optical flow. in contrast to training only SPyNet in the earlier work [18].
One of the main reasons is that dense human motion is ex- Our experiments show both qualitative and quantitative im-
tremely difficult to capture accurately in real scenes. Without provements.
ground truth, there has been little work focused specifically
on estimating human optical flow. To advance research on In summary, our major contributions in this extended
this problem, the community needs a dataset tailored to hu- work are: 1) We provide the Single-Human Optical Flow
man optical flow. dataset (SHOF) of human bodies in motion with realistic
A key observation is that recent work has shown that textures and backgrounds, having 146, 020 frame pairs for
optical flow methods trained on synthetic data [2, 10, 11] single-person scenarios. 2) We provide the Multi-Human
generalize relatively well to real data. Additionally, these Optical Flow dataset (MHOF), with 111, 312 frame pairs of
methods obtain state-of-the-art results with increased realism multiple human bodies in motion, with improved textures
of the training data [12,13]. This motivates our effort to and realistic visual occlusions, but without (self-)collisions
create a dataset designed for human motion. or intersections of body meshes. These two datasets together
comprise the Human Optical Flow dataset. 3) We fine-tune
To that end, we use the SMPL [14] and SMPL+H [15]
SPyNet [18] on SHOF and show that its performance im-
models, that capture the human body alone and the body
proves by about 43% (over the initial SPyNet), while it also
together with articulated hands respectively, to generate dif-
outperforms existing state of the art by about 30%. Further-
ferent human shapes including hand and finger motion. We
more, we fine-tune SPyNet and PWC-Net on MHOF and
then place humans on random indoor backgrounds and sim-
observe improvements of 10 − 20% (over the initial SPyNet
ulate human activities like running, walking, dancing etc.
and PWC-Net). Compared to existing state of the art, im-
using motion capture data [16, 17]. Thus, we create a large
provements are particularly high for human regions. After
virtual dataset that captures the statistics of natural human
masking out the background, we observe improvements of
motion in multi-person scenarios. We then train optical flow
up to 13% for human pixels. 4) We provide the dataset files,
networks on this dataset and evaluate their performance for
dataset rendering code, training code and trained models1 for
estimating human motion. While the dataset can be used to
research purposes.
train any flow method, we focus specifically on networks
based on spatial pyramids, namely SpyNet [2] and PWC-Net
[3], because they are compact and computationally efficient.
A preliminary version of this work appeared in [18] that
presented a dataset and model for human optical flow for 1 https://humanflow.is.tue.mpg.de
Learning Multi-Human Optical Flow 3
Fig. 2 Pipeline for generating the RGB frames and ground truth optical flow for the Multi-Human Optical Flow dataset. The datasets used in this
pipeline are listed in Table 1, while the various rendering component are summarized in Table 2.
other in complicated ways. For this reason we then create the parameters of the body model for articulated pose, translation
Multi-Human Optical Flow (MHOF) dataset to better capture and shape. The pose specifically is a vector of axis-angle pa-
this realistic interaction. To make images even more realistic rameters, that describes how to rotate each body part around
for MHOF, we replace SMPL [41] with the SMPL+H [15] its corresponding skeleton joint.
model that models the body together with articulated fingers, For the SHOF dataset, we use the Human3.6M dataset
to have richer motion variation. In the rest of this section, we [43]. that contains five subjects for training (S1, S5, S6, S7,
describe the components of our rendering pipeline, shown in S8) and two for testing (S9, S11). Each subject performs 15
Figure 2. For easy reference, in Table 1 we summarize the actions twice, resulting in 1, 559, 985 frames for training and
data used to generate the SHOF and MHOF datasets, while 550, 727 for testing. These sequences are subsampled at a
in Table 2 we summarize the various tools, Blender passes rate of 16×, resulting in 97, 499 training and 34, 420 testing
and parameters used for rendering. In the rest of the section, poses from Human3.6M.
we describe the modules used for generating the data. For the MHOF dataset, we use the CMU [44] and Hu-
manEva [45] MoCap datasets to increase motion variation.
From CMU MoCap dataset, we use 2, 605 sequences of 23
high-level action categories. From the HumanEva dataset,
3.1 Human Body Generation we use more than 10 sequences performing actions from 6
different action categories. To reduce redundant poses and
Body Model. A parametrized body model is necessary to allow for larger motions between frames, sequences are sub-
generate human bodies in a scene. In the SHOF dataset, we sampled to 12 fps resulting in 321, 873 poses. As a result the
use SMPL [41] for generating human body shapes. For the final MHOF dataset has 254, 211 poses for training, 32, 670
MHOF dataset we, use SMPL+H [15] that parametrizes the for validation and 34, 992 for testing.
human body together with articulated fingers, for increased Hand Poses. Traditionally MoCap systems and datasets
realism. The models are parameterized by pose and shape [43, 44, 45] record the motion of body joints, and avoid the
parameters to change the body posture and identity, as shown tedious capture of detailed hand and finger motion. However,
in Figure 2. They also contain a UV appearance map that in natural settings, people use their body, hands and fingers
allows us to change the skin tone, face features and clothing to communicate social cues and to interact with the physical
texture of the resulting virtual humans. world. To enable our methods to learn such subtle motions, it
Body Poses. The next step is articulating the human body should be represented in our training data. Therefore, we use
with different poses, to create moving sequences. To find such the SMPL+H model [15] and augment the body-only Mo-
poses, we use 3D MoCap datasets [43, 44, 45] that capture Cap datasets, described above, with finger motion. Instead
3D MoCap marker positions, glued onto the skin surface of of using random finger poses that would generate unrealistic
real human subjects. We then employ MoSh [16, 17] that optical flow, we employ the Embodied Hands dataset [15]
fits our body model to these 3D markers by optimizing over and sample continuous finger motion to generate realistic op-
Learning Multi-Human Optical Flow 5
tical flow. We use 43 sequences of hand motion with 37, 232 non-CAESAR textures, due to the bad quality of CAESAR
frames recorded at 60 Hz by [15]. Similarly to body MoCap, texture maps for the finger region. Thus each texture is sam-
we subsample hand MoCap to 12 fps to reduce overlapping pled with equal probability.
poses without sacrificing variability. Hand Texture. Hands and fingers are hard to be scanned
Body Shapes. Human bodies vary a lot in their propor- due to occlusions and measurement limitations. As a result,
tions, since each person has a unique body shape. To rep- texture maps are particularly noisy or might even have holes.
resent this in our dataset, we first learn a gender specific Since texture is important for optical flow, we augment the
Gaussian distribution of shape parameters, by fitting SMPL body texture maps to improve hand regions. For this we
to 3D CAESAR scans [46] of both genders. We then sam- follow a divide and conquer approach. First, we capture hand-
ple random body shapes from this distribution to generate only scans with a 3dMD scanner [15]. Then, we create hand-
a large number of realistic body shapes for rendering. How- only textures using the MANO model [15], getting 176 high
ever, naive sampling can result in extreme and unrealistic resolution textures from 20 subjects. Finally, we use the hand-
shape parameters, therefore we bound the shape distribution only textures to replace the problematic hand regions in the
to avoid unlikely shapes. full-body texture maps.
For the SHOF dataset we bound the shape parameters We also need to find the best matching hand-only texture
to the range of [−3, 3] standard deviations for each shape for every body texture. Therefore, we convert all texture maps
coefficient and draw a new shape for every subsequence of in HSV space, and compute the mean HSV value for each
20 frames to increase variance. texture map from standard sampling regions. For full body
For the MHOF dataset, we account explicitly for col- textures, we sample face regions without facial hair; while
lisions and intersections, since intersecting virtual humans for hand-only textures, we sample the center of the outer
would result in generation of inaccurate optical flow. To min- palm. Then, for each body texture map we find the closest
imize such cases, we use similar sampling as above with hand-only texture map in HSV space, and shift the values of
only small differences. We first use shorter subsequences of the latter by the HSV difference, so that the hand skin tone
10 frames for less frequent inter-human intersections. Fur- becomes more similar to the facial skin tone. Finally, this
thermore, we bound the shape distribution to the narrower improved hand-only texture map is used to replace the pixels
range of [−2.7, 2.7] standard deviations, since re-targeting in the hand-region of the full body texture map.
motion to unlikely body shapes is more prone to mesh self- (Self-) Collision. The MHOF dataset contains multiple
intersections. virtual humans moving differently, so there are high chances
Body Texture. We use the CAESAR dataset [46] to gen- of collisions and penetrations. This is undesirable because
erate a variety of human skin textures. Given SMPL regis- penetrations are physically implausible and unrealistic. More-
trations to CAESAR scans, the original per-vertex color in over, the generated ground truth optical flow might have arti-
the CAESAR dataset is transferred into the SMPL texture facts. Therefore, we employ a collision detection method to
map. Since fiducial markers were placed on the bodies of avoid intersections and penetrations.
CAESAR subjects, we remove them from the textures and Instead of using simple bounding boxes for rough colli-
inpaint them to produce a natural texture. In total, we use sion detection, we draw inspiration from [53] and perform
166 CAESAR textures that are of good quality. The main accurate and efficient collision detection on the triangle level
drawback of CAESAR scans is their homogeneity in terms using bounding volume hierarchies (BVH) [51]. This level
of outfit, since all of the subjects wore grey shorts and the of detailed detection allows for challenging occlusions with
women wore sports bras. In order to increase the clothing small distances between virtual humans, that can commonly
variety, we also use textures extracted from our 3D scans be observed for realistic interactions between real humans.
(referred as non-CAESAR in the following), to which we This method is useful not only for inter-person collision
register SMPL with 4Cap [52]. A total of 772 textures from detection, but also for self-intersections. This is especially
7 different subjects with different clothes were captured. We useful for our scenarios, as re-targeting body and hand mo-
anonymized the textures by replacing the face by the average tion to people of different shapes might result in unrealistic
face in CAESAR, after correcting it to match the skin tone self-penetrations. The method is applicable out of the box,
of the texture. Textures are grouped according to the gender, with the only exception that we exclude checks of neighbor-
which is randomly selected for each virtual human. ing body parts that are always or frequently in contact, e.g.
For the SHOF dataset the textures were split in train- upper and lower arm, or the two thighs.
ing and testing sets with a 70/30 ratio, while each texture
dataset is sampled with a 50% chance. For the MHOF dataset, 3.2 Scene Generation
we introduce more refined splitting with a 80/10/10 ratio
for the train, validation and test sets. Moreover, since we Background texture. For the scene background in the
introduce also finger motion, we want to favour sampling SHOF dataset, we use random indoor images from the LSUN
6 Anurag Ranjan et al.
MoCap data Human3.6M [43] CMU [47], HumanEva [45] Natural body poses
MoCap → SMPL MoSh [16, 17] MoSh [16,17] SMPL parameters from MoCap
Training poses 97, 499 254, 211 Articulate virtual humans
Validation poses – 32, 670 Articulate virtual humans
Test poses 34, 420 34, 992 Articulate virtual humans
Hand pose dataset – Embodied Hands [15] Natural finger poses
Body shapes Sample Gaussian distr. (CAESAR) Sample Gaussian distr. (CAESAR) Body proportions of virtual humans
bounded within [−3, 3] st.dev. bounded within [−2.7, 2.7] st.dev.
Textures CAESAR, CAESAR (hands improved), Appearance of virtual humans
non-CAESAR non-CAESAR (hands improved)
Background LSUN [48] (indoor) SUN397 [49] (indoor and outdoor) Scene background
417, 597 images 30, 022 images
Table 1 Comparison of datasets and most important data preprocessing steps used to generate the SHOF and MHOF datasets. A short description
of the respective part is provided in the last column.
dataset [48]. This provides a good compromise between sim- colliding frequently with each other, while still being close
plicity and the complex task of generating varied full 3D enough for visual occlusions.
environments. We use 417, 597 images from the LSUN cate- Scene Illumination. We illuminate the bodies with Spher-
gories kitchen, living room, bedroom and dining room. These ical Harmonics lighting [50] that define basis vectors for light
images are placed as billboards, 9 meters from the camera, directions. This parameterization is useful for randomizing
and are not affected by the spherical harmonics lighting. the scene light by randomly sampling the coefficients with
a bias towards natural illumination. The coefficients are uni-
In the MHOF dataset, we increase the variability in back- formly sampled between −0.7 and 0.7, apart from the ambi-
ground appearance, We employ the Sun397 dataset [49] that ent illumination, which has a minimum value of 0.3 to avoid
contains images for 397 highly variable scenes that are both extremely dark images, and illumination direction, which is
indoor and outdoor, in contrast to LSUN. For quality reasons, strictly negative to favour illumination coming from above.
we reject all images with resolution smaller than 512 × 512 Increasing Image Realism. In order to increase realism,
px, and also reject images that contain humans using mask- we introduced three types of image imperfections. First, for
RCNN [54,55]. As a result, we use 30, 222 images, split in 30% of the generated images we introduced camera motion
24, 178 for the training set and 3, 022 for each of the valida- between frames. This motion perturbs the location of the cam-
tion and test sets. Further, we increase the distance between era with Gaussian noise of 1 cm standard deviation between
the camera and background to 12 meters, to increase the frames and rotation noise of 0.2 degrees standard deviation
space in which the multiple virtual humans can move without per dimension in an Euler angle representation. Second, we
Learning Multi-Human Optical Flow 7
add motion blur to the scene using the Vector Blur Node in
Blender, and integrated over 2 frames sampled with 64 steps
between the beginning and end point of the motion. Finally,
we add a Gaussian blur to 30% of the images with a standard
deviation of 1 pixel.
Scene Compositing. For animating virtual humans, each
MoCap sequence is selected at least once. To increase vari-
ability, each sequence is split into subsequences. For the first
frame of each subsequence, we sample a body and back-
ground texture, lights, blurring and camera motion parame-
ters, and re-position virtual humans on the horizontal plane.
We then introduce a random rotation around the z-axis for
variability in the motion direction.
For the SHOF dataset, we use subsequences of 20 frames,
and at the beginning of each one the single virtual human is
re-positioned in the scene such that the pelvis is projected
onto the image center.
For the MHOF dataset, we increase the variability with
smaller subsequences of 10 frames and introduce more chal-
lenging visual occlusions by uniformly sampling the number
of virtual humans in the range [4, 8]. We sample MoCap se-
|S |
quences Sj with a probability of pj = P|S|j , where |Sj |
i=1 |Si |
denotes the number of frames of sequence Sj and |S| the
number of sequences. In contrast to the SHOF dataset, for
the MHOF dataset the virtual humans are not re-positioned Fig. 3 Body part segmentation for the SMPL+H model. Symmetrical
at the center, as they would all collide. Instead, they are body parts are labeled only once. Finger joints follow the same naming
placed at random locations on the horizontal plane within convention as shown for the thumb. (Best viewed in color)
camera visibility, making sure there are no collisions with
other virtual humans or the background plane during the 4 Learning
whole subsequence.
We train two different network architectures to estimate opti-
cal flow on both the SHOF and MHOF dataset. We choose
3.3 Ground Truth Generation compact models that are based on spatial pyramids, namely
SPyNet [2] and PWC-Net [3], shown in Figure 4. We denote
the models trained on the SHOF dataset by SPyNet+SHOF
Segmentation Masks. Using the material pass of Blender,
and PWC+SHOF. Similarly, we denote models trained on
we store for each frame the ground truth body part segmenta-
the MHOF dataset by SPyNet+MHOF and PWC+MHOF.
tion for our models. Although the body part segmentation for
The spatial pyramid structure employs a convnet at each
both models is similar, SMPL models the palm and fingers
level of an image pyramid. A pyramid level works on a par-
as one part, while SMPL+H has a different part segment
ticular resolution of the image. The top level works on the
for each finger bone. Figure 3 shows an example body part
full resolution and the image features are downsampled as
segmentation for SMPL+H. These segmentation masks allow
we move to the bottom of the pyramid. Each level learns a
us to perform a per body-part evaluation of our optical flow
convolutional layer d, to perform downsampling of image
estimation.
features. Similarly, a convolution layer u, is learned for de-
Rendering & Ground Truth Optical Flow. For gener- coding optical flow. At each level, the convnet Gk predicts
ating images, we use the open source suite Blender and its optical flow residuals vk at that level. These flow residuals
vector pass. The render pass is typically used for producing get added at each level to produce the full flow, VK at the
motion blur, and it produces the motion in image space of finest level of the pyramid.
every pixel; i.e. the ground truth optical flow. We are mainly In SPyNet, each convnet Gk takes a pair of images as
interested in the result of this pass, together with the color inputs along with flow Vk−1 obtained by resizing the output
rendering of the textured bodies. of the previous level with interpolation. The second frame is
8 Anurag Ranjan et al.
Fig. 4 Spatial Pyramid Network [2] (left) and PWC-Net [3] (right) for optical flow estimation. At each pyramid level, network Gk predicts flow at
that level which is used to condition the optical flow at the higher resolution level in the pyramid. Adapted from [3].
however warped using Vk−1 and the triplet {Ik1 , w(Ik2 , Vk−1 ), 5 Experiments
Vk−1 } is fed as input to the convnet Gk .
In this section, we first compare the SHOF, MHOF and other
In PWC-Net, a pair of image features, {Ik1 , Ik2 } is input at common optical flow datasets. Next, we show that fine-tuning
a pyramid level, and the second feature map is warped using SPyNet on SHOF improves the model, while we observe that
using the flow Vk−1 from the previous level of the pyramid. fine-tuning PWC-Net on SHOF does not improve the model
We then compute the cost-volume c(Ik1 , w(Ik2 , Vk−1 )) over further. We then fine-tune the same methods on MHOF and
feature maps and pass it to network Gk to compute optical evaluate them. We show that both, SPyNet and PWC-Net
flow Vk at that pyramid level. improve when fine-tuned on MHOF. We show that the meth-
ods trained on the MHOF dataset outperform generic flow
We use the pretrained weights as initializations for train- estimation methods for the pixels corresponding to humans.
ing both SPyNet and PWC-Net. We train both models end- We show on qualitative results that both, the models trained
to-end to minimize the average End Point Error (EPE). on SHOF and models trained on MHOF seem to general-
ize to real word scenes. Finally, we quantitatively evaluate
optical flow methods on the MHOF dataset using motion
Hyperparameters. We follow the same training proce-
compensated intensity metric.
dure for SPyNet and PWC-Net. The only exception to this
is the learning rate, which is determined empirically for Dataset Details. In comparison with other optical flow
each dataset and network from {10−6 , 10−5 , 10−4 }. For the datasets, our dataset is larger by an order of magnitude (see
SHOF we found 10−6 to yield best results for SpyNet. Predic- Table 3); the SHOF dataset contains 135, 153 training frames
tions of PWC on the SHOF dataset do not improve for any of and 10, 867 test frames with optical flow ground truth, while
these learning rates. For training on MHOF a learning rate of the MHOF dataset has 86, 259 training, 13, 236 test and
10−6 and 10−4 yield best results for SpyNet and PWC-Net, 11, 817 validation frames. For the single-person dataset we
respectively. We use Adam [56] to optimize our loss with keep the resolution small at 256 × 256 px to facilitate easy
β1 = 0.9 and β2 = 0.999. We use a batch size of 8 and run deployment for training neural networks. This also speeds up
400, 000 training iterations. All networks are implemented the rendering process in Blender for generating large amounts
in the Pytorch framework. Fine-tuning the networks from of data. We show the comparisons of processing time of dif-
pretrained weights takes approximately 1 day on SHOF and ferent models on the SHOF dataset in Table 4(a). For the
2 days on MHOF. MHOF dataset we increase the resolution to 640 × 640 px
to be able to reason about optical flow even in small body
Data Augmentations. We also augment our data by ap- parts like fingers, using SMPL+H. Our data is extensive,
plying several transformations and adding noise. Although containing a wide variety of human shapes, poses, actions
our dataset is quite large, augmentation improves the quality and virtual backgrounds to support deep learning systems.
of results on real scenes. In particular, we apply scaling in the
range of [0.3, 3], and rotations in [−17◦ , 17◦ ]. The dataset is Comparison on SHOF. We compare the average End
normalized to have zero mean and unit standard deviation Point Errors (EPEs) of optical flow methods on the SHOF
using [57]. dataset in Table 4, along with the time for evaluation. We
Learning Multi-Human Optical Flow 9
Frame1 Ground Truth FlowNet FlowNet2 LDOF PCA-Layers EpicFlow SPyNet SpyNet+SHOF PWC
Fig. 5 Visual comparison of optical flow estimates using different methods on the Single-Human Optical Flow (SHOF) test set. From left to right,
we show Frame 1, Ground Truth flow, results of FlowNet [10], FlowNet2 [11], LDOF [59], PCA-Layers [31], SPyNet [2], EpicFlow [34] , LDOF
[59], SPyNet [2], SPyNet+SHOF (ours) and PWC-Net [3]
FlowNet2+MHOF to perform even better, but we do not Real Scenes. We show a visual comparison of results on
include this here due to its long and tedious training process. real-world scenes of people in motion. For visual compar-
isons of models trained on the SHOF dataset we collect these
Visual comparisons are shown in Figure 6. In particu-
scenes by cropping people from real world videos as shown
lar, PWC+MHOF predicts flow fields with sharper edges
in Figure 7. We use DPM [61] for detecting people and com-
than generic methods or SPyNet+ MHOF. Furthermore, the
pute bounding box regions in two frames using the ground
qualitative results suggest that PWC+MHOF is better at dis-
truth of the MOT16 dataset [62]. The results for the SHOF
tinguishing the motion of people, as people can be better
dataset are shown in Figure 8. A comparison of methods on
separated on the flow visualizations of PWC+MHOF (Fig-
real images with multiple people can be seen in Figure 9.
ure 6, row 3). Last, it can be seen that fine details, like the
motion of distant humans or small body parts, are better
The performance of PCA-Layers [31] is highly depen-
estimated by PWC+MHOF.
dent on its ability to segment. Hence, we see only a few cases
The above observations are strong indications that our where it looks visually correct. SPyNet [2] gets the overall
Human Optical Flow datasets (SHOF and MHOF) can be shape but the results look noisy in certain image parts. While
beneficial for the performance on human motion for other LDOF [59], EpicFlow [34] and FlowFields [60] generally
optical flow networks as well. perform well, they often find it difficult to resolve the legs,
Learning Multi-Human Optical Flow 11
Frame1 Ground Truth FlowNet2 LDOF PCA-Layers EpicFlow SPyNet SPyNet+MHOF PWC PWC+MHOF
Fig. 6 Visual comparison of optical flow estimates using different methods on the Multi-Human Optical Flow (MHOF) test set. From left to right,
we show Frame 1, Ground Truth flow, results of FlowNet2 [11], LDOF [59], PCA-Layers [31], EpicFlow [34], SPyNet [2], SPyNet+MHOF (ours),
PWC-Net [3] and PWC+MHOF (ours).
Method Average Average EPE on Fine-tuned on Method Average MCI Average MCI
EPE body pixels MHOF MHOF Real
FlowNet 0.808 2.574 7 FlowNet 287.328 401.779
PCA Layers 0.556 2.691 7 PCA Layers 201.594 423.332
Epic Flow 0.488 1.982 7 Epic Flow 129.252 234.037
SPyNet 0.429 1.977 7 SPyNet 142.108 302.753
SPyNet+MHOF 0.391 1.803 3 SPyNet+MHOF 143.029 297.142
PWC-Net 0.369 2.056 7 PWC-Net 157.088 344.202
LDOF 0.360 1.719 7 LDOF 71.449 158.281
FlowNet2 0.310 1.863 7 FlowNet2 145.732 303.799
PWC+MHOF 0.301 1.621 3 PWC+MHOF 152.314 351.567
Table 5 Comparison using End Point Error (EPE) on the Multi-Human Table 6 Comparison using Motion Compensated Intensity (MCI) on
Optical Flow (MHOF) dataset. We show the average EPE and body- the Multi-Human Optical Flow (MHOF) dataset and a real video
only EPE. For the latter, the EPE is computed only over segments of sequence. Example images for the real video sequence can be seen in
the image depicting a human body. Best results are shown in boldface. Figure 9.
A comparison of body-part specific EPE can be found in Table 7 .
like legs, hands and the human head. Models trained on the
hands and head of the person. The results from models trained Human Optical Flow dataset perform well under occlusion
on our Human Optical Flow dataset look appealing especially (Figure 8, Figure 9). Many examples including severe occlu-
while resolving the overall human shape, and various parts sion can be seen in Figure 9. Besides that, Figure 9 shows that
12 Anurag Ranjan et al.
Fig. 8 Single-Human Optical Flow visuals on real images using different methods. From left to right, we show Frame 1, Frame 2, results of
PCA-Layers [31], and SPyNet [2], EpicFlow [34], LDOF [59], FlowFields [60] and SPyNet+SHOF (ours).
human motion recognition. In Computer Vision (ICCV), 2011 12. N.Mayer, E.Ilg, P.Häusser, P.Fischer, D.Cremers, A.Dosovitskiy,
IEEE International Conference on, pages 2556–2563. IEEE, 2011. and T.Brox. A large dataset to train convolutional networks for
8. Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready disparity, optical flow, and scene flow estimation. In IEEE Inter-
for autonomous driving? the KITTI vision benchmark suite. In national Conference on Computer Vision and Pattern Recognition
Conference on Computer Vision and Pattern Recognition (CVPR), (CVPR), 2016. arXiv:1512.02134.
2012. 13. A Gaidon, Q Wang, Y Cabon, and E Vig. Virtual worlds as proxy
9. D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic for multi-object tracking analysis. In CVPR, 2016.
open source movie for optical flow evaluation. In A. Fitzgibbon et 14. Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler,
al. (Eds.), editor, European Conf. on Computer Vision (ECCV), Part Javier Romero, and Michael J. Black. Keep it SMPL: Automatic
IV, LNCS 7577, pages 611–625. Springer-Verlag, October 2012. estimation of 3D human pose and shape from a single image. In
10. Alexey Dosovitskiy, Philipp Fischery, Eddy Ilg, Caner Hazirbas, Computer Vision – ECCV 2016, Lecture Notes in Computer Sci-
Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, Thomas ence. Springer International Publishing, October 2016.
Brox, et al. Flownet: Learning optical flow with convolutional 15. Javier Romero, Dimitrios Tzionas, and Michael J. Black. Em-
networks. In 2015 IEEE International Conference on Computer bodied hands: Modeling and capturing hands and bodies together.
Vision (ICCV), pages 2758–2766. IEEE, 2015. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6),
11. Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, November 2017.
Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution 16. Matthew Loper, Naureen Mahmood, and Michael J Black. Mosh:
of optical flow estimation with deep networks. arXiv preprint Motion and shape capture from sparse markers. ACM Transactions
arXiv:1612.01925, 2016. on Graphics (TOG), 33(6):220, 2014.
14 Anurag Ranjan et al.
Frame1 FlowNet2 FlowNet LDOF PCA-Layers EpicFlow SPyNet SPyNet+MHOF PWC PWC+MHOF
Fig. 9 Multi-Human Optical Flow visuals on real images. From left to right, we show Frame 1, results of FlowNet2 [11], FlowNet [10], LDOF
[59], PCA-Layers [31], EpicFlow [34], SPyNet [2], SPyNet+MHOF (ours), PWC-Net [3] and PWC+MHOF (ours).
17. Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard 29. Xuanyi Dong, Shoou-I Yu, Xinshuo Weng, Shih-En Wei, Yi Yang,
Pons-Moll, and Michael J. Black. AMASS: archive of motion and Yaser Sheikh. Supervision-by-registration: An unsupervised
capture as surface shapes. CoRR, abs/1904.03278, 2019. approach to improve the precision of facial landmark detectors. In
18. Anurag Ranjan, Javier Romero, and Michael J. Black. Learning The IEEE Conference on Computer Vision and Pattern Recognition
human optical flow. In 29th British Machine Vision Conference, (CVPR), 2018.
September 2018. 30. William T. Freeman, Egon C. Pasztor, and Owen T. Carmichael.
19. Gunnar Johansson. Visual perception of biological motion and a Learning low-level vision. International Journal of Computer
model for its analysis. Perception & Psychophysics, 14(2):201–211, Vision, 40(1):25–47, 2000.
1973. 31. Jonas Wulff and Michael J Black. Efficient sparse-to-dense optical
20. J. W. Davis. Hierarchical motion history images for recognizing flow estimation using a learned basis and layers. In 2015 IEEE
human motion. In Detection and Recognition of Events in Video, Conference on Computer Vision and Pattern Recognition (CVPR),
pages 39–46, 2001. pages 120–130. IEEE, 2015.
21. M. J. Black, Y. Yacoob, A. D. Jepson, and D. J. Fleet. Learning pa- 32. Sun D., S Roth, JP Lewis, and MJ Black. Learning optical flow. In
rameterized models of image motion. In IEEE Conf. on Computer ECCV, pages 83–97, 2008.
Vision and Pattern Recognition, CVPR-97, pages 561–567, Puerto 33. Fatma Güney and Andreas Geiger. Deep discrete flow. In Asian
Rico, June 1997. Conference on Computer Vision (ACCV), 2016.
22. R. Fablet and M. J. Black. Automatic detection and tracking of
34. Jerome Revaud, Philippe Weinzaepfel, Zaid Harchaoui, and
human motion with a view-based representation. In European Conf.
Cordelia Schmid. EpicFlow: Edge-Preserving Interpolation of
on Computer Vision, ECCV 2002, volume 1 of LNCS 2353, pages
Correspondences for Optical Flow. In Computer Vision and Pat-
476–491. Springer-Verlag, 2002.
23. K. Fragkiadaki, H. Hu, and J. Shi. Pose from flow and flow from tern Recognition, 2015.
pose. In 2013 IEEE Conference on Computer Vision and Pattern 35. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and
Recognition, pages 2059–2066, June 2013. Manohar Paluri. Deep End2End Voxel2Voxel prediction. In The
24. Silvia Zuffi, Javier Romero, Cordelia Schmid, and Michael J Black. 3rd Workshop on Deep Learning in Computer Vision, 2016.
Estimating human pose with flowing puppets. In IEEE Interna- 36. Maria Shugrina, Ziheng Liang, Amlan Kar, Jiaman Li, Angad
tional Conference on Computer Vision (ICCV), pages 3312–3319, Singh, Karan Singh, and Sanja Fidler. Creative flow+ dataset. In
2013. The IEEE Conference on Computer Vision and Pattern Recognition
25. Tomas Pfister, James Charles, and Andrew Zisserman. Flowing (CVPR), June 2019.
convnets for human pose estimation in videos. In ICCV, pages 37. Adrien Gaidon, Zaid Harchaoui, and Cordelia Schmid. Activity
1913–1921. IEEE Computer Society, 2015. representation with motion hierarchies. International Journal of
26. James Charles, Tomas Pfister, Derek R. Magee, David C. Hogg, and Computer Vision, 107(3):219–238, 2014.
Andrew Zisserman. Personalizing human video pose estimation. 38. Igor Barros Barbosa, Marco Cristani, Barbara Caputo, Aleksander
In CVPR, pages 3063–3072. IEEE Computer Society, 2016. Rognhaugen, and Theoharis Theoharis. Looking beyond appear-
27. Javier Romero, Matthew Loper, and Michael J. Black. FlowCap: ances: Synthetic training data for deep cnns in re-identification.
2D human pose from optical flow. In Pattern Recognition, Proc. Computer Vision and Image Understanding, 167:50–62, 2018.
37th German Conference on Pattern Recognition (GCPR), volume 39. Mona Fathollahi Ghezelghieh, Rangachar Kasturi, and Sudeep
LNCS 9358, pages 412–423. Springer, 2015. Sarkar. Learning camera viewpoint using cnn to improve 3d body
28. Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Con- pose estimation. In 2016 Fourth International Conference on 3D
volutional two-stream network fusion for video action recognition. Vision (3DV), pages 685–693. IEEE, 2016.
In CVPR, pages 1933–1941. IEEE Computer Society, 2016.
Learning Multi-Human Optical Flow 15
40. Makehuman: Open source tool for making 3d characters. 52. Gerard Pons-Moll, Javier Romero, Naureen Mahmood, and
41. Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons- Michael J. Black. Dyna: A model of dynamic human shape in
Moll, and Michael J. Black. SMPL: A skinned multi-person motion. ACM Transactions on Graphics, (Proc. SIGGRAPH),
linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(4):120:1–120:14, August 2015.
34(6):248:1–248:16, October 2015. 53. Dimitrios Tzionas, Luca Ballan, Abhilash Srikantha, Pablo Aponte,
42. Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Marc Pollefeys, and Juergen Gall. Capturing hands in action using
Michael Black, Ivan Laptev, and Cordelia Schmid. Learning from discriminative salient points and physics simulation. International
synthetic humans. In CVPR, 2017. Journal of Computer Vision (IJCV), 118(2):172–193, June 2016.
43. Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Smin- 54. Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick.
chisescu. Human3.6m: Large scale datasets and predictive methods Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International
for 3d human sensing in natural environments. IEEE Transactions Conference on, pages 2980–2988. IEEE, 2017.
on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, 55. Waleed Abdulla. Mask r-cnn for object detection and instance
jul 2014. segmentation on keras and tensorflow. https://github.com/
44. Carnegie-mellon mocap database. matterport/Mask_RCNN, 2017.
45. L. Sigal, A. Balan, and M. J. Black. HumanEva: Synchronized
56. Diederik Kingma and Jimmy Ba. Adam: A method for stochastic
video and motion capture dataset and baseline algorithm for evalu-
optimization. arXiv preprint arXiv:1412.6980, 2014.
ation of articulated human motion. International Journal of Com-
57. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
puter Vision, 87(1):4–27, March 2010.
Deep residual learning for image recognition. arXiv preprint
46. Kathleen M Robinette, Sherri Blackwell, Hein Daanen, Mark
arXiv:1512.03385, 2015.
Boehmer, and Scott Fleming. Civilian american and european
58. Moritz Menze and Andreas Geiger. Object scene flow for au-
surface anthropometry resource (caesar), final report. volume 1.
tonomous vehicles. In Proceedings of the IEEE Conference on
summary. Technical report, DTIC Document, 2002.
47. Ralph Gross and Jianbo Shi. The cmu motion of body (mobo) Computer Vision and Pattern Recognition, pages 3061–3070, 2015.
database. 2001. 59. Thomas Brox, Christoph Bregler, and Jitendra Malik. Large dis-
48. Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong placement optical flow. In Computer Vision and Pattern Recogni-
Xiao. Lsun: Construction of a large-scale image dataset using deep tion, 2009. CVPR 2009. IEEE Conference on, pages 41–48. IEEE,
learning with humans in the loop. arXiv:1506.03365, 2015. 2009.
49. Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and 60. Christian Bailer, Bertram Taetz, and Didier Stricker. Flow fields:
Antonio Torralba. Sun database: Large-scale scene recognition Dense correspondence fields for highly accurate large displacement
from abbey to zoo. In Computer vision and pattern recognition optical flow estimation. In Proceedings of the IEEE International
(CVPR), 2010 IEEE conference on, pages 3485–3492. IEEE, 2010. Conference on Computer Vision, pages 4015–4023, 2015.
50. Robin Green. Spherical Harmonic Lighting: The Gritty Details. 61. P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan.
Archives of the Game Developers Conference, March 2003. Object detection with discriminatively trained part-based models.
51. Matthias Teschner, Stefan Kimmerle, Bruno Heidelberger, Gabriel TPAMI, 32(9):1627–1645, 2010.
Zachmann, Laks Raghupathi, Arnulph Fuhrmann, Marie-Paule 62. Anton Milan, Laura Leal-Taixé, Ian D. Reid, Stefan Roth, and
Cani, François Faure, Nadia Magnenat-Thalmann, Wolfgang Konrad Schindler. MOT16: A benchmark for multi-object tracking.
Strasser, and Pascal Volino. Collision detection for deformable arXiv:1603.00831, 2016.
objects. In Eurographics, pages 119–139, 2004.
16 Anurag Ranjan et al.
Parts Epic Flow LDOF FlowNet2 FlowNet PCA Layers PWC-Net PWC+MHOF SPyNet SPyNet+MHOF
Average (whole image) 0.488 0.360 0.310 0.808 0.556 0.369 0.301 0.429 0.391
Average (body pixels) 1.982 1.719 1.863 2.574 2.691 2.056 1.621 1.977 1.803
global 1.269 1.257 1.337 2.005 1.920 1.389 1.163 1.356 1.236
head 1.806 1.328 1.626 2.681 2.808 1.881 1.445 1.708 1.519
leftCalf 2.116 1.802 1.787 2.420 2.711 2.109 1.476 1.991 1.796
leftFoot 3.089 2.346 2.476 2.987 3.393 3.002 2.142 2.701 2.566
leftForeArm 3.972 3.231 3.536 4.380 4.778 3.926 3.136 3.945 3.605
leftHand 5.777 4.422 4.823 5.928 6.531 5.634 4.385 5.547 5.040
leftShoulder 1.513 1.429 1.646 2.331 2.336 1.732 1.471 1.560 1.462
leftThigh 1.424 1.338 1.466 2.102 2.150 1.565 1.230 1.517 1.362
leftToes 3.147 2.573 2.755 3.065 3.307 3.100 2.524 2.830 2.784
leftUpperArm 2.215 1.947 2.288 3.005 3.139 2.376 1.955 2.307 2.076
lIndex0 6.199 4.900 5.334 6.254 6.785 6.124 4.861 5.925 5.472
lIndex1 6.367 5.159 5.672 6.340 6.829 6.303 5.212 6.087 5.727
lIndex2 6.315 5.253 5.878 6.203 6.670 6.270 5.433 6.028 5.784
lMiddle0 6.338 4.983 5.331 6.364 6.910 6.211 4.837 6.012 5.544
lMiddle1 6.498 5.239 5.632 6.435 6.927 6.383 5.176 6.143 5.767
lMiddle2 6.266 5.212 5.756 6.130 6.592 6.182 5.303 5.934 5.679
lPinky0 6.048 4.792 5.302 6.035 6.603 5.940 4.873 5.738 5.307
lPinky1 6.106 4.922 5.489 6.038 6.574 6.014 5.064 5.765 5.418
lPinky2 5.780 4.856 5.419 5.655 6.170 5.702 4.956 5.474 5.231
lRing0 6.388 4.973 5.281 6.413 7.010 6.218 4.834 6.064 5.552
lRing1 6.313 5.083 5.391 6.256 6.801 6.168 4.949 5.966 5.558
lRing2 6.047 5.035 5.515 5.924 6.409 5.942 5.067 5.710 5.441
lThumb0 5.415 4.318 4.673 5.473 6.072 5.316 4.329 5.212 4.809
lThumb1 5.636 4.527 5.065 5.698 6.232 5.612 4.685 5.449 5.065
lThumb2 5.825 4.749 5.388 5.820 6.323 5.802 5.005 5.629 5.314
neck 1.336 1.195 1.371 2.151 2.245 1.440 1.227 1.399 1.250
rightCalf 2.243 1.892 1.864 2.530 2.851 2.223 1.548 2.081 1.907
rightFoot 3.270 2.454 2.610 3.149 3.599 3.171 2.276 2.894 2.732
rightForeArm 3.990 3.242 3.554 4.381 4.759 3.928 3.190 4.029 3.641
rightHand 5.735 4.348 4.787 5.837 6.447 5.550 4.339 5.582 4.978
rightShoulder 1.547 1.431 1.670 2.390 2.340 1.735 1.477 1.573 1.462
rightThigh 1.477 1.374 1.512 2.158 2.226 1.624 1.263 1.556 1.407
rightToes 3.395 2.707 2.918 3.293 3.566 3.346 2.699 3.064 2.999
rightUpperArm 2.267 1.974 2.294 3.033 3.148 2.400 2.007 2.002 2.113
rIndex0 6.264 4.875 5.324 6.255 6.800 6.150 4.886 6.003 5.486
rIndex1 6.541 5.210 5.755 6.449 6.951 6.457 5.329 6.237 5.835
rIndex2 6.465 5.320 5.968 6.294 6.776 6.404 5.533 6.149 5.879
rMiddle0 6.509 5.056 5.454 6.470 7.014 6.354 4.967 6.211 5.662
rMiddle1 6.680 5.341 5.777 6.562 7.058 6.537 5.325 6.325 5.895
rMiddle2 6.394 5.261 5.838 6.209 6.713 6.274 5.366 6.038 5.739
rPinky0 5.983 4.750 5.372 5.952 6.504 5.855 4.845 5.741 5.262
rPinky1 6.076 4.905 5.566 5.979 6.533 5.943 5.025 5.809 5.402
rPinky2 5.789 4.813 5.403 5.645 6.220 5.662 4.903 5.532 5.232
rRing0 6.397 4.948 5.350 6.383 6.938 6.215 4.856 6.126 5.565
rRing1 6.395 5.108 5.465 6.290 6.841 6.212 5.019 6.066 5.615
rRing2 6.222 5.129 5.644 6.052 6.610 6.063 5.160 5.889 5.571
rThumb0 5.417 4.304 4.748 5.470 6.057 5.301 4.360 5.247 4.819
rThumb1 5.605 4.465 4.945 5.643 6.210 5.514 4.607 5.434 5.032
rThumb2 5.835 4.748 5.262 5.789 6.328 5.749 4.938 5.639 5.306
spine 1.233 1.271 1.325 1.941 1.856 1.360 1.168 1.322 1.221
spine1 1.330 1.369 1.421 2.028 1.957 1.460 1.268 1.417 1.322
spine2 1.329 1.308 1.439 2.089 2.049 1.480 1.276 1.387 1.309
Table 7 Comparison using End Point Error (EPE) on the Multi-Human Optical Flow (MHOF) dataset. We show the average EPE and body part
specific EPE, where part labels follow Figure 3. The first two rows are repeated from Tab 5.