PROST: Parallel Robust Online Simple Tracking
PROST: Parallel Robust Online Simple Tracking
PROST: Parallel Robust Online Simple Tracking
Jakob Santner Christian Leistner Amir Saffari Thomas Pock Horst Bischof
Institute for Computer Graphics and Vision, Graz University of Technology
{santner,leistner,saffari,pock,bischof}@icg.tugraz.at
Zach et al. [24] achieved realtime performance by minimiz- Complementary to FLOW and NCC, we employ an
ing this energy on the GPU. The TV regularization favors adaptive appearance-based tracker based on online ran-
sharp discontinuities, but also leads to a so-called staircase dom forests (ORF). Random Forests [6] are ensembles of
effect, where the flow field exhibits piecewise constant lev- N recursively trained decision trees in form of f (x) :
els. In recent work, Werlberger et al. [23] replaced the TV X Y. For a forest F = {f1 , , fN }, a decision
norm by a Huber norm to tackle this problem: Below a cer- is made by simply taking the maximum over all individ-
tain threshold the penalty is quadratic, leading to smooth ual probabilities of the trees for a class k with C(x) =
PN
flow fields for small displacements. Above that threshold, arg max N1 n=1 pn (k|x), where pn (k|x) is the estimated
kY
the penalty becomes linear allowing for sharp discontinu- density of class labels of the leaf of the nth tree. In order to
ities. With the additional incorporation of a diffusion tensor decrease the correlation of the trees, each tree is provided
for anisotropic regularization, their method (Huber - L1 ) is with a slightly different subset of training data by sub sam-
currently one of the most accurate optical flow algorithms pling with replacement from the entire training set, a.k.a
according to the Middlebury evaluation website [4]. Figure bagging. During training, each split node randomly selects
2 shows the difference between the method of Werlberger binary tests from the feature vector and selects the best ac-
et al. and the algorithm of Horn and Schunk. cording to an impurity measurement. The information gain
In order to use the dense flow field as input to a tracker, after node splitting is usually measured with
we estimate the objects translation from the flow vec-
tors. We use a mean-shift procedure in the two-dimensional |Il | |Ir |
translation space, taking into account every flow-vector H = H(Il ) H(Ir ), (3)
|Il | + |Ir | |Il | + |Ir |
within our tracking rectangle. In contrast to averaging the
displacement vectors, mean shift allows to handle occlu- where Il and Ir are the
PKleft and right subsets of the train-
sions more robustly. For simplicity, we estimate only trans- ing data. H(I) = i=1 pji log(pji ) is the entropy of the
lation of our object throughout this work; however, note classes in the node and pji is the label density of class i in
that other motion models incorporating e.g., rotation, scale, node j. The recursive training continues until a maximum
affine motion, etc. could be estimated from the flow field. depth is reached or no further information gain is possible.
Random Forests have several advantages that make them
particularly interesting for computer vision applications,
i.e., they are fast in both training and evaluation and yield
state-of-the-art classification results while being less noise-
sensitive compared to other classifiers (e.g., boosting). Ad-
ditionally, RFs are inherently multi-class and allow, due to
their parallel structure, for multi-core and GPU [18] imple-
mentations.
Recently, Saffari et al. [17] proposed an online version Figure 3. Highly-flexible parts of our system take care of tracking,
of RFs which allows to use them as online classifiers in while the conservative parts correct the flexible ones when they
tracking-by-detection systems. Since recursive training of have drifted away.
decision trees is hard to do in online learning, they propose a
tree-growing procedure similar to evolving-trees [15]. The
algorithm starts with trees consisting only of root nodes and
randomly selected node tests fi and thresholds i . Each
4. Experiments
node estimatesPKan impurity measure based on the Gini in- During the experiments, we compare our algorithm
dex (Gi = i=1 pji (1 pji )) online, where pji is the label to current state of the art methods on publicly available
density of class i in node K. Then, after each online update datasets. We also created several new challenging video se-
the possible information gain G during a potential node quences, which are available on our website together with
split is measured. If G exceeds a given threshold , the ground truth annotation and results 2 . The major conclusion
node becomes a split node, i.e., is not updated any more and from the experiments is that our algorithm is more adaptive
generates two child leaf nodes. The growing proceeds until and stable at the same time compared to other tracking-by-
a maximum depth is reached. Even when the tree has grown detection systems. Please note that we always use the same
to its full size, all leaf nodes are further updated online. parameters throughout the experiments in this section.
The method is simple to implement and has shown to
converge fast to its offline counterpart. Additionally, Saf- 4.1. Implementation
fari et al. [17] showed that the classifier is faster and more
For FLOW, we employ the GPU-based implementation
noise-robust compared to boosting, which makes it an ideal
of Werlberger et al. [23], which is available online. NCC is
candidate for our tracking system.
based on the cvMatchTemplate() function implemented in
the OpenCV library, ORF is based on the code of Saffari et
al. [17], which is also publicly available. We achieve real-
3. Tracker Combination time performance with our system, however, NCC and es-
pecially ORF could benefit largely from being implemented
A tracker has to incorporate two conflicting properties: It on the GPU.
has to (i) adapt to fast object appearance changes while (ii)
being able to recover in case of drifting. In other words, we 4.2. Quality Score
need an highly adaptive tracker that is corrected by system
components that are more inertial. Therefore, we combine To evaluate the performance of their tracker, Babenko et
the three different tracking approaches discussed before in al. [3] use a score representing the mean center location er-
a simple fall-back cascade (see also Figure 3): In order to ror in pixels. This is not a good choice, as the ground truth
allow for fast changes, FLOW forms the main tracker. This rectangles are fixed in size and axis-aligned whereas the se-
implies that FLOW can also easily lose the target, hence, it quences exhibit scale and rotational changes. Furthermore,
can be overruled by ORF. NCC is employed to prevent ORF their score does not take into account the different size of
from making too many wrong updates. Our cascade can be the objects in different sequences.
summarized with the following simple rules: To overcome these problems, we additionally use a score
based on the PASCAL challenge [8] object detection score:
1. FLOW is overruled by ORF if they are (i) not Given the detected bounding box ROID and the ground
overlapping and (ii) ORF has a confidence truth bounding box ROIGT , the overlap score evaluates as
above a given threshold.
area(ROID ROIGT )
score = .
2. ORF is updated only if it overlaps with NCC area(ROID ROIGT )
or FLOW.
2 www.gpu4vision.org
By interpreting a frame as true positive when this score ex- With these experiments we show, that the different algo-
ceeds 0.5, we can give a percentage of correctly tracked rithms can complement one another. FLOW is a good high
frames for each sequence. dynamic tracker, but needs correction from time to time to
get rid of cumulating errors. ORF could do that, but needs
4.3. Sequences a supervisor preventing it from doing too many wrong up-
dates. NCC is not suited to track on a per-frame basis but
Throughout the experiments, we use ten challenging se-
gives strong cues when the object reappears similarly to the
quences (Table 1) featuring e.g. moving cameras, clut-
initial template.
tered background, occlusions, 3-D motion or illumination
changes. The video data, ground truth and results of other
methods for the first six sequences have been taken from
Babenko et al. [3]. The other four videos (see Figure 7)
have been created and annotated by ourselves.
Sequence Adaboost FragTrack MILTrack PROST search window size increased to 50. Similar to the pre-
Girl 43.3 26.5 31.6 19.0 vious experiment, we use the best out of 5 differently
David 51.0 46.0 15.6 15.3
Sylvester 32.9 11.2 9.4 10.6 initialized runs.
Faceocc 49.0 6.5 18.4 7.0
Faceocc2 19.6 45.1 14.3 17.2 The average pixel error for each method is given in ta-
Tiger1 17.8 39.6 8.4 7.2 ble 4, the PASCAL based score in table 5. Our approach
Table 2. Mean distance of the tracking rectangle to annotated yields the best score in three sequences, tracking correctly
ground truth, the best result is printed in bold faced letters, the
an average of 79.5% over all four sequences, followed by
second best result is underlined.
FragTrack (65.3%), MILTrack (48.5%) and ORF (27.3%).
Sequence Adaboost FragTrack MILTrack PROST Looking at the pixel error graph in Figure 6 directly shows
Girl 24 70 70 89 the benefits of our combined system over the online tracker
David 23 47 70 80 ORF it is based on:
Sylvester 51 74 74 73
Faceocc 35 100 93 100
Faceocc2 75 48 96 82 ORF looses the object in every sequence after at
Tiger1 38 20 77 79 least 400 frames. With the high-dynamic optical flow
Table 3. Percentage of frames tracked correctly. tracker increasing plasticity, our system looses the ob-
ject far less often.
4.5.2 PROST dataset When ORF has lost the track, it performs wrong up-
To further demonstrate the capabilities of our system, we dates until eventually totally drifting away from the
compare on newly created sequences. Besides our own object. This happens in the sequences board, box and
method (parametrized identically to the previous experi- lemming. In liquor, it is able to recover the object three
ments), we benchmark the following algorithms: times. Although far less often, our system also looses
the track several times, but is, except for the last frames
ORF with 100 trees of maximum depth 5 and a search of liquor, always able to recover the object.
region factor of 1.0. Similar to MILTrack [3], we use
Haar-like features. This is exactly the online part of
our tracker, thus this experiment directly shows the Sequence MILTrack ORF FragTrack PROST
benefit of the complementary methods. Board 51.2 154.5 90.1 37.0
Box 104.6 145.4 57.4 12.1
FragTrack [1] with 16 bins and a search window half Lemming 14.9 166.3 82.8 25.4
Liquor 165.1 67.3 30.7 21.6
size of 25 to cope with the larger frame size.
Table 4. Mean distance error to the ground truth.
MILTrack [3], as provided on their webpage with
Figure 6. Tracking results for the PROST dataset
[14] M. Ozuysal, P. Fua, and V. Lepetit. Fast keypoint recognition [20] D. Shulman and J.-Y. Herve. Regularization of discontinu-
in ten lines of code. In CVPR, 2007. 1 ous flow fields. In Proceedings Workshop on Visual Motion,
[15] J. Pakkanen, J. Iivarinen, and E. Oja. The evolving tree 1989. 3
a novel self-organizing network for data analysis. Neural [21] B. Stenger, T. Woodley, and R. Cipolla. Learning to track
Process. Lett., 20(3):199211, 2004. 4 with multiple observers. In CVPR, 2009. 2
[16] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang. Incremental [22] P. Viola, J. Platt, and C. Zhang. Multiple instance boosting
learning for robust visual tracking. Int. J. Comput. Vision, for object detection. In Advances in Neural Information Pro-
77(1-3):125141, 2008. 5 cessing Systems, volume 18, pages 14171424. MIT Press,
[17] A. Saffari, C. Leistner, J. Santner, M. Godec, and H. Bischof. 2006. 2
On-line random forests. In 3rd IEEE ICCV Workshop on On- [23] M. Werlberger, W. Trobin, T. Pock, A. Wedel, D. Cremers,
line Comput. Vision, 2009. 2, 4 and H. Bischof. Anisotropic Huber-L1 optical flow. In Proc.
[18] T. Sharp. Implementing decision trees and forests on a GPU. of the British Machine Vision Conf., 2009. 3, 4
In ECCV, 2008. 4 [24] C. Zach, T. Pock, and H. Bischof. A duality based approach
[19] J. Shi and C. Tomasi. Good features to track. In CVPR, 1994. for realtime tv-l1 optical flow. In Pattern Recognition (Proc.
1 DAGM), 2007. 3