Cross-Dataset Training For Class Increasing Object Detection
Cross-Dataset Training For Class Increasing Object Detection
Cross-Dataset Training For Class Increasing Object Detection
Yongqiang Yao, Yan Wang, Yu Guo, Jiaojiao Lin, Hongwei Qin, Junjie Yan
SenseTime
{yaoyongqiang,wanyan1,guoyu,linjiaojiao,qinhongwei,yanjunjie}@sensetime.com
arXiv:2001.04621v1 [cs.CV] 14 Jan 2020
1
the labels is unreasonable. The first reason is that labels solutions for scale [21, 7, 25, 18]. Feature Pyramid Net-
may be duplicated, making it necessary to first merge the work [18] is an efficient multi-scale representation learning
identical labels across datasets. The second reason is the method for object detection. It achieves consistent improve-
possible conflicts among the positive and negative samples ment on a number of detection frameworks. RetinaNet
from different datasets. For example, the negative samples with ResNets and FPN is the current leading framework on
from a face detection dataset may contain a large amount benchmarks like COCO.
of human body, which makes the task very confusing if Cross-dataset Training In terms of cross-dataset training,
cross-trained with a human detection dataset. Considering the most related work is Recurrent Assistance [23], where
these two aspects, we propose a novel cross-dataset train- cross-dataset training is used for frame-based action recog-
ing scheme specially designed for object detection, which nition during pre-training stage. Since the number and
is composed of four steps: identity of classes are different in each dataset, the authors
1) merge duplicated labels across datasets; propose to generate a new dataset by label concatenation,
2) generate a hybrid dataset through label concatenation where labels from different datasets are simply concate-
but still keep the original partition information of every im- nated to form a hybrid dataset with more labels. They have
age; found that improved accuracy and faster convergence can be
3) build an avoidance relationship across partitions such achieved by pre-training on similar datasets using label con-
as face-negative versus human-positive; catenation. A similar label concatenation is also adopted in
4) train the detector with this hybrid dataset where the our paper, but in the training stage we must consider the
loss is calculated according to this avoidance relationship. duplication and conflict of different datasets instead of a
We experiment on several popular object detection straight-forward concatenation.
datasets to evaluate cross-dataset training. First we choose Another related work is integrated face analytics net-
WIDER FACE [30] and WIDER Pedestrian [1]. We train works [17], where multiple datasets annotated for different
a baseline model for face detection on WIDER FACE and tasks (facial landmark, facial emotion and face parsing) are
another for pedestrian detection on WIDER Pedestrian, re- used to train an integrated face analysis model, avoiding the
spectively. When performing cross-dataset training using need of building a fully labeled common dataset for all the
our training scheme, we get a single model that simultane- tasks. The performance is boosted by explicitly modelling
ously achieves little accuracy loss or no accuracy loss on the interaction of different tasks. It belongs to the scope of
both datasets. Then, we conduct experiments on WIDER multi-task learning. Although we have a similar motivation
FACE and COCO, making it possible to detect 80 classes of for cross-dataset training, we are the first to apply the gen-
COCO as well as face with a single model, without label- eral idea of cross-dataset training to object detection tasks,
ing faces on COCO. We observe no accuracy loss on COCO where different problems need to be addressed compared
and WIDER FACE. with those previous works.
We believe the flexible and general framework provides Multi-task Learning Another related research topic is
a solid approach for cross-dataset training and continual multi-task learning [2, 22]. Among the seminal works, the
learning for academic research and industrial applications. most related one is MultiNet [28], where classification, de-
tection and semantic segmentation are jointly trained on a
2. Related Work single dataset. But it is quite different from our scenario,
Object Detection In the past decade, object detection has where multiple datasets are jointly trained for similar task
been involving rapidly. Deep CNNs (Convolutional Neural but different object classes. In the multi-task learning com-
Networks) bring object detection to a totally new era. The munity, there is still no appropriate algorithms for cross-
two-stage R-CNN series [10, 8, 26, 18, 11] and single-stage dataset training.
detectors like SSD [21], YOLO [24] and RetinaNet [19] are
among the most popular frameworks. We choose RetinaNet 3. Cross-dataset Training
as the detection framework, because it achieves state-of-
3.1. Detection Baseline
the-art performance with focal loss to balance the positive
and negative samples. RetinaNet is also robust and effec- Cross-dataset training for object detection aims to detect
tive on detection, instance segmentation and keypoint tasks. the union of all the classes across different existing datasets
Backbones are also important for detection performance. without additional labelling efforts. Considering duplicate
ResNets [12] are widely used in state-of-the-art object de- labels and possible conflicts across datasets, simply con-
tectors. For mobile-friendly applications, MobileNets [13] catenating labels of all datasets is unreasonable and may
and its variants [32, 27] are widely adopted for benchmark- cause degraded performance. Using RetinaNet as baseline,
ing. Object scale is another important research topic in ob- we propose a novel cross-dataset training scheme specially
ject detection. Various detection frameworks adopt various designed for object detection. The key components include
box subnet
class subnet
box subnet
Regression loss
Label Mapping class subnet
box subnet
class subnet
RetinaNet Dataset-aware
Hybrid Dataset focal loss
Figure 2: Overall structure of the proposed cross-dataset training method. Existing datasets are merged into a hybrid dataset.
Then RetinaNet is adopted as the detector. A shared regression loss is adopted over all classes after box subnet, while a
dataset-aware focal loss is used to enable the training on the hybrid dataset after the class subnet. Different colors in the
dataset-aware focal loss imply different classes from merged dataset.
label mapping and dataset-aware classification loss, which the old labels are mapped to a new set of labels where only
is a revised version of focal loss in RetinaNet. The overall unique labels are kept. A simple example of label mapping
structure of the proposed cross-dataset training method is is given in Figure 3. Assuming we have two datasets, whose
shown in Figure 2. labels are l1 , l2 , l3 , l4 , l5 and m1 , m2 , m3 respectively. La-
RetinaNet is a widely used one-stage object detec- bel l1 and m3 have the same or similar meaning, so they will
tor, where focal loss is proposed to solve foreground- be mapped to the same new label n2 in the hybrid dataset.
background class imbalance during training. RetinaNet By this label mapping procedure, we can obtain a new hy-
contains two parts: the backbone network and two sub- brid dataset where labels are consistent without duplicated
networks for object classification and bounding box regres- ones.
sion. As in the original paper, we adopt the Feature Pyramid
Network (FPN) [18] to facilitate multi-scale detection and
3.3. Dataset-aware focal loss
the details of pyramid configuration are exactly the same as The loss function for cross-dataset object detection needs
RetinaNet. Specifically, we use five pyramid levels of 256 to be carefully designed because of the possible conflicts of
channels, where the three levels with larger spatial sizes are positive and negative samples. For example, negative sam-
calculated from the corresponding stage of ResNet-50 [12] ples from face detection dataset may be positive samples
and MobilenetV2 [27]. We choose these two backbones to for human bodies, which means face detection dataset is a
demonstrate that the proposed cross-dataset training scheme conflicting dataset for human detection. To accommodate
can be applied to detectors with networks of various sizes. this problem, we propose a new type of focal loss which is
In Section 4, we observe similar results on both large and dataset-aware. The original focal loss for binary classifica-
small backbones. tion is
3.2. Label Mapping
As mentioned above, duplicated or semantically consis- F L(pt ) = −α(1 − pt )r log(pt ) (1)
(
tent labels across datasets need to be merged. Then a hybrid p, if y = 1
dataset can be generated through label concatenation, where pt = (2)
1 − p, otherwise
the source label of each image in origin dataset is kept to
target dataset for using dataset-aware focal loss. These two In the above y specifies the ground truth class label and
steps can be summarized as a label mapping process. All p is the estimated probability for the label y = 1. Other
Dataset-aware classes
l1
a1
l2 n1 a2 1𝑠𝑡
Negative Samples a3
l3 n2 b1
Classes
from i𝑡ℎ b2 i𝑡ℎ
l4 n3 dataset b3
l5 n4
…
z1
m1 n5
z2 n𝑡ℎ
z3
ignore selected
m2 n6
m3 n7 (a) Overview of dataset-aware focal loss.
Table 1: The single and cross-dataset training COCO results on minival subset. All models are trained and tested with 800
pixel (shorter edge) input. FACE indicates WIDER FACE dataset, while PED indicates WIDER Pedestrian dataset.
the anchor scales to get adaptive results both in COCO and Data ratios scales AP Easy Medium Hard
WIDER FACE. For fair comparison, we retrain the COCO FACE [1.25] [2,3] - 94.7 93.27 87.60
results with new anchor scales. Table 1 shows our baseline FACE [0.5,1,2] [2,3,4] - 95.14 93.80 86.48
FACE [1.25,2.44] [2,3] - 95.21 93.7 87.88
results. Our baseline result (35.4 AP) is very close to the COCO [0.5,1,2] [4,5,6] 35.5 - - -
result (35.7 AP) of the original paper. COCO [0.5,1,2] [2,3,4] 35.4 - - -
PASCAL VOC: PASCAL VOC dataset are applied to PED [2.44] [2,3,4] 50.82 - - -
PED [1.25,2.44] [2,3] 48.37 - - -
carry out cross-dataset training with COCO dataset to verify
PED [0.5,1,2] [2,3,4] 50.0 - - -
our idea on the general object detection task. The anchor F+C [0.5,1,2] [2,3,4] 35.2 95.23 93.97 87.06
ratios are same with COCO settings. Table 5 shows the F+P [1.25,2.44] [2,3] 48.32 94.58 93.22 87.59
results of our baselines.
WIDER Pedestrian: We first follow the same anchor Table 3: Different anchor settings results with backbone of
aspect ratio of {2.44} as used in [31] for accurate pedes- ResNet-50.
trian detection by applying RetinaNet framework. For a
more fair comparison with cross-dataset training, we also
add serveral anchor settings same with cross-dataset train- backbone. Compared to the baseline results, cross-dataset
ing to act as baselines as well. training results only drop little in COCO minival set. As
reported in Figure 5, it maintains performance on three sub-
4.2.2 Cross-dataset training sets. For MobileNetV2 backbone, similar results are shown
in Table 1, where the AP of COCO minival is 29.5 vs 29.5
WIDER FACE with WIDER Pedestrian: In these exper-
for cross-dataset settings and baseline.
iments, WIDER FACE and WIDER Pedestrian are mixed
for cross-dataset training. From Table 2, it gets the 48.32 WIDER Pedestrian with COCO: Results for adding
AP, almost the same as the baseline result (48.37 AP) in pedestrian to COCO are included in Table 2. Note that the
the pedestrian detection task. For the face detection task, person class in COCO are similar with pedestrian, we need
similar results are shown in Figure 5. The results clearly merge the class. It gets 35.2 AP similar to the original result.
demonstrate that cross-dataset training strategy is powerful Moreover, it gets higher performance in pedestrian detec-
in face and pedestrian detection tasks without degrading in tion task because COCO dataset also contains pedestrians,
performance. which brings stronger pedestrian feature.
WIDER FACE with COCO: In these experiments, we WIDER Pedestrian, COCO and WIDER FACE: We
mix WIDER FACE with COCO for cross-dataset training. add WIDER Pedestrian and WIDER FACE to COCO
As shown in Table 1, it gets an AP of 35.2 using ResNet-50 dataset for cross-dataset training. We can conclude that
1 1 1
Precision
Precision
0.5 res50-face1-0.952 0.5 res50-face2-0.938 0.5 res50-face-ped-0.875
res50-face-coco-ped-0.949 res50-face1-0.937
0.3 0.3 0.3 res50-face-coco-ped-0.865
res50-face-ped-0.932
res50-face-ped-0.945 res50-face2-0.864
0.2 0.2 mv2-face2-0.922 0.2
mv2-face2-0.939 mv2-face2-0.846
mv2-face1-0.909
mv2-face1-0.930 mv2-face1-0.846
0.1 0.1 mv2-face-ped-0.906 0.1
mv2-face-ped-0.925 mv2-face-ped-0.844
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall Recall Recall
Figure 5: Precision-recall curves on WIDER FACE val set. Results for baselines and cross-dataset training of two different
backbones, ResNet50(res50) and MobileNetV2(mv2) are drawn. Legends with ”face1” indicate baseline settings for WIDER
FACE with WIDER Pedestrian, while those with ”face2” are for WIDER FACE with COCO.
Table 4: Merge labels results. C P indicate COCO and PED Table 5: The COCO and VOC results in val set.
datasets. F C P indicates FACE, COCO and PED datasets.
Data Backbone AP Easy Medium Hard be generalized to train a single model. For example, 360 de-
FACE Res50 - 95.21 93.7 87.88 gree (in-plane rotation, i.e. roll) support is a difficult prob-
FACE Mv2 - 93.00 90.98 84.64
lem in face detection. Only using data augmentation can
PED Res50 48.37 - - -
PED Mv2 43.73 - - - not solve this problem well. But with cross-dataset train-
F+P Res50 48.32 94.58 93.22 87.59 ing, we can add orientation supervision during data aug-
F+P Mv2 43.86 92.5 90.6 84.41 mentation. Different orientations can be viewed as different
classes. The trained model can detect faces in 360 degrees
Table 6: Different backbone results of face and pedes- and predict the face orientations.
trian detection tasks. F , P indicate FACE and Pedestrian
datasets. All the anchor scales in these experiments are {2,
3}. Mv2 is short for MobileNetV2.
. 6. Conclusion
proposed cross-dataset training setting does not undermine In this paper, we introduce cross-dataset training for ob-
the performance on both datasets, indicating that our train- ject detection. It aims to jointly train multiple datasets la-
ing pipeline is a universal solution, not depending on the beled with different object classes. We propose a training
redundancy of the network. scheme using label mapping and cross-dataset classification
loss. Experiments on various popular datasets and back-
5. Applications bones prove the effectiveness of our approach. With cross-
dataset training, we make it possible to detect the class
Cross-dataset training is a general concept. We demon- union of multiple datasets with a single model without accu-
strate how it can be used to train a single unified object de- racy loss. We expect this general training method to be used
tector with multiple datasets. When data with new class la- in three scenarios: 1) object detection research that utilizes
bels arrives, existing data does not need to be labeled again. existing object detection datasets, 2) industrial applications
This method can be extended to other applications as that are usually faced with increasing classes, 3) life long
well. As long as the datasets have different classes or the learning where data with new class labels continuously ar-
same class but different domains, cross-dataset training can rive.
References [16] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin,
J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig,
[1] Wider chanllenge. http://wider-challenge.org/, et al. The open images dataset v4: Unified image classifi-
2018. [Online; accessed 13-Nov-2018]. 2, 5 cation, object detection, and visual relationship detection at
[2] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task fea- scale. arXiv preprint arXiv:1811.00982, 2018. 1
ture learning. In Advances in neural information processing [17] J. Li, S. Xiao, F. Zhao, J. Zhao, J. Li, J. Feng, S. Yan, and
systems, pages 41–48, 2007. 2 T. Sim. Integrated face analytics networks through cross-
[3] Z. Cai and N. Vasconcelos. Cascade r-cnn: Delving into high dataset hybrid training. In Proceedings of the 2017 ACM on
quality object detection. arXiv preprint arXiv:1712.00726, Multimedia Conference, pages 1531–1539. ACM, 2017. 2
2017. 1
[18] T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and
[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- S. J. Belongie. Feature pyramid networks for object detec-
Fei. Imagenet: A large-scale hierarchical image database. tion. In CVPR, volume 1, page 4, 2017. 1, 2, 3
In Computer Vision and Pattern Recognition, 2009. CVPR
[19] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal
2009. IEEE Conference on, pages 248–255. IEEE, 2009. 1
loss for dense object detection. IEEE transactions on pattern
[5] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and
analysis and machine intelligence, 2018. 2
A. Zisserman. The pascal visual object classes (voc) chal-
[20] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
lenge. International journal of computer vision, 88(2):303–
manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
338, 2010. 1
mon objects in context. In European conference on computer
[6] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra-
vision, pages 740–755. Springer, 2014. 1, 5
manan. Object detection with discriminatively trained part-
based models. IEEE transactions on pattern analysis and [21] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-
machine intelligence, 32(9):1627–1645, 2010. 1 Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector.
In European conference on computer vision, pages 21–37.
[7] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg.
Springer, 2016. 2
Dssd: Deconvolutional single shot detector. arXiv preprint
arXiv:1701.06659, 2017. 2 [22] M. Long, Z. Cao, J. Wang, and S. Y. Philip. Learning
multiple tasks with multilinear relationship networks. In
[8] R. Girshick. Fast r-cnn. In Proceedings of the IEEE inter-
Advances in Neural Information Processing Systems, pages
national conference on computer vision, pages 1440–1448,
1594–1603, 2017. 2
2015. 1, 2
[9] R. Girshick. The generalized r-cnn framework for ob- [23] T. Perrett and D. Damen. Recurrent assistance: Cross-
ject detection. https://www.dropbox.com/s/ dataset training of lstms on kitchen tasks. In Computer Vi-
rd095fo2lr7gtau/eccv2018_tutorial_ross_ sion Workshop (ICCVW), 2017 IEEE International Confer-
girshick.pptx?dl=0, 2018. [Online; accessed ence on, pages 1354–1362. IEEE, 2017. 2
13-Nov-2018]. 1 [24] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You
[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- only look once: Unified, real-time object detection. In Pro-
ture hierarchies for accurate object detection and semantic ceedings of the IEEE Conference on Computer Vision and
segmentation. In Proceedings of the IEEE conference on Pattern Recognition, pages 779–788, 2016. 2
computer vision and pattern recognition, pages 580–587, [25] J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger.
2014. 1, 2 arXiv preprint, 2017. 2
[11] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. [26] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards
In Computer Vision (ICCV), 2017 IEEE International Con- real-time object detection with region proposal networks. In
ference on, pages 2980–2988. IEEE, 2017. 1, 2 Advances in neural information processing systems, pages
[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- 91–99, 2015. 1, 2
ing for image recognition. In Proceedings of the IEEE con- [27] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C.
ference on computer vision and pattern recognition, pages Chen. Mobilenetv2: Inverted residuals and linear bottle-
770–778, 2016. 1, 2, 3 necks. In Proceedings of the IEEE Conference on Computer
[13] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, Vision and Pattern Recognition, pages 4510–4520, 2018. 2,
T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Effi- 3
cient convolutional neural networks for mobile vision appli- [28] M. Teichmann, M. Weber, M. Zoellner, R. Cipolla, and
cations. arXiv preprint arXiv:1704.04861, 2017. 2 R. Urtasun. Multinet: Real-time joint semantic reasoning
[14] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, for autonomous driving. In 2018 IEEE Intelligent Vehicles
S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Vi- Symposium (IV), pages 1013–1020. IEEE, 2018. 2
sual genome: Connecting language and vision using crowd- [29] P. Viola and M. Jones. Rapid object detection using a boosted
sourced dense image annotations. International Journal of cascade of simple features. In Computer Vision and Pattern
Computer Vision, 123(1):32–73, 2017. 1 Recognition, 2001. CVPR 2001. Proceedings of the 2001
[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet IEEE Computer Society Conference on, volume 1, pages I–I.
classification with deep convolutional neural networks. In IEEE, 2001. 1
Advances in neural information processing systems, pages [30] S. Yang, P. Luo, C.-C. Loy, and X. Tang. Wider face: A
1097–1105, 2012. 1 face detection benchmark. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, pages
5525–5533, 2016. 1, 2, 5
[31] S. Zhang, R. Benenson, and B. Schiele. Citypersons: A di-
verse dataset for pedestrian detection. In The IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR),
volume 1, page 3, 2017. 6
[32] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An
extremely efficient convolutional neural network for mobile
devices. arXiv preprint arXiv:1707.01083, 2017. 2