Cross-Dataset Training For Class Increasing Object Detection

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Cross-dataset Training for Class Increasing Object Detection

Yongqiang Yao, Yan Wang, Yu Guo, Jiaojiao Lin, Hongwei Qin, Junjie Yan
SenseTime
{yaoyongqiang,wanyan1,guoyu,linjiaojiao,qinhongwei,yanjunjie}@sensetime.com
arXiv:2001.04621v1 [cs.CV] 14 Jan 2020

Abstract Wider Face Dataset

We present a conceptually simple, flexible and general


framework for cross-dataset training in object detection.
Given two or more already labeled datasets that target for
different object classes, cross-dataset training aims to de- Cross-Dataset
Training
tect the union of the different classes, so that we do not
have to label all the classes for all the datasets. By cross-
dataset training, existing datasets can be utilized to detect
the merged object classes with a single model. Further
more, in industrial applications, the object classes usually COCO Dataset

increase on demand. So when adding new classes, it is


quite time-consuming if we label the new classes on all Figure 1: Motivation of cross-dataset training. A single
the existing datasets. While using cross-dataset training, model is trained across multiple existing datasets with dif-
we only need to label the new classes on the new dataset. ferent object classes and can be used to detect the union
We experiment on PASCAL VOC, COCO, WIDER FACE of the object classes from all the datasets, which saves the
and WIDER Pedestrian with both solo and cross-dataset heavy burden of labelling new classes on all the existing
settings. Results show that our cross-dataset pipeline can datasets.
achieve similar impressive performance simultaneously on
these datasets compared with training independently.
Open Image Dataset [16].
The trend is that more and more object classes are in de-
1. Introduction mand, as well as the data amount. In practice, new object
Object detection is one of the most important computer classes need to be added continuously. But it costs time
vision tasks. Essentially, object detection is a joint task of and human resources if we label the new classes for all the
classification and localization. For a long history, the com- existing data. It would be much flexible if we can continu-
munity spend a lot of efforts solving the problem of face ously add new data labeled with new classes. Eventually, a
detection. In the last decade, however, general object detec- life long learning system can utilize the incomplete labeled
tion becomes an very active research topic. datasets. In this paper, we propose to solve this problem
This phenomenon results from three aspects: better rep- with a novel concept called cross-dataset training.
resentations, better detection frameworks and better super- Cross-dataset training aims to utilize two or more
vision. The best bounding box mAP on COCO [20] in datasets labeled with different object classes to train a sin-
the year of 2012 was 5 [9], while the mAP rises to 51.3 gle model that can performs well on all the classes, as shown
in 2018 [9, 11, 3]. The most rapid progress comes from in Figure 1 . Generally, a face detection model is trained on
better representations, especially in the deep learning era, WIDER FACE to perform face detection only, and a general
from AlexNet [15] all the way to ResNet [12] and its varia- object detection model is trained on COCO to perform 80
tions. Detection frameworks also have made great progress, class object detection. By cross-dataset training on WIDER
from Viola Jones [29], DPM [6] to the modern R-CNN se- FACE and COCO, our goal is a single model that has the
ries [10, 8, 26, 11, 18]. Better supervision results from the same backbone and can detect 81 classes without accuracy
dataset blossom, for example, Pascal VOC [5], COCO [20], loss.
WIDER FACE [30], ImageNet [4], Visual Genome [14] and For cross-dataset object detection, simply concatenating

1
the labels is unreasonable. The first reason is that labels solutions for scale [21, 7, 25, 18]. Feature Pyramid Net-
may be duplicated, making it necessary to first merge the work [18] is an efficient multi-scale representation learning
identical labels across datasets. The second reason is the method for object detection. It achieves consistent improve-
possible conflicts among the positive and negative samples ment on a number of detection frameworks. RetinaNet
from different datasets. For example, the negative samples with ResNets and FPN is the current leading framework on
from a face detection dataset may contain a large amount benchmarks like COCO.
of human body, which makes the task very confusing if Cross-dataset Training In terms of cross-dataset training,
cross-trained with a human detection dataset. Considering the most related work is Recurrent Assistance [23], where
these two aspects, we propose a novel cross-dataset train- cross-dataset training is used for frame-based action recog-
ing scheme specially designed for object detection, which nition during pre-training stage. Since the number and
is composed of four steps: identity of classes are different in each dataset, the authors
1) merge duplicated labels across datasets; propose to generate a new dataset by label concatenation,
2) generate a hybrid dataset through label concatenation where labels from different datasets are simply concate-
but still keep the original partition information of every im- nated to form a hybrid dataset with more labels. They have
age; found that improved accuracy and faster convergence can be
3) build an avoidance relationship across partitions such achieved by pre-training on similar datasets using label con-
as face-negative versus human-positive; catenation. A similar label concatenation is also adopted in
4) train the detector with this hybrid dataset where the our paper, but in the training stage we must consider the
loss is calculated according to this avoidance relationship. duplication and conflict of different datasets instead of a
We experiment on several popular object detection straight-forward concatenation.
datasets to evaluate cross-dataset training. First we choose Another related work is integrated face analytics net-
WIDER FACE [30] and WIDER Pedestrian [1]. We train works [17], where multiple datasets annotated for different
a baseline model for face detection on WIDER FACE and tasks (facial landmark, facial emotion and face parsing) are
another for pedestrian detection on WIDER Pedestrian, re- used to train an integrated face analysis model, avoiding the
spectively. When performing cross-dataset training using need of building a fully labeled common dataset for all the
our training scheme, we get a single model that simultane- tasks. The performance is boosted by explicitly modelling
ously achieves little accuracy loss or no accuracy loss on the interaction of different tasks. It belongs to the scope of
both datasets. Then, we conduct experiments on WIDER multi-task learning. Although we have a similar motivation
FACE and COCO, making it possible to detect 80 classes of for cross-dataset training, we are the first to apply the gen-
COCO as well as face with a single model, without label- eral idea of cross-dataset training to object detection tasks,
ing faces on COCO. We observe no accuracy loss on COCO where different problems need to be addressed compared
and WIDER FACE. with those previous works.
We believe the flexible and general framework provides Multi-task Learning Another related research topic is
a solid approach for cross-dataset training and continual multi-task learning [2, 22]. Among the seminal works, the
learning for academic research and industrial applications. most related one is MultiNet [28], where classification, de-
tection and semantic segmentation are jointly trained on a
2. Related Work single dataset. But it is quite different from our scenario,
Object Detection In the past decade, object detection has where multiple datasets are jointly trained for similar task
been involving rapidly. Deep CNNs (Convolutional Neural but different object classes. In the multi-task learning com-
Networks) bring object detection to a totally new era. The munity, there is still no appropriate algorithms for cross-
two-stage R-CNN series [10, 8, 26, 18, 11] and single-stage dataset training.
detectors like SSD [21], YOLO [24] and RetinaNet [19] are
among the most popular frameworks. We choose RetinaNet 3. Cross-dataset Training
as the detection framework, because it achieves state-of-
3.1. Detection Baseline
the-art performance with focal loss to balance the positive
and negative samples. RetinaNet is also robust and effec- Cross-dataset training for object detection aims to detect
tive on detection, instance segmentation and keypoint tasks. the union of all the classes across different existing datasets
Backbones are also important for detection performance. without additional labelling efforts. Considering duplicate
ResNets [12] are widely used in state-of-the-art object de- labels and possible conflicts across datasets, simply con-
tectors. For mobile-friendly applications, MobileNets [13] catenating labels of all datasets is unreasonable and may
and its variants [32, 27] are widely adopted for benchmark- cause degraded performance. Using RetinaNet as baseline,
ing. Object scale is another important research topic in ob- we propose a novel cross-dataset training scheme specially
ject detection. Various detection frameworks adopt various designed for object detection. The key components include
box subnet
class subnet

box subnet
Regression loss
Label Mapping class subnet

box subnet
class subnet

RetinaNet Dataset-aware
Hybrid Dataset focal loss

Figure 2: Overall structure of the proposed cross-dataset training method. Existing datasets are merged into a hybrid dataset.
Then RetinaNet is adopted as the detector. A shared regression loss is adopted over all classes after box subnet, while a
dataset-aware focal loss is used to enable the training on the hybrid dataset after the class subnet. Different colors in the
dataset-aware focal loss imply different classes from merged dataset.

label mapping and dataset-aware classification loss, which the old labels are mapped to a new set of labels where only
is a revised version of focal loss in RetinaNet. The overall unique labels are kept. A simple example of label mapping
structure of the proposed cross-dataset training method is is given in Figure 3. Assuming we have two datasets, whose
shown in Figure 2. labels are l1 , l2 , l3 , l4 , l5 and m1 , m2 , m3 respectively. La-
RetinaNet is a widely used one-stage object detec- bel l1 and m3 have the same or similar meaning, so they will
tor, where focal loss is proposed to solve foreground- be mapped to the same new label n2 in the hybrid dataset.
background class imbalance during training. RetinaNet By this label mapping procedure, we can obtain a new hy-
contains two parts: the backbone network and two sub- brid dataset where labels are consistent without duplicated
networks for object classification and bounding box regres- ones.
sion. As in the original paper, we adopt the Feature Pyramid
Network (FPN) [18] to facilitate multi-scale detection and
3.3. Dataset-aware focal loss
the details of pyramid configuration are exactly the same as The loss function for cross-dataset object detection needs
RetinaNet. Specifically, we use five pyramid levels of 256 to be carefully designed because of the possible conflicts of
channels, where the three levels with larger spatial sizes are positive and negative samples. For example, negative sam-
calculated from the corresponding stage of ResNet-50 [12] ples from face detection dataset may be positive samples
and MobilenetV2 [27]. We choose these two backbones to for human bodies, which means face detection dataset is a
demonstrate that the proposed cross-dataset training scheme conflicting dataset for human detection. To accommodate
can be applied to detectors with networks of various sizes. this problem, we propose a new type of focal loss which is
In Section 4, we observe similar results on both large and dataset-aware. The original focal loss for binary classifica-
small backbones. tion is
3.2. Label Mapping
As mentioned above, duplicated or semantically consis- F L(pt ) = −α(1 − pt )r log(pt ) (1)
(
tent labels across datasets need to be merged. Then a hybrid p, if y = 1
dataset can be generated through label concatenation, where pt = (2)
1 − p, otherwise
the source label of each image in origin dataset is kept to
target dataset for using dataset-aware focal loss. These two In the above y specifies the ground truth class label and
steps can be summarized as a label mapping process. All p is the estimated probability for the label y = 1. Other
Dataset-aware classes
l1
a1
l2 n1 a2 1𝑠𝑡
Negative Samples a3
l3 n2 b1
Classes
from i𝑡ℎ b2 i𝑡ℎ
l4 n3 dataset b3

l5 n4

z1

m1 n5
z2 n𝑡ℎ
z3
ignore selected
m2 n6
m3 n7 (a) Overview of dataset-aware focal loss.

Figure 3: A simple example of label mapping. Two datasets


with labels l1 , l2 , l3 , l4 , l5 (labels are represented by circles)
and m1 , m2 , m3 (represented by rectangles) are merged
into a new hybrid dataset with labels n1 , n2 ....n7 (repre-
sented by round rectangles). The solid shapes in blue in-
dicate that the labels belong to the same class that can be
merged into one class in label mapping operation. Other
labels stay unchanged during mapping.
(b) An example of dataset-aware focal loss for face and pedes-
trian datasets.
notations follow the RetinaNet paper. The proposed dataset-
Figure 4: (a) Overview of dataset-aware focal loss. The
aware focal loss is
original dataset information is kept for all the samples in
the hybrid dataset. Negative Samples from ith dataset will
only contribute to the focal loss of object classes from ith
F L(pt ) = −α(1 − pt )r log(pt ) (3)
 dataset. (b) An illustration of dataset-aware focal loss for

 p, if y = 1 face and pedestrian datasets. These two datasets are con-
pt = 1, if y 6= 1 and from a conflicting dataset sidered to be conflicting because negative samples in one


1 − p, otherwise dataset may include objects being labeled as ground-truth
(4) in the other dataset. In this case, positive samples are gen-
erated according to the new labels after label mapping and
The most important modification of dataset-aware focal can be processed according to the standard focal loss, but
loss is that loss values corresponding to negative samples negative samples are not shared when calculating the focal
from conflicting datasets are set to zero. When training loss of object classes from different datasets. Meanwhile,
across multiple datasets, negative samples from one dataset ground-truth patches from one dataset are included to un-
will only contribute to the focal loss of those object classes shared negative examples of other datasets.
within exactly the same dataset as shown in Figure 4a. A
simple illustration of dataset-aware focal loss is given in
Figure 4b. There exist conflicts between face and pedes-
trian datasets because their negative samples may include
objects with labels from the other dataset, which causes this dataset-aware approach, we can avoid the confusion
confusion if using normal loss functions for classification. caused by conflicting datasets. Futhermore, positive exam-
In dataset-aware focal loss, negative samples are not shared ples from one dataset are regarded as negative ones in other
across different datasets. So loss values of negative sam- dataset to enrich the negative information for cross-dataset
ples from face dataset are set to zero when calculating focal training.
loss for the class pedestrian. Positive samples from differ-
ent datasets are generated together according to their own In the Section 4, we will show dataset-aware classifica-
ground truth labels, so there exist no conflicts and their loss tion loss are necessary for stable training and good perfor-
values are calculated using the standard focal loss. Through mance.
4. Experiments cross-dataset scenarios. For instance, when combining with
face or pedestrian detection tasks, the default anchor scales
We conduct experiments to validate the feasibility of of {4, 5.03, 6.35} in COCO setting are too large to detect
cross-dataset training with different detection tasks, and small faces or pedestrians, but the larger anchor scales are
also provide detailed ablation study in this section. needed to detect large scale objects in COCO and PASCAL
4.1. Experimental Setting VOC evaluation. So adjustments on training scales and ra-
tios are necessary for cross-dataset training to ensure the
4.1.1 Datasets performance. The final anchor scales are the product of the
simplified anchor scales we proposed and strides in each
In our experiments, three distinct detection tasks, i.e. face
pyramid level. As new anchor scales are introduced in our
detection, general object detection and pedestrian detection
cross-dataset training experiments, we choose default set-
are taken into consideration. Each task is associated with
tings and newly-revised ones together as our baselines for a
one or two different datasets.
fair comparison.
The task of face detection aims to detect faces in images.
According to the reason above, for COCO, we use an-
We use the popular WIDER FACE [30] dataset for experi-
chors at three aspect ratios of {1:2, 1:1, 2:1} and the scales
ments. It consists of 32,203 images and 393,703 faces with
are set as {2, 3, 4} and {4, 5.03, 6.35}. And for PASCAL
large variability in scale, pose and occlusion. The dataset
VOC, ratios of {1:2, 1:1, 2:1} and scales of {4, 5.03, 6.35}
is split into training (40%), validation (10%) and test (50%)
are selected as our baseline setting.
set. Besides, the images are divided into three levels (easy,
Similarly, we have three different baselines in our exper-
medium and hard subsets) according to the difficulty of de-
iments for face detection tasks. The corresponding combi-
tection. The images and annotations of training and vali-
nations of ratios and scales are { {1.25}; {2, 3}}, { {1.25,
dation set are available online, and all the results of face
2.44}; {2, 3}} and { {0.5, 1, 2}; {2, 3, 4} }
detection in our experiments are reported on the validation
set. For pedestrian detection, the ratio and scales are set to
{2.44} and {2, 3}, respectively. Finally, we use the ratios
To verify our method on general object detection
of {1.25, 2.44} and scales of {2, 3} in FACE-PED cross-
task, we conduct a series of experiments on the COCO
dataset training and adopt the ratios of {0.5, 1, 2 } and
[20] detection dataset. For training, we use stan-
scales of {2, 3, 4} in PED-COCO and FACE-COCO-PED
dard coco-2017-train, which contains 115k images.
cross-dataset training.
The evaluation results are reported on the 5k val im-
ages(coco-2014-minival). We also carry out many 4.2. Results and Comparison
experiments on PASCAL VOC dataset to prove the effec-
tiveness of our method when the number of classes in- We compare the performance of the proposed cross-
creases a lot. dataset training with our baseline results. We consider sev-
For pedestrian detection task, the WIDER Pedestrian [1] eral different cross-dataset training settings including face
dataset with 11,500 training images, 5,000 validation im- with COCO, face with pedestrian, pedestrian with COCO
ages and 3,500 testing images is used in our experiments. and all mixed datasets. The same network structure and
We report our results on the 5,000 validation images in our same training strategy are used for fair comparison. As for
experiments for thorough comparison. evaluation metrics, we evaluate the COCO-style Average
Precision (AP) for COCO general object detection task and
pedestrian detection task. For the WIDER FACE dataset,
4.1.2 Implementation Details
the validation set has easy, medium and hard subsets, which
We choose RetinaNet as our detection baselines. ResNet- roughly correspond to large, medium and small faces re-
50 and MobileNetV2 are used as our backbones. Images spectively.
are resized such that their scale (shorter edge) is 800 pixels,
the same as original paper. Our models are trained using 4.2.1 Baselines
SGD with 0.9 momentum, 0.0001 weight decay and batch
size 32. The first two epochs are set as warm-up for stable WIDER FACE: We use RetinaNet to train face detectors
training. The maximum number of epochs is 20 and we use as our baselines. Different from original configuration, we
0.04 learning rate for the first 10 epochs, and then continue set serveral different settings for better performance con-
train for another two 5 epochs with learning rate of 0.004 sidering the relatively fixed ratio of the face. Figure 5 and
and 0.0004. Table 3 shows our results.
Intuitively, the size of objects may vary a lot during COCO: We first reproduce the original COCO Reti-
cross-dataset training procedure, so the default setting of naNet results using PyTorch. However, the default anchor
anchor scales during training may no longer suitable for scales are not suitable for face detection. So we re-configure
Data Backbone ratios scales AP AP50 AP75 APS APM APL
COCO ResNet-50 [0.5, 1, 2] [4,5.03,6.35] 35.5 54.9 38.0 19.8 39.2 46.2
COCO ResNet-50 [0.5, 1, 2] [2, 3, 4] 35.4 56.4 37.3 19.8 38.7 45.6
COCO MobileNetV2 [0.5, 1, 2] [4, 5.03, 6.35] 30.0 48.6 31.4 16.2 31.9 39.3
COCO MobileNetV2 [0.5, 1, 2] [2, 3, 4] 29.5 48.0 30.6 16.1 31.3 39.3
COCO+FACE ResNet-50 [0.5, 1, 2] [2, 3, 4] 35.2 56.1 37.1 20.9 38.3 45.4
COCO+FACE MobileNetV2 [0.5, 1, 2] [2, 3, 4] 29.5 49.0 30.4 16.2 31.4 39.5
COCO+PED ResNet-50 [0.5, 1, 2] [2, 3, 4] 35.2 55.6 36.9 20.2 38.0 44.9
COCO+FACE+PED ResNet-50 [0.5, 1, 2] [2, 3, 4] 35.0 56.0 36.7 20.8 38.2 44.6

Table 1: The single and cross-dataset training COCO results on minival subset. All models are trained and tested with 800
pixel (shorter edge) input. FACE indicates WIDER FACE dataset, while PED indicates WIDER Pedestrian dataset.

Data Backbone ratios scales AP AP50 AP75 Easy Medium Hard


PED ResNet-50 [1.25,2.44] [2, 3] 48.37 84.26 50.24 - - -
PED MobileNetV2 [1.25,2.44] [2, 3] 43.73 81.94 41.1 - - -
FACE ResNet-50 [1.25,2.44] [2, 3] - - - 95.21 93.7 87.88
FACE MobileNetV2 [1.25,2.44] [2, 3] - - - 93.0 90.98 84.64
FACE+PED ResNet-50 [1.25, 2.44] [2, 3] 48.32 84.26 49.71 94.58 93.22 87.59
FACE+PED MobileNetV2 [1.25, 2.44] [2, 3] 43.86 81.0 42.1 92.24 90.41 84.63
COCO+PED ResNet-50 [0.5, 1, 2] [2, 3, 4] 52.46 85.64 56.6 - - -
COCO+FACE+PED ResNet-50 [0.5, 1, 2] [2, 3, 4] 51.44 85.03 54.9 95.02 93.76 86.66

Table 2: The PED and WIDER FACE results on val subset.

the anchor scales to get adaptive results both in COCO and Data ratios scales AP Easy Medium Hard
WIDER FACE. For fair comparison, we retrain the COCO FACE [1.25] [2,3] - 94.7 93.27 87.60
results with new anchor scales. Table 1 shows our baseline FACE [0.5,1,2] [2,3,4] - 95.14 93.80 86.48
FACE [1.25,2.44] [2,3] - 95.21 93.7 87.88
results. Our baseline result (35.4 AP) is very close to the COCO [0.5,1,2] [4,5,6] 35.5 - - -
result (35.7 AP) of the original paper. COCO [0.5,1,2] [2,3,4] 35.4 - - -
PASCAL VOC: PASCAL VOC dataset are applied to PED [2.44] [2,3,4] 50.82 - - -
PED [1.25,2.44] [2,3] 48.37 - - -
carry out cross-dataset training with COCO dataset to verify
PED [0.5,1,2] [2,3,4] 50.0 - - -
our idea on the general object detection task. The anchor F+C [0.5,1,2] [2,3,4] 35.2 95.23 93.97 87.06
ratios are same with COCO settings. Table 5 shows the F+P [1.25,2.44] [2,3] 48.32 94.58 93.22 87.59
results of our baselines.
WIDER Pedestrian: We first follow the same anchor Table 3: Different anchor settings results with backbone of
aspect ratio of {2.44} as used in [31] for accurate pedes- ResNet-50.
trian detection by applying RetinaNet framework. For a
more fair comparison with cross-dataset training, we also
add serveral anchor settings same with cross-dataset train- backbone. Compared to the baseline results, cross-dataset
ing to act as baselines as well. training results only drop little in COCO minival set. As
reported in Figure 5, it maintains performance on three sub-
4.2.2 Cross-dataset training sets. For MobileNetV2 backbone, similar results are shown
in Table 1, where the AP of COCO minival is 29.5 vs 29.5
WIDER FACE with WIDER Pedestrian: In these exper-
for cross-dataset settings and baseline.
iments, WIDER FACE and WIDER Pedestrian are mixed
for cross-dataset training. From Table 2, it gets the 48.32 WIDER Pedestrian with COCO: Results for adding
AP, almost the same as the baseline result (48.37 AP) in pedestrian to COCO are included in Table 2. Note that the
the pedestrian detection task. For the face detection task, person class in COCO are similar with pedestrian, we need
similar results are shown in Figure 5. The results clearly merge the class. It gets 35.2 AP similar to the original result.
demonstrate that cross-dataset training strategy is powerful Moreover, it gets higher performance in pedestrian detec-
in face and pedestrian detection tasks without degrading in tion task because COCO dataset also contains pedestrians,
performance. which brings stronger pedestrian feature.
WIDER FACE with COCO: In these experiments, we WIDER Pedestrian, COCO and WIDER FACE: We
mix WIDER FACE with COCO for cross-dataset training. add WIDER Pedestrian and WIDER FACE to COCO
As shown in Table 1, it gets an AP of 35.2 using ResNet-50 dataset for cross-dataset training. We can conclude that
1 1 1

0.9 0.9 0.9

0.8 0.8 0.8

0.7 0.7 0.7

0.6 0.6 0.6

res50-face-coco-0.952 res50-face-coco-0.939 res50-face1-0.878


Precision

Precision

Precision
0.5 res50-face1-0.952 0.5 res50-face2-0.938 0.5 res50-face-ped-0.875

res50-face2-0.951 res50-face-coco-ped-merge-0.937 res50-face-coco-0.870


0.4 0.4 0.4
res50-face-coco-ped-merge-0.950 res50-face-coco-ped-0.937 res50-face-coco-ped-merge-0.866

res50-face-coco-ped-0.949 res50-face1-0.937
0.3 0.3 0.3 res50-face-coco-ped-0.865
res50-face-ped-0.932
res50-face-ped-0.945 res50-face2-0.864
0.2 0.2 mv2-face2-0.922 0.2
mv2-face2-0.939 mv2-face2-0.846
mv2-face1-0.909
mv2-face1-0.930 mv2-face1-0.846
0.1 0.1 mv2-face-ped-0.906 0.1
mv2-face-ped-0.925 mv2-face-ped-0.844

0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall Recall Recall

(a) Easy. (b) Medium. (c) Hard.

Figure 5: Precision-recall curves on WIDER FACE val set. Results for baselines and cross-dataset training of two different
backbones, ResNet50(res50) and MobileNetV2(mv2) are drawn. Legends with ”face1” indicate baseline settings for WIDER
FACE with WIDER Pedestrian, while those with ”face2” are for WIDER FACE with COCO.

Data Backbone Merge COCO PED Data Backbone COCO VOC


CP R50 no 35.2 51.0 COCO R50 35.5 -
CP R50 yes 35.3 51.43 COCO Mv2 29.5 -
CP Mv2 no 29.1 46.75 VOC R50 - 83.0
CP Mv2 yes 29.1 47.21 VOC Mv2 - 76.4
FCP R50 no 35.1 50.99 COCO+VOC R50 35.4 87.5
FCP R50 yes 35.0 51.45 COCO+VOC Mv2 29.5 83.1

Table 4: Merge labels results. C P indicate COCO and PED Table 5: The COCO and VOC results in val set.
datasets. F C P indicates FACE, COCO and PED datasets.

classes to verify the validity of that strategy. From Table


adding more datasets for cross-dataset training don’t harm 4, our PED results can get an improvement of about 0.5
the performance in each single dataset. Observation from AP in cross-dataset train after merging classes via seman-
Table 2 suggests that the AP rises from 50.0 to 51.44 for tic information. Actually, if we don’t merge those similar
PED and slightly declines for WIDER FACE. Table 1 shows categories, they are mutually exclusive and the definition
that the AP for COCO suffers minor drop from 35.4 to 35.0, of negative examples may be confused, doing harm to our
which can be explained by the fact that mixed part of two classification task training.
pedestrian categories may bring disturbation for COCO de- How the anchor scales influence the results for cross-
tection. dataset training ? In this experiment, we have explored the
COCO with PASCAL VOC: Finally, we add PAS- effects of different anchor scales on cross-dataset training.
CAL VOC to COCO dataset to validate the effectiveness Each detection task has specific anchor scales for better per-
of our training method when more classes are influenced or formance. Compared to face and pedestrian detection tasks
merged in cross-dataset training pipeline. We can observe that require smaller anchor scales, COCO general object de-
from Table 5 that COCO dataset maintains the performance, tection task needs larger anchor scales to detect various ob-
while PASCAL VOC dataset get even better performance jects. For a more fair comparison of experimental results,
on cross-dataset scenarios, benefitting from more data in- we keep anchor scales and ratios same on the single dataset
troduced by COCO dataset. and cross-dataset training. Table 4 shows the cross-dataset
The experiments above demonstrate the effectiveness of training maintains the performance when we use same an-
cross-dataset training in general object detection tasks. By chor settings with baseline.
cross-dataset training, we can utilize the existing datasets to How important are the backbones for cross-dataset
detect more classes without extra works of labeling all the training ? We conduct experiments on two different back-
classes in different datasets. bones, i.e. ResNet-50 and MobileNetV2, to illustrate that
our cross-dataset training method works not because of the
4.2.3 Ablation Study redundant parameters but a general effective pipeline for
multi-dataset training. Results in Table 6 support our as-
Merge semantically identical classes is necessary for sumption. Cross-dataset training can maintain performance
cross-dataset training? In our cross-dataset training set- for both two datasets and similar results are given when we
tings, those classes with identical semantic meanings, such choose smaller backbones using cross-dataset training. We
as person and pedestrian, are combined during training. We can see from Table 6 that for both backbones with large
conduct different experiments of whether to merge similar (ResNet-50) and limited (MobileNetV2) parameters, the
Figure 6: More results of our cross-dataset training on different datasets. The three rows are samples selected from COCO,
WIDER-Face and Pedestrian Dataset, respectively.

Data Backbone AP Easy Medium Hard be generalized to train a single model. For example, 360 de-
FACE Res50 - 95.21 93.7 87.88 gree (in-plane rotation, i.e. roll) support is a difficult prob-
FACE Mv2 - 93.00 90.98 84.64
lem in face detection. Only using data augmentation can
PED Res50 48.37 - - -
PED Mv2 43.73 - - - not solve this problem well. But with cross-dataset train-
F+P Res50 48.32 94.58 93.22 87.59 ing, we can add orientation supervision during data aug-
F+P Mv2 43.86 92.5 90.6 84.41 mentation. Different orientations can be viewed as different
classes. The trained model can detect faces in 360 degrees
Table 6: Different backbone results of face and pedes- and predict the face orientations.
trian detection tasks. F , P indicate FACE and Pedestrian
datasets. All the anchor scales in these experiments are {2,
3}. Mv2 is short for MobileNetV2.
. 6. Conclusion

proposed cross-dataset training setting does not undermine In this paper, we introduce cross-dataset training for ob-
the performance on both datasets, indicating that our train- ject detection. It aims to jointly train multiple datasets la-
ing pipeline is a universal solution, not depending on the beled with different object classes. We propose a training
redundancy of the network. scheme using label mapping and cross-dataset classification
loss. Experiments on various popular datasets and back-
5. Applications bones prove the effectiveness of our approach. With cross-
dataset training, we make it possible to detect the class
Cross-dataset training is a general concept. We demon- union of multiple datasets with a single model without accu-
strate how it can be used to train a single unified object de- racy loss. We expect this general training method to be used
tector with multiple datasets. When data with new class la- in three scenarios: 1) object detection research that utilizes
bels arrives, existing data does not need to be labeled again. existing object detection datasets, 2) industrial applications
This method can be extended to other applications as that are usually faced with increasing classes, 3) life long
well. As long as the datasets have different classes or the learning where data with new class labels continuously ar-
same class but different domains, cross-dataset training can rive.
References [16] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin,
J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig,
[1] Wider chanllenge. http://wider-challenge.org/, et al. The open images dataset v4: Unified image classifi-
2018. [Online; accessed 13-Nov-2018]. 2, 5 cation, object detection, and visual relationship detection at
[2] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task fea- scale. arXiv preprint arXiv:1811.00982, 2018. 1
ture learning. In Advances in neural information processing [17] J. Li, S. Xiao, F. Zhao, J. Zhao, J. Li, J. Feng, S. Yan, and
systems, pages 41–48, 2007. 2 T. Sim. Integrated face analytics networks through cross-
[3] Z. Cai and N. Vasconcelos. Cascade r-cnn: Delving into high dataset hybrid training. In Proceedings of the 2017 ACM on
quality object detection. arXiv preprint arXiv:1712.00726, Multimedia Conference, pages 1531–1539. ACM, 2017. 2
2017. 1
[18] T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and
[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- S. J. Belongie. Feature pyramid networks for object detec-
Fei. Imagenet: A large-scale hierarchical image database. tion. In CVPR, volume 1, page 4, 2017. 1, 2, 3
In Computer Vision and Pattern Recognition, 2009. CVPR
[19] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal
2009. IEEE Conference on, pages 248–255. IEEE, 2009. 1
loss for dense object detection. IEEE transactions on pattern
[5] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and
analysis and machine intelligence, 2018. 2
A. Zisserman. The pascal visual object classes (voc) chal-
[20] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
lenge. International journal of computer vision, 88(2):303–
manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
338, 2010. 1
mon objects in context. In European conference on computer
[6] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra-
vision, pages 740–755. Springer, 2014. 1, 5
manan. Object detection with discriminatively trained part-
based models. IEEE transactions on pattern analysis and [21] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-
machine intelligence, 32(9):1627–1645, 2010. 1 Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector.
In European conference on computer vision, pages 21–37.
[7] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg.
Springer, 2016. 2
Dssd: Deconvolutional single shot detector. arXiv preprint
arXiv:1701.06659, 2017. 2 [22] M. Long, Z. Cao, J. Wang, and S. Y. Philip. Learning
multiple tasks with multilinear relationship networks. In
[8] R. Girshick. Fast r-cnn. In Proceedings of the IEEE inter-
Advances in Neural Information Processing Systems, pages
national conference on computer vision, pages 1440–1448,
1594–1603, 2017. 2
2015. 1, 2
[9] R. Girshick. The generalized r-cnn framework for ob- [23] T. Perrett and D. Damen. Recurrent assistance: Cross-
ject detection. https://www.dropbox.com/s/ dataset training of lstms on kitchen tasks. In Computer Vi-
rd095fo2lr7gtau/eccv2018_tutorial_ross_ sion Workshop (ICCVW), 2017 IEEE International Confer-
girshick.pptx?dl=0, 2018. [Online; accessed ence on, pages 1354–1362. IEEE, 2017. 2
13-Nov-2018]. 1 [24] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You
[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- only look once: Unified, real-time object detection. In Pro-
ture hierarchies for accurate object detection and semantic ceedings of the IEEE Conference on Computer Vision and
segmentation. In Proceedings of the IEEE conference on Pattern Recognition, pages 779–788, 2016. 2
computer vision and pattern recognition, pages 580–587, [25] J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger.
2014. 1, 2 arXiv preprint, 2017. 2
[11] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. [26] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards
In Computer Vision (ICCV), 2017 IEEE International Con- real-time object detection with region proposal networks. In
ference on, pages 2980–2988. IEEE, 2017. 1, 2 Advances in neural information processing systems, pages
[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- 91–99, 2015. 1, 2
ing for image recognition. In Proceedings of the IEEE con- [27] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C.
ference on computer vision and pattern recognition, pages Chen. Mobilenetv2: Inverted residuals and linear bottle-
770–778, 2016. 1, 2, 3 necks. In Proceedings of the IEEE Conference on Computer
[13] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, Vision and Pattern Recognition, pages 4510–4520, 2018. 2,
T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Effi- 3
cient convolutional neural networks for mobile vision appli- [28] M. Teichmann, M. Weber, M. Zoellner, R. Cipolla, and
cations. arXiv preprint arXiv:1704.04861, 2017. 2 R. Urtasun. Multinet: Real-time joint semantic reasoning
[14] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, for autonomous driving. In 2018 IEEE Intelligent Vehicles
S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Vi- Symposium (IV), pages 1013–1020. IEEE, 2018. 2
sual genome: Connecting language and vision using crowd- [29] P. Viola and M. Jones. Rapid object detection using a boosted
sourced dense image annotations. International Journal of cascade of simple features. In Computer Vision and Pattern
Computer Vision, 123(1):32–73, 2017. 1 Recognition, 2001. CVPR 2001. Proceedings of the 2001
[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet IEEE Computer Society Conference on, volume 1, pages I–I.
classification with deep convolutional neural networks. In IEEE, 2001. 1
Advances in neural information processing systems, pages [30] S. Yang, P. Luo, C.-C. Loy, and X. Tang. Wider face: A
1097–1105, 2012. 1 face detection benchmark. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, pages
5525–5533, 2016. 1, 2, 5
[31] S. Zhang, R. Benenson, and B. Schiele. Citypersons: A di-
verse dataset for pedestrian detection. In The IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR),
volume 1, page 3, 2017. 6
[32] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An
extremely efficient convolutional neural network for mobile
devices. arXiv preprint arXiv:1707.01083, 2017. 2

You might also like