3.1. Datasets and Implementation Details
We trained our method using the FAIR1M [
38] dataset and validated it on three oriented object detection datasets: DOTA-v1.0 [
9], DOTA-v1.5, and DIOR-R [
39]. For pretraining, all images from the FAIR1M dataset were used. For the DOTA-v1.0 and DOTA-v1.5 datasets, 80% of the original training data were randomly divided into a training dataset, then 10% of the divided training dataset was randomly selected for fine-tuning, and the original validation data were used as the test dataset. For the DIOR-R dataset, 10% of the original training data were randomly selected for fine-tuning, and the original validation data were used as the evaluation dataset.The datasets are shown in detail in
Table 1.
FAIR1M is a remote sensing dataset for fine-grained oriented object detection. The images in the FAIR1M dataset were collected from different sensors and platforms, with spatial resolutions ranging from 0.3 m to 0.8 m. There are more than 1 million instances and more than 16,488 images in this dataset. All objects in the FAIR1M dataset are annotated with respect to 5 categories and 37 sub-categories according to oriented bounding boxes (OBBs). No labels from the dataset were used in the pretraining.
The DOTA-v1.0 images were collected from Google Earth, GF-2, and JL-1 satellites provided by the China Centre for Resources Satellite Data and Application. It is one of the largest datasets for oriented object detection in aerial images and contains 15 common categories, 2806 images, and 188,282 instances. Each image is from 800 × 800 to 40,000 × 40,000 pixels in size. The object categories in DOTA-v1.0 are as follows: plane (PL), baseball diamond (BD), bridge (BR), ground track field (GTF), small vehicle (SV), large vehicle (LV), ship (SH), tennis court (TC), basketball court (BC), storage tank (ST), soccer field (SBF), roundabout (RA), harbor (HA), swimming pool (SP), and helicopter (HC). For training and testing, we cropped the original images into a series of 1024×1024 patches with a stride of 824.
DOTA-v1.5 uses the same images as DOTA-v1.0, but the extremely small instances (less than 10 pixels) are also annotated. Moreover, a new category was added: container crane. It contains 403,318 instances in total. Compared to DOTA-v1.0, DOTA-v1.5 is more challenging but also more stable during training. For training and testing, we cropped the original images into a series of 1024 × 1024 patches with a stride of 824.
The angle distribution of objects for each category in the cropped DOTA-v1.0 dataset is shown in
Figure 3. The horizontal axis represents the categories contained in the DOTA-v1.0 dataset, while the vertical axis represents the proportion of the number of objects in each angle interval to the total number of objects in that category. The different colors of the legend represent data from different angle intervals. It shows that in this remote sensing dataset, there are two main categories, as shown in
Figure 4: (a) ground-nonfixed objects and (b) ground-fixed objects. Ground-nonfixed objects include vehicles, planes, ships, etc. This type of target has an arbitrary orientation and can be distributed in the image at any angle. Ground-fixed objects mainly include basketball courts, soccer fields, and ground track fields. This type of target has a regular orientation and can be distributed in the image at one or several fixed angles. Remote sensing object detection datasets are typically collected from several fixed remote sensing satellites. Due to the limitations of satellite trajectory and shooting angle, the angles of ground-fixed objects are concentrated in a few angle intervals. The angle distribution of ground-nonfixed objects is more uniform over the intervals due to the arbitrary object distribution.
DIOR-R is a challenging remote sensing dataset for oriented object detection, which shares the same images as the DIOR [
40] dataset labeled with horizontal annotations. There is a total of 23,463 images and 192,518 instances, covering 20 classes. The size of each image is 800 × 800 and the spatial resolutions range from 0.5m to 30m. The object categories in DIOR-R are as follows: airplane (APL), airport (APO), baseball Field (BF), basketball Court (BC), bridge (BR), chimney (CH), dam (DAM), expressway toll station (ETS), expressway service area (ESA), golf course (GF), ground track field (GTF), harbor (HA), overpass (OP), ship (SH), stadium (STA), storage tank (STO), tennis court (TC), train station (TS), vehicle (VE), and windmill (WM). The original image size was used for both training and testing.
Pretraining. Our method is based on MoCo-v2 [
21]. We optimized the model using synchronized SGD with a weight decay of 0.0001 and a momentum of 0.9. We used a batch size of 128 in a GPU. The optimization took 200 epochs with an initial learning rate of 0.015 and a cosine learning rate schedule. The learning rate was multiplied by 0.1 at 120 and 160 epochs. The backbone network was ResNet50 [
41] and the temperature hyperparameter was set to 0.07.
Fine-tuning. The detector was fine-tuned using the RoI Transformer [
8] method implemented in the MMRotate-v0.3.4 [
42]. The backbone network weights obtained from pretraining were use and frozen during training. We optimized the model using synchronized SGD with a weight decay of 0.0001 and a momentum of 0.9. The optimization took 200 epochs with an initial learning rate of 0.005 and a batch size of 4. We adopted a learning rate warm-up of 500 iterations and the learning rate was divided by 10 at 133 and 183 epochs.
3.2. Main Results
We compared our method to other current state-of-the-art methods on three separate remote sensing image datasets for oriented object detection.
Table 2,
Table 3 and
Table 4 show the experimental results on DOTA-v1.0, DOTA-v1.5, and DIOR-R, respectively. Our method showed the best performance compared to the baseline methods on all three datasets, with improvements of 0.39%, 0.33%, and 0.32%, respectively. Combining the results from the DOTA-v1.0 dataset in
Table 2 and
Table 3 and the angle distribution for each category in
Figure 3, we found that compared to the baseline method, ground-fixed objects, such as GTF, BC, and SBF, achieved an average accuracy improvement of 2.56% on DOTA-v1.0 and DOTA-v1.5, which was 1.50% higher than ground-nonfixed objects, such as PL, BR, SV, and LV. Our method showed more significant improvements on ground-fixed objects where the angle information provided was more unbalanced. This demonstrates that the Co-ECL method does not rely on the network fitting the angle distribution of the oriented object for feature learning, but really learns better rotation equivariance through the covariant network. However, our method performed poorly on two circular objects: ST and RA. We found that this was because circular objects do not have obvious angle information. Circular objects also do not have different angle information for the network, but this type of object has different angle annotations. This is contradictory for network training, so the accuracy was reduced.
3.3. Rotation Equivariance Measurement
To verify whether our method could learn better rotation equivariance, we applied rotation transformations to the data and validated the method on multi-angle datasets. The validated data came from the test datasets of DOTA-v1.0, DOTA-v1.5, and DIOR-R. We applied rotation transformations to generate data under four angles: 0°, 90°, 180°, and 270°.
3.3.1. Important Regions
To verify whether the model could learn better rotation equivariance after pretraining, we visualized the important regions for predicting the concept on multi-angle images. We used the Grad-CAM [
44] method for the visualizations. The Grad-CAM method is a popular CNN visualization approach, which uses the global averages of gradients to compute the weights in feature maps, producing coarse localization maps that highlight important regions in images for predicting the concept. Specifically, we used the pretraining backbone network weights to obtain visualization results on the DOTA-v1.0 dataset and compared them to the MoCo-v2.
Figure 5 shows that MoCo-v2 could only focus on the object region at certain angles, while our method could focus on the object region on all different angle images. This proves that our method can focus on objects with different angles and learn better rotation equivariance.
3.3.2. Detection Accuracy
Since the angle distributions of oriented objects are fixed in a fixed remote sensing dataset, testing on these datasets can only reflect the detection results of oriented objects within that fixed angle distribution. The angle distributions of the oriented objects in the training and test datasets were similar, so it is possible that the network performed angle fitting via its strong learning ability to achieve better detection results. To more accurately measure the rotation equivariance learned by the network, we used multi-angle datasets to compare the overall levels and degrees of deviation. Specifically, we tested the multi-angle datasets separately to obtain the detection results of oriented objects within each angle distribution. We then took the averages of the accuracy and the coefficient of variation for each angle distribution. The average accuracy for each angle distribution represented the overall detection accuracy. A larger average represented better overall detection results on multi-angle datasets. The coefficient of variation is the ratio of the standard deviation to the mean and is used to compare the dispersion of data when means are not equal. Here, the coefficient of variation represented the degree of detection deviation. A smaller coefficient of variation represented less variation in detection on multi-angle datasets.
Table 5,
Table 6 and
Table 7 show that our method had the highest average accuracy
on all three datasets, with improvements of 0.75%, 0.59%, and 0.12% compared to the best-performing method on each dataset. The improvement was most obvious on the DOTA-v1.0 dataset and poorest on the DIOR-R dataset. We found that this was because some categories are forcefully annotated by horizontal boxes, although the objects are not exactly horizontal. This confuses the information about the angles that the network needed to learn and thus, could have affected the network training and detection results.The coefficient of variation
also decreased compared to the MoCo-v2 method by 0.48%, 2.5%, and 0.26% on the three datasets. The average decrease was 1.08%. The decrease was most obvious on the DOTA-v1.5 dataset and poorest on the DIOR-R dataset. The decrease in the degree of deviation was most obvious and higher on the DOTA-v1.5 dataset than the other two datasets. It is possible that the DOTA-v1.5 dataset has more detailed annotations, providing a greater amount of quantifiable rotation information during testing. On datasets with higher difficulty levels in rotation-related tasks, the improvement of Co-ECL compared to the baseline methods became more obvious.
The optimization of these two metrics showed that the Co-ECL method could achieve good detection results on test data with different angle distributions by only relying on the single angle distribution data. This means that the model produces accurate predictions even when the angle distribution of the test set is different from that of the training set. This demonstrates that the Co-ECL method truly learns good rotation-equivariant features of objects and improves the robustness of the model on rotation-related tasks.